A recent work from @iddo claimed GPT4 can score 100% on MIT's EECS curriculum with the right prompting. My friends and I were excited to read the analysis behind such a feat, but after digging deeper, what we found left us surprised and disappointed. dub.sh/gptsucksatmit 🧵
A recent work from @iddo claimed GPT4 can score 100% on MIT's EECS curriculum with the right prompting. My friends and I were excited to read the analysis behind such a feat, but after digging deeper, what we found left us surprised and disappointed. dub.sh/gptsucksatmit 🧵
The released test set on Github is chockfull of impossible to solve problems. There are lots of questions referring to non-existent diagrams and missing contextual information. So how did GPT solve it? (1/4)
@sauhaarda @iddo If they were really good, they would open source at least version 3. There's something under the rug. I can't believe anyone who doesn't make an effort to be transparent. It just tries to shove something down our throats with an uneven and overpowering force.
@sauhaarda @iddo Super interesting analysis, more work like this is needed, thanks for the effort in this.
@sauhaarda @iddo Thank you for calling this out! I keep saying that GPT-4 is very very good at sounding like a confident human, which is enough to fool 50% of the people 100% of the time. It's also very good at repeating something it has read before, although there, it may sometimes mis-quote.
@sauhaarda @iddo students 1, prof 0 great investigative work
@sauhaarda @iddo Thank you for doing the digging, probably 90% of people who read the original headline won't see the criticism, but it's important to get out there. Validation using GPT-4 is just such a massive red flag, can't believe they thought it was ok
@sauhaarda @iddo That kind of academic malpractice usually ends a career. Do the same rules not apply to machine learning papers? Jesus Christ.
@sauhaarda @mark_riedl @iddo Okay, so the fewshot setting is not good. But GPT-4 zeroshot got 90% right?
@sauhaarda @iddo I'm completely unsurprised. It performs absolutely terribly at mathematical questions. I have plenty of examples (including GPT-4 which is not better) although not neatly organized in one place.
@sauhaarda @iddo Sir, you do know you can choose to not be flower-nutria on notion? 😂
@sauhaarda @iddo A lot of fuzzy evaluation has been happening especially in the open source domains, where they simply fine tune it with data consistent with evaluation benchmarks and then claiming wild results
@sauhaarda @iddo Does it mean the exams are not relevant for evaluation of human potential? :)
@sauhaarda @iddo Thanks so much for looking into it. I asked them how they prompted and got no response. My daughter uses Khanmigo which suppose to use GPT4. It got my daughter’s AOPS math problems wrong all the time, such as counting, probability and algebra.
@sauhaarda @iddo But isn't CHAT GPT getting its answers from the internet. And if there are several answers to one question on the net....GPT is unable to find the right one. That is why... 60% score. Get back to your books guys.