More than 50% of the reported reasoning abilities of LLMs might not be true reasoning. How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper. Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)? What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models. We’re releasing the paper, code, and 3 snapshots of functional MATH() today. arxiv draft: arxiv.org/abs/2402.19450 github repo: github.com/ConsequentAI/f… 1/🧵
2/ Problem Testing using static text (Q, A)s is problematic. Contamination is a concern. Especially when we don’t know the data for each training step starting from random weights. Also, we cannot keep building new, harder (Q, A)s all the time. We need a longer term solution.
@_saurabh This is awesome! I’ve been manually doing this for some cases with just prompt regex but it’s quite annoying. Have been looking for a functional approach.
@_saurabh I find it quite weird that you didn't validate your methodology using models with publicly released data. Was there a reason you didn't run those experiments?
@_saurabh @jeremyphoward Can it be that LLMs are overhyped? Can't wait to see how GPT-5 scores at true reasoning.
@_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…
@_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…
@_saurabh @aphysicist Reasoning capabilities are total trash afaict. Consider this very specific materials science question it could not answer (it gave nonsensical explanations to wrong answers).
@_saurabh Your thread is very popular today! #TopUnroll threadreaderapp.com/thread/1763626… 🙏🏼@vankous for 🥇unroll
@_saurabh I'd love to see how Claude 3 scores here.
@_saurabh The slightly disingenuous, but seemingly accurate, read of your paper is: We rewrite the format of questions to be harder and less natural, so LLMs do worse on them. I do think there is interesting and useful work in this area, potentially even using your dataset, but man.
@_saurabh This is surprising because I had assumed model trainers would create synthetic data with exactly this approach? And many more sophisticated approaches?
@_saurabh Great work! The problem of overfitting to static benchmarks is super important, glad to see folks looking into this! We have been thinking about similar directions in a recent paper where we formalise lifelong benchmarking of models: arxiv.org/abs/2402.19472
@_saurabh Do you have human baseline results? Anecdotally, my middle schoolers often struggle with "equivalent problems" with harder constants, and people in general are sensitive to framing effects (a la Wason selection task)
@_saurabh Nice framework but only mathematical reasoning was evaluated here, where it preforms the weakest Quite different from reasoning about real world concepts Personal thorough testing of GPT-4 in articles after pre-training suggests that its wider reasoning is much better
@_saurabh You should probably clarify the model is gpt 3.5 not gpt 3 since they are different models with completely different capabilities
@_saurabh Wander into a first-year philosophy class (or read medieval scholars), and you'll soon discover it's not just the machines that think they're reasoning when they're not, of course.
@_saurabh Are people using this approach to generate training data?
@_saurabh My reaction was “so you’re telling me 40% of LLM performance is true reasoning” The difference between GPT 3 and 4 at least suggests a positive trend, at least.
@_saurabh Disappointed to see that even GPT-4 is such a “cheater” :) also surprising that Mixtral and GPT-4 are similar ballpark! What happens if you generate lots of examples like these and try fine tuning?