More than 50% of the reported reasoning abilities of LLMs might not be true reasoning. How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper. Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)? What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models. We’re releasing the paper, code, and 3 snapshots of functional MATH() today. arxiv draft: arxiv.org/abs/2402.19450 github repo: github.com/ConsequentAI/f… 1/🧵
2/ Problem Testing using static text (Q, A)s is problematic. Contamination is a concern. Especially when we don’t know the data for each training step starting from random weights. Also, we cannot keep building new, harder (Q, A)s all the time. We need a longer term solution.
3/ Solution Represent the reasoning you want to test as code. When you run this code with random inputs it should generate a “snapshot” (Q, A). So you get to test using the exact same IOs and harness as before, just without repeating any (Q, A) from the past.
@_saurabh We have done something similar to address this issue: arxiv.org/abs/2401.15545. We dynamically generate questions for code models.