Saurabh Srivastava @_saurabh, Twitter Profile

Saurabh Srivastava @_saurabh

2 years ago

More than 50% of the reported reasoning abilities of LLMs might not be true reasoning. How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper. Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)? What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models. We’re releasing the paper, code, and 3 snapshots of functional MATH() today. arxiv draft: arxiv.org/abs/2402.19450 github repo: github.com/ConsequentAI/f… 1/🧵

44 226 1K 485K 999

Download Image

Saurabh Srivastava @_saurabh

2 years ago

2/ Problem Testing using static text (Q, A)s is problematic. Contamination is a concern. Especially when we don’t know the data for each training step starting from random weights. Also, we cannot keep building new, harder (Q, A)s all the time. We need a longer term solution.

2 0 49 13K 4

Ashwinee Panda @PandaAshwinee

2 years ago

@_saurabh This is awesome! I’ve been manually doing this for some cases with just prompt regex but it’s quite annoying. Have been looking for a functional approach.

1 0 13 4K 1

Stella Biderman @BlancheMinerva

2 years ago

@_saurabh I find it quite weird that you didn't validate your methodology using models with publicly released data. Was there a reason you didn't run those experiments?

2 0 9 2K 0

Everett World @WorldEverett

2 years ago

@_saurabh @jeremyphoward Can it be that LLMs are overhyped? Can't wait to see how GPT-5 scores at true reasoning.

0 0 4 2K 1

Victor Mota @vimota

2 years ago

@_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…

Victor Mota @vimota

2 years ago

@_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…

0 0 0 3K 2

0 0 3 1K 1

khalid @k_saifullaah

2 years ago

@_saurabh functional benchmark would certainly be better than our current way of evaluating these models

khalid @k_saifullaah

2 years ago

@_saurabh functional benchmark would certainly be better than our current way of evaluating these models

3 0 12 2K 3

Download Image

0 0 2 597 1

Delta, Dirac (AKA Pan Guy) @DeltaClimbs

2 years ago

@_saurabh @aphysicist Reasoning capabilities are total trash afaict. Consider this very specific materials science question it could not answer (it gave nonsensical explanations to wrong answers).

0 0 0 43 0

Thread Reader App @threadreaderapp

2 years ago

@_saurabh Your thread is very popular today! #TopUnroll threadreaderapp.com/thread/1763626… 🙏🏼@vankous for 🥇unroll

0 0 0 1K 2

PFO (e/acc) @pfo_sac

2 years ago

@_saurabh this is great, congrats!

0 0 0 214 0

Leonard Tang @leonardtang_

2 years ago

@_saurabh elegant perspective

0 0 0 94 0

Timothy B. Lee @binarybits

2 years ago

@_saurabh I'd love to see how Claude 3 scores here.

2 0 0 489 0

Aaron Scher @aaronscher

2 years ago

@_saurabh The slightly disingenuous, but seemingly accurate, read of your paper is: We rewrite the format of questions to be harder and less natural, so LLMs do worse on them. I do think there is interesting and useful work in this area, potentially even using your dataset, but man.

0 0 0 33 0

Engineering Randomness @EERandomness

2 years ago

@_saurabh This is surprising because I had assumed model trainers would create synthetic data with exactly this approach? And many more sophisticated approaches?

1 0 14 1K 0

Vishaal Udandarao @vishaal_urao

2 years ago

@_saurabh Great work! The problem of overfitting to static benchmarks is super important, glad to see folks looking into this! We have been thinking about similar directions in a recent paper where we formalise lifelong benchmarking of models: arxiv.org/abs/2402.19472

0 0 9 473 4

mattwallace @mattwallace

2 years ago

@_saurabh Paging @fchollet super curious about your take on this

0 0 5 771 1

AidenHStone @AidenHSt

2 years ago

@_saurabh Do you have human baseline results? Anecdotally, my middle schoolers often struggle with "equivalent problems" with harder constants, and people in general are sensitive to framing effects (a la Wason selection task)

0 1 2 112 0

AnKo @anko_979

2 years ago

@_saurabh Nice framework but only mathematical reasoning was evaluated here, where it preforms the weakest Quite different from reasoning about real world concepts Personal thorough testing of GPT-4 in articles after pre-training suggests that its wider reasoning is much better

2 1 1 970 1

Clayton Thorrez @cthorrez

2 years ago

@_saurabh You should probably clarify the model is gpt 3.5 not gpt 3 since they are different models with completely different capabilities

0 0 2 142 0

Ali Minai @barbarikon

2 years ago

@_saurabh In the sense that 99% is “more than 50%”….

0 0 1 1K 0

Russell Johnston @RussellJohnston

2 years ago

@_saurabh Wander into a first-year philosophy class (or read medieval scholars), and you'll soon discover it's not just the machines that think they're reasoning when they're not, of course.

0 0 1 441 0

BrandenCollingsworth @brandenco

2 years ago

@_saurabh Are people using this approach to generate training data?

0 0 1 341 0

GOON MASTER SOPHONT SIMP @SOPHONTSIMP

2 years ago

@_saurabh My reaction was “so you’re telling me 40% of LLM performance is true reasoning” The difference between GPT 3 and 4 at least suggests a positive trend, at least.

0 0 1 399 0

floating point @yar_vol

2 years ago

@_saurabh Disappointed to see that even GPT-4 is such a “cheater” :) also surprising that Mixtral and GPT-4 are similar ballpark! What happens if you generate lots of examples like these and try fine tuning?

0 0 1 645 1