• _saurabh Profile Picture

    Saurabh Srivastava @_saurabh

    2 years ago

    More than 50% of the reported reasoning abilities of LLMs might not be true reasoning. How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper. Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)? What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models. We’re releasing the paper, code, and 3 snapshots of functional MATH() today. arxiv draft: arxiv.org/abs/2402.19450 github repo: github.com/ConsequentAI/f… 1/🧵

    _saurabh tweet picture

    44 226 1K 485K 999
    Download Image
  • _saurabh Profile Picture

    Saurabh Srivastava @_saurabh

    2 years ago

    2/ Problem Testing using static text (Q, A)s is problematic. Contamination is a concern. Especially when we don’t know the data for each training step starting from random weights. Also, we cannot keep building new, harder (Q, A)s all the time. We need a longer term solution.

    2 0 49 13K 4
  • PandaAshwinee Profile Picture

    Ashwinee Panda @PandaAshwinee

    2 years ago

    @_saurabh This is awesome! I’ve been manually doing this for some cases with just prompt regex but it’s quite annoying. Have been looking for a functional approach.

    1 0 13 4K 1
  • BlancheMinerva Profile Picture

    Stella Biderman @BlancheMinerva

    2 years ago

    @_saurabh I find it quite weird that you didn't validate your methodology using models with publicly released data. Was there a reason you didn't run those experiments?

    2 0 9 2K 0
  • WorldEverett Profile Picture

    Everett World @WorldEverett

    2 years ago

    @_saurabh @jeremyphoward Can it be that LLMs are overhyped? Can't wait to see how GPT-5 scores at true reasoning.

    0 0 4 2K 1
  • vimota Profile Picture

    Victor Mota @vimota

    2 years ago

    @_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…

    vimota Profile Picture

    Victor Mota @vimota

    2 years ago

    @_saurabh Great work! Could one go a step further and do this for coding benchmarks by defining formal specs and evaluating model outputs based on that? x.com/vimota/status/…

    0 0 0 3K 2

    0 0 3 1K 1
  • k_saifullaah Profile Picture

    khalid @k_saifullaah

    2 years ago

    @_saurabh functional benchmark would certainly be better than our current way of evaluating these models

    k_saifullaah Profile Picture

    khalid @k_saifullaah

    2 years ago

    @_saurabh functional benchmark would certainly be better than our current way of evaluating these models

    k_saifullaah tweet picture

    3 0 12 2K 3
    Download Image

    0 0 2 597 1
  • DeltaClimbs Profile Picture

    Delta, Dirac (AKA Pan Guy) @DeltaClimbs

    2 years ago

    @_saurabh @aphysicist Reasoning capabilities are total trash afaict. Consider this very specific materials science question it could not answer (it gave nonsensical explanations to wrong answers).

    0 0 0 43 0
  • threadreaderapp Profile Picture

    Thread Reader App @threadreaderapp

    2 years ago

    @_saurabh Your thread is very popular today! #TopUnroll threadreaderapp.com/thread/1763626… 🙏🏼@vankous for 🥇unroll

    0 0 0 1K 2
  • pfo_sac Profile Picture

    PFO (e/acc) @pfo_sac

    2 years ago

    @_saurabh this is great, congrats!

    0 0 0 214 0
  • leonardtang_ Profile Picture

    Leonard Tang @leonardtang_

    2 years ago

    @_saurabh elegant perspective

    0 0 0 94 0
  • binarybits Profile Picture

    Timothy B. Lee @binarybits

    2 years ago

    @_saurabh I'd love to see how Claude 3 scores here.

    2 0 0 489 0
  • aaronscher Profile Picture

    Aaron Scher @aaronscher

    2 years ago

    @_saurabh The slightly disingenuous, but seemingly accurate, read of your paper is: We rewrite the format of questions to be harder and less natural, so LLMs do worse on them. I do think there is interesting and useful work in this area, potentially even using your dataset, but man.

    0 0 0 33 0
  • EERandomness Profile Picture

    Engineering Randomness @EERandomness

    2 years ago

    @_saurabh This is surprising because I had assumed model trainers would create synthetic data with exactly this approach? And many more sophisticated approaches?

    1 0 14 1K 0
  • vishaal_urao Profile Picture

    Vishaal Udandarao @vishaal_urao

    2 years ago

    @_saurabh Great work! The problem of overfitting to static benchmarks is super important, glad to see folks looking into this! We have been thinking about similar directions in a recent paper where we formalise lifelong benchmarking of models: arxiv.org/abs/2402.19472

    0 0 9 473 4
  • mattwallace Profile Picture

    mattwallace @mattwallace

    2 years ago

    @_saurabh Paging @fchollet super curious about your take on this

    0 0 5 771 1
  • AidenHSt Profile Picture

    AidenHStone @AidenHSt

    2 years ago

    @_saurabh Do you have human baseline results? Anecdotally, my middle schoolers often struggle with "equivalent problems" with harder constants, and people in general are sensitive to framing effects (a la Wason selection task)

    0 1 2 112 0
  • anko_979 Profile Picture

    AnKo @anko_979

    2 years ago

    @_saurabh Nice framework but only mathematical reasoning was evaluated here, where it preforms the weakest Quite different from reasoning about real world concepts Personal thorough testing of GPT-4 in articles after pre-training suggests that its wider reasoning is much better

    2 1 1 970 1
  • cthorrez Profile Picture

    Clayton Thorrez @cthorrez

    2 years ago

    @_saurabh You should probably clarify the model is gpt 3.5 not gpt 3 since they are different models with completely different capabilities

    0 0 2 142 0
  • barbarikon Profile Picture

    Ali Minai @barbarikon

    2 years ago

    @_saurabh In the sense that 99% is “more than 50%”….

    0 0 1 1K 0
  • RussellJohnston Profile Picture

    Russell Johnston @RussellJohnston

    2 years ago

    @_saurabh Wander into a first-year philosophy class (or read medieval scholars), and you'll soon discover it's not just the machines that think they're reasoning when they're not, of course.

    0 0 1 441 0
  • brandenco Profile Picture

    BrandenCollingsworth @brandenco

    2 years ago

    @_saurabh Are people using this approach to generate training data?

    0 0 1 341 0
  • SOPHONTSIMP Profile Picture

    GOON MASTER SOPHONT SIMP @SOPHONTSIMP

    2 years ago

    @_saurabh My reaction was “so you’re telling me 40% of LLM performance is true reasoning” The difference between GPT 3 and 4 at least suggests a positive trend, at least.

    0 0 1 399 0
  • yar_vol Profile Picture

    floating point @yar_vol

    2 years ago

    @_saurabh Disappointed to see that even GPT-4 is such a “cheater” :) also surprising that Mixtral and GPT-4 are similar ballpark! What happens if you generate lots of examples like these and try fine tuning?

    0 0 1 645 1
  • Download Image
    • Privacy
    • Term and Conditions
    • About
    • Contact Us
    • TwStalker is not affiliated with X™. All Rights Reserved. 2024 www.instalker.org

    twitter web viewer x profile viewer bayigram.com instagram takipçi satın al instagram takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al sosyalgram takipçi satın al instagram ücretsiz takipçi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al metin2 metin2 wiki metin2 ep metin2 dragon coins metin2 forum metin2 board popigram instagram takipçi satın al takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al buyfans buy instagram followers buy instagram likes buy instagram views buy tiktok followers buy tiktok likes buy tiktok views buy twitter followers buy telegram members Buy Youtube Subscribers Buy Youtube Views Buy Youtube Likes forstalk postegro web postegro x profile viewer