• _saurabh Profile Picture

    Saurabh Srivastava @_saurabh

    2 years ago

    More than 50% of the reported reasoning abilities of LLMs might not be true reasoning. How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper. Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)? What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models. We’re releasing the paper, code, and 3 snapshots of functional MATH() today. arxiv draft: arxiv.org/abs/2402.19450 github repo: github.com/ConsequentAI/f… 1/🧵

    _saurabh tweet picture

    44 226 1K 485K 999
    Download Image
  • _saurabh Profile Picture

    Saurabh Srivastava @_saurabh

    2 years ago

    2/ Problem Testing using static text (Q, A)s is problematic. Contamination is a concern. Especially when we don’t know the data for each training step starting from random weights. Also, we cannot keep building new, harder (Q, A)s all the time. We need a longer term solution.

    2 0 49 13K 4
  • _saurabh Profile Picture

    Saurabh Srivastava @_saurabh

    2 years ago

    3/ Solution Represent the reasoning you want to test as code. When you run this code with random inputs it should generate a “snapshot” (Q, A). So you get to test using the exact same IOs and harness as before, just without repeating any (Q, A) from the past.

    1 0 41 11K 2
  • davidyoung8906 Profile Picture

    Wei Yang @davidyoung8906

    2 years ago

    @_saurabh We have done something similar to address this issue: arxiv.org/abs/2401.15545. We dynamically generate questions for code models.

    0 0 0 35 0
  • Download Image
    • Privacy
    • Term and Conditions
    • About
    • Contact Us
    • TwStalker is not affiliated with X™. All Rights Reserved. 2024 www.instalker.org

    twitter web viewer x profile viewer bayigram.com instagram takipçi satın al instagram takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al sosyalgram takipçi satın al instagram ücretsiz takipçi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al metin2 metin2 wiki metin2 ep metin2 dragon coins metin2 forum metin2 board popigram instagram takipçi satın al takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al buyfans buy instagram followers buy instagram likes buy instagram views buy tiktok followers buy tiktok likes buy tiktok views buy twitter followers buy telegram members Buy Youtube Subscribers Buy Youtube Views Buy Youtube Likes forstalk postegro web postegro x profile viewer