👉 New preprint! Today, many the biggest challenges in LM post-training aren't just about correctness, but rather consistency & coherence across interactions.
This paper tackles some of these issues by optimizing reasoning LMs for calibration rather than accuracy...
👉 New preprint! Today, many the biggest challenges in LM post-training aren't just about correctness, but rather consistency & coherence across interactions.
This paper tackles some of these issues by optimizing reasoning LMs for calibration rather than accuracy...
✨ New paper ✨
🚨 Scaling test-time compute can lead to inverse or flattened scaling!!
We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways:
➡️ Frontier LLMs struggle on Seal-0 (SealQA’s…
Are AI scientists already better than human researchers?
We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.
Main finding: LLM ideas result in worse projects than human ideas.
I’m looking for a new postdoc to start this fall working on AI for Science/Science-Inspired AI (focusing on chemistry and bioengineering domains for now). Please drop me a CV if interested.
The paper claims coding benchmarks high scores of LLMs may come from memorizing past GitHub issues, not real reasoning.😯
The authors build a tiny test: given only the text of an issue, guess the file path that needs fixing.
Models hit up to 76% accuracy on the benchmark set,…
LLM reasoning with reinforcement learning focuses on limited domains, hindering general applicability.
This paper develops GURU, a 92,000-example multi-domain dataset, to enable broader reinforcement learning-based reasoning.
Methods 🔧:
- GURU includes Math, Code, Science,…
Large language models exhibit grokking, where generalization improves significantly long after training loss converges.
This paper identifies grokking in large-scale LLM pretraining and provides internal metrics to monitor this delayed generalization without external validation.…
A bit late but happy to share that LLM-SRBench, our new benchmark targeting memorization issue in LLMs for scientific discovery is selected for *Oral* presentation at #ICML2025 !
Great to see the community recognizing the importance of this direction. Checkout the camera-ready…
A bit late but happy to share that LLM-SRBench, our new benchmark targeting memorization issue in LLMs for scientific discovery is selected for *Oral* presentation at #ICML2025 !
Great to see the community recognizing the importance of this direction. Checkout the camera-ready…
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient?
Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵
1/
This is really BAD news of LLM's coding skill. ☹️
The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.
LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International…
This study shows the same models break down on Olympiad problems and cannot even flag their own faulty proofs.
Showed that frontier LLM handle fewer than 4 % of Olympiad proofs correctly and misjudge their own flawed reasoning.
Current math benchmarks mark a right answer and…
❓How to balance negative and positive rewards in off-policy RL❓
In Asymmetric REINFORCE for off-Policy RL, we show that giving less weight to negative rewards is enough to stabilize off-policy RL training for LLMs! 💪 (1/8)
Paper: arxiv.org/abs/2506.20520
Exciting new RL tooling: A modular library for RL training by the Berkeley NovaSky team. While standard RL training is all done in one loop, it is more efficient for modern post-training to separate the generation of the rollouts from the trainer. It also enables asynchronous…
Exciting new RL tooling: A modular library for RL training by the Berkeley NovaSky team. While standard RL training is all done in one loop, it is more efficient for modern post-training to separate the generation of the rollouts from the trainer. It also enables asynchronous…
Github: A fully open source framework for creating RL training swarms over the internet.
Train reinforcement-learning models collaboratively across decentralized peers, leveraging GenRL-Swarm on consumer laptops or GPUs
Plug into a global swarm, contribute compute, and…
Removing knowledge from LLMs is HARD. @GurYoav proposes a powerful approach that disentangles the MLP parameters to edit them in high resolution and remove target concepts from the model. Check it out!
Removing knowledge from LLMs is HARD. @GurYoav proposes a powerful approach that disentangles the MLP parameters to edit them in high resolution and remove target concepts from the model. Check it out!
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
"We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant…
2K Followers 2K FollowingNLP postdoc at @SheffieldNLP
Ex @Imperial_NLP PhD, @Apple AI/ML Scholar, @UCL MSc
Model robustness and now uncertainty quantification
239 Followers 1K FollowingGet expert HELP with Essay, Exam, Quiz, Online class, Discussion, Homework, Assignments, ppt, Thesis, Research Paper and more.
DM +1 (520) 203-5047
150 Followers 345 Followingpostdoc @argonne. real time situational awareness, data-driven and robust uncertainty quantification for real-time decision making
239 Followers 2K FollowingML @Apple · PhD in Physics (@CarnegieMellon)· Working on computer vision & vision-language models · Scaling AI in the real world
24 Followers 138 FollowingResearch lab from UC Davis(@ucdavis) specializing in #NLP, #Multimodal, and #AI4Science (particularly on #LLMs and #VLMs). Directed by Prof. @lifu_huang
1.1M Followers 1K FollowingVP of @WLCongress. Founder & Chair of @Renew_Democracy. Activist, speaker, 13th World Chess Champion. Autocracy in America podcast: https://t.co/xemlxTR3IN
20K Followers 1K FollowingResearcher @MSFTResearch, AI Frontiers Lab; Prof @UWMadison (on leave); learning in context; thinking about reasoning; babas of Inez Lily.
10K Followers 4K Followingsth new // ex Gemini RL+Inference @GoogleDeepMind // Chat AI @Meta // RL Agents @EA // ML+Information Theory @MIT+@Harvard+@GeorgiaTech // زن زندگی آزادی
137K Followers 19 FollowingTwo F1 fans bringing you all the latest news, reaction, opinion and predictions from the best sport in the world! Business: [email protected]
45K Followers 1K FollowingML with Social Purpose. @[email protected] | Research Scientist @DeepMind | Strengthening African ML @DeepIndaba. He/Him. South African 🇿🇦🏳️🌈🌍
451K Followers 77 FollowingTensors and neural networks in Python with strong hardware acceleration. PyTorch is an open source project at the Linux Foundation. #PyTorchFoundation
38K Followers 7 FollowingProfessor of machine learning at the University of Cambridge. Opinions are my own. Author of "The Atomic Human"
Mainly found on @lawrennd.bsky.social
42K Followers 865 FollowingFR/US/GB AI/ML Person, Director of Research at @GoogleDeepMind, Honorary Professor at @UCL_DARK, @ELLISforEurope Fellow. All posts are personal.
25K Followers 1K FollowingFind me @[email protected] Professor at @OxCSML, @oxfordstats and Research Director at @GoogleDeepMind. All opinions are my own.
163K Followers 0 FollowingInvented principles of meta-learning (1987), GANs (1990), Transformers (1991), very deep learning (1991), etc. Our AI is used many billions of times every day.
16K Followers 495 FollowingHarvard Professor.
Full stack ML and AI.
Co-director of the Kempner Institute for the Study of Artificial and Natural Intelligence.
6K Followers 761 FollowingSenior Research Scientist @GoogleDeepMind. I ∈ Optimization ∩ Machine Learning. Fan of @IronMaiden🤘.Here to discuss research 🤓