🤝 Can LLM agents really understand us?
We introduce UserBench: a user-centric gym environment for benchmarking how well agents align with nuanced human intent, not just follow commands.
📄 arxiv.org/pdf/2507.22034
💻 github.com/SalesforceAIRe…
(1/4)🚨 Introducing Goedel-Prover V2 🚨
🔥🔥🔥 The strongest open-source theorem prover to date.
🥇 #1 on PutnamBench: Solves 64 problems—with far less compute.
🧠 New SOTA on MiniF2F:
* 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%.
* 8B > 671B: Our 8B…
Reward models (RMs) are key to language model post-training and inference pipelines. But, little is known about the relative pros and cons of different RM types.
📰 We investigate why RMs implicitly defined by language models (LMs) often generalize worse than explicit RMs
🧵
1/6
🎥 Video is already a tough modality for reasoning. Egocentric video? Even tougher! It is longer, messier, and harder.
💡 How do we tackle these extremely long, information-dense sequences without exhausting GPU memory or hitting API limits?
We introduce 👓Ego-R1: A framework…
Can LLMs make rational decisions like human experts?
📖Introducing DecisionFlow: Advancing Large Language Model as Principled Decision Maker
We introduce a novel framework that constructs a semantically grounded decision space to evaluate trade-offs in hard decision-making…
(1/5) Want to make your LLM a skilled persuader?
Check out our latest paper: "ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind"!
For details:
📄Arxiv: arxiv.org/pdf/2505.22961
🛠️GitHub: github.com/ulab-uiuc/ToMAP
📢 New Paper Drop: From Solving to Modeling!
LLMs can solve math problems — but can they model the real world? 🌍
📄 arXiv: arxiv.org/pdf/2505.15068
💻 Code: github.com/qiancheng0/Mod…
Introducing ModelingAgent, a breakthrough system for real-world mathematical modeling with LLMs.
How to improve the test-time scalability?
- Separate thinking & solution phases to control performance under budget constraint
- Budget-Constrained Rollout + GRPO
- Outperforms baselines on math/code.
- Cuts token 30% usage without hurting performance
huggingface.co/papers/2505.05…
🚀 Can we cast reward modeling as a reasoning task?
📖 Introducing our new paper:
RM-R1: Reward Modeling as Reasoning
📑 Paper: arxiv.org/pdf/2505.02387
💻 Code: github.com/RM-R1-UIUC/RM-…
Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we…
We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs.
– Achieves 2–4× faster convergence than RAFT
– Improves accuracy on math…
Thrilled to announce that our paper Sparse VideoGen got into #ICML2025! 🎉
Our new approach to speedup Video Generation by 2×. Details in the thread/paper.
Huge thanks to my collaborators!
Blog: svg-project.github.io
Paper: arxiv.org/abs/2502.01776
Code:…
Thrilled to announce that our paper Sparse VideoGen got into #ICML2025! 🎉
Our new approach to speedup Video Generation by 2×. Details in the thread/paper.
Huge thanks to my collaborators!
Blog: svg-project.github.io
Paper: arxiv.org/abs/2502.01776
Code:…
Thrilled to share my first project at NVIDIA! ✨
Today’s language models are pre-trained on vast and chaotic Internet texts, but these texts are unstructured and poorly understood. We propose CLIMB — Clustering-based Iterative Data Mixture Bootstrapping — a fully automated…
🚀 Excited to share our latest work on Iterative-DPO for math reasoning! Inspired by DeepSeek-R1 & rule-based PPO, we trained Qwen2.5-MATH-7B on Numina-Math prompts. Our model achieves 47.0% pass@1 on AIME24, MATH500, AMC, Minerva-Math, OlympiadBench—outperforming…
4K Followers 2K FollowingResearch Scientist at @Meta Fundamental AI Research (FAIR), New York. Previously: Postdoc @Caltech, PhD @PrincetonCS, Undergrad @Tsinghua_Uni.
613 Followers 1K FollowingProfessor at Texas A&M University; ML/AI researcher; optimization for ML/AI; large reasoning models, developing LibAUC library for training deep neural nets.
317 Followers 227 FollowingWorking on LLM/VLM Tool Learning and Reasoning at Tsinghua and Bytedance, reading at least one paper a day — The future will not invent itself.
2K Followers 32 FollowingPrinceton University initiative enhancing fundamental understanding of AI, enabling its use in academic disciplines, and examining AI's societal implications.
874K Followers 52 Followingwe invest in software eating the world
https://t.co/A9eTFq6plZ
https://t.co/MXGUBJoesw
Watch "The Ben & Marc Show": https://t.co/eRuDhx7kpe
45K Followers 64 FollowingStudent of mind and nature, libertarian, chess player, cancer survivor. @ Keen, UAlberta, Amii, https://t.co/u8za2Kod54, The Royal Society, Turing Award
104 Followers 85 FollowingWe are a group of researchers working on natural language processing in the Department of Computer Science at the University of Hong Kong.
6K Followers 150 FollowingSkywork Super Agents: the Originator of Al Workspace Agents, turns your 8 hours of work into 8 minutes. Support: https://t.co/Zvze6mFI6E
13K Followers 689 FollowingResearch @Meta Superintelligence Labs, RL/post-training/agents; Previously Research @OpenAI on multimodal and RL; Opinions are my own.
5K Followers 828 FollowingPostdoc @LTIatCMU. PhD from Ohio State @osunlp. Author of MMMU, MAmmoTH. Training & evaluating foundation models. Opinions are my own.