[17/n] Final thoughts
EvoLM offers:
🔓 100+ open LLMs
📜 Controlled full-stage training
📊 Evaluations across cloze, generative, ID/OOD tasks
📦 Full code, data, and ongoing support
Kudos to the team @ZhentingQi, @FanNie1208, @AlexAlahi, @james_y_zou, @hima_lakkaraju,…
[16/n] Takeaway 1️⃣3️⃣
📷
“ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies…
[15/n] Takeaway 1️⃣2️⃣
“Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.”
With 100K total examples:
90K SFT + 10K RL = best ID…
[14/n] Takeaway 1️⃣1️⃣
“Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but may not necessarily improve models’ fundamental reasoning capabilities.”
RL amplifies confidence, not competence.
[13/n] Takeaway 🔟 -📷“RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks, but with diminishing returns.”
We scale RL epochs and dataset sizes separately.
Performance peaks at ~8 epochs or ~100K examples for 1B models.
After that,…
[12/n] Takeaway 9️⃣ - 📷“Excessive SFT, especially overly large epochs, could limit further RL improvements.”
Once the model memorizes via SFT, RL has little room to further improve.
→ Overfitting in SFT bottlenecks RL
🛑 Stop SFT early if planning RL.
[11/n] Takeaway 8️⃣ - “Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.”
We scale both epochs (1–32) and dataset size (50K–400K):
ID metrics 📈
OOD metrics plateau or drop
⚖️ Balance SFT…
[10/n] Takeaway 7️⃣ - “With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.”
With enough CPT (e.g. 42B math tokens), post-trained models can generalize well to OOD…
[9/n] Takeaway 6️⃣ - “As domain-specific CPT data increase, in-domain downstream performance steadily improves, and the SFT models could benefit more from RL finetuning.”
Scaling CPT from 2B → 42B tokens = monotonic ID performance gains.
Plus:
🟡 RL helps more when CPT is…
[8/n] Takeaway 5️⃣-📷 “Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.”
Without CPT, even strong pre-training leads to poor downstream performance.…
[7/n] Takeaway 4️⃣ - “Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge, which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this…
[6/n] Takeaway 3️⃣ -
“Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts.
Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD…
[5/n] Takeaway 2️⃣ - “Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks.”
We evaluated SFT and RL models initialized from various pre-training budgets.
Beyond 80–160B…
[3/n] In EvoLM, we
✅ Build a fully transparent and reproducible model suite for studying LM training
✅ Quantify how each training phase contributes to upstream cloze task performance and downstream generative task performance, considering both in-domain and out-of-domain…
[2/n] We train 100+ decoder-only LMs (1B/4B) from scratch, across four training stages —
🟦 Pre-training
🟩 Continued Pre-Training (CPT)
🟨 Supervised Fine-Tuning (SFT)
🟥 Reinforcement Learning (RL)
Under controlled conditions and with full transparency regarding the data and…
[1/n] Discussions about LM reasoning and post-training have gained momentum. We identify several missing pieces:
✏️Post-training based on off-the-shelf base models without transparent pre-training data components and scale.
✏️Intermediate checkpoints with incomplete learning…
Do language models have algorithmic creativity?
To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
[6/n] Overall, we emphasize multiple takeaways for different stakeholders to better evaluate LMs:
Developers: More pre-training compute strengthens foundational capabilities.
Evaluators: Avoid solely relying on overall scores – analyze specific skills being tested; design…
[5/n] We found that parent nodes in the graph — such as BBH and MMLU-Pro — which represent broader capabilities like knowledge-based reasoning, tend to exhibit stronger correlations with pre-training FLOPs in scaling law fits.
68 Followers 371 FollowingProfessor
@PrincetonEcon
, Director of
@PrincetonBCF
, Research on Macro, Money, and Finance, Author The Resilient Society and A Crash Course on Crises
8K Followers 451 FollowingProfessor @MITEECS and @MIT_CSAIL. Computational complexity, algorithm design, and related math. I'll let you know when P != NP is proved (and when it's not)
70K Followers 179 FollowingFlagship journal of the nonprofit APS. The most cited physics journal, publishing Letters which substantially move physics forward.
993 Followers 981 FollowingPh.D. @CarnegieMellon. Working on data and hardware-driven principled algorithm & system co-design for scalable and generalizable foundation models. They/Them
1K Followers 65 FollowingAsst. Professor, Stanford MS&E. Ex-Principal Researcher at Microsoft Research, working on machine learning, econometrics, causal inference, and game theory.
35K Followers 815 FollowingProfessor @PrincetonEcon, Director of @PrincetonBCF, Research on Macro, Money, and Finance, Author "The Resilient Society" and "A Crash Course on Crises"
42K Followers 184 FollowingUsing health data to learn what works.
Making #causalinference less casual.
Director @CAUSALab | Professor @HarvardChanSPH | Methods Editor @AnnalsofIM
15K Followers 6K FollowingI build tough benchmarks for LMs and then I get the LMs to solve them. SWE-bench & SWE-agent. Postdoc @Princeton. PhD @nlpnoah @UW.
45K Followers 64 FollowingStudent of mind and nature, libertarian, chess player, cancer survivor. @ Keen, UAlberta, Amii, https://t.co/u8za2Kod54, The Royal Society, Turing Award
18K Followers 4K FollowingAssociate Professor at UC Berkeley. Former Research Scientist at Google DeepMind. ML/AI Researcher working on foundations of LLMs and deep learning.
36K Followers 2K FollowingInformation Geometry, Information Theory, and Geometric Science of Information (GSI) for machine learning and AI, visual computing, HPC, pyBregMan lib @SonyCSL