Hanlin Zhang @_hanlin_zhang_

CS PhD candidate @Harvard, @googleai hanlin-zhang.com Joined September 2019

Tweets

89
Followers

807
Following

313
Likes

597

Hanlin Zhang @_hanlin_zhang_

2 months ago

[17/n] Final thoughts EvoLM offers: 🔓 100+ open LLMs 📜 Controlled full-stage training 📊 Evaluations across cloze, generative, ID/OOD tasks 📦 Full code, data, and ongoing support Kudos to the team @ZhentingQi, @FanNie1208, @AlexAlahi, @james_y_zou, @hima_lakkaraju,…

0 0 1 384 0

Hanlin Zhang @_hanlin_zhang_

2 months ago

[16/n] Takeaway 1️⃣3️⃣ 📷 “ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies…

1 0 2 370 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[15/n] Takeaway 1️⃣2️⃣ “Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.” With 100K total examples: 90K SFT + 10K RL = best ID…

1 0 2 196 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[14/n] Takeaway 1️⃣1️⃣ “Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but may not necessarily improve models’ fundamental reasoning capabilities.” RL amplifies confidence, not competence.

1 0 1 181 0

Hanlin Zhang @_hanlin_zhang_

2 months ago

[13/n] Takeaway 🔟 -📷“RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks, but with diminishing returns.” We scale RL epochs and dataset sizes separately. Performance peaks at ~8 epochs or ~100K examples for 1B models. After that,…

1 0 2 185 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[12/n] Takeaway 9️⃣ - 📷“Excessive SFT, especially overly large epochs, could limit further RL improvements.” Once the model memorizes via SFT, RL has little room to further improve. → Overfitting in SFT bottlenecks RL 🛑 Stop SFT early if planning RL.

1 0 1 195 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[11/n] Takeaway 8️⃣ - “Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.” We scale both epochs (1–32) and dataset size (50K–400K): ID metrics 📈 OOD metrics plateau or drop ⚖️ Balance SFT…

1 0 1 211 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[10/n] Takeaway 7️⃣ - “With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.” With enough CPT (e.g. 42B math tokens), post-trained models can generalize well to OOD…

1 0 1 213 0

Hanlin Zhang @_hanlin_zhang_

2 months ago

[9/n] Takeaway 6️⃣ - “As domain-specific CPT data increase, in-domain downstream performance steadily improves, and the SFT models could benefit more from RL finetuning.” Scaling CPT from 2B → 42B tokens = monotonic ID performance gains. Plus: 🟡 RL helps more when CPT is…

1 0 1 228 0

Hanlin Zhang @_hanlin_zhang_

2 months ago

[8/n] Takeaway 5️⃣-📷 “Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.” Without CPT, even strong pre-training leads to poor downstream performance.…

1 0 1 233 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[7/n] Takeaway 4️⃣ - “Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge, which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this…

1 0 2 257 1

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[6/n] Takeaway 3️⃣ - “Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD…

1 0 1 256 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[5/n] Takeaway 2️⃣ - “Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks.” We evaluated SFT and RL models initialized from various pre-training budgets. Beyond 80–160B…

1 0 1 269 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[4/n] Takeaway 1️⃣ - “>16x Chinchilla general-domain pre-training improves upstream performance but with diminishing returns.” We pre-trained models on 10B–320B tokens. Upstream cloze accuracy (e.g., HellaSwag, PIQA) improves until ~80–160x model size, then flattens. Example: 1B…

1 0 1 338 0

Download Image

Hanlin Zhang @_hanlin_zhang_

2 months ago

[3/n] In EvoLM, we ✅ Build a fully transparent and reproducible model suite for studying LM training ✅ Quantify how each training phase contributes to upstream cloze task performance and downstream generative task performance, considering both in-domain and out-of-domain…

1 0 1 379 0

Hanlin Zhang @_hanlin_zhang_

2 months ago

[2/n] We train 100+ decoder-only LMs (1B/4B) from scratch, across four training stages — 🟦 Pre-training 🟩 Continued Pre-Training (CPT) 🟨 Supervised Fine-Tuning (SFT) 🟥 Reinforcement Learning (RL) Under controlled conditions and with full transparency regarding the data and…

1 0 1 459 1

Hanlin Zhang @_hanlin_zhang_

2 months ago

[1/n] Discussions about LM reasoning and post-training have gained momentum. We identify several missing pieces: ✏️Post-training based on off-the-shelf base models without transparent pre-training data components and scale. ✏️Intermediate checkpoints with incomplete learning…

1 17 219 13K 20

Ori Press @ori_press

2 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

6 60 159 23K 67

Download Image

Hanlin Zhang @_hanlin_zhang_

3 months ago

[6/n] Overall, we emphasize multiple takeaways for different stakeholders to better evaluate LMs: Developers: More pre-training compute strengthens foundational capabilities. Evaluators: Avoid solely relying on overall scores – analyze specific skills being tested; design…

0 0 1 157 0

Hanlin Zhang @_hanlin_zhang_

3 months ago

[5/n] We found that parent nodes in the graph — such as BBH and MMLU-Pro — which represent broader capabilities like knowledge-based reasoning, tend to exhibit stronger correlations with pre-training FLOPs in scaling law fits.