Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement:
1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation
2. Sensitive hparams need a more granular sweep than power-of-2
x.com/percyliang/sta…
Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement:
1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation
2. Sensitive hparams need a more granular sweep than power-of-2
x.com/percyliang/sta…
Following up on my Newton–Schulz speedup post, here’s the code: github.com/thib-s/flash-n… (I'll do a PR soon in Dion/Muon)
And here’s how I squeezed out the extra gain ⬇️
Following up on my Newton–Schulz speedup post, here’s the code: github.com/thib-s/flash-n… (I'll do a PR soon in Dion/Muon)
And here’s how I squeezed out the extra gain ⬇️
🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge?
In new work with @AIatMeta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results:
* 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia…
Good news: I managed to get an extra 1.6x speedup of the Newton Schulz algorithm (which is at the core of Dion/Muon). It reaches nearly a 3x speedup over the plain torch implementation !
Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!
> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving…
So we went from
"LLM is memorizing dataset"
to
"LLM is not reasoning"
to
"LLM cannot do long / complex math proving"
to
"Math that LLM is doing is not REAL math. LLM can't do REAL math"
Where do we go from now?
So we went from
"LLM is memorizing dataset"
to
"LLM is not reasoning"
to
"LLM cannot do long / complex math proving"
to
"Math that LLM is doing is not REAL math. LLM can't do REAL math"
Where do we go from now?
I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon.
---
The general solution turned out to be much simpler than I thought. And it should…
I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon.
---
The general solution turned out to be much simpler than I thought. And it should… https://t.co/NWwzMzmcHH
It is interesting that the new @deepseek_ai v3.1 is trained using the UE8M0 FP8 scale data format which is logarithmic number system. Our multiplicative weights update (Madam) for training in that format was done several years ago while at @nvidia It yields maximum hardware…
Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀
🧠 Hybrid inference: Think & Non-Think — one model, two modes
⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528
🛠️ Stronger agent skills: Post-training boosts tool use and…
New DeepSeek v3.1 model is out 🐳
It's an hybride reasoning model, the key difference between V3 and V3.1 is a much, much longer long context extension phase.
Guess it's time to repost the paper review. Most innovative part for me along with seed-geometry is the generation of "conjectures" (now a lot on my mind for all kind of model/synth designs). x.com/Dorialexander/…
Guess it's time to repost the paper review. Most innovative part for me along with seed-geometry is the generation of "conjectures" (now a lot on my mind for all kind of model/synth designs). x.com/Dorialexander/…
Training phi-4-reasoning with GFPO cuts GRPO’s length inflation by
- 71% on AIME 25
- 80% on GPQA
- 83% on Omni-MATH
- 80% on LiveCodeBench (WITHOUT training on code!)
…while matching or beating GRPO in accuracy
Vaish sharing a great 🧵 below!
Training phi-4-reasoning with GFPO cuts GRPO’s length inflation by
- 71% on AIME 25
- 80% on GPQA
- 83% on Omni-MATH
- 80% on LiveCodeBench (WITHOUT training on code!)
…while matching or beating GRPO in accuracy
Vaish sharing a great 🧵 below!
GRPO makes reasoning model yap a lot, but there's a simple fix:
Sample more responses during training, and train on the shortest ones.
This creates a length pressure that makes the model sound much more terse, without sacrificing accuracy!!
Examples of GRPO vs GFPO versions…
GRPO makes reasoning model yap a lot, but there's a simple fix:
Sample more responses during training, and train on the shortest ones.
This creates a length pressure that makes the model sound much more terse, without sacrificing accuracy!!
Examples of GRPO vs GFPO versions… https://t.co/yWKzNgyiQn
Thinking Less at test-time requires Sampling More at training-time!
GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group:
Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…
Math folks look at AI output, it's clearly legit math, and they're like, yeah this thing knows how to do math
Humanities folks look at AI output, and are immediately like: this thing can't possibly be as good at interpreting texts as we are
903 Followers 542 FollowingGTM @Tabnine prev @CockroachDB @Coursera documenting how ai is colliding with real-world workflows. field notes from the transition.
224 Followers 569 FollowingSecond year PhD @UW | Post-Training, LLM reasoning and synthetic dataset.
https://t.co/cYAkbnCsCp
Open to chat and collaborate!
3K Followers 1K FollowingResearch Engineering Lead at @StanfordCRFM. Previously co-founder at Semantic Machines ⟶ MSFT. Lead developer of Levanter and Marin @[email protected]
237 Followers 364 FollowingI don't teach you how to make money. 👀 I help you grow, optimise and unlock your full business potential with AI 🚀 | @AIGOConsult
12K Followers 0 FollowingOur mission is to promote greater understanding of UAP (Unidentified Anomalous Phenomena) by advocating for government transparency & scientific research.
42K Followers 1K FollowingSecular Bayesian.
Professor of Machine Learning @Cambridge_CL. Talent aficionado at https://t.co/RbJkoLguey
Alum of @Twitter, Magic Pony and @Balderton
27K Followers 1 FollowingNano Banana 🍌, aka Gemini 2.5 Flash Image, the world's most powerful image editing and generation model! Try it for free in the @GeminiApp
128K Followers 540 FollowingFormer CIA Officer, Espionage Investigator, HUMINT Collector, Author, "Twilight Of The Shadow Government. How Transparency Will Kill The Deep State.”
115 Followers 750 Following🤖 AI Research
📊 Transformers, Diffusion, Pretraining
🤝 Explaining & Doing AI Research
🎥 My YouTube below, Bilibili: vuk_ai
🧑🎓 我在学习中文,随时可以跟我用中文聊天
496 Followers 201 FollowingCELL: Consortium for the Equations of Life and Living Systems. Fusing MathBio, BioPhysics, CompBio and DescriptiveBio around an aggressive mathematical core.
12K Followers 449 FollowingMachine learning researcher at @GoogleDeepMind & mathematician. Host of The Cartesian Cafe podcast. All opinions are my own.
9K Followers 197 FollowingAssistant Professor, Department of Psychology, Harvard University. Computation, cognition, development.
Bluesky: https://t.co/cU3TtyokJE