L @CodeTitanium

Joined July 2023

Tweets

1K
Followers

101
Following

5K
Likes

61K

main @main_horse

4 hours ago

a modest proposal

3 10 83 6K 25

Download Image

Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement: 1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation 2. Sensitive hparams need a more granular sweep than power-of-2 x.com/percyliang/sta…

Percy Liang @percyliang

a day ago

15 74 602 140K 320

3 8 149 14K 56

Thibaut Boissin @ThibautBoissin

a week ago

Following up on my Newton–Schulz speedup post, here’s the code: github.com/thib-s/flash-n… (I'll do a PR soon in Dion/Muon) And here’s how I squeezed out the extra gain ⬇️

Thibaut Boissin @ThibautBoissin

a week ago

Following up on my Newton–Schulz speedup post, here’s the code: github.com/thib-s/flash-n… (I'll do a PR soon in Dion/Muon) And here’s how I squeezed out the extra gain ⬇️

6 36 477 32K 199

Download Image

1 7 47 8K 45

main @main_horse

4 days ago

μtransfer for Mamba2 & Muon

4 23 193 17K 120

Download Image

Jessy Lin @realJessyLin

a week ago

🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge? In new work with @AIatMeta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results: * 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia…

15 158 1K 126K 1K

Download Image

Thibaut Boissin @ThibautBoissin

a week ago

Good news: I managed to get an extra 1.6x speedup of the Newton Schulz algorithm (which is at the core of Dion/Muon). It reaches nearly a 3x speedup over the plain torch implementation !

6 36 477 32K 199

Download Image

Lisan al Gaib @scaling01

a week ago

Google made a huge breakthrough in image editing with their new Gemini 2.5 Flash model 171 elo points ahead of the 2nd best model

15 57 946 55K 70

Download Image

elie @eliebakouch

2 weeks ago

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale! > It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training. > They use WSD with a "Simple moving…

16 42 363 112K 308

Download Image

Simo Ryu @cloneofsimo

2 weeks ago

So we went from "LLM is memorizing dataset" to "LLM is not reasoning" to "LLM cannot do long / complex math proving" to "Math that LLM is doing is not REAL math. LLM can't do REAL math" Where do we go from now?

Edward Frenkel @edfrenkel

2 weeks ago

255 234 2K 1.0M 601

156 82 1K 231K 221

leloy! @leloykun

2 weeks ago

I've finally solved steepest descent on Finsler-structured (matrix) manifolds more generally. This generalizes work by me, @jxbz, and @Jianlin_S on Muon, Orthogonal Muon, & Stiefel Muon. --- The general solution turned out to be much simpler than I thought. And it should…

leloy! @leloykun

4 weeks ago

12 48 579 118K 548

11 49 469 71K 389

Download Image

Prof. Anima Anandkumar @AnimaAnandkumar

2 weeks ago

It is interesting that the new @deepseek_ai v3.1 is trained using the UE8M0 FP8 scale data format which is logarithmic number system. Our multiplicative weights update (Madam) for training in that format was done several years ago while at @nvidia It yields maximum hardware…

14 106 612 72K 414

Download Image

DeepSeek @deepseek_ai

2 weeks ago

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 🛠️ Stronger agent skills: Post-training boosts tool use and…

503 2K 16K 2.0M 3K

elie @eliebakouch

2 weeks ago

New DeepSeek v3.1 model is out 🐳 It's an hybride reasoning model, the key difference between V3 and V3.1 is a much, much longer long context extension phase.

4 4 92 5K 11

Download Image

Alexander Doria @Dorialexander

2 weeks ago

Guess it's time to repost the paper review. Most innovative part for me along with seed-geometry is the generation of "conjectures" (now a lot on my mind for all kind of model/synth designs). x.com/Dorialexander/…

Alexander Doria @Dorialexander

a month ago

4 16 146 13K 96

Download Image

0 2 18 2K 4

wh @nrehiew_

2 weeks ago

Reasoning models preferring music artists with numbers in their name is one of the funniest RL artefacts possible

6 24 400 22K 79

Download Image

Dimitris Papailiopoulos @DimitrisPapail

3 weeks ago

Training phi-4-reasoning with GFPO cuts GRPO’s length inflation by - 71% on AIME 25 - 80% on GPQA - 83% on Omni-MATH - 80% on LiveCodeBench (WITHOUT training on code!) …while matching or beating GRPO in accuracy Vaish sharing a great 🧵 below!

Vaish Shrivastava @VaishShrivas

3 weeks ago

4 48 331 58K 278

Download Image

7 26 203 20K 125

elie @eliebakouch

3 weeks ago

new muP alert, this time for MoE

4 8 109 6K 53

Download Image

Dimitris Papailiopoulos @DimitrisPapail

3 weeks ago

GRPO makes reasoning model yap a lot, but there's a simple fix: Sample more responses during training, and train on the shortest ones. This creates a length pressure that makes the model sound much more terse, without sacrificing accuracy!! Examples of GRPO vs GFPO versions…

Dimitris Papailiopoulos @DimitrisPapail

3 weeks ago

19 41 361 93K 273

Download Image

6 34 352 31K 238

Download Image

Dimitris Papailiopoulos @DimitrisPapail

3 weeks ago

Thinking Less at test-time requires Sampling More at training-time! GFPO is a new, cool, and simple Policy Opt algorithm is coming to your RL Gym tonite, led by @VaishShrivas and our MSR group: Group Filtered PO (GFPO) trades off training-time with test-time compute, in order…

19 41 361 93K 273

Download Image

alz @alz_zyd_

3 weeks ago

Math folks look at AI output, it's clearly legit math, and they're like, yeah this thing knows how to do math Humanities folks look at AI output, and are immediately like: this thing can't possibly be as good at interpreting texts as we are