Rosinality @rosinality

no side-effects rosinality.substack.com Seoul, Korea Joined October 2008

Tweets

32K
Followers

2K
Following

998
Likes

22K

Rosinality @rosinality

12 hours ago

Oh, this endorses and expands the explanation on pretraining hallucinations from calibration and singleton facts. arxiv.org/abs/2311.14648

Adam.GPT @TheRealAdamG

a day ago

Oh, this endorses and expands the explanation on pretraining hallucinations from calibration and singleton facts. arxiv.org/abs/2311.14648

5 6 133 6K 25

0 1 5 483 2

Rosinality @rosinality

20 hours ago

Wow, compositional generalization across depth! This beautifully resolves whether RL lets the model acquire new skills. At least, learning to compose atomic skills itself is a new skill.

Lifan Yuan @lifan__yuan

2 days ago

Wow, compositional generalization across depth! This beautifully resolves whether RL lets the model acquire new skills. At least, learning to compose atomic skills itself is a new skill.

9 80 389 36K 334

Download Image

0 0 9 694 6

Recently I have tried to implement H-Net. I think this is the most promising approach as a tokenizer-free method. (Though the length is not fixed during training, which is annoying.) And reducing the vocabulary size below the model dimension itself has a desirable property.

main @main_horse

a day ago

5 10 124 11K 38

Download Image

0 0 15 729 4

Rosinality @rosinality

a day ago

Though different from the eval flow, I find this interview with MiniMax's CEO interesting (news.qq.com/rain/a/2025011…). Most companies in China are still using the methods for building recommendation systems to create large model products. With a content product, you can't know…

0 0 5 223 3

Rosinality @rosinality

a day ago

In B.1, the authors estimated the scaling law of N and D, and based on that, suggested that Adam would be better as N is larger. But what about D? And how will it change with MoE? It would be worthwhile to estimate. (Though estimating the scaling law of N, D could be delicate,…

JingyuanLiu @JingyuanLiu123

2 days ago

2 6 89 9K 46

Download Image

0 0 8 711 2

Kaiyue Wen @wen_kaiyue

2 days ago

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!