Fusheng Liu @mathlfs

PhD student @ National University of Singapore [email protected] mathematicallfs.github.io Singapore Joined September 2022

Tweets

51
Followers

19
Following

97
Likes

530

Kimi.ai @Kimi_Moonshot

7 months ago

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW…

84 307 2K 749K 1K

Download Image

DeepSeek @deepseek_ai

9 months ago

🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n

669 2K 13K 7.3M 5K

Download Gif

Fusheng Liu @mathlfs

11 months ago

Highly recommend this user-friendly project if you start with LM pretraining and want to build your own model/optimizer. The repo is easy to understand, easy to edit and easy to implement new ideas with minimum workloads. Well done Keller! Looking forward to your records on VIT:)

Keller Jordan @kellerjordan0

11 months ago

14 28 326 201K 174

0 0 1 130 0

Daniel Han @danielhanchen

11 months ago

Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation…

21 132 754 315K 416

Download Image

Fusheng Liu @mathlfs

a year ago

Mamba at ICLR :)

0 0 1 128 0

Download Image

Taco Cohen @TacoCohen

2 years ago

Harm's Law of Smol Models (HLSM) tells us how much we need to scale up the data size (k_D) as we scale down the model size (k_N), if we wish to preserve the loss of a Chinchilla-optimal model. harmdevries.com/post/model-siz…