Yifan Zhang @yifan_zhang_

Ph.D. student at Princeton University, focusing on LLMs. yifzhang.com New York Metropolitan Area Joined October 2022

Tweets

60
Followers

319
Following

343
Likes

122

Zeyuan Allen-Zhu, Sc.D. @ZeyuanAllenZhu

5 days ago

Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096

Rabeeh Karimi @KarimiRabeeh

5 days ago

1 1 17 14K 5

2 8 92 14K 31

YIFENG LIU @YIFENGLIU_AI

3 months ago

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: arxiv.org/abs/2505.17508 Code:…

5 40 201 57K 164

Download Image

Junyang Lin @JustinLin610

a day ago

I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.

Qwen @Alibaba_Qwen

a day ago

237 485 3K 561K 679

Download Image

44 47 661 42K 72

Yifan Zhang @yifan_zhang_

2 days ago

The future of something great is now within reach.

0 0 2 217 0

Download Image

Kaiyue Wen @wen_kaiyue

2 days ago

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

10 72 358 127K 185

Download Image

Alex Banks @thealexbanks

a week ago

OpenAI just published their official prompting guide for GPT-5. Master these 6 critical prompting techniques:

107 836 7K 1.3M 14K

Download Image

Gro-Tsen @gro_tsen

4 months ago

[Via jeanas.bsky.social on the non-Musky place.] And yes, this monstrosity is an actual commutative diagram from an actual math paper: “Comma 2-comonad I: Eilenberg-Moore 2-category of colax coalgebras” by Igor Baković arxiv.org/abs/2505.00682 (on page 53).

2 3 33 2K 12

Zhouliang Yu @ZhouliangY

4 months ago

🚀 Excited to introduce FormalMATH: a large-scale formal math benchmark with 5,560 formally verified Lean 4 statements from Olympiad and UG-level problems. 📉 Best model performance: just 16.46% — plenty of room for progress! 🔗 Explore the project: spherelab.ai/FormalMATH/

2 6 26 2K 15

Download Image

Oriol Vinyals @OriolVinyalsML

4 months ago

Ahead of I/O, we’re releasing an updated Gemini 2.5 Pro! It’s now #1 on WebDevArena leaderboard, breaking the 1400 ELO barrier! 🥇 Our most advanced coding model yet, with stronger performance on code transformation & editing. Excited to build drastic agents on top of this!

36 71 770 230K 109

Download Image

Yifan Zhang @yifan_zhang_

11 months ago

1/8 ⭐General Preference Modeling with Preference Representations for Aligning Language Models⭐ arxiv.org/abs/2410.02197 As Huggingface Daily Papers: huggingface.co/papers/2410.02… We just dropped our latest research on General Preference Modeling (GPM)! 🚀

4 16 47 84K 28

Quanquan Gu @QuanquanGu

6 months ago

Very cool! Who’d like to use FlashTPA? Drop a like if you want us to release it! MHA-->GQA-->MLA--->TPA🚀🚀 Paper: arxiv.org/pdf/2501.06425

DeepSeek @deepseek_ai

6 months ago

Very cool! Who’d like to use FlashTPA? Drop a like if you want us to release it! MHA-->GQA-->MLA--->TPA🚀🚀 Paper: arxiv.org/pdf/2501.06425

560 1K 10K 1.7M 2K

4 13 106 18K 37

Krishna Mohan @KMohan2006

8 months ago

1/n 'Tensor Product Attention is all you need' paper Key Points -> 1. KV size reduction by using contextual tensor decomposition for each token 2. Dividing hidden_dimension for each token into head dimension factor and token dimension factor and then combining using tensor

3 13 106 28K 56

Download Image

Thomas Ahle @thomasahle

8 months ago

Tensor Product Attention illustrated with Tensor Diagrams.

2 26 221 13K 108

Download Image

Quanquan Gu @QuanquanGu

8 months ago

MHA-->GQA-->MLA--->TPA🚀🚀🚀 Introducing Tensor Product Attention (TPA). To reduce KV cache size, various Multi-Head Attention (MHA) variants have been developed, including Multi-Query Attention (MQA), Group Query Attention (GQA), and Multi-Head Latent Attention (MLA). GQA has…

Yifan Zhang @yifan_zhang_

8 months ago

6 63 316 87K 210

Download Image

12 57 327 55K 221

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr

8 months ago

Tensor Product Attention Is All You Need Proposes Tensor Product Attention (TPA), a mechanism that factorizes Q, K, and V activations using contextual tensor decompositions to achieve 10x or more reduction in inference-time KV cache size relative to standard attention mechanism…

16 81 437 65K 410

Download Image

𝚐𝔪𝟾𝚡𝚡𝟾 @gm8xx8

8 months ago

Tensor Product Attention Is All You Need Tensor Product Attention reduces memory overhead by compressing KV cache using tensor decompositions. The T6 Transformer, built on TPA, processes longer sequences efficiently and outperforms standard models across benchmarks.

3 35 170 9K 82

Download Image

Quanquan Gu @QuanquanGu

8 months ago

We're the architects now. 🏗️📐.

Yifan Zhang @yifan_zhang_

8 months ago

We're the architects now. 🏗️📐.

6 63 316 87K 210

Download Image

8 10 109 19K 40

Yifan Zhang @yifan_zhang_

8 months ago

1/ Introducing “Tensor Product Attention Is All You Need” (TPA) and Tensor ProducT ATTenTion Transformer (T6)! 🚀 Ever wondered if there’s a more memory-efficient way to handle long contexts in LLMs? Homepage: tensorgi.github.io/T6