Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096
Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096
1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO:
Paper: arxiv.org/abs/2505.17508
Code:…
I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.
I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
[Via jeanas.bsky.social on the non-Musky place.]
And yes, this monstrosity is an actual commutative diagram from an actual math paper: “Comma 2-comonad I: Eilenberg-Moore 2-category of colax coalgebras” by Igor Baković arxiv.org/abs/2505.00682 (on page 53).
🚀 Excited to introduce FormalMATH: a large-scale formal math benchmark with 5,560 formally verified Lean 4 statements from Olympiad and UG-level problems.
📉 Best model performance: just 16.46% — plenty of room for progress!
🔗 Explore the project: spherelab.ai/FormalMATH/
Ahead of I/O, we’re releasing an updated Gemini 2.5 Pro! It’s now #1 on WebDevArena leaderboard, breaking the 1400 ELO barrier! 🥇
Our most advanced coding model yet, with stronger performance on code transformation & editing. Excited to build drastic agents on top of this!
1/8 ⭐General Preference Modeling with Preference Representations for Aligning Language Models⭐ arxiv.org/abs/2410.02197
As Huggingface Daily Papers: huggingface.co/papers/2410.02…
We just dropped our latest research on General Preference Modeling (GPM)! 🚀
1/n
'Tensor Product Attention is all you need' paper
Key Points ->
1. KV size reduction by using contextual tensor decomposition for each token
2. Dividing hidden_dimension for each token into head dimension factor and token dimension factor and then combining using tensor
MHA-->GQA-->MLA--->TPA🚀🚀🚀
Introducing Tensor Product Attention (TPA).
To reduce KV cache size, various Multi-Head Attention (MHA) variants have been developed, including Multi-Query Attention (MQA), Group Query Attention (GQA), and Multi-Head Latent Attention (MLA). GQA has…
MHA-->GQA-->MLA--->TPA🚀🚀🚀
Introducing Tensor Product Attention (TPA).
To reduce KV cache size, various Multi-Head Attention (MHA) variants have been developed, including Multi-Query Attention (MQA), Group Query Attention (GQA), and Multi-Head Latent Attention (MLA). GQA has…
Tensor Product Attention Is All You Need
Proposes Tensor Product Attention (TPA), a mechanism that factorizes Q, K, and V activations using contextual tensor decompositions to achieve 10x or more reduction in inference-time KV cache size relative to standard attention mechanism…
Tensor Product Attention Is All You Need
Tensor Product Attention reduces memory overhead by compressing KV cache using tensor decompositions. The T6 Transformer, built on TPA, processes longer sequences efficiently and outperforms standard models across benchmarks.
1/
Introducing “Tensor Product Attention Is All You Need” (TPA) and Tensor ProducT ATTenTion Transformer (T6)! 🚀
Ever wondered if there’s a more memory-efficient way to handle long contexts in LLMs?
Homepage: tensorgi.github.io/T6
0 Followers 5 Following#DataGeek - I'm a DF aficionado and self-proclaimed tech lover. My mission is to use data for good, shape the future of technology, and make data-driven decisio
463 Followers 2K FollowingInteresting in crypto currency
#BTC #BNB #ETH #NFT
and invest in domain names and digital assets. Owner of https://t.co/Viu5HqACZD, https://t.co/TI2Jwlybw5, https://t.co/44PopEroDo,
33K Followers 1 FollowingNano Banana 🍌, aka Gemini 2.5 Flash Image, the world's most powerful image editing and generation model! Try it for free in the @GeminiApp
25K Followers 206 FollowingWorking towards the safe development of AI for the benefit of all @UMontreal, @LawZero_ & @Mila_Quebec
A.M. Turing Award Recipient and most-cited AI researcher.
3K Followers 1K FollowingResearch Engineering Lead at @StanfordCRFM. Previously co-founder at Semantic Machines ⟶ MSFT. Lead developer of Levanter and Marin @[email protected]
110K Followers 3K FollowingCPO @OpenAI, BoD @Cisco @nature_org, LTC @USArmyReserve
Prev: President @Planet, Head of Product @Instagram @Twitter
❤️ @elizabeth ultramarathons kids cats math
488K Followers 146 FollowingNobel Laureate. Co-Founder & CEO @GoogleDeepMind - working on AGI. Solving disease @IsomorphicLabs. Trying to understand the fundamental nature of reality.
1K Followers 298 FollowingAssistant Prof. @UofMaryland; Prev. {@MIT, @SimonsInstitute, @ECEILLINOIS, @Tsinghua_Uni}; Interested in Control, Game Theory, Machine Learning, and Robotics
29K Followers 806 FollowingMathematician (Distinguished Professor of #Math at @RutgersU). Here to learn about research, education, and community. Let’s build something together.
4K Followers 2K FollowingResearch Scientist @NVIDIA focusing on efficient post-training of LLMs. Finetuning your own LLMs with LMFlow: https://t.co/UTykmQBwFr Views are my own.