Stephen Xie @stephenx_

@berkeleynlp @berkeley_ai | eecs & business @ucberkeleymet Berkeley & SF Joined August 2020

Tweets

44
Followers

135
Following

622
Likes

253

henry @arithmoquine

4 weeks ago

new post. there's a lot in it. i suggest you check it out

71 188 3K 258K 2K

Download Image

Stephen Xie @stephenx_

a month ago

Yup :) here are the reasoning traces from running gpt-oss 20b on the s1k dataset huggingface.co/datasets/Steph…

Oliver Li @oliveraochongli

a month ago

Yup :) here are the reasoning traces from running gpt-oss 20b on the s1k dataset huggingface.co/datasets/Steph…

1 0 1 416 0

1 0 1 318 0

Owain Evans @OwainEvans_UK

2 months ago

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

292 1K 9K 1.9M 5K

Download Image

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…

282 1K 7K 2.6M 3K

Download Image

Jiayi Pan @jiayi_pirate

2 months ago

With Grok-4, RL is the new pre-training

Shengyang Sun @ssydasheng

2 months ago

With Grok-4, RL is the new pre-training

53 163 1K 183K 243

Download Image

8 45 736 59K 101

Xiang Yue @xiangyue96

2 months ago

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…

17 127 610 60K 396

Download Image

Minqi Jiang @MinqiJiang

2 months ago

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total…

41 196 1K 551K 808

Download Image

Paul Bogdan @paulcbogdan

2 months ago

New paper: What happens when an LLM reasons? We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

18 152 771 119K 841

Download Video

Zengzhi Wang @SinclairWang1

2 months ago

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…

10 87 510 91K 476

Download Image

Constantin Venhoff @cvenhoff00

2 months ago

Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇

4 27 168 27K 144

Download Image

Infini-AI-Lab @InfiniAILab

3 months ago

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

6 83 218 81K 111

Download Gif

verl project @verl_project

3 months ago

DeepSeek 671b and Qwen3 236b support with Megatron backend is now available as preview in verl v0.4.0 🔥🔥🔥 We will continue optimizing MoE model performance down the road. DeepSeek 671b: verl.readthedocs.io/en/latest/perf… verl v0.4: github.com/volcengine/ver…

0 12 105 6K 32

Download Image

Ryan Marten @ryanmart3n

3 months ago

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data…

32 196 929 194K 729

Download Image

❄️Andrew Zhao❄️ @_AndrewZhao

3 months ago

RL scaling is here arxiv.org/pdf/2505.24864

17 119 801 175K 693

Download Image

Stephen Xie @stephenx_

3 months ago

Incredible work by my mentors and open-source collaborators—honored to have played a tiny part! Huge respect for Simon Huang & team for leading this! 👏🙏

verl project @verl_project

3 months ago

Incredible work by my mentors and open-source collaborators—honored to have played a tiny part! Huge respect for Simon Huang & team for leading this! 👏🙏

1 9 86 6K 56

1 0 10 801 2

Stella Li @StellaLisy

3 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

73 348 2K 691K 1K

Download Image

Lilian Weng @lilianweng

4 months ago

Giving your models more time to think before prediction, like via smart decoding, chain-of-thoughts reasoning, latent thoughts, etc, turns out to be quite effective for unblocking the next level of intelligence. New post is here :) “Why we think”: lilianweng.github.io/posts/2025-05-…