carlos @_carlosejimenez

i like ai, philosophy, and politics carlosejimenez.com San Francisco, CA Joined May 2019

Tweets

331
Followers

1K
Following

354
Likes

7K

Kilian Lieret @KLieret

2 weeks ago

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

8 21 157 23K 41

Download Image

Kilian Lieret @KLieret

2 weeks ago

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

19 20 270 31K 134

Download Image

Samuel Miserendino @samuelp1002

3 weeks ago

0 2 13 1K 1

Download Video

SemiAnalysis @SemiAnalysis_

4 weeks ago

At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current model performance on this benchmark. No "verified" subset, limited tool use (bash only), most scaffolding is open to see. In this benchmark, the Claude 4 Opus…

11 15 274 28K 52

Download Image

Kilian Lieret @KLieret

4 weeks ago

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵

5 6 33 5K 13

Download Image

Talor Abramovich @AbramovichTalor

4 weeks ago

Incredible to see the progress in Offensive Cybersecurity benchmarks!

Terry Yue Zhuo @ SF 🏖️ @terryyuezhuo

a month ago

Incredible to see the progress in Offensive Cybersecurity benchmarks!

1 16 66 17K 41

Download Image

0 1 5 462 3

Kilian Lieret @KLieret

4 weeks ago

Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!

1 4 14 2K 3

Download Gif

Ofir Press @OfirPress

a month ago

.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results! Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard.

Qwen @Alibaba_Qwen

2 months ago

316 1K 9K 2.0M 4K

Download Image

3 1 20 3K 2

Download Image

Kilian Lieret @KLieret

a month ago

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

12 73 791 107K 901

Download Image

Ofir Press @OfirPress

2 months ago

AGI

0 1 18 2K 0

Download Image

Ofir Press @OfirPress

2 months ago

LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->

Alex Zhang @a1zhang

2 months ago

1 1 26 4K 5

Download Image

2 2 19 2K 4

SWE-bench @SWEbench

2 months ago

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

1 6 15 2K 0

Download Image

Talor Abramovich @AbramovichTalor

2 months ago

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…

3 6 24 9K 10

Rajko Radovanović @rajko_rad

2 months ago

We @a16z just launched the third batch of Open Source AI Grants (cc @mbornstein) 🎉 This round includes projects focused on LLM evaluation, novel reasoning tests, infrastructure, and experimental research at the edge of capability and cognition: • SGLang: High-performance LLM…