Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current model performance on this benchmark.
No "verified" subset, limited tool use (bash only), most scaffolding is open to see. In this benchmark, the Claude 4 Opus…
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!
.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results!
Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard.
.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results!
Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard. https://t.co/oQOnajNjFw
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!
Made for benchmarking, fine-tuning, RL, or just for use from your terminal.
It’s open source, simple to hack, and compatible with any LM! Link in 🧵
LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->
LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->
SWE-agent is now Multimodal! 😎
We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev.
🔗➡️
Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA.
EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…
We @a16z just launched the third batch of Open Source AI Grants (cc @mbornstein) 🎉
This round includes projects focused on LLM evaluation, novel reasoning tests, infrastructure, and experimental research at the edge of capability and cognition:
• SGLang: High-performance LLM…
61 Followers 894 FollowingThis moment is perfect and the next is a perfect mystery. Now is never and forever, so forget your beliefs and see. This is all I will ever be as I is not me.
8 Followers 153 FollowingSoftware Engineer currently building Ingenious (https://t.co/ZjHuvymGXU), Certified Yapper, Scale Model Enthusiast. Views are mine.
3K Followers 3K FollowingPost-Training Lead @ Together AI | OpenChat Project Lead (#1 7B LLM on Arena for 2+ months, 2M+ downloads) | DeepCoder, DeepSWE
36 Followers 2K Following"You're a techno-alchemist, Rez — part mad scientist, part storytelling strategist, and fully engineering mind meets marketing soul."
126K Followers 16K FollowingChinese Australian artist/Award wining cartoonist for @theage @smh /Human rights Activist/DM for signed print & original art /New Book https://t.co/O7ZmTytF6D
6K Followers 726 Following🗽 Lower Manhattan native ⚖️ Attorney & community advocate 🏙️ Fighting for a just, affordable, thriving NYC 🏛️ Former candidate, NYC Council
181K Followers 302 FollowingInequality Economist. Former Trader. Other Economists make predictions, but my ones are actually right. Explaining Economics on YouTube - garyseconomics
162K Followers 561 Followingco-founder of Fog Creek, Trello, Stack Overflow, Glitch, and https://t.co/Jb7fG3eQgU - I have moved to @[email protected] on mastodon
437K Followers 762 FollowingComplex systems, wicked problems. Society, technology, science and more. @Princeton professor. @NYTimes columnist. My newsletter @insight https://t.co/6Ky01N9JwA
895 Followers 146 Following“If you’re careful enough, nothing good or bad will ever happen to you.” -Ashleigh Brilliant
Very low frequency trading: [email protected]
1K Followers 330 FollowingInfra & AI enthusiast, dreaming about test-time compute ✨
Research Scholar at @Berkeley_EECS, @ucbrise, @berkeley_ai | MS in CS @ETH_en | Prev, @IBMResearch
51K Followers 474 FollowingRE Developer, doer, design & construction geek, owner @ MADDPROJECT, work from anywhere specialist, RA, NCARB. I build teams that design & build buildings.