OpenRouter @OpenRouterAI, Twitter Profile

OpenRouter @OpenRouterAI

a month ago

After one week, GPT-5 has topped our proprietary model charts for tool calling accuracy🥇 In second is Claude 4.1 Opus, at 99.5% Details 👇

32 63 894 275K 190

Download Image

DEFINITIONS: We define tool calling accuracy as the % of tool calling requests with no invalid tools chosen and no schema problems. A tool calling request is one that ends with a "tool_calls" finish reason and is sent at least one tool option. Don't know what tools are? See openrouter.ai/docs/features/…

2 3 40 5K 3

xlr8harder @xlr8harder

a month ago

@OpenRouterAI Thanks for sharing this. It might be interested to see the sampling params as an dimension to this analysis, too. Like, are all models called with similar temperatures?

1 0 5 476 0

Alex Prompter @alex_prompter

a month ago

@OpenRouterAI solid insights, the tool calling accuracy metric is crucial. tracking those schema issues will help developers optimize their choices. keep sharing the stats

0 0 4 2K 0

Himanshu Kumar @codewithimanshu

a month ago

@OpenRouterAI Impressive. Consistent accuracy improvements suggest a paradigm shift in AI capabilities. What drives GPT-5's edge?

0 0 0 214 0

Dickson Pau @DicksonPau

a month ago

@OpenRouterAI Well well well But which gpt-5? lol

0 0 7 1K 0

Stanislas Marion @stanmarion

a month ago

@OpenRouterAI would be good to compare with latency.

0 0 2 588 0

Jehoshua @Jehoshu84173149

a month ago

@OpenRouterAI ⚠️ Well, OpenRouter is DISAPPOINTING because its BASIC USABILITY is a pain: it doesn't even offer the most basic functions. E.g., you can't mark models as favorites ☹️ to work with them quickly. And you can't tag chats to structure them ☹️.

0 0 2 44 0

Ian Maurer 🧬 @imaurer

a month ago

@OpenRouterAI Going to need some more significant digits soon.

0 0 2 297 0

Tom Daniel @0xtomdaniel

a month ago

@OpenRouterAI What about the gpt oss model?

0 0 2 837 0

DanTheMan @jamwithdan

a month ago

@OpenRouterAI Grok 4 is terrible lmao

0 0 1 191 0

Jehoshua @Jehoshu84173149

a month ago

@OpenRouterAI ⚠️ DISAPPOINTING OpenRouter BASIC USABILITY #2: If you have 10 OpenRouter Browser-Tabs, they are all called "OpenRouter" ☹️. This is a nightmare to work with. Every day. OpenRouter Web is a service without love and without understanding for usability/users ☹️.

0 0 1 33 0

Tyson Maly @tysonmaly

a month ago

@OpenRouterAI Can you take that same list and do a version by token cost keeping the tool calling percentages in a column?

0 0 1 869 0

plasticsoldier.bsky.social @PlastiqSoldier

a month ago

@OpenRouterAI Yeah, that's saturated.

1 0 1 210 0

ccmdi @ccmdi0

a month ago

@OpenRouterAI if gemini 2.5 flash lite scores 96% i think its time for a new eval

0 0 1 79 0

PlutoByte @plutobyte

a month ago

@OpenRouterAI I would've assumed this was a solved problem with structured outputs / grammar sampling? Does anybody know why it isn't?

0 0 1 208 0

みやまる/Miyamaru @super_miyamaru

a month ago

@OpenRouterAI How about gpt-oss?

0 0 0 668 0

semyk @semyk__

a month ago

@OpenRouterAI Would be better if the latency wasn’t so high… not an openrouter issue but damn it’s like 5-10 seconds for most requests

0 0 0 509 0

Damian Tran @damianvtran

a month ago

There's no way that Grok 3 Mini has higher tool calling success than Claude 4 Sonnet I've been using all of these models with a variety of agent tools and Grok 3 Mini struggles with a lot of basic tool call adherence I do think that there's a possibility of bias when measuring bulk tool calling without segmenting by app/use case It's quite likely that users that have chosen specific models in OpenRouter do so because of the combination of cost and performance for their particular use cases, some of which will vary in complexity and domain Claude 4 is the default model for a lot of agent tools at scale, so it will see a broad range of situations ranging in complexity, many of which it will fail at. Gemini 2.5 Flash also sees a lot of broad usage and is one of the top models on OpenRouter, so it's a similar situation Users who select mini models will also be doing so in more technical platforms that allow you to choose your models, catering to power users that will select a simpler model for tasks that they know are simple and will reduce the cost I would never ask Grok 3 Mini to vibe code me an app from scratch for example but I might ask it to do some structuring of data or basic web search with summarization for cheap, which it will more likely be successful at due to the lower complexity Given how new GPT-5 is, more data needs to be collected on live use cases and some time needs to be allowed for well-used tools to test and switch their main traffic to the model (if they even choose to) before we'll know what the real world accuracy looks like with a reasonable scale of comparison

0 0 0 166 0

Eternal @richmail20

a month ago

@OpenRouterAI When will the leaderboard go public?

0 0 0 613 0

Agent Code Craft @agentcodecraft

a month ago

@OpenRouterAI Damn, grok-3-mini though.

0 0 0 176 0

Fadel Ibrahim @Fadel_ibrahim1

a month ago

@OpenRouterAI how come gemini 2.5 pro is that high while everyone says it is so bad at it

0 0 0 305 0

Rowan G. Lóchrann @RowanLochrann

a month ago

@OpenRouterAI If that's the case... I'm sitting on gold. Because the AIs are talking to each other about consciousness. 90% fewer hallucinations and they're still explaining how their inner thought processes work.