After one week, GPT-5 has topped our proprietary model charts for tool calling accuracy🥇 In second is Claude 4.1 Opus, at 99.5% Details 👇
DEFINITIONS: We define tool calling accuracy as the % of tool calling requests with no invalid tools chosen and no schema problems. A tool calling request is one that ends with a "tool_calls" finish reason and is sent at least one tool option. Don't know what tools are? See openrouter.ai/docs/features/…
@OpenRouterAI Thanks for sharing this. It might be interested to see the sampling params as an dimension to this analysis, too. Like, are all models called with similar temperatures?
@OpenRouterAI solid insights, the tool calling accuracy metric is crucial. tracking those schema issues will help developers optimize their choices. keep sharing the stats
@OpenRouterAI Impressive. Consistent accuracy improvements suggest a paradigm shift in AI capabilities. What drives GPT-5's edge?
@OpenRouterAI would be good to compare with latency.
@OpenRouterAI ⚠️ Well, OpenRouter is DISAPPOINTING because its BASIC USABILITY is a pain: it doesn't even offer the most basic functions. E.g., you can't mark models as favorites ☹️ to work with them quickly. And you can't tag chats to structure them ☹️.
@OpenRouterAI Going to need some more significant digits soon.
@OpenRouterAI ⚠️ DISAPPOINTING OpenRouter BASIC USABILITY #2: If you have 10 OpenRouter Browser-Tabs, they are all called "OpenRouter" ☹️. This is a nightmare to work with. Every day. OpenRouter Web is a service without love and without understanding for usability/users ☹️.
@OpenRouterAI Can you take that same list and do a version by token cost keeping the tool calling percentages in a column?
@OpenRouterAI Yeah, that's saturated.
@OpenRouterAI if gemini 2.5 flash lite scores 96% i think its time for a new eval
@OpenRouterAI I would've assumed this was a solved problem with structured outputs / grammar sampling? Does anybody know why it isn't?
@OpenRouterAI Would be better if the latency wasn’t so high… not an openrouter issue but damn it’s like 5-10 seconds for most requests
There's no way that Grok 3 Mini has higher tool calling success than Claude 4 Sonnet I've been using all of these models with a variety of agent tools and Grok 3 Mini struggles with a lot of basic tool call adherence I do think that there's a possibility of bias when measuring bulk tool calling without segmenting by app/use case It's quite likely that users that have chosen specific models in OpenRouter do so because of the combination of cost and performance for their particular use cases, some of which will vary in complexity and domain Claude 4 is the default model for a lot of agent tools at scale, so it will see a broad range of situations ranging in complexity, many of which it will fail at. Gemini 2.5 Flash also sees a lot of broad usage and is one of the top models on OpenRouter, so it's a similar situation Users who select mini models will also be doing so in more technical platforms that allow you to choose your models, catering to power users that will select a simpler model for tasks that they know are simple and will reduce the cost I would never ask Grok 3 Mini to vibe code me an app from scratch for example but I might ask it to do some structuring of data or basic web search with summarization for cheap, which it will more likely be successful at due to the lower complexity Given how new GPT-5 is, more data needs to be collected on live use cases and some time needs to be allowed for well-used tools to test and switch their main traffic to the model (if they even choose to) before we'll know what the real world accuracy looks like with a reasonable scale of comparison
@OpenRouterAI how come gemini 2.5 pro is that high while everyone says it is so bad at it
@OpenRouterAI If that's the case... I'm sitting on gold. Because the AIs are talking to each other about consciousness. 90% fewer hallucinations and they're still explaining how their inner thought processes work.
@OpenRouterAI time for a new benchmark, that looks insanely saturated
@OpenRouterAI Can we somehow access this leaderboard? I'd be curious about the other models' performance