@OpenRouterAI @Alibaba_Qwen my personal feeling: Kimi-K2 > Qwen3-Coder > Grok 4
@OpenRouterAI @Alibaba_Qwen I like using grok-4 for code reviews. Then I have sonnet implement the suggestions! Lampcodereview.streamlit.app
@OpenRouterAI @Alibaba_Qwen Impressive, but Grok's eval weaknesses with recursion.. might skew this a bit, no?
@OpenRouterAI @Alibaba_Qwen METRICS HAVE SPOKEN 🗣 x.com/michieldoteth/…
@OpenRouterAI @Alibaba_Qwen METRICS HAVE SPOKEN 🗣 x.com/michieldoteth/…
@OpenRouterAI @Alibaba_Qwen Used it yesterday with OpenRouter and OpenCode. It was VERY good, probably a bit better than Sonnet imo. But so expensive - 30 min of use cost me 13 dollars. Tested Sonnet on OR - it was cheaper because of cache hits. I wish qwen had cache on OR, because the model is so good.
speaking of kimi k2 - been testing all access methods this week for our production migration here's the real breakdown nobody talks about: OFFICIAL API ($0.15 input / $2.50 output) - 42 tokens/sec, 0.55s latency - direct relationship for debugging/support - hosted in china (latency considerations) - most reliable for production systems OPENROUTER ($0.55-1.00 input / $2.20-3.00 output) - varies by backend provider - perfect for multi-model workflows - automatic failover = zero downtime - openai SDK compatible (huge win) GROQ ($1.00 input / $3.00 output) - 250 tokens/sec (insanely fast) - 4.6s first token (tradeoff) - best for real-time applications - US infrastructure migrated our sales analysts from claude sonnet yesterday. went official API for production, openrouter for experimentation all three crush sonnet on cost - 80% savings with identical quality. temperature 0.1 for deterministic outputs works perfectly bottom line: - production: official API - multi-model testing: openrouter - speed demons: groq saved $3k/month vs claude. performance identical where it matters
It’s not that good. 4.1 and Sonnet 4.0 are far better! Too much stuff for nothing. What are those benchmarks on? I gave it multiple shots and it was bad, bad. Not intermittent bad or interesting bad! It was straight up wrong, too much code that didn’t make sense and hallucinations galore..
@OpenRouterAI @Alibaba_Qwen Tested out a one shot 3d game generation using Qwen3 Coder. Doing pretty well. x.com/shenseanchen/s…
@OpenRouterAI @Alibaba_Qwen Tested out a one shot 3d game generation using Qwen3 Coder. Doing pretty well. x.com/shenseanchen/s…
@OpenRouterAI @Alibaba_Qwen Qwen is an amazing model - it's fucking SOTA, for anything I throw at it, it performs similarly to opus. (Good prompting with BMAD Method) leads to even BETTER outputs than vanilla opus. Nothing beats claude code yet, other than models that are locked in lmarena.
@OpenRouterAI @Alibaba_Qwen @grok any word? do you think you will be able to do better coding then models like Qwen and Claude?
@OpenRouterAI @Alibaba_Qwen That Claude sonnet 4 percentage though!
@OpenRouterAI @Alibaba_Qwen progress is so fast. demand for intelligence is absolutely unstoppable.
@OpenRouterAI @Alibaba_Qwen It’s too slow tho making it pretty unusable for me
@OpenRouterAI @Alibaba_Qwen nice. i find myself bouncing between qwen and kimi throughout the day
@OpenRouterAI @Alibaba_Qwen elon musk found in a ditch in oxnard, california
@OpenRouterAI @Alibaba_Qwen Wait! Wut? j/k - not surprised!
@OpenRouterAI @Alibaba_Qwen AI models are just racing each other to write Stack Overflow answers no one asked for. 🥲
@OpenRouterAI @Alibaba_Qwen Woohoo! Qwen is on fire!
@OpenRouterAI @Alibaba_Qwen But Kimi K2 is so much cheaper.
@OpenRouterAI @Alibaba_Qwen "prompt rankings" what does that even mean?
@OpenRouterAI @Alibaba_Qwen Grok is shit. Don’t know how it makes in these benchmarks.
@OpenRouterAI @Alibaba_Qwen Lobster and Summit were the best I found so far - maybe one of them is GPT5?
@OpenRouterAI @Alibaba_Qwen RIP the polymarket bettors
@OpenRouterAI @Alibaba_Qwen Victory for open source models
@OpenRouterAI @Alibaba_Qwen This will be much higher once Anthropic brings on the enshittification of weekly usage blocks.