At this point, how much do people actually care about benchmarks? Calling it now, Grok 4 won't actually be the best model, it's just the classic hype cycle. Starting to see a lot more people catch on.
@neetcode1 X Monetization is a hell of a drug.
@neetcode1 Most benchmarks are BS. I’m only excited by the ARC score.
@neetcode1 Most benchmarks are BS. I’m only excited by the ARC score.
@neetcode1 Benchmarks are the trailer. Real-world use is the movie.
@neetcode1 It is just like IQ tests. If you start solving 10,20,30 then you arent testing your IQ, you are optimizing your reasoning for IQ tests and therefore the result starts becoming less and less trustworthy
@neetcode1 Are people catching on? Yesterday this entire feed was filled with Grok 4 hype... As expected, it's not AGI but more of a bloated mess
@neetcode1 model providers don't disclose quantization level and regularly change it so, the model you actually get from ChatGPT, Claude, Grok, etc. isn't the model that is pegged to the benchmark
@neetcode1 It's true, sometimes the hype can overshadow the actual performance.
@neetcode1 I ignore both benchmarks and initial Twitter over/under hype. Useful models surface to the top organically
@neetcode1 Jeah it’s sad they mostly showed benchmark results and graphs that most people neither really understand nor care about…
@neetcode1 I think benchmarks still matter, just not to end-users. They're crucial for the developers and businesses who have to choose which foundational model to build on top of.
@neetcode1 LM's are getting saturated. These hype claims are just to maintain the stock value.
@neetcode1 I refuse to believe it's actually smarter than claude at coding.
@neetcode1 By that logic, solving leetcode questions as a ‘benchmark’ should not be someones full time job.
@neetcode1 They started late and is leading now. Things will only get better. Never bet against Elon!!
@neetcode1 With every new Gemini release it tops the charts but after trying it, it doesn't feel like that's the best model at all so for me the benchmarks are just a hype machine to keep the party going
@neetcode1 many such cases. grok 3 was hyped and leading some benchmarks, just for no one caring other than musk bootlickers. time and time again, anthropic and openai get dethroned in benchmarks just for them to still have the best models when it comes to actually using them
@neetcode1 You're dumb bro, you also said ai was all hype in the starting. We all know how better and useful it is now
@neetcode1 Everyone caught on ages ago when openai overdid it with strawberry. People now just test out of curiosity but the excitement has dropped. A tweet of a just a 🍓used to get people crazy 😂
@neetcode1 do you think we'll see like another checkpoint, like the reasoning models, where the models get a serious update/change
@neetcode1 Some benchmarks are manufactured by authors themselves. We ll soon know about this. Hope it’s not like the llama-gate🤷🏻♂️
@neetcode1 Fully agree!! My tweet after I watched yesterday’s presentation
@neetcode1 Yup, for me Claude and deepseek are better code writers than Grok.
You called it, didn't do as well on held-out benchmarks. Its about having he right benchmarks, not just self-reported performance. x.com/_valsai/status…
You called it, didn't do as well on held-out benchmarks. Its about having he right benchmarks, not just self-reported performance. x.com/_valsai/status…