Random thought: If older LLMs scored >90% on benchmarks, why are they suddenly “bad” the moment a new one drops? Were they just trained to ace the test, not generalise?
0
0
0
71
0