Shipping reliable AI-powered apps isn't just about model performance – it's about delivering consistent value to users. That's why with LLM evals - response quality, task completion rates, and user satisfaction often matter more than pure model performance. Love how @braintrustdata makes multimodal evals seamless. I particularly enjoyed their latest findings on evaluating Gemini models for vision, where they found that Gemini models use significantly fewer tokens per image compared to the GPT models, with GPT-4o using 3.5x the number of tokens per image. aidevmode.com/blog/braintrus…
1
2
3
300
0
Here's the eval findings braintrust.dev/blog/gemini . Also found this useful: braintrust.dev/blog/after-eva… (h/t @ornelladotcom & @albertzhang36)