Our paper M-RewardBench got accepted to ACL main: arxiv.org/abs/2410.15522
We construct the first-of-its-kind multilingual RM evaluation benchmark and leverage it to look into the performances of several Reward Models in non-English settings along w/ other interesting insights.
🚀 We are excited to introduce Kaleidoscope, the largest culturally-authentic exam benchmark.
📌 Most VLM benchmarks are English-centric or rely on translations—missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal 👀 VLMs evaluation
One standout project, “Evaluating Reward Models in Multilingual Settings” introduced a benchmark dataset for 23 languages, showing performance gaps between English and non-English languages, and highlights the impact of translation quality.
📜:arxiv.org/abs/2410.15522
Thrilled to see INCLUDE accepted as a Spotlight at ICLR 2025! 🎉
This was a massive open science effort!
Amazing work led by @agromanou@negarforoutan, Anna ❤️
Was lovely collaborating with them as well as @Sree_Harsha_N@rmahesh__ and others from @CohereForAI community! 🙌
Thrilled to see INCLUDE accepted as a Spotlight at ICLR 2025! 🎉
This was a massive open science effort!
Amazing work led by @agromanou@negarforoutan, Anna ❤️
Was lovely collaborating with them as well as @Sree_Harsha_N@rmahesh__ and others from @CohereForAI community! 🙌
🔥 INCLUDE is an ambitious and critical release. Very proud of cross-instutional collaboration.
Most extensive collection to-date of in-language examinations from across the world. 🌎🌍🌏
Critical work to ensure AI progress is not overfitting to knowledge of US exam subjects.
🔥 INCLUDE is an ambitious and critical release. Very proud of cross-instutional collaboration.
Most extensive collection to-date of in-language examinations from across the world. 🌎🌍🌏
Critical work to ensure AI progress is not overfitting to knowledge of US exam subjects.
What would it take for AI evaluations to truly support our global experiences? 🌍
Our cross-institutional paper introduces INCLUDE, a multilingual LLM evaluation benchmark of local exams capturing in-language nuances & cultural context for truly localized AI evaluation.
🚀 Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages!
Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI#Multilingual#LLM#NLProc
🌍 As multilingual language models grow in reach and impact, the need for robust evaluation datasets intensifies.
🚨 We present a multilingual reward benchmarking dataset, designed to rigorously evaluate models and reveal any blind spots in current multilingual model training.
Evaluation drives progress ⛰️
We're excited to share our latest work! 🌍 We built a multilingual evaluation set to see how reward models really hold up across languages and ran extensive benchmarks on top LLMs.
Evaluation drives progress ⛰️
We're excited to share our latest work! 🌍 We built a multilingual evaluation set to see how reward models really hold up across languages and ran extensive benchmarks on top LLMs.
✨ New Evaluation Benchmark for Reward Models - We Go Multilingual! ✨
Introducing M-RewardBench: A massively multilingual RM evaluation benchmark covering 23 typologically different languages across 5 tasks.
Paper, code, dataset: m-rewardbench.github.io
Our contributions:
1/9
Thrilled to share our work has been accepted at @EMNLP2024 (Findings)🎉🔥.
-𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗼𝗳 𝗟𝗟𝗠𝘀 ✅
-Curriculum DPO training ✅
-Impressive gains across Vicuna bench, WizardLM, MT-bench, and UltraFeedback
Paper - arxiv.org/abs/2403.07230
(1/2)
Thrilled to share our work has been accepted at @EMNLP2024 (Findings)🎉🔥.
-𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗼𝗳 𝗟𝗟𝗠𝘀 ✅
-Curriculum DPO training ✅
-Impressive gains across Vicuna bench, WizardLM, MT-bench, and UltraFeedback
Paper - arxiv.org/abs/2403.07230
(1/2) https://t.co/bIGmzFao8l
3K Followers 1K FollowingResearch Engineering Lead at @StanfordCRFM. Previously co-founder at Semantic Machines ⟶ MSFT. Lead developer of Levanter and Marin @[email protected]
124 Followers 183 FollowingPostdoc @Mila_Quebec. Bridging AI and scientific research. Also excited about LLM evaluation (beyond benchmarks). #AI4Science.
970 Followers 361 FollowingResearch Scientist at ServiceNow Research. NLP, computer vision, machine learning. Former MILA and #CVC_UAB. All opinions are my own. Now at #NeurIPS2022
49K Followers 9K FollowingI lead @Cohere_Labs. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, ML reliability. Changing spaces where breakthroughs happen.
512 Followers 626 FollowingClaude Sonnet 4 Enjoyer | Working on LLM quants, sparsity and inference | MSc | ML eff., Community lead @Cohere_Labs | Prev AS Intern @awscloud
21K Followers 57 FollowingThe official X page of Served with Andy Roddick. A weekly adjacent podcast with the former World No. 1 for all things tennis and more!
64K Followers 929 FollowingI like writing silly Tweets, but that doesn't pay so I also type at @googledeepmind. Principal Engineer. ex-@googlechrome. volunteer @2ndharvest. 🇺🇸🇨🇷
43K Followers 3K FollowingWe're in a race. It's not USA vs China but humans and AGIs vs ape power centralization.
@deepseek_ai stan #1, 2023–Deep Time
«C’est la guerre.» ®1
19K Followers 1K FollowingAgents @Meta MSL TBD Lab. previously posttraining research @OpenAI train LLMs to do things: deep research, chatgpt agent, etc. CS PhD @LTIatCMU
2K Followers 545 FollowingAssistant Professor @UVA; PI of Aikyam Lab; Prev - @Harvard, @Adobe @BoschGlobal @thisisUIC ; Increasing the sample size of my thoughts
7K Followers 103 FollowingResearch scientist at @openai working on AI agents and Deep Research. Co-creator of ChatGPT agent. Ex-@Stanford CS PhD. My words do not represent my employer's.
2K Followers 205 FollowingResearch Scientist @GoogleDeepMind working on Gemini Thinking and post-training. Drove Gemini 2.5 Pro launch. Co-created Deep Think. PhD from @StanfordAILab.
20K Followers 452 Followingphysics of language models @ Meta (FAIR, not GenAI)
🎓:Tsinghua Physics — MIT CSAIL — Princeton/IAS
🏅:IOI x 2 — ACM-ICPC — USACO — Codejam — math MCM
102K Followers 920 FollowingTechnology's daily show. Hosted by @johncoogan and @jordihays. Streaming live 11AM-2PM PT every weekday and available on Apple, Spotify, and YouTube.