Thrilled our global data ecosystem audit was accepted to #ICLR2025!
Empirically, we find:
1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).
2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.
3⃣ <0.2% of data from…
Testing LLMs' reasoning skills is tough—human evaluations are expensive, data contamination is common, and LLM judges can be biased. We propose StructTest, the first benchmark that checks how well LLMs follow complex instructions and create structured outputs. It uses a…
✨New Report✨ Our data ecosystem audit across text, speech, and video (✏️,📢,📽️) finds:
📈 Rising reliance on web, synthetic, and YouTube data.
🛑 80%+ datasets carry hidden restrictions.
🌍 Relative representation in languages and creators has not improved for 10+ yrs.…
✨New Preprint ✨ How are shifting norms on the web impacting AI?
We find:
📉 A rapid decline in the consenting data commons (the web)
⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)
⛔️ Robots.txt preference protocols…
Introducing: 💫StarCoder
StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant.
Try it here: shorturl.at/cYZ06r
Release thread🧵
Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling!
Demo: hf.co/spaces/bigcode…
Paper: hf.co/datasets/bigco…
Attribution: hf.co/spaces/bigcode…
A🧵:
✨Our work "How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts" got accepted at the Findings on EMNLP 2022!✨
Joint work with @manandey and our awesome mentor @koustuvsinha 🎉
New paper alert! 🎉 Turns out you can reduce the gender biases your translation models just using relevant contexts, purely during inference! Checkout this cool work led by @evolvedeve and @manandey! arxiv.org/abs/2205.10762 [1/4]
🧐🕵️I am looking for the best possible open source tool to do memory profiling!
I would like to know what part of my python code is causing these memory usage spikes that don't necessarily come from the Python interpreter.
Looking forward to reading your recommendations! 🤗
We are releasing PromptSource, a toolkit for creating, sharing, and using natural language prompts.
We used it to create the largest open-source collection of English prompts: 2,000 prompts for 170 datasets!
📄 arxiv.org/abs/2202.01279
💻 github.com/bigscience-wor…
Tokenization—the least interesting #NLProc topic? Hell no! We, members of the @BigscienceW tokenization group are proud to present:
✨Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP✨
arxiv.org/abs/2112.10508
What's in it? [1/10]
We’ve seen crazy interest in T0++ (pronounced "T Zero Plus Plus"), and almost 10’000 queries to the model since we announced it 3 days ago.
Probably the most hilariously decisive prediction from the model (courtesy of @_philschmid):
1/N
Hi #NeurIPS2020! I and @manandey will be presenting our poster on *Evaluating Gender Bias in NLI* at the Workshop on Dataset Curation and Security today (11th Dec) at 2:30 PM EST. Drop by if you're around :)
cc: @koustuvsinha
Gather Town (Poster 19)
neurips.gather.town/app/A4yaHmXq3U…
I'm really happy to share that our work on evaluating gender bias in NLI systems has been accepted at #NeurIPS2020 Workshop on Dataset Curation and Security. Joint work with amazing collaborators @manandey and @koustuvsinha. More details coming soon!
I'll be presenting our poster on assessing viewer's mental health by analysing YouTube videos at AI for Social Good workshop at #NeurIPS2019! Drop by if you’re around! Poster sessions at 9:35-10:30 AM and 3:30-4:15 PM East MR11,12 - with @manandey
Really happy that our (me and @manandey) paper has been accepted @NeurIPSConf 2019 Workshop on AI for Social Good".
We'll be discussing about the effect of YouTube videos on viewer's mental health. You can read more about our work at thechange.world#AI4Good#NeurIPS
1 Followers 99 FollowingRecruiting webshell engineers to penetrate webs ites, with a monthly salary of up to $100,000. If interested, please contact https://t.co/p6pJJl0tuQ
1K Followers 534 FollowingSenior Research Scientist @SFResearch. Lead #xLAM and #LLMAgents. @AIatMeta @AdobeResearch, @SFResearch and @AlibabaGroup research intern.
1K Followers 750 FollowingNLP Postdoc @MIT Center for Constructive Communication (CCC). PhD from McGill University @rllabmcgill & @Mila_Quebec. @AUB_Lebanon alum.
296 Followers 576 FollowingCurr. at @nvidia and PhD at @Mila_Quebec | Multi-modal DL, RL | Prev.: Applied Scientist on Turing team at @Microsoft On a quest to make rocks think.
669 Followers 738 FollowingGraduate Student at @Mila_Quebec and Visiting Researcher @Meta. Prior Research Intern at @Apple, @MorganStanley, @NVIDIAAI and @YorkUniversity
2K Followers 2K FollowingDame Kathleen Ollerenshaw Fellow at @csmcr; Member of @ELLISforEurope; Research Unit Lead for the causality unit at @valence_ai. South African 🇿🇦
781 Followers 1K FollowingScientist @wayve_ai / PhD from @mcgillu x @Mila_Quebec, advised by Doina Precup & @Yoshua_Bengio
A true friend who roasts you and learns with you
13K Followers 3K FollowingAI Correspondent @FT. Former senior reporter for AI @techreview. | Ex @POLITICOEurope & @TheEconomist | Forbes 30 under 30 | She/her
523K Followers 867 FollowingI run a portfolio of internet companies and host @startupideaspod. CEO: @latecheckoutplz we build companies like @ideabrowser, @meetLCA, @boringmarketer etc
25K Followers 206 FollowingWorking towards the safe development of AI for the benefit of all @UMontreal, @LawZero_ & @Mila_Quebec
A.M. Turing Award Recipient and most-cited AI researcher.