Manan Dey @manandey

scholar.google.co.in/citations?user… India Joined July 2013

Tweets

22
Followers

111
Following

2K
Likes

945

Shayne Longpre @ShayneRedford

5 months ago

Thrilled our global data ecosystem audit was accepted to #ICLR2025! Empirically, we find: 1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024). 2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection. 3⃣ <0.2% of data from…

4 22 76 15K 25

Download Image

Caiming Xiong @CaimingXiong

6 months ago

Testing LLMs' reasoning skills is tough—human evaluations are expensive, data contamination is common, and LLM judges can be biased. We propose StructTest, the first benchmark that checks how well LLMs follow complex instructions and create structured outputs. It uses a…

3 37 146 13K 92

Download Image

Shayne Longpre @ShayneRedford

9 months ago

✨New Report✨ Our data ecosystem audit across text, speech, and video (✏️,📢,📽️) finds: 📈 Rising reliance on web, synthetic, and YouTube data. 🛑 80%+ datasets carry hidden restrictions. 🌍 Relative representation in languages and creators has not improved for 10+ yrs.…

1 43 86 24K 27

Shayne Longpre @ShayneRedford

a year ago

✨New Preprint ✨ How are shifting norms on the web impacting AI? We find: 📉 A rapid decline in the consenting data commons (the web) ⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic) ⛔️ Robots.txt preference protocols…

12 94 235 115K 86

Download Image

BigCode @BigCodeProject

2 years ago

Introducing: 💫StarCoder StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant. Try it here: shorturl.at/cYZ06r Release thread🧵

69 642 3K 882K 2K

Download Image

BigCode @BigCodeProject

3 years ago

Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling! Demo: hf.co/spaces/bigcode… Paper: hf.co/datasets/bigco… Attribution: hf.co/spaces/bigcode… A🧵:

9 199 836 264K 259

Download Image

Shanya Sharma @evolvedeve

3 years ago

✨Our work "How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts" got accepted at the Findings on EMNLP 2022!✨ Joint work with @manandey and our awesome mentor @koustuvsinha 🎉

3 2 20 0 1

Download Image

BigScience Research Workshop @BigscienceW

3 years ago

BLOOM is here. The largest open-access multilingual language model ever. Read more about it or get it at bigscience.huggingface.co/blog/bloom hf.co/bigscience/blo…

29 778 3K 0 441

Download Image

Koustuv Sinha @koustuvsinha

3 years ago

New paper alert! 🎉 Turns out you can reduce the gender biases your translation models just using relevant contexts, purely during inference! Checkout this cool work led by @evolvedeve and @manandey! arxiv.org/abs/2205.10762 [1/4]

2 3 21 0 1

Download Image

Saulnier Lucile @LucileSaulnier

4 years ago

🧐🕵️I am looking for the best possible open source tool to do memory profiling! I would like to know what part of my python code is causing these memory usage spikes that don't necessarily come from the Python interpreter. Looking forward to reading your recommendations! 🤗

11 21 150 0 81

Download Image

BigScience Research Workshop @BigscienceW

4 years ago

We are releasing PromptSource, a toolkit for creating, sharing, and using natural language prompts. We used it to create the largest open-source collection of English prompts: 2,000 prompts for 170 datasets! 📄 arxiv.org/abs/2202.01279 💻 github.com/bigscience-wor…

4 89 376 0 82

Download Image

Sabrina J. Mielke @sjmielke

4 years ago

Tokenization—the least interesting #NLProc topic? Hell no! We, members of the @BigscienceW tokenization group are proud to present: ✨Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP✨ arxiv.org/abs/2112.10508 What's in it? [1/10]

15 132 662 0 173

Download Image

Victor Sanh @SanhEstPasMoi

4 years ago

We’ve seen crazy interest in T0++ (pronounced "T Zero Plus Plus"), and almost 10’000 queries to the model since we announced it 3 days ago. Probably the most hilariously decisive prediction from the model (courtesy of @_philschmid): 1/N

6 42 240 0 42

Download Image

BigScience Research Workshop @BigscienceW

4 years ago

First modeling paper out of BigScience is here! T0 shows zero-shot task generalization on English natural language prompts, outperforming GPT-3 on many tasks, while being 16x smaller! Model: huggingface.co/bigscience/T0pp Repo: github.com/bigscience-wor… Paper: arxiv.org/abs/2110.08207

14 299 1K 0 229

Download Image

Shanya Sharma @evolvedeve

5 years ago

Hi #NeurIPS2020! I and @manandey will be presenting our poster on *Evaluating Gender Bias in NLI* at the Workshop on Dataset Curation and Security today (11th Dec) at 2:30 PM EST. Drop by if you're around :) cc: @koustuvsinha Gather Town (Poster 19) neurips.gather.town/app/A4yaHmXq3U…

0 1 10 0 0

Download Image

Shanya Sharma @evolvedeve

5 years ago

I'm really happy to share that our work on evaluating gender bias in NLI systems has been accepted at #NeurIPS2020 Workshop on Dataset Curation and Security. Joint work with amazing collaborators @manandey and @koustuvsinha. More details coming soon!

0 1 11 0 0

Shanya Sharma @evolvedeve

6 years ago

I'll be presenting our poster on assessing viewer's mental health by analysing YouTube videos at AI for Social Good workshop at #NeurIPS2019! Drop by if you’re around! Poster sessions at 9:35-10:30 AM and 3:30-4:15 PM East MR11,12 - with @manandey

0 2 5 0 0

Download Image

Shanya Sharma @evolvedeve

6 years ago

Really happy that our (me and @manandey) paper has been accepted @NeurIPSConf 2019 Workshop on AI for Social Good". We'll be discussing about the effect of YouTube videos on viewer's mental health. You can read more about our work at thechange.world #AI4Good #NeurIPS