The official account of the 1st Workshop on Instruction Tuning and Instruction Following (ITIF), colocated with NeurIPS, in December 2023.an-instructive-workshop.github.io New OrleansJoined August 2023
Copyrighted 🚧, private 🛑, and sensitive ☢️ data remain major challenges for AI.
FlexOlmo introduces an architectural mechanism to flexibly opt-in/opt-out segments of data in the training weights, **at inference time**.
(Prior common solutions were to filter your data once…
Come say hello at ICLR! 👋 Here's where you can find me:
Friday: Data-centric AI Social! lu.ma/rmyoy2vw
Saturday: Multimodal Data Provenance poster (3 pm, Hall 2B #494)
Sunday: MLDPR Workshop (3 pm) [mldpr2025.com]—I'll talk about challenges to AI data…
Thrilled our global data ecosystem audit was accepted to #ICLR2025!
Empirically, we find:
1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).
2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.
3⃣ <0.2% of data from…
What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️
Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:
1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination…
I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚
➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?
🧵
I wrote a spicy piece on "AI crawler wars"🐞 in @MIT@techreview (my first op-ed)!
While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.
🔗 technologyreview.com/2025/02/11/111…
1/
1/ Last week, we published the International AI Safety Report—supported by 30 nations plus the OECD, UN, and EU.
Over 100 independent experts contributed. I’m thankful to play a small writing role, focusing on “Risks of Copyright.”
🔗 bit.ly/40Vm7Mu
Our updated Responsible Foundation Model Development Cheatsheet (250+ tools & resources) is now officially accepted to @TmlrOrg 2025!
It covers:
- data sourcing,
- documentation,
- environmental impact,
- risk eval
- model release & licensing
🪶 Some thoughts on DeepSeek, OpenAI, and the copyright battles:
This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs.
Dec 2023: They suspended ByteDance’s accounts.
1/
Check out our recipe for adapting existing LMs for multimodal generation: it fully preserves language performances while enhancing models with visual understanding and generation🖼️
Check out our recipe for adapting existing LMs for multimodal generation: it fully preserves language performances while enhancing models with visual understanding and generation🖼️ https://t.co/QDN0GallXG
New Report, to appear at @RealAAAI 2025:
The @defcon 2024 @aivillage_dc Generative Red Team 2 (GRT2) Case Study, led by @seanmcgregor
The event spanned:
⚔️495 hackers, against AI2’s Olmo + WildGuard
🐞200 model flaw reports
💰$7k+ paid bounties
🔗 arxiv.org/pdf/2410.12104
✨New Report✨ Our data ecosystem audit across text, speech, and video (✏️,📢,📽️) finds:
📈 Rising reliance on web, synthetic, and YouTube data.
🛑 80%+ datasets carry hidden restrictions.
🌍 Relative representation in languages and creators has not improved for 10+ yrs.…
In this ecosystem-wide study we set out to analyze trends in data sourcing, representation, and restrictions 🔎, and develop tools to facilitate the navigation and filtering of these datasets for developers 🛠️.
Touching down in Vancouver 🛬 for #NeurIPS2024!
I'll be presenting our "Consent in Crisis" work on the 11th: arxiv.org/abs/2407.14933
Reach out to catch up or chat about:
- Training data / methods
- AI uses & impacts
- Multilingual scaling
@iclr_conf author responses are mostly ignored... and it's hurting the field.
Proposal: why not require two stages of reviewing: (1) Review, and (2) Review Rebuttal—even if it is just a checkbox and re-score.
Without both, reviewers shouldn't get credit.
1K Followers 3K FollowingOpenBabylon | Boosting Global GDP with AI for Underrepresented Languages | Chaotic Good | helping AI nerds @goatstackai ¯\(ツ)/¯
37K Followers 1K FollowingCo-creator of GitHub Copilot, Dropbox Paper, AI Tinkerers, Hackpad, MobileCoin, Minion AI, etc. Working on @PerplexityComet. Survivor 🎗️
520 Followers 785 Following1st Year PhD Student, supervised by @shi_weiyan | Incoming intern in @OrbyAI | MRes and BSc Student @EdinburghNLP | Member of @CohereForAI
1K Followers 758 FollowingResearch Engineer @Cohere_Labs @cohere | @huggingface fellow 🤗 | “Research means that you don't know, but are willing to find out” ✨
2K Followers 935 FollowingPh.D. student @LTIatCMU and intern at @AIatMeta (FAIR) working on (V)LM Evaluation & Systems that SeIf-Improve | Prev: @kaist_ai @yonsei_u
5K Followers 2K Followingbuilding @collinearAI 🧪 | MIT 35u35 | UN AI Advisory Body | Featured in NYT, Quanta, Science, MIT TR| Previously: @huggingface 🤗, @SFResearch, PhD @utcompsci
50K Followers 3K FollowingAI alignment + LLMs at Anthropic. On leave from NYU. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.
49K Followers 9K FollowingI lead @Cohere_Labs. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, ML reliability. Changing spaces where breakthroughs happen.
8K Followers 198 FollowingAssistant Prof at Stanford CS, member of @stanfordnlp and statsml groups; Formerly at Microsoft / postdoc at Stanford CS / Stats.
7K Followers 6K FollowingCenter for Language and Speech Processing at @JohnsHopkins #NLProc #MachineLearning #AI https://t.co/6IXR5OSQtw
@[email protected]