Instruction Workshop, NeurIPS 2023 @itif_workshop

The official account of the 1st Workshop on Instruction Tuning and Instruction Following (ITIF), colocated with NeurIPS, in December 2023. an-instructive-workshop.github.io New Orleans Joined August 2023

Tweets

185
Followers

256
Following

28
Likes

222

Shayne Longpre @ShayneRedford

2 months ago

Copyrighted 🚧, private 🛑, and sensitive ☢️ data remain major challenges for AI. FlexOlmo introduces an architectural mechanism to flexibly opt-in/opt-out segments of data in the training weights, **at inference time**. (Prior common solutions were to filter your data once…

1 8 39 4K 10

Download Image

Shayne Longpre @ShayneRedford

2 months ago

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 ! Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel. 📜: arxiv.org/pdf/2506.05209 📚🤖 Data & models: huggingface.co/common-pile 1/

2 12 63 4K 25

Download Image

@

56 years ago

0 0 0 0 0

Shayne Longpre @ShayneRedford

4 months ago

Come say hello at ICLR! 👋 Here's where you can find me: Friday: Data-centric AI Social! lu.ma/rmyoy2vw Saturday: Multimodal Data Provenance poster (3 pm, Hall 2B #494) Sunday: MLDPR Workshop (3 pm) [mldpr2025.com]—I'll talk about challenges to AI data…

1 7 24 2K 1

Download Image

Shayne Longpre @ShayneRedford

5 months ago

Thrilled our global data ecosystem audit was accepted to #ICLR2025! Empirically, we find: 1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024). 2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection. 3⃣ <0.2% of data from…

4 22 76 15K 25

Download Image

Shayne Longpre @ShayneRedford

6 months ago

What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️ Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators: 1️⃣ Standardized AI flaw reports 2️⃣ AI flaw disclosure programs + safe harbors. 3️⃣ A coordination…

6 31 61 12K 34

Download Image

Shayne Longpre @ShayneRedford

7 months ago

I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚 ➡️ why is copyright an issue? ➡️ what is fair use? ➡️ why are memorization and generation important? ➡️ how does it impact the AI data supply / web crawling? 🧵

2 8 18 2K 13

Download Image

Shayne Longpre @ShayneRedford

7 months ago

I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)! While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web. 🔗 technologyreview.com/2025/02/11/111… 1/

3 10 30 4K 18

Download Image

Shayne Longpre @ShayneRedford

7 months ago

1/ Last week, we published the International AI Safety Report—supported by 30 nations plus the OECD, UN, and EU. Over 100 independent experts contributed. I’m thankful to play a small writing role, focusing on “Risks of Copyright.” 🔗 bit.ly/40Vm7Mu

1 2 12 778 2

Shayne Longpre @ShayneRedford

7 months ago

Our updated Responsible Foundation Model Development Cheatsheet (250+ tools & resources) is now officially accepted to @TmlrOrg 2025! It covers: - data sourcing, - documentation, - environmental impact, - risk eval - model release & licensing

1 29 98 9K 53

Download Image

Shayne Longpre @ShayneRedford

7 months ago

🪶 Some thoughts on DeepSeek, OpenAI, and the copyright battles: This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs. Dec 2023: They suspended ByteDance’s accounts. 1/

1 6 29 3K 6

Download Image

Weijia Shi @WeijiaShi2

7 months ago

Check out our recipe for adapting existing LMs for multimodal generation: it fully preserves language performances while enhancing models with visual understanding and generation🖼️

Weijia Shi @WeijiaShi2

9 months ago

Check out our recipe for adapting existing LMs for multimodal generation: it fully preserves language performances while enhancing models with visual understanding and generation🖼️ https://t.co/QDN0GallXG

13 179 858 131K 450

Download Image

1 14 63 6K 10

Download Image

Shayne Longpre @ShayneRedford

8 months ago

New Report, to appear at @RealAAAI 2025: The @defcon 2024 @aivillage_dc Generative Red Team 2 (GRT2) Case Study, led by @seanmcgregor The event spanned: ⚔️495 hackers, against AI2’s Olmo + WildGuard 🐞200 model flaw reports 💰$7k+ paid bounties 🔗 arxiv.org/pdf/2410.12104

1 6 22 2K 3

Shayne Longpre @ShayneRedford

9 months ago

✨New Report✨ Our data ecosystem audit across text, speech, and video (✏️,📢,📽️) finds: 📈 Rising reliance on web, synthetic, and YouTube data. 🛑 80%+ datasets carry hidden restrictions. 🌍 Relative representation in languages and creators has not improved for 10+ yrs.…

1 43 86 24K 27

Cohere Labs @Cohere_Labs

9 months ago

In this ecosystem-wide study we set out to analyze trends in data sourcing, representation, and restrictions 🔎, and develop tools to facilitate the navigation and filtering of these datasets for developers 🛠️.

1 14 34 16K 4

Download Image

Melissa Heikkilä @Melissahei

9 months ago

New research reveals a worrying trend: AI's data practices risk concentrating power overwhelmingly in the hands of dominant technology companies. I spoke w/@ShayneRedford @sarahookr @sarahbmyers @GiadaPistilli about what this says about the state of AI technologyreview.com/2024/12/18/110…

4 31 57 11K 38

Shayne Longpre @ShayneRedford

9 months ago

Touching down in Vancouver 🛬 for #NeurIPS2024! I'll be presenting our "Consent in Crisis" work on the 11th: arxiv.org/abs/2407.14933 Reach out to catch up or chat about: - Training data / methods - AI uses & impacts - Multilingual scaling

2 16 70 5K 8

Shayne Longpre @ShayneRedford

9 months ago

Interested in how LLMs are really used? We are starting a research project to find out! In collaboration w/ @sarahookr @AnkaReuel @ahmetustun89 @niloofar_mire and others. We are looking for two junior researchers to join us. Apply by Dec 15th! forms.gle/H2o3cNCPdG8eDk…

4 37 139 24K 103

Shayne Longpre @ShayneRedford

9 months ago

@iclr_conf author responses are mostly ignored... and it's hurting the field. Proposal: why not require two stages of reviewing: (1) Review, and (2) Review Rebuttal—even if it is just a checkbox and re-score. Without both, reviewers shouldn't get credit.