Arnab Sen Sharma @arnab_api
Ph.D. student @KhouryCollege, working to make LLMs interpretable arnab-api.github.io Boston, MA Joined September 2022-
Tweets48
-
Followers183
-
Following139
-
Likes180
How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…
💡 New ICLR paper! 💡 "On Linear Representations and Pretraining Data Frequency in Language Models": We provide an explanation for when & why linear representations form in large (or small) language models. Led by @jack_merullo_ , w/ @nlpnoah & @sarahwiegreffe
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
💨 A new architecture of automating mechanistic interpretability with causal interchange intervention! #ICLR2025 🔬Neural networks are particularly good at discovering patterns from high-dimensional data, so we trained them to ... interpret themselves! 🧑🔬 1/4
[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.
In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.
Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️
DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @can_rager shows a method for auditing AI bias by probing the internal monologue. dsthoughts.baulab.info I'd be interested in your thoughts.
More big news! Applications are open for the NDIF Summer Engineering Fellowship—an opportunity to work on cutting-edge AI research infrastructure this summer in Boston! 🚀
why do language models think 9.11 > 9.9? at @TransluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?
why do language models think 9.11 > 9.9? at @TransluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible? https://t.co/8gzm89FJ0K
Hallucinations are a subject of much interest, but how much do we know about them? In our new paper, we found that the internals of LLMs contain far more information about truthfulness than we knew! 🧵 Project page >> llms-know.github.io Arxiv >> arxiv.org/abs/2410.02707
Attending @COLM_conf in Philadelphia! Drop by our poster on Wednesday morning (Session 5, 11 am - 1 pm, #8) Would love to catch up and chat about interpretability. Give me a DM!
Attending @COLM_conf in Philadelphia! Drop by our poster on Wednesday morning (Session 5, 11 am - 1 pm, #8) Would love to catch up and chat about interpretability. Give me a DM!
What should be the goal of unlearning in language models? In our new preprint we look at this question carefully and propose a new erasing method, "ELM," that erases knowledge from LLMs very cleanly. It is driven by three key goals - here is an explainer: 🧵👇
Thanks Tal! 📜 In this paper, we provide a theoretically grounded review of causal (which, imo, ⊇ mechanistic) interpretability. We argue that this gives a more cohesive narrative of the field, and makes it easier to see actionable open directions for future work! 🧵
Thanks Tal! 📜 In this paper, we provide a theoretically grounded review of causal (which, imo, ⊇ mechanistic) interpretability. We argue that this gives a more cohesive narrative of the field, and makes it easier to see actionable open directions for future work! 🧵
Thanks to my many great coauthors for essential contributions to this review! @BrinkmannJannik @millicent_li @saprmarks @kpal_koyena @nikhil07prakash @can_rager @arunasank @arnab_api @SunJiuding @ericwtodd @davidbau @boknilev More in the paper! 📜 arxiv.org/abs/2408.01416
Time to study #llama3 405b, but gosh it's big! Please retweet: if you have a great experiment but not enough GPU, here is an opportunity to apply for shared #NDIF research resources. Deadline July 30: ndif.us/405b.html You'll help @ndif_team test, we'll help you run 405b
Time to study #llama3 405b, but gosh it's big! Please retweet: if you have a great experiment but not enough GPU, here is an opportunity to apply for shared #NDIF research resources. Deadline July 30: ndif.us/405b.html You'll help @ndif_team test, we'll help you run 405b
NNsight and NDIF Democratizing Access to Foundation Model Internals The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering
Just reached Vienna to attend ICLR! Stop by our poster session tomorrow (10:45 am, Hall B, #131). Would love to chat with people about interpretability and AI alignment. DMs are open!
Just reached Vienna to attend ICLR! Stop by our poster session tomorrow (10:45 am, Hall B, #131). Would love to chat with people about interpretability and AI alignment. DMs are open!
I am delighted to officially announce the National Deep Inference Fabric project, #NDIF. ndif.us NDIF is an @NSF-supported computational infrastructure project to help YOU advance the science of large-scale AI.

Mushfiqur Rahman @Mushfique83
39 Followers 870 Following
Yulu Qin @yulu_qin
151 Followers 436 Following
Ibrahim Adiza @linda7472297404
1 Followers 260 Following
Ibrahim Adiza @deborah2434312
0 Followers 296 Following
Alan Saji @AlanSaji2251
28 Followers 132 Following Research Associate @AI4Bharat Interests include AI interpretability and reasoning Prev: Axis Bank, IIT Madras
Md Adith Mollah (mr. ... @Adith082
5 Followers 132 Following csegrad @sustbd | Trustworthy ML and XAI enthusiast | Alumni of @aspire_leaders | Competitive Programmer | Content Creator @Youtube
Jung Hoon Son, M.D. @junghoon_sonMD
1K Followers 3K Following knowledge architect // pathologist | informaticist
Swati A @swatiardeshna
204 Followers 7K Following
henry castillo @henrycstllo
79 Followers 613 Following phd @tamu. prev: swe @stripe, bs @utaustin. i want to mechanistically understand models through the lens of training dynamics. 🇵🇪🏳️🌈
Vartan Polodyan @VarPol
187 Followers 6K Following
James Morrison Rubin @import_jmr
7K Followers 6K Following Product Lead | Google Gemini Prev: Launched @aws Trainium, @alexa99 Echo Show 5 Tweets are my own. Retweets are not endorsements. Joyful Learning Machines
Imranur Rahman @imranur7
111 Followers 793 Following PhD Student at NC State, Security and Privacy enthusiast, ex Samsung
Subramanyam Sahoo @iamwsubramanyam
186 Followers 4K Following Independent AI Safety researcher, M. Tech x Summa Cum Laude @NITHamirpurHP. BASIS Fellow @UCBerkeley, RA @HarvardAISafety. Get Published or Die Trying.
AI Safety Papers @safe_paper
2K Followers 219 Following Sharing the latest in AI safety research. "One who says they have no time to read papers will never read papers even when they have time a-plenty."
worst tweep eune @mmn3mm
721 Followers 324 Following I hacked a friend and it didn't feel bad at all. @r3billions
Spairsairs @SpairsairsEMk1
45 Followers 2K Following
David D. Baek @dbaek__
2K Followers 31 Following PhD Student @ MIT EECS / Mechanistic Interpretability, Scalable Oversight
Sarah Wiegreffe @sarahwiegreffe
5K Followers 1K Following Research in NLP (mostly LM interpretability & explainability). Assistant prof @umdcs @clipumd Formerly @allen_ai @uwnlp @icatgt @gtcomputing Views my own.
ineffable alias @joorosa12185462
147 Followers 8K Following
AI News International... @AINewsInt
1K Followers 7K Following 🔺Advancing The State Of Artificial Intelligence
Chris Wendler @wendlerch
538 Followers 837 Following PostDoc at Northeastern university; LAION contributor; I like deep learning & open source.
Enamul Hassan @enamcse
170 Followers 978 Following PhD Student @ CS, SBU, NY | Assistant Professor @ CSE, SUST, BD | Sports Programmer | Parent | Husband
Natalie Shapira @NatalieShapira
1K Followers 277 Following Tell me about challenges, the unbelievable, the human mind and artificial intelligence, thoughts, social life, family life, science and philosophy.
zkhani @zkhani7
2K Followers 3K Following
francesco ortu @francescortu
244 Followers 1K Following NLP & Interpretability | PhD Student @UniTrieste & Data Engineering Lab @AreaSciencePark | Prev @MPI_IS intern https://t.co/RM03tgnbJP
Michael Hanna @michaelwhanna
613 Followers 437 Following PhD student at the University of Amsterdam / ILLC, interested in computational linguistics and (mechanistic) interpretability. Current Anthropic Fellow.
Shun Shao 邵顺 @ShunShao6
50 Followers 365 Following Phd student @CambridgeLTL, MScR @EdinburghNLP @InfAtEd, MEng @ucl. Working on large language models, LLM Interpretability, Spectral learning.
Michael Toker @michael_toker
135 Followers 519 Following PhD candidate @Technion NLP lab - Developing explainability methods to gain a better understanding of Text to Image models
KN throwaway @Knthrowaway2222
6 Followers 167 Following
Mike Allton | AI for ... @mike_allton
51K Followers 8K Following AI for Creators & Solopreneurs | Automate your biz, reclaim your time. Host @ The AI Hat Podcast. Learn how below! 👇
Soda Fries @soda4fries
0 Followers 4 Following
Danial Namazifard @IamDanialNamazi
674 Followers 6K Following Graduate Researcher at University of Tehran #NLProc #MachineLearning
Froughtoa @Froughtoa2fyIw
64 Followers 7K Following
Amir H. Kargaran @amir_nlp
799 Followers 3K Following On job maket (Fall 2025) / 🤖 PhD student @CisLmu/ 🛠️ Multilingual NLP / Previous: Intern @huggingface. My views!
James Michaelov @jamichaelov
365 Followers 732 Following Postdoc @MIT. Previously: @CogSciUCSD, @CARTAUCSD, @AmazonScience, @InfAtEd, @SchoolofPPLS. Research: language comprehension, the brain, artificial intelligence
Mert İnan @Merterm
271 Followers 2K Following CS PhD candidate @Northeastern Cognitive-aware MM convAI interdisciplinarity lover @FulbrightPrgrm @SCSatCMU 🦋: @merterm.bsky.social
8jag @0x386a6167
101 Followers 2K Following
Prakash Kagitha @prakashkagitha
294 Followers 2K Following Research Assistant @DrexelUniv. Improving planning with LLMs. Previously, Lead Data Scientist @ https://t.co/yohjwHWaMY
Seytasa @seytasa47488
72 Followers 5K Following
Modou Sarr @Modou__Sarr
5 Followers 255 Following
Mikita Balesni 🇺�... @balesni
852 Followers 623 Following Working on risks from rogue AI @apolloaievals Past: Reversal curse, Out-of-context reasoning // best way to support 🇺🇦 https://t.co/eagDB8VUzz
James Morrison Rubin @import_jmr
7K Followers 6K Following Product Lead | Google Gemini Prev: Launched @aws Trainium, @alexa99 Echo Show 5 Tweets are my own. Retweets are not endorsements. Joyful Learning Machines
Md Mahadi Hasan Nahid @mhn_nahid
568 Followers 2K Following CS PhD student @UAlberta 🍁 | | working on #NLProc #LLM | Father | Loves 👨🍳 ✈️ 🌎| 🇧🇩|🇨🇦|🇨🇳|🇮🇳|🇫🇷|🇵🇹|🇦🇪|🇲🇽
Enamul Hassan @enamcse
170 Followers 978 Following PhD Student @ CS, SBU, NY | Assistant Professor @ CSE, SUST, BD | Sports Programmer | Parent | Husband
DiscussingFilm @DiscussingFilm
2.6M Followers 819 Following Your leading source for quick reliable news. Home for healthy and liberating discussion on all things pop culture. (Affiliate links shared earn us commissions)
Sarah Wiegreffe @sarahwiegreffe
5K Followers 1K Following Research in NLP (mostly LM interpretability & explainability). Assistant prof @umdcs @clipumd Formerly @allen_ai @uwnlp @icatgt @gtcomputing Views my own.
David D. Baek @dbaek__
2K Followers 31 Following PhD Student @ MIT EECS / Mechanistic Interpretability, Scalable Oversight
Actionable Interpreta... @ActInterp
252 Followers 11 Following 🛠️ Actionable Interpretability🔎 @icmlconf 2025 | Bridging the gap between insights and actions ✨ https://t.co/4zRMTbzwDc
Chris Wendler @wendlerch
538 Followers 837 Following PostDoc at Northeastern university; LAION contributor; I like deep learning & open source.
Natalie Shapira @NatalieShapira
1K Followers 277 Following Tell me about challenges, the unbelievable, the human mind and artificial intelligence, thoughts, social life, family life, science and philosophy.
Josh Engels @JoshAEngels
1K Followers 117 Following Mech interp @GoogleDeepMind | on leave from my PhD @MIT. Let's use interp to make models safer today
Roger Beaty @Roger_Beaty
3K Followers 2K Following Associate Professor of Psychology @penn_state. Director of the Cognitive Neuroscience of Creativity Lab. President-elect @tsfnc.
Yoed Kenett @yoed_kenett
1K Followers 997 Following Assistant Professor at the Technion - Israel Institute of Technology. Studying High-level cognition, creativity, cognitive networks, knowledge
Open Philanthropy @open_phil
18K Followers 230 Following Open Philanthropy's mission is to help others as much as we can with the resources available to us.
Grant Sanderson @3blue1brown
411K Followers 362 Following Pi creature caretaker. Contact/faq: https://t.co/brZwdQfdif
Jan Leike @janleike
114K Followers 332 Following ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.
Percy Liang @percyliang
84K Followers 417 Following Associate Professor in computer science @Stanford @StanfordHAI @StanfordCRFM @StanfordAILab @stanfordnlp | cofounder @togethercompute | Pianist
Stella Biderman @BlancheMinerva
17K Followers 806 Following Open source LLMs and interpretability research at @AiEleuther. She/her
Dan Hendrycks @DanHendrycks
42K Followers 109 Following • Center for AI Safety Director • xAI and Scale AI advisor • GELU/MMLU/MATH/HLE • PhD in AI • Analyzing AI models, companies, policies, and geopolitics
Ekdeep Singh @EkdeepL
2K Followers 1K Following Member of Technical Staff @GoodfireAI; Previously: Postdoc / PhD at Center for Brain Science, Harvard and University of Michigan
AK @_akhaliq
425K Followers 3K Following AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5YmrQ
Michael Hanna @michaelwhanna
613 Followers 437 Following PhD student at the University of Amsterdam / ILLC, interested in computational linguistics and (mechanistic) interpretability. Current Anthropic Fellow.
NDIF @ndif_team
216 Followers 0 Following The National Deep Inference Fabric, an NSF-funded computational infrastructure to enable research on large-scale Artificial Intelligence. https://t.co/STsQ707an3
Adam Karvonen @a_karvonen
3K Followers 564 Following ML Researcher, doing MATS with Owain Evans. I prefer email to DM.
Yoshua Bengio @Yoshua_Bengio
25K Followers 206 Following Working towards the safe development of AI for the benefit of all @UMontreal, @LawZero_ & @Mila_Quebec A.M. Turing Award Recipient and most-cited AI researcher.
METR @METR_Evals
11K Followers 29 Following An AI research non-profit advancing the science of empirically testing AI systems for capabilities that could threaten catastrophic harm to society.
Yannick Assogba @tafsiri
1K Followers 441 Following Research Scientist @ Apple AI/ML. Prev: Google Brain / PAIR
Alex Loftus @AlexLoftus19
151 Followers 476 Following Our textbook is on amazon now! https://t.co/ayc3bMWVFt https://t.co/CUEOxFRDse | PhD student, Bau lab @ Northeastern. Studying LLM internals.
Diego Garcia-Olano @dgolano
596 Followers 971 Following Research Scientist at @MetaAI GenAI Trust/Responsible AI. PHD in explainable ML @UTAustin 22. @IBMResearch 20, @GoogleAI 18/19, @datascifellows 16, @la_UPC 15
Thomas G. Dietterich @tdietterich
57K Followers 619 Following Distinguished Professor (Emeritus), Oregon State Univ.; Former President, Assoc. for the Adv. of Artificial Intelligence; Robust AI & Comput. Sustainability
Transluce @TransluceAI
8K Followers 15 Following Open and scalable technology for understanding AI systems.
Eric J. Michaud @ericjmichaud_
3K Followers 1K Following PhD student at MIT. Trying to make deep neural networks among the best understood objects in the universe. 💻🤖🧠👽🔭🚀
Hadas Orgad @ ICML @OrgadHadas
618 Followers 131 Following PhD student @ Technion | Focused on AI interpretability, robustness & safety | Because black boxes don’t belong in critical systems
James Michaelov @jamichaelov
365 Followers 732 Following Postdoc @MIT. Previously: @CogSciUCSD, @CARTAUCSD, @AmazonScience, @InfAtEd, @SchoolofPPLS. Research: language comprehension, the brain, artificial intelligence
Goodfire @GoodfireAI
9K Followers 20 Following Advancing humanity's understanding of AI through interpretability research. Building the future of safe and powerful AI systems.
Adam Gleave @ARGleave
4K Followers 402 Following CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as https://t.co/98dTfmdw2b