Arnab Sen Sharma @arnab_api

Ph.D. student @KhouryCollege, working to make LLMs interpretable arnab-api.github.io Boston, MA Joined September 2022

Tweets

48
Followers

183
Following

139
Likes

180

Nikhil Prakash @nikhil07prakash

2 months ago

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

9 97 568 95K 618

Download Image

Yanai Elazar @yanaiela

4 months ago

💡 New ICLR paper! 💡 "On Linear Representations and Pretraining Data Frequency in Language Models": We provide an explanation for when & why linear representations form in large (or small) language models. Led by @jack_merullo_ , w/ @nlpnoah & @sarahwiegreffe

6 44 213 27K 130

Download Image

Aaron Mueller @amuuueller

5 months ago

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

3 38 171 28K 82

Download Image

Jiuding Sun @JiudingSun

5 months ago

💨 A new architecture of automating mechanistic interpretability with causal interchange intervention! #ICLR2025 🔬Neural networks are particularly good at discovering patterns from high-dimensional data, so we trained them to ... interpret themselves! 🧑‍🔬 1/4

1 17 73 6K 41

Download Image

Sheridan Feucht @sheridan_feucht

5 months ago

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

2 34 151 15K 119

Download Image

Chris Wendler @wendlerch

6 months ago

In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.

2 6 52 7K 23

David Bau @davidbau

6 months ago

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

1 69 310 36K 179

Download Image

David Bau @davidbau

7 months ago

DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @can_rager shows a method for auditing AI bias by probing the internal monologue. dsthoughts.baulab.info I'd be interested in your thoughts.

12 52 269 20K 178

NDIF @ndif_team

9 months ago

More big news! Applications are open for the NDIF Summer Engineering Fellowship—an opportunity to work on cutting-edge AI research infrastructure this summer in Boston! 🚀

1 8 19 10K 4

Download Image

Kevin Meng @mengk20

11 months ago

why do language models think 9.11 > 9.9? at @TransluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?

Transluce @TransluceAI

11 months ago

4 24 190 326K 161

Download Image

68 150 1K 373K 871

Download Video

Hadas Orgad @ ICML @OrgadHadas

11 months ago

Hallucinations are a subject of much interest, but how much do we know about them? In our new paper, we found that the internals of LLMs contain far more information about truthfulness than we knew! 🧵 Project page >> llms-know.github.io Arxiv >> arxiv.org/abs/2410.02707

7 46 910 133K 699

Download Image

Arnab Sen Sharma @arnab_api

11 months ago

Attending @COLM_conf in Philadelphia! Drop by our poster on Wednesday morning (Session 5, 11 am - 1 pm, #8) Would love to catch up and chat about interpretability. Give me a DM!

Arnab Sen Sharma @arnab_api

a year ago

Attending @COLM_conf in Philadelphia! Drop by our poster on Wednesday morning (Session 5, 11 am - 1 pm, #8) Would love to catch up and chat about interpretability. Give me a DM!

2 28 164 30K 140

Download Image

0 1 6 337 1

Rohit Gandikota @rohitgandikota

11 months ago

What should be the goal of unlearning in language models? In our new preprint we look at this question carefully and propose a new erasing method, "ELM," that erases knowledge from LLMs very cleanly. It is driven by three key goals - here is an explainer: 🧵👇

3 34 215 34K 178

Download Image

Aaron Mueller @amuuueller

a year ago

Thanks Tal! 📜 In this paper, we provide a theoretically grounded review of causal (which, imo, ⊇ mechanistic) interpretability. We argue that this gives a more cohesive narrative of the field, and makes it easier to see actionable open directions for future work! 🧵

Tal Linzen @tallinzen

a year ago

2 11 74 18K 75

1 18 84 12K 53

Aaron Mueller @amuuueller

a year ago

Thanks to my many great coauthors for essential contributions to this review! @BrinkmannJannik @millicent_li @saprmarks @kpal_koyena @nikhil07prakash @can_rager @arunasank @arnab_api @SunJiuding @ericwtodd @davidbau @boknilev More in the paper! 📜 arxiv.org/abs/2408.01416

2 3 11 1K 8

David Bau @davidbau

a year ago

Time to study #llama3 405b, but gosh it's big! Please retweet: if you have a great experiment but not enough GPU, here is an opportunity to apply for shared #NDIF research resources. Deadline July 30: ndif.us/405b.html You'll help @ndif_team test, we'll help you run 405b

Jaden Fiotto-Kaufman @jadenfk23

a year ago

1 25 47 23K 14

2 37 122 27K 42

AK @_akhaliq

a year ago

NNsight and NDIF Democratizing Access to Foundation Model Internals The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering

2 25 72 14K 24

Download Image

David Bau @davidbau

a year ago

The National Deep Inference Fabric #NDIF, an @NSF-funded AI research infrastructure project, is awarding 2024 **Summer Engineering Fellowships** in Boston. These are summer visiting positions, for current or recent PhD or undergrads, including stipend, travel and housing costs.

1 27 59 26K 27

Arnab Sen Sharma @arnab_api

a year ago

Just reached Vienna to attend ICLR! Stop by our poster session tomorrow (10:45 am, Hall B, #131). Would love to chat with people about interpretability and AI alignment. DMs are open!

David Bau @davidbau

a year ago

Just reached Vienna to attend ICLR! Stop by our poster session tomorrow (10:45 am, Hall B, #131). Would love to chat with people about interpretability and AI alignment. DMs are open!

1 2 14 2K 7

Download Image

0 1 7 441 0

David Bau @davidbau

a year ago

I am delighted to officially announce the National Deep Inference Fabric project, #NDIF. ndif.us NDIF is an @NSF-supported computational infrastructure project to help YOU advance the science of large-scale AI.