Michael Pearce @_MichaelPearce

Mechanistic Interpretability @ Goodfire | Physics | Evolution Joined September 2015

Tweets

82
Followers

162
Following

643
Likes

596

Goodfire @GoodfireAI

2 weeks ago

Adversarial examples - a vulnerability of every AI model, and a “mystery” of deep learning - may simply come from models cramming many features into the same neurons! Less feature interference → more robust models. New research from @livgorton 🧵 (1/4)

4 24 243 26K 134

Download Image

Goodfire @GoodfireAI

2 weeks ago

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)

10 44 379 44K 199

Jack Merullo @jack_merullo_

4 weeks ago

Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning

14 50 509 43K 397

Download Image

Goodfire @GoodfireAI

a month ago

New research with coauthors at @Anthropic, @GoogleDeepMind, @AiEleuther, and @decode_research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7)

3 22 250 18K 133

Eric Ho @ericho_goodfire

2 months ago

Just wrote a piece on why I believe interpretability is AI’s most important frontier - we're building the most powerful technology in history, but still can't reliably engineer or understand our models. With rapidly improving model capabilities, interpretability is more urgent,…

1 17 134 18K 52

Download Image

Goodfire @GoodfireAI

2 months ago

(1/7) New research: how can we understand how an AI model actually works? Our method, SPD, decomposes the *parameters* of neural networks, rather than their activations - akin to understanding a program by reverse-engineering the source code vs. inspecting runtime behavior.

14 86 784 99K 597

Download Image

Goodfire @GoodfireAI

3 months ago

New research update! We replicated @AnthropicAI's circuit tracing methods to test if they can recover a known, simple transformer mechanism.

2 53 502 52K 243

Download Image

Curt Jaimungal @TOEwithCurt

8 months ago

“There is no wave function...” This claim by Jacob Barandes sounds outlandish, but allow me to justify it with a blend of intuition regarding physics and rigor regarding math. We'll dispel some quantum woo myths along the way. (1/13)

60 146 1K 147K 992

Download Image

Goodfire @GoodfireAI

3 months ago

We created a canvas that plugs into an image model’s brain. You can use it to generate images in real-time by painting with the latent concepts the model has learned. Try out Paint with Ember for yourself 👇

39 98 923 174K 587

Download Video

Adam Karvonen @a_karvonen

6 months ago

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇

4 35 193 36K 112

Download Image

Andy Keller @t_andy_keller

6 months ago

In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network? Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time" 1/14

148 924 7K 756K 5K

Download Gif

Bart Bussmann @BartBussmann

7 months ago

Do SAEs find the ‘true’ features in LLMs? In our ICLR paper w/ @NeelNanda5 we argue no The issue: we must choose the number of concepts learned. Small SAEs miss low-level concepts, but large SAEs miss high-level concepts - it’s sparser to compose them into low-level concepts

3 38 269 40K 223

Download Image

Nora Belrose @norabelrose

7 months ago

MLPs and GLUs are hard to interpret, but they make up most transformer parameters. Linear and quadratic functions are easier to interpret. We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

5 32 299 31K 200

Download Image

David D. Baek @dbaek__

7 months ago

1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!

76 533 4K 1.2M 4K

Download Gif

tdooms @thomasdooms

7 months ago

Can we understand neural networks from their weights? Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights. In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵

3 44 395 43K 370

Ekdeep Singh @EkdeepL

10 months ago

Paper alert—accepted as a NeurIPS *Spotlight*!🧵👇 We build on our past work relating emergence to task compositionality and analyze the *learning dynamics* of such tasks: we find there exist latent interventions that can elicit them much before input prompting works! 🤯

12 91 607 110K 553

Download Gif

Harish Kamath @kamath_harish

10 months ago

I wanted to share some independent research I did in LLM Interpretability! This work provides a microscope into the “dark matter” of current interpretability approaches. TL;DR it is an alternative to SAEs which learns a hierarchy of features instead of a sparse dictionary.

6 30 258 33K 243

Download Image

Nick @nickcammarata

10 months ago

the path is about in some sense about living life in reverse, going layer by layer undoing conditioning you’ve accumulated until you hit nothingness. It’s fully personal and human and yet what it leads to is so impersonal and inhuman, and it’s all algorithmically beautiful