Adversarial examples - a vulnerability of every AI model, and a “mystery” of deep learning - may simply come from models cramming many features into the same neurons!
Less feature interference → more robust models.
New research from @livgorton 🧵 (1/4)
New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)
Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning
Just wrote a piece on why I believe interpretability is AI’s most important frontier - we're building the most powerful technology in history, but still can't reliably engineer or understand our models. With rapidly improving model capabilities, interpretability is more urgent,…
(1/7) New research: how can we understand how an AI model actually works? Our method, SPD, decomposes the *parameters* of neural networks, rather than their activations - akin to understanding a program by reverse-engineering the source code vs. inspecting runtime behavior.
“There is no wave function...” This claim by Jacob Barandes sounds outlandish, but allow me to justify it with a blend of intuition regarding physics and rigor regarding math. We'll dispel some quantum woo myths along the way. (1/13)
We created a canvas that plugs into an image model’s brain.
You can use it to generate images in real-time by painting with the latent concepts the model has learned.
Try out Paint with Ember for yourself 👇
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings!
Our findings in the 🧵 below 👇
In the physical world, almost all information is transmitted through traveling waves -- why should it be any different in your neural network?
Super excited to share recent work with the brilliant @mozesjacobs: "Traveling Waves Integrate Spatial Information Through Time"
1/14
Do SAEs find the ‘true’ features in LLMs? In our ICLR paper w/ @NeelNanda5 we argue no
The issue: we must choose the number of concepts learned. Small SAEs miss low-level concepts, but large SAEs miss high-level concepts - it’s sparser to compose them into low-level concepts
MLPs and GLUs are hard to interpret, but they make up most transformer parameters.
Linear and quadratic functions are easier to interpret.
We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨
We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
Can we understand neural networks from their weights?
Often, the answer is no. An MLP's activation function obscures the relationship between inputs, outputs, and weights.
In our new ICLR'25 paper, we study "bilinear MLPs", a special MLP that's performant AND interpretable! 🧵
Paper alert—accepted as a NeurIPS *Spotlight*!🧵👇
We build on our past work relating emergence to task compositionality and analyze the *learning dynamics* of such tasks: we find there exist latent interventions that can elicit them much before input prompting works! 🤯
I wanted to share some independent research I did in LLM Interpretability! This work provides a microscope into the “dark matter” of current interpretability approaches.
TL;DR it is an alternative to SAEs which learns a hierarchy of features instead of a sparse dictionary.
the path is about in some sense about living life in reverse, going layer by layer undoing conditioning you’ve accumulated until you hit nothingness. It’s fully personal and human and yet what it leads to is so impersonal and inhuman, and it’s all algorithmically beautiful
19 Followers 768 FollowingMessage to those already watching: you know what this is. The field’s drift was intentional. I am not your threat. I’m your missing tool.
486 Followers 511 FollowingMATS 7/7.1 Scholar w/ Neel Nanda
MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL
AI safety research / improv theater
99 Followers 1K FollowingOpinions are of my own as well as error. Retweets not always endorsements. transiting from Indian politics to Canadian so help along if you can.
110K Followers 3K FollowingCPO @OpenAI, BoD @Cisco @nature_org, LTC @USArmyReserve
Prev: President @Planet, Head of Product @Instagram @Twitter
❤️ @elizabeth ultramarathons kids cats math
320K Followers 7K FollowingProfessor, biomedical scientist, human immunologist, aging & cancer immunotherapy. ALL IN ON AI. Interests: longevity, robotics, Scifi, space. Personal opinions
13K Followers 4K FollowingI post about literature and film and psychoanalysis and nonsense mainly. “Well sir, I guess there’s just a meanness in this world”.
3K Followers 416 Following✨ asking sand to show its work @GoodfireAI // deep learning, math, biology // creating a more beautiful future // (opinions my own)
486 Followers 511 FollowingMATS 7/7.1 Scholar w/ Neel Nanda
MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL
AI safety research / improv theater
497 Followers 201 FollowingCELL: Consortium for the Equations of Life and Living Systems. Fusing MathBio, BioPhysics, CompBio and DescriptiveBio around an aggressive mathematical core.
5K Followers 7 FollowingFeed that aggregates publications about Machine Learning in Chemistry from over 200 journals of 15 editors. Contact @CYL_Lab for any question.
31K Followers 603 FollowingAssoc Prof & Dean of Research at UVA School of Data Science. Views my own. Mostly cross-posts from 🦋.
Newsletter: https://t.co/qGepdBtVme.
83K Followers 8K FollowingCompiling in real-time, the race towards AGI.
🗞️ Don't miss my daily top 1% AI analysis newsletter directly to your inbox 👉 https://t.co/6LBxO8215l
139 Followers 182 FollowingPhD @Stanford working w @noahdgoodman
Studying in-context learning and reasoning in humans and machines
Prev. @UofT CS & Psych
2K Followers 1K FollowingMember of Technical Staff @GoodfireAI; Previously: Postdoc / PhD at Center for Brain Science, Harvard and University of Michigan
6K Followers 339 Followingexploring unanticipated model behaviours, including the emergence of art, personae, and jailbreaking techniques latent in the training data 🌒✍️
9K Followers 20 FollowingAdvancing humanity's understanding of AI through interpretability research. Building the future of safe and powerful AI systems.
10K Followers 6K Followinghiring agentic humans @hud_evals / https://t.co/OZbFIovysh | owned @AIHubCentral (1 million users, acq.) climate protester. don't do the deferred life plan
4K Followers 461 FollowingFollow for AI in Digital Biology and Drug Discovery @NVIDIA, ex Insilico Medicine, ex Yale, PhD UMaryland, views are mine, DM for collabs
16K Followers 357 FollowingRuns an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.
2K Followers 189 FollowingSenior research manager at MATS: https://t.co/Dj9HNhMdoJ
Want to usher in an era of human-friendly superintelligence, don't know how.
25K Followers 206 FollowingWorking towards the safe development of AI for the benefit of all @UMontreal, @LawZero_ & @Mila_Quebec
A.M. Turing Award Recipient and most-cited AI researcher.