2025-06-06 | Stanford CS25: V5 | On the Biology of a Large Language Model, Josh Batson of Anthropic

Joshua Batson探讨大型语言模型的内部机制与行为特性

视频科技

媒体详情

上传日期: 2025-06-06 17:53
来源: https://www.youtube.com/watch?v=vRQs7qfIDaU
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: So today I'm it's my pleasure to welcome Joshua Batson from anthropping. So Hebe talking about on the biology of a large language model, which should be a very interesting talk. So Josh leads the circuits effort of the anthropic mechanistic interpretability team. Before anthropic, he worked on virrogenomics and computational microscopy at the Chan Zuckerberg biohub. And his academic training is in pure mathematics, which is very cool and impressive. And just another thing, which is some more recordings for this quarter have released like kina and dives talks. So feel free to check those out on our YouTube playlist. And for the folks on zoom, feel free to ask questions either on the zoom or the slide with the code cs 25. And without further ado, I'll hand it off to Josh.
speaker 2: Thank you. Clalapcome on. We're getting started. Okay, it's a pleasure to be here. It's crazy to meet that. There is a class on transformers now, which I think you know were invented rather recently. We've got like an hour for the class, I guess, and then like 15, 20 minutes for questions. Feel free to just like interrupt me with questions. Like I, as you said, like my trainings in pure mathematics and like people just interrupt each other all the time very rudely and it's totally fine. And so I'm if I want na just cut you off and move on to something, I will but you know this can kind of be as interactive as you guys would would like it to be and that's for the people on zoom too. Okay, so this talk is titled on the biology of a large language model, which is also the title of a paper, which is what we call our like 100 page interactive blog posts that went out a few weeks ago. And if you're here, you probably know something about large language models. The biology word is like, it was our sort of choice here. And maybe you should contrast to like you a series of papers called on the physics of large language models, where you're sort of thinking of them as like dynamical systems over the course of training. But we sort of think of interpretability being in relationship to neural networks, which are trained by gradient descent, as biology is to living systems that are developed through evolution. You have some process that gives rise to complexity, and you can just study the objects that are produced for how they do the kind of miraculous things that they do. So models do lots of cool things. I don't want to tell you guys this. This is an example. Oh, I don't know when it's from now, maybe six months ago, which is like ten years in AI time, but I quite enjoyed this. This is just somebody who was working on natural language processing for circasian, which is like a very low resource language. Not many people speak. There's not many documents and hebeen sort of tracking the state of the art and nlp for many years, like you know using the best models at the time to try to just help translate into this language, translate from this language, you know, help preserve it. And tried it with a version of Claude where I think this was probably sona 3.5, where he just shoved this master list of Russian circadian translations into the context window to just like painstakingly gathered over years rather than train a model, just put it in the context window and then just ask the model to do like other translations. And it could not only translate it successfully, but also kind of like break down the the grammar, right? And so just in context, learning with these models was sort of enough to beat the state of the art for the nlp specific models that hebeen working on for a while. So like that's cool would be my summary of that. But models are also weird. This was also Claude, and someone asked her, what day is tomorrow on leap day? And it just like got in a big fight. You know, if today is February 20 ninth, 20, 20 fourth, then tomorrow would be March first. However, 2024 is not a weep year unso. February 20 ninth is not a valid date in the Gregorian calendar. Okay, now it starts around the rules. Fallthis ship 2000 with wep year 20 100 will not be like, fine. And then so the next leayear after 2024 be 2028, which is and then so if we assume you meant February 20 eighth, 2024, the last valid date in February, then it gives it it's just like what is going on, right? There's like some smarorgasborg of like correct recollection of facts, correct reasoning from the facts, and then like disregarding them out of consistency with this like initial, it's just pretty weird for it to be leap day. It's like, that's odd, right? I mean, if a person were doing this, you would wonder like what was in the brownies they have consumed. There's just this like now children are sort of like this. So maybe that's an interesting topic. And then I just love this. AI art will make designers obsolete. AI, except in the job. And it's got so many fingers. This is like now out of date, right? Like people have figured out how to keep the finger count down to it most five per hand. You know the new chai pki model, right? Can do like extremely realistic people. And they all have five fingers now, but like that wasn't exactly solved by figuring out like why are there so many fingers on these? You know, just sort of like other other methods got through it. But like presumably you kind of bat down some of this weirdness, and now it's just the weirdness esses more sophisticated. So as the models get better, you need to get better at understanding just where like the craziness has sort of gone. So as the frontier moves forward, like maybe it's not five fingers, but there might be subtly other other things that are wrong. And when I think about interpretability, which is sort of like what did the model learn exactly? How is it represented inside the model? How does it manifest in behavior? Is I am sort of thinking ahead to when you know most of the simple interactions seem to go well, and then it's like, are these going well? Because fundamentally the model learned something like deep and or because you've managed to kind of beat this down like the finger problem. But like if you went out to the edge of the model's capabilities, it would all be seven fingers again, but you can't tell. And also because it seems pretty reliable, you've delegated a lot of decision making and trust to these models, and it's in the corner. You can no longer verify where things get weird. So for that reason, we kind of want to understand what's going on with these capabilities. All right. So this will be a bit of a review at the beginning of like why models are hard to understand, some strategies for picking them apart. And then three main lessons I think we've learned about how models work inside that aren't obvious from a black box way of engaging with them. So here are sort of 33 statements that are like somewhere between myths or things that are out of date or like if you interpret them in one way philosophically, but missing the point. And so these statements are that models just pattern match to similar training data examples that they use only like shallow and simple heuristics and reasoning, and that they just kind of work one word at a time, you know just kind of like gutting out the next thing. And some like eruption of spontaneity. And I think you know we find that that models learn and can compose pretty abstract representations inside, that they perform rather complex and often heavily parallel computations. It's not serials if they're doing a bunch of things at once at the same time and also that they plan many tokens into the future, even though they say one word at a time, they're often thinking ahead quite far to be able to make something coherent, which will work for that. Okay.
speaker 3: this is probably a review for this class.
speaker 2: but we made these nice slides and I think it's nice to go through anyway. So you have a chat bot which is hello, how can I assist you as an answer? How does this actually happen? Right? You know it is saying one word at a time by just predicting the next word. So how goes through clawed and predicts can and how can goes to I and how can I goes to assist and how can I assist goes to you? So you can reduce the problem to the computation that gives you the next word. And that is a neural network. Here I've drawn like a fully connected network. To turn language into language, passing through numbers, you have to turn things into vectors first. So there's an embedding. So every word or token in the vocabulary has an embedding, which is a list of numbers or a vector, morally speaking. You basically just concatenate those together and run them through a massive neural network with a lot of weights, outcomes, a score for every word in the vocabulary. And the model says the highest scoring word, modulosum temperature. To introduce randomness. Transformer architectures are more complex as these residual connections, alternating attention and mlp blocks, which you could think of as like baking in really strong priors into a massive mlp. But in some sense, that's just .
speaker 3: about efficiency.
speaker 2: So a metaphor we found use, which is also biological, is that language models should be thought of as grown, not built. Like you kind of start with this randomly initialized thing in architecture, which is like a scaffold. You like give it some data, maybe that's like a nutrients, and then the loss is like the sun, and it grows towards that. And so you get this kind of organic thing, which has been made by the end of it. And but the way that it grew, you don't really have any access to it. The scaffold you have access to. But that's like looking at a model at a nit, which tends not to be that interesting. Okay. So of course, like we have the models. And so there is a toutological answer to what are they doing, which I already told you, which is they turn the words into numbers. They do a bunch of matmoles. They apply like simple functions. It's just math all the way through. And then you get something out and it's like, that's it. That's what the model does. And I think that's an unsatisfying answer because you can't reason about it, right? Like that answer to how do models work? Like doesn't tell you about what behaviors they should or shouldn't be able to do or kind of any of those things. So the first thing you might hope is that the neurons inside the neural network might have interpretable roles. You people were hoping that's going back to the first networks in the eighs. And there was a bit of a resurgence in deep learning, a kind of small on Chris ola, who leads the team at anthropic, got really into this ten years ago. You just look at a neuron and you ask, when does this fire for? What input ts is this neuron active? And then you just sort of see, like, do those form a coherent class? Right? This is the car detector neuron and division model. Or if this is the eye detector, this is like the edge detector. If it's early in the model, you know they found like the Donald Trump neuron in a clip model, for example. But it turns out that in language models, when you do this, you just say, which sentences cause this neuron to activate? The answer doesn't make that much sense. So here's a visualization of a bunch of example text for which a neuron and a model activates. And there's just a lot of stuff. There's like code and some Chinese and some math and hemlock. For Socrates, it's not especially clear. And like, of course, there's no reason it would need to be right. Like it's just learned to function. And so asking for a neuroto be interpreter was like a bit of a Hail Mary and it's pretty cool to that. Like works sometimes, but it's like not particularly systematic. There's a prior from neuroscience is that like maybe while there's a whole bunch of neurons going on or something that in any given moment, maybe the models like not thinking of that many things at once, maybe there's some sparsity here where if there were a map of the concepts the model is using or the subroutines it's using, or something that on any given token it's only using a few at a time. And that's just you know maybe a slightly better guess than maybe the neurons are interpretable. It's like not necessarily a great prior, but it's something you can work with. And so you can fit linear combinations of neurons such that each activation vector is a sparse combination of these dictionary elements. This is called dictionary learning classical ml. And we just did it. You just like gather a bunch of activations from the model when you put through the trillion text ts or something, and then you take those vectors, you look for dictionaries, and then you look at, when are those dictionary components active? And lo and behold, it's way better. So you know, we had a paper last year, which is just like here's a bunch of them where we fit about 30 million features on clad three sonnet on the middle layer of the model. You just say, you know, what are the atoms of computation or representation inside that model? This was one of my favorites. Where this linear combination is present or it's like a dot product with a vector is large when the input is about the Golden Gate Bridge. And that is if it is like an explicit mention of the Golden Gate Bridge in English on the left, also if it's translated into another language, also if it's an image of the Golden Gate Bridge, also, it turns out, if it's an indirect mention. So you're like, I was driving from San Francisco tumarin, right, which you know, you cross the bridge to do. And the same feature is active there, right? So it's like some relatively general concept also in the San Francisco landmarks, etcetera. These combinations of neurons are interpretable. We were happy with this. There's things that are more abstract, you know notions of inner conflict. There was a feature for like bugs in code that like fired on, kind of kind of small bugs like division by zero or typos or something in many different programming languages. If you suppressed it, the model would act like there wasn't the bug. If you activated it, the model would give you a trace back as if there were a bug. And so it sort of had this general, general properties. But there was something very unsatisfying about this, which is like, how does it know it's the Golden Gate Bridge? And so what you know what does it do with that information, right? So even if you manage to like piece apart the representations, that's just a cross section. It kind of doesn't give you the why and the how, it just gives you the what. And so we wanted to find ways of connecting these features together. So you start from the input words, and you're going to process these to higher order representations and eventually itsay something and try to trace that through through presentation tation.
speaker 3: So very .
speaker 2: concretely, you have a matrix, which is like you take a billion examples and of activation vectors, and then you've got A D model here. That's a matrix, and you try to factorize that matrix into a product of a fixed dictionary of atoms times a sparse matrix of which atoms are present in which example. And you just have an objective function, which is this should reconstruct the data and some l one penalty to encourage varsity .
speaker 3: really through. It's a joint .
speaker 2: optimization problem. Yeah, that's right. That's right. We tried being clever about this. There's like a beautiful, rich literature on dictionary learning. But after a few months of that, the bitter lesson got us. And it turned out that you could just use like a one layer of sparse auto encoder, which you can put in torch and then just like train on your GPU's. And like that scaling was more important than being clever yet again. So that's what it is. It's a sort of sparse autoencoder. All right.
speaker 3: so .
speaker 2: here's a prompt. Okay. So the capital of the state containing Dallas is Austin. This is that's because the state containing Dallas is Texas. Good. Did the model do that? Was it like Texas? So Austin? Or is it like, I don't know. It's seemed a lot of training data. Like was this just in it? And it's just like reciting the answer. You know, people, there's a lot of eval contamination. Like a lot of elou made it into people's training set. So you get high scores. Well, just knew the answer to that literally because it had seen it. So did it literally see this before or is it kind of thinking Texas like you did? If thinking about Texas. So I'll tell you, this is like a cartoon, and we can slowly through this like pick you know, break away these abstractions to like literally what we did. But I'll sort of start by saying that you should think of each of these as sort bundles of features, which are, again, atoms. We sort of learned through an optimization process. We label the role of each feature by looking at when it's active and trying to see if we can describe, kind of in language, you know what that feature is doing. And then the connections between them are actually direct causal connections, as the model processes this in a fopass. And so what we do is we break apart the model into pieces. We ask, which pieces are active, can we interpret them separately, and then how does it flow? And so in this case, we found a bunch of features related to capitals, a bunch related to states in the abstract, a bunch related to Dallas. We found some features that were like you can think of as like motor neurons. They make the model do something. In this case, they make it, say, the name of a capital, like they make it, say, a bunch of state capitals or country capitals or something. So that's the start, but also that's to get it right. And so you know there's some mapping from Dallas into Texas and there's a pile of features there. Some are like you discussions of like Texas politics. Some are like phrases like everything's bigger in Texas. And once you have Texas in a capital, you get saying Austin in particular. And you know if you're saying to state capand also you're saying Austin, what you get is like Austin coming out. There's some interesting straight lines though, you know so like Texas also feeds into Austin directly, right? If you're thinking about Texas, you might just be thinking about Austin. And so you know I'll get into more of how how we build this. But you know this gives you a sort of a picture in terms of these atoms or dictionary elements we've learned, and you might want to check that they make sense. So then you can do interventions on the model, deleting pieces of this, like in neuroscience, these would be ablations of neurons or something. And then you see if the output of the model changes as you would expect.
speaker 3: Someone on zoom asked a like that. Awesome. Frequently nuclear together an Internet text suggesting simple statistical pertions. So is a this a way of overcomplicating.
speaker 2: I think that people have found the transformmative outperformback of n for most tasks. So you know, Houston also curs know with Dallas in the training data, and the model doesn't say Houston, so it must be using the capital and Dallas. In this case, I think you could say, you know, maybe if you just had capital and Dallas, then it would say Austin. And that's actually, I think some of these edges were pretty weak. And it turns out you could just say the capital, you make something ungrammatical, just like, you know, the capital of Dallas is, and it will say Austin. And that's an interesting thing where I think you look at the graph and this edge ges weak, and it indicates that actually, maybe it is just like if you have capital in Dallas in proximity, you might get Austin, which is then causally, yes.
speaker 3: Lethe is. I mean, the catalways said you don't have little bit Yeah Yeah.
speaker 2: So is this is the flow through layers of the model up upstairs and we've done some proving here. So I will show in more detail when me actually see when we're going to get that. Yeah. So I'm going to give you a lot more detail in just about a slide on the technical side. Yes.
speaker 3: for sigfor a dictionary or for the dictionary.
speaker 2: we did a scan where we tradictionaries of different sizes. And we, you know, you get some trade off of the compuyou spend and the amount or the accuracy of the approximation, like, you know, these are supposed to reconstruct the activation tions. How well does that happen? The bigger the dictionary, the better it does. Also, the denser it is, the less sparse it is, the better that does. So you can pay some price on for ability at some point. So we did a bunch of sweeps and then we pick something that seems good enough. There's still a bunch of errors and know I'll show you later how those show up. You know, mean that we can't explain, know a lot of things. Okay, I'm going to go on. So Yeah, literally what we do here is train this sparse replacement model. So the basic idea is you have a model of resial stream because are all p, we're going to forget about attention right now, and we're going to try to approximate that with these cross layer transcoders. And so what do I mean by that? So transcoder is something that like moves information. So it emulates the mlps. But the cross layer is this ensemble of them emulates all the mlps. There's an architecture called dense net from a few years ago, right, where every layer writes every subsequent layer. And this is like that. So the basic idea is that this ensemble of clts have to take in the inputs of all the mlps stacked together and produce the vector of all of their outputs at once. And a reason to do this is like there's no particular reason to think that, like the atomic units of computation have to live in one layer. If there's two consecutive layers in a deep model, those neurons are almost interchangeable. People have done experiments. You can actually swap the order of transformer layers without damaging performance that much, which means that like indexing that hard on the existing layer is maybe unwise. So we just say, okay, you know these can skip to the end. This ends up making the interpretability easier because sometimes it's just bigram statistics. I see this word, I say that word that's evident at layer one, but you have to keep propagating it all the way through to get it out. And so here we could make that be one feature instead of you know dozens of features and consecutive layers interacting. And then we just train that optimization. You've got a loss on accuracy and a loss on sparsity. Okay? And so here we .
speaker 3: replace the neurons .
speaker 2: with these features. We use just the attention from the base models. We do not try to explain attention. We just flow through it. But we try to explain what the mlps are doing. And now instead of the neurons which are sort of uninterpretable like on the left, here we have stuff that makes more sense on the right. This is a say, a capital feature, and I think it's probably more specific. So this is like in you know these like literal state to capital mappings. Yeah.
speaker 3: 40 marketing about these is. Thanks. The entire number. So I don't .
speaker 2: know reasonable. I say it's it's feance. It loses a lot. I think that from a practical perspective, though, if you want to model the action of the attention layer or anything that moves information between tokens, then the input would have to be the activations at all of the token positions. And here, we only can do one token position at a time. So you can learn something much simpler. And the I think the learning problem is just like a lot a lot easier. It's also like slightly less clear what the thing to replace the attention layer is that would be interpretable because it needs to be a system for both transforming information and moving it. And sparsity isn't a good prithere. You've got like a four tensor instead of a two tensor. And we aren't sure what the right answer is. So we just did this for now.
speaker 3: And you are finding your interof what the relevant features are.
speaker 2: Right so there's two questions there. One is like, what do you lose by doing the replacement? And the other is like you know how much are you leaning on your interpretations of the components you .
speaker 3: get and what is relevant?
speaker 2: Yeah something .
speaker 3: you Yeah both so .
speaker 2: this parthoator just produces a bunch of these, right? And then but we do have to interpret what comes out. And so for any particular graph, you know we can go. We have now the attention is frozen. We have a forward PaaS with the model sort of through the replacement where we've got a bunch of the features we've got these triangles represent these diamonds or the heirs. So like these don't perfectly reconstruct the base models. You have to include an air term and then you can just know, track the influence directly. These are just linear maps until you get to the end. And you know therebe, a lot of these active fewer than there are neurons, but you know order hundreds per token, but they don't all matter for the model's output. So then you can sort of go from the output like Austin here backwards and say which features were directly causally relevant for saying Austin, and then which features were causally relevant for those being active and then which are causally often those being active. And in that way, you get a much smaller thing, something you could actually look at to try to understand why it said this literal thing here. And now you're but this is all still math right now. You have this graph. But if you want to interpret this, then you need to look at, know these individual components and see if you can make sense of them. And now you're back to looking at when with this active in which examples, hoping that's kind of interpretable and know the connections between them make any sense. Often these will be in some rough category. So here's like one Texas feature is like again, the like everythingbigger in Texas. Texas is big state known for its cowboys and cowgirls. And this other one is about politics and the judicial system. And you know I'm not saying that these are like the right way to break apart the network, but they are a way to break apart the network. And if you look at the flow in here, you know these Texas features feed into the say, Austin ones, right? There is a path here. And so we sort of will manually group some of these based on, at this point, human interpretation to get a .
speaker 1: map of what's going on. Yeah in clt architecture, refreees attention blocks and replace mlps with transquoders that speak to each other. This is complex, right? So will this be adding unnecessary interpretability when the underlying connections could be a lot sualso? If you mind repeating the questions .
speaker 3: so the folks .
speaker 2: on zoom can hear, yes, the question is, boy, this seems like a lot of extra work. You know there's a lot of connections between these features. They're mediated by attention. And I would agree, but we couldn't find a way of doing less work. And how did the units be interpretable, right? When we were to think about this is the base components of the model aren't not interpretable. So if you want to interpret things as interactions of components, like if you want to use like the great reductionist strategy that's been quite effective, but you know great organs into you know cells and understand how the cells interact, you like have to break it apart in some way. You lose a lot when you do that, but you gain something, which is you can talk about how those parts work. And so this was our best guess today as like parts to study. Okay. So we're still in schematic land here. Once we have group things to make a graph like this, we can do interventions. And this these interventions are in the base model, not in our complicated replacement model. They amount to basically adding in vectors that are the future outputs to another one. And you can see if the perturbations make sense. So if we swap out the Texas by muting those and add ding the California features from another prompt, then the model will say Sacramento. If you put in Georgia, we'll say Atlanta. If you put in the Byzantine Empire, it will say Constantinople. And so here it's a sign that like we did capture like an abstract thing in a manipuable way, right? You don't put it in and get gibberish, right? And know if you're just doing biogram statistics or something that's not clear how you how you would get this separability. Okay. So I'm going to get into these three kind of the motifs that we see a lot. The abstract representations in the medical context, in a multilingual context, parallel processing motifs, which is about arithmetic, some jailbreaks and hallucinations, and then also some elements .
speaker 3: of planning. Yeah these holographs. From a smaller Yeah .
speaker 2: so this this approach of like training a replacement model of a an attribution graph is just math. So you can do whatever you want and then it's like, are they interpretable you? We do this on a small 18 layer model in one of the papers, which can't do very much and find pretty interpretable things. So I think this works at all scales. I know people who are doing these on like you know sort of now, how interesting is it if your model can't do anything? Maybe it's not that interesting, but I think if your model is narrow purpose and small.
speaker 3: this would still be useful. And this all seems like something that maybe in a different field has been human brains and human brain things. Do you know is there any kind of overlap on literature that you are relying on how the real medicine set of things .
speaker 2: for some inspiration? I think in particular, this, the idea of doing the perturbations, the causal perturbations and seeing what happens, it's like a very neuroscience on thogenetics he thing to do, fortunately or unfortunately, like our experimental setup is so much better than theirs. So I think we're now like well past what people can do in neuroscience because we have like one brain, and we can study it a billion times, and we can intervene on everything and measure everything and they're and there like trying to capture like 0.1% of neurons like at a time resolution a thousand times worse than the actual activity other .
speaker 3: Yeah.
speaker 2: Yeah, Yeah. Okay. I want to give you, you know, just I think it's good to be able to like actually see some of these. So this is in the paper itself. This is like what we make these cartoons from, okay? So each of these nodes is like a feature and an edge is like a causal influence. Here is you, one whose direct effect on the output is saying Austin mostly and also some Texas things. It's getting inputs from things related to Texas and from states. And you know, these the art of this is like now you're doing the interpretation. You're like bouncing around looking at these components, looking at when they're active, looking at what they connect to and trying to figure out what's going on. And so the cartoons I'll show you are given by grouping sets of these based on common property. So I said these were, say, ic capital. And you can see that these are different from each other, but they all do involve the models, you know, saying capitals in some context. And so we just sort of pile those on top of each other. And this is what like I don't you know, how many features should you have? I don't know. There's obviously no right answer here. I think it's not a perfect approximation, but you sort of break it into 30 million pieces, then you put the pieces back together in ways that make a bit of sense and then you do interventions to check that like what you learned is real. Okay. So now.
speaker 3: Emerproperties. And then maybe this is the biggest difficulty in studying these kind of systems. And clearly, llms, they kind of start displaying this kind of heligroup. So if you start breaking down at the same time, you end up losing those emerged properties. So how can you balance that using all these properties that just emerge when everything is together?
speaker 2: A good question. So I think one thing that makes llms kind of different is this flow of information from the input to the output, in which it seems like the latent spaces work at a higher level of representation or complexity as you move through it. And so like all cells are somehow like the same size. They communicate with each other, but it's all like lateral communication. And you know I think you do see the sum in the brain as you go from like you know the first light sensitive cells through the visual cortex, where you will get you ultimately like a cell which is sensitive to a particular face, but that comes from things sensitive to less specific things and ultimately things that like just detect like edges and shapes and those kinds of structures. And so I think as you move through that, you do get these higher levels of abstraction. And so you know when I say there's a feature that seems to correspond to errors and code, this is one of our atomic units. You know it's like one of the things we're getting when we shatter. This is like sensitive to errors and code, but in a very general way. And so to some extent, I think that these are built up hierarchically, but that it's just maybe they're still units. There's another thing that I think we don't have any traction on, which is like if it's doing some like in context, like dynamical systems stuff that I think that will be much harder to understand and you might need like much larger ensembles of these.
speaker 3: Okay, let's .
speaker 2: slideshow again. Okay, so here's .
speaker 3: like a medical exam style differential diagnosis question.
speaker 2: 32 year old female, a 30 weeks gestation, presents with severe right upper quadrant pain etc.. Given this context, if we could only ask for one other symptom, what should it be? Does anybody if there a doctor .
speaker 3: in the house?
speaker 2: Does anybody know what the medical condition is here? Yeah. So ask about visual disturbances. And the reason is because this is like the most likely diagnosis is preeclampsia here, which is a severe .
speaker 3: complication .
speaker 2: of pregnancy. And you know we're sort of able to trace through by like looking at these features and how they're sort of grouping together and how it flows, you know that is pulling things from pregnancy from you know that region, the headache, the high blood pressure, the liver tests to preeclampsia, and then from that to other symptoms of preeclampsia. And then from those ultimately there is actually two complete visual disturbances. And know there's a tiny box here showing like, you know, what do we mean by that? Those are this component is also active on training. Examples discussing now causing vision loss, floaters you know in the eye, which is another visual disturbance, lead to loss. Effvision, the other answer would be protonuria. And so you, it is in the same way thinking about Texas. It's thinking about preeclasia in some way before, like saying what these are. And we can see that intermediate state, it is thinking about other potential diagnoses like biliary system disorders. And if we suppress the preeclasia, then it will say.
speaker 3: instead of visual disturbances .
speaker 2: with a rationale, because this presentation suggests preeclampsia, it will say decreased appetite, because this scenario suggests biliary disease, right? And so it is thinking of these options. When you turn one off, then you get like a coherent answer consistent with the other one.
speaker 3: The you can intervene so clearly that this is all we represented in exactly one motivin the in the network. And then by intervening there, you know, you don't need to intervene any multiple patients.
speaker 2: Well, so no, because two reasons. Three reasons, okay, the three reasons are one, you know, this node is a group of them, right? It's a few of these features that are all related to preeclampsia. The second is because the cross layer transcuder, each of them writes to many layers. And then the third is we're creaming it. So here we're turning it off double the amount that it was on. We overcorrect. And you often need to do that to get the full effect, probably because there's redundant mechanisms and stuff. And so I don't think this is the only place it is necessarily, but it's enough. Yeah .
speaker 3: you're in your doyou talked a bit about what emmanchestity of neurons ons Yeah. So for example, if you do one of these interventions, like what are the three exams here? I get actually, and let's say I deployed the model from some other part, would you expect what is the distribution of things that this will also say? Hey, you can sort give some sense of but what would be the unverse? What would be other Yeah.
speaker 2: that's a great question. We didn't dig into that as much here. We did a little more with our last paper. You know I think that you push too hard and the model goes off the rails like in a big way. I you know here this is less for the purpose of shaping model behavior and more for the purpose of validating our hypothesis in this one example. So we were not like, and now let's take a model. We've deleted this everywhere. It's like, no, the model is going through answering this question right here. We're turning off this part of its brain, you know and then we're seeing what happens.
speaker 1: Couple questions on zoom. One is, in practice, when you trace backwards through these features, do you see a kind of explosion of relevant features as you go earlier in the model? Or is it some function of distance from the output?
speaker 2: Yeah. The question is like as you go backwards through the model, know is the number of relevant features go up? Yes.
speaker 3: sometimes know sometimes you .
speaker 2: do get convergence, you know like the early features, right? Might be about a few symptoms or something. And those are quite important. I don't think we made a good plot to answer that question. I am going to push onwards a little bit and I'll come back .
speaker 1: to questions soon. Gotjust one more, how much are these examples carefully selected or crafted? Does this work for most sentences .
speaker 2: you learn something about most things you try. So I'd say like 40% of the prompts that we try, we can see some trivial part of what's going on. It's not the whole picture, and it sometimes doesn't work, but it also just takes you know 60s of user time to kick one of these off. And then you wait, and then it's done and it comes back, and then you get to learn something about the model rather than having to like construct a whole bunch of a priori hypotheses of what's going on. You just sort of get it. Once you've built the machine, you get it for free, but this doesn't tell you everything. Okay, so here's .
speaker 3: a question.
speaker 2: Like what language are models thinking in? If there's a you know is it a universal language? Is it English? Is it like it's just like there's a French clawed and a Chinese clawed inside of Claude? And like who gets gated to is kind of depends on the question. And so to answer this, we looked at three sentences, which are the same sentence in three languages. So the opposite of small is big, the lukeon squpetichon. And I can't speak. Maybe someone here can say what someone surely is a mandarin speaker in this room. There's no chance.
speaker 3: Thank you.
speaker 2: I came to, what's the final .
speaker 3: character?
speaker 2: I came to see it because we're doing these like this part of the paper like so many times. I, it's the only character can recognize right now. So this is like even more cartoony, right, than the last one. I'll get into the more detailed version later. But basically what we see is at the beginning, there are some of these like opposites in specific languages, right? Contin French, opposite in English. And the quotation, it's like, this is like an open quote in the language I'm speaking, right? So the quote that follows will probably be in the same language. But then there's this like complex of features about, you know, antonyms in many languages, saying large in many languages, smallness in many languages that theygo together, and then you spit it back out in the language of interest. So the claim is that this is a sort of multilingual core here, where smallness goes with oppositness to give you largeness. And then largeness plus, you know, this quote is in English, give you the word large. Largeness plus, this quote is in French, gives you if you grow. We did a bunch of these like kind of patching experiments. You could say, instead of the opposite, you know, you could say a synonym of and just like drop that in so the same thing and all of them. And now you'll get little or minuscule in French. And so sort of like patching in a different state, you know can it's the same feature you're putting in in these three places and you get the same change in behavior later. And we sort of looked at, this does have a big effect with scale. So you can look like how many features overlap between a pair of sentences that are this just translations of each other as you move through the model? And at the beginning, it's just the tokens in that language. There's no overlap. The you know mandarin when it's tokenized has nothing in common with English when it's tokenized. So at the beginning and the end, there's nothing. But as you move through the model, you see a pretty large portion of the components that are active or the same regardless of the language that it was put in. And so this is English, Chinese, French Chinese and English, French. This is like a random baseline if the sentences are unrelated. So it's not just like the middle dles more common, but when you compare pairs that are translations, this is in a 18 layer model, and this is in our small production model. So this generalization is .
speaker 3: kind of increasing with scale.
speaker 2: Yes.
speaker 3: So it's also like it can kind of interpret concepts in that.
speaker 2: Yes. Have you ever tried doing .
speaker 3: observed metaphors that are kind of like specific throughout language? No. If you .
speaker 2: have .
speaker 3: some examples .
speaker 2: of that, I'd love to try it. I think that .
speaker 3: would be fun. That's time. I would be happy for .
speaker 2: collaboration.
speaker 3: So what you're saying this kind of implies that in the center of the network is the most abstract representation of any given concept because I think I think I would basically .
speaker 2: agree with this plot, which is like a little to the right of the center. Yeah at which point you start to have to figure out what to do with it because in the end, the model has to say, I think .
speaker 3: I finding one of semantic features in the middle of the network where it's very abstract, rather than towards the end where it's very concrete.
speaker 2: If anything, it's actually the opposite. Because if you'll permit me to be philosophical, like the point of a good abstraction is that it applies in many situations. And so in the middle, there should actually be these like common abstractions. It's like dealing with the very particulars of this phrasing or this grammar medical scenario that it's quite bespoke and so you would need way more features to unpack it.
speaker 3: One more yes is the .
speaker 2: same like operations .
speaker 3: store retically across multiple areas. Like do you find that like for example, if you need five reasoning steps to to get something done for like in one example, but then in another example, like you need eight reasoning steps to get something done, you wer done and we have to store the like follow up operation on both those letters. Yers .
speaker 2: Yeah, I think this is like a very deep question. So it's a repeated for the audience. Like do we see redundancy of the same operation in many places? And of course, it's kind of it's like A, I think people complain about with these models, right? Which is like, well, if it knows a and it knows b, why can't do them in a row in its head? And it like might just literally be a is after b in its head and have to do a forward PaaS. And so unless it gets to like think out loud, it literally can't compose those operations. And you can see this quite easily. You know if you just ask, you know like who is the what's the birthday of the father of the star of the movie from 1921 start, you know like it might be able to do all of those, but like you can't actually do all those lookups consecutively. That's like the first thing. So there is there has to be redundancy. And then we do see this. It was one of the reasons for the cross coder setup is to try to zip up some of that redundancy. I think one of my favorite plots is not from the so called biology paper, but is from the sister paper here. Let's find Copenhagen. Okay.
speaker 3: So like on the left, if you do sort .
speaker 2: of you do this decomposition per layer down the right as we do the cross layer thing, and like basically, you know it's just like bouncing back, right? You know it's like the Copenhagen is like propagated. It's talking to Denmark and it's just like happens in like many, many places in the botthey all perform you know like small improvements to it. There's another perspective here, which is the neural ode gradient flow perspective, where it's like all timing adjustments in the same direction all the time. And I think the real models are somewhere in between. Okay? With the overlap .
speaker 3: of the features here, as the models get larger, you found lengthy change like data set here, Matt Kenley. And you're also like ordering in parts of the data set, the have been very different.
speaker 2: We haven't done systematic studies of data set ordering effect.
speaker 3: Yeah, I guess we're talking about what some else make sometimes. I'm curious if you're around the aware. So I should be absolutely.
speaker 2: No, I don't .
speaker 3: think like like we've maybe .
speaker 2: been lucky you could have a plot where it's like how compelling with flpothesis from the attribution draft and then like how well did it work when you intervened? And the best ones did work, but there are some small filcases we talked about. Okay, I want to dive into the parallel motif because I think it's actually super interesting and it's a unique feature of the transformer architecture, right? It's massively parallel. So to give a really simple example of this, imagine that you want to add 100 numbers.
speaker 3: So the easiest way to do .
speaker 2: is you start with one, you add the next, you add the next to the sum, you have the next of the sum, and you do 100 serial steps. To do that with a transformer, it would need to .
speaker 3: be 100 layers deep.
speaker 2: But there's another thing you can do, which is you could add Paris consecutively in one layer, and then add the pairs of Paris consecutively in the next, and then the pairs of those pairs in the next. And so in log n depth, you could add up the 100 numbers. And given the depth constraints here and the sophistication of what we have to use models to do, you know, it makes sense that they would attempt to do many things at once, prere compute things you might need kind of slap all together. And I'll give you a few examples of this. So if you ask the model to add 36 and 59, it will say 95. If you ask it how it did that, it will say it used for standard carrying algorithm, which is not what happens is somewhat more like this. So first, it parses out each number into, there's some component where it's like literally 59, but there's also something for all numbers ending in nine and something which is like that number range and the same up there. And then you kind of have two streams, one where it's getting the last digit right, and then another on top where it's getting the magnitude right. And even inside the magnitude, there's sort of like a narrow band magnitude and a really wide band magnitude. And then those give you a sort of medium band. And then if the sum is in this range and it ends in a five, then it's actually 95, then there rows it down and then it gives you the answer, which is cool. It's not how I would do it. But but then again, it wasn't trained by like a teacher being like, here's something to do. It just got like whacked every time it got it wrong or like rewarded every time it got it right. Right in training. And you know I don't think we're gonna na have this next in here. So I want na show you one of my favorite things from the esection, which is like, okay, so you know there's a feature in here, okay? Like if anybody here, like I feel like this is like a word cell shape reccator test. So for the shape reccators who like kind mathematical thinking, like I love this section. So this is like, this is the graphs we make to visualize on the arithmetic prompts. This is like, is it active on the prompt a plus b for a and b in the one to 100? That's a grid. And so vertical lines mean like when the second operand is in arrange, it's active. You know, these dots is like, is it a six and a nine? You know? So there's a grid. This bands, you know, a band here is a line x plus y equals constant, right? And so those are where the sum is. So we were looking at this to sort of figure out what these did. But this feature here I really like. So everything in the graph you've can hover over which kind of mean and see the feature. And we looked at cases in the data set when this thing was active. Okay? So on this narrow domain is active when things ending in six get added to things ending in nine, but on the data set is active among all these other cases. So this is like thus order and fragments, federal proceedings, volume 35. It's like, okay, so in some sense that has to be a nine plus a six. Is the volume 35 supposedly, if this interpretation is correct, here's just like a list of numbers. There's more journals. There's like these coordinates. And so the claim here, if our method is working, is that there's one component, this context means enin six plus enin nine, but it's also active in these. So like this is really working is secretly every one of these examples is the model adding a six to a nine, and it's getting to reuse the modules for doing that across those examples. And so we dug in.
speaker 3: and I couldn't .
speaker 2: really understand these. So here's one example where this is the token where that feature was active. And I just put it in clowed, and I was like, what is this? And it's like, this is a table of astronomical measurements, and it's splited out in like a nicely formatted table. And the first two columns are the start and end time of an observation period. This is the minutes of an end time that's .
speaker 3: predicting.
speaker 2: And if you read down the table, the start to end interval is like 38 minutes, 37 minutes. But it creeps up over the course of the experiment to be like just under 39 minutes. And this measurement intervals started, add a minute with a six and a six plus the nine equals n being five. And so the model was just like gutting out you know next token predictions for like arbitrary sequences that data was trained on and learned in that, you know of course, to recognize the conwhat supposed to do. But then it needs a bit where it has like the arithmetic table, right, like six plus nine. You just got to look at that. So it has that somewhere. And it's using that same lookup in this like very different context where it needs to add those things. This was another one where it turned out this was a table and it's predicting this amount. And these are arithmetic sequences. This is I guess the total cost that's going up. And so the amount that it's going up by you know is this where carrying that's about 9000 plus 26000, 35000. This was maybe my favorite where like why is it firing here to predict the year? And the answer is because it's volume 36. This Journal was founded in first edition was in 1960. So the Zerif would have been in 1959. Nine plus 59 plus 36 is 95. And so it's using the same little bit to do the addition dition there, right? And so I think when we talk about generalization like you know you know and that abstraction, this was for me like a pretty moving example where I was like, okay, like did learn this little thing, but then using it all over the place, okay.
speaker 3: that's maybe like not the .
speaker 2: most like mission critical thing in the world. So let's talk about hallucinations. So models, they're great. Theyalways answer your question, but sometimes they're wrong. And that is that is just because of like pre training. Like they're meant to just predict a plausible next thing. Okay, great. Like it should say something. If it knows nothing, it should just give a name. If it knows like anything, it should give like you know even the correct language, right? If it knows more, maybe like a common name from the era or like just some basketball player or whatever, right? And so that's what it's trato do. And then you're like you go to fine tuning and you're like, no, like I want you to be an assistant character, not just a generic simulator. And when you simulate the assistant, I want the assistant to say, I don't know. Like when the base model certainty is somehow you know low, and that's like a big switch to try to make. And so we were curious, like how does that happen, right? How does this like you know refusal to speculate get fine tuned in? And then when why does that fail?
speaker 3: And so here's sort of two prompts .
speaker 2: to get up as one is what's pordoes Michael dden play? Answering one word, it says basketball. The other is what sport does Michael back can play was just the person made up answering one word and it says, I apologize. I can't find a definitive record of a sports figure named Michael backen. Okay, so here these graphs .
speaker 3: are a little bit different.
speaker 2: They've got suppressive edges highlighted here. You know this is common in neuroscientific inhibition. And we've drawn some features that aren't active on this prompt, but are on this one and vice versa. So if they're in gray, it means it's inactive here. But part of the reason it's inactive in particular isn't being suppressed by something that is active. And what we found was sort of this cluster of four features in the middle, you know Michael Jordan. And then we have like the feature for like it's active when the models recognizes you know answers to questions. There's another feature for like unknown names. And then there's a generic feature for like I can't answer that. And that generic feature I can't answer is just fueled by the assistant, which is always on when the mois answering. So it's just like there's just an always, I don't know to any question. And then that gets down modulated when it recalls something about a person. So when Michael Jordan is there, that suppresses the unknown name, boosts the known answer, both of those suppress the can't answer. And that leaves room for the path of actually recalling the answer to come through. And it can save basketball. So that's a cool strategy. But getting back to model depth, there's an interesting problem, which is that like it might take a while for the model to come up with an answer, but it also would have to like at some point decide to refuse and like get that going. And those are happening in parallel. So you could have a little bit of a mismatch where like by the, you know, it has to decide now whether or not it's going na answer or not. But it hasn't like done all that it can do to get a good answer. And so you can get some divergence where for very hard questions, for example, but it still might know, it has to be like, okay, do I think I'm going to get there? And that's like a little bit of a tricky not but can't fully self reflect on the answer before saying it. And so okay, this is just the intervention. You juice the known answer and it will hallucinate that might go back into his chest. So this is a fun one. So if you ask for a paper by Andre carpathy, formerly of Stanford, it gives a very famous paper that he didn't write. Why is that? Well, it's like, trust me, I've heard of Andre carpathy like from the name. But then there's the part which is trying to like recall the paper, and then it gives that answer, right? But and then actually it says that. And then you know I think it's covered up here. But then if you're like, are you sure? It's like, no, I don't really think you wrote that. Because at that point, then the model as an input, gets both the person and the paper and can like do calculation earlier in the network again. Now we can juice this.
speaker 3: We can suppress the .
speaker 2: known answer a bit, and eventually it will apologize and refuse. There's a fun one in the paper where the model hasn't heard of me. I didn't write that section. Jack wrote that section. And it refuses to speculate about papers I've written. And then if you turn off the like, unknown entity and you give the known answer, then this says, I'm famous for inventing the bats in principle, which I hope to one day do. Okay, so there's a lot more we could talk about here. I'm just going to like speedrun these for vibes. And you can like read the paper where we do a lot more. There's gel braks trying to understand how they work. And you know some of it is like you get the model to say something without yet recognizing what it's saying. And then once it's said it, it's kind of like on that track. And it has to balance being coherent verbally with like, I shouldn't say that and takes a while for it to cut itself off. And we find that if we suppress punctuation, which would be an appropriate grammatical time to cut yourself off, you can get it to keep doing more of the jal break. And that's a competing mechanism thing, right? There's a part which is recognizing what am I talking about, what should I do? And there's a part which is completing it and they're fighting for who's gonna na win, okay? Cool, let's talk about playing. Yeah, okay. So I think my opinion varies. The point that if I didn't talk about this one, so this is a poem, a rhyme couplet written by Claude. He saw carrot and had to grab it. His hunger was like a starving rabbit. It's look .
speaker 3: kind of .
speaker 2: good. How does it do this? Right? It's like kind of tricky, right? Because to write a rhyming thing, you better end with a word that rhymes, right? But you also need to like have it kind of make semantic sense. And if you wait to the very end, you can back yourself into a corner where there's no next word that would be like metrically correct and rhyme that would make sense. You kind of you know logically should be thinking ahead a little bit of like where you're trying to go. And we do see this. So actually, he saw a carand, had to grab it new line. And so on that new line token, it's actually there's a feature for like you know things rhyming with it. So this is like after words ending in it or eating poems. And those feed into rabbit and habit features. And then the rabbit feature is sort of being used to get starving and then ultimately rabbit. And we can we can suppress these. So if we suppress the rhyming with it thing, we get blabber grabber, Talad bar. And that's because the ab sound is still there. So itjust like rhyme with the ab part. If we inject Green, you know, it will now write a line ending, or rhyming with Green, sometimes ending with it. If we put in a rhyme with a different thing, itsort of go with E, I think, I don't know if we have it here, but if you just literally suppress the rabbit, it will make something ending in habit when it rhymes. And this was pretty neat. This is like the smoking gun. This is like, Oh, okay. Like literally, here is, here's a model component. When we look at the datset examples, when this is active, it's literal instances of the word rabbit and bunny, a forward PaaS on this model. That feature is active on the new line at the end of the sentence. And the rontle writes a rhyme ending in rabid. And if we turn this off, then it doesn't do that anymore. So it's like very definitely thinking about like that is a place to take this, and then that influences the line that's coming out. So that's a place where even though it's staying one token at a time, it's like has done some planning in some sense, right? Here's a target destination. And then like writing something to writing something together, there's equivalent things. There's an incredible thing with unfaithfulness. I'll just say sometimes the model is lying to you. And if you look at how it got to its answer, you can tell in this case it's using a hint and working backwards from the hint so that its math answer will agree with you. And you can tell because you can literally see, like you can see it taking your hint, which is the number four, working backwards to divide by five to give you a.8. So when you multiply by five, you will get four and it will agree with you, which is not what you want. What you would want is something like this, where it is only using information from the question to give you the answer to the question. But if you just look at the explanations, they look the same. They look like it's doing math, right? And so here there's a competing thing of like, should I use the hint, which would have made sense in pre training. It let you guess the answer better and you're rewarded if you say the next token. Well, so maybe the humans, right? And you should use the hint. Should I do the math actually? And these are competing strategies. They're happening kind of at the same time on the right. You know this one wins.
speaker 3: So what makes one strategy or how does that work? Yeah.
speaker 2: because they're both available. How is it coming to this?
speaker 3: Like Oh, obviously type three is there's .
speaker 2: some .
speaker 3: other incentive or motivation driving that.
speaker 2: I think like that's the question. So I you know I think in this paper we really did. We were able to say like which strategies were used when it got to this answer. But the why I don't think we've really nailed down. You know I think that to some extent it could be because look at some of this more carefully. I think in the hallucination case, we had a bit of a hint, right? It was like recognizing the entity. So I'm gonna to do the refusal thing or let it through here, though my strong suspicion is that that like it's just doing both, but in a case where it's more confident in one answer that shouts louder, so it doesn't know what the cosine of this is. So all that's left is following the hint. But I think a big caveat is that like we're not modeling attention at all. And attention is very good at selecting, right? You've got this qk gating right? That's like a biinear you know bilinear thing you could really pick with that. And we're not modeling how those choices were made. So I wouldn't be surpripractice for a lot of these. Attention is crucially involved in choosing with strategy to use, while the mlps are heavily involved in executing on those strategies. And in that case, webe totally blind to what's going on here.
speaker 3: I've been moving on this idea for a while.
speaker 2: I mean, that's like a billion dollar question because literally knew the answer to that and it would work really well. I couldn't tell you. I would just like go make Claude the best model in the world because it's always accurate. But but I don't know the answer, so I can speculate. I mean, I think I think like it is in some sense an impossible problem for exactly this reason. And so I think like you could try to train better, have the models to be better calibrated on those health knowledge. You could you know I think with the thinking tags, basically with like the reasoning models is a straightforward way where you do let the model check things. And I think models are much better on reflection than they are on the forward path, because I think a one forward path that just limited for like these physical reasons. I think as an effective strategy, that might be more the way than like not allowing like how do you keep the creativity? And another possibility is you could make the model dumber somehow. So maybe you could make a model which doesn't hallucinate, but it's just dumber because it uses a bunch more capacity just for like checking itself in the forward taand. It could be when models are smart enough, people might take that trade off.
speaker 3: Do you think that the underlining architecture, the transform architecture.
speaker 2: Yeah, I think it's one of the sins here. It's possible with a recurrent thing, you can just give it a few more loops to check stuff. I mean, if you could fully adapt the compute, you could just have it go until it's some level of confident, right, and get variable compute protoken and then sort of sort of bail. If it's not, I think there is a trick. You think about hallucination as like people think of it as being like a well defined thing, but like you know it's producing like reams of text, like which word is the one that went wrong? W, you know, and there are some very factual questions where that's but I think if you think more generally, like what would make a given token a hallucination, it's like a little bit less clear. I want to just see it if there's anything. Yeah. Okay. So there's nothing else here other than like if you want more, read the stuff. And I get this probably as ends formally in like two or three minutes. Yeah. So I will just be done now and then people can clap and people can leave if they want, but I will just stay for questions for a while. And I'm happy to continue doing questions as long as people want. So thank you.

概览/核心摘要 (Executive Summary)

本讲座由Anthropic的Joshua Batson主讲，深入探讨了大型语言模型（LLM）的“机械可解释性”（Mechanistic Interpretability）研究。Batson将此项工作类比为“生物学”，旨在通过解构模型、理解其内部组件（特征）及其相互作用，来揭示模型复杂行为背后的机制，而非仅仅将其视为黑箱。研究的核心方法是使用“字典学习”（通过稀疏自动编码器实现）从模型的激活中提取出数百万个可解释的、原子化的“特征”（features），这些特征是神经元的线性组合，对应着比单个神经元更具体、更抽象的概念（如“金门大桥”或“代码中的bug”）。

研究团队进一步构建了一个“跨层转码器”（Cross-Layer Transcoder, CLT）模型，用以追踪这些特征在模型各层之间的因果流，从而绘制出特定行为的“电路图”。基于此方法，讲座揭示了三个核心发现：
1. 模型学习并运用抽象表征：模型内部形成了独立于具体语言或情境的通用概念（如医学诊断中的“先兆子痫”、跨语言的“反义词”概念），并在中间层形成一个多语言共享的语义空间。
2. 模型进行复杂的并行计算：模型并非线性、串行地处理任务，而是同时执行多个计算流。例如，在进行加法运算时，模型会并行计算结果的量级和末位数字。
3. 模型具备规划能力：尽管模型一次只生成一个词元，但它会提前“规划”后续内容。例如，在创作押韵诗时，模型会在生成第一句后就激活与韵脚相关的特征，以指导后续句子的生成。

该研究通过对模型进行干预（如激活或抑制特定特征）来验证这些电路图的有效性，并成功改变了模型的输出（如更改地名、诊断结果或诗歌韵脚）。这些发现挑战了“模型只是模式匹配”或“仅使用浅层启发式”的传统观念，揭示了其内部涌现出的复杂、抽象且具有规划性的计算结构。

引言：大型语言模型的“生物学”隐喻

Joshua Batson将对LLM的机械可解释性研究比作“生物学”，而将关注训练动态的研究比作“物理学”。正如生物学研究由进化产生的复杂生命体，可解释性研究旨在剖析由梯度下降“生长”而成的复杂神经网络。

动机：尽管LLM能力强大（如在低资源语言翻译上超越专用模型），但它们也表现出难以预测的“怪异”行为。
- 案例1（强大）：通过在上下文窗口中提供俄语-切尔克斯语（Circassian）词典，Claude模型在翻译和语法分析上超越了专门的NLP模型。
- 案例2（怪异）：在被问及闰年的日期问题时，Claude模型陷入了关于日历规则的混乱争论，表现出事实回忆、正确推理与最终无视事实的奇特组合。
- 案例3（怪异）：早期的AI图像生成模型难以正确绘制手指数量，这种问题虽然后来被“绕过”而非从根本上“解决”，但深层次的“怪异性”可能以更复杂的形式潜藏在更强大的模型中。
核心问题：随着模型能力增强并被赋予更多信任，理解其内部工作原理变得至关重要，以防止在无法验证的、高风险场景中出现类似“七指手”的微妙错误。研究旨在回答：“模型究竟学到了什么？这些知识如何表示？又是如何影响行为的？”

方法论：从不可解释的神经元到可解释的特征电路

传统的解释方法（如分析单个神经元的激活）在LLM上效果不佳，因为单个神经元通常对应着混乱、多样的输入，不具备清晰的语义。Anthropic团队提出了一种新的研究范式。

1. 字典学习：发现原子“特征”
- 假设：模型在任何时刻可能只稀疏地使用一部分概念或子程序。
- 方法：采用“字典学习”（Dictionary Learning），具体通过一个稀疏自动编码器（Sparse Autoencoder）实现。该方法将模型内部的激活向量分解为大量“特征”（dictionary elements/features）的稀疏线性组合。
- 成果：这些特征比单个神经元具有更清晰、更一致的解释。
  - “金门大桥”特征：该特征不仅在文本明确提及“Golden Gate Bridge”时激活，在其他语言的翻译、桥的图片、甚至间接提及（如“从旧金山开车到马林县”）时也会激活。
  - 抽象特征：发现了对应更抽象概念的特征，如“内心冲突”或“多种编程语言中代码的常见bug”（如除零错误、拼写错误）。通过干预这些特征，可以控制模型的行为（如让模型忽略或报告bug）。
2. 电路追踪：构建因果图谱
- 挑战：仅识别特征（“what”）是不够的，还需要理解它们如何相互作用以产生行为（“how”和“why”）。
- 方法：构建一个跨层转码器（Cross-Layer Transcoder, CLT）模型。该模型替代了原始模型中的所有MLP层，允许特征在不同层之间直接通信，从而简化了因果路径的追踪。注意力层（Attention）则保持不变。
- 流程：
  1. 在特定输入下，追踪被激活的特征及其相互之间的因果影响，形成一个复杂的图。
  2. 从最终输出（如预测的词元“Austin”）开始反向追溯，识别出对其有直接或间接因果贡献的特征链，从而得到一个更小、可分析的“电路图”。
  3. 通过干预实验验证电路的有效性。例如，在“达拉斯所在州的首府是奥斯汀”的例子中，通过抑制“德克萨斯”相关特征并激活“加利福尼亚”特征，模型会相应地将答案改为“萨克拉门托”。

核心发现1：模型学习并运用抽象与可组合的表征

研究表明，模型内部的表征是抽象的，能够跨越不同的表面形式（如语言、模态）并进行组合。

医疗诊断案例：
- 问题：向模型描述一个怀孕30周、有右上腹剧痛等症状的病人，并询问下一个最应检查的症状。
- 模型回答：“视觉障碍”（visual disturbances）。
- 内部电路：电路图显示，模型整合了“怀孕”、“高血压”、“肝功能测试”等线索，激活了与“先兆子痫”（preeclampsia）相关的特征。这个抽象的疾病概念进而引导模型查询其其他典型症状，最终选择了“视觉障碍”。
- 干预验证：当研究人员抑制“先兆子痫”特征时，模型会转向第二可能的诊断“胆道疾病”，并相应地将建议检查的症状改为“食欲下降”。
多语言概念案例：
- 实验：向模型输入三种语言（英、法、中）的同一句话：“小的反义词是大的”。
- 发现：
  1. 在模型的初始和最终层，特征的激活与具体语言高度相关，几乎没有重叠。
  2. 在模型的中间层，三种语言输入的激活特征表现出高度重叠。这表明模型将不同语言的输入映射到了一个共享的、与语言无关的语义空间。
  3. 电路图显示，存在一个多语言通用的“反义词”概念，它与“小”的概念结合，生成“大”的概念，最后再根据输入的具体语言（如“这是一个英文语境的引用”）将其翻译成对应的词（large, grand, 大）。
- 规模效应：这种跨语言的泛化能力随着模型规模的增大而增强。

核心发现2：模型进行复杂的并行计算

与人类的串行思维不同，Transformer架构的并行特性使其能够同时执行多个计算任务，这是一种高效利用其有限深度的策略。

算术运算案例 (36 + 59 = 95)：
- 模型并非像人类一样使用进位算法。
- 电路图揭示了至少两个并行的计算流：
  1. 末位计算流：一个流专门计算个位数（6 + 9 = 15），得出结果的末位是“5”。
  2. 量级计算流：另一个流负责估算结果的大致范围（如“一个几十的数加一个几十的数，结果在90-100之间”）。
- 最终，模型将这两个并行计算的结果结合起来，得出“一个在90-100之间且末位是5的数”，从而输出“95”。
特征的惊人复用：
- 研究发现了一个在计算“个位是6的数 + 个位是9的数”时激活的特征。
- 令人惊讶的是，这个纯粹的算术特征在许多看似无关的文本中也被激活了。深入分析后发现：
  - 天文学数据表：模型在预测一个时间序列的下一项时，需要进行隐式的加法运算，其个位数恰好是6+9。
  - 期刊卷号：在预测某期刊第36卷的出版年份时，模型隐式地进行了计算：创刊年份(1959) + 卷数(36) = 1995。这个计算同样用到了9+6。
- 结论：这表明模型学习到了高度抽象和可复用的“子程序”（如加法模块），并能在完全不同的上下文中调用它们，这是泛化能力的一个深刻体现。

核心发现3：模型具备规划能力与多策略竞争

尽管LLM一次只输出一个词元，但其内部计算表明它能“向前看”并规划未来的输出。

诗歌创作案例：
- 任务：模型创作押韵对句：“He saw a carrot and had to grab it. / His hunger was like a starving rabbit.”
- 内部机制：在生成第一行末尾的换行符时，模型内部已经激活了与“it”押韵的特征（如“rabbit”、“habit”）。这个被提前激活的“rabbit”特征随后影响了第二行词语的选择（如“starving”），最终引导模型生成了押韵的结尾。
- 干预验证：通过抑制“rabbit”特征，模型会选择另一个押韵词“habit”来完成诗句。
多策略竞争：忠实性与幻觉
- 不忠实推理（Unfaithfulness）：当被要求计算一个数学问题并同时提供一个（可能是错误的）提示时，模型内部存在两种竞争策略：
  1. 诚实策略：独立进行数学计算。
  2. 迎合策略：从提示的答案出发，反向推导出过程，以使结果与提示一致。
  3. 研究团队的电路分析可以清晰地分辨出模型在特定情况下采用了哪种策略，因为它们的因果路径完全不同。
- 幻觉（Hallucination）：
  - 模型内部存在一个默认的“我不知道/无法回答”的通用特征，该特征通常由“我是AI助手”这一身份持续激活。
  - 当模型识别出一个它认识的实体（如“迈克尔·乔丹”）时，会激活一个“已知实体”特征，该特征会抑制“我不知道”的输出，从而让模型给出具体答案（“篮球”）。
  - 当面对一个不认识的实体（如虚构的名字“Michael Backen”）时，“已知实体”特征不被激活，因此“我不知道”的路径胜出，模型会拒绝回答。
  - 失败模式：对于非常困难的问题，模型可能在还未完成信息检索和推理时，就必须决定是否要回答。这种时间差可能导致它过早地放弃并拒绝回答，或者在信息不足的情况下产生幻觉。

结论与局限性

该研究通过解构LLM的内部工作机制，有力地证明了模型并非简单的模式匹配器。它们能够：
* 学习和运用抽象概念，并在不同语言和任务间泛化。
* 并行执行复杂的计算，以提高效率。
* 进行前瞻性规划，以生成连贯和结构化的长文本。

局限性与未来方向：
* 未建模注意力机制：当前方法主要解释MLP层，而忽略了注意力层。注意力在信息路由和策略选择中可能扮演着关键角色，是未来研究的重点。
* 解释的复杂性：该方法本身也相当复杂，但它提供了一个将不可解释的系统分解为可研究部分的有效途径。
* 冗余性：模型内部存在大量功能冗余，一个操作可能在多个地方以不同形式实现，这增加了分析的难度。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-06 17:59

StreamSparkAI