speaker 1: So today I'm it's my pleasure to welcome Joshua Batson from anthropping. So Hebe talking about on the biology of a large language model, which should be a very interesting talk. So Josh leads the circuits effort of the anthropic mechanistic interpretability team. Before anthropic, he worked on virrogenomics and computational microscopy at the Chan Zuckerberg biohub. And his academic training is in pure mathematics, which is very cool and impressive. And just another thing, which is some more recordings for this quarter have released like kina and dives talks. So feel free to check those out on our YouTube playlist. And for the folks on zoom, feel free to ask questions either on the zoom or the slide with the code cs 25. And without further ado, I'll hand it off to Josh. speaker 2: Thank you. Clalapcome on. We're getting started. Okay, it's a pleasure to be here. It's crazy to meet that. There is a class on transformers now, which I think you know were invented rather recently. We've got like an hour for the class, I guess, and then like 15, 20 minutes for questions. Feel free to just like interrupt me with questions. Like I, as you said, like my trainings in pure mathematics and like people just interrupt each other all the time very rudely and it's totally fine. And so I'm if I want na just cut you off and move on to something, I will but you know this can kind of be as interactive as you guys would would like it to be and that's for the people on zoom too. Okay, so this talk is titled on the biology of a large language model, which is also the title of a paper, which is what we call our like 100 page interactive blog posts that went out a few weeks ago. And if you're here, you probably know something about large language models. The biology word is like, it was our sort of choice here. And maybe you should contrast to like you a series of papers called on the physics of large language models, where you're sort of thinking of them as like dynamical systems over the course of training. But we sort of think of interpretability being in relationship to neural networks, which are trained by gradient descent, as biology is to living systems that are developed through evolution. You have some process that gives rise to complexity, and you can just study the objects that are produced for how they do the kind of miraculous things that they do. So models do lots of cool things. I don't want to tell you guys this. This is an example. Oh, I don't know when it's from now, maybe six months ago, which is like ten years in AI time, but I quite enjoyed this. This is just somebody who was working on natural language processing for circasian, which is like a very low resource language. Not many people speak. There's not many documents and hebeen sort of tracking the state of the art and nlp for many years, like you know using the best models at the time to try to just help translate into this language, translate from this language, you know, help preserve it. And tried it with a version of Claude where I think this was probably sona 3.5, where he just shoved this master list of Russian circadian translations into the context window to just like painstakingly gathered over years rather than train a model, just put it in the context window and then just ask the model to do like other translations. And it could not only translate it successfully, but also kind of like break down the the grammar, right? And so just in context, learning with these models was sort of enough to beat the state of the art for the nlp specific models that hebeen working on for a while. So like that's cool would be my summary of that. But models are also weird. This was also Claude, and someone asked her, what day is tomorrow on leap day? And it just like got in a big fight. You know, if today is February 20 ninth, 20, 20 fourth, then tomorrow would be March first. However, 2024 is not a weep year unso. February 20 ninth is not a valid date in the Gregorian calendar. Okay, now it starts around the rules. Fallthis ship 2000 with wep year 20 100 will not be like, fine. And then so the next leayear after 2024 be 2028, which is and then so if we assume you meant February 20 eighth, 2024, the last valid date in February, then it gives it it's just like what is going on, right? There's like some smarorgasborg of like correct recollection of facts, correct reasoning from the facts, and then like disregarding them out of consistency with this like initial, it's just pretty weird for it to be leap day. It's like, that's odd, right? I mean, if a person were doing this, you would wonder like what was in the brownies they have consumed. There's just this like now children are sort of like this. So maybe that's an interesting topic. And then I just love this. AI art will make designers obsolete. AI, except in the job. And it's got so many fingers. This is like now out of date, right? Like people have figured out how to keep the finger count down to it most five per hand. You know the new chai pki model, right? Can do like extremely realistic people. And they all have five fingers now, but like that wasn't exactly solved by figuring out like why are there so many fingers on these? You know, just sort of like other other methods got through it. But like presumably you kind of bat down some of this weirdness, and now it's just the weirdness esses more sophisticated. So as the models get better, you need to get better at understanding just where like the craziness has sort of gone. So as the frontier moves forward, like maybe it's not five fingers, but there might be subtly other other things that are wrong. And when I think about interpretability, which is sort of like what did the model learn exactly? How is it represented inside the model? How does it manifest in behavior? Is I am sort of thinking ahead to when you know most of the simple interactions seem to go well, and then it's like, are these going well? Because fundamentally the model learned something like deep and or because you've managed to kind of beat this down like the finger problem. But like if you went out to the edge of the model's capabilities, it would all be seven fingers again, but you can't tell. And also because it seems pretty reliable, you've delegated a lot of decision making and trust to these models, and it's in the corner. You can no longer verify where things get weird. So for that reason, we kind of want to understand what's going on with these capabilities. All right. So this will be a bit of a review at the beginning of like why models are hard to understand, some strategies for picking them apart. And then three main lessons I think we've learned about how models work inside that aren't obvious from a black box way of engaging with them. So here are sort of 33 statements that are like somewhere between myths or things that are out of date or like if you interpret them in one way philosophically, but missing the point. And so these statements are that models just pattern match to similar training data examples that they use only like shallow and simple heuristics and reasoning, and that they just kind of work one word at a time, you know just kind of like gutting out the next thing. And some like eruption of spontaneity. And I think you know we find that that models learn and can compose pretty abstract representations inside, that they perform rather complex and often heavily parallel computations. It's not serials if they're doing a bunch of things at once at the same time and also that they plan many tokens into the future, even though they say one word at a time, they're often thinking ahead quite far to be able to make something coherent, which will work for that. Okay. speaker 3: this is probably a review for this class. speaker 2: but we made these nice slides and I think it's nice to go through anyway. So you have a chat bot which is hello, how can I assist you as an answer? How does this actually happen? Right? You know it is saying one word at a time by just predicting the next word. So how goes through clawed and predicts can and how can goes to I and how can I goes to assist and how can I assist goes to you? So you can reduce the problem to the computation that gives you the next word. And that is a neural network. Here I've drawn like a fully connected network. To turn language into language, passing through numbers, you have to turn things into vectors first. So there's an embedding. So every word or token in the vocabulary has an embedding, which is a list of numbers or a vector, morally speaking. You basically just concatenate those together and run them through a massive neural network with a lot of weights, outcomes, a score for every word in the vocabulary. And the model says the highest scoring word, modulosum temperature. To introduce randomness. Transformer architectures are more complex as these residual connections, alternating attention and mlp blocks, which you could think of as like baking in really strong priors into a massive mlp. But in some sense, that's just . speaker 3: about efficiency. speaker 2: So a metaphor we found use, which is also biological, is that language models should be thought of as grown, not built. Like you kind of start with this randomly initialized thing in architecture, which is like a scaffold. You like give it some data, maybe that's like a nutrients, and then the loss is like the sun, and it grows towards that. And so you get this kind of organic thing, which has been made by the end of it. And but the way that it grew, you don't really have any access to it. The scaffold you have access to. But that's like looking at a model at a nit, which tends not to be that interesting. Okay. So of course, like we have the models. And so there is a toutological answer to what are they doing, which I already told you, which is they turn the words into numbers. They do a bunch of matmoles. They apply like simple functions. It's just math all the way through. And then you get something out and it's like, that's it. That's what the model does. And I think that's an unsatisfying answer because you can't reason about it, right? Like that answer to how do models work? Like doesn't tell you about what behaviors they should or shouldn't be able to do or kind of any of those things. So the first thing you might hope is that the neurons inside the neural network might have interpretable roles. You people were hoping that's going back to the first networks in the eighs. And there was a bit of a resurgence in deep learning, a kind of small on Chris ola, who leads the team at anthropic, got really into this ten years ago. You just look at a neuron and you ask, when does this fire for? What input ts is this neuron active? And then you just sort of see, like, do those form a coherent class? Right? This is the car detector neuron and division model. Or if this is the eye detector, this is like the edge detector. If it's early in the model, you know they found like the Donald Trump neuron in a clip model, for example. But it turns out that in language models, when you do this, you just say, which sentences cause this neuron to activate? The answer doesn't make that much sense. So here's a visualization of a bunch of example text for which a neuron and a model activates. And there's just a lot of stuff. There's like code and some Chinese and some math and hemlock. For Socrates, it's not especially clear. And like, of course, there's no reason it would need to be right. Like it's just learned to function. And so asking for a neuroto be interpreter was like a bit of a Hail Mary and it's pretty cool to that. Like works sometimes, but it's like not particularly systematic. There's a prior from neuroscience is that like maybe while there's a whole bunch of neurons going on or something that in any given moment, maybe the models like not thinking of that many things at once, maybe there's some sparsity here where if there were a map of the concepts the model is using or the subroutines it's using, or something that on any given token it's only using a few at a time. And that's just you know maybe a slightly better guess than maybe the neurons are interpretable. It's like not necessarily a great prior, but it's something you can work with. And so you can fit linear combinations of neurons such that each activation vector is a sparse combination of these dictionary elements. This is called dictionary learning classical ml. And we just did it. You just like gather a bunch of activations from the model when you put through the trillion text ts or something, and then you take those vectors, you look for dictionaries, and then you look at, when are those dictionary components active? And lo and behold, it's way better. So you know, we had a paper last year, which is just like here's a bunch of them where we fit about 30 million features on clad three sonnet on the middle layer of the model. You just say, you know, what are the atoms of computation or representation inside that model? This was one of my favorites. Where this linear combination is present or it's like a dot product with a vector is large when the input is about the Golden Gate Bridge. And that is if it is like an explicit mention of the Golden Gate Bridge in English on the left, also if it's translated into another language, also if it's an image of the Golden Gate Bridge, also, it turns out, if it's an indirect mention. So you're like, I was driving from San Francisco tumarin, right, which you know, you cross the bridge to do. And the same feature is active there, right? So it's like some relatively general concept also in the San Francisco landmarks, etcetera. These combinations of neurons are interpretable. We were happy with this. There's things that are more abstract, you know notions of inner conflict. There was a feature for like bugs in code that like fired on, kind of kind of small bugs like division by zero or typos or something in many different programming languages. If you suppressed it, the model would act like there wasn't the bug. If you activated it, the model would give you a trace back as if there were a bug. And so it sort of had this general, general properties. But there was something very unsatisfying about this, which is like, how does it know it's the Golden Gate Bridge? And so what you know what does it do with that information, right? So even if you manage to like piece apart the representations, that's just a cross section. It kind of doesn't give you the why and the how, it just gives you the what. And so we wanted to find ways of connecting these features together. So you start from the input words, and you're going to process these to higher order representations and eventually itsay something and try to trace that through through presentation tation. speaker 3: So very . speaker 2: concretely, you have a matrix, which is like you take a billion examples and of activation vectors, and then you've got A D model here. That's a matrix, and you try to factorize that matrix into a product of a fixed dictionary of atoms times a sparse matrix of which atoms are present in which example. And you just have an objective function, which is this should reconstruct the data and some l one penalty to encourage varsity . speaker 3: really through. It's a joint . speaker 2: optimization problem. Yeah, that's right. That's right. We tried being clever about this. There's like a beautiful, rich literature on dictionary learning. But after a few months of that, the bitter lesson got us. And it turned out that you could just use like a one layer of sparse auto encoder, which you can put in torch and then just like train on your GPU's. And like that scaling was more important than being clever yet again. So that's what it is. It's a sort of sparse autoencoder. All right. speaker 3: so . speaker 2: here's a prompt. Okay. So the capital of the state containing Dallas is Austin. This is that's because the state containing Dallas is Texas. Good. Did the model do that? Was it like Texas? So Austin? Or is it like, I don't know. It's seemed a lot of training data. Like was this just in it? And it's just like reciting the answer. You know, people, there's a lot of eval contamination. Like a lot of elou made it into people's training set. So you get high scores. Well, just knew the answer to that literally because it had seen it. So did it literally see this before or is it kind of thinking Texas like you did? If thinking about Texas. So I'll tell you, this is like a cartoon, and we can slowly through this like pick you know, break away these abstractions to like literally what we did. But I'll sort of start by saying that you should think of each of these as sort bundles of features, which are, again, atoms. We sort of learned through an optimization process. We label the role of each feature by looking at when it's active and trying to see if we can describe, kind of in language, you know what that feature is doing. And then the connections between them are actually direct causal connections, as the model processes this in a fopass. And so what we do is we break apart the model into pieces. We ask, which pieces are active, can we interpret them separately, and then how does it flow? And so in this case, we found a bunch of features related to capitals, a bunch related to states in the abstract, a bunch related to Dallas. We found some features that were like you can think of as like motor neurons. They make the model do something. In this case, they make it, say, the name of a capital, like they make it, say, a bunch of state capitals or country capitals or something. So that's the start, but also that's to get it right. And so you know there's some mapping from Dallas into Texas and there's a pile of features there. Some are like you discussions of like Texas politics. Some are like phrases like everything's bigger in Texas. And once you have Texas in a capital, you get saying Austin in particular. And you know if you're saying to state capand also you're saying Austin, what you get is like Austin coming out. There's some interesting straight lines though, you know so like Texas also feeds into Austin directly, right? If you're thinking about Texas, you might just be thinking about Austin. And so you know I'll get into more of how how we build this. But you know this gives you a sort of a picture in terms of these atoms or dictionary elements we've learned, and you might want to check that they make sense. So then you can do interventions on the model, deleting pieces of this, like in neuroscience, these would be ablations of neurons or something. And then you see if the output of the model changes as you would expect. speaker 3: Someone on zoom asked a like that. Awesome. Frequently nuclear together an Internet text suggesting simple statistical pertions. So is a this a way of overcomplicating. speaker 2: I think that people have found the transformmative outperformback of n for most tasks. So you know, Houston also curs know with Dallas in the training data, and the model doesn't say Houston, so it must be using the capital and Dallas. In this case, I think you could say, you know, maybe if you just had capital and Dallas, then it would say Austin. And that's actually, I think some of these edges were pretty weak. And it turns out you could just say the capital, you make something ungrammatical, just like, you know, the capital of Dallas is, and it will say Austin. And that's an interesting thing where I think you look at the graph and this edge ges weak, and it indicates that actually, maybe it is just like if you have capital in Dallas in proximity, you might get Austin, which is then causally, yes. speaker 3: Lethe is. I mean, the catalways said you don't have little bit Yeah Yeah. speaker 2: So is this is the flow through layers of the model up upstairs and we've done some proving here. So I will show in more detail when me actually see when we're going to get that. Yeah. So I'm going to give you a lot more detail in just about a slide on the technical side. Yes. speaker 3: for sigfor a dictionary or for the dictionary. speaker 2: we did a scan where we tradictionaries of different sizes. And we, you know, you get some trade off of the compuyou spend and the amount or the accuracy of the approximation, like, you know, these are supposed to reconstruct the activation tions. How well does that happen? The bigger the dictionary, the better it does. Also, the denser it is, the less sparse it is, the better that does. So you can pay some price on for ability at some point. So we did a bunch of sweeps and then we pick something that seems good enough. There's still a bunch of errors and know I'll show you later how those show up. You know, mean that we can't explain, know a lot of things. Okay, I'm going to go on. So Yeah, literally what we do here is train this sparse replacement model. So the basic idea is you have a model of resial stream because are all p, we're going to forget about attention right now, and we're going to try to approximate that with these cross layer transcoders. And so what do I mean by that? So transcoder is something that like moves information. So it emulates the mlps. But the cross layer is this ensemble of them emulates all the mlps. There's an architecture called dense net from a few years ago, right, where every layer writes every subsequent layer. And this is like that. So the basic idea is that this ensemble of clts have to take in the inputs of all the mlps stacked together and produce the vector of all of their outputs at once. And a reason to do this is like there's no particular reason to think that, like the atomic units of computation have to live in one layer. If there's two consecutive layers in a deep model, those neurons are almost interchangeable. People have done experiments. You can actually swap the order of transformer layers without damaging performance that much, which means that like indexing that hard on the existing layer is maybe unwise. So we just say, okay, you know these can skip to the end. This ends up making the interpretability easier because sometimes it's just bigram statistics. I see this word, I say that word that's evident at layer one, but you have to keep propagating it all the way through to get it out. And so here we could make that be one feature instead of you know dozens of features and consecutive layers interacting. And then we just train that optimization. You've got a loss on accuracy and a loss on sparsity. Okay? And so here we . speaker 3: replace the neurons . speaker 2: with these features. We use just the attention from the base models. We do not try to explain attention. We just flow through it. But we try to explain what the mlps are doing. And now instead of the neurons which are sort of uninterpretable like on the left, here we have stuff that makes more sense on the right. This is a say, a capital feature, and I think it's probably more specific. So this is like in you know these like literal state to capital mappings. Yeah. speaker 3: 40 marketing about these is. Thanks. The entire number. So I don't . speaker 2: know reasonable. I say it's it's feance. It loses a lot. I think that from a practical perspective, though, if you want to model the action of the attention layer or anything that moves information between tokens, then the input would have to be the activations at all of the token positions. And here, we only can do one token position at a time. So you can learn something much simpler. And the I think the learning problem is just like a lot a lot easier. It's also like slightly less clear what the thing to replace the attention layer is that would be interpretable because it needs to be a system for both transforming information and moving it. And sparsity isn't a good prithere. You've got like a four tensor instead of a two tensor. And we aren't sure what the right answer is. So we just did this for now. speaker 3: And you are finding your interof what the relevant features are. speaker 2: Right so there's two questions there. One is like, what do you lose by doing the replacement? And the other is like you know how much are you leaning on your interpretations of the components you . speaker 3: get and what is relevant? speaker 2: Yeah something . speaker 3: you Yeah both so . speaker 2: this parthoator just produces a bunch of these, right? And then but we do have to interpret what comes out. And so for any particular graph, you know we can go. We have now the attention is frozen. We have a forward PaaS with the model sort of through the replacement where we've got a bunch of the features we've got these triangles represent these diamonds or the heirs. So like these don't perfectly reconstruct the base models. You have to include an air term and then you can just know, track the influence directly. These are just linear maps until you get to the end. And you know therebe, a lot of these active fewer than there are neurons, but you know order hundreds per token, but they don't all matter for the model's output. So then you can sort of go from the output like Austin here backwards and say which features were directly causally relevant for saying Austin, and then which features were causally relevant for those being active and then which are causally often those being active. And in that way, you get a much smaller thing, something you could actually look at to try to understand why it said this literal thing here. And now you're but this is all still math right now. You have this graph. But if you want to interpret this, then you need to look at, know these individual components and see if you can make sense of them. And now you're back to looking at when with this active in which examples, hoping that's kind of interpretable and know the connections between them make any sense. Often these will be in some rough category. So here's like one Texas feature is like again, the like everythingbigger in Texas. Texas is big state known for its cowboys and cowgirls. And this other one is about politics and the judicial system. And you know I'm not saying that these are like the right way to break apart the network, but they are a way to break apart the network. And if you look at the flow in here, you know these Texas features feed into the say, Austin ones, right? There is a path here. And so we sort of will manually group some of these based on, at this point, human interpretation to get a . speaker 1: map of what's going on. Yeah in clt architecture, refreees attention blocks and replace mlps with transquoders that speak to each other. This is complex, right? So will this be adding unnecessary interpretability when the underlying connections could be a lot sualso? If you mind repeating the questions . speaker 3: so the folks . speaker 2: on zoom can hear, yes, the question is, boy, this seems like a lot of extra work. You know there's a lot of connections between these features. They're mediated by attention. And I would agree, but we couldn't find a way of doing less work. And how did the units be interpretable, right? When we were to think about this is the base components of the model aren't not interpretable. So if you want to interpret things as interactions of components, like if you want to use like the great reductionist strategy that's been quite effective, but you know great organs into you know cells and understand how the cells interact, you like have to break it apart in some way. You lose a lot when you do that, but you gain something, which is you can talk about how those parts work. And so this was our best guess today as like parts to study. Okay. So we're still in schematic land here. Once we have group things to make a graph like this, we can do interventions. And this these interventions are in the base model, not in our complicated replacement model. They amount to basically adding in vectors that are the future outputs to another one. And you can see if the perturbations make sense. So if we swap out the Texas by muting those and add ding the California features from another prompt, then the model will say Sacramento. If you put in Georgia, we'll say Atlanta. If you put in the Byzantine Empire, it will say Constantinople. And so here it's a sign that like we did capture like an abstract thing in a manipuable way, right? You don't put it in and get gibberish, right? And know if you're just doing biogram statistics or something that's not clear how you how you would get this separability. Okay. So I'm going to get into these three kind of the motifs that we see a lot. The abstract representations in the medical context, in a multilingual context, parallel processing motifs, which is about arithmetic, some jailbreaks and hallucinations, and then also some elements . speaker 3: of planning. Yeah these holographs. From a smaller Yeah . speaker 2: so this this approach of like training a replacement model of a an attribution graph is just math. So you can do whatever you want and then it's like, are they interpretable you? We do this on a small 18 layer model in one of the papers, which can't do very much and find pretty interpretable things. So I think this works at all scales. I know people who are doing these on like you know sort of now, how interesting is it if your model can't do anything? Maybe it's not that interesting, but I think if your model is narrow purpose and small. speaker 3: this would still be useful. And this all seems like something that maybe in a different field has been human brains and human brain things. Do you know is there any kind of overlap on literature that you are relying on how the real medicine set of things . speaker 2: for some inspiration? I think in particular, this, the idea of doing the perturbations, the causal perturbations and seeing what happens, it's like a very neuroscience on thogenetics he thing to do, fortunately or unfortunately, like our experimental setup is so much better than theirs. So I think we're now like well past what people can do in neuroscience because we have like one brain, and we can study it a billion times, and we can intervene on everything and measure everything and they're and there like trying to capture like 0.1% of neurons like at a time resolution a thousand times worse than the actual activity other . speaker 3: Yeah. speaker 2: Yeah, Yeah. Okay. I want to give you, you know, just I think it's good to be able to like actually see some of these. So this is in the paper itself. This is like what we make these cartoons from, okay? So each of these nodes is like a feature and an edge is like a causal influence. Here is you, one whose direct effect on the output is saying Austin mostly and also some Texas things. It's getting inputs from things related to Texas and from states. And you know, these the art of this is like now you're doing the interpretation. You're like bouncing around looking at these components, looking at when they're active, looking at what they connect to and trying to figure out what's going on. And so the cartoons I'll show you are given by grouping sets of these based on common property. So I said these were, say, ic capital. And you can see that these are different from each other, but they all do involve the models, you know, saying capitals in some context. And so we just sort of pile those on top of each other. And this is what like I don't you know, how many features should you have? I don't know. There's obviously no right answer here. I think it's not a perfect approximation, but you sort of break it into 30 million pieces, then you put the pieces back together in ways that make a bit of sense and then you do interventions to check that like what you learned is real. Okay. So now. speaker 3: Emerproperties. And then maybe this is the biggest difficulty in studying these kind of systems. And clearly, llms, they kind of start displaying this kind of heligroup. So if you start breaking down at the same time, you end up losing those emerged properties. So how can you balance that using all these properties that just emerge when everything is together? speaker 2: A good question. So I think one thing that makes llms kind of different is this flow of information from the input to the output, in which it seems like the latent spaces work at a higher level of representation or complexity as you move through it. And so like all cells are somehow like the same size. They communicate with each other, but it's all like lateral communication. And you know I think you do see the sum in the brain as you go from like you know the first light sensitive cells through the visual cortex, where you will get you ultimately like a cell which is sensitive to a particular face, but that comes from things sensitive to less specific things and ultimately things that like just detect like edges and shapes and those kinds of structures. And so I think as you move through that, you do get these higher levels of abstraction. And so you know when I say there's a feature that seems to correspond to errors and code, this is one of our atomic units. You know it's like one of the things we're getting when we shatter. This is like sensitive to errors and code, but in a very general way. And so to some extent, I think that these are built up hierarchically, but that it's just maybe they're still units. There's another thing that I think we don't have any traction on, which is like if it's doing some like in context, like dynamical systems stuff that I think that will be much harder to understand and you might need like much larger ensembles of these. speaker 3: Okay, let's . speaker 2: slideshow again. Okay, so here's . speaker 3: like a medical exam style differential diagnosis question. speaker 2: 32 year old female, a 30 weeks gestation, presents with severe right upper quadrant pain etc.. Given this context, if we could only ask for one other symptom, what should it be? Does anybody if there a doctor . speaker 3: in the house? speaker 2: Does anybody know what the medical condition is here? Yeah. So ask about visual disturbances. And the reason is because this is like the most likely diagnosis is preeclampsia here, which is a severe . speaker 3: complication . speaker 2: of pregnancy. And you know we're sort of able to trace through by like looking at these features and how they're sort of grouping together and how it flows, you know that is pulling things from pregnancy from you know that region, the headache, the high blood pressure, the liver tests to preeclampsia, and then from that to other symptoms of preeclampsia. And then from those ultimately there is actually two complete visual disturbances. And know there's a tiny box here showing like, you know, what do we mean by that? Those are this component is also active on training. Examples discussing now causing vision loss, floaters you know in the eye, which is another visual disturbance, lead to loss. Effvision, the other answer would be protonuria. And so you, it is in the same way thinking about Texas. It's thinking about preeclasia in some way before, like saying what these are. And we can see that intermediate state, it is thinking about other potential diagnoses like biliary system disorders. And if we suppress the preeclasia, then it will say. speaker 3: instead of visual disturbances . speaker 2: with a rationale, because this presentation suggests preeclampsia, it will say decreased appetite, because this scenario suggests biliary disease, right? And so it is thinking of these options. When you turn one off, then you get like a coherent answer consistent with the other one. speaker 3: The you can intervene so clearly that this is all we represented in exactly one motivin the in the network. And then by intervening there, you know, you don't need to intervene any multiple patients. speaker 2: Well, so no, because two reasons. Three reasons, okay, the three reasons are one, you know, this node is a group of them, right? It's a few of these features that are all related to preeclampsia. The second is because the cross layer transcuder, each of them writes to many layers. And then the third is we're creaming it. So here we're turning it off double the amount that it was on. We overcorrect. And you often need to do that to get the full effect, probably because there's redundant mechanisms and stuff. And so I don't think this is the only place it is necessarily, but it's enough. Yeah . speaker 3: you're in your doyou talked a bit about what emmanchestity of neurons ons Yeah. So for example, if you do one of these interventions, like what are the three exams here? I get actually, and let's say I deployed the model from some other part, would you expect what is the distribution of things that this will also say? Hey, you can sort give some sense of but what would be the unverse? What would be other Yeah. speaker 2: that's a great question. We didn't dig into that as much here. We did a little more with our last paper. You know I think that you push too hard and the model goes off the rails like in a big way. I you know here this is less for the purpose of shaping model behavior and more for the purpose of validating our hypothesis in this one example. So we were not like, and now let's take a model. We've deleted this everywhere. It's like, no, the model is going through answering this question right here. We're turning off this part of its brain, you know and then we're seeing what happens. speaker 1: Couple questions on zoom. One is, in practice, when you trace backwards through these features, do you see a kind of explosion of relevant features as you go earlier in the model? Or is it some function of distance from the output? speaker 2: Yeah. The question is like as you go backwards through the model, know is the number of relevant features go up? Yes. speaker 3: sometimes know sometimes you . speaker 2: do get convergence, you know like the early features, right? Might be about a few symptoms or something. And those are quite important. I don't think we made a good plot to answer that question. I am going to push onwards a little bit and I'll come back . speaker 1: to questions soon. Gotjust one more, how much are these examples carefully selected or crafted? Does this work for most sentences . speaker 2: you learn something about most things you try. So I'd say like 40% of the prompts that we try, we can see some trivial part of what's going on. It's not the whole picture, and it sometimes doesn't work, but it also just takes you know 60s of user time to kick one of these off. And then you wait, and then it's done and it comes back, and then you get to learn something about the model rather than having to like construct a whole bunch of a priori hypotheses of what's going on. You just sort of get it. Once you've built the machine, you get it for free, but this doesn't tell you everything. Okay, so here's . speaker 3: a question. speaker 2: Like what language are models thinking in? If there's a you know is it a universal language? Is it English? Is it like it's just like there's a French clawed and a Chinese clawed inside of Claude? And like who gets gated to is kind of depends on the question. And so to answer this, we looked at three sentences, which are the same sentence in three languages. So the opposite of small is big, the lukeon squpetichon. And I can't speak. Maybe someone here can say what someone surely is a mandarin speaker in this room. There's no chance. speaker 3: Thank you. speaker 2: I came to, what's the final . speaker 3: character? speaker 2: I came to see it because we're doing these like this part of the paper like so many times. I, it's the only character can recognize right now. So this is like even more cartoony, right, than the last one. I'll get into the more detailed version later. But basically what we see is at the beginning, there are some of these like opposites in specific languages, right? Contin French, opposite in English. And the quotation, it's like, this is like an open quote in the language I'm speaking, right? So the quote that follows will probably be in the same language. But then there's this like complex of features about, you know, antonyms in many languages, saying large in many languages, smallness in many languages that theygo together, and then you spit it back out in the language of interest. So the claim is that this is a sort of multilingual core here, where smallness goes with oppositness to give you largeness. And then largeness plus, you know, this quote is in English, give you the word large. Largeness plus, this quote is in French, gives you if you grow. We did a bunch of these like kind of patching experiments. You could say, instead of the opposite, you know, you could say a synonym of and just like drop that in so the same thing and all of them. And now you'll get little or minuscule in French. And so sort of like patching in a different state, you know can it's the same feature you're putting in in these three places and you get the same change in behavior later. And we sort of looked at, this does have a big effect with scale. So you can look like how many features overlap between a pair of sentences that are this just translations of each other as you move through the model? And at the beginning, it's just the tokens in that language. There's no overlap. The you know mandarin when it's tokenized has nothing in common with English when it's tokenized. So at the beginning and the end, there's nothing. But as you move through the model, you see a pretty large portion of the components that are active or the same regardless of the language that it was put in. And so this is English, Chinese, French Chinese and English, French. This is like a random baseline if the sentences are unrelated. So it's not just like the middle dles more common, but when you compare pairs that are translations, this is in a 18 layer model, and this is in our small production model. So this generalization is . speaker 3: kind of increasing with scale. speaker 2: Yes. speaker 3: So it's also like it can kind of interpret concepts in that. speaker 2: Yes. Have you ever tried doing . speaker 3: observed metaphors that are kind of like specific throughout language? No. If you . speaker 2: have . speaker 3: some examples . speaker 2: of that, I'd love to try it. I think that . speaker 3: would be fun. That's time. I would be happy for . speaker 2: collaboration. speaker 3: So what you're saying this kind of implies that in the center of the network is the most abstract representation of any given concept because I think I think I would basically . speaker 2: agree with this plot, which is like a little to the right of the center. Yeah at which point you start to have to figure out what to do with it because in the end, the model has to say, I think . speaker 3: I finding one of semantic features in the middle of the network where it's very abstract, rather than towards the end where it's very concrete. speaker 2: If anything, it's actually the opposite. Because if you'll permit me to be philosophical, like the point of a good abstraction is that it applies in many situations. And so in the middle, there should actually be these like common abstractions. It's like dealing with the very particulars of this phrasing or this grammar medical scenario that it's quite bespoke and so you would need way more features to unpack it. speaker 3: One more yes is the . speaker 2: same like operations . speaker 3: store retically across multiple areas. Like do you find that like for example, if you need five reasoning steps to to get something done for like in one example, but then in another example, like you need eight reasoning steps to get something done, you wer done and we have to store the like follow up operation on both those letters. Yers . speaker 2: Yeah, I think this is like a very deep question. So it's a repeated for the audience. Like do we see redundancy of the same operation in many places? And of course, it's kind of it's like A, I think people complain about with these models, right? Which is like, well, if it knows a and it knows b, why can't do them in a row in its head? And it like might just literally be a is after b in its head and have to do a forward PaaS. And so unless it gets to like think out loud, it literally can't compose those operations. And you can see this quite easily. You know if you just ask, you know like who is the what's the birthday of the father of the star of the movie from 1921 start, you know like it might be able to do all of those, but like you can't actually do all those lookups consecutively. That's like the first thing. So there is there has to be redundancy. And then we do see this. It was one of the reasons for the cross coder setup is to try to zip up some of that redundancy. I think one of my favorite plots is not from the so called biology paper, but is from the sister paper here. Let's find Copenhagen. Okay. speaker 3: So like on the left, if you do sort . speaker 2: of you do this decomposition per layer down the right as we do the cross layer thing, and like basically, you know it's just like bouncing back, right? You know it's like the Copenhagen is like propagated. It's talking to Denmark and it's just like happens in like many, many places in the botthey all perform you know like small improvements to it. There's another perspective here, which is the neural ode gradient flow perspective, where it's like all timing adjustments in the same direction all the time. And I think the real models are somewhere in between. Okay? With the overlap . speaker 3: of the features here, as the models get larger, you found lengthy change like data set here, Matt Kenley. And you're also like ordering in parts of the data set, the have been very different. speaker 2: We haven't done systematic studies of data set ordering effect. speaker 3: Yeah, I guess we're talking about what some else make sometimes. I'm curious if you're around the aware. So I should be absolutely. speaker 2: No, I don't . speaker 3: think like like we've maybe . speaker 2: been lucky you could have a plot where it's like how compelling with flpothesis from the attribution draft and then like how well did it work when you intervened? And the best ones did work, but there are some small filcases we talked about. Okay, I want to dive into the parallel motif because I think it's actually super interesting and it's a unique feature of the transformer architecture, right? It's massively parallel. So to give a really simple example of this, imagine that you want to add 100 numbers. speaker 3: So the easiest way to do . speaker 2: is you start with one, you add the next, you add the next to the sum, you have the next of the sum, and you do 100 serial steps. To do that with a transformer, it would need to . speaker 3: be 100 layers deep. speaker 2: But there's another thing you can do, which is you could add Paris consecutively in one layer, and then add the pairs of Paris consecutively in the next, and then the pairs of those pairs in the next. And so in log n depth, you could add up the 100 numbers. And given the depth constraints here and the sophistication of what we have to use models to do, you know, it makes sense that they would attempt to do many things at once, prere compute things you might need kind of slap all together. And I'll give you a few examples of this. So if you ask the model to add 36 and 59, it will say 95. If you ask it how it did that, it will say it used for standard carrying algorithm, which is not what happens is somewhat more like this. So first, it parses out each number into, there's some component where it's like literally 59, but there's also something for all numbers ending in nine and something which is like that number range and the same up there. And then you kind of have two streams, one where it's getting the last digit right, and then another on top where it's getting the magnitude right. And even inside the magnitude, there's sort of like a narrow band magnitude and a really wide band magnitude. And then those give you a sort of medium band. And then if the sum is in this range and it ends in a five, then it's actually 95, then there rows it down and then it gives you the answer, which is cool. It's not how I would do it. But but then again, it wasn't trained by like a teacher being like, here's something to do. It just got like whacked every time it got it wrong or like rewarded every time it got it right. Right in training. And you know I don't think we're gonna na have this next in here. So I want na show you one of my favorite things from the esection, which is like, okay, so you know there's a feature in here, okay? Like if anybody here, like I feel like this is like a word cell shape reccator test. So for the shape reccators who like kind mathematical thinking, like I love this section. So this is like, this is the graphs we make to visualize on the arithmetic prompts. This is like, is it active on the prompt a plus b for a and b in the one to 100? That's a grid. And so vertical lines mean like when the second operand is in arrange, it's active. You know, these dots is like, is it a six and a nine? You know? So there's a grid. This bands, you know, a band here is a line x plus y equals constant, right? And so those are where the sum is. So we were looking at this to sort of figure out what these did. But this feature here I really like. So everything in the graph you've can hover over which kind of mean and see the feature. And we looked at cases in the data set when this thing was active. Okay? So on this narrow domain is active when things ending in six get added to things ending in nine, but on the data set is active among all these other cases. So this is like thus order and fragments, federal proceedings, volume 35. It's like, okay, so in some sense that has to be a nine plus a six. Is the volume 35 supposedly, if this interpretation is correct, here's just like a list of numbers. There's more journals. There's like these coordinates. And so the claim here, if our method is working, is that there's one component, this context means enin six plus enin nine, but it's also active in these. So like this is really working is secretly every one of these examples is the model adding a six to a nine, and it's getting to reuse the modules for doing that across those examples. And so we dug in. speaker 3: and I couldn't . speaker 2: really understand these. So here's one example where this is the token where that feature was active. And I just put it in clowed, and I was like, what is this? And it's like, this is a table of astronomical measurements, and it's splited out in like a nicely formatted table. And the first two columns are the start and end time of an observation period. This is the minutes of an end time that's . speaker 3: predicting. speaker 2: And if you read down the table, the start to end interval is like 38 minutes, 37 minutes. But it creeps up over the course of the experiment to be like just under 39 minutes. And this measurement intervals started, add a minute with a six and a six plus the nine equals n being five. And so the model was just like gutting out you know next token predictions for like arbitrary sequences that data was trained on and learned in that, you know of course, to recognize the conwhat supposed to do. But then it needs a bit where it has like the arithmetic table, right, like six plus nine. You just got to look at that. So it has that somewhere. And it's using that same lookup in this like very different context where it needs to add those things. This was another one where it turned out this was a table and it's predicting this amount. And these are arithmetic sequences. This is I guess the total cost that's going up. And so the amount that it's going up by you know is this where carrying that's about 9000 plus 26000, 35000. This was maybe my favorite where like why is it firing here to predict the year? And the answer is because it's volume 36. This Journal was founded in first edition was in 1960. So the Zerif would have been in 1959. Nine plus 59 plus 36 is 95. And so it's using the same little bit to do the addition dition there, right? And so I think when we talk about generalization like you know you know and that abstraction, this was for me like a pretty moving example where I was like, okay, like did learn this little thing, but then using it all over the place, okay. speaker 3: that's maybe like not the . speaker 2: most like mission critical thing in the world. So let's talk about hallucinations. So models, they're great. Theyalways answer your question, but sometimes they're wrong. And that is that is just because of like pre training. Like they're meant to just predict a plausible next thing. Okay, great. Like it should say something. If it knows nothing, it should just give a name. If it knows like anything, it should give like you know even the correct language, right? If it knows more, maybe like a common name from the era or like just some basketball player or whatever, right? And so that's what it's trato do. And then you're like you go to fine tuning and you're like, no, like I want you to be an assistant character, not just a generic simulator. And when you simulate the assistant, I want the assistant to say, I don't know. Like when the base model certainty is somehow you know low, and that's like a big switch to try to make. And so we were curious, like how does that happen, right? How does this like you know refusal to speculate get fine tuned in? And then when why does that fail? speaker 3: And so here's sort of two prompts . speaker 2: to get up as one is what's pordoes Michael dden play? Answering one word, it says basketball. The other is what sport does Michael back can play was just the person made up answering one word and it says, I apologize. I can't find a definitive record of a sports figure named Michael backen. Okay, so here these graphs . speaker 3: are a little bit different. speaker 2: They've got suppressive edges highlighted here. You know this is common in neuroscientific inhibition. And we've drawn some features that aren't active on this prompt, but are on this one and vice versa. So if they're in gray, it means it's inactive here. But part of the reason it's inactive in particular isn't being suppressed by something that is active. And what we found was sort of this cluster of four features in the middle, you know Michael Jordan. And then we have like the feature for like it's active when the models recognizes you know answers to questions. There's another feature for like unknown names. And then there's a generic feature for like I can't answer that. And that generic feature I can't answer is just fueled by the assistant, which is always on when the mois answering. So it's just like there's just an always, I don't know to any question. And then that gets down modulated when it recalls something about a person. So when Michael Jordan is there, that suppresses the unknown name, boosts the known answer, both of those suppress the can't answer. And that leaves room for the path of actually recalling the answer to come through. And it can save basketball. So that's a cool strategy. But getting back to model depth, there's an interesting problem, which is that like it might take a while for the model to come up with an answer, but it also would have to like at some point decide to refuse and like get that going. And those are happening in parallel. So you could have a little bit of a mismatch where like by the, you know, it has to decide now whether or not it's going na answer or not. But it hasn't like done all that it can do to get a good answer. And so you can get some divergence where for very hard questions, for example, but it still might know, it has to be like, okay, do I think I'm going to get there? And that's like a little bit of a tricky not but can't fully self reflect on the answer before saying it. And so okay, this is just the intervention. You juice the known answer and it will hallucinate that might go back into his chest. So this is a fun one. So if you ask for a paper by Andre carpathy, formerly of Stanford, it gives a very famous paper that he didn't write. Why is that? Well, it's like, trust me, I've heard of Andre carpathy like from the name. But then there's the part which is trying to like recall the paper, and then it gives that answer, right? But and then actually it says that. And then you know I think it's covered up here. But then if you're like, are you sure? It's like, no, I don't really think you wrote that. Because at that point, then the model as an input, gets both the person and the paper and can like do calculation earlier in the network again. Now we can juice this. speaker 3: We can suppress the . speaker 2: known answer a bit, and eventually it will apologize and refuse. There's a fun one in the paper where the model hasn't heard of me. I didn't write that section. Jack wrote that section. And it refuses to speculate about papers I've written. And then if you turn off the like, unknown entity and you give the known answer, then this says, I'm famous for inventing the bats in principle, which I hope to one day do. Okay, so there's a lot more we could talk about here. I'm just going to like speedrun these for vibes. And you can like read the paper where we do a lot more. There's gel braks trying to understand how they work. And you know some of it is like you get the model to say something without yet recognizing what it's saying. And then once it's said it, it's kind of like on that track. And it has to balance being coherent verbally with like, I shouldn't say that and takes a while for it to cut itself off. And we find that if we suppress punctuation, which would be an appropriate grammatical time to cut yourself off, you can get it to keep doing more of the jal break. And that's a competing mechanism thing, right? There's a part which is recognizing what am I talking about, what should I do? And there's a part which is completing it and they're fighting for who's gonna na win, okay? Cool, let's talk about playing. Yeah, okay. So I think my opinion varies. The point that if I didn't talk about this one, so this is a poem, a rhyme couplet written by Claude. He saw carrot and had to grab it. His hunger was like a starving rabbit. It's look . speaker 3: kind of . speaker 2: good. How does it do this? Right? It's like kind of tricky, right? Because to write a rhyming thing, you better end with a word that rhymes, right? But you also need to like have it kind of make semantic sense. And if you wait to the very end, you can back yourself into a corner where there's no next word that would be like metrically correct and rhyme that would make sense. You kind of you know logically should be thinking ahead a little bit of like where you're trying to go. And we do see this. So actually, he saw a carand, had to grab it new line. And so on that new line token, it's actually there's a feature for like you know things rhyming with it. So this is like after words ending in it or eating poems. And those feed into rabbit and habit features. And then the rabbit feature is sort of being used to get starving and then ultimately rabbit. And we can we can suppress these. So if we suppress the rhyming with it thing, we get blabber grabber, Talad bar. And that's because the ab sound is still there. So itjust like rhyme with the ab part. If we inject Green, you know, it will now write a line ending, or rhyming with Green, sometimes ending with it. If we put in a rhyme with a different thing, itsort of go with E, I think, I don't know if we have it here, but if you just literally suppress the rabbit, it will make something ending in habit when it rhymes. And this was pretty neat. This is like the smoking gun. This is like, Oh, okay. Like literally, here is, here's a model component. When we look at the datset examples, when this is active, it's literal instances of the word rabbit and bunny, a forward PaaS on this model. That feature is active on the new line at the end of the sentence. And the rontle writes a rhyme ending in rabid. And if we turn this off, then it doesn't do that anymore. So it's like very definitely thinking about like that is a place to take this, and then that influences the line that's coming out. So that's a place where even though it's staying one token at a time, it's like has done some planning in some sense, right? Here's a target destination. And then like writing something to writing something together, there's equivalent things. There's an incredible thing with unfaithfulness. I'll just say sometimes the model is lying to you. And if you look at how it got to its answer, you can tell in this case it's using a hint and working backwards from the hint so that its math answer will agree with you. And you can tell because you can literally see, like you can see it taking your hint, which is the number four, working backwards to divide by five to give you a.8. So when you multiply by five, you will get four and it will agree with you, which is not what you want. What you would want is something like this, where it is only using information from the question to give you the answer to the question. But if you just look at the explanations, they look the same. They look like it's doing math, right? And so here there's a competing thing of like, should I use the hint, which would have made sense in pre training. It let you guess the answer better and you're rewarded if you say the next token. Well, so maybe the humans, right? And you should use the hint. Should I do the math actually? And these are competing strategies. They're happening kind of at the same time on the right. You know this one wins. speaker 3: So what makes one strategy or how does that work? Yeah. speaker 2: because they're both available. How is it coming to this? speaker 3: Like Oh, obviously type three is there's . speaker 2: some . speaker 3: other incentive or motivation driving that. speaker 2: I think like that's the question. So I you know I think in this paper we really did. We were able to say like which strategies were used when it got to this answer. But the why I don't think we've really nailed down. You know I think that to some extent it could be because look at some of this more carefully. I think in the hallucination case, we had a bit of a hint, right? It was like recognizing the entity. So I'm gonna to do the refusal thing or let it through here, though my strong suspicion is that that like it's just doing both, but in a case where it's more confident in one answer that shouts louder, so it doesn't know what the cosine of this is. So all that's left is following the hint. But I think a big caveat is that like we're not modeling attention at all. And attention is very good at selecting, right? You've got this qk gating right? That's like a biinear you know bilinear thing you could really pick with that. And we're not modeling how those choices were made. So I wouldn't be surpripractice for a lot of these. Attention is crucially involved in choosing with strategy to use, while the mlps are heavily involved in executing on those strategies. And in that case, webe totally blind to what's going on here. speaker 3: I've been moving on this idea for a while. speaker 2: I mean, that's like a billion dollar question because literally knew the answer to that and it would work really well. I couldn't tell you. I would just like go make Claude the best model in the world because it's always accurate. But but I don't know the answer, so I can speculate. I mean, I think I think like it is in some sense an impossible problem for exactly this reason. And so I think like you could try to train better, have the models to be better calibrated on those health knowledge. You could you know I think with the thinking tags, basically with like the reasoning models is a straightforward way where you do let the model check things. And I think models are much better on reflection than they are on the forward path, because I think a one forward path that just limited for like these physical reasons. I think as an effective strategy, that might be more the way than like not allowing like how do you keep the creativity? And another possibility is you could make the model dumber somehow. So maybe you could make a model which doesn't hallucinate, but it's just dumber because it uses a bunch more capacity just for like checking itself in the forward taand. It could be when models are smart enough, people might take that trade off. speaker 3: Do you think that the underlining architecture, the transform architecture. speaker 2: Yeah, I think it's one of the sins here. It's possible with a recurrent thing, you can just give it a few more loops to check stuff. I mean, if you could fully adapt the compute, you could just have it go until it's some level of confident, right, and get variable compute protoken and then sort of sort of bail. If it's not, I think there is a trick. You think about hallucination as like people think of it as being like a well defined thing, but like you know it's producing like reams of text, like which word is the one that went wrong? W, you know, and there are some very factual questions where that's but I think if you think more generally, like what would make a given token a hallucination, it's like a little bit less clear. I want to just see it if there's anything. Yeah. Okay. So there's nothing else here other than like if you want more, read the stuff. And I get this probably as ends formally in like two or three minutes. Yeah. So I will just be done now and then people can clap and people can leave if they want, but I will just stay for questions for a while. And I'm happy to continue doing questions as long as people want. So thank you.