speaker 1: As you may have noticed, I'm a little bit less innovative in my lecturing than Percy, so you're going to get PowerPoint slides rather than executable Python ones, but you should be able to find the pdf's on the website as well. So I've titled this lecture everything you didn't want to know about lm architecture and training because we're gonna to get into some of the nitty gritty details that I think most other classes would spare, the details of what should my hyper parameters be in those kinds of questions? Some minor logistics. Also, if you're doing the assignments, we are updating assignments as we find some mostly minor bugs. Make sure you pull updates to the assignments as you go along. Okay, so what we're going to do, we're going to start with a quick recap of a transformer and I'll give you two variants of a standard transformer, one that's probably coming from the standard transformer lectures that you might see in 224n, and then I'll talk about what you implement and kind of the modern consensus variant of a transformer. And then we're going to take a much more kind of data driven perspective to understanding transformer architectures. So the question that we're gonna to ask is people have trained lots of llms at this point and you can go and read all of those papers and try to understand what has changed, what has been in common. And from that kind of almost an evolutionary analysis, you try to understand what are the things that are really important to make transformers work, right? So today's theme is the theme of the class is the best way to learn his hands on experience. But the theme of this lecture, because we can't train all these transformers, is to learn from the experience of others. So the starting point is the original transformer, right? So just as a review, right? Hopefully you all remember this from 224n or your other nlp classes, you know you've got some simple position embeddings at the bottom, you've got multi head attention, you've got layer norms afterwards, you've got a residual stream going upwards, you've got a mlp and then a soft max at the very end. And we're gonna to see variance to all these different pieces until we get to basically the most modern variants of the transformer. And the latest one we'll talk about will be just a few months before. So what you implemented is not the vanilla transformer variant from the original paper. We've modified a few things. You know, we've put the layer norm in front of the block. So you can see on this slide over here that there's the norm is over here right before each of these blocks in the residual stream, we've asked you to implement rotary position embeddings. The feforward layers use something called a swiigglulue. And then linear layers, you know, now emit these bistrems. And you might ask, why have you forced us to implement this weird variant of a transformer instead of the original transformer? Is all you need transformer? And so we're going to go through some of those questions. And then yesterday I was thinking, okay, I should actually catch up on all the developments that have happened in architectures over the last year. And Percy warned me about this because he said, you're gonna to have to redo the lecture every year. And so I started looking and I was like, all right, Yeah, there's a couple of good, good papers recently. There's command a, there's two motoo furious. There's you know small lm 54. And then you go looking and you're like, wow, Yeah, there's gema three and Quint 2.5 and internm, and then there's you more. I can't even sort of cover the screen with these guys, right? There's a lot of models. There were about 19 new dense model releases in the last year, many of them with minor architecture tweaks. And on the one hand, it's kind of annoying to go through all these papers and say like, you know what is happening in all these? But also it's like actually wealth of information because not all of them do the same thing. And you can kind of see, not all of you can, especially in the back, can see the details of this slide. But I put together a little spreadsheet of what all these models are doing. And starting with all the way from 2017, the original transformer, all the way to 2025, what the us models are doing. And we'll talk about this as we go. But you kind of see sort of certain kinds of architecture changes sort of being explored. Like so here on this column is position embeddings. People used to do all sorts of stuff like absolute relative rope. There was a sort of alibi phase for some people. But then now starting around 20, 23, everyone just does rope, right? So you can kind of see this as the convergent evolution almost of neural architectures. And we're going to talk about all these different kinds of things, right? So the parts that I'll cover, so this is a preview of the three major sections of this lecture. And if I have time, I'm also going to talk about different attention variants at the end. The first thing is going to be architecture variations. That's what I'm going to talk about. So activations, feforwards, attention variance, position, embeddings, all of those things. And then having nailed down the architecture, what do we have to do? Well, we have to pick hyperparameters, right? Like how big do we make the hidden dimension? How big do we make the sort of inner projection layer inside of mlp? What do we do about the number of dimensions? How many vocab elements? Those are all sort of important things that you have to choose when you're actually training your language model. And you don't want to just sort of pick these out of a hat, right? You want to select them in some fairly intelligent way. So we're going to start with architecture variations and the two things that I'll mention right here and I'll go back to them as I talk. The first one is, you know there's not that much consensus in a lot of the choices. There's been sort of convergent you evolution in the last few years what I'll call like llama like architectures at the very bottom here. But people do all sorts of things. They swap between layer norm and rms norm. They do serial versus parallel layers. There's one choice that basically everyone does since the first, very first GPT, and I'll talk about that in a bit, but there's lots of different variations that we can learn from here. The big one, I've already talked about this guy in 224n. So if you remember that lecture, this will be refor you, rather than being totally new. I think the one thing basically everyone agrees on and agreed on almost from the very start is the use of prenorm versus post norm. That terminology will get a little bit more confusing. But the original transformer paper, did you this thing on the left over here where you had your residual stream in the gray and know in addition to the residual stream, you had these layer norms after sort of every subcomponents, you would do your multi head attention, you would add back to the residual stream, and then you would layer norm that. And then you would do the same thing with your fully connected layer, and then you would layer norm it. Very, very early on, people realithat moving this layer norm to the front of this sort of non residual part. So this block on the right did much better in many different ways. And basically, almost all modern lms that I know of use this kind of prenorm. There have been some sort of new innovations recently that I'll touch on in two slides, but lots of models have moved to this. The one exception is okt 350m, which I'm guessing know they kind of messed that one up. And that was sort of orphaned when they were training. That was a fun find in my survey of architectures. So this prere versus post norm thing, if you look into why it was originally developed, the arguments were that you know if you wanted to use this post norm stuff, it was much less stable. And so you would have to do some careful learning rate, warm up style things to make a train in a stable way. And so if you look at some of the earlier papers know arguing for this preorm approach, Salzar and Yan, and also this Xiang in 2020 paper, you almost always see sort of this comparison of, Hey, if we use prenorm and we do some other stability inducing tricks, then we can remove warm up p. And these systems work just as well, if not better, than sort of know the post norm, layer norm with careful warm up p type approaches. And you see this in sort of a machine translation setting here. You see this as well on the right on various other tasks, especially using burwhich was trained with postnorm. So there were many arguments about why this was helpful. There were arguments about gradient attenuation across layers, like if you do pnorm, then the gradient sizes would remain constant, whereas if you did post norm, you know without warm up, then it would sort of blow up in this orange way. It's a reasonable argument, but I think a maybe more closer to modern intuition would be this argument that prenorm is just a more stable architecture to train. And so some of the earlier work by Salazar nyun identified all these lost spikes that if you were training with prenorm kind of in blue here, you would see a lot more lost spikes and the training would be kind of unstable, know, as you were training. So you see the gradient norm here, know spiking and generally higher than the one with prenorm. And so today you see prenorm and other layer norm tricks being used essentially as stability inducing aids for using training large neural networks. And so this brings us to one new, fairly I think, recent innovation. I think this didn't exist when I gave this lecture last year, which is this variant that I don't think really has has a great name, but I'm just gonna to call it the double norm for the moment here. So this is the original figure that I showed you at the very beginning. And we know that putting layer norms in the residual stream is bad. But actually, someone in 224n this year asked, well, but why do you have to put the layer norm in the front? Why can't you put it you after the feed four network? And of course, you can. And not only that, sort of recent people have gone around and just add the layer norm after the blocks as well. And so Grock and Gemma two both take this approach of layer norms both in front and after. Omo two does only the layer norm after the feed forward in the multi head attention. So this is actually kind of an interesting change. Prenorm has just been kind of dominant and the only thing for a while, but things have been changed up a little bit. So now there's a new variant. And this is actually there's been some evaluations of this kind of approach. People have argued it's a little bit more stable and nicer to train on these larger models that feel free to stop me and ask me questions as well. I have a tendency to sort of keep going if no one's stopped me. So yes. Why is layer norm in the residual bad? Why is layer norm in the residual bad? That's a good question. I don't think I can give you like a, you know, this is the proof of why it's bad. I think one you intuitive argument for why this might be bad is that the residual gives you this identity connection all the way from almost the top of the network way all the way to the bottom. And so if you're trying to train really deep networks, this makes gradient propagation very easy, right? So there's lots of arguments about how lstms and these other kinds of state space models have difficulty propagating gradients backwards. An identity connection does not have any such problems. And so putting layer norms in the middle, you might mess with that kind of gradient sort of behavior. And that, of course, you see back here, right? This is exactly the kind of plot you expect to see if that's happening. Okay, cool. The other thing that people now do is in the original transformer, people did layer norm. And so layer norm is this equation over here. What you do is you have the activations x coming in, you subtract the empirical mean, so that's the average of the x's up top, and then you divide by the standard or the variance, plus a little fudge factor epsilon, and then you square root that so that you can roughly think of it as a standard deviation, right? So that's going to standardize your activvax. You're going to scale it up by a gamma that's a learnable parameter and then shift it by a beta, right? So this makes sense. You're going to normalize, know your activations, and then you're going to shift them around to whatever point you want. And many models use this layer norm thing and it worked quite well. But many models have sort of now moved on to rms norm. And this is one of the consensus changes. Like basically all the models have switched to using rms norm. And now what do you do? You just drop the mean adjustments so you don't subtract the mean, you don't add a bias term. And many notable models do this. The llama family, palm chchilla, key five, they've all moved to rms norm. And what's the reason for this? One reason is that it doesn't really make a difference. Turns out if you train models with rms norm just as well as training with layer norm. And so there's a simplification argument. But really, I think the argument that's often given in these papers, and I think it's good to appreciate kind of the details of this argument is that going to rms norm is know it's faster and just as good. So in what way is it faster? Well, if I don't subtract the mean, it's fewer operations. If I don't have to add that bias term beta back, it's fewer parameters that I have to load from memory back into sort of my compute units. So I don't have to retrieve this sort of state. And some of you might be thinking, but wait, you told me in 224n that nothing but matrix multiplies matter for the purpose of runtime, right? And this is not a matrix multiply. And so I shouldn't care about any of this. And that's a reasonable perspective to take. If you think about the number of the percentage of flops that is taken up by different operations in a transformer. This table, there's a nice paper by evovenall in 2023. I think the title is like memory movement is all you need, or something that does profiling of all the different components of a transformer. And you see that tensor contractions, which are like matrix multiplies, that's like 99.8% of the flops that happen in a transformer. And so you saving 0.17% of your flops doesn't seem like a huge win. But I think one of the things that's important for architecture design now is to not just think about flops because the flops are important, but that's not the only resource that you have to think about. It's also that you have to think carefully about know memory movement. And so even though tensor contractions, so this is things like matrix multiplies, that's like 99.8% of the flops. You know, if you have things like the soft max operation or layer norms, all these like normalization operations that happen in a transformer, there are 0.17% of the flops. So actually, they're 25% of the runtime. And a big reason for that is because the these normalization operations still incur a lot of memory movement overhead, right? And so it does actually matter to try to optimize some of these lower level things because it's not just about flops, it's also about memory movement. I'm gonna to emphasize this quite a bit more as I get into the systems lecture. Like when we talk about GPU architectures, it's gonna to become very, very, very important to think about memory, not just about flops. And so this is one of the reasons why rms norm has now become sort of much more popular. And so I went back and looked at some of the earlier rms norm papers. I think the sad thing is that there aren't quite as many papers published by industry labs with big nice ablations. And so many of the ablations that I'll show you are going to be from a couple years back, but nangat, all in 2020, had this very nice ablation showing you. Here's the vanilla transformer, here's the rms norm version. And you kind of see the exact thing I told you. You know, the number of steps per second that you can do in a vanilla transformer, 3.5 per second. With rms norm, you got 3.68. You know, not a huge gain, but that's in some sense for free and you get a final loss that's lower than the vlinilla transformer. So that's great, right? In some sense, we've gotten runtime improvements and we've gotten, in fact, at least in this case, loss improvements. And so that's a win win for us. The final thing that I'll say, which is very much in line with this rms norm thing in terms of theme, is that most modern transformers do not have bias terms. So the original transformer, if you look at the ffn, will look something like this. You have your inputs. X, you're gonna to do a linear layer with a bias term, and then you'll value it, and then you'll have a second linear layer wrapping around it. But most implementations, if they're not gated units, which I'll talk about in a moment, look, actually something like this, they just drop the bias terms. Ms, you can just make this argument from basically the same kinds of underlying principles. You know they perform just as well. Matrix multiplies are apparently all that you need to get these guys to work. And the other thing, which is maybe more subtle, is actually optimization stability. I don't quite have the deepest understanding of why the bias terms are particularly bad for stability, but there's been sort of really clear empirical observations that people have made that basically dropping these bias terms often stabilizes the training of these largest neural networks. And so now a lot of the implementations now emit bias terms entirely and train only on these pure matrix multiply kind of settings. So that's a layer norm bit. And so there is kind of two things that know you should kind of think of. This is nice because the story is pretty clear. Everyone does something. And so you should just kind of know this, right? Basically, everyone does prenorm, or at least they do the layer norms outside of the residual stream. Like that's kind of the iron rule, right? You know you get nicer gradient propagation, you get much more stable training. It just doesn't make sense to do it the other way. Most people or most almost everybody does rms norm. In practice, that works almost as well, has fewer parameters to move around. And this idea of dropping bias terms just broadly applies. A lot of these models just don't have bias terms in most places. I think the one exception to this rms norm one, as I was reading yesterday is I think cohere both command a and R plus U layer norm. Not quite sure why. Okay. Any questions on kind of the layer norm, ms norm and bias terms stuff before I move on? Yes, question. There are some long term lessons you can take away from these details that are more future proofed potentially? Or do you think these are Yeah. So the question was, is there something more future proof? And I think it's hard to have like the biggest picture in many ways, deep learning has been very empirical and like bottom up rather than top down. But I do think there's some generalizable lessons that you could sort of draw from here. I think the lesson of you know have very direct identity map residial connections is sort of a story and a lesson that has played out in many, many different kinds of architectures, not just you know in these kinds of architectures, the effectiveness of layer norm we'll see once again later on in this lecture, has been very effective. And so not letting your activations drift in sort of scale is another thing that I think generally has been very effective for training stability. Those two seem like fairly generalizable lessons. We will also kind of see sort sort of systems concerns come into play again. So this is another generalizable lesson of sort of thinking really carefully about the impact of your architecture on the systems components of your design. Okay. So now there's this other component, which is the activations. And there is a whole big zoo of activations reu, gu, Swiss, lu, glu. And then there's I mean, these aren't activations. There are different kinds of mlps, gegluu, regluu, celu, swigglue and legglue. And Yeah, I think this is exactly the kind of thing that I didn't originally want to learn. When I got into doing deep learning, I was like, I don't care about activations, it's gonna to train anyway. But it really does matter, unfortunately for both you and me, that swiiggu and other glu variants just consistently work well. And so I will explain those to you. And you should think about them carefully because they do work and internalize that, right? So I think the reu and maybe the gu, you all should already know, right? The reyou learn in some of the most basic deep learning in classes, right? You just take the max of zero. And in the case of the mlp, right? You've got your I've dropped the bias terms here, know x dot W one, you take the value and then you do W two fairly easy, right? Gu is a Gaussian error linear unit. This one multiplies the linear with a cdf of a Gaussian. And so it's basically gonna to be like the value, but with a little bit of a bump here, hopefully you can see that over here. This is not just flat at the very bottom. This makes things a little bit more differentiable, which may or may not help. And the GPT family of models one, two, three and GPT J and so on all use the gu and the original transformer in some of the older models used the reu. And really almost all the modern models have switched to the gated linear units like swiigggluu and the gegalaloo and others, right? And really, I think this is the Google folks really pushed for this like palm and P5 and others, but since it's sort of been tried, and basically almost all the models post 20, 23 use a gated linear unit. And so you know going back to that earlier question of like what generalizable architecture things can we learn from this lecture, you know, there are some things that have really consistently been very useful, residual connections, layer norms. Gating is yet another one, right? And so this is another place where gating just up here and is a very good way of doing things. So originally, this is our fully connected layer right here, right? This is with a value. Now instead of doing just linear and a relu, what I'm going to do is I'm going to gate the output here with an entrywise linear term. So X V is going to give me a vector, and I'm going to multiply that entrtry wise with my original inside term of the mlp, and then I'm going to multiply the whole thing with W two, right? So the way to think about this is I've gated sort of the hidden part of the mlp, right? So I've got my original activation that takes my inputs and puts it into the sort of hidden space, and then I'm going to gate that with x dot v and then you I'm going to project that back into sort of the hidden dimensionality using W two. So there's this gating operation that happens entry wise, and that's really the basic thing that's happening here. And this is the glu plus the value, so the red glue. And then we have an extra parameter that we've added here for the gating. This is v. And so when someone says something like, Oh, it's a gegu flea, there's nothing to laugh about that. There's the gegu flea connected layer. What I've got here is you know I've got the galu sort of for the nonlinearity, and I've still got the exact same gating here on X V, right? And this is the architecture that was used by many of the Google models like t five, V1 one, gamma two, gamma three, and then another variance, there's a swig looo. And this has been very, very popular switshes x times the sigmoid, and this is the non linearity. And you can kind of, you know, a sigmoid is like this and x is like this. So it will look you know just like the Gaussian error unit. And then you do the same thing here. You have a gating over the swish, and then you get a fleconnected layer here. Yes, negative value, the swish function and also the calu function. It's not normonically increasing. In fact, it's decreasing, right? And a lot of the hardwork graant decent forlike input up remission learning is are like, okay, you want to do grain decent flibut here? It seems like you would go in the opposite direction if you use gwords or swish auththeir ated version sions. So Yeah so the question was, you know this isn't a monotonically decreasing. You know there's there's a bit on the very left of this zero here that's kind of flipping in the derivative. And isn't that going to be a problem? I think intuitively, you could have argued that this would be a problem. You might trap a bunch of activations at zeros. I think in practice, you know if you look at kind of like neural network optimization dynamics, what's actually happening is often you're throwing very high learning rates with momentum into the optimizer. And so you're not really going to converge to the zero point. Like these activations are going to be all over the place. And so in practice, I don't think this little tiny negative piece is really an effect that's going to be huge for the model, if that makes sense. Okay. And then going back to this, the swiigglue is basically most models today like the llama, family palm, omo, and I'll show you the big table later, but you'll see that the swiiggu is very, very popular. And one thing to note, I'll talk about this again in the hyperparameters part is you know now remember I've added this this v term, this extra parameter, right? And so I want to you know think about how to size this extra parameter. What people do is gated models usually make this like hidden size, the basically output dimensionality of W slightly smaller by a factor of two thirds in order to make sure that the total number of parameters of this whole thing remains the same as the non gated counterparts. And that's a convention thing that most people do. If that you don't quite understand what that is. I'll go back over that again later. But you can just kind of keep in mind that basically for the gated linear units, just make everything a little bit smaller to make sure things are remain parameter matched. So Oh yes, question, this may be obvious. You're responin the past. One of the benefits of value is like it's very easily differentiable by the input, but if you have the derivative of the cdf of the Gaussian, you have like a squared with x. Is that not really slow? Fined down? That's a very good question. I'm not 100% sure what the internal like kuda implementation of the swiiggluo or the galu gegluo is. I think it's entirely possible that like internally they might be implemented with like lookup tables. Good. I mean, what really matters is the memory pressure in here and it will be the exact same because you're reading the same amount elements. Sure. Right. So the extra computed neligble, that's actually a Yeah that's probably a better argument that like basically flops wise, this is negligible anyway. And actually the memory calculus is the same. So okay, cool. All right. So dugated linear units work. I will have more modern evidence for this as well. But I thought, you know, I should take you straight to the horses mouth, noum chi's original paper, where he evaluates all these glu variants and know this is somewhat older stuff. So you're seeing cola and sst two performance, but you do see basically that the gllu variants consistently perform better. Glu is 84.2, 84.12, 84.36, 84.67. And you know, wow, it's 2020s. They even give you the standard deviation so you can sort of figure out how significant those results are. And they in fact, are significant, right? And so this is some nice evidence to see here. There was also the nurang at all in two and 20 paper, which is a very nice paper studying all sorts of architecture variants, I think in the context of key five style models. And once again, you see that the gated linear unit variants consistently achieve kind of lower losses than their counterparts. Like you see that the Bolden lines are exactly at the glu variants, and this pattern has basically held up. So for gating and activations, there are lots of variants across different models, but the gated linear unit has become basically widespread and dominant. And I think for a good reason, of course, the glu isn't necessary for a good model. Like it's important to separate the two, right? Just because it's probably slightly better and everyone does it doesn't mean it's necessary. And you do see examples of very high performance models not using a gllu like GPT -3 is one example. More recent one, nemotron 340b uses a squared value, which I had not seen before in falcon two. Eleven b uses a value. Both of those are relatively high performance models, so you can kind of see that it's not really necessary. And so evidence does point towards consistent gains from swiiggluo and gegluo, and that's why we ask you to implement exactly that variant. Cool. Okay. The final thing that I want to talk about for architectures, and this is one kind of final major, I want to say, variation that we've seen normally, the transformer block is serial, right? In the sense that for each block, the outputs come in from the bottom, and then you do your attention, and then you PaaS the result of that compucation forward, and then you do your mlp, and then you PaaS that compucation forward, right? And so this is inherently serial. You do attention and then mlp. But of course, this might have certain like parallelism constraints. So if you want to paralyze this over gigantic sets of GPU's, it might be harder to do. So if you have the serial connection, the system's concerns might also be more difficult. You might get lower utilization from your GPU's. And so a few models have done this thing that I'll call parallel layers, where basically, instead of having serial computation of attention and an mlp, they will do them both at the same time. So you will get your x from your previous layer. You will compute both the mlp and the attention side by side, and then you will add them together into the residual stream, and then that will be your output. And this was pioneered by GPT J, which was kind of this open source replication effort in the folks at Google doing palm were kind of bold enough to do this at the really big scale. And many others have kind of followed since. So if you're implementing this right, you can share a lot of stuff like the layer norms and the matrix multiplies, can get fused together and you can get some systems efficiencies out of that. It hasn't been quite as popular since then, at least in the last year. I think most of the models that we've seen have been serial layers rather than parallel ones. I think the only exceptions to this are like here, command a command R plus and a falcon q eleven b. So now I think we have the ability to kind of go back to you know this big you know hard to see chart and then see what what I was sort of pointing at at the very beginning. So this column here, you know you don't really need to be able to read any of the text because think the colors will tell you everything. You need to see this check mark here. This is basically pre versus post norm. The only two models I really know of in the early days that did post norm, this is the original transformer and GPT and burif. You want na include that into this table? And then almost everybody else, I think basically everyone else has done pre norm. The only other non checked boxes here are models that are proprietary. And I don't have details for this column here. On the leftmost thing, this is rms norm versus layer norm. The gray boxes are the layer norm. The blue ones are rms norm. Basically most people have converged to rms norm. As I said, this column next to it is serial and parallel layers. Once again, most people do serial, but you see other variants. What I'm gonna to talk about next is going to be position embeddings and that be kind of more interesting in a moment here. Any questions about any of this architecture stuff before I move on? Hopefully that gives you a bit of an overview of at least the major variations in architectures that we see. Yes. So the question was whether cereal is more efficient than parallel. It should be actually the reverse. That parallel is more efficient than cereal, and that's why you're kind of willing to do this. So in some sense, you might expect cereal to be more expressive because you're composing two compucations rather than just adding them together. But the benefit of parallel in theory, is that if you write kind of the right kinds of fused kernels, a lot of these operations can be done in parallel, or the computation is shared across the different parallel parts. Okay, so cool. So the last thing I want to talk about in architecture land, I think this last thing is variations in position embeddings. And I think this one's interesting because in the first few years of sort of lm land, there were a lot of different things that people were trying sign. Embeddings were from the original transformer. You know, you should have learned this in 224n. There's sine and cosine positions. Many others did absolute embeddings, like the GPTs and opt, all basically just added a learned position vector to the embedding. Some others, like t five and gopher, did various kinds of relative embeddings that add vectors to the attention compucation. And then I think most models have converged to rope, which is relative position embeddings. And this, I think, actually started in GPT J, once again, another open source contribution. It has really rapidly been picked up by most of the models. And so the high level thought aprocess behind rope is that the thing that matters is relative positions of these vectors, right? And so if I have an embedding f of x of I, where x is the word I'm trying to embed and I is my position, then I should be able to write things down in this way. So there should exist to f such that f of xi and f of yj. If I take the inner product of these embeddings, then I can write this down as some different function g, which is a function of the two words and the difference in their positions, right? So this is a definition that enforces basically position invariance or absolute position invariance. So you only pay attention to the how far apart these two words are. And so you can do a brief check and see, okay, what happens with signs? Well, you get these cross terms that are not relative. So you do still leak absolute position information, absolute positions like it's in the name. You know it's not a relative position embedding and relative embeddings. Well, it is relative, but it's not an inner product. So it sort of violates this constraint. And so rope is this kind of clever observation that we do know one thing that is you know invariant to sort of absolute things, which is rotations. And so we're gonna to exploit that structure to come up with our position. Embeddings, we know that inner products are invariant to arbitrary rotation, so we're going to leverage that. So on the left, this is the starting point. Let's say my embedding for the word we is this arrow over here and my embedding for the word no is this other arrow over here. Now I want to embed this sequence. We know that. And I only, you know, I look at the word we and no. So how do I do that? Well, we and zm position zero, so I'm not gonna to rotate that guy at all. No is in position one, so I'm going to rotate him by one unit of rotation. And so now I have this embedding for, we know. And now let's say I want to embed this sequence. Of course we know now we and know are have the same relative positioning to each other. And so let's look at what happens. We get shifted by two positions. I rotate we by, you know, I start you know in this vertical position and I rotate them twice, 12. And then I rotate no, by three positions because it's one, three, sorry, zero, one, two, third position, right? And so now if you look at these two arrows, they have the same relative angle, so their inner products are preserved. And so this is kind of the nice fun idea about rope. You just rotate the vectors and the rotation angle is determined by the position of each word. And rotations, you know, the inner products don't care about relative rotations. And so these inner products are only going to look at sort of the difference in distance. Now it's easy to think about in 2D because rotations are kind of obvious. In 2D, there's only one way to rotate a vector. But in high dimensional spaces where we operate, it's not obvious at all how we are going to do this rotation. So the rope folks came up with, you know, in some ways, the simplest but also effective way of doing this in the way to do it is you take your high dimensional vector, in this case d, and I'm just going na cut it up into blocks of two dimensions. And every two dimension is going to be rotated by some theta. So there's going to be a rotation speed, and I'm going to rotate the pairs of dimensions. And so now every pair of dimensions is encoding all these relative positions. And much like in signing cosine embeddings, I'm going to pick some set of theta such that some embeddings are rotated quickly and others are rotated much more slowly so they can capture both high frequency information or like close by information and very far away, sort of lower frequency positioning information. And the actual rope math here is, if you're going to think about rotations, it's just going to be multiplying with various sine and cosine rotation matrices. Hopefully you remember this kind of from linear algebra and trig. And so you can think about this as an operation where you multiply your embedding vectors with these block two by two block matrices, and there's no sort of additive or cross terms that sort of appear here. This is all purely relative. One thing that is different, if you're used to sort of absolute position embeddings or signing cosine embeddings here, is that the rope is going to operate at the actual attention later, right? You're not going to add position embeddings at the bottom whenever these attention compucations are going to be done, you're going to intervene on that layer and then that's going to give you your position information. And so now I pulled this from, I think, the llama implementation of rope. You know you've got the initial normal attention stuff at the very top, like query keys and values. These are you know your normal linear projections. And then you you're going to come up with cosine and sine angles. These are rotation angles telling you how much to rotate different blocks of the query and key. And then so you take your query in your key and you're going to rotate them by the cosines and signs. And now you've gone rotated query and rotated key. And that's going to be what's going to go into the rest of your attention computation. So you don't do this at the bottom. You do it whenever you generate your queries and keys. Hopefully that's clear. That's really critical to enforcing kind of this relative positioning only me information. Okay, good. So one of the things I want to highlight is that rope is actually one of the things that it seems like everyone has conversion on. I went through all 19 of those papers over the weekend and basically all of them now use rope for various different reasons. There's know the reason that rope has now many different algorithms for extrapolating context length, and that's an important part of sort of the modern productionized language model. But also it seems to be empirically quite effective even at fairly small scales and small context length. So it's kind of won out on this. What's it called position embedding battle? Okay. Any questions before I move on to some of the hyperparameters of yes. I don't think they're all the same. There's some variation in the theta. Oh, yes. Are the data as like for each pair? Are those hyperparameters? Are they trained? They're not the theta that determine the rotation angles. They're not hyperparameters much like in the sines and cosines here. There's kind of a schedule to the rotation angles that are determined, and it's in the same intuition. And those siges and cosines, you want to cover different frequency ranges in order to get higher or lower frequency information? Yes. Oh, the rotations create any difficulty with like training ding. I wonder like this, like angle lar, the rotations themselves don't really create any issues because one way of thinking about a rotation is that it's just a matrix multiply, right? Since theas are fixed, right, and the ms here are fixed, this is really just a fixed matrix that multiplies your vector. And so in that sense, it's not really an issue. If you learning the theta, then maybe you have issues because you're maybe differentiating through trig functions, but you're not doing that here. So okay, cool. So now I think we go even one more level into the details here and we're going to talk about hyper parameters. I feel like when you have to you're dropped in and you're asked to train a new language model. There's a lot of questions you have about hyperparameters because there's quite a few of them. And one of the things that I've realized is that actually only a few of these really get changed across different successful models. There's actually like fairly clear rules of thumb and fairly clear guidelines that people seem to be following. So you know there are some things like how much bigger should the feet forward size be, or how many heads should I have or what should my vocab size be? And so we'll talk about each of those things and we'll try to constrain the space of hyper parameters that people have. So you know the starting point, we're gonna to look at a simple feforward layer, you know just the know with the bias, let's say this is a value version of it. And so there's two hyper parameters here. There' S D model, which is the dimensionality of x. That's the input coming into your mlp. And then you've got dff. So this is the feed four dimension. This is kind of the output hidden dimension of your mlp. And from there you're going to project back onto d model, right? So what should dff be in general? You know these things are gonna to be up projections, right? You're gonna to have more hidden units than there were inputs, but how much bigger? Well, there is actually just like a consensus. Almost everybody that uses you know value style mlps are gonna to pick dff is equal to four times d model. This is I will show you some empirical evidence for why this is the same number later. But as far as I can tell, there's no like you know law of nature that says you have to pick for this is a convention that has really held up. Now there are a few exceptions to this rule. Remember that the glu variants are gonna to scale this down by a factor of two thirds, right? And if you scale it down by a factor of two thirds, you're going to have roughly the same number of parameters. You can do a little bit of math. And if you scale the glu variance down by a factor of two thirds, you'll come to the conclusion that the way to do that is to set dff equal to a over 3D model. That's going to be the number that you end up at, and you can sort of convince yourself that that will give you the same number of parameters. And that's the ratio that you would get if you started with a ratio of four. So if you look at many of the models, they actually do follow this rule of thumb. Palm, for example, you know our palm, misteral and lama are slightly larger. These are glu models, but they don't follow this 26 rule. But if you look at, for example, lama, you know, one quan deep, C E and t five, they all roughly follow this kind of 26 ish rule. And I can sort of put up the big table of llms that I made later with hyperparameters. Many, many, many of them fall into this roughly 26 range. And that's the standard parameterzation of a glu unit. I'll go through one other exception. I really like this exception because I think in many ways know big large language model training is a game of copying hyperparameters from other people. And so we don't learn very much, right? It's very conservative. But t five I really like because in some sense, it's really bold. And I think Google people actually do some pretty bold stuff. And so if you look at the 11 billion parameter t five model, they have a pretty, pretty incredible setting. Their hidden dim is 1024, but their dff, their up projected dimension is 65000. And so that's going to give you a 64 times multiplier on the ratio of dff to d model. And of course, you compare to this where palm is like a factor four and everyone else is much smaller. This is a very large difference. And there's some other recent examples of using much bigger multipliers like Gemma two kind of follows in these footsteps and does a factor of a and I'll talk a little bit about this exception later. Of course, t five was a totally fine model. So this should tell you it is possible to train a model with such a much larger ratio. So one of the things that I think is, know, quantitative evidence. You know, I saw that four x multiplier and I thought, is that really the right thing to do? Or is there some more quantitative experiment someone's done to convince me that that is a good idea? So one of the figures from Jared Kaplan, sort of scaling law paper, and most people know this paper for the scaling law component, but actually there's also some really useful hyperparameter components to this paper. You'll actually see that they do exactly this thing that I'm talking about, the dff to d model ratio, and they plot essentially how much the loss increases as you vary this. And you kind of see that there's kind of a sweet spot. This is a ratio of one, two, three, four and then up to like ten or so here, right? And so there's a pretty wide basin here, anywhere between one to maybe up to ten, where you know you can pick whatever feforward ratio you want and itbe roughly optimal. And four is not too far off from your optimal choices over here. It's like one, two, three, four. It's like right here or maybe right here. So that's a pretty reasonable choice. So what can we learn from all of this hyperparameter stuff? I think a lot of the evidence points towards, you know you can pick the same defaults of you know if you're not using a gllu, you can multiply by four. If you're using a gllu, you can use roughly 2.66. And they can work pretty well for mostly all the modern lms. T five, once again, does show that you don't have to follow these rules, right? You can be a rule breaker and do whatever youlike. There's no hyperparameter choice written in stone. You can get reasonable lms at many other hyperparameters. That said, I think the really funny epilogue to this story is that P5 has a follow up model called P5V1 one that's improved, and it uses a much more standard 25 multiplier on Gago. So you can read between the lines and say, maybe they look at the original t five and said, actually, maybe we want na walk back that 64 times multiplier and pick a more standard one. And they did end up with a better model. Cool. Yeah, okay. So so I think that's a good question. So the question was the ratio, sorry, what's the relationship between this ratio that I'm talking about here and generally the impact on the model, right? And so if we go all the way back here here know the ratio is controlling essentially how wide the hidden part of this mlp is. And so the original justification in the t five paper for picking 64 was to say, actually, we can get bigger and fatter. Matrix multiplies if we make that dimension really, really large. And while that is kind of a statement, you know the wider it is, you know you're getting more parallel computation, so to speak, rather than serial computation. So you're spending your flops in your parameters in a slightly different way than if you made your hidden units bigger, which would let you PaaS more information, or using more units, which would give you sort of more serial computation, right? So you're spending your parameters and your flops in a in a slightly suboptimal way from expressive power, but you might get systems gains if sort of your matrices are wide enough. Okay. Excellent. So another thing that is a surprising or maybe not surprising consensus hyperparameter is the ratio between the model dimension and the head dimension times the number of heads. So I clithis from 324n, right? But really, the basically canonical choice is to pick things so that the dimension d, that's a hidden dimension. And if you have multiple heads, you're just going to split up the number of dimensions each head gets. So you're going to keep the dimensions fixed as you add more heads. And you don't have to do that, right? As you add more heads, you could just keep the same number of dimensions per head, and you could just let the attention part take more and more parameters, right? You could do that. That's an option that you have. But most models, once again, do follow this guideline. We see GPT -3, t five, lambda palm and lama two. They all have a ratio of one, or almost exactly one. T five is the one exception that breaks this rule. They tried the big ratio of 16, but otherwise it is all not fairly following this consensus. There's been a couple papers that I've argued against this one to one ratio. You know there's a notable one by I don't know how to pronounce this boja Penelli at all, 2020 who have argued that if you have more and more heads, they're going to have lower and lower rank. And if you have very few dimensions per head, that's going to start affecting the expressiveness of the attention operation. But in practice, it doesn't really seem like we see too many significant low rank bottlenecks in practice. And most of the models with this ratio of one seem to do just fine, right? This is really a parameter that's generally been held cost stant by most of the models that we've seen. If I have time, I'll talk a little bit about different optimizations that people have made on this, like mulfi head component. But hyperparameter rise, things have stayed fairly similar. I think one of the big ones in terms of hyperparameters is the aspect ratio. So know we can think about deep networks. We can have more and more layers or we can have wide networks. And generally, if you want one knob to control the width, that would be sort of the hidden dimension of the residual street that would control essentially the width of almost all the operations at once. And so this seems like a pretty critical thing to tune. You might think that deeper networks are smarter and more expressive. Where wider networks are more efficient. There is generally a sweet spot of ratios that people have picked. There have been sort of outlier. Some of the early models used much smaller ratios here. So what that means is that they were much, much wider than they were deep. And then some models have gone really deep where they had way more of sorry, the other way are really wide where they had way more d model than n layer. And there's been generally a sweet spot of saying we want about 128 sort of hidden dimensions per layer. And that has been generally stuck to by a lot of the GPT -3 and lama variant models. And I'll talk a little bit about evidence for that in a second. There's considerations about aspect ratio that are quite important. They will control the amount of sort of parallelism that we can do. So if you're doing something called a pipeline parallel, what you're often going to do is you're going to take your different layers and you're going to cut them up and you're going to put them on different devices or different walcks of devices because you'll paralleze you know within each layer as well. And so there's going to be certain kinds of constraints that you're gonna to put on your model. And also, you know if you have really wide models, then you can do something called tensor parallel, where you slice up the matrices and then you distribute those on GPU's. And one thing that we'll learn in, I think, one, two, three, four or five lectures is that these different parallelism paradigms are going to have different constraints. You need really fast networking for tensor parallel, and you can sort of maybe get away with slower networking or higher latency networking for pipeline parallel. And so your networking constraints might in turn drive some of these like with depth considerations. But setting that aside, you might abstractly ask, what is the impact of aspect ratio model performance? And once again, Kaplan at all have a really nice visual sort of aid showing how aspect ratio impacts performance. And so this is three different scale, 50 million, 274000001.5 billion parameters. And the x ax's x effect ratio y axis is sort of loss difference in percentage change. And you see that around 100, which is once again, I told you was around the consensus choice of hyperparameters, is the minimum across different scales. So this is kind of backed by some of this like large scale hyperparameter data that's been published by Kaplan and all and roughly matches that intuition. And a really nice thing here is it seems to be the case that aspect ratio a does not shift too much across several orders of magnitude here. So if this holds up even more, that's very good news. You can keep training on one fixed aspect ratio. One thing I will know that is quite an interesting result is eand others at Google had this very interesting paper sort of studying impact of depth versus width, both upstream and downstream. And one of the things that they found was that if you're looking at losses, then it doesn't really matter. Parameters is the only thing that matters. Deeper models don't help you, but the story is less clear if you're looking at downstream accuracy. At the time, they were looking at sort of fine tune super glue accuracy. They were arguing that for the same amount of flops, deeper models might be better. So I'll sort of just leave it at that. There's not quite as much follow up to this work, at least in the open that I've seen, but downstream performance may actually be slightly different in terms of the aspect ratio considerations here. Okay, cool. The final thing that I want to talk about in this sort of very low level hyper parameter world is what are kind of the vocabulary sizes that you might want to pick. And in general, vocabulary sizes have been trending upwards. And I think a big part of why is because llms are being deployed out in the wild, they're becoming more useful services. And when that happens, you're gonna to interact with people speaking different languages, people using emojis, all sorts of other kinds of almost modalities or languages than what you might expect. And so I think some of the earlier models, and especially monolingual models ranged around in the 30 to 50 zero token vocabulary range. You can kind of see this in like GPTs, the early llamas. But if you look at the multilingual or I would call like production systems that have come out, they've all sort of been shifting towards the 100 to 250, zero range for their vocabulary sizes. And you know I looked at command a, which is one of coherence models. They're a company that emphasized a lot of multilingual stuff. You see very large vocap sizes from them, even with GPT -4 and many others that have copied the GPT -4, tokenizer are going to be around the 100k tokens. And so that's kind of the standard that a lot of people are operating at, roughly at 100k to 200k token size. And I think there's been work showing that as models get bigger, these models can in some sense handle more and more or make good use of more and more vocab elements. And so you might see increasing trends to token counts as models get scaled up or more data is used to train them. Cool. Okay. So the last thing, this is no longer sort of specific hyper parameters, but sort of two other things that you might need to do before you sort of set your model to run, which is dropout and other kinds of regularization, right? And I think this one was really interesting to me when I was originally doing kind of the research for putting this lecture together. If you sort of think about pre training, pre training is about the furthest place that you might think of from regularization, right? Because pre training you do usually like one epoch, right? You can't even go through all of your data because you have too much of it. So you're gonna to do one epoch training and you're almost certainly not overfitting the data in that one PaaS that you're doing, right? And so you might think, all right, we don't need regularization for pre training, right? Let's just set your optimizer loose. It's all about minimizing loss. And this is really good arguments for why you shouldn't need to regularize. But then if you look at what people do, the story is actually kind of mixed. And this story actually is maybe even more mixed than what has turned out to be. But early days, people did a lot of dropout. And then there's a lot of weight decay that also seems to be happening. And these days, I think a lot of the people have stopped publishing details on precisely their training hyperparameters. But dropout has sort of gone out of fashion. But weight decay has really been something that a lot of people continue to do. And why is that? That's like a really odd thing to be doing, right? So I'll give you a moment to just kind of think about the state of affairs, right? If you're doing you know training a really large neural network for one PaaS on sgd, vast amounts of data, why would you use weight decay when you're doing that, right? So maybe some of you know the answer, but I think that's a kind of interesting thing to think about. It's very intuition sort of violating, at least for me. Okay. So the reason is because you know it's not to control overfitting in the sense that if you look at weight decay, different amounts of weight decay don't really seem to change the ratio of training loss to validation loss, right? So you can train with different amounts of weight decay. If you train for long enough for you, you control your hyperparameters appropriately, you'll end up with the same train to valloss gap. So overfitting, nothing's happening here even with zero weight decay. But what is interesting is that the weight decay seems to be interacting somewhat in a strange way with the learning rate schedules of the optimizers. And so what's happening is that if you look at sort of a constant learning rate, so this is a model Traon constant learning rate, and then you suddenly decrease the learning rate in ten year zero. So you see this drop off as you decrease the learning rate. And then let's look at different kinds of weight decay that you could do. And what happens is, with weight decay, the model is not training very well at this high learning rate. And then when you decrease the learning rate, itvery rapidly drop off. And when you look at sort of cosine learning rate decay, what happens is that the models with high weight decay start out very slow. But then as they cool down, that is their learning rate decreases, they very rapidly optimize. So there's some very complex sort of interaction happening here between the optimizer and the weight decay and some sort of implicit sort of acceleration that happens near the tail end of training that ends up giving you better models. And so the answer to the question I posed you is know you don't weight decay because you want to regularize the model, which is kind of what it was defined for, your weight decaying in order to get actually better training losses. And you end up doing that because of the various learning dynamics at the tail end of training as you decrease your learning rates to zero. It's a very sort of very interesting and complex and in some ways troubling thing to be doing with language models. But now you sort of see why, if you look at a lot of the reports, you'll see we use weight decay. This is kind of why that ends up happening. Cool. Okay. So putting all that together, so there are certain things that I think are just kind of no brainers. So if you're picking various hyper parameters for your model, you don't really need to think too deeply about them in the sense that they've been validated and basically everyone else does them. So this is things like know the hidden size of the mlp, the head dimensions of your multi head attention, your aspect ratio and your choice of regularization through weight decay. Like all of those, there's fairly good, I think, consensus evidence of how to pick most of these hyperparameters. And those defaults roughly give you the kinds of things that we suggest in the assignment. So you can kind of follow along and theyroughly give you something similar to this. Okay. Any questions about the hyperparameter piece? Yes. Is there a reason why droouts gone out of patn? That's a good question. I don't think I've seen the question was why did dropout go out of fashion? I haven't quite seen a deep analysis of why dropout is or isn't helpful. Like I haven't seen any result that, for example, shows that it helps for training loss. And as sort of this you know both this paper argues and logic would dictate, there's not really a training overfitting issue with these models that can't even do one epoch over their training data. Yes, multilingual vocabulary actually contribute to improve performance in one language? Yeah. So the question was, do multilingual vocabularies contribute to improving performance in one language? When you say one language, you mean do multilingual or like larger vocabularies help performance in English? Is that the right question? Yeah. So I think in your high resource language, the impact is less, right? So you know if you're only thinking about you English language language modeling, you can get away with smaller vocabularies. This much is kind of you know but the place where larger vocabularies is really helpful is when you're starting to get at, I wouldn't say the tail of your distribution, but when you get the languages that are sort of more minority, and one great example of this, if you look at any of the coherent announcements about their models or their tokenizers, they basically always argue that because of the way they have larger vocabularies and the way they train their tokenizer non English and like low resources languages, they are packed into much fewer tokens. And so people using those pay much femuch lower cost at inference time, which is a great benefit. Oh, yes. Question plus, if weight t doesn't have a significant impact on the boweloss, like why do we care about the treating dynamics or favorable ocxtadynamics, right? Okay. So the question was if it doesn't have an impact on valloss, why do we care about training dynamics? The goal is still, I want to get you know good training loss, right? This is the game that we're playing. And the surprising thing about weight decay is that somehow it gets us better training losses. I think the intuitive thing that makes sense is you do wadecay. It gives you better vowel losses, but that's not what happens. What it's getting you is better training losses, which are also the same as vowel losses. Yes. Are there differences as the architecture hyperfamer choices people make as they move towards like multimodal. Yeah. So the question was about multimodal models. That is a great question. My survey of multimodal models is very incomplete. What I can say is a lot of the academic and open work that I've seen, they do what you might call like shallow or like later fusion or relifusion of the modalities. And the way that works is you kind of bolt the vision modality onto a existing language model. In those cases, the hyper parameter and architecture choices are fixed, right? One thing I will note, and I will talk about this in just a few slides, is that the multimodal models pioneered some pretty interesting techniques in stabilizing language model training. And that's been a really big theme and theytalk a little bit about those. So what is different is often when you bolt on this new kind of vision piece and you retrain with that, that's a big shock to the model. And so you have to think carefully about how to stabilize that training process. And those innovations have actually seeped back into pure text language model training. Okay, cool. So I went back through and I looked through all these new papers, and as I was trying to think about, okay, what's been new in the last year and sort of what new architecture and related things have happened, actually the core architecture hasn't changed much. But I think the one thing that stood out as being very emphasized in a lot of the releases has been what I would call stability tricks. And so these are things where you would like to train your model in much more stable ways. And as you make bigger and bigger models, you train for longer and longer, these kinds of issues start to appear more and more. So I've taken this from the ulo two paper, and actually, that paper is a great sort of set of academic results on llm training stability. And one thing they start with is kind of this figure. And you look at this blue curve over here and you look at this know l two normal m of the gradient graph, and this is terrifying graph to look at, right? Like know your loss curve kind of seems to be behaving okay, but you've got some bad spikes every now and then you open up your gradient m. It's this horrible plot where you've got spikes everywhere where your norms are completely blowing up. And you know if you're training models like this, you're gon to have a really tough time getting it to converge reasonably. At some point, it's going to know hit you know gradient norm explodes and like you can't do anything and your training is done right. So you can't train any further. And so there's been a lot of emphasis basically trying to turn this blue curve into something that looks a lot like the orange curve. And of course, this loss is higher, but ignore that fact because I think they just switch data sets in between these two training runs. But this orange curve has nice low gradient norms throughout, and that's really the kind of plot that you would much rather see. And so you might ask, where do stability issues arise in transformers? And of course, they can arise basically everywhere. But if you look at the kind of interventions that people are making, there's really one place that really stands out as the kind of problem child, and that's the soft maxes. And it can be a problem because you're gonna to be taking exponentials, and those can be numerically badly behaved. You're also dividing two numbers, and so you might have a division by zero. So for many different reasons, this softmax piece is a part that you might have lots of issues with. And so actually, one more thing I want to talk about. So where are the soft maxes in a transformer? Well, there's one at the very end, so you've got to be careful about that output, soft tmax. And also there's soft maxes in your self attention. So there's two soft maxes that we're gonna to think a little bit about. And for each one, I'm going to mention stability intervention that has you generally seemed to be effective. Okay. So the first one is called the z loss. And in my desire to cite a paper that's older, I've gone back to Devlin in 2014, where in a machine translation paper, their goal was to try to make sure that this normalizer was near one. So if you look at p of x, that's the output soft maps over here. The output soft max is two terms. You exponentiate your logets and then you divide by the normalizer z, right? The z is just summing up the values across all the vocap. And so if you want this z of x, you want to train the network to have A Z of x close to one, well, then you can rewrite your loss and you can add a little second term here to try to force log of z of xi to be close to zero. So you're gonna to end up with the auxiliary loss term that's alpha log squared z of xi. You can kind of see that deriation on the right here. And this is you know in some sense what people often call the z loss. I think know Jacob devblin and others did this for machine translation for totally different reasons than what it's used for today. But this was, I think the first instance of this in language modeling land was palm, who used this, as they called it, auxiliary loss of z. Loss tends on the negative four log squared z to basically encourage the softmax normalizer to behave nicely. And you can kind of reason through the behavior of this regularizer, if it succeeds and it forces log of z of x to always be zero, then the log and the exponent, the exponential, cancels, and you've basically just got U of R of x. And that's a good place to be, right? That's a nice numerically stable operation. So all of these sort of problematic operations kind of go away. And so you can think of the softmax as being well behaved when z of x is close to one or log of z is close to zero. And you know, palm in some sense is very much a pioneer because they did this zelos trick and many others didn't really do it for a long time, or at least the ones that had open papers. But then there was a kind of sequence of papers that have done this. Baicuan two is actually the earliest follow up that I know of. And the dclm and almo two, and now several others have basically picked up on zlois, a very nice, convenient intervention for improving stability. And then the other trick that we see, so that was on how to stabilize the output sofmax. But we've got another sofmax we've got to deal with, right? The other sofmax we have to deal with is in the attention operation. And so this is from an nvidia paper. I forgot to put the citation marker, but here, this is a block diagram of how attention works. Know you've got your layer norm at the beginning. You've got your qkv's. Ignore this for the moment. You might multiply your q's in your keyou'll. Sofmax it, you'll multiply the v and then you'll project it and then that's gonna to give you your fully connected in your output, right? So if you ignore this this little piece over here, you know this looks just like your normal multi head attention operation. So what's kind of the difference here? So several folks came up with this idea or this approach called the qk norm, where you take the queries and the keys and you PaaS them through a layer norm layer before you take their inner product for the softmax operation. And this is a very different kind of approach to controlling the behavior of the softmax. Here, you're not controlling the normalizer z. Instead, you're controlling the inputs, the softmax, to be kind of bounded in size. And that's going to naturally control the bad behaviors of the softmax. And as I said before, this is originally an innovation from the vision and sort of multimodal model community degi in 2:23, this was a paper on training, very large vision transformers. And then chameleon and eddifix from hugging face sort of used these tricks for their like multimodal training components. And then you, it got picked up by several others like Gemma two, dclm, omo two, all basically uses this kind of techniques in order to stabilize their training. And I think I'm allowed to add one joke per lecture. And so this is the one I'm going to go with here. I think one of the things that really has stood out in terms of stability interventions has been just how strikingly effective layer norms are, right? So we've seen know going from layer norms just in the ppart of the block to both the beginning and the end of the non residual component. And now we've also thrown it into the q and the k component, at least in terms of improving stability, layer norms have been shockingly effective without affecting performance too much. The last trick that I'll note, I think this one has been sort of not quite as frequently used, which just a soft cap, the low jits that go into the soft mac. So the other approach that you can take, so qk norm, is in some sense a very heavy handed intervention because we're gonna to operate over the entire vector. But one thing you could do is after you take the inner products for self attention, you could PaaS them through kind of like a soft maximum operation. So you could PaaS them through this equation over here. So you have your low jets as your input, divide it by the soft cap, multiplied by the soft cap. What does that do? Well, if your low jits start exceeding the soft cap by a lot, the ten H is going to clip them off to one. And so you're gonna to have a maximum value of soft cap over here, right? So this is going to control in some sense soft clipping of the loits and gamma two. And I think omo two also do this. It hasn't been, I think, quite as popular otherwise. And I think the other sort of evidence against this, the nvidia folks that I mentioned earlier did actually quite a few different sort of stability improving interventions. And what they find is you have your baseline model over here. This is the perplexity of the baseline model 11.19. Soft capping makes it worse. Qk norm actually makes it better because you can use more aggressive learning rates and sort of push the optimizer further. Cool. Okay. So that's the end of sort of the stability improving intervention stuff. Does anyone have any questions? I think that's been kind of the new development over the last year. Yes. So for the qkb, norlike understand that during training you have the later yer norm. At inference time, is the layer norms still being kept? Yes. So the question was, at inference time, do you still use the norm? And the answer is yes, because the layer norm has kind of learned parameters. Like the whole you know action of the layer norm is it takes an activation, normalizes it to unit and then scales them to some size. If you take that out, that's a huge change to the model. It will have no idea what to do with those unnormalized activations. Okay, cool. All right. So I have this last bit, last few slides that I want to end with. If we go over, then we can always push this into the moe lecture, but I think we also have a lot of content next time because I have to cover deep seat V3. So the last thing I want to cover is variations on the attention heads. So attention heads, I think, haven't had as much work done to them, but there have been a few, I think, important changes that you need to know about in order to understand the models that are being trained. So the one thing I'll talk about, the first thing I'll talk about is gqa and mqa. And these aren't really critical to kind of the training time behavior of the models, but they're very important in understanding the inference costs and inference behavior of the models. And because this is an important architecture change, I'll mention them here. In addition to probably being mentioned by Percy in some of the inference lectures, the other thing that's a kind of new development I'll mention is how the most recent models, like lama four, if you've heard of it, supports supposedly 10 million tokens of context. How does it do that? Well, it does so by sort of messing with the attention pattern in very structured ways. And so I'll talk about that as well. So gqa and mqa, if you looked at like some of the larger models, like the big llama models or others, you'll have heard or seen this term gqa or mqa, and I'll talk through what that sort of means. So to set the stage, let's think about the compute that you need to do attention, right? So this is once again 224n slides here. You're gonna to take your know xq, your query and your xk, and then you're gonna to form your big sort of quadratic attention matrix. And you can sort of walk through each of these matrix multiplies and you can convince yourself that the total number of arithmetic operations is going to be b times, n times, d square. So that's going to be b is the batch dimension, n is the sequence length, and d squared is going to be the hidden dimension squared. And you can ask about the total memory accesses. And this is going to be b times n times d, and this is going to be, for example, accessing just this matrix here. This xq is going to be that size. And then the softmax is going to be b times H times n squared. And you can kind of convince yourself of that by just thinking about the size of the soft max matrix, which is going to be batch times, number of heads times, all of the different soft max activations that you have. So that's n squared of them, right? And you've got a projection and you've got d squared projection operations at the very end over here. And so we can take the ratio of total memory accesses in arithmetic operations. And this is going to be something that will be very important in a couple lectures, this idea called arithmetic intensity. So we want our arithmetic intensity to be high. What that means is we want to be doing a lot of compute for every single memory access that we do. And this is gonna na be because memory accesses are very expensive on a GPU, relatively speaking, and compute is relatively cheap. And so in this batch computation that I'm showing you here, know, the arithmetic intensity, if you take the ratio of those two things, is going to be one over k plus one over bn inverse. And so this is going to mean that we can kind of keep our GPU's running, because if we have sort of large number of heads and we have large batch size and large sequence length, you know those are all going to be sort of good large numbers. Of course, this is what happens at training time, right? So the issue is that inference time, we do not have these big chunky matrices to multiply together. And so that's going to really change the nature of the behavior of our algorithms. So when we're generating text, right, remember that we have to generate a token, and then the transformer has to read that token, and then it has to process it. And now we can get the next token distribution. And then we do the things autoregressively one token at a time. And by doing this, we can't paralleze this generation process. We need to go step by step for every single new token. And when we do this, we're going to need to incrementally compute attention, an idea that people call the kv cache. And so what do you do? This is a lovely animation of a kv cache that's been explained. So if you can sort of look at this figure, what you're doing is you've got a query token, right? A query token here is you've generated a new token, you're conditioning on it, and now you want to ask, what sort of information should I look up in the past past that query token, right? And your query tokens are shifting from one through n because you're generating new tokens one at a time. You're building up this sort of key cache over here where basically I'm building up all of the past tokens keys, right? And the past tokens keys don't change because they only depend on things in the past. And so I'm incrementally, as I generate tokens, building up all of these past keys, and each time I can compute one new element of Q K, right? So the big attention matrix is gonna to be this lower triangular matrix. I'm computing one row at a time, and that row is exactly what's necessary to generate the next token. So this kv cache idea, if you've not seen this before, is this idea of saying, I'm going to generate the ks 's and the v's incrementally as I go, as I generate each token, and I'm only going na compute qk. That's absolutely necessary to do my operations. And so once again, you can go through and do sort of the various arithmetic components of how many flops do we do? What's the total number of memory accesses. And if you think about the kvcache, right, I'm only multiplying the absolute necessary keys and values, right? Since I'm saving all of the intermediate computations, I'm not wasting any sort of matrix or vector. Vector multiplies the total number of arithmetic operations remains exactly the same b and d, but the memory access patterns are now different. Why is that? Because you when I do this kv caching thing, I'm going to have to move various kinds of parameters in and out of memory repeatedly whenever I multiply with a key sort of k matrix, I'm going to have to put that into memory and then multiply by k, and then I need to put that away, and I need to compute some activations. And so I'm repeatedly loading in different matrices. And that's going to give me a much higher total memory access of b squared d plus nd d squared. And so when you take this ratio now, the arithmetic intensity is not so good. You're going to get n over d plus one over b inverse. And so if we sort of reason through this, okay. So if I want arithmetic intensity to be high, I want this thing inside to be very small. So I need really large batches and I need n over d to be small. What does that mean? I need really short sequence lengths or really big model dimensions. And this n over d is really unfavorable because I don't want a bigger model and I don't want a shorter sequence length, right? And so this is the core, in some sense, inference cost tradeoff that people face. You have this very bad memory access pattern where you have this one term n over d that's kind of really killing you in terms of the throughput of your system. And so this motivates this thing called mqa. And the key idea here, hopefully know you kind of see from this figure back here that really the part that's really bad is the keys and the values. They have this kv cache thing being built up and there's memory moving in and out. So what you do is you can have multiple heads for the query, multiple query heads, but only one dimension or one head for the keys and values. This immensely simplifies things. Once you do this now you're moving much less information for the ks and the v's. And so kmv is shared, but query has many heads. And so you still have multi head attention or multiple queries, but only single kand v's. So that's why it's called multi query attention. And now when you do the same kind of arithmetic, we have fewer memory accesses because we've shared the ks in the v's and the arithmetic intensity is much, much better behaved, right? And so we can increase things like you, we have we've decreased the first term by a factor of n, so longer sequence length are now viable. And the second term is now divided by the number of heads. So this term is also not so terrible. So all the different terms are controlled now. And mqa can give you much better behaviors. Gqa or group query attention basically changes this slightly. Instead of having sorry, multiple query and single key, you can reduce the number of keys by some multiple. And so this will let you trade off between kind of the inference time behaviors and the expressiveness of the model, because maybe going from multi head all the way to multi query is a little bit too aggressive. You know, some works show that gqa doesn't hurt, but multi head attention hurts. I'm not going to get into that. I'm just going to close off with this very last thing, which I think is a really interesting development in the last few months. So back in 2:19, OpenAI had this kind of cool paper basically arguing how to build longer attention models. And they were basically arguing, well, one way to do that is to come up with sort of sparse attention patterns. So instead of paying attention to all of the sequence, I'm going to pay attention to, let's say, a local window at each sort of chunk. And then I can have sort of other sort of attention patterns that are like diagonals that help propagate information across, so you can build sparse or structured attention that trades off various kinds of expressiveness versus runtime. GPT -3 uses exactly these kinds of tricks when they originally released it to get larger attention. Windows sliding window attention is another variant of this idea where you know, at each layer, you only pay attention to a small region around your current position. And this also is going to control the total amount of sort of resources that you need, the amount of resources you need in order to do longer context. So your effective receptive field is now the local one times kind of the layers, the final trick. So those were kind of the older ideas. But the way that this has kind of been modern instantiation is some of the recent papers like lama four and gamma and cohercommand a have now come up with this very clever trick of basically having transformer blocks, where in this case you have a set of four transformer blocks. The very bottom one uses full self attention with no position embedding. So there's no rope, no nothing. It doesn't know about position at all, but it's full self attention, and it only happens once every four blocks. And then the three blocks above it use sliding window attention with rope. And so this is actually a really clever trick to both control the syemaspect of things, because the full tenonly happens every now and then. And also the length extrapolation aspect, because rope only deals with local context windows. And anything that's really, really long range has no position embeddings at all. So it could extrapolate very, very aggressively because you don't have to do this position extrapolation that you do with something like rope. So that's a really cool development that we've seen in the last couple months. So all right. I think we're coming up on time. Feel free to ask me questions about architecture or hyperparameters. I'll be happy to answer questions after.