Stanford CS336 Language Modeling from Scratch | Spring 2025 | Scaling laws 2
Scaling Laws and Model Training Optimization in Large Language Models
标签
媒体详情
- 上传日期
- 2025-06-04 13:29
- 来源
- https://www.youtube.com/watch?v=OSYuUqGBQxw
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Okay, so let's get started. Today's the second and last of the scaling laws lectures. And today it's going to be a little bit more of a case study in details oriented lecture. I'm gonna to cover two separate kinds of things. The first one is I'm gonna to go through a couple of papers where people have done careful scaling law studies as part of their model building. And I'm going to use that as a way to convey to you how modern large language model builders use scaling laws as part of their design process, right? So motivation from last time and today is what's the best practice for scaling a large model? We want to have large language models with nice hyper parameters and good architecture choices. And you know I've already told you about chinchilla and using scaling laws to validate some of this, but I think you should have rightfully skeptical questions about scaling loss, right? Like it's curve fitting on a log log plot. Is it really as good as I said it was last lecture? So does chinchilla's approach to scaling laws actually work? You're finding this out in your assignments. You know if you fit an iso flop, is that really telling you about the right token tradeoffs? Can you use this stuff to really set optimal learning rates? And should we be picking sort of particular architectures or primeterizations to scale nicely? So the last paper or the newest paper we talked about with lots of detailed scaling studies in last lecture was the DeepMind chchilla paper, right? And after that, ChatGPT happened, and kind of the competitive landscape of large language model building really changed, and people just stopped publishing anything about data and scaling and all of these things. It was sort of very secretive. I've talked to people at some of the frontier labors and asked them, Oh, you like, what are you guys doing for scaling? And they're like, no, we will not tell you anything about what we do for scaling. And so we have to sort of rely on other sources for how scaling happens in practice. And there have been several competently executed large scale models that have done scaling. And so you know last year in this lecture, I covered three wrigpt, dc, km and mini cpm. And as a nice side note, last year, I had to really strongly justify why I was covering these like Chinese models, so to speak. But this year, thankfully, hopefully, you're already excited to hear about deep seek rather than me trying to convince you that this is the right thing to listen to. In the year since then, I've looked at a lot of the models that have come out. I'm actually the hain terms of new scaling law insights and papers is actually much sparser. I'll briefly mention some results from llama three, which came out at the later end of last year, hyyuon large, which is the moe model from China, and then minmax zero one, which is a linear time sort of hybrid attention model from our long context model that came out this year. And all three of those have some scaling studies, but really nothing quite as extensive as deep seek or mini cpm, which have really been the gold standard, I think, for modern scaling law studies. So that's one part of what I want to talk about today. I want to make sure you guys have an understanding of what scaling looks like in a real semi production model. And the other thing I wanna talk about, which is an important deep dive, I think, is the mup method that I mentioned last time. So mup is this approach, just as a recap of last lecture is when we train these models, as we make them bigger, we need to change certain hyperparameters. Right on the left hand side of this plot here, you see that as you make models wider, in this case, like an mlp, you make them wider, the optimum learning rate sort of shifts downward. So you need smaller learning rates for these bigger models. And that's a really big problem potentially because then you need the hyper parameter tununior learning rates at the very large scale. And that's going to be very computationally expensive. It's going to be a huge problem. If on the other hand, if we could sort of parameterize our model differently so that know the learning grade that's optimal, just stathe same forever across all the scales. You know, that's great. That's really simplified our search process, right? We would like all of our hyperparameters and really choices in general to remain stable across scales, right? That's the ideal. And mupiza is a very interesting class of approaches, and it teaches us some pretty interesting sort of ways of thinking about the problem. So I'm going to actually go through some of the details in the math in the year since last time I thought this, there were a couple of very nice tutorials on muthat came out. So I'm going to follow those because they have math that's pretty easy to follow. And then I'll talk about some work that has come out doing sort of third party validation and evaluation of mustyle methods. So okay, the focus of the first part of this lecture, which is the case study, is going to be on three models. I talked about three additional more modern models, but actually the details in those are much more sparse. And I think the lessons you learn are primarily from these three papers here. So that's going to be my focus for the first part of this lecture. So I'm going to talk about three brigpt, mini cpm and deep seand. Each one of these will have actually a pretty different mix of scaling strategies and italso have different things to teach us about how to get scaling right. So we'll get started. Cerebras GPT is the first of the models and sort of scaling things that I want to talk about. It's a large family of models. It's trained point one to 13 billion parameter models trained with the chchilo recipe. So roughly the same number of like token to parameter counts as is optimal. And you know they have interesting core finding. The Cerebus folks actually are pretty interested in a lot of these like scaling and parameterzation studies, and they have a really interesting core finding, which is they scale up this mute p thing that I mentioned before, and they find that it makes scaling a lot more stable and a lot more pleasant to deal with. And just to kind of show you the punch line, right, you've got test loss on the pile and you've got sort of the scaling curves here of like the three breast GPT in blue. This is with standard parameterization. You've got mute y in orange. This is the model that they also train. We're using the maximum update parameterization, and they show that it scales more nicely, if not better than things like Pythia or gptj. So that's nice. And the thing that I want to emphasize here is that this is kind of one of the few, if not first public validations of mup. We know that all or most of the labs that are doing lm scaling pay close attention to how they parameterize their networks, their initializations as a function of the scale of the model as well as things like per layer learning rates are things that people pay close attention to you to make scaling much more stable. And so things like nup are pretty important in this space. Lama four, for example, the paper for that isn't out, and I don't know if they will be out, but they talk about a technique they call meta p, which is a variant of this as well. So what they show is that when they train models using sort of standard parameterzation, they find that you know they have sort of big oscillations kind of around the predicted scaling point. So that's this dshed line. You know they have kind of oscillations due to the fact that, for example, they have to adjust the learning rate as a function of scale. And so it's hard for them to kind of really get the predicted performance sort of exactly right, which is this daash line using their scaling recipe. On the other hand, what they find is if you have sort of the three wrist GPT, sorry, mu p scaling, then you got this orange line, which is much, much closer to the scaling law fit for this mp version, sion. And so they're claim ed here at least, is that using this alternative parameterization allows them to get much more predictable scaling, much more nice hyper parameter tuning. We're going to see this in more detail. I'll return to this slide again once I've sort of gone through the mathematical derivation of mubut. In case you're ever interested in implementing this thing, some of the cerepress GPT folks, and in general, the kind of artifacts that the cereus research folks are putting out is very, very helpful for mup because they have this big table in the appendix that really just tells you exactly the difference between the standard initialization and parameterization or sp and the maximum update version or mup. And you'll see that you know I'll just give you kind of the one liner version. Basically every non embedding parameters is initialized with one over the width and then the learning rates per layer are scaled down by one over kind of the width, right? So the interesting difference from standard parameterization, even if you're already doing sort of one over width scaling on the initialization, is actually there's per layer learning rates that are different. And I'm gonna to get to that later, I'm gonna to do a full derivation of this result, but you can kind of see here this kind of nice quick reference. And also, if you want to implement this thing, this gives you very easy ways of implementing mup. Another interesting thing that we also see in some of the other scaling strategies is that you combine these strategies like new p, which makes hyperparameter selection stable, with very, very aggressive scaling. So what they do here is they scale down their experiments all the way down to 40 million parameters. They do extensive, extensive hyperparameter search on this proxy model, and then they scale things back up using sort of mupy to try to keep hyperparameters as stable as possible. And so this is sort of what they see in their small scale hyperparameter search. Each one of these dots is a model run, and there's sort of a hyper parameter associated with each one of these. And then they pick the minimum across these runs, giving them essentially their sort of hyperparameter grid. This is a very like clean approach, the hyperparameter selection. It's unclear whether you know this kind of this level of aggressive scaling down is really what you want to do if you want na train these really, really large models. But this is kind of one strategy that we see also in mini cpm and deep seek like training much smaller surrogate models and then trying to figure out how to stably scale them back up. And that's going to be kind of a theme that we see throughout. And Yeah, if folks have questions, please stop me. Can you maybe I'll stop here for a moment in case anyone has questions for the sbrigpt piece, although maybe you'll be clear once I talk about the mup derivation later in this lecture. Okay. There is another paper I want to talk about, mini cpm or another artifact. I guess for whatever reason, I think mini cpm hasn't been talked about quite as much, especially in sort of like western academic circles. But at least for me, this was one of the first sort of releases or papers I saw coming out of like a Chinese research group where they had done some like really cool in depth scaling and other kinds of research. It really felt like stuff coming out of the frontier. And to give you an overview of what they do, their goals here is they want to train relatively small language models, but use a lot of computrain, really good small language models. That's their ostensible goal. And in doing so, they do a lot of careful scaling computations. They also, once again, use mup to stabilize and simplify scaling when they sort of end up scaling these models, not in size, but in terms of the amount of data, and to try to convince you that this is a paper we're following right at the time that they were trained, this was a remarkably good 1.2 to 2.4b models. It beats out most of the two b models that were out there, and it matched many of the modern seven b models, at least modern as of 2024 standards. I mean, now, of course, you've got even better seven b models. The arms race is fierce, but this should give you a sense that at least given the amount of compute and technology available back in mid 2024, this was actually really at the frontier. And they did something right to get models of this quality. And so much like series ri risk St GPT, they do essentially they have to have some strategy to get scaling, right? Right? So stepping back, right, you're going to do a really big model run. What do you have to do? You have to pick hyper parameters. You have to make sure those hyperparameters scale nicely, and then you scale up your model, right? So we can do the same thing as the cerebrst GPT folks. We can try to pick hyperparameters at a small scale, hope that they sort of stay stable and then scale everything up. And the way to do that would be to use something like mup. And this has exactly the same kind of strategy at play here, you see. So for embedding, you don't really do anything. You just scale it by a constant. Whenever you have some sort of like residual connection, like an mlp, you scale it by a square root of the number of layers. You initialize it by sort of one over the base width. And then the learning rates are also scaled by the width of the model. We see basically the same strategy or the same kinds of scaling factors appear as the Cerebras GPT case. And they also end up with very similar parameters as cerus GPT. The same kinds of scale embeddings similar learning rates off by a factor of two or so, but generally, you end up in similar places as these kinds of hyperparameters. And then what you do is, once you have this, you're sort of relying on your optimum learning rates to remain stable. So you're just going to keep those roughly fixed. And we know that the aspect ratio is a pretty important thing. So we just fix that after figuring out what the right one is. And then you scale up the overall model size, going all the way from nine or 30m all the way to half or 1 billion parameter models. And so what they have is like a roughly five x or maybe a little bit more compute savings going from the smallest models that they've got to the largest sort of pilot run m models that they have. And now you can use this and then you can sort of figure out whether you have sort of optimal batch sizes as a function of scale. So you want na figure out be Crit, which is the critical batch size. If you remember correctly, this is the critical batch size zes roughly the diminishing returns point, right? So as models get bigger, their losses get lower. As their loss gets lower, you can make use of bigger and bigger batch sizes. So the critical batch size zes roughly telling you for the given model size and scale that I'm operating at, what is an appropriate global batch size for me to be training these models with. And so much like the Caplan paper, they follow a similar kind of recipe. You know the plots look different from the Caplan paper, but the underlying strategy is kind of the same. What they're trying to figure out is what is the critical batch size for training or the optimum batch size in this case if we're training different models and they're trying to find relatively predictable scaling relationships between the batch size and, for example, the data size or the loss size. And vertical columns here sort of represent a single training curve. And then sort of the quadratics are sort of being fitted to try to identify the minimum. So the red line here is the minimum across all of these points as we go upwards. And this is trying to tell us the optimum batsize for a particular choice of model size and data set size. And then at this point, know you can follow the same logic as the cand paper for identifying the back sizes. Basically, you reproduce the same kind of plot if you remember the Kaplan paper and the critical backsize discussion from two lectures ago. If not, you can kind of pull up the lecture slides. You'll remember that basically the thing that's highly predictable is the loss that you're trying to train to the terminal loss and the back size of the critical backsize point. And so we see that once again, much like in Kaplan, you see a log linear relationship here between the target loss or the terminal loss and the batch size that you want. And so from this, you can kind of figure out what batch size you're going to get, right? Because if you have a particular target scale, you can use scaling loss to figure out what is the loss that I expect to get. Once you know the loss that you expect to get, you can use that to back out what batch size you can kind of operate at, right? So there's a fairly clean trend polynomially increase the batch size as loss decreases. Now batch sizes do sort of shift around as a function of target loss and thus compute. So we have to fit a scaling loss for that guy, but we already did new p. And so in theory, if the approach works at all, what we should now get is we should get that the optimum learning rate here is stable. So on this plot, we're seeing essentially different model sizes from sort of small models in the light colors to their biggest models in the dark colors. And you see them sort of varying different learning rates. The big models, they're only running for a little bit for compureasons, but what you see is a fairly clear trend. And once again, very consistent with some of the earlier results that we've seen in Kaplan at all, where you have a relatively wide basin and then sort of sharp increases as your model becomes very unstable. But the important thing here is that the minimum remains fixed across relatively large orders of magnitude right from your small model to the big model. The minimum, or at least tied with the minimum, is at the exact same point, roughly ten to the negative two learning rate. And so this is a nice sort of piece of evidence or some validation that properly scaling your model initialization and properly scaling your per layer learning rates allow you to avoid tuning learning rates over and over or even fitting scaling laws on learning rates in order to try to predict what the optimal learning rate is. Okay. And then the final thing is know you might want to figure out essentially model sito data tradeoffs. If you're training small models, you're going to be probably over training your models or at least you want to justify to yourself why you're training on so many tokens. And so you might want to replicate something like the chinchla analysis. So the mincpm people had a really cool or nice innovation. Others have done similar things, but I think they were the first to really popularize this in the lm setting, especially in the context of chinchilla style scaling is kind of the following. So let's say I want to fit a chinchilla scaling law. When I do that, what do I need to do? Well, I need to vary the number of tokens and I need to vary model sizes, right? And so when I do that, I'm gonna to fix a model size and I'm gonna to train a model for longer and longer. It would be nice if I could sort of early stop and take the checkpoints of this model and have that be sort of the difference or changes to the datset size because earlier checkpoint c less data. It would be nice if I could use a single run to collect all of this sort of data scaling things. Unfortunately, what I'm showing here, right, is that kind of the cosine learning rates for different data targets are different. So if you have a very small amount of data, you have a cosine that goes up very quickly or sorry, that goes up. And the warup is always the same, but a very fast cooldown, right? You train for a little bit and then you come down very quickly. If you have a lot of data, then you're gonna to very slowly come down to the end. And so your learning rates between a small data training run and a big data training run will be different, right? This is a very, very key point, right? Lots of people get tripped up by this. You cannot use a single run of a cosine learning ray model and try to get early checkpoints and reason about data scaling behavior based on that, right? This bites people all the time. And so in order to avoid this, what you would normally need to do is you need to train a model from start to every single enpoint. So you have to train it to every single target. And so this kind of takes you to n squared runs, right? Like some of the runs are small, but you have to basically run lots of runs, each one with a target termination point, rather than using a single run and collecting checkpoints. It feels kind of senseless that we have to do this. So the mini cpm folks popularize this idea of a wsd or warm up, stable decay learning rate. And so this plot on the left really shows you what's going on here. Normally what we would train with is something that looks like this cosine learning rate shown in yellow. Here it goes up. There's a warm up period, usually very short, to get to your full learning rate. And then there's a cosine that goes all the way down to kind of termination point, and maybe you stay at your minimum learning rate. This is all, of course, optional. You might terminate here as well. You might go all the way to zero, right? And so cosine learning rate looks like this. And the issue here, of course, is that if I have a different target, the cosine is going to be totally different. So everything past the warm up can't be reused. Now if you look at this new wsd, which is basically a trapezoid learning rate, what it has is three phases. It's got a warm up phase that's the same as a cosine. It's got a stable phase that's flat, and then it's got a decay phase that rapidly cools down the model down to its minimum learning rate. And of course, you can have variations of this. You can go up, down and then stay stable at your minimum. You know, you can do any of these variations. But I think in general, the simplest form to think about is warm up, stable, decay, terminate, right? Why is this nice? This is nice because you can reuse the stable part, right? So the thing that you do is if you want to do chinchilla in almost one run, what you do is you sort of warm up. You have a stable run all the way to the end, and then you cool down. And then if you want na figure out, Oh, how would my model have been if I use less data, you rewind the checkpoints and then you do another cool down, right? And now you've got a exact warm up, stable decay, learning rate shape without having done the training from the beginning, right? So this is a very nice thing. The fact that the stable part essentially is flat allows you to do chinchilla style scaling or data scaling in a single training run or for mostly the cost of a single training run. And a lot of people now do this, okay? So they work very well. Mini cpm, I think, popularized this, and I think a lot of people have since then adopted it. And we see a lot of wsd style schedules in many, many places. You see curves that look kind of like this. If you have a cosine learning rate schedule, you'll see essentially relatively predictable smooth decay towards your terminal loss, like this yellow line here. If you train with wsd, you'll see much, much funkier learning curves that look like the curves that I have here above them, the darker lines, right? So you've got your warm up phase, which doesn't really show up in this training curve, it's so short. Then you've got your stable phase where it sort of goes down normally. And then as soon as you hit your decay phase, like the cool oldown part, your loss really rapidly drops off until you hit your sort of either zero or minimum learning. Great point, at which point you've gotten your terminal loss right? So these losses may look very disturbing to you, but they are actually pretty normal when you're training with these kinds of sort of rapid cooldown learning curves. And maybe the point to make here is at every single token count, you see that the warm up stable decay curve, the minimum point beats or matches the cosine learning rate. That's not always the case. There can sometimes be cases where cosine works better or wsd works better. But in general, a lot of I think things that people say here is that the two learning rates are roughly comparable, but wsd has the additional nice advantage that you don't have to worry about your termination point. You can repeatedly cool down to get checkpoints of different data counts. Okay. Cool. Okay. And then of course, there's other things that have appeared for trying to estimate chchillikers. Some folks, a collaboration of like uw, formerly uw and apple folks had this paper on estimating sort of the chinchilla penalty. That is when you keep adding more and more data, know how much worse is your loss then if you had you scaled according to chinchilla. So you have your sort of heel line here, which is sort of m equals 20, 20 tokens to parameters. And you can sort of think about, okay, what happens if I train with 320 tokens to parameters? Well, then you have a separate parallel scaling line, and then you have another line, which is the circles, which is, what if I train with 640 or the darker one is there? And so the thing that they show is actually instead of doing the wsd style thing, another thing you could do is you could try to figure out, ok, how much does my model degrade as a function of sort of higher tokens to parameter ratios? Well, that turns out also to have a fairly predictable shape. And you can sort of extrapolate that based on sort of degradation at small training runs. I don't think I've seen large skill training runs using this idea, but it's kind of an additional cool thing to know that essentially you could do chinchilla in almost one training run by sort of extrapolating the excess token penalty at a small scale as well. So okay, going back to mini cpm. Now we have the tools that we need. We have the wsd learning rate, which allows us to essentially do one training run. And that one training run allows us to have both variations. Sorry, that one training run allows us to vary data as we go along. And then we have multiple training runs for different model sizes. That gives us all that we need to do chinchilla analysis. And they use method one and method three, if you remember what those are. Method one is you overlay all the learning curves and you take the lower envelope, and the lower envelope of all the training curves is supposed to be roughly a power law. And then method three is you basically jointly fit this equation two. You have here, you hypothesize this two variable scaling law, and you kind of fit it to all the data that you have in kind of curve fitting style fashion. And then that allows you to solve for the optimum token to data ratio through that fit. So they do both of that. They do see for chinchilla, method one, fairly clear, although not perfectly linear, trends that allow them to essentially go from compute to token ratios. And their primary approach that they use to justify a lot of their design decisions is the method three. It's the curve fitting. So the kind of contours that you see here is the curve that they fit. The dots that they have here is the small scale runs that they did to fit the chinchilla parameters. And just to sort of justify what they do, they find very, very high token to parameter ratios, like so high that I feel like this is an outlier that doesn't really agree very closely with most of the other literature. They argue that llama style architecture should all have a higher ratio because of improved data quality and improved model efficiency. But their token to parameter ratio estimates are really, really high, 192 tokens per parameter, which I don't think I've seen anyone else derive. I think other people have done replications of chinchilla. I don't think anyone one's ever really done or argued for 192 tokens to parameter regardless. You know we have seen that recent models like lama three have significantly higher data than model ratios. We also don't really see diminishing returns like these models aren't way worse than the covalent chinchilla scaled like lama two models. This kind of suggests that with careful optimization and careful tuning, we should be able to go far beyond like the 20 times the model size rule of thumb, right? So if there's one thing you take away from this set of sort of you these last two slides, maybe not necessarily that you should trust whatever scaling law fits that mini cpm did, but rather that sort of the chinchilla analysis isn't really a strong constraint like 20 times model size is just a starting point. You should be feeling free to significantly increase that token to parameter ratio. Finally, the curfits that they get are generally pretty good looking. So this is the scaling lockcurves for essentially data and model size scaling and perplexities on code in English. They do have some really weird outliers that I don't really understand why they get these, but their sort of fitted scaling laws are generally pretty good as they increase the amount of data on their relatively small model. So this is one example of a large scale training run scaling recipe. So I'll stop here. Things like wsd are probably new to you, so if you have any questions, please feel free to ask. Or any of the other bits, including the chinchilla replication and uand so on. Okay, sure. The main location of a min terms of initializing the wing. So the question was the main change in mup was initialization. So there's two things that will happen when you derive and you implement mup. One of them will be the initialization will change. And the other thing will be that the learning rate changes. The learning rate changes per layer, right? And that is probably a more exotic object than many of you are used to. The initialization actually is not that different. If you're already using like a standard like like hyming style initialization, that's already one over the Fanon, one over one over the square root of the Fanon, which is going to be already the right thing. Whereas the learning grnormally, like unless you're doing something really exotic, you're using like a global constant learning grright everywhere. So that's gonna to be a big difference from what you're normally training with. So you can think of that as like the practical difference for a lot of the mup implementations. Yes, the was kept constant with the he saw that the car was very close to the cosine decay. Yeah. So you're talking about this curve and you're saying like, Oh, when we're in the stable phase of wsd, like when we're up here, the curve remains pretty close. And like why is that? Well, it's kind of close, but also not really right. Like if you look at this last curve over here, you know there's a big gap before we enter the decay phase between cosine and wsd. And I think this is like one of the pretty interesting mysteries about deep learning optimizers. Clearly, you need a stable phase to kind of get you far from your initialization. But also the cooldown phase is what gets you most of your gains and your losses, right? If you don't cool down, this is a gigantic loss in your losses. So the cooldown is actually really critical and know a lot of the gains from cosine versus here, like this relative gap. This is all from cooldown, right? And so a lot of the optimizer learning rate design is about this balance between how do I keep learning rates high to travel far from my initialization, but still have good decay of my learning rate to be able to uneal my loss down to a very low value. So the other paper I want to talk about is deep seek. This is the original deep seek llm paper from 2024. And in many ways, if you read the original deep seek llm paper, you'll kind of you can know that these are like very serious science people, you know because they do a lot of very careful scaling ablations, and they're really trying to get it right when they scale up. And that's kind of an attitude that's shared amongst sort of the players that that get scaling right. So you know they have seven and 67b parameter models. You know at the time, very high performance relative to llama, which is really the primary competitor at the time. And at the time, I guess you know llama two and mistreal were kind of the big players. Deep c comes in and they're able to match the performance. Not quite the flashy impact of deep sev three coming in and sort of matching OpenAI's GPT -4 zero, but for a first time sort of attempt, this is a pretty remarkable result. And so let's kind of dig in and try to understand all what did deep seek do that allowed them to go from essentially zero to know, at least for open source state of the art at the time, right? And I think deep seek, more than most other players, maybe the only comparable one being mini cpm, is very, very open about a lot of the experiments they did and the approach they use to choose a lot of these hyperparameters. So immediately we see one difference between deep seek V1 and mini cpm and also ceri breast GPT, which is that they don't use any mup, and they're going to directly try to estimate both the optimal batch size and the optimum optimal learning rso. It's like a really direct method, you might call it, and requires kind of a strong belief in scaling laws. So what they do is they take two relatively small models and have they run a grid over different batch sizes, they run a grid over different learning rates, and they get losses across this grid. They do the same thing at a larger scale, and you can get kind of the optimum batch size and learning rate, right? And so they're saying, all right, well, this is a pretty wide basin, so we don't maybe have to be too scared about messing this up. And so then what they do is they know that all the choice of learning rate and batch size are both relatively forgiving, but we do want na get the order of magnitude of these things correct, right? So how do we get the order of magnitude of these things correct? Well, you know, what we're going to do is we're gonna to train a bunch of models with different amounts of non embedding flops, and we're going to change essentially across a grid the parameters that I had before, both the batch size and the learning rate. And by varying these, know we're going to have the optimum batch size and the optimum learning rate, sorry, across these different scales, right? So you can imagine basically making these grids across many different flop scales and basically marking down a star for each one. Perhaps unsurprisingly, because it's just scaling law lectures, these things seem to kind of follow a scaling law line, at least for the back size, things seem more clear. And you can kind of fit a line to hear and you can extrapolate out to the big models that you're going to train what your optimal and batch sizes should kind of look like. They do the same thing with learning rate, and they sort of fit this line and they say, Oh, these are the two learning rates we're gonna to use. It might be because the points are being ploted on top of each other, but I find this line to be particularly not particularly like somewhat suspicious looking. I mean, that could have probably fit a horizontal line and that would have also looked okay. This one, I don't know, even as a scaling law enthusiast, I'm not quite sure. I would bet my life on this one to pick the learning grade, but they did, and that's how they get the learning grade. Now. They also kind of follow best practices at the time. They do a chinchilla style analysis and they use once again, a wsd style learning grade where they're trying to essentially minimize the amount of repeated work that they do. They do kind of something a little bit weird or a little bit more non standard, where what they're doing is you know they do warm up, they do stable, and then they do two sets of decay steps decaying down to zero. So it's like two decay phases consisting of kind of like 10% plus 10%, and they sort of analyze different choices of that decay phase and it kind of doesn't seem to matter very much. But generally speaking, it's about 20% of the total compute budget is going to be spent on that cooldown phase. And so they also show once again that it matches cosine learning rates. But once again, the advantage here is that we can do chinchilla style analysis ses for very cheap. In contrast, the the rate, sorry, learning rate fits. You know chinchilla style analysis just fits really, really cleanly. I think this is a broad lesson. When you look at lots of people scaling laws, I think the stuff on hyper parameters always looks a little noisy and tenuous. But the isoflops analysis from all of the players look always like very, very nice. And so this is you know replication of the chinchiller result. You see you know different compute scales. We see different quadratics. We draw a line through the bottom of the quadratics. We get exactly the kinds of sort of optimum flops for token and optimum token size as a function of training flops, right? So this gives us a very straightforward way of analyzing the token size to model size tradeoffs. And this allows them sort of do everything from scratch, right? Of course, I think you know as a side commentary, I think it's really nice that they're kind of redoing a lot of this. Like they could have certainly cargo culted chinchilla and just picked 20 tokens per parameter, but they said no. Like let's actually go and do the scaling law analysis and like let's actually make sure that you know the token sizes are relatively appropriate for us and then they have a fitted scaling law at the very end. This is in some ways not surprising because this is after they fix their scaling strategy, they do predictable scaling. They try to predict what happens on the seven b and the 67b models. You know, it's unsurprising in many ways, but very nice that they're able to extrapolate out from about ten to the 20, the ten to the 24 and actually nail the prediction on the basis of the scaling law, right? So it's a very nice thing to be able to see that we can actually get predictive measures of model capabilities before we actually train them. So that's kind of the deep seek part. Anyone have questions for kind of the deep seek strategy and kind of what they did and any of the other pieces? I think most of this I think wsd was probably the newest thing that I've sort of mentioned today. The other thing that deep seek does is directly fitting a scaling law into the optimum learning rate and batch sizes rather than using something like mup. Yes. Do they have a global learning Maso? They're tuning that global learning. Okay Yeah once we know the once to become. Yeah. So so the question was like, do people redo this kind of analysis for new frontier models? To be honest, I'm not actually sure. And I'm beginning to think that a lot of people maybe don't like exactly replicate some of this because we see that in the newer paper is just increasingly less scaling details, like even from deep seek, for example, like deep seek V2 and then V3, we see a lot of emphasis on the new parts of each paper. Like so for deep seek V2, we see a lot of emphasis on like mla and like the architectural improvements and the deep seek V3, we see a lot of the systems components being emphasized, like the low bit training, but we don't see, for example, in either of those any additional new scaling loss studies. And so I think my guess is that there's not much new there. Like maybe they're replicating it just to make sure it works, but nothing new to report. And I think that will kind of be captured in the next couple slides where I'm going to talk about scaling laws and papers and models from the last year or so. So I did a little brief survey, but actually there's nothing that is at the level of detail of either mini cpm or deep sek. Like those are really still, I think, the most detailed open studies into scaling that we have in 2025. Cool. Okay. So you know llama three was probably one of the bigger model releases in the past year since I last taught this class. And they do have some pretty interesting scaling bits for one. You know just the question right now of like do people actually replicate these analyses once they've run them once? Well, kind of yes. Lama three know, redoes the isoflop style scaling chinchilla scaling laws and they find roughly that the optimum ratio if I got the calculation right is about 39 to one. And I do think this is interesting because chinchilla got the 20 to one parameter ratio. I think many of us have trained models at the chinchilla ratio in our research and so on. You know it's quite clear that the 20 isn't really that stable like other people that have fitting it have been getting generally slightly higher ratios than before. And that might point to things like improved algorithmic efficiency in sort of architectures that learn better from data. It might mean something else like improved data quality. All of those are kind of moving parts. So it's hard to know like what's leading to these slightly different ratios. But the results seem fairly clear, though the fits are relatively good and they do get a 40 to one ratio. The other thing, which is close to the data scaling stuff that I mentioned, the early parts of my first scaling lecture, one of the interesting things that the llama three folks do is they try to essentially correlate compute into nllls like log loss, and then correlate those nllls back into downstream accuracies, right? And so the thinking that they're trying to do here is they would like to not really scale against log likelihoods. That's not really a thing they truly care about, right? They care about improving, I don't know, benchmark numbers on mmlu or lumbada or whatever other benchmarks that they've decided to hclimon, right? And so if that's the case, then what they're going to need is to have a conversion factor going from these nllls per character, these perplexities or equivalent to perplexities, and then map them into accuracies. And so they've done some studies in lama three essentially trying to relate these two fitting sigmoids, showing that you know if you fit essentially these small models and you fit some lama two models and you fit a sigmoid on the whole thing, you can accurately predict the performance of llama three 405b on the basis of those fits. It's interesting, I think they say that they use these kinds of ideas for data selection, but I think there's not that much details there. And it's unclear whether this is like a really core object when lama three was being trained or whether this was kind of a side scaling thing that was like just of interest to the authors. Another recent work that has come out on sort of yet another Chinese alum that's nicely executed is hyyuan one. Hopefully I didn't really butcher the pronunciation there. They are training moes. And so because they're training moes, they want to kind of redo the chinchilla style analysis. They fit once again, they do isoflops analysis. They fit quadratics. They figure out the minimums, and then they're able to get a different token to parameter ratio. So they get a 96 to one data to active parameter ratio. These ratios are obviously going to be quite different because the cranian moes, there's lots of differences about the architectures. We don't really expect the same thing as chchilla, right? And so we do actually see in various papers, essentially replications of chinchill happen again and again because a lot of these people are very interested in understanding, like how far can I push the token to parameter size ratio? We would like to stay on the higher end of that, like have more data than parameters because then people actually use our models or our models will be cheap to serve. So for all those reasons, people have been replicating chinchill. I think this is one of the best replicated results in scaling. In many ways, the actual 20 to one parameter ratio isn't the thing that consistently replicates, but the fact that you can do isoofflops and fit the minimum and get these very predictable trade offs and flops to optimal optimum parameters is quite clean and consistent in the replications. Okay. The last one, which is honestly a little bit more of an exotic scaling law over the last year is minix one, which came out pretty recently. So minimax one is a kind of linear time or long context language model released by another sort of Chinese startup. And their interest is, well, what we're gonna to do is we're gonna to take soft tmax attention, which is quadratic. They have this thing called lightning attention, which is a kind of linear attention or linear attention layer, which is linear time. And then they have a hybrid version of this model, and they want to figure out how much cost am I paying in terms of the performance of the model going from softmax to linear to hybrid attention? And so they do things like they basically replicate method one for chchilla, where they're looking at the lower envelope of the loss curves as they train. They look at essentially the implied optimal model size and the implied optimal token count as they go. And roughly, the conclusion that they draw from this is that the lightning and the hybrid models roughly perform the same as the softmax attention. And thus, they can they're okay to train long context models on the basis of these architectures. We've seen these kinds of plots occur very often in research papers, like if you look at the mamba paper or the mambba two paper, or any, the delta net paper, or any of these other kinds of linear time complexity rnn papers, you you'll see plots that look a lot like this, where they say, Oh, the full attention scaling and my linear attention scaling are basically the same as a function of compute. But this is, I would say, like a kind of a rare case of this same plot being produced almost at scale from a major sort of artifact release. Okay, so putting all that together, right, I know that was like a bunch of mini case studies that I went through fairly quickly, but I want na sort of step back and recap it a little bit, right? We've seen several common ingredients being used in the scaling recipes. We've seen three breast deep seek, mini cpm, and then the few new paper since. So cerebrst GPT and mini cpm both use mup as a way to make hyper parameters more stable across scale and know they essentially mini cpm especially has a nice wsd schedule, which is a thing they popularize to be able to do chinchilla style scaling. A cerbris doesn't bother to replicate. Chinchilla deep seek does a little bit different thing. They assume that most hyperparameters just don't change the scale, but they do a full scaling analysis on batch size and learning rate, and then they use the scaling laws as a way to figure out optimum scaling. You know I've already noted that some of the scaling looks a little bit more suspicious than others, but really, this is a way to at least get the order of manude hopefully right. They use isoofflops analysis. They replicate Chinchillo once again to figure out the model sizing and to make sure they're kind of in the right order of mangnq. Of the more recent releases, lama three and honyuando isoofflops analysis, only llama three does a little bit more, but that's basically it. And then minix does the more interesting thing of basically justifying architecture choices through the lens of a scaling law. But we see, generally speaking, that there's a few different things that get replicated, like chinchilla and learning rate and batch size are really the things that people are really deeply concerned about when they're scaling models up and they sort of do things like fixed aspect ratio and just scale the total model size up. And that's generally the way that people handle a lot of the moving pieces of scaling up. Okay. Any questions about the case studies pieces? Actually, I'm going to stay here and let's make sure I've covered any questions that people might have. Okay, cool. Okay. So the second and kind of last part of this lecture is going to be understanding new pehopefully through the case studies, you've seen that essentially getting the learning rate right is one of the core concerns that people have and also the vasize. But in general, I think we want na have scale and variant hyperparameters. And it is the case that our choice of initialization and our choice of per layer learning rates are essentially arbitrary. There's no reason why we have to initialize them one way and not the other. And so if we could manipulate those sort of three variables to get scale and variance in our learning grades, that would just be really wonderful. That would make our lives way easier, and it would make small scale experiments much more possible. So you know also talk you first through the math of this, like how it's derived, what's the justification? What are the core conceptual objects behind trying to make models scale predictably? And then I want to talk about a pretty nice preprint by an independent researcher on basically just a bunch of ablations on mup. Like what makes it break? What is it robust do? Does it work on a real transformer language model? These kinds of questions are explored pretty well in this preprint that I'll talk about at the very end here. So okay, what is mup anyway? I feel like maybe I've jumped the gun for the last few lectures because I've mentioned what this is without really giving you the core conceptual object that is based off of. On the other hand, I think I'm justified in doing this because I think most of the literature doesn't explain mup that clearly either. They're just like, Yeah just scale the initialization by one over the width and scale the per layer learning grade by one over the width. That's mup. But I think the ideas behind up are pretty interesting and worth discussing because I think they speak to some core objects that rer in deep learning in general. So I'm going to be basing my slides off this preprint or paper if you're interested in kind of reading about mui would point you to this one. I think this and another blog post called a practitioner guides to mup, I think, are the two kind of readable descriptions of what the sort of this paradigm is. Okay, so I'm going to base myself off this. The math is, for whatever reason, not exactly the same across these different presentations. I'll clarify that. I'm basing the math off this one. So mup is based off of the following relatively simple ideas, right? So there's two things that we think should happen when we're training a neural network, right? So you know, when we scale a neural network, we're going to make the, in this case, let's just say only the width, the width of the network bigger. And we're going to fix the layer size are the depth. And I'm going to make the width bigger as we go. Now, if I do that, as I make the width bigger, I want the activainitialization to remain big theta of one. I want it to remain roughly constant, bounded above and below by universal constant, roughly constant as I make the width bigger. It shouldn't blow up. It shouldn't vanish, right? Seems like a pretty natural thing to want, right? You don't want your activations to get too big. This is per coordinate. Now, the second assertion I want is that I'm going to initialize my model and I'm going to take a single gradient step. And when I take that single gradient step, I want na make sure that the change in activation should also be big theta of one. So both of these seems like very natural conditions, right? Because if you violate these, it's gonna to mean that you, as I make the models bigger, either the initial activations will blow up or vanish, or after one gradient step, my activations will either blow up or vanish. Those are both bad, bad conditions, right? And as a note, right, I'm talking about individual activations like coordinates. And so if you're thinking about norms of an entire vector of activations, right, that should look like a big theta of square root of nl, right? Because each one of these are going to be roughly independent. So the norm is going to look like the square root of the width, the number of elements in my width coordinate so I can derive mu p from those two conditions. So the first condition, which is that I want my activation to remain stable, imposes sort of constraints on the initialization. So I'm going to walk you through a very, very simple example. So I'm going to consider a deep linear network. So this is H of l. So this is the activations at layer little l. And that's going to be a function of the weight matrix at layer l and the activations from the previous layer. No nonlinearities, no fancy stuff, right? Like just, you know, it's all square. Just forget all this complexities. If you want complexities, you can go read the preprint theyexplain in slightly hand wavy terms. Why those things don't matter. Now, the initialization, I'm going to pick a Gaussian initialization, right? So it's going to be zero centered. It's going to be a rectangular size of the sizes that depend on the sizes of my activations. And then I'm going to have one hyper parameter, which is the noise scale of this matrix at this layer. Sorry, there should be a little l on this sigma. So now what can we say? Well, I want to understand the size of H of l, you know, at initialization. So how can we do that? Well, one thing we can do is we can consider sort of the limiting behavior of this system. I'm going to take the basically little n of l and little n of l minus one to infinity. And if I do that, this W is gonna to concentrate. It's a random Gaussian matrix. If you remember your random matrix theory, actually that's not a prerequisite for the course. But you know, if you know some basic random matrix theory, you know that the operator norm of a Gaussian matrix is going to roughly concentrate to this object, right? It's going to be sigma, which is the noise scale times the square root of both of the coordinates added. And importantly, know you can write down roughly that this equivalence is right. So the activvations at layer l, the norm of that is going to be approximately equal to the operator norm of wl times the activation norm of H L minus one. And this is roughly assuming that W of l is independent of hl minus one, which is at initialization. So I think you can basically make that a right arrow if youlike. Now I'm going to say I'm going to pick a particular choice of sigma, which is going to be square root nl over square root l minus one times this object. You can simply think of it as this right hand sithink. This is the exact form. This is kind of the more asymptotic form that you can think of this as. But really, it's just one over the square root of the fanin of your layer times the minimum of one and sort of the aspect ratio of your model, in case your fanin is much larger than your fnow than the sort of kicks in. Okay, so let's say that I pick this sigma. What happens? Roughly one over the square root of my fanin. So now what happens? I can plug this back into this formula, the matrix concentration limit and also this approximation here. And I can sort of inductively prove that every layer is gonna to have the right sort of activation size. So let's just go through all the layers and assume that up until layer l minus one, I have this property, right? So that's the inductive assumption. At layer l minus one, I have that my activation norm is square root of N L minus one. Okay? So that's just an assumption. Now if this is, then I just plug all these in, right? So I plug in square root of nl minus one into this component, into wl operator norm. I plug in the limit. And then for sigma, I plug in this expression over here, you see that this inverse cancels this, and then you're going to get exactly that. H L, the l two norm of H L is equal to square root of nl. So this is the thing that we wanted because before, remember, we said we want na make sure that the activations remains big theta of one, which means that the norm should be square root of N L. So that's exactly what we get, plus some lower order terms. So this is a fairly clear step by step argument that shows you what the right thing to do is for initializations. I want na pick one over the square root of the Fanon, plus a small correction factor in order to make sure that my activations do not blow up at initialization. I'll pause here for a moment in case someone has questions. I feel like this is actually maybe the first real math that we've done in the class. So maybe it's a bit of a context switch for people. I did not warn you that I was going na talk about a bit of math. Okay, is this all relatively clear for people? One over square root of Fanon? Yes. Okay. I'm going to assume that everyone's on board with one over square root of fan. Ok? So now we're going to derive the second part of mup. So the first part of mup was about initializations. The second part of mup is going to be about learning rates, right? And so how are we going to think about learning rates? Well, to think about learning rates, I'm going to look at the second condition. The second condition, a two, which says when I take one gradient step past initialization, what needs to happen is that my activation sorry, sorry, my update size needs to remain constant. It can't blow up, it can't vanish. So what does that mean? So if I have an update of delta wl on the weights at layer l, where does that come from? Well, that comes from let's say I'm doing sgd, that's going to come from this expression. It's going to be a learning rate times l, which is my loss, the gradient of l, which is my loss. And then the activations transpose. In the case that my batch size is one, this is a rank one object, right? This is a rank one update to delta of l, right? And because it's rank one, there's a nice easy expression. The change of wl times the activation of the previous layer is equal to the norm of the change in wl. The operator norm of this thing times the l two norm of H of l minus one. And now combine this with the fact that the change in activation at layer l is kind of this expression. You can convince yourself that this is you can write this out by sort of figuring out what the actual final activation is at layer l after they update and canceling out wl hol, which is a shared term across left and right, then you'll get this expression. You'll get that know what is the update in H L? This is the object that we want to keep roughly square root of nl, the norm of this object. So let's look through each of these terms and look at what the magnitude of this is. The first term here, wl delta H L minus one. This, you know we can assume is going to be controlled from the inductive assumption because this is exactly sort of the delta H of l that we have. Plus the condition a one argument, right? Condition a one basically says, so condition a one is going to say that delta of H of l minus one is going to be screw wed ml, and then wl is going to maintain that more the more complicated parts is going to be these two arguments. The second and third terms that we have here, delta wl hof l minus one and delta wl delta hl minus one. Sorry, that's quite the mouthful. They all have the same order of magnitude actually. And the only thing that we need to really figure out is this expression here. What is the product of the previous layers norm times the operator norm of delta wl, right? Because we don't really know how big the update is gonna to be in the weight matrix W if we knew that. All very straightforward stuff. Okay. And so the remaining argument is actually relatively straightforward. Even though this is actually like a complicated jumble of things, the intuition is actually very clear. The intuition for this is says, okay, what do I really need to figure out? The one thing I really need to figure out is this expression here. How much does the weight at layer l change? If I can figure that out, then I can sort of derive all of the relevant quantities and solve for the learning rate. But at a high level, that's our strategy here. And so how can we possibly figure out after one gradient step, how much delta wl moves? Right? That's really the key question. Well, there's an additional sort of sneaky assumption that then shows up here. And the assumption is something like this. If our learning is well behaved, then after a single gradient step, then the change in the loss delta of l, this quantity has to also be big theta of one. And why is that? Well, because we don't want the size of our losses, the update, the decrease in our losses to kind of blow up or go to zero as the width goes to infinity, right? We want essentially our improvement in losses to remain roughly the same order of magnitude no matter how big our models get. That's a stronger assumption than what we've had before. But assuming that's then essentially we can say, ok, the change in the loss is kind of like multiplying the gradient with the change in the weights. This left side is o of one. We know how big this delta of l should look like. So now or sorry, we know how big this delta wl looks like. Now we can solve for the gradient size. And once we have that, we can plug that in here. We know delta wl. We know the gradient of l, we know the size of hl from condition a one, and now we can solve for the learning rate. And that's exactly what you get at the bottom here. And if you work through the arithmetic, the final result that you get here is that the learning rate for sgd is equal to the fanout over the fanin. So lots of steps involved and lots of substitution, and slightly sketchy Bigo notation being substituted into the equations here. But once we do that, we're going to end up with a very simple formula. Note that this is for sgd. And those of you that have kind of been paying attention and like of staring at this equation are probably you know internally complaining you're like you have misled us because you know in a transformer, well, what's nl over nl minus one for like an mlp that actually is like a four, right? Because you've got a factor of four between dff and d model, right? And so this thing doesn't really change, just a constant in most models, right? Unless your aspect ratios are like dramatically changing through your network. The reason why nup is different from standard parameterzation is because this derivation is for sgd, where the primetizations look very similar between nup and sp. If you do the exact same derivation for atom, you're gonna to find that actually you're gonna to get slightly different things, which is that it's going to be one over the fanin rather than the fanout over the fanin. Okay, so here's the recap. I have sort of dragged you through, hopefully willingly, the derivation of the basic, what people call the spectral conditions that define mup. But now I will give you the kind of one slide high level takeaway of that result, right? So you know, when we want to do something like mup, if we are following the guidelines from before directly, what we will end up with is the following blue box initialization. You know you set yourself to one over the square root of Fanon. Time is a correction factor. That's one if your fanin is smaller than your fan out, but square root of the ratio otherwise. And then this is a simple initialization for the scale of your Gaussian for your learning rate. If you're doing sgd, then you set it the fan out or a fan in. But if you're doing atom, that's going to be slightly different. It's going to be one over the fan in. Now in case you already sort of know the standard, like being initialization and so on, off top of your head, you can mentally compare what this looks like to the standard parameterization. So in a standard parameterization, if you're doing it right, you should probably be already setting your Gaussians to initialize to one over the square root of Fannon. So that's good. That's already perfectly set, but your learning rates are probably being set globally to a constant. This is fine for sgd, not so fine for atom, where the really big difference between sp and mup comes in. So okay, that brings us right back to kind of the three usgpt paper. Now we have all the context we need to understand all the operations they do. If you look once again at the column over here of mp, the embedding layer is special. It doesn't really do essentially like any scaling because embeddings are one hot, so their norms don't scale linearly with the number of vocab elements. But ignoring that, basically you see that all the layers get scaled down by one over the width, you know that's the initialization rule, and then the learning rate rules are scaled by one over the width as well. So this is once again the learning rate rule for atom. So if you're using atom, that's exactly the right thing to do and that's also exactly what they do in series brist GPT sibrist GPT. So hopefully that's clear. And hopefully this gives you a sense of both the interestingness of manipulating per layer learning rates to get more predictable scaling and also maybe an appreciation of this idea of trying to control activations and updates as a function of model with right, like you, I'll pause for a moment there and just mentioned, right? That's like a very successful idea from physics, right? Lots of physicists think about ideas like renoralization as I take limits of certain things. I want things to remain stable. I want them to not blow up or go to zero. This is an exact application of that idea. That's kind of an interesting use of that. Okay. Any questions about, I don't know, mup derivation or this GPT or any of the other things? Yes, archittransyeah. So that is part of the subtlety. The question was, you know, what's the architecture assumptions? Well, I mean, technically there's a even stronger assumption here. Oh, I go. There's even stronger assumption here, which is that I'm assuming things are a deep linear network. I'm just multiplying matrices repeatedly. You know, this is the kind of phliest network that you can have. Basically, there are arguments for why adding nonlinearities are fine. There arguments for how you would take the same arguments and apply them to the attention layer. There are arguments for why much more complex things are needed for a gated linear unit. So each one of those architecture pieces needs a careful analysis in order to have a corresponding object. Yes, it looks like you're in there spent. Right, right, right, right. So n sub l is just the output of a matrix multiply, and nl minus one is the input. So for example, if you have an mlp, you're gonna to have a matrix multiply that takes you from d model dimension to four times d model like the dff dimension, right? So that would give you nl over nl minus one of four, for example. So all the different matrix shapes are giving you the nl nnl minus one. The fan in and the fan out of a matrix. Yeah, that exactly. Yeah, Yeah. So the input and output dimensions are determining all of these objects. Okay, excellent. It's also just a top of the pan in the panan out or like they input it out with Yeah, I was using those terms exchangeably, but it should have been a little bit more clear. Oh yes. Psihas a really mean that they don't have over one of these. So the question was like, since deep sea uses a global learning rate, does that mean they don't have an order one update? So you know, all of this argument is asymptotic, right? It's basically saying as I scale my width out to infinity, things will kind of be big or small. And I mean, if you look at the mup plot, for example, you do kind of see this, right? You see that the learning rates have to shift as the model gets larger in order to compensate for the fact that the updates are getting bigger and bigger, right? What's empirically been seen is if you do nail the learning rate, you don't need mup, right? It's not like mup is necessary for you to train a good model. It's really just an attempt to try to keep this shift as small as possible so you can use the same learning rate throughout scaling. And if you go back to deep seek, if you remember the scaling law that I was being a slight hater for, you'll see that they too have learning rates that go down as a function of scale in order to try to compensate for the fact that the bigger models are going to have bigger updates, right? And so know to respond more directly to the question, yes, in the case of deep seek, as we scale the model up, you know our activation updates will get bigger. So we have to strength the global learning rate where we should strength the global learning rate to compensate for them. Cool. Okay, nice questions. Okay. So that was kind of the conceptual, somewhat mathematical components of mp. Now I want to talk about the empirical aspects of mup. And so I'm going to talk through a preprint, or I think this one made to Polish out com, a large scale exploration of mu transfer. And I like this because it's got a bunch of ablation, and I think I'm a sucker for ablation. So I'll present any paper that has large scale ablation in the course. And so they do essentially with mup, as we've described it. Just look at the right hand side, which is the more relevant piece. They're scaling down the variances. They're scaling down the learning rates by the widthe global width of the models m, and they're primarily keeping the depth fixed, which is a little bit of an unusual scaling regime because usually you scale depth and width together, but they really want to do a controlled experiment where they're only looking at width variations and they want to see if mup precisely nails scaling in this regime. There's also a little bit of a kind of weird subtlety that all of the mup papers seem to do, which is that if you remember your 224n lecture, you know you remember that there's like a scaling on the attention activations. You scale know, you do your inner product and you scale it down by one over the square root of d, you know and I told you kind of this was a magic constant that was like the right thing to do. You know mup and other papers use one over d scaling instead of a one over square root d for various arguments related to activation and update size stability. And so that's another thing that I think was worth pointing out because you might sort of not initially think of that as being something that's related to mufy. Okay. Architecture is mostly similar to the standard transformer stuff. And as I already mentioned before, they only consider width scaling. So they take a standard transformer trained auto regressively on pre training text, and they want to basically make the model wider and wider and wider on the mlps and sort of the model residual stream dimensions. They're going to make that bigger and bigger and bigger. And what they want is for the optimum learning rate to remain the same as they scale the width up. And if it remains stable, then that's the big victory for mup. So the game is hopefully cleared to everybody. You just want to scale with th. I want my learning rate that's optimal to stay the same. So question number one is, you know, does it work? Well, the answer is yes. So we have different width, 128, 512, 20, 48. We have different learning rates s across the columns. And you know the sort of idealized strategy here is we run a sweep of learning rates at the smscale. I pick the smallest scale and I scale that up. And hopefully that base learning rate remains optimal. And Yeah, it seems like you know learning rates transfer very reliably across model sizes. If we're doing this like you know somewhat precise with scaling. And so then I think you start asking questions of all right, very similar to the previous question that was just asked of like, okay, when does mup break, right? So you can ask that question in theory, but you can also ask that question in practice, right? So I'm just gonna to try all sorts of modern variations to architectures that people do. And then I'm gonna to ask, you know does this hyperparameter transfer thing continue to hold under these variations or not? And you know, the paper is quite nice because they just go through a lot of different stuff, vary the activations, theyvary, the batch sizes, the neualizations, the rms norm gains theyeven use like really exotic optimizers, like sort of sign gradient style stuff, and then theyalso vary the regularizers. So which one of these prevents learning rate transfer? So the first one, you know, which I think is probably relevant if you were kind of looking at that deep linear network and saying, Oh, no one just multiplies matrices together. There's like nonlinearities in between, right? So there's mup work when we change nonlineariarities around. Well, swiiggu squared value and the baseline sort of new p approach of value all have the same minimal learning rate. So no changes at all. You know, we just see that, for example, swiiggu and squared value just do better than baseline. Mp unsurprising sort of agrees with a lot of what we've learned in the course, right? We might vary the batch sizes because we know that batch sizes are kind of going to be sensitive to scale. Like we've seen mini cpm and we've seen deep seek basically fit scaling loss to batch sizes to try to get what the optimum batch size was. You know once again, we see that you know as we scale up batch sizes by four up or down, optimum learning rates remain stable. What about initializations, right? You know, there are some initializations that know people vary. Like for example, some people set the query matrix to zero so that all the different items get uniform attention. Maybe that's more stable. Some people sort of the unembedding layer at the very top theyscale differently based on either use standard parameterization or mup. Maybe that matters a lot. Turns out neither of those do. You know the center column, the optimum learning rate remains optimal in all of these cases. You know what? Is it not robust to well know it's not going to work for every single case. For example, if you add sort of learnable gains, that turns out to break new p, so you need to remove the biases. But if you remove them, then mup works. If you add them back in, don't necessarily work. Similarly, you can try sort of more exotic optimizers. Lyon is an optimizer that takes like the sign of the gradient updates, which to me feel a little bit crazy. But I think searthis was found through like evolutionary search or something like this to find the fastest optimizer. If you use this kind of a more crazy optimizer, it really breaks down. And I think this is what you expect, right? Nup is designed to adapt to a very particular optimizer like atom W to control the update sizes. So if you're using a totally different optimizer, I don't know why youexpect the learning rates to transfer. So maybe be expected that this thing fails. And then sort of finally, what is it you know also not robust do turns out if you really have much stronger weight decay, mup actually starts to fail. And so this is one of the few significant mup failures that are in there. A lot of the other ones are kind of just like, Oh, we maybe expected that or that's not standard to do. You know weight decay is something that you actually do do. Okay? So mup seems generally useful. Like if you take standard parameterization, like kind of going back to the baseline, right? You might ask like, all right, what if I just do, you know standard baseline stuff? You know you can't use the same learning rate. The same learning rate results in significantly worse losses at 2048, right? Like your model just blows up, gives you basically degenerate losses. You would have been very sad scaling up at the same learning rate. Know and we see also that the learning rate needs to scale down predictably as a function of the width. On the other hand, even if you scale up all the way to a ten b parameter model, you see that the base loss remains the same. So they do one large scale ill experiment, and they see that the learning rate remains sort of ideal at the two to the negative six level, which is a kind of cool validation, right? So they do the whole study at a medium to small scale. They do one big sort of hero run, and then the learning rate remains optimal. So the empirical results on that look somewhat promising. The fact that meta used it for lama four is also quite nice. But as far as I know, it's not a consensus that people use mup. So putting it all together, how do you scale in the wild? I have never trained a 70b model superintentuous sizes. And so we're going to have to rely a lot on case studies. And we saw several examples of scaling in the wild. We saw people setting things like model hyper parameters, especially learning rate and vach sizes using scaling laws. We saw people using things like new p or assume stability to try to avoid search over these spaces. And then also the use of things like alternative learning schedules like usd can decrease the amount of compute that you need in order to fit a lot of these scaling laws. So that's all I got.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座深入探讨了大规模语言模型(LLM)的缩放法则(Scaling Laws)及其在实践中的应用,重点关注了超参数调优、模型参数化以及计算效率等问题。讲座首先回顾了Chinchilla等早期研究,并对其有效性提出疑问,随后通过分析Cerebras-GPT、MiniCPM、DeepSeek等近期模型的详细缩放研究案例,展示了现代LLM构建者如何运用缩放法则指导设计过程。核心讨论点包括µP(Maximal Update Parametrization)在稳定超参数(尤其是学习率)跨尺度调优中的作用和理论基础。研究表明,µP通过特定方式调整初始化和层级学习率,可以使得模型在宽度增加时,最优学习率保持相对稳定,从而简化大模型训练。此外,讲座还介绍了WSD(Warmup-Stable-Decay)学习率调度器,它通过分离预热、稳定和衰减阶段,使得在进行Chinchilla式数据与模型规模权衡分析时,能够以接近线性的成本估算不同数据量下的模型性能,避免了为每个数据点从头训练模型的巨大开销。案例分析显示,不同团队采用了多样的缩放策略:Cerebras-GPT和MiniCPM依赖µP稳定超参数;DeepSeek则直接对学习率和批量大小进行缩放分析;Llama 3等更新的模型虽然细节较少,但也进行了IsoFLOP分析。讲座最后深入剖析了µP的理论推导(基于激活值和梯度更新在初始化时保持Θ(1)的假设)及其在Transformer模型中的具体实现,并通过第三方研究(Lingle的预印本)探讨了µP的鲁棒性——它对某些架构变体(如SwiGLU)和操作(如不同批量大小)具有鲁棒性,但可能在RMSNorm增益、特定优化器(如Lion)或强权重衰减下失效。总体而言,讲座强调了精心设计的缩放策略和参数化方法对于成功训练高性能LLM至关重要。
动机与核心问题
讲座旨在探讨以下核心问题,以理解大规模语言模型(LLM)缩放的最佳实践:
* LLM缩放和超参数调优的最佳实践是什么?
* Chinchilla的缩放方法是否真的有效?(例如,IsoFLOP曲线是否能准确指导数据权衡?能否用于设定最优学习率?)
* 在训练和拟合这些模型时,我们能否节省计算资源?
* 我们是否应该选择特定的架构或参数化方法以实现良好的缩放性?
讲者提到,自DeepMind的Chinchilla论文(2022年)和ChatGPT出现后,LLM领域的竞争格局改变,许多前沿实验室对数据和缩放细节变得非常保密。因此,需要依赖其他公开来源了解实践中的缩放方法。
近期模型缩放实践案例研究
讲座重点分析了以下几个公开了详细缩放策略的模型:
Cerebras-GPT
- 模型概况:由Cerebras Systems开发,包含0.1到130亿参数不等的模型,采用Chinchilla法则进行训练(即token与参数量的比例接近最优)。
- 核心发现:使用µP(Maximal Update Parametrization,最大更新参数化)可以使模型缩放更加稳定和易于处理。
- 对比实验显示,使用µP的Cerebras-GPT(橙色线)相比标准参数化(蓝色线,Cerebras-GPT)以及Pythia、GPT-J等模型,在Pile测试集上展现出更平滑的损失下降曲线。
- µP的优势:Cerebras-GPT作者发现µP参数化能带来更可预测的缩放。标准参数化模型在预测的缩放点附近有较大振荡(可能因为需要随规模调整学习率),而µP模型(橙色线)更接近其缩放法则的拟合线。
- µP实施细节:
- Cerebras-GPT的附录G提供了µP与标准参数化(SP)的详细对比表,关键差异在于:
- 初始化:非嵌入参数通常以
1/宽度进行初始化。 - 学习率:层级学习率按
1/宽度进行缩放。 - 具体调整包括增加逐元素激活张量缩放、调整受影响层的初始化器、为特定层增加层级学习率缩放。
- 初始化:非嵌入参数通常以
- 超参数调优策略:结合µP与积极的缩放策略。
- 在4000万参数的代理模型(proxy model, $d_{model}=d_{model,base}=256$, $n_{layers}=32$, $d_{head}=128$)上进行200个样本的随机超参数搜索(训练6亿token,批大小13.1万token)。
- 获得的µP调优超参数为:$\eta_{base}=6e-3$, $\sigma_{base}=0.08$, $m_{emb}=10$。这些超参数随后被迁移到1.11亿至27亿参数的模型中。
- 这些值与Yang et al. (2021) 使用的超参数接近。
- Cerebras-GPT的附录G提供了µP与标准参数化(SP)的详细对比表,关键差异在于:
- 发言人观点:讲者认为Cerebras-GPT是µP的首批公开验证之一,并指出许多进行LLM缩放的实验室都密切关注网络参数化、初始化(随模型规模变化)以及层级学习率,以实现更稳定的缩放。例如,Llama 4据称使用了一种名为Meta P的µP变体。
MiniCPM
- 模型概况:由清华大学和Modelbest Inc.于2024年发布的小型、高性能语言模型(12亿和24亿参数)。其性能优于当时多数20亿参数模型,并能媲美许多70亿参数模型。
- 核心策略:通过精心的、广泛的缩放计算,并结合µP来稳定和简化缩放过程。
- µP应用:
- 其µP参数与CerebrasGPT类似:
Scale_emb = 12(CerebrasGPT为10),lr=0.01(CerebrasGPT为6e-3),init_std = 0.1(CerebrasGPT的 $\sigma_{base}=0.08$)。 - 具体操作包括:嵌入输出缩放、残差连接缩放、张量初始化(二维张量参数的初始化标准差设为
init_std/$\sqrt{d_{m}/d_{base}}$)、张量学习率缩放(二维张量参数的学习率调整为其他部分学习率的 $1/(d_{m}/d_{base})$ 倍)、LM Head输出缩放。
- 其µP参数与CerebrasGPT类似:
- 缩放策略:
- 使用µP进行初始化。
- 固定模型各部分的宽高比(aspect ratio)。
- 逐步扩大整体模型尺寸(从9M到0.5B参数进行试验性缩放,实际训练模型约为此最大试验模型的5倍)。
- 通过缩放分析直接拟合最优批量大小、学习率和token与模型大小的比率。
- 最优批量大小(Optimal Batch Size):
- 通过在三种模型尺寸(9M, 30M, 170M)上改变数据量和批量大小,观察损失变化,绘制3D图(批量大小 vs 处理数据量,颜色代表损失)。
- 识别每个数据量下的最小损失点,确定模型大小/数据集大小组合的“最优批量大小”。
- 遵循Kaplan 2020年的分析,绘制最优批量大小与最终损失的关系图,得到拟合趋势:
log(BS) = -6.24 * log(L) + 20.91。这表明随着损失降低,应以多项式级别增加批量大小。
- 最优学习率(Optimal Learning Rate):
- 根据µP理论,最优学习率应大致稳定。
- 实验结果(不同模型尺寸0.04B到2.1B的损失 vs 学习率图)显示,应用µP(Tensor Program)后,不同模型尺寸的最优学习率确实非常相似(约为0.01)。
- WSD(Warmup-Stable-Decay)学习率调度器:
- 动机:传统的Chinchilla式缩放分析需要为不同数据量目标从头训练模型(因余弦学习率调度在不同总步数下形状不同),成本高昂(从n到$n^2$)。
- WSD机制:将学习率分为预热(Warmup)、稳定(Stable)、衰减(Decay)三个阶段。其优势在于,可以从稳定阶段的任意检查点开始衰减,都能达到与完整训练到该数据点的余弦调度器相当的最优损失。
- 效果:允许研究者精确测量最优缩放特性,而无需为不同数量的token从头重新训练模型,从而使沿数据轴的缩放定律测量更加高效(线性成本 $O(mC)$)。
- MiniCPM使用WSD对6种尺寸(0.04B到2B)的模型进行数据和模型轴的缩放定律测量,每个模型从稳定训练阶段的10N到60N数据的检查点开始进行6次衰减。
- WSD调度器在稳定阶段损失下降较慢,但在衰减阶段损失迅速下降(约占总训练步数的10%)。
- Chinchilla式分析:
- 利用WSD学习率获得的数据点,拟合损失函数 $L(N,D)=C_{N}N^{-\alpha}+C_{D}D^{-\beta}+L_{0}$。
- MiniCPM作者采用了Chinchilla的方法1(下包络线法)和方法3(联合拟合)。
- 方法1显示数据带来的收益递减相对较低。
- 方法3(联合拟合,如"Ultratext"数据集的2D热力图所示)是其主要的缩放方法,拟合结果为:$\frac{7.54\times10^{-2}}{N^{0.30}} + \frac{2.92\times10^{-1}}{D^{0.30}} + 0.25$,$K^2 = 0.01$, $\eta = -0.00$, $\frac{D_{opt}}{N_{opt}} |_{C=10^{21}} = 95.60$。
- 结论:他们发现非常高的数据与模型大小比率,平均数据量应为模型大小的192倍,远高于Chinchilla的20倍。他们认为这与Llama 3等模型使用更高数据模型比的趋势一致,表明通过更仔细的优化,可能远超20倍的经验法则。
- 其缩放曲线拟合在不同模型尺寸和数据量下(如代码、英文Wikihow数据集)总体良好。
DeepSeek LLM
- 模型概况:由DeepSeek AI于2024年发布的70亿和670亿参数模型,在当时与其他开源LLM相比具有高性能,大致与同等规模的LLaMA 2模型相当。
- 核心策略:不使用µP,而是直接估算最优批量大小和学习率。
- 批量大小和学习率的缩放分析:
- 在小规模实验(算力预算1e17 FLOPs,模型177M FLOPs/token;算力1e20 FLOPs,模型2.94B FLOPs/token)上进行批量大小和学习率的网格搜索。
- 结果显示,在较宽的参数空间内,泛化误差保持稳定,表明在相对较宽的参数选择范围内可以实现接近最优的性能。
- 收集“接近最优”(最小损失0.25%以内)的模型数据点,拟合最优批量大小和学习率随训练FLOPs变化的曲线:
- 最优批量大小:$B_{opt}=0.2920 \cdot C^{0.3271}$
- 最优学习率:$\eta_{opt}=0.3118 \cdot C^{-0.1250}$
- 讲者对学习率拟合的可靠性表示了一定的怀疑(“看起来有点可疑”)。
- 类WSD学习率调度器:
- 采用多步学习率调度器:2000步预热达到最大学习率,处理80%训练token后降至最大学习率的31.6%,处理90% token后再降至10%。梯度裁剪设为1.0。
- 这种调度器通常能匹配余弦学习率的性能,并有助于进行Chinchilla式分析。
- 数据规模权衡分析(Chinchilla方法2):
- 采用直接的IsoFLOP式分析来选择模型规模的权衡。
- 绘制不同总计算预算下的IsoFLOP曲线(验证集上每字节比特数 vs. 非嵌入FLOPs/token)。
- 推导出最优模型规模(非嵌入FLOPs/token)和最优数据规模(Tokens)随总训练FLOPs变化的线性趋势(对数-对数坐标)。
- 缩放预测最终模型损失:
- 拟合的缩放模型(基于小规模实验)能够准确预测DeepSeek LLM 7B和67B的最终泛化误差(验证集上每字节比特数)。
其他近期模型的缩放提及
- Llama 3 (Meta, 2024):
- 进行了IsoFLOP式缩放分析(验证损失 vs 训练Token数),近似测量使用二阶多项式。
- 发现最优数据与模型大小比率约为39:1。
- 研究了计算量到下游任务准确率的缩放关系(通过负对数似然NLL转化)。
- Hunyuan-Large (Tencent, 2024):
- 针对MoE(Mixture of Experts)模型进行了IsoFLOP式缩放分析(训练损失 vs 激活参数量)。
- 得出最优激活参数量与最小计算预算的缩放关系。
- 最优数据与激活参数比率为96:1。
- MiniMax-01 (MiniMax, 2025):
- 研究了不同注意力架构(Softmax Attention, Lightning Attention, Hybrid-lightning)的缩放法则。
- 采用Chinchilla方法1(下包络线)分析损失、最优模型参数量、最优训练token数与计算量(PFLOP/s-days)的关系。
- 结论是Lightning Attention和Hybrid版本与Softmax Attention性能相当。
近期缩放法则配方总结
- CerebrasGPT:使用µP使超参数对规模不敏感;直接使用Chinchilla缩放公式。
- DeepSeek:假设多数Transformer超参数对规模不敏感;对批量大小/学习率进行缩放分析确定最优缩放;IsoFLOP分析确定模型规模;使用分段线性调度使Chinchilla缩放成本降低。
- MiniCPM:使用µP使Transformer+学习率对规模不敏感;使用分段线性调度为Chinchilla方法3(曲线拟合)获取样本。
- Llama 3 / Hunyuan:近期(2024年末+)但细节较少;主要是IsoFLOP分析。
- MiniMax:关注架构选择/决策的缩放。
深入理解与验证 µP (Maximal Update Parametrization)
µP旨在实现“尺度不变”的超参数调优,核心思想是使模型在宽度增加时,最优学习率保持稳定。
µP 的理论基础 (基于 Greg Yang 等人的 "A Spectral Condition for Feature Learning")
µP基于以下两个关于网络宽度 $n_l$ 的断言:
1. A1: 初始化时激活值应保持 $\Theta(1)$ (即有界常数)。
* 对于单个激活值坐标而言。如果激活向量范数为 $||h_l||_2$,则应为 $\Theta(\sqrt{n_l})$。
2. A2: 一次梯度步骤后,激活值的变化量也应为 $\Theta(1)$。
推导 µP (条件 A1 - 初始化):
* 考虑深度线性网络 $h_l = W_l h_{l-1}$,初始化 $W_l \sim N(0, \sigma^2 I_{n_l \times n_{l-1}})$。
* 随机矩阵理论表明 $||W_l|| \rightarrow \sigma(\sqrt{n_{l-1}} + \sqrt{n_l})$。
* 且 $||h_l||2 \approx ||W_l|| ||h||2$。
* 若选择初始化标准差 $\sigma = \frac{\sqrt{n_l}}{\sqrt{n$。}}}(\sqrt{n_l} + \sqrt{n_{l-1}})^{-1} = \Theta(\frac{1}{\sqrt{n_{l-1}}} \min(1, \sqrt{\frac{n_l}{n_{l-1}}}))$,约等于 $1/\sqrt{\text{fan-in}
* 通过归纳法,假设 $||h_{l-1}||2 = \Theta(\sqrt{n)$,满足A1。}})$,则可以证明 $||h_l||_2 = \sqrt{n_l} + o(\sqrt{n_l
推导 µP (条件 A2 - 更新):
* 权重更新 $\Delta W_l = -\eta_l \nabla_{h_l}L \cdot h_{l-1}^T$ (对于SGD,秩为1)。
* 激活值更新 $\Delta h_l = W_l \Delta h_{l-1} + \Delta W_l (h_{l-1} + \Delta h_{l-1})$。
* 目标是使 $\Delta h_l$ 的范数为 $\Theta(\sqrt{n_l})$。
* 关键在于选择学习率 $\eta_l$ 使得 $|| \Delta W_l ||* \sqrt{n)$。}} = \Theta(\sqrt{n_l
* 假设损失更新 $\Delta L = \Theta(1)$,且 $\Delta L = \Theta(\langle \Delta W_l, \nabla_{W_l}L \rangle)$。
* 对于SGD,推导出学习率 $\eta_l = \Theta(\frac{n_l}{n_{l-1}})$ (即 fan-out / fan-in)。
* 与SGD不同的是,对于Adam优化器(根据讲座后续的“µP mini recap”部分以及Cerebras-GPT的实现),其学习率调整通常为 $\Theta(\frac{1}{n_{l-1}})$ (即 $1/\text{fan-in}$)。
µP 小结:
* 初始化:设为 $\Theta(\frac{1}{\sqrt{n_{l-1}}} \min(1, \sqrt{\frac{n_l}{n_{l-1}}}))$。
* 学习率:SGD设为 $\frac{n_l}{n_{l-1}}$;Adam(根据实践和后续总结)设为 $\frac{1}{n_{l-1}}$。
* 与标准参数化(SP)相比,SP通常初始化为 $1/\sqrt{n_{l-1}}$,学习率设为 $\Theta(1)$。主要区别在于Adam的学习率调整和当fan-out $n_l$ < fan-in时的初始化差异。
Cerebras-GPT中的µP实现 (回顾):
* 嵌入层特殊处理,其范数不随词汇表大小线性缩放。
* 其他所有层的权重初始化按 $1/\text{width}$ 缩放,学习率也按 $1/\text{width}$ 缩放(针对Adam)。
* 表格显示了SP和µP下各组件(嵌入、LN、偏置、MHA、QKV、O、FFN1、FFN2、输出logits)的初始化和学习率公式。例如,MHA中的注意力缩放从 $1/\sqrt{d_{head}}$ (SP) 变为 $1/d_{head}$ (µP)。QKV权重LR从 $\eta_{base}$ (SP) 变为 $\eta_{base}/m_{width}$ (µP)。
µP 的大规模探索 (Lucas Dax Lingle的预印本)
该研究旨在通过大量实验验证µP的有效性及其鲁棒性,主要关注宽度缩放(深度L=24固定,宽度M在{128, 512, 2048}间变化)。
* µP的有效性:实验表明,在µP下,学习率确实可以在模型尺寸间可靠迁移。小模型(宽度128)的最优基础学习率能直接预测大模型(宽度512、2048)的最优值。
* µP对架构变体的鲁棒性:
* 非线性激活函数:SwiGLU、Squared ReLU与基线ReLU在µP下具有相同的最优学习率,且SwiGLU和Squared ReLU性能略优。µP迁移有效。
* 批量大小:增大或减小4倍批量大小,µP迁移仍然有效。
* 初始化变体:如SP Unembedding Init(输出层使用 $1/M$ 而非 $1/M^2$ 初始化)、Zero Query Init(查询矩阵置零),µP迁移均有效。
* µP不具鲁棒性的情况:
* RMSNorm Gains (可学习增益):若RMSNorm层包含可学习的增益参数(无论是向量还是标量形式),并且这些增益的学习率按 $\Theta(1)$ 缩放,则µP的最优学习率迁移会失效。移除这些增益后,µP恢复有效,且移除增益对最大µP模型性能影响甚微。
* 奇异优化器 (Exotic Optimizers):如Lion优化器(基于梯度符号),µP迁移失效。这符合预期,因为µP是为特定优化器(如AdamW)设计的。
* 强权重衰减 (Strong Weight Decay):例如0.1的解耦权重衰减,µP迁移失效。这是少数几个显著的µP失败案例之一。
* µP的实用性:
* 与标准参数化(SP)相比,SP在不同宽度下最优学习率变化显著,若使用相同学习率,大模型性能会急剧下降或崩溃。
* µP即使在扩展到100亿参数模型时,最优基础学习率($2^{-6}$)依然保持稳定。
* 当前证据表明,µP参数化/初始化可能更容易调优。
实践中的缩放挑战与解决方案回顾
- 设置模型架构超参数(宽度等)
- 设置优化器超参数(学习率、批量大小)
- 拟合Chinchilla式大规模扫描所需的计算量
一些解决方案:
1. 假设稳定性(或使用µP)。
2. 在小规模上搜索最优学习率/批量大小,然后固定或预测其缩放。
3. 使用替代学习调度(如WSD类)以降低拟合缩放定律的成本。
讲者总结,获取正确的学习率(以及批量大小)是核心关切之一。µP通过操纵初始化和层级学习率这两个“自由变量”,试图实现学习率的尺度不变性,从而简化实验。
核心观点总结
大规模语言模型的缩放是一个涉及多方面权衡的复杂过程。µP参数化通过调整初始化和层级学习率,为稳定超参数(尤其是学习率)跨不同模型宽度提供了有效途径,其理论基础在于控制激活值和梯度更新的尺度。WSD学习率调度器则显著降低了进行Chinchilla式数据-模型规模分析的计算成本。Cerebras-GPT、MiniCPM和DeepSeek等模型的案例研究揭示了不同的实用缩放策略,包括µP的应用、直接拟合超参数缩放曲线以及IsoFLOP分析。尽管µP在多种情况下表现鲁棒,但也存在其局限性,例如在引入可学习的RMSNorm增益或使用特定优化器时可能失效。总体而言,精心设计和验证的缩放策略对于经济高效地训练强大的大型语言模型至关重要。