Stanford CS336 Language Modeling from Scratch | Spring 2025

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Scaling laws

该讲座主要探讨了大规模语言模型（LLM）的伸缩法则（Scaling Laws），旨在通过研究小模型的行为来预测和优化大模型的训练。

核心内容包括：
1. 伸缩法则的动机与历史：伸缩法则旨在建立模型性能与计算资源（如数据量、模型大小、训练步数）之间的可预测关系，从而在有限的计算预算下高效训练出最佳模型。讲座回顾了伸缩法则思想的早期渊源，如贝尔实验室1993年的工作及后续研究，强调了从理论边界到经验拟合的转变。

关键技术与方法：
- 最大更新参数化 (μP)：一种旨在使超参数（尤其是学习率）在不同模型宽度下保持稳定的技术。通过调整特定层（如矩阵类、嵌入层、输出层）的初始化方差和学习率缩放，μP试图简化从小型模型到大型模型的超参数迁移。CerebrasGPT和MiniCPM等模型应用了μP，并发现其有助于稳定训练和预测伸缩行为。Lingle的预印本研究进一步验证了μP在宽度伸缩时的有效性，但也指出了其局限性，如对可学习的RMSNorm增益、某些优化器（如Lion）和强权重衰减不鲁棒。
- Chinchilla伸缩法则与数据/模型权衡：DeepMind的Chinchilla论文提出了在固定计算预算下，模型大小和训练数据量之间存在最优配比。讲座讨论了如何拟合这类伸缩法则，如$L(N,D) = E + AN^{-\alpha} + BD^{-\beta}$。
- WSD学习率调度 (Warmup-Stable-Decay)：为了降低拟合Chinchilla伸缩法则所需的大量完整训练成本，MiniCPM和DeepSeek等采用了分阶段（预热-稳定-衰减）学习率调度。这种方法允许从稳定阶段的检查点开始衰减学习率，从而以较低成本（线性而非平方级）获得不同数据量下的模型性能点，用于伸缩法则分析。
- IsoFLOP分析：另一种确定最优模型和数据规模的方法，通过在恒定计算量（FLOPs）下比较不同模型配置的性能。DeepSeek、Llama 3和Hunyuan等模型采用了此类分析。
近期模型案例分析：
- CerebrasGPT：应用μP实现了更稳定的伸缩，并基于Chinchilla法则进行训练。
- MiniCPM：结合μP和WSD学习率进行精细的伸缩计算，以较小模型尺寸（1-2.5B）实现了高性能，并发现最优数据与模型参数量之比远高于早期Chinchilla研究（如平均192:1，而非20:1）。
- DeepSeek：未使用μP，而是直接通过小规模实验估计最优批次大小和学习率的伸缩规律，并采用WSD式学习率进行Chinchilla分析（IsoFLOP方法），其伸缩模型能较好预测最终模型性能。
- Llama 3：据报道采用IsoFLOPs式伸缩，数据与参数比约为39:1。
- Hunyuan-Large：针对MoE模型，采用IsoFLOPs式伸缩分析激活参数量，发现数据与激活参数的最优比率为96:1。
- MiniMax-01：关注架构选择对伸缩法则的影响，结合Chinchilla方法1进行分析。
伸缩实践总结与挑战：
- 挑战：如何设定模型架构超参数（宽度、深度等）、优化器超参数（学习率、批次大小），以及如何经济地进行Chinchilla式的大范围扫描。
- 解决方案趋势：依赖超参数稳定性假设或使用μP；在小规模上搜索最优学习率/批次大小，然后固定或预测其伸缩行为；采用WSD等替代学习率调度方案以降低伸缩分析成本。

讲座强调，通过系统性的伸缩法则研究，可以在训练昂贵的大模型前，更科学地做出架构选择和超参数设定，从而提升研发效率和模型性能。近期研究趋势表明，为达到最优性能，模型可能需要比以往认为的更多的数据进行训练。

视频科技

媒体详情

上传日期: 2025-05-17 21:56
来源: http://youtube.com/watch?v=6Q-ESEmDf4Q
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: I'm going to talk a little bit about scaling laws. Originally, I think we were going to talk about inference, but I'll take a few minutes to start on scaling laws and then we'll kind of figure out where we'll go from there. Okay. So the whole point of scaling laws is kind of well, to begin with, I want you to put yourself into the following scenario, right? So you have a very rich trend and he or she has given you 10000 actually, let's say, hundred, zero H -100s for a month. And you have to build you know the best open source llm that you can, right? So this is a somewhat hard task. And we've given you some of the tools that you need to make progress on this question. So you know you can put together your inra team and your systems people and you can put together a distributed training framework in the next assignafter that you're going to put together a great pre training data set and then you kind of know all about you architectures and so on. So you kind of know you have all the pieces and so we can turn the crank and we can run the big model. And in the first couple lectures, we talked about all the other various decisions you might make along this journey, right? Like what's the architecture? What's the hyperparameters? Like how are you going to do all these things? Well, you know, I think the in some ways the answer I gave you from those early lectures was just pick what other people have done, right? Like just follow llama or whatever other models. But in a way that's a very boring answer because that doesn't let you push the frontiers, right? Like if you're if you're in like a big frontier lab and you're gonna to build the best model, you don't want to just copy other people, you want to innovate, right? So how do we innovate and get these optimized solutions in the first place? So that's kind of going to be the point of scaling loss. What we want to do is we want to build simple predictive laws for the behavior of language models. And scaling laws are basically this whole idea of being able to take small models, scale them up, and be able to do that in order to improve your engineering, right? So one way of thinking about this is the old and pleasant way of doing deep learning is just train a bunch of big models to your hyperparameters so that your big models are good, right? That's just going na cost tons and tons of compute. Like you can't really easily do that. And so I think the new optimism, and if you're sort of following a lot of these developments on scaling, you know, you kind of think of this as, all right, we're going to train a bunch of small models. We're going to learn a lot of things from the small models, and then we're gonna to extrapolate them back up to bigger models, right? So we're gonna to take our smallest models at the left side of this sort of compute scale here. I'm gonna to learn a lot about what to do, and then I'm going to nail it in one go when I build my big model. And the first place I want to start with is just kind of the history and the background of scaling loss. And I want to contextualize this because I think when people talk about scaling laws, often this is done in very like mesaniaic, like agi terms. They're like scaling laws. Just tell you that you know, these amazing things are log linear forever and we will achieve super intelligence or something. But I think scaling laws are actually much more grounded and have a lot of interesting history. And so I'm gonna to start there to sort of try to convince you that scaling laws aren't necessarily just fitting lines on log log plots, although that is a very big part of what we're gonna to do. And then I'm gonna to do basically very easy steps. I'm gonna to try to convince you that at least for data, scaling laws are a very natural thing to think about and expect. So so as a person that's kind of brought up in statistical machine learning, you know my starting point is going to be statistical machine learning, right? Like what is scaling loss? You know, in some ways, scaling laws are telling us as we increase the amount of data or we change the model size, we expect certain behaviors out of the model, right? And if you go back to something like machine learning 101 and if you remember your vc dimensions and rodamoacher complexities and so on, in some ways that's the theory version of exactly this. So you have on the top, you know generalization bound for how the the generalization bound for the excess risk of learning amongst a finite set of k hypotheses. And we see that that should scale as a one over square root of m, right? In some ways, that's a theoretical version of a scaling law where we're making predictions about how fast our errors should decay as a function of n on the bottom. We might have something a little bit more exotic if we're doing generative modeling and where our generative model is a really flexible non parametric class. What we might do instead is we might fit some sort of smooth density. So in this case, our prediction is that the l two sort of error of estimating a density is going to be upper bounded by some polynomial and to the beta over two beta plus one, right? This is what some people might call non parametric rates. So you know, theorests have been thinking for a very long time about how sample size especially should relate to error. This is a very classic problem that people have thought about in machine learning theory, but these are upper bounds, not actual realized loss values. And really scaling laws are in some sense the leap from thinking about kind of the theoretical side of how should data and model size relate the performance, and going to the empirical side of saying, actually, our bounds are bad, but maybe we can actually fit these things empirically. And this is a fun trivia fact or arguable trivia fact, like what is the first scaling loss paper? And actually not many papers cite this one, but I think probably the right first scaling loss papers is a paper from 1993, nurips, from Bell Labs. And you might recognize some of these names. These are a kind of theorists and some of the people that have done really classic work in machine learning theory, like, you know Vapnik and Corina Cortez and others. And I take an excerpts because I was reading this paper actually just preparing this lecture earlier, and it just struck me how, ahead of its time, in many ways, this paper was right. It's saying training classifiers on large databases is very computationally demanding, and we need to figure out which ones are good before actually training them. And so what we're going to do is we're going to propose a new predictive method that predicts how good a model is going to be without actually training the whole thing. And that sounds a lot like scaling loss, and you'll see this later, but they have a functional form that's basically like, Oh, the test sterror of a model is expressible as some irreducible error plus a polynomial decaator. And you're like, huh? On that looks a lot like a modern scaling law. And they even do the thing where they train a bunch of small models, they fit their curves, and they're like, Oh, we can accurately predict the behavior of the model further out. So as with many things, I guess scaling loss, partially thought about Abbell labs way back when. And of course, there's others that I think have thought about related ideas in scaling, not just scaling laws, but also really the modern mindset, I think, of thinking about scaling. There's another paper that often gets mentioned in sort of the history of scaling laws on Banko and Brill, who was studying sort of how does the performance of a certain kind of nlp system scale with the amount of data? And they have you what looks like know very often a modern scaling law, know log axis data on the x axis, performance on the y axis. And you know they're basically arguing, well, look, we can get really dramatic performance improvements just by scaling up data. It's very predictable. And you know maybe we should consider the trade deoff spent between spending time and money on algorithm development versus just collecting more data. And you're like, huh? That sounds a lot like what a lot of this pre training stuff is thinking about. And then finally, you know one of the things that I think people have thought about recently and in the past is, you know is this thing really predictable? What are the right functional forms? And as early as like 2:12, you know people are really thinking about, all right, like are these things actually predictable? You know, is power law, like for example, power three and power four, are those really the right functional forms for predicting the behavior of models? And of course, all of this, just to remind you, right, is thinking about the behavior of models on the y axis, the capabilities as a function of the amount of data that you have on the x axis. So that's the relationship that I think has been really classically studied, what you might call data scaling in all these cases. And if you're interested in like kind of the earliest like large scale neural scaling law paper, that would probably be hessness at all in 2017. I believe they were at by do when they did this work. They showed that for a range of tasks, machine translation, speech, and I think like some vision tasks, they showed that essentially error rates fall as a power law. And they even have this nice plot that I really like to refer to when people are discussing scaling loss, that really your expectation should be that there's three different regions in the behavior of a model, right? Initially, you start out a best guess. You then enter into a region where you're kind of predictably scaling the model. That's the power law region. And then there's another asymptotic region where you're approaching essentially the irreducible error of your model class. And I'll kind of highlight that. I think you know there's been in the last few years a lot of talk of new phenomena, things like Oh, emerging capabilities or like scaling compute being a new thing or systems being really important. But have you been reading sort of hesness in 2017 carefully? You would have seen essentially all of these things. They say, actually, it's really hard to do predictions by scaling law when models are at random performance, because suddenly you can leave the random region. They talk about computational limits, actually know if we can scale, it means actually scaling by compute is really important. And then finally, they even say things like, you know, maybe we should do things like quantization, because if we have predictable scaling, then that means we should be willing to pay for model accuracy with compute, right? These are all very, very modern ideas that I think a lot of the early scaling law papers, I think, kind of understood fairly intuitively because you know once you see these spots, you kind of see that actually with predictable resource investment, you get predictable capabilities improvements, right? So that's in some sense, so the core, not quite history, but I think context that has really shaped scaling loss. All right. Any questions so far on kind of the context? This is mainly just kind of data scaling, but I wanted to make sure we go over it carefully. Yes. Like it's pretty natural for like maybe I was wondering if I book like is there cases where there isn't scaling where that doesn't get better? Yeah. So the question was, you know it's natural or maybe it may arguably natural to expect scaling. Are there cases where we don't get scaling or we get different kinds of scaling? And I think one way of thinking about this is if you're measuring kind of training loss or like you know held out versions of training loss, then I think scaling is very natural, right? Like all of classical statistical theory says, you know things should converge. And when they converge, eventually they will get better, right? At some sort of very asymptotic sense. But we do see non scaling behavior. There was a really interesting competition a few years back called like the inverse scaling prize, where they were looking for things that like scale inversely as models got better. And a lot of these are very niche things, like you know models tend to copy better and say, if you want na like suppress copying behavior becomes really hard for really strong models, for example. But I think one sort of like thing that ties a lot of that together is you know if you go really far out of distribution where the behavior is not well specified by the data, then you can get all sorts of behaviors like no scaling at all or inverse scaling or what have you, right? So in some sense, you can think of this as like the extension of the classic like deep learning robustness problems. Cool. Okay. So now I'm going to talk about the scaling behaviors of llms. Like just essentially going through several kinds of empirical results. I'm going to walk you through data scaling in particular and some examples just to convince you that this is a very natural object to expect. And then we'll talk about model size, which is a different kind of a thing. So scaling laws, I think, are fairly well established, and they seem to appear very, very often in kind of many variables, right? You see, scaling in compute on the x aThese are all taken from capplan scaling law paper, which I'll refer to extensively in this lecture. So the x axis here is log compute, y axis here is log test loss. And on the right you see similar kinds of scaling, both for data set size. So this is the amount of data and parameters. One subtlety I'll mention here as I sort of talk through this is know when we scale things like data set size or parameters, we're always assuming that the other variable, in this case, if you're scaling datset size, the model size is much, much, much bigger then you can saturate with the data set size because obviously, if you have way more data than parameters, eventually you're going to of asymptota. So in all of these, we're trying to avoid the asymptotic regime they hold in also pretty non standard settings, theyhold for downstream tasks, theyhold out of distribution, which is what's being shown here from the Kaplan paper. And so you know in some ways, power law relationships seem to appear more often than we might initially expect, especially for these od or other variables. So I want to talk through data scaling laws first because I think they're the most intuitive. Like at the very least, I think the theory for that is fairly clear. And so to be precise, when I say something like data scaling, what I mean is just some sort of simple formula that maps data set size, which I'm going to refer to as n, to our excess error. Excess error is the error beyond the irreducible regime. And you know if you recall that figure I refer to in hesness, what we are going to expect is monotonic logistic looking curves. And really, our interest is primarily going to be in the power law region to the irreducible error region. Like of course, it's very interesting to also ask questions about what happens in the small data regions as we leave random guessing, but that's much, much harder to reason about. Whereas I think this right tail actually, I can hopefully convince you that this part is actually a very, very natural thing to expect power law scaling. So okay, right. So the first empirical observation that we have, right, and this is kind of the thing that I'm going to convince you is natural, is when we plot on the x axis data set size and on the y axis test loss, then on the log lock plot model, performance is linear, right? You might call this scale free, or you might call it power law. These are more sort of physics physics oriented terminology. And sort of this was established you know by many people, but you might refer to Kaplan to see many examples of this. So I think you know as sort of the previous question sort of brought up, right, we kind of expect error to be monotone. We train on more data. Error goes down. Fairly obvious. The part that is less obvious is the precise functional form of this scaling, right? So when I say it's a power law, it's linear in log log space. And then so what is the implication of that, right? If something is linear in log log, that means that there's a polynomial relationship between your x axis and your y axis, right? And y is polynomial decay natural? Well, I'm gonna to walk you through two examples and both of those are going to result in some fairly natural polynomial decay. I'm going to start with the simplest possible example, right? Like this is just going to be you even stats 101 rather than machine learning 101. So what I want na do is I want na estimate the meat of a data set, right? And estimating the mean is a task of estimating a parameter, right? I can ask for, what's the scaling law? What's the error of my mean estimation task as a function of data, right? So I can write that down. Well, you know, my input comes from a Gaussian, and the task is to estimate the average. I've written those out in the blue box above. And what's the error? Well, by sort of very standard arguments, right, the average is going to be also distributed as a Gaussian with the standard deviation divided by n. So I'm going to get sigma squared over n is my estimation error, right? This is the expected squared error of my estimate. And if you look at this, this is polomglomial an n. And just to really drive the point home, you know, you take the log of both sides of this, log of the air on the left and log of sort of of n on the right hand side, you know, I get exactly log of error is equal to negative log n plus two log sigma, right? So this is exactly the kind of thing we expect. And we expect a slope of one if we were to fit a scaling law for mean estimation. So now, you know, equipped with this this new knowledge, you might say, all right, I'm going to go around and I'm going to look at what the rates are for estimating different things. And that will tell me about what I should expect for data scaling. And so you might say, Oh, what I expect is one over n. You might expect one over square root of n for agnostic learning and so on and so forth. So we should expect to see some like pretty nice round numbers on the slope here of a log. Log pi should expect to see like one or 0.5. What do we actually find empirically when we look across these papers, right, just to sort of call them out in hessness, for machine translation, we see negative 0.13. For speech, we see negative 0.3. And for language modeling, we see an exponent of negative 0.095, right? Those are all much, much slower than the one over n or one over square root of n rates that you might expect when you're just fitting simple functions. So why might this be okay? This will be the last math slide of this this lecture. And then we can go to just fitting lines on log lock plots the rest of the time. But this will hopefully drive the point home of why we might see these particular slopes. So we know that neural nets aren't just estimating the mean right, or it's not even fitting a linear regression. They can fit arbitrary functions. So let's turn that into an example and let's work through that example. So my input is x one through x sense. I have n samples and I'm gonna to put them uniformly in the 2D unit box. And I want to estimate some random, not random, some arbitrary regression function y equals f, right? And I'll assume f is smooth and so on, if you really want to be precise, right? But there's some regularity conditions here. A simple approach to estimating a regression function f is just to cut the 2D space up into small boxes. And within each box, I can measure the average of the y values. Like a very simple non parametric regressor is to just cut the space up and then to estimate what's going to happen. Now informally, if we pick, you know I'm going to have square root m boxes. Now each box is going to get square root of n samples. And now my error is going to be one over square root of n. And if you sort of follow this logic through the more dimensions, you'll see that in d dimensions, this is gonna to be error is equal to n to the negative one over d. And then sort of my overall scaling, if I were to take log log plots of the whole thing, is I expect the slope of negative one over d, right? And so why did I walk you through this example, right? I walk you through this example because if you have flexible function classes, what people call non parametric function classes, you expect dimension dependence and therefore the slope of the scaling law to actually move sort of much more slowly. And in some sense, the slope is telling you almost precisely kind of the intrinsic dimensionality or the ease of learning this task. And people have argued this more formally or sort of more literally. There's been a several sort of theory slash empirical papers arguing that really the reason why we get these sort of exotic or non standard rates of learning is that it is closely connected to the intrinsic dimensionality of the data. And the sort of, for example, the plots of these predictions, the dashed lines and these purple circles are somewhat close, although you don't want to read too much into this because estimation of intrinsic dimension is an extremely difficult problem, as difficult as modeling the data overall. Okay. Oh yes, I guess it was related to point, but like Yeah how you can you generate data that has an underlying intrinsic dimension at all from a simulation perspective? Yeah. So the results here, well, if you want, for example, to generate data that's actually not too hard, you could like write down a function that takes in like five variables, right? And then that would be as long as all five of those variables like don't you know cancel each other. That's a five dimensional surface and you can add a little bit of noise and you're good to go. The difficulty here is that they're actually doing things like training on seafar and then they're having different they're trying to estimate the intrinsic dimensionality of cfar. That's a much harder task. Okay. And data scaling laws are quite useful. You know, I was going at this from a let me explain to you scaling laws perspective, but you can actually use scaling laws to do many interesting things. You can make engineering decisions of various kinds using data scaling laws, and people do, in fact, do this. For example, you know, you might say, well, how does data set composition affect performance, not just datset size? Well, if you're changing the test set, you know, you Kaplan, it all has a really nice figure showing actually data composition only affects the offset, not the slope. And what that would mean is it says if you want to pick a really good data set, you don't have to necessarily train your models at a huge scale. You can scale them down and do your data selection experiments on much smaller models. And the shape of the expected sort of as we mix sort of different data, we might expect certain kinds of sort of shapes. And you can use regression and other kinds of techniques to try to figure out, for example, optimal data mixing using scaling laws. And people have written several papers on this topic, although you know as with all data selsort of research, a lot of this seems fairly tricky to execute reliably. There's other also interesting questions that you might ask, right? There's a lot of discussion these days about know, are we running out of data right on the Internet? And so once you start asking those questions, the other interesting and important question is, well, can we just keep training on the same data we have? What's the diminishing returns property of that, right? And so there's interesting work extending scaling laws to multi epoch training, basically arguing that there's a sort of effective sample size. And after about four epochs, you know you have rapidly diminishing returns as you repeat more and more data. And by modifying sort of the usual scaling law, you can basically get a version where you have amount of effective data and unique tokens that sort of diminish out as you increase the amount of repetition. Finally, I think one interesting sort of combination of these two ideas is if you're thinking about sort of data selection in the large data regime, right? Like imagine you're going to be training on trillions and trillions of tokens right now. What would be better? Would it be better to repeat high quality sources like Wikipedia and perhaps your secret pirated books ten times? Or would it be better to include new data? Right? The fact that you can either repeat data or you can include more data right now has multiple sort of axes on which you can sort of optimize your data mixture. And there's also been some interesting data scaling work, this one from cmu folks on essentially trading off between repeating data versus picking lower quality data. That's new. And so all of this really is a really natural extension of what I sort of already taught you, which is if you assume that there's a predictive power law relationship, right, and that this power law relationship holds sort of on a per mixture basis, then you can fit these sort of scaling lock trapolations and then get an estimate of how good your data is going to be at scale, right? So that's the starting point, which is data scaling, right? And hopefully, I've convinced you at this point, both sort of empirically and conceptually, that it's natural to have log log linear relationships between data and error. This relationship seems to hold very robustly across domains, across different kinds of models. And you can kind of have a nice, clean theoretical understanding of what is happening here. And once you do this, you can use this for all sorts of purposes, like picking optimal data mixtures or whatever else. Okay, yes, size Picon the data. Yeah. So as I was kind of saying back in, well, not this slide, but let's see back in this slide, when we think about kind of the data size scaling, the model is always picked to be really, really large. So the data is not saturating your model, right? And you want to kind of avoid being in this irreducible error regime. So the model is always picked to be large enough that you're in the power law region whenever you're only varying data. For like all of them that one really really big model size it like this each point point a different size model Yeah for example, for this plot in particular, it's like one big model size when you're looking at, for example, compute scaling on this axis, then data and model scale jointly at some preordained ratio. Cool. Any other questions? Good. Okay, excellent. All right. So now I think we got to move from data scaling to, in my opinion, slightly more mysterious kinds of scaling. And we're going to talk about model scaling next. And I think this is a more practical engineering set of questions that we're now gonna to try to answer. So you're in charge of you know building and shipping a really large language model. There's a lot of interesting ideas out there. Like you could train the latest state space model, you could train a transformer, you could use atom, you could use sgd, right? People invent all sorts of nutrics. Which ones are worth scaling up and which ones are not. You could also take your limited compute resources and spend them on different things. You could train models for longer, or you could train bigger models, right? For given flop, you could trade between these two. And you could also do things like go and collect more data versus get more gps. There's a lot of different sort of things that you can do, and the scaling laws allow you to have a pretty simple procedure to just answer all these questions, right? So I'll go through the classic sort of Kaplan scaling law paper. If you're interested in these topics, I encourage you to read it. It's just kind of a gold mine of all these kinds of observations. Some of it is old, but it's, I think, still unmatched in the thoroughness of all the things that it really studied in a fairly nice unified setting. So architecture wise, you might start by asking, like transformers versus lstms, right? Which one's better? Well, you know the brute force way might be to know scale up lstms and up to like GPT -3 level, and then you know you can figure out whether it's good or not. The scaling loway is much simpler, right? You basically train a bunch of lstms and transformers across many different compute thresholds or compute levels, and then you kind of see what happens as you scale them up. And I think the trends here are fairly clear. No matter how many layers you have on your lms, there's a pretty big gap, pretty big constant factor gap between transformers and lstms. And remember, this is in log scale. So this is kind of saying something like, I don't know what the exact numbers are, but imagine this is like 15 times less efficient. Then no matter where you are on this plot, you the lstm is, let's say, 15 times less compute efficient than a transformer, right? So there's a constant factor compute penalty to using lstms, at least in this plot. You know you could zoom out and say, well, there's a lot more architectures, you know which ones are you know really good and worth doing. And sort of some of the classic papers, this one is by ek and others, aguhave done exactly this kind of scaling work where they took a bunch of architectures on the right here, and they basically scaled them up. So the x axis is the amount of compute. The red line is basically each architecture, and the Green line is the transformer bass line, right? And they ask, like, Oh, can any of these alternative architectures match or outscale the transformer, right? And what do they end up? Well, actually, the only thing that seems like really strongly and reliably meet the transformer is gated linear units and mixture of experts. And once you know it, that's exactly the kind of stuff that people are doing today. And so this is kind of the scaling law version of that same idea of saying, like how would you have come to the conclusion that we should be doing switch transformers and glu and for example, not the performer, right? And the scaling law provides some clear evidence of why you might want to do that. Optimizer choice, I think, follows a similar thing. This one's from hessness. They compare sgd and atom. They find very similar to before this kind of constant factor gap, right, in compute, in this case, datset size. But of course, that translates the compute in the effectiveness of atom versus sd. You know, rhn in this case is recurhighway nets. You can sort of ignore the details here. You kind of see the point of how you would do this analysis rather than the specific results that are shown here. You know in the beginning I also said something like, Oh, you know depth versus width, like what should the aspect ratios be? That was one of the hyper parameter topics we talked about. And we see sort of similar sort of analysis. But in scaling law form from Kaplan, I think this one's intriguing to me at least because you know we might think that deeper layers get dramatically better, right? That there's like clear separation between the number of layers. But we see, at least here, that you know there's actually a lot of sort of slop. One layer is really bad, but a lot of the other sort of layer choices sort of remain pretty stable. And hopefully this is reminiscent of kind of that slide I showed back in the architecture lecture where I said, well, you know the aspect ratio, the ratio of withe depth, know roughly something like four to 16 or something was a pretty natural number. But there's a really wide basin in which you're approximately optimal. And the scaling law analysis also backs that up. One important subtlety that I would do want to point out, and this one bites people every now and then, is that not all parameters are equal. Like often you want to do parameter scaling analyses, but if you were to, say, count embedding parameters as part of your model, well, you get like a pretty different scaling law. You get this kind of weird looking thing that slightly bends over here. Whereas if you only consider the non embedding parameter, you see that much cleaner result that I showed you before. So embedding layer parameters don't really behave the same, and they don't show the same kinds of sort of log linear scaling as the non embedding parameters when you account for them. And there's sort of related work on saying like not all parameters are the same on recent papers on scaling mixtures of experts, where they're also sort of trying to figure out like what does it mean to be a parameter when you have such sparsely activated parameters? And in those kinds of papers, they sort of try to derive essentially things like a covalent number of dense parameters in order to sort of try to normalize the number of parameters in moe, right? I've showed you this plot earlier in the hyperparameter selection, but hopefully now actually you see the full context, not just the original sort of the hyperparameter choice question. We know that in many cases, I'll go back, let's say, to here often what we'll see is scaling lock curves that look like the following. You'll often see that the slope of the curves remain very similar, they're non crossing and that there's sort of constant factor offsets between these curves. And whenever this is, what you can then do is you can take a slice at a particular level of compute or a particular set of hyperparameters and analyze the hyperparameter traoffs very carefully, assuming and sort of be sort of safe in sort of scaling that up. And so when you go to Kaplan's paper, you'll see exactly these kinds of analyses being done, especially, I think, the center one, the aspect ratio plot is definitely worth looking at. You know, they're not just sort of scaling up and down models. They're actually taking different slices. So different sized models, 50 million to hundred 70 million, 1.5 million, and they're looking at how the aspect ratio changes the loss. And they kind of see that, Oh, actually the shape of the curve, not just the scaling slopes, actually remain similar. And this means that you, I can pick an aspect ratio between ten to 100, and anything in between will work fine at all of these different scales. And so this is, I think, important to think about. I think initially when you're trained in sort of deep learning, you know model training, you think about hyperparameter tuning, but you want to be sort of scale where in how you're tuning your hyperparameters. And that's a really big difference in mindset, I think, between kind of the scaling law style approach and sort of maybe what you've been trained or what you've naturally think about in terms of, Oh, let's just tune these models at a small scale. And so the same is being done kind of for feforward ratio and for attention head dimension, your various aspects of scale, and you're trying to see whether sort of the minima remains similar. Okay, another important thing next actually maybe not next lecture, but next lecture I'm going to talk about sort of practical case studies almost of how people have scaled up models. And we'll actually see that batch size and learning rate are actually two really tricky things that you have to deal with carefully when you scale models up, right? So when you scale models up, you're gonna na have to maybe think about you know the optimal learning rate will be different across model scales. And if you're doing that, then also maybe the optimal batch size might end up varying as well because those are often co linked. And so we need to think about what the right way of scaling batch size is and how batch size interacts with scale and also learning rates. I'll talk about those for the next couple slides, right? So batch size from the systems lecture, hopefully you remember it has diminishing returns past a certain point. So up until a certain point, so you know when the batch size is smaller than the noise scale, we're on the left hand sides here, increasing batch size is almost equivalent to taking a more gradient step. So that's roughly saying if I double my batch size, it's as good as taking two gradient steps. And that's a really, really good place to be, right? Because now you've got the systempower of being able to paralyze across the batch while having the optimization efficiency of taking two steps. But past a certain point, you're gonna to have ineffective scaling, right? Where now your sort of noise scale and your batch size are the same. And the additional samples in your batch that you're taking, you know they're not reducing useful noise. It's getting dominated by kind of the curvature of the bias term, so to speak, of the curvature of your optimization landscape. And one really useful thing to think about, useful sort of analysis object, is this notion of a critical batch size. And the critical batch size you can think of is kind of this threshold point where we go from perfect scaling to dip strong diminishing returns. And you can sort of analyze this in theory and sort of OpenAI papers on critical batch sizes do this, but you could also analyze this empirically. And this is another thing that's been studied sort of in the scaling law kind of way. You can kind of see you can estimate the point at which sort of progress slows. So you can estimate empirically what the critical batch size point, trade off points are. And you can also basically train bigger and better models. And one really interesting thing is as you try to improve the loss, so you're going left side here, so you're making losses better and better and better and better and better, your critical batch size ends up getting smaller, right? So the smaller the loss target, the bigger the overall back size that you can be. And so one of the things that this leads to is, for example, if you look at the llama three training report, you'll actually see, for example, that theylike increase the back size after a certain point or theydo things like increase the back size as they train because as your loss target gets smaller, your back ch sizes can in turn get bigger. So as we increase both compute and model size, like what's the right thing to do? Once again, we can do kind of a scaling analysis. This is from Kaplan. And you can try to figure out, you know as we increase the amount of compute, what is the optimal batch size? And what we kind of see is that you know as we increase the amount of compute, we can actually have reasonable sort of parallelism. The number of total steps can stay the same, at least within this compute threshold. The number of total steps can stay the same while sort of getting the batches bigger and bigger and bigger. And if you fix the amount of batches, of course, the number of steps is going to go up and up and up. So this is good news hopefully, for data parallel processing. So that's the batch size story. The thing you should maybe remember, because I think critical batch size zes are kind of a messy concept, is that a there's a sort of diminishing returns point. The critical batch size, that's one thing. The second one is that it does seem to follow a pretty predictable scaling often as a function of your target loss. And given that you can figure out you know what is the right trade deoffs that I can make in terms of systems efficiency and my optimization progress. As I said before, the other aspect of this right is you've got your batch size and then you've got your learning rate. And those two are fairly closely linked with each other. And I'm going to talk about mup at much more extensive length in the next part of the scaling lecture. But this is kind of a really important, I think, broader idea. So you could do one of two things, and I think this figure will allow me to talk about both of these. So let's look at this left plot first. What's labeled standard practice, right? So when you train a transformer, what you're basically going to see is something like this left thing here, the standard practice. So the optimal learning rate is going to be at different points. And the wider the model, as you increase your model size and your mlps get wider and wider and wider, the optimal learning rate is going to be pretty small. And as you make your model smaller and smaller and smaller, right, your losses, of course, are going to go up because your model is less less expressive, but also the optimum learning rate is going to also go up, right? And often, you know people say there's a rule of thumb. It's like one over the width is the right rate at which you should scale the learning rate. More advanced people will actually fit, basically take these curves, find the minimum, and then fit a scaling law on the optimum learning rate. And so there we can see that this is a predictable decay in learning rate. And maybe we can fit a scaling law. I'll talk about this more in the next set of lectures. But an alternative one that I think many people have started to adopt and I think is a really interesting thing to think about, is that you can actually re parameterize the model. And in particular, you can do things like scale the initialior, scale the learning rates of different layers based on the width. You can scale the variance of the initialization based on the width of the model, as well as multiply the output in the forward path ths of different layers of the model. And if you do this in a way that is dependent on sort of the width of the model, you end up with a parameterzation of the model whose learning rate is supposed to be more stable, or at least in the original paper, exactly stable across scale. So you tune your learning rate once, and you don't have to do anything else that opmum directly transfer. You tune it here on the smallest one, and that directly transfers to the very largest scale. And this is the idea called new p. There have been sort of this original paper that I'm showing you is called with nup. There's been other variants. Meta, with the release of lamafour claims to have invented something called meta p, which I'm not quite sure what it is yet, but you can sort of see that a lot of labs are kind of thinking about this, right? Because if you're gonna na have to rely on predicting what the optimum learning rate is, then you have to do all sorts of tricky scaling law fits. And maybe this is very unstable, but if you can reparameterze your model, then, well, maybe you don't have to do any sort of retununing at all. Of course, that's way more optimistic than what happens in practice, but hopefully this gives you a sense of why this is really cool and really interesting. Scale aware initializations cool. Any questions? Up until this point, I feel like I've sort of gone through a whole bunch of scaling architecture and parameter stuff. So maybe I'll stop for a moment here in case anyone has any questions. Yeah, I really get the intuition behind like if we want to the word lost target, want to increase the match size. I you're understand? Yeah. Like so when you have a lower loss target, Yeah. So what you want to do right, is the smaller the loss target, the kind of more sensitive things are. And in the same way that, like you're going to be lowering your learning rate, right, you want na also increase your batch size in order to denoise, right? Like the more sensitive the target that you have, the sort of more precise your gradients potentially have to be. One way of thinking about it is like as you're cooling down in, your learning rate is going up. Maybe your batch slishould increase as well because the learning rate and batch slizes sort of affect each other inversely. Yeah. So this specithing is only for or for computer vision. I'm not sure there is a sort of related OpenAI scaling paper for sort of mulmodal models, but not I don't remember what that says about critical backsize for those. Yeah, yes, the noise scale. The noise scale at least didn't sort of disfigure, if that's what you're asking about. This is a kind of theoretical analysis. It's basically about the gradient noise that you expect from random sampling within the batch. So this is not like a you know precisely empirically measured quantity obesity. All right. So one thing I'll caution, and I think this is a big caution for a lot of scaling law works, is that scaling laws are very nicely behaved for log losses. So we train on the next token prediction, cross entropies. When your scaling law targets are those, cross entropiece very easy, works very well. But if you're trying to do downstream tasks, right, you're trying to like directly scale on benchmarks, behavior is much less predictable. So here on the left side, this is from ek's paper comparing lots of different sort of like hyperparameters and architectures, you see that the number of parameters, which in this case is a surrogate for compute and the negative log perplexity is very nicely linearly correlated. And what this is basically saying is, well, it doesn't matter what your like depth or width or like precise setting of the hyperparameters are, the only thing that really matters is your total compute expenditure, right? This is a very simple and nice story. But then you take these models, this was back in 20, 23, so you know people were still kind of doing super glue accuracy and you basically say, like, ok, but what's the downstream performance of these models? And while now we don't see a very nice linear relationship anymore, we see this totally different thing where certain models are much better than others and certain architectures are better than others. And so you might not expect exactly this kind of scaling property. And we've seen variants of this story play out in many different places. If you follow the literature on state space models, that's one thing that we've seen know in state space models know we see really nice predictable scaling like the ones on the left. But often for certain capabilities like in context learning or for qa, people have shown that these models maybe do less well. So it's important to not take this perplexity scaling as the same thing as downstream scaling. And you want na be a little bit cautious whenever you're doing these kinds of analyses. Okay. So maybe this is not surprising to some of you, but hopefully know this is surprising and convincing, which is that you, if we want to make lots of engineering decisions like hyper parameter choices, architecture decisions, we can do a lot of that before training, right? Like we can train these models at small scale across several orders of magnitude compute, and then you scale that up in order to try to predict the behavior of models. So the scaling law based design procedure is pretty simple. You train a few smaller models, and these smaller models should span a couple orders of magnude compute. You establish a scaling law of some kind. So you see that at least on the models that you train, that there's a clear log log linear relationship. And then based on this prediction, you can set optimal hyperparaeters in many cases. In fact, you know these scaling laws won't really vary too much. Their slopes will actually be the same. In which case, sort of the corollary to this is you can just train a few smaller models and know the results of those small models will transfer surprisingly well to the larger models in many of these cases, but not all of them, learning rate being an important exception, for example. Okay. So that's how you do things like hyper parameter selection and architecture selection. Now I want to talk about one very important use of scaling loss, one that's had kind of an outsized influence on you know how we pick sizes of models, how we think about data efficiency and so on of these models. So I think back in the earlier days when people are beginning to scale up these models, there's a really core question that you need to ask. Do we need more data or do we need bigger models? In some sense, back in 20 21 to 2020 three or something, data was way more abundant than compute. So we didn't need to worry about the total data limitations. And so the one limiting resource is compute, right? Your total number of flops for your training budget, that's kind of the limiting resource. And you can then spend that resource in many different ways. You can spend it on training on lots of data with a small model, or you can train one giant model on very little data. And both of those extremes seem very wasteful, right? Like if you have a teeny tiny model pumping in tons and tons of data doesn't seem used one in reverse. If you're a giant model with like ten tokens, also doesn't seem very useful. And so this was sort of a core question for many people. And so there simultaneously, several authors sort of proposed sort of joint data model scaling laws to try to answer this question. And so what are those, right? I've have been talking about scaling laws in essentially one variable exclusively up until this point. And that one variable has varied. It has sometimes been parameters or data or compute, but we've not looked at joint scaling, right? And so data model scaling laws are things that look like this. These two sort of equations here are both like functionally equivalent to first order and describe the trade deoff between the amount of data and the amount of models. So the top one from Rosenfeld is basically saying, there is a part of the error, one part of it that decays polynomially in data. There's a part of the error that decays polynomially in the model size, and then there's an irreducible error term that cannot be removed even if I scale both the data size and the model to infinity. Same effect with Kaplan. But here they're sort of thinking about irreducible error rather than reducible error. And so there's no constant term here. So this seems kind of arbitrary because I don't think there's any sort of you top down reason why this has to be the correct functional form. But this provides surprisingly good fits to the joint error that you see in data and models. So this is from, I believe, Rosenfeld. They show this nice 3D plot of this is the amount of data, the amount, this is the size of the model, and this is the loss on the y axis. And the surface that's being fit is their functional form. The dots are their runs. It might be a little hard to see from the back, but the surface fits the dots almost exactly. And despite the fact that this functional form is kind of ad hoc, like it's pulled out of a hat, it is surprisingly accurate. This one's from Rosenfeld as well, where they basically say, ok, I'm only going to train on essentially the small half right models that are small and data that is small, right? So on this sort of left bottom, and I'm going to extrapolate to models that are sort of both large and trained with more models. And how good is that fit of like joint extrapolation? Well, quite good, right? So if you look at the error, my my sort of real values are on the x axis, my predictions of the error on the y axis, and they're sort of almost exactly right, both on like sort of imagenet and on wiki text. So this seems pretty good. And so you for a fixed compute budget. Now what can we do? We go back to, for example, capplan. We see similar things being done here. We see sort of joint scaling of compute and data. So in this case, parameters are on the x axis. The colors represent compute. And so there's sort of a third axis of data that's being implicitly varied in order to vary the total amount of compute. As you go shift on these curves, the parameters are being varied while the compute is being held constant. And so the amount of data is going to vary. So chchilla, I think many of you have hopefully heard of is probably the reference in solving this problem, right? So both Rosenfeld and Kaplan came up with kind of this joint scaling functional form. And then, you know, both of them sort of noticed that it was possible to use these functional forms to optimize the tradeoff between compute and data in various ways. But for various reasons, basically, it's hard to fit these sort of functional forms precisely. And the details, like the sort of learning great shapes being different, are important. And so Kaplan sort of had one estimate that was quite far off from what was later in some sense validated to be optimal. And so the chinchilla paper by a bunch of Google authors sort of was an attempt to really empirically try to nail down what is the right tradeoff between the amount of tokens and the model size, assuming that your goal is to get the best model for the smallest amount of training flops. So they have three different approaches, approach one, 23 for basically fitting different curves and making scaling predictions. These blue dots are the models that they trained. And basically the lines are predicting different optimal parameter sizes for different flops. And hopefully most of you kind of know the Chinchillo ratio. That's something like know 20 tokens per parameter. And that comes from exactly this. Like if you take each of these points and you multiply it by 20, you're gonna to get roughly the p or sorry, multiply it by 20, you'll get the token count. And so if you multiply the parameters by that, you'll get the flops. The difference between sort of the Kaplan results, which were basically estimating one set of token to parameter ratios, and sort of the chchilla ones, one of the reasons is because of learning rate schedules, right? We know that we train models with cosine learning rates. So cosine learning rates are going to look something like this, right? It goes up and then it comes back down and then it's going to cool down all the way to a minimum learning rate at your bottom. But one thing about coastine learning rates, that sort of trips everyone up all the time, is you can't truncate them early for a coastine learning rate. You have to sort of go all the way to the end in order to get a valid model. You have to get a cool oldown phase all the way to the end. If I truncate a model in the middle, this is not the same as starting a model from scratch and training it with a cosine learning rate somewhere in the middle. And this was one of the sort of contributing factors. There were others as well, leading to the capplan estimates being pretty far off from the later sort of more improved estimates provided by the chinchilla paper. So what do the chinchilla authors actually do? Well, they have three different methods of trying to estimate the optimum trade off between tokens to models. And each of these methods are going to sort of provide different scaling coefficients, scaling coefficients for the model size and scaling coefficients for the data size. And kind of surprisingly, in this case, they're getting 0.5 on both of these for methods 12, and method three is providing pretty different or slightly different estimates. They're about off by 0.03, but we'll talk about that a little bit later. Kaplan and all you see is way off than any of the three estimates, right? So we'll go over each of these methods. Each of these makes sense. They make sort of different assumptions about scaling, but they end up with very, very similar estimates at the very end here. So method one on chinchilla is to basically take the minimum over curves. And so what does that mean? Well, you basically overlay all of the different training curves that you have. So you can see here on the x axis is different flops, on the y axis is sort of the training loss. And I have models trained at many different sizes. And of course, you know each of these sizes are going to be trained with different amount of tokens. And so they're going to reach a different total flop as they sort of go through training. Right now, what I'm going to do is I'm going to look at the lower envelope, the set of sort of points or checkpoints that prove to be optimal under any compute budget. And I can take these models and I can look at, ok, what were the actual parameter sizes of these models? And you can see that sort of the total compute on the x axis here. And the number of parameters, as well as the corresponding tokens, all forms a relatively nice scaling law. And so this is kind of the minimum envelope method. It's basically saying I expect the minimum training loss where I optimize over all the model sizes to actually be optimum in flops. And sort of to call back to some earlier papers, if you look back at the earlier sort of Kaplan paper and other scaling loss, you see exactly this already being done. You see different models being trained with different sort of parameters and different compute scales, and we taking sort of the minimum across these. And we've already seen that the minimum forms of scaling law, so this is building on this observation that the minimum across many different training curves across compute should form a tolaw. So under that assumption, you can get fairly nice fits. And this gives one estimate that is quite consistent with others of 0.5 point five. Now the other one, this, I think if you were to pick a single canonical way to do the chinchilla analysis, this would probably be the one. And in some ways, I think this is the most conceptually straightforward one, which is the iofflop analysis. So to do the iofflop analysis, what you do is you pick a bunch of compute scale. So each of these colors is a different amount of compute. And what I'm going to do is for each of these compute scales, I can essentially have models with smaller parameters trained with more data, or more parameters trained with less data, right? So I'm gonna to sweep over my sort of model sizes for each of these swaps, and then I can look at the minimum of each of these curves. I can either pick the minimum point explicitly, sort of non parametrically, or I could fit quadratics onto each of these and get the minimum point of the quadratic. But in either case, sort of the argument is fairly simple. The argument is I should be the case that this minimum itself follows a predictable scaling law, and thus I can extract from it sort of the optimum sort of parameters per flop. So that's the minimum points across all of these. And I can also extract the optimal number of tokens per flop. I can read that out by sort of dividing my flops budget by the number of parameters. So I can get those simultaneously. And you can see that once again, this gives very clean sort of results that are consistent with method one. So we can compare that with before. This says for the eventual chinchilla model budget, you want 63 billion parameters. This one says 67 billion parameters. The two are quite close, right? Okay. The last one honestly is just a little bit messier. And this goes back to kind of that Rosenfelt paper. If you have a functional form like this one, right, like this from Rosenfeld, a very natural instinct is to say, I'm just gonna to train a bunch of models varying both n and n, right? And I'm just gonna to do curve fitting. I'm gonna to fit this curve onto whatever I get thing I get out of my models, right? So I'm gonna to train a bunch of models and fit that 3D shape. And we know from rosemanfelit's reasonable to some extent to fit these. So you've got all these dots, which are the models I fitted a curve that's this sort of heat map color that you see on the left. And then you can sort of back out what the implied iflops should look like from these dash lines. But if you look at this, you hopefully you see that the scaling law fits and like sort of the curve fits here are just not quite as good as the fits in the other plots, right? And you know, if you look at the coefficients, the chchilla method three just gives way different estimates in terms of the model size and total token count than the others, right? And actually, this was a mystery to me for a long time. I think some of my students were like, why is method three so different? And I said, I don't know, maybe scaling laws are just sometimes noisy. I don't know how many of you know this, but this is a really fun trivia or not trivia fact, fun piece of trivia, let's say. So last year, some folks at epoch AI, I don't know what motivated them to do this, were curious enough about this result that they went and tried to replicate method three. And it was very difficult to replicate it because you don't have the original data for all these training runs. So they actually went to the extreme of actually looking at the plots and using sort of a forensic tool to extract the values of the points from the plots. And based on that, they could actually replicate the original result. And kind of the funny thing is, they showed that actually the curve fitting was the bad part. Like their data in their approach was good, but actually when they fit the curve, they didn't necessarily do it right. And so the original fit had residuals, if you're familiar with the regression, you know your residuals should be zero mean centered, because otherwise you should be you know offsetting your predictions to make it zero. Their residuals are non zero, and then they fit it better. And then when they did fit it better, well, actually their optimal estimate almost exactly matched methods 12. And so this is one of those funny cases where actually you, the original authors, had both the idea and the data, right? But because of a minor issue in curfitting, they kind of had it wrong. And the replication actually makes it more correct than before. Usually replication sort of disproof things, but in this case, actually the replication just showed that the original result was correct. All the law, which is I think a pretty cool result. Okay. So the final thing I want to talk about with kind of this set of chinchilla results is you know we're talking about training optimal scaling. So you have a fixed fox budget. I want the best possible model possible. But really, I think the story has really shifted when sort of Chinchillo was written and the Kaplan paper was written, you know llms were not really a product yet. And so really the name in the game was everyone wanted the most biggest, flashiest, most intelligent model, but they didn't care about the inference cost of actually deploying these systems. But nowadays you know what we really care about is inference costs, right? Because these systems are actually products. They generate revenue. You know you have a cost associated with the revenue. And so we've seen over time that actually the tokens per parameter has steadily grown, right? Like GPT -3 was two tokens per parameter. Chinchilla moved us to 20 tokens per parameter and four bit people played around with sort of 20 tokens per parameter stuff. But then you know very quickly people realized actually what we care about is you know really good intelligence at really small parameter sizes. And so people have really started to scale up the number of tokens per parameter very, very rapidly. And I think I saw yesterday that, for example, the most recent quem models were trained on 30 trillion tokens, right? People are really pushing the limits on the tokens to parameter ratio because really you would much rather pay the upfront cost than to pay the ongoing operating cost of running inference on a really big expensive model. Cool. Last thing know that is kind of a fun sithing that I want to end with is to say, you know these results are pretty robust and easy to replicate. A few years back, one of my students, Ishan, was really interested in really pushing diffusion models for text forward. And so one of the things that we had to do was to say this is a whole new kind of model. We don't know what the optimal token to parameter ratio is. We don't know if this thing, even you reliably scales, is a totally different kind of generative model. What do we do? Well, turns out, you know if you just fit the same kind of playbook of saying, Oh, we're going to do you know isoflop analyses for autoaggressive models, we get almost exactly the chinchilla thing without too much effort. You know you do the same kind of analysis on diffusion models while we see you know very similar kinds of curves, even though it's a pretty different generative model entirely. And then if you plot sort of the minimum across of these, well, you see very predictable scaling for both separated by a constant offset, right? Like I don't bring this up to say you know because I want to particularly push through diffusion models, but just as a really random sort of case study or example to say, know, these scaling laws don't necessarily need to be these very Cherry picked examples. They seem to happen pretty naturally as you're sort of working on new models or working on new environments. So okay, you this is to put together this last part, right? Log linearity is not just about sort of one dimensional things where we think about data, they extend to sort of model parameters, they extend to total compute. And so that lets us know make all sorts of hyperparameter in other decisions. That's kind of this first part. And they're also letting us make really smart resource trade offs, right? They let us make trade offs between sort of big models versus more data. And we saw that in kind of the stinchilla analysis. And you know it's kind of remarkable how cleanly things like the iflop analysis turnout. So all right, that's all I got for basic data or basic scaling laws. We did a recap of Kaplan as well as chinchilla today. And hopefully now you're on board with this idea of data scaling, model scaling and using scaling loss to sort of optimize all the aspects of your model without actually going all the way to the large scale training runs. Thanks, and I'll see you all Thursday.

概览/核心摘要 (Executive Summary)

本讲座深入探讨了大型语言模型（LLMs）的伸缩法则（Scaling Laws），旨在建立简单、可预测的模型行为规律，以便通过小模型实验指导大模型的工程优化。讲座回顾了伸缩法则的历史背景，从统计机器学习理论（如VC维）到早期经验性研究（如Bell Labs 1993年的工作、Banko & Brill 2001年关于数据量的研究），强调了其并非仅是对数坐标系下的直线拟合。核心内容分为数据伸缩（Data Scaling）和模型伸缩（Model Scaling）。数据伸缩方面，讲座论证了模型性能（如测试损失）随数据量增加呈幂律下降的自然性，并通过均值估计和非参数回归的例子解释了多项式衰减的合理性，指出斜率与数据内在维度相关。模型伸缩则关注架构（如Transformer优于LSTM）、优化器、以及关键超参数（如模型宽高比、批次大小、学习率）如何影响模型性能随计算资源的变化。特别讨论了Chinchilla法则，即在固定计算预算下，如何平衡模型大小与训练数据量以达到最优性能，并介绍了其三种推导方法（最小包络线法、IsoFLOP分析、联合拟合）。讲座还提及了μP（Maximal Update Parametrization），一种使学习率等超参数在模型宽度变化时保持稳定的参数化方法。最后，总结了基于伸缩法则的设计流程：通过小规模实验建立伸缩关系，预测并优化大规模模型的超参数和资源分配，但同时提醒伸缩法则在下游任务上的预测性可能不如在损失函数上那么稳定。值得注意的是，受推理成本驱动，当前研究趋势显示业界倾向于使用远超经典Chinchilla法则建议的数据量进行训练，以期在更小参数规模的模型上实现更高性能。

引言与伸缩法则的动机

核心问题：如何在有限的计算资源（例如，100,000个H100 GPU使用一个月）下构建最优秀的开源大语言模型？
挑战：直接训练和调优大型模型成本极高。传统的深度学习方法是“训练大量大模型，调整超参数”，这非常消耗计算资源。
伸缩法则的目标：
- 建立简单、可预测的语言模型行为规律。
- 通过训练小模型，学习其行为规律，然后外推到大模型，从而优化工程实践。
- 实现“在小模型上学习，一次性成功构建大模型”。
讲座内容：
- 回顾伸缩法则的历史和背景，强调其不仅是对数-对数坐标图上的直线拟合。
- 论证数据伸缩的自然性。
- 探讨模型大小伸缩及其他相关因素。

伸缩法则的历史背景与早期工作

统计机器学习的根源：
- 伸缩法则的概念类似于机器学习理论中的泛化界，如VC维和Rademacher复杂度的分析，这些理论预测误差如何随样本量n变化（例如，1/√n的衰减率）。
- 非参数率（Non-parametric rates）也研究样本大小与误差的关系，例如密度估计误差的上界为n^(β/(2β+1))。
- 理论研究主要提供误差上界，而伸缩法则是从理论思考（数据/模型大小与性能关系）到经验拟合（实际损失值）的飞跃。
早期经验性研究：
- Bell Labs (1993, Vapnik, Cortes et al.)：被认为是“第一个伸缩法则论文”。
  - 提出预测模型性能而无需完整训练的方法，以应对大型数据库上分类器训练的计算密集问题。
  - 其函数形式（测试错误 = 不可约错误 + 多项式衰减项）与现代伸缩法则相似。
  - 通过训练小模型拟合曲线，预测大模型行为。
- Banko & Brill (2001)：研究NLP系统性能如何随数据量伸缩。
  - 展示了数据量（x轴，对数）与性能（y轴）的现代伸缩法则图表。
  - 提出观点：“也许我们应该考虑在算法开发上花费时间和金钱与仅仅收集更多数据之间的权衡。”
- 早期关于函数形式的思考 (约2012年)：探讨幂律（如三次方、四次方）是否是预测模型行为的正确函数形式。
大规模神经伸缩法则的早期工作：
- Hessness et al. (2017, 当时在百度)：
  - 展示了在机器翻译、语音等多种任务中，错误率随数据量增加呈幂律下降。
  - 提出了模型行为的三个区域：初始猜测区、可预测伸缩区（幂律区）、渐近区（接近不可约错误）。
  - 强调了早期工作已涵盖现代观念：随机性能区域预测困难、计算限制的重要性、量化等技术（用计算换取模型精度）。
核心观点：通过可预测的资源投入，可以获得可预测的能力提升。

数据伸缩法则 (Data Scaling Laws)

定义：数据伸缩法则是指将数据集大小 N 映射到超出误差（Excess error，即超出不可约误差部分）的简单公式。
经验观察：
- 在x轴为数据集大小（对数），y轴为测试损失（对数）的图上，模型性能呈线性关系（幂律关系或无标度现象）。
- 此现象在多种变量下均成立，如计算量、数据集大小、参数量（假设其他变量足够大以避免饱和）。
- Kaplan的伸缩法则论文中展示了这些关系，包括在下游任务和分布外数据上也成立。
为何是多项式衰减？
- 单调性：训练数据越多，误差越低，这很直观。
- 函数形式：幂律关系意味着误差与数据量之间存在多项式关系。
- 简单例子：均值估计
  - 任务：估计高斯分布数据的均值。
  - 误差：均值估计的期望平方误差为 σ²/N。
  - 取对数后：log(Error) = -log(N) + 2log(σ)，斜率为-1。
- 实际观察到的斜率：
  - Hessness et al. (机器翻译): -0.13
  - Hessness et al. (语音): -0.3
  - Kaplan et al. (语言建模): -0.095
  - 这些斜率远小于理论上的-1或-0.5（如agnostic learning），表明衰减更慢。
- 原因：非参数回归与内在维度
  - 神经网络能拟合任意函数，类似于非参数回归。
  - 例子：2D空间中的非参数回归
    - 将2D空间划分为 √M 个小盒子，每个盒子分配 √N 个样本。
    - 误差近似为 1/√N。推广到D维空间，误差为 N^(-1/D)，对数-对数图上的斜率为 -1/D。
  - 结论：对于灵活的函数类别（非参数函数类），伸缩法则的斜率依赖于数据的内在维度（intrinsic dimensionality）。斜率越平缓，可能意味着内在维度越高或学习任务越难。
  - 有研究试图将学习速率与数据内在维度联系起来，但内在维度估计本身非常困难。
数据伸缩法则的实际应用：
- 数据组成（Data Composition）：
  - Kaplan et al. 的研究表明，数据组成主要影响伸缩曲线的偏移量（offset）而非斜率（slope）。
  - 这意味着可以在较小模型规模上进行数据选择实验，以挑选优质数据集。
  - 可以通过回归等技术，利用伸缩法则估计最优数据混合比例。
- 多周期训练（Multi-epoch Training）：
  - 解决“数据是否耗尽”以及“重复训练相同数据收益递减”的问题。
  - 研究表明存在“有效样本量”，大约4个周期后，重复数据的收益迅速递减。
  - 可以通过修改标准伸缩法则，引入有效数据量和唯一Token量的概念来建模。
- 大规模数据选择：
  - 权衡是重复高质量数据源（如维基百科）还是包含新的、可能质量较低的数据。
  - CMU的研究工作利用数据伸缩法则来权衡重复数据与选择低质量新数据。
总结：数据与误差之间的对数-对数线性关系具有鲁棒性，并且有较清晰的理论解释，可用于优化数据混合等实际问题。当讨论数据规模伸缩时，通常假设模型规模足够大，以避免模型容量成为瓶颈。

模型伸缩法则 (Model Scaling Laws)

核心问题：在构建大型语言模型时，如何选择架构、优化器、超参数，以及如何在计算、数据和模型大小之间分配资源？
Kaplan et al. 的经典工作：提供了大量关于模型伸缩的观察。
架构选择：
- Transformers vs. LSTMs：
  - 通过在不同计算水平上训练一系列LSTMs和Transformers，比较其伸缩行为。
  - 结果显示，无论层数多少，Transformers 相对于 LSTMs 存在一个显著的常数因子计算效率优势（例如，LSTMs 可能比 Transformers 计算效率低15倍）。
- 其他架构比较 (Eikema & Aziz, 2023)：
  - 比较多种架构（x轴为计算量，红线为各架构，绿线为Transformer基线）能否匹配或超越Transformer。
  - 结论：门控线性单元（Gated Linear Units, GLU）和混合专家模型（Mixture of Experts, MoE） 是少数能可靠匹敌Transformer的架构。这与当前SOTA模型的趋势一致。
优化器选择 (Hessness et al.)：
- 比较 SGD 和 Adam。
- 发现 Adam 相对于 SGD 在有效性上存在类似的常数因子差距。
模型宽高比 (Depth vs. Width)：
- Kaplan 的分析显示，虽然单层模型表现很差，但其他层数选择表现相对稳定。
- 存在一个较宽的近似最优区间（例如，宽高比在4到16之间），与架构讲座中的结论相似。
参数的非均等性：
- 嵌入层参数 (Embedding Parameters)：与非嵌入参数相比，嵌入层参数的伸缩行为不同，不显示清晰的对数线性关系。进行参数伸缩分析时，通常只考虑非嵌入参数。
- MoE模型的参数：稀疏激活参数的计数方式也需要特殊处理，例如推导等效的稠密参数量。
超参数调优的规模感知 (Scale-aware Hyperparameter Tuning)：
- Kaplan 的研究表明，许多超参数的最优选择（如宽高比、前馈网络比例、注意力头维度）在不同模型规模下保持相似的“形状”或趋势。
- 这意味着可以在较小规模上进行超参数调优，其结果在很大程度上可以迁移到更大规模。
- 重要例外：学习率和批次大小通常需要随模型规模调整。
批次大小 (Batch Size) 伸缩：
- 临界批次大小 (Critical Batch Size)：批次大小增加到某一点后，收益递减。临界批次大小是完美伸缩到强递减收益的转折点。
- 目标损失与批次大小：当目标损失更低时（模型更好），临界批次大小会变小，意味着可以使用更大的批次。Llama 3的训练报告中提到在训练中途增加批次大小。
- 计算量与最优批次大小：Kaplan 的分析显示，随着计算量的增加，最优批次大小可以相应增大，而总步数可以保持相对稳定。
学习率 (Learning Rate) 伸缩：
- 标准实践 (Standard Practice)：
  - 最优学习率随模型规模变化。模型越宽，最优学习率越小；模型越小，最优学习率越大。
  - 经验法则：学习率与模型宽度成反比（1/width）。
  - 更高级的做法是拟合最优学习率随模型规模变化的伸缩法则。
- μP (Maximal Update Parametrization)：
  - 通过重新参数化模型（如根据宽度调整初始化、各层学习率、输出等），使得最优学习率在不同模型宽度下保持（理论上）完全稳定。
  - 调优一次学习率，即可直接应用于不同规模的模型。
  - Meta的Llama系列（如Llama 3或后续版本）声称使用了类似的技术（如MetaP）。
  - 这是一个重要的研究方向，因为它简化了学习率调优。
对下游任务的警示：
- 伸缩法则在对数损失（如交叉熵）上表现良好且可预测。
- 但在下游任务的性能（如SuperGLUE准确率）上，行为可能远不那么可预测。
- Eikema & Aziz 的研究显示，尽管不同超参数设置的模型在负对数困惑度上与参数量（计算量）呈良好线性关系，但在下游任务上，这种关系瓦解，某些模型和架构表现显著优于其他。
- 例如，状态空间模型（State Space Models）在困惑度上伸缩良好，但在上下文学习或问答等能力上可能表现较差。
- 结论：不应将困惑度的伸缩等同于下游任务性能的伸缩。

基于伸缩法则的设计流程

训练少量小模型：这些小模型应跨越几个数量级的计算量。
建立伸缩法则：确认在训练的小模型上存在清晰的对数-对数线性关系。
设置最优超参数：
- 基于预测设定最优超参数。
- 在许多情况下，伸缩法则的斜率保持不变，这意味着小模型的调优结果可以很好地迁移到大模型。
- 重要例外：学习率等。

联合数据-模型伸缩 (Chinchilla法则)

核心问题：在固定的总计算预算下，应该分配更多资源给数据还是更大的模型？
- 早期（2021-2023年），数据相对计算更充裕，计算是主要限制因素。
- 极端情况（小模型+海量数据或巨模型+少量数据）均效率低下。
联合数据-模型伸缩法则的函数形式：
- Rosenfeld et al.：Error = A/N^α + B/M^β + C_irreducible (N为数据量，M为模型大小)
- Kaplan et al.：类似形式，但关注可约误差，无常数项。
- 这些函数形式虽有些随意，但能很好地拟合数据和模型的联合误差。
- Rosenfeld 的研究显示，在小模型和小数据上训练拟合的曲线，可以很好地外推到大模型和大数据的情况。
Chinchilla 论文 (Hoffmann et al., Google)：旨在精确确定在以最小训练FLOPs获得最佳模型的目标下，Token数量与模型大小的最佳权衡。
- Chinchilla比例：大约是每个参数20个Token。
- 与Kaplan估计的差异原因：部分由于学习率调度。Kaplan的早期估计与Chinchilla有较大出入。
  - 余弦学习率调度 (Cosine Learning Rate Schedule)：必须完整运行才能获得有效模型，不能提前截断。提前截断不等同于用较短的余弦周期从头训练。这是导致Kaplan估计偏差的因素之一。
- Chinchilla的三种估计方法：
  1. 方法一：最小包络线法 (Minimum over curves / Lower envelope)
    - 叠加不同大小模型的训练曲线（x轴FLOPs，y轴训练损失）。
    - 取所有计算预算下最优的检查点（形成下包络线）。
    - 这些最优模型的参数大小和对应Token数与总计算量之间形成良好的伸缩法则。
    - 此方法基于“在所有模型尺寸上优化的最小训练损失在FLOPs上是最优的”这一观察。
  2. 方法二：IsoFLOP分析 (IsoFLOP analysis)
    - 选取一系列固定的计算预算（IsoFLOP线，每条线代表一种颜色）。
    - 对于每个计算预算，通过改变模型参数量（同时反向改变数据量以保持总计算不变）来扫描模型。
    - 找到每条IsoFLOP曲线上损失最小的点（可以通过显式选择或拟合二次曲线）。
    - 这些最小点本身应遵循可预测的伸缩法则，从而提取出每FLOP的最优参数和Token数。
    - 这是最典型、概念最直接的Chinchilla分析方法。
  3. 方法三：联合拟合 (Joint fit)
    - 类似于Rosenfeld的方法，训练一系列不同N和M的模型，然后拟合一个3D的损失函数表面。
    - Chinchilla论文中，此方法得到的模型大小和Token数估计与其他两种方法有显著差异。
    - 后续研究 (Epoch AI)：通过从原论文图表中“法医式”提取数据点并重新拟合，发现原始Chinchilla论文中方法三的曲线拟合存在问题（残差非零均值）。修正拟合后，其结果与方法一和方法二高度一致。
Token与参数比例的演变：
- 早期（如GPT-3）可能是每个参数2个Token。
- Chinchilla将此推向每个参数20个Token。
- 当前趋势：由于推理成本的重要性日益增加（模型已成为产品，需考虑运营成本），研究者倾向于用更多的Token训练参数相对较小的模型，以获得在较小模型尺寸下的高智能。例如，最新的Qwen模型据称用了30万亿Token训练。
- 目标从“训练最优”转向“推理最优”。

伸缩法则的鲁棒性与可复现性

案例研究：文本扩散模型 (Diffusion Models for Text)
- 讲者学生Ishan的研究工作，试图将扩散模型应用于文本生成。
- 面临新模型类型，不知其最优Token/参数比例，也不知其是否可靠伸缩。
- 应用与自回归模型类似的IsoFLOP分析方法。
- 结果：扩散模型也显示出与Chinchilla类似的、可预测的伸缩曲线，尽管其生成机制完全不同。两种模型（自回归与扩散）的伸缩曲线平行，仅存在常数偏移。
结论：伸缩法则并非只适用于特定精选案例，它们在研究新模型或新环境时似乎能自然出现。

结论与回顾

核心思想：
- 数据伸缩和模型伸缩（参数、总计算量）均呈现对数线性关系。
- 这使得我们可以在训练大规模模型之前做出许多超参数和架构决策。
- 伸缩法则允许我们进行智能的资源权衡，例如在模型大小和数据量之间（如Chinchilla分析）。
主要成就：通过小规模实验优化模型各方面，而无需进行完整的大规模训练运行。
本次讲座回顾了Kaplan和Chinchilla的工作，强调了数据伸缩、模型伸缩以及利用伸缩法则优化模型的重要性。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-19 21:25

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-17 22:02

StreamSparkAI