speaker 1: I'm going to talk a little bit about scaling laws. Originally, I think we were going to talk about inference, but I'll take a few minutes to start on scaling laws and then we'll kind of figure out where we'll go from there. Okay. So the whole point of scaling laws is kind of well, to begin with, I want you to put yourself into the following scenario, right? So you have a very rich trend and he or she has given you 10000 actually, let's say, hundred, zero H -100s for a month. And you have to build you know the best open source llm that you can, right? So this is a somewhat hard task. And we've given you some of the tools that you need to make progress on this question. So you know you can put together your inra team and your systems people and you can put together a distributed training framework in the next assignafter that you're going to put together a great pre training data set and then you kind of know all about you architectures and so on. So you kind of know you have all the pieces and so we can turn the crank and we can run the big model. And in the first couple lectures, we talked about all the other various decisions you might make along this journey, right? Like what's the architecture? What's the hyperparameters? Like how are you going to do all these things? Well, you know, I think the in some ways the answer I gave you from those early lectures was just pick what other people have done, right? Like just follow llama or whatever other models. But in a way that's a very boring answer because that doesn't let you push the frontiers, right? Like if you're if you're in like a big frontier lab and you're gonna to build the best model, you don't want to just copy other people, you want to innovate, right? So how do we innovate and get these optimized solutions in the first place? So that's kind of going to be the point of scaling loss. What we want to do is we want to build simple predictive laws for the behavior of language models. And scaling laws are basically this whole idea of being able to take small models, scale them up, and be able to do that in order to improve your engineering, right? So one way of thinking about this is the old and pleasant way of doing deep learning is just train a bunch of big models to your hyperparameters so that your big models are good, right? That's just going na cost tons and tons of compute. Like you can't really easily do that. And so I think the new optimism, and if you're sort of following a lot of these developments on scaling, you know, you kind of think of this as, all right, we're going to train a bunch of small models. We're going to learn a lot of things from the small models, and then we're gonna to extrapolate them back up to bigger models, right? So we're gonna to take our smallest models at the left side of this sort of compute scale here. I'm gonna to learn a lot about what to do, and then I'm going to nail it in one go when I build my big model. And the first place I want to start with is just kind of the history and the background of scaling loss. And I want to contextualize this because I think when people talk about scaling laws, often this is done in very like mesaniaic, like agi terms. They're like scaling laws. Just tell you that you know, these amazing things are log linear forever and we will achieve super intelligence or something. But I think scaling laws are actually much more grounded and have a lot of interesting history. And so I'm gonna to start there to sort of try to convince you that scaling laws aren't necessarily just fitting lines on log log plots, although that is a very big part of what we're gonna to do. And then I'm gonna to do basically very easy steps. I'm gonna to try to convince you that at least for data, scaling laws are a very natural thing to think about and expect. So so as a person that's kind of brought up in statistical machine learning, you know my starting point is going to be statistical machine learning, right? Like what is scaling loss? You know, in some ways, scaling laws are telling us as we increase the amount of data or we change the model size, we expect certain behaviors out of the model, right? And if you go back to something like machine learning 101 and if you remember your vc dimensions and rodamoacher complexities and so on, in some ways that's the theory version of exactly this. So you have on the top, you know generalization bound for how the the generalization bound for the excess risk of learning amongst a finite set of k hypotheses. And we see that that should scale as a one over square root of m, right? In some ways, that's a theoretical version of a scaling law where we're making predictions about how fast our errors should decay as a function of n on the bottom. We might have something a little bit more exotic if we're doing generative modeling and where our generative model is a really flexible non parametric class. What we might do instead is we might fit some sort of smooth density. So in this case, our prediction is that the l two sort of error of estimating a density is going to be upper bounded by some polynomial and to the beta over two beta plus one, right? This is what some people might call non parametric rates. So you know, theorests have been thinking for a very long time about how sample size especially should relate to error. This is a very classic problem that people have thought about in machine learning theory, but these are upper bounds, not actual realized loss values. And really scaling laws are in some sense the leap from thinking about kind of the theoretical side of how should data and model size relate the performance, and going to the empirical side of saying, actually, our bounds are bad, but maybe we can actually fit these things empirically. And this is a fun trivia fact or arguable trivia fact, like what is the first scaling loss paper? And actually not many papers cite this one, but I think probably the right first scaling loss papers is a paper from 1993, nurips, from Bell Labs. And you might recognize some of these names. These are a kind of theorists and some of the people that have done really classic work in machine learning theory, like, you know Vapnik and Corina Cortez and others. And I take an excerpts because I was reading this paper actually just preparing this lecture earlier, and it just struck me how, ahead of its time, in many ways, this paper was right. It's saying training classifiers on large databases is very computationally demanding, and we need to figure out which ones are good before actually training them. And so what we're going to do is we're going to propose a new predictive method that predicts how good a model is going to be without actually training the whole thing. And that sounds a lot like scaling loss, and you'll see this later, but they have a functional form that's basically like, Oh, the test sterror of a model is expressible as some irreducible error plus a polynomial decaator. And you're like, huh? On that looks a lot like a modern scaling law. And they even do the thing where they train a bunch of small models, they fit their curves, and they're like, Oh, we can accurately predict the behavior of the model further out. So as with many things, I guess scaling loss, partially thought about Abbell labs way back when. And of course, there's others that I think have thought about related ideas in scaling, not just scaling laws, but also really the modern mindset, I think, of thinking about scaling. There's another paper that often gets mentioned in sort of the history of scaling laws on Banko and Brill, who was studying sort of how does the performance of a certain kind of nlp system scale with the amount of data? And they have you what looks like know very often a modern scaling law, know log axis data on the x axis, performance on the y axis. And you know they're basically arguing, well, look, we can get really dramatic performance improvements just by scaling up data. It's very predictable. And you know maybe we should consider the trade deoff spent between spending time and money on algorithm development versus just collecting more data. And you're like, huh? That sounds a lot like what a lot of this pre training stuff is thinking about. And then finally, you know one of the things that I think people have thought about recently and in the past is, you know is this thing really predictable? What are the right functional forms? And as early as like 2:12, you know people are really thinking about, all right, like are these things actually predictable? You know, is power law, like for example, power three and power four, are those really the right functional forms for predicting the behavior of models? And of course, all of this, just to remind you, right, is thinking about the behavior of models on the y axis, the capabilities as a function of the amount of data that you have on the x axis. So that's the relationship that I think has been really classically studied, what you might call data scaling in all these cases. And if you're interested in like kind of the earliest like large scale neural scaling law paper, that would probably be hessness at all in 2017. I believe they were at by do when they did this work. They showed that for a range of tasks, machine translation, speech, and I think like some vision tasks, they showed that essentially error rates fall as a power law. And they even have this nice plot that I really like to refer to when people are discussing scaling loss, that really your expectation should be that there's three different regions in the behavior of a model, right? Initially, you start out a best guess. You then enter into a region where you're kind of predictably scaling the model. That's the power law region. And then there's another asymptotic region where you're approaching essentially the irreducible error of your model class. And I'll kind of highlight that. I think you know there's been in the last few years a lot of talk of new phenomena, things like Oh, emerging capabilities or like scaling compute being a new thing or systems being really important. But have you been reading sort of hesness in 2017 carefully? You would have seen essentially all of these things. They say, actually, it's really hard to do predictions by scaling law when models are at random performance, because suddenly you can leave the random region. They talk about computational limits, actually know if we can scale, it means actually scaling by compute is really important. And then finally, they even say things like, you know, maybe we should do things like quantization, because if we have predictable scaling, then that means we should be willing to pay for model accuracy with compute, right? These are all very, very modern ideas that I think a lot of the early scaling law papers, I think, kind of understood fairly intuitively because you know once you see these spots, you kind of see that actually with predictable resource investment, you get predictable capabilities improvements, right? So that's in some sense, so the core, not quite history, but I think context that has really shaped scaling loss. All right. Any questions so far on kind of the context? This is mainly just kind of data scaling, but I wanted to make sure we go over it carefully. Yes. Like it's pretty natural for like maybe I was wondering if I book like is there cases where there isn't scaling where that doesn't get better? Yeah. So the question was, you know it's natural or maybe it may arguably natural to expect scaling. Are there cases where we don't get scaling or we get different kinds of scaling? And I think one way of thinking about this is if you're measuring kind of training loss or like you know held out versions of training loss, then I think scaling is very natural, right? Like all of classical statistical theory says, you know things should converge. And when they converge, eventually they will get better, right? At some sort of very asymptotic sense. But we do see non scaling behavior. There was a really interesting competition a few years back called like the inverse scaling prize, where they were looking for things that like scale inversely as models got better. And a lot of these are very niche things, like you know models tend to copy better and say, if you want na like suppress copying behavior becomes really hard for really strong models, for example. But I think one sort of like thing that ties a lot of that together is you know if you go really far out of distribution where the behavior is not well specified by the data, then you can get all sorts of behaviors like no scaling at all or inverse scaling or what have you, right? So in some sense, you can think of this as like the extension of the classic like deep learning robustness problems. Cool. Okay. So now I'm going to talk about the scaling behaviors of llms. Like just essentially going through several kinds of empirical results. I'm going to walk you through data scaling in particular and some examples just to convince you that this is a very natural object to expect. And then we'll talk about model size, which is a different kind of a thing. So scaling laws, I think, are fairly well established, and they seem to appear very, very often in kind of many variables, right? You see, scaling in compute on the x aThese are all taken from capplan scaling law paper, which I'll refer to extensively in this lecture. So the x axis here is log compute, y axis here is log test loss. And on the right you see similar kinds of scaling, both for data set size. So this is the amount of data and parameters. One subtlety I'll mention here as I sort of talk through this is know when we scale things like data set size or parameters, we're always assuming that the other variable, in this case, if you're scaling datset size, the model size is much, much, much bigger then you can saturate with the data set size because obviously, if you have way more data than parameters, eventually you're going to of asymptota. So in all of these, we're trying to avoid the asymptotic regime they hold in also pretty non standard settings, theyhold for downstream tasks, theyhold out of distribution, which is what's being shown here from the Kaplan paper. And so you know in some ways, power law relationships seem to appear more often than we might initially expect, especially for these od or other variables. So I want to talk through data scaling laws first because I think they're the most intuitive. Like at the very least, I think the theory for that is fairly clear. And so to be precise, when I say something like data scaling, what I mean is just some sort of simple formula that maps data set size, which I'm going to refer to as n, to our excess error. Excess error is the error beyond the irreducible regime. And you know if you recall that figure I refer to in hesness, what we are going to expect is monotonic logistic looking curves. And really, our interest is primarily going to be in the power law region to the irreducible error region. Like of course, it's very interesting to also ask questions about what happens in the small data regions as we leave random guessing, but that's much, much harder to reason about. Whereas I think this right tail actually, I can hopefully convince you that this part is actually a very, very natural thing to expect power law scaling. So okay, right. So the first empirical observation that we have, right, and this is kind of the thing that I'm going to convince you is natural, is when we plot on the x axis data set size and on the y axis test loss, then on the log lock plot model, performance is linear, right? You might call this scale free, or you might call it power law. These are more sort of physics physics oriented terminology. And sort of this was established you know by many people, but you might refer to Kaplan to see many examples of this. So I think you know as sort of the previous question sort of brought up, right, we kind of expect error to be monotone. We train on more data. Error goes down. Fairly obvious. The part that is less obvious is the precise functional form of this scaling, right? So when I say it's a power law, it's linear in log log space. And then so what is the implication of that, right? If something is linear in log log, that means that there's a polynomial relationship between your x axis and your y axis, right? And y is polynomial decay natural? Well, I'm gonna to walk you through two examples and both of those are going to result in some fairly natural polynomial decay. I'm going to start with the simplest possible example, right? Like this is just going to be you even stats 101 rather than machine learning 101. So what I want na do is I want na estimate the meat of a data set, right? And estimating the mean is a task of estimating a parameter, right? I can ask for, what's the scaling law? What's the error of my mean estimation task as a function of data, right? So I can write that down. Well, you know, my input comes from a Gaussian, and the task is to estimate the average. I've written those out in the blue box above. And what's the error? Well, by sort of very standard arguments, right, the average is going to be also distributed as a Gaussian with the standard deviation divided by n. So I'm going to get sigma squared over n is my estimation error, right? This is the expected squared error of my estimate. And if you look at this, this is polomglomial an n. And just to really drive the point home, you know, you take the log of both sides of this, log of the air on the left and log of sort of of n on the right hand side, you know, I get exactly log of error is equal to negative log n plus two log sigma, right? So this is exactly the kind of thing we expect. And we expect a slope of one if we were to fit a scaling law for mean estimation. So now, you know, equipped with this this new knowledge, you might say, all right, I'm going to go around and I'm going to look at what the rates are for estimating different things. And that will tell me about what I should expect for data scaling. And so you might say, Oh, what I expect is one over n. You might expect one over square root of n for agnostic learning and so on and so forth. So we should expect to see some like pretty nice round numbers on the slope here of a log. Log pi should expect to see like one or 0.5. What do we actually find empirically when we look across these papers, right, just to sort of call them out in hessness, for machine translation, we see negative 0.13. For speech, we see negative 0.3. And for language modeling, we see an exponent of negative 0.095, right? Those are all much, much slower than the one over n or one over square root of n rates that you might expect when you're just fitting simple functions. So why might this be okay? This will be the last math slide of this this lecture. And then we can go to just fitting lines on log lock plots the rest of the time. But this will hopefully drive the point home of why we might see these particular slopes. So we know that neural nets aren't just estimating the mean right, or it's not even fitting a linear regression. They can fit arbitrary functions. So let's turn that into an example and let's work through that example. So my input is x one through x sense. I have n samples and I'm gonna to put them uniformly in the 2D unit box. And I want to estimate some random, not random, some arbitrary regression function y equals f, right? And I'll assume f is smooth and so on, if you really want to be precise, right? But there's some regularity conditions here. A simple approach to estimating a regression function f is just to cut the 2D space up into small boxes. And within each box, I can measure the average of the y values. Like a very simple non parametric regressor is to just cut the space up and then to estimate what's going to happen. Now informally, if we pick, you know I'm going to have square root m boxes. Now each box is going to get square root of n samples. And now my error is going to be one over square root of n. And if you sort of follow this logic through the more dimensions, you'll see that in d dimensions, this is gonna to be error is equal to n to the negative one over d. And then sort of my overall scaling, if I were to take log log plots of the whole thing, is I expect the slope of negative one over d, right? And so why did I walk you through this example, right? I walk you through this example because if you have flexible function classes, what people call non parametric function classes, you expect dimension dependence and therefore the slope of the scaling law to actually move sort of much more slowly. And in some sense, the slope is telling you almost precisely kind of the intrinsic dimensionality or the ease of learning this task. And people have argued this more formally or sort of more literally. There's been a several sort of theory slash empirical papers arguing that really the reason why we get these sort of exotic or non standard rates of learning is that it is closely connected to the intrinsic dimensionality of the data. And the sort of, for example, the plots of these predictions, the dashed lines and these purple circles are somewhat close, although you don't want to read too much into this because estimation of intrinsic dimension is an extremely difficult problem, as difficult as modeling the data overall. Okay. Oh yes, I guess it was related to point, but like Yeah how you can you generate data that has an underlying intrinsic dimension at all from a simulation perspective? Yeah. So the results here, well, if you want, for example, to generate data that's actually not too hard, you could like write down a function that takes in like five variables, right? And then that would be as long as all five of those variables like don't you know cancel each other. That's a five dimensional surface and you can add a little bit of noise and you're good to go. The difficulty here is that they're actually doing things like training on seafar and then they're having different they're trying to estimate the intrinsic dimensionality of cfar. That's a much harder task. Okay. And data scaling laws are quite useful. You know, I was going at this from a let me explain to you scaling laws perspective, but you can actually use scaling laws to do many interesting things. You can make engineering decisions of various kinds using data scaling laws, and people do, in fact, do this. For example, you know, you might say, well, how does data set composition affect performance, not just datset size? Well, if you're changing the test set, you know, you Kaplan, it all has a really nice figure showing actually data composition only affects the offset, not the slope. And what that would mean is it says if you want to pick a really good data set, you don't have to necessarily train your models at a huge scale. You can scale them down and do your data selection experiments on much smaller models. And the shape of the expected sort of as we mix sort of different data, we might expect certain kinds of sort of shapes. And you can use regression and other kinds of techniques to try to figure out, for example, optimal data mixing using scaling laws. And people have written several papers on this topic, although you know as with all data selsort of research, a lot of this seems fairly tricky to execute reliably. There's other also interesting questions that you might ask, right? There's a lot of discussion these days about know, are we running out of data right on the Internet? And so once you start asking those questions, the other interesting and important question is, well, can we just keep training on the same data we have? What's the diminishing returns property of that, right? And so there's interesting work extending scaling laws to multi epoch training, basically arguing that there's a sort of effective sample size. And after about four epochs, you know you have rapidly diminishing returns as you repeat more and more data. And by modifying sort of the usual scaling law, you can basically get a version where you have amount of effective data and unique tokens that sort of diminish out as you increase the amount of repetition. Finally, I think one interesting sort of combination of these two ideas is if you're thinking about sort of data selection in the large data regime, right? Like imagine you're going to be training on trillions and trillions of tokens right now. What would be better? Would it be better to repeat high quality sources like Wikipedia and perhaps your secret pirated books ten times? Or would it be better to include new data? Right? The fact that you can either repeat data or you can include more data right now has multiple sort of axes on which you can sort of optimize your data mixture. And there's also been some interesting data scaling work, this one from cmu folks on essentially trading off between repeating data versus picking lower quality data. That's new. And so all of this really is a really natural extension of what I sort of already taught you, which is if you assume that there's a predictive power law relationship, right, and that this power law relationship holds sort of on a per mixture basis, then you can fit these sort of scaling lock trapolations and then get an estimate of how good your data is going to be at scale, right? So that's the starting point, which is data scaling, right? And hopefully, I've convinced you at this point, both sort of empirically and conceptually, that it's natural to have log log linear relationships between data and error. This relationship seems to hold very robustly across domains, across different kinds of models. And you can kind of have a nice, clean theoretical understanding of what is happening here. And once you do this, you can use this for all sorts of purposes, like picking optimal data mixtures or whatever else. Okay, yes, size Picon the data. Yeah. So as I was kind of saying back in, well, not this slide, but let's see back in this slide, when we think about kind of the data size scaling, the model is always picked to be really, really large. So the data is not saturating your model, right? And you want to kind of avoid being in this irreducible error regime. So the model is always picked to be large enough that you're in the power law region whenever you're only varying data. For like all of them that one really really big model size it like this each point point a different size model Yeah for example, for this plot in particular, it's like one big model size when you're looking at, for example, compute scaling on this axis, then data and model scale jointly at some preordained ratio. Cool. Any other questions? Good. Okay, excellent. All right. So now I think we got to move from data scaling to, in my opinion, slightly more mysterious kinds of scaling. And we're going to talk about model scaling next. And I think this is a more practical engineering set of questions that we're now gonna to try to answer. So you're in charge of you know building and shipping a really large language model. There's a lot of interesting ideas out there. Like you could train the latest state space model, you could train a transformer, you could use atom, you could use sgd, right? People invent all sorts of nutrics. Which ones are worth scaling up and which ones are not. You could also take your limited compute resources and spend them on different things. You could train models for longer, or you could train bigger models, right? For given flop, you could trade between these two. And you could also do things like go and collect more data versus get more gps. There's a lot of different sort of things that you can do, and the scaling laws allow you to have a pretty simple procedure to just answer all these questions, right? So I'll go through the classic sort of Kaplan scaling law paper. If you're interested in these topics, I encourage you to read it. It's just kind of a gold mine of all these kinds of observations. Some of it is old, but it's, I think, still unmatched in the thoroughness of all the things that it really studied in a fairly nice unified setting. So architecture wise, you might start by asking, like transformers versus lstms, right? Which one's better? Well, you know the brute force way might be to know scale up lstms and up to like GPT -3 level, and then you know you can figure out whether it's good or not. The scaling loway is much simpler, right? You basically train a bunch of lstms and transformers across many different compute thresholds or compute levels, and then you kind of see what happens as you scale them up. And I think the trends here are fairly clear. No matter how many layers you have on your lms, there's a pretty big gap, pretty big constant factor gap between transformers and lstms. And remember, this is in log scale. So this is kind of saying something like, I don't know what the exact numbers are, but imagine this is like 15 times less efficient. Then no matter where you are on this plot, you the lstm is, let's say, 15 times less compute efficient than a transformer, right? So there's a constant factor compute penalty to using lstms, at least in this plot. You know you could zoom out and say, well, there's a lot more architectures, you know which ones are you know really good and worth doing. And sort of some of the classic papers, this one is by ek and others, aguhave done exactly this kind of scaling work where they took a bunch of architectures on the right here, and they basically scaled them up. So the x axis is the amount of compute. The red line is basically each architecture, and the Green line is the transformer bass line, right? And they ask, like, Oh, can any of these alternative architectures match or outscale the transformer, right? And what do they end up? Well, actually, the only thing that seems like really strongly and reliably meet the transformer is gated linear units and mixture of experts. And once you know it, that's exactly the kind of stuff that people are doing today. And so this is kind of the scaling law version of that same idea of saying, like how would you have come to the conclusion that we should be doing switch transformers and glu and for example, not the performer, right? And the scaling law provides some clear evidence of why you might want to do that. Optimizer choice, I think, follows a similar thing. This one's from hessness. They compare sgd and atom. They find very similar to before this kind of constant factor gap, right, in compute, in this case, datset size. But of course, that translates the compute in the effectiveness of atom versus sd. You know, rhn in this case is recurhighway nets. You can sort of ignore the details here. You kind of see the point of how you would do this analysis rather than the specific results that are shown here. You know in the beginning I also said something like, Oh, you know depth versus width, like what should the aspect ratios be? That was one of the hyper parameter topics we talked about. And we see sort of similar sort of analysis. But in scaling law form from Kaplan, I think this one's intriguing to me at least because you know we might think that deeper layers get dramatically better, right? That there's like clear separation between the number of layers. But we see, at least here, that you know there's actually a lot of sort of slop. One layer is really bad, but a lot of the other sort of layer choices sort of remain pretty stable. And hopefully this is reminiscent of kind of that slide I showed back in the architecture lecture where I said, well, you know the aspect ratio, the ratio of withe depth, know roughly something like four to 16 or something was a pretty natural number. But there's a really wide basin in which you're approximately optimal. And the scaling law analysis also backs that up. One important subtlety that I would do want to point out, and this one bites people every now and then, is that not all parameters are equal. Like often you want to do parameter scaling analyses, but if you were to, say, count embedding parameters as part of your model, well, you get like a pretty different scaling law. You get this kind of weird looking thing that slightly bends over here. Whereas if you only consider the non embedding parameter, you see that much cleaner result that I showed you before. So embedding layer parameters don't really behave the same, and they don't show the same kinds of sort of log linear scaling as the non embedding parameters when you account for them. And there's sort of related work on saying like not all parameters are the same on recent papers on scaling mixtures of experts, where they're also sort of trying to figure out like what does it mean to be a parameter when you have such sparsely activated parameters? And in those kinds of papers, they sort of try to derive essentially things like a covalent number of dense parameters in order to sort of try to normalize the number of parameters in moe, right? I've showed you this plot earlier in the hyperparameter selection, but hopefully now actually you see the full context, not just the original sort of the hyperparameter choice question. We know that in many cases, I'll go back, let's say, to here often what we'll see is scaling lock curves that look like the following. You'll often see that the slope of the curves remain very similar, they're non crossing and that there's sort of constant factor offsets between these curves. And whenever this is, what you can then do is you can take a slice at a particular level of compute or a particular set of hyperparameters and analyze the hyperparameter traoffs very carefully, assuming and sort of be sort of safe in sort of scaling that up. And so when you go to Kaplan's paper, you'll see exactly these kinds of analyses being done, especially, I think, the center one, the aspect ratio plot is definitely worth looking at. You know, they're not just sort of scaling up and down models. They're actually taking different slices. So different sized models, 50 million to hundred 70 million, 1.5 million, and they're looking at how the aspect ratio changes the loss. And they kind of see that, Oh, actually the shape of the curve, not just the scaling slopes, actually remain similar. And this means that you, I can pick an aspect ratio between ten to 100, and anything in between will work fine at all of these different scales. And so this is, I think, important to think about. I think initially when you're trained in sort of deep learning, you know model training, you think about hyperparameter tuning, but you want to be sort of scale where in how you're tuning your hyperparameters. And that's a really big difference in mindset, I think, between kind of the scaling law style approach and sort of maybe what you've been trained or what you've naturally think about in terms of, Oh, let's just tune these models at a small scale. And so the same is being done kind of for feforward ratio and for attention head dimension, your various aspects of scale, and you're trying to see whether sort of the minima remains similar. Okay, another important thing next actually maybe not next lecture, but next lecture I'm going to talk about sort of practical case studies almost of how people have scaled up models. And we'll actually see that batch size and learning rate are actually two really tricky things that you have to deal with carefully when you scale models up, right? So when you scale models up, you're gonna na have to maybe think about you know the optimal learning rate will be different across model scales. And if you're doing that, then also maybe the optimal batch size might end up varying as well because those are often co linked. And so we need to think about what the right way of scaling batch size is and how batch size interacts with scale and also learning rates. I'll talk about those for the next couple slides, right? So batch size from the systems lecture, hopefully you remember it has diminishing returns past a certain point. So up until a certain point, so you know when the batch size is smaller than the noise scale, we're on the left hand sides here, increasing batch size is almost equivalent to taking a more gradient step. So that's roughly saying if I double my batch size, it's as good as taking two gradient steps. And that's a really, really good place to be, right? Because now you've got the systempower of being able to paralyze across the batch while having the optimization efficiency of taking two steps. But past a certain point, you're gonna to have ineffective scaling, right? Where now your sort of noise scale and your batch size are the same. And the additional samples in your batch that you're taking, you know they're not reducing useful noise. It's getting dominated by kind of the curvature of the bias term, so to speak, of the curvature of your optimization landscape. And one really useful thing to think about, useful sort of analysis object, is this notion of a critical batch size. And the critical batch size you can think of is kind of this threshold point where we go from perfect scaling to dip strong diminishing returns. And you can sort of analyze this in theory and sort of OpenAI papers on critical batch sizes do this, but you could also analyze this empirically. And this is another thing that's been studied sort of in the scaling law kind of way. You can kind of see you can estimate the point at which sort of progress slows. So you can estimate empirically what the critical batch size point, trade off points are. And you can also basically train bigger and better models. And one really interesting thing is as you try to improve the loss, so you're going left side here, so you're making losses better and better and better and better and better, your critical batch size ends up getting smaller, right? So the smaller the loss target, the bigger the overall back size that you can be. And so one of the things that this leads to is, for example, if you look at the llama three training report, you'll actually see, for example, that theylike increase the back size after a certain point or theydo things like increase the back size as they train because as your loss target gets smaller, your back ch sizes can in turn get bigger. So as we increase both compute and model size, like what's the right thing to do? Once again, we can do kind of a scaling analysis. This is from Kaplan. And you can try to figure out, you know as we increase the amount of compute, what is the optimal batch size? And what we kind of see is that you know as we increase the amount of compute, we can actually have reasonable sort of parallelism. The number of total steps can stay the same, at least within this compute threshold. The number of total steps can stay the same while sort of getting the batches bigger and bigger and bigger. And if you fix the amount of batches, of course, the number of steps is going to go up and up and up. So this is good news hopefully, for data parallel processing. So that's the batch size story. The thing you should maybe remember, because I think critical batch size zes are kind of a messy concept, is that a there's a sort of diminishing returns point. The critical batch size, that's one thing. The second one is that it does seem to follow a pretty predictable scaling often as a function of your target loss. And given that you can figure out you know what is the right trade deoffs that I can make in terms of systems efficiency and my optimization progress. As I said before, the other aspect of this right is you've got your batch size and then you've got your learning rate. And those two are fairly closely linked with each other. And I'm going to talk about mup at much more extensive length in the next part of the scaling lecture. But this is kind of a really important, I think, broader idea. So you could do one of two things, and I think this figure will allow me to talk about both of these. So let's look at this left plot first. What's labeled standard practice, right? So when you train a transformer, what you're basically going to see is something like this left thing here, the standard practice. So the optimal learning rate is going to be at different points. And the wider the model, as you increase your model size and your mlps get wider and wider and wider, the optimal learning rate is going to be pretty small. And as you make your model smaller and smaller and smaller, right, your losses, of course, are going to go up because your model is less less expressive, but also the optimum learning rate is going to also go up, right? And often, you know people say there's a rule of thumb. It's like one over the width is the right rate at which you should scale the learning rate. More advanced people will actually fit, basically take these curves, find the minimum, and then fit a scaling law on the optimum learning rate. And so there we can see that this is a predictable decay in learning rate. And maybe we can fit a scaling law. I'll talk about this more in the next set of lectures. But an alternative one that I think many people have started to adopt and I think is a really interesting thing to think about, is that you can actually re parameterize the model. And in particular, you can do things like scale the initialior, scale the learning rates of different layers based on the width. You can scale the variance of the initialization based on the width of the model, as well as multiply the output in the forward path ths of different layers of the model. And if you do this in a way that is dependent on sort of the width of the model, you end up with a parameterzation of the model whose learning rate is supposed to be more stable, or at least in the original paper, exactly stable across scale. So you tune your learning rate once, and you don't have to do anything else that opmum directly transfer. You tune it here on the smallest one, and that directly transfers to the very largest scale. And this is the idea called new p. There have been sort of this original paper that I'm showing you is called with nup. There's been other variants. Meta, with the release of lamafour claims to have invented something called meta p, which I'm not quite sure what it is yet, but you can sort of see that a lot of labs are kind of thinking about this, right? Because if you're gonna na have to rely on predicting what the optimum learning rate is, then you have to do all sorts of tricky scaling law fits. And maybe this is very unstable, but if you can reparameterze your model, then, well, maybe you don't have to do any sort of retununing at all. Of course, that's way more optimistic than what happens in practice, but hopefully this gives you a sense of why this is really cool and really interesting. Scale aware initializations cool. Any questions? Up until this point, I feel like I've sort of gone through a whole bunch of scaling architecture and parameter stuff. So maybe I'll stop for a moment here in case anyone has any questions. Yeah, I really get the intuition behind like if we want to the word lost target, want to increase the match size. I you're understand? Yeah. Like so when you have a lower loss target, Yeah. So what you want to do right, is the smaller the loss target, the kind of more sensitive things are. And in the same way that, like you're going to be lowering your learning rate, right, you want na also increase your batch size in order to denoise, right? Like the more sensitive the target that you have, the sort of more precise your gradients potentially have to be. One way of thinking about it is like as you're cooling down in, your learning rate is going up. Maybe your batch slishould increase as well because the learning rate and batch slizes sort of affect each other inversely. Yeah. So this specithing is only for or for computer vision. I'm not sure there is a sort of related OpenAI scaling paper for sort of mulmodal models, but not I don't remember what that says about critical backsize for those. Yeah, yes, the noise scale. The noise scale at least didn't sort of disfigure, if that's what you're asking about. This is a kind of theoretical analysis. It's basically about the gradient noise that you expect from random sampling within the batch. So this is not like a you know precisely empirically measured quantity obesity. All right. So one thing I'll caution, and I think this is a big caution for a lot of scaling law works, is that scaling laws are very nicely behaved for log losses. So we train on the next token prediction, cross entropies. When your scaling law targets are those, cross entropiece very easy, works very well. But if you're trying to do downstream tasks, right, you're trying to like directly scale on benchmarks, behavior is much less predictable. So here on the left side, this is from ek's paper comparing lots of different sort of like hyperparameters and architectures, you see that the number of parameters, which in this case is a surrogate for compute and the negative log perplexity is very nicely linearly correlated. And what this is basically saying is, well, it doesn't matter what your like depth or width or like precise setting of the hyperparameters are, the only thing that really matters is your total compute expenditure, right? This is a very simple and nice story. But then you take these models, this was back in 20, 23, so you know people were still kind of doing super glue accuracy and you basically say, like, ok, but what's the downstream performance of these models? And while now we don't see a very nice linear relationship anymore, we see this totally different thing where certain models are much better than others and certain architectures are better than others. And so you might not expect exactly this kind of scaling property. And we've seen variants of this story play out in many different places. If you follow the literature on state space models, that's one thing that we've seen know in state space models know we see really nice predictable scaling like the ones on the left. But often for certain capabilities like in context learning or for qa, people have shown that these models maybe do less well. So it's important to not take this perplexity scaling as the same thing as downstream scaling. And you want na be a little bit cautious whenever you're doing these kinds of analyses. Okay. So maybe this is not surprising to some of you, but hopefully know this is surprising and convincing, which is that you, if we want to make lots of engineering decisions like hyper parameter choices, architecture decisions, we can do a lot of that before training, right? Like we can train these models at small scale across several orders of magnitude compute, and then you scale that up in order to try to predict the behavior of models. So the scaling law based design procedure is pretty simple. You train a few smaller models, and these smaller models should span a couple orders of magnude compute. You establish a scaling law of some kind. So you see that at least on the models that you train, that there's a clear log log linear relationship. And then based on this prediction, you can set optimal hyperparaeters in many cases. In fact, you know these scaling laws won't really vary too much. Their slopes will actually be the same. In which case, sort of the corollary to this is you can just train a few smaller models and know the results of those small models will transfer surprisingly well to the larger models in many of these cases, but not all of them, learning rate being an important exception, for example. Okay. So that's how you do things like hyper parameter selection and architecture selection. Now I want to talk about one very important use of scaling loss, one that's had kind of an outsized influence on you know how we pick sizes of models, how we think about data efficiency and so on of these models. So I think back in the earlier days when people are beginning to scale up these models, there's a really core question that you need to ask. Do we need more data or do we need bigger models? In some sense, back in 20 21 to 2020 three or something, data was way more abundant than compute. So we didn't need to worry about the total data limitations. And so the one limiting resource is compute, right? Your total number of flops for your training budget, that's kind of the limiting resource. And you can then spend that resource in many different ways. You can spend it on training on lots of data with a small model, or you can train one giant model on very little data. And both of those extremes seem very wasteful, right? Like if you have a teeny tiny model pumping in tons and tons of data doesn't seem used one in reverse. If you're a giant model with like ten tokens, also doesn't seem very useful. And so this was sort of a core question for many people. And so there simultaneously, several authors sort of proposed sort of joint data model scaling laws to try to answer this question. And so what are those, right? I've have been talking about scaling laws in essentially one variable exclusively up until this point. And that one variable has varied. It has sometimes been parameters or data or compute, but we've not looked at joint scaling, right? And so data model scaling laws are things that look like this. These two sort of equations here are both like functionally equivalent to first order and describe the trade deoff between the amount of data and the amount of models. So the top one from Rosenfeld is basically saying, there is a part of the error, one part of it that decays polynomially in data. There's a part of the error that decays polynomially in the model size, and then there's an irreducible error term that cannot be removed even if I scale both the data size and the model to infinity. Same effect with Kaplan. But here they're sort of thinking about irreducible error rather than reducible error. And so there's no constant term here. So this seems kind of arbitrary because I don't think there's any sort of you top down reason why this has to be the correct functional form. But this provides surprisingly good fits to the joint error that you see in data and models. So this is from, I believe, Rosenfeld. They show this nice 3D plot of this is the amount of data, the amount, this is the size of the model, and this is the loss on the y axis. And the surface that's being fit is their functional form. The dots are their runs. It might be a little hard to see from the back, but the surface fits the dots almost exactly. And despite the fact that this functional form is kind of ad hoc, like it's pulled out of a hat, it is surprisingly accurate. This one's from Rosenfeld as well, where they basically say, ok, I'm only going to train on essentially the small half right models that are small and data that is small, right? So on this sort of left bottom, and I'm going to extrapolate to models that are sort of both large and trained with more models. And how good is that fit of like joint extrapolation? Well, quite good, right? So if you look at the error, my my sort of real values are on the x axis, my predictions of the error on the y axis, and they're sort of almost exactly right, both on like sort of imagenet and on wiki text. So this seems pretty good. And so you for a fixed compute budget. Now what can we do? We go back to, for example, capplan. We see similar things being done here. We see sort of joint scaling of compute and data. So in this case, parameters are on the x axis. The colors represent compute. And so there's sort of a third axis of data that's being implicitly varied in order to vary the total amount of compute. As you go shift on these curves, the parameters are being varied while the compute is being held constant. And so the amount of data is going to vary. So chchilla, I think many of you have hopefully heard of is probably the reference in solving this problem, right? So both Rosenfeld and Kaplan came up with kind of this joint scaling functional form. And then, you know, both of them sort of noticed that it was possible to use these functional forms to optimize the tradeoff between compute and data in various ways. But for various reasons, basically, it's hard to fit these sort of functional forms precisely. And the details, like the sort of learning great shapes being different, are important. And so Kaplan sort of had one estimate that was quite far off from what was later in some sense validated to be optimal. And so the chinchilla paper by a bunch of Google authors sort of was an attempt to really empirically try to nail down what is the right tradeoff between the amount of tokens and the model size, assuming that your goal is to get the best model for the smallest amount of training flops. So they have three different approaches, approach one, 23 for basically fitting different curves and making scaling predictions. These blue dots are the models that they trained. And basically the lines are predicting different optimal parameter sizes for different flops. And hopefully most of you kind of know the Chinchillo ratio. That's something like know 20 tokens per parameter. And that comes from exactly this. Like if you take each of these points and you multiply it by 20, you're gonna to get roughly the p or sorry, multiply it by 20, you'll get the token count. And so if you multiply the parameters by that, you'll get the flops. The difference between sort of the Kaplan results, which were basically estimating one set of token to parameter ratios, and sort of the chchilla ones, one of the reasons is because of learning rate schedules, right? We know that we train models with cosine learning rates. So cosine learning rates are going to look something like this, right? It goes up and then it comes back down and then it's going to cool down all the way to a minimum learning rate at your bottom. But one thing about coastine learning rates, that sort of trips everyone up all the time, is you can't truncate them early for a coastine learning rate. You have to sort of go all the way to the end in order to get a valid model. You have to get a cool oldown phase all the way to the end. If I truncate a model in the middle, this is not the same as starting a model from scratch and training it with a cosine learning rate somewhere in the middle. And this was one of the sort of contributing factors. There were others as well, leading to the capplan estimates being pretty far off from the later sort of more improved estimates provided by the chinchilla paper. So what do the chinchilla authors actually do? Well, they have three different methods of trying to estimate the optimum trade off between tokens to models. And each of these methods are going to sort of provide different scaling coefficients, scaling coefficients for the model size and scaling coefficients for the data size. And kind of surprisingly, in this case, they're getting 0.5 on both of these for methods 12, and method three is providing pretty different or slightly different estimates. They're about off by 0.03, but we'll talk about that a little bit later. Kaplan and all you see is way off than any of the three estimates, right? So we'll go over each of these methods. Each of these makes sense. They make sort of different assumptions about scaling, but they end up with very, very similar estimates at the very end here. So method one on chinchilla is to basically take the minimum over curves. And so what does that mean? Well, you basically overlay all of the different training curves that you have. So you can see here on the x axis is different flops, on the y axis is sort of the training loss. And I have models trained at many different sizes. And of course, you know each of these sizes are going to be trained with different amount of tokens. And so they're going to reach a different total flop as they sort of go through training. Right now, what I'm going to do is I'm going to look at the lower envelope, the set of sort of points or checkpoints that prove to be optimal under any compute budget. And I can take these models and I can look at, ok, what were the actual parameter sizes of these models? And you can see that sort of the total compute on the x axis here. And the number of parameters, as well as the corresponding tokens, all forms a relatively nice scaling law. And so this is kind of the minimum envelope method. It's basically saying I expect the minimum training loss where I optimize over all the model sizes to actually be optimum in flops. And sort of to call back to some earlier papers, if you look back at the earlier sort of Kaplan paper and other scaling loss, you see exactly this already being done. You see different models being trained with different sort of parameters and different compute scales, and we taking sort of the minimum across these. And we've already seen that the minimum forms of scaling law, so this is building on this observation that the minimum across many different training curves across compute should form a tolaw. So under that assumption, you can get fairly nice fits. And this gives one estimate that is quite consistent with others of 0.5 point five. Now the other one, this, I think if you were to pick a single canonical way to do the chinchilla analysis, this would probably be the one. And in some ways, I think this is the most conceptually straightforward one, which is the iofflop analysis. So to do the iofflop analysis, what you do is you pick a bunch of compute scale. So each of these colors is a different amount of compute. And what I'm going to do is for each of these compute scales, I can essentially have models with smaller parameters trained with more data, or more parameters trained with less data, right? So I'm gonna to sweep over my sort of model sizes for each of these swaps, and then I can look at the minimum of each of these curves. I can either pick the minimum point explicitly, sort of non parametrically, or I could fit quadratics onto each of these and get the minimum point of the quadratic. But in either case, sort of the argument is fairly simple. The argument is I should be the case that this minimum itself follows a predictable scaling law, and thus I can extract from it sort of the optimum sort of parameters per flop. So that's the minimum points across all of these. And I can also extract the optimal number of tokens per flop. I can read that out by sort of dividing my flops budget by the number of parameters. So I can get those simultaneously. And you can see that once again, this gives very clean sort of results that are consistent with method one. So we can compare that with before. This says for the eventual chinchilla model budget, you want 63 billion parameters. This one says 67 billion parameters. The two are quite close, right? Okay. The last one honestly is just a little bit messier. And this goes back to kind of that Rosenfelt paper. If you have a functional form like this one, right, like this from Rosenfeld, a very natural instinct is to say, I'm just gonna to train a bunch of models varying both n and n, right? And I'm just gonna to do curve fitting. I'm gonna to fit this curve onto whatever I get thing I get out of my models, right? So I'm gonna to train a bunch of models and fit that 3D shape. And we know from rosemanfelit's reasonable to some extent to fit these. So you've got all these dots, which are the models I fitted a curve that's this sort of heat map color that you see on the left. And then you can sort of back out what the implied iflops should look like from these dash lines. But if you look at this, you hopefully you see that the scaling law fits and like sort of the curve fits here are just not quite as good as the fits in the other plots, right? And you know, if you look at the coefficients, the chchilla method three just gives way different estimates in terms of the model size and total token count than the others, right? And actually, this was a mystery to me for a long time. I think some of my students were like, why is method three so different? And I said, I don't know, maybe scaling laws are just sometimes noisy. I don't know how many of you know this, but this is a really fun trivia or not trivia fact, fun piece of trivia, let's say. So last year, some folks at epoch AI, I don't know what motivated them to do this, were curious enough about this result that they went and tried to replicate method three. And it was very difficult to replicate it because you don't have the original data for all these training runs. So they actually went to the extreme of actually looking at the plots and using sort of a forensic tool to extract the values of the points from the plots. And based on that, they could actually replicate the original result. And kind of the funny thing is, they showed that actually the curve fitting was the bad part. Like their data in their approach was good, but actually when they fit the curve, they didn't necessarily do it right. And so the original fit had residuals, if you're familiar with the regression, you know your residuals should be zero mean centered, because otherwise you should be you know offsetting your predictions to make it zero. Their residuals are non zero, and then they fit it better. And then when they did fit it better, well, actually their optimal estimate almost exactly matched methods 12. And so this is one of those funny cases where actually you, the original authors, had both the idea and the data, right? But because of a minor issue in curfitting, they kind of had it wrong. And the replication actually makes it more correct than before. Usually replication sort of disproof things, but in this case, actually the replication just showed that the original result was correct. All the law, which is I think a pretty cool result. Okay. So the final thing I want to talk about with kind of this set of chinchilla results is you know we're talking about training optimal scaling. So you have a fixed fox budget. I want the best possible model possible. But really, I think the story has really shifted when sort of Chinchillo was written and the Kaplan paper was written, you know llms were not really a product yet. And so really the name in the game was everyone wanted the most biggest, flashiest, most intelligent model, but they didn't care about the inference cost of actually deploying these systems. But nowadays you know what we really care about is inference costs, right? Because these systems are actually products. They generate revenue. You know you have a cost associated with the revenue. And so we've seen over time that actually the tokens per parameter has steadily grown, right? Like GPT -3 was two tokens per parameter. Chinchilla moved us to 20 tokens per parameter and four bit people played around with sort of 20 tokens per parameter stuff. But then you know very quickly people realized actually what we care about is you know really good intelligence at really small parameter sizes. And so people have really started to scale up the number of tokens per parameter very, very rapidly. And I think I saw yesterday that, for example, the most recent quem models were trained on 30 trillion tokens, right? People are really pushing the limits on the tokens to parameter ratio because really you would much rather pay the upfront cost than to pay the ongoing operating cost of running inference on a really big expensive model. Cool. Last thing know that is kind of a fun sithing that I want to end with is to say, you know these results are pretty robust and easy to replicate. A few years back, one of my students, Ishan, was really interested in really pushing diffusion models for text forward. And so one of the things that we had to do was to say this is a whole new kind of model. We don't know what the optimal token to parameter ratio is. We don't know if this thing, even you reliably scales, is a totally different kind of generative model. What do we do? Well, turns out, you know if you just fit the same kind of playbook of saying, Oh, we're going to do you know isoflop analyses for autoaggressive models, we get almost exactly the chinchilla thing without too much effort. You know you do the same kind of analysis on diffusion models while we see you know very similar kinds of curves, even though it's a pretty different generative model entirely. And then if you plot sort of the minimum across of these, well, you see very predictable scaling for both separated by a constant offset, right? Like I don't bring this up to say you know because I want to particularly push through diffusion models, but just as a really random sort of case study or example to say, know, these scaling laws don't necessarily need to be these very Cherry picked examples. They seem to happen pretty naturally as you're sort of working on new models or working on new environments. So okay, you this is to put together this last part, right? Log linearity is not just about sort of one dimensional things where we think about data, they extend to sort of model parameters, they extend to total compute. And so that lets us know make all sorts of hyperparameter in other decisions. That's kind of this first part. And they're also letting us make really smart resource trade offs, right? They let us make trade offs between sort of big models versus more data. And we saw that in kind of the stinchilla analysis. And you know it's kind of remarkable how cleanly things like the iflop analysis turnout. So all right, that's all I got for basic data or basic scaling laws. We did a recap of Kaplan as well as chinchilla today. And hopefully now you're on board with this idea of data scaling, model scaling and using scaling loss to sort of optimize all the aspects of your model without actually going all the way to the large scale training runs. Thanks, and I'll see you all Thursday.