Stanford CS336 Language Modeling from Scratch ｜ Spring 2025 ｜ 01 Overview and Tokenization

斯坦福大学的CS336课程《从零开始构建语言模型》旨在让学生端到端地理解语言模型构建全流程，包括数据、系统与建模。该课程的讲座将发布于YouTube。
主讲人Percy认为，当前研究者与底层技术日益脱节，许多人仅依赖调用大型专有模型接口。他强调，尽管便捷，但这些抽象是“泄露的”，真正的基础研究仍需深入理解技术栈的各个层面。因此，课程的核心理念是“要理解它，就必须构建它”。
鉴于前沿模型（如GPT-4）规模庞大、成本高昂且技术细节不公开，学生将专注于构建小型语言模型。Percy承认，小型模型可能无法完全复现大规模模型的某些特性（如不同模块的计算占比变化、特定能力的涌现）。
尽管如此，课程仍能传授三个层面的知识：1) 模型运行的“机制”（如Transformer架构、并行计算）；2) “思维模式”（如追求硬件极限性能、严肃对待规模化问题，这被认为是OpenAI成功的关键）；3) 关于数据与模型选择的“直觉”（但这部分受规模效应影响，只能部分传授，因为小规模有效的策略未必适用于大规模）。
Percy还对“惨痛的教训”进行了解读，指出并非“规模决定一切”，而是“规模化的有效算法”至关重要。他强调，效率（尤其是算法效率，其进步已超越摩尔定律）在大规模训练中更为关键。
课程旨在引导学生思考的核心问题是：在给定的计算和数据预算下，如何构建出最佳模型。

视频科技

媒体详情

上传日期: 2025-05-13 16:29
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Welcome everyone. This is cs 336 language models from scratch. And this is our the core staff. So I'm Percy, one of your instructors. I'm really excited about this class because it really allows you to see the whole language modeling building pipeline end to end, including data systems and modeling tattoo. I'll be co teaching with him. So I'll let everyone introduce themselves. Hi everyone. I'm pu. I' M1 of the co instructors. I'll be giving lecture in a week or two, probably few weeks. I'm really excited about this class, chricy. And I, you know, it's spent a while being able to disgruntle thinking, like, what's the really deep technical stuff that we can teach our students today? I think one of the things that is really, you got to build it from scratch to understand it. So I'm hoping that that's sort of the ether we'll keep away from. And roi actually failed this class and I took it. But now on your seeing, so you're going for what they say. Anything is possible. Everyone o'I'm, a third year student, PhD student in the cs department, norwith, patu and Pena. Yeah, I'm mostly interested in my research search on synthetic data and language, ID models, reasoning, all that stuff. So now it should be a fun word. Hey guys, I'm barsaw. I'm a second European student. And he was a topper of many leaderboards from last year, so he's the number to beat. Okay, all right. Well, thanks, everyone. So let's continue. As Hato mentioned, this is the second time we're teaching the class. We've thrown the class by around 50%. I have three tas instead of two. And one big thing is we're making all the lectures on YouTube so that the world can learn how to build language models from scratch. Okay? So why do we decide to make this a course and endure all the pain? So let's ask GPT -4. So if you ask it, why teach a course on building language models from scratch? The reply is, teaching your course provides foundational understanding of techniques, fosters innovation, kind of the typical kind of generic bladders. Okay? So here's a real reason. So we're in a bit of a crisis. I would say researchers are being coming in more and more disconnected from the underlying technology. Eight years ago, researchers would implement and train their own models in AI. Even six years ago, you at least take the models like perr, download them and fine tune them. And now many people can just get away with prompting a proprietary model. So this is not necessarily bad, right? Because as you enter these layers of abstraction, we can all do more. And a lot of research has been unlocked by the simplicity of being a prompter language model. And I do a fair my share of prompting. So there's nothing wrong with that. But it's also remember that these abstractions are leaky. So in contrast, the programming languages or operating systems, you don't really understand what the abstraction is. It's a string in and string out, I guess. And I would say that there's still a lot of fundamental research to be done that require tearing up the stack and co designing different aspects of the data and the systems and the model. And I think really, that full understanding of this technology is necessary for fundamental research. So that's why this class exists. We want to enable the fundamental research to continue. And our philosophy is to understand it, you have to build it. So there's one small problem here, and this is because of the industrialization of language models. So GPT -4 has rumored to be 1.8 trillion parameters, cost $100 million to train. You have xi building the clusters with 200, zero H -100, if you can imagine that there's an investment of over 500 billion, supposedly over four years. So these are pretty large numbers, right? And Furthermore, there's no public details on how these models are being built here from GPT -4. This is even two years ago. They very honestly say that due to the competitive landscape and safety limitations, we're going to disclose no details. So this is the state of the world right now. And so in some sense, frontier models are out of reach for us. So if you came into the supply thinking you're each going to train your own GPT for, sorry, so we're gonna to build small language models, but the problem is that these might not be representative. And here's some of two examples to illustrate why. So here's this kind of a simple, simple one. If you look at the fraction of flops spent in the tenlayers of a transformer versus a mlp, this changes quite a bit. So this is this is a tweet from Stephen roller from quite a few years ago, but this is still, if you look at small models, it looks like the number of flops in the attention versus the mlp layers are roughly comparable. But if you go up to 175 billion, then the mlps really dominate, right? So why does this matter? Well, if you spend a lot of time at small scale and you're optimizing the tension, you might be optimizing the wrong thing. Because at larger scale, it doesn't it gets to get swashed out. This is kind of a simple example because you can literally make this plot without actually any compute. You just like do it's napkin math. Here's something that's a little bit harder to grapple with, is just emergent behavior. So this is a paper from Jason wafrom 2:22. And here this plot shows that as you increase the amount of training flops and you look at accuracy on a bunch of tasks, you'll see that for a while it looks like the accuracy, nothing is happening. And all of a sudden you get these kind of Mergent of various phenomena in context learning. So if you were hanging around at this scale, you would be concluding that, well, these language models really don't work, when in fact, you had to scale up to get that behavior. So don't despair that we can still learn something in this class, but we have to be very precise about what we're learning. So there's three types of knowledge. There's the mechanics of how things work. This we can teach you. We can teach you what a transformer is. You, you'll implement a transformer. We can teach you how model parallelism leverages GPU's efficiently. These are just like kind of the raw ingredients, the mechanics. So that's fine. We can also teach you mindset. So this is something a bit more subtle and seems like a little bit you know fuzzy, but this is actually in some ways more important, I would say, because the mindset that we're going to take is that we want to squeeze as most out of the hardware as possible and take scaling seriously, right? Because in some sense, the mechanics, all of those we'll see later that all of these ingredients have been around for a while. But it was really, I think, the scaling mindset that OpenAI pioneered, that led to this next generation of AI models. So mindset, I think hopefully we can binto you that to think in a certain way and the thiris intuitions, and this is about which data and modeling decisions lead to good models. This, unfortunately, we can only partially teach you. And this is because what architectures and what data sets work at in most scmight not be the same ones that work at large scales. And but you, that's just, but hopefully you got two and a half out of three, so that's pretty good being for your buck. Okay, speaking of intuitions, there's this sort of, I guess, sad reality of things that, you know you can tell a lot of stories about why certain things in the transformer the way they are, but sometimes it's just you come, you do the experiments and the experiments speak. So for example, there's this no shazea paper that introduced the swigloo, which is something that we'll see a bit more in this class, which is a type of non linearity. And in the conclusion, know the results are quite good and this got adopted. But in the conclusion, there's this honest statement that we offer no explanation except for this is divine benevolence. There you go. This is the extent of our understanding. Okay, so now let's talk about this bitter lesson that I'm sure people have heard about. I think there's a sort of a misconception that a bitter lesson means that scale is all that matters. Algorithms don't matter. All you do is pump more capital into building the model and you're good to go. I think this couldn't be further from the truth. I think the right interpretation is that algorithms at scale is what matters. And because at the end of the day, your accuracy of your model is really a product of your efficiency and the number of resources you put in. And actually, efficiency, if you think about it, is way more important at a larger scale, because if you're spending, know, hundreds of millions dollars, you cannot afford to be wasteful. In the same way that if you're looking at running a job on your on your local cluster, you might run it again. You fail, you debug it. And if you look at actually the utilization and the use, I'm sure OpenAI is way more efficient than any of us right now. So efficiency really is important. And Furthermore, this, I think is this point is maybe not as well appreciated in the sort of scaling rhetoric, so to speak, which is that if you look at efficiency, which is a combination of hardware algorithms, but if you just look at the algorithm efficiency, there's this nice open ad paper from 2020 that showed over the period of 2012 to 2019, there's a 44x algorithmic efficiency improvement in the time that it took to train imanet to a certain level of accuracy, right? So this is huge. And I think if you I don't know if you could see the abstract here. This is faster than Morres law, right? So algorithms do matter. If you didn't to have this efficiency, you would be paying 44 times more cost. This is for image models, but there's some results for language as well. Okay? So with all that, I think the right framing or mindset to have is what is the best model one can build given a certain compute and data budget? Okay. And this question makes sense no matter what scale you're at because you're sort of like it's accuracy per resources. And of course, if you can raise the capital and get more resources, you'll get better models. But as researchers, our goal is to improve the efficiency of the algorithms. Okay? So maximize efficiency. We're going to hear a lot of that. Okay. So now let me talk a little bit about the current landscape and a little bit of, I guess, you know, obligatory history. So language models have been around for a while now. Going back to Shannon, who looked at language models, a way to estimate the entropy of English. I think in AI, they really were prominent in nlp where they were a component of larger systems like machine translation, speech recognition. And one thing that's maybe not as appreciated these days is that if you look back in 2007, Google is training 30 large and ground m models, so five gram models over 2 trillion tokens, which is a lot more tokens than GPT -3. And it was only, I guess, in the last two years that we've gotten to that token count, but they were in gramodels, so they didn't really exhibit any of the interesting phenomena that we know of language models today. Okay. So in the 2010S, I think a lot of the you can think about this, a lot of the deep learning revolution happened, and a lot of the ingredients sort of kind of falling into place, right? So there was a first neural language model from jostrapenjo's group in Becan. 2003. There was seek to seek models. This, I think was a big deal for how do you basically model sequences from iya and Google folks, there's an atom optimizer, which still is used by the majority of people. Deating, over a decade ago, there's a tension mechanism which was developed in the context of machine translation, which then led up to the famous attention also need or the aka the transformer paper in 2017, people were looking at how to scale mixture of experts. There was a lot of work around late 2010s on how to essentially do model parallelism. And they were actually figuring out how you could train 100 billion prime models. They didn't train it for very long because these were like more systems work, but all the ingredients were kind of in place before by the time the 2020 came around. So I think one other trend, which was starting in lpu is the idea of these foundation models that could be trained on a lot of text and adapted to a wide range of downstream tasks. So Elmo, Bert, T T five, these were models that were, for their time, very exciting. We kind of maybe forget how excited people were about things like Bert, but it was a big deal. And then I think, I mean, this is abbreviated history. But I think one critical piece of the puzzle is OpenAI just taking these ingredients. You know, they apply very nice engineering and really kind of pushing on kind of the scaling laws, embracing it. As you know, this is a kind of the mindset and that led to GPT two and GPT -3, Google obviously was in the game and trying to compete as well. But that sort of paved the way, I think, to another kind of line of work, which is these were all closed models, so models that weren't released and you can only access via api, but they were all Nova models, starting with early work by Luther, right after GPT -3 came out, met a's early attempt, which didn't work, maybe as quite as well bloom, and then met ta Alibaba, deep seeai too. And there's a few others which I elisted have been creating, these open models where that the weights are released. One other piece of, I think, tidbit about openness, I think, is important, is that there's many levels of openness. There's closed models like GPT -4. There's open weight models where the weights are available and there's actually a paper, a very nice paper with lots of architectural details but no details about the data set. And then there's open source models where all the weights and data are available in the paper, where they're honestly trying to explain as much as they can. But of course, you can't really capture everything in a paper and there's no substitute for learning how to build it except for kind of doing yourself. Okay. So that leads to kind of the present day where there's a whole host of frontier models from OpenAI, anthropic xai, Google, meta, dpsealibaba, Tencent and probably a few others that are sort of dominate the current landscape. So we're kind of into this interesting time where, you know just to kind of reflect a lot of the ingredients, like I said, were developed, which is good because I think we're going to revisit some of those ingredients and trace how these techniques work. And then we're going to try to move as close as we can to best practices on frontier models. But using information from essentially the open community and reading between the lines from what we know about the closed models. Okay. So just as an interlude, so what are you looking at here? So this is an executable lecture. So it's a program where I'm stepping through and it delivers the content of lecture. So one thing that I think is interesting here is that you can embed code. So if you you can just step through a code, and I think this is a smaller screen that I'm used to, but you can look at the environment variables as you're stepping through code. So that's useful. Later, when we start actually trying to drill down and giving code examples, you can see the hierarchical structure of lecture, like we're in this module and you can see where it's it was called from main and you can jump to definitions next supervised fine tune in which we'll talk about later. Okay, and if you think this looks like a Python program, well, it is a Python program, but I've made it, processed it so for your viewing pleasure, okay, so let's move on to the course logistics. Now actually, maybe I'll pause for questions. Any questions about you know what we're learning in this class? To be able to lead a team to build effective your model or other skills students. So the question is, would I expect a graduate from this class to be able to lead a team and build a frontier model, of course, with know like a billion dollars of capital? Yeah, of course, I would say that it's a good step, but there's definitely many pieces that are missing. And I think you know we thought about we should really teach like a series of classes that eventually leads up to to as close as we can get. But I think this is maybe the first step of the puzzle. But there are a lot of things, and I'm happy to talk offline about that. But I like the ambition. Yeah that's what you should be doing, taking the class so you can go lead teams and build frontier models. Okay. Okay, let's talk a little bit about the course. So here's a website, everything's online. This is a five unit class, but I think that maybe doesn't express the level here. As well as this quote that I pulled out from a course evaluation, the entire assignment was approximately the same amount of work as all five assignments from the csu 24n plus the final project. And that's the first homework assignment. So not too I'll scare you off, but just giving some data here. So why should you endure that? Why should you do it? I think this class is really for people who have sort of its obsessive need to understand how things work all the way down to the atom, so to speak. And I think if you when you get through this class, I think you will have really leveled up in terms of your research engineering and the comfort level of comfort that you'll have in building ml systems at scale will just be, I think, know something. There's also a bunch of reasons that you shouldn't take the class. For example, if you want to get any research done this quarter, maybe this class isn't for you. If you're interested in learning just about the hottest new techniques, there are many other classes that can probably deliver on that better than, for example, you spending a lot of time debugging bpe. And this is really, I think, about a class about the primitives and learning things bottom up as opposed to the kind of the latest. And also, if you're interested in building language models or four x, this is probably not the first class you would take. I think practically speaking, as much as I kind of made a fun of prompting, prompting is great. Fine tuning is great. If you can do that and it works, then I think that is something you should absolutely start with. So I don't want people taking this class and thinking that create any problem. The first step is to train a language model from scratch. That is not the right way of thinking about it. Okay. And I know that many of you, you know some of you were enrolled, but we didn't we did have a cap so we weren't able to enroll everyone. And also for the people online, you can follow up at home. All the lecture materials and assignments are online so you can look at them. The lectures are also recorded and will be put on YouTube, although there will be some number of weak lag there. And also, we'll offer this class next year. So if you were not able to take it this year, don't fret, there will be next time. Okay, so the class has five assignments. And each of the assignments, we don't provide scaffolding code in a sense that literally we give you a blank file and you're supposed to build things up in the spirit of learning building from scratch, but we're not that mean. We do provide unit tests and some adapter interfaces that allow you to check correctness of different pieces. And also the assignment write up, if you walk through it, does do it for sort of a gentle job of doing that, but you're kind of on your own for making good software design decisions and figuring out what you name, your functions and how to organize your code, which is a useful skill, I think. So one strategy, I think, for all assignments is that there is a piece of assignment, which is just implement the thing and make sure it's correct that mostly you can do locally on your laptop, you shneed compute for that. And then you, we have a cluster that you can run for benchmarking both accuracy and speed. So I want everyone to kind of embrace this idea of that. You want to use as small data set or as few resources as possible to prototype before running large jobs. You shouldn't be debugging with 1 billion parameter models on the cluster if you can help it. Okay. There's some assignments which will have a leaderboard, which usually is of the form do things to make perplexity go down, given a particular training budget. Last year, it was, I think, pretty exciting for people to try to try different things that you either learn from the class or you read online. And then finally, I guess this year is No This was less of a problem last year because I guess Copilot wasn't as good, but no curse is pretty good. So I think our general strategy is that AI tools can take away from learning because there are cases where it can just solve the thing you want to do. But you know I think you can obviously use them judiciously. So but use at your own risk. You're kind of responsible for your own learning experience here. Okay, so we do have a cluster. So thank you together, AI for providing a bunch of H -100s for us. There's a guide to please read it carefully to learn how to use the cluster and start your assignments early because the cluster will fill up towards the end of a deadline as everyone's trying to get their large runs in. Okay. Any questions about that? You mentioned it was a five. Were we unable to sign up before it? Like right. So the question is can you sign up for less than five units? I think administratively, if you have to sign up for less, that is possible, but it's the same class in the same workload. Any other questions? Okay. So in this part, I'm going to go through all the different components of the course and just give a broad overview, a preview of what you're going to experience. So remember, it's all about efficiency given hardware and data, how do you train the best model given your resources? So for example, if I give you a common call dump of web dump and 32H -100 for two weeks, what should you do? There are a lot of different design decisions. There's questions about the tokenizer, the architecture, systems, optimizations you can do, data veins you can do. And we organized the class into these five units or pillars. So I'm going to go through each of them interturn, and talk about what will cover what the assignment will involve, and then I'll kind of wrap up. Okay. So the goal of the basics unit is just get a basic version of a full pipeline working. So here you you implement a tokenizer model architecture and training. So just say a bit more about what these components are. So a tokenizer is something that converts between strings and sequences of integers. Intuitively, you can think about the integers corresponding to breaking up the string into segments and mapping each segment to an integer. And the idea is that you just your sequence of integers is what goes into the actual model, which has to be like a fixed dimension. Okay, so in this course, we'll talk about the biypair encoding, be tokenizer, which is relatively simple and still is used. There are, I guess, a promising set of methods on tokenizer free approaches. So these are methods that just start with the raw bytes and don't do tokenization and develop a particular architecture that just takes the row bites. This work is promising, but so far, I haven't seen it been scaled to the frontier yet. So we'll go with bpe for now. Okay, so once you've tokenized your sequence or strings into a sequence of integers, now we define a model architecture over the sequences. So the starting point here is original transformer. That's what is the backbone of basically all frontier models. And here's architectural diagram. We won't go into the details here, but there's attention sion piece, and then there's mlp player with some you know normalization. So a lot has actually happened till since 2017, right? And I think there's a sort of sense to which, Oh, the transformer is invented and then you, everyone's just using in transformer. And to a first approximation, that's we're still using the same recipe, but there have been a bunch of the smaller improvements that do make a substantial difference when you add them all up. So for example, there is the activation, nonlinear activation function, the swiggly, which we saw a little bit before, positional embeddings. There's new positional embeddings, these rotary positional embeddings, which we'll talk about normalization. Instead of using a layer norm, we're going to look at something called rms norm, which is similar but simpler. There's a question of where you place the normalization, which has been changed from the original transformer. The mlp use the canonical version as a dense mlp, and you can replace that with mixture of experts. Attention is something that has actually been gaining a lot of attention. I guess there's full attention and then there's sliding window attention and linear attention. All of these are trying to prevent a quadratic blow up. There's also lower dimensional versions like gqa and mla, which we'll get to in a second or not in a second, but in a future lecture. And then the most kind of maybe radical thing is other alternatives to the transformer, like staspace, models like hyena where they're not doing attention, but some other sort of operation. And sometimes you get best of both worlby mixmaking, a hybrid model that mixes these in with transformers. Okay. So once you define your architecture, you need a train. So there's you know design decisions include optimizer. So atom W, which is a variant, basically atom fixed up, is still very prominent. So we'll mostly work with that. But it is worth mentioning that there is more recent optimizers like Muan and soap that have shown promise, learning rate, schedule, batch size, whether you do regulzation or not hyperparameters. There's a lot of details here, and I think this class is one where the details do matter because you can easily have order of magnitude difference between a well tuned architecture and something that's just like a vanilla transformer. So in assignment one, basically you'll implement the bpe tokenizer. I'll warn you that this is actually the part that seems to have been a lot of surprising, maybe a lot of work for people. So just you're warned. And you also impleted transformer cross mp, p three p loss, atom W optimizer and training loop. So again, the whole stack, and we're not making you implement pi torch from scratch. So you can use pi torch, but you can't use the transformer implementation for pi torch. There's a small list of functions that you can use, and you can only use those. Okay, so we're gonna to have some know tiny stories and open web text data asets that you'll train on and then there will be a leaderboard to minimize the open webtext perplexity. We'll give you 90 minutes on H -100 and see what you can do. So this is last year so we'll see we have the top so this is the number to beat for this year. Okay all right, so that's the basics. Now after basics, I mean, in some sense you're done right. Like you have ability to train a transformer. What else do you need? So the system part really goes into how you can optimize this further. So how do you get the most out of hardware? And for this, we need to take a closer look at the hardware and how we can leverage it. So there's kernels, parallelism and inference are the three components of this unit. So okay, so to first talk about kernels, let's talk a little bit about what a GPU looks like. Okay, so a GPU, which we'll get much more into is basically a huge array of these little units that do floating point operations. And maybe the one thing to note is that this is the GPU chip. And here is the the memory that's actually off chip. And then there's some other memory like l two caches and l one caches on chip. And so the basic idea is that compute has to happen here. Your data might be somewhere else. And how do you basically organize your compute so that you can be most efficient? So one quick analogy is imagine that your memory is where you can store. Like your data model parameters is like a warehouse, and your computer te, is like the factory. And what ends up being a big bottleneck is just data movement costs. So the thing that we have to do is how do you organize the compute like even a matrix multiplication to maximize the utilization of the GPU's by minimizing the data movement? And there's a bunch of techniques like fusion and tiling that allow you to do that. So we'll get all into the details of that. And to implement and leverage a kernel, we're gonna to look at triiton. There's other things you can do with various levels of sophistication, but we're going to use Triton, which is developed by OpenAI in a popular way to kernels. Okay, so we're going to write some kernels. That's for one GPU. So now in general, you have these big runs take know thousands, if not tens of thousands of gpus. But even at eight, it kind of starts becoming interesting because you have a lot of GPU's. They're are connected to some cpu nodes and they also are directly connected via mv switch, mv link. And it's the same idea right now. The only thing is that data movement between GPU's is even slower, right? And so we need to figure out how to put model parameters and activations and gradients and put them on the GPU's and do the computation and to minimize the amount of movement. And then so we're going to explore different type of techniques like data parallelism and tensor parallelism and so on. So that's all I'll say about that. And finally, inference is something that we didn't actually do last year in the class, although that we had a guest lecture. But this is important because inference is how you actually use a model, right? It's basically the task of generating tokens, given a prompt, given a train model. And it also turns out to be really useful for a bunch of other things besides just chatting with your favorite model. You need it for reinforcement learning, test time, compute, which has been very popular lately. And even evaluating models, you need to do inference. So we're gonna to spend some time talking about inference. Actually, if you think about the globally, the cost that's that's spent on inference is eclipsing the cost that it is used to train models because training, despite it being very intensive, is ultimately a one time cost. And inference is cost scales with every use. And the more people use your model, the more you'll need inference to be efficient. Okay? So in inference, there's two phases. There's a prefill and a decode. Prefill is you take the prompt and you can run it through the model and get some activations. And then decode is you go autoaggressively one by one and generate tokens. So prefill, all the tokens are given so you can process everything at once. So this is exactly what you see at training time. And generally, this is a good setting to bm because you can it's naturally parallel and you're mostly compute bound. What makes inference, I think, special and difficult is that this Autery regressive decoding, you need to generate one token at a time. And it's hard to actually saturate all your GPU's and it becomes memory bound because you're constantly moving data around. And we'll talk about a few ways to speed the models up, speed inference up, you can use a cheaper model. You can use this really cool technique called speculative decoding, where use a cheaper model to sort of scout ahead to regenerate multiple tokens. And then if these tokens happen to be good by some for cendefinition good, you can have the full model model just score in and accept them all in parallel. And then there's a bunch of systems optimizations that you can do as well. Okay. So after the systems, Oh, okay, let's time in two. So you're going to implement a kernel, you're going to implement some parallelism. So data parallel is very natural. And so we'll do that. Some of the model parallelism like fsdp turns out to be a bit kind of complicated do from scratch. So we'll do sort of a baby version of that. But you know I encourage you to learn and know about the full version. We'll go over the full version in class. But implementing from scratch might be a bit too much. And then I think an important thing is getting in the habit of always benchmarking profile. I think that's actually probably the most important thing, is that you can implement things, but unless you have feedback on how well your implementations is going and where the ball on x are, you're just going na be kind of flying blind. Okay. So unit three is scaling laws. And here the goal is you want to do experiments at small scale and figure things out and then predict the hyperparameters and loss at large scale. So here's a fundamental question. So if I give you a flops budget, what model size should you use? If you use a larger model, that means you can train on less data. And if you use a smaller model, you can train on more data. So what's the right balance here? And this has been quite a study, quite extensively and figured out by a series of paper from openair and DeepMind. So if you hear the term chinchilla optimal, this is what this is referring to. And the basic idea is that for every compute budget number of flops, you can vary the number of parameters of your model. Okay, that and then you measure how good that model is. So for every level of compute, you can get the optimal parameter count. And then what you do is you can fit a curve to extrapolate and see if you had, let's say, one e 22 flops, you what would it be? The parameter size. And it turns out these minimum, when you plot them, it's actually remarkably linear, which leads leads to this very actually simple but useful rule of thumb, which is that if you have a particular model of size n, if you multiply by 20, that's the number of tokens you should train on essentially. So that means if I say 1.4 billion, primeter model should be trained on 28 billion tokens. Okay. But you know this doesn't take into account inference cost. This is literally how can you train the best model regardless of how big that model is. So there's some limitations here, but it's nonetheless been extremely useful for model development. So in this assignment, this is kind of know fun because we define a quote unquote, training api, which you can query with a particular set of hyperparameters. You specify architecture and batch size and so on, and we return you a loss value. Your decisions will get you. Okay? So your job is you have flops budget and you're going to try to figure out how to train a bunch of models and then gather the data. You're going to fit a scaling law to the gather data and then you're going to submit your prediction on what you would choose to be the hyperparameters, what model size and so on at a larger scale. Okay. So this is a case where you have to be really we want to put you in this position where there's some stakes. I mean, this is not like burning real compute, but you know once you run out of your flop widget, that's it. So you have to be very careful in terms of how you prioritize what experiments to run, which is something that the frontier labs have to do all the time. And there will be a leaderboard for this, which is minifinmized loss given your flops budget. Question four. So if we're working ahead, should we expect assignments to change over time or so? The question is that these links are from 2024. The rough assignments, the rough structure will be the same from 2025. There will be some modifications. But if you look at these, you should have a pretty good idea of what to expect. Okay, so let's go into data now. Okay, so up until now, you have scaling laws, you have systems, you can you have your transformer implementation, everything you're really kind of good to go. But data, I would say, is a really kind of key ingredient that I think differentiates in some sense. And the question to ask here is, what do I want this model to do? Because what the model does is completely determine, I mean, mostly determined by the data. If I put, if I train on multilingual data, it will have multilingual capabilities. If I train on code, it will have code capabilities. And not it's very natural. And usually data sets are a conglomeration of a lot of different pieces. There's this is from a pile which is a four years ago, but the same idea, I think, holds you. You have data from the web. This is common crawl. You have bassaexchange, Wikipedia, GitHub and different sources which are curated. And so in the data section, we're going to start talking about evaluation, which is given a model, how do you evaluate whether it's any good? So we're going to talk about perplexity measures, kind of standardized testing like mmu. If you have models that generate utterances for instruction following, how do you evaluate that? There's so decisions about if you can ensoble or do chain of fat test time, how does that affect your evaluation? And then you can talk about entire systems, evaluation of entire system, not just the language model, because language models often get these days plugged into some a gentic system or something. Okay. So now after establishing evaluation, let's look at data curation. So this is, I think, an important point that people don't realize. I often hear people say, Oh, we're training the model on the Internet. This doesn't make sense, right? Data doesn't just it's fall from the sky and there's the Internet that you can you know pipe into your model. Data has to always be actively acquired somehow, even if you you know, just as an example of you know I always tell people, look at the data and so let's look at some data. So this is some common crawl data. I'm going to take ten documents and I think hopefully this works. Okay. I think the rendering is off, but you can kind of see this is a sort of random sample of common crawl, and you can see that this is maybe not exactly the data. Oh, here's some actually real text here. Okay, that's cool. But if you look at most of common crawl, aside from this is a different language, but you can also see this is very spammy sites and you'll quickly realize that a lot of the web is just trash. And so well, okay, maybe that's not that's surprising, but it's more trash than you would actually expect, I promise. What I'm saying is that there's a lot of work that needs to happen in data. So you can crawl the Internet, you can take books, archives, papers, GitHub, and there's actually a lot of processing that needs to happen. Know there's also legal questions about what data you can train on, which we'll touch on. Nowadays, a lot of those frontier models have to actually buy data because the data on the Internet that's publicly accessible is actually turns out to be a bit limited for that kind of the really frontier performance. And also, I think it's important to remember that this data that's scrape, it's not actually text, right? First of all, it's html or it's pdf's or in the case of code, it's just directories. So there has to be an explicit process that takes this data and turns it into text. Okay, so we're going to talk about the transformation from html to text. And this is going to be a lossy process. So the trick is how can you preserve the content and some of the structure without know? Basically, just having an html filtering as you get you know surmise is going to be very important both for getting ining high quality data, but also removing harmful content. Generally, people train classifiers to do this. The duplication is also an important step, which we'll talk about. Okay, so assignment four is all about data. We're gonna to give you the raw common crawl dump so you can see just how bad it is and you're gonna to train classifiers dedube. And then there's going to be a leader board where you're going to try to minimize perplexity given your token budget. So now now you have the data. You've done this, build all your fancy kernels, you've trained. Now you can really train models. But at this point, what you'll get is a model that can complete the next token, right? And this is called essentially base model. And I think about it as a model that has a lot of raw potential, but it needs to be aligned or modified some way. Alignment is a process of making it useful. So alignment captures a lot of different things. But three things I think it captures is that you want to get the language model to follow instructions right. Completing the next token is not necessarily following the instruction itwill just complete the instruction or whatever it thinks will follow the instruction. You get to here specify the style of the generation, whether you want to be long or short, whether you want bullets, whether you wanted it to be witty or have sas or not. When you play with ChatGPT versus rock, you'll see that there's different alignment that has happened. And then also safety. One important thing is for these models to be able to refuse answers that can be harmful. So that's where alignment also kicks in. So there's generally two phases of alignment. There's supervised fine tuning. And here the goal is, I mean, it's very simple. You basically gather a set of user assistant pairs, so prompt response pairs, and then you do supervised learning. Okay? And the idea here is that the base model already has sort of the raw potential. So just fine tuning it on a few examples is sufficient. Of course, the more examples you have, the better the results. But there's papers like this one that shows even like a thousand examples suffices to give you instruction following capabilities from a base good base model. Okay. So this part is actually very simple. And it's not that different from pre training because it's just you're given text and you just maximize the probability of the text. So the second part is a bit more interesting from algorithmic perspective. So the idea here is that even with sft phase, you will have a decent model. And now how do you improve it? Well, you can get there more sft data, but that can be very expensive because you have to have someone sit down and annotate data. So the goal of learning from feedback is that you can leverage lighter forms and annotation and have the algorithms do a bit more work. Okay. So one type of data you can learn from is preference data. So this is where you generate multiple responses from a model to a given prompt, like a or b, and the user rate, whether a or b, is better. And so the data might look like it generates what's the best way to train a language model, use a large data set or use a small data set? And of course, the answer should be a, so that is a unit of a expressing preferences. Another type of supervision you could have is using verifiers. So for some domains, you're lucky enough to have a formal verifier, like for math or code. Or you can use learn verifiers, where you train an actual language model to rate the response. And of course, this relates to evaluation. Again, algorithms. This is we're in the realm of reinforcement learning. So one of the earliest algorithms that was developed that was applied to instruction tuning models was ppo proximal policy optimization. It turns out that if you just have preference data, there's a much simpler algorithm called dpo that works really well. But in general, if you wanted to learn from verifiers data, you have to it's not preference data. So you have to embrace rl fully. And there's this method which will do in this class, which called group relative preference optimization, which simplifies ppo, makes it more efficient by removing the value function developed by deep seek, which seems to work pretty well. Okay. So assignment five implements supervised tuning, dpo and grpo, and of course, evaluate question quote, the question is time at one seems a bit daating. What about the other ones? I would say that assignment 12 are definitely the most heavy and hardest. Assignment three is a little bit more of a breaether. And assignment 45, at least last year were I would say a notch below assignment 12, although I don't know, it depends on we haven't fully worked out the details for this year. Yeah, it does get better. Okay. So just to a recap of the different pieces here, remember, efficiency is this driving principle, and there's a bunch of different design decisions. And I think if you view efficiency everything through a lens of efficiency, I think a lot of things kind of make sense. And importantly, I think we are it's worth pointing out there, we are currently in this compute constraint regime, at least this class and most people who are somewhat GPU poor. So we have a lot of data, but we don't have that much compute. And so these design decisions, sions will reflect squeezing the most out of the hardware. So for example, data processing, we're filtering fairly aggressively because we don't want to waste precious compute on bad relevant data tokenization. Like it's nice to have a model over bytes that's very elegant, but it's very compute inefficient with today's model architectures. So we have to do tokenization to as an efficiency gain model architecture. There are a lot of design decisions there that are essentially motivated by efficiency training. I think the fact that we're most of what we're doing to do is just a single epoch. This is clearly we're in a hurry. We just need to see more data as opposed to spending a lot of time on any given data point. Scaling laws is completely about efficiency. We use less compute to figure out the high proparameters and alignment is is maybe a little bit different. But the connection to efficiency is that if you can put resources into alignment, then you actually require less smaller base models. Okay. So there a you know there's sort of two paths. If your use cases fairly narrow, you can probably use a smaller model, you align it or fine tune it and you can do well. But if your use cases are very broad, then there might not be a substitute for training a big model. So that's today. So increasingly now, at least for frontier labs, they're becoming data constrained, which is interesting because I think that the designs decisions will presumably completely change. Well, I mean, compute will always be important, but I think the design decisions will change. For example, you know, learning, taking one epoch over your data, I think doesn't really make sense. If you have more compute, why wouldn't need you take more epochs at least or do something smarter? Or maybe there will be different architectures, for example, because a transformer was really motivated by compute efficiency. So that's something to kind of ponder. Still, it's about efficiency, but the design decisions reflect what regime you're in. Okay. So now I'm going to dive into the first unit before that. Any questions? The question is if we have a slack Or App, we will have a slack. We'll send out details after this spice. Yeah will students auditing the course will job access to the same material? The question is students auditing the class will have access to all the online materials, assignments and will give you access to cavas so you can watch the lecture videos. Yeah what's the grawhat's? The grading of the assignments? Good question. So there will be a set of unit tests that you will have to PaaS, or part of the grading is, did you implement this correctly? There will be also parts of the grade which will did you implement a model that achieved a certain level of loss or is efficient enough in the assignment? Every problem part has a number of points associated with it. And so that gives you a fairly granular level of what grading looks like. Okay, let's jump into tokenization. Okay, so Andre Kaati has this really nice video on tokenization. And in general, he makes a lot of these videos on that actually inspired a lot of this class, how you can build things from from scratch. So you should go check out some of his videos. So tokenization, as we talked about it, is the process of taking raw tewhich is generally represented as Unicode strings and turning it into a set of integers essentially. And where each integer is represents a token. Okay? So we need a procedure that encodes strings to tokens and decodes them back into strings. And the vocabulary size is just the number of values that a token take on the number of the range of the integers. Okay? So just to give you an example of how tokenizers work, let's play around with this really nice website which allows you to look at different tokenizers and just type in something like know hello, you hello or whatever. Maybe I'll do this. And one thing it does is it shows you the list of integers. This is output of tokenizer. It also nicely maps out the decomposition of the original string into a bunch of segments. And a few things to kind of note. First of all, the space is part of a token. So unlike classical nlp, where the space just kind of disappears, everything is accounted for. These are meant to be kind of reversible operations tokenization. And by convention, for whatever reason, the space is usually preceding the token. Also notice that hello is a completely different token than space hello, which you might make you a little bit squeamish, but it seems, and it can cause problems, but that's just how it is. Question, is space speed meeting instead of trailing intentional? Or is it just an artifact of the bp process? So the question is, is the spacing before intentional or not? So in the bp process, I will talk about you actually pretokenize, and then you and then you tokenize each part and think the ptokenizer does put this space in the front, so it is built into the algorithm. You could put it at the end, but I think it probably makes more sense to put in the beginning, but actually don't. Well, I guess it could go either way. It's my sense. Okay. So then if you look at numbers, you see that the numbers are chopped down into different pieces. It's a little bit kind of interesting that it's left to right. So it's definitely not grouping by thousands or anything like semantic. But anyway, I encourage you to kind of play with it and get a sense of what these existing tokenizers look like. So this is a tokenizer for GPT -4 zero, for example. So there's some observations that we made. So if you look at the gbt two tokenizer, which we'll use this kind of as a reference, okay, let me see if I can hopefully this is let me know if this is getting too small in the back, you can take a string. If you apply the GPT two tokenizer, you get your indices. So it maps, strings the indices and then you can decode to get back the string. And this is just a sanity check to make sure that you actually round trips. Another thing that's I guess interesting to look at is this compression ratio, which is if you look at the number of bytes divided by the number of tokens, so how many bytes are represented by a token? And the answer here is 1.6. So every token represents 1.6 bytes of data. Okay? So that's just a GPT tokenizer that optheir trained to motivate kind of bpe. I want to go through a sequence of attempts. So like suppose you wanted to do tokenization, what would be the sort of the simplest thing? The simplest thing is probably character based tokenization. A Unicode string is a sequence of Unicode characters, and each character can be converted into a integer called a code point. Okay, so a maps to 97. The world emoji maps to 127757 and you can see that it converts back. Okay. So you can define a tokenizer which simply you know maps each character into a copoint. Okay. So what's one problem with this? Yeah, compression ratio is one. The compression ratio is one. So that's well, actually the compression ratio is not quite one because a character is not a byte, but it's maybe not as good as you want. One problem with that, if you look at some code points, they're actually really large, right? So you're basically allocating each like one slot in your vocabulary for every character uniformly. And some characters appear way more frequently than others. So this is not a very effective use of your kind of budget. Okay? So the vocabulary size is you're huge. I mean, the vocabulary size being 127 is actually a big deal. But the bigger problem is that some characters are rare and this is inefficient use of the vocab. Okay? So the comparisratio is 1.5 in this case, because it's the tokens, sorry, the number of bytes per token and a character can be multiple bytes. Okay? So that was a very kind of naive approach. On the other hand, you can do bybased tokenization. Okay? So Unicode strins can be represented as sequence of bytes because every string can just be you converted into bytes. Okay? So some you know a is already just kind of one byte, but some characters take up as many as four bytes. And this is using the utta f eight kind of encoding of Unicode. There's other encodings, but this is the most common one that's dynamic. So let's just convert everything into bytes and see what happens. So if you do it into bytes now, all the indices are between zero and 256 because they're are only 256 possible values for a byte by definition. So your vocabulary is very small and each byte is I guess not all bytes are equally used, but you know it's not you don't have Tamany sparsity problems, but what's the problem with bybased encoding? Yeah long sequences so this is I mean in some ways I really wish bicoding would work it's the most elegant thing but you have long sequences your compression ratio is 11 bite per token and this is just terrible of compression ratio of one is terrible because your sequences will be really long attention is quadratic naively in the sequence lane. So this is you're just gonna to have a bad time in terms of efficiency, okay, so that wasn't really good. So now the thing that you might think about is, well, maybe we kind of have to be adaptive here, right? Like you know we can't allocate a character or byte per token, but maybe some tokens can represent lots of bytes and some tokens can represent few bytes. So one way to do this is word based tokenization. And this is something that was actually very classic in nlp. So here's a string and you can just split it into a sequence of segments, and you can call each of these tokens. So you just use a regular expression. Here's a different regular expression that GPT two uses to ptokenize, and it just splits your string into a sequence of strings. So and then what you do with each segment is you assign each of these to an integer and then you're done. Okay. So what's the problem with this? Yeah so the problem is that your vocabulary size is sort of unbounded. Well, not maybe not quite unbounded, but you don't know how big it is because on a given new input, you might get a segment that's just you've never seen before. And that's actually a kind of a big problem. This is actually word basa, really big pain in the butt because some real words are rare. And actually it's really annoying because new words have to receive this unk token. And if you're not careful about how you compute the perplexity, then you're just going to mess up. So you know word based isn't I think it captures the right intuition of adaptivity, but it's not exactly what we want here. So here we're finally going to talk about the bpe encoding or bipair encoding. So this was actually a very old algorithm developed by flilip gauge in 94 for data compression. And it was first introduced into nlp for neural machine translation. So before papers, I did machine translation or any, basically all nlp used word based tokenization. And again, we're based with a pain. So this paper pioneered this idea. Well, we can use this nice algorithm, form 94, and we can just make the tokenization kind of round trip, and we don't have to deal with unks or any of that stuff. And then finally, this entered a kind of language modeling era through GPT two, which was trained on using the bpe tokenizer. Okay. So the basic idea is instead of defining some sort of preconceived notion of how to split it up, we're going to train the tokenizer on raw Tex. That's the basic kind of insight, if you will. And so organically, common sequences that span multiple characters we're going to try to represent as one token. And rare sequences are going to be represented by multiple tokens. There's a sort of a site detail, which is for efficiency. The GPT two paper uses war based tokenizer as a sort of preprocessing to break it up into segments, and then runs bp on each of the segments, which is what you're going to do in this class as well. The algorithm bp is actually very simple. So we first convert the string into a sequence of bytes, which we already did when we talk about bybase tokenization. And now we're going to successively merge the most common pair of adjacent tokens over and over again. So the intuition is that if a pair of tokens that shows up a lot, then we're going to compress it into one token. We're gonna to dedicate space for that. Okay, so let's walk through what this algorithm looks like. So we're going to use this kind of has an example and we're going to convert this into a sequence of integers. These are the bytes. And then we're going to keep track of what we've merged. So remember, merges is a map from two integers which can represent bytes or other pexisting tokens and we're going to create a new token and the vocab is just going kind of be a handy way to represent the index to bytes. Okay, so we're going to the ppalgorithm. I mean it's very simple. So I'm just actually gonna to run through the code, you're gonna to do this number merges of times. So number ges is three. In this case, we're going to first count up the number of occurrences of pairs abytes. So hopefully this doesn't become too small. So we're going to just step through this sequence and we're going to see that. Okay, so once 116, 104, we're going to increment that count. 104, 101 increment that count. We're go through the sequence and we're going to count up the bytes. Okay? So now after we have these counts, we're going to find the pair that occurs the most number of times. So I guess there's multiple ones, but we're just going to break ties and say 116 and 104. Okay, so that occurred twice. So now we're going to merge up here. So we're going to create a new slot in our vocab, which is going to be 256. So so far it's zero through one, two, 55, but now we're expanding the vocab to 256. And we're going to say every time we see one, one, six and 104, we're going to replace it with 256. Okay, and then we're going to just apply that merge to our training set. So after we do that, the 1:16 104 became 256 and this 256, remember, it occurred twice. Okay, so now we're just going to loop through this algorithm one more time. The second time it decided to merge 256 and 101. And now I'm going to replace that in indices and notice that the indices is going to shrink, right? Because our compression ratio is getting better as we make a room for more vocabulary items and we have a greater vocabulary to represent everything. Okay, so let me do this one more time. And then the next merge is 250 73. And this is shrinking one more time. Okay? And then now we're done. Okay, so let's try out this tokenizer. So we have the string, the quick Brown fox. We're going to encode into a sequence of indices, and then we're going to use our bpe tokenizer to decode. Let's actually step through what that know. Looks like this. Well, actually maybe decoding isn't actually interesting. Sorry, it should have gone through the encode. Let's go back to encode. So in code, you take a string, you convert to indices, and you just replay the merges. And importantly, in the order that occur, so I'm going to replay these merges and then and then I'm going to get my indices, okay? And then verify that this works. Okay? So that was it's pretty simple. You know, it's because it's simple. It was also very inefficient. For example, encode loops over the merges. You should only loops over the merges that matter. And there's some other bells and whistles like there's special tokens pre tokenization. So in your assignment, you're going to essentially take this as a starting point or I mean, I guess you should implement your own from scratch, but your goal is to make the implementation now fast and you can paralyze it if you want. You can go have fun. Okay. So summary of tokenization. So tokenizer maps between strings and sequences of integers. We looked at character base by base, word base. They're are highly suboptimal for various reasons. Bpe is a very old algorithm from 94 that still proves to be effective heuristic. And the important thing is it looks at your corpus statistics to make sensible decisions about how to best adaptively allocate vocabulary to represent sequences of characters. And you know I hope that one day I won't have to give this lecture because we'll just have architectures that map from bytes, but until then, we'll have to deal with tokenization. Okay, so that's it for today. Next time we're going to dive into the details of piytorch and give you the building blocks and pay attention to resource accounting. All of you have presumably implemented piytorch programs, but we're going to really look at where all the flops are going. Okay, see you next time.

概览/核心摘要 (Executive Summary)

Stanford CS336课程“从零开始构建语言模型”旨在让学生深入理解语言模型构建的全流程，包括数据、系统和模型层面。课程主要由Percy Liang与Pu [姓氏不清晰]联合讲授，教学团队还包括助教Roi [姓氏不清晰]、一名三年级博士生[姓名不清晰]及二年级博士生Barsaw。Percy Liang强调，当前研究人员与底层技术日益脱节，本课程通过“从零开始构建”的理念，弥补这一差距，赋能基础研究。

课程承认，如GPT-4这类前沿模型因其巨大的参数量（传闻1.8万亿）和训练成本（数亿美元）非学术界所能及，且其细节不公开。因此，课程将专注于构建虽小但具代表性的模型，教授核心“机制”（如Transformer、模型并行）和“思维模式”（如极致优化硬件、严肃对待规模化），但“直觉”（关于数据和模型选择）只能部分传授。课程核心观点是“规模化算法”至关重要，强调在给定计算和数据预算下最大化模型效率，而非盲目追求规模。

课程内容分为五大模块：基础（实现完整Tokenizer、Transformer、训练循环）、系统（GPU内核优化、并行计算、推理优化）、规模法则（小规模实验预测大规模性能）、数据（评估、获取、清洗、去重）和对齐（监督微调、基于反馈的学习如DPO、GRPO）。课程作业量大，要求学生从空白文件开始编程，并提供H100集群资源。第一讲重点介绍了Tokenization，对比了字符、字节、词级别方法的优劣，并详细解释了字节对编码（BPE）算法的原理和实现步骤，指出其通过语料库统计自适应分配词汇表的有效性。

课程介绍与教学团队

课程名称: CS336 Language Models from Scratch (从零开始构建语言模型)
核心教员:
- Percy Liang: 主讲教师之一，强调课程能让学生了解语言模型构建的完整端到端流程，包括数据、系统和模型。
- Pu [姓氏不清晰]: 联合授课教师，认为“从零开始构建才能真正理解”，希望传授深度技术知识。
- Roi [姓氏不清晰]: 助教，曾是该课程学生。
- Barsaw: 二年级博士生，去年课程排行榜佼佼者。
课程更新:
- 这是该课程第二次开设。
- 学生规模扩大约50%。
- 助教从2名增加到3名。
- 所有讲座将在YouTube上公开，供全球学习者学习。

课程动机与理念

应对研究危机: Percy Liang指出，当前研究人员越来越脱离底层技术。
- “八年前，研究人员会自己实现和训练模型。”
- “六年前，至少会下载BERT这类模型进行微调。”
- “现在，许多人仅通过提示专有模型即可。”
抽象的泄露性: 虽然抽象能提高效率，但语言模型的抽象是“泄露的 (leaky)”，其内部机制不透明（“输入字符串，输出字符串”）。
赋能基础研究: 对技术的全面理解对于需要“撕开技术栈 (tearing up the stack)”并协同设计数据、系统和模型的根本性研究至关重要。
核心哲学: “要理解它，就必须构建它 (to understand it, you have to build it)。”

面临的挑战与课程教学范围

语言模型的工业化:
- GPT-4: 传闻拥有1.8万亿参数，训练成本高达1亿美元。
- X.ai: 正在构建拥有20万个H100 GPU的集群。
- 巨额投资: 预计四年内投资超过5000亿美元。
- 细节不公开: OpenAI明确表示“由于竞争格局和安全限制，我们不会披露任何细节。”
课程的局限性:
- 无法让学生训练自己的GPT-4级别模型。
- 将构建“小型语言模型 (small language models)”，但这可能不完全代表大型模型的行为。
  - 例子1 (计算量分布): Stephen Roller的推文显示，小模型中Attention和MLP层的FLOPs大致相当，但在1750亿参数模型中，MLP层FLOPs占主导。过度优化小模型中的Attention可能方向错误。
  - 例子2 (涌现行为): Jason Wei 2022年的论文显示，模型在训练FLOPs达到一定规模后，才会在某些任务上突然表现出“涌现能力 (emergent abilities)”，如上下文学习。
课程能教授的内容:
- 机制 (Mechanics): Transformer原理、模型并行如何高效利用GPU等。
- 思维模式 (Mindset): “尽可能压榨硬件性能 (squeeze as most out of the hardware as possible)”和“认真对待规模化 (take scaling seriously)”。Percy认为这是OpenAI成功的关键。
- 直觉 (Intuitions): 关于哪些数据和模型决策能带来好模型（只能部分教授，因大小模型表现可能不同）。
实验的重要性: 引用Noam Shazeer关于SwiGLU的论文结论：“我们无法解释，只能归功于神之仁慈 (divine benevolence)。” 表明有时实验结果胜于理论解释。

“惨痛的教训”与效率的重要性

对“惨痛教训 (Bitter Lesson)”的误解: 并非指“规模决定一切，算法不重要”。
正确解读: “规模化的算法 (algorithms at scale) 才是关键。”
效率在规模化中的极端重要性:
- 模型准确率 = 效率 × 资源投入。
- 大规模投入（如数亿美元）时，浪费是不可承受的。
- OpenAI在资源利用率上可能远超学术界。
算法效率的提升:
- OpenAI 2020年论文指出，2012-2019年间，将ImageNet训练至特定准确率的算法效率提升了44倍。“这比摩尔定律还快 (faster than Moore's law)。”
核心问题框架: “在给定的计算和数据预算下，如何构建最佳模型? (What is the best model one can build given a certain compute and data budget?)”
研究者的目标: 提升算法效率。

语言模型简史与当前格局

早期:
- 香农 (Shannon): 将语言模型用于估计英语的熵。
- NLP领域: 作为机器翻译、语音识别等大系统的组件。
- Google (2007): 训练了30个大型N-gram模型，使用了超过2万亿的tokens（比GPT-3还多），但这些是N-gram模型，未展现现代LM的复杂现象。
2010年代 (深度学习革命的关键组件):
- 首个神经语言模型 (Bengio团队, 2003)。
- Seq2Seq模型 (Sutskever, Google)。
- Adam优化器。
- 注意力机制 (Attention Mechanism)。
- Transformer (“Attention is All You Need”, 2017)。
- 混合专家模型 (Mixture of Experts, MoE) 的规模化研究。
- 模型并行技术 (Model Parallelism)，当时已在探索训练千亿参数模型。
基础模型 (Foundation Models) 趋势:
- ELMo, BERT, T5等在当时引起轰动。
OpenAI的推动: 结合已有组件，进行优秀工程实践，并大力推动规模法则 (Scaling Laws)，催生了GPT-2和GPT-3。
开放模型的发展:
- 早期工作: EleutherAI (GPT-3之后)。
- Meta的早期尝试 (效果欠佳)。
- Bloom, Meta (Llama), Alibaba, DeepSeek AI, AI2等相继推出开源权重模型。
开放性的层级:
- 封闭模型 (Closed Models): 如GPT-4，无细节。
- 开放权重模型 (Open Weight Models): 提供权重和架构论文，但无数据集细节。
- 开源模型 (Open Source Models): 提供权重、数据和尽可能详细的论文。
当前格局: OpenAI, Anthropic, X.ai, Google, Meta, DeepSeek, Alibaba, Tencent等机构的前沿模型主导市场。
课程策略: 回顾这些关键组件的演进，并结合开源社区信息和对闭源模型的分析，尽可能接近前沿模型的最佳实践。

可执行讲座 (Executable Lecture)

讲座本身是一个“可执行程序”，可以逐步运行并展示内容。
特点：
- 嵌入代码并可单步执行，查看环境变量。
- 显示讲座的层级结构。
- 可跳转到定义。
- 本质是经过处理以便于观看的Python程序。

课程安排与后勤

课程网站: 所有资料在线。
学分与工作量: 5学分。引用课程评估：“第一个作业的工作量约等于CS224N所有五个作业加上期末项目的总和。”
适合人群: 对理解事物底层原理有“执念 (obsessive need)”的人，希望提升研究工程能力和大规模ML系统构建舒适度的人。
不适合人群:
- 本学期希望完成大量研究任务的人。
- 只想学习最新最热技术的人（课程更关注基础和自下而上的学习）。
- 将“从零训练语言模型”视为解决所有问题首选方案的人（提示工程和微调通常是更实际的起点）。
无法注册或旁听者:
- 所有讲义和作业在线。
- 讲座录像将上传至YouTube（有数周延迟）。
- 课程明年会再次开设。
作业 (5个):
- 不提供脚手架代码，强调“从零开始构建”。
- 提供单元测试和适配器接口以检查正确性。
- 学生需自行做软件设计决策。
- 策略：本地用小数据集原型验证，集群进行基准测试。
- 部分作业有排行榜（如在训练预算内降低困惑度）。
- AI工具（如Copilot）："自行承担使用风险 (use at your own risk)"，学生对自身学习体验负责。
计算集群:
- 由Together AI提供H100 GPU。
- 需仔细阅读集群使用指南。
- 尽早开始作业，避免截止日期前集群拥堵。
学分问题: 课程固定为5学分，工作量不变。
沟通渠道: 将设立Slack频道。
旁听权限: 旁听学生可访问所有在线资料、作业，并通过Canvas观看讲座视频。
评分: 基于单元测试通过情况、模型性能（如损失值、效率）等，每个问题部分有明确分数。

课程结构：五大支柱

核心问题：“假如给你一个Common Crawl的网络转储数据、32个H100 GPU和两周时间，你应该怎么做？” 涉及Tokenizer、架构、系统优化、数据清洗等决策。

1. 基础 (Basics)

目标: 实现一个基础但完整的语言模型训练流程。
组件:
- Tokenizer: 字符串与整数序列的转换。
  - 将主要学习字节对编码 (Byte Pair Encoding, BPE)。
  - 提及有前景但尚未大规模应用的Tokenizer-free方法（直接处理原始字节）。
- 模型架构 (Model Architecture):
  - 起点: 原始Transformer。
  - 重要改进:
    - 激活函数: SwiGLU。
    - 位置编码: 旋转位置编码 (Rotary Positional Embeddings, RoPE)。
    - 归一化: RMSNorm (比LayerNorm更简单)。
    - 归一化层位置: 与原始Transformer不同。
    - MLP层: 可用混合专家模型 (MoE)替代密集MLP。
    - 注意力机制: 全注意力、滑动窗口注意力、线性注意力（旨在解决二次复杂度问题），以及低维版本如GQA, MQA。
    - 替代架构: 状态空间模型 (State-Space Models)如Hyena，或混合模型。
- 训练 (Training):
  - 优化器: AdamW (主流)。提及有前景的新优化器如Lion, Sophia。
  - 学习率调度、批量大小、正则化等超参数。
  - “细节至关重要 (details do matter)”，调优好的架构与普通Transformer性能差异可达数量级。
作业1:
- 实现BPE Tokenizer (“这部分工作量可能超乎预料”)。
- 实现Transformer、交叉熵损失、AdamW优化器、训练循环。
- 允许使用PyTorch，但不能使用其内置Transformer实现（有允许函数列表）。
- 数据集: TinyStories, OpenWebText。
- 排行榜: 在H100上训练90分钟，最小化OpenWebText困惑度。

2. 系统 (Systems)

目标: 进一步优化，最大化硬件利用率。
组件:
- 内核 (Kernels):
  - GPU架构简介: 大量浮点运算单元、片外内存 (HBM)、片上缓存 (L1, L2)。
  - 瓶颈: 数据移动成本。
  - 优化技术: 融合 (Fusion)、分块 (Tiling) 以最小化数据移动。
  - 实现工具: Triton (OpenAI开发)。
- 并行计算 (Parallelism):
  - 多GPU互联 (NVLink, NVSwitch)。GPU间数据移动更慢。
  - 技术: 数据并行 (Data Parallelism)、张量并行 (Tensor Parallelism) 等。
- 推理 (Inference):
  - 定义: 给定提示和训练好的模型，生成Token。
  - 应用: 强化学习、测试时计算 (Test-Time Compute)、模型评估。
  - 成本: 全球推理成本正超过训练成本（训练是一次性，推理随使用量增加）。
  - 阶段:
    - Prefill (预填充): 处理输入提示，并行度高，计算密集型。
    - Decode (解码): 逐个Token自回归生成，难以饱和GPU，内存密集型。
  - 加速方法: 使用更便宜的模型、推测解码 (Speculative Decoding)、系统优化。
作业2:
- 实现一个Kernel。
- 实现数据并行。
- 实现一个简化版FSDP (Fully Sharded Data Parallel)。
- 强调基准测试和性能分析的习惯。

3. 规模法则 (Scaling Laws)

目标: 通过小规模实验预测大规模下的超参数和损失。
核心问题: 给定FLOPs预算，应选择多大的模型尺寸（大模型少数据 vs. 小模型多数据）？
Chinchilla Optimal:
- 由OpenAI和DeepMind一系列论文提出。
- 对每个计算预算，找到最优参数量，拟合曲线外推。
- 经验法则: 模型参数量N × 20 ≈ 训练所需Token数量 (例如1.4B参数模型需28B Token)。
- 局限性: 未考虑推理成本。
作业3:
- 提供一个“训练API”，输入超参数返回损失值。
- 在FLOPs预算内，进行实验收集数据，拟合规模法则。
- 提交对更大规模下超参数（模型大小等）的预测。
- 强调实验设计的谨慎性（模拟真实场景）。
- 排行榜: 在给定FLOPs预算下最小化损失。
作业链接说明: 讲座中链接为2024年版本，2025年大致结构相似，但会有修改。

4. 数据 (Data)

目标: 理解数据在模型能力塑造中的关键作用。“你想让模型做什么? (What do I want this model to do?)”
组件:
- 评估 (Evaluation):
  - 困惑度 (Perplexity)。
  - 标准化测试 (如MMLU)。
  - 指令遵循模型的生成评估。
  - 集成 (Ensemble)、思维链 (Chain-of-Thought) 对评估的影响。
  - 评估整个系统而非仅语言模型。
- 数据策管 (Data Curation):
  - “我们正在互联网上训练模型”的说法不准确。数据需主动获取和处理。
  - Common Crawl示例: 大量垃圾、非目标语言内容。
  - 来源: 网页抓取、书籍、论文、GitHub等。
  - 处理: 法律问题、数据购买（前沿模型）、从HTML/PDF/代码目录到纯文本的转换（有损过程）。
  - 关键步骤: 过滤（高质量、去有害内容，通常用分类器）、去重。
作业4:
- 提供原始Common Crawl数据。
- 训练分类器进行数据清洗、去重。
- 排行榜: 在给定Token预算下最小化困惑度。

5. 对齐 (Alignment)

目标: 使基础模型（能预测下一个Token）变得“有用 (useful)”。
对齐的方面:
- 遵循指令。
- 指定生成风格（长/短、列表、诙谐/严肃等）。
- 安全性（拒绝有害回答）。
对齐的两个阶段:
- 监督微调 (Supervised Fine-Tuning, SFT):
  - 收集“提示-回答”对，进行监督学习。
  - 基础模型已有潜力，少量样本（如数千）即可赋予指令遵循能力。
  - 与预训练相似，最大化文本概率。
- 基于反馈的学习 (Learning from Feedback):
  - 利用更轻量级的标注，让算法做更多工作。
  - 数据类型:
    - 偏好数据 (Preference Data): 对同一提示的多个回答进行A/B比较。
    - 验证器 (Verifiers): 形式化验证器（数学、代码）或学习型验证器（训练LM评分）。
  - 算法 (强化学习领域):
    - PPO (Proximal Policy Optimization): 早期用于指令微调。
    - DPO (Direct Preference Optimization): 针对偏好数据，更简单有效。
    - GRPO (Group Relative Preference Optimization): DeepSeek开发，简化PPO，移除价值函数，效率更高。
作业5:
- 实现SFT, DPO, GRPO。
- 进行评估。
作业难度回顾: 作业1和2最重，作业3相对轻松，作业4和5（去年）难度低于1和2。

课程核心原则回顾：效率

效率是驱动原则: 所有设计决策都应围绕效率。
当前约束 (课程及多数研究者): 计算受限 (Compute Constrained)。
- 数据处理: 积极过滤，避免在劣质数据上浪费计算。
- Tokenization: 尽管字节级模型优雅，但当前计算效率低，故采用Tokenization。
- 模型架构: 许多设计决策源于效率考量。
- 训练: 通常只训练一个epoch，表明时间紧迫，需快速过更多数据。
- 规模法则: 用少量计算预测超参数。
- 对齐: 高效对齐可能使较小基础模型在特定任务上表现良好。
未来趋势 (前沿实验室): 数据受限 (Data Constrained)。
- 设计决策可能改变: 如训练更多epoch，或因Transformer主要为计算效率设计而出现新架构。
- 效率依然重要，但具体决策会随约束条件变化。

第一单元深入：Tokenization

参考资料: Andrej Karpathy关于Tokenization的视频。
定义: 将原始文本（通常是Unicode字符串）转换为整数序列（Token），每个整数代表一个Token。需要编码 (encode) 和解码 (decode) 过程。词汇表大小 (Vocabulary Size) 即Token能取的不同值的数量。
在线Tokenizer演示 (tiktokenizer.io):
- 空格通常是Token的一部分，且按惯例位于Token之前。
- hello 和 hello 是不同的Token。
- 数字会被切分成片段。
GPT-2 Tokenizer示例:
- encode(): 字符串 -> 索引列表。
- decode(): 索引列表 -> 字符串。
- 压缩率 (Compression Ratio): 字节数 / Token数 (示例中为1.6，即每个Token代表1.6字节)。
Tokenization方法演进:
- 基于字符 (Character-based): Unicode字符 -> 码点 (codepoint)。
  - 问题: 词汇表可能非常大（Unicode字符众多），对稀有字符不友好，效率低。压缩率示例为1.5。
- 基于字节 (Byte-based): 字符串 -> UTF-8字节序列。
  - 优点: 词汇表小 (0-255)。
  - 问题: 序列过长（压缩率为1.0，即1字节/Token），对二次复杂度的Attention机制是灾难。
- 基于词 (Word-based): 传统NLP方法，用正则表达式切分。
  - 问题: 词汇表无界，易出现未登录词 (UNK token)，处理麻烦。
字节对编码 (Byte Pair Encoding, BPE):
- 起源: Philip Gage于1994年为数据压缩提出。
- NLP应用: Sennrich等人2016年引入神经机器翻译。
- 语言模型应用: GPT-2采用。
- 核心思想: 在原始文本上训练Tokenizer，让常见字符序列有机地合并为单个Token，稀有序列则由多个Token表示。
- GPT-2细节: 先用基于词的Tokenizer进行预切分，再对各片段运行BPE（课程作业亦如此）。
- BPE算法步骤 (训练阶段):
  1. 将字符串转换为字节序列。
  2. 重复以下操作（指定合并次数 num_merges）：
    - 统计训练数据中所有相邻Token对的出现频率。
    - 找到最频繁出现的Token对 (A, B)。
    - 创建一个新的Token ID (C)，代表 (A, B) 的合并。
    - 将训练数据中所有的 (A, B) 替换为 C。
    - 记录这次合并 (A, B) -> C。
- BPE编码 (推理阶段):
  1. 将输入字符串转换为字节序列。
  2. 按照训练时学到的合并规则的顺序，依次将字节序列中的相应Token对替换为合并后的Token。
- BPE解码: 将Token ID递归地拆分为原始字节，再转换为字符串。
- 简单实现的效率问题: 编码时遍历所有合并规则可能低效。作业中需优化。
Tokenization总结:
- BPE是一种有效的启发式算法，利用语料库统计进行自适应词汇表分配。
- Percy表示：“希望有一天不再需要讲这部分，因为我们将拥有能直接从字节映射的架构，但在那之前，我们必须处理Tokenization。”

下次课程预告

深入探讨PyTorch的细节。
关注资源核算 (Resource Accounting)，理解FLOPs的去向。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-13 19:12

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-13 16:42

StreamSparkAI