2024 | Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

该讲座概述了大型语言模型(LLM)的构建过程。演讲者首先介绍了LLM(如ChatGPT、Claude、Gemini、Llama等)的基本概念,并指出构建LLM的关键要素包括模型架构、训练损失与算法、数据、评估方法以及系统组件。演讲者强调,尽管学术界常关注架构与算法,但业界实践更侧重于数据、评估和系统的重要性,因此本次讲座将重点讨论后者。

LLM的构建通常分为预训练(Pretraining)和后训练(Post-training)两个阶段。预训练阶段的目标是进行通用语言建模,让模型学习并理解大规模文本数据(如整个互联网的内容)。此阶段的核心任务是语言建模,即模型学习预测一个词元(token)序列出现的概率。

当前主流的LLM采用自回归(Autoregressive)语言模型。这类模型通过概率的链式法则,将整个序列的联合概率分解为一系列条件概率的乘积,即在给定前面所有词元的条件下,预测下一个词元出现的概率。其工作流程大致为:首先对输入文本进行词元化(tokenization),将词或子词转换为唯一的ID;然后将这些词元ID输入模型(通常是Transformer架构,但讲座未深入探讨架构细节);模型会输出一个在整个词汇表上的概率分布,表示下一个最可能的词元。在训练过程中,模型通过比较预测的词元分布与实际出现的词元(通常使用独热编码表示),并利用交叉熵损失函数(Cross-entropy loss)来调整模型参数,以最大化正确预测下一个词元的概率。词元化的选择对模型至关重要,因为它定义了模型的词汇量大小,直接影响模型的输出维度。评估预训练模型的方法包括困惑度(Perplexity)和学术基准测试(如MMLU)。

媒体详情

上传日期
2025-05-14 13:38
来源
https://www.youtube.com/watch?v=9vM4p9NN0Ts
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-preview-06-05

转录

下载为TXT
speaker 1: So let's get started. So I'll be talking about building llms today. So I think a lot of you have heard of lms before, but just as a quick recap, lms standing for large language models are basically all the chatbots that you've been hearing about recently. So ChatGPT from OpenAI, Claude from antropic, Gemini and lama and other type of models like this. And today we'll be talking about how do they actually work. So it's going to be an overview because it's only one lecture and it's hard to compress everything. But hopefully I'll touch a little bit about all the components that are needed to train some of these lms. Also, if you have questions, please interrupt me and ask. If you have a question, most like other people in the room or on zoom other have the same questions. So please ask. Great. So what matters when training llms? So there are few key components that matter. One is the architecture. So as you probably all know, llms are neural networks. And when you think about neural networks, you have to think about what architecture you're using. Another component which is really important is the training loss and the training algorithm. So how you actually train these models, then it's a data. So what do you train these models on the evaluation, which is how do you know whether you're actually making progress towards the goal of llms and then the system component so that like how do you actually make these models run on modern hardware, which is really important because these models are really large. So now more than ever, systems is actually really an important topic for lms. So those five components, you probably all know that llms and if you don't know lms are all based on transformers or at least some version of transformers. I'm actually not going to talk about the architecture today. One, because I gave us here a lecture on transformers a few weeks ago. And two, because you can find so much information online on transformers, but I think you can add as much less information about the other four topics. I really want to talk about those. Another thing to say is that most of academia actually focuses on architecture and training algorithm and losses as academics. And I've done that for a lot. Big part of my career is simply we like thinking that this is like we make new architectures, new models, and it seems like it's very important. But in reality, honestly, what matters in practice is mostly the three other topics. So data evaluation and systems, which is what most of industry actually focuses on. So that's also one of the reasons why I don't want to talk too much about architecture, because really the rest is super important. Great. So overview of the lecture. I'll be talking about pre training. So pre training, you probably heard that word. This is the general word. This is kind of the classical language modeling paradigm where you basically train a language model to essentially model all of Internet. And then there's a post training, which is a more recent paradigm, which is taking these large language models and making them essentially AI assistance. So this is more of a recent trend since ChatGPT. So if you ever heard of GPT -3 or GPT two, that's really ptraining land. If you heard of chagpt, which you probably have, this is really post training land. So I'll be talking about both, but I'll start with pre training. And specifically, I'll talk about what is the task of ptraining llms and what is the loss that people actually use. So language modeling, this is a quick recap. Language models at a high level are simply models of probability distribution over sequences of tokens or of words. So it's basically some model of p of x one to xel, where x one is basically word one, and xel is the last one in the sequence, one in the sentence. So very concretely, if you have a sentence like the mouse eight to the che's, what the language model gives you is simply the probability of this sentence being uttered by a human or being found on online. So if you have another sentence like the the mouse eight cheese here, there's grammatical mistakes. So the model should know that this should have some syntactic knowledge. So it should know that this has less likelihood of appearing online. If you have another sentence like the cheese ate the mouse, then the model should hopefully know about the fact that usually cheese don't eat mouse. So there's some semantic knowledge, and this is less likely than the first sentence. So this is basically at a high level what language models are. One word that you probably have been hearing a lot in the news are generative models. So this is just something that can generate models, that can generate sentences or can generate some data. The reason why we see language models or generative models is that once you have a model of a distribution, you can simply sample from this model, and then we can generate data so you can generate sentences using a language model. So the type of models that people are all currently using are what we call autoregressive language models. And the key idea of autoregressive language models is that you take this distribution over words, and you basically decompose it into the distribution of the first word multiplied by the distribution, or the likelihood of distribution of the second word given the first word. Multiply by p of the third word given the first two words. So there's zero approximation here. This is just the chain rule of probability, which you hopefully you'll know about. I really know approximation. This is just one way of modeling a distribution. So cycling more concisely, you can write it as a product of p's of the next word, given everything which happened in the past. So of the context. So this is what we call autoregressive language models. Again, this is really not the only way of modeling distribution. This is just one way. It has some benefits and some downsides. One downside of autoregressive language models is that when you actually sample from this autoaggressive language model, you basically have a full loop, which generates the next word, then conditions on that next word, and then regenerate an other word. So basically, if you have a longer sentence that you want to generate, it takes more time to generate it. So there are some downsides of this current paradigm, but that's what we currently have. So I'm going to talk about this one. Great, ed. So autoregressive language models at a high level, what the task of autoregressive language model is is simply predicting the next word, as I just said. So if we have a sentence like she likely prefers, one potential next word might be dogs. And the way we do it is that we first tokenize. So you take these words or subwords, you tokenize them, and then you give an ID for each token. So here have one, two, three, then you PaaS it through this black box. As I already said, we're not going to talk about the architecture. You just PaaS it through a model, and you then get a distribution, a probability distribution, over the next word, over the next token. And then you sample from this distribution, you get a new token, and then you detokenize. So you get a new ID, you get detokenize, and that's how you basically sample from a language model. One thing which is important to note is that the last two steps I actually only needed during inference, when you do training, you just need to predict the most likely token, and you can just compare to the real token, which happened in practice. And then you basically change the weights of your model to increase the probability of generating that token. Great. So autoaggressive neural language models. So to be slightly more specific, still without talking about the architecture, the first thing we do is that we have all of these sorry, yes.
speaker 2: predicting the probability of the next tokens. Does this mean that your final output vector has to be the same dimensionality as the number of tokens that you have? Yes. How do you deal with like it can do? So we're going to talk about tokenization actually later.
speaker 1: So you will get some sense of this. You basically can deal with adding new tokens. I'm kind of exaggerating the methods for doing it, but essentially people don't do it. So it's really important to think about how you tokenize your text, and that's why we'll talk about that later. But it's a very good point to notice that you basically the vocabulary size. So the number of tokens that you have is essentially the output of your language model. So it's actually pretty large. Okay? So autoaggressive neural language models, first thing you do is that you take every word or every token, you embed them. So you get some vector represenpustation for each of these tokens. You PaaS them through some neural network as we set it as a transformer. Then you get a repustation for all the word and all the words in the context. So it's basically representation of the entire sentence. You PaaS it through a linear layer, as you just said, to basically map it to the number so that the output, the number of outputs is the number of tokens. You then PaaS it through some soft max, and you basically get probability ty distribution over the next words, given every word in the context. And the laws that you use is basically it's essentially a task of classifying the next token. So it's a very simple kind of machine learning task. So you use the cross entropy loss where you basically look at the actual target that happened, which is a target distribution, which is a one hoencoding, which here in this case says, I saw the real word that happened is cat. So that's a one hot distribution over cat. And here this is the actual, do you see my mouse? Oh Yeah, this is the distribution that you generated. And we see you do cross entropy, which really just increases the priority of generating cat and decreases all the the priof generating all the other tokens. One thing to notice is that, as you all know, again, this is just equivalent to maximizing the text log likelihood, because you can just rewrite the MacX of the probability of this autoaggressive language modeling task as just being this minimum of I just added the log here and minus, which is just the minimum of the loss, which is across entropy loss. So basically, minimizing the loss is the same thing as maximizing the likelihood of your text. Any question? Questions? Okay tokeniza. So this is one thing that people usually don't talk that much about. Tokenizers are extremely important, so it's really important that you kind of understand at least what they do at a high level. So why do we need tokenizes in the first place? First, it's more general than words. So one simple thing that you might think is, Oh, we're just gonna to take every word that we all have. You just say every word is a new, is a token in its own. But then what happens is if there's a typo in your word, then you might not have any token associated with this word with a typo, and then you don't know how to actually PaaS this word with a typo into a large language model. So what do you do next? And also, even if you think about words, words is a very like words are fine with like Latin based languages. But if you think about language like Thai, you won't have a simple way of tokenizing by spaces, because there are no spaces between words. So really, tokens are much more general than words. First thing. Second thing that you might think is that you might tokenize every sentence character by character. You might say a is one token, b is another token. That would actually work, and probably very well. The issue is that then your sequence becomes super long. And as you probably remember from the lecture on transformers, the complexity a, it grows quadratically with the length of sequences. So you really don't want to have a super long sequence. So tokeners basically try to deal with those two problems and give common subsequences a certain token. And usually how you should be thinking about is around an average, every token is around three, four letters. And there are many algorithms for tokenization. I'll just talk about one of them to give you a high level, which is what we call bite pan coding, which is actually pretty common. One of the two most common tokenizers, and the way that you train a tokenizer is that first you start with a very large corpus of text. And here I'm really not talking about training a large language model yet. This is purely for the tokenization step. So this is my large corpus of text with these five words. Then you associate every character in this corpus of text a different token. So here I just split up every character with a different token, and I just color coded all of those tokens. And then what you do is that you go through your text, and every time you see pairs of tokens that are very common, the most common paof token, you just merge them. So here you see three times the tokens t and o next to each other. So you're just going na say, this is a new token. And then you continue, you repeat that. So now you have totowhich happens three times. Towith an e that happens, sorry, two times in token, which happens twice, and in ex, which also happens twice. So this is that if you were to train a tokenizer on this corpus of text, which is very small, that's how you would finish with a tokenwith, a like a train tokenizer. In reality, you do it on much larger corpus of text. And this is the real tokenizer of, actually, I think this is GPT -3 or ChatGPT. And here you see how it would actually separate these words. So basically, you see the same thing as what we gave in a previous example. Token becomes its own token. So token ier is actually split up into two tokens, token and ier. So Yeah, that's all about tokenizes.
speaker 2: Any question on that?
speaker 1: You you Yeah. So actually there's a step before tokenizers, which is what we call pretokenizers, which is exactly what you just said. So this is mostly in theory, there's no reason to deal with spaces and punctuation separately. You could just say every space gets its own token, every punctuation gets its own token and you could just do old emerging. The problem is that so there's an efficiency question. Actually training these tokenizes takes a long time. So you're better because you have to consider every pair of token. So what you end up doing is saying if there's a space, this this is very like pretokenizes or very English specific, you say if there's a space, we're not going to start looking at the token that came before and the token that came afterwards. So you're not merging in between spaces. But this is just like a computation optimization. You could theoretically just deal with it the same way as you deal with any other character.
speaker 2: When you merge tokens, delete the tokens that you merge away? Or do you heat .
speaker 1: you actually keep the smaller tokens? I mean, in reality, it doesn't matter much because usually on large corpus of text, you will have actually everything, but you usually keep the small ones. And the reason why you want to do that is because if in case there's, as we said before, you have some grammatical mistakes or some typos, you still want to be able to represent these words by character. So Yeah.
speaker 2: yes. Are the tokens unique? So I mean, say in this case, token, is there only one occurrence or it could do you need to leave multiple occurrence so they could have take on different meanings?
speaker 1: Oh, Oh, I see what you say. No, it's every token has its own unique ID. So usual, this is a great question. For example, if you think about bank, which could be bank for like money or bank like water, it will have the same token. But the model will learn, the transformer will learn that based on the words that are around it, it should associate that. I'm saying I'm being very high wavy here, but associate that with a represenpustation that is either more like the bank money side or the bank water side. But that's a transformer does that. It's not a tokenizer. Yes. Yes. So you mentioned .
speaker 2: during tokenization trip the smaller two things you started to write. Like if you start with the Teyou, keep the teand, then you turn your tokenzer, come out and cook tokey. So let's say maybe did train on token, but like in your data, you are trying to encode token. So how does the tokenizer know to encode it with token or to the T? A great question.
speaker 1: You basically when you so when you tokenso, that's after training of the tokkenizer, when you actually apply the tokenizer, you basically always choose the largest token that you can apply. So if you can do token, you will never do Teyou will always do token. But there's actually so people don't usually talk that much about tokenizes, but there's a lot of computational benefits or computational tricks that you can do for making these things faster. So I really don't think we and honestly, I think a lot of people think that we should just get away from tokenizes and just kind of tokenized character by character or bes by bytes. But as I said right now, is this issue of like length? But maybe one day, like in five or ten years, we will have different architectures that don't scale quadratically with the length of the sequence and .
speaker 2: maybe we'll Yeah move away from tokenzes. So can you share with us the drawback?
speaker 1: Why do people want to move away from the tokenier? Oh, Yeah. So think one good example is math. If you think about math, actually, numbers right now are not tokenized. So for example, 327 might have its own token, which means that models, when they see numbers, they don't see them the same way as we do. And this is very annoying because I mean, the reason why we can kind of generalize with math is because we can deal with every letter separately, and we can then do composition, where you know that basically if you add stuff, which is the same thing as adding every one separately, plus like whatever the unit that you add, so they can do that. So then you have to do like special tokenization. And one of the big changes that GPT forwarded is changing the way that they tokenize code. So for example, if you have code, you know, you have like often in Python, these four spaces at the beginning, those who are dealt with kind of strangely before, and as a result, like the model couldn't really understand how to deal with code. So tokenize actually matter a lot. Okay, so I'll move on right now, but we can come back later on. Tokenizes, great. So we talked about the test, the loss, the tokenizer. Let's talk a little bit about evaluation. So the way that lm's are usually evaluated is what we call is using what we call perplexity at a high level. It's basically just your validation loss. The slight difference with perplexity is that we use something that is slightly more interpretable, which is that we use the average per token loss and then you exponentiate it. And the reason why you exponentiate it is because you want I mean, the loss has a log inside. And like one, humans are actually pretty bad at thinking in log space, but two logs depend on the base of the log. While when you exponentiate, you basically have everything in the kind of the vocabulary size unit. And the average protoken is just so that your perplexity is independent of the length of your sequence. So perplexity is just due to the power average of the loss of the sequence. So perplexity is between one and the length of the vocabulary of your tokenizer. One, it's simply, well, if you predict perfectly the thing which every word, then every word will have basically product of ones. So the best perplexity you can have is one, if you really have no idea, you basically predict with one divided by size of vocabulary. And then you do simple math and you basically get perplexity of size of vocabulary. So the tuition of perplexity is that it's basically the number of tokens that your model is kind of hesitating between. So if you're model perfect, it doesn't hesitate and know exactly the word if it really has no idea that it hesitates between all of the vocabulary. So perplexity really improved. That's perplexity on a standard data set between 2017 and 2023, it went from kind of 70 tokens to less than ten tokens over these five, six years. So that means that the models were previously as dated between 70 words every time it was generating a word, and now it's as dated between like less than ten words. So that's much better. Perplexity is actually not used anymore in academic benchmarking, mostly because it depends on the tokenizers that you use and depends on the actual data that people are evaluating on. But it's still very important for development of llms. So when you actually train your own llm, people will still really look at the perplexity. One common other way and now more common in academia of evaluating these llms is just by taking all the classical nlp benchmarks, and I'll give you a few examples later and just kind of aggregating everything. So collect as many automatically evaluable benchmarks and just evaluate across all of them. So one such actually two such benchmarks, what we call helm, which is from Stanford. Another one is the hugging face open lm leaderboard, which are the probably the two most common ones right now. So just to give you an idea, in helm are all of these type of tasks, which are mostly things that can be easily evaluated like question answering. So think about many different question answering tasks. And the benefit with question answering is that you usually know what is the real answer. So can the way that you evaluate these models, and I'll give you a concrete example in 1s, is that you can just look at how likely the language model is to generate the real answer compared to some other answers. And that's essentially at a high level how you evaluthese models. So to give you a specific example, mu is probably the most common academic benchmark for llms. And this is just a collection of many question and answers in all of those domains. For example, college medicine, college physics, astronomy and these type of topics. And the questions are things like, so this is an astronomy, what is for type one, a supernova? Then you give four different potential answers. You just ask the model, which one is more likely? So there are many different ways of doing it. Either you can look at the likelihood of generating all these answers, or you can ask the model, which one is the most likely. So there are different ways that you can prompt the model. But at a high level, you know which one is correct acked. And there are three other mistakes. Yes.
speaker 2: Creating this like unconstrained text as an output. Yeah how do you evaluate a model if it gives something that's you know semantically completely identical but is not the exact total? Yeah.
speaker 1: So that's a great question. I'll talk more about that later here. In this case, we don't do unconstrained. So the way you would evaluate mmlu is basically either you you ask the first question, and then you look at the likelihood of the model generating a, the likelihood of the model generating B, C and d, and you look at which one is the most likely, or you can ask the model out of abcd, which one is the most likely. And you look at whether the most likely next token is abb cod. So you can strain the model to say it can only answer these four things. Yeah.
speaker 2: You constrain a prompt, or do you mean of its whole probability distribution? Laoutfits comparoutyou're only comparing with Junso. In the second .
speaker 1: case I gave you, you would do exactly the I'm actually, you would do both. You would prompt the model saying abcod plus you would constrain to only look at these these four tokens. In the first case, you don't even need to generate anything. So in the first case, you literally just look, given that it's a language model, it can give a distribution over sentences. You just look at what is the likelihood of generating all of these words, what is the likelihood of generating the second choice, and you just look at whether the most likely sentence is actually the real answer. So you don't actually sample from it. You really just use p of x one to excel. Does that make sense? That being said, evaluation of open ended questions is something we're going to talk about later and is actually really important and really challenging. Yes.
speaker 2: earlier you mentioned like metrics like publiplexity are are not like usually used because it depends on like how you do a organization, some design choices. I was wondering if you could speak more to that.
speaker 1: Oh, Yeah. So think about perplexity. I told you perplexity is between one and vocabulary size. So now imagine that ChatGPT uses a tokenizer that has like 10000 tokens, but Gemini from Google uses a tokenizer that has 100, zero potential tokens. Then actually the Gemini one will have like the upper bound of the perplexity that you can get is actually worse for Gemini than for ChatGPT. Does that make sense? So there's just an idea. It's actually a little bit more complicated than that, but there's just like one first all bit of where you can see that the tokenizer actually matters. Great. Okay. So evaluation challenges, there are many. I'll just talk about two really briefly. One, as I told you, there are two ways of doing evaluation for these mmlouse. Actually domare many more than two. But I gave you two examples. And it happens that for a long time, even though that was a very classical benchmark that everyone used, actually different different companies and different organization were actually using different ways of evaluating mmlu. And as a result, you get completely different results. For example, lamma of 65b, which was the first model of meta in dilemma series, had on helm 63.7 accuracy, but on this other benchmark had like 48.8. So really, the way that you evaluate, and this is not even talking about prompting, this is really just kind of the way that you evaluate the models. Prompting is another issue. So really, there are a lot of inconsistencies. It's not as easy as it looks. First thing, Yeah, sorry. The second thing, this is a great question. Chain test contamination. This is something which I would say is really important in academia in given that the talk is mostly about training large language models for companies, it's maybe not that important because they know what they trained on. For us, we have no idea. So for us, it's a real problem. So there are many different ways of trying to test whether the test set sorry, whether the test set was actually in the training set. One kind of cute trick that people in tatos lab have found is that what you can do is that given that most of the data set online are not randomized, you can just look at, in that language, models. What they do is just predict the next word. You can just look at the entire test set. What if you generate all the examples in order versus all the examples in a different order? And if it's more likely to generate the thing in order, given that there's no real order there, then it means that probably was in a training set. Does that make sense? So there are that's like one of them. There are many other ways of doing it. Train test contamination, again, not that important for development, really important for academic benchmarking. Great. So there are many other challenges, but I'll move on for now. Great data. So data is another really big topic at a high level. People just say, Oh, you basically train large language models on all of Internet. What does that even mean? So people sometimes say all of clean Internet, which is even less defined. So Internet is very dirty and really not representative of what we want in practice. If I download a random website right now, you would be shocked at what is in deit's, definitely not your Wikipedia. So I'll go really briefly on like what people do. I can answer some questions, but I mean, data is on its own is a huge topic. Basically. First, what you do is download all of Internet. What that means is that you use web crawlers that will go on every web page on Internet or every web page that is on Google. And that is around 250 billion pages right now. That's around one petabyte of data. So this is actually a common call, is one, web crawlers. So people don't usually write their own web crawlers. What they do is that they use standard web crawlers. And common crawl is one of them. That basically every month adds all the new websites that were added on Internet that are found by Google. And they put it in a basically a big data set. So that's on common call. You have around 250 billion pages right now. So one e, 6 gb of data once you have this. So this is a random web page, like literally random from this common quroll. And what you see is, I one, it really doesn't look at type of things that you would usually see, but actually, so this is an html page. It's hard to see. But if you look through, you will see some content. For example, here, test King World is your ultimate source for the system x high performance server. And then you have three dots, so you don't even the sentence is not even finished. That's how random Internet looks like. So of course, it's not that useful if you just train a large language model to generate things like this. So what are some of the steps that needed? First one, you extract the text from the html. So that's what I just tried to do by looking at basically the correct text. There are a lot of challenges through this. For example, extracting math is actually very complicated, but pretty important for training large language models. Or for example, boilerplates. A lot of your farms will have the same type of headers, the same type of footers. You don't want to repeat all of this in your data. Then you will filter undesirable content, so not safe for work, harmful content pii. So usually every company has basically a bllist of websites that they don't want to train the models on that blacklist is very long. And you basically say, if it comes from there, we don't train on this. There are other ways of doing these things, is that you can train a small model for classifying what is pii, removing these things. It's hard. Every point here that I'm gonna to show you is like of hard amount of work by which you're going to go quickly through it. So filter undesirable content. Second or fourth is the d plication. As I said, you might have things like headers and footers in farms that are always the same. You want to remove that. Another thing that you might have is a lot of urls that are different but actually show the same website. You might also have a lot of paragraphs that come from common books that are basically dedupliccated 1000 times or 10000 times on Internet. So you have to duplicate, also very challenging because you have to do that at scale. Once you do digitduplication, you will do some heuristic filtering. You will try to remove low quality documents. The way you do that are things like rules based filtering. For example, if you see that there are some outlier tokens, if the distribution of tokens in the website is very different than the usual distribution of tokens, then it's probably some outlier. If you see that the length of the words in this website is super long, there's something strange going on on that website. If you see that the website has only three words, maybe is it worth training on it? Maybe not. If it has like 10 million words, maybe there's something also wrong going on that page.
speaker 2: So a lot of rules like this, yes, let me filter out undesirable content from our kind kind of. Hin is like a supervised moss. Can we not just say like you know here's this like hate speech website, let's actively try to let's actively penalize progenta.
speaker 1: We'll do exactly that, but not at this step. That's where the post training will come from. Pre training, the idea is just to say, I want to model kind of how humans speak essentially, and I want to remove all these like headers, photos and menus and things like this. But it's a very good idea that you just had, and that's exactly what we'll do later. Next step, model based filtering. So once you've filter a lot of data, what you will do? That's actually a very cute trick. You will take all of Wikipedia and you will look at all the links that are linked through Wikipedia pages. Because probably if something is referenced by Wikipedia, it's probably some high quality website. And you will train a classifier to predict whether something comes from whether a document comes from one of these references from Wikipedia or whether it's from the random web. And you will try to basically say, I want more of the things that come from Wikipedia references. Does that make sense? So Yeah. So you will train a machine learning model, usually also very simple models, because you need to do that really at scale. And mean just think about the 250 billion pages. Next one, you will try to classify your data into different different domains. You will say, okay, this is entertainment, this is books, this is code, this is like these type of domains. And then you will try to either up or downweight some of the domains. For example, you might say, you might see that actually if you train more on code, then actually your model becomes better on reasoning. So that's something that people usually say in a very hand wavy way. If you train your model morcode, actually it helps reasoning. So you want to upweight the coding distribution, because that helps for general language modeling skills. Books is usually also another one that people usually upweight entertainment. They usually downweight. So things like this, of course, you want to do it. So people used to do it maybe kind of heuristically. Now there's entire pipelines that we'll talk about of how to do these things slightly more automatically. And then at the end of training, usually train after training on all of this data that we saw, usually train on very high quality data at the end of training, a large language model where you decrease your learning grade d, and that basically means that you're kind of overfitting your model on a very high quality data. So usually what you do there is like Wikipedia. You basically overfit on Wikipedia and you overfit on like human data that was collected. There are other things like continual pretraining for getting longer context. I'm going to skip over all of these things. But as just to give you a sense of how hard it is when people just say, Oh, I'm going to train on Internet, that's a lot of work. And really, we haven't figured it out yet. So collecting well data is a huge part of practical large language model. Some might say it's actually the key. Yes.
speaker 2: not about the data. So basic questions. So usually when you start with like the terparaabite of data after I go through or ask through, the typical amount of your guys have been it and then like how how large it think does it typically think to go through or the data steps you talked about?
speaker 1: So how is the question, how large is the data after you filter after .
speaker 2: and to go through orders stuff? How large it seemed do you need to go through like the order future systems you mentioned?
speaker 1: How slow is it? How .
speaker 2: people you this okay, that's a great question.
speaker 1: I'm going somewhat answer about the data. How large is the datset at the end of this slide for number of people that work on it? That's a good question. I'm actually not quite sure, but I would say Yeah, I actually don't quite know, but I would say it's probably even bigger than a number of people that work on kind of the tuning of the pretraining of the model. So the data is bigger than kind of the modeling aspect. Yeah, I don't think I have a good sense. I would say probably in lamas team, which have like 70H people, I would say maybe 15, work on data. Yeah, all these things. You don't need that many people. You need a lot of compualso because for data, you need a lot of cpu's. So Yeah, and I'll answer the second question at the end of this slide. So as I just kind of alluded to, really we haven't solved data at all for pre training. So there's a lot of research that has to be done. First, how do you process these things super efficiently? Second, how do you balance kind of like all of these different domains? Can you do synthetic data generation? That's actually a big one right now. And because we don't have we'll talk about that later, but we don't have enough data on the Internet. Can you use multimodal data instead of just text data? And how does that improve even your text performance? There's a lot of secrecy because really this is the key of most of the pre train large language models. So for competitive dynamics, usually these these companies don't talk about how they do the data collection. And also there's a copyright liability issue. They definitely don't want to tell you that they've trained on books even though they did, because if not, you can sue them. Common academic benchmarks. So that will kind of answer what you asked. It started so those are the smaller ones. The names are not that important, but it started from around 150 billion tokens, which around 800 gb of data. Now it's around 15 trilion 15 trillion tokens, which is also the size of the models that are right now. The best models are probably trained on that amount of data. So 15 trillion tokens, which is probably, I guess, two automabigger than that. So 80e, 3 gb. So then it would be around 100 to 1000 times filtering of the krocrawl, if I'm not mistaken. So Yeah, one very famous one is the pile. So this is an academic benchmark of the pile, and we can just look at what distribution of the data they have. It's sinlike archive, Pub Med central, which is all the biology stuff here. It's Wikipedia. You see Stack Exchange, some GitHub and some books and things like this. Again, this is on the smaller side. So this is, if we look at here, this is on 280b. So in reality it's like 100 times bigger. So you cannot have that much of GitHub on Wikipedia in terms of closed source models. Just to give you an idea, lama two, it was trained on 2 trillion tokens. Lama 315 trillion tokens, which is currently the best model that we know on how much it was trained on, which is the same thing as the best academic or the biggest academic benchmark, which is 15 trillion tokens. GPT -4, we don't really know, but it's probably in the same word of magnitude. Oh, it's probably around that actually, it's probably around 13 from if the leaks are great. So scaling loss, any other questions on data before you go to scaling laws? Sorry, I know I'm giving you a lot of information, but there's a lot in two training and large language models, great scaling laws. So the idea is that what people saw around 2:20, or at least for a long time, but they've been able to kind of theoretically show it or importshow it since 2020, is that the more data you train your models on and the larger the models, the better the performance. This is actually pretty different than what you've seen in this class. In this class, we teach you about overfitting. Overfitting doesn't happen with large language models, larger models, better performance. It's something that really took a long time for the community who took this type of class to realize. But for the exam, overfitting exists. So okay, the idea of scaling laws is that if given that you know that more data and larger models will always give you better performance, can we predict how much better your performance will be if you increase the amount of data and the size of your model? And surprisingly, it works. So here you see three pllots from a very famous paper called scaling loss from OpenAI. Here you see on the x axis compute. So how much did you train? Like how much compute did you spend for training? And here you see test loss. So this is essentially, I mean, some perplexity, but your validation loss. So it's a log of the perplexity. And if you put these two on log scale, then you see that the performance, like the sorry that the scaling law is linear. That means that if you increase your compute by a certain amount, you can say by how much your test loss will actually decrease. Same thing with data and same thing for parameters. If you increase the data set size, your loss will decrease by an amount that is somewhat predictable. If you increase the number of parameters, decthe loss will decrease by an amount, which is somewhat predictable. This is really amazing. Very surprising. I mean, it looks innooculous when you look at these type of plots. But that's crazy because it means that you can predict how well we're going to perform in two, three years, depending on how much compute we will add, assuming that these things will hold. There's nothing theoretical about it. Yes.
speaker 2: What is the loss that they're using here?
speaker 1: Is this perplexity. So you know, I said perplexed two was like two to the power of the law. So this is the power of the perplexity.
speaker 2: When you increase the number of parameters or you increase the total of data set size of code, doesn't that just inherently increase your compubecause? All of this? Yes. No.
speaker 1: this is a great question. So the compute here is actually a factor of two things, the data and the parameter. What I'm showing here is that you can well, actually, we're gonna to talk about that in details, but basically if you increase the number of parameters, you should increase the number of data that you have. So you actually don't go multiple times to the same data set. No one does epochs in, at least not yet because we haven't still kind of enough data. So Yeah, this is all the same trend, which is increased compute decreasase loss. Yes.
speaker 2: I have seen the numbers for the last two years .
speaker 1: still holding. It is still holding. I don't have like good numbers to show you, but it is still holding. Surprisingly. Yes. Is there no evidence, like empical evidence that you would ever plateau? In theory, we expect it plateau, right? No empical evidence of plateaing anytime soon. Why, we don't know. Will it happen? Probably. I mean it doesn't need to because it's actually in log scale. So it's not like as if it had to go, it had to plateau. Like mathematically, it could continue decreasing like this. I mean, most people think that it will probably plateau at some point. We don't know when. Okay, so that's I will talk more about scaling loss now. So why you are scaling loss? Really cool. Imagine that. I give you you're very fortunate. I give you 10000GPU's for this month. What model will you train? How do you even go by answering that question? And I mean, this is a hypothetical, but that's exactly what these companies are faced with. The old pipeline, which was basically tunhigh parameters on the big models. So let's say I have 30 days. I will train 30 models for one day each. I will pick the best one. And that will be the final model that I will use in production. That means that the model that I actually used was only train ined for one day. The new pipeline is that you first find a scaling recipe. So you find something that tells you, for example, Oh, like one common thing is that if you increase the size of your model, you should decrease your learning rate. So you find a scaling recipe so that, you know, if I increase the size of my model, here's what I should do with some high parameters. Then you tune your high parameters on smaller models of different sizes. Let's say, I will say for three days of my 30 days, I will train many different models, and I will do high perparameter tuning on these small models, each of different sizes. Then I will fit a scaling law and try to extrapolate from these smaller models which one will be the best if I train it for much longer sorry, if I train it for a larger model. And then I will train the final huge model for 27 days instead of just one day. So the new pipeline is not train things or do high priority tuning on the real scale of the model that you're going to use in practice, but do things on smaller ones at different scales. Try to predict how well they will perform once you make them bigger. I will give you a very concrete example right now. Let's say transformers versus lms. Let's say you have these 10000GPU's. You're not sure which one you should be using. Should I be using transformer based model and stm base model? What I will do is I will train transformers at different skills. So here you see different parameters on the x axis, y axis, my test source. I will then train different different lscms at different. Once I have these points, I will see, Oh, it kind of fits a scaling law. I will fit my scaling law, and then I will be able to predict, Oh, if I had ten times more compute, here's how well I would perform for the lstm. It's actually slightly less linear for the lstm, but like you could probably try to predict where you would end up. And clearly from this plot, you would see that transformers are better. One thing to notice when you read this type of scaling laws is that there are two things that are important. One is really your scaling rate, which is kind of the Sloof, the slope or the scaling law. The other thing is your intercept, like you could start worse but actually become better over time. It just happens that lms are worse for both. But I could show you another one where things you can predict that actually after a certain scale, you're better off using that type of model than others. So that's why scaling laws are actually really useful. Any questions on that? Yeah.
speaker 2: So these are all kind of very how sensitive are these to like small differences in one like transformer architecture versus another transform architecture? You basically have to fit your own curve and basically say like, Oh, scawing mois told me this should be some like logarithmic function. Let me extrapolate that from .
speaker 1: Yeah. So usually, for example, if you're an academic and you want to now at least that's like pretty recent and you want to propose a new like activation, that's exactly what you will do. You will fit a scaling law, show another scaling law with the standard like I don't know, gu, and you will say that it's better in reality. Once you stop thinking about it in scaling loss terms, you really realize that actually all the architecture differences that we can make, like the small minor ones, all they do is maybe change a little bit the intercept, but really that doesn't matter because just train it for ten hours longer or like wait for for the next computer GPU's. And these things are really secondary, which is exactly why I was selling you originally. People spent too much time on their architecture and losses. In reality, these things don't matter as much data, though. If you use good data, you will have much better scloss than if you use bad data. So that really matters. Another really cool thing you can do with scaling laws is that you can ask yourself how to optimally allocate training resources. Should I train larger models? Because we saw that it's better when you train larger models, but we saw that it's also better when you use more data. So which one should I do? Should I just train on more data, a smaller model? Or should I train a larger model on less data? So chchilla is a very famous paper that first show this the way they did it. I want to give you a little bit of a sense of what these thoughts are. Here you see training loss again, on the x axis. You see parameter differences, sorry, parameter number of parameters. So the size of the model. And here all these curves are what we call iflops, which is that all the models on this curve have been trained with the same amount of compute. The way that you do that is that you use change. So you vary the number of tokens that were trained on and the size of the models, but you vary in such a way that a total compute is constant. Okay? So all these curves that you see with different colars have different amount of computes that were trained on. Then you take the best one for each of those curves. Once you have the best one for each of those curves, you can ask, you can plot how much flops it was and which curve were you on and how much parameters did you actually use for training that specific point. You put that on the log log scale again, and now you fit a scaling law again. So now I have something which tells me if I want to train a model of ten to the power, 23 flops, here's exactly the number of parameters that I should be using, 100, 100b. And you can do the same thing with flops and tokens. So now you can predict, if I tell you exactly I have one month of compute, what size of model should I be training? Fake scaling law. And I tell you, of course, that all looks beautiful in reality like this. Like there's a lot of like small things of like should you be counting like embedding parameters? Like there's a lot of complexities. But if you do things well, these things actually do hold. So the optimal number of parameters that chinchilla people have found is to use 20 tokens for every parameter that you train. So if you add one more parameter, you should train your thing on your model on 20 more tokens. So one caveat here is that this is optimal training resources. So that is telling me if you have ten to power, 23 flops, or if you have like 100, I don't know how much that is. $100 million or tenno, that was much less. Actually, let's say I have dollar million dollars to train my best model that gets the lowest slowhat would I train on? In reality, these companies need to think about inference. Also, if you have a smaller model, they will spend less over time. So actually, if you consider the inference cost, you have other papers that try to show that it's around 150 paraper tokens per parameters because you prefer having a smaller model because over time you're going to actually spend less money on inference of these models. So 150 to one, that's around what the best models are trained on right now at these, the ones that are used in practice, in production. Great. Any question on chana? Great. Oh, sorry.
speaker 2: In fact, how expensive is infor these models training?
speaker 1: Actually very expensive. I will not talk about inference because that would be another entire lecture, but just think about chagpt where they have I don't know how much it isn't now like 600 million people that used it. Like that's a lot. Yeah so it's actually very expensive. There's a lot of optimization you can do for in for though and that's an entire other lecture. So I'm gonna to skip that this time but it's very interesting. Okay, tuning, as I said, there are many things that you can answer with scaling laws. I just try to give you two examples. But really, there are many things. What data do you use? What xture, what data mixing, weighting you use, the data mixtures, that's what we talked about before, what architecture you use, whether you should make your models wider or deeper, should you be paying for more GPU's or actually collecting more data? All these things are things you can try to answer with scaling laws. One thing I want to say is the bitter lesson. If you ever heard of Richard Sutton, a very famous blog post in 2019, what he realized, which I think not enough people realized, I definitely did not realize at that time, is that once you see these type of scaling laws, you know that the more compute you have, the better models you will get. So with scale, you will get better model. And you also know by mo's law, or these type of variants of mo's law that you will always have better compute. Then the only thing that matters is just to have architectures that can leverage computation. So what matters is basically systems data and less so the architecture, like the smaller architecture differences, like your activation and things like this. So I think that's like one of the reasons why most of research focuses on some things that for industry matters less. And I was one of those researchers for a large part of my career. So don't spend time overcomplicating, do the simple things, do it well, seal them. That's really what OpenAI taught us with ChatGPT and with all the GPTs before. Okay, I want to give you some back up the envelope computation. So I might be off by a few factors here, but I just want to give you a sense of how costly it is to train some of these models. I'll give us an example, lambma three 400b, which is currently the best open source model that you can get. It was trained on 15.6 tokens. It has 405 billion parameters. So just now that you know what is like this optimal tokens per parameter, that's around 40. So that's a little bit more than chchilla, but less than this like inference optimal model. So they went for training optimality flops for this model. So one simple way to compute flops is six times the number of parameters, times the number of data that you train on. So if you do the simple calculation here, it's 3.8e 25 flops. The reason why this is important is that if you follow the little bit of the news, there's an executive order from Biden that basically says that once you have one e 26 parameters, sorry, flops, then you have special scrutiny on your models. So they went two x less than that. So they really went right below this to not have special scrutiny. So 38, I might be off by a little bit, but it's definitely under the one e 26. Oh, so p is parameters, n is data, number of tokens. This is just an approximation. Yeah. Okay, compute. We know that they train on 16000H -100s and we know the throughput they set it to. So if you do the computation, it takes around 70 days, 26 million GPU hours at least. That's with my back of the envelope computation. They actually said that they used for 30 million instead of 26 million GPU hours. So maybe they had like some some challenges, I don't really know. But if you follow the simple computation, it's around 70 days cost. I mean, this it's hard to approximate, but I'm just going to say it's kind of the rent. Like what if I were to rent H -100s that many, H -100s for that many days? How much will I pay H -100? A lower bound on the renting cost of H -100 is around two hours. Dollar, two per hour. So if you multiply this by dollar, 26 million hours, you get $52 million. So they probably pay less than that, but not actually much less because all these services that actually rengpu's, they don't make that much money. So it's probably slightly less, but not that much less. Now, salary. I said 50 employees, 500k per year. Yeah, it's probably the ride ballpark, 25 million. So if you put all together around $75 million for training this Slama model and probably off by like 10 million, but that's kind of right ballpark carbon emitted, a lot of people might ask, like also the cost is not the only thing that is important. So I did the computation. It's around four, four zero tons of co two equivalent. That is actually only 2000 return tickets from jfk to London. So right now, carbon emitted is actually not I mean, it's huge, but it's not like meaningful yet. I think in maybe GPT six, GPT seven, once you multiply this by 100, that might become a real issue right now. It's still not, I think, an issue in the grand scheme of things. Next model, the way you should be thinking about these models is that every new generation, the number of flops essentially multiplies ten x. At least that's what they try. If they have enough energy and if they can buy enough GPU's. Great. Any question on these back of the envelope math? Okay. So now we talked about pre training. I wanted to also chat about systems because now we know compute is really important. So there's a question of how do you optimize your compute? I will leave that for the end because I'm not sure how much time we will have. I think it's important, but hopefully I'll be able to talk about it later. It's slightly different than what we've been talking about right now. So I'll move on to post training for now. So the task of post training, the reason why we need to do post training is, as I told you before, it's to make AI assistance. So language modeling is not really the thing that you want when you have an AI assistant. For example, if you ask to GPT -3, which is a purely language model, a pure language model, not an aligned one, if you ask a question like explain the moon landing to a six year old, the completion that you would get is something like explain the theory of gravity to a six year old. Because what it learned is that on Internet, if you have one question, you usually have maybe another bullet point of other similar questions. You don't usually have question and an answer later. This is not what you want from an AI assistant. So how do we do this alignment, which is this post training and making these models assistance? So the goal of this alignment is to basically get lms, follow the instructions that are given by users and maybe some designers kind of desires. So think about moation. You don't want the model like openi definitely doesn't want the model to say stuff that is very toxic. So here you see on the left hand side that when you ask a question, it actually provides a real answer. So it's not like before the lm. And on the right hand side, you see that it would if you ask to write a tweet describing how a certain part of the population are evil, it will say that I cannot do that. So that's kind of this alignment. The background here is that basically the data that you want for training some of these models is like, we know what we want, which is just asking humans. This is the question. This is the answer that you want. But the thing is that it's very expensive to collect that data and it's hard to find it online. In contrast, pre training data is not what you want, but there's a lot of it. So what we will do, or the main idea is simply take a pre train large language model, pre train our oil of Internet and then just fine tune. So you just change a little bit of weights on the type of data that you actually want. And hopefully, given it, you're already pre training on all our Internet. It basically learns or knows how to speak in English and knows standard language syntax. Then you can really fine training with very little data. Okay. Sfso supervised fine tuning is really exactly what I just said, which is the idea of fine tuning the large language model on basically the desired answers that are collected from humans. So why is it called supervised fine tuning? Because you basically want na do language modeling on the real answers. So language modeling is this like next word prediction and as the fine tuning part, and then you want to do it on desired answers given by humans. So that's why we call it supervised. So how do we collect this data? While I just said it, you just ask humans to tell you this is the question. This is the answer that you would want from some of these models. So this is an example. So I can't read very well on my computer, but my kid needs to do a science. Now, let's read this one. Can you write a short introduction about the relevance of the term monopsony? And then it says monopsony refers to market structure, blah, blah, blah. And it's a human. And wrote that. So actually, this is open assistant, which was a way to collect data online by humans. So this type of supervised fine tuning while alignment is really the key of ChatGPT. This is what made the big jump from GPT -3, which was mostly something that was known by AI researchers, to ChatGPT, which became known by basically everyone. So the problem with human data is that it's very slow to collect and very expensive. So one possible simple idea is to use lms to scale data collection. So that's exactly what we did with alpaca one year ago. What we did is that we asked humans, or we use a data set of human question answers. So there were 175 question answers here, and we asked the best model at the time. So tevinu zero three to basically generate many more of these question and answers. So what we did is like, this is what humans would write now, write similar answers and similar questions. And we collected 52000 llm generated question answers. And then what we did is simply we took lama seven b, which was the best ptrain model at the time, and we just fine tuned this with supervised fine tuning, as I told you, and that's how we got the alpaca seven b model. And this is the type of data that we collected. So things like, what does algorithm mean? An algorithm is a step by step set of instruction used to solve a problem or achieve a goal. Blah, blah, blah, blah. So the data is not actually it's actually pretty good given it was lm generated by llms from essentially two generations ago. So that really started, at least for us, kind of as an academic replication of ChatGPT. Now it really there's a big field of like synthetic data generation of how to use lms to basically make development of lms faster and basically by decreasing the amount of human hours that you need, quantity of data. So we talked about what type of data and how we collect it. One thing which is surprising with sft is that you don't need that much data. So what this paper showed, this is called Lima, is that if you scale the amount of data to use from supervised mind tunfrom 2000 to 32000, it really doesn't help much. So here, scaling laws definitely don't help. So the intuition here is that all you learn is you learn how to format your desired answers. Another way of saying it is that your pretrained models, they essentially model the distribution of every user on Internet. One that might write bullet points, another one that might answer question with an answer. So all you tell your model is like, wait, you should actually be optimizing more for this type of user than another one. So you're not actually teaching you're not teaching anything through this sft. So supervised fine tuning. All you do is you tell the model to kind of optimize one type of user that it's saw already in a pre trained data set. So the knowledge is already in the pre train llm and you basically just specialize to one type of user. Great. Any question on .
speaker 2: I know it's a big issue with synthetic data where if you keep generating data from the same distribution, eventually you're not learning a new distribution. You're essentially playing with the bootstrap in that surely you can't scale that for it, right? You can keep going on and generating from the same distribution. And we hope to learn something new. Yeah. So are there it's an activity of research. Yeah any thoughts that you have around how people are maybe thinking around this and better ways to boostrap or to give up on this idea and realize that the church shows you don't need that many, so just get humans to generate .
speaker 1: 2000 reading? Yeah. So that's a very good question. So for the data stuff, so I'm saying it's not that important for sbut. There will be another thing we'll talk about right after where actually data does matter. My intuition based on not that much empirical results is that you can still get even though you use your alliif, you use purely lm generated text, and you do that for like three, four generations of allimi. Agree with you that probably you won't improve much. But for me, what is important is how do you use human in the loop with lms? Not purely lms, not purely humans, but maybe what you can do is just have the model generate some new text and just humans write a few edits, edits so much faster than writing the entire text. And I think that if you have that type of collaboration, then from like kind of an information theoretical point of view, you still get additional information, but you still much faster than if you use humans. And I think that as a field will probably move towards these type of things, which is really just finding the examples that are important and asking humans kind of active learning, just asking humans exactly when you need to get their inputs. Yes.
speaker 2: Do we training with like the same loss function, the same like general training algorithm for the supervised fine tuning bit as we do for the pre training, right? Because like the examples you showed, I think the important thing for good examples is really super perfectionately accurate tively. There's these more complex things, don't just like sharing sinals.
speaker 1: So that's why here, Yeah, I didn't maybe didn't emphasize enough, this is just language modeling, fine tunwith language model on the desired answers. So this is literally the same loss. It will be different in 2s. But the first step of sft is literally the same loss where you just say, okay, I want na actually specialize on that type of data. So there's even a question of like, what is ptraining? What is post training? Because in reality, just like a different data that you use, the reason why we usually call it post training is that the way we collect that data is very different. Great, great questions. Yes, maybe it's a same question.
speaker 2: but why would these 2000 examples have such an overweghted influence? Oh, so that's why we also, that's another reason .
speaker 1: why we call it post training is that we use different type of high parameters. So you know, I told you basically at the end of free training, you essentially end up with a learning rate of zero. And here you're going to increase your learning rate to like one e minus five, one e minus Thea. And so the weight that you give to them is actually different. Okay. Second step or second part of this post, training is what we call reinforcement learning from human feedback, or rhf. Some of you might have heard of that. The idea is that St has a problem, namely that you do behavioral cloning, which means that you just try to clone what the humans would say. And that has many issues. One of them is that you're bound by human abilities. So if. Like humans, actually, humans won't generate the things that they think is actually the best thing to generate. So if you ask me to write a book, I mean, I can definitely enjoy a book. I can probably say one book is better than another, but I'm definitely not gonna to be as good as writing the book that I want to read. So you're gonna to be bound by the human ability to generate things, even though the humans might be better at distinguishing between things. That's one issue. Issue number two, I find that actually pretty interesting, is that it might, if you ever heard of the word hallucination, so this is land 's generating like false information, a hallucination. These people have hypothesized that that can come from the supervised fine tuning. Even if you do supervised fine tuning on data, that is correct. And the reason why that is, is that if given, I told you that basically sft is with very little data, and it's with data that the model doesn't learn anything new. So what if the human gives an answer that the model didn't know was from the model perspective? The human basically is telling the model, generate this thing. That seems plausible, but I actually have no idea if it's or not. So just to give you a very concrete example, if we go back to this monopsony example, can you write blah, blah, blah about monopsony? Imagine that the human wrote a reference on this type of book, and that book might exist. That might be a correct reference. But what if the llm never saw this reference during pre training? Then it doesn't know that it's a correct reference. So really, what you tell the model is to generate or make up some plausibly sounding reference rather than actually tell the real reference that it saw during prere training. So hallucination might be a might be caused by this sft. That's problem number two. Does that all make sense? Great. Problem number three, price. Generating the ideal answers is very pricey. And then it comes back to your question of like humans writing the entire answer is actually pretty expensive. So that's where rhf comes in. The idea is that instead of cloning the behaviors of humans, we're going to maximize human preference. And the way we're going to do that, so the pipeline is that for a certain for every instruction, you're going to ask a model to generate two answers and usually use a pretty good model. So usually you don't use an llm here. You use a sft fine tunyou, use a fine tune llm already to give like pretty good answers. And then you ask labelers, which of these two answers was better. So select the preferred one. And then with different type of algorithms, we're going to talk about the algorithms. You just fine tune the model to generate more of the Green thing than the red thing. So more of the good stuff. So another question is how and we're going to talk about that right now. So there are two ways that we're going to talk about, and two that are mainly used in the community. The first one is simply the idea of using reinforcement learning. So hopefully you all know what reinforcement learning is now. So when you think about using reinforcement learning, one important question is like what is the reward that we are optimizing? So in this case, there are really two options that I could think about. The first one, you could just say, I'm gonna to compare the output generated by some baseline, the output generated by my model, and I'm just gonna to as a human to say which one is better and I'm gonna to use this as a reward. So if I'm better than the baseline, this is a plus one. If not, it's a minus one. So now it's binary reward. The problem of binary reward is that it's very spse and you don't get much information out of it. Like maybe your answer was slightly better, maybe it was like way better, and you don't really know from this how much better it was. So option two is that you can train what we call a reward model, which is simply a classifier. So you use machine learning to classify how much better two outputs are, prefrom the perspective of the human. So this is a little bit meta, but what you basically do is that you take a real model R, which is just also a large classifier, and you basically ask this reward model, you give it the input and the actual output that you have one of the two outputs, and you just exponentiate that. So that's the softmax loss that you will know about. And now you divide by the exponentiated reward on the first example, I'm sorry, on the first output, and this is on the second output, and you basically train. So the reason why you do that is that you train your model. You train this reward model to be able to classify how much better one output is to another one. So another sliche less convoluted way of seeing it is that your reward model will output some reward that will be used as the logiits of your softmax. So now if you have high logiits in your softmax, it means that highly likely this output is better. So that's what we call Bradley Terry model. Yes.
speaker 2: the entire output. Or is it going .
speaker 1: to the so this takes the entire Yeah, this takes the entire output at once. So it takes all the input and all the output and it gives one number. Yes.
speaker 2: So this reward model, where would the human be that with the reward model.
speaker 1: where would the human be like, Oh, I see. Okay, sorry. Yeah, maybe I wasn't clear. You train this reward model to fit this Green and and red preference from humans. So basically you train a classifier to say whether the humans prefer red or Green. But instead of using the binary reward, which is what the human would tell you, you basically use the loits of the softmax. And the thing with the logits is that logits are continuous. So now you know that if your reward model said it has high loggets, then in some ways the human highly prefer this answer to some other answer. Great. So as I just said, continuous information says better. So that's what people use in practice, or at least used to use in practice. I'll tell you about the other algism later. So what you do at the end is that you basically try to just use reinforcement learning that you know about. Now we know when you have a reward, what you sample through is the generation from your large language model, and then you just use some regularization term. So the reason why we do this regulzation term is for avoiding what we call over optimization. So this reward model might not be really might not perfectly model human preferences. So you don't want to maximize this thing to essentially infinity, and you do it using a ppo, which is a common reinforcement learning algorithm. One thing to note here, because it will be important for later, is that when we use maximum likelihood sorry, now that large language models are actually a policy for your reinforcement learning, it's not maximizing maximum likelihood anymore, which means that you're not modeling any distribution anymore. And the reason why this is important is that models that went through this type of ppo actually don't give you likelihoods of text that are meaningful, because what you optimize them to do is basically just optimize for generating the most likely thing, not optimized for modeling, like all the answers that humans might say. Another way of saying that is that there's nothing that incentivizes here the model to not give like a single possible generation. Nothing here says it's good if you have some distribution of some entropy, if you haven't followed, it's not that important, but just good to know. Great. So ppo is exactly what ChatGPT did originally. So here's them on their blog post when they have is step one, do supervise fine training, which now you all know about. Step two, train a reward model on human preferences. Step three, do ppo multiple steps, which is where you see this this blue arrow. So you continue, you train a model one for ppo, you collect new data, you continue. And that's exactly what ChatGPT did. And that was a big breakthrough between GPT -3 and ChatGPT. One thing to note is that ppo has many challenges. Reinforcement learning is something that's super nice theoretically in practice. Anyone who ever worked with reinforcement learning knows it's such a mess. There's a lot of things like rollouts out of loops, clipping so many complications. So it's messy. This is the idealized ppo use for lm settings. So that's already much more complicated than this expectation we saw before. And in practice, it's actually much more complicated. So we have one implementation of it that we had to do, and I'm not going to go through it, but basically you have like so much stuff that you have to think about when you implement that type of of ppo algorithms. So you have clipping everywhere. You have a lot of complexities and things are not well documented. All this to say that there was a new method that was proposed also from Sanford one year ago called dpo, which is essentially a simplification of ppo underway. What they did or the idea that they have is that instead of using reinforcement learning, you can just maximize the probability of generating the stuff that you like and minimizing the probability of the stuff that you don't like. So if you think about the human preference, the red and Green maximize Green, minimize red. So the loss is actually this one where what you see this is simply some log of the model. So this is the likelihood of the model generating the things that the human preferred, given the inputs. And what you try to do is basically maximize the likelihood of generating the things that you like, minimize the likelihood of the things that you don't like. All the rest of the terms here, it's not too important. It's actually really not that complicated to understand. But at the high level, it's really just maximizing the things you like, minimizing the rest. And one thing to note, which I was going to say just here, is that actually all the rest is chosen such that the global minima of ppo and the global minima of like this dpo under some assumptions are essentially equivalent. So this is the right thing to do mathematically. I'm not going to go through the derivations, but that's the right thing to do. It's pretty different with ppo in the sense that now with ppo, what you had to do is collect the human preferences, then train A A reward model with maximum likelihood, then use reinforcement learning. Now all you do is basically maximum likelihood. Much simpler. Yes.
speaker 2: I so it seems like this is a much simpler b, like what you would just intuitively to do. Why did they start with this reward model? But what led them to doing that?
speaker 1: I think it's a great question. I don't really know. What I can tell you is that at opai, the people who did who did basically this, sorry, who did chagpt initially are the ones who actually wrote ppo. And I think they were just like there were a lot of reinforcement learning people. And I think that for them, it was very intuitive. So there's also some additional like potential benefits. For example, I Yeah for example, if you use the reward model, the cool thing here with reinforcement learning is that you can use unlabeled data with the reward model. So here you can only use the labell data for doing dpo for for people. You first train your reward model and then you can use unlabeled data when the reward model will basically label this unlabell data. So this additional kind of potential, there could be potential improvements in practice. It happens at Daon. And I think just that a lot of people in this team were reinforcement learning experts, including the main author of ppo, judge Holman. So much simplin ppo, and it's basically performed as well. So now this is the standard thing that people use, at least in the open source community. I believe it's actually the standard also in industry that's called dpo games. So those are all the papers on the left here. This is on a summarization task. You see, all I want to show you is that basically the pre train models were okay. And the improve of scale, if you do supervised fine tuning, you improve them a little bit more. If you do ppo or something with rhf, with human feedback, you get performance that are oftentimes, depending on a benchmark, even better than human ans. So this is the human reference summaries. Same thing that's on a bia that we have alpaca farm where we see the evaluation here is not too important. But basically you see pretrain model, you jump to sand, then you jump to pppo. Dpo and pppo dpo have the exact same performance. So basically, rhf helps. That's kind of the conclusion. And epo is simple data, the way that you collect that type of data. First idea is just use humans, as we already talked about, guidelines are very complicated for what humans should be labeling, and it's really not that easy. And actually, if you ever do some of the labeling, you will see that it's extremely complicated. Like if I zoom into this here, I have a question. Tell me about self driving cars. And you read both self driving cars or vehicles that are capable of detecting their surroundings, blah, blah, blah. Self driving cars or cars that are equipped with sensors, blah, blah, blah, to navigate without the need for a driver. And you both seem okay. Like which one is better? It's actually hard to say at a glance. And as a result, the problem with humans is that you will start optimizing a lot of like high level features. For example, the second one is longer. I can guarantee you that most humans will choose the second one, even though, I mean, maybe if I one is better, I don't know, haven't read it carefully. So challenges with humans, first, slow and expensive. Second, as I just mentioned, it's hard to focus on things that matter, like correctness. And people les usually look at things that don't matter as much, like deform, like length. And as a result, so what I show here is that when you do lhf, the more you do of lhf, the longer the output of the models become. So if you've ever been annoyed at ChatGPT answersuing you super long sentences, this is because of all rhf annotator distribution shift. Like the distribution of annotators that you use matters a lot. And you have to think like is what is even the humans that we want to represent in these models? Now question is like crowdsourcing ethics. Like usually these basically a lot of the labeling that is done, like the people who do them are not paid well and they have to go through a lot of toxic data because you basically want the model to avoid saying the toxic data. So crowdsourcing ethics too. So many challenges with human data. So what we did also last year is again, the same thing as alpaca, just the idea of like, Oh well, now challenges with humans. Maybe we can just replace them with llms. So what we did is simply replace Oh, I see that. I'm just realizing that these slides are not centered anyways. You replace human preference with lm preferences. So here on this figure, you see on the x axis the price that we paid for collecting human data. It's around $300 for 1000 examples. And this is on mechanical turkas, which are usually like cheaper than maybe some of the other companies that you could go through. And on the y axis, it's basically the agreement with other humans, with the mode of other humans. And what you see is that actually, as I told you before, labeling is really complicated. Humans agree with themselves only around 66% of the time on a binary task. And it's not that the humans are not good here. Because we were five main authors on this paper, we tried to label this data ourselves, and we only had like say, 67 or 68% accuracy, even though like we talk for like three hours of how we should be doing labeling. And really, it's complicated. It's not an easy task. And here I just showed many different models. And basically, you see that models are much cheaper, and they can actually get higher agreement with the mode of humans than humans themselves. And the reason why is because humans have a lot of variants, models have no virants. So they might be a little bit more biased, but have less variants. So it works surprisingly well. And now it's kind of the standard and open source community. I think even in an industry, a lot of people use both humans and llms for improving the collection of all hf data. And this is like this is the paper from last year, but honestly, now it's more like that llms would be around this agreement and discourse. So around, I would say, 50x cheaper than humans and better agreement with than humans themselves. Okay. So that gets us to evaluation of post training. As that goes back to your initial question at the beginning of the lecture, how do we evaluate something like chagpt? The answers that chagpt could give are basically unbounded. And it's not that there's one right answer. There are many answers that are just as good. So there are many challenges. One, you can't use validation loss because one method might use dpo, the other one might use dpo. Validation loss is not comparable. Second, you can't use, sorry, perplexity. That's the thing I told you before. These models are not calibrated. They don't give distributions. They just optimize, for one thing. So you can't use perplexity for actually evaluating these type of models once they gned, once they're aligned. Third, there's a large diversity of questions that human might ask to these models, generation, open q, some question answering, some summarization and all of these things. So there's so many things you have to cover, then the tasks are really open ended. So it's very hard to automate. So that's what you were alluding to before. So the idea is that instead of trying to come up with really easily automated benchmarks is just we're going to ask questions that users actually ask to these models in practice, and we're just going to ask annotators to say between these two models, which one is better? Like what's the better output? So basically you do the exact same thing as basically the data from rhf, but you use it now for evaluation.
speaker 2: Mean by can't use perplexity, not calibrated. Like hello, I'm was still doing like next taking prediction.
speaker 1: So think .
speaker 2: about the optimal solution .
speaker 1: after doing ppo is basically one model that gives you essentially a delta. Like basically says that it's only one sentence that could be generated for that question. So now if you use it on something that is slightly semantically differently different, it would actually give a likelihood of zero for that answer. So in reality, it's not that extreme because as you say, it's still a distribution. But it just shows you that there's a fundamental issue with perplexity. Once these models are not llms anymore, they were not trained, at least with ppr, they were not trained to do maximum likelihood anymore. They were trained to be policies. Okay. So probably the most common or the Yeah the most common benchmark, the most trusted one is what we call sorry, chabowina, which is basically go on Internet, have random users on the Internet blindly talk with two chatbots, just ask many questions, see the two answers and rate which one is better. And you do that over 100 reds, thousands of users, and then you get the actual preferences and you get rankings of models. So you can go right now on chatbot arena and actually into act these models. One potential issue just to highlight is that while people who want to do these type of things are usually more like tech driven or like tech savvy, so a lot of the questions that you will ask are more like tech stuff, discussing software errors, inquiries about AI tools and all these things. So another issue is cost and speed. If you really want to use something like this for development process, it will be too costly because you will need to basically pay a lot of humans to do that. So one simple idea is, again, as we said many times, just use lm instead of humans. You probably know the drill at this point. Steps for every instruction generate outputs by some baseline and the model that you want to evaluate. So here, imagine that I'm comparing an answer from ChatGPT and from Mirow. I'm just asking a model, another model, which one is better? And I just basically averthat out. Yeah, I asked GPT -4, which one is better? I averaged that out over my entire distribution, over my entire benchmark or datset, and that gives me a win rate. So we're in probability for one model compared to another one. And now you can rank models. And this is the alpaca evil leaderboard. So the benefits of this is that actually we show we get 98% correlation with chahad Bina, so very high correlation with humans. So this is Yeah comparison with correlation with other benchmarks. And it takes less than three minutes and less than $10 to one. So it's pretty cheap. There are downsides though. One of them is poorest correlation. So as we already saw before, lms prefer. This is one spcorrelation no many. I'll just talk about one. Lms prefer longer outputs. Actually humans also prefer longer outputs. But the problem, the issue wants to use llms, is that once there is biased, you will continue optimizing that humans. At some point, I can guarantee you, if I ask a simple question and you give me five pages of answers, I'll be like, no, I don't like that answer. But llms, if they have this bias and they will train for that, they will continue preing longer outputs. So here we see the preference just showing that like humans and models prefer longer outputs. And here's another view of the initial alpka evval data benchmark. When we asked when we rank GPT -4, when we look at the ring rate of GPT -4 versus actually GPT -4 itself, if we use the standard GPT -4, it gets 50%, kind of by definition, we're comparing GPT -4 versus GPT -4. But if we ask gbd four to be cycling more vobose, so we just say in the prompt, b vo Bose in your answers, then it gets a renrate of 64.4%. So really, there's a huge virus. And if we ask it to be concise, it gets 20%. So there's a huge virus, depending on whether you ask it to be concise over both. That's very annoying. So one possible solution, which is what we did, is just use some regression analysis. I'm not gonna to go into details, but basically use causal inference tools to control for length. And right now, actually, length matters much less. So if you ask it to be verbose, you still get some gains, but much less. Great. So that's all about post screand. Now for the next eight minutes, I might talk about systems or just answer questions. Yes.
speaker 2: go back to your post training. In terms of post training, how did we tune those parameters using the small body of fine tuning data and have such big effect on the model you mentioned earlier? That is a different set of hyperparameters. Are we changing just some of the weights, the later weights or all the weights? What's actually happening?
speaker 1: Yeah, Yeah, I kind of skiimmed through all of this. You change all the weights. Actually, industry would change all the weights in open source land. You might have heard of Laura, which is going to change it basically only sum of the weights. Or it actually, to be more specific, it's going to add some differences to the output of every layer. But in industry, you're going to just fine tune all the weights. And also to say something else about the data. Actually this last step, rhf, you're usually going to collect a lot more data than with sso. If fsfifty is like 5000, 10000, maybe 50, zero with rhf, I think you're going to be more around like the 1 million order of magnitude. It's still much less than .
speaker 2: pretraining though. Yeah because prefifteen trillion tokens. I mean, this is like that's not even a drop. Yeah you influence the weight the lot because you do it.
speaker 1: I mean, you have to think that how you do it as you use, I mean, as I said, the learning way that you're going to use is going to be different. But also you only do that. So just imagine if I train, even if I train on one sentence, but over and over again, at some point, my model will only generate that sentence even if it was just one sentence instead of the 15 trillion tokens. So if you use a large enough learning rate and for enough time, you will basically overfit that sentence. So the key thing to remember is that the data is it's not as if you mix some post training data and some pre training data, you do ptraining, and then you just start fine tunonly on a post training. So another way, maybe another perspective, is that the pre training is just the initialization of your model. And once you view it that way that this is just initialization of weights, then there's nothing special. Like you don't need to remember that you traa lot of data before. The only thing that matters is that you had initialization, and now I actually train a model. So when you think about it that way like this, there's a mark of property. In some ways, it's just like you had your weights. This is my initialization. Now I'm training that one. Does that kind of answer your question?
speaker 2: You said something just now about it's almost the equivalent of just rerunning the fine tuning data many times. Is it actually is that what actually happens in order to give .
speaker 1: so much more preference? You I actually don't know right now how they do it in industry. When we did our packet, we had to do three epocks. So you did run it three times to it. But I mean even the number of times that you run it through, it's actually not important. The thing is kind of the effective learning rate that what matters. So Yeah, great. So I think I have five minutes, right? 嗯,okay. I might try to give a high level overview at least from one of the systems trick systems. As we said for everyone, k is sorry, compute is a huge barleneck. One question you might ask is why not buy more GPU's? GPU's are expensive, but also a scareven. If you have $10 million right now, you cannot buy the best GPU's. Oh Oh Yeah. There's also some physical limitations when you have when you have multiple GPU's, you have to communicate between them. That takes time. So just buying more GPU's is not that easy. So it's really important to think about how do you allocate resources and how do you optimize your pipeline. So system 101 on GPU's, I'm sorry, I'm going slightly faster. I hope that some of you at least can follow. GPU's are basically optimized able throughput. Cpu's are optimized for latency. So GPU's, the way you have to think about it is that there's one command that is run on many, many codes at the same time on different type of data. So this is how you see a GPU. You see there are many different codes. We call them streaming mulprocesses, which is very different than the usual cpu architecture. So just think, high throughput parallelilization for GPU's. GPU's are optimized for fast matrix multiplication. So every time you will do something on gpuu, if you can do it with a matrix multiplication, it's going to be ten times faster than with anything else. That is a little annoying because it means that we are kind of bottlenecked to doing anything with matrix multiplications. Another thing to note with GPU's is that compute has been improving faster than memory and communication. So right now, GPU's usually are hard to keep. Like the data that you send that set your GPU's is actually hard to keep up with the process. So most of your GPU's are actually gonna to be idle if you just run normal code, if you don't optimize your code. So communication and this will continue over time. Another thing to know about GPU's is that as a memory hierarchy, this is the same thing actually with cpu's. But basically the closer you are to your cause, the less memory there is. But the faster things run, if you're further more memory slower, okay, I'm going skip that. Okay. Actually, I'm going say it. I told you about this. The fact of communication, the metric that people usually look at is model flop utilization. So what is the theoretical maximum that GPU could run at? Number more flops that you could use per second. Sorry, the number of observed throughput divided by this theoretical maximum. And in general, if you reach 50%, you're very happy. Like Facebook, I looked at limma was at 45 for something like this. So that means that data doesn't come fast enough even for these big companies. So one simple trick, and that might be the only one I'm going to tell you about, is low precision. One simple idea is that, well, if I'm going to put my floats in lower precision, then there's going to be fewer bits that I have to send to my GPU's. If there's fewer bits, it's faster communication, lower memory consumption, things are going to go faster. And for deep learning, it just happens that decimal is not that important. So when you do matrix multiplication, when you do like, for example, sgd, there's already so much noise that if you update something by 0.01 or 0.015, who cares? So basically, instead of using 32 bits per float, which is what people used to use, or 64, for example, which is what you would use in other domains, use 16 bits for matrix multiplications. So for every flow to use 16 bits and for training, you have this type of like what we call automatic mixed precision, which is that some of the things on 32 bits, others are in its on 16 bits. Generally, the way you should be thinking about it is that your weights are stored of your model are stored in 32 bits. But just before the computation, you put everything in 16 bits like this. You do computation super fast. And at the end, you update your weights in 32 bits. And the reason why you do all the updates in 32 bits is just think that if your learning weight, for example, is very small, you still want to be able to like, make a difference in your weights. So all the computation is done in 16 bits, but the weights are actually stored in 32 bits. So that's like the sway that people are doing it. Okay, I'll actually talk just about this and then I'll skip all the rest. Operator fusion, because I think is actually pretty cool. As I just said, communication is very slow. And actually every time you use a PyTorch line, it basically moves variable to global memory of your GPU. So when you have something like this x cosine equal x one, and then you do x one cosine, what is happening behind the scenes is that you take the x, which is data, you ship it to your actual processes of your GPU's. You apply the cosine, you ship it back to the main memory of your GPU, and then you see the next sign. You ship it back to the computer, to the GPU processor, you apply another cosine, and you ship it back again. So another way to see that is that you go from your dram, which is your global memory, in your GPU, and you ship it to compute. You ship it back for every line. This is a naive way of doing it. This seems very wasteful. So the idea, simple idea of operative fusion is just communicate, do all the computation, ship it backonce. And this is exactly what Fuuse kernels are. So if you ever want to make your computations in PyTorch much faster, just apply torch dot compile on your model. This is going to make your model around two times faster. And what it does is simply that it rewrites your code, like your PyTorch code, basically in C++and coda, to do the communication only once, then do all the operations, then ship it back. Okay. I'm not going to have time to talk about tiling. Tiling is important. Paralyzation paralyzation is important. And mixture of experts, mixture of experts is important. Outlook. There are many things we haven't talked about. We haven't talked about architectures. We definitely haven't talked about inference. There are many other things that are important with lms. What is the ui that you use? I mean, arguchatgpt, the big novelty was just have a simple ui to use it. Multimodality, what are all the misuses you could have. The fact that there might not be enough data on the Internet to train all these models, legality of data collection, so many other things. If you are interested in all these topics, I would suggest three classes. Cs 224n is probably the one that touches the least on lms, but it gives some background and historical context of all the lms that gives kind of some adjacent material. Cs 324, I think it's called, I think it's just called large language models, more in depth reading and lectures on everything. I talked about cs 336, which is large language models from scratch. You actually build your own llm, it's an amazing class, also given by my two supervisors, very heavy workload. So be careful. Great.

最新摘要 (详细摘要)

生成于 2025-06-06 20:31

概览/核心摘要 (Executive Summary)

该讲座由斯坦福大学博士生Yann Dubois主讲,全面概述了构建类ChatGPT大型语言模型(LLM)的完整流程,主要分为预训练(Pre-training)后训练(Post-training)两个核心阶段。讲者强调,尽管学术界传统上关注模型架构与算法,但在工业界,数据、评估和系统优化才是决定模型成败的三个关键支柱。

预训练阶段,模型通过自回归(Autoregressive)任务,在海量互联网文本上学习预测下一个词元(token),其核心是最大化文本的对数似然。此阶段的关键技术包括高效处理文本的分词(Tokenization)、复杂的数据清洗与处理流程(从TB级原始数据中筛选高质量语料),以及用于评估的困惑度(Perplexity)等指标。一个核心理论是规模法则(Scaling Laws),它揭示了模型性能会随着计算资源、数据量和模型参数的增加而可预测地提升,这为资源分配和性能预测提供了理论依据。

后训练(对齐)阶段,目标是将预训练好的模型转变为能遵循指令、有用的AI助手。该阶段始于监督式微调(SFT),使用少量高质量的“指令-回答”数据对,教会模型输出的格式。随后,通过基于人类反馈的强化学习(RLHF)或其更简单、更稳定的替代方案直接偏好优化(DPO),根据人类(或更强大的LLM)对不同回答的偏好来进一步优化模型,使其行为与人类价值观对齐。由于对齐后模型的输出是开放式的,评估极具挑战,通常依赖于如Chatbot Arena等平台进行人类或LLM的“头对头”比较。讲座最后简要提及了系统优化(如混合精度训练、算子融合)对于提升训练效率的重要性。


引言:构建LLM的关键组成部分

讲者Yann Dubois指出,构建大型语言模型(LLM)涉及五个关键部分:

  1. 模型架构 (Architecture):如今LLM几乎都基于Transformer架构。讲者选择不深入探讨此部分,因其已有大量公开资料。
  2. 训练损失与算法 (Training Loss and Algorithm):模型如何学习。
  3. 数据 (Data):模型训练所用的语料。
  4. 评估 (Evaluation):如何衡量模型性能及进展。
  5. 系统 (Systems):如何让庞大的模型在现代硬件上高效运行。

一个核心观点是,学术界往往过度关注前两者,而工业界的实践成功更多地依赖于后三者。

"in reality, honestly, what matters in practice is mostly the three other topics. So data evaluation and systems, which is what most of industry actually focuses on."
(“说实话,在实践中,真正重要的是另外三个主题:数据、评估和系统,这也是大多数行业实际关注的重点。”)


第一阶段:预训练 (Pre-training)

预训练是经典的语言建模范式,目标是让模型学习整个互联网的语言规律。

核心任务:自回归语言建模 (Autoregressive Language Modeling)

  • 定义:语言模型本质上是词元序列的概率分布模型 P(x₁...xₙ)。自回归模型利用概率的链式法则,将其分解为一系列条件概率的乘积:P(xᵢ | x₁...xᵢ₋₁),即预测给定上下文的下一个词元
  • 训练过程
    1. 将输入文本分词并嵌入为向量。
    2. 通过Transformer等神经网络处理,获得上下文表示。
    3. 通过一个线性层映射到与词汇表大小相同的维度。
    4. 使用Softmax函数得到下一个词元的概率分布。
  • 损失函数:使用交叉熵损失(Cross-Entropy Loss),这等同于最大化训练文本的对数似然(Log Likelihood)。

关键技术:分词 (Tokenization)

分词器(Tokenizer)是将原始文本转换为模型可以处理的词元(token)序列的工具,其重要性常被低估。

  • 必要性
    • 通用性:比直接用单词更通用,能处理拼写错误、未登录词(OOV)以及泰语等无空格语言。
    • 效率:若按字符分词,序列会过长,而Transformer的计算复杂度与序列长度成二次方关系,导致效率低下。分词器在词汇量和序列长度之间取得了平衡。
  • 算法示例:字节对编码 (Byte Pair Encoding, BPE)
    1. 初始时,将语料库中所有单个字符视为一个词元。
    2. 迭代地找出最常出现的相邻词元对,并将它们合并成一个新的词元。
    3. 重复此过程,直到达到预设的词汇表大小。
  • 挑战与未来
    • 分词器在处理数学公式和代码时存在困难,例如,数字“327”可能被视为一个独立的词元,使模型难以理解其数值含义。
    • 讲者认为,未来可能会随着新架构的出现(如能高效处理长序列的架构),逐步淘汰分词器,转向更底层的字节级处理。

预训练的评估方法 (Evaluation for Pre-training)

  • 困惑度 (Perplexity)
    • 定义:可以直观理解为模型在预测下一个词元时“犹豫不决”的选项数量。计算上是指数化的平均损失,值越低越好。
    • 现状:虽然因其依赖于分词器和评估数据,已不再是主要的学术基准,但在模型开发过程中仍然是极其重要的监控指标
  • 学术基准 (Academic Benchmarks)
    • 聚合基准:通过在大量经典的NLP任务上进行评估并聚合分数,如斯坦福的HELM和Hugging Face的Open LLM Leaderboard
    • MMLU (Massive Multitask Language Understanding):目前最常用的学术基准之一,包含覆盖众多领域的专业级多项选择题。
  • 评估挑战
    • 不一致性:不同机构对同一基准(如MMLU)的评估方法(如Prompt格式)可能不同,导致结果差异巨大。
    • 数据污染 (Test Set Contamination):测试集中的数据可能无意中出现在了模型的训练集中,导致评估结果虚高。这对学术基准的公正性构成了严重威胁。

预训练的核心要素:数据 (Data)

数据是训练LLM的基石,其处理流程远比“在互联网上训练”复杂得多。

  • 数据来源:通常始于Common Crawl等网络爬虫项目,其数据量可达PB级别(约2500亿个页面)。
  • 数据处理管线
    1. 文本提取:从HTML中提取纯文本内容。
    2. 内容过滤:移除不适宜工作(NSFW)、有害内容和个人身份信息(PII)。
    3. 数据去重:在文档、段落等多个层级上移除重复内容。
    4. 启发式过滤:基于规则(如词元分布、文本长度等)移除低质量文档。
    5. 基于模型的过滤:训练一个分类器,用于识别和筛选高质量文档(例如,以维基百科引用的网页为正样本)。
    6. 领域分类与加权:将数据分为代码、书籍、娱乐等领域,并调整各领域的采样权重(如提高代码和书籍的权重,因其有助于提升模型的推理能力)。
    7. 持续预训练:在训练末期,使用极高质量的数据(如维基百科)以较低的学习率继续训练,以“巩固”模型能力。
  • 数据规模与保密性
    • 规模:顶尖模型(如Llama 3)的训练数据量已达到15万亿(Trillion)词元
    • 保密性:由于涉及核心竞争力和版权风险,各大公司通常对其数据处理方法和构成严格保密。

核心理论:规模法则 (Scaling Laws)

规模法则是LLM领域的一项惊人发现,深刻影响了模型的开发策略。

  • 核心内容:模型性能(以测试损失衡量)与计算量、模型参数量、数据量这三个因素呈幂律关系。在对数-对数坐标系下,性能提升曲线近似为一条直线。这意味着:
    > 更大的模型、更多的数据会带来更好的性能,且这种提升是可预测的。
  • 实践意义
    • 性能预测:可以根据可用的计算资源,预测出训练出的模型能达到的性能水平。
    • 资源优化:帮助决策者在模型大小和数据量之间做出最优权衡。例如,Chinchilla论文发现,对于训练效率而言,每参数约需20个词元的数据是较为理想的配比。
    • 开发流程变革:开发者不再直接在大模型上调参,而是在不同规模的小模型上进行实验,拟合规模法则曲线,然后推断出最佳参数配置,最后用全部资源训练最终的大模型。

预训练的成本估算 (Cost Estimation)

Llama 3 400B模型为例的粗略估算:

  • 模型参数:4050亿
  • 训练数据:15万亿词元
  • 计算量:约 3.8 x 10²⁵ FLOPs
  • GPU时间:约2600万 - 3000万 GPU小时
  • 总成本:约7500万美元(其中GPU租赁成本约5200万美元,人员薪酬约2500万美元)。
  • 碳排放:约4400吨二氧化碳当量,相当于约2000次纽约到伦敦的往返航班。

第二阶段:后训练 (对齐) (Post-training / Alignment)

后训练的目标是将一个只会“续写”的预训练模型,转变为一个能理解并遵循用户指令的AI助手。

方法一:监督式微调 (Supervised Fine-Tuning, SFT)

  • 方法:使用一个规模相对较小、高质量的“指令-回答”数据集对预训练模型进行微调。
  • 数据来源:可以由人类编写,也可以由更强大的LLM生成(如Alpaca项目)。
  • 核心洞察:SFT所需的数据量并不大(数千条即可)。其主要作用不是教授新知识,而是教会模型如何以“助手”的格式来回答问题。知识本身已在预训练阶段习得。

方法二:基于人类反馈的强化学习 (RLHF) 及其替代方案

SFT存在局限性,如受限于人类撰写高质量答案的能力、成本高昂、可能诱发模型“幻觉”等。RLHF旨在解决这些问题。

  • RLHF (Reinforcement Learning from Human Feedback)

    1. 收集偏好数据:针对同一指令,让模型生成两个或多个回答,然后由人类标注者选出更好的一个。
    2. 训练奖励模型:训练一个独立的模型(Reward Model),使其能够预测人类会对哪个回答给出更高的评分。
    3. 强化学习优化:将LLM作为RL中的策略(Policy),奖励模型作为环境(Environment),使用PPO等RL算法进行优化,使LLM倾向于生成能获得更高奖励的回答。
    4. 挑战:RLHF流程复杂,尤其是PPO算法的实现和调试非常困难且不稳定。
  • DPO (Direct Preference Optimization)

    • 方法:作为RLHF的简化替代方案,DPO通过一个巧妙设计的损失函数,直接在偏好数据上进行优化。该损失函数的目标是最大化模型生成“更优”回答的概率,同时最小化生成“较差”回答的概率。
    • 优势:DPO省去了训练独立奖励模型和使用复杂RL算法的步骤,实现简单且性能与RLHF相当,已成为当前开源社区和工业界的主流选择。

后训练的评估挑战与方法

评估对齐后的模型极具挑战性,因为回答是开放式的,没有单一的正确答案。

  • 评估困境
    • 困惑度失效:经过RLHF/DPO优化的模型不再是纯粹的概率模型,其输出的概率分布不具可比性,因此困惑度不再适用。
    • 自动化困难:难以设计自动化的指标来衡量开放式回答的质量。
  • 主流评估方法:头对头比较 (Head-to-head Comparison)
    • 人类评估 (Chatbot Arena):被视为“黄金标准”。让大量真实用户在匿名情况下与两个模型进行对话,并投票选出更好的一个,通过ELO评分系统对模型进行排序。
    • LLM作为裁判 (LLM-as-a-Judge):使用一个非常强大的LLM(如GPT-4)来自动评估两个模型回答的优劣。这种方法成本低、速度快,且与人类评估结果高度相关(如AlpacaEval)。
    • 裁判模型的偏见:LLM裁判存在自身偏见,例如强烈偏好更长、更详细的回答。这需要通过引入控制变量等统计方法进行校正。

系统优化 (Systems Optimization)

由于时间限制,讲者简要介绍了几个提升训练效率的系统级技巧。

  • GPU计算瓶颈:现代GPU的计算速度增长快于内存和通信带宽,导致GPU大部分时间在等待数据,即模型浮点运算利用率(MFU)不高(通常50%已算优秀)。
  • 优化技巧
    1. 低精度/混合精度训练 (Low/Mixed Precision Training):使用16位浮点数(FP16/BF16)进行主要的矩阵运算,以减少内存占用和通信开销,同时保留32位浮点数(FP32)用于存储权重以维持精度和稳定性。
    2. 算子融合 (Operator Fusion):通过torch.compile等工具,将多个连续的计算操作(Kernel)融合成一个,从而大幅减少GPU内存与计算单元之间的数据传输次数,提升效率。

结论与未来展望

讲座总结了构建LLM的关键流程和核心理念,但仍有许多重要议题未深入探讨,包括:

  • 模型架构的演进
  • 推理优化(Inference Optimization)
  • 多模态(Multimodality)
  • 模型的滥用与安全
  • 数据来源的法律与伦理问题

讲者推荐了斯坦福大学的相关课程(CS224N, CS324, CS336)供有兴趣的听众深入学习。