Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma

斯坦福大学博士生 Archit Sharma 介绍了大型语言模型的后训练（post-training）过程，旨在阐释如何从基础的预训练模型发展到如 ChatGPT 这样强大的模型。

他首先强调了“规模法则”（scaling laws）的重要性：随着模型参数量（计算量从10^24浮点运算增至超过10^26）和训练数据量（从2022年的1.4万亿词元增至2024年Llama 3的约15万亿词元）的持续增长，模型能力不断提升，但同时也带来了巨大的成本投入。

预训练不仅让模型学习事实知识、语法、语义和多种语言，更重要的是，模型开始展现出理解人类信念、行为和意图的深层能力。例如，模型能根据情境中人物的背景知识预测其不同反应，或在数学、编程（如Copilot辅助编码）、医学（初步诊断，但不建议作为医疗建议）等领域展现应用潜力。尽管预训练任务本质上是预测下一个词元，但模型正演变为通用的多任务助手。

讲座的核心内容将依次探讨：
1. 零样本（Zero-Shot）和少样本（Few-Shot）上下文学习（In-Context Learning）。
2. 指令微调（Instruction fine-tuning）。
3. 基于人类偏好的优化方法（如DPO和RLHF）。
4. 探讨当前技术的局限与未来发展方向。

以GPT系列模型为例，从GPT-1到GPT-2，通过扩大模型规模和数据量，GPT-2展现了零样本学习能力，即模型无需针对特定任务进行额外训练，仅通过巧妙设计提示（prompting），就能执行如文本摘要、问答等多种任务。

视频科技

媒体详情

上传日期: 2025-05-15 22:42
来源: https://www.youtube.com/watch?v=35X6zlhoCy4
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Good evening, people. How are you guys doing? All right, my name is arer Sherma. I'm a phsg student at Stanford, and I'm very, very excited to talk about post training, generally speaking, for large language models. And I hope you guys are ready to learn some stuff, because this has been one of the last few years in machine learning, have been very, very exciting with the advent of large language models, ChatGPT and everything to that extent. And hopefully after today's lecture, you will be more comfortable understanding how we go from pre train ined models to models like ChatGPT. And we take a whole journey through prompting instruction, fine tuning and dpn rlf. So let's get started. All right. So something .
speaker 2: that has .
speaker 1: been very fundamental to our entire field is this idea of scaling loss. And models are increasingly becoming larger and larger, and they're expanding more and more compute. So this is a graph of models starting all the way back in 1950s to somewhere around these are still, this is an outdated graph. So like this shows up to ten to part 24 flops or floating point operations that go into pre training these models, but the number is well above ten to part 26 now. But you can see the graph and the way it's trending. And more and more compute requires more and more data because you need to train on something meaningful. And this is roughly the trend on the amount of language tokens that are going into the language models in pretraining. And again, this plot is outdated. Does anybody want to guess like where in 2024, 2022 we were at 1.4 trillion tokens, or words, roughly speaking, in language model pretraining dudoes anyone want to guess like where we are in 2024. That's a pretty good guess. Yeah so we close to 15 trillion tokens. Recent lama three models were roughly trained on 15 trillion tokens. So Yeah, just just for a second, appreciate that these are a lot of words. This is not Yeah I don't I don't think anybody of us listens to like trillions of tokens in our lifetime. So this is where we are right now. And I hope you guys were here for the pre training lectures. Cool. So what do we do? So like, I mean, broadly speaking, we are really just learning to predict text tokens or language tokens. But what do we learn in the process of ptraining is why are people spending so much money, so much compute? Because these compute and tokens take dollars to do, and we're on the order of spending hundreds of millions of dollars on these runounds. So why are we doing this? And this is basically a recall from whatever you've probably learned till now. But we learning things like, Oh, we are learning knowledge. Stanford University is located in Santa Clara, California. Wherever you want to like say you're learning syntax, you're learning semantics of the sentences. These are things that you would expect to learn when you're training on language data broadly. You're probably learning a lot about different languages as well. So like depending on your text data distribution, you're learning a lot of things, but the models we interact with are very intelligent. So where is that coming from? Like I mean, just simply learning about very factual things and it's a very simple loss function we optimizing. And where is that intelligence coming from? And this perhaps is the interesting .
speaker 2: bit recently, like people have like .
speaker 1: started accumulating evidence for that. When you optimize the next token prediction losses, you're not just learning about syntax, you're not just learning knowledge, but you're starting to like form models of agents beliefs and actions as well. So how do we know this? Again, a lot of this is speculative evidence, but it's baby to like form an understanding that the losses we optimizing are not just about the data, fitting the data, but you start learning something may be more meaningful as well. For example, like I mean, in this specific case, we change the last sentence and the prediction of the text to the next text that is predicted changes as well. So here it starts with pad watch ches, a demonstration of a bowling ball and leabeing dropped at the same time. Pat, who is a physicist, predicts that the bowling ball and the leaf land at the same rate. We all know gravity the way it works, but when you lay change, the last sentence tends to pat, who has never seen this demonstration before. Pat predicts that bowling ball will fall to the ground first. Maybe somebody who's never seen this experiment before might intuitively believe that. Correct? So like I mean, the language model was able to predict this. So how do you predict this? You have to have some notion of understanding of how humans work to even be able to predict this. And that's maybe like something that is not obvious when you're simply optimizing to predict the text. Similarly, like I mean, these kind of examples are we're going to run through some examples to sort of communicate that when you're pre training these models, you're learning much more than just language tokens and so on. You're also learning about math, like you're able to understand what a graph of a circle means and what the center is and where how to understand equations. Probably my favorite example, something I use pretty much every day, is you're learning how to write code. So I don't know how many of you have interacted with Copilot before, but if you have like you probably know, like if you write down a few comments, write down a function template, it will automatically complete code for you. So again, it's now perfect, but it has to have some deeper understanding of what your intent is for something like that to emerge. And similarly, we have examples from medicine as well. I don't know about you guys, but like whenever I have some issue, I probably go to chat gpor clad or something to the artifact and ask them a diagnosis for those things as well. I don't recommend that. Please don't take medical advice from me. But Yeah, so broadly, like the way we're seeing language models at this point is that they're sort of emerging as a general purpose multitask assistance. And it's very strange, right? Like I mean, we start off with text token prediction and we reaching the stage where we can sort of rely to them on them to do many, many different things. So how are we getting there? And I'm sure you all are aware of like what these models are. So Yeah. So today's lecture is largely going to be about how do we go from something Stanford University is located, this very simple ptraining task, a very simple procedure. Well, it's more complicated, but in abstract terms, it's not very complicated to like something as powerful as ChatGPT. Cool. So I recommend you guys stopping me, asking me a lot of questions because this is a there's a lot of fun examples and a lot of fun techniques. So like I want you guys to like learn everything about here. So the overall plan is we're going to talk about zero shot and fushot in context learning. Next we're going to follow up with instruction fine tuning, and then we're going to talk about optimizing for preferences. And this is where roughly things are right now in the industry. And when we're going to talk about what's next, what the limitations are and how do we move from here. Cool. So we're going to start with zero shot in future in context learning broadwe're going to take an example of GPT or the generative ptrain transformer. And this is a whole series of models that started off in roughly 2018. And like up to 2020, they were building GPT, GPT two, GPT -3. So we're going to start off with this example. And yes, so it's a decoder only model that is trained on roughly 4.6 gb of text and it has twelve layers of credit transformers there. And it's trained with the next token protection loss. And the first model obviously was not extremely good, but it started showing that Hey, like this technique for pre training can be very effective for general purpose tasks. And we're going to see some examples, for example, like I mean, here it's able to do the task for .
speaker 2: entertainment.
speaker 1: And Yeah and GPT one itself was not very strong as a model, so but they took the same recipe and like I mean tried to increase the model size. So they went from 117 million parameters to about 1.5 billion parameters. And we're now scaling up the data alongside as well. So we went from 4 gb of data to approximately 40 gb of data. And pre training is a whole different like melting pot of techniques. And there's a lot that goes into it, but like roughly, for example, here, they filter data by the number of upwards on the redata. And Yeah, so this is roughly where we are. And I think one of the things that started emerging with GPT two is zero shot learning. And what do we mean by zero shot learning conventionally in the field? Like when we pre trade models, there was the idea that you take a few examples, you update the model, and then you are able to adapt to a specific task. But as you pretrain on more and more data and more and more task, you sort of start seeing this phenomena where they're able to do the task basically is shot there, show no examples of how to do the task. And you can start thinking, Oh, you can do a summarization, you can follow some instructions, you can do maybe a little bit of math as well. So this is where the idea of zero shot learning started to emerge. Yeah so how do we do zero short learning or task specific learning from these pre train ined models? Really, the idea is like we have to be creative here. We know that these are text prediction models. If you put in a text, they will complete whatever follows. So if we can sort of course, erthese models into completing the task we care about, maybe it's question answering, we can start getting them to solve tasks here. So for example, if you want to ask questions about Tom Brady, you you sort of set it up, you sort of put information about Tom Brady and then you put a question that you wanted to answer and then it will auto complete in some sense. So these is one early perspective on these models. These these are very advanced auto complete models. And similarly, if you want to figure out which answer is or which is not, something that is very useful to measure is log probabilities. So for example, we want to figure out what is the word it referring to here in this sentence, the cat couldn't fit into the hat because it was too big. What we can do is we can take the sentence, replace it with either the cat or either the hat, and then you can measure the probability of which which one does the model think is higher. And you can sort of get the idea of what the reference here is. So none of this is like in the training data. It's simply learning to predict text. But you can start seeing how we can leverage these models to do other tasks as well besides prediction. So this is just more evidence about like how GPT two, no task specific fine tuning, no task specific training. It simply is learning to predict text. And it's establishes the state of the art on many, many different tasks simply by scaling up the model parameters and scaling up the amount of data it's trained on. So this is a fun example. So if you want to do summarization for data or like you have a news article that you want to summarize, so how do you get a zero shot model to do it? This answer is you put the document into the context and you simply put tldr in front of it. Now like I mean, if most of the data on Internet, whenever you see tr, you'll naturally summarize it. So Yeah you can get zero shot performance and summarization here as well. And again, this is not trained to do summarization in any specific way, and it's still doing really well simply because of its pre training data. So Yeah I think gptwo 200 and tldr are is somewhere there and some of the very taspecific cmodels are like, and I think you will see the trend with again, if you were Alec Radford or somebody like I mean, you see like these cool things emerging, your next step would obviously be, I'm gonna to scale this up a little more. I'm gonna to make an even bigger model. I'm gonna to train that on even more data and we'll see how things go, right? So that's how we got GPT -3. We went from 1.5 billion parameters to 175000000000 parameters. We are well over like 40 gb of data to 600 gb of data. Of course, now we are in terabytes of data. And Texas are very compressed representation, like terabytes of data is a lot. And you know, we talked about zero shot learning. The cool thing that emerged in GPT -3 is like, go ahead .
speaker 2: faused to hold the passage right? No.
speaker 1: you typically put the passage if you have like interacted with reddor, something like that. Typically somebody will write an entire post and then end with trdr. Here's a summary of the thing. Too long day treat or if you have used .
speaker 2: opposite comes first. Oh Yeah, there are situations .
speaker 1: where it also comes first. But one reason is that these are like decoder only models. So like they are often these are causal attention models. So they typically need to see the context before .
speaker 2: I'm just curious like from my experience of healhere comes forth, then how is it able to after the passage? Okay. There's probably .
speaker 1: a lot of data where the tldr comes first, but there's probably a lot of data where tldr comes after as well. Cool. So we saw zero shot learning emerging in GPT two. Few short learning maybe seem slightly easier, but like this is where things started getting really funny. Is that like you're starting to beat state of the art simply by just putting examples in context? So Yeah, what does fushort learning mean here? What are we talking about? As I mentioned, like the typical idea here is is that you want na solve translation. So you would put some examples of translation into context. And you know if this is a correction task here, or maybe when you're interested in translation and no gradient updates, no learning in any conventional sense whatsoever, you put a few examples in and that's it. Like I mean, you know how to solve the task. Isn't that like crazy? Like you guys did the assignment on translation, right? But this is what the modern nlp looks like. Yeah you put in some examples and you have the entire system. And this is where things got really interesting, is that all these taspecific models that were created to like be really, really good at translation or really good at summarization, you can just put, let's look at this graph. So we start with the zero shot performance of this in a similar fashion that I described earlier. And you start somewhere there. You put one example in of translation from English to French. You get to somewhere like already at a fine tune level, few examples in. You're already starting to be close to the state of the art models.
speaker 2: Wait. But in that graph, the state of the hour is really high.
speaker 1: isn't it defined five of the word plus. Plus here I think is like, Oh, I'm referring to five instead of the art, like where it's just trained exclusively on a lot of translation data. Might be like slightly barrious. And I think that the relevant comparison here is just the in context learning starts to emerge at scale. So and this is I think like the key point is that this some of this is contested just to be very upfront, but like there's this idea of emergence of this property as you train on more compute and more scale. There's more recent research which suggests that if you plot the x axis correctly, it feels less emergent. But the general idea is as you increase the number of parameters and increase the number of compute that is going into the models, the the ability to just go from a few examples to really strong performance is very compelling. Cool. And Yeah, I think as I explained earlier, the general idea is that this is very different from the conventional idea of fine tuning that we typically go for. Instead of like iterating over examples, putting it into context and doing gradient updates, we are actually just going for few short prompting. We are going to put in few examples and that's going to give us the system.
speaker 2: Go to car, you're.
speaker 1: Yes, I mean, the exact details roughly can depend on the prompt template that you use. But typically you would just put examples so like see auter and put these examples and then whatever your task is, you can just let the model complete from there because it can infer the task based on the examples you've given. Any other questions?
speaker 2: Cool.
speaker 1: So Yeah, like I mean, we have gotten from zero short prompting and like we've we're seeing that few short prompting is becoming really competitive with good models, but there's still limitations to this. Like I mean, you can not solve every task that you see here. And particularly like things that involve like richer multi step reasoning is something that actually can be pretty challenging. And just to be fair, human struggle at these tasks as well. So things like addition and so on, like these are probably still hard to do like when you keep increasing the number of digits. But one thing that you have to start being creative with, I alluded to this earlier, is that you can get these models to do the task if you're creative in how you prompt the model. And this is what we're going to see next. So this technique called chain of thought prompting emerged here. And the idea that we have explored thus far is that we put in examples of the kind of tasks we want to do, and we expect the model to zero short, learn what the task is and go from there. The idea is that instead of just showing what the task is, you show them examples where the reason through the task. So they're not just learning to do the task, but also learning how the reasoning is working. So in this example, initially we started with like we have to solve a simple math problem, and we are just shown exactly the answer directly instead of doing that. And if you do that directly, you'll observe that the model gets the answer wrong. Instead of that, what if you show model how to reason about the task, show a chain of thought and include that in the prompt as well, and then you ask it a new question? The idea is that now the model is not just going to output an answer, it's going to reason about the task, and it's going to do actually a lot better. And this has been shown to be very effective. Chain of thought is also, as you can see, like I mean, it's again, something that improves a lot with model scale. It's not just Yeah but what you can probably start seeing is it's nearly better than supervised best models here. So power models roughly were about 40 billion parameters. And simply with this chain of thought kind of a scale, you're already like beating straight of the cool. So Yeah so I show you examples of chain of thought reasoning to this point where you go through a reasoning chain, but you can be even slightly smarter than that. You might not even need to show them any examples. You just need to trick them into thinking about what to do next. So Yeah, Sithis idea emerged in this paper called where we are. Let's think step by step. Instead of even showing an example, you just start your answer with let's think step by step and that's it. Like a mean the model will start reasoning about the answer itself instead of just like auto completing to an answer and you get something like this. So maybe you don't even need to show any examples. Like you can probably induce the reasoning behavior through zero shot behavior as well. And again, what the final numbers look like is like compared to zero shot performance that we got from essentially auto completing this zero shot chain of thought substantially improves the performance. So you go from like 17.7 to 78.7. It's still worse than still putting like examples of reasoning and multi shot fushot chain of thought as well. But you can see how much it improves the performance simply by asking it to let's think about the step by step. And maybe this is like a lesson that interacting with these models is like when you interact with these models, you might not get the exact desired behavior from these models up front. But often, like these models are capable of doing the behavior that you might want. And often you have to think about how to induce that behavior such that and the right way to think perhaps, is like, what is the pre training data? What is the data on the Internet it might have seen, which induces a similar behavior to the kind I want and you probably want to like think about that and then induce these kinds of behaviors from those models? And Yeah like I mean you know we've designed designed some of these prompts. You can also like get an lm to design these prompts as well. There's like recursive self improving ideas here that happen and you can bump up the performance a little bit more cool. So what we have seen so far is that as models get stronger and stronger, you can start forcing them to do your task zero shot or with few short examples, and you can trick them into thinking what task you want them to solve. But the downside is that there's only so much you can fit into context. This might not be very anymore. Water is like becoming increasingly larger context, but it's still somewhat unsatisfactory to think you have to trick the model into doing your task rather than like just doing the task you wanted to do. And potentially like I mean, going forward, like you probably still want to fine tune these models for more and more complex tasks. And that's where we're going to go forward. In this next section, we're going to cover is instruction fine tuning. And the general idea we have right now is that, as we talked about, pre training is not about assisting users. It is about predicting the next token. Now you can trick it into assisting users and is following the instruction you want to. But in general, that's not what retrained it for. This is an example where if you ask GPT D3, pretty strong model to explain, like moon landing to a six year old in a few sentences, and it will follow up with more questions about what a six year old White want, this is not what you wanted the model to do, right? So it's a general term that people use these days is that they are not aligned with user intent. And the next section that we're going to cover are going to talk about how to align it with the user intent so that they don't have to trick the model into whatever of you .
speaker 2: wanted to do.
speaker 1: And this is the kind of desired completion we want at the end of instruction tuning. And Yeah how do we get from those pretrained models to models which can respond to user intent? Again, I hope this was covered somewhere in the class, the general idea of pre training and fine tuning. But what you have probably seen thus far is that you ptrain on a lot of different language taks on data, but then you find in on your specific taks. So you're taking the same decoder only models and you're fine tuning to some task with very little amount of data. The thing that is going to be different now is not that we're no longer fine tuning on a little amount of data. We're going to find Tunon many, many different tasks, and we're going to just try to put them into a single usable ux for users. And this is where fine tuning or instruction fine tuning comes in. Cool. So again, the recipe is not very, very complicated here. We're going to collect a lot of examples of instruction and output pairs. And the instructions are going to range over several tadifferent forms. There's going to be question answering, there going to be summarization, translation, code reasoning and so on. And we're going to collect a lot of examples related to all those tasks. And the idea is like, I mean, we train on instruction and output pairs exactly with them, and then we're going to evaluate on some unseen tasks as well. So this is a general paradigm of instruction. Fine tunand. Again, it's the same idea which we explored in pre training, is that data plus scale is really important. And these days, like I mean, you start off with like one task, you're now extending it over thousands of thousands and thousands of tasks with like 3 million plus examples. This is generally like a broad range of tasks that you might see in instruction fintaining data sets. And Yeah, you might even think of like why are we calling it fine tuning anymore? Like it's almost starting to look like ptraining. But Yeah, can these are just terms so you can decide whatever you're are comfortable with. So Yeah, we get this like huge instruction data set we find in our model. The next question is like how do we evaluate these data sets? Now I think you guys will see another lecture on evaluation. So I don't want to like dive too deep into this, but generally, evaluation of these language models is an extremely tricky topic. There's a lot of biases that you need to deal with, and a lot of this will be covered. But some more recent progress on this is like we are starting to curate these like really large benchmarks like mmlu, where the models are tested in a broad range of diverse knowledge. And this is just one example, which is and these are the topics that you will see. And just to give some intuition of what the examples in these evaluation look like under astronomy, you might be asked, what is for type one, a supernova? Or you might be asked some questions about biology. And there's a huge host of tasks for this. And you can typically like these are mulchoice questions, and you can ask the model to answer the question. If they're instruction fine, tunalready. Hopefully they can like simply answer the question. But you can also, chain of thought prompt these questions or few prompt these questions too. And recently, there's been a huge amount of progress on this benchmark. What people have observed is like more and more pre training on more and more data and larger models is simply just like climbing up these the number on this. So 90% is often seen as a benchmark number that these models wanted to cross because it's roughly like human level and knowledge or understanding. And recently the Gemini models verpedly crossed this number. So Yeah, go ahead.
speaker 2: Isn't this like the entire sort of thing all over again, right? Like imagine it. At some point you're like, okay, maybe my methods are too like two fine tunes implicitly on the imaginal biases. And isn't something like that happening here as well?
speaker 1: Yes. I think this is a tricky topic because a lot of the models often there this idea about whether your test sets are leaking into your training data set, and there's are huge concerns about that. It's a perfectly valid question to ask, how do we even evaluate? This is why evaluation is actually very tricky. But one general thing to be careful about is at some point, it doesn't matter what your train test is, if the models are generally useful, if their models are doing useful stuff, like does it matter how if you test, if you train on everything you care about and it does well on it, like does it matter? So Yeah, again, we still need better ways to evaluate the models and to understand what methods are doing and how they're if they're improving the model or not. But at some point like that, those boundaries start to be less important. Cool. So massive progress on this benchmark starting with GPT two and like we roughly at 90%, which to the point where these benchmarks are starting to become unclear if improvements on these are actually meaningful or not. In fact, like most of the times when the models are wrong, like you might often find that the question itself was unclear or ambiguous. So all evaluation benchmarks have a certain limited utility to them. So Yeah, I'm going to go over like another evaluation example of how this recipe like changes things. So t five models were instruction fine Tunon a huge number of tasks. And another trend timwhich, I think will be the theme across this lecture, is that as your models become larger, as they're trained on more data, they become more and more responsive to your task information as well. So what you will observe here is like as the number of parameters increase, we have like t five small and t five small, and we go up to 11 billion parameters where we have t five x excl, you'll see that the improvement actually improves, like going from a pre train to an instruction model. The instruction model is all the more better at following instructions. So the difference is plus 6.1 and goes to plus 26.6 as the models become larger. So this is another very encouraging trend that you probably should train on a lot of data with a lot of compute and ptraining just keeps on giving. So Yeah, I hope you guys get a chance to like play with a lot of these models. I think you already hopefully are. But Yeah before instruction, fine tuning something, when you're asked a question related to disambiguation qa, you get something like this and it doesn't actually follow the let's think by step by step instruction very clearly. But after instruction fine tuning, it is able to answer the question here. And Yeah like more recently, people have been like researching into like what the instruction tuning data sets should look like. There's a huge plethora of instruction tuning data sets now available. Like this is just a representative diagram. And there's a huge open source community developing around these as well. Some high level lessons that we have learned from this is one lesson that I think might be interesting is that we can actually use really large, strong models to generate some of the instruction tuning data to train some of our smaller models. So take your favorite model right now, GPT -4, maybe or maybe Claude or whatever, and you can get it to answer some questions and generate instruction output pairs for training your open source or smaller model. And that actually is a very successful recipe. So instead of getting a human to collect all the instruction output pairs or getting humans to generate the answers, you can get the bigger models to generate the answers as well. So that's number one thing that has recently emerged. Another thing that is being emerged or is being discussed is how much data do we need? I talked about millions of examples, but like people have often found that if we have really high quality examples, you can get away with thousand examples as well. So this is the paper less is more for alignment and this is still an active area of research and how like data scaling and instruction tuning affects the final model performance. And Yeah clousourcing, these models can be effective as well. So they're in very cool benchmarks that are emerging like open assistant, Yeah a lot of activity in the field and hopefully like a lot more .
speaker 2: progress as we go on. Yes, a question sort of in the spirit of this lema paper, does it like code or like I don't know, like math word problems have this desired structure? So like shouldn't we just like be training cold part between like some English stuff and then just be like, okay, this is the best reason we can get at some point, right? Because like code has the structure where like you're going sort of step by step and you're sort of thinking in some way in like a breaking down a higher concept in this smaller so you can consider like code and had like a high value tokens.
speaker 1: So maybe like just doing so I think there's again, pre training is a whole dark card that I am not completely familiar with, but code actually ends up being really useful in pre training mixtures. And people do like upgrade code data quite a lot similarly. Like I mean, but it depends upon what the users are going to use the models for, right? Some people might use it for code, some people might do this for reasoning, but that's not the only task we care about. As we might see later on in the next step, we'll will discuss this as well, is that people often use these models for creative task sks. They wanted to write a story. They wanted to generate a movie script or so on. And I don't know if necessarily training on reasoning only tasks would help with that. So go ahead.
speaker 2: Yeah, but there exists like some data distribution, which is like high value for creative tasks.
speaker 1: Yes. Like I mean, it seems like a lot of people write about stories and everything on the Internet all the time, which is not code. And sometimes like there's this idea of hallucinations as well in this like field. But you can often think like, Hey, creativity might be a byproduct of hallucinations as well. So I don't know what exact data would like lead to like more creative models. But generally there's a lot of data are a lot of stories that are written on the Internet, which allows the model to be creative. Yeah, but I don't know if I have a specific answer to the question. Cool. So we discuss instruction, fine tuning, very simple and very straightforward. There's like no complicated algorithms here. Just collect a lot of data and then you can start leveraging the performance at scale as well. Like as models become better, these models also become more easily specifiable and they become more responsive to task ks as well. We're going to discuss some limitations. And I think this is really important to understand why we are going to optimize for human preferences. Cool. So we talked a bit about this. Like instruction fine tuning is necessarily contingent on humans labeling the data. Now humans, it's expensive to collect this data, especially as the questions become more and more complex. You want to answer questions about which may be at physics PhD level or things to that effect. These things become increasingly expensive to collect. So Yeah, this is maybe like perhaps obvious. Like collecting data pre training does not require any specific data. You scrape data off the web. But for instruction fine ing, you probably need to recruit some people to write down answer to your instructions. So this can become very expensive very quickly, but there's more limitations to this as well. And we're just discussing this like they're open ended tasrelated to creativity that don't really have like an exact correct answer to begin with. So how do you generate the right answer to this kind of a question? And Yeah, like language modeling inherently like penalizes all token level mistakes equally. This is what supervised fine tuning does as well. But often not all mistakes are the same. So this is an example where you're trying to do this prediction task. Avatar is a fantasy tv show, and perhaps you can see like I mean, calling it an adventure tv show is perhaps okay, but calling it a musical may be like a much worse mistake. But both these mistakes are penalized equally. And I think one general aspect, which is more becoming increasingly relevant, is that the humans that you might ask might not generate the right or the highest quality answer. Your models are becoming increasingly competitive, and you want, in some sense, you going to be limited by how high quality the answer humans can generate. But often I find that the models are generating better and better answers. So do we really want to keep relying on humans to write down the answers? Or do we want to like somehow go over that? So these are the three problems we have talked about with instruction fine tuning, and we made a lot of progress with this. But this is not how we got chagpt. And one high level problem here is that even when we are instruction fine tuning, there is still a huge mismatch between the end goal is to optimize for human preferences, generate an output that a human might like. And we're still doing prediction kind of tasks where we're predicting the next token, but now in a more curated data set. So there's still a bit of a mismatch going on here, and it's not exactly what we want to do, hopefully. Like I mean, I'm going to take a second here to pause because this is important to understand the next section. And if there's any questions, feel free to ask.
speaker 2: So this step still taken. And as a first step or discthat's, a good question. So I think this is still one of .
speaker 1: the more important steps that you take before taking the next step. But people are trying to like remove the step altogether and jump directly to the next step. So there's work emerging on that. But Yeah, this is still a very important step before we do the next step. God is probably two also present in bretraining. And if so, how do you avoid that versus just by having a lot of data? Yeah, that's a great question. There's one difference, one major difference on pre training. Pre training covers a lot more text. So just for context, like I mean, as we talked about, it's pre training is roughly 15 trillion tokens, whereas supervised instruction fine tuning might be somewhere on the order of millions to billions of tokens. So it's like few orders of magnitude lower. Typically, youonly see one answer for a specific instruction, but during pre training, you'll see multiple taand, multiple completions for a same kind of a prompt. Now that's good, because when you see multiple answers or completions during peer training, you sort of start to weigh different answers. You start to put probability MaaS on different kind of answers or completions. But in instruction, fine tuning might force you to put and weigh on only one answer. They said, okay. But generally, Yeah. Like I mean, this is a problem with both the stages. You're right. Anything .
speaker 2: else?
speaker 1: Cool. So as this whole thing alludes to, we're going to start to attempt to satisfy human preferences directly. We're no longer going to try to like get humans to generate some data and try to do some kind of a token level prediction loss. We're going to try to optimize for human preferences directly, and that is the general field of rlhf. And that's the final step in typically getting a model like chagpt. So we talked about how collecting demonstration is expensive, and there's still a broad mismatch between the lm objective and human preferences. And now we're going to try and optimize for human preferences directly. So what does optimizing for human preferences even mean to like concretely establish that, let's go through a specific example in mind, which is summarization. We want to train a model to be better at summarization, and we want to satisfy human preferences. So let's imagine that the human is able to prescribe a reward for a specific summary. Let's just pretend there is a reward function. You and I can assign, say, a reward. This is plus one, this is minus one, or something to that effect. Okay, so in this specific case, we have this input x, which is about an earthquake in San Francisco. So this is a news article that we want to summarize. Let's pretend that we get these rewards and we want to optimize this. So we get one summary, y one, which gives us an earthquake hit and so on, and we assign a reward of 8.0 and another summary which gives us a reward of 1.2. Generally speaking, like the objective that we want to set up is something of the following form, where we want to take our language model p theta, which generates a completion y given an input x, and we want to maximize the reward of rx y, where x is the input and y is the output summary in this specific task. And maybe like just to like really concretely point out something here, this is different from everything that we have done in one very specific way. We are sampling from the model itself in the bottom term. If you see like we're using y from p theta, everything we've seen so far, the data is sampled from some other source either during ptraining, either in supervised fine tuning, and we're maximizing the lolikelihood of those tokens. But now we're explicitly sampling from our model and optimizing potentially a non differentiable .
speaker 2: objective. Cool.
speaker 1: So broadly, the rhf pipeline looks something like this. And first step is still instruction tuning, something we have seen up until now where we take our pre train model. We instruction tune on a large collection of tasks, and we get something which starts responding to our desired intent or not. But there are two more steps after this, which are typically followed in creating something like instruct GPT. The first step is estimating some kind of a reward model, something which tells us, given an instruction, how much would a human like this answer or how much would a human hate this answer? So we looked at something like this earlier, but I didn't talk about how do we even get something like that. That's the second step. And then we take this reward model and we optimize it through the optimization that I suggested earlier. So the maximizing the expected reward under your language model, and we're going to go over a lot over in the second and third steps. So the first question we want to answer is how do we even get like a reward model about what humans are going to like? Like this is a very ill defined problem. Generally speaking. There's this two problems here that we can address versus a human in the loop is expensive. So let's say like if I ask a model to generate an answer and then I get a human two labels with some kind of a score, I'm doing this over millions of completions that is not very scalable. I don't want na set around and label millions of examples. So this is very easy. Like we are in a machine learning class. So what are we gonna to do? What we gonna to do is we're gonna to train something which predicts what a human would like or what a human might not like. And this is roughly this is essentially a machine learning problem where we take these rewards scores and we try to train a reward model to predict, given an input and output, what the reward scores would look like. Simple, simple machine learning regression style problem. You might have seen this earlier. Cool. Now there's a bigger problem here. And sorry, go ahead.
speaker 2: Step one. So do we use like I don't know, like just embeddings with psafior? Do we use real language model to that's .
speaker 1: a good question. Generally. Like what we do is like we still typically need reward models where they need to be able to understand the text really well. So like bigger models and like they're typically initialized from the language model that you trained pre trained as well. So you typically start with a pre trained language model and do some kind of prediction that we'll talk about and theygive you a score.
speaker 2: How do you if you're doing that.
speaker 1: how do you separate x and y?
speaker 2: Like how does the language model know which part .
speaker 1: it doesn't need? It can put the x and y. Like it only sees x and y as an input, so it doesn't need to typically see it separated. It's just going to predict a score at the end. The x and y is more for notational convenience here, because for us, x and y are different. X is a question user asand. Y is something the model generated. But you shove the whole, you shove the whole thing. Cool. Now, this is the bigger problem here. And human judgments are very noisy. We talked about we want to assign a score to a completion. This is something that's extremely non trivial to do. So if I give you a summary like this, what score are you going to assign on a scale of ten? If you ask me on different days, I'll give a different answer. First of all, but across humans itself, this number is not calibrated in any meaningful way. So you got a sine number of 4.1, 6.6 in different humans, which just simply assign different scores. And there are ways to address this. You can calibrate humans, you can give them a specific rubric, you can like talk to them, but it's a very complicated process. And like still like there's a lot of room for judgment, which is not typically very nice for training a model like this. If your labels can vary a lot, it's just hard to predict. So the way this is addressed is that instead of trying to predict the reward label directly, you actually want to set up a problem in a slightly different way. What is something much easier for humans to do is give them two answers, or maybe many answers, and tell them, ask them which one is better. So this is where the idea of asking humans to rank answers comes in. So if I give you a whole news article and ask you which summary is better, you might be able to give me a ranking that, Oh, this second summary is the worst, but the first one is better and the third one is somewhere in the middle between those two. So you get like a ranking which gives you preference over summaries. And hopefully, like I mean, you can see, like the idea that is important here is that even when we have some kind of a consistent utility function, even I have, it's much easier to compare to something and know that which is better than this rather than ascribing it an arbitrary number on a scale. And that's why the signal from something like this is a lot better. Now, how do we get like we talked about, we need like we get this kind of a preference data, and now we need some kind of a reward score out of this. And we shove in like our input, we shove in a summary as well, and we still need to get a score out of this, but it's not clearly obvious to me like how do we take this data and convert into that kind of score? Incomes are pretty good friends named Bradley Terry. And essentially like there's a lot of study, many in economics and like psychology, which basically tries to model how humans make decisions in specific case like this, Bradley termodel essentially says that a probability that a human chooses answer y one over y two is based on the difference between the rewards that humans assign internally and then you take a sigmoid around it. So if you've have looked at binary classification before, the logic is simply the difference between the reward of some y one minus y two, or the difference between the winning completion and the losing completion. Is everybody with me till this point? So the idea is that like if you have a data set where given y one and y two, where y one is a winning completion and we have a winning completion yw and losing completion yl, the winning completion should score higher than the losing completion guys ahead. Sorry, what is J?
speaker 2: Is that a log clock?
speaker 1: Or like sorry.
speaker 2: what what like what is the type of J like this number here that we're getting as the expectation? Is it at log .
speaker 1: prop or what is it's an expected log props. It will be a scalar at the end. Sigmoid is so you're taking the let's say you have a reward model which gives a score R one to like yw and R2 to yl. You subtract that number, you get another number, you put it into sigmoid and you get a probability because sigmoid will convert a loc into probability. And then you take a logarithm of that, and you take the expectation of everything, and you get this final number, which tells you how good your reward model is doing on the entire datset. So like a good model of humans should behave like this. A good model of humans would score very low here. So it would generally assign a higher reward to the winning completion and generally assign a lower reward to the losing completion. Cool. The math is just beginning, so hold on to your seats. Cool. So now let's see where we are. We have a pretramodel pt y given x, and we got this fancy reward model, which tells us that, Hey, we have a model of humans that can tell us which answer they like and which answer did not like. Now to do rlhf generally, like, I mean, we've discussed what this will look like. We will copy our ptrain model or our instruction tune model, and we will optimize the parameters for those models. And I suggested that the objective that we want to optimize is the expected reward when we sample completions from p theta, and we're going to optimize our learned reward model instead of like the reward model which humans would have typically assigned. Do you guys see any problem with this? Is there something that's wrong here or like that might go wrong if you do something along these lines?
speaker 2: Go for the model box.
speaker 1: It might collapse, yes. But generally, at least from my intuition, like if you're ever doing something and you have you're optimizing some learned metric, I'd be very careful because typically a loss functions are very clearly defined. But here my reward model is learned. What? When it's learned, it means it will have errors. Yes. So it's going to be trained on some distribution. It will generalize as well, but it will have errors. And when you're optimizing against a learned model, it will tend to hack the reward model. So it might exploit the reward model, might erroneously assign a really high score to a really bad completion. If your policy learns or if your language model learns to do that, it will completely hack that and start generating those jperish completions. So just as a general machine learning tip as well, if you're optimizing a learn metric, be careful about what you're optimizing and make sure that it's actually reliable and the way and this is obviously not desirable. Like I mean, if you start optimizing this objective, you're gonna to converse to jebberish language models very, very quickly. So typically, what people do is that you want to add some kind of a penalty that like avoids it drifting too far from its initialization. And why do we want to do that? Like if it cannot drift too far from its initialization, we know the initialization of the model is a decent language model, and we know it is not necessarily satisfying this reward model too much. And we also know that the reward model is trained on a distribution of completions where the initial model is. So typically, when we talk about training this reward model, we have traaround certain completions which are sampled from this initial distribution. So we know the reward model will be somewhat reliable in that distribution. So we're just going to simply add a penalty, which tells us that you should not drift too far away from the initial distribution. And just to go over this, we want to maximize the objective where we have rm fi, which is our learunder one model, but we're going to add this term beta log ratio. And the ratio is are the model we're optimizing p theta and our initial model. And what this says is that if we assign a much higher probability to certain completion as compared to our pre train model, you're going to add an increasingly large penalty to it. And simply, you're paying a price for drifting too far from initial distribution. If you guys have taken like machine learning this, the expectation of this quantity is exactly the kback Lilar divergence or kl divergence between p theta and ppd. So you're penalizing drifting between two distributions. Go for it. Shouldn't you also do this like add a penalty in the previous version where you had to find tuning? Or is this only relevant for the rl? Hf? That's a good question. So I think people do add some kind of regularization in fine tuning. It's not nearly not as critical when you're doing this with rl. Like the incentive is to exploit this reward model as well as much as possible. And we'll see examples where like the learned reward dle predicts like it's doing really well, but the reward models are completely garbage. So it's much more important in this optimization. Cool. So now this course does not assume background on reinforcement learning. So we're not going to go into reinforcement learning, but I just want to give a very high level intuition about how this works. And reinforcement learning has not typically just useful language model has been applied to several domains of interest, game playing agents, robotics, developing chip designs and so on. And the interest between rl and model elms, it's also like dates back to roughly at 2016 as well. But it's been really successful recently and especially with the success of rlhf. The general idea is that we're going to use our model that we're optimizing to generate several completions for an instruction. We're going to compute the reward under our learned reward model, and then we're going to simply try and update our model to increase the probability on the high reward completions. So when we sample a model, we'll see completions of varying quality and we'll see some good completions, good summaries for our task sks, some bad summaries for our task. And we will try to update our log probabilities in a way such that the reward for when you use an updated model, you're typically in the higher reward region. Does a high level .
speaker 2: summary like make .
speaker 1: sense? Cool. And our lecis incredibly successful. I think this is a very good example of this is the same summarization example. And I think the key point here is that the performance improves by increasing the model size. For sure, we have seen this in many different example. What you can actually see is that even very small models can outperform human completions. If you train it with rlcheand, this is exactly the result you see here. The reference summaries are human generated. And when you evaluate, when you ask humans which ones they prefer, they often prefer the model generated summary over the human generated summary. And this is something you only observe at rlhf, even at small Skand, again, the same scaling phenomena still hosier, bigger models do become more responsive. But rlhf itself is very impactful here. Cool. The problem .
speaker 2: with rlhf is that it's .
speaker 1: just incredibly complex. Like I gave you a very high level summary that like doesn't there's whole courses on this for a reason. So it and this image is not for you to understand. It's just completely to intimidate you. So you want to fit a value function to something. There's you have to sample the model a lot. It can be sensitive to a lot of hyper parameters. So there's a lot that goes on here. And Yeah, if you start implementing an rlf pipeline, it can be very hard. And this is the reason why a lot of rlf was restricted to very, very like high compute, high resource places, and it was not very accessible. So what we're going to talk about in cover in this course is something called direct preference optimization, which is a much simpler alternative to rlf and hopefully, like that's much more accessible. But please bear with me. There will be a lot of math on here, but the end goal of the math is to make come up with a very simple algorithm. So hopefully, like it's and feel free to stop me and ask me questions.
speaker 2: Samyou need clinin terms of like GPT -4 versus three, like how much does the number of parameters in the base model help with like maybe needing to reduce the number of parameters or like in order, sorry, here's the number of like examples from humans or hps that work well. Yeah.
speaker 1: that's a really good question. So generally speaking, as the if you hold the data set size constant and simply increase the model size, it will improve quite a lot. But the nice thing is that you can reuse the data and you can keep adding data as you keep like scanning models up. So typically, like nobody tries to reduce the amount of data collection, you just keep increasing both the things out it .
speaker 2: cool. So we talked about rlhf.
speaker 1: And the current pipeline is something like we train a reward model on the comparison data that we have seen so far. And we're going to optimize. We're going to start with our pre trained instruction tune model and convert it into an rlhft model using the reinforcement learning techniques. Now the really the key idea in direct preference optimization is what if we could just simply write a reward model in terms of our language model itself? Now to intertuiunderstand that like what is going on a language model is assigning probabilities to whatever is the most plausible completion next. But those plausible completions might not be what we intended. But you could restrict the probabilities simply to the completions that a human might like. And then the law g probabilities of your model might represent something which the humans might like, not just some arbitrary completion on the Internet. So there is a direct correspondence between the log probability that a language model assigns and how much a human might like the answer. They can have like a direct correspondence in them. And this is not some arbitrary intuition that I'm trying to come up with. We will derive this mathematically. So the general idea of direct preference optimization is going to be we're going to write down reward model in terms of our language model. And now that we can write a reward model in terms of our language model, we can simply solve directly fit a reward model to the preference data we have, and we don't need to do the rl step at all. So we start off with some preference data, and we simply fit our reward model to it, which directly optimizes the language parameters and maybe at a higher level. Why is this like even possible? Like we did this like really cumbersome process with fitting a reward model and optimizing it. But in the whole process, the only external information that was being added to the system, like was human labels, labels on the preference data. When we optimize a learned reward model, there is no new information being added into the system. So this is why something like this is even possible for quite a few years. This was not obvious. But like as you will see, like some of these results like start to make sense. So we're going to derive direct preference optimization. I'll be here after the class as well if you have questions, but I'll hopefully like this is clear. Yes, we discussed that we wanted to solve this expected reward problem where we want to maximize the expected reward, but we subtract this term, which is the beta log ratio, which essentially penalizes the distance between where our current model is and where we started off. So we don't want to drift too far away from where we started. Now it turns out that this specific problem, instead of doing like an iterative routine, there's actually a closed form solution to this problem. So the closed form solution looks something like this. Again, if you have seen the Bolzmann distribution or something to that effect before, this is basically the same idea. But the idea is this, that we're going to take a pre train distribution ppt y given x, and we're going to reate the distribution by the expected reward. So if a completion has a very high reward, it's going to have a higher probability MaaS. And if it has a lower reward, it's going to have a lower probability MaaS. And it's determined by the expected reward. And beta is a hyper parameter, which essentially covers like what is the trade off between the reward model and the constraint? And as beta becomes lower and lower, you're going to start paying more and more attention to the reward model. So the probabilities look something like this. And there is this like really annoying term, the zx. And the reason why it exists is that the numerator by itself is not normalized. It's not a probability distribution. So to construct like an actual probability distribution, you have to normalize it. And zx is simply just this normalization. So real quick, if we write the x out is the sum of all one. Yes, Yeah. And that's exactly like it's sum over all wise for a given instruction. And that's exactly what this is very pesky is like. It's intractable. If I take an instruction and try to sum over every possible completion, and not just like syntactitically correct ones, every single possible, we have 50, zero tokens, maybe even more. And the completions can go arbitrary long. So this space is completely intractable. This quantity is not easy to approximate. Even so, the main point here is that if you're given a reward model, you can actually, there does exist at least a close form solution which does us what the optimal policy will look like or optimal language model will look like. But if you do a little bit of algebra, just move some terms around, take a logarithm here or there, I promise this is not very complicated. You can actually express the reward model in terms of the language model itself. And I think this term is reasonably intuitive as well. What it says is that a completion y hat has a high reward if the model, my optimal policy, assigns a higher probability to it relative to my initialized model. And this is scaled by beta. So the beta log ratio is what we're looking at here. And the partition function, let's ts just ignore it for now, but it's intractable. But the beta log ratio is the key part here. Is everyone falling along? Awesome. Okay. So right now I'm talking about optimal policies, but really like every policy is probably optimal for some kind of a reward, right? Like this is mathematically as well. So the important bit here is that you can actually express you take your current policy, take your initialized model, and you can get some kind of a reward model out of it. And this is the exact identity which leads to this. So reward model can be expressed in terms of your language model, barring the log partition term, which we'll see what happens to it go for. I'm sorry, I don't know that you got like why is it that we can swap because there is a thing that we're trying of optimize and how do p start turn into plo? Yeah, for now, like we're not optimizing any reward model. All I'm saying is that if I take my current language model is it probably represents some kind of a reward model implicitly because of this relationship, because this holds for every p star and every reward model. What I'm saying is that like there, if I plug in my current language model, it also represents some kind of a reward model. I'm not saying it's optimal.
speaker 2: but I want to say because at the beginning, pr is ppt yes. And so we just get that the reward is basically zero to zero. And so .
speaker 1: initially it's zero, but we can optimize the parameters. Yeah, but that's a good observation. That is basically zero in the beginning.
speaker 2: But how do we start optimizing it?
speaker 1: I'll get to them. Any other questions?
speaker 2: Such that that makes the language optitimate taste something .
speaker 1: that's and that's the next step.
speaker 2: Yes. But the key idea is that my log.
speaker 1: my language model, the probabilities, already implicitly define a reward model. I think that's really the main point here. And this mathematical relationship is exact. Cool. Now, like I mean, I'm obviously ignoring like the elephant in the room here, which is the partition function, it's not going to magically vanish away. So like if this was just the beta log ratio, that would be really nice. I can compute all these quantities. I know how to compute the log probability under my language model, I know how to compute the log probability under my ptrain model, and I can compute the reward score, and I can optimize this, but I don't know what to do by my log partition function. This is where something fun happens. So recall what the reward modeling objective was when we started off like we started off with our friends Bradley, Terry again. And what we really wanted to optimize was the reward difference between the winning completion and the losing completion. And really like I mean, that's all we care about. We don't care about the exact reward itself. What we care about is maximizing difference between the difference between winning and losing completion. And that's actually really key here because if you plug in the definition of the rm theta there, what you'll observe is that the partition function actually just cancels out. Now why does it cancel out? The input is exactly the same. The x is actually exactly the same in the difference. So the partition function zx will just cancel out like it's the same in both the terms. So what do you get is that the reward difference between the winning and losing completion is the differences between the beta log ratio for the winning and losing completion. You can plug in the terms, you can work it out. It's fairly simple. So the partition function, which was our like, which was something we could not address, we could not compute, actually just simply vanished away. Z doesn't appear in the bitary molecule, but it appears here in this equation. So we going to take this equation, the last line that you see, and we're going to plug in in place of rm. So and in this, the first loss equation, so the first loss equation is the Bradly loss model. Cool. So this really is it. Like I mean, the key observation is we could express our reward model in terms of language model and our problems with the partition function actually go away because we are optimizing the Brad latmodel. And what do you get is something like this is that we're going to express the loss function directly in terms of our language model parameters theta, and we're going to be able to directly optimize on our data without doing any rsteps or not. And this is simply a binary classification problem. So we're really just trying to classify whether an answer is good or bad. And that's really what we're doing before I go on. Like people want to like absorb this in like I mean, feel they're okay with it.
speaker 2: It's very, I don't get wordly. Why good and why when and why those come from? Are they human included?
speaker 1: Good question. It's the same data set we started with in rlhf as well. But the way the process works is that you take a set of instruction and get the model to generate some answers, and then you get humans to label which answer they prefer. So they are model generated. Typically, they can be human generated as well, but they typically model generated. And then you get some preference labels. All you need is a label.
speaker 2: Sinwhich is a better answer. You must be losing some information because of the lack of information about like you're cancelling out because of the lack of any information about the partition function. You are bound to lose information about like are possible completions, which you would have taken into account in like standard rchat, right?
speaker 1: That's a really good question. I don't think I'll be able to completely answer this question in time, but like partition function is almost kind of a free variable. So I think the problem here is that the reward model there, think of when you there's many reward models that satisfy this optimization. So there's a free variable here that you can actually completely remove, and that's what this optimization benefits from. So think of it this way. Like if I assign something a reward of plus one and assign something a reward of minus one, that's basically the same as saying as it's a reward of plus 199 and it will give you the same loss, right? So that scale doesn't shift invvariant in a ways .
speaker 2: somehow not what you want though, like like okay. Like if you have, if you're actually turning a reward model, right, like 199 is like much, you should pay much less attention to that as compared to like it's but something what we're assuming .
speaker 1: is our choice model here is like if a human prefers something over the other, like the probability is governed only by the difference between the rewards. So that's an assumption that every rlhf also makes and like dpo also makes now is that assumption not completely, but like it holds to a fairly large degree, but that's a good question here.
speaker 2: Cool. I'll move on in rest of time.
speaker 1: And really like I mean, the goal of this plot is to like we actually get fairly performant models when we optimize things with dpo. So in this plot, I think the main thing that you should look at is pppo, which is the typical rlhf pipeline. And we are evaluating the models for summarization, and we're comparing to human summaries. And what we find is that dpu and vpsort of do similarly, but you're really not losing much by just doing the dpu procedure instead of aual latof. And that's really compelling because dpu is simply a classification loss instead of like a whole reinforcement learning procedure. So I want to quickly summarize. What we have seen thus far is that we want to optimize for human preferences. And the way we do this is instead of relying on uncalibrated scores, we getting comparison data and feedback on that. And we use this ranking data to either do something like rlhf, where we first fit a reward model and optimize using reinforcement learning, or we do something like direct preference optimization. We simply take the data set and do a classification loss on that. And Yeah, like there's trade offs in these algorithms. Like people when they have a lot of computational budget, they typically maybe go for our lecor, some routine like that. But if you're really looking to get them bang for your buck, like I mean, you might want to go for dpu. And that's like probably going to work out of the box. It's still an active area of research. People are still trying to understand how to like best work with these algorithms. So like I'm not making any strong claims here, but like both of these algorithms are very effective. Db is just much simpler .
speaker 2: to work with. Cool.
speaker 1: So Yeah, I mean, let's see. Like we went through all this instruction tuning rlhf, what do we get? Int, GPT is the first model which sort of followed this pipeline. It defined this pipeline. So we got models which did 30, zero or so tasks. Remember when we were doing like only one task? And now we have scaled it up from thousand tasks with like 30, zero different tasks and many, many different examples. So that's like where we are with int GPT, and it follows this pipeline that we just described. In this case, they're following a specific rlf pipeline where we explicitly fit a reward model and then do some kind of a reinforcement learning routine on top of it. And Yeah, the task collected from labelers looks something like this. I'll leave it to your imagination where you can look at the details. But how we started off with this model was something like completions we see from GPT -3, which you know explained the moon Lanto a six year. And like it is not really following the instructions, but instruct GPT will give you something which is meaningful. So it's inferring what a user wanted from the specific instruction and is converting to a realistic answer that a user might like. And Yeah, these are just more examples of what an instruct GPT like model would do, whereas your base model might not follow the instructions to your desired intentions. And Yeah like we went from instruct GPT to chagpt and it was essentially this pipeline. The key difference here is that it is still doing the instruction tuning, but it is more optimized for dialogue, more optimized for interacting with the users. So the core algorithmic techniques that we discussed today are what give us ChatGPT, but you have to be really careful about the kind of data you're training on, and that's really the whole game. But this is the foundation for ChatGPT. And Yeah, it follows the same pipeline as well. And at you might interact with chargpt. I'm sure you all have interacted, whether it's some form or not, but like this is an example of what a ChatGPT interaction might look like. You want to make a Gen Z. So like I mean, can you know the idea here is that it's very good at responding to instructions and intent. This is not something that we could like even fshort in very easily. These are kind of instructions are hard to come examples for. But like this is probably not something get trained on either, but it's able to infer the intent and generalize very, very nicely. And that's something I find personally very remarkable. Cool. And there's been a lot of progress on the open source front as well. So like dpu is much simpler and much more efficient. And essentially all the open source models these days are using dpo. So this is a leaderboard that is maintained by hugging face here. So like I mean, nine out of ten models here are trained with dpo. So that's been something that's been enabled the open source community to instruction tune their model betas well, and same is being used in many production models now as well. Mistrel is using dpo. Lama three used dpo. So these are very, very strong models which are nearly GPT -4 level. And they're also starting to use these algorithms as well. And something that's very cool to see is like like we went through all this like optimization and like I mean, math and stuff. But what is really fundamentally changing in the behavior, and I think this is a really good example, is that if you simply ask an instruction and ask for an sft output from an instruction tune model, you'll get something like this. But when you are chithe model, you actually get a lot more details in your answer and theyprobably organize the answers a little better. And this something that they maybe humans prefer. That's why it's a property that is emerging in this model, but it's something that's a very clear difference between simply instruction tued models and models which are rshift. So Yeah, we discussed like this whole rhf routine where we are directly modeling the preferences and we are generalizing beyond label data. And we also discuss rl can be very tricky to correctly implement the deep sort of implements this or like avoid some of these issues. And we briefly also touched upon the idea of reward model and reward hacking. And when you're optimizing for learned reward models, you will often see this example is that there is a way for it to just simply crash into the object. Some keep repeated repetitively crashing the board to get more and more points. That wasn't the goal of this game. So this is a very common example that is shown for reward hacking. If you do not specify rewards, well, the models can like learn. Weird behaviors are not your desired intent. And this something a lot of people worry about as well. Part of the reason is reinforcement learning is a very strong optimization algorithm. It's at the heart of alpha go, alpha zero, which results in superhuman models. So you have to be careful about how you specify things. And the other thing is, even optimizing for human preferences is often not the right thing, because humans are not, do not always like things which are in their best interest. So something that emerges is that they're like authoritative and helpful answers, but they often don't necessarily like truthful answers. So one property that happens is that theyprefer authoritativeness more than correctness, which is maybe like not something nice. Please go ahead. I'm curious.
speaker 2: It maybe like charging tethings. So now widely used by the public, will maybe change the like how people were we're like like made words because I always feel like now when I go to charging, I try something gives me five like detailed paragraphs of information.
speaker 1: Sometimes I just annoyed that I not what wanted but maybe in the original reward function, in the original meetings ings, people actually for that I know less. Yeah, that's a great point because like as these models like integrate more and more into our systems, they're going to collect more and more data and they will like pick up on things, maybe undesirable things as well. As far as I understand, chagpt is really cutting down on the verbosity, which is like a huge issue that all of these models are trying to cut down on, and they are dealing with that. Part of the reason why that emerges is that when you collect preference data at scale, people are not necessarily reading the answers. The turkers might just simply choose the longer answer, and that's a property that actually goes into these models. So but hopefully, like these things will improve over time as they get most of that. And Yeah hallucinations is not a problem that is going to go away with rl. And we talked a bit about reward hacking as well, biases from things and so on. But hopefully, like I mean, what I want to conclude at is like we started with pre trained models. We had these things which could predict text and we got chagpd. And hopefully, it's a little more clear how we go from something like that to ChatGPT d. And that's .
speaker 2: I'll end .
speaker 1: here.

概览/核心摘要 (Executive Summary)

本讲座由斯坦福大学博士生 Archit Sharma 主讲，深入探讨了大型语言模型（LLM）在预训练之后所经历的关键阶段，旨在揭示如何从基础预训练模型演进至如 ChatGPT 般强大的对话助手。核心内容围绕三大技术路径：上下文学习（In-Context Learning, ICL）、指令精调（Instruction Fine-tuning, IFT）以及基于人类偏好的优化（RLHF/DPO）。

讲座首先回顾了 LLM 规模持续扩大的趋势（模型参数、训练数据量，如 Llama 3 使用约15万亿tokens），并指出预训练不仅学习知识、语法，更可能形成对智能体信念和行为的初步建模。随后，详细介绍了零样本（Zero-Shot）和少样本（Few-Shot）上下文学习，特别是 GPT 系列模型的发展，以及通过“思维链（Chain-of-Thought）”提示提升复杂推理能力的方法。

接着，讲座阐述了指令精调的必要性与方法，即通过在大量多样化任务（指令-输出对）上进行微调，使模型更好地理解和遵循用户意图，克服预训练模型与用户需求不一致的问题。讨论了评估（如MMLU基准）和数据集构建（包括使用强模型生成数据、高质量小规模数据的重要性）的挑战与进展。

最后，重点讲解了如何直接优化人类偏好，介绍了强化学习人类反馈（RLHF）的复杂流程（SFT模型 -> 奖励模型训练 -> RL优化）及其在 InstructGPT 和 ChatGPT 中的应用。同时，详细推导并介绍了直接偏好优化（DPO）作为 RLHF 的一种更简洁、高效的替代方案，解释了其如何通过数学转换将偏好学习转化为直接的分类损失，并已在众多开源模型（如 Mistral, Llama 3）中得到广泛应用。讲座强调，尽管这些技术取得了巨大进步，但仍面临奖励模型被“攻击”（reward hacking）、模型产生幻觉、以及人类偏好本身可能存在的偏见等挑战。

引言

Archit Sharma 开场点明本次讲座的目标：帮助听众理解大型语言模型从预训练阶段到能够像 ChatGPT 一样与用户交互的演进过程。这一过程主要涉及提示（prompting）、指令精调（instruction fine-tuning）和直接偏好优化（DPO）/强化学习人类反馈（RLHF）。

大规模预训练的基础

规模效应 (Scaling Laws):
- 模型规模持续增大：计算量（flops）和参数量不断攀升。讲者提到，图表显示截至约2022年，预训练计算量达到 10^24 flops，而当前（2024年）已远超 10^26 flops。
- 数据量激增：训练所需的文本数据（tokens）也随之增长。
  - 数据点: 2022年约为 1.4 万亿 tokens。
  - 数据点: 2024年，如 Llama 3 模型，训练数据量接近 15 万亿 tokens。讲者强调“appreciate that these are a lot of words”。
预训练学到的内容:
- 不仅仅是预测下一个词元，模型在预训练过程中学习到多种能力：
  - 知识: 例如，“Stanford University is located in Santa Clara, California.”
  - 语法 (Syntax)
  - 语义 (Semantics)
  - 多语言知识 (取决于训练数据分布)
- 涌现的智能 (Emergent Intelligence):
  - 讲者提出，模型智能的来源不仅仅是事实性知识和简单的损失函数优化。
  - 有推测性证据表明，优化“下一个词元预测”损失函数时，模型开始形成对“智能体的信念和行为的模型 (models of agents' beliefs and actions)”。
    - 例子：关于保龄球和树叶下落实验的预测，当描述 Pat 是物理学家或从未见过该实验时，模型的预测会相应改变，暗示模型对人类认知有一定理解。
  - 模型还学习到：
    - 数学能力: 理解圆的方程和图形。
    - 代码生成能力: 如 GitHub Copilot，通过注释和函数模板自动补全代码。讲者称其“pretty much every day”使用。
    - 医学知识 (初步): 讲者提及自己会向 ChatGPT 等模型咨询健康问题，但不推荐他人效仿，“Please don't take medical advice from me.”
- 结论: LLMs 正在演变为“通用多任务助手 (general purpose multitask assistants)”。

上下文学习：零样本与少样本提示 (In-Context Learning: Zero-Shot and Few-Shot Prompting)

GPT系列模型回顾:
- GPT-1 (约2018年):
  - Decoder-only 模型，约 4.6GB 文本训练，12层 Transformer。
  - 采用下一个词元预测损失。
  - 虽然性能非顶尖，但展示了预训练对通用任务的潜力。
- GPT-2:
  - 参数量从 1.17 亿增加到 15 亿。
  - 数据量从 4GB 增加到约 40GB (通过 Reddit Upvotes 过滤数据)。
  - 零样本学习 (Zero-Shot Learning) 的兴起: 模型无需针对特定任务进行微调或展示示例，即可执行任务。
    - 例如：摘要（在文本后添加 "TLDR:"）、问答（提供上下文后提问）、少量数学运算。
    - 方法: 通过巧妙设计提示（prompt），“诱导 (coerce)”模型完成任务。
      - 问答：提供Tom Brady的信息，然后提问。
      - 指代消解：通过替换句子中的指代词为具体名词（如 "the cat" 或 "the hat"），比较模型给出的对数概率（log probabilities）来判断指代对象。
    - GPT-2 在许多任务上仅通过扩大模型和数据规模就达到了当时的 SOTA 水平，无需任务特定的训练。
- GPT-3 (约2020年):
  - 参数量从 15 亿增加到 1750 亿。
  - 数据量从 40GB 增加到 600GB (当前已达 TB 级别)。
  - 少样本学习 (Few-Shot Learning) 的显著效果:
    - 在提示中提供少量任务示例（如翻译示例），模型即可学会执行该任务，无需梯度更新。
    - 讲者感叹：“Isn't that like crazy? Like you guys did the assignment on translation, right? But this is what the modern nlp looks like.”
    - 性能：少样本学习在某些任务上（如翻译）的表现接近甚至超越了专门微调过的SOTA模型。
    - 涌现特性: 这种能力随着模型规模和计算量的增加而“涌现 (emergence)”。（讲者注：关于“涌现”的精确定义和观察方式，近期有研究提出不同看法，认为若x轴绘制得当，现象可能不那么“涌现”）。
思维链提示 (Chain-of-Thought Prompting, CoT):
- 动机: 传统的少样本学习在涉及复杂多步推理的任务上仍有挑战。
- 方法: 在提示的示例中，不仅给出最终答案，还展示详细的推理步骤。
  - 模型不仅学习任务本身，还学习推理过程。
  - 例子：数学应用题，展示解题步骤后，模型在新问题上表现更好。
- 效果: CoT 显著提升模型在推理任务上的性能，且效果随模型规模增大而增强，甚至超越监督学习的最佳模型（如 PaLM 约540B 参数时）。
- 零样本思维链 (Zero-Shot CoT):
  - 无需提供带推理步骤的示例，仅在问题后添加一句引导语，如 “Let's think step by step.”
  - 模型会自动生成推理过程并给出答案。
  - 性能：虽不如多样本CoT，但远超标准零样本提示（例如，从17.7%提升到78.7%）。
  - 启示: 与LLM交互时，需要思考如何通过提示引导模型展现其潜在能力，可以思考“预训练数据中什么样的内容会引发类似我想要的行为”。
- 自动提示设计: 甚至可以使用LLM来设计更优的提示。
上下文学习的局限性:
- 上下文窗口大小有限（尽管现在窗口越来越大）。
- 依赖“技巧性”的提示，不够直接。
- 对于更复杂的任务，仍需微调。

指令精调 (Instruction Fine-tuning, IFT)

动机:
- 预训练模型的目标是预测下一个词元，而非直接辅助用户或遵循指令。
- 例子：要求 GPT-3 “向一个六岁孩子解释登月”，它可能会反问关于六岁孩子的问题，而不是直接解释。这表明模型“未与用户意图对齐 (not aligned with user intent)”。
目标: 使模型能够响应用户意图，遵循指令。
方法:
- 与传统微调在单一任务上进行不同，指令精调是在大量、多样化的任务上进行。
- 收集大量的“指令-输出 (instruction-output)”对。
  - 任务类型涵盖：问答、摘要、翻译、代码生成、推理等。
  - 数据规模：从最初的单一任务扩展到成千上万的任务，数百万级别的样本。
- 使用这些数据对预训练模型进行微调。
评估 (Evaluation):
- 是一个“极其棘手的话题 (extremely tricky topic)”，存在很多偏见。
- MMLU (Massive Multitask Language Understanding) 基准:
  - 测试模型在广泛知识领域的表现（如天文学、生物学等）。
  - 通常是多项选择题。
  - 近期进展显著，模型得分不断攀升。90% 被视为一个重要门槛（大致相当于人类水平）。Gemini 模型据称已超过此分数。
  - 担忧: 测试集可能泄露到训练集中 (“test sets are leaking into your training data set”)。
  - 讲者反思：如果模型在所有我们关心的任务上都表现良好，评估方式的边界是否还那么重要？
指令精调的效果与模型规模的关系:
- 以 T5 模型为例，模型规模越大（从 T5-small 到 T5-XXL 11B参数），从预训练到指令精调带来的性能提升越显著（提升幅度从+6.1到+26.6）。
- 这表明“预训练持续带来回报 (ptraining just keeps on giving)”。
指令精调数据集的构建与经验:
- 存在大量开源指令微调数据集。
- 使用强LLM生成数据: 可以用 GPT-4 或 Claude 等强模型为较小或开源模型的指令精调生成“指令-输出”对，这是一个成功的策略。
- 数据质量与数量: 有研究表明（如 "Less is more for alignment" 论文），高质量的少量数据（如数千样本）也可能达到良好效果，而非盲目追求数百万样本。这是一个活跃的研究领域。
- 众包 (Crowdsourcing): 如 OpenAssistant 项目。
指令精调的局限性:
1. 数据收集成本高昂: 人工标注，尤其是复杂问题（如物理博士级别问题）的答案，非常昂贵。
2. 开放式/创造性任务缺乏唯一正确答案: 难以生成标准答案。
3. 监督学习的惩罚机制问题: 所有词元级别的错误被同等惩罚。
  - 例子：将《降世神通》(Avatar) 描述为“冒险电视剧”可能还行，但描述为“音乐剧”则是更严重的错误，但两者受到的惩罚可能相同。
4. 受限于人类标注者能生成的答案质量: 模型能力可能已超越普通人类标注者。
5. 根本性错位: 即使是指令精调，其优化目标（预测词元）与最终目标（优化人类偏好）之间仍存在巨大不匹配。

优化人类偏好：RLHF 与 DPO (Optimizing for Human Preferences: RLHF and DPO)

目标: 直接优化模型以满足人类偏好，而不是预测人类书写的文本。
核心问题:
- 收集演示数据（如指令精调中的答案）昂贵。
- LLM 的目标函数与人类偏好之间存在错位。

强化学习人类反馈 (RLHF - Reinforcement Learning from Human Feedback)

RLHF 流水线 (Pipeline):
1. 获取一个经过指令精调的模型 (SFT Model): 这是起点。
2. 训练奖励模型 (Reward Model, RM):
  - 问题1: 人类打分成本高且不可靠。
    - 让模型生成回答，人类打分（如1-10分），这种方式难以扩展且人类评分校准困难、噪声大。
  - 解决方案1: 训练一个预测人类打分的模型。 这是一个标准的机器学习回归问题。
  - 问题2: 人类对绝对分数的判断非常主观和不一致。
  - 解决方案2: 人类进行比较排序。 让人类对模型生成的多个回答进行排序（哪个更好/哪个最差），这比打绝对分数更容易、更可靠。
    - 例子：对于一篇新闻文章，让人类判断哪份摘要更好。
  - Bradley-Terry 模型: 用于从成对比较数据中学习偏好。该模型假设人类选择回答 y1 优于 y2 的概率与两者内在奖励值的差异相关：P(y1 > y2) = sigmoid(R(y1) - R(y2))。
  - 奖励模型训练目标: 最大化被选中的回答（winning completion, yw）与未被选中的回答（losing completion, yl）之间的奖励差异的对数似然。即，log sigmoid(R(yw) - R(yl))。奖励模型通常从预训练语言模型初始化。
3. 通过强化学习优化语言模型:
  - 目标函数: 最大化语言模型 P_theta(y|x) 生成的回答 y 在奖励模型 R_phi(x,y) 下的期望奖励。
  - 关键区别: 此时数据 y 是从当前正在优化的模型 P_theta 中采样得到的，而非固定的数据集。
  - 问题: 奖励模型本身是学习得到的，存在误差。 直接优化可能导致模型“攻击 (hack)”奖励模型，找到奖励模型错误地给予高分的无意义输出（“jberish completions”）。
  - 解决方案: KL 散度惩罚项。 在优化目标中加入一项，惩罚当前模型 P_theta 与初始SFT模型 P_ref (或 P_pt，预训练模型) 之间的KL散度，防止模型偏离过远。
    - Objective = E[R_phi(x,y)] - β * KL(P_theta(y|x) || P_ref(y|x))
    - 其中 β 是权衡系数。
  - RL 过程简述: 模型生成多个回答 -> 用奖励模型评估 -> 更新模型参数以增加高奖励回答的概率。
RLHF 的成功:
- InstructGPT 是遵循此流程的首个重要模型。
- 即使是较小的模型，通过 RLHF 训练后，其表现在人类评估中也可能优于人类自己撰写的参考答案（如摘要任务）。
- 模型规模越大，RLHF 效果越好。
RLHF 的复杂性:
- 实现起来“极其复杂 (incredibly complex)”，涉及多个模型的训练、大量超参数调整、采样效率等问题。
- 这使得 RLHF 在早期主要局限于资源雄厚的大型机构。

直接偏好优化 (DPO - Direct Preference Optimization)

动机: 作为 RLHF 的一种更简单、更易于实现的替代方案。
核心思想:
- RLHF 的优化目标（最大化奖励，惩罚KL散度）存在一个闭式解 (closed-form solution)：
  P_optimal(y|x) = (1/Z(x)) * P_ref(y|x) * exp(R(x,y)/β)
  其中 Z(x) 是归一化因子（配分函数），计算棘手。
- 通过代数变换，可以反过来用语言模型（最优策略 P_optimal 和参考策略 P_ref）来表达隐式的奖励模型：
  R(x,y) = β * log(P_optimal(y|x) / P_ref(y|x)) + β * log Z(x)
- 关键洞察: 当我们将此奖励模型代入 Bradley-Terry 模型的偏好损失（即 log sigmoid(R(yw) - R(yl))）时，棘手的配分函数项 β * log Z(x) 因为对 yw 和 yl 相同而相互抵消。
- 因此，损失函数可以直接用语言模型 P_theta（替代 P_optimal）和参考模型 P_ref 的概率来表示，无需显式训练奖励模型或进行RL。
  Loss_DPO = -log sigmoid( β * log(P_theta(yw|x) / P_ref(yw|x)) - β * log(P_theta(yl|x) / P_ref(yl|x)) )
DPO 的优势:
- 将复杂的 RLHF 流程简化为一个直接的分类损失函数。
- 实现简单，计算效率高。
- 性能与 RLHF 相当，有时甚至更好。
DPO 的广泛应用:
- 目前 Hugging Face 排行榜上，十个模型中有九个使用 DPO 进行训练。
- Mistral、Llama 3 等知名模型也采用了 DPO。

成果与应用

InstructGPT:
- 定义了“SFT -> RM -> RL”的流程。
- 能够处理约30种不同任务。
- 相比 GPT-3，能更好地遵循指令，生成符合用户意图的回答。
ChatGPT:
- 基本沿用了 InstructGPT 的流程，但更侧重于对话优化。
- 核心算法技术与讲座所讨论的一致，但数据类型和处理是关键。
- 展现了强大的指令遵循和意图理解能力，即使对于训练数据中可能未见过的复杂指令。
RLHF/DPO 后的模型行为变化:
- 相比仅经过SFT的模型，RLHF/DPO调优后的模型回答通常更详细、组织更清晰。
- 讲者提到，这可能是因为人类偏好数据中，标注者倾向于选择更长、更全面的答案，导致模型学习到这种“冗余性 (verbosity)”，尽管目前模型正在努力减少不必要的冗余。

挑战与未来展望

奖励 hacking (Reward Hacking):
- 模型可能会找到奖励模型的漏洞，通过非预期行为（如游戏中反复撞墙得分）来最大化奖励，而非完成真实目标。
- RL 是强大的优化算法（如 AlphaGo），因此需要仔细设计奖励。
人类偏好的局限性:
- 人类偏好并不总是“正确”或符合最佳利益。
- 例子：人类可能更偏好“权威性的 (authoritative)”回答，即使其并非完全“真实的 (truthful)”。
幻觉 (Hallucinations): RLHF/DPO 并未完全解决幻觉问题。
偏见 (Biases): 模型可能从数据中学习并放大偏见。

结论

Archit Sharma 总结道，通过本次讲座，听众应能更清晰地理解从预训练模型（仅能预测文本）到 ChatGPT 这样强大对话模型的演进路径。这一路径涉及了从上下文学习、指令精调到基于人类偏好的复杂优化过程。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 23:14

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 23:09

StreamSparkAI