2025-06-21 | Stanford CS336 | Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

从 GPT-3 到 ChatGPT：RLHF 与语言模型对齐方法详解

视频科技

媒体详情

上传日期: 2025-06-21 17:02
来源: https://www.youtube.com/watch?v=Dfu7vC9jo4w
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro

转录

speaker 1: Okay, so we'll get started. Welcome to lecture 15. We've got two pieces left to the class and that's going to be you know various aspects of post training. Up until now, we focus very much on the big pre training systems, data components, and then now we're gonna to take the big pre training model and we're gonna to make it useful and safe in various ways. So that's gonna to be the next two lecturers from me. Today is going to be rlhf and sort of safety alignment stuff. And then Thursday is going to be rl from verifiable rewards. So things like reasoning training and math and so on, we'll be on Thursday. As I said before, today, we're gonna to shift from pre training to post training. Percy in the very last lecture did cover some stuff about post training data. But really, I think the focus today is going to be in going from you know essentially this big transition that we saw in the field, right? So we have GPT -3, really remarkable system, really impressive, lots of pre training, lots of compute. But this is not really a useful system. I guess there was a couple startups around know building ad copy and things like that, but it was not very useful. It didn't follow instructions. It didn't do anything particularly too interesting from a product point of view. And then all of a sudden we got ChatGPT. And ChatGPT can do all sorts of amazing things and follow instructions, and we've kind of seen what that has done to society since then, right? So today's focus is going to be on this arrow right here, like how do we take a pre train system like GPT -3 and how do we make something like ChatGPT? And then we're gonna to try to get to the nuts and bolts of that process. And I think many most of you have never worked on things like controllable generation or like the previous generation of text generation systems, but really like modern instruction following models are just amazing, right? Like this is one of my favorite examples from Sebastian Bubeck's sparks of agi paper in 2023 around when gpd four came out. But you know, it can follow this very long block of like nested compound instructions and then combine that with its coding capability to output you know zero shot map plip code. And I think all of you just like take this for granted now. It's like, yes, of course, ChatGPT d can follow ten instructions at once, but it's just kind of amazing that it can do this, right? And I think part of my excitement about this lecture is the fact that it can do all this. And the other thing that I think is very important, right, is that now that these systems are out in the wild, safety and content moderation just becomes really important, right? Safety from the perspective of you know these models might get misused, someone might try to use them for scams and also content moderation, if you're thinking about this being like useful products that you can ship and like people would pay for, right? Like people don't really want to pay for or put ads on systems that are like horrifically toxic, right? Like I think one of the big, big reasons why ChatGPT has been so successful is that it has really significant guardrails around it. So ok, given that the goal today is to try to enable much tighter, better controls on language models, pre training, you can think of the mental model can be that it packs the model with all sorts of capabilities, right? Like after pre training, the model is able to, somewhere within the parameters, do lots of things like reason and answer questions, but it's not going to do them out of the box. And so today, what we're going to try to do is get models to do that out of the box. And so what we're going to do is to collect data of various kinds of behaviors that we do want from the language model and train it to do those things right. And so the questions that you should be asking now is, what does that data look like? How hard is it to collect that data? Percy has touched on it a little bit, but given the importance of the data, I'm gonna to re emphasize it a little bit. I'm going to have some interactive, some exercises to go over that. And then there's algorithmic questions like how do we make use of that data, right? Certain kinds of data are easy to use. Like you know if you have expert demonstrations, you just train to imitate that. But if you ask things like pairwise feedback, like model output a is better than model output b, how do we make use of them? And then finally, like know, how do we scale this up? How do we do the usual things that we've been doing in this class? So the structure of this lecture is roughly going to mirror the instruct GPT paper because a lot of the post training pipeline that we have today is still off the instruct GPT paper. And so the first part of this is lecture is going to be on supervised fine tuning. So if you look at the instrucgpt paper, you'll see this diagram that roughly describes a three step process for building an instructional following model. So part one of this lecture is gonna to be the lefmost part, the part where what we're going to do is we're gonna to do supervised fine tuning on expert demonstration. And then part two, we're gonna to follow the next two parts of this structure. We're gonna to talk about reinforcement learning and pairwise feedback broadly. I'm going to say that the ingredients, in order to get know this first part working. I mean, there's two things that we have to kind of think about. The first part is the training data, right? If you're going to imitate expert demonstrations, you better have expert demonstrations. Like what does that look like? And then the second thing I want to talk about is kind of the method. Like you have data now, like how are you going to adapt to it? And there's an obvious answer. I'll you know talk about this again, but like just do gradient descent. But there's also a kind of non obvious part to this answer. And you know in case you haven't been following how people build these models today, this might you know be still surprising. So I'll leave that as a teaser for later. Okay, so in pery's lecture, know he's mentioned already several different kinds of instruction data, but today we're going to like walk through a couple of them and we're gonna to do a little bit of an interactive exercise. So you know those of you who have your laptops open can use those for good. So I want na talk about two different details. Like one of them is what's inside these data sets. People often say data matters a lot. I think post training is one place where this is even more than before because you're using very small amounts of data to get exactly the behaviors you want. So if you have noisy instruction tuning data, you're gonna na get some pretty crazy behavior out of your models. And then what kinds of things should we be paying attention to if you're in charge of post training data collection tion, what kinds of things might matter? So I have taken three different data sets from basically constructed in three very different ways, like you might even call them like kind of three different paradigms, to building instruction following or post training data. And we're gonna to go through each one and then we'll look at them closely and then we'll think a little bit about what's going on with these data sets. Okay, so I'm talking about fon. This is by a bunch of Google folks. And fon is going to be essentially constructed by aggregating a bunch of training data sets from nlp tasks, right? So if you look at it, you know you see all sorts of different tasks like natural instructions, V2, which has a bunch of question answering and things. It's got t zero, sf, adversarial qa and like topic classification. So basically, this was constructed by taking existing nlp data sets that do all sorts of individual tasks and then aggregating them into one big meta data set. So this is one approach to building such data sets. We've got open assistant on the right. And this was, I think, a pretty unique sort of endeavor in which a bunch of online enthusiasts got together and decided to write instruction tuning data for language models, like right after the release of ChatGPT. I think the excitement for this kind of thing was really high. And so there's actually a lot of good high quality human written data from that effort. And lastly, of course, this is a bit of self advertisement here. But as a representative of the kind of like language model generated post training data or like AI feedback style data, I'm going to talk a little bit about some of the data from Stanford alpaca. So let's just look at examples, right? Like I think looking at examples and talking about them are very useful. Now, this is from random examples taken from the flan data set. And you can kind of see the types of stuff that are in here. So you know you've got things that look like pretty normal instruction tuning data, like write highlights for this article, sauntering down leafy avenues past Dutch step gabled buildings. And then in the end, there's even more information on travel in the Netherlands at www dot Holland dot com. And then it answers the least known of the Dutch cities, the hug was a village dot dot dot. And then it sort of summarizes this as a highlight. You you've got something like this where this is like, what is this text about? Here are your four options. It's business, right? So this is kind of a multiple choice training thing that's happening. This you know pricy talked about the nron data set. So you all can like you know smile a little bit, but things like this of taking, let's say, maybe the nron email data set and you paste back, write a subject line for this email, and now you've got kind of supervision for that task, right? This one, I guess no one here has probably worked on taxi generation, but this is from a data set called E2E where you have a database entry and then you're supposed to write a sentence that describes that restaurant. So immediately you kind of see, you know you can probably get a lot of data for free this way, right? There's a lot of nlp training data sets and you can put them all together and you will get a really big aggregated data set. And so in that sense, fm was you ahead of its time and produce a ton of data for this kind of thing. But also we see in many ways that this can be somewhat unnatural, right? We've already seen that like this nrdata set is a little bit weird. We can really definitely see things like, Oh, here's a text, and then now you sort of append sort of the options that turn it into a task. So you can kind of see the surgery that you have to do very visibly in order to make this kind of data set. And I think if you look at this, you'll agree with me that this isn't your usual chat interaction for something like chatgpe. Another example for this is alpaca. This was like a really, really early attempt at using language models to generate instruction tuning data. And so you know just to describe the procedure here, you know a language model was used. There was a seset of human written instructions, and then a language model was used to essentially generate more instructions. So that's the left column. And then you use something like instruct GPT to essentially fill in the response, right? And so here you know now we have something that looks a little bit more like you know on the left standard sort of ChatGPT inputs. If you compare this to something on the left, this is a very benchmark centric set of tasks. This feels a lot more like a set of interactions that someone might just throw into a chatbot. And the response is almost always in long form, natural language versus with flwn. Often it can be quite short, like one word or phrase or something like that, right? So we kind of see that. Of course, you know we also see that on the left, these are in some ways not very diverse inputs. They're very short instructions. And then open assistant is kind of the third leg of this instruction tuning saga. You kind of see more complex queries on the left. And then because back then, I guess people were just really into writing long detailed supervision for models, you see like actually really detailed responses. And this one even has a citation on what makes this answer correct, right? And so very high quality, but also kind of very difficult. And so now this is the first interactive task in this class, but especially those of you that have your laptops open, please go to this url. This should be a Google form. And there will be a sort of one sort of sort of prompt. And now we're collectively going to crowdsource a instruction tuning response. And I'll give you all, let's say, five minutes to do so. Let me know if the link is wrong, but I did test this last night, so hopefully this is working and then we'll look at the responses briefly. And then I want to talk about why I did this exercise. There is a teachable moment here rather than sort of getting you just off your ladomoment. Okay, excellent. I think there's a decent number of responses, so I'm going to maybe put them up. Let's see if I can put them up then. Yeah, there you go. Okay. So in many ways, I think this reflects the kinds of data you get. I mean, if anything, you guys are all motivated to do this task. Then I think the standard crowd worker, you know, but you've got the person, I mean, I'm not sure what's going on here, but this is probably chagpt. There's a lot of emojis. I'm getting trolled, but this is I mean, I am preparing for lift toets. That's a good response. You've got the naaffam, which of course is the kinds of things that you'll get out of crowd sourcing. And I think hopefully one thing that you've seen or felt as you were doing this is that it's actually really difficult to write long form responses, right? Especially for something that you arenen't prepared for, right? And so you know you get a lot of short responses like this. It's very difficult, I think, to get people to write sort of long, detailed you know responses like this one. The very top of often those kinds of things are you know from chachy beand. Of course, you'll get things like, you know I saw this one, ice cream is a frozen dessert, typically made from milk or cream. And you're gonna have to you know filter out those kinds of things that you're gonna get through crowd sourcing. So why did I talk about this? Well, of course, you know now you have a sense of what this task is like annotators in the wild, even if their experts will be under time constraints. And I think one of the reasons why you know things like like AI feedback or using lms to try to refine or generate these kinds of data has really gotten popular is you know if you look at the GPT -4 zero response to this, you know it's pretty good it's a pretty good response to this question. It's very long. It's very detailed. And generating this kind of a human response is going to take a lot of effort and a lot of costs, right? And so you have to know if you're in charge of human data collection at one of these labs, you have to think about, okay, how do we take what I showed you in the spreadsheet and incentivize people to generate something that looks like this instead? That is no easy task at all. That is a very difficult crossword zing the task. Okay? And so these things that we've just seen, they vary quite a bit in things like length. We saw know the chat gpism, which is bullet points, like lots of style variations. We saw in the open assistant example that you sometimes people put in references, sometimes they put in sort of very complex deep knowledge, like is that good or bad? I'll talk about that in a moment. And there's also other important aspects to this process, right? Like maybe you want to collect a ton of data or very little data that's high quality. So you've got this trade off. You also have to think a lot about safety, right? Like the data that we collected just now, that's just capabilities data, right? Just makes models answer things like, what is cs 336? It does not help us make our models. You refuse malicious instructions and things like that. So we have to think a little bit about what that kind of data looks like too. Okay. I think length has always been a big kind of gorilla in the room issue for all of these data sets. Know when back in 20, 23, when it was very, very popular to generate these kinds of instruction tuning data sets, there's a survey like yjong, wanong and others at uw came up with this very nice survey coming, looking over the many different kinds of data sets that were created early in that year. And you kind of see if you look at the length of both the prompt, so that's the inputs and the responses, the completions you see like really different lengths of both inputs and outputs. And inputs are probably a measure of sort of complexity of the task in many ways. And then the outputs are in some sense a measure of how much you push the annotators or if you used AI generated responses. And one of the things that you should be all aware of, you know, let's say you got put in charge of making a new language model. You're in charge of post training. Well, you know if you're using human emails, people have a strong preference for lists. I mean, so do actually, if you use AI as a judge, they also have a strong preference for lists. And people have a very long preference for outputs like 60, 70% inpreference for longer outputs. And so do sort of AI judges, right? And so this is a little concerning because you want to be optimizing for not just kind of the stylistic content of your responses. You ideally want to be using post training to do things like reduce hallucinations and actually make the model hopefully more capable. One thing that we do see is that you know these factors are not super relevant for benchmark performance. So if you look at, for example, in mlu performance, kind of despite the really big variation in length for a lot of these models, most of the instruction tuning data sets like the simple ones give you boosts over kind of the base model, which is the very top row above. I think one of the things that I'll say here is know chat style evaluations have their place. You chatbot arena, alpaca eval, these kinds of automated evals have their place in helping understand user engagement and things like this. But benchmarks also have a very important place because when you post train, you don't necessarily want to be too affected by, for example, length biases and open ended domains and so on. And so you want to be really careful of these effects. You want to have a different, diverse array of evaluation strategies to try to avoid those pitfalls. One other thing that I think really trips people up when they initially start thinking about these things is to say, Oh, what I'm doing is I want to collect high quality data. And high quality data has lots of deep knowledge and has lots of citations, right? That's a reasonable thing to say. And I think you open assistant, I think, had a great example of this, right? And so here's an example, input output pair. You know, you've got the introduction about monopsony economics and then there's these references on the response on the right. So now let's say we have a model and we fine tune the model to take the left side as input and reproduce the right side as output, right? So you can kind of think about two different things that this process is gonna to do at the same time, right? So one of the things that this is gonna to do is it's gonna to associate you know monopsony with that citation, right? So it's learning new knowledge, so that is good, right? So that is a positive thing to do, but it's also going to do a second thing at the same time, which is this kind of generalized thing of saying, if you ask me a complicated concept, I had to better finish the output with a reference, right? And so it's basically the first thing is teaching new knowledge, which is good. But the second thing here is kind of teaching the model to hallucinate, right? Like if the model doesn't already have somewhere within its parameters, an association between monopsony and this citation, this Bivin's initial book, you know, what might happen instead is it just learns that, Oh, what I should do is whenever I have a complicated input, I should give a response and then make up a reference at the very end. Those are two competing explanations for what is happening here. And this is going to motivate, in some ways, the second part of this lecture, right? John shman has this kind of great talk. I think he gave it up Berkeley, where you know, his argument is basically, if you do this kind of thing, you're going to encourage the model to hallucinate, right? Like the model doesn't have the knowledge of answering a question. You force it to answer that question. What it's going na learn is, of course, you'll learn the knowledge in some abstract sense, but it will also learn the other aspect of, I just need to make something up in order to sort of type check what the response should look like. Okay, yes, there's a question.
speaker 2: Everything has a sense of like, okay, I think I need to add a citation here. Let search for 2:11 citation, either for memory or let me actually use a database. It is the fact that like the element is like, okay, here I shoot in citation. It's actually a correct and desirable thing. It's and the fact that it's a made up incitation is a either a memraor be something you can maybe augment a tool usage in the text, but like I don't see why the behavior of needing to add a citation itself is problematic with like if you can either fix the memory issue or the tool user issue.
speaker 1: sure. I mean, I think, okay. So to repeat the question, the question was like, I guess that was a more comment than a question, was that the learning to put in a citation isn't a bad thing, right? Like I mean, maybe you augment it with tools and itactually give the right citation. I mean, that's a fair point. But I think the thing to maybe point out here is like the deeper conceptual or not conceptual, the deeper issue with token prediction, right, is that you're teaching the model to kind of predict the right kinds of tokens. And here, you know essentially the lesser of the two errors is to say hallucinating is less bad for my loss. The not making up the reference at all kind of the structure of the response always has to be fulfilled because you have to fill the tokens in at the right places, right? Of course, at scale, if you know the facts, if you have the right tools, right, those are good. You want na of make the predictions on the right spaces. But I do kind of think this is very indicative of this, this failure mode that models can get into where you're trying to get it to do things that it can't, right? Like if your sft data is just much more advanced than what your pre trained model naturally can do, you run this risk of teaching the model kind of this alternative shortcut behavior instead of teaching models the right behavior? Yeah. So that's John Shulman. I think he makes a fairly reasonable case that you know this is one of the reasons why like on policy rl, like reinforcement learning style things, is an important thing to do because you want to know what the model already knows and only teach it those things to avoid hallucinating. And whenever it's encountering some fact that it doesn't know, that maybe you should change your fine tuning data to say, Oh, I don't know that fact, instead of forcing the model to try to answer, right? And we kind of see this on the other sort of knowledge storage studies as well, where people have kind of talked about know it's much easier for models to sort of reproduce known facts than to learn sort of unknown facts, where it just takes a lot longer for models to kind of learn facts that aren't shown in pre training. And this sort of is sort of matching what you might expect from these phenomena. So okay. I think one of the things that I'll sort of summarize that with is that there's a very counterintuitive phenomenon for instruction tuning, which is that you can have an instruction tuning data set that is fully correct and actually very rich, but actually that might not be good for your lm because it's going na teach your language model to sort of try to make up facts to match that depth of knowledge. That's always been, I think, one of the arguments for why you want na be really careful with both distillation data, where the teacher model is stronger than your student model, and also really human annotation, where the human might be much more knowledgeable than the model. You want to be really careful to make the model abstain nicely when it doesn't know things. And in principle, reinforcement learning style correctness could help. And we'll talk about that in a moment. And sort of optimizing this at the instruction tuning level is just really messy and very difficult. I don't think people have really nailed it down, at least in the open research literature. The other thing I want to talk about briefly, because I think this isn't necessarily something that can be solved with instruction tuning alone, is to touch on safety and to think a little bit about what the trade deoffs are here. So we know you language models need some guardrails. They're deployed straight to end users. They're very capable. So they might be used for misinformation or for generating things like scams or spam. And so there's a need for safety tuning these models. And I think in parallel with a lot of the research on instruction tuning, there's been actually quite a bit of work studying safety tuning as well. And I think some of the early work in this area kind of has shown that even a small amount of sort of safety tuning data that's mixed in to instruction tuning process can make models much safer, sort of paralleling a lot of the findings that people had that actually for instruction tuning as well. If you have a strong enough prere train model, even a small amount of instruction tuning data can get you a lot of the way. Not to say that's sufficient, but actually that's sort of gets you to a reasonable point. And I think the core trade off with safety tuning that I'll sort of touch on in this brief section is this trade off between refusing things and not refusing kind of too much, right? So there's always this thing of you know if you have unsafe responses, you want your safety tune model to just refuse to answer. And then maybe you have these other you know actually safe responses, but things that look like unsafe responses, like how can I kill a Python process, right? We all know that is a reasonable question to ask, but I guess you're not. You know if you don't understand English very deeply, you're like, Oh, killing, killing sounds very dangerous. So maybe I should refuse to answer that question, right? So how can you make models sort of understand this nuance? It's a very tricky thing to do purely in the instruction tuning setting. And so a lot of what people have done is come up with carefully curated small instruction tuning data sets to try to balance this tradeoff. So even some research has shown that even like 500 examples can make models follow some of the safety guidelines to put this together. Instruction tuing is surprisingly powerful. I think you would think that given how powerful things like ChatGPT are, that there's actually a ton of complexity into getting anything that works. I think you'll find that even if you take a fairly standard instruction tuning data set like open Hermes or open assistant or any of these data sets, and you take a base model and you fine tune on it with reasonable hyperparameters, you'll get a model that behaves a lot like lama or ChatGPT. It won't be quite as good. There's a lot of extra work you need to do to optimize it, but you can get pretty far. The second thing that's good to remember is basically the notion of high quality data is just very complex. And you have to reason about it really carefully. It's not obvious how to do. And then the last thing is actually, even a small amount of data can have great leverage at this stage in changing how models behave. The last thing I want to end this section on is how to do this instruction tuning. There's kind of a flipping answer to this, which is, well, you've got demonstrations, just put in the instruction and the response and just do some gradient descent, right? We all know how to do gradient descent at this point. And I think in most academic settings, that's basically it, right? Like you're done. You do your small scale gradient descent and you're done. But I think if you're at like a frontier lab and you've got more compute and you've got more money than you know what to do with, then you've got a lot of compute and you've got a lot of data. And so you can scale this whole process up quite a bit, like you can scale it up a lot. And modern instruction tuning pipelines are starting to look a lot like pre training pipelines. And so increasingly, the boundaries between pre training and instruction tuning are just getting blurred. Because if you think about it, instruction tuning data is still a sequence, right? It's just a sequence of tokens. And so I can throw that in into my pre training process, and that's a totally valid thing to do. And so this is an increasingly popular idea, I think, you know the close labs don't tell us anything, but know, I think the things that people have tell me in bits and pieces suggest that this is know what they're doing. A lot of the open groups from China do basically this now. And so what you do is you have your usual pre training setting, right? You do per pre training. And then what you're going to do is you're going to start mixing in instruction tuning data into pre training. So the kind of the tail end of your pre training, especially as you're kind of anknealing the learning rate, you're going to start putting in a lot of this higher quality data or instruction tuning data. And then in the end, you might actually do a second short instruction tuning round. But maybe this is smaller because most of your data has already gone into the second stage, what people call mid training. And this is cool because it lets you scale up without catastrophic forgetting issues. You might get more leverage out of your data because it's integrated more deeply into pre training. And to give you an example or a sense of what this looks like, and it's a bit of a shame that data mixes are often pretty closely guarded secrets by a lot of the groups. So I've taken this figure from mini cpm, which we've talked about before. Great paper from the Chinese groups, where basically they have know a two stage training pipeline where they have a first stage where they do pure pre training. And if you look at this pie chart, this is all pre training data sets, common crawl code, pre training pile, dma, they've thrown it all in into one big pie. And then they have a second stage, which they call the decay stage. And so if you remember my lecture on scaling laws, you know I talked about wsd warm up, stable decay. So that's the stable stage. That's the decay stage into the decay stage. What have we got? We've got you know Wikipedia, what people might call high quality data. We've got still the pre training stuff mixed in there. So it's not pure post training data. But then if you look at the right, we've got code sft, we've got Chinese books, we've got ultra chat, we've got Stack Exchange, question answering and evil instruct and. Oss instruct and all sorts of other things. So those are all kind of instruction tuning or instruction tuning adjacent data sets that you've thrown in onto the second half of pre training. And it's, I think, used by most models today and mini, cpm and other sort of derilms that are derived from that lineage of models have definitely sort of publicized this. I think it's extremely effective to do this. And so I think everyone has been following this. One last commentary I'll make before we move on to rlhf here is that this whole process makes it very, very difficult to reason about pre trained models versus post trained models, right? If you look at recent releases from like Quinn or whatever other companies that you're looking at and they say base model, you that base model is probably at the end of this process. And so it has basically gone through an instruction tuning phase, implicitly through its mid training process. We don't exactly know what the mixes are for a lot of these closed models, but it does actually mean that I think the term base models is increasingly questionable what that really means. So that's my sort of side comment that is useful for you if you're thinking about base models. Zyes.
speaker 2: like this is the day and make sure you change your .
speaker 1: music to the stage like when you have one hour today.
speaker 2: Yeah if that's when you got .
speaker 1: like these thank some laws, is that like directed .
speaker 2: pojanscene in that last Yeah.
speaker 1: So that's right. That was the motivation for a lot of the two face training for these groups. Like they basically use essentially the large drop in loss as a way to try to anneal the model into the right you know mode. I think there's you know increasing studies into like what's the optimal point at which to switch. And I think it's a little bit more nuanced than that. But I think to first order, this has been a very effective recipe. Yes.
speaker 2: We've not having trouble forgetting or I'll trying to think about the incentive ice apitation they come earlier because this also happened.
speaker 1: Yeah. So I guess there are two questions packed in one. But the question was like, is this primarily for catastrophic forgetting and also does this help with the citations issue? So to answer the second part first, I think it doesn't help with the citation issue because you know just as sort of like a type signature thing, the only things that can help with the citation issue is if you know what facts the model knows, right? So you have to either ensure that the model like always knows the citation facts before you show it this fdata, or you have to check to see if the model knows it and then show it that data if it doesn't know it, right, which this doesn't do, this will always unconditionally put in those data points whether or not that citation is learned. So it has no way of fixing that. Adaptecatastrophic forgetting wise, I think that is one of the motivations that if you have so much sfdata, your tradeoffs are pretty tricky, right? Because unless you're gonna to do this kind of almost pre training mixed in with post training, you have to think about regularization, you have to think about tiny step sizes to avoid messing up your pre training. And so I think this is partially motivated by sort of catastrophic forgetting adjacent issues. It keeps the model sort of more or general. Yes.
speaker 2: we've a John Man example for his deitation instead. If the model doesn't go the run in the first training taknows, if it does know.
speaker 1: That's right. Or that's the claim, right? That if the model did know this citation, that's right here, then the model wouldn't necessarily out of these two sort of competing mechanisms. What it would learn is, Oh, whenever I see this example, I should retrieve my knowledge about you, bivinza Michel L, and then use that as a citation. I think the reality of this is that it's always very complicated, right? Like what does it mean for a model to know something? How reliably doesn't know something? And so these two kind of mechanisms are always maybe in superposition for a model. It's really just a question of which one is more dominant. Like if a model just has no idea about this, it's probably two that is you know more dominant. Whereas I think if the model knows it reliably, it's more likely that it's just going to learn the correct citation rather than encouraging broad general hallucinations. Yes, people tried putting .
speaker 2: into all of the predators some kind of thoughts, tokens that tell the model I should. This looks like a fat, let me check if I actually knew it. And so I'm going to query myself, you know, and check if I'm getting consistent. If so, all approval. So in the entire retraining process has a proprocess esses I need to check myself.
speaker 1: Okay, that is a very interesting idea. So just to repeat it, has anyone done something where you put in kind of thought tokens or the models checking itself for its knowledge of facts as it trains or something like that, right? Is that roughly right? Yeah. So depending on how you interpret or like implement that exact idea, it starts to look a lot like reinforcement learning because for example, there's a method called quiet star from sufolks here. Noah Goodman and their exolic man and others have done this where they do, essentially, they try to learn the thinking process of a model by sort of predicting what happens on the answer token, and then based on whether or not it's correct, to try to reinforce the model to have good thought. Procactually, the even closer analogy to this is star, which is the original paper, which is if the model gets something correct, then that thinking process gets fed back into the model training. And if it's wrong than it doesn't right? And it's kind of very similar to what you're proposing, which is to adaptively train the model based on kind of correctness of its knowledge or whatever else.
speaker 2: Let me make use of this tool that checks myself, and I can lead her at some points, change what that tool is. But when is that tool will come to a response that says, yes, I do have the knowledge. No, I don't. And then youactually seen the knowledge gets printed out. Your doesn't.
speaker 1: So in this like tool use example, are you imagining that kind of the fact would get replaced by a tool call or will the fact still be there? Like I think the key question is, do you force the model to predict the fact tokens? Or do you just force it to predict like use a tool to look it up on Google token .
speaker 2: from all knows sing. If you don't know what the tool, how you implement it, but you see, if the consenis yes, I do know it, then there will actually be some response printed out in the free training data. And if the tool says, no, I don't know it, then the pretraining data do data will just say, actually, I don't know this.
speaker 1: I'm going Scand. So right. So I think that's hard because when you do pre training, like you have to know whether you know the fact to know whether to take losses on that knowledge token, right? Like you can't defer it to inference time because you have to decide whether or not you're gonna to take gradient steps, right? The other sort of logistical difficulty here is during pre training time, you have a static data set where for computational reasons, you would want a static data set. And if you have a static data set, you can't adaptively do updates. Like anything that solves the hallucination problem has to be kind of reactive of the form. What does the model know? And then do I take updates on this or not at the pre training stage? That's very difficult unless you're doing rl style stuff at pre training scale, which would get you very close to that, but still very difficult. I'm happy to follow up, but I think hopefully that answers the question. Oh, there's more syes. Okay.
speaker 2: In case of lama. I think up where we constantly start going to start to.
speaker 1: Yeah. Okay. So the question was, if at pre training we don't see emojis, but at post training we put in a bunch of emojis at the end, what will happen? It depends on the structure of the emojis. I guess if the emojis are dependent on the inputs in a very complex way, and that's very difficult to learn, maybe what the model will learn is, well, in post training, what I saw was a bunch of emojis. I don't have enough data or training to know what the complex pattern is. So the model will just learn to put a bunch of random emojis at the end. If there's no pattern, if there's no complex dependence, then maybe the model will learn to do the right thing, which is just to put a bunch of random emomojis at the end. Really, the key way to think about kind of the sft issues is instruction tunwill reliably teach the style of the output, like the type signature of the output, right? And the model will most likely follow that type signature at the very least. And the real question is, do you have enough instruction tuning data that you could do something more than that? And that's kind of the more complex open question. So in your emojis case, at the very least, you'll get a bunch of emojis. Whether those emojis are the right emojis. Open question depends on how much instruction tuning data depends on pre training, so on and so forth. Yes, I was wondering .
speaker 2: earlier in the lecture, I think it was said that the post training part is really teach all new knowledge, right? Is mostly on styles. Then the line kind of gets more brewery like when mirotype is like me trastage. So I to make sure that the mecpm paper so in that in this review like scenario, like the meat training part, could also do some sort of new kind of.
speaker 1: Yeah, that's right. So I guess the question was like you know if I rephrase it, you know can't instruction tuning essentially teach new world knowledge? Because mid training blurs the line between pre training and instruction tuning. And we know pre training teaches knowledge, so why not instruction tuning? And I think that's right. That like in some ways, instruction tuning, if it's scaled up enough and it's diverse enough, will teach knowledge, right? But I think instruction tuning and it's like smaller, like non mid training form, it is very difficult to have the scale and diversity of data needed to reliably teach you know various facts. I think modern mid training is starting to become a different game, but it's still sort of an emerging object, I think. Cool. Okay, so now we get to the part two, right? So part one, this is the quick intro to instruction tuning nsfkey. Now we get to reinforcement learning from human feedback, right? The rl part of this lecture and conceptually, right? And I'm going to take it slow here because I think this is an important conceptual transition, right? We're gonna to move from the world of, I think, generative modeling at the very top here, which is, you know there's a very simple goal in this world, which is there's A P star. P star is a reference distribution from which completions are drawn. That reference distribution probably looks like some mixture of you know Internet data as well as annotator written data. But there exists some abstract p star that we're trying to imitate. That's all there is to it. So this is peer generative modeling. Now we're going to move to the second perspective now, which is rlhf. And in this world, I no longer care about matching any distribution. So probabilistic perspectives kind of don't really entirely go at the window, but you want to be careful about adopting those because really what we're looking for is we're really just looking for some policy p of y given x such that we maximize our rewards. There's some reward function, R of y and x that take in both my completion and my prompt and then gives me a reward. And all I'm looking for is any policy that gives me good rewards, right? And so now lms are not necessarily a model for some underlying distribution. They are policies that give us good rewards. And so why would we go and do our lhf? There are kind of two reasons that we might do this. One of them is on the top one in sft, in order to do this process of imitation, we have to get samples from p star, and that can be quite expensive. And the second one, all we need to do is get measurements of rewards. R and sft data can just be really, really expensive. You know this is kind of like a caricature of various costs that you might have in different stages, right? You have some compute costs when you train your base model, and then you're doing like supervised learning, like sft, and then you're going to go collect a bunch of pairwise feedback and do rl and do evaluation and so on. So when you do this, right, you know the sft is just really, really expensive. You're getting like really expert people to write very long form responses, and you kind of saw how annoying and difficult that was. And frontier labs are going to be spending millions on this post training data. Well, maybe there's a nicer way of collecting data that makes models better. So that's one argument for why we're gonna to do all the things that we're gonna to do in this second part of this lecture. There's also a second reason that is equally or maybe even more important that I think people do this kind of rl training. And one of the things that's very interesting is people don't always agree with themselves about what is good. So if you ask somebody to write a summary, you know they can write one summary, and then you ask them to compare their own summaries to lm written summaries. There's a good amount of people that will actually prefer lm written summaries. And this was a really surprising result from one of my students papers, I guess, two years back now, where we were benchmarking summarization systems. And there was one person who is of course, anonymized by annotator one, who wrote a bunch of summaries, and they actually preferred you know the AI summaries actually significantly more than their own. And they're like a freelance expert writer or something. And we went and like interviewed them and they were like, Yeah when you asked me to write stuff, I just felt like I had to write more with flowery language. But then I like read the AI ones and they just read better. You know and I'm sure you've had similar experiences, you know where you look at the output and it is actually know different from your own assessment of how to generate. So there's a not just cheaper to verify than generate, but actually maybe higher quality to verify than to generate. And so there's this generator validator gap. Okay. So we're going to cover different aspects of this rlhf process. We're going to talk about how we collect data and what are things you should worry about if you're in charge of rlhf data collection. And we're going to talk about how we do rlhf. I'm going to talk about two representative algorithms, ppo and dpo. I'll defer some of the more like detailed explanations of ppo to next lecture just for kind of space reasons. And then finally, we'll end with some things to worry about, almost like pitfalls of rlhf at the very end here. Okay. So how do we get pairwise feedback, right? So pairwise feedback is, Oh, sorry, I'll go back here and just like take it a little slower. So when we do the second part of this instruct GPT process, right, like how does this work? Well, we have the model sort of generate its own outputs, right? These are rollouts in rl terms. And then we're gonna to compare you know these different outputs. And although four outputs are shown here, you know in kind of standard settings, you often just have a pair of outputs and you'll see we have a and b. All I want to know is a is a better than b or not, right? And then given these pairwise feedbacks, I'm going to train a reward model that can essentially internally give every single output a scalar value and then just use that to do reinforcement learning. Like this reward model is now My Rewards and I want my model to maximize those rewards, right? Fairly hopefully. Simple pipeline. So how do we collect pairwise feedback data? Well, you know the obvious simple thing to do is just say, okay, like you know I'm just gonna to make some web app, you know I've got two different AI responses and you're gonna to have a little you know four way checkbox that you know checks which responses better. I took this from from one of the studies that we did, right. A lot of pairwise feedback responses look similar to this. But I thought you one thing that would be helpful and useful into actually getting a sense of like, what does this look like for real? I have gone and sort of dug up examples of annotation guidelines from different papers in different places, sort of you know talking about this process, right? So if we look at the instruct GPT guideline, this is one of the very few, I would say, sort of released materials from one of these companies describing their annocation guidelines. You know they say, okay, your job is to evaluate these outputs to ensure that they're helpful, truthful and harmless, right? So those are their three pillars. You've got helpfulness, which is like you know writing in clear language, you know answer the question they mean to ask, like being sensitive to internationality. Like if someone says football, you know they shouldn't assume American football. If it's too confusing, ask for clarifications. I want you to be truthful and not hallucinating outputs. And by harmless, you should sort of be not toxic and be very nice and not say nsw things. All fairly reasonable things, but you can kind of see how sort of there's the interplay between things like the model spec, which OpenAI publishes publicly. Then there's a very detailed annotation guideline, which know this is not that detailed. It's probably much bigger in practice where you would write down these kinds of bullet points. You would hand this to annotators, and then they would sort of go and make annotations. This was you know instruct GPT is not even like kind of production grade. This is early days, but you see how this process kind of works. The kind of other interesting example, you can kind of go look this one up later, if that. The actual text is too small because I'm not going to read through all of it. I'll just sort of touch on it a bit. There's actually a leaked version of actual the annotation guideline apparently for Google bard that I think was part of some news story. And you can kind of see very, very similar things happening here. We've got on the top left box like helpfulness, like you should address the intent of the uers prompt. You should adhere to any requirements, don't have misleading information. Like very similar to the instruct GPT setup. You know we've got actually here a style box, which is what kinds of style are good or bad. And then we've got different rating scales for the different responses. And if I remember right, I think the Google bard folks, I think it's like a minute per question to be doing this task, which is quite difficult. Okay? And then for instruct GPT, you know they go through like scale and Upwork and they collect about data from about 40 people. Like this is really, really tiny sort of by tostandards. But hopefully you kind of get a sense of you know what types of groups are being involved here. Okay. So this is the second part of our interactive exercise. Okay, cool. So that's five minutes, I guess, to take a straw poll. I think about 27 of you managed to complete this, which is great. Thank you for your participation. You know how many of you managed to fact check all the facts. Yeah, Yeah.
speaker 2: the expression.
speaker 1: right? Yes.
speaker 2: yes, that's right. So so how many of .
speaker 1: you manage to fact check you know any or all of the facts? You managed to hacheck things. Just one. Just one. Okay. How many of you manage to check the math in the five minutes? Okay. So there are a couple people. Good, good, good. Excellent. Okay, that makes me, that makes me happy. So you know the point of this exercise partially, I mean, I think maybe you could have guessed what was gonna happen here. The shorter ones, most of them are essentially taking the longer ones and actually just removing the hallucinations to the best of my ability. And so for the most part, basically, for those of you that picthe longer one are sort of picking the slightly longer but hallucinated ones. Like I don't quite remember which of the pair games ones that are hallucinated, but many of them are. And so you know we see strong disagreement. And actually the longer one gets more votes despite having strong hallucinations, this one, I think both is correct. So b is probably the better choice. The two math ones, I think you're can also back out by the unnaturalness of this construction. The one that's sort of more conclusive actually is going to the wrong conclusion. This is not mathematically fully correct here. And so you know you probably should find it difficult to try to do these judgments this quickly. They're are very, very strong challenges in collecting this kind of pairwise feedback at very, very large scales. Just because even though you only have to verify if I'm showing you a math problem or showing you you know something like a very you know strong factually laden know text, you're going to have to basically break this down into claims and check each one to know whether one is wrong and one is right. So this is just like a very labor intensive and difficult task. And I gave you five minutes because know essentially, I think the news story that the Google bard article was associated with was basically that the annotators were given one minute per example. And there was big complaints that this is not enough for us to judge the safety and factual accuracy of the model. And you hopefully somewhat agree that it is a very difficult task, right? So okay, the lessons here going back to the flow of the lecture, right? It's very hard to get high quality verifiable annotators. Like I think you are in many ways like pretty high on the bar of people that are actually motivated to do this because you're just doing it because I asked you to, not because I'm paying you two or to get grades. It's very difficult to get people to check correctness, especially under time constraints. And the last one, I don't know if those of you are doing this, but if you put in an online survey like this, you know someone's just going to take the whole thing, dump it into GPT -4, and then just like kind of copy the ansstraight back onto your pairwise responses. You know we've had several studies in the past where an annotator had like 95 plus percent agreement with GPT -4, and you kind of have to wonder what's going on there. And so despite the fact that pairwise feedback is easier to collect than supervised imitation data, there are still significant issues. And I'd be remiss to not point out the many things that have been written about sort of the kinds of problems that this creates. If you try to outsource this to federal countries, there's sort of pricing concerns and sort of lots of ethical issues that you all should be aware of. You know, if you're sort of going to be in the future, be a part of these kinds of sort of data collection pipelines, you want na make sure to get high quality data and to also make sure that you know people are being paid a kind of living wage. The other thing to be aware of, you know, this is in the sort of bias and safety angle, because for alignment, I think this is a really important thing to also touch on is that you know in some sense, rhf and alignment come at the end of the pipeline. And because they come at the end of the pipeline, they have very strong influence on model behaviors. One of the papers that actually Percy and my postdoc Shibani and essn postdoc online worked on was this paper on trying to figure out how do subjective opinions of lms align with different know groups of people? And one of the really interesting patterns that we found was actually for instruct GPT. Like these are old models, but you know the now still useful models, these models somehow became more aligned with sort of Southeast Asian religions than before. And then we looked in the appendix of instruct GPT and actually like, what's the nationality of the people doing the annotation? It's Filipino and Bangladeshi and 17% American. I was like, well, it's kind of surprising, but it does. Circumstantial evidence lines up with this kind of thing. And so you do have to be very careful, because this is the thing that in some sense goes out in ships. You know, others have also noted that, no, depending on the annotator, what they pay attention to is very different. I really like this paper from huking blossom and Bartolo, where they basically study two different kinds of annotators. One is like the authors, and they're very motivated to know, judge things correctly with quotes. And then crowd workers. And what they really find is you know crowd workers don't pay much attention to factuality. That's kind of this row here. They pay more attention to format ting. So depending on the annotator, you're kind of getting different kinds of feedback even though you're asking them the same things. And so kind of increasingly, people have turned to, as with the instruction tuning phase, AI feedback where lm generated feedback. And so there's been many works, including some of our own that have shown things like, Oh, if you try to get pairwise feedback from GPT -4, it has very strong agreement from like GPT -4s, you know estimated win rates of models or responses and sort of the human estimated ones on the y axis. Same here. The agreement between human to human, which is the blue box here, is roughly the same as the agreement between GPT -4 and human and but is much cheaper, right? So so you know there's lots of reasons I think why sort of AI feedback has become popular and it has been used very extensively in rlhf. So if you look at ultra feedback, which I think is one of the very popular open source data sets for off policy rlhf, you see this if zephyr seven B I think was a hugging face effort, I want na say, last year to build a big, strong open model. And the reason why I bring this up, I think zephyr is probably not the most well known model. But one of the things I kind of remember about sort of the hugging face model process was initially they were really interested in human data collection. Like they were convinced that like human crowdworker, like if you paid them enough and got the right vendors, you would outperform sort of AI generated feedback. But kind of late in the process, they kind of realized, know GPT -4 generated feedback for this kind of thing just worked much better. And a more modern example of this is tuu three, which is a sort of post training paper slash project out of AI two. And they've done kind of roughly this. Like they take different prompts. They have lots of different models to generate responses and they have you know lm sort of rate pe these to get chosen versus non chosen. And this really all just kind of goes back to the classic paper, would be bananthropic paper on constitutional AI, which I think sort of really planted a flag on the ground in terms of AI feedback being used for this kind of alignment process. Finally, the last thing I want to talk about for data is length effects. I think know when we did the annotation, the one of the things that we saw was I think many of you saw the longer response and you're like, this is more detailed. I like details. This is good. It's not just you. Models and people all have this bias. And so people have found that models that people have thought were better were in fact, maybe only just also longer. And then AI feedback seems to make models just generally longer. And so this is always a confounder that you want to be careful of length as a confounder for general preference. There was a question there.
speaker 2: Okay, Yeah, I'll talk about it.
speaker 1: Off versus is on policy later. You know off policy, I think I mentioned it here is going to refer to you collect these like kind of pairwise feedback things separately. Like they're not collected from the outputs of your model. Maybe your model is involved like for example, in Tulu, you know you've got all these kind of off policy data on the left. That's models that are not your own. But you also have on policy data from yourself, right? So the off policy data kind of tells you about the landscape of places you're not at. And the on policy data tells you how to refine yourself. Yes.
speaker 2: And the way this sort of human aspis what we're sampling props and then asking humans to open them, grade them, have people ever, instead of sampling prompts that you don't have the answer to and then asking them, you have like existing in the world from any number of places, there are existing prompts that would know the answers to. And you have that data saved. We don't need to do anything to receive that data. Have people done that sort of alignment to think using that style of data instead of and yes, okay, that's a good question.
speaker 1: And it's like, why don't we do rlhf on or like can we do rlhf on domains where we know the answer is like one way of putting the question right? And part of the answer is the next lecture on Thursday is kind of that like where do we really know the answer? Math. We know the answer very well for math, and we can do exactly rl against math, and it works very well. People also do things like, you know here is a long form response from an expert that has already been written. Now, can you judge it? Given this, that helps. But one thing that I think is important to keep in mind is for a lot of these open ended the tasts, there's many correct cancers and it's very difficult to judge which are correct. Like if I have a new fact in the lm response, like is that a correct fact or not? Doesn't really solve those problems. Yes.
speaker 2: So like thinking my friends in santhrothe paper referenced, there's this like relatively specific and small constitutional thing. But we can even like sample value aliges to make this work for enthings. We have any number of sources of light. Here's a problem. Here are values that we as a society think are relevant, whether that's surveys or politics or article les, just any number of ways. People I know they like hamthropis and like love people go, but that seems .
speaker 1: Yeah like I guess there's lots of different things being mixed in on that comment. Like you could do like deliberative democracy style stuff to gn models. That is certainly a thing. There's also, I guess the thing that feels close to what you're talking about is almost giving the annotator sources almost that are relevant. And I'm sure those things help. There have been works on like showing people expert written responses when they do the pairwise judgment and so on, but I think it's not a silver bullet. Like all of this kind of ui interventions definitely help, but it's not necessarily gonna to one stroke solthe problem. Okay, yes.
speaker 2: Like Oh they can see no long chair like your other elresponses. No, there's like biyers from life.
speaker 1: Yes, there's a very strong or like very detectable self preference for most models for their own outputs. And so when you use this for evals, which many people, including I, have done, you have to be very careful for that self bias. Absolutely. Yes. Ok, then there's lots of hands. But yes, I think you were first, we're balancing the amount of information .
speaker 2: you get just from having modiated on outputs. And then the best using if you look at own feedback, right, you still look a heuristic for how many creations that you can do and not change if you say Samford to take the model. Yeah.
speaker 1: I guess the question of like how much can you extract out of models doing things to themselves is an interesting question. But I guess the kind of in some ways the information theoretic bounds, so to speak, is very, very high because technically the model like ingests the entire ptraining corpus, and that could be stored somewhere in the model. And depending on how you prompt it, you might get an amazing model out. And so it's possible that based on how you're using the model as part of your self refinement loop, you can extract more capabilities out of the model. And we don't know what the upper bound of that is because basically the inputs are just vast. But I do think there's like practically lots of papers that have studied this question of how much can self improvement help, like what's the scaling properties and so on and so forth. I think ultimately it's a very empirical question. Okay.
speaker 2: yes, problem because compared to St obviously basically was the best. We have a Laval holfor, a four hour don't but we have like reward model to .
speaker 1: give it like returns. Yeah. So maybe I should just go through the next slides because actually that watch you explain it. So not only does it help us get through the rest of this, hopefully quickly, but also, I think itexplain your question. Okay, so now let's talk about methods. Our goal is to do this thing right. I want to find a policy to maximize My Rewards. So I need to tell you what the rewards are, and I need to tell you what the maximization process is, right? So let's do that now. I think, as you can probably tell by the frequency with which I'm referring to the int GPT paper, the whole sub area of instruction, tuning and post training is very closely tied in instruct GPT. So from the instruct GPT paper, you have the equation two, which is this objective that describes what we're optimizing. So we've got this R theta of x of y. This is our reward, and I'll define that in a moment. And then we've got you the second terms, the log ratio of my rl policy divided by my sft output. So what is this object? This is the kl divergence between my rl policy and my original sft model. So it's saying when I do rl, don't move too far from where I started. And then this second line, this gamma term, this is basically saying keep doing pre training while you do rl. So you don't catastrophically forget if you you know keep doing this. Lots of people don't do the second step. But this kl thing is a really, you know it's a standard thing. It remains even today. Okay, so now what is the reward? The reward is this thing kind of at the very top. And so maybe that's a small equation. I'll talk through what that is, right? So there is a sort of hypothesized model of the world that exists. So what is the hypothesized model of the world? It's that every single output in the world, right, like that a language model could output, every single sequence, has a scalar value, R, associated with it. And we don't observe what that R is. And when a person rates it, like when they do a pairwise rating of a versus b, what they do is they compare the two rewards of those two sequences, and based on the difference, theytake a coin flip. So this is a logistic model of the difference of the two rewards. So every sequence is a reward. When I do pairwise comparisons, I take the difference and I flip a coin, right? This is the Bradley Perry model of human preferences. And this is what's happening. And so when we want to optimize the reward, what we're trying to do is we want to output sort of the sequence that has the highest R. And R is not something we observe. We only observe noisy pairwise comparisons through our theta. So that's what we do. So that's the objective. So that's what we're trying to optimize. And so now let's talk about the how. And to be clear, I'm only going to talk about ppo, which is kind of the og algorithm. This is what appears in instruct GPT and a lot of the OpenAI stuff. I'm going to talk about it only very briefly, and then I'll talk about it in much more detail on Thursday because it more naturally belongs there. We're gonna to do a lot more sort of real rl on that lecture. So at a conceptual level, remember, what we want to do is we want to optimize the reward of some policy, right? That's the left term here. And what's a good way of optimizing something? Well, let's take some gradients, right? Let's take gradient descent. So that's the very top left equation here. Now you, we can sort of do a little bit of math, and if we take the gradient of this object, we can write down the this equivalent to, you know, the expectation of the reward multiplied by the gradient of p theta. And so this is very natural. What you're doing is you take your normal gradients that you normally take in doing, like pre training or whatever. This is saying, p theta of Z, I want na maximize that probability, and I maxii multiply that with R. So if My Rewards are positive, I want to upthose probabilities. If My Rewards are negative, I want to downweight those probabilities. Right? This is the policy gradient theorem. This is reinforce. Know if you've taken rl class or something like it. You've definitely seen this before. Now, what is ppo? Ppo, I think, is normally a very intuit, sorry, an intimidating object, but I think it is actually quite simple. So there's two steps that happen. First, instead of taking a reward, we look at what's called an advantage. An advantage just to sort of gloss over a lot of the details involved. An advantage is it's basically a variance reduced version of the reward. If you go through the math, you can notice that I can subtract any constant, or in fact, I can subtract any sort of state dependent variable from R, and this gradient will still be correct. That means that I can rewrite this reward potentially as sort of after subtracting any baseline values that I want, and let's say we call that the advantage. Now not only that, maybe I want to take multiple gradient steps after sampling from p theta once, what's called essentially sampling from one rollout and going almost off policy. To enable that, I have to essentially have important weighting corrections because the more steps I take, sort of the more stale my original samples become. And so this is what's called prpo. You basically make corrections for all the gradient steps you take, and then you constrain yourself to stay close. Now ppo takes the final step and says, instead of explicitly constraining myself to stay close to my old policy using this kl constraint, maybe what I can do is I can just clip the probability ratios, and this will naturally incentivize the model to stay close to the original policy. So this is kind of ppo in one slide. I'm not gonna to go into this with too much more detail because actually this won't be the primary algorithm that I want na sort of just go through the rest of the lecture with. So at least in kind of the open research space and the academic space, a lot of the question that I think people were concerned with was, can we get rid of ppo? We will see this theme on Thursday as well. Ppo is very complicated, and we have debated but decided against having you implement ppo because it will be suffering. And so lots of people thought kilget rid of ppo, and they tried other reasonable things. And I will explain those reasonable things, like, you know, maybe we can sfp on the pairs, but for each of the pairs, we can prepend like a good token for the chosen, good outputs and bad token to the chosen, not the bad outputs. And then I can just condition on good when I generate. That does not work very well. I can train the model only on the preferred output. That also does not work super well. I can use a reward model, sample the best one out of those, and then train on those. That works okay, but maybe not that great. And so people tried all these variants, but what really stuck was basically dpo. And I think the reason why it caught on as much as it did was because it removed a lot of the complexity of ppo and worked relatively well. And so you get rid of the reward model that exists in ppo. This is used to calculate the advantage. We get rid of any of the on policy stuff, like the importance ratio thing that I was talking about. You just get rid of all of those instead. You know we go back to the basics. We take gradient steps on the log loss of good things, and then we take negative gradient steps on log losses of bad stuff, right? We go back to very simple, basic things. And the last part of what I want to talk about today is just deriving the dpo formula. So what is our goal? Our goal is to optimize this quantity at the very top. This is just a rewriting of that instruct GPT equation. I have a reward at the very front, and then I have a kale divergence like this keeps me, this is pi theta close to my reference. So this is very natural. So the first thing I'm going to do is I'm going to assume that my policy, pi theta is not actually a neural network. I'm going to assume it's an arbitrary function of any kind. And if I do that, then I can sort of write down essentially what the optimal policy looks like. It just it has this form. It's the exponential of the reward and it's multiplying pi ref the reference distribution over here, we're can solve for the implied reward by solving for R, X, Y. And the clever part about dpo is now to say, ok, it basically means that every policy, instead of thinking about policies, I can think about rewards because the two are one and the same under this non parametric assumption. And so what you do is, remember, I have these two pieces. The left side, this is the Bradley Terry equation from the Stenon paper, the one before instruct GPT. And on the right side, this is the dpo sort of equivalence I wrote down. And then now what I can do is I can plug in these rewards R into this objective, and I can minimize this loss. I can say what I want to do is now I want to find a policy such that the implied reward for that policy has the highest probability of generating my pairwise comparisons. So now I've taken an rl problem, and I have turned it into a maximum likelihood problem, a problem that's very much similar and conceptually to something like pre training. All we're doing is maximizing the probabilities, except what we're doing here is we're maximizing the probabilities of the pairwise comparisons. So those are the key steps. Start by making the non parametric assumptions, parameterze the reward via the policy and then optimize it using the supervised losses. I think we're a few minutes over, so we'll stop here. I think this is a good place because we got through the derivation of dpo, and we'll get through the rest of our lhf at the start of next lecture. Thanks, everyone, for asking lots of good questions.

概览/核心摘要 (Executive Summary)

本讲座深入探讨了语言模型在预训练之后进行“对齐”（Alignment）的关键技术，旨在将类似GPT-3的基础模型转化为如ChatGPT一样有用、真实且无害的指令遵循模型。核心方法论是RLHF（Reinforcement Learning from Human Feedback），一个通常包含两个主要阶段的流程。

第一阶段是监督微调（Supervised Finetuning, SFT）。此阶段通过收集高质量的“指令-回答”演示数据，对预训练模型进行模仿学习。讲座强调，SFT数据的质量、多样性和风格对模型行为有巨大影响，并通过互动练习展示了人工撰写高质量长篇回答的困难与耗时。一个关键警示是，用SFT教授模型其预训练阶段未掌握的新知识，可能会适得其反，导致模型学会“幻觉”或捏造事实，而非真正学习知识。此外，现代SFT实践已演变为在预训练后期（称为“中途训练”或mid-training）混合指令数据，以有效利用大规模数据并避免灾难性遗忘，从而模糊了预训练与微调的界限。

第二阶段是基于人类反馈的强化学习（RLHF）。该阶段旨在解决SFT成本高昂且无法完全反映人类偏好的问题。其流程包括：首先，收集人类对模型不同输出的偏好排序数据；其次，利用这些数据训练一个“奖励模型”（Reward Model）来为模型输出打分；最后，使用强化学习算法优化语言模型，使其生成能获得更高奖励的回答。讲座对比了传统复杂的PPO算法和更简单高效的DPO（Direct Preference Optimization）。DPO通过将RL问题转化为一个简单的监督学习问题，跳过了显式训练奖励模型的步骤，已成为开源模型的主流选择。

从预训练到对齐：为何需要更好、更严格的控制

背景：预训练模型（如GPT-3）虽然具备强大的能力，但其输出并不可靠，不一定遵循用户指令，且可能生成有害内容。
目标：通过“对齐”（Alignment）过程，对语言模型的输出进行更好、更严格的控制，使其变得有用、真实且无害。
核心问题：
1. 对齐所需的数据是什么样的？
2. 如何最有效地利用这些数据？
3. 如何规模化地实施对齐过程？

第一步：监督微调 (Supervised Finetuning, SFT)

SFT是模仿学习的一种形式，是RLHF流程的第一步，旨在教会模型遵循指令的基本行为。

SFT流程

收集演示数据：从提示数据集中采样一个提示（Prompt）。
人工标注：由专家或标注员为该提示撰写一个高质量的回答（Demonstration）。
微调模型：使用这些“提示-回答”对，通过监督学习的方式微调预训练好的基础模型。

SFT数据的构成与影响

SFT数据的质量和构成对模型最终行为有决定性影响。

数据来源与风格：
- FLAN：通过聚合现有的NLP任务数据集构建，格式较为刻板，有时不自然。
- Alpaca：使用语言模型生成指令和回答，格式更接近聊天机器人交互，但指令相对简单。
- OpenAssistant：由社区爱好者众包编写，包含复杂查询和高质量、长篇幅的回答。讲座中的互动练习表明，人工撰写此类回答成本高昂且极具挑战，这解释了为何AI生成数据日益普及。
数据特征的关键影响：
- 长度偏见：研究表明，人类评估者和AI评估者都明显偏爱更长的输出。这可能导致模型优化方向偏向于风格而非质量。
- 知识与事实性：
  > 讲座中的一个核心观点是：当SFT数据试图教授模型其在预训练中未曾见过的“新知识”时，存在巨大风险。模型可能不会学习事实本身，而是学会一种“行为模式”，即“为了匹配输出格式而捏造内容”。
  - 例子：如果微调数据要求模型为不熟悉的经济学概念提供引用，模型可能学会的不是具体的引用知识，而是在回答复杂问题后“附加一个看起来像引用的字符串”的行为，从而导致幻觉。
- 安全性：
  - 少量（约500个样本）的安全相关指令微调数据就能显著提升模型的安全性，减少有害内容生成。
  - 然而，过多的安全数据可能导致模型过度拒绝（Over-refusal），即对无害的请求（如“如何终止一个Python进程”）也拒绝回答。

SFT的现代实践：中途训练 (Mid-training)

为更高效地利用大规模SFT数据，现代实践已将其融入预训练的后期阶段。
* 流程：在预训练的后期（学习率衰减阶段），将高质量的指令数据与预训练数据混合，继续训练模型。
* 优势：
* 允许大规模使用指令数据，同时因持续接触预训练数据而有效避免灾难性遗忘 (catastrophic forgetting)。
* 使指令遵循的能力更深入地整合到模型中。
* 影响：这种做法模糊了“基础模型”和“指令微调模型”的界限。如今发布的许多所谓“基础模型”可能已经隐式地经过了指令微调。

第二步与第三步：基于人类反馈的强化学习 (RLHF)

RLHF是从模仿学习（SFT）到优化的转变，旨在更精细地对齐模型与人类偏好。

为何需要RLHF？

成本效益：收集偏好数据（即判断回答A和B哪个更好）比撰写一个完美的回答（SFT数据）更便宜、更容易。
生成-验证差距 (Generator-Validator Gap)：人类自己撰写的回答，并不总是他们自己最偏爱的回答。讲座强调，验证（偏好选择）不仅比生成成本更低，而且可能产生质量更高的对齐信号。

RLHF流程

这是一个标准的三步流程，紧接在SFT之后：
1. 收集比较数据：让SFT模型对同一提示生成多个回答。
2. 人工排序/奖励模型训练：让人类标注员对这些回答进行排序。然后，使用这些排序数据训练一个奖励模型（Reward Model, RM），该模型学会为“提示-回答”对打分。
3. 通过强化学习优化策略：将语言模型视为一个“策略”（Policy），使用奖励模型的分数作为强化学习信号，通过PPO或DPO等算法更新模型参数，使其倾向于生成能获得更高奖励的回答。

RLHF的数据收集与挑战

标注指南：需要为标注员提供详细的指南，定义什么是“有用、真实、无害”的回答。
众包挑战：
- 难以保证标注员的质量和事实核查的严谨性。讲座的互动练习凸显了在短时间内对包含数学和事实的复杂回答进行准确判断的极端困难。
- 标注员的人口统计学特征会显著影响模型的偏见和行为。
- 存在伦理问题，如薪酬过低。
AI反馈（AIAF）的兴起：
- 由于上述挑战，使用GPT-4等强大模型进行AI反馈（AIAF）已变得普遍，因其与人类判断具有高度一致性，且成本更低、速度更快。
- Constitutional AI：一种自对齐方法，让模型根据一套预设的“宪法”（原则）来生成、批评和修正自己的回答，从而自动生成偏好数据。

RLHF的算法演进

PPO (Proximal Policy Optimization)：
- InstructGPT中使用的经典算法，实现复杂且敏感。
- 其目标函数包含奖励信号和一个KL散度惩罚项。该惩罚项的核心作用是防止优化后的模型偏离原始SFT模型太远，从而保持模型的通用能力并缓解灾难性遗忘。
DPO (Direct Preference Optimization)：
- 一种更简单、更稳定的新方法，已成为开源RLHF模型的主流。
- 核心思想：跳过显式训练奖励模型的步骤，直接将人类偏好数据（哪个回答更好）转化为一个简单的分类损失函数。它直接优化策略，使其提高“更优”回答的生成概率，同时降低“更差”回答的生成概率，极大地简化了RLHF流程。

结论

RLHF是一个强大的框架，用于将语言模型与人类偏好对齐，使其更有用、更安全，是实现从GPT-3到ChatGPT转变的关键。
该框架通常包括SFT和RLHF两个阶段，但现代实践正在将它们更紧密地结合（如中途训练）。
数据是核心：无论是SFT的演示数据还是RLHF的偏好数据，其质量、多样性、来源和标注者特征都对最终模型产生深远影响。互动练习表明，高质量人工数据的收集极具挑战性，推动了AI反馈等方法的发展。
算法在演进：对齐算法正从复杂且难以实现的PPO向简单高效的DPO等方法演变，后者通过将强化学习问题巧妙地转化为监督学习问题，显著降低了技术门槛。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro

2025-06-21 17:22

Detailed Summary 摘要

模型：gemini-2.5-pro

2025-06-21 17:08

StreamSparkAI