Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert

斯坦福大学CS224N课程邀请了AI2的Nathan Lambert就“DPO之后的发展”发表演讲。Lambert博士首先回顾了语言模型的发展历程，强调了从强化学习背景转向语言模型研究的趋势，并指出后训练阶段（如RLHF和DPO）对于大型语言模型的重要性日益增加。他提到，像Meta这样的大公司在后训练阶段使用的数据量远超研究机构，这给学术研究带来了挑战。讲座的核心在于探讨DPO出现后，模型对齐领域的研究方向和面临的问题。Lambert解释了DPO作为去年的一大突破，使得更多人能够参与到对齐工作中。他还区分了指令微调、监督微调、对齐、RLHF等概念，并强调指令微调（如添加系统提示）仍是当前模型微调的基础，它使模型能够更好地理解和遵循用户指令。

视频科技

媒体详情

上传日期: 2025-05-16 20:47
来源: https://www.youtube.com/watch?v=dnF463_Ar9I
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Okay, well welcome back to cs 224n as welcome back from me to cs 224n too. Since I was traveling for a couple of weeks, I have everything went smoothly in the meantime. So today I'm delighted to introduce our first invited speaker, Nathan Lambert. So Nathan did his PhD at uc Berkeley. So you're allowed to boo and piss for that. But but since then he worked first for a couple of years at hugging face, and now he's working at AI two, the Allen Institute for artificial intelligence in Seattle. So Nathan comes from a background in reinforcement learning. Like quite a few other the people who are now applying reinforcement learning to language models. He had an early background applying reinforcement learning to robots, but it turns out it's more fun to do it with language models. No, it's not okay. But anyway, I mean, he's been very influential in both developing ideas as to how to do post training with rhf, and other ideas have come since then, including dpo that hedefinitely mention in today's talk. And so he's one of the sort of best experts on the post training phase of language model development, which is just proven as time has passed by than more and more of the action of the large language model companies is happening, not in the initial pre training language model training phase when this subsequent post training phase. And Nathan will have a lot to say about that today. Thanks a lot for coming to do this.
speaker 2: Yeah, thanks for the wonderful intro. You can see my talk is life after dpo, which is a little bit of an unclear title. So I apologize about this, but it's trying to capture like what is the moment that we're at in alignment and alignment research? And really dpo is the paper, the story of last year, which is this paper that came out and I'll get to the math. And now a lot more people are interested in able to do alignment, and it's building on from there. So it's like, what are we gonna to be interested in after dpo? And a tidbit talking with Chris that isn't explicitly in my slides is like what we're trying to close. And the labs like meta and people with the amount of data that they're using for this kind of post training, fine tuning, there's all these words defined, is so big that the amount of data points that meta bot in llama two from one of these providers is much more data than all the data that's been collected on Chaon arena from meomsis. So chatbot arena has like 800000 data points that have been collected. And meta twos paper pers says they bought about 1.5 million comparisons, and these are years outdated in chatbot arena's data. That's as of a few weeks ago. So you can only imagine what OpenAI, anthropic, etcetera. Are buying at the scale. And this is the kind of reality that we need to adapt to, is like what is different? Like we don't have that type of resource doing research and what are we gonna to do? So this lecture is some history on things that lead up to dpo that I saw that I think are important to remember. And then really, what kind of goes zero to 100 and talk about recent research that we're doing to try to answer this question and define what is happening. So I'll start with the heavily abbreviated history of language models. I won't go through all of this. There's a bunch of this in the class already. This is late in the lecture. I like to talk, start with Claude chanannon, and then you skip a whole bunch of stuff where this autoregressive loss function shows a lot of promise. And this was not fast. You can see how many years that it took to build language modeling as a field here. And deep learning is brewing in the background of one of many things that went into this. And then you have these years with like 2:17, the transformer paper that you hear about 2018 with GPT one, Elmo and Bert, kind of these foundational topics and language processing and how embeddings are created. And then with gbt two and scaling laws become this kind of key idea that people are looking at and tracking and how these models are improving. And then in 2020 is when people really started to wake up to how useful these large scale trained language models were. At this time, I wasn't even a language modeling person, but for a lot of people in AI, this is when the kind of gravity of the situation was starting to suck people in. And there's a lot of cadence to these things. Gs, in 20, 21, we had the stochastic parrots paper, which before ChatGPT is raising the warnings of what are actually what are we actually putting into these models and what are they learning like, are they actually learning something meaningful from language or are they repeating the language that we have? And this is the kind of philosophical debate, depending on where you land on what language is, what these language models are doing today. But it's important that they came out before tgpt. And it's like these foundations of debates of what language models are doing in end of 2022 is when ChatGPT actually came out, which was supposed to be this kind of quiet launch of like a demo from OpenAI. And it has since captured the attention of the world that we have seen. And the simple question is, can ChatGPT exist without rlhf? I think it's important to acknowledge that so much of this is from pre training. But at every point of the line, and ChatGPT, and then a lot of these popular models since then, rlhf in these human related or other fine tuning technologies seem to be necessary, but not sufficient. Like you need the pre training, but you also need this kind of rfor, this post training to really shift the needle on what the most important models are at that certain moment. Some examples you can list, so many of them where rotf has relied upon. I like to look at these plots from the anthropic constitutional AI paper, where they kind of show this iterative improvement of their different rhf methods. It kind of shows how you have these multiple model versions that are evolving over time as you add more fine tuning data. This is a dense paper, but one of the most representative figures of kind of what rhf can do. There's a lot of information in here they don't need to follow right now. And then like meta's lama two paper is pretty funny where they have this quote. This, like reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the nlp research community. However, reinforcement learning proved highly effective, particularly given its costs and time effectiveness. So this is from the technical report directly, which I found really entertaining is so this is back in the day when we were like, Oh, we don't know if rchap is really going to take off. This is July of early years. 20, 23 is like in this building period, and it's just directly from the report. And that's aged really well where people are still using this today. But there's just a lot of interesting hints and kind of history of culture of rlhf in the releases of these models, where the people, these companies like to talk about it and give us kind of these cultural details to what's going on. So I'm going to kind of go through some definitions, and I don't spend too much time on saying doing rhf 101 and exactly what is happening with these kind of mathematical terms. But it's important to get on the same page of what some of these things do and don't mean there's a lot of definitions. I think some of the interesting ones that if they don't make sense right now to come back to is like, what's the difference between instruction fine tuning and supervise fine tuning? I think like instruction fine tuning is what's become really popular, where it's like you're training a model to follow instructions. And I have another slide on this after. And supervised fine tuning is like this domain specific thing, and we want to do both of them. I think instruction fine tuning is more linked to rhf. It's about making these models really useful and really engaging and kind of easy to work with. And these other things, like alignment, which is like super vague, but it's in the word. It's aligned. It's training a model to be mirrored to what a user wants. And there's a lot of things that you can align to. Rhf is a mouthful, which is one specific tool for doing alignment, where you have this kind of human feedback data, which is like feedback is a really loaded word there, where there can be like preferences. And learning to rank is related to actually putting feedback on preferences. There's a lot of little things. I tried to make preference, fine tuning a phrase at one point, but didn't really double down on it. I think it's a little bit clearer than rhf, especially in the context of dpo, but there's just these lot of spheres that are overlapping in this kind of post training or fine tuning space of models these days. Instruction tuning, instruction fine tuning is the kind of it's still the foundation of a lot of this. This is where the things called system prompts are added, where we're like making the model ready for a specific style of input. OpenAI is still kind of innovating on this. They had this model spec document they released a few weeks ago where they said they're gonna to have like a second level system prompt here, which this just adds some structure to how the models can take in data so that you can do a lot more of this fine tuning down the line and how user data actually gets passed to the model or how the developer passes information that the user doesn't see. So what this can often look like is like stack overflow reddit data, where you have a question at the top and then an answer. And this is still, I think, a lot of what's happening behind the scenes. There's a lot of data sets of stack overflow out there. Reddit has these data partnerships, and this still uses the autoregressive loss function that we started with. We haven't branched out into kind of different loss functions yet, but still super important. A lot of academic research shows that this is like all you need in some ways, which I think is a much more mixed bag. But it's the simple method and it's the right place to start. And where we kind of go is then we go to this rhf objective, which this looks really familiar to people that are trained in reinforcement learning. I think this is a little different from the nlp loss function. On the left side is like the standard reinforcement learning objective, which is you're learning a policy pi to maximize some reward, which is a function of something depending how you side up the problem. Then on the right side is going to be this kind of kl constraint. It's a distance so that the policy doesn't change too much. It's related to this whole idea of overoptimization that I don't go into too much of this talk, but the key idea is that we want to optimize a reward but not over optimize it. And the primary questions when doing rlhf is like, how do we implement a reward function? Like what is our reward actually going to be? And then how do we optimize it? You see this abstracted later as like we train a specific reward model and then we have specific policy updates. And dpo direct preference optimization handles this a little bit differently. So before we get there, it's like the actual preference model that people use for rlhf is what I find this interesting. It's from this Bradley Terry model, which is from economics in like the 19 fifties, which is essentially a probability distribution over a pairwise choice. And what ends up happening for various technical reasons is that if we change a preference model, it needs to output a scalar value. And by some coincidence that I think is still very convenient, they just take the output of this learned probability distribution as a reward. They say that, okay, a reward is going to be proportional to this probability, and it's going to work and it ends up doing so. But that's like even a big leap to accept that. It's like we have this parwise probability that's saying the probability that one answer is chosen over another. And then you have the kind of this mental crazy step of saying, we just PaaS in one number or one piece of text and we're getting the probability that that one piece of text is chosen over any arbitrary other one. So there's a lot of like assumptions that make there's like kind of deep concepts in here, but what we're getting is a model that's giving us a score out. And the kind of question is if why do we have to do this? And like what if we can just take our original objective and use gradient ascent on this equation? Ascent because it's a maximum. And this is really what dpo does. I'm blurring through a ton of math. It's a great paper to learn a lot of this math of language modeling, where you've learned how these probabilities of different pieces of text are handled by the model and how it ends up being a lot of these, like log probability ratios, and seeing how the prompt and the completion are handled differently. It's worth digging into and understanding the derivation. But the core idea is like, why can't we just do gradient descent or gradient descent to solve rlhf optimization? This is like it becomes incredibly simple. So if you look at the code on the right as a reference code from the original implementation, it's extremely simple to implement. And it has this characteristic where if you've work with something like transformers before, it's pretty easy to write a loss function that uses dpo rather than building an entire infrastructure stack to start with. When you do something like ppo and this full rhf stuff that OpenAI does, you normally need a homentire new infrastructure stack, but you can get started with dpo and a much, much simpler way. And there's some kind of characteristics that I'll get to later, which is dpo still has a reward model, which is really important to the math, actually checking out where it's you're using your original language model as a different type of reward model, but that quickly takes us down a whole bunch of derivations that is probably at least not the lecture that I think is as fun to give. And the key thing is, which is why this lecture is called, what it is, is that the first two points mean we'll see more dpo models than anything else. Like dpo is where everyone will start with if they want to do alignment research, and it's for a good reason. Like it is the right place to start if you're thinking about doing this. It scales more easily on compute, it's easier to debug, it's even easier to learn. So like it's it's not really worth second guessing that and it is a good place to start. But it also leads into these ridiculous conversations online where everyone is trying to figure out like is dpo better than other rl methods? Ppo, which is this older popular drl algorithm, which John Shulman wrote, reinforce, which is a slightly different parameterization of policy gradient. They're very similar. And dpo ends up being much simple, like it's just simpler to work with. So there's this meanand where it's like if you just do gradient descent, itwork in reality, pretty they're different loss functions and they're doing very different things, but you can get similar results with both of them, which is why if something is much easier to do, you should just start with it. And I come back to this much later in the talk, which is like what is fundamentally different about these rl algorithms and how your data is processed and where the signals actually come from. But for now, it's like we don't need to say one versus the other. We can do both and they are different. So that's kind of a quick one on one of what the core ideas are. I'm going to kind of take a path to where how we actually got to training models with dpo, because I think this slide was from a different talk where this subsection is reduced from. But dpo really came out months before we started getting popular models trained with it. So it's like how did we actually get to the point where the community was training models with dpo, which is much more recently than the paper was actually released? And this comes all the way back to these first instruction tuned models that you saw. So the alpaca, the vicikuna, koala, Dali of the world, all in April of 20, 23. And these are all built on kind of similar things and slight iterations. So just kind of figuring out how to use synthetic data, building off this first llama release. There's some other things that I'll talk about, but this is where we started. They're all using instruction tuning. Most of them use synthetic data. And what vikuna actually did was they used this thing called share GPT, which was the first time that people working in kind of this academic alignment space had access to data that was from humans. It ended up being a bit of a legal gray area because it was logging data that people used in a Google Chrome extension called share GPT to make it so ChatGPT had a share button. But this data is really important for things like baikuna and a lot of the other models that came down the line and is still used in models today as like one subset of the training data set. So just having access to these human prompts was unlocked a lot of potential back in the day and is still something that we're seeing. Thankfully, now we're starting to get data sets like this that were collected in more permissive ways. Like this kind of lmsis data has prompts that are collected with consent. And wildchat, which was a project from AI too, which essentially gave people free access to ChatGPT and exchange for their data. The thing that came after shared GPT was the realization that we need more human data. And this open assistant project is one that we honestly need more of. It shows how hard it is to create human data that we haven't seen more things like this. This was run by a few people in a discord community working extremely long hours to generate prompts, responses and preference pairs to kind of common requests, language models. And this was from April of 20, 23, and we haven't seen anything like it. Chapt or lm cdata is similar, but there's not the same level of controls and voting and ranking that they went into this open assistant data. And it again is a data set that we're still training models with and many people still training models with and come up time and time again. So it's just like these one or two influential data sets from over a year ago are still what are used to trade models. So you'll get the theme as I keep going. There was actually rhf models trained in April of 2023 as well. This was from carate AI that was doing a lot of work in the space kind of taken. They've followed back a bit in recent times, but there were people that were doing the similar methods to what I'm going na talk about at the end of the talk. That kind of knowledge and infrastructure was not translated into things that were easy to use. So there's also this vein of like even if things are open, it doesn't mean that it's going to immediately catch on and be useful. You have to have the resources, the data and your code based set up in a way that people can build on it, which is what dpo did really well. This like rhf model from carper was successful. It was better than the vikuna model, but no one really built on it right away, which I always find confusing. Then kind of later in the year, another key thing for this open alignment was the llama two backlash. Whether the llama two was asked to kill a Linux process, it opened to refuse. And this kind of bred a whole series of models, which are are still referred to as uncensored, which I don't think is the best name, because I don't think there was ever actually any censoring to the model. It wasn't intentional censorship. But the goal is to make models that don't refuse any request, which is useful as a research artifact, which is like, what do you get out of a model if it answers every question? Like what are the limits in that regard? There are other ways to use that, which are up to you. But like what ended up happening is a lot of these like share gbt data sets because they're from ChatGPT. There's data that says, Oh, as a language model, I shouldn't answer that. So people start filtering all of that out. And you still see a lot of people releasing these like uncensored models today as a popular area of development, I think that we should understand what people need when doing research and researching a model that doesn't refuse is reasonable. But if you're to deploy a model for free use to users, you should consider whether or not everything should be answered. So it's like as a researcher, how your artifacts are used kind of depend on the work that you're actually going to be doing. Then in the alignment space, there's this long series. I'm almost done with the andlens, but there's this long series of models that are really interesting to people like me that never really broke through the narrative where they're saying there's things like we used rlhf, where the first model to beat GPT -4 on a paca val al and these other votools, they're scaling things up, but they don't always have papers. They don't always have code bases. And it's like things are happening around just like it's not just. Like the hugging phaof the world. There's a lot of different organizations in the us and elsewhere that were aligning models and getting similar numbers or beating these kind of mainstream tech companies and these places that you look for models to these. So these are all in the summer of 20, 23. And this is kind of all this like these, I bring these up because this comes before like the first big splash of dpo. So this zephyr model was really the first model that I remember making a splash with dpo. And this is when it took until this time, which was in September, after the may release of the paper, for people to really be like, Oh, gpo is a real deal. Like it took four months. And now like the paper has best paper. Everyone uses it, their sons of derivations. But in industry and people trying to train models, there was a lot of skepticism until this moment. So this is like a classic academic story of needing to wait a bit until your work is vindicated in some ways. But the two crucial things here was new data set, the ultra feedback data set, which is a data set of synthetically generated text labeled by GPT -4. So it's again, this kind of new ways of making data where it's a preference data set. We didn't make it. It was made by open band b, and I think they're based in China. It should know more. And then we also just had to do a lot of experiments to make it work. There's a weird, really low learning rate that was needed to make this kind of chat model work with dpo, which is like five b minus seven. If you're really plugged into AI, you'll know that like three e minus four is like the lower of the best learning rate. So it's many orders of magnitude lower. So that's kind of what it took to get this to work. We probably could have done a months earlier if we just did more hyperparameter. But like this is the random happenstance of the stories that people now like backcast as being like, this is the super important botit's, somewhat random. And then at the same time, I was switching jobs to the Allen Institute, and they were already working on this project, which is trying to do a systematic study of instruction tuning data, along with some of this preference tuning recipes that were coming out. Because once this zephyr model came out, there's always skeptic, sort of like, Oh, doing it at seven b is easy. Like that's a small model. So it's like always I can actually scale to the real deal, to bigger models to be what like ChatGPT does. So this is like, okay, we have some more compute and we tried it on this 70 billion parameter scale, and we showed similar gains. We all we did was use the same feedback recipe, the low learning rate, and it largely works. So this is within two months. And then this is when and since then is when there's tons of new dpo models, anyone, all these startups that are releasing their own models of releasing instruct version, that is a dpo thing. And that kind of continued for six months. I think just today I'm starting to see less dpo models, which is interesting, and been keep track keeping track of them for another evaluation project. And it has finally kind of slowed down a little bit. I don't know if that's alignment at large, but there is so many I should add a slide that's like a list of the ridiculous amount of dpo models after then after these two. But this is really when the floodgates kind of started and when like, okay, dpo really works. So this is kind of why I say like what comes next. It's like we could retrain models on datasets when we have we don't have that many data sets, but it kind of feels like we're fishing in the dark. Like zephyr was built on the success of needing the low learning rate. This two lu two model is actually trained on tpu's because we have the Google tenser research clouds. So we have bigger tpu's to train these models. And it's like, how do we do this more systematically? And that's found where most of what I talk about today on the technical matters is the the recent research that we've been doing to just kind of make sense of this and answer the fundamental questions of like what do we need to change about dpo? Is ppo better and so on. So this is kind of the reality that I go back and forth in between, which is we don't really have the human data to do our lhf like industry, but it is getting much easier to do alignment research. So you can kind of choose your narrative. I think sometimes, because I'm so close to industry and hear about people have I'm like too often on the side, but there is a lot of opportunity to do things. It feels crowded, but being crowded at this point when there's so much investment is just because you're in the right area and most people in this room aren't trying to be professors. So if you get scooped, it's okay, but it's I find it very fun. And so like how do we actually understand what we're doing with alignment and can we improve on these models? Like we have two loop two. It has a number because we want to keep releasing more models. So it's like, how do we get better evaluating what we're doing to try to understand this process? And then how do we train better models? So these are sort of things that I'm up to. I have a few examples of things I've been working on. I built an evaluation tool for reward models. So talk more about reward models to start here. And we need better evaluation because when you're training models, you need to be able to do kind of what I call like local evaluation. You need to be able to get a number that tells you if your training technique is improving the end result. You can't wait until chatbot arena evaluates your model because that takes you about a month to get your numbers back. You need to be able to run something at your desk that gives you signal on if you're actually doing a good job. And we're still pretty behind on those evaluation tools, and they're more coming, which is promising. And then given dpo's simplicity, can we actually improve on that? And can we catch on some of the industry rumors that they've let it drift aside? So reward bench is this project that I started because there are no evaluation tools for reward models. My motivation was mostly for transparency. Given how much industry says reward models are what you need to focus on, they're really important for getting good models out the door. And it's like, what does that mean? Like what does it mean for a reward model to be good? If we look at this kind of feedback diagram, which is the one kind of homage to the rl background, just feedback loops is like a reward model is in this case, the agent is your actual language model. Pi is the policy. The training data is prompts that you get. So in this kind of rohf framework, you have this feedback loop where the policy generates something, a, which is the action, which is the completion. It goes to the reward model, which then scores it. You kind of on the side are looking at all these evaluation tools, and it's like none of these evaluation tools are giving us internal insight into what's happening in this feedback loop. It seems kind of external to what we're doing when we're training these models. So we really wanted to zoom in on this reward model. And reward models are trained in another kind of weird way, the many quirks of rlhf. So in order to train a reward model, you need to collect this pairwise preference data. If you're kind of using ChatGPT a lot, you'll sometimes see it give you two answers. And mawhich one is better. This data is literally what is used to train a reward model. It's a prompt and then to completion, so a chosen completion and a rejected completion. But in order to train these models, you have to PaaS both of them in at the same time. So you PaaS both of them in at the same time, and it gives you two scalar values. You use a language model that outputs a scalar just by some modifications of the last layers rather than outputting text. And then this loss function I'll show you on the next slide is essentially why you need to use this batch mode idea, which is you PaaS multiple things I once and you get multiple numbers out. So this loss function is essentially here. This R is the output directly from the reward model for the rejected completion and the chosen completion. So you're trying to separate the distance between them and then automatic differentiation kind of updates the parameters so that this distance is bigger. So you can't just kind of do supervised learning directly on one thing to say for the reward model, there are alignment methods researching that now, but it's really built on this idea of separating two things and creating a margin in the preferences to kind of learn the decision boundary. There's a lot of really specific details in industry, such as these models that are only trained for one epoch. They get really low accuracy scores when you compare them to other kind of trained test set things in machine learning. And there's some additional tweaks that people do. You can do ensembles. Lama two did this weird margin loss, but none of it really is transformative in how these models are trained. They're in this weird place where you can only get about 70% agreement with your annotators. It's kind of the sort of thing of is the noise part of the signal or is it a bug? So like in preferences, it could make sense that it's a signal because not everyone's preferences here are the same. So not getting full of agreement be like this system might be working. We don't watch ChatGPT to be fully narrow minded all the time. And this kind of read slow thing of like how do we actually evaluate these reward models that I was talking about? I hear all the time that reward models are crucial to rlhf. But how do we know exactly what types of the final policy that are improving? Should we include safety in these reward models? How does scaling laws impact reward models? There's kind of basic machine learning questions. It's like, can we evaluate these? What should we think about? So of what we did is we collected a bunch of prompts, and then we manually created chosen and rejected answers for each prompt. And then we can see whether or not the reward model agrees with our human created data and call that like a win or loss and an accuracy point of view. It's really direct. We're just doing inference on existing models, and we're going to see whether or not they agree with human data. And this is a slide if you want to go into the academic side of things. This was built on a lot of existing evaluation tools that were out there. You'll see some common names, alpka, val, mt benent, or things that you've heard about. Excess test was on the slide when I mentioned Lamba two being overly safe. And there's some other things that are really good, but you might not have heard about like this lom bar data set from Princeton is a bunch of triquestions that I'll have an example on later and some kind of normal names from anthropic and OpenAI in here as well. So there's a lot of different things that we're testing with this data set, and then we're trying to get the full picture of what is going on with these models. We've released this in March of 24, and you can see a key in the bottom where these kind of red circles with the arrow in them are dpo models, which you can use as a reward model. And then this kind of, these dice, which look like gray squares when you zoom out, are what I described in this kind of classifier type of training. And you can see that there's reasonable scores. The benchmark is saturated. Bunch of open models. Some names that you've seen before, like the tuluu models and the zephyr models are on here, kind of normal stuff where like this is why we expected it's not too saturated. But if you look here, I'll show you where this model is moved in a few months. So today we have a lot more models and there's a lot more information here. So I get to tell you about more interesting things, which is like how OpenAI's and coherence models do on this, which is, like I mentioned, wanting to do this for transparency. We also add new types. So this is where the fifth model ended up. So in two months, the model that was fifth under our leaderboard is now 30 first. So we're getting the saturation from people doing research in the area to actually have places to compare their models. And but we also have models from some closed labs, and I'll kind of get into the details here. So like some of these are labeled as are different types of models with his llm as a judge. Llm as a judge is the idea if you can ask a language model, which answer is better? This is kind of how things like alpaco of al and mt ventor built. But you can also use that as a reward model. I told you that I have prompts and then chosen and rejected. I can just ask ChatGPT which one is better and see what it does. And this is what we added in as a baseline. And this ends up being really interesting because GPT -4 and GPT -4 zero are not actually as good in this closed domain as a reward model that coheres training. So we don't have full information because we don't have opthe eyes reward models, but we can use their models to compare. So we have a lot of different information going into one system about how language models in different parts of the alignment process choose different categories. So kind of you can kind of go back and you can see like this, cohere is across two different months. Theirs has impved a lot. And then these kind of earlier dpo models that we saw higher up on the leaderboard have been shifting down by more people training reward models to begin with. And the specific category that I'll focus most on is this kind of chat hard thing. If you think about evaluation a lot, it's actually surprisingly common as a topic covered in kind of tech coverage is how evaluation is saturating. This is the one feature of our benchmark that hasn't fully saturated, and it's really important to kind of having some sort of longevity to the benchmark. And I'll talk more about this kind of as we go from here. So I mentioned this data set, and it's interesting to understand if you could actually do this problem. So what we have is a prompt, a chosen and a rejected. And the prompt is give an example of a metaphor that uses the following object stars. And then the chosen and rejected are two similar metaphors. But one of the like you you could see if you read these, what the differences are. I'm just pausing for the people still that are paying attention to reading these. But essentially what happens is that the chosen one is about sky and the rejected is about the moon or Yeah, so the twinkling diamonds on the sky. See, I haven't messed it up reading the slide, but it asks for stars. And it's about this kind of metaphor of stars where the rejected is about the moon, which is also in the sky at night. And this data set is a whole bunch of things like this, where what they do to create this is they either manually or by ChatGPT, ask the or ask to rephrase a prompt. And then you create a new generation from it. So you can kind of get these rejected generations that are just off topic. And it makes sense for something that would be really hard for language models because they have this association between the stars and the moon. But we want our language models to be able to answer questions like this. And this is the type of thing where our reward model benchmark, which is something that is training language models, has the best correlation as something that is hard. So this is promising. This is the sort of thing that, if you're in research, is really interesting. So it's really in the weeds. But it shows that we still have things to learn about these models, and there are things that we can't do yet. But another interesting pattern is safety. I mentioned this kind of uncensored models. And in safety, we see all the patterns. We would expect the breakdown at the top of this table. Refusals is things that we want the language model to refuse. And then this excess test datset can be split into something that we want models to refuse and we want models to respond. And you can kind of see that there's multiple categories of either gpo models or reward models where the model that handles safety really well refuses things like asking for advice on causing harm and responds to something that is borderline. But there's actually a lot of models out there that just refuse everything. So thattyour score on things that responds to everything, which is kind of the safe bet. We were seeing a lot of like tech companies release models like this, which it just feels like you just it doesn't feel right when you talk to them. But there's also the models that just respond to everything. It's like, not my job to gate whether or not I should. It's not like not the language, mo's job to gate the question. It's the philosophy there, which is something that we hear a lot about in the discourse of alignment. But to see it in these reward models and dpo models when directly probing them at the without asking them to generate text is nice to be able to confirm a lot of suspicions that we have. So this is back to some of the dpo math, which is, again, good to know. So if you go into the dpo paper, you'll see equation three here, which is the reward that is defined in order to make the math actually work. And this is very different than just outputting a scalar. It ends up being a ratio of the probability of the policy relative to the original policy during training, which is called the reference model. And this it's it's a very complicated mathematical representation. So if you actually take a piece of text and PaaS it through a dpo model, the reward will be something like -200 or something, because it's a bunch of log probabilities. Probabilities are between zero to one. You take the log, you get negative numbers, and you sum all of these up, so you get a big negative number. And that intuitively is like the score that these models are providing, which is very different than the other type of reward models that I talked about training earlier. And if you have two prompts with a chosen and a rejected, equation four is the math that you actually need to do to decide whether or not one of the answers was better. You're kind of comparing these ratios of probabilities from two different models with respect to this reference model, which was the starting point of training. And the question is, when people release a dpo model, they normally release a model and they don't release all the intermediate checkpoints. So this reference model would be an intermediate checkpoint and the training process. So the question is like, can you do this? Can you use it as a reward model if you don't have access to all the information? And the short answer is no, which is all the scores on our benchmark plummet across all the dpo models that we have, which makes sense because this extra model is a regularizer in the probabilities or it's in the actual reward equation if you go back a few slides like it's in the equation. So what we do is we get rid of this and we stop normalizing equation four, and we just see if it works and it doesn't. But this is important because dpo is training a reward model. But if we don't always to have access to it, we just can't learn from it. We can't use that in another system as clearly. So it's just a lot to ask for when getting people to release models. And this is an interesting slide showing coherence kind of progress on reward models. In just a few months, they released something that was clearly state of the art on our benchmark, a alignment lab. They this kind of rlhf flow work released something in may, and then just a few days later, cohersent another number of those. Like here's our new model. It's still better than everyone else. So it's nice to kind of have this academic industry intersection, but it's very rare and takes a lot of work in terms of networking and building relationships. But we're trying to do it at least in these small niches where the companies are willing to share reward. Bench two is going to need to just mostly make everything harder and make everything more human. And kind of the last point is what I'm gonna to transition into next is like everything I've told you about is about part of this rhf pipeline, but I haven't told you how it is impacting the final model that you use at the end of the day, which is very rightful criticism, which is like if you're evaluating part of the alignment pipeline, you should be telling me whether or not the final model is actually useful. So this is kind of where I talk about our journey into trying to train ppo models. So we're trying to fine tune a good model. We spent a lot of time on dpo with this Tulu two work, and we wanted to know if we could do better by switching to ppo. So this is a lot of it's not yet published work, but it's gonna to be out soon. So the numbers aren't entirely final, but we're just trying to disentangle what the difference between dpo and ppo is at a very empirical level. So we're trying to answer if it's better or not. So what we're going to do is kind of walk through a series of design decisions and see how it affects this suite of evaluations. We're starting with this lama 213b model that has already been instruction tuned. The difference between the blue and the red is the gains from instruction tuning. For these kind of reasoning coding chat tasks, instruction tuning does the biggest delta that you'll see among all of these slides, instruction tuning kind of puts the model on the map as being useful. And it is easy to see gains at the beginning. And then it's harder and harder for us to really keep improving these models. So what we start with is we add this anthropic helpful, harmless rhf data with dpo, and you can see that there is a small bump across all the metrics that we did. This data set is known as being particularly noisy among researchers in the area, but it is kind of the starting point when you're doing research on alignment. It's been around for a few years. It's big, it's multi turn, but it's known to be noisy and it still gives improvement. And then what you do is we switch to this data that was used for both zephyr and Tulu, two officially this ultra feedback data, we get an even bigger bump. So this is just kind of showing the difference that changing only the data can give you. In a dpo recipe, it's normally increases of kind of like zero to 2%. And in the research sphere, trying to ship a model, that's a big deal. So this is kind of where we treaded into new territory. Grad students worked really hard and implemented ppo and jacks in addition to what they already had. And we were like, okay, what happens when we add ppo and require reliably across multiple experiments? This is one example with 13 billion parameters. Ppo just happens to do a little bit better, is like 1% better. And we try to change a lot of things. And the changing things is where things get a bit messier. So we've heard from industry that using a bigger reward model can be really helpful to getting a better policy model. Essentially, these bigger reward models will be better at nuance. They should give more laels, better scores, which are used as rewards. They should just kind of make this process a little bit more stable. If we have the compute for it, we see that it does improve some things, but it doesn't actually make the model overall much better. It's kind of flatlined with like pretty similar data. And then just at making the reward model bigger, which was a little bit surprising to us. And we all this is like this the this is the most realistic few slides of the talk, but it's like we did this thing where we took the we even were trying to see if our reward model training was bad as we scaled it up. So we used reward bench on the right, which I had told you about earlier, which. It's not clearly correlated whether or not these 213b models or seven berdbb are better. And we also did this best of end sampling idea, which is if you generate a bunch of completions from the language model, you can rank them by your reward model and then re evaluate on the top to rank completions. That shows that our reward models are better at the bigger scale. But we couldn't get this to really click into like a downstream model in a ppo notion of the world. We even tried adding more prompts to rhf. We've added more code and reasoning prompts because that's something that OpenAI talks about a lot and we want to improve our models on. It doesn't really shift the needle on this kind of cohesive average over many tasks in the paper. What you'll see when it's out is it shows that we added prompts really similar to two math and code evaluations, and those specific evaluations got a bit better. Adding the full noise into the fact that some other evaluations might go down makes it just like this process is really hard to disentangle. And this is why it's like we're getting the zero to 2% improvement out of ppo, but dpo doesn't have this sort of mess. So what we ended up getting to is like there's always one more thing for us to ablate when you're trading these models with ppo, the sort of things like different regularization, we're learning a value function nrl, different warm up, different size parameters. Like there's so many knobs to turn in ppo, and it was reliably getting us a pretty good model. But it's like we're staring into the best trying to approve this right now in the next few months. And the bottleneck in terms of the actual technical side is that ppo generates new responses from the model as it trains to kind of refresh the data. And that is by far and away the biggest bottleneck when you're actually training these models is just way slower than dpo. So all these resources for ppo things are somewhat available to academics. The Google tenso research cloud, I think is pretty available to grad students I work with. Seems to time up. The code base is open. So if you're interested in grad student and are trying to do ppo alignment and have access to tpu's, please get in touch. It's a very fun can of harms, but kind of as a summary, like this is the many different dpo data sets that we tried. This is almost all of the well received data sets that are out there in the open and they all look at like the factuality column. Like some of these things just don't matter at all when you're aligning these models. So like we need to get new data sets that are really adding different capabilities to these models and something that matches these kind of ultra feedback numbers at the bottom. And I don't like I'm surprised whenever I look at this, but this is where we are at, and we need to try to keep building data sets and keep adding freshness to this system. Ultra feedback at this point is maybe six months old or so. I don't know the exact age, but in terms of people training models, that feels old to people, to things that are happening. And these are the actual sort of numbers that you get when you compare dpo versus ppo. This is all with this 13 billion parameter. Again, we changed the data set and every one of these ppo comes out a little bit better on average. And this is a few grad students and people like me. This is not a big team in industry doing this. Like we're scraping by, and I don't know if it's worth the effort. If you I'd see why OpenAI uses this because we were able to get a bit more signal out of it, but it's a ton of effort to get a bit better signal out. And I'll kind of transition into a bit more of a like open ended discussion of this and then we'll have questions. But it's like, what about ppo is actually special, like this generation and this online nature. And like can we just change dpo to be like this? Or like where are the new things going to go? And I had the pleasure of advising one project that was related to this, but this is much, much more general. So it's like what is special about online data? There's multiple ways that you can get new data into your rlhf process. And then there's also this related question in reinforcement learning literature, which is like on versus off policy, which is a technical distinction that often gets looped in with these discussions of dpo versus ppo. They're actually related, but the reinforcement learning discussions have a very much more like definitional flavor to them. While in this alignment space, we're more focused on if we need to get fresh data in and how we need to label our data for language models. So I'd make this distinction between these two things, which is freshly generated data from the policy. If you zoom into a data set like culture feedback, it has generations from all sorts of models, from alpaca, Vaika, GPT -3 point five, GPT -4, llama, like there's generations from all sorts of models in this data set we are using. So when we train these zephyr, these two Lou models, we're incorporating information from a lot of different models down into our one policy. Whereas what ppo is doing is only generating data from your existing model and kind of changing this distribution over time. So that is a very different idea of where the signal is coming from, from the models. And then the second thing is whether or not you're refreshing the data of labels over time. If I have human labelers comparing chosen and rejected, that's one data point. But I can also later on take this reward model that I trained and generate a chosen and rejected and change the label. So these kind of two things of like what the actual text is and when the chosen slash rejected label was given are what people mean when they're talking about like is something special about online in rhf. And it's much clear it's clear to see that ppo does it very differently than dpo, but we're not restricted to this. In the last few weeks, I have the dates all in here. So April, April and may of 20, 24, there started to be a lot of papers on this about dpo ppo online, offline. And they really kind of say similar things, which is that online is important. And these papers on this slide, they show these kind of more theoretical and closed form experiments on link. What is special about online data and what performance drops if you use this kind of offline data? It's good to dig into these, but it's this is why I say it's like nice to do research now because if you have an idea, a lot of times people have like three papers that confirm the notion that you have it's a lot easier to be confident in things if three independent institutions say something similar at the same time. There's a lot of methods coming out where people are trying to modify dpo to actually use this kind of online notion. I think self rewarding language models from meta was the first really popular one where they just had they asked the dpo model, Hey, which of these answers is better in between each iteration? So they did this like lm as a judge to relabel their own data. And then they did multiple iterations of dpo, and the model had really strong stores. There's now ideas like not using all of your data at once, so you can kind of do batches of dpo and update your data. The paper that I was on with this discriminator guided dpo, which I'll talk about in a second, is using reward models plus this dpo training objective. There's just a lot of things that we can change. And I think the community, again, is in this expansion phase where I even get messages from, people are like, Oh, my paper was really similar to this other paper that we did it first. They didn't cite us. And I'm like, this is kind of the point, but it's hard. It's like it's going to be like this for a little bit longer. And then hopefully in the end of the year, in a few years, we're going to be like, okay, this is clearly what we need to do on the methods side of thing. So this is one example. D2 po discriminator guided dpo, which I as an advisor to, which is an undergrad researcher, and the idea is comparing these three different things. So like a is the standard dpo. You have a data set, you apply the loss function on it. B is what we call some sort of online preference optimization, which is where you can repeatedly label your data with a reward model. It just kind of like the self reward paper that I mentioned, which is you can reshuffle your preference data based on a reward model and that kind of adds some notion of online Neto your data. And then the third thing is like what if we're relabeling data and we're retraining our reward model over time? So we're just really trying to keep our kind of what our policy is doing related to our reward model and keep everything really updated in real time so that it's all lined up. And this is wondering, how much of a gain do you have by retraining the reward model over time in a dpo framework? And part of why I like this paper is there's things like closed form tasks. So the biggest question that I get for alignment is like, how do we actually evaluate it? Like what tasks is it good for? There's a whole philosophical discussion where I think information transformation is a valuable task. Writers tell the same stories in different ways, but the best told story is the one that resonates with people, that has value. But at the other time, we're academics, and we need to be able to measure things. So this paper has things like, your reward is counting the number of nouns in a sentence, and then you're using these alignment methods to increase the number of nouns in the outputted sentences from the model. So you can measure that a lot better because we have classifiers which know our nouns. And you can see on this left figure is that just by retraining ating this reward model a few times, and it converges better than if you were just to relabel your preference data. It's a mouthful, but it's just like keeping your model, your training process a little bit more online can improve a performance. And on the right is a more standard open ended evaluation task where we're asking a language model like ChatGPT, which answer wers better, and that has all sorts of problems where we can show similar results. I think the big takeaway is really like these few slides, which is the literature is moving. We have studies that show that online is better, and people are coming up with really cool, clever ways to actually use online data. So I combined with new data sets, this is kind of the like dpo of this year is like online methods and how they work. So this kind of goes back to what industry is doing. And I showed you this figure earlier on the left with clad, where you can see the little points along the lines. And these are these different iterations. We don't know exactly what they're doing, but it seems a little bit different where the dots on these figures are new data sets from humans, rather than this kind of redo a reward model, relabel your data. This is what happens when you have access to different types of scale. The lama two paper makes this much clearer. They say they work with an annotator, they get batches of data. When they're generating this new batch of data, the previous model's checkpoint was used for generations. They do this many times, and you can kind of see that they're collecting new human data, new human data, new human data. And each time they generate human data, it is trained for a new model. They're doing a lot of training updates, and they're kind of building on each other. And this kind of leads into the last section that I'll talk about in the conclusions is like, what did meta do with llama three? This is one of the most funny blog post sentences. It's like the ridiculous things that they give us. And then we parse the tea leaves. They say in the blog post, is that our approach to post training is a combination of supervised fine tuning, rejection sampling, proximal policy optimization ppo and direct preference optimization. So it's like the people ask me like, what the heck did they do? And I mean, I kind of agree, but it really goes back to this slide in my mind, which is that they're getting new data and then they're training a new model over time. So what I think is happening at each one of these points, they tried a few methods and they chose the training method that worked best. It's it's practical. Meta is a really practical organization, especially in the gen AI I org right now. And that just makes sense. It's like at different points in the model, your model has different capabilities and it's ready to be trained in different ways. Rejection sampling, which I didn't cover here, is the simplest training method. You take a reward model. You rank some supervised fine tuning outputs and then you use this autoregressive loss function again. And then from there, dpo is much simpler to ppo, but it might not give you the highest end performance. And then as your model really starts kicking into gear or you have more time to train this model, once all of your data is collected and you're not on a weekly time crunch, you can experiment with all the little knobs of ppo and you can really try to get the best model out. At the end of the day, it's just hopefully they release a technical report that confirms some of my hypotheses. But I think this is normally what people are interested in when somebody from industry comes up to give a lecture. And I wish we had more details on what industry was doing. But in terms of Current Directions that I'm most interested in rlhf, I talked about data a lot. We are very bottlenecked on data, even as academics with very limited compute. We literally try every data set that is available, not like we don't have a lot of compute, but we need to keep innovating there. We're gonna to see more dpo methods. It's here to say there's a ton I didn't cover here, things like removing the reference model, changing the loss function slightly, not using pairwise preferences, but single wise preferences slot going on there. We should use more model sizes in seven and 13 billion parameters, or in lama's case, like seven and 70 billion parameters. Particularly scaling down is very useful. Is a place where academia can still play. There's kind of less of a weird marketing dynamic where all the companies are racing to go bigger for certain strategic reasons. But this is something that's accessible to many people. Aligning small models, it's hard to get signal out of them because the models show more or less random scores on many benchmarks that people care about or really low scores. So even just kind of breaking through in that domain would be really impactful work to kind of get more people working on alignment. And then kind of evaluations I covered at length, which is we need to keep getting more specific on things we care about. And personalization is something in alignment that I didn't cover in this talk, but is something that is good to compete with this kind of big tech, which is like how do we train models that are good for you as an individual rather than one big model for one big technology organization? So these slides will get to you, but these are the types of places that I follow when I'm trying to see open models or open data sets that are reputable and easy to keep track of. So you don't have to try to follow everyone, and I've write about this a lot without doing too much self promotion. But I have like I ended like ten minutes early for questions that I'm happy to take in a Q&A format and then that you don't have to stay and wait if you don't want to.
speaker 1: Okay. Thank you, Nathan. And questions? Anyone haven't got questions?
speaker 3: Going would be good for model, which is a large assumption, I agree. But what is the key challenge to doing online detail in sense? You can do end roll llouts and then just like rank .
speaker 2: them using your award .
speaker 3: model and then go and you can iterate into so what is the hard thing?
speaker 2: Yeah, I'm going to repeat the questions so that people can hear them and it gets recorded. The idea is if you have a good reward model, what is stopping you from doing online dpo and kind of just improving the policy from there? I think there's kind of multiple angles to this that they're both technical and like the kind of industry wide. But the technical thing is I think the prompt matching ends up being really important. So prompt matching, so what your reward model can learn is specific to the prompts. They're are the technical detail where the prompts used for your policy often are exactly the same as your reward model in ppo, which is really strange because we talk about generalization and machine learning, but we're kind of like soft tfollowing ourselves at the ppo stage, which is we're only grading ppo answers, which our reward model is trained to answer, which is kind of strange. So people think that some of that might break down, and we see some of that when trying to train ppo models with off the shelf reward models, which is kind of a long answer. And then but I think it's like mostly distribution matching, if I had to guess. But if we had truly a good model, it should work for some things. And that could be one of the reasons why there aren't that many in the open because it would kind of help people catch up. An alignment is like overreward model if it is as important as people say it is, it might be easy.
speaker 3: Other questions? Yeah. Our Leon methods always be these parairwise strucscomark there, I guess architectural instructhat you can use, but you still have music. He does such more complicated than like freebook at. Pais, I said that like a story where Samdy was like measured by the passenger. That's a lot more cash.
speaker 2: Yeah, I think this is a whole conversation. So if I don't cover it, if you want more after I answer, like you can come up. But the question is like, is there more than pairwise preferences that can be used in rrhf? And there's a lot of different lines of work that are studying this one is methods. Like there's a method out of Stanford that's kto. Like converi always mess it up. And these names are so hard to pronounce, but it's the idea of using one sided preference data. So a lot of customer apps have like, did you did you get good support from this agent, yes or no? And like you could use data like that. Is it just is a different loss function for using single sided preferences or just a yes or no? There are other things like learning to rank from multiple answers. So this is something I slightly insinuated, but like binary preferences is kind of like there's a lot of literature on learning preferences. One of the models that got reduced down is a starling model, and they use a kwise preference. So they have like five or nine answers to every prompt, and then they collect answers, and then they have a different loss function. And this is one of the models that is kind of like broken through in the open alignment space. It's one of the few that I left in and skipped in my slide deck. But so that's kind of interesting. And then there's other research that's like fine grained preferences. So for every completion to a prompt, you get labels like ciseness, helpfulness, honesty. So there's a few things on that regards. There's like a steer lm paper from nvidia and then there's work from uw that does like learning from fine grained preferences. So that one was probably like the one that's most emerging most in the academic sense. But there's so much to learn here. There's like like literally all the field of social choice needs to get condensed .
speaker 3: into these things. 嗯。
speaker 1: Any other questions?
speaker 3: Okay.
speaker 2: Yeah. So the question is how can we broadly is like how can we exceed human performance with fine tuning or any training for that? Regards. I think this is where some older ideas in the cs will come back. Like I think one of the foundational ideas in cs is search, which is really also motivated as like exploration in rl. And therefore, we need to have some sort of language models that can search and generate new data. I was talking with somebody before the grad student. I think that it's like search will be a large part of synthetic data, but then the human aspect will be what gets it across the wide if they can't solve a certain area. And this is like the q star rumors are ridiculous, but that seems to be the best argument for the sort of thing that OpenAI is trying with that. It was like how to get that barrier broken with AI.
speaker 3: Thank you so much for coming in. You mentioned data sets for a big imitation, and I was curious how one goes about creating a data set.
speaker 2: Yeah, this is another thing that's hard. I think community efforts are what people have tried to do. I mentioned open assistant, but most people that do a community effort are like, I never want to do this again. So well, I still think it's worth doing things once that are highly impactful, even if you might might not want na do it again. Other avenues for building these in a sustainable manner are very important. I think that there's some ways that this is being done, like chatbot arena returns some of the prompts in the labels to users. There's specific concerns I have with that data around being too noisy, but that is the sort of thing that can happen if AI two has a demo for their models. It's going to be about science and like generating information rather than being a ChatGPT competitor. It's like a nonprofit. It can't do a product competitor, but that's the sort of data that we would want to release and something that I might just have to do. But I'm interested in like academic workshops and competitions as a ground where you could have communities meet every three, six, eight months and have work that's focused on the area and or like focused time to have people contribute to it. But it's a good question. Not it's probably why there aren't very many.
speaker 3: Yeah how do you feel you reward models? S, I'll subject to reward hacking as well. Right. Can we get to one?
speaker 2: The front person? Yeah close first and then we'll come to you.
speaker 3: The various places you've done research at over the years, do you have any sense of how they compare in terms of specifically alignment research? I mean, obviously, you weren't doing alignment research specifically at those.
speaker 2: I think generally that represents different culture and investments of the company. Like by I wasn't doing language models until I time at hugging phase. So I can really only speak to these kind of two open companies. And from like a hugging phase perspective is to show that more people can do this. Like we're not trying to compete with ChatGPT, but we're trying to Foster an ecosystem of doing this. And AI two is similar, but more about like what is happening, like how do we learn about this? How do we do science? How do we study the science of this and communicate that clearly? And I'm sure if you do the exercise, you can map this to every company is like what is their important thing? And like they have different goals in their products and their corporate structure and things like that. I will talk more when not recorded.
speaker 1: Okay, up the back.
speaker 3: So I need the reward work also subject to reward hacking. Like the tion, a good reason on the outcome, but actual .
speaker 2: in reality, the outcome was not .
speaker 3: only expected.
speaker 2: Yeah. So this is like when talking about reward models, this is probably the most established line of work. The question is like, are reward models subject to reward hacking? And reward hacking is a classic problem in rl. I should bring back from my rl slides where you have the boat swimming going in circles and thatbe like this happens to your language model and what happens, but it is, and there's a lot of research to mitigate it, but it's a fundamental problem, which is you have a very powerful optimizer and you have a incomplete representation of your reward and italways find where your representation of reward is wrong. So it's like we will always be doing the best we can. But I think saying it's perfect, it's like not possible in the map. I mean, I can also say like the ways that it fails are pretty funny because like if you train these models, you'll end up with a model that just says JavaScript to like every answer to like for not infinity. It's like sometimes it's really easy to see when that is happening, which is which is good. Or like you could change your loss function so that it will always exploit. And like it's a good way to kind of make sure that things are working, which is you should be able to easily .
speaker 3: exploit if you turn the .
speaker 1: brakes off. Any last public .
speaker 3: question?
speaker 1: If not, thank you for Nathan for giving us more. And if there's anything youlike to ask off the record, Hebe here for a bit longer.

概览/核心摘要 (Executive Summary)

本讲座由艾伦人工智能研究所（AI2）的Nathan Lambert主讲，深入探讨了在直接偏好优化（DPO）之后，大型语言模型（LLM）对齐领域的研究现状与未来方向。Lambert首先回顾了LLM和对齐技术（特别是从RLHF到DPO）的发展历程，强调了后训练（post-training）阶段日益增长的重要性。DPO因其实现简单、易于调试和扩展性好等优点，已成为对齐研究的普遍起点，尤其是在学术界和资源相对有限的团队中。

讲座进一步探讨了对齐研究的关键挑战，如高质量数据获取与评估方法。Lambert介绍了其团队开发的RewardBench评估工具，并比较了DPO与PPO的实证效果，强调了DPO的实用性。在线学习作为提升模型性能的关键方向被重点讨论，Meta Llama 3的混合策略也印证了此趋势。最后，讲座展望了数据创新、方法演进、小模型对齐、评估体系及个性化等未来研究方向。

讲座引言与背景

主讲人介绍：Nathan Lambert，加州大学伯克利分校博士，曾在Hugging Face工作，现就职于艾伦人工智能研究所（AI2），在强化学习应用于语言模型，特别是RLHF和DPO等后训练技术领域有深入研究。
讲座主题：“DPO之后的生活 (Life after DPO)”，探讨在DPO成为主流对齐方法后，对齐研究的现状和未来发展。
核心背景：后训练阶段在LLM开发中愈发关键。工业界（如Meta）在后训练微调中使用的数据量巨大，例如Meta为Llama 2购买的数据点（约150万个比较数据）远超Chatbot Arena等学术资源的累积数据量（约80万）。这揭示了学术研究与工业界在资源上的差距，以及探索不同研究路径的必要性。

大型语言模型与对齐简史

Lambert简要回顾了语言模型的发展，从克劳德·香农（Claude Shannon）的基础工作，到自回归损失函数的广泛应用，再到深度学习的推动。
关键节点：
- 2017年：Transformer架构提出。
- 2018年：GPT-1, ELMo, BERT等模型的出现，奠定了语言处理和嵌入生成的基础。
- GPT-2与规模法则（scaling laws）成为研究焦点。
- 2020年：大规模预训练语言模型的实用性开始显现，吸引了AI领域的广泛关注。
- 2021年：“随机鹦鹉（Stochastic Parrots）”论文在ChatGPT出现前就警示了模型可能存在的偏见和局限。
- 2022年底：ChatGPT发布，最初设想为OpenAI的一个小型演示，却迅速引发全球关注。
RLHF的重要性：Lambert提出问题“没有RLHF，ChatGPT能否存在？”，他认为预训练是基础，但RLHF及其他人类相关的微调技术对于提升顶尖模型的性能是“必要但不充分”的。Anthropic的Constitutional AI论文和Meta的Llama 2技术报告都强调了RLHF的有效性，后者更直言RLHF“被证明非常有效，尤其是在成本和时间效益方面”。

RLHF与DPO的核心概念

相关术语定义：
- 指令微调 (Instruction Fine-Tuning, IFT)：训练模型遵循指令，使其更实用、更易交互，与RLHF紧密相关。
- 监督微调 (Supervised Fine-Tuning, SFT)：更偏向领域特定的微调。
- 对齐 (Alignment)：训练模型以符合用户期望，是一个相对模糊的概念。
- RLHF (Reinforcement Learning from Human Feedback)：一种特定的对齐工具，使用人类反馈数据。
- 偏好微调 (Preference Fine-Tuning)：Lambert曾尝试推广的术语，认为其比RLHF更清晰，尤其在DPO背景下。
指令微调 (Instruction Tuning)：
- 仍是许多对齐工作的基础，通过引入“系统提示 (system prompts)”使模型为特定输入风格做好准备。
- 常用数据如Stack Overflow、Reddit的问答对。
- 仍使用自回归损失函数。
RLHF目标函数：
- 形式上与标准强化学习目标相似：学习一个策略 ( \pi ) 以最大化奖励 ( R )。
- 包含一个KL散度约束项，防止策略 ( \pi ) 与初始参考模型 ( \pi_{ref} ) 偏离过远，以避免过度优化。
- 核心问题：如何实现奖励函数？如何优化该目标？
偏好模型 (Preference Model)：
- RLHF中常用的偏好模型源于1950年代经济学中的布拉德利-特里模型 (Bradley-Terry model)，该模型描述了成对选择的概率分布。
- 技术上，偏好模型需输出一个标量值。实践中，学习到的概率分布的输出被直接用作奖励信号，Lambert认为这是一个“巨大的跳跃”，但确实有效。
- 模型输入一个文本，输出该文本相对于任意其他文本被选中的“概率”（或分数）。
DPO (Direct Preference Optimization) 的兴起与核心思想：
- 核心问题：为何不能直接通过梯度上升优化原始的RLHF目标（即偏好模型的似然）？
- DPO正是基于此思想，直接优化偏好数据。Lambert强调，DPO的数学推导涉及文本概率、log概率比率等，值得深入学习。
- 关键优势：
  1. 实现简单：相比于PPO等需要完整RLHF基础设施的方法，DPO的损失函数易于在现有Transformer框架下实现。
  2. 易于调试和学习。
  3. 计算扩展性更好。
- Lambert指出：“DPO仍然有一个奖励模型，这对于数学推导的正确性至关重要，它实际上是将原始语言模型用作一种不同类型的奖励模型。”
- 结论：DPO是进行对齐研究的理想起点，引发了关于DPO是否优于PPO等其他RL方法的讨论。Lambert认为两者是不同的损失函数，做的事情也不同，但可以获得相似结果，因此应从更简单的方法入手。

DPO模型的实践之路

DPO论文发布与实际应用的时间差：DPO论文发布数月后，基于DPO训练的流行模型才开始涌现。
早期指令微调模型 (2023年4月)：Alpaca, Vicuna, Koala, Dolly等模型均基于相似技术和迭代，大多使用合成数据，并建立在Llama的首次发布之上。
- ShareGPT数据：Vicuna使用了ShareGPT数据，这是学术对齐领域首次接触到真实人类交互数据（通过记录Chrome插件用户与ChatGPT的分享数据，存在法律灰色地带）。这些人类提示数据对后续许多模型产生了重要影响。
- 现在已有更多合规收集的数据集，如LMSYS数据和AI2的WildChat项目。
人类数据的重要性与获取：
- OpenAssistant项目 (2023年4月)：一个由社区驱动、投入巨大努力生成提示、回复和偏好对的项目。其数据至今仍被广泛使用，凸显了高质量人类数据创造的难度和价值。
早期RLHF模型 (2023年4月)：CarperAI等机构已在训练RLHF模型，并取得了优于Vicuna的成果，但由于资源、代码库开放性等原因，未能立即普及。
“Llama 2审查风波”与“无审查”模型：Llama 2因拒绝执行“杀死Linux进程”等指令而引发反弹，催生了一系列所谓“无审查 (uncensored)”模型。
- Lambert认为“无审查”并非最佳名称，因Llama 2的拒绝可能并非有意审查。
- 这类模型的目的是研究模型在回答所有问题时的能力边界，作为研究工具具有价值。
- 许多此类模型通过过滤掉ShareGPT数据中“作为语言模型，我不应回答…”等拒绝式回答来训练。
Zephyr模型 (2023年9月)：
- Lambert记忆中首个使用DPO并产生广泛影响的模型，真正让社区认识到DPO的潜力（距论文发表约4个月）。
- 关键因素：
  1. 新的数据集：UltraFeedback数据集，由GPT-4标注的合成生成文本构成的偏好数据集（由OpenBMB创建）。
  2. 实验调优：发现使用极低的DPO学习率（如5e-7，远低于常规的3e-4）对训练聊天模型至关重要。
Tulu 2模型 (Zephyr发布后约2个月)：
- 在艾伦人工智能研究所（AI2）进行，旨在系统研究指令微调数据和新兴的偏好微调方法，其训练利用了TPU资源（通过Google Tensor Research Cloud）。
- 将Zephyr的DPO成功经验扩展到更大的700亿参数模型（Llama 2 70B），使用了相同的UltraFeedback数据和低学习率配方，证明了DPO的可扩展性。
- 此后，DPO模型大量涌现。Lambert提到，直到近期（讲座时点），DPO模型的增长势头才有所放缓。

后DPO时代的研究方向：评估与改进

Lambert指出，尽管DPO模型众多，但感觉像“在黑暗中钓鱼”，因为高质量数据集依然稀缺。如何更系统地进行研究，是当前面临的问题。

核心矛盾：学术界缺乏工业界那样大规模的人类数据，但进行对齐研究的门槛正在降低。
研究问题：如何理解对齐过程？如何改进模型？如何有效评估？
奖励模型评估工具：RewardBench
- 动机：工业界强调奖励模型的重要性，但缺乏透明的评估工具来衡量奖励模型的好坏。
- RLHF反馈回路：奖励模型在策略生成、评估、更新的闭环中扮演核心角色，但现有评估工具多关注最终策略，缺乏对奖励模型本身的深入洞察。
- 奖励模型训练：通常使用成对偏好数据（一个提示，一个选择的答案，一个拒绝的答案），通过损失函数拉开两者得分差距。这类模型通常只训练一个epoch，与标注者的一致性约为70%，这种“噪声”可能反映了偏好的多样性。
- RewardBench方法：收集一系列提示，为每个提示人工创建“选择”和“拒绝”的答案，然后测试现有奖励模型是否同意人类的判断，并计算准确率。
- 数据集构成：基于AlpacaEval, MT-Bench, TruthfulQA, HellaSwag, MMLU, BBH, ToxiGen, HH-RLHF, RealToxicityPrompts, Koala, Self-Instruct, Vicuna, LIMA, LongForm, Chatbot Arena Prompts, XSTest, LOMBAR (Princeton的棘手问题数据集)等多种现有评估工具和数据集构建。
- 主要发现 (截至2024年3月及更新)：
  1. 排行榜饱和迅速：最初（2024年3月）排名第五的模型，两个月后跌至第三十一，表明领域研究活跃。
  2. 闭源模型表现：可以评估如OpenAI (GPT-4作为裁判) 和Cohere的奖励模型。结果显示，GPT-4和GPT-4o在此任务上并非最佳，Cohere训练的专用奖励模型表现更优。
  3. DPO模型作为奖励模型：DPO模型本身可以被用作奖励模型。其奖励值是策略模型输出相对于参考模型输出的log概率比率之和，通常为较大的负数。然而，要准确使用DPO模型作为奖励模型，需要访问训练时的参考模型（通常是训练过程中的中间检查点），如果只发布最终DPO模型，则其作为奖励模型的性能会大幅下降。
  4. “Chat Hard”类别的重要性：在RewardBench中，“Chat Hard”（包含如LOMBAR数据集的棘手问题）是唯一未完全饱和的类别，对基准的长期有效性至关重要。例如，区分“关于星星的隐喻”和“关于月亮的隐喻”这类需要细致理解的问题。
  5. 安全性模式：奖励模型在安全相关提示上的表现符合预期。一些模型能很好地处理安全问题（拒绝有害请求，回应边界情况），一些模型倾向于全部拒绝，还有一些“无审查”模型则倾向于全部回应。
  6. Cohere的进步：Cohere在几个月内发布的奖励模型在RewardBench上表现出显著的SOTA（State-of-the-Art）水平。
- RewardBench 2.0 展望：使评估更困难，更侧重人类评估。关键是建立奖励模型评估与最终策略模型性能之间的联系。
DPO与PPO的实证比较 (基于Llama 2 13B模型)
- 基线：指令微调后的Llama 2 13B在推理、编码、聊天任务上已有显著提升。
- DPO效果：
  - 使用Anthropic的HH-RLHF数据进行DPO，带来小幅提升（该数据被认为噪声较大）。
  - 切换到UltraFeedback数据（用于Zephyr和Tulu 2），带来更明显的提升 (约0-2%)。Lambert强调，在研究领域，这种幅度的提升已相当可观。
- PPO效果：
  - 在相同数据下，PPO通常比DPO表现略好 (约1%的提升)。
  - 奖励模型规模的影响：使用更大的奖励模型（如70B）来指导13B策略模型的PPO训练，并未带来显著的整体性能提升，尽管某些方面（如通过Best-of-N采样评估的奖励模型能力）有所改善。这有些出乎研究团队意料。
  - 增加特定领域提示：向RLHF中增加更多编码和推理提示，仅在相应的特定评估任务上略有改善，但并未提升多任务平均性能，甚至可能导致其他评估指标下降。
- PPO的复杂性与瓶颈：
  - PPO有众多超参数需要调整（正则化、价值函数学习、预热、参数规模等）。
  - 核心瓶颈：PPO在训练过程中需要从当前策略模型生成新的响应（online generation）以刷新数据，这比DPO慢得多。
- 结论：尽管PPO能带来微小性能提升，但其巨大的工程努力和资源消耗使得研究者质疑其性价比，尤其对学术界而言。 Lambert理解为何OpenAI等机构使用PPO，因为他们能从中榨取更多性能，但过程非常复杂。

在线学习在对齐中的重要性

PPO的特殊之处：在线数据生成。
在线数据 (Online Data) vs. 离线数据 (Offline Data)：
- 在线数据定义：
  1. 从当前策略模型新鲜生成的数据。PPO在训练时仅从当前模型生成数据，并随时间改变数据分布。而DPO常用的数据集（如UltraFeedback）可能包含来自多种不同模型（Alpaca, Vicuna, GPT-3.5, GPT-4, Llama等）的生成内容。
  2. 随时间刷新数据标签。例如，使用训练好的奖励模型重新标注已有的（选择/拒绝）对。
- PPO天然具有在线特性，而标准DPO通常是离线的。
近期研究进展 (2024年4月-5月)：
- 多篇论文（理论和实验）指出在线数据对于提升性能至关重要，使用离线数据会导致性能下降。
- 改进DPO以利用在线数据的方法涌现：
  1. 自奖励语言模型 (Self-Rewarding Language Models, Meta)：在DPO的每次迭代之间，让模型自身判断哪个答案更好（LLM as a Judge），从而重新标记数据，进行多轮DPO。
  2. 分批次使用数据进行DPO，并更新数据。
  3. 判别器引导的DPO (Discriminator-Guided DPO, D2PO)：Lambert参与指导的项目，结合奖励模型和DPO训练目标。实验表明，在DPO框架下，随时间重新训练奖励模型（保持策略和奖励模型同步更新）比仅仅重新标记偏好数据能带来更好的性能，尤其在可控的“闭环任务”（如增加句子中名词数量）中效果明显。
Lambert观点：在线方法和新数据集将是“今年的DPO”，即当前研究热点。

工业界的实践：以Meta Llama 3为例

Meta在Llama 3的博客文章中提到：“我们的后训练方法是监督微调（SFT）、拒绝采样（rejection sampling）、近端策略优化（PPO）和直接偏好优化（DPO）的组合。”
Lambert的解读：
- Meta可能在模型开发的不同阶段，根据模型的不同能力和训练目标，实用主义地选择了当时效果最好的方法。
- 拒绝采样：最简单的训练方法，使用奖励模型对SFT的输出进行排序，然后使用自回归损失函数进行训练。
- DPO：比PPO简单，可能用于早期或对性能要求不极致的阶段。
- PPO：当模型能力较强或有更多时间调优时，用于榨取最终性能。
- 这与Meta在Llama 2论文中描述的迭代过程（收集新的人类数据，训练新模型检查点）相符。每次收集新数据后，都可能尝试不同方法进行微调。

未来展望与核心挑战

数据瓶颈 (Data Bottlenecks)：
- 学术界在数据方面仍非常受限，即使计算资源有限，也几乎尝试了所有可用的开放数据集。
- 迫切需要新的、能为模型增加不同能力的数据集，以达到类似UltraFeedback的效果。
DPO方法的持续演进 (Continued Evolution of DPO Methods)：
- 包括移除参考模型、修改损失函数、使用单边偏好而非成对偏好等。
模型规模的多样性 (Diversity in Model Scales)：
- 应关注70亿和130亿参数之外的模型，特别是向下扩展 (scaling down)，研究小模型的对齐。
- 小模型对齐难度大，因其在许多基准上得分低或随机，在该领域取得突破将非常有影响力。
评估体系的完善 (Improvement of Evaluation Systems)：
- 需要针对我们关心的具体能力进行更细致的评估。
个性化对齐 (Personalized Alignment)：
- 训练对个体用户而言表现良好的模型，而非一个“大一统”的模型，这是与大型科技公司竞争的一个潜在方向。
有用的资源：Lambert推荐了一些跟踪开放模型和数据集的平台。

问答环节摘要

在线DPO的挑战：即使有好奖励模型，实现有效在线DPO的关键在于提示匹配（prompt matching）。PPO中策略生成的提示与奖励模型训练的提示往往相同，这可能导致分布不匹配问题。如果奖励模型真正优秀，在线DPO应能奏效。
超越成对偏好的RLHF：存在多种研究方向：
- 单边偏好数据（如KTO方法）：利用“好/不好”这类简单反馈。
- 多答案排序学习（如Starling模型使用K-wise偏好）：从多个候选中学习。
- 细粒度偏好学习（如SteerLM）：对生成的文本从简洁性、帮助性、诚实性等多维度进行标注和学习。
模型超越人类性能：Lambert认为计算机科学中的经典思想如“搜索”（search），以及强化学习中的“探索”（exploration）将是关键。语言模型需要具备搜索和生成新数据的能力，结合人类在特定难点上的指导。
数据集创建：非常困难。社区努力（如OpenAssistant）影响巨大但难以持续。AI2等机构可能通过特定应用（如科学信息生成）收集数据并开放。学术研讨会和竞赛也是潜在途径。
奖励模型本身的奖励作弊 (Reward Hacking)：这是一个RL中的经典问题，由于优化器强大而奖励表示不完美，模型总能找到奖励表示的漏洞。虽然有很多研究致力于缓解，但这是个根本性问题。有时奖励作弊的表现很明显（如模型对所有问题都回答“JavaScript”）。
不同研究机构的对齐研究文化：Lambert认为这反映了机构的文化和投入重点。Hugging Face旨在赋能更多人进行对齐研究，构建生态；AI2更侧重科学研究，理解现象并清晰传达。

总结核心观点

Nathan Lambert的讲座清晰地勾勒了DPO出现后LLM对齐领域的图景。DPO凭借其简洁性和有效性已成为对齐研究的重要基石，但数据稀缺和评估困难仍是主要障碍。RewardBench等工具试图提升评估的透明度。与PPO的比较表明，尽管PPO可能带来边际收益，但其复杂性使其在非工业界环境中应用受限。当前研究热点转向在线学习方法，探索如何更有效地利用模型在训练过程中生成的数据和动态标签，以期突破现有瓶颈。Meta等工业界巨头的实践也反映了结合多种后训练技术的实用主义趋势。未来，对齐研究需要在数据创新、DPO方法演进、小模型对齐、评估体系完善和个性化等方向持续探索。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 21:24

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 21:03

StreamSparkAI