speaker 1: Good evening, people. How are you guys doing? All right, my name is arer Sherma. I'm a phsg student at Stanford, and I'm very, very excited to talk about post training, generally speaking, for large language models. And I hope you guys are ready to learn some stuff, because this has been one of the last few years in machine learning, have been very, very exciting with the advent of large language models, ChatGPT and everything to that extent. And hopefully after today's lecture, you will be more comfortable understanding how we go from pre train ined models to models like ChatGPT. And we take a whole journey through prompting instruction, fine tuning and dpn rlf. So let's get started. All right. So something .
speaker 2: that has .
speaker 1: been very fundamental to our entire field is this idea of scaling loss. And models are increasingly becoming larger and larger, and they're expanding more and more compute. So this is a graph of models starting all the way back in 1950s to somewhere around these are still, this is an outdated graph. So like this shows up to ten to part 24 flops or floating point operations that go into pre training these models, but the number is well above ten to part 26 now. But you can see the graph and the way it's trending. And more and more compute requires more and more data because you need to train on something meaningful. And this is roughly the trend on the amount of language tokens that are going into the language models in pretraining. And again, this plot is outdated. Does anybody want to guess like where in 2024, 2022 we were at 1.4 trillion tokens, or words, roughly speaking, in language model pretraining dudoes anyone want to guess like where we are in 2024. That's a pretty good guess. Yeah so we close to 15 trillion tokens. Recent lama three models were roughly trained on 15 trillion tokens. So Yeah, just just for a second, appreciate that these are a lot of words. This is not Yeah I don't I don't think anybody of us listens to like trillions of tokens in our lifetime. So this is where we are right now. And I hope you guys were here for the pre training lectures. Cool. So what do we do? So like, I mean, broadly speaking, we are really just learning to predict text tokens or language tokens. But what do we learn in the process of ptraining is why are people spending so much money, so much compute? Because these compute and tokens take dollars to do, and we're on the order of spending hundreds of millions of dollars on these runounds. So why are we doing this? And this is basically a recall from whatever you've probably learned till now. But we learning things like, Oh, we are learning knowledge. Stanford University is located in Santa Clara, California. Wherever you want to like say you're learning syntax, you're learning semantics of the sentences. These are things that you would expect to learn when you're training on language data broadly. You're probably learning a lot about different languages as well. So like depending on your text data distribution, you're learning a lot of things, but the models we interact with are very intelligent. So where is that coming from? Like I mean, just simply learning about very factual things and it's a very simple loss function we optimizing. And where is that intelligence coming from? And this perhaps is the interesting .
speaker 2: bit recently, like people have like .
speaker 1: started accumulating evidence for that. When you optimize the next token prediction losses, you're not just learning about syntax, you're not just learning knowledge, but you're starting to like form models of agents beliefs and actions as well. So how do we know this? Again, a lot of this is speculative evidence, but it's baby to like form an understanding that the losses we optimizing are not just about the data, fitting the data, but you start learning something may be more meaningful as well. For example, like I mean, in this specific case, we change the last sentence and the prediction of the text to the next text that is predicted changes as well. So here it starts with pad watch ches, a demonstration of a bowling ball and leabeing dropped at the same time. Pat, who is a physicist, predicts that the bowling ball and the leaf land at the same rate. We all know gravity the way it works, but when you lay change, the last sentence tends to pat, who has never seen this demonstration before. Pat predicts that bowling ball will fall to the ground first. Maybe somebody who's never seen this experiment before might intuitively believe that. Correct? So like I mean, the language model was able to predict this. So how do you predict this? You have to have some notion of understanding of how humans work to even be able to predict this. And that's maybe like something that is not obvious when you're simply optimizing to predict the text. Similarly, like I mean, these kind of examples are we're going to run through some examples to sort of communicate that when you're pre training these models, you're learning much more than just language tokens and so on. You're also learning about math, like you're able to understand what a graph of a circle means and what the center is and where how to understand equations. Probably my favorite example, something I use pretty much every day, is you're learning how to write code. So I don't know how many of you have interacted with Copilot before, but if you have like you probably know, like if you write down a few comments, write down a function template, it will automatically complete code for you. So again, it's now perfect, but it has to have some deeper understanding of what your intent is for something like that to emerge. And similarly, we have examples from medicine as well. I don't know about you guys, but like whenever I have some issue, I probably go to chat gpor clad or something to the artifact and ask them a diagnosis for those things as well. I don't recommend that. Please don't take medical advice from me. But Yeah, so broadly, like the way we're seeing language models at this point is that they're sort of emerging as a general purpose multitask assistance. And it's very strange, right? Like I mean, we start off with text token prediction and we reaching the stage where we can sort of rely to them on them to do many, many different things. So how are we getting there? And I'm sure you all are aware of like what these models are. So Yeah. So today's lecture is largely going to be about how do we go from something Stanford University is located, this very simple ptraining task, a very simple procedure. Well, it's more complicated, but in abstract terms, it's not very complicated to like something as powerful as ChatGPT. Cool. So I recommend you guys stopping me, asking me a lot of questions because this is a there's a lot of fun examples and a lot of fun techniques. So like I want you guys to like learn everything about here. So the overall plan is we're going to talk about zero shot and fushot in context learning. Next we're going to follow up with instruction fine tuning, and then we're going to talk about optimizing for preferences. And this is where roughly things are right now in the industry. And when we're going to talk about what's next, what the limitations are and how do we move from here. Cool. So we're going to start with zero shot in future in context learning broadwe're going to take an example of GPT or the generative ptrain transformer. And this is a whole series of models that started off in roughly 2018. And like up to 2020, they were building GPT, GPT two, GPT -3. So we're going to start off with this example. And yes, so it's a decoder only model that is trained on roughly 4.6 gb of text and it has twelve layers of credit transformers there. And it's trained with the next token protection loss. And the first model obviously was not extremely good, but it started showing that Hey, like this technique for pre training can be very effective for general purpose tasks. And we're going to see some examples, for example, like I mean, here it's able to do the task for .
speaker 2: entertainment.
speaker 1: And Yeah and GPT one itself was not very strong as a model, so but they took the same recipe and like I mean tried to increase the model size. So they went from 117 million parameters to about 1.5 billion parameters. And we're now scaling up the data alongside as well. So we went from 4 gb of data to approximately 40 gb of data. And pre training is a whole different like melting pot of techniques. And there's a lot that goes into it, but like roughly, for example, here, they filter data by the number of upwards on the redata. And Yeah, so this is roughly where we are. And I think one of the things that started emerging with GPT two is zero shot learning. And what do we mean by zero shot learning conventionally in the field? Like when we pre trade models, there was the idea that you take a few examples, you update the model, and then you are able to adapt to a specific task. But as you pretrain on more and more data and more and more task, you sort of start seeing this phenomena where they're able to do the task basically is shot there, show no examples of how to do the task. And you can start thinking, Oh, you can do a summarization, you can follow some instructions, you can do maybe a little bit of math as well. So this is where the idea of zero shot learning started to emerge. Yeah so how do we do zero short learning or task specific learning from these pre train ined models? Really, the idea is like we have to be creative here. We know that these are text prediction models. If you put in a text, they will complete whatever follows. So if we can sort of course, erthese models into completing the task we care about, maybe it's question answering, we can start getting them to solve tasks here. So for example, if you want to ask questions about Tom Brady, you you sort of set it up, you sort of put information about Tom Brady and then you put a question that you wanted to answer and then it will auto complete in some sense. So these is one early perspective on these models. These these are very advanced auto complete models. And similarly, if you want to figure out which answer is or which is not, something that is very useful to measure is log probabilities. So for example, we want to figure out what is the word it referring to here in this sentence, the cat couldn't fit into the hat because it was too big. What we can do is we can take the sentence, replace it with either the cat or either the hat, and then you can measure the probability of which which one does the model think is higher. And you can sort of get the idea of what the reference here is. So none of this is like in the training data. It's simply learning to predict text. But you can start seeing how we can leverage these models to do other tasks as well besides prediction. So this is just more evidence about like how GPT two, no task specific fine tuning, no task specific training. It simply is learning to predict text. And it's establishes the state of the art on many, many different tasks simply by scaling up the model parameters and scaling up the amount of data it's trained on. So this is a fun example. So if you want to do summarization for data or like you have a news article that you want to summarize, so how do you get a zero shot model to do it? This answer is you put the document into the context and you simply put tldr in front of it. Now like I mean, if most of the data on Internet, whenever you see tr, you'll naturally summarize it. So Yeah you can get zero shot performance and summarization here as well. And again, this is not trained to do summarization in any specific way, and it's still doing really well simply because of its pre training data. So Yeah I think gptwo 200 and tldr are is somewhere there and some of the very taspecific cmodels are like, and I think you will see the trend with again, if you were Alec Radford or somebody like I mean, you see like these cool things emerging, your next step would obviously be, I'm gonna to scale this up a little more. I'm gonna to make an even bigger model. I'm gonna to train that on even more data and we'll see how things go, right? So that's how we got GPT -3. We went from 1.5 billion parameters to 175000000000 parameters. We are well over like 40 gb of data to 600 gb of data. Of course, now we are in terabytes of data. And Texas are very compressed representation, like terabytes of data is a lot. And you know, we talked about zero shot learning. The cool thing that emerged in GPT -3 is like, go ahead .
speaker 2: faused to hold the passage right? No.
speaker 1: you typically put the passage if you have like interacted with reddor, something like that. Typically somebody will write an entire post and then end with trdr. Here's a summary of the thing. Too long day treat or if you have used .
speaker 2: opposite comes first. Oh Yeah, there are situations .
speaker 1: where it also comes first. But one reason is that these are like decoder only models. So like they are often these are causal attention models. So they typically need to see the context before .
speaker 2: I'm just curious like from my experience of healhere comes forth, then how is it able to after the passage? Okay. There's probably .
speaker 1: a lot of data where the tldr comes first, but there's probably a lot of data where tldr comes after as well. Cool. So we saw zero shot learning emerging in GPT two. Few short learning maybe seem slightly easier, but like this is where things started getting really funny. Is that like you're starting to beat state of the art simply by just putting examples in context? So Yeah, what does fushort learning mean here? What are we talking about? As I mentioned, like the typical idea here is is that you want na solve translation. So you would put some examples of translation into context. And you know if this is a correction task here, or maybe when you're interested in translation and no gradient updates, no learning in any conventional sense whatsoever, you put a few examples in and that's it. Like I mean, you know how to solve the task. Isn't that like crazy? Like you guys did the assignment on translation, right? But this is what the modern nlp looks like. Yeah you put in some examples and you have the entire system. And this is where things got really interesting, is that all these taspecific models that were created to like be really, really good at translation or really good at summarization, you can just put, let's look at this graph. So we start with the zero shot performance of this in a similar fashion that I described earlier. And you start somewhere there. You put one example in of translation from English to French. You get to somewhere like already at a fine tune level, few examples in. You're already starting to be close to the state of the art models.
speaker 2: Wait. But in that graph, the state of the hour is really high.
speaker 1: isn't it defined five of the word plus. Plus here I think is like, Oh, I'm referring to five instead of the art, like where it's just trained exclusively on a lot of translation data. Might be like slightly barrious. And I think that the relevant comparison here is just the in context learning starts to emerge at scale. So and this is I think like the key point is that this some of this is contested just to be very upfront, but like there's this idea of emergence of this property as you train on more compute and more scale. There's more recent research which suggests that if you plot the x axis correctly, it feels less emergent. But the general idea is as you increase the number of parameters and increase the number of compute that is going into the models, the the ability to just go from a few examples to really strong performance is very compelling. Cool. And Yeah, I think as I explained earlier, the general idea is that this is very different from the conventional idea of fine tuning that we typically go for. Instead of like iterating over examples, putting it into context and doing gradient updates, we are actually just going for few short prompting. We are going to put in few examples and that's going to give us the system.
speaker 2: Go to car, you're.
speaker 1: Yes, I mean, the exact details roughly can depend on the prompt template that you use. But typically you would just put examples so like see auter and put these examples and then whatever your task is, you can just let the model complete from there because it can infer the task based on the examples you've given. Any other questions?
speaker 2: Cool.
speaker 1: So Yeah, like I mean, we have gotten from zero short prompting and like we've we're seeing that few short prompting is becoming really competitive with good models, but there's still limitations to this. Like I mean, you can not solve every task that you see here. And particularly like things that involve like richer multi step reasoning is something that actually can be pretty challenging. And just to be fair, human struggle at these tasks as well. So things like addition and so on, like these are probably still hard to do like when you keep increasing the number of digits. But one thing that you have to start being creative with, I alluded to this earlier, is that you can get these models to do the task if you're creative in how you prompt the model. And this is what we're going to see next. So this technique called chain of thought prompting emerged here. And the idea that we have explored thus far is that we put in examples of the kind of tasks we want to do, and we expect the model to zero short, learn what the task is and go from there. The idea is that instead of just showing what the task is, you show them examples where the reason through the task. So they're not just learning to do the task, but also learning how the reasoning is working. So in this example, initially we started with like we have to solve a simple math problem, and we are just shown exactly the answer directly instead of doing that. And if you do that directly, you'll observe that the model gets the answer wrong. Instead of that, what if you show model how to reason about the task, show a chain of thought and include that in the prompt as well, and then you ask it a new question? The idea is that now the model is not just going to output an answer, it's going to reason about the task, and it's going to do actually a lot better. And this has been shown to be very effective. Chain of thought is also, as you can see, like I mean, it's again, something that improves a lot with model scale. It's not just Yeah but what you can probably start seeing is it's nearly better than supervised best models here. So power models roughly were about 40 billion parameters. And simply with this chain of thought kind of a scale, you're already like beating straight of the cool. So Yeah so I show you examples of chain of thought reasoning to this point where you go through a reasoning chain, but you can be even slightly smarter than that. You might not even need to show them any examples. You just need to trick them into thinking about what to do next. So Yeah, Sithis idea emerged in this paper called where we are. Let's think step by step. Instead of even showing an example, you just start your answer with let's think step by step and that's it. Like a mean the model will start reasoning about the answer itself instead of just like auto completing to an answer and you get something like this. So maybe you don't even need to show any examples. Like you can probably induce the reasoning behavior through zero shot behavior as well. And again, what the final numbers look like is like compared to zero shot performance that we got from essentially auto completing this zero shot chain of thought substantially improves the performance. So you go from like 17.7 to 78.7. It's still worse than still putting like examples of reasoning and multi shot fushot chain of thought as well. But you can see how much it improves the performance simply by asking it to let's think about the step by step. And maybe this is like a lesson that interacting with these models is like when you interact with these models, you might not get the exact desired behavior from these models up front. But often, like these models are capable of doing the behavior that you might want. And often you have to think about how to induce that behavior such that and the right way to think perhaps, is like, what is the pre training data? What is the data on the Internet it might have seen, which induces a similar behavior to the kind I want and you probably want to like think about that and then induce these kinds of behaviors from those models? And Yeah like I mean you know we've designed designed some of these prompts. You can also like get an lm to design these prompts as well. There's like recursive self improving ideas here that happen and you can bump up the performance a little bit more cool. So what we have seen so far is that as models get stronger and stronger, you can start forcing them to do your task zero shot or with few short examples, and you can trick them into thinking what task you want them to solve. But the downside is that there's only so much you can fit into context. This might not be very anymore. Water is like becoming increasingly larger context, but it's still somewhat unsatisfactory to think you have to trick the model into doing your task rather than like just doing the task you wanted to do. And potentially like I mean, going forward, like you probably still want to fine tune these models for more and more complex tasks. And that's where we're going to go forward. In this next section, we're going to cover is instruction fine tuning. And the general idea we have right now is that, as we talked about, pre training is not about assisting users. It is about predicting the next token. Now you can trick it into assisting users and is following the instruction you want to. But in general, that's not what retrained it for. This is an example where if you ask GPT D3, pretty strong model to explain, like moon landing to a six year old in a few sentences, and it will follow up with more questions about what a six year old White want, this is not what you wanted the model to do, right? So it's a general term that people use these days is that they are not aligned with user intent. And the next section that we're going to cover are going to talk about how to align it with the user intent so that they don't have to trick the model into whatever of you .
speaker 2: wanted to do.
speaker 1: And this is the kind of desired completion we want at the end of instruction tuning. And Yeah how do we get from those pretrained models to models which can respond to user intent? Again, I hope this was covered somewhere in the class, the general idea of pre training and fine tuning. But what you have probably seen thus far is that you ptrain on a lot of different language taks on data, but then you find in on your specific taks. So you're taking the same decoder only models and you're fine tuning to some task with very little amount of data. The thing that is going to be different now is not that we're no longer fine tuning on a little amount of data. We're going to find Tunon many, many different tasks, and we're going to just try to put them into a single usable ux for users. And this is where fine tuning or instruction fine tuning comes in. Cool. So again, the recipe is not very, very complicated here. We're going to collect a lot of examples of instruction and output pairs. And the instructions are going to range over several tadifferent forms. There's going to be question answering, there going to be summarization, translation, code reasoning and so on. And we're going to collect a lot of examples related to all those tasks. And the idea is like, I mean, we train on instruction and output pairs exactly with them, and then we're going to evaluate on some unseen tasks as well. So this is a general paradigm of instruction. Fine tunand. Again, it's the same idea which we explored in pre training, is that data plus scale is really important. And these days, like I mean, you start off with like one task, you're now extending it over thousands of thousands and thousands of tasks with like 3 million plus examples. This is generally like a broad range of tasks that you might see in instruction fintaining data sets. And Yeah, you might even think of like why are we calling it fine tuning anymore? Like it's almost starting to look like ptraining. But Yeah, can these are just terms so you can decide whatever you're are comfortable with. So Yeah, we get this like huge instruction data set we find in our model. The next question is like how do we evaluate these data sets? Now I think you guys will see another lecture on evaluation. So I don't want to like dive too deep into this, but generally, evaluation of these language models is an extremely tricky topic. There's a lot of biases that you need to deal with, and a lot of this will be covered. But some more recent progress on this is like we are starting to curate these like really large benchmarks like mmlu, where the models are tested in a broad range of diverse knowledge. And this is just one example, which is and these are the topics that you will see. And just to give some intuition of what the examples in these evaluation look like under astronomy, you might be asked, what is for type one, a supernova? Or you might be asked some questions about biology. And there's a huge host of tasks for this. And you can typically like these are mulchoice questions, and you can ask the model to answer the question. If they're instruction fine, tunalready. Hopefully they can like simply answer the question. But you can also, chain of thought prompt these questions or few prompt these questions too. And recently, there's been a huge amount of progress on this benchmark. What people have observed is like more and more pre training on more and more data and larger models is simply just like climbing up these the number on this. So 90% is often seen as a benchmark number that these models wanted to cross because it's roughly like human level and knowledge or understanding. And recently the Gemini models verpedly crossed this number. So Yeah, go ahead.
speaker 2: Isn't this like the entire sort of thing all over again, right? Like imagine it. At some point you're like, okay, maybe my methods are too like two fine tunes implicitly on the imaginal biases. And isn't something like that happening here as well?
speaker 1: Yes. I think this is a tricky topic because a lot of the models often there this idea about whether your test sets are leaking into your training data set, and there's are huge concerns about that. It's a perfectly valid question to ask, how do we even evaluate? This is why evaluation is actually very tricky. But one general thing to be careful about is at some point, it doesn't matter what your train test is, if the models are generally useful, if their models are doing useful stuff, like does it matter how if you test, if you train on everything you care about and it does well on it, like does it matter? So Yeah, again, we still need better ways to evaluate the models and to understand what methods are doing and how they're if they're improving the model or not. But at some point like that, those boundaries start to be less important. Cool. So massive progress on this benchmark starting with GPT two and like we roughly at 90%, which to the point where these benchmarks are starting to become unclear if improvements on these are actually meaningful or not. In fact, like most of the times when the models are wrong, like you might often find that the question itself was unclear or ambiguous. So all evaluation benchmarks have a certain limited utility to them. So Yeah, I'm going to go over like another evaluation example of how this recipe like changes things. So t five models were instruction fine Tunon a huge number of tasks. And another trend timwhich, I think will be the theme across this lecture, is that as your models become larger, as they're trained on more data, they become more and more responsive to your task information as well. So what you will observe here is like as the number of parameters increase, we have like t five small and t five small, and we go up to 11 billion parameters where we have t five x excl, you'll see that the improvement actually improves, like going from a pre train to an instruction model. The instruction model is all the more better at following instructions. So the difference is plus 6.1 and goes to plus 26.6 as the models become larger. So this is another very encouraging trend that you probably should train on a lot of data with a lot of compute and ptraining just keeps on giving. So Yeah, I hope you guys get a chance to like play with a lot of these models. I think you already hopefully are. But Yeah before instruction, fine tuning something, when you're asked a question related to disambiguation qa, you get something like this and it doesn't actually follow the let's think by step by step instruction very clearly. But after instruction fine tuning, it is able to answer the question here. And Yeah like more recently, people have been like researching into like what the instruction tuning data sets should look like. There's a huge plethora of instruction tuning data sets now available. Like this is just a representative diagram. And there's a huge open source community developing around these as well. Some high level lessons that we have learned from this is one lesson that I think might be interesting is that we can actually use really large, strong models to generate some of the instruction tuning data to train some of our smaller models. So take your favorite model right now, GPT -4, maybe or maybe Claude or whatever, and you can get it to answer some questions and generate instruction output pairs for training your open source or smaller model. And that actually is a very successful recipe. So instead of getting a human to collect all the instruction output pairs or getting humans to generate the answers, you can get the bigger models to generate the answers as well. So that's number one thing that has recently emerged. Another thing that is being emerged or is being discussed is how much data do we need? I talked about millions of examples, but like people have often found that if we have really high quality examples, you can get away with thousand examples as well. So this is the paper less is more for alignment and this is still an active area of research and how like data scaling and instruction tuning affects the final model performance. And Yeah clousourcing, these models can be effective as well. So they're in very cool benchmarks that are emerging like open assistant, Yeah a lot of activity in the field and hopefully like a lot more .
speaker 2: progress as we go on. Yes, a question sort of in the spirit of this lema paper, does it like code or like I don't know, like math word problems have this desired structure? So like shouldn't we just like be training cold part between like some English stuff and then just be like, okay, this is the best reason we can get at some point, right? Because like code has the structure where like you're going sort of step by step and you're sort of thinking in some way in like a breaking down a higher concept in this smaller so you can consider like code and had like a high value tokens.
speaker 1: So maybe like just doing so I think there's again, pre training is a whole dark card that I am not completely familiar with, but code actually ends up being really useful in pre training mixtures. And people do like upgrade code data quite a lot similarly. Like I mean, but it depends upon what the users are going to use the models for, right? Some people might use it for code, some people might do this for reasoning, but that's not the only task we care about. As we might see later on in the next step, we'll will discuss this as well, is that people often use these models for creative task sks. They wanted to write a story. They wanted to generate a movie script or so on. And I don't know if necessarily training on reasoning only tasks would help with that. So go ahead.
speaker 2: Yeah, but there exists like some data distribution, which is like high value for creative tasks.
speaker 1: Yes. Like I mean, it seems like a lot of people write about stories and everything on the Internet all the time, which is not code. And sometimes like there's this idea of hallucinations as well in this like field. But you can often think like, Hey, creativity might be a byproduct of hallucinations as well. So I don't know what exact data would like lead to like more creative models. But generally there's a lot of data are a lot of stories that are written on the Internet, which allows the model to be creative. Yeah, but I don't know if I have a specific answer to the question. Cool. So we discuss instruction, fine tuning, very simple and very straightforward. There's like no complicated algorithms here. Just collect a lot of data and then you can start leveraging the performance at scale as well. Like as models become better, these models also become more easily specifiable and they become more responsive to task ks as well. We're going to discuss some limitations. And I think this is really important to understand why we are going to optimize for human preferences. Cool. So we talked a bit about this. Like instruction fine tuning is necessarily contingent on humans labeling the data. Now humans, it's expensive to collect this data, especially as the questions become more and more complex. You want to answer questions about which may be at physics PhD level or things to that effect. These things become increasingly expensive to collect. So Yeah, this is maybe like perhaps obvious. Like collecting data pre training does not require any specific data. You scrape data off the web. But for instruction fine ing, you probably need to recruit some people to write down answer to your instructions. So this can become very expensive very quickly, but there's more limitations to this as well. And we're just discussing this like they're open ended tasrelated to creativity that don't really have like an exact correct answer to begin with. So how do you generate the right answer to this kind of a question? And Yeah, like language modeling inherently like penalizes all token level mistakes equally. This is what supervised fine tuning does as well. But often not all mistakes are the same. So this is an example where you're trying to do this prediction task. Avatar is a fantasy tv show, and perhaps you can see like I mean, calling it an adventure tv show is perhaps okay, but calling it a musical may be like a much worse mistake. But both these mistakes are penalized equally. And I think one general aspect, which is more becoming increasingly relevant, is that the humans that you might ask might not generate the right or the highest quality answer. Your models are becoming increasingly competitive, and you want, in some sense, you going to be limited by how high quality the answer humans can generate. But often I find that the models are generating better and better answers. So do we really want to keep relying on humans to write down the answers? Or do we want to like somehow go over that? So these are the three problems we have talked about with instruction fine tuning, and we made a lot of progress with this. But this is not how we got chagpt. And one high level problem here is that even when we are instruction fine tuning, there is still a huge mismatch between the end goal is to optimize for human preferences, generate an output that a human might like. And we're still doing prediction kind of tasks where we're predicting the next token, but now in a more curated data set. So there's still a bit of a mismatch going on here, and it's not exactly what we want to do, hopefully. Like I mean, I'm going to take a second here to pause because this is important to understand the next section. And if there's any questions, feel free to ask.
speaker 2: So this step still taken. And as a first step or discthat's, a good question. So I think this is still one of .
speaker 1: the more important steps that you take before taking the next step. But people are trying to like remove the step altogether and jump directly to the next step. So there's work emerging on that. But Yeah, this is still a very important step before we do the next step. God is probably two also present in bretraining. And if so, how do you avoid that versus just by having a lot of data? Yeah, that's a great question. There's one difference, one major difference on pre training. Pre training covers a lot more text. So just for context, like I mean, as we talked about, it's pre training is roughly 15 trillion tokens, whereas supervised instruction fine tuning might be somewhere on the order of millions to billions of tokens. So it's like few orders of magnitude lower. Typically, youonly see one answer for a specific instruction, but during pre training, you'll see multiple taand, multiple completions for a same kind of a prompt. Now that's good, because when you see multiple answers or completions during peer training, you sort of start to weigh different answers. You start to put probability MaaS on different kind of answers or completions. But in instruction, fine tuning might force you to put and weigh on only one answer. They said, okay. But generally, Yeah. Like I mean, this is a problem with both the stages. You're right. Anything .
speaker 2: else?
speaker 1: Cool. So as this whole thing alludes to, we're going to start to attempt to satisfy human preferences directly. We're no longer going to try to like get humans to generate some data and try to do some kind of a token level prediction loss. We're going to try to optimize for human preferences directly, and that is the general field of rlhf. And that's the final step in typically getting a model like chagpt. So we talked about how collecting demonstration is expensive, and there's still a broad mismatch between the lm objective and human preferences. And now we're going to try and optimize for human preferences directly. So what does optimizing for human preferences even mean to like concretely establish that, let's go through a specific example in mind, which is summarization. We want to train a model to be better at summarization, and we want to satisfy human preferences. So let's imagine that the human is able to prescribe a reward for a specific summary. Let's just pretend there is a reward function. You and I can assign, say, a reward. This is plus one, this is minus one, or something to that effect. Okay, so in this specific case, we have this input x, which is about an earthquake in San Francisco. So this is a news article that we want to summarize. Let's pretend that we get these rewards and we want to optimize this. So we get one summary, y one, which gives us an earthquake hit and so on, and we assign a reward of 8.0 and another summary which gives us a reward of 1.2. Generally speaking, like the objective that we want to set up is something of the following form, where we want to take our language model p theta, which generates a completion y given an input x, and we want to maximize the reward of rx y, where x is the input and y is the output summary in this specific task. And maybe like just to like really concretely point out something here, this is different from everything that we have done in one very specific way. We are sampling from the model itself in the bottom term. If you see like we're using y from p theta, everything we've seen so far, the data is sampled from some other source either during ptraining, either in supervised fine tuning, and we're maximizing the lolikelihood of those tokens. But now we're explicitly sampling from our model and optimizing potentially a non differentiable .
speaker 2: objective. Cool.
speaker 1: So broadly, the rhf pipeline looks something like this. And first step is still instruction tuning, something we have seen up until now where we take our pre train model. We instruction tune on a large collection of tasks, and we get something which starts responding to our desired intent or not. But there are two more steps after this, which are typically followed in creating something like instruct GPT. The first step is estimating some kind of a reward model, something which tells us, given an instruction, how much would a human like this answer or how much would a human hate this answer? So we looked at something like this earlier, but I didn't talk about how do we even get something like that. That's the second step. And then we take this reward model and we optimize it through the optimization that I suggested earlier. So the maximizing the expected reward under your language model, and we're going to go over a lot over in the second and third steps. So the first question we want to answer is how do we even get like a reward model about what humans are going to like? Like this is a very ill defined problem. Generally speaking. There's this two problems here that we can address versus a human in the loop is expensive. So let's say like if I ask a model to generate an answer and then I get a human two labels with some kind of a score, I'm doing this over millions of completions that is not very scalable. I don't want na set around and label millions of examples. So this is very easy. Like we are in a machine learning class. So what are we gonna to do? What we gonna to do is we're gonna to train something which predicts what a human would like or what a human might not like. And this is roughly this is essentially a machine learning problem where we take these rewards scores and we try to train a reward model to predict, given an input and output, what the reward scores would look like. Simple, simple machine learning regression style problem. You might have seen this earlier. Cool. Now there's a bigger problem here. And sorry, go ahead.
speaker 2: Step one. So do we use like I don't know, like just embeddings with psafior? Do we use real language model to that's .
speaker 1: a good question. Generally. Like what we do is like we still typically need reward models where they need to be able to understand the text really well. So like bigger models and like they're typically initialized from the language model that you trained pre trained as well. So you typically start with a pre trained language model and do some kind of prediction that we'll talk about and theygive you a score.
speaker 2: How do you if you're doing that.
speaker 1: how do you separate x and y?
speaker 2: Like how does the language model know which part .
speaker 1: it doesn't need? It can put the x and y. Like it only sees x and y as an input, so it doesn't need to typically see it separated. It's just going to predict a score at the end. The x and y is more for notational convenience here, because for us, x and y are different. X is a question user asand. Y is something the model generated. But you shove the whole, you shove the whole thing. Cool. Now, this is the bigger problem here. And human judgments are very noisy. We talked about we want to assign a score to a completion. This is something that's extremely non trivial to do. So if I give you a summary like this, what score are you going to assign on a scale of ten? If you ask me on different days, I'll give a different answer. First of all, but across humans itself, this number is not calibrated in any meaningful way. So you got a sine number of 4.1, 6.6 in different humans, which just simply assign different scores. And there are ways to address this. You can calibrate humans, you can give them a specific rubric, you can like talk to them, but it's a very complicated process. And like still like there's a lot of room for judgment, which is not typically very nice for training a model like this. If your labels can vary a lot, it's just hard to predict. So the way this is addressed is that instead of trying to predict the reward label directly, you actually want to set up a problem in a slightly different way. What is something much easier for humans to do is give them two answers, or maybe many answers, and tell them, ask them which one is better. So this is where the idea of asking humans to rank answers comes in. So if I give you a whole news article and ask you which summary is better, you might be able to give me a ranking that, Oh, this second summary is the worst, but the first one is better and the third one is somewhere in the middle between those two. So you get like a ranking which gives you preference over summaries. And hopefully, like I mean, you can see, like the idea that is important here is that even when we have some kind of a consistent utility function, even I have, it's much easier to compare to something and know that which is better than this rather than ascribing it an arbitrary number on a scale. And that's why the signal from something like this is a lot better. Now, how do we get like we talked about, we need like we get this kind of a preference data, and now we need some kind of a reward score out of this. And we shove in like our input, we shove in a summary as well, and we still need to get a score out of this, but it's not clearly obvious to me like how do we take this data and convert into that kind of score? Incomes are pretty good friends named Bradley Terry. And essentially like there's a lot of study, many in economics and like psychology, which basically tries to model how humans make decisions in specific case like this, Bradley termodel essentially says that a probability that a human chooses answer y one over y two is based on the difference between the rewards that humans assign internally and then you take a sigmoid around it. So if you've have looked at binary classification before, the logic is simply the difference between the reward of some y one minus y two, or the difference between the winning completion and the losing completion. Is everybody with me till this point? So the idea is that like if you have a data set where given y one and y two, where y one is a winning completion and we have a winning completion yw and losing completion yl, the winning completion should score higher than the losing completion guys ahead. Sorry, what is J?
speaker 2: Is that a log clock?
speaker 1: Or like sorry.
speaker 2: what what like what is the type of J like this number here that we're getting as the expectation? Is it at log .
speaker 1: prop or what is it's an expected log props. It will be a scalar at the end. Sigmoid is so you're taking the let's say you have a reward model which gives a score R one to like yw and R2 to yl. You subtract that number, you get another number, you put it into sigmoid and you get a probability because sigmoid will convert a loc into probability. And then you take a logarithm of that, and you take the expectation of everything, and you get this final number, which tells you how good your reward model is doing on the entire datset. So like a good model of humans should behave like this. A good model of humans would score very low here. So it would generally assign a higher reward to the winning completion and generally assign a lower reward to the losing completion. Cool. The math is just beginning, so hold on to your seats. Cool. So now let's see where we are. We have a pretramodel pt y given x, and we got this fancy reward model, which tells us that, Hey, we have a model of humans that can tell us which answer they like and which answer did not like. Now to do rlhf generally, like, I mean, we've discussed what this will look like. We will copy our ptrain model or our instruction tune model, and we will optimize the parameters for those models. And I suggested that the objective that we want to optimize is the expected reward when we sample completions from p theta, and we're going to optimize our learned reward model instead of like the reward model which humans would have typically assigned. Do you guys see any problem with this? Is there something that's wrong here or like that might go wrong if you do something along these lines?
speaker 2: Go for the model box.
speaker 1: It might collapse, yes. But generally, at least from my intuition, like if you're ever doing something and you have you're optimizing some learned metric, I'd be very careful because typically a loss functions are very clearly defined. But here my reward model is learned. What? When it's learned, it means it will have errors. Yes. So it's going to be trained on some distribution. It will generalize as well, but it will have errors. And when you're optimizing against a learned model, it will tend to hack the reward model. So it might exploit the reward model, might erroneously assign a really high score to a really bad completion. If your policy learns or if your language model learns to do that, it will completely hack that and start generating those jperish completions. So just as a general machine learning tip as well, if you're optimizing a learn metric, be careful about what you're optimizing and make sure that it's actually reliable and the way and this is obviously not desirable. Like I mean, if you start optimizing this objective, you're gonna to converse to jebberish language models very, very quickly. So typically, what people do is that you want to add some kind of a penalty that like avoids it drifting too far from its initialization. And why do we want to do that? Like if it cannot drift too far from its initialization, we know the initialization of the model is a decent language model, and we know it is not necessarily satisfying this reward model too much. And we also know that the reward model is trained on a distribution of completions where the initial model is. So typically, when we talk about training this reward model, we have traaround certain completions which are sampled from this initial distribution. So we know the reward model will be somewhat reliable in that distribution. So we're just going to simply add a penalty, which tells us that you should not drift too far away from the initial distribution. And just to go over this, we want to maximize the objective where we have rm fi, which is our learunder one model, but we're going to add this term beta log ratio. And the ratio is are the model we're optimizing p theta and our initial model. And what this says is that if we assign a much higher probability to certain completion as compared to our pre train model, you're going to add an increasingly large penalty to it. And simply, you're paying a price for drifting too far from initial distribution. If you guys have taken like machine learning this, the expectation of this quantity is exactly the kback Lilar divergence or kl divergence between p theta and ppd. So you're penalizing drifting between two distributions. Go for it. Shouldn't you also do this like add a penalty in the previous version where you had to find tuning? Or is this only relevant for the rl? Hf? That's a good question. So I think people do add some kind of regularization in fine tuning. It's not nearly not as critical when you're doing this with rl. Like the incentive is to exploit this reward model as well as much as possible. And we'll see examples where like the learned reward dle predicts like it's doing really well, but the reward models are completely garbage. So it's much more important in this optimization. Cool. So now this course does not assume background on reinforcement learning. So we're not going to go into reinforcement learning, but I just want to give a very high level intuition about how this works. And reinforcement learning has not typically just useful language model has been applied to several domains of interest, game playing agents, robotics, developing chip designs and so on. And the interest between rl and model elms, it's also like dates back to roughly at 2016 as well. But it's been really successful recently and especially with the success of rlhf. The general idea is that we're going to use our model that we're optimizing to generate several completions for an instruction. We're going to compute the reward under our learned reward model, and then we're going to simply try and update our model to increase the probability on the high reward completions. So when we sample a model, we'll see completions of varying quality and we'll see some good completions, good summaries for our task sks, some bad summaries for our task. And we will try to update our log probabilities in a way such that the reward for when you use an updated model, you're typically in the higher reward region. Does a high level .
speaker 2: summary like make .
speaker 1: sense? Cool. And our lecis incredibly successful. I think this is a very good example of this is the same summarization example. And I think the key point here is that the performance improves by increasing the model size. For sure, we have seen this in many different example. What you can actually see is that even very small models can outperform human completions. If you train it with rlcheand, this is exactly the result you see here. The reference summaries are human generated. And when you evaluate, when you ask humans which ones they prefer, they often prefer the model generated summary over the human generated summary. And this is something you only observe at rlhf, even at small Skand, again, the same scaling phenomena still hosier, bigger models do become more responsive. But rlhf itself is very impactful here. Cool. The problem .
speaker 2: with rlhf is that it's .
speaker 1: just incredibly complex. Like I gave you a very high level summary that like doesn't there's whole courses on this for a reason. So it and this image is not for you to understand. It's just completely to intimidate you. So you want to fit a value function to something. There's you have to sample the model a lot. It can be sensitive to a lot of hyper parameters. So there's a lot that goes on here. And Yeah, if you start implementing an rlf pipeline, it can be very hard. And this is the reason why a lot of rlf was restricted to very, very like high compute, high resource places, and it was not very accessible. So what we're going to talk about in cover in this course is something called direct preference optimization, which is a much simpler alternative to rlf and hopefully, like that's much more accessible. But please bear with me. There will be a lot of math on here, but the end goal of the math is to make come up with a very simple algorithm. So hopefully, like it's and feel free to stop me and ask me questions.
speaker 2: Samyou need clinin terms of like GPT -4 versus three, like how much does the number of parameters in the base model help with like maybe needing to reduce the number of parameters or like in order, sorry, here's the number of like examples from humans or hps that work well. Yeah.
speaker 1: that's a really good question. So generally speaking, as the if you hold the data set size constant and simply increase the model size, it will improve quite a lot. But the nice thing is that you can reuse the data and you can keep adding data as you keep like scanning models up. So typically, like nobody tries to reduce the amount of data collection, you just keep increasing both the things out it .
speaker 2: cool. So we talked about rlhf.
speaker 1: And the current pipeline is something like we train a reward model on the comparison data that we have seen so far. And we're going to optimize. We're going to start with our pre trained instruction tune model and convert it into an rlhft model using the reinforcement learning techniques. Now the really the key idea in direct preference optimization is what if we could just simply write a reward model in terms of our language model itself? Now to intertuiunderstand that like what is going on a language model is assigning probabilities to whatever is the most plausible completion next. But those plausible completions might not be what we intended. But you could restrict the probabilities simply to the completions that a human might like. And then the law g probabilities of your model might represent something which the humans might like, not just some arbitrary completion on the Internet. So there is a direct correspondence between the log probability that a language model assigns and how much a human might like the answer. They can have like a direct correspondence in them. And this is not some arbitrary intuition that I'm trying to come up with. We will derive this mathematically. So the general idea of direct preference optimization is going to be we're going to write down reward model in terms of our language model. And now that we can write a reward model in terms of our language model, we can simply solve directly fit a reward model to the preference data we have, and we don't need to do the rl step at all. So we start off with some preference data, and we simply fit our reward model to it, which directly optimizes the language parameters and maybe at a higher level. Why is this like even possible? Like we did this like really cumbersome process with fitting a reward model and optimizing it. But in the whole process, the only external information that was being added to the system, like was human labels, labels on the preference data. When we optimize a learned reward model, there is no new information being added into the system. So this is why something like this is even possible for quite a few years. This was not obvious. But like as you will see, like some of these results like start to make sense. So we're going to derive direct preference optimization. I'll be here after the class as well if you have questions, but I'll hopefully like this is clear. Yes, we discussed that we wanted to solve this expected reward problem where we want to maximize the expected reward, but we subtract this term, which is the beta log ratio, which essentially penalizes the distance between where our current model is and where we started off. So we don't want to drift too far away from where we started. Now it turns out that this specific problem, instead of doing like an iterative routine, there's actually a closed form solution to this problem. So the closed form solution looks something like this. Again, if you have seen the Bolzmann distribution or something to that effect before, this is basically the same idea. But the idea is this, that we're going to take a pre train distribution ppt y given x, and we're going to reate the distribution by the expected reward. So if a completion has a very high reward, it's going to have a higher probability MaaS. And if it has a lower reward, it's going to have a lower probability MaaS. And it's determined by the expected reward. And beta is a hyper parameter, which essentially covers like what is the trade off between the reward model and the constraint? And as beta becomes lower and lower, you're going to start paying more and more attention to the reward model. So the probabilities look something like this. And there is this like really annoying term, the zx. And the reason why it exists is that the numerator by itself is not normalized. It's not a probability distribution. So to construct like an actual probability distribution, you have to normalize it. And zx is simply just this normalization. So real quick, if we write the x out is the sum of all one. Yes, Yeah. And that's exactly like it's sum over all wise for a given instruction. And that's exactly what this is very pesky is like. It's intractable. If I take an instruction and try to sum over every possible completion, and not just like syntactitically correct ones, every single possible, we have 50, zero tokens, maybe even more. And the completions can go arbitrary long. So this space is completely intractable. This quantity is not easy to approximate. Even so, the main point here is that if you're given a reward model, you can actually, there does exist at least a close form solution which does us what the optimal policy will look like or optimal language model will look like. But if you do a little bit of algebra, just move some terms around, take a logarithm here or there, I promise this is not very complicated. You can actually express the reward model in terms of the language model itself. And I think this term is reasonably intuitive as well. What it says is that a completion y hat has a high reward if the model, my optimal policy, assigns a higher probability to it relative to my initialized model. And this is scaled by beta. So the beta log ratio is what we're looking at here. And the partition function, let's ts just ignore it for now, but it's intractable. But the beta log ratio is the key part here. Is everyone falling along? Awesome. Okay. So right now I'm talking about optimal policies, but really like every policy is probably optimal for some kind of a reward, right? Like this is mathematically as well. So the important bit here is that you can actually express you take your current policy, take your initialized model, and you can get some kind of a reward model out of it. And this is the exact identity which leads to this. So reward model can be expressed in terms of your language model, barring the log partition term, which we'll see what happens to it go for. I'm sorry, I don't know that you got like why is it that we can swap because there is a thing that we're trying of optimize and how do p start turn into plo? Yeah, for now, like we're not optimizing any reward model. All I'm saying is that if I take my current language model is it probably represents some kind of a reward model implicitly because of this relationship, because this holds for every p star and every reward model. What I'm saying is that like there, if I plug in my current language model, it also represents some kind of a reward model. I'm not saying it's optimal.
speaker 2: but I want to say because at the beginning, pr is ppt yes. And so we just get that the reward is basically zero to zero. And so .
speaker 1: initially it's zero, but we can optimize the parameters. Yeah, but that's a good observation. That is basically zero in the beginning.
speaker 2: But how do we start optimizing it?
speaker 1: I'll get to them. Any other questions?
speaker 2: Such that that makes the language optitimate taste something .
speaker 1: that's and that's the next step.
speaker 2: Yes. But the key idea is that my log.
speaker 1: my language model, the probabilities, already implicitly define a reward model. I think that's really the main point here. And this mathematical relationship is exact. Cool. Now, like I mean, I'm obviously ignoring like the elephant in the room here, which is the partition function, it's not going to magically vanish away. So like if this was just the beta log ratio, that would be really nice. I can compute all these quantities. I know how to compute the log probability under my language model, I know how to compute the log probability under my ptrain model, and I can compute the reward score, and I can optimize this, but I don't know what to do by my log partition function. This is where something fun happens. So recall what the reward modeling objective was when we started off like we started off with our friends Bradley, Terry again. And what we really wanted to optimize was the reward difference between the winning completion and the losing completion. And really like I mean, that's all we care about. We don't care about the exact reward itself. What we care about is maximizing difference between the difference between winning and losing completion. And that's actually really key here because if you plug in the definition of the rm theta there, what you'll observe is that the partition function actually just cancels out. Now why does it cancel out? The input is exactly the same. The x is actually exactly the same in the difference. So the partition function zx will just cancel out like it's the same in both the terms. So what do you get is that the reward difference between the winning and losing completion is the differences between the beta log ratio for the winning and losing completion. You can plug in the terms, you can work it out. It's fairly simple. So the partition function, which was our like, which was something we could not address, we could not compute, actually just simply vanished away. Z doesn't appear in the bitary molecule, but it appears here in this equation. So we going to take this equation, the last line that you see, and we're going to plug in in place of rm. So and in this, the first loss equation, so the first loss equation is the Bradly loss model. Cool. So this really is it. Like I mean, the key observation is we could express our reward model in terms of language model and our problems with the partition function actually go away because we are optimizing the Brad latmodel. And what do you get is something like this is that we're going to express the loss function directly in terms of our language model parameters theta, and we're going to be able to directly optimize on our data without doing any rsteps or not. And this is simply a binary classification problem. So we're really just trying to classify whether an answer is good or bad. And that's really what we're doing before I go on. Like people want to like absorb this in like I mean, feel they're okay with it.
speaker 2: It's very, I don't get wordly. Why good and why when and why those come from? Are they human included?
speaker 1: Good question. It's the same data set we started with in rlhf as well. But the way the process works is that you take a set of instruction and get the model to generate some answers, and then you get humans to label which answer they prefer. So they are model generated. Typically, they can be human generated as well, but they typically model generated. And then you get some preference labels. All you need is a label.
speaker 2: Sinwhich is a better answer. You must be losing some information because of the lack of information about like you're cancelling out because of the lack of any information about the partition function. You are bound to lose information about like are possible completions, which you would have taken into account in like standard rchat, right?
speaker 1: That's a really good question. I don't think I'll be able to completely answer this question in time, but like partition function is almost kind of a free variable. So I think the problem here is that the reward model there, think of when you there's many reward models that satisfy this optimization. So there's a free variable here that you can actually completely remove, and that's what this optimization benefits from. So think of it this way. Like if I assign something a reward of plus one and assign something a reward of minus one, that's basically the same as saying as it's a reward of plus 199 and it will give you the same loss, right? So that scale doesn't shift invvariant in a ways .
speaker 2: somehow not what you want though, like like okay. Like if you have, if you're actually turning a reward model, right, like 199 is like much, you should pay much less attention to that as compared to like it's but something what we're assuming .
speaker 1: is our choice model here is like if a human prefers something over the other, like the probability is governed only by the difference between the rewards. So that's an assumption that every rlhf also makes and like dpo also makes now is that assumption not completely, but like it holds to a fairly large degree, but that's a good question here.
speaker 2: Cool. I'll move on in rest of time.
speaker 1: And really like I mean, the goal of this plot is to like we actually get fairly performant models when we optimize things with dpo. So in this plot, I think the main thing that you should look at is pppo, which is the typical rlhf pipeline. And we are evaluating the models for summarization, and we're comparing to human summaries. And what we find is that dpu and vpsort of do similarly, but you're really not losing much by just doing the dpu procedure instead of aual latof. And that's really compelling because dpu is simply a classification loss instead of like a whole reinforcement learning procedure. So I want to quickly summarize. What we have seen thus far is that we want to optimize for human preferences. And the way we do this is instead of relying on uncalibrated scores, we getting comparison data and feedback on that. And we use this ranking data to either do something like rlhf, where we first fit a reward model and optimize using reinforcement learning, or we do something like direct preference optimization. We simply take the data set and do a classification loss on that. And Yeah, like there's trade offs in these algorithms. Like people when they have a lot of computational budget, they typically maybe go for our lecor, some routine like that. But if you're really looking to get them bang for your buck, like I mean, you might want to go for dpu. And that's like probably going to work out of the box. It's still an active area of research. People are still trying to understand how to like best work with these algorithms. So like I'm not making any strong claims here, but like both of these algorithms are very effective. Db is just much simpler .
speaker 2: to work with. Cool.
speaker 1: So Yeah, I mean, let's see. Like we went through all this instruction tuning rlhf, what do we get? Int, GPT is the first model which sort of followed this pipeline. It defined this pipeline. So we got models which did 30, zero or so tasks. Remember when we were doing like only one task? And now we have scaled it up from thousand tasks with like 30, zero different tasks and many, many different examples. So that's like where we are with int GPT, and it follows this pipeline that we just described. In this case, they're following a specific rlf pipeline where we explicitly fit a reward model and then do some kind of a reinforcement learning routine on top of it. And Yeah, the task collected from labelers looks something like this. I'll leave it to your imagination where you can look at the details. But how we started off with this model was something like completions we see from GPT -3, which you know explained the moon Lanto a six year. And like it is not really following the instructions, but instruct GPT will give you something which is meaningful. So it's inferring what a user wanted from the specific instruction and is converting to a realistic answer that a user might like. And Yeah, these are just more examples of what an instruct GPT like model would do, whereas your base model might not follow the instructions to your desired intentions. And Yeah like we went from instruct GPT to chagpt and it was essentially this pipeline. The key difference here is that it is still doing the instruction tuning, but it is more optimized for dialogue, more optimized for interacting with the users. So the core algorithmic techniques that we discussed today are what give us ChatGPT, but you have to be really careful about the kind of data you're training on, and that's really the whole game. But this is the foundation for ChatGPT. And Yeah, it follows the same pipeline as well. And at you might interact with chargpt. I'm sure you all have interacted, whether it's some form or not, but like this is an example of what a ChatGPT interaction might look like. You want to make a Gen Z. So like I mean, can you know the idea here is that it's very good at responding to instructions and intent. This is not something that we could like even fshort in very easily. These are kind of instructions are hard to come examples for. But like this is probably not something get trained on either, but it's able to infer the intent and generalize very, very nicely. And that's something I find personally very remarkable. Cool. And there's been a lot of progress on the open source front as well. So like dpu is much simpler and much more efficient. And essentially all the open source models these days are using dpo. So this is a leaderboard that is maintained by hugging face here. So like I mean, nine out of ten models here are trained with dpo. So that's been something that's been enabled the open source community to instruction tune their model betas well, and same is being used in many production models now as well. Mistrel is using dpo. Lama three used dpo. So these are very, very strong models which are nearly GPT -4 level. And they're also starting to use these algorithms as well. And something that's very cool to see is like like we went through all this like optimization and like I mean, math and stuff. But what is really fundamentally changing in the behavior, and I think this is a really good example, is that if you simply ask an instruction and ask for an sft output from an instruction tune model, you'll get something like this. But when you are chithe model, you actually get a lot more details in your answer and theyprobably organize the answers a little better. And this something that they maybe humans prefer. That's why it's a property that is emerging in this model, but it's something that's a very clear difference between simply instruction tued models and models which are rshift. So Yeah, we discussed like this whole rhf routine where we are directly modeling the preferences and we are generalizing beyond label data. And we also discuss rl can be very tricky to correctly implement the deep sort of implements this or like avoid some of these issues. And we briefly also touched upon the idea of reward model and reward hacking. And when you're optimizing for learned reward models, you will often see this example is that there is a way for it to just simply crash into the object. Some keep repeated repetitively crashing the board to get more and more points. That wasn't the goal of this game. So this is a very common example that is shown for reward hacking. If you do not specify rewards, well, the models can like learn. Weird behaviors are not your desired intent. And this something a lot of people worry about as well. Part of the reason is reinforcement learning is a very strong optimization algorithm. It's at the heart of alpha go, alpha zero, which results in superhuman models. So you have to be careful about how you specify things. And the other thing is, even optimizing for human preferences is often not the right thing, because humans are not, do not always like things which are in their best interest. So something that emerges is that they're like authoritative and helpful answers, but they often don't necessarily like truthful answers. So one property that happens is that theyprefer authoritativeness more than correctness, which is maybe like not something nice. Please go ahead. I'm curious.
speaker 2: It maybe like charging tethings. So now widely used by the public, will maybe change the like how people were we're like like made words because I always feel like now when I go to charging, I try something gives me five like detailed paragraphs of information.
speaker 1: Sometimes I just annoyed that I not what wanted but maybe in the original reward function, in the original meetings ings, people actually for that I know less. Yeah, that's a great point because like as these models like integrate more and more into our systems, they're going to collect more and more data and they will like pick up on things, maybe undesirable things as well. As far as I understand, chagpt is really cutting down on the verbosity, which is like a huge issue that all of these models are trying to cut down on, and they are dealing with that. Part of the reason why that emerges is that when you collect preference data at scale, people are not necessarily reading the answers. The turkers might just simply choose the longer answer, and that's a property that actually goes into these models. So but hopefully, like these things will improve over time as they get most of that. And Yeah hallucinations is not a problem that is going to go away with rl. And we talked a bit about reward hacking as well, biases from things and so on. But hopefully, like I mean, what I want to conclude at is like we started with pre trained models. We had these things which could predict text and we got chagpd. And hopefully, it's a little more clear how we go from something like that to ChatGPT d. And that's .
speaker 2: I'll end .
speaker 1: here.