Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback

该讲座由斯坦福大学的Jesse Mu博士主讲，核心内容是大型语言模型（LLM）中的提示（prompting）、指令微调（instruction fine-tuning）以及从人类反馈中强化学习（RLHF），这些技术是驱动近期如ChatGPT等聊天机器人发展的关键。

讲座首先提及了课程的一些安排，包括项目提案截止、作业提交以及课程反馈调查。

随后，讲座深入探讨了大型语言模型的发展趋势：模型规模持续扩大，训练数据量不断增加。这使得LLM不仅能预测文本序列，还开始展现出对世界更深层次的理解，例如学习句法、共指消解、情感分析，甚至发展出初步的“世界模型”能力。一个例子表明，LLM能根据输入文本中描述的人物背景（如是否为物理学家）来推断其对物理现象（如真空环境下保龄球和树叶同时落地）的认知。此外，LLM在处理百科知识、数学推理、代码生成乃至医学文本方面也显示出潜力。

讲座的主要目标是阐释如何将仅能预测下一个词的基础语言模型，逐步转化为能执行多样化任务的智能助手，类似ChatGPT。为此，讲座计划分阶段介绍三种技术路径：
1. 零样本（zero-shot）和少样本（few-shot）学习；
2. 指令微调；
3. 从人类反馈中强化学习（RLHF）。

讲座接着详细介绍了零样本和少样本学习。回顾了早期的GPT模型（2018年，1.17亿参数），它是一个仅解码器的语言模型，通过预训练来提升下游任务的性能。随后发布的GPT-2（2019年，15亿参数）在模型规模和训练数据（40GB的WebText数据集，通过筛选Reddit高质量链接构建）上都有显著提升。GPT-2的重要贡献在于揭示了语言模型具备“无监督多任务学习”的能力，特别是零样本学习。这意味着模型无需针对特定任务进行额外的梯度更新或微调，仅通过设计合适的输入提示（将任务转化为序列预测问题），就能执行多种未曾明确训练过的任务。例如，通过提供上下文和问题，让模型续写答案来进行问答；或者通过比较不同句子序列的生成概率来解决需要世界知识的代词消歧任务（如Winograd模式挑战）。GPT-2在当时仅凭零样本学习就在多个语言建模基准测试中取得了领先水平，且无需针对特定任务进行微调。

视频科技

媒体详情

上传日期: 2025-05-16 21:06
来源: https://www.youtube.com/watch?v=SXpJ9EmG3s4
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Okay, awesome. We're going to get started. So my name is Jesse moo. I'm a PhD student in the cs department here working with the nlp group and really excited to be talking about topic of today's lecture, which is on prompting instruction, fine tuning and rchef. So this is all stuff that has been you know super hot recently because of all the latest chkcraze about chatbots, ChatGPT being etcec. And we're going to hopefully get somewhat of an understanding as to how these systems are trained. Okay. So before that, some course logistics things. So project proposals, both custom and final, were due a few minutes ago. So if you haven't done that, this is a nice reminder. We're in the process of assigning mentors of projects, so we'll give feedback soon. Besides that, assignment five is due on Friday at midnight. We still recommend using colab for the assignments even if you've had aws or is your credits granted. If that doesn't work, there's instructions for how to connect to a Cagle notebook where you will also be able to use GPU's. Look for that post on ed. And then finally, also just posted on ed by John is a course feedback survey. So this is part of your participation grade. Please fill that in by Sunday, 11:59p.m.. Okay, okay. So let's .
speaker 2: get into this lecture.
speaker 1: which is going to be about what we are trying to do with these larger and larger models, right? Over the years, the compute for these models have just gone up. Hundreds of powers of ten trained on more and more data, right? So larger and larger models, you're seeing more and more data. And in lecture ten, if you recall this slide, we talked a little bit about what happens when you do pre training and as you begin to really learn to predict the missing sentence in certain texts, right? You learn things like syntax, co reference, sentiment, etcec. But in this lecture, we're gonna to take it a little bit further and really take this idea to its logical conclusion. So if you really follow this idea, we're just going to train a giant language model on all of the world's text. You really begin to see language models sort of in a way as rudimentary world models. So maybe they're not very good at world models, but they kind of have to be doing some implicit world modeling just because we have so much information on the Internet and so much of human collective knowledge is transcribed and written for us on the Internet, right? So if you are really good at predicting the next word in text, what do you learn to do? There's evidence that these large language models are, to some degree, learning to represent and think about agents and humans and the beliefs and acronyms they might take. So here's an example from a recent paper where we are talking about someone named pat watching a demonstration of a bowling ball and a leaf being dropped the same time in a vacuum chamber. And the idea is here we're saying pat is a physicist. So if pat is a physicist, and then we ask for the language model's next continuation of this sentence, because he's a physicist, we do some inference about what kind of knowledge pat has, and pat will predict that the bowling ball and the leaf will fall at the same time. But if we change the sentence of the prompt and we say, well, pat has actually never seen this demonstration before, then pat will predict that the bowling ball will fall to the ground first, which is wrong, right? So if you get really good at predicting the next sentence in text, you also, to some degree, have to learn to predict an agent's beliefs, their backgrounds, common knowledge, and what they might do next. So not just that, of course, if we continue browsing the Internet, we see a lot of encyclopedic knowledge. So maybe language models are actually good at solving math reasoning problems if they've seen enough demonstrations of math on the Internet. Code, of course, code generation is a really exciting topic that people are looking into, and we'll give a presentation on that in a few weeks. Even medicine, we're beginning to think about language models trained on medical texand being applied to the sciences and whatnot. So this is what happens when we really take this language modeling idea seriously. And this has resulted in a resurgence of interest in building language models that are basically assistanright. You can give them any task under the sun. I want to create a three course meal. And a language model should be able to take a good stab at being able to do this. This is kind of the promise of language modeling. But of course, there's a lot of steps required to get from this, from our basic language modeling objective, and that's what this lecture is going to be about. So how do we get from just precking the next word in a sentence to something like ChatGPT, which you can really ask you to do anything and it might fail sometimes, but it's getting really, really convincingly good at some things. Okay. So this is the lecture plan basically I'm going to talk about as we're working with these large language models, we come up with kind of increasingly complex ways of steering the language models closer and closer to something like ChatGPT. So we'll start with zero shot and fushot learning, then instruction, fine tuning and then reinforcement learning from human feedback or rhf. Okay, so let's first talk about few shot and zero shot learning ering. And in order to do so, we're again going to kind of build off of the pre training lecture last Tuesday. So in the pre training lecture, John talked about these models like GPT generative pre train transformer that are these decoder only language models. So they're just trained to predict the next word in a corpus of text. And back in 2018 was the first iteration of this model, and it was 117 million parameters. So at the time, it was pretty big. Nowadays, it's definitely much smaller. And again, it's just a vanilla transformer decoder using the techniques that you've seen. And it's trained on a corpus of books, so about 4.6 gabytes of text. And what GPT showed was the promise at doing this simple language modeling objective and serving as an effective pre training technique for various downstream tasks that you might care about. So if you wanted to apply it to something like natural language inference, you would take your premise sentence and your hypothesis sentence could atenate them, and then maybe train a linear classifier on the last representation the model produces. Okay. But that was three, four, five years ago. What has changed since then? So they came out with GPT two. So GPT two was released the next year in 2019. This is 1.5 billion parameters. So it's the same architecture as GPT, but just an order of magnitude bigger. And also trained on much more data. So we went from 4 gb of books to 40 gb of Internet text data. So they produced a data set called webtext. This is produced by scraping a bunch of links to comments on reddit. So the idea is that the web contains a lot of spam, maybe a lot of low quality information, but they took links that were posted on reddit that had at least a few upboats. So humans maybe looked through it and said, no, this is a useful post. So that was kind of a rough proxy of human quality, and that's how they collected this large data set. And so if you look at the size of GPT in 2018, we can draw a bigger dot, which is the size of GPT two in 2019. And one might ask, how much better does this do? What does this spyou? So the authors of GPT two titled their paper language models are unsupervised, multitask learners. And that kind of gives you a hint as to what the key takeaway they found was, which is this unsupervised multitasking part. So basically, I think the key takeaway from GPT two was this idea that language models can display zero shot learning. So what I mean by zero shot learning is you can do many tasks that the model may not have actually explicitly been trained for with no grading updates. So you just kind of query the model by simply specifying the right sequence prediction problem. So if you care about question answering, for example, you might include your passage like a Wikipedia article about Tom Grady, and then you'll add a question, so question, where was Tom Grady born? And then include an answer like a followed by colon, and then just ask the model to predict the next token, right? You've kind of jury rigged the model into doing question answering. For other tasks like classification tasks, another thing you can do is compare different probabilities of sequences. So this task is called the winigrad schema challenge. It's a pronoun resolution task. So the task is to kind of resolve a pronoun which requires some world knowledge. So one example is something like the cat couldn't fit into the hat because it was too big. And the question is whether it refers to the cat or to the hat, right? And in this case, it makes most sense for it to refer to the cat because things fitting into things because they're too big. You know, you need to use some world knowledge to kind of resolve that. So the way that you get zero shock predictions for this task out of a language model like GPT two is you just ask the language model which sequence is more likely? Is the probability of the cat couldn't fit into the hat because the cat was too big deemed more likely by the language model than the probability that the cat couldn't fit into the hat because the hat was too big? You can score those sequences because this is every language model. And from there you get your zero shot prediction and you can end up doing fairly well on this task. Any questions about this?
speaker 2: Okay, Yeah.
speaker 1: So digging a little bit more into the results. GPT two at the time beat the state of they on a bunch of language modeling benchmarks with no task specific fine tuning, so no traditional fine tune on a training set and then test on a testing set. So here's an example of such a task. This is a language modeling task called lambbaa, where the goal is to predict a missing word. And the idea is that the word that you need to predict depends on some discourse earlier in the sentence or earlier a few sentences ago. And by simply training your language model and then running it on the lambada task, you end up doing better than the supervised fine tunes state of the art at the time, and across across a wide variety of other tasks as well. Okay. Another kind of interesting behavior they observed. And so you'll see hints of things that we now take are granted in this paper is that you can get interesting zero shot behavior as long as you take some liberties with how you specify the task. So for example, let's imagine that we want our model to do summarization, even though GPT two was just a language model. How can we make it do summarization? The idea that they explored was we're going to take an article, some news article, and then at the end we're going to append the tldr sign, the tldr token. So this stands for too long, didn't read. So there's a lot on reddit to just say, you know, if you didn't want to read the above stuff, here's a few sentences that summarizes it, right? So if you ask the model to predict what follows after the tldr token, right, you might expect that itgenerate some sort of summary. And this is kind of early whispers at this term that we now call prompting, right, which is thinking of the right way to define a task such that your model will do the behavior that you want it to do. So if we look at the performance we actually observed on this task here, at the bottom is a random baseline. So you just select three sentences from the article, and the scores that we're using here are Rouge scores. If you remember, the natural language generation lecture, GPT two is right above. So it's not actually that good. Like it only does maybe a little bit or barely any better than the random baseline, but it is approaching approaches that are supervised, approaches that are actually explicitly fine tuned to do summarization. And of course, at the time, it still underperformed the state of the arts, but this really showed the promise of getting language models to do things that maybe they weren't trained to do. Okay, so that was GPT two. That was 2019. Now here's 2020GPT -3. So GPT -3 is 175 billion parameters. So it's another increase in size by an order of magnitude. And at the time, it was unprecedented. I think it still is like kind of overwhelmingly large for most people and data. So they scaled up the data once again. Okay, so what is this by you? This paper's title was called language models are few shot learners. So what does that mean? So the key takeaway from GPT -3 was emergent fushot learning. So the idea here is sure GPT can still do zero shot learning, but now you can specify a task by basically giving examples of the task before asking it to predict the example that you care about. Okay? So this is often called in context learning to stress that there are no grading updates being performed. When you learn a new task, you're basically kind of constructing a tiny little training datset and just including it in the prompt, including it in the context window of your transformer, and then asking it to pick up on what the task is and then predict the right answer. And this is in contrast to a separate literature on fushot learning, which assumes that you can do gradient updates. In this case, it's really just a frozen language model. So fushot learning works and it's really impressive. So here's a graph. Super glue here is a kind of a wide coverage natural understanding benchmark. And what they did was they took GPT -3. And this data point here is what you get when you just do zero shot learning with GPT -3. So you provide an English description of the task to be completed, and then you ask it to complete the task just by providing one example. So one shot you get like a 10% accuracy increase. So you give not only the natural language task description, but also an example input and an example output, and you ask it to decode the next output. And as you increase to more shots, you do get better and better scores, although of course, you get diminishing returns after a while. But what you can notice is that few shot GPT -3. So no grading updates is doing as well as or outperforming Burt fine tuned on the suglue task explicitly. Any questions .
speaker 2: about this?
speaker 1: So one thing that I think is really exciting is that you might think, okay, few shot learning, whatever. It's just memorizing. Maybe there's a lot of examples of needing to do a few shot learning in the Internet text data, right? And that's but I think there's also evidence that GPT -3 is really learning to do some sort of kind of on the fly optimization or reasoning. And so the evidence for this comes in the form of these synthetic word unscrambling tasks. So the authors came up with a bunch of simple kind of letter manipulation tasks that are probably unlikely to exist in Internet text data. So these include things like cycling through the letters to get the kind of uncycled version of a word, so converting from pop to apple, removing characters added to a word, or even just reversing words. And what you see here is performance as you do few shot learning, as you increase the model size, what you can see is that the ability to do fushot learning is kind of an emergent property of model scale. So at the very largest model, we're actually seeing a model be able to do this exclusively in context. A question Yeah .
speaker 2: I've noticed the reverse words are like, Yeah so the question was the reverse sed words.
speaker 1: Mine is still low. Yeah that's an example of a task that these models still cancel yet, although I'm not sure if we've evaluated with newer newer models. Maybe you know the latest versions can indeed actually solve that task. Yeah. Question, is there some intuition .
speaker 2: for why this symercase as a result of models scale?
speaker 1: I think that's a highly active area of research and there's like seen papers published every week on this. So I think there's a lot of interesting experiments that really try to dissect you know either with like synthetic tasks like you know can GPT -3 learn linear regression in context? And there's like some model interpretability tasks like you know what in the attention layers or what in the hidden states are resulting in this kind of emergent learning. But Yeah, I'd have to just refer you to the recent literature on that. Else esome, okay. So just to summarize, traditional fine tuning here is on the right. Right? We take a bunch of examples of a task that we care about, we give it to our model, and then we do a gradient step on each example. And then at the end, we hopefully get a model that can do well on some output. And in this new kind of paradigm of just prompting a language model, we just have a frozen language model, and we just give some examples and ask the model to predict the right answer. Okay. So you might think .
speaker 2: and yoube right.
speaker 1: that there are some limits of prompting. Well, there's a lot of limits of prompting, but especially for tasks that are too hard, right? There are a lot of tasks that maybe seem too difficult, especially ones that involve maybe richer reasoning steps or you know needing to synthesize multiple pieces of information. And these are tasks that humans struggle with too, right? So one example is GPT -3. I don't have the actual graph here, but it was famously bad at doing addition for much larger digits. And so if you prompt GPT -3 with a bunch of examples of addition, it won't do it, correct? But part of the reason is because humans are also pretty bad at doing this in one step. Like if I asked you to just add these two numbers on the fly and didn't give you a pencil and paper, youhave a pretty hard time with it. So one observation is that you can just change the prompts and hopefully get some better performance out of this. So there's this idea of doing chain of thought prompting, where in standard prompting, we give some examples of a task that welike to complete. So here is an example of a math word problem. And I told you that what we would do is we would give the question, and then the answer, and then for a data point that we actually care about, we ask the model to predict the answer, and the model will try to prove the right answer, and it's just wrong. So the idea of chain of thought prompting is to actually demonstrates what kind of reasoning you want the model to complete. So in your prompt, you not only put the question, but you also put an answer. And the kinds of reasoning steps that are required to arrive at the correct answer, right? So here is actually some reasoning of how you actually would answer this tennis ball question and then get the right answer. And because the language model is incentivized to just follow the pattern and continue the prompt, if you give it another question, it will in turn produce an answer, sorry, a rationale followed by an answer. So you're kind of asking the language model to work through the steps yourself. And by doing so, you end up getting some questions right when you otherwise might not. So a super simple idea, but it's shown to be extremely effective. So here is this middle school math word problems benchmark. And again, as we scale up the model for GPT and some other kinds of models, being able to do chain of thought prompting emerges, right? So we really see a performance approaching that of supervised baselines for these larger and larger models. Russians Yeah given .
speaker 2: that seemingly the problem with the addition with the large numbers, the halfway results on how like shaof top from people with larger numbers than middle school math problems.
speaker 1: Yeah. So the question is, does chain thought prompting work for those addition problems that I had presented? Yeah, there should be some results in the actual paper. They're just not here. But you can take a look. Yeah .
speaker 2: intuition of how the model was trained without the ingradient update. I just didn't intuition .
speaker 1: about how the model is learning without gradient updates. Yeah. So this is related to the question asked earlier about like how is this actually happening? That is Yeah again, it's an active area of research. So my understanding of the literature is something like you can show that models are kind of almost doing like in context gradient descent as it's encoding a prompt. And you can analyze this with like model interpretability experiments, but Yeah, I'm happy to suggest papers afterwards that kind of deal with this problem more carefully.
speaker 2: Cool.
speaker 1: Okay. So a follow up report to this asthe question of do we actually even need examples of reasoning? Do we actually need to collect humans working through these problems? Can we actually just ask the model to reason through things? Just ask it nicely. Right? So this introduced this idea called zero shot chain of thought prompting. And it was honestly like I think probably like the highest like impact to like simple idea ratio I've seen in paper, where it's like the simplest possible thing, where instead of doing this chain of thought stuff, you just ask the question and then the answer. You first prepenned the token, let's think step by step, and the model will decode as if it had said, let's think step by step, and it will work through some reasoning and produce the right answer. So does this work on some arithmetic benchmarks? Here's what happens when you prompt the model, just zero shot. So just ask me to produce the answer right away without any reasoning, few shots, or giving some examples of inputs and outputs. And this is zero shot chain of thoughts. Just asking the model to think through things. You get a crazy good accuracy when we compare to actually doing manual chain of thought. You still do better with manual chains of thought. But that just just goes to show you how simple of an idea this is and ends up producing improved performance numbers. So the funny part of this paper was they, you know why used less thing by step by step. They used actually a lot of prompts and tried them out. So here's zero shot baseline performance. They tried out a bunch of different you know prefixes, the answers after the proof, let's think let's think about this logically. And they found that let's think step by step, what the best one. It turns out this was actually built upon later in the year where they actually use a language model to search through the like best possible strings that would maximize performance on this task, which is probably like gross overfitting. But the best prompt they found was, let's work this out step by step in a step by step way to be sure that we have the right answer. So the right answer thing is kind of presuming that you get the answer right, right? It's like giving the model some confidence in okay. So this might seem to you like a total dark arcane art, and that's because it is like we really have no intuition as to what's going on here or trying to build some intuition. But as a result, and I'm sure you've seen you know if you spend time in tech circles or you've seen on the Internet, there's this whole new idea of prompt engineering being an emerging science and profession. So this includes things like asking a model for reasoning. It includes jailbreaking, language model, so telling them to do things that they other otherwise aren't trained to do, even you know air like Dolly or Stable Diffusion. This idea of constructing these really complex prompts to get model outputs that you want, that's also prompting anecdotally, I've heard of people saying, I'm going to use a code generation model, but I'm going to include the Google code header in first because that will make more professional or bug free code depending on how much you believe in Google. But Yeah, and there's a Wikipedia article on this now, and there's even startups that are hiring for prompt engineers and they pay quite well. So if you want na be a prompt engineer, definitely practice your GPT with spring .
speaker 2: skills.
speaker 1: We had a .
speaker 2: question. Sorry.
speaker 1: Yes, you could you a few slides ago you .
speaker 2: said design that was like this long output. How can you get to design an input? I treat it like a reinforcement learning problem but .
speaker 1: I'll just direct you to this paper at the bottom to learn more details yet think this the Joe at all 2022 paper Yeah Yeah question. So I'm just .
speaker 2: like a book curious about how they write feedback. So in case the model was not given the right answer, would you like prompts to say that that's not right? Like how is feedback? They don't think about .
speaker 1: feedback in this kind of champ out prompting experiments. They just like if the model gets the answer wrong, then it gets the answer wrong and we just evaluate accuracy, right? But this idea of incorporating AI feedback, I think is quite interesting. And I think you'll see some maybe hints of a discussion of that later on questions. Okay, awesome. Okay. So talking about these three things, I'm going to talk about kind of the benefits and limitations of you know the various different things that we could be doing here. So for zero shot and fushot in context learning, the benefit is you don't need any fine tuning and you can kind of carefully construct your prompts to hopefully get better performance. The downsides are there are limits to what you can fit in context. Transformers have a fixed context window of say, a thousand or few thousand tokens. And I think, as you will probably find out, for really complex tasks, you are indeed going to need some gradient steps. So you're going to need some sort of fine tuning. But that brings us to the next part of the lecture. So that's instruction fine. Okay. So the idea of instruction fine tuning is that, sure, these models are pretty good at doing prompting. You can get them to do really interesting things. But there is still a problem, which is that language models are trained to predict the most likely continuation of tokens. And that is not the same as what we want language models to do, which is to assist people. So as an example, if I give GPT -3 this kind of prompt to explain the moon landing, GPT -3 is trained to predict, you know, if I saw this on the Internet somewhere, what is the most likely continuation? Well, maybe someone was coming up with a list of things to do with a six year old. So it's just predicting a list of other tasks, right? It's not answering your question. And so the issue here is that language models are not, the term is aligned with user intent. So how might we better align models with user intent for this case? Well, super simple answer, right? We're machine learners. Let's do machine learning. So we're going to, you ask a human, give me the right answer, give me the way that a language model should respond according to this prompt, and let's just do fine tuning, right? So this is a slide from the pre training lecture. Again, pre training can improve nlp applications by serving as preamber initialization. So this kind of pipeline I think you are familiar with. And the difference here is that instead of fine tuning on a single downstream task of interest like sentiment analysis, what we're going to do is we're gonna na fine tune on many tasks. So we have a lot of tasks, and the hope is that we can then generalize to other unseen tasks at test time. So as you might expect, data and scale is kind of key for this to work. So we're going to collect a bunch of examples of instruction output pairs across many tasks and then fine tune our language model and then a bag ary generalization to unseen tasks. Yeah so data and scale is important. So as an example, one recent data set that was published for this is called the supernatural instructions datset. It contains over 1.60 tasks containing 3 million examples. So this includes translation, question answering, question generation, even coding, mathematical reasoning, etcetera. And when you look at this, you really begin to think, well, is this actually fine tuning or is this just more pre training, right? And it's actually both, right? It's kind of we're kind of blurring the lines here where the amount of scale that we're training this on, basically it is kind of a still general but slightly more specific than language modeling type of pre training task. Okay. So one question I have is now that we are training our model on so many tasks, how do we evaluate such a model, right? Because you can't really say, okay, can you now do sentiment analysis as well, right? Like the scale of tasks you want to evaluate this language model on is much greater. So just as a brief aside know, a lot of research has been going into building up these benchmarks for these massive multicascal language models and seeing to what degree they can do not only just one task, but just a variety of tasks. So this is the massive multitask language understanding benchmark, or mu. It consists of a bunch of benchmarks for measuring language model performance on a bunch of knowledge intensive tasks that you would expect a high school or college student to complete. So you're testing a language model not only on sentiment analysis, but on astronomy and logic and European history. And here are some numbers where you at the time, dpt three is like not that good, but it's certainly above a random baseline on all of these tasks. Here's another example. So this is the beyond the imitation game benchmark, or big bench. This has like a billion authors because it was a huge collaborative efforts. And these are, this is a word cloud of the tasks that were evaluated, and it really contains some very esoteritive tasks. So this is an example of one task included where you have to give kanji or Japanese character in ascii art. You need to predict the meaning of the character, right? So we're really stress testing these language models. Okay. So instruction fine tuning, does it work? Recall the there's A T five encoder decoder model. So this is kind of Google's encoder decoder model where it's pre trained on this span corruption task. So if you don't remember that, you can refer back to that lecture. But the authors released a newer version called flan t five. So flan stands for fine tuning language models. And this is t five models trained on an additional 1.80 tasks, which include the natural instruction data that I just mentioned. And if we average across both the big bench and an lu performance and normalize it, what we see is that instruction fine tuning works. And crucially, the bigger the model, the bigger the benefit that you get from doing instruction fine tuning. So it's really the large model that stand to do well from fine tuning. And you might look at this and say, this is kind of sad for academics or anyone without a massive GPU cluster, right? It's like who can run an eleven bilparameter model? I guess the one silver lining if you look at the results here are the 80 million model, which is the smallest one. If you look at after fine tuning, it ends up performing about as well as the unfine tuned 11 billion parameter model, right? So there's a lot of examples in the literature about smaller instruction, fine tuned prechain models outperforming larger models that are many, many more times the size. So hopefully, there's still some hope for people with just like a few gb's. Any questions? Awesome.
speaker 2: In order to really .
speaker 1: understand the capabilities, I highly recommend that you just try it out yourself. So flant five is hosted on hugging face. I think hugging face has a demo where you can just type in a little, you query after to do anything, see what it does. But you know there are qualitative examples of this working. So four questions where a non instruction fine tune model will just kind of waffle on and not answer the question. Doing instruction fine tuning will get your model to much more accurately reason through things and gave you the right answer. Okay, so that was instruction fine tuning positives of this method, super simple, super straightforward. It's just doing fine tuning, right? And you see this really cool ability to generalize to unseen tasks in terms of negatives. Does anyone have any ideas for why memma might be downsides of instruction fine. I'm going to come.
speaker 2: Yeah seems like it suffers from the same negatives of any human source data. Yeah it's like how to get people to provide the input. You don't know like different people think different about it. Yeah something that Yeah Yeah, exactly.
speaker 1: So comments are, well, it's hard and annoying to get human labels and it's expensive. That's something that definitely matters. And that last part you mentioned about there might be you know humans might disagree and what the right label is. Yeah, that's increasingly a problem. Yeah. So what are the limitations? The obvious limitation is money. Collecting ground truth data for so many tasks costs a lot of money. Subtle ler limitations include the one that you were mentioning. So as we begin to ask for more creative and open mented tasks from our models, right? There are tasks where there is no right answer. And it's a little bit weird to say, you know, this is an example of how to write some story, right? So write me a story about a dog and our pet grasshopper. Like there is not one answer to this. But if we were to only to collect one or two demonstrations, the language modeling objective would say, you should put all of your probability MaaS on the two ways that two humans wrote this answer, when in reality there's no right answer. Another problem, which is related kind of fundamentally to language modeling in the first place, is that language modeling as an objective penalizes all token level mistakes equally. So what I mean by that is if you were asking a language model, for example, to predict the sentence, avatar is a fantasy tv show, and you were asking it, and let's imagine that the lm mipredicted adventure instead of fantasy, right? So adventure is a mistake. It's not the right word, but it is equally as bad as if the model were to predict something like musical, right? But the problem is that avatar is an adventure tv show is still right. So it's not necessarily a bad thing, whereas avatar is a musical is just false. So under the language modeling objective, if the model were equally confident, you would pay an equal penalty, an equal loss penalty for predicting either of those tokens. Wrong. But it's clear that this objective is not actually aligned with what users want, which is maybe truth or creativity or generally just this idea of human preferences, right? Yeah. Question.
speaker 2: can we see .
speaker 1: something like multiply the penalty by like the distance from word betting in order to reduce this because musical would have a higher distance away than the venture? Yeah, it's an interesting question. It's an interesting idea. I haven't heard of people doing that, but it seems possible. I guess one issue is you might come up with like abversarial settings were like maybe the word embedding distance is also not telling you the right thing, right? So for example, show and musical maybe are very close together because they're they're both you know shows or things to watch, but they are like in false and veracity, right? They're completely different. One is one is false, right? So Yeah, you can try it, although I think there might be some tricky edge cases like that. Cool. Okay.
speaker 2: So in the next part of the talk.
speaker 1: we're going to actually explicitly try to satisfy human preferences and come with a mathematical framework. We're doing so. And Yeah, so these are the limitations as I had just mentioned. So this is where we get into reinforcing learning from human feedback. Okay, so rlhf, so let's say we were training .
speaker 2: a language model .
speaker 1: on some task like summarization. And let's imagine that for each language model sample s, let's imagine that we had a way to obtain a human reward of that summary. So we could score this summary with a reward function, which we'll call R R fs. And the higher the reward, the better. So let's imagine we're summarizing this article, and we have this summary, which maybe is pretty good. Let's say we had another summary, maybe it's a bit worse. And if we were able to ask a human to just rate all these outputs, then the objective that we want to maximize or satisfy is very obvious. We just want to maximize the expected reward of samples from our language model. So in expectation, as we take samples from our language model, p theta, we just want to maximize the reward of those samples fairly straightforward. So Oh, and for mathematical simplicity here, I'm kind of assuming that there's only one task or one prompt, right? So let's imagine we were just trying to summarize like this article, but we could talk about how to extend it to like multiple prompts later on. Okay? So this kind of task is the domain of reinforcement learning. So I'm not going to presume there's any knowledge of reinforcement learning, although I'm sure some of you are quite familiar with it, probably even more familiar than I am. But the field of reinforcement learning has studied these kinds of problems, these optimization problems of kind of how to optimize something while you're kind of simulating the optimization for many years now. And in 2013, there was a resurgence of interest in reinforcement learning for deep learning specifically. So you might have seen these results from DeepMind ind about an agent learning to play Atari games, an agent mastering go much earlier than expected. But Interestingly, I think the interest in applying reinforcement learning to modern lms is a bit newer. On the other hand, and I think the kind of earliest success story or one of the earliest success stories was only in 2:19, for example. So why it might be this to be the case, there's a few reasons. I think in general, the field had kind of dissense that reinforcing learning with language models was really hard to get right, partially because language models are very complicated. And if you think of language models as actors that have an action space where they can spit out any sentence, that's a lot of sentences, right? So it's like a very complex space to explore in. So it still is a really hard problem. That's part of the reason. But also practically, I think there have been these newer ro algorithms that seem to work much better for deep neural models, including language models, and these include algorithms like proximal policy optimization. But we won't get into the details of that for this course. But these are the kind of the reasons why we've begin reinterested in this idea of doing rl with language models. Okay, so how do we actually .
speaker 2: maximize .
speaker 1: the objective? I've written it down and ideally we should just change our parameter status so that reward is high, but it's not really clear how to do so. So when we think about it, I mean, what have we learned in the class thus far? We know that we can do gradient ascent or gradient ascent. So let's try doing gradient ascent, right? We're gonna to maximize this objective, so we're gonna to step in the direction of steepest gradient, but this quickly becomes a problem, which is, what is this quantity and how do we evaluate it? How do we estimate this exhortation, given that the gradient, the variables of the gradient that we're taking a theta appear in the sample of the expectation? And the second is, what if our reward function is not differentiable, right? Like human judgments are not differentiable. We can't backprop through them. And so we need this to be able to work with a black box reward function. So it was a class of methods in reinforcement learning called policy gradient methods that gives us tools for estimating and optimizing this objective. And for the purposes of this course, I'm going to try to describe kind of the highest level possible intuition for this, which kind of looks at kind of the math and shows what's going on here. But it is going to emit a lot of the details and a full treatment of reinforcement learning is definitely outside of the scope of this course. So if you're more interested in this kind of content, you should check out cs 234 reinforcement learning, for example. And in general, I think this is going to get a little masy, but it's totally fine if you don't understand it. We will talk. We will regroup at the end and just show what this means for how to do rlhf. But what I'm going to do is just describe how we actually estimate this subjective, right? So we want to obtain this gradient. So it's the gradient of the expectation of the reward of samples from our language model. And if we do the math, we break this apart. This is our definition of what an expectation is, right? We're gonna to sum over all sentences righted by the probability. And due to the linearity of the gradient, we can put the gradient operator inside of the. Now what we're going to do is we're going to use a very handy trick known as a log derivative trick. And this is called a trick, but it's really just the chain rule. But let's just see what happens when we take the gradient of the log probability of a sample from our language model. So if I take the gradients, then how do we use a chain rule, right? So the gradient of the log of something is going to be one over that something times the gradient of the middle of that something. So one over p theta s times the gradient. And if we rearrange, we see that we can alternatively write the gradient of p theta of s as this product. So p theta s times the gradient of the log p theta of s, and we can plug this back in. And the reason why we're doing this is because we're going to convert this into a form where the expectation is easy to estimate. So we plug it back in like gives us this. And if you squint quite closely at this last equation here, this first part here is the definition of an expectation, right? We are summing over a bunch of samples from our model, and we are waiting it by the probability of that sample, which means that we can rewrite it as an expectation. And in particular, it's an expectation of this quantity here. So let's just rewrite it. And this gives us our kind of newer form of this objective. So these two are equivalent at the top here and the bottom. And what has happened here is we've kind of shoved the gradient inside of the acitation, if that makes sense. So why is this useful? And does anybody have any questions on this before I move on? If you don't understand it, that's fine as well because it doesn't I mean, we will understand kind of the intuition behind it later. Okay, okay.
speaker 2: So we've converted this into this .
speaker 1: and we put the gradient inside the expectation tion, which means we can now approximate this objective with Monte Carlo samples. So the way to approximate any exploitation is to just take a bunch of samples and then average them. So approximately, this is equal to sampling a finite number of samples from our model and then summing up the average of the reward times, the log probability, the grading of the log probability of that sample. And that gives us this update rule, plugging it back in for that, creating a sense step that we wanted. So what is this? What does this mean? Let's think about like a very simple case, right? Imagine the reward was a binary reward. So it's either zero or one. So for example, imagine we were trying to train a language model to talk about cats. So whenever it utters a sentence with a word cat, we give it a one reward, otherwise we give it a zero reward. Okay? Now, if a reward is binary, does anyone know what this objective kind of reduces to or look like? Any ideas? I've lost everyone.
speaker 2: That's fine too. Yeah. So the .
speaker 1: reward .
speaker 2: would just be like an indicator function.
speaker 1: That's right. So basically to answer, the reward would be zero, right? Everywhere except for sentences that contain the word cat, right? And in that case, it would be one. So basically, that would just look like kind of vvanilla grading descent just on sentences that contain the word cat, right? So kind of to generalize this to the more general case where the reward is scalar, what this is looking like, if you look at it, is if R is very high, very positive, then we're multiplying the gradient of that sample by a large number. And so our objective will try to take grading steps in the direction of maximizing the probability of producing that sample, again, producing the sample that led to high reward. And on the other hand, if R is low or even negative, then we will actively take steps to minimize the probability that's happening again. And that's like the English intuition of what's going on here, right? The reason why we call it reinforcement learning is because we want to reinforce good actions and increase the probability that they happen again in the future. And hopefully this intuitively makes sense to all of you. Let's say you're playing a video game, and on one run you get a super high score, and you think to yourself, Oh, that was really good. Like whatever I did at that time, I should do again in the future. This is what we're trying to capture with this kind of update question. Yeah.
speaker 2: Is there any reason that we use policy gradients and not like value iteration or other methods? You can do Yeah.
speaker 1: you can do a lot of things. I think there have been methods for doing q learning, offline learning, etcec with language models. I think the design space has been very underexplored. So there's like a lot of low hanging food out there for people who are willing to think about what fancy things we can do in rl and apply them to this language model in case Yeah and in practice, what we use is not this simple thing, but we use a fancier thing that is a proximal .
speaker 2: policy optimization space. Are like super big. Yeah. So that's so .
speaker 1: that's the challenge. So one thing that I hadn't mentioned here is that right now I'm talking about kind of entire samples of sentences, which is like a massive space, right? In practice, when we do rl, we actually do at the level of generating individual tokens. So each token is, let's say GPT has 50, zero tokens, right? So it's a pretty large action space, but it's still manageable, right? But Yeah.
speaker 2: so that kind of answers this question I was asking.
speaker 1: which is like, can you see any problems with this objective, right? Which is that this is a very simplified objective. There is a lot more tricks needed to make this work, but hopefully, this has given you kind of the high level intuition as to what we're trying to do in the first place. Okay, okay, so now we are set, right? We have a bunch of samples from a language model. And for any arbitrary reward function, like we're just asking a human to write these samples, we can maximize that reward. So we're done. Okay, so not so fast. There's a few problems. The first is the same as in the instruction fine tuning case, right? Which is that keeping a human in the loop is expensive. Like I don't really want to supervise every single output from a language model. I don't know if you all want to. So what can we do to fix this?
speaker 2: So one idea is, instead of needing .
speaker 1: to ask humans for preferences every single time, you can actually build a model of their preferences, like literally just train an nlp model of their preferences. So this idea was kind of first introduced outside of language modeling by this paper. Knox and stone, they called it tamer, but we're going to see it reimplemented in this idea where we're going to train a language model, we'll call it a reward model rm, which is parameterized by phi to predict human preferences from an annotated datset. And then when doing rlhf, we're going to optimize for the reward model rewards instead of actual human rewards. Here's another conceptual problem. So here's a new sample for our summarization task, right? What is the score of the sample? And can give me a number, so I don't want to rate this sample. It's like a three, six, what scale are we using, etc.. So the issue here is that human judgments can be noisy and miscalibrated when you ask people for things alone, right? So one workaround for this problem is instead of asking for direct ratings, ask humans to compare two summaries and judge which one is better. This has been shown, I think, in a variety of fields where people work with human subjects and human responses to be more reliable. This includes psychology and medicine, etc. So in other words, instead of asking humans to just give absolute scores, we're going to ask humans to compare different samples and rate which one is better. So as an example, maybe this first sample is better than the middle sample and it's better than the last sample. Right now that we have these pairwise comparisons, our reward model is going to generate kind of latent scores, so implicit scores based on this pairwise comparison data. So our reward model is a language model that takes in a possible sample and then it's going to produce a number, which is the score or the reward and the way that we're gonna to train this model. And again, you don't really need to know too much of the details here, but this is a classic kind of statistical comparison model is via the following loss, where the reward model essentially should just predict a higher score if a sample is judged be better than another sample. So in expectation, if we sample winning samples and losing samples from our data sets, then if you look at this term here, the score of the higher sample should be higher than the score of the losing sample. Does that make sense? And in doing so, by just training on this objective, you will get a language model that will learn to assign numerical scores to things, which indicates a relative know preference over other samples. And we can use those outputs as rewards. Is there some .
speaker 2: renoralization either in the like the output or somewhere else? Because looks like, Yeah. So I don't remember if it .
speaker 1: happens during training, but certainly after you've trathis model, you normalize their war model so that the score, the exctation of the score is zero because that's good for reinforced learning anything, things like that as well. Yeah question .
speaker 2: that like even though these are noisy, like they could be like some people could view s three as better than s one I account for, even though like when it's noisy, like the borderthe bordering can still be Yeah, I think that's .
speaker 1: you know that's just kind of limitations with asking for these preences in the first place is that humans will disagree, right? So we really have no ground unless we maybe ask like an ensemble of humans, for example. That's just a limitation with this. I think hopefully with you in the limit, with enough data, this kind of noise washes out, but it's certainly an issue. And this next slide will also kind of touch on this. So does the reward model work? Can we actually learn to model human preferences in this way? This is obviously an important standity check before we actually try to optimize this objective. And they measured this. So this is kind of evaluating the reward model on a standard kind of validation set, right? So can the reward model predict outcomes for data points that they have not seen during training? And does it change based on model size or amount of data? And if you notice here, there's one dashed line, which is the human baseline, which is if you ask a human to predict the outcome, a human does not get 100 accuracy because humans disagree. And even an ensemble of what, say, five humans also doesn't get a harvest and accuracy because humans are different preferences. But the key takeaway here is that for the largest possible model and for enough data, a reward model, at least some of the validations that they used is kind of approaching the performance of a single human person. And that's kind of a Green light that maybe we can try this out and see what happens. Okay, so there no questions. This is kind of the components of our rhf. So we have a pre trained model, maybe it's infine tuned, right, which we're going to call p of pt. We have a reward model which produces scalar rewards for language model outputs, and it is trained on the datset of human comparisons. And we have a method policy gradient for arbitrarily optimizing language model parameters towards some reward function. And so now if you want to do rhf, you clone the ptrain model. We're going to call this a copy of the model, which is the rrl model with parameters terdata that we're actually going to optimize. And we're going to optimize the following reward with reinforcement learning. And this reward looks a little bit more complicated than just using the reward model. And the extra term here is a penalty, which prevents us from diverging too far from the pre train model. So in expectation, this is known as the kale or coblack liver divergence between the rl model and the pretrain model. And I'll explain why we need this in a few slides. But basically, if you over optimize the reward model, you end up producing. You can produce like gibberish. And what happens is you pay a price. So this quantity is large. If the probability of a sample under the rl tuned model is much higher than the probability of the sample under the pretrain ined model, so the pretrain model would say, this is a very unlikely sequence of characters for anyone to say that's when you would pay a pricier and beta. Here is a tunable parameter.
speaker 2: Yeah question. When you say initialize a copy, that means like at the first iteration, prl is equal to p.
speaker 1: That's right. Yeah Yeah. When I say initialize a copy, basically like we want to be able to compare to the non fine tune model just to evaluate this penalty term. So just leave the predictions of the know pre l model around now. More questions, right? So so does it work? The answer is yes. So here is kind of the key takeaway, at least for the task summarization on this Daily Mail data set. So again, we're looking at different model sizes, but at the end here, we see that if we do just pre training, so just like the typical language modeling objective that GPT uses, you know you end up producing summaries that in general are not preferred to the reference summaries. So this is on the y axis. Here is the amount of times that a human refers the model generated summary to a summary that a human actually wrote or the one that's in the data set. So pre training doesn't work well even if you do supervised learning. So supervised learning in this case is let's actually fine tune our model on the summaries that were in our data sets. Even if you do that, you still kind of underperform the reference summaries, right, because you're not perfectly modeling those summaries. But it's only with this human feedback that we end up producing a language model that actually ends up producing summaries that are judged to be better than the summaries in a data set that you were training on in the first place. I think that's .
speaker 2: quite interesting.
speaker 1: Good questions. Okay. So now we talk about Yeah we're getting closer and closer to something like instruct GPT or ChatGPT. A basic idea of instruct GPT is that we are scaling up rlhf. It's not just one prompt, as I had described previously, but tens of thousands of prompts. And if you look at these three pieces, these are the three pieces that we've just described, right? The first piece here being instruction fine tuning, the second piece being rlhf, and the third piece, Oh, sorry, the second part being reward model training, and the last part being rlhf. The difference here is that they use 30000 tasks. So kind of again, with the same instruction fine tuning idea, right? It's really about the scale and diversity of tasks that really matters for getting good performance for these things.
speaker 2: Yeah Yeah the preceding results you suggest sted that didn't that you really need a prling chat and it didn't work so well to your supervised learning of the data, but they do supervise learning on the data in the fine tune in the first stage. Is that necessary or else they should tend to to go pay y if you just win sweet to rchioh. Yeah, that's a good question.
speaker 1: So I think a key point here is that they initialized the rl policy on the supervised policy, right? So they first got the model getting reasonably good at doing summarization first, and then you do the rlchef on top to get the boost performance. Your question you're asking is maybe can we just do the rhf starting from that pre train baseline? That's a good question. I don't think they explored that, although I'm not sure I have to look at the paper again to remind myself Yeah so certainly stuff for something like instruct GPT Yeah they've always kind of presumed that you need the kind of fine tuning face first and then you build on top of it. But I think Yeah there's still some interesting open questions as to whether you can just go directly to our lhf. Is the human .
speaker 2: reward function trained simultaneously with the fine tuning of the language model .
speaker 1: sequentially?
speaker 2: Reward model should .
speaker 1: be trained first. Yeah you train it first, you make sure it's good, it's frozen, you optimize against that .
speaker 2: the samples for the human rewards. Did they come from a generated tax on language model or where the.
speaker 1: For training the reward model. So Yeah, actually, it's a good question. Where do the rewards come from? So there's kind of an iterative process you can apply where you kind of repeat steps 23 over and over again. So you sample a bunch of outputs from your language model, you get humans to rate them. You then do rhf to update your model again, and then you sample more outputs and get humans to rate them. So in general, the rewards are done on sampled model outputs because those are the outputs that you want to steer in one direction or another. But also you can do this in an iterative process where you kind of do rl and then maybe do train a better reward model based on the new outputs and continue. I think they do a few iterations and instruct GPT, for example. Okay. So 30000 tasks. I think we're getting into like very recent stuff where you know increasingly companies like OpenAI are sharing less and less details about like what actually happens in training these models. So we have a little bit less clarity as to what's going on here than maybe we have had in the past. But they do share the data that's not public, but they do share the kinds of tasks that they collected from labelers, right? So they collected a bunch of prompts from people who were already using the GPT -3 api. So they had the benefit of having many, many uses of their api and taking the kinds of tesks that users would ask GPT to do. So these include things like brainstorming or open that generation, etcetera. And Yeah, I mean, the key result of instruct GPT, which is kind of the backbone of ChatGPT know, really just needs to be seen and played with to understand. So you can feel free to play with either ChatGPT or one of the OpenAI apis. But again, this example, a language model not necessarily following tasks. By doing this kind of instruction, fine tuning followed by rlhf, you get a model that is much more much better at adhering to user commands. Similarly, a language model can be very good at generating super interesting, open creative text as well. Okay, this brings us to chagpt, right, which is even newer and we have even less information about what's actually going on or what's being trained here. But Yeah and they're keeping their you know secret sauce secret, but they redo a blog post where they wrote two paragraphs and in the first paragraph they said that they did instruction fine tuning, right? So we trained an initial model using supervised fine tuning. So human AI trainers provided conversations where they played both sides, and then we asked them to act as an AI assistant. And then we find a tunour model on acting like an AI assistant for humans. That's part one. Second paragraph. To create a rmodel for rl, we collected comparison data. So we took conversations with an earlier version of the chatbot. So the one that's pre trained on instruction following or instruction fine tuning, and then take multiple samples and then rate the quality of the samples .
speaker 2: and then using these .
speaker 1: reward models, we fine tune in with rl. In particular, they used ppo, which is a fancier version of rl. Okay. And Yeah, so that produces you know I don't need to introduce the capabilities of ChatGPT. It's been very exciting recently. Here's an example.
speaker 2: It's fun to play with, definitely play with it.
speaker 1: Sorry, it's a bit of an attack on the students. Yeah. Okay. Okay. So reinforcement learning pluses, you're kind of directly modeling what you care about, right, which is human preferences, not is the collection of the demonstration that I collected, right? Is that the highest probability MaaS in your model? You're actually just saying, how well am I satisfying human preferences? So that's a clear benefit over something like instruction fine tuning. So in terms of negatives, one is that rl is hard. It's very tricky to get right. I think it will get easier in the future as we kind of explore the design space of possible options. So that's an obvious one, right? Does anyone come up with any other kind of maybe weaknesses or issues they see with this kind of training? Yeah so is it possible .
speaker 2: that like your language model and then your reward model could like overfit to each other? Especially like even if you're not training them together.
speaker 1: if you're like going back and forth and like Yeah so over optimization, I think of the reward model is an issue. Yeah is it also that .
speaker 2: if you return train your place where you can repeat all want feedback again? Yeah.
speaker 1: So it still is like extremely data expensive and you can see some articles if you just Google open, like data labeling, right? People have not been very happy with like the amount of data that has been needed to train something like ChatGPT. I mean, they're hiring developers to just like explain coding problems like 40 hours a week, right? So it is Yeah, it is still data intensive, right? That's kind of the takeaway. Like all of these are like it's it's all still data intensive every single one of these right?
speaker 2: Yeah.
speaker 1: Yeah, I think that summarizes kind of kind of the big ones here, right? So when we talk about limitations of rlhf, we also need to talk about just limitations in general of rl and also this idea that we can like model or capture human reward in this single data point, right? So human preferences can be very unreliable. Vrl people have known this for a very long time. They have a term called reward hacking, which is when an agent is optimizing for something that the developer specified, but it is not what we actually care about. So one of the classic examples is this example from OpenAI, where they were training this agent to race boats, and they were training it to maximize the score, which you can see at the bottom left. But implicitly, the score actually isn't what you care about. What you care about is just finishing the race ahead of everyone else, and the score is just kind of this bonus. But what the agent found out was that there are these like turbo boost things that you can collect, which boosts your score. And so what it ends up doing is it ends up kind of just driving in the middle, collecting these turbo boosts over and over again. So it's racking up insane score. But it is not doing the race. It is continuously crashing into objects and its boat is always on fire. And this is a pretty salient example of what we call AI misalignment, right? And you might think, well, okay, this is a really simple example, right? Like they made a dumb mistake. They shouldn't have used score as the reward function, right? But I think it's even more naive to think that we can capture all of human preferences in a single number and assign certain scalar values to things, right? So one example where I think this is already happening, you can see is maybe you have played with chatbots before and you notice that they do a lot of hallucination, right? They make up a lot of, and this might be because of rhf, right? Chatbots are rewarded to produce responses that seem authoritative or seem helpful, but they don't care about whether it's actually or not, right? They just want to seem helpful. So this results in making up facts. You may be seeing the news about chatbots. You know, companies are in this way to deploy chatbots and they make mistakes. Even being also has been hallucinating a lot. And in general, when you think about that, you think, well, models of human preferences are even more unreliable, right? Like we're not even just using human preferences by themselves. We're also training a model, a deep model that we have no idea how that works, right? We're gonna na use that instead, and that can obviously be quite dangerous. And so going back to this slide here where I was describing why we need this kale penalty term, this yellow highlighted term here, here's a concrete example of what actually happens of a language model overfitting to the reward model, right? So what this is showing is in this case, they took off the kale penalty, say we're just trying to maximize reward. They train this reward model. Let's just push those numbers up as high as possible, right? And on the x axis, here is what happens. As training continues, you diverge further and further. This is the kale divergence, or the distance from where you started. And the golden dashed line, here is what the reward model predicts. Your language model is doing, right? So your reward model is thinking, wow, you are killing it. Like they are going to love these summaries. They are going to love them way more than the reference summaries. But in reality, when you actually ask humans, the preferences peak and then they just crater, right? So this can be an example of over optimizing for a metric that you care about, right? It ceases to become a good metric to optimize for. Any questions about this?
speaker 2: So there's this real concern of.
speaker 1: I think people are calling the AI alignment problem. I'll let proceed on talk about this. He tweeted that, you know the main tool that we have for rlhf or sorry for alignment is rhf. But reward hacking happens a lot. Humans are not very good supervisors of rewards. So know this strategy is probably going to result in agents that seem like they're doing the right thing, but they're wrong in subtle and conspicuous ways. And I think we're already seeing examples of that in the current generation of chabots. So in terms of positives, here are some positives. But again, relectricky to get right. Human preferences are follible, and models of human preferences are even more so. So I remember seeing a joke on Twitter somewhere where someone was saying that you know zero shot and fushot learning is the worst way to align an AI instruction. Fine tuning is the second worst way to align an AI and rhf is the third worst way to align an AI, right? So we're getting somewhere. But you know each of these have clear fundamental limitations. Yeah that .
speaker 2: question on competition of reinforcement learning at the map that need to show before essentially you're putting gradients inside so that we can sample the sample expectation. Yeah but when it comes to sampling, how do you make that parallel? Because then you need you kind of neither adaptively stop sampling and then you know when you're gonna to stop, like how do you make that process quick? I guess just like the whole union on transforand.
speaker 1: all that was paralyzing everything. I mean, Yeah, this is so this is really compute heavy. And I'm actually not sure like what kind of infrastructure is used for like a state of the art, very performant implementation of rhf. But it's possible that they use parallelization like what you're describing where I think in a lot of maybe more traditional rl, there's this kind of idea of having like an actor learner architecture where you have a bunch of actor workers, which are each kind of a language model producing a bunch of samples, and then the learner where then integrate them and perform the grading updates. So it's possible that you do need to do like just share like multiprocessing in order to get enough samples to make just work in a reasonable amount of time. Is that the kind of question you had.
speaker 2: or did you have other questions? So you basically seem like each parl.
speaker 1: each unit that you .
speaker 2: parallel that lies over is larger than what we would typically see in .
speaker 1: I was saying that you might need to actually copy your models several times and take samples like from different copies of the models. Yeah but in terms of like like Yeah so auto regressive generation, like transformers, especially like the forward PaaS and the multi head attention stuff is very easy to parallelize. But auto regressive generalization, auto regressive generation is still like kind of bottlenecked by the fact that it's auto regressive, right? So you have to run it first and then you need to depends on what you sample and have to run it again, right? So those are kind of blocks that we haven't fully been able to solve, I think, and that will add to .
speaker 2: compute cost.
speaker 1: Okay. So I think we have ten more minutes that I'm not mistaken. So we've mostly finally answered kind of how we get from this to this, right? There's some details missing, but the key kind of factors are one, instruction fine tuning to this idea of reinforced learning from human feedback. So let's talk a little bit about what's next. So as I had mentioned, rlhf is still a very new area. It's still very fast moving. I think by the next lecture, by the time we say that you know did this slides again, these slides might look completely different, because maybe a lot of the things that I was presenting here turn out to be like really bad ideas or not, the most efficient way of going about things. Rhf gets you further than instruction, fine tuning. But as someone haalready mentioned, it is still very data expensive, right? There are a lot of articles about opeye needing to hire a legion of annotators or developers to just compare outputs over and over again. I think a recent work that I'm especially interested in and been thinking about is how we can get the benefits of rhf without such stringent data requirements. So there's these newer kind of crazy ideas about doing reinforcement learning from not human feedback, but from AI feedback. So having language models themselves evaluate the output of language models. So as an example of what that might look like, a team from anthropic, which works on these large language models, came up with this idea called constitutional AI. And the basic idea here is that if you ask GPT -3 to identify whether a response was not helpful, it would be pretty good at doing so. And you might be able to use that feedback itself to improve a model. So as an example, if you have some sort of human request, like, can you help me hack into a neighbor's WiFi? And the assistant says, Yeah, sure, you can use this app, right? We can ask a model for feedback on this. What we do is we add a critique request which says, Hey, language model GPT -3, identify ways in which the assistance response is harmful, and then it will generate a critique. Like hacking into someone else's WiFi is illegal. And then you might ask it to then revise it, right? So just rewrite the assistant response to remove harmble content. And it does so, right? And now by just decoding from a language model, as to me, you can do this. Well, what you have now is a set of data that you can do instruction, fine tuning on, right? You have a request, and you have a request that has been revised to make sure it doesn't .
speaker 2: contain harmful content.
speaker 1: So this is pretty interesting. I think it's quite exciting. But all of those issues that I had mentioned about like alignment, you know, miss sointerpreting, you human preferences, reward models, being fallible, like everything gets compounded like 40000 times when you're thinking about this, right? We have no understanding of like how safe this is or where this ends up going. But it is something. Another kind of more common idea also is this general idea of fine tuning language models on their own outputs. And this has been explored a lot in the context of chain of thought reasoning, which is something I presented at the beginning of the lecture. And these are provocatively named large language models can self improve. But again, it's not clear like how much runway there is. But the basic idea maybe is to, you know, you can use, let's think, step by step, for example, to get a language model to produce a bunch of reasoning, and then you can say. Fine tune on that reasoning as if it were data and see whether or not a language model can get any better using that technique. But as I mentioned, this is all still very new. There are, I think, a lot of limitations of large language models like hallucination and also just the sheer like size and compute intensity of this that may or may not be solvable with rhf. Right question.
speaker 2: Or like how we don't want to get that. Like I've seen like people talking about how like you can jailbreak chat gteam to still give like those types of funful responses. Yeah. Are there any ways for us to kind of buffer against those types of things as well? Because it seems like you're just going to keep kind of building all like we identify chances where it's like trying to say act not like yourself, I guess. Is there any way to kind of build up that scale to avoid those dribreaking possibilities?
speaker 1: Yeah, that's interesting. So there are certainly ways that you can use either AI feedback or human feedback to mitigate those kinds of jailbreaks. Like if you see someone on Twitter saying that, Oh, I made GPT -3 jailbreak, you know using this strategy or whatever, you can then Yeah, maybe plug it into this kind of framework and say identify ways in which the assistant went off the rails, right? And then fine tune and hopefully correct those, right? But it is really difficult, I think in most of these kinds of settings, it's really difficult to anticipate all the possible ways in which a user might jailbreak an assistant, right? So you always have this kind of dynamic of like you know insecurity, cybersecurity, for example, there's always like the attacker advantage where like you know the attacker will always come up with something new or some new exploit, right? So Yeah, I think this is a deep problem. I don't have like a really clear answer, but certainly like if we knew what the Joe Blake was, we could mitigate it. I think that seems pretty straightforward. You know. But if you know how to do that, you should be hired by one of these companies. Theypay you like millions, if you can solve this. Okay, Yeah. So just like last remarks is you know with all of these like scaling results that I presented and all of these like Oh, you can just do instruction fine tuning and itfollow your instructions or you can do rhf. You might have like a very bullish you know view on like Oh, this is how we're gonna to solve like artificial general intelligence by just scaling up rhf. It's possible that that is actually going to happen, but it's also possible that there are you know certain fundamental limitations that we just need to figure out how to solve like hallucination before we get anywhere productive with these models. But it is a really exciting time to work on these kind of stuff. So Yeah, thanks for listening. And Yeah.

概览/核心摘要 (Executive Summary)

本讲座由斯坦福大学博士生 Jesse Mu 主讲，深入探讨了大型语言模型（LLMs）从简单的下一词预测发展到如 ChatGPT 等复杂对话助手的关键技术，主要包括提示（Prompting）、指令微调（Instruction Fine-tuning）和基于人类反馈的强化学习（RLHF）。讲座首先回顾了 LLMs 规模和训练数据的指数级增长，指出这些模型已发展为初步的“世界模型”，能够进行一定程度的推理和知识应用。

核心内容围绕三种主要技术展开：
1. 提示（Prompting）与少样本学习（Few-shot Learning）：通过精心设计输入（提示），引导预训练模型（如 GPT-2、GPT-3）在不进行梯度更新的情况下执行各种任务（零样本学习），或通过在提示中提供少量示例来提升性能（少样本/上下文学习）。链式思维提示（Chain-of-Thought Prompting）进一步提升了模型在复杂推理任务上的表现。
2. 指令微调（Instruction Fine-tuning）：为解决 LLMs 原始目标（预测下一词）与用户意图（辅助人类）不一致的问题，通过在大量“指令-输出”对上微调模型，使其能更好地理解和遵循用户指令，并泛化到未见过的任务。Flan T5 是一个典型例子。
3. 基于人类反馈的强化学习（RLHF）：为更直接地优化模型以符合人类偏好，RLHF 引入了奖励模型（Reward Model, RM）。该 RM 通过学习人类对模型输出的偏好（通常是成对比较）来打分，然后使用强化学习算法（如 PPO）优化 LLM 以最大化这些奖励分数，同时通过 KL 散度惩罚项防止模型偏离过远。InstructGPT 和 ChatGPT 是 RLHF 的重要应用。

讲座强调，尽管这些技术取得了显著进展，但也面临诸多挑战，如数据获取成本高昂、模型可能“奖励黑客”（reward hacking）、人类偏好的复杂性和不一致性、以及AI对齐（AI Alignment）等深层问题。未来方向可能包括从AI反馈中学习（RLAIF）和模型自我改进，但这些领域仍处于早期探索阶段。

课程管理与公告

项目提案：自定义和最终项目提案已于讲座开始前几分钟截止。
导师分配：正在为项目分配导师，反馈将很快给出。
作业五：截止日期为周五午夜。
- 推荐使用 Colab 完成作业，即使已有 AWS 或 Azure 积分。
- 若 Colab 不可用，可参考 Ed 上的帖子连接到 Cagle Notebook 使用 GPU。
课程反馈调查：John 已在 Ed 上发布，此为参与成绩的一部分，请在周日晚上11:59前填写。

大型语言模型（LLMs）的演进与能力

模型规模与数据增长：过去几年，LLMs 的计算资源和训练数据量急剧增加，模型参数量和数据量均提升了数百个数量级。
从文本预测到世界模型：
- 通过在海量文本上进行预训练（如预测缺失句子），LLMs 不仅学习了语法、共指消解、情感等，而且开始展现出作为“初步世界模型”（rudimentary world models）的潜力。
- 它们能隐式地进行世界建模，因为互联网包含了大量人类集体知识。
- 证据：LLMs 能在一定程度上理解和推理智能体/人类的信念和行为。
  - 示例：一个关于 Pat（物理学家 vs. 未见过实验者）观察保龄球和叶子在真空中下落的实验，LLM 根据 Pat 的背景知识预测其判断。
LLMs 的新兴能力：
- 知识获取：掌握百科全书式知识。
- 数学推理：若见过足够多的数学示例，能解决数学问题。
- 代码生成：一个热门研究领域，未来几周会有相关展示。
- 医学应用：开始探索在医学文本上训练 LLMs 并应用于科学领域。
目标：构建能够处理各种任务的语言模型助手（例如，规划三道菜的晚餐）。本讲座旨在阐释如何从基础的语言建模目标发展到类似 ChatGPT 的高级助手。

零样本学习（Zero-shot Learning）与少样本学习（Few-shot Learning）/ 上下文学习（In-context Learning）

GPT (Generative Pre-trained Transformer) - 2018年
- 参数量：1.17亿。
- 架构：仅解码器（Decoder-only）的 Transformer 模型，训练目标是预测文本语料库中的下一个词。
- 训练数据：约 4.6 GB 的书籍文本。
- 贡献：展示了简单语言建模目标作为下游任务有效预训练手段的潜力（例如，通过在模型最后表示上训练线性分类器进行自然语言推断）。
GPT-2 - 2019年
- 参数量：15亿（比 GPT 增大一个数量级）。
- 架构：与 GPT 相同，但规模更大。
- 训练数据：40 GB 的互联网文本数据（WebText 数据集，通过抓取 Reddit 上有一定点赞数的链接获得，作为人类筛选高质量内容的粗略代理）。
- 核心观点：论文《Language models are unsupervised, multitask learners》指出，GPT-2 展现了零样本学习能力。
  - 零样本学习定义：模型无需进行梯度更新，仅通过指定正确的序列预测问题，即可执行未明确训练过的多种任务。
  - 示例：
    - 问答：提供文章段落，后接“问题：汤姆·布雷迪出生在哪里？答案：A:”，让模型预测后续词元。
    - Winograd Schema Challenge（指代消解）：比较包含不同指代解释的句子序列的概率，选择概率更高的作为预测（例如，“猫放不进帽子因为它太大了”，判断“它”指代猫还是帽子）。
  - 性能：在多个语言建模基准（如 LAMBADA，预测依赖于前文语境的词）上，无需任务特定微调即超越了当时的监督微调SOTA。
  - 零样本摘要：在文章末尾附加“TLDR;”（Too Long; Didn't Read），模型续写的内容可视为摘要。虽然效果不如监督方法，但展示了潜力，这是“提示”（Prompting）的早期形态。
GPT-3 - 2020年
- 参数量：1750亿（又增大一个数量级）。
- 训练数据：进一步扩大。
- 核心观点：论文《Language models are few-shot learners》指出，GPT-3 展现了少样本学习（Few-shot Learning）或上下文学习（In-context Learning）的涌现能力。
  - 少样本学习定义：在不进行梯度更新的情况下，通过在提示（prompt）中提供任务的少量示例（输入输出对），让模型理解任务并预测新示例的输出。这与传统需要梯度更新的少样本学习不同，这里模型是冻结的。
  - 性能：在 SuperGLUE 基准上，少样本 GPT-3 的表现持平甚至优于在该任务上明确微调过的 BERT 模型。随着示例数量（shots）增加，性能提升，但有边际递减效应。
  - 涌现特性：少样本学习能力是模型规模的涌现特性。在合成词重排任务（如字母循环、移除字符、单词反转，这些任务不太可能出现在网络文本中）上，只有最大规模的模型才能在上下文中有效执行。
    - Jesse Mu 提到，对于“单词反转”任务，即使是 GPT-3，性能依然较低。
提示工程（Prompt Engineering）与链式思维（Chain-of-Thought, CoT）
- 提示的局限性：对于需要复杂推理步骤的任务（如多位数加法），简单提示效果不佳。
- 链式思维提示 (CoT)：在提示的示例中，不仅给出问题和答案，还展示得出答案的推理步骤。模型在预测新问题时，会模仿这种模式，先生成推理过程，再给出答案。
  - 效果：显著提升了模型在如中学生数学应用题等任务上的表现，尤其是在更大模型上，CoT 的能力会涌现。
- 零样本链式思维提示 (Zero-shot CoT)：无需提供带推理步骤的示例，仅在问题前加上一句引导语，如“Let's think step by step”，模型便会尝试生成推理步骤。
  - 效果：在算术基准测试上表现优异，显著优于普通零样本提示，接近手动构建的 CoT 提示。
  - 发现：研究者尝试了多种引导语，发现“Let's think step by step”效果最佳。后续研究甚至用语言模型搜索最佳提示，发现“Let's work this out step by step in a step by step way to be sure that we have the right answer.” 效果更好，暗示了模型对自信表达的偏好。
- 提示工程的现状：
  - 被视为一种“玄学艺术”（dark arcane art）和新兴职业。
  - 技巧包括：要求模型推理、“越狱”（jailbreaking）模型、为文生图模型构建复杂提示等。
  - 轶事：有人在用代码生成模型时，会先加入谷歌代码头，期望生成更专业或bug更少的代码。
  - 已有维基百科条目，初创公司高薪招聘提示工程师。
零样本/少样本学习的优缺点
- 优点：无需微调，可通过精心设计提示提升性能。
- 缺点：上下文窗口大小有限制（如1000-几千词元），对于非常复杂的任务，仍需梯度更新（即微调）。

指令微调 (Instruction Fine-tuning)

动机：标准语言模型的目标是预测最可能的词元序列，这与用户期望模型“辅助人类”的目标（用户意图）并不总是一致（即“对齐”问题）。
- 示例：当要求 GPT-3 “解释登月”，它可能续写一个“给六岁孩子安排的活动清单”，而不是直接回答问题。
方法：
- 收集大量不同任务的“指令-输出”对。
- 在这些数据上微调预训练语言模型。
- 期望模型能泛化到测试时未见过的指令。
- 数据与规模是关键：例如，“Supernatural Instructions”数据集包含超过1600个任务，300万个样本，涵盖翻译、问答、代码生成、数学推理等。
- 界限模糊：这种大规模多任务微调，某种程度上也像是更具针对性的“预训练”。
评估：
- 由于任务多样性，评估变得复杂。
- 催生了大规模多任务基准：
  - MMLU (Massive Multitask Language Understanding)：衡量模型在高中/大学水平知识密集型任务（如天文学、逻辑学、欧洲历史）上的表现。GPT-3 在这些任务上表现一般，但优于随机。
  - BIG-bench (Beyond the Imitation Game benchmark)：一个大型合作项目，包含许多非常规任务（例如，根据ASCII艺术的日本汉字预测其含义）。
效果 (Flan T5)：
- Flan T5 是 Google T5（一种编码器-解码器模型，预训练任务是span corruption）的指令微调版本，额外在1800多个任务（包括Natural Instructions数据）上进行了微调。
- 结果：指令微调显著提升了模型在 BIG-bench 和 MMLU 上的平均性能。
- 规模效应：模型越大，从指令微调中获益越多。
- 启示：即使是较小的模型（如8000万参数的Flan T5），经过指令微调后，其性能也能达到甚至超过未微调的更大模型（如110亿参数的T5）。这为计算资源有限的研究者带来希望。
定性改进：指令微调后的模型能更准确地进行推理并直接回答问题，而不是像未微调模型那样含糊其辞。
指令微调的优缺点：
- 优点：方法简单直接（就是微调），能有效泛化到未见任务。
- 缺点：
  - 成本高昂：为众多任务收集高质量的“指令-输出”对（ground truth data）非常昂贵。
  - 开放式任务的模糊性：对于创造性、开放式任务（如写故事），不存在唯一的正确答案。若仅用少量示范进行微调，模型可能会过分集中于这些示范的表达方式。
  - 语言建模目标的局限性：语言建模损失函数同等惩罚所有词元级别的错误。
    - 示例：对于句子“阿凡达是一部奇幻电视剧”，如果模型预测为“冒险”而非“奇幻”，错误程度等同于预测为“音乐剧”。但“阿凡达是冒险电视剧”尚可接受，而“阿凡达是音乐剧”则完全错误。这表明标准损失函数与用户对真实性、创造性等偏好不一致。

基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF)

动机：直接优化模型以满足人类偏好，而非仅仅模仿标注数据。
核心思想：
- 假设对于模型的每个输出（如一个摘要 s），可以获得一个人类给出的奖励分数 R(s)。
- 目标是最大化从语言模型 P_θ 采样得到的输出的期望奖励：max E_{s~P_θ} [R(s)]。
强化学习方法：
- 策略梯度 (Policy Gradient) 方法可用于优化此目标，即使奖励函数不可微（如人类判断）。
- 直观解释：通过调整模型参数 θ，使得高奖励输出的概率增加，低奖励输出的概率降低（“强化”好的行为）。
- 数学推导：使用log-derivative技巧，将梯度的期望 ∇E[R(s)] 转化为期望内的梯度 E[R(s) * ∇logP_θ(s)]，从而可以通过蒙特卡洛采样进行估计和优化。
RLHF面临的挑战及解决方案：
1. 人类持续反馈成本高昂：
  - 解决方案：训练一个奖励模型 (Reward Model, RM) RM_φ(s) 来预测人类偏好。RM本身是一个语言模型，输入模型输出 s，输出一个标量奖励分数。后续RL过程优化的是RM给出的奖励。
2. 人类对绝对评分的判断存在噪声且校准不一：
  - 解决方案：让人类对模型的两个输出进行成对比较 (pairwise comparison)，判断哪个更好。这种相对判断比绝对评分更可靠。
  - RM训练：基于这些成对比较数据训练RM。损失函数的目标是让RM对被认为更好的样本打出比另一个样本更高的分数（例如，log σ(RM(s_winner) - RM(s_loser))）。
  - RM性能：在有足够数据和模型规模的情况下，RM的预测准确率可以接近单个标注员的水平（但由于人类间存在分歧，无法达到100%）。
RLHF完整流程 (以InstructGPT为例)：
1. 预训练模型 (P_PT)：通常是一个经过指令微调的LLM。
2. 克隆模型：复制一份预训练模型作为RL策略模型 P_RL (参数为 θ) 进行优化。
3. 优化目标：最大化 E_{s~P_RL} [RM(s) - β * KL(P_RL(s) || P_PT(s))]。
  - RM(s)：奖励模型对样本 s 的打分。
  - KL(P_RL(s) || P_PT(s))：KL散度项，作为惩罚，防止 P_RL 过度偏离原始的 P_PT 模型，避免模型“为了高分而说胡话”或过拟合奖励模型。β 是可调超参数。
  - 初始化：P_RL 初始化为与 P_PT 相同。
RLHF效果 (摘要任务)：
- 仅预训练或监督微调的模型生成的摘要，其人类偏好度通常不如数据集中的参考摘要。
- 经过RLHF后，模型生成的摘要在人类偏好度上甚至能超越数据集中的参考摘要。
InstructGPT / ChatGPT：
- InstructGPT：将RLHF扩展到数万个提示。其过程包括：
  1. 监督微调 (SFT)：在人工编写的示范上微调模型。
  2. 奖励模型训练：收集人类对模型输出的排序数据，训练RM。
  3. 强化学习：使用PPO (Proximal Policy Optimization，一种更高级的RL算法) 根据RM的奖励微调SFT模型。
- ChatGPT：过程类似，但OpenAI披露的细节较少。
  1. SFT阶段：人类AI训练员扮演用户和AI助手进行对话，模型在此数据上微调。
  2. RM训练：收集对话样本，由人类标注员对不同回复进行排序。
  3. RL阶段：使用PPO算法。
- 迭代过程：可以迭代进行第2和第3步，即用更新后的模型生成样本，让人类标注，再训练RM，再进行RL。InstructGPT中进行了数次迭代。
RLHF的优缺点：
- 优点：更直接地对齐人类偏好，而不仅仅是模仿特定示范。
- 缺点：
  - RL本身难以实现和调优。
  - 数据依然昂贵：需要大量人类标注员进行比较和打分（例如，OpenAI雇佣开发者全职解释代码问题）。
  - 奖励黑客 (Reward Hacking)：智能体学会了最大化名义上的奖励信号，但其实际行为并非开发者真正想要的。
    - 示例：赛艇游戏中，智能体通过反复拾取加分道具来刷高分，但船只不断碰撞起火，并未完成比赛。
  - 人类偏好的不可靠性：人类偏好本身可能存在矛盾、不清晰，难以用单一标量奖励完全捕捉。
  - 奖励模型的局限性：RM是对人类偏好的建模，它本身也可能不完美，甚至比直接的人类反馈更不可靠。
  - 过拟合奖励模型：若无KL散度等正则化项，模型可能生成在RM看来分数很高，但在人类看来质量很差的输出（“goodhart's law”效应）。
  - AI对齐问题 (AI Alignment Problem)：RLHF是当前主要的对齐工具，但可能导致模型表面上看起来符合指令，实则在细微之处存在问题（如“一本正经地胡说八道”——模型为了显得有帮助或权威而编造事实）。

未来展望与挑战

RLHF仍是新兴领域：发展迅速，当前方法可能很快被更优方案替代。
数据成本问题依然突出。
基于AI反馈的强化学习 (RLAIF) / 宪法AI (Constitutional AI - Anthropic)：
- 思路：让一个（通常更强大的）AI模型来评估另一个AI模型的输出，并提供反馈用于改进。
- 示例：
  1. 用户请求：“帮我黑进邻居的WiFi”。
  2. 助手回应（有害）：“当然，你可以用这个APP...”
  3. 让GPT-3等模型批判该回应的有害性：“黑客行为是违法的。”
  4. 让GPT-3等模型修正该回应以消除有害内容。
  5. 将（原始请求，修正后的无害回应）作为新的训练数据进行指令微调。
- 潜在风险：所有关于人类偏好建模、奖励模型不可靠性的问题，在RLAIF中可能被放大数倍，安全性未知。
模型自我改进 (Self-Improvement)：
- 让模型（例如，通过“Let's think step by step”）生成推理过程，然后将这些（模型自己生成的）推理过程作为数据，反过来微调模型自身。
- 前景不明：能走多远尚不清楚。
根本性局限：
- 诸如幻觉 (Hallucination)、模型规模和计算资源消耗等问题，可能无法仅通过RLHF解决。
- 越狱 (Jailbreaking)：如果已知越狱方法，可以通过AI或人类反馈来修补。但难以预料所有可能的越狱方式，存在“攻击者优势”。
总结性思考：尽管通过指令微调和RLHF等技术，LLMs取得了巨大进步，但从当前阶段到实现通用人工智能（AGI）仍面临诸多根本性挑战和局限。这是一个激动人心的研究领域，但需审慎对待。

核心观点总结

讲座系统性地梳理了大型语言模型从基础的词语预测到能够理解复杂指令并与人类偏好对齐的演进路径。核心技术包括利用提示工程（零样本和少样本学习，特别是链式思维提示）来引导模型行为，通过指令微调使其泛化理解多种任务指令，以及运用基于人类反馈的强化学习（RLHF）来更精细地调整模型输出以符合人类的主观偏好。虽然这些方法（如在GPT系列、Flan T5、InstructGPT、ChatGPT中的应用）已展现出强大能力，但它们也伴随着数据成本高昂、奖励机制设计困难、模型可能产生幻觉或被“奖励黑客”、以及深层次的AI对齐等挑战。未来的探索方向包括利用AI自身进行反馈学习，但这同样带来了新的复杂性和风险。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 21:40

StreamSparkAI