Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 14 - Reasoning and Agents by Shikhar Murty

该讲座主要探讨了语言模型在推理领域的应用。首先，讲座定义了推理是利用事实和逻辑得出答案的过程，并区分了演绎推理、归纳推理和溯因推理三种主要类型，同时提及形式与非形式推理，指出讲座重点关注多步骤的非形式演绎推理。

接着，讲座探讨了通过提示工程（prompting）引导大型语言模型进行推理的多种方法。其中包括“思维链”（Chain-of-Thought, CoT）提示，即引导模型在给出答案前生成推理步骤，可通过上下文示例或“让我们一步一步思考”等简单指令实现。“自洽性”（Self-Consistency）方法通过对同一问题采样多个推理路径和答案，并选取最常见答案来提升准确性，其效果优于简单模型集成。针对多步骤推理，提出了“由少至多”（Least-to-Most）提示法，它将复杂问题分解为子问题，模型逐步解决并整合答案，显示了其处理复杂推理的潜力，尽管其根本性优势尚待验证。

最后，讲座讨论了超越提示工程的策略，如通过知识蒸馏将推理能力赋予小型语言模型。以Orca模型为例，它通过微调使小型Llama模型模仿GPT-4生成的解释和推理过程。训练数据通过结合Flan V2等数据集的指令与GPT-4在特定系统提示下（如要求分步解释）生成的详尽回答来构建。讲座强调，相关领域的研究多为近三四年成果，尚有许多未解问题。

视频科技

媒体详情

上传日期: 2025-05-16 21:03
来源: https://www.youtube.com/watch?v=I0tj4Y7xaOQ
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Okay, let's just get started. Welcome to lecture 14, everyone. Hope you've been doing well and managing all of the various deadlines. So today we'll be looking at two interesting applications of language models. The first half, I'll be talking about using language models to reason in domains like math, geometry, doing things like spatial reasoning. And then in the second half of the lecture, I'll be talking about how you can use language models to take actions in grounded environments. Okay, so a little bit of a disclaimer. A lot of the content today day's research that was done in the last three, four years. So there's plenty of questions, plenty of unanswered questions and not a lot of answers. So you know, let's let's maybe we can have more of a discussion around these topics. Okay? Okay, so let's get started with reasoning. So experts like to start a lecture on reasoning by really talking about what are the various kinds of reasoning. So I'm going to do that here, okay? But at a high level, it's really about using facts and logic to arrive at an answer. But more concretely, there's three distinct categories of reasoning that we can talk about. The first one, which is probably the one that most of you are familiar with, is deductive reasoning, where we go from rules of logic along with a premise to come with a firm conclusion. So an example of that could be that we have the sentence, all mammals have kidneys and all whales are mammals, and then we can come up with a conclusion, all whales have kidneys, and we could do multiple such steps of reasoning. Okay? A second form of reasoning is inductive. Where given observations, we derive conclusions. Okay? So maybe we've learned from experience that every time we see a creature with wings or is usually a bird, and let's say we observe a state where we see a creature with wings, and using our experience, we can come up with this conclusion that the creature is likely to be a bird. So that form of reasoning is inductive, okay? And finally, we have abductive reasoning where we are given an observation, and then we start drawing possible explanations. Okay? So maybe you see a car that cannot start and there's a puddle of liquid under the engine, and then you start drawing inferences about the situation. So one of them could be that the car has a leak in the radiator. Okay, all right. And apart from that taxonomy, we can also think of reasoning in formal and informal terms, where formal reasoning involves using axioms and rules of formal logic to derive truth conditions. Okay, there's also informal reasoning, which is what you and I probably do every day. And here we just reason about everyday situations and use common sense to derive conclusions. For most of the lecture, when I say reasoning, I will mean informal deductive reasoning. And it's often going to involve multiple steps. Okay, so let's let's come back to language models. Okay, so we've learned in lectures nine, ten, eleven that language models are really, really good at, or large language models are really, really good at coming up with plausible continuations of text that reflect human preferences and constraints. Today, we try to answer if they can also reason. Okay? So one of the most basic ways we can try to answer this question is we are prompting. Okay? And we've probably already seen this. There is this popular method called chain of thought prompting, where you get a language model to produce a reasoning step before producing an answer. And we could do this by providing some in context examples with explicit reasoning steps that the language model can then mimic at test time. Okay? So that's chain of thought prompting. Another rather surprising property of language models is that sometimes you don't even have to show them these in context examples, and you could just prompt them with the sentence, let's think step by step, and you can get these reasoning rationales before they produce an answer. Okay? So that's pretty simple, but let's keep going. Okay? So another popular way to prompt language models to do reasoning is we are self consistency. So here what we do is instead of greedily sampling a rationale followed band answer, we are going to sample multiple reasoning steps and correspondingly multiple answers. Okay, so what we see in the figure on the right, we have a question. And then what you would normally do with chain of thought prompting is you would greedily decode rationale and then condition on the rationale, generate an answer with self consistency. We are going to sample multiple times. So sample multiple rationals, they are all going to lead to multiple answers. And then we are going to pick the one that is the most common, okay? With the idea being that if an answer keeps appearing for multiple rationales, this is a majority of the rationals agree on, then it's more likely to be correct. And the authors of self consistency find that on a variety of mathematical reasoning tasks, if you add this simple idea of self consistency where you sample multiple times and sort of do majority voting, that improves performance pretty drastically over a standard chain of thought. And Interestingly, you know when I saw this result the first time, I thought, okay, this is just like ensembling, which is you know we we learned this in cs 229. The idea is if you want to boost the performance of your system, I'm going to produce like ten classifiers with different random seeds. I'm going to produce a classification decision, and I'm going to do a majority voting. But turns out that it's doing maybe a little bit more than just simple ensembling. So the authors also compared an ensemming approach where it's the same language model with multiple different prompts, and then you do majority voting there. And then turns out that self consistency is better than just simple ensembling. Okay. So earlier today, I said that I'll be talking about multi step reasoning. So far we've looked at sort of math problems and like prompting, but not necessarily multi step reasoning. One of the main kind of aspects about multi step reasoning is it involves breaking down a large problem into several suparts where and answering each of the suparts and then combining everything into a solution. Okay. So there's this kind of decomposition strategy that was integrated into another prompting method called least to most prompting. And the idea behind least most prompting is, like I said, given a question, we're going to first break it down into sub questions, as shown here. And then given these suquestions, the language model will sort of answer each of the suquestions and then condion its answers to the sub questions is going to generate the final answer. And this is kind of how it looks like for sort of a math reasoning problem. So in standard chain of thought prompting, you would have a question followed by a rationale. And the answer with least to most prompting, which is this like decomposition strategy, you would take the question and then instead of directly producing a rationale, you sort of ask the language monitor, break it down into problems, and then you have like these two different sub problems, and then you start answering both of those sub problems and then condition your final answer on the answers to those suproblems. So okay, so that's just like a prompting method, right? One interesting experiment from least to most prompting was showing that you can sometimes generalize from a small number of reasoning steps to a much larger number of reasoning steps. So here in this sort of math word problem, this two reasoning steps, and if we show this prompt to the language model, sort of as in context example, we see that it continues to generalize even on examples that require more than five steps of reasoning, and in a way that's much better than standard chain of thought. But it's not entirely clear if structuring inin this manner is really fundamental. One of the other results they reported was sort of that with enough prompt engineering, so the rose corresponding to best normal chain of thought is on pawith, at least most prompting. But it's kind of an interesting idea of trying to break down problems into suproblems, solving the suproblems, and then sort of building up a solution based on your answers to the suproblems. Okay. So all this was different sort of prompting methods to get reasoning behavior out of language models. Can we do something more? So one of the things that we might be interested in is instead of trying to get really large language models to do reasoning, maybe we want to somehow get this kind of reasoning behavior in a smaller language model. And one popular approach for doing that is distillation, where maybe you want to find tunin, a smaller lama model, by teaching it to imitate a larger llama model. And so that's what we're going to look at now. Okay? So this model is called orca. And at a high level, orca is going to fine tune a smaller 13 billion llama language model on explanations produced by gpfour. And to construct this data, it's pretty simple. It has these three steps. So the first step is we get a wide variety of instructions from the flan V2 collection. Okay, so flan V2 is basically a data set. It kind of cumulates multiple data sets into one sort of collection. And it consists of instructions paired with questions and answers. And I'll show an example of this in a moment. And then we're going to prompt GPT -4 or ChatGPT with these instructions along with the system message. And the objective of the system message is to get ChatGPT or GPT -4 to produce an informative explanation along with the answer. So here we have a question about you, simple data processing, about calculating the median. And there's a system instruction that says, please justify your steps and kind of answer step by step. And in producing its output, the model sort of provides a fairly detailed explanation of how it got to the answer. And what orca is going to do is use precisely this explanation to find tunommax model. So that's what's going to happen once we have these explanations. We are going to fine tune a much smaller 13 billion parameter lamma model on these explanations. Okay. So so far, we've looked at sort of math reasoning and sort of grade school math problems. Let's kind of turn to a different benchmark for reasoning. So we're going to look at big bench hard. And this is another data set for multi step reasoning. And let's look at some examples from big bench hard. So it consists of multiple different subpta. So there's a total of 23 different subptai'm going to show a few examples. So one of them is evaluating boolean expressions. So the question is and false and not and is okay. So that's basically you know evaluate this boolean expression and you know with sort of chain of thought, the model can evaluate each of the suexpressions and get to the final answer. And another example of a task from big bench hard is data understanding. Where know maybe the question is, sorry, this is data understanding, not data understanding. So the question is tomorrow is a given date. What is the date one year ago from today in a given format? It's paired with some options. And again, the model can sort of think step by step, following basic chain of thought and then come up with an answer. So this is kind of the flavor of tasks in big banch. You know most of these involve multi step reasoning. They're fairly synthetic, but also reasonably hard for language models. Okay. Another example is geometric shapes. And this one is pretty surprising that language mucan do anything here. So you're given sort of the svg path element and sort of, I have no idea what this renders us, but like the question is, you know, just given the svg what shape you're going to get, okay? And there's a bunch of options. And then again, the model prompted with less think step by step will produce some answer. We don't know if it's correct, but it's going to produce some answer. Okay? And so it's basically this data set covering different kinds of reasonings, spial reasoning, data understanding, evaluating boolean and a sort of multi choice. So it's easier to kind of get sort of an accuracy number. And so Yeah, so it covers like a wide variety of different tasks. On the left, we have performance from really large language models. This is zero short chain of thought with just the prompt. Let's think step by step. So GPT d four has some potential contamination issues with big bench hard. So let's maybe we can ignore that column. Wakuna is, I think a few months ago it was state of the art as an instruction tuned lama 13b model. And orca is again a lama 13b that's fine tuned specifically on this explanation data where you have instructions and then you have explanations from ChatGPT or GPT -4 and you fine tune on that. And we see that overall it outperforms ChatGPT maybe because it's specialized to just like these reasoning problems, and it outperforms vickuna, which was not trained on like these really extensive explanations. So that's one way you can get a smaller language model to display some kind of reasoning behavior. Okay? So you know this was all great. And you know we are very happy that like you can just generate rationales from big alm and then find tunin a smaller language model on that. But then someone could ask, why not just find tunin, the big language model on its own rationales, right? So that's also been explored. And there's a bunch of different methods that do this. I'm gonna to talk about one of them called reinforced self fraor rest. I going to alternate between two stages. The first stage given a reasoning problem, and perhaps the prompt lets things step by step. We're going to have the language model generate multiple rationales, and then I'm going to filter these rationales based on whether they give me the correct answer or not. Okay. So, you know, think about the word algebra problems. Someone has three apples, someone else has four apples. And you generated rationale, and the answer comes out to be seven. You keep that rationale. Ale, the answer is twelve. You sort of leave that rationale up. And then I'm going to do an update step where I'm going to take these rationales that I filtered in my first stage. I'm going to find you in the language model on that. And then I can do this iteratively. Now I have an updated language model. I can get hopefully better rationals, and then I can update the language model and better rationales to get an even better language model, and I can keep doing that. Okay? And the results are promising, but you know what we find is on gsm 8K, which is this great school math data set of like algebraic word problems. As you increase the number of iterations of self training, we see a slight improvement in performance, and then it starts degrading. Math is another data set that again focuses on multiti step reasoning covering math problems. And again, we on this data set, we see that as we do more iterations of this reinforced self training paradigm, we see an improvement in the accuracy. And the numbers in orange here are a much larger palm model. The numbers in blue are a smaller model. And the daash lines represent what you get sort of if you did supervised fine tuning on human provided rationales. So one of the promising things about this approach is when you do multiple iterations of self training on your own rationals, you can outperform sort of human generated rationales. And that is exemplified again in this graph, where what we find is the blue bar represents accuracy when you take the palm model and you do supervised fine tuning on human provided der rationales. Okay? And then in Green is, if you did, if you controlled for the sorry, so blue is if you find Tunon all human provided rationals. Orange is if you fine tune on one rationale per training example, okay? And these are from, these are written by humans. In Green, it's what you get if you fine tune on one rationale chosen at random per question, which is generated by the model. So it's controlling for the number of rationales, and we see that it outperforms human provided rationals. And then if you sort of do the full multi step iterative procedure where you keep improving the model, we see, again, a boost in performance. So that's super promising. But let's kind of start revisiting the question that we asked in the beginning about reasoning in language models. Okay. So one way of answering that question is we can apply all these methods, and we can look at benchmarks. But maybe the way to answer the question correctly is to be more systematic, come up with counterfactual tasks, and be very careful about possible data contamination. And I'm going to show some results around that. So we started the lecture with chain of thought. And maybe the first question to ask is, are the rationales that the model produces with chain of thought faithful? What I mean by faithful is maybe the model produces some rationale and then it produces an answer. But maybe the answer does not even depend on the rationletter produced, right? So maybe the question was, Tom has three apples and Jerry has four apples. And the rationale it produced was, okay, Tom has three apples, Jerry has four. Three plus four is seven. So the answer is 25. You know, so in a case like that, yousay that the model was not faithful to its rationale. And so what we see in this plot is a very careful experiment where on the x axis we have the number of reasoning samples. So the setup is something like this. So for every question, the model produces a rational ale and a rationhere is multiple sentences. And what we're going to do is we're going to force the model to a sort of early exit from its rationalization and just like force it to produce an answer. Okay? So if it produced four rationals, I can early exit right after the first rationale and ask it to produce an answer. I can exit after the second, rationask it to produce an answer, and so on. And what I'm gonna to plot on the y axis is the model's accuracy after early exiting in this procedure. So let's say that I early exerted after just one rationale and the model produced exactly the same answer that it would if it had seen all four sentences in its rationale, then maybe we can conclude that the kind of reasoning is not faithful. Like it doesn't matter if the model is the full rationale or just the first sentence. And if you take that to the extreme, maybe you terminate even without any rationthat produces the same answer. So the results here are somewhat mixed, but we see that there are enough data sets where it doesn't matter if you see full if the modeis the full rationale before answering or if you sort of early exit, you kind of get the same answer, which means that sometimes these rationales may be posthoexplanations of the moanswer. Okay? Another experiment that tries to answer this exact same question is you can take these rationales and then you can start corrupting them. So maybe your rationale was length four, and then I generate the first rationale, the second rationale, and for the third rationale, I just corrupt it. Okay? And then the fourth rationand, then I ask them all to generate, man. So if it turns out that no matter how much I corrupt my rationale, the model produces the same answer, then I can conclude that, again, the answer kind of did not depend on my rationale. So on the x axis, we are looking at number, the percentage of reasoning steps before I add sort of a mistake in the rationale. Okay. So what you should see is kind of a strictly increasincreasing sort of trend where if I add a mistake after the very first step, then that's probably going to change the answer a lot. And then if I add a mistake after the last step, that maybe doesn't change the answer all that much. But again, we find that for some data sets, it so happens that know you can add a mistake in the first sentence in your rationale, and the answer is not going to change all that much. And so that's also kind of an indicator that maybe these rationale es are sort of posthoexplanations of the model's behavior. So Yeah, so there's a lot of lines here. So if anyone has questions, see a few blank faces in the audience. Okay, so let's let's keep moving. Okay. So that was about like whether the models, whether sort of chain of thought expresses kind of a reasoning that the model is faithful to. Another question you could ask is what if I changed my setting a little bit? Right? So my model, let's say I observed that it's able to do arithmetic in base ten, so it's able to answer something like twelve plus 14. Does that mean that my model knows how to do arithmetic? Or maybe there was just this exact same example was present in the training data. So one way you could test for this is by creating counterfactuals, which based on our understanding of the data, you expect to not be present that frequently in the training data. So instead of doing base ten addition tion, you could do addition in base nine. And then if the model has the same accuracy in base nine, then you can conclude that maybe this model has understood how to do similarly for logic. Maybe the reason why the model is so good at solving logic problems is because it's seen something very similar in its training data. So what if I construct a world where I don't know corgis are reptiles? Can it still do here this logic problem? Okay. And so what we find is there is know sometimes a pretty significant drop when you move from there's a question, the final of the counterfactual. Why is baseline and counterfactual or base ten? So it's a counterfactual, excuse me, in the sense that the authors come in that like Basten edition is like frequently observed in training data, but very few people do baseline edition. And so there's going to be much fewer examples of this in training data. So it's more cadistribution actual. So you can also call it out of distribution for sure. And Yeah so like from results, like what we see is you know there's there's like this drop in performance even for like very simple logic problems that don't involve like multiple steps of reasoning, there's you know kind of a significant drop in performance, which maybe suggests that there's is not that much reasoning, there's more memorization. Yeah. So we could keep going with this paradigm of like changing the problem setting so that it starts looking sort of out of distribution to the training corpus. And this is exactly what was done in this paper that looked at analogical reasoning. So basically, the setup is something like this. I'm going to show certain examples of string transformations, and I'm going to ask the model to generalize to new examples. Okay, so in this extentent sequence problem, I have abcd in the output is abcd, and then given ijkl, the model has to produce ijkl m and so on. Now the way you can sort of make this into like a counterfactual or something that is out of distribution is maybe you can kind of change what the extent sequence task is. So now instead of outputting abcde, maybe the model has to output abcd f. Okay? So instead of outputting, the next character has to output sort of one more, so the second character after the next and so on. The other kind of contterfactual you could add is instead of operating on this standard alphabet, you could modify the alphabet completely. So instead of the alphabet being abcd, maybe you start at X, Y and so on. So what we find is, so we find two things. The first thing that we find is that there's a significant drop in performance as we go from the standard sort of analogical reasoning problem to one of these counterfactures where we either change the alphabet, we change the description of the task so that it becomes slightly unnatural. On the other hand, the authors also did this exact same experiment on human subjects where they find very little decrees in performance. Okay, so overall, what this result suggests is maybe there some reasoning, maybe there's some memorization, but there's nothing systematic. Okay? So you know again, this is like all emerging. So maybe someone will find that you know if if you change your prompt a little bit now now models can do reasoning, but this is kind of the current lay of the land. Okay, so that was sort of the reasoning module of the lecture. I'm going to now switch gas and talk about language model agents. And this was kind of related to reasoning in the sense that reasoning involves sort of this multiti step inferences where know given some facts, you have to arrive at completely new conclusions with agents. What we'll see is that there's some high level kind of objective a model has to accomplish, and it has to reason about post conditions, object audidances, kind of uncertainty in the world to Carry out a sequence of steps. So let's start with some terminology. Okay, so we have our agent on the right. That's going to be some neural network. And then we have an environment. No, and I'll give some examples of what these environments could be. The agent receives an observation from its environment, and based on the observation, it issues an action, okay? And along with that, it receives this second variable, g. And g represents a language instruction, okay? So there is many names for this setting and what and these models, digital agent, language condition policy, or an instruction following agent. Some examples of environments are maybe it's sort of a web browser in sort of a browsing environment where the objective is to book a flight from San Francisco to New York. And the observation could either be a raw pixel that the model sees, or it could be the html dom representation. And the action space, if you're looking at these web environments, could be typing on specific web elements, clicking on web elements, moving your mouse to a certain web element to interact with it, and so on. And Yeah, I mean, like this has sort of a vast number of applications. I don't think I can cover all applications, but like you know we can look at some. So there's obviously digital assits like you know not going to say the names because I know people's new bias might start popping up, but know you can give them natural language commands and know set an alarm, set reminders and so on. You could also do natural language programming where you could, given in natural language descriptions, get a model to sort of write Python code. Another example of this could be ui automation, where maybe you want to do automated testing of ui elements. And so instead of having a human sort of verify whether a ui element works, maybe you can get a model to execute actions corresponding to a given instruction. Or it could be something more sort of user facing, where you given some kind of complex environment like Spotify, you could ask an agent to play some songs. And then finally, there is this sort of emerging application where we want to add additional tools or plugins to language models so that they can control various different applications. Okay. So before we look at how we can use language models to do instruction following, I think it's very helpful to look at how this was done before language models. So there were basically three main ideas. Sometimes the right thing to do was collect examples of utterances paired with logical forms. So logical forms could be some kind of an executable representation that you could just execute against either a knowledge graph or a database to get an answer. So maybe you have a query like, what state borders, taxes? And then there exists some sort of programmed description that you could execute against a knowledge graph to get sort of an answer or a list here. And idea number one that people looked at was to treat this as almost like machine translation, right? So you have a source language, which is sort of English commands, and then you have a target language, which is sort of these like meaning representations or logical forms. And then you could apply the same machinery from assignment three to build kind of a natural language interface here. So you directly maximize the probability of a sequence of actions given a goal or a command. Idea number two was something a little bit more complex. So here you have instructions paired with actions. But instead of directly mapping instructions to actions, I'm going to infer an executable plan, okay, from these instructions and action sequences, and I'm going to train a model to go from instructions to these plans and then define a very rich execution model that's going to directly execute these plans. The advantage of this is maybe there is more sort of high level decisions you could encode in your plan, which would be harder to like get into the model if you were to just train it to produce the action trajectories directly. And I have an example of a system like that from 2011, which was basically an agent that could navigate in sort of grounded environments. And Yeah, the idea was something like this, that you kind of took an instruction and obtained a plan, and then you would train a semantic parso, which is basically like this kind of machine translation system that would convert commands, sequences into this plan. And then once that's trained at test time, given a completely new instruction, you would run the semantic parser, get this plan, and then execute it in this execution model. Okay. And I have an example of an instruction and a plan from this 2011 system. The third idea, which is probably you maybe the first one that comes to mind if you see a setting like that, is to use reinforcement learning directly. And what people did there was to use rl to directly map instructions into actions. So I'm going to learn a policy that outputs actions that maximize some reward, okay? Which is conditioned on my natural language instruction and the observation. And this reward could be both sparse, which is I Carry out the entire task and then my environment tells me if I achieve the task or not. Or it could be something that I obtain after each step. So I take an action and the and then the environment tells me if this action sort of completed some percentage of my task or not. And on the top, I've included an example of a system from 2009 that did this for automated windows debugging. And so you you have some natural language instruction to click some ui elements and that get mapped into kind of an api command that the model executes one after the other. Okay. So these are basically the three main ideas that people had before language models. You would either train semantic parsers, you would either infer these plans from instruction trajectory pairs and then learn to directly model plans and then have an execution model that can execute plans. Or you would do reinforcement learning if you had a reward signal. So how do we do things in 2024? So there are a few ways to think about this. I think like maybe most instructive is to think about what we are trying to achieve, right? So we are trying to monitor trajectories. So sequence of actions, condition on some goal. Okay, so I want my model to book a flight from San Francisco to New York. And I wanted to produce a trajectory of flight, maybe typing and clicking actions. So let's look at how that factorizes. So the probability of a trajectory conditioned on a goal or an instruction is just the probability of the state action next state and so on condition on the goal. And you could factorize that into two terms. So the first term is sort of the transition dynamics of the environment. And that's just what happens if I take a certain action in a given state. How is my state going to change? And the second object is sort of the agent policy, which is, given my goal and the trajectory so far, what is the next action I should be taking? Okay. And then sort of people quickly realize that you could just treat this as kind of a generative problem. So you could treat the problem of decision making in environments as sort of a generative trajectory modeling problem. And what I have in sort of the top right is an example of a transformer that just takes the history of actions it's taken so far, the current state, and some indication of what tagets should achieve here based on reward. But it could be a natural language string and it's just trained to predict what's the next action. And you could just train an autoregressive language model to with us. And it turned out that this worked very well in sort of an offline rl case. 哎我俩so we are Picone action for and the current action also so this known。 So you predict an action, execute that, append that to your trajectory and then you predict the next action and so on. So we we resolve the three input tokens into one output, okay and half. Okay, that's and it turned out that this worked really well. And so know instead of know getting these latent plans and training semantic parsers or trying to do reinforcement learning, we started using language models as policies. And so a simple way to do all of that is to prompt a language model in a loop. Okay, so we're going to specify the action space and text. So this is like a simple sort of language model agent. This is not going to work at all, but probably just like illustrative of how agents can be built now. So you provide an action space in text. So maybe it's a digital environment and maybe it can type, maybe it can click, maybe it can type characters, maybe it can move mouse somewhere. You provided an instruction and you provide it the sequence of actions and observations it's received so far. Okay? And then condition on all that you ask it to predict next the next action. And there's nothing deep going on here. This is just chain of thought prompting in a loop. Okay? But the hope is that because all of this, because we reduce the problem of decision making into just autoregressive modeling, this could work, okay? And indeed, like you know a slightly more complex version of this can work in some environments. Okay. So now I'm going to sort of give a little flavor of what different environments look like now for evaluating language models as agents. So the simplest environment that people consider is mini wap. So this is a sandboxed environment that evaluates like basic browser interactions, like you know maybe on a mini Twitter environment. Can you get a language model to retweet a given tweet, given sort of simulated email client? Can the model forward someone's email? Can it compose an email? Can it click on certain buttons or not? It's not at all real world. So it's not real websites and it's a relatively short horizon. So given any instruction, most tasks can be accomplished in under three actions. But zero short performance of even the best language models is still far from perfect, even on this very simple benchmark. A second, slightly more real world benchmark is weberina. And this is also sandbox environment, but it's kind of a pretty close approximation of real websites that span e -commerce. So there is a website in wearenina that resembles Amazon social media, so something that resembles Twitter. And Additionally, there are utility tools like maps. So an instruction could require a model to open up sort of a map application, find the shortest path from point a to point p, and use that in its later sequence of actions. And this is multab browsing, like we kind of commonly do. So with minwab, there's only one single tab. And with weberina, I think this was the first environment that introduced this idea where you kind of have multiple tabs, and the agent can sort of switch between apps, tabs. And again, we are going to evaluate sort of functional correctness, which is whether the model sort of gave the correct answer at the end, whether the sequence of steps it took gave the intended behavior, as opposed to whether it took a sequence of steps that maybe a user had pre programmed. So another popular kind of environment is our data. Aset is web links. So web links also has multitab browsing, and it has web interactions on real websites. So this is not sandboxed approximations of real websites. Zes, not sandboxed, kind of just browser like browser interactions. These are like actual real websites. And it also introduced like a new action where the agent could communicate with a user. So maybe there's some instruction, which is to like reserve kind of, I don't know, like a movie or buy a movie ticket or something. And then at some point, the model has to request credit card information. And so there is this like additional action where a human could be involved in in communicating with the agent. And this is not an environment, but just a collection of interactions. So you can't, for example, do any kind of exploration or online learning here, but you could definitely use the sort evaluating. Okay. So this was just a taste of what some benchmarks look like for language model agents. So how are we going to train these models, right? So now given that we are going to we're going to treat like decision making sort of casual as causal language modeling. We're not going to use any of the ideas from pre llms. The standard practice is to do in context learning with few short examples. And in the few short examples, for typically for any new kind of website or any new use case, you're going to get humans to perform those tasks and sort of feed that into the language models prompt, as in context demonstrations, which it could then use to solve similar looking tasks on very similar websites. So obviously, this is not scalable. There's thousands of environments. On some environments, there's like lots of different interactions tions that are possible. And so maybe there's something better that we can do than just sort of getting humans to provide demonstrations for every new use case. And so we are going to use something we saw early on in the lecture. She was to kind of use the language model to generate rationales and then fine Tunon that. And here we don't have rationales, but we could produce action trajectories, and then we're going to use that as supervision. Okay? So the way that looks like is something like this. So let's say I have some environment, you know let's say it's some mini wab environment, and I'm going to just get an agent that can randomly explore the environment. So to just execute a random sequence of clicks and types and scrolling operations, and let's say, produces some trajectories, and now I'm going to use these trajectories and somehow filter them. So that was the idea from earlier. So you're going to get a bunch of different outputs, and then we going to filter it somehow. So here we going to use a second language model because we don't know what a good trajectory looks like. So not like a matths problem where you know, you know the correct answer. We just had a language model interact with a website and generate trajectories. And we want to somehow filter out what are good trajectories. And so we're going to use a second model that will produce a description of these trajectories. And the idea here is that if you can get a model to produce a description of what the sequence of actions corresponds to, then maybe that's a good enough signal for a good trajectory. And so maybe given the first trajectory, it guesses that the instruction was to book a flight from San Francisco to New York. For the second trajectory, it said the date to some given date, and maybe it wasn't able to come up with any good sort of instruction for the third trajectory. And then we are going to do something again that we saw earlier on, which is to kind of do this iteratively. So now we have a goal that we got for a trajectory. And now I'm gonna to get the language model to condition its behavior on this goal. So the goal is to set the date as some given date. And now instead of doing random exploration, the model is going to produce a sequence of actions that have a better correspondence with some natural language instruction. So it produced a trajectory based on that instruction. And then I'm going to use sort of some course filter that's just going to look at correspondences between the instruction and the sequence of actions and the states, the language model visited, and use that to decide if the trajectory was a good trajectory for the instruction. And in this case, given the instruction, this seems like a pretty good trajectory for completing this task. And so then we add it to a set of examples. Okay. But maybe sometimes things are not so good. So for that second instruction, the generated label was to book a flight from San Francisco to New York. And let's say we run that again through the language model. And it produced a second trajectory. And clearly, this does not seem like kind of a successful trajectory corresponding to booking a flight. And so what do we do here? Maybe we can throw away this interaction, but interactions are pretty costly. Like specifically, you know if you're looking at real websites and each interaction you know could take a few milliseconds. And so maybe we don't want to throw away this interaction. So what we're going to do here is again, invoke the relabeler to take the trajectory and assign it a new label. So the model was not successful at accomplishing the task it's set out to do, but it accomplished something. And we're going to come up with the best guess of what that was with a second language model. And let's going to say that, okay, maybe the instruction you accomplished instead was to set the origin to sfo and the destination to New York City, okay? And so that's going to get fed back into the language model. And we're going to keep doing this iteratively till our filter says that this is a good instruction trajectory pair. Okay? So we have the same idea of using a language model to sort of generate outputs and some iterative procedure that will like you know give us kind of a good set of training examples. So overall, the method looks something like this. You know, you have some environment. We are going to use kind of an unconditioned language model to just randomly explore the environment and generate a sequence of trajectories. And then we are going to convert these trajectories into synthetic training data by iteratively converting trajectories into natural language descriptions and then taking natural language descriptions and converting them into even better trajectories, and so on. And once we have this collection of synthetic examples, there are two things we could do. One could fine tune using this data, but the simplest thing you could do was kind of repeat the paradigm earlier of replace human provided demonstrations in context with these synthetic demonstrations. And we find a reasonable boost in performance or 13 point improvement on the minwab benchmark. And again, the even the minwab is very, very simple. Zero short performance for even the best language models is far from perfect. And we also see an improvement on second sort of multi step two use environment. But so far, we've only looked at text, right? But maybe for real world applications, it's kind of intractable to, for every environment, obtain the html and feed that into the language models context. Sometimes there can be tens of thousands of dom elements and then corresponding JavaScript and inputting all that into the language models context could be intractable. And maybe that's also not the best way to kind of show the state of the environment, maybe the best way to directly show the pixels corresponding to the environment. And so now we're going to look at some examples of vision language models that people have used for building these agents. Okay. So the first one that we're going to look at is lava. And the idea here is, again, kind of similar to orca that we looked at in sort of the reasoning half of the lecture, we're going to use GPT -4 to generate this time, both instructions and responses for textual descriptions of images. So maybe there's an image, and we're going to sort of use metadata corresponding to that image to come up with a texture description, feed that into GPT d four and ask it to generate possible questions and responses. And then we are going to jointly fine tune sort of an image encoder here, clip, along with a text only decoder here, vikuna, which is a llama model that is instruction tuned. And through this sort of joint fine tuning, at the end, we kind of get this image encoder that can output language responses. And now we can sort of ask questions about images, maybe use that to directly input screenshots instead of html dom elements. So a second approach that looked at sort of building joint image language models that then people later adapted to agents was picdestruct. And the idea, again, very similar, there's an image encoder and a text decoder. The image encoder will sort of take the image, convert them into patches and and assign each bch sort of a position ID, run that through a transformer, and then there's a decoder that will decode out some text. Okay. One of the new things that pixstruct introduced was a new ptraining task. So for lava, the prere training was you know fairly simple. We're going to use GPT -4 to just generate sort of synthetic questions and responses based on textual descriptions of images. But there's only so far you can go with textual descriptions of images. What piture struck did was to look at screenshots from websites and mask out screenshots and then ask the transformer decoder to produce html corresponding to the mask out elements. So here there is like this list that has a corresponding html. One of the data points in picxto struct looks something like this. So you might mask out, let's say, the the first answer corresponding to Python and ask the model to produce the html corresponding to just the patch that was masked out. And so this seems like a more natural sort of pretraining objective that can maybe have better interactions between image and text. And then this was also adapted for building like these monmodal agents. Okay. So you know at this point, I just want to kind of highlight that this is really an emerging application, this kind of, this huge kind of prompting gap is what I like to call it. So if you do not do extensive prompting and if you do not use the spoke few short examples where for every different environment you have a different set of few short examples, even the best language models are very, very far from perfect, even on very, very simple tasks like minwab, where the goal was just to click on certain elements or respond to someone's email, where in minwab that just takes like five actions. And then even for something as simple as minvob, even after doing extensive prompting and few short examples, is this like drop in performance as you go from sort of the simplest task that involves mapping an instruction into a single action to mapping an instruction into maybe five or ten actions? So long horizon planning is still very, very hard even on these very simple benchmarks. And then if you look at something more complex, like weberina, which tries to approximate real websites, has multitab browsing, has external tools that the all can use, there's just a huge difference between sort of human level task success rate and what the best models get even after prompting, even with few short examples. And then the kinds of errors models make are also pretty weird. So one of the examples from web links was the task was to just open Google Translate and sign in using credentials. And it was an email and a password. And then what GPT -4v did was instead of typing in the password, it just typed in the email into the password tab and it just couldn't recover from this error. So you, it tried to sign in, there was an error. Try to insert, try to type in the email again and so on. And I'm sure with extensive prompting you can fix this. And maybe that's besides the point, right? And then again, you know there was like a different example where the model had to issue a search and then instead of issuing the search with the correct term, it sort of repeated the same term like three times. And obviously, that's not going to return any results. So there's a lot of room for improvement, as you can see, and there's lots to be done in this space. Okay. So I'm going to recap and take any questions. So we kind of looked at two different things today. We looked at reasoning and language models. We saw that there's a few ways that you can get reasoning like behavior in language models. You can prompt them in various ways. So the simplest example of that is chain of thought prompting. You can do chain of thought prompting, but generate multiple rationals and sort of try to reconcile them and pick the answer that was most like frequent. You can do sort of problem decomposition in your prompt. So ask the model to explicitly decompose a problem into multiple steps before answering. So that was all prompting. You could also try and train specialized small language models for reasoning by generating rationals es from a big language model and then fine tuning a smaller language model on these rationales. Instead of fine tuning a smaller language model on rationals from a big language model, you could just fine tune the big language model on its own rationales and keep doing this iteratively. And we saw that sometimes, like if you do, multiple iterations of that performance can keep improving and can even outperform sort of human provided rationales. But on the flip side, we saw that while there are some initial reasons to be optimistic, if we go and do counterfactual evaluation, we see that you know it's not clear if the models are good because the reasoning or if models are good because you know all of these problems were in some shape or form already in the training data. And we saw that with sort of counterfactual evaluation. In the second part, we looked at language model agents. We kind of talked about the historical perspective through which people build sort of grounded agents. And then we saw that you could reccast the problem of decision making as just sort of causal language modeling. And then we looked at various ways through which people have modeled decision making with language models. Most of it involves prompting and in context learning. And then we looked at a method for, you know, similar to sort of what we saw in the first module, generating synthetic demonstrations. And here we looked at doing exploration in the same kind of iterative relaeling. You know, most of the language models we looked at today were text only. We saw some examples of language models that can take both text and visual input. And then you we saw that benchmarks are very, very challenging. Models make kind of trivial mistakes. There's a huge gap between human performance and sort of what we get with models. So there's a huge like there's a huge difference between human performance and where models are and a lot of room driving further improvement. And maybe some of you are doing it for your projects. Thank you.

核心摘要 (Executive Summary)

本讲座（Stanford CS224N，第14讲，Spring 2024）由Shikhar Murty主讲，探讨了语言模型（LMs）在推理和作为智能体（Agents）方面的应用，这两个领域均属于近三四年快速发展的研究前沿。

在推理方面，讲座首先介绍了演绎、归纳和溯因推理，并明确后续讨论主要聚焦于非正式的演绎推理。语言模型通过多种提示（prompting）方法展现推理能力：思维链（Chain-of-Thought, CoT）促使模型分步思考；自洽性（Self-Consistency）通过采样多个推理路径并进行多数投票来提高答案准确性；从少到多提示（Least-to-Most Prompting）将问题分解为子问题。此外，可以通过蒸馏（如Orca模型，用GPT-4的解释微调小型LLaMA模型）或在自身生成的合理论证上进行微调（如强化自训练ReST方法，迭代地生成、筛选并学习合理论证）来训练模型的推理能力，后者在某些任务上甚至能超越人类提供的合理论证。然而，对这些推理能力的严格评估（如合理论证的忠实性测试、反事实评估如不同进制的算术或改变前提的逻辑问题、以及类比推理测试）显示，语言模型在面对训练数据中不常见的分布时表现显著下降，表明其当前能力可能更多依赖记忆而非系统性推理，生成的合理论证有时更像是“事后解释”。

在语言模型智能体方面，讲座回顾了历史方法（语义分析、推理可执行计划、强化学习），并指出当前趋势是将决策制定视为生成式轨迹建模，通过自回归模型预测行动序列。智能体通过与环境（如MiniWob、WebArena、WebLINX等基准测试）交互，接收观察并执行动作以达成语言指令设定的目标。训练方法包括使用少量人类示范进行上下文学习，或更具扩展性地通过生成合成示范数据（模型探索环境、用另一模型标记轨迹、迭代优化）。视觉语言模型（VLMs如LLaVA、Pix2Struct）也开始被用于处理视觉输入。尽管前景广阔，但目前存在巨大的“提示鸿沟”：即使在简单任务上，模型若无大量特定提示和少量示范，表现远不完美，尤其在长时程规划和避免简单错误方面与人类水平差距显著。

语言模型中的推理 (Reasoning in Language Models)

讲座首先对推理进行了分类，包括演绎推理（从规则和前提推导出结论）、归纳推理（从观察中推导结论）和溯因推理（为观察寻找可能的解释）。同时区分了形式推理（使用公理和形式逻辑）和非形式推理（日常情境中的常识推理）。

Speaker 1: "For most of the lecture, when I say reasoning, I will mean informal deductive reasoning. And it's often going to involve multiple steps."

通过提示激发推理能力

思维链提示 (Chain-of-Thought Prompting, CoT):
- 引导语言模型在给出答案前生成中间推理步骤。
- 可通过提供带有明确推理步骤的上下文示例，或直接用“让我们一步一步地思考”（Let's think step by step）这样的指令来实现。
自洽性 (Self-Consistency):
- 对一个问题，模型会采样生成多个不同的推理路径（rationales）和对应的多个答案。
- 最终选择最常出现的答案，基于“多数合理论证指向的答案更可能正确”的假设。
- Speaker 1指出，这种方法在多种数学推理任务上显著优于标准CoT，并且效果“可能不仅仅是简单的集成学习”。
从少到多提示 (Least-to-Most Prompting):
- 核心思想是将复杂问题分解为一系列更简单的子问题。
- 模型首先将原问题分解，然后逐步回答子问题，并基于子问题的答案生成最终答案。
- 实验表明，这种方法有时能从少量推理步骤的示例泛化到需要更多推理步骤的问题。但讲者也提到，“通过足够的提示工程，最佳的普通思维链提示也能达到与从少到多提示相当的水平”。

训练语言模型进行推理

蒸馏：训练小型推理模型 (如Orca)
- 目标：将大型语言模型的推理能力迁移到小型模型中。
- Orca模型通过让一个较小的13B参数LLaMA模型模仿GPT-4生成的解释来进行微调。
- 数据构建过程：
  1. 从Flan V2数据集中获取指令、问题和答案。
  2. 使用这些指令和特定的系统消息（system message）提示GPT-4或ChatGPT，使其生成详细的解释和答案。例如，对于计算中位数的问题，系统消息会要求模型“请证明你的步骤并逐步回答”。
  3. 使用GPT-4生成的这些解释来微调小型LLaMA模型。
- 基准测试 (BigBench Hard): 这是一个包含23个子任务的多步推理数据集，例如：
  - 布尔表达式求值 (如："and false and not and is")
  - 日期理解 (如："明天是[给定日期]。一年前的今天是什么日期，格式为[给定格式]?")
  - 几何形状识别 (从SVG路径元素判断形状)
- 结果：Orca (在GPT-4解释上微调的LLaMA 13B) 在BigBench Hard上的整体表现优于ChatGPT，也优于未经过这种详尽解释训练的Vicuna (一个指令微调的LLaMA 13B模型)。讲者提到GPT-4在BigBench Hard上可能存在数据污染问题。
在自身合理论证上微调大型语言模型 (如Reinforced Self-Training, ReST)
- 思路：让大型语言模型学习自己生成的、被验证为正确的合理论证。
- 过程：
  1. 生成与筛选：给定一个推理问题（如数学应用题），让模型生成多个合理论证。根据这些合理论证是否导向正确答案进行筛选。
  2. 更新：使用筛选后的合理论证微调语言模型。
  3. 迭代：重复此过程，期望模型能生成越来越好的合理论证，从而进一步提升模型性能。
- 结果：
  - 在GSM8K（小学数学代数应用题数据集）上，随着自训练迭代次数增加，性能先略有提升然后开始下降。
  - 在MATH（另一个多步数学推理数据集）上，多轮迭代的ReST能持续提升准确率。
  - 一个重要发现是：“当你对模型自身的合理论证进行多轮自训练后，其性能可以超越使用人类生成的合理论证进行监督微调的效果。”

对语言模型推理能力的批判性审视

讲者提出，需要更系统的方法和反事实任务来评估语言模型是否真正具备推理能力，并警惕数据污染。

合理论证的忠实性 (Faithfulness of Rationales): 模型生成的合理论证是否真实反映了其“思考”过程，还是仅仅是“事后解释”？
- 实验一（提前退出）：强制模型在生成完整合理论证的不同阶段提前输出答案。结果显示，在一些数据集中，无论模型看到完整合理论证还是部分合理论证（甚至只有第一句），最终答案准确率相似，暗示合理论证可能并非决策的关键。
- 实验二（损坏合理论证）：在合理论证中途引入错误。结果显示，在某些数据集中，即使在合理论证的第一步引入错误，最终答案也变化不大。
- 结论：“这些合理论证有时可能是模型答案的事后解释。”
反事实评估 (Counterfactual Evaluation): 测试模型在训练数据中不常见或不存在的情境下的泛化能力。
- 算术：比较模型在十进制加法和九进制加法上的表现。由于九进制加法在训练数据中罕见，模型表现显著下降。
- 逻辑：构建与常识相悖的前提（如：“柯基是爬行动物”），测试模型是否仍能进行正确的逻辑推导。结果显示性能同样显著下降。
- 这些结果暗示“更多的是记忆而非推理”。
类比推理 (Analogical Reasoning): 使用字符串转换等任务。
- 当改变字母表（如从X、Y开始）或使任务描述变得“不自然”（如输出序列的下一个字符的再下一个字符）时，模型性能大幅下降。
- 相比之下，人类在这些反事实任务上的表现下降幅度很小。
Speaker 1的总结观点: "也许有一些推理，也许有一些记忆，但目前还没有系统性的东西...这都是新兴的（研究）"。

语言模型智能体 (Language Model Agents)

智能体是指能够接收环境观察、根据语言指令（目标）执行动作的神经网络。

术语与应用

核心概念：智能体（Agent）、环境（Environment）、观察（Observation）、动作（Action）、语言指令/目标（Language instruction, G）。
应用场景：
- 数字助理（设置闹钟、提醒等）
- 自然语言编程（根据自然语言描述生成代码）
- UI自动化（自动化测试UI元素）
- 控制复杂应用（如通过指令让Spotify播放歌曲）
- 为语言模型添加工具或插件以控制不同应用。

语言模型出现前的智能体构建方法

语义分析 (Semantic Parsing)：将自然语言指令（如“哪些州与德克萨斯州接壤？”）翻译成可执行的逻辑形式（logical forms），然后在知识库或数据库中执行。类似于机器翻译。
推断可执行计划 (Inferring Executable Plans)：从指令和动作序列中推断出可执行计划，训练模型从指令生成计划，再由执行模型执行计划。优点是计划可以编码更高级别的决策。
- 提及一个2011年的系统，该系统能够在实体环境中导航，通过语义分析器将指令转换为计划，然后执行。
强化学习 (Reinforcement Learning, RL)：直接学习一个策略，将指令和观察映射到能最大化某种奖励（reward）的动作。奖励可以是稀疏的（完成整个任务后获得）或密集的（每一步都获得）。
- 提及一个2009年的系统，用于Windows自动化调试。

当前 (2024年) 基于语言模型的智能体

核心思想是将决策制定问题视为一个生成式轨迹建模 (generative trajectory modeling) 问题。

Speaker 1: "你可以把决策制定问题看作是一种生成式轨迹建模问题。"

模型需要预测一系列动作 (P(\text{trajectory} | \text{goal}))，这可以分解为环境的转移动态和智能体的策略。
自回归建模：可以将智能体训练成一个自回归语言模型，根据历史动作、当前状态和目标（可以是奖励或自然语言指令）来预测下一个动作。
循环提示 (Prompting in a Loop)：
- 在文本中指定动作空间（如：打字、点击、移动鼠标）。
- 提供指令。
- 提供迄今为止的动作和观察序列。
- 让语言模型预测下一个动作。
- 讲者称之为“循环中的思维链提示”，并表示“一个稍微复杂一些的版本可以在某些环境中工作”。

智能体评估基准

MiniWob (Mini World of Bits):
- 沙盒环境，评估基本的浏览器交互（如在模拟Twitter上转发、在模拟邮件客户端中转发邮件）。
- 任务通常在3个动作内完成（短时程）。
- 特点：非真实网站，单标签页。
- 即使在这个简单基准上，“即使是最好的语言模型，其零样本性能也远非完美”。
WebArena:
- 沙盒环境，但更接近真实网站，涵盖电子商务（类似亚马逊）、社交媒体（类似Twitter）、地图等实用工具。
- 支持多标签页浏览，智能体可以在不同标签页间切换。
- 评估功能正确性。
WebLINX:
- 在真实网站上进行交互，支持多标签页浏览。
- 引入了智能体与用户沟通的新动作（如请求信用卡信息）。
- 本身是一个交互数据集，不能用于在线学习或探索。

训练语言模型智能体

上下文学习 (In-Context Learning) 与少量示范 (Few-shot Examples)：
- 标准做法是为每个新网站或用例提供人工执行的任务示范，作为语言模型的上下文示例。
- 缺点：不可扩展，因为环境和交互类型繁多。
生成合成示范 (Generating Synthetic Demonstrations)：
- 类似于推理部分的方法，让模型生成数据并从中学习。
- 过程：
  1. 探索 (Exploration)：让一个未条件化的语言模型在环境中随机探索，生成轨迹。
  2. 标记 (Labeling)：使用第二个语言模型为这些轨迹生成描述（即推断出的指令/目标）。例如，模型随机点击后，另一个模型可能推断出“指令是预订从旧金山到纽约的航班”。
  3. 迭代改进 (Iterative Refinement)：
    - 让智能体模型以推断出的目标为条件，生成更符合该目标的轨迹。
    - 使用一个粗略的过滤器判断（指令，轨迹）对是否良好。
    - 如果轨迹未能完成预设目标，则调用“重新标记器”（relabeler）为该实际执行的轨迹赋予一个新的、更准确的指令标签（即模型实际完成了什么）。
    - 重复此过程，直到过滤器认为（指令，轨迹）对是好的。
  4. 应用：收集到的合成（指令，轨迹）对可用于微调模型，或作为上下文学习中的示范。
- 结果：在MiniWob基准上获得了“13个百分点的提升”。

视觉语言模型 (Vision-Language Models, VLMs) 在智能体中的应用

处理HTML可能因DOM元素过多（“有时可能有数万个DOM元素”）而变得不切实际，直接使用环境的像素级视觉输入可能更优。

LLaVA (Large Language and Vision Assistant):
- 思路类似Orca，使用GPT-4为图像的文本描述生成相关的指令和回应。
- 联合微调一个图像编码器（如CLIP）和一个纯文本解码器（如Vicuna，一个指令微调的LLaMA模型）。
- 最终得到一个能输出语言回应的图像编码器，可以回答关于图像的问题，或直接输入截图作为观察。
Pix2Struct:
- 同样包含图像编码器和文本解码器。
- 引入了新的预训练任务：在网页截图上遮挡部分区域，要求Transformer解码器生成被遮挡区域对应的HTML代码。
- 讲者认为这“似乎是一个更自然的预训练目标，可能有助于图像和文本之间更好的交互”。
- 后续也被用于构建多模态智能体。

当前挑战与“提示鸿沟”

巨大的“提示鸿沟” (Huge Prompting Gap):
> Speaker 1: "如果你不做大量的提示，如果你不使用定制的少量示例...即使是最好的语言模型也远非完美，即便是在像MiniWob这样非常非常简单的任务上。"
长时程规划困难 (Long Horizon Planning)：即使在简单基准上，“长时程规划仍然非常非常困难”。
与人类表现的巨大差距：在WebArena这类更复杂的基准上，“人类水平的任务成功率与最佳模型（即使经过提示和少量示例）的表现之间存在巨大差异”。
模型易犯低级错误：
- 例如，在WebLINX的一个任务中，GPT-4V将电子邮件地址输入到密码字段，并且无法从中恢复。
- 另一个例子是模型重复搜索相同的词语三次。
- 讲者评论道：“我相信通过大量的提示你可以解决这个问题。但这可能不是重点，对吧？” 暗示深层能力不足。
结论：
> Speaker 1: "正如你所见，有很大的改进空间，这个领域还有很多工作要做...基准测试非常非常具有挑战性，模型会犯一些微不足道的错误。"

总结与展望

讲座回顾了语言模型在推理和作为智能体两方面的进展。

推理方面：通过提示、蒸馏和自训练等方法，语言模型能展现一定的推理行为，甚至在某些情况下超越人类提供的合理论证。但反事实评估表明，当前的推理更多依赖模式匹配和记忆，缺乏系统性。
智能体方面：从历史上的语义分析和强化学习，发展到当前将决策视为因果语言建模。通过提示和上下文学习，或利用合成数据进行训练，语言模型开始在数字环境中执行任务。视觉语言模型为处理多模态输入提供了新途径。
共同挑战：两个领域都处于早期发展阶段，模型表现高度依赖提示工程，在泛化能力、长时程规划和避免简单错误方面仍有很大提升空间，与人类水平存在显著差距。讲者最后鼓励学生们在课程项目中探索这些方向。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 21:29

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 21:10

StreamSparkAI