speaker 1: Okay, let's just get started. Welcome to lecture 14, everyone. Hope you've been doing well and managing all of the various deadlines. So today we'll be looking at two interesting applications of language models. The first half, I'll be talking about using language models to reason in domains like math, geometry, doing things like spatial reasoning. And then in the second half of the lecture, I'll be talking about how you can use language models to take actions in grounded environments. Okay, so a little bit of a disclaimer. A lot of the content today day's research that was done in the last three, four years. So there's plenty of questions, plenty of unanswered questions and not a lot of answers. So you know, let's let's maybe we can have more of a discussion around these topics. Okay? Okay, so let's get started with reasoning. So experts like to start a lecture on reasoning by really talking about what are the various kinds of reasoning. So I'm going to do that here, okay? But at a high level, it's really about using facts and logic to arrive at an answer. But more concretely, there's three distinct categories of reasoning that we can talk about. The first one, which is probably the one that most of you are familiar with, is deductive reasoning, where we go from rules of logic along with a premise to come with a firm conclusion. So an example of that could be that we have the sentence, all mammals have kidneys and all whales are mammals, and then we can come up with a conclusion, all whales have kidneys, and we could do multiple such steps of reasoning. Okay? A second form of reasoning is inductive. Where given observations, we derive conclusions. Okay? So maybe we've learned from experience that every time we see a creature with wings or is usually a bird, and let's say we observe a state where we see a creature with wings, and using our experience, we can come up with this conclusion that the creature is likely to be a bird. So that form of reasoning is inductive, okay? And finally, we have abductive reasoning where we are given an observation, and then we start drawing possible explanations. Okay? So maybe you see a car that cannot start and there's a puddle of liquid under the engine, and then you start drawing inferences about the situation. So one of them could be that the car has a leak in the radiator. Okay, all right. And apart from that taxonomy, we can also think of reasoning in formal and informal terms, where formal reasoning involves using axioms and rules of formal logic to derive truth conditions. Okay, there's also informal reasoning, which is what you and I probably do every day. And here we just reason about everyday situations and use common sense to derive conclusions. For most of the lecture, when I say reasoning, I will mean informal deductive reasoning. And it's often going to involve multiple steps. Okay, so let's let's come back to language models. Okay, so we've learned in lectures nine, ten, eleven that language models are really, really good at, or large language models are really, really good at coming up with plausible continuations of text that reflect human preferences and constraints. Today, we try to answer if they can also reason. Okay? So one of the most basic ways we can try to answer this question is we are prompting. Okay? And we've probably already seen this. There is this popular method called chain of thought prompting, where you get a language model to produce a reasoning step before producing an answer. And we could do this by providing some in context examples with explicit reasoning steps that the language model can then mimic at test time. Okay? So that's chain of thought prompting. Another rather surprising property of language models is that sometimes you don't even have to show them these in context examples, and you could just prompt them with the sentence, let's think step by step, and you can get these reasoning rationales before they produce an answer. Okay? So that's pretty simple, but let's keep going. Okay? So another popular way to prompt language models to do reasoning is we are self consistency. So here what we do is instead of greedily sampling a rationale followed band answer, we are going to sample multiple reasoning steps and correspondingly multiple answers. Okay, so what we see in the figure on the right, we have a question. And then what you would normally do with chain of thought prompting is you would greedily decode rationale and then condition on the rationale, generate an answer with self consistency. We are going to sample multiple times. So sample multiple rationals, they are all going to lead to multiple answers. And then we are going to pick the one that is the most common, okay? With the idea being that if an answer keeps appearing for multiple rationales, this is a majority of the rationals agree on, then it's more likely to be correct. And the authors of self consistency find that on a variety of mathematical reasoning tasks, if you add this simple idea of self consistency where you sample multiple times and sort of do majority voting, that improves performance pretty drastically over a standard chain of thought. And Interestingly, you know when I saw this result the first time, I thought, okay, this is just like ensembling, which is you know we we learned this in cs 229. The idea is if you want to boost the performance of your system, I'm going to produce like ten classifiers with different random seeds. I'm going to produce a classification decision, and I'm going to do a majority voting. But turns out that it's doing maybe a little bit more than just simple ensembling. So the authors also compared an ensemming approach where it's the same language model with multiple different prompts, and then you do majority voting there. And then turns out that self consistency is better than just simple ensembling. Okay. So earlier today, I said that I'll be talking about multi step reasoning. So far we've looked at sort of math problems and like prompting, but not necessarily multi step reasoning. One of the main kind of aspects about multi step reasoning is it involves breaking down a large problem into several suparts where and answering each of the suparts and then combining everything into a solution. Okay. So there's this kind of decomposition strategy that was integrated into another prompting method called least to most prompting. And the idea behind least most prompting is, like I said, given a question, we're going to first break it down into sub questions, as shown here. And then given these suquestions, the language model will sort of answer each of the suquestions and then condion its answers to the sub questions is going to generate the final answer. And this is kind of how it looks like for sort of a math reasoning problem. So in standard chain of thought prompting, you would have a question followed by a rationale. And the answer with least to most prompting, which is this like decomposition strategy, you would take the question and then instead of directly producing a rationale, you sort of ask the language monitor, break it down into problems, and then you have like these two different sub problems, and then you start answering both of those sub problems and then condition your final answer on the answers to those suproblems. So okay, so that's just like a prompting method, right? One interesting experiment from least to most prompting was showing that you can sometimes generalize from a small number of reasoning steps to a much larger number of reasoning steps. So here in this sort of math word problem, this two reasoning steps, and if we show this prompt to the language model, sort of as in context example, we see that it continues to generalize even on examples that require more than five steps of reasoning, and in a way that's much better than standard chain of thought. But it's not entirely clear if structuring inin this manner is really fundamental. One of the other results they reported was sort of that with enough prompt engineering, so the rose corresponding to best normal chain of thought is on pawith, at least most prompting. But it's kind of an interesting idea of trying to break down problems into suproblems, solving the suproblems, and then sort of building up a solution based on your answers to the suproblems. Okay. So all this was different sort of prompting methods to get reasoning behavior out of language models. Can we do something more? So one of the things that we might be interested in is instead of trying to get really large language models to do reasoning, maybe we want to somehow get this kind of reasoning behavior in a smaller language model. And one popular approach for doing that is distillation, where maybe you want to find tunin, a smaller lama model, by teaching it to imitate a larger llama model. And so that's what we're going to look at now. Okay? So this model is called orca. And at a high level, orca is going to fine tune a smaller 13 billion llama language model on explanations produced by gpfour. And to construct this data, it's pretty simple. It has these three steps. So the first step is we get a wide variety of instructions from the flan V2 collection. Okay, so flan V2 is basically a data set. It kind of cumulates multiple data sets into one sort of collection. And it consists of instructions paired with questions and answers. And I'll show an example of this in a moment. And then we're going to prompt GPT -4 or ChatGPT with these instructions along with the system message. And the objective of the system message is to get ChatGPT or GPT -4 to produce an informative explanation along with the answer. So here we have a question about you, simple data processing, about calculating the median. And there's a system instruction that says, please justify your steps and kind of answer step by step. And in producing its output, the model sort of provides a fairly detailed explanation of how it got to the answer. And what orca is going to do is use precisely this explanation to find tunommax model. So that's what's going to happen once we have these explanations. We are going to fine tune a much smaller 13 billion parameter lamma model on these explanations. Okay. So so far, we've looked at sort of math reasoning and sort of grade school math problems. Let's kind of turn to a different benchmark for reasoning. So we're going to look at big bench hard. And this is another data set for multi step reasoning. And let's look at some examples from big bench hard. So it consists of multiple different subpta. So there's a total of 23 different subptai'm going to show a few examples. So one of them is evaluating boolean expressions. So the question is and false and not and is okay. So that's basically you know evaluate this boolean expression and you know with sort of chain of thought, the model can evaluate each of the suexpressions and get to the final answer. And another example of a task from big bench hard is data understanding. Where know maybe the question is, sorry, this is data understanding, not data understanding. So the question is tomorrow is a given date. What is the date one year ago from today in a given format? It's paired with some options. And again, the model can sort of think step by step, following basic chain of thought and then come up with an answer. So this is kind of the flavor of tasks in big banch. You know most of these involve multi step reasoning. They're fairly synthetic, but also reasonably hard for language models. Okay. Another example is geometric shapes. And this one is pretty surprising that language mucan do anything here. So you're given sort of the svg path element and sort of, I have no idea what this renders us, but like the question is, you know, just given the svg what shape you're going to get, okay? And there's a bunch of options. And then again, the model prompted with less think step by step will produce some answer. We don't know if it's correct, but it's going to produce some answer. Okay? And so it's basically this data set covering different kinds of reasonings, spial reasoning, data understanding, evaluating boolean and a sort of multi choice. So it's easier to kind of get sort of an accuracy number. And so Yeah, so it covers like a wide variety of different tasks. On the left, we have performance from really large language models. This is zero short chain of thought with just the prompt. Let's think step by step. So GPT d four has some potential contamination issues with big bench hard. So let's maybe we can ignore that column. Wakuna is, I think a few months ago it was state of the art as an instruction tuned lama 13b model. And orca is again a lama 13b that's fine tuned specifically on this explanation data where you have instructions and then you have explanations from ChatGPT or GPT -4 and you fine tune on that. And we see that overall it outperforms ChatGPT maybe because it's specialized to just like these reasoning problems, and it outperforms vickuna, which was not trained on like these really extensive explanations. So that's one way you can get a smaller language model to display some kind of reasoning behavior. Okay? So you know this was all great. And you know we are very happy that like you can just generate rationales from big alm and then find tunin a smaller language model on that. But then someone could ask, why not just find tunin, the big language model on its own rationales, right? So that's also been explored. And there's a bunch of different methods that do this. I'm gonna to talk about one of them called reinforced self fraor rest. I going to alternate between two stages. The first stage given a reasoning problem, and perhaps the prompt lets things step by step. We're going to have the language model generate multiple rationales, and then I'm going to filter these rationales based on whether they give me the correct answer or not. Okay. So, you know, think about the word algebra problems. Someone has three apples, someone else has four apples. And you generated rationale, and the answer comes out to be seven. You keep that rationale. Ale, the answer is twelve. You sort of leave that rationale up. And then I'm going to do an update step where I'm going to take these rationales that I filtered in my first stage. I'm going to find you in the language model on that. And then I can do this iteratively. Now I have an updated language model. I can get hopefully better rationals, and then I can update the language model and better rationales to get an even better language model, and I can keep doing that. Okay? And the results are promising, but you know what we find is on gsm 8K, which is this great school math data set of like algebraic word problems. As you increase the number of iterations of self training, we see a slight improvement in performance, and then it starts degrading. Math is another data set that again focuses on multiti step reasoning covering math problems. And again, we on this data set, we see that as we do more iterations of this reinforced self training paradigm, we see an improvement in the accuracy. And the numbers in orange here are a much larger palm model. The numbers in blue are a smaller model. And the daash lines represent what you get sort of if you did supervised fine tuning on human provided rationales. So one of the promising things about this approach is when you do multiple iterations of self training on your own rationals, you can outperform sort of human generated rationales. And that is exemplified again in this graph, where what we find is the blue bar represents accuracy when you take the palm model and you do supervised fine tuning on human provided der rationales. Okay? And then in Green is, if you did, if you controlled for the sorry, so blue is if you find Tunon all human provided rationals. Orange is if you fine tune on one rationale per training example, okay? And these are from, these are written by humans. In Green, it's what you get if you fine tune on one rationale chosen at random per question, which is generated by the model. So it's controlling for the number of rationales, and we see that it outperforms human provided rationals. And then if you sort of do the full multi step iterative procedure where you keep improving the model, we see, again, a boost in performance. So that's super promising. But let's kind of start revisiting the question that we asked in the beginning about reasoning in language models. Okay. So one way of answering that question is we can apply all these methods, and we can look at benchmarks. But maybe the way to answer the question correctly is to be more systematic, come up with counterfactual tasks, and be very careful about possible data contamination. And I'm going to show some results around that. So we started the lecture with chain of thought. And maybe the first question to ask is, are the rationales that the model produces with chain of thought faithful? What I mean by faithful is maybe the model produces some rationale and then it produces an answer. But maybe the answer does not even depend on the rationletter produced, right? So maybe the question was, Tom has three apples and Jerry has four apples. And the rationale it produced was, okay, Tom has three apples, Jerry has four. Three plus four is seven. So the answer is 25. You know, so in a case like that, yousay that the model was not faithful to its rationale. And so what we see in this plot is a very careful experiment where on the x axis we have the number of reasoning samples. So the setup is something like this. So for every question, the model produces a rational ale and a rationhere is multiple sentences. And what we're going to do is we're going to force the model to a sort of early exit from its rationalization and just like force it to produce an answer. Okay? So if it produced four rationals, I can early exit right after the first rationale and ask it to produce an answer. I can exit after the second, rationask it to produce an answer, and so on. And what I'm gonna to plot on the y axis is the model's accuracy after early exiting in this procedure. So let's say that I early exerted after just one rationale and the model produced exactly the same answer that it would if it had seen all four sentences in its rationale, then maybe we can conclude that the kind of reasoning is not faithful. Like it doesn't matter if the model is the full rationale or just the first sentence. And if you take that to the extreme, maybe you terminate even without any rationthat produces the same answer. So the results here are somewhat mixed, but we see that there are enough data sets where it doesn't matter if you see full if the modeis the full rationale before answering or if you sort of early exit, you kind of get the same answer, which means that sometimes these rationales may be posthoexplanations of the moanswer. Okay? Another experiment that tries to answer this exact same question is you can take these rationales and then you can start corrupting them. So maybe your rationale was length four, and then I generate the first rationale, the second rationale, and for the third rationale, I just corrupt it. Okay? And then the fourth rationand, then I ask them all to generate, man. So if it turns out that no matter how much I corrupt my rationale, the model produces the same answer, then I can conclude that, again, the answer kind of did not depend on my rationale. So on the x axis, we are looking at number, the percentage of reasoning steps before I add sort of a mistake in the rationale. Okay. So what you should see is kind of a strictly increasincreasing sort of trend where if I add a mistake after the very first step, then that's probably going to change the answer a lot. And then if I add a mistake after the last step, that maybe doesn't change the answer all that much. But again, we find that for some data sets, it so happens that know you can add a mistake in the first sentence in your rationale, and the answer is not going to change all that much. And so that's also kind of an indicator that maybe these rationale es are sort of posthoexplanations of the model's behavior. So Yeah, so there's a lot of lines here. So if anyone has questions, see a few blank faces in the audience. Okay, so let's let's keep moving. Okay. So that was about like whether the models, whether sort of chain of thought expresses kind of a reasoning that the model is faithful to. Another question you could ask is what if I changed my setting a little bit? Right? So my model, let's say I observed that it's able to do arithmetic in base ten, so it's able to answer something like twelve plus 14. Does that mean that my model knows how to do arithmetic? Or maybe there was just this exact same example was present in the training data. So one way you could test for this is by creating counterfactuals, which based on our understanding of the data, you expect to not be present that frequently in the training data. So instead of doing base ten addition tion, you could do addition in base nine. And then if the model has the same accuracy in base nine, then you can conclude that maybe this model has understood how to do similarly for logic. Maybe the reason why the model is so good at solving logic problems is because it's seen something very similar in its training data. So what if I construct a world where I don't know corgis are reptiles? Can it still do here this logic problem? Okay. And so what we find is there is know sometimes a pretty significant drop when you move from there's a question, the final of the counterfactual. Why is baseline and counterfactual or base ten? So it's a counterfactual, excuse me, in the sense that the authors come in that like Basten edition is like frequently observed in training data, but very few people do baseline edition. And so there's going to be much fewer examples of this in training data. So it's more cadistribution actual. So you can also call it out of distribution for sure. And Yeah so like from results, like what we see is you know there's there's like this drop in performance even for like very simple logic problems that don't involve like multiple steps of reasoning, there's you know kind of a significant drop in performance, which maybe suggests that there's is not that much reasoning, there's more memorization. Yeah. So we could keep going with this paradigm of like changing the problem setting so that it starts looking sort of out of distribution to the training corpus. And this is exactly what was done in this paper that looked at analogical reasoning. So basically, the setup is something like this. I'm going to show certain examples of string transformations, and I'm going to ask the model to generalize to new examples. Okay, so in this extentent sequence problem, I have abcd in the output is abcd, and then given ijkl, the model has to produce ijkl m and so on. Now the way you can sort of make this into like a counterfactual or something that is out of distribution is maybe you can kind of change what the extent sequence task is. So now instead of outputting abcde, maybe the model has to output abcd f. Okay? So instead of outputting, the next character has to output sort of one more, so the second character after the next and so on. The other kind of contterfactual you could add is instead of operating on this standard alphabet, you could modify the alphabet completely. So instead of the alphabet being abcd, maybe you start at X, Y and so on. So what we find is, so we find two things. The first thing that we find is that there's a significant drop in performance as we go from the standard sort of analogical reasoning problem to one of these counterfactures where we either change the alphabet, we change the description of the task so that it becomes slightly unnatural. On the other hand, the authors also did this exact same experiment on human subjects where they find very little decrees in performance. Okay, so overall, what this result suggests is maybe there some reasoning, maybe there's some memorization, but there's nothing systematic. Okay? So you know again, this is like all emerging. So maybe someone will find that you know if if you change your prompt a little bit now now models can do reasoning, but this is kind of the current lay of the land. Okay, so that was sort of the reasoning module of the lecture. I'm going to now switch gas and talk about language model agents. And this was kind of related to reasoning in the sense that reasoning involves sort of this multiti step inferences where know given some facts, you have to arrive at completely new conclusions with agents. What we'll see is that there's some high level kind of objective a model has to accomplish, and it has to reason about post conditions, object audidances, kind of uncertainty in the world to Carry out a sequence of steps. So let's start with some terminology. Okay, so we have our agent on the right. That's going to be some neural network. And then we have an environment. No, and I'll give some examples of what these environments could be. The agent receives an observation from its environment, and based on the observation, it issues an action, okay? And along with that, it receives this second variable, g. And g represents a language instruction, okay? So there is many names for this setting and what and these models, digital agent, language condition policy, or an instruction following agent. Some examples of environments are maybe it's sort of a web browser in sort of a browsing environment where the objective is to book a flight from San Francisco to New York. And the observation could either be a raw pixel that the model sees, or it could be the html dom representation. And the action space, if you're looking at these web environments, could be typing on specific web elements, clicking on web elements, moving your mouse to a certain web element to interact with it, and so on. And Yeah, I mean, like this has sort of a vast number of applications. I don't think I can cover all applications, but like you know we can look at some. So there's obviously digital assits like you know not going to say the names because I know people's new bias might start popping up, but know you can give them natural language commands and know set an alarm, set reminders and so on. You could also do natural language programming where you could, given in natural language descriptions, get a model to sort of write Python code. Another example of this could be ui automation, where maybe you want to do automated testing of ui elements. And so instead of having a human sort of verify whether a ui element works, maybe you can get a model to execute actions corresponding to a given instruction. Or it could be something more sort of user facing, where you given some kind of complex environment like Spotify, you could ask an agent to play some songs. And then finally, there is this sort of emerging application where we want to add additional tools or plugins to language models so that they can control various different applications. Okay. So before we look at how we can use language models to do instruction following, I think it's very helpful to look at how this was done before language models. So there were basically three main ideas. Sometimes the right thing to do was collect examples of utterances paired with logical forms. So logical forms could be some kind of an executable representation that you could just execute against either a knowledge graph or a database to get an answer. So maybe you have a query like, what state borders, taxes? And then there exists some sort of programmed description that you could execute against a knowledge graph to get sort of an answer or a list here. And idea number one that people looked at was to treat this as almost like machine translation, right? So you have a source language, which is sort of English commands, and then you have a target language, which is sort of these like meaning representations or logical forms. And then you could apply the same machinery from assignment three to build kind of a natural language interface here. So you directly maximize the probability of a sequence of actions given a goal or a command. Idea number two was something a little bit more complex. So here you have instructions paired with actions. But instead of directly mapping instructions to actions, I'm going to infer an executable plan, okay, from these instructions and action sequences, and I'm going to train a model to go from instructions to these plans and then define a very rich execution model that's going to directly execute these plans. The advantage of this is maybe there is more sort of high level decisions you could encode in your plan, which would be harder to like get into the model if you were to just train it to produce the action trajectories directly. And I have an example of a system like that from 2011, which was basically an agent that could navigate in sort of grounded environments. And Yeah, the idea was something like this, that you kind of took an instruction and obtained a plan, and then you would train a semantic parso, which is basically like this kind of machine translation system that would convert commands, sequences into this plan. And then once that's trained at test time, given a completely new instruction, you would run the semantic parser, get this plan, and then execute it in this execution model. Okay. And I have an example of an instruction and a plan from this 2011 system. The third idea, which is probably you maybe the first one that comes to mind if you see a setting like that, is to use reinforcement learning directly. And what people did there was to use rl to directly map instructions into actions. So I'm going to learn a policy that outputs actions that maximize some reward, okay? Which is conditioned on my natural language instruction and the observation. And this reward could be both sparse, which is I Carry out the entire task and then my environment tells me if I achieve the task or not. Or it could be something that I obtain after each step. So I take an action and the and then the environment tells me if this action sort of completed some percentage of my task or not. And on the top, I've included an example of a system from 2009 that did this for automated windows debugging. And so you you have some natural language instruction to click some ui elements and that get mapped into kind of an api command that the model executes one after the other. Okay. So these are basically the three main ideas that people had before language models. You would either train semantic parsers, you would either infer these plans from instruction trajectory pairs and then learn to directly model plans and then have an execution model that can execute plans. Or you would do reinforcement learning if you had a reward signal. So how do we do things in 2024? So there are a few ways to think about this. I think like maybe most instructive is to think about what we are trying to achieve, right? So we are trying to monitor trajectories. So sequence of actions, condition on some goal. Okay, so I want my model to book a flight from San Francisco to New York. And I wanted to produce a trajectory of flight, maybe typing and clicking actions. So let's look at how that factorizes. So the probability of a trajectory conditioned on a goal or an instruction is just the probability of the state action next state and so on condition on the goal. And you could factorize that into two terms. So the first term is sort of the transition dynamics of the environment. And that's just what happens if I take a certain action in a given state. How is my state going to change? And the second object is sort of the agent policy, which is, given my goal and the trajectory so far, what is the next action I should be taking? Okay. And then sort of people quickly realize that you could just treat this as kind of a generative problem. So you could treat the problem of decision making in environments as sort of a generative trajectory modeling problem. And what I have in sort of the top right is an example of a transformer that just takes the history of actions it's taken so far, the current state, and some indication of what tagets should achieve here based on reward. But it could be a natural language string and it's just trained to predict what's the next action. And you could just train an autoregressive language model to with us. And it turned out that this worked very well in sort of an offline rl case. 哎我俩so we are Picone action for and the current action also so this known。 So you predict an action, execute that, append that to your trajectory and then you predict the next action and so on. So we we resolve the three input tokens into one output, okay and half. Okay, that's and it turned out that this worked really well. And so know instead of know getting these latent plans and training semantic parsers or trying to do reinforcement learning, we started using language models as policies. And so a simple way to do all of that is to prompt a language model in a loop. Okay, so we're going to specify the action space and text. So this is like a simple sort of language model agent. This is not going to work at all, but probably just like illustrative of how agents can be built now. So you provide an action space in text. So maybe it's a digital environment and maybe it can type, maybe it can click, maybe it can type characters, maybe it can move mouse somewhere. You provided an instruction and you provide it the sequence of actions and observations it's received so far. Okay? And then condition on all that you ask it to predict next the next action. And there's nothing deep going on here. This is just chain of thought prompting in a loop. Okay? But the hope is that because all of this, because we reduce the problem of decision making into just autoregressive modeling, this could work, okay? And indeed, like you know a slightly more complex version of this can work in some environments. Okay. So now I'm going to sort of give a little flavor of what different environments look like now for evaluating language models as agents. So the simplest environment that people consider is mini wap. So this is a sandboxed environment that evaluates like basic browser interactions, like you know maybe on a mini Twitter environment. Can you get a language model to retweet a given tweet, given sort of simulated email client? Can the model forward someone's email? Can it compose an email? Can it click on certain buttons or not? It's not at all real world. So it's not real websites and it's a relatively short horizon. So given any instruction, most tasks can be accomplished in under three actions. But zero short performance of even the best language models is still far from perfect, even on this very simple benchmark. A second, slightly more real world benchmark is weberina. And this is also sandbox environment, but it's kind of a pretty close approximation of real websites that span e -commerce. So there is a website in wearenina that resembles Amazon social media, so something that resembles Twitter. And Additionally, there are utility tools like maps. So an instruction could require a model to open up sort of a map application, find the shortest path from point a to point p, and use that in its later sequence of actions. And this is multab browsing, like we kind of commonly do. So with minwab, there's only one single tab. And with weberina, I think this was the first environment that introduced this idea where you kind of have multiple tabs, and the agent can sort of switch between apps, tabs. And again, we are going to evaluate sort of functional correctness, which is whether the model sort of gave the correct answer at the end, whether the sequence of steps it took gave the intended behavior, as opposed to whether it took a sequence of steps that maybe a user had pre programmed. So another popular kind of environment is our data. Aset is web links. So web links also has multitab browsing, and it has web interactions on real websites. So this is not sandboxed approximations of real websites. Zes, not sandboxed, kind of just browser like browser interactions. These are like actual real websites. And it also introduced like a new action where the agent could communicate with a user. So maybe there's some instruction, which is to like reserve kind of, I don't know, like a movie or buy a movie ticket or something. And then at some point, the model has to request credit card information. And so there is this like additional action where a human could be involved in in communicating with the agent. And this is not an environment, but just a collection of interactions. So you can't, for example, do any kind of exploration or online learning here, but you could definitely use the sort evaluating. Okay. So this was just a taste of what some benchmarks look like for language model agents. So how are we going to train these models, right? So now given that we are going to we're going to treat like decision making sort of casual as causal language modeling. We're not going to use any of the ideas from pre llms. The standard practice is to do in context learning with few short examples. And in the few short examples, for typically for any new kind of website or any new use case, you're going to get humans to perform those tasks and sort of feed that into the language models prompt, as in context demonstrations, which it could then use to solve similar looking tasks on very similar websites. So obviously, this is not scalable. There's thousands of environments. On some environments, there's like lots of different interactions tions that are possible. And so maybe there's something better that we can do than just sort of getting humans to provide demonstrations for every new use case. And so we are going to use something we saw early on in the lecture. She was to kind of use the language model to generate rationales and then fine Tunon that. And here we don't have rationales, but we could produce action trajectories, and then we're going to use that as supervision. Okay? So the way that looks like is something like this. So let's say I have some environment, you know let's say it's some mini wab environment, and I'm going to just get an agent that can randomly explore the environment. So to just execute a random sequence of clicks and types and scrolling operations, and let's say, produces some trajectories, and now I'm going to use these trajectories and somehow filter them. So that was the idea from earlier. So you're going to get a bunch of different outputs, and then we going to filter it somehow. So here we going to use a second language model because we don't know what a good trajectory looks like. So not like a matths problem where you know, you know the correct answer. We just had a language model interact with a website and generate trajectories. And we want to somehow filter out what are good trajectories. And so we're going to use a second model that will produce a description of these trajectories. And the idea here is that if you can get a model to produce a description of what the sequence of actions corresponds to, then maybe that's a good enough signal for a good trajectory. And so maybe given the first trajectory, it guesses that the instruction was to book a flight from San Francisco to New York. For the second trajectory, it said the date to some given date, and maybe it wasn't able to come up with any good sort of instruction for the third trajectory. And then we are going to do something again that we saw earlier on, which is to kind of do this iteratively. So now we have a goal that we got for a trajectory. And now I'm gonna to get the language model to condition its behavior on this goal. So the goal is to set the date as some given date. And now instead of doing random exploration, the model is going to produce a sequence of actions that have a better correspondence with some natural language instruction. So it produced a trajectory based on that instruction. And then I'm going to use sort of some course filter that's just going to look at correspondences between the instruction and the sequence of actions and the states, the language model visited, and use that to decide if the trajectory was a good trajectory for the instruction. And in this case, given the instruction, this seems like a pretty good trajectory for completing this task. And so then we add it to a set of examples. Okay. But maybe sometimes things are not so good. So for that second instruction, the generated label was to book a flight from San Francisco to New York. And let's say we run that again through the language model. And it produced a second trajectory. And clearly, this does not seem like kind of a successful trajectory corresponding to booking a flight. And so what do we do here? Maybe we can throw away this interaction, but interactions are pretty costly. Like specifically, you know if you're looking at real websites and each interaction you know could take a few milliseconds. And so maybe we don't want to throw away this interaction. So what we're going to do here is again, invoke the relabeler to take the trajectory and assign it a new label. So the model was not successful at accomplishing the task it's set out to do, but it accomplished something. And we're going to come up with the best guess of what that was with a second language model. And let's going to say that, okay, maybe the instruction you accomplished instead was to set the origin to sfo and the destination to New York City, okay? And so that's going to get fed back into the language model. And we're going to keep doing this iteratively till our filter says that this is a good instruction trajectory pair. Okay? So we have the same idea of using a language model to sort of generate outputs and some iterative procedure that will like you know give us kind of a good set of training examples. So overall, the method looks something like this. You know, you have some environment. We are going to use kind of an unconditioned language model to just randomly explore the environment and generate a sequence of trajectories. And then we are going to convert these trajectories into synthetic training data by iteratively converting trajectories into natural language descriptions and then taking natural language descriptions and converting them into even better trajectories, and so on. And once we have this collection of synthetic examples, there are two things we could do. One could fine tune using this data, but the simplest thing you could do was kind of repeat the paradigm earlier of replace human provided demonstrations in context with these synthetic demonstrations. And we find a reasonable boost in performance or 13 point improvement on the minwab benchmark. And again, the even the minwab is very, very simple. Zero short performance for even the best language models is far from perfect. And we also see an improvement on second sort of multi step two use environment. But so far, we've only looked at text, right? But maybe for real world applications, it's kind of intractable to, for every environment, obtain the html and feed that into the language models context. Sometimes there can be tens of thousands of dom elements and then corresponding JavaScript and inputting all that into the language models context could be intractable. And maybe that's also not the best way to kind of show the state of the environment, maybe the best way to directly show the pixels corresponding to the environment. And so now we're going to look at some examples of vision language models that people have used for building these agents. Okay. So the first one that we're going to look at is lava. And the idea here is, again, kind of similar to orca that we looked at in sort of the reasoning half of the lecture, we're going to use GPT -4 to generate this time, both instructions and responses for textual descriptions of images. So maybe there's an image, and we're going to sort of use metadata corresponding to that image to come up with a texture description, feed that into GPT d four and ask it to generate possible questions and responses. And then we are going to jointly fine tune sort of an image encoder here, clip, along with a text only decoder here, vikuna, which is a llama model that is instruction tuned. And through this sort of joint fine tuning, at the end, we kind of get this image encoder that can output language responses. And now we can sort of ask questions about images, maybe use that to directly input screenshots instead of html dom elements. So a second approach that looked at sort of building joint image language models that then people later adapted to agents was picdestruct. And the idea, again, very similar, there's an image encoder and a text decoder. The image encoder will sort of take the image, convert them into patches and and assign each bch sort of a position ID, run that through a transformer, and then there's a decoder that will decode out some text. Okay. One of the new things that pixstruct introduced was a new ptraining task. So for lava, the prere training was you know fairly simple. We're going to use GPT -4 to just generate sort of synthetic questions and responses based on textual descriptions of images. But there's only so far you can go with textual descriptions of images. What piture struck did was to look at screenshots from websites and mask out screenshots and then ask the transformer decoder to produce html corresponding to the mask out elements. So here there is like this list that has a corresponding html. One of the data points in picxto struct looks something like this. So you might mask out, let's say, the the first answer corresponding to Python and ask the model to produce the html corresponding to just the patch that was masked out. And so this seems like a more natural sort of pretraining objective that can maybe have better interactions between image and text. And then this was also adapted for building like these monmodal agents. Okay. So you know at this point, I just want to kind of highlight that this is really an emerging application, this kind of, this huge kind of prompting gap is what I like to call it. So if you do not do extensive prompting and if you do not use the spoke few short examples where for every different environment you have a different set of few short examples, even the best language models are very, very far from perfect, even on very, very simple tasks like minwab, where the goal was just to click on certain elements or respond to someone's email, where in minwab that just takes like five actions. And then even for something as simple as minvob, even after doing extensive prompting and few short examples, is this like drop in performance as you go from sort of the simplest task that involves mapping an instruction into a single action to mapping an instruction into maybe five or ten actions? So long horizon planning is still very, very hard even on these very simple benchmarks. And then if you look at something more complex, like weberina, which tries to approximate real websites, has multitab browsing, has external tools that the all can use, there's just a huge difference between sort of human level task success rate and what the best models get even after prompting, even with few short examples. And then the kinds of errors models make are also pretty weird. So one of the examples from web links was the task was to just open Google Translate and sign in using credentials. And it was an email and a password. And then what GPT -4v did was instead of typing in the password, it just typed in the email into the password tab and it just couldn't recover from this error. So you, it tried to sign in, there was an error. Try to insert, try to type in the email again and so on. And I'm sure with extensive prompting you can fix this. And maybe that's besides the point, right? And then again, you know there was like a different example where the model had to issue a search and then instead of issuing the search with the correct term, it sort of repeated the same term like three times. And obviously, that's not going to return any results. So there's a lot of room for improvement, as you can see, and there's lots to be done in this space. Okay. So I'm going to recap and take any questions. So we kind of looked at two different things today. We looked at reasoning and language models. We saw that there's a few ways that you can get reasoning like behavior in language models. You can prompt them in various ways. So the simplest example of that is chain of thought prompting. You can do chain of thought prompting, but generate multiple rationals and sort of try to reconcile them and pick the answer that was most like frequent. You can do sort of problem decomposition in your prompt. So ask the model to explicitly decompose a problem into multiple steps before answering. So that was all prompting. You could also try and train specialized small language models for reasoning by generating rationals es from a big language model and then fine tuning a smaller language model on these rationales. Instead of fine tuning a smaller language model on rationals from a big language model, you could just fine tune the big language model on its own rationales and keep doing this iteratively. And we saw that sometimes, like if you do, multiple iterations of that performance can keep improving and can even outperform sort of human provided rationales. But on the flip side, we saw that while there are some initial reasons to be optimistic, if we go and do counterfactual evaluation, we see that you know it's not clear if the models are good because the reasoning or if models are good because you know all of these problems were in some shape or form already in the training data. And we saw that with sort of counterfactual evaluation. In the second part, we looked at language model agents. We kind of talked about the historical perspective through which people build sort of grounded agents. And then we saw that you could reccast the problem of decision making as just sort of causal language modeling. And then we looked at various ways through which people have modeled decision making with language models. Most of it involves prompting and in context learning. And then we looked at a method for, you know, similar to sort of what we saw in the first module, generating synthetic demonstrations. And here we looked at doing exploration in the same kind of iterative relaeling. You know, most of the language models we looked at today were text only. We saw some examples of language models that can take both text and visual input. And then you we saw that benchmarks are very, very challenging. Models make kind of trivial mistakes. There's a huge gap between human performance and sort of what we get with models. So there's a huge like there's a huge difference between human performance and where models are and a lot of room driving further improvement. And maybe some of you are doing it for your projects. Thank you.