speaker 1: Cool. Hi. My name is shi. We're glad to be here and talk to you about island agents, brief history and overview. So today's plan is very straightforward. I want to talk about three. So first, what is island agents to start with? And then second, I want to talk about a brief history of island agents, both in the context of land, in the context of agents. And lastly, I want to share some ideas on some future directions of agents. So as you know, this field is a moving piece and it's very big and messy. So it's impossible to cover everything in alm agents. So I just try to do whatever is in the best of my and you can see there's a qr code and you can can scan it and give me feedback and I can improve the talk accordingly. Okay, so let's get .
speaker 2: started first.
speaker 1: What is agents do know? Do anyone say if they know the answer? If so, raise their hand. Like do you have a definition for what this Eron age is? Okay. They're like with three people. So that means this field is really a moving piece. So I think if we want to define what is Allen agent, we want to first define the two components. What is lm and what is and does everyone know what is lm? Okay. So.
speaker 2: so so really what's .
speaker 1: left is we need to define what's agent. And if you search on Google image, this is this is agent, right? But but in the context of AI, obviously, we know it's a notoriously broad term. It can refer to a lot of different things, right? From aonments car to playing go to playing video game to Chabot. So first, what is that as an agent? So my definition is that it is an intelligent system that can interact with some environment. And depending on different ones, you can have different agents. You can have physical environments, such as robot or autonomous one, and you can have agents that interstitional environments such as video games or iphone. And if you count humthe environment, then chat tbothis also some kind of agent. And if you want to define agent, you really need to define what is intelligent and what's the environment. And what's really interesting is that throughout the history of AI, the definition of what's intelligent often changes across time, right? So like 60 years ago, if you have a very basic chatbot using like three lines of rule, then it can be intelligent. But right now, even tragbility is not surprising anymore. So I think a good last question for you all is like, how do you even define intelligence? Okay.
speaker 2: so let's say we .
speaker 1: have some definition of agent, then what is lan agent? So I really think there are three categories or three concepts. So I think the first level of concept is what is a tax agent? And I think tax agent is defined as so you have this agent interacting with the environment. And if both the text and act observation, it's in language, then it's a text. So obviously, you can have tax agent that's not using lms. And in fact, we have tax agent from the beginning of AI several decades ago. And I think the second level of definition is Li measured, which is agents that are tax agent that are also using .
speaker 2: lms to acting, right? So and I think the .
speaker 1: last level is what I call reasoning agent. And the idea is those agents use lms to reinto act. And so right now, you might be confused, what is the difference between the second level and the third level, which I will explain either.
speaker 2: So like I said, people have been .
speaker 1: developing tax agent from the beginning of AI, right? So for example, like back in 1960s, there has already been chat bots. So Eliza is one of the earliest chabots. And the idea is really simple. You just have a bunch of, and what's really interesting is that using a bunch of rooms, you can already make a chat. Tout that. That's quite human, right? And what it does is asking you questions or repeating what you said. And people find it very human. So obviously, like there are limitations to those kind of rubased agents. But as you can see, like if you want to design rules, then it often is very task specific. And for each new domain, you need to develop some new rules, right? And lastly, like those rules not really worked beyond a simple domain, right? Suppose you write many rules to build a chat bot, but then you need to write many rules for a video game, measuand, so on and so forth.
speaker 2: So before alms, there's another very .
speaker 1: popular paradigm, which is to use rto build taxation. And the idea is, I'm sure everybody have seen video games, right? So you can imagine text games where instead of pixels and a keyboard, you can text a subservation and action, and you similarly have rewards. You can similarly use reinforcement learning to optimize reward. The idea is you can just optimize the reward and you with some kind of language intelligence. But again, this kind of method are pretty domain specific, right? For each new domain, you need to train new agent, and it really requires you to have skary reward signal for the task ahead, which many of the task don't. And lastly, it takes extensive training, which is a feature of really.
speaker 2: if you .
speaker 1: think about the promise of alms to revolutionize tax agent, right? These alms are really just Tranow next tonprediction on massive tax copa. Yet during inference time, it could be prompted to solve various new tasks. So this kind of generality and fulearning phenomena will be really exciting to build agents.
speaker 2: So next I .
speaker 1: want to give brief overview of airline agents. And it's also like a historic view.
speaker 2: and it's obviously very simplified.
speaker 1: I think what was happening is, so first we have something like lm in two and 20. I think the beginning of lm is gbe three, and then people start to explore that across different tasks. And some tasks happen to be reasonin tasks, such as symbolic question answering as o, and some tasks happen to be what I call acting tasks. You can think of games or robotics and so on and so forth. And then end, we find that this paradigm of reasoning and paradigm of acting are start to converge, and we start to build what I call reasoning agent that's actually quite different from all the previous agents. And from within the agent, we start to explore, on one hand, more interesting applications and tasks and domains, such as web interaction or self engineering, or human scientific discovery and so on. And on the other hand, we start to explore new methods such as memory or learning or planning, or multi agand, so on. So first I want to introduce, you know what I mean by the paradigm of reasoning and what I mean by the paradigm of acting and how they converge. And what is this paradigm of reasoning agent? And history is always messy. So for now, let's just assume, let's focus on one task, which is question answering, which can simplify our history discussion a little bit, and then we'll come to more tasks.
speaker 2: So question .
speaker 1: you're asking is a very intuitive task, right? So if you ask a language model, what is one plus two? It will tell you three, right? That's question answering. It's very intuitive.
speaker 2: So it also happens to be one of .
speaker 1: the most useful tasks in nlp, right? So obviously, people will try to use language models to do question answering. And then people find a lot of questions, a lot of problems when you try to answer questions. Okay? So if you have some question like this, it would be very powerful for the transformer language model to just upput the answer directly, right? So it turns out you need some reasoning and then covering the netalk, like chanout reasoning and so on and so forth. There has been a lot of people investigating how to do better reasoning with lmodels. You can also imagine like language model trying to answer something like this, and it will probably get an answer wrong. Because for example, if a language model is trained before 2024 and the Prime Minister of uk change often, as you know, so it might get the answer wrong, right? So in that case, you need new knowledge, and people are working on that.
speaker 2: And for another example.
speaker 1: like you can ask something that's really mathematical and really hard. And in that case, like you cannot really expect transformto give the answer. So in some sense, you need some way of doing computation beyond the naive auto repression sof transformer. So as you can see, like there are many types of presbyterine taks, and people find many problems when using language model to answer those questions, and then people come up with various solutions. So for example, if you are trying to solve you the problem of computation, right, what you can do is you can first use the language model to generate a program, and then this program will run and give you a result. That's the way you can answer you a question about prime factorization or what's fifth fibolati number. So for the problem of knowledge, there is this paradigm of retrieval, augmented generation. And the idea is very simple, right? You assume you have some extra copa, for example, Wikipedia, or this copa of this company, for example, and then you could have a retriever, whether it's a bm 25 or dpr or so on and so forth. You can think of retriever as kind of a search engine, right? So what it does is, given a question, this retriever will actually just pull the relevant information from the cpra and then appthat to the context of the language model. So that is much easier for the language model to answer the question. So this is a very good patterhowever. What if there's no copa for the knowledge or information that you care about, right? For example, if I care about to day's weather in San Francisco, it's very hard to expect .
speaker 2: any existing copa to have that right.
speaker 1: So people also find this solution called two use.
speaker 2: And the idea is .
speaker 1: you have this natural form of generation, which is to generator sentence ters, but then you can introduce some special tokens so that it could involoke two costs. For example, you have a special token for calculator, or a special token for a Wikipedia search, or a special token .
speaker 2: for calling a weather api.
speaker 1: So this is very powerful. Obviously, you can argument language models with a lot of different knowledge, information and even computation. But if you look at this, this is now really a very natural format of text, right? There's no like blog post or Wikipedia passage on the Internet that looks like this. So if you want manamodel generate something like this, you have to fine tune that in this very format. And it twout very hard to call that more than once across the, across the text. So another .
speaker 2: natural question come, right. What if you need both reasoning and .
speaker 1: knowledge, and people actually come up with a bunch of solutions for different tasks, right? For example, you can imagine interleaving ving the chaout and retrieval, or generate follow up questions, and so on and so forth. But we don't need to get into the details of other masses. I just want to point out, like the situation at the time was a little scattered, right? So you have this single task called qa, but it turns out to be more than a single task. You actually have like tens of different benchmarks, and they happen to challenge that models in very different ways. And people come up with solutions for each .
speaker 2: of the benchmark. So it feels very peacewise.
speaker 1: at least for me, right? And at least for me at the time. The question is, can we really have a very simple and unifying solution? And I think if we want to do that, we really need abstraction beyond you know individual task methods. We need like a higher level abstraction over what's happening. So the abstraction that I find, at least for myself, is the abstraction of reasoning and acting. So what is reasoning? Hope you already know that from Denny's talk last time. So chain of thought, right? It's very intuitive and it's just a very flexible and general way to to autest time compute and to sync for longer during inference time to solve more complex questions, right? However, if you only do chathought, you don't have any external knowledge or tools, right? Even the biggest, smartest model in the world does not know the weather is statistical today. So if you want to know that, you need external environment and knowledge and tools, and what I have descriis like read or retrieval or code or to use and so on and so forth. In some sense, it's just a paradigm of acting, because you are just assuming you are having an agent and you have various environments, whether it's retrieval, search engine calculator.
speaker 2: api or Python, right?
speaker 1: And the benefit of interacting with an external environment is that it's like a flexible and general way to augment knowledge and computation and feedback and so on and so forth.
speaker 2: However, it doesn't have reasoning.
speaker 1: And we will see later why that's troublesome. So the idea of this word called react is actually very simple, right? So you have this two paradigm recently enacting and before react, like language models are either generating reasoning or acting. And for react, the idea is to just generate both. And we will see that it's actually a great way to symmegize, both in the sense that reasoning can help acting and acting and help reasoning. And it's actually quite simple, intuitive. You will see later it's actually you can argue that's how like I solve the task or you solve the task. It's a very human way to solve the task and it's very general across the domain.
speaker 2: So the idea of react .
speaker 1: is very simple, right? Suppose you only solve a task, and what you do is you write a prompt. And a prompt consists of a trajectory that looks like this. So you give an example task. And as a human, you just write down how you think and how you do what you do to solve the task, along with the observation along the way, right? So if you're trying to answer a question using Google engine, you just think about some stuff and do some search, and then you write down that, and then also write down the result of Google. And you keep doing that until you solve the task. You can give this one example, and then you can give a new task. And given this prompt, the language model will generate a thought and an action. And this action is parsed and fit into the external environment. And then that would figure some observation. And then the thought action observation is appliinto the context of the lamic model. And then language model, generate a new thought and new action and so on so forth. So obviously, you can do that using a single example that's called one shell prompting. You can do that with few examples. That's called fushell prompting. If you have many, many examples, you can also find tunnel prompting to do that. So it's really about a way to use language model rather than prompting or fine tuning. So as a comfort example, let's say you want to answer a question, right? If I have $7 trillion, can I buy apple nvidian Microsoft? I make this slide back in March, and that was a trendy topic at a time. So you can write down a font like that. You just say, okay, language one. Now you're agent and you can do two type of actions. You can either Google or you can finish with an answer. And you just need to write on the solithern action. Okay? So that's very intuitive. And let's say what language model do, right? So this is what GPT -4 does back in March. So we first generate a salt, right? So first I need to understand, I need to find what is the market cab of those companies and then add them together so that I can determine if dollar trillion dollars can buy all three companies.
speaker 2: And then this triggers this .
speaker 1: action to search on Google. And this Google search returns this nmppeas a result. Unfortunately, ately like it just contains all the market caps you need. So the react knowledge logy that. And now I have the aumarket cap. All I just need to do is to add them together. So it uses, so changing as a calculator, add them together and get a result, and it will think, okay, so seven mmetry dollars is not enough. So you need additional money to buy that. I think if it's us today, there is even more money because nvidia is much higher on now. Yeah. So that's how it reacts of the task. And you can see it's a very intuitive way, very similar to how humans of the task, right? You think about the situation, you do something to get some more knowledge or information and then based on information and you think more.
speaker 2: and then I try to .
speaker 1: be a little more adversarial. So instead of finding all the market caps, I inject this adversarial observation, right? Nothing is fine. And here comes the power of reasoning, right? So reasoning actually finds a way to adjust the plan and get the action to adapt to the situation, right? Because the research is not result is not fine. Maybe I can search for individual market cap, right? So I can just search for the market cap of apple. And then I try to be adversarial again. I give the stock price in southern market capital. And here reasoning helps again, right? Based on the common sense it figures out, right? This is probably the price, not a market caso. If you can now find the market cap, what you can do is you can find a number of shares, and then you can multiply number of shares and the market and the stock price to get a market cap. And then you can do that for the all three companies, and then you can solve the task. So from the example, you can see that it's not really acting helping reasoning, right? Obviously acting as helping reasoning to get real time information or doing calculation in this case, but also reasoning is constantly guiding the acting to plan the situation and replan the situation based on exceptions. So you can imagine something like this. To solve various questions, taks, all you need to do is to provide different examples and provide different tools.
speaker 2: So okay, this is good. We're .
speaker 1: making .
speaker 2: progress. That's thing. What's .
speaker 1: really cool is like this paradigm goes beyond qa, right? So if you think about it, you can literally use it to solve any task. And to realize this, all you need to realize is many paths can be turned into a text game. So imagine if you have a video game, what you can do is you can assume you have a you know video captioning or image captioning model, and you can have some controller that can turn language action into a keyboard action, and then you can literally turn many of the tasks into a text game. And then you can literally use reactors solthem. So it goes well beyond .
speaker 2: best gatherine. So after, you know the invention .
speaker 1: of mamodel, obviously another part of the history is there are people from resourlearning, robotics, video games, so on and so forth, that are trying to apply this technique. And there are many works. I'm only listening sting one, for example.
speaker 2: And the idea is .
speaker 1: very intuitive. Like I said, you can try to turn all the observation into text observation, and then you can try to use laof model, generate text action, and then you turn the text action into some original format of action, and then you solve the task. But what's the issue of this? Right. So this is an example from a video game where you're trying to do some household task in a kitchen, right? And the the problem really is sometimes it's really hard to directly .
speaker 2: map observation into action because for one.
speaker 1: you you may have never seen the domain. Second, like to to to process you know like from the observation, actually, you need to think. But if you don't .
speaker 2: have this .
speaker 1: thinking paradigm, like all you are doing is just trying to emitate the observation to action mapping from the from the prompt or from the fiial example. So in this case, in the sync basin one, there is no paper shaker. So nothing happens. But because it doesn't have the capacity to sync, it will just keep doing that and keep feeling because like it's not language model, it's just trying to emit it. So it's not really trained to solve the task like an agent. So what we add, guess, is actually something very simple, right? You are literally just adding another type of action called thinking and thinking. It's a very interesting action because you can think about anything, right? So in this video game, you might only be able to go somewhere or pick up something. That's the action space defined by the environment. But you can think about anything. And you can see that the thinking action is very useful because they help you plan the situation and help you keep track of the situation and help you with them if something wrong happens. So as you can see, like react is a genpattern that helps a cross raditask, and it's systematically better than if you only do reasoning or only do.
speaker 2: So this is interesting, right? And I just .
speaker 1: want to point out why this is interesting from a more therotic perspective. So again, abstraction.
speaker 2: right? So if you think .
speaker 1: about all the agents that you have everything, right, from video game to AlphaGo to autonomous car or whatever, like all the agents.
speaker 2: like one common featsher is that you .
speaker 1: have an action space that's defined by the environment, right? So assume you're are solving video game, say attara game, then your action space is left, right, up, down. Like you can be very good, you can be very bad, but your action space is fixed. And what's really different for language agent or alagent, or reasoning agent, is that you have this aumented action called reasoning. And what's really interesting about this aumented action is that it could be any language, right? You can think about anything. It's a inspace. You can think about a paragraph. You can think about a sentence. You can think about a word. You can think about 10 million tokens. And it doesn't do anything the to the world, right? No matter what you think, it doesn't really change the earth or the video game you're playing. All it does is it changes your own context, right? It changes your memory, and then based on that, it changes your folup actions.
speaker 2: So that's why I think this new paradigm .
speaker 1: of reasonagis different is different, because reasoning is an internal action for agents. And reasoning has a very special property because it's an infinite space of language.
speaker 2: Cool. So we've covered the most .
speaker 1: important part of the talk.
speaker 2: D.
speaker 1: I think the history goes on, right? So from now we have the paradigm of reasoning agent, and then we have more messes, more task. And there's a lot of progress, obviously. And I cannot now cover everything. So on the methodotical side, I just want to cover one thing today, which is long term memory. So we just talk about what is resilient agent. And the idea is you have an external environment, be it video game or Google search engine or your car or whatever. And we'll just talk about the difference of reasoning agent, is that the agent can also sync, right? Another way to think about this is youhave, an agent that has a shorter memory, which is the context window of the language model.
speaker 2: And it's interesting that .
speaker 1: you can obtain interesting thoughts and actions and observations to this context. But if you look look at this context test window of the lampmodel, first is append only, right? So you can only apply new tokens to the context, and you have limited contacts, right? So it could be a southern token two years ago. It could be a million token now. It could be 20 million tokens next year. But you have a limited size of context. Now, even I say we have a convenitoken window. You might have limited attention, right? So you can have a lot of distracting things if you're like doing a long horizon scene, right? And lastly, it is a shorter memory because this kind of memory does not persist over time or over new task, right? So you can imagine, let's say this agent solved reman hypothesis today, which is really good. But then unfortunately, if you don't find tunlanguage model, right, it doesn't change, right? So next time we have to solve from scratch again, and there's no guarantee whether it will solve tomorrow.
speaker 2: right?
speaker 1: So I think an analog I want to make is it's kind of like a golden fish, right? So folk wisdom is a folk golden fish only has 3s of memory, right? So like you can you can solve something remarkable, but if you cannot remember it, then you have to solve it again. And it's really a shame, right?
speaker 2: So hope that's motivating enough .
speaker 1: to introduce this concept of mounmemory, right? So it's just like as a human right, you cannot remember every detail in the every day, right? But you maybe you may write a diary, right? That's kind of like a lomemory. You read and write important stuff for your life, for your future life, like important experience, important knowledge, are important skills. And hopefully that should persist .
speaker 2: over a new experience, right?
speaker 1: So you can also imagine a mathematician, right, writing a paper on how to prove remohypothesis. That's kind of like a long term memory, right? Because then like you can just read a paper and you can prove it. You don't have to solve it again.
speaker 2: So let's look at a very, very.
speaker 1: very simple form of lamemory in this work called reflection, which is a very simple fallup from react.
speaker 2: So let's say you are .
speaker 1: trying to solve the coding task, right? This is a task, and you can imagine, you can write some program, you can run a program, you can reason, you can do whatever, but at the end of the day, right, you test sted, and let's say it doesn't work, right? Some tasks failed. So if you do have a lot of memory, then you just have to try again, right? But what's different now is if you have a long term memory, what you can do is you can reflect .
speaker 2: on your experience, right? So if you .
speaker 1: write a program and failed some test, you can think about it, right? It's like, Oh, I Fathis test because I forget about this corner case. So if I write this program again, I should remember this. And what you can do is you can persist this piece of information over time. Like when you write this program again, you can literally read this long term memory, and then you can try to be better next time.
speaker 2: and hopefully .
speaker 1: it will improve.
speaker 2: So this turns out to .
speaker 1: be working really well for various tasks, but in particular, recording ding, right? Because recording, you have great feedback, which is the unit test result, and you can just reflect on your failure or success, and then you can keep track of the experience as a sort of not a memory, and that you can get better. Another way to think about this is it's really a new way of doing learning.
speaker 2: right?
speaker 1: If you think about the traditional form of reinforcement learning, right? So you do something and then you get a scary reward. And what you do is essentially trying to backpropagate the reward to update the widof your policy. And there are like many, many algorithms to do that. If you think about reflection on, right, it's really a different way of doing learning, because first, you're not using scillary word. You can use anything, right? You can you can use a code execution result, you can use a compiler arrow, you can use the feedback from your teacher, which is in text, so on and so forth. And it's not doing learning by gridescent, right? It's learning by updating language, right? By language, I mean a lot of memory of task knowledge, and then you can think of this language as affecting the future behavior of the policy, right? So this is only a very simple form of lanmemory. And I think folup work did more complicated stuff. You will hear about Voyager from Jim later, I guess, where you have like a memory of code based skills and ideas. For example, you're trying to play Minecraft and you learn how to build a sword in this kind of api code. Then you can try to remember it. And next time, if you want to kill a zombie, you can first pull the skill of building a sword. You don't have to try to try it from scratch, right? And for example, in this work of generative agents.
speaker 2: the idea is you have like .
speaker 1: 20 human like agents in this small town trying to be human. You know, they have jobs, they have life, they have social interactions, so on and so forth.
speaker 2: You have this episodic form .
speaker 1: of melmemory where you literally, each agent keeps a log of all the events that's happened every hour, right? That's like a most detadiiled possible diary you can possibly have. And you can imagine like like later, if you want to do something, you can try to look at the log to decide what you work on, right? Because if you like drop off your kit at this place, you want to retrieve that piece of information and then you pick it up. You can also have this form of seatic memory where you can look at a diary and you can draw some conclusions about other people and yourself. You can realize, you can reflect on that, and you can say, okay, okay. Je is actually very curious guy and I actually like video game. And this kind of knowledge can actually affect your behavior.
speaker 2: Yeah. So this is .
speaker 1: not .
speaker 2: of memory. And and I think the final .
speaker 1: step to finishing this part is to realize that you can actually also think of the language model .
speaker 2: as a form of longer memory.
speaker 1: So you can learn by learn, I mean improve. You can improve yourself, or you can say you can change yourself by either changing your parameters of the new network, which is to .
speaker 2: find your language model.
speaker 1: or you can write some piece of code or language or whatever in your lot memory, and then you can retrieve from it later, right? So that's just two ways of learning. But if you think of both the nenetwork and then and whatever text coper suppose a form of long memory, then you have a unified abstraction of learning. And then you have an agent that has this power of reasoning over a special form of short term memory called containal length mph model. And then you can have various formal laterm memory. And in fact, you can show that this is almost just sufficient to express any agent. So I have this paper called koala, which I don't have time to cover today, but I encourage .
speaker 2: you to check out where the statement .
speaker 1: is that you can literally just express any agent by the memory, which is where the information is stored, the action space, like what the agent can do and the decision making procedure. Basically, given the space of actions, which action you want to take, right, you can literally express any agent with this three PaaS. So this is a very clean and sufficient way of thinking about any agent. And I want to .
speaker 2: leave two questions .
speaker 1: for you to think. And I have answer in this paper that you can try to retrieve. So the first question is, what makes external environment different from internal memory, right? So imagine if the agent opens up Google doc and write es something there. Is that a form of long term memory? Or is that like some kind of action to changes the external environment? Or like imagine if the agent has an archive of Internet and it tries to retrieve some knowledge from there. Is that a kind of action? Or is that kind of retrieval from a long I think this question is interesting because if you think about physical agents like humans or upon ymous cars, right, it's very easy to define what is external and what is eternal. Because for us, like what's outside our skin is external. What's inside our skin is internal. It's very easy to define. But I want you to think about for digital agents, how can you even define that? And lastly, like how do you even define long term memory versus shorter memory? Like suppose you have a language model context of 10 million tokens. Can can that still .
speaker 2: be called long term memory?
speaker 1: Know that those terms are defined from kind of human psychology and neuroscience. And think about these two questions. So okay, so we have covered some brief history of alimations. I also want to talk about you know the history of alim ages in a broader context of agents, right? We have talked about how we start from l and m to derive various things and other developments of knowledge agents. But if you look at a more Asian history, how is recently agent different from all the previous paradigms of agents? So here I want to give a very, very minimal history of agents, and it's definitely wrong. So it's just for illustration, right? Don't take that too seriously. But I think if you want to write a very minimal history of Asians, need one slide at the beginning of AI, right? The paradigm is called symbolic AI, and you have symbolic AI agenand. The idea is kind of like programming, right? You can program all the rules to interact with all the different kind of environments, and you can have expersystem and stuff. And then you have this period of AI winter, right? And then you have deep learning, and you have this very powerful paradigm of rl agent. And it's usually a deep rl agent where you have a lot of amazing miracles, from Atari to AlphaGo, so on and so forth. And only very recently we have allmajges, right? So this is this is obviously wrong, but if I have to put things in my slide, this is kind of the perspective.
speaker 2: And remember the examples we .
speaker 1: look at at the beginning of the top, right? This is like a very typical example of a symbolic AI agand.
speaker 2: Lm dicuin is very .
speaker 1: like typical example of a deep rl agent in the text domain.
speaker 2: And I think one way to think .
speaker 1: about the difference between these three paradigms of agents is the problem is the same, right? So you have some observation from the environment and you want to create action, right? You want to take some action. And the difference is what kind of representation, what kind of language do you use to process from the observation to the action, right? So if you think about symbolic AI agents, essentially you are first mapping all the observation into some symbolic state, right? And then you're trying to use a symbolic state to derive some action, because think of if else rule, essentially you're just trying to map all the possible complex observations into a set of logical expressions, right?
speaker 2: And if you think about .
speaker 1: all the deep rl agents, like a very abstract way of thinking of this is you have many different possible forms of observations, right? It could be pixel, it could be text, it could be anything. But from a deep rl perspective, it doesn't really matter because it's mapped into some kind of embedding. It's processed by a new Ural network to some vectors or matrixes. And I use that to derive .
speaker 2: some .
speaker 1: actions, right? And in some sense, what's different for language agent or reasoning agent is that you're literally using language as the intermediate representation to process observation to action, right? Like instead of this neembedding or this kind of symbolic state, you're literally sinlanguage, which is kind of the .
speaker 2: human way of doing things, right?
speaker 1: And and the problem with symbolic state on neine bedding is that if you think about it, it takes intensive efforts to design those kind of symbolic ages, right? If you think about how vamo is built as an autonomous car, right, you probably write milons of lines of rules and code. And if you think about all those deep ragents, most of them, it takes millions of steps to spin them, right? And the problem .
speaker 2: is those are kind of ask specific, right?
speaker 1: If you write a millions of lines of code for autonomous car, right, you can really reuse that for playing medigame. Similarly, if you train an agent using dpl, using millions of steps to play a video game, you cannot use that to drive cars, right?
speaker 2: Language is very different because first.
speaker 1: you don't have to do too much, right? Because you already have rich parts for lms. That's why you can prompt to build lon ages. It's really convenient. And it's very general, right? You can think about how you drive a car, you can think about how to play a video game, you can think about which house to buy, right? Considering mortgage readand stuff. And thinking is very different from symbolic state and dparl, because the symbolic state and the deep ro vector, right, they usually have a fixed size, but you can think arbitrarily long. You can think about a paragraphic and syabout lar sentence. And that brings this whole new dimension of intime speed. And that's why fundamentally, reasonagent is different.
speaker 2: Okay. So I just realized I just covered half of .
speaker 1: the later half of the brief history of adagents, right? We talk about one memory and why the methodology is fundamentally different from the previous agents. I also want to briefly talk about the new applications and tasks that rm agents enabled.
speaker 2: So as you can see in the beginning .
speaker 1: of my talk, the examples are basically questionising and playing games. And that's pretty much, if you think about it, that's pretty much the predominant paradigm of .
speaker 2: p and rl, right? But I think .
speaker 1: what's really cool about language agents is that it really enables much more applications, and in particular, what I call digital automation, right? So what I mean by digital automation is, imagine if you have an assistant that can help you file reimbursement reports or help you write code, debug wrong experiments, help you find relevant papers, help review paper, help you find papers that are relevant. If all of them can be achieved, then everybody can graduate undergrad in two years, or PhD in three years, or are like continuanin three years. Like everything can be speed up. But if you think about it like before chagbeauty, right, there's literally no progress. If you think about Siri, right, which is the state of art digital agent before chagbt, right, it literally can do nothing, right? And and why is that? I think the reason is that you really need to reason over real world language if you want to write a code like this paradigm of sequence to sequence mapping is not enough. You have to think about what you write and why you write it, and you have to make decisions over opening it. Actions over long horizon, right? But unfortunately, if you think about if you look at all the agent benchmarks before the existence of lms or agents, they often look something like this, right? So they usually, they are usually very like synthetic class, very small scale and not practical at all. And that's been limiting for the history of agents because even you have the best agency award, if you don't have the good task, like how can you even show progress, right? Because like let's say we solve this like great game with 100% accuracy, then then what does it mean, right? So I think the history of alm agents on one side is all the magetting better and better. But an equally, if not more important side of the history is we're getting more practical and more scalable tasks. So to have a flavor, right, this task is called web shop. And I created with my coloors in 2021, 2022. And the idea is you can imagine alm agents to help you do online shopping. So it, you give the agent extrto find a particular type of product, it could just Bross like a human, right? It could click links. It could have search queries. It could check different products and go back and search again. And if it has to search again, it has to explore different items or think about how to reformulate the query, right? You can, you can immediately notice, right? The environment is much more practical and much more open ended than grid word. And I say you find a good product, you can just click all the customization options that you can click by now, and you can also give a reward from the one indicating how good you're are in the past. So it's really like a very standard paradigm of reinforccement environment, except that the observation and action is seen text, and it turns out to be a very.
speaker 2: very practical .
speaker 1: environment. And a web shop is interesting because it's the first time people build large scale complex environment .
speaker 2: based on large .
speaker 1: scale real Internet data, right? So at the time we script modern a million Amazon products and we build this website and we build some automatic reward system to to tell you know if you find a product and here's the instruction, then how can you give a reward to indicate how matching are the two things? And you can seriously, it's perhaps harder than the grid word task because you need to understand not only the images and language in real world domabut. You also need to make decisions over long horizon, right? Like you have to maybe look like explore ten different products or ten different search pairs to find the perfect match. And and for example, on this direction of web b interaction, like full up work has made progress, great progress. You know beyond shopping, you can actually sell various tasks on the red and you can also try to sell other practical tasks, for example, self engineering, right? So in this example, sbenge is a task where you are given a GitHub Rapple and an issue, right? So you are given a bunch of files in your repository, right? And you are given an issue, right? This thing doesn't work. Help me fix it. And you're supposed to output you're supposed to output the file. Tithat can resolve the issue, right? So it's a very clean definition of the task, but it's very hard to solve, right? Because because if you want to solve it, you have to interact with sort cops, right? You have to create unit tasts, you have to run it and and you have to try various things just like a self engineer.
speaker 2: So another example that I think is .
speaker 1: really cool is like I think the current progress is well beyond digital automation, right? So in this example from caca work that I really like, the idea is they're using reasoning Allon agents to try to find new chromofiles. And what's really cool is that you give the agent a bunch of data about some chemicals, and you give them access to use tools like Python or Internet or whatever, and they could do some analysis and try to propose some kind of new chcal. And also, the action space of the agent is somehow extended into the physical space because that the action or the suggestion from the agent is then synthesizing the web lab. And then you can imagine, you can get feedback from the web lab, and then you can use that to improve yourself and stuff like that. So I think it's really exciting that you can think of language agent not only us operating in the digital domain, but also in the hadomain, not only in solving ocdious tasks like poukemia, brodesh, but also a more intelligent or creative task like self engineering or scientific discovery.
speaker 2: Okay, so great.
speaker 1: So we have covered this slide finally. So in summary, I have talked about you know how we start from ln. We have this paradigm recently. We have this paradigm of acting, the converge, and that brings up more diverse tasks. And and we have also covered in a more broader timscale, the paradigms of agents and why this time is different. And also from a past perspective, right, the previous paradigof past you about in AI, you can think of games, you can think of simulations, you can think of robotics. But really elimages bring up this new dimension of task, which is to automate various things in digital world. So we have covered a .
speaker 2: lot of history.
speaker 1: and I just want to summarize a little bit in terms of lessons for doing research, right? So I think personally, as you can see, you know, the it turns out some of the most important work is sometimes the most simple work, right? You can argue a chain of sais incredibly simple and react is incredibly simple and and simple is good because simple means general, right? If you have something extremely simple there, you have probably something extremely general and that probably is the best research.
speaker 2: But it's hard it's hard to .
speaker 1: be simple, right? So if you want to be simple in general, you need to both, like have the ability to thinking obspection, right? So you have to jump out of individual tasks or data points. You have to think in a higher level, but you also need to be very familiar with the individual taks, the data.
speaker 2: the problem you're trying to solve, right?
speaker 1: So note that you can actually, it could actually be distracting to be very familiar with all the tasks, specific methods, right? So remember, in the history of qa, I covered all those a lot of like task specific methods, like if you are very familiar of with some, then you might end up you know trying to create an incremental solution after that. But if you are familiar with not only qa but a lot of different tasks, and you can think in abstraction, then you can people something simpler and more generally. And in this case, I think really learning the history helps and learning other the subjects helps because they provide you some priyers for how to build abstraction. And they provide ways to syk in abstraction. Okay, so this is mostly the talk. I think I will just briefly talk about some thoughts on the future of hion agents, right? So everything before this slide is history and everything after this slide is kind of the state of art or the future. Obviously future is very multidimensional. There are many directions that are very exciting to work out. I about I want to talk about five keywords that I think are truly, truly exciting topics that are, first, very new in the sense set. If you get to work on this now, there may be a lot of low hanging food or you might have a chance to create some very fundamental results. And second, is somehow doable in the ocmia setup. So you don't have to be OpenAI to do this, but it's still good to be .
speaker 2: over there. So this five topics .
speaker 1: actually correspond to three recent work that I did, and I only cover them briefly. And if you have more interest, you should check out those papers yourself. So the topic says, first, training, how can we train models for agents? Where can we have the data.
speaker 2: right? Second.
speaker 1: interface. How can we build the environment for our own agents? Third, robustness, right? How can we make sure things actually work in real life? Force, human, how can we make sure things actually work in real life with human right? And lastly, benchmark, how can we build good benchmarks?
speaker 2: So first .
speaker 1: of training, right. I think it's interesting to note that like up until this year, language model and agent are kind of disentangled in the sense that the people that are training models and the people building agents are kind of the different people, right? And what is the paradigm is that the model building people build some model, right? And then the agent building people build some agents on top of it using some fine tuning or some prompting, mostly mostly prompting. However, those models are not trained for agents, right? So if you think about the historical roof manamodel, it's just a model that's changed text. Like people can ever imagine this when they used to solve like chemical discovery or self engineering, right? So that brings the issue of discriancy of the data, right? So it's not trying to do those things, but then it's prompted .
speaker 2: to do those things.
speaker 1: So the performance is not optimal.
speaker 2: And one .
speaker 1: solution to do that is one solution to fix this is you should train models. I'll get it for agents. And one thing you can do is you can use those prompted agents to generate a lot of data, and then you can use those data to fine tune the model to be better at agents. And this is really good because first you can improve all the agent capabilities not covered in Internet, right? So you can imagine on the Internet, which is the predominant source of language model training, there is not a lot of like self evaluation kind of data, right? People only give you like a well written blog post, but no one really releases all the soft process and action process of how to write the protocols. But that's actually what matters for agent training, right? So you can actually prompt agents to have those paraators and you can train models on those things. And that's, I think, really one way to fix the data problem because we all know Internet data is running out and how can we have the next trillion dollars to train models? This is very exciting. And I think a very like maybe a best analog is you can think of the like the synergy between GPU and deep learning, right? Because GPU first was now designed for deep learning, right? It was first designed to play games and then people exploore the usage. I find, Oh, it's very good for deep learning. And then what happens is that not only people use existing interviews to build better deep learning algorithms, but also the GPU builders free better gpuuse to fit the deep p learning algorithms, right? You can build a GPU that's for transformer and so on and so forth. I think we should also establish the synergy between model and agent. And the second topic is interface. And in fact, human computer interface has been a subject for decades. It has been a great topic in computer science. And really, the idea is if you really cannot optimize the agent, you can optimize the environment, right? Because if you're trying to write a code, right, even if you're the same person, it makes a difference whether you're are doing that in the text edit interface or the vs code interface, right? You're still the same U, like you're not being more smart, but if you have a better environment, then you can solve the task better. I think the same thing happens for agents, right? So as a very concrete example, you can imagine, how can the agent like say, search files in an os, right? So the human interface .
speaker 2: in criminal.
speaker 1: as we all know, is to use iand, cd and so on and so forth. It works for humans, but it's not the best interface for agents. You can also do something like you can you know you can define a new command called search, and then it will give a result, and then you can use this action columnext to get the next result. But it's probably still not the best for language model. So in this research called three agent, what we find is that what turns out to be the best way to help agents search files is to have this specific command called search. And instead of giving one result at a time, you can just skiten results at a time and then use let the agent decide what's the .
speaker 2: best file to look at, right? And you can actually do experiments and show.
speaker 1: like you can use the same language model, you can use the same agent prompt, but the interface matters for for photodownstream tasks. So I think this is a very, very exciting topic, and it's only getting started. And it's a great research topic for anemia. You don't need to have a lot of GPU's to do that. And it's interesting because models and humans are different. So social, their interfaces, right? You cannot expect language models to use vs code des to be the best code interface. So there must be something different.
speaker 2: And we need to explore that. And in this case.
speaker 1: you can think of the difference as being that human, we just have a smaller, shorter memory, right? If I give you ten results at the same time, you cannot just read them. That's why for human interface, you have to design that in an intuitive way, right? You have a next button. If you do control trf, you can only read one .
speaker 2: thing at a time.
speaker 1: But actually for models, it's worse because models have a longer contwindow, right? So if you control f, you should just probably give everything to the model, right? So if you decent better interface, it could help you solve tabetter with agents. It could also help you understand agents better. It could help you understand some what the fundamental differences of humans. And lastly, I want to point out this topic of cuminthe .
speaker 2: loop anrobosomes.
speaker 1: I just want to point out there is this very big discrepancy between existing benchmarking and what people really care in the real, real world, right? So you can think of a very typical agent task, ks or AI taks, say coding with union taks. And this is a plot from ala alpha code two. And basically the idea is if you sample more times, then the chance that you have a right submission increases, right? That's very interesting, itive. And if you have a unit test, then then you can sample many, many times, obviously. And what you really care is what we call PaaS at k, right? What you really care about is kind of solve it one time out of 10000 times or seven times or million times, right? It's kind of like solving real hypothesis. You just need to do it once, right? What you care about is if you sample 10 million times, can you solve it once? But if think about all the most of the jobs in the real world, it's more about robustness, right?
speaker 2: So suppose you're trying .
speaker 1: to do customer service.
speaker 2: right? Like lm majalready appliyed .
speaker 1: for customer service, but sometimes they who lose this and there are consequences, right? And if it may have something, the company might have composition .
speaker 2: and stuff like that, right? So .
speaker 1: arguably, you know, custom service is much easier than coding or proving remohealth processes, at least for human, right? But here it really presents a different challenge because when you care about this, not, can you solve it one time around a sudtime? When you care about this, can you solve it a soutime out of a sudthern time, right? You care about what I feel it one time out of a sudtime, because if you Faone time, then you might lose the customer, right? So it's more about getting simple things done reliably.
speaker 2: So I think really that .
speaker 1: calls for a different way of doing benchmarking. And we have this recent work called toppage. And the idea is, first, we have a very practical task, which is to do customer service. Second, the agent's not only interacting with some kind of environment, like digital environment, it's also interacting with a human, but it's a simulated human, right? So the idea is the customer service agent, just like a human customer service agent, needs to interact with, pose the backend api of the company and also some kind of user. And the agent really .
speaker 2: needs to interevery .
speaker 1: booyou solve the task. And the chitory might look something like this, right? So the human might not give you all the information at the beginning, right, which is a predominal paradigm of all the tasks right now. If you think about self engineering and so on and so forth, it might just saying something like change flight, and then you might need to actually prompt the user, Oh, can you tell me like which fly are you changing? And you need to interact with the user over multiple terms to figure out what they need and to help them. And this is very different. And also it makes the matric that you care about different, right? So you can imagine for the same task, right, you can sample the chatery multiple times with the same user simulation. So if you look at the dashed line, it's called PaaS at k, which is measuring, you know if you sample ten times, can you solve it at least one time? And obviously, as you sample more, the chance that you solve at least one time increases, right?
speaker 2: But here, you don't care about .
speaker 1: whether you can solve it one time out of ten times. You care about whether you can solve it ten times out of ten times, because otherwise you will lose client, you might lose customer, right? So so the solid line measures, you sample more. What's the chance you can always solve the task across all the possible task? Ks, and what we see in today's language model is that obviously they have different starting points, meaning like obviously they have different capabilities. But what's really concent is they all have this decrease in trend, right? So if you sample more, the robuses always go down, like from the small model to big model. They all have the similar trend. The ideal trend should be something more flat, right? So if you can solve something, you should be more reliably to solve the same thing over time. So I just want to point out that I think we also need some more efforts in picking more real world elements in benmarketing king, and that requires new settings and matrics.
speaker 2: So we have this blog .
speaker 1: post that talks about, know some thoughts .
speaker 2: on the .
speaker 1: future of language agents. And one way to think about that is you want to think about what kind of jobs they can replace, right? If you think about it, maybe the first type of task is kind of not that intelligent, right, but really require robustness, right? It's about simple debugging or doing customer service or doing simple assistant over time and time. And second, you need to collaborate with humans. And third, you might need to do very hard tales, right? Can you might need to like write a survey from scratch or discover a new chmophold. And that requires some new type of ways for the agents to explore as own. But I think it's in general very useful to think about what jobs they can replace and why they're not replacing those human jobs yet and what are missing and how can we improve it.
speaker 2: Lastly, lastly.
speaker 1: this is a like limited time, and we're going to have a eml up tutorial on language agents in November, and it will be three hours. So hopefully, it will be more comprehensive than this.