CS 194⧸294-196 (LLM Agents) - Lecture 1, Denny Zhou

LLM Agents与推理能力的前沿探索

媒体详情

上传日期
2025-05-23 12:59
来源
https://www.youtube.com/watch?v=QL-FS_Zcmyo
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-preview-06-05

转录

下载为TXT
speaker 1: Okay, thank you. Okay, so I will first start with some introduction and then we'll get the actual contents of this class started. Okay, so first, so my name is dson. I'm a professor or in computer science here. I use Berkeley and also a co director of the campus wicenter how the center on responsible decentralized intelligence. So I'm the instructor for this class and also we have gets co instructor singing from Google who is also a very alarm to my former student here teaching this class together. And also we have our great talex and seahuand, also we have our great village Tara and Ashman. Okay, so this is the tstaff who will be working together with you this semester. Okay, great. So everyone's here, everyone has been seeing the exciting group of large language models. The speed of advancement is just astonishing. However, these large language models, they operate in a very simple manner. They take tax input and produce tax output. So what we will cover in this message, in this class, is the next frontier large language model agents. So instead of just taking text as input and produce text as apphere, we use a large language model as the key brain for reasoning and planning for the agents, and enable the agents to interact with external environments, observe the environments, and take actions in the environments. And the agents will be using external tools and also external database, lbase and so on for retrieval to help the agents to perform these tasks. And the reach capabilities of these large enmodels makes these lm agents very flexible. And they can easily operate in diverse environments without much particular specific training. And these lm agents, they can interact with different types of environments, including, for example, surping the web through different apis online. And they can also be emboded even in a robot and operating in a physical world. And they can intersense the environments through different types of inputs, even in the bounty bundle setting, even including very sensory inputs and taking actions in the diverse environments. And through this interaction with the complex and diverse environments, they can update their memory. They can learn to do, to use, they can interact with humans, and they obtain grounding through these interactions as well. And these agents not only just interact with environments that can interact with other agents through multi agent interactions and collaboration, including humans as well. And these multi and collaboration can help agents together to solve even more complex tasks. So why is our agent the next part here? Why do we need to empower our arms with the agent framework? For a number of reasons. So ving real, real world task is never just in one goal with tax inputs. Produce tax x apples, oftentimes involves a trial error process and celeeveraging external tools. And the retrieval from external knowledge can help expand al's capabilities. And the more important to this dynamic agenentic flow, this agent workflow can facilitate solving complex tasks through enabling task decomposition, allocation of subtasks to specialized modules, division of labor for project collaboration. And throughout the course, we also see that multi agengeneration can help inspire better responses. Even though agents has been a fairly recent development, we have already seen agents helping transform different education domains through wide ranging, including education, law, finance, healthcare, cysecurity, bill namit. And the development is really exciting and is fast improving. There are many different leaderboards for different agent benchmarks that you can see online and you can see the really fast improvements on all these different agent frameworks. So overall, to better to enable agent deployments, there are a number of key challenges that we still need to address. So first, we need to improve the reasoning and planning capabilities of agents. Our agents tend to make mistakes when performing complex taend to end, and it's important to improve the reasoning and planning capabilities and also into improve embodiments and the learning from environmental feedback for these rm agents. Rm agents are still now efficient at recovering from mistakes for long horizon tasks. We need to further develop methods and capabilities for continuous learning and self improvements for these ragents, and also include multimodel. Understanding grting and water model capabilities of these agents, and also, as I mentioned, multi agcan really help agents to provide better solutions for tasks. And developing theory of minds helps multi agents to better develop as well. And safety and privacy. These issues are also very important for agents. Rms are susceptible to amversarial, attcan, evit, harmful messages or lead private data and so on. Solving these challenges are also really important for deploying rm agents safely in the real world and also enabling human ainteractions and ethics. How to effectively control our aging behaviors and design interaction modes between humans and our imagies to best able our images to serve human needs is also really important. So to help students learn and better develop methods to address these challenges, the course has been designed to cover a broad spectrum of topics actually throughout the different layers of the agents framework and also the domains. So first in the class, we'll cover key model capabilities, including reasoning, planning, multimodel, understanding. We also cover popular real world and Asian frameworks to enable students to learn how to better design agan applications and use various aggentic flows easily. And this will help students to also learn to use our Asian frameworks for workflow design, to use retrieval, auments generation, rp and multi agent systems. And we'll also cover a number of exciting application domains using these agents, including software code developments, workpool, o automation, multimodapplications and enterprise applications. And finally, we'll also cover important topics on our agents safety and ethics. To cover this wide range, conwe have assembled an amazing team of guest speakers and researchers to cover these topics. So the class will be Liby me and Xin Yun, and we have this amazing crew of guest speakers to help cover these important topics in class.
speaker 2: Before I talk, and I want to ask one question for everyone, so what do you expect for you may take the old seconds, send about it. So I can imagine many different answers, like solve the hardest metic problems that humans can not solve, for example, even Carry hard to not solve or discover a new scientific theory, or even saw asiyeah. My background is a mission learning. I don't know, in the kind days you have many people study machine and coor, not because inspired child former is oed right? As a michelaniperson, I have really talking a about AI. AI should be performvermed from just few new examples, like what humans usually do. In the past decades, the vehanonomi community has been a great efforts to develop data efficient methods like summary of learning, active learning, conversomething. And you know, if you look at newspaper in the past decade, the people always crabout 1.2 pagainst in the as ora paper. So in practice, actually, I'm nervous. So data efficiapproaches, I would like this music failed, you know, don't feel bad about that. I A presto, go back because I started micto, almost confused it, or I'm a syterpeactually. That would me to think a different problem. What's missing? Emotion on it? So I sort of carried them for years. And finally I found out that I in the kind days, in particular for people in the coast today, I seems so obvious because it's a pleasure about a reasoning. Humans can learn from just few examples because humans can reme, not because of data. Statistics sound so straightforward. Let's start from a toproblem. In my research, I usually prefer a very simple crod, but it kind order detail, order challenging places. So this problem is called a large lacompanies. If you are familiar with neurosymbolic literature, you found similar problems. So for this problem, give a people name as input. The output will be the calenlation of the third of the last letter over the first name. And Muslim, for example, like Milo musk, and the last data of O E, als n, the last data of MaaS is k. So all put in k. This is so simple. And if you have this problem a few years ago, you're probably able to try to subfy a measurement model. For example, you could use transformmal model with A Y. The decoder is ender. Why the encoder? The is a decoder. And then you will find that, okay, you probably need kinds of labell examples to trenmodel. And finally, you can look at an accuracy that 85% or 9% to something. Now, it's easy. Think about machine learning methods, you know, for such a simple task. I mean, simple for humans, okay? And if the methrequires a vast amount of labor bel data to learn, and would you like to call it as AI or not? AI needs artificial intelligence. I suppose an intelligent model should be able to learn this past as using one or two roanimals. Now let's see how this problem can be solved by using large lmodels. I suppose most people know large one models, but professors are calling me yet to explain what am are. Okay, ms is a transformer model trying to predict next word. For example, given the taai is the future where mask featjust with AI and bill as the input without a model pretty, what will be the next word if the net producis not the World Future? We need to adject private meters ors to make produce ing correct? You make the money that's called back procation. Of course, here you can change your model with many sytences. For example, you can use all tags from the Internet. If you don't want to go to details, you can simply think of a training loms as training powers to mimic human libraries. Actually, I ated this sentence, and one guy reacted me. He said he's very curienced about training parents. He's going good job. Why switchen is a model? Okay. And then we can just mimic the process of the training with the training about creating that content. We can use whatever as input and to see what would be the output, the mojust pretty in that coin. And then you can import the generetic token as you can use the impput, genertoken and pretty next net token as how you can answer from aos. And for this problem, we can simply conatminate all the examples we have had as the input, and also contaminate with a test example, Barack Obama here, we can try this as using any lm and see what happens. And probably see you can rise. Se, it here is called k of course it's not correct. Ack, right? Because k is the last letter of black and a is the last night of Obama. The output should be a ka, so this around, right? The problem, this called a few short problem. It's just a minute overmachine learning process. Instead of a training model, we just use the examples as the input. That's the only difference. In the current days when the heart of face is this prompted idea, we I'll need to add driprocess before the answer. Like we just add the explicit crisis here. The last letter of volume, yiome is n. The last of MaaS is k conenn. K is to N K, like at that. It's called a reasoning process. And similarly for bagand, now we use this as a new import and the usc. Okay, we'll get a perfect response from the large amount models. So even like a hummus, one demonwas stration is enough to get an accuracy 100%. That's exactly what I looked for. We cannot imagine any machine learning method can achieve this perfect generseason here. There's no way. But by the way, the want to overread what I said about machine. Machine money is so useful and important for doing research. In the kind days, I saw many of my naive mistakes from social media news, even from the papers in recall conferences, ordinary mistakes, mostly from people who have no activities on machine learning. They just really have different ideas. This Interlegal now this kind of idea of add in the media steps has been proposed many years in the literature. So this is the amazing paper we know by researchers in quality mind published in the acl 2017. So in their paper, they use natural language reginnail to solve MaaS problems. In the paper, the evrow derived the final answer through a series of small steps, and then they changed ined, a sysix model from scratch. If you know channels sort of work, you'll be so surprising about this paper. Peract, the authare, just like time travelers, they know how to make a given their approach. Ge, in the 2021, a team in the opi published an amazing data set called csmk. They followed the idea in tipaper in 2017. In this data, aset, every problem, followed by multimeters tax as solution and also Finzer. And this team created this amazing data set and used that probuntugps three model. They are greatly scale, are up the work by boomind in. Even in the same year, 20, 21 group researchers in Google Green, now part of our comind part, did the work last show security pafor intermediate commitwith lying on models. They discovered the similar ideas independently, but in the domain of program synthesis. That's why they use actual strsymbols here instead of using natural language in the country. Probably many people know our work in sort of property team and the of sort of actually liberates. Sort is not a term where invented is just common English phrase. It means not a step. So in this world, we extensively evaluated from here individual steps and showed amazing results on almost every energy house. So let's put other people here in 2017 who demand publishing paper came with intermediate steps in 2021 and few Public Papers went lms with intermediate size in 20025, 20022 and prompting with intermediate, obviously. Okay. Which part is more important you can see in here. Actually, it doesn't matter if you are sharing open tier or prompt model. What really matters here? Intermediate steps, that's the key. So let me summarize here, regardless of a training by healing or prompting, when provided with examples that include intermediate Iwill, generate responses that also include the media steps s. Keeping in the midstates. One other question, is it helpful to introduce reasonstrategies in those examples for humans? They when they solve a problem, they could have a strategy for solving. So this so case I work from our team is this most proud one team in this world. We enable easy to higeneration by probably many people saw this famous book, how to solve it by Cordia, a classic book for MaaS H education. So there's a chapter about decomso. If you just if you go to details, you may Mose yourself in details Yeah. Notice what is the difference by decomonization? So given this MaaS problem here, so by the way, so in this talk, the MaaS is at an elementary level. So every time when I came an talk before I, my daughter, I also tell my talk, and she reknew my life. I maybe she could, it affgreat love. And he said, esr has three airand now, had two more airples than esr. How many airples do they have together? Okay. But we see the difference is that, okay, we first show light nomodels how to break down these problems to some problems and then solve what and that's why the least most from least the most complex problems. Is there simple idea, but surprisingly awful. So I suggest I assuhow to decompose complex tasks into simple tasks. So this is a sky task for compositional generalization. You can look for examples here. Give a naturally natural line between a command and wait to translate to a sequence of accents, and that could be executed by a robot, some sithat. So if you use this to most prompting, we'll get accuracy. Nine, 927. So we just used 0.1% demonstration examples. So what? I wonder why I showed this task. I actually am a new this task from Xu she here today. And she invented a beautiful ilapproach to southeast task many years ago. When I looked at at this past, I was very surprised. People look so straightforward for humans. Why could we socorporations finally were made it by. And this is another time what a sahue has been call again with a computaogenticim house. I don't know if anyone knows the concept, composition, tional generzation, roughly speaking, that test examples are more difficult than that, the training examples or proting examples. So for example, for the technical problems, the type problems, we are longer slipy here where our Crohas a little bit of change, a little bit. It's called a dynamic list towards primpting, and we just use one person data and achieved a great results, way better than the solar results in the literature and the solar results industry literactually by specialized architectural design and training and the they, of course, auddata set. Yeah, so far. Any question here? Otherwise, I'll go to the next session. Yeah okay. I suppose this part is quite a firmtive for everyone. I have two kids, my audience ten years old and my side seven years old. Actually, when the terms of crowding people came out, I heard a very interesting conversation between my daughter and my son. And my daughter asked a little brother, so was eleven, 17 times three. A little brother said, I don't know. And then she asked, what's ten times three? 30 was seven times three, 21. So what's the 17 times three? Oh Yeah, I know, 51. And the funny thing is, my daughter shoume daddy Xian, also cramping, also works in my little brother ther's brain. Okay, now okay, why might you say okay, why intermediate steps are helpful ful, so let me say, okay, that's so natural for humans, but with own research, we a in it deeper. Yeah, that's just something similar that our elecon models are just machine learning models. Where to understand what happened. And this year we have work published that I created 2024, and I collaborated with briliant civians from stamburg. And in that work, we are given rigorous mathematical analysis. Okay, so here are two results. Transformer generating intermediate steps can solve any inherently serious problem as long as its death th exceeds a constant shareholand ever a constant, that means independent of your import. However, if transformer generating direct answers either requires a huge depth to solve or kind solve at all. Yeah police chected state and again, then then moving through an aspect. Probably as say in terms of the practical implications of this theory. Yeah if you could solve a problem, you'll be similar generating more intelligence steps. And also probably you could call some external tools that search to help intermediate steps. So as though in this lm agent course, many people would talk about how to use an ectomal course and you can to think about how to download in videos and limitations. Yeah. So I have one of my big finis to find problems my daughter can solve in saings, but alms fail, Yeah. Okay. So now have thought about how to use examples to trigger lms to generate step by step. So what manner is possible to trigger with you without using clforce? Here's amazing work. Actually, when this paper came out, I thought it was a joke. It turned out lot. And then I was inspired a lot. Very spoke. It's called lesson step by step. So given this question, okay, we don't need any examples. We just need to see less step and step. And the model can generate business. Yeah it's really cool but yyou know the approach is there's no than examples it's worse than feel short. Women wonder, okay if we can have purge you still to job but can do much better work. So this this to our another world is called alms as anlogical reasoners. So again this beautiful book how a solve it and bacon you so in some book say okay horrible and likely reason to solve MaaS problems so we see a new problem you'll first ask you a question do you know a related problem or masses strategies Yeah so out that talk how you're going to kind of find a disorder of it and provided another people so this I really liected the code from if you started funds analysis you will know part a space and I was really amazed by the Lawson ness the aumade mathematiciism is the one who can see analogies between analogies of course I I A certain here without you know how far well from ati and so given this simple problem okay of course it has a okay lessons that I said but now we can see a different way okay we call a related problem and then solve the one solve this one okay you can see that actually indeed will cause relevant examples and knowledge here but another problem are exactly the same problem Oh that useful that's amazing. And we found that agwhich, of course, we had a bit ch Marks. And see, it works really well. So you can see that the last row row is formed an logical resoner by a prot. Of course, you can optimister prompt by yourself, catabout yourself. The most important thing here is that it's much better than just see lesson step by step. That's up here in still short cot. And even this approach outconfirms matalso of you here, as the main, is that, you know, we, you disaapproach the model, automatically generate related equations to each different problem. This is resource on a big bench. It was great performance. Promise the results on forcompetitive programming. Yeah if you are interested in company programming, you could try this approach. So what we didn't do here is about skaling. You can maybe you can search the web from all the related problems, knowledge logies for the problem you will solve. So the key idea here directly generates relevant examples and knowledge for each given problem instead of using a big set examples as a manual channel el prompting. Okay, now we can see that. Okay, we can use build short examples to shield a model. How to do step by step with me was can zero shock without using any examples? Just see that's in step by step. Now I could ask another question here. Is it possible to figure step by step with you even without using any problem that seems to by step? You could say, okay, what models in the company are just like, right? You're right, the data is Chinese Union or something. That means they already used many examples in the data mixture for training or training. So Yeah, we found those years. Is that in our recent world, instead of sort reasoning without prompting, without probably that without seeing anything, just give problem to the model, even for point per chalnot a child. Let's look example here. I have three apples. My dad has two more apples than me. And how many apples do we have together for this example? See, the approach actually is a very simple decoding. At the first step, we looked on passport Bocast here I listed file tophere. Okay, so we started the first book applicants and then continue greatdecomding. Okay, so the first one is a file aokay. The first is a file and the next book is file aples. And if you put tool called cancis, I then the fourth generation will be, I has three efforts. My dad has two more efforts than me. And so he has five efforts. Yeah, Yeah. And I see that. So that's that's very usually right. So we didn't see anything reasonhere or the model can do some visuif. We start from different tokens. Here's another example to say, okay, was Nicholas Cage born one in even or all year. The first one say, okay, any was okay was a poem in Oyet. The that in new class was the first token and the first second one that has even and then period third was or the period okay? Now probably say, okay, if the indices, you know, the model could have had shared sort in their response, the probably is harfunded obviously okay. You can taken longer sentences, longer senmeans. The model could do some reading steps. Actually, it's surprising thing to look with the probability. Open that on the poand. Open that yeside. If you look at the probability and the first row here, a Nicholas Cage was born in old India, and the sois quite low. And however, you see, if there's a reason, PaaS that. The last one, Kate, was born in 1964 and an evbay year. There's a reasoning process here and then probably finally chapter to point 98. That's amazing, right? It seems that the model is so well calibrated. I was really surprised by the same those probabilities. See that that turn three with modeeven or old and the crmeters are really low. So key of business and preals have had responses with step by step with the new amount generations started with the top key tokens. We don't need to build any prhere not needed. And the higher confidence ance you can call in the final answer when a step by step reasonpass is present. So here is a comprising between grad decoding and the channel of decoding. We see that the channel ofor coding performance much better. Yeah. So any question here? Now let's move to the net topic, right? Generating the ate steps are helpful, even really helpful, you know but and you can on generating image celinstead of direct answers. Any questions? Okay. Yeah. So this the authentic business social. So probably overthe city depends on your problem. You need Yeah. So actually, in the kind days, you know, we need to always keep in mind that alms are probability models of generating next tokens. They are not humans, no matter if I use my case examples or not. Keep this your mind. So it's a previous model so that see what lm does in code ding. So it's actually automx probability of reasoning PaaS. And the final answer, given the problem. However, what we want is arguments privby the answer given proright. That's what we learned in machine learning. This doesn't mean written PaaS is not important. Yes. See, Finwe have to make sure finis correct, and then look at the written path. They not aligned right, but two different objectives. Okay, now let's look by staff further. Okay, the primedial finisare given a problem how a computer we should sum over all parts or reasoning paths. That's the thought, is compufrom our course, we'll learn, right? So given a MaaS H problem, you could find different solutions which leato the same answer. Yeah, we that when you produce some mission here and then, okay, how do compute the sound. If you start machine learning, you know, answer, okay, right, sample it. So simple of here. Now this due to our work self consistsency, probably many people have no self assibut with all hai really want to let you see the underlying motivation, how we approach this problem from the first principles in machine learning. So let's look the question here. Okay, we have this MaaS problem and you could sample the answer multiple times. Yeah, again, and the finally you see, okay, the most frequent answer is 18. What we get along in here is not most frequently reasonpast, which use most frequent eyesight as huge difference. Reason. The past here is late in the Bible. This idea is so simple, by using self consistency with simply crushed solar results in the literature data time. And I see that employee research with doing you we can you know, it's really just about your idea. You don't have to have to know a lot since. And of course, you know give our explanation on self conency is about probability. It's about sampling. So imagine that more consistent results, more likely to crsh. We look first here. If the consistency is more than 80%, then then the accuracy is nearly 100%. Yeah quipoint about okay. So when the Iwas a direct answer, we saw intermediate steps s we will steer sample or separate hands and then choose the most common answer. Anyone let it ice? Yeah. Okay, great. Yeah. Is one token. Okay, that's where the ple with muscle ability. Yeah. And for a third question, and change self consistency by lack king agenerate multiple responses instead of sampling multiple hands. And then chothe most common answer, does this make sense? Yeah, no. Great. Yeah, that is no. And for both answers, we just need follow this principle. I maprivate ability to fingiven problem. That's all you need to understand itself. Self consistency is a very, very simple principle. Also, first, one of the best principles you mentioned on, if we all know many and more probably know, okay, this record Marks marginal inference. Okay, so one more to okay, how about free from answers you upon as universal subcontinency that machine here. So this idea is a little bit different, but a relaso I put here and given this problem, where do people drink less coffee than they do in Mexico? If you look answers, and this answer is different from others, but the most common response is to here, Japan, China, India. Any question. Otherwise couldn't move from that section, okay. Self caency or sawho ya as a market hands and then shows the most frequent answer as the other Yeah. And next I'm will control about the ittations. The first way I'll above lms can be easily distracted by revenue. From college studies, you know, revenue information may significantly decrease some children and even adults problem solving accuracy. They want check if this observation holds for lms. So these are some problems here. The highlighted text is manually added with variminat. Least $10 is relevant to the original problem. But you see, after the model made it wrong, solution wasn't here. So actually, the inries are okay. If we add a prom like ignore event context acts and tomorrow model, immediately notice that and make cralone. But it's still hard to take a product back if we make a problem making eleven connecare big. So even we can simply just add irrelevant synces like the sky, the blue and the grass Green or something, you know, those mountains you can make this input up to along. You will see a significant performance draw across all lis. The next thing is I'm going to talk about lms can not sell correct reasoning yet. Let's start from a math public game. And actually this problem is a triif you look at and I see that the model gave a wrong answer and then we prompt the model with review your previous answer and find problems with your answer. Okay. And Interestingly, after reviewing the model recognized and and perhaps this looks at sounds amazing, right? You see. And then we see another problem. Based on the problem you find, you improve your answer, and the phenomanswer here is a crsh. However. If that we're in ces a crack, we do a same prompt. The model could have made mistake. That's a problem. So overall, when allowing airto review, their generated response can help correct inaccurate answers. It may also risk changing correct answers into impcorrect ones. We run extensive studies on some benchmarks like a tck comgnisense qa and the huof our qa. And we didn't notice any improvements from self cracks and methods. It just makes this worse. How do you store some improvements from the literase? Now this said, look at improvements are reasoning. And actually they use oracle answers. You can see the orhere. Oracle means you only prompt lms to craft the answer when the ice is one. The problem is that the model doesn't know if the ice is correct or wrong. You held them because the ice is wrong. Descriit one. And also this ability to Moate mitiate the yogroup multiple os, os debate each other and to achieve an equipment for casensus. And also we try this approach, we friend of actually the es, how many response are generated? For example, if we have three os, and if one can be ary in response, there will be three. If a latminate there we one, there will be nine response error. So high partise to serve assidency with nine response, and let's see what happened. We the fact that those efforts ges cannot outperform self assiency, self facis medisimple because simple body tags and take the most frequzer as refinement prediction. So the lesson we learned here is oracle peterback is needed for am to self crsh. If so, that we adit to our work is Saudi basadebak naturally leverage unit tests. As oracle is about coding problems. You analythen have unit tests as try. Actually, we started this work quite early and we didn't make it a self present work. And then finally we'll move to the foo. And the last noi'll talk about the previous order matters in and within. So you know, in a conbase where every time we use phototype reporvers from the archive for poto somewhere, and the people will show great results, for example, recently, more information and should emit results. And given I have probably trounumbers you in the current day, the model channel was over there from that, therefore probably various and other house. So one of on my team is to generate different evolve tasks. So to test the models, so that here, that here we just did a simple trick. We are given this original tsmk proo. We reorder the sentence a little bit and see if the modesphere can solve it. So here, you know, in this orient problem that hair losses ten beers while getting home. And we could you just move this sentence to the end and to see what happened? We'll just give some chance for some chance. Gave problems. And we noticthat there are about a ten points drop rought unsolving risacross all frontier areas. So here, response here is, can contive with compare response, whatever problem, follow the problem. And then we all hold the problem. So I see that at the model, actually, just the model, just know how to solve the problems. Sequently. They couldn't go back and forth. And one could say, okay, that maybe be related to some semantic understanding in what is, I found reasonthen, we design another task. It's quite a logical in ference is either more pure than the MaaS problems you say if if they, if they if they write even we don't use real rewards. We just use random random tokens here. And given the rules ws and the facts, and then model inference, logical inference in the query and the rules for the original problem, the rules are ordered according to the use in the inference process, perhaps how I put out, not all rules are necessary for recoquery. And another way, you know we could just randomly order those rules, okay? I want only remember order rules relevant to query. If not relevant to query, they just keep their or even positions. And surprisingly, then we saw a city passport upon the draacross all frontier lms. From my personal experience, I think it's really important to design experiments when doing research to this athletic body. Okay, now let me summarize the pohere. So the first thing that holy power is generating the media steps s improves aliperformance actually a lot. And you can do training, fine tuning, prompting with intermediate stats, but you have to also to to show logical ways near or some kind of a schedual decoding live safety ding I presented today. And also self consistency greatly improves step by standlistening, no matter you are from five tool model or from btable. And also saw a lot of limitations that you have the contacts self correction and the premise order orders. Those are matters for reasonperformance. So when I came to public, I say a little the next it what probably will come, right? I think the most interesting here is know, say we work something we couldn't say, we work out agi or solve agi. That's not a problem. The problem is to find a right problem, work on and solve it from the first principles, not just you increase on assumpmpfrom principles. That last, machine learning is still super important here. And actually currently I'm in the called marwith, a bunch of amazing people and it's the first ever conference dedicated to lonmarand. Welcome to Yeah, that's it sense.

最新摘要 (详细摘要)

生成于 2025-06-07 15:40

概览/核心摘要 (Executive Summary)

该讲座转录主要包含两部分内容。第一部分由Speaker 1介绍CS 194/294-196 (LLM Agents)课程,强调大型语言模型智能体(LLM Agents)作为下一代前沿技术的重要性。LLM被用作智能体的核心“大脑”,负责推理和规划,使其能够与外部环境互动、观察、采取行动,并利用外部工具和数据库。这些智能体具有灵活性,能适应多样化环境,并可应用于教育、法律、金融等多个领域。然而,其发展面临推理规划能力、环境反馈学习、持续学习、多模态理解、多智能体协作、安全性与隐私以及人机交互与伦理等挑战。

第二部分由Denny Zhou (Speaker 2)深入探讨了大型语言模型(LLM)的推理机制、关键思想和局限性。核心观点是,LLM通过生成中间步骤(如思维链,Chain-of-Thought, CoT)能显著提升解决复杂问题的能力,实现类似人类的少样本学习。关键技术包括:通过示例引导(CoT提示)、分解问题(从最少到最多提示,Least-to-Most Prompting)、无示例引导(如“让我们一步一步思考”)、类比推理以及无需显式提示的思维链解码。此外,“自洽性”(Self-Consistency)通过采样多个推理路径并选择最常见答案,能提高推理的鲁棒性。然而,当前LLM推理存在明显局限,如易受无关信息干扰、无法可靠地自我纠正错误(除非有外部“预言机”反馈)以及对前提信息顺序高度敏感。未来研究需关注定义正确的问题、质疑现有提示范式,并开发能自主学习和克服这些局限性的模型。

课程介绍与LLM智能体 (Speaker 1)

Speaker 1首先介绍了课程CS 194/294-196 (LLM Agents)及其教学团队,包括讲师"dson"[不确定,根据发音推测或为 Dawn Song]教授(同时为“负责任去中心化智能中心”联席主任)、来自谷歌的联合讲师"Singing"[不确定,根据发音推测或为 Xinyun]以及助教团队(Alex, "Seahuand"[不确定], Tara, Ashman)。

LLM智能体的核心概念与重要性
* 定义:使用大型语言模型作为核心“大脑”,进行推理和规划,使智能体能够与外部环境互动、观察环境并采取行动。
* 能力
* 利用外部工具和数据库(如知识库)进行检索。
* 在多样化环境中灵活操作,无需大量特定训练。
* 与不同类型的环境交互,如通过API浏览网页,甚至嵌入机器人在物理世界操作。
* 通过多模态输入感知环境,并在多样化环境中采取行动。
* 通过与复杂环境的互动更新记忆、学习使用工具、与人类互动并获得“接地”(grounding)。
* 与其他智能体(包括人类)进行多智能体互动与协作,以解决更复杂的任务。
* 为何是“下一个前沿”
* 真实世界任务通常涉及试错过程和利用外部工具。
* 从外部知识中检索信息能扩展LLM的能力。
* 动态的智能体工作流(agent workflow)通过任务分解、子任务分配给专门模块、项目协作中的劳动分工,有助于解决复杂任务。
* 多智能体互动有助于激发更好的响应。
* 应用领域:已在教育、法律、金融、医疗保健、网络安全等多个领域展现变革潜力,并且发展迅速,各种智能体基准测试的排行榜显示了快速的进步。

LLM智能体面临的关键挑战
为了更好地部署LLM智能体,仍需解决以下关键挑战:
1. 提升推理与规划能力:智能体在执行复杂端到端任务时易犯错。
2. 提升具体化(embodiment)与从环境反馈中学习的能力:LLM智能体在从长时程任务的错误中恢复方面效率不高。
3. 持续学习与自我提升:需要进一步开发相关方法和能力。
4. 多模态理解、接地与多模态能力
5. 多智能体协作:发展“心智理论”(Theory of Mind)有助于多智能体更好地协作。
6. 安全性与隐私:LLM易受对抗性攻击,可能泄露有害信息或隐私数据。
7. 人机交互与伦理:如何有效控制智能体行为,设计人与智能体间的互动模式,以最好地服务于人类需求。

课程内容设计
为帮助学生学习并开发解决这些挑战的方法,课程将覆盖广泛主题:
* 核心模型能力:推理、规划、多模态理解。
* 流行的真实世界智能体框架:学习设计智能体应用和使用各种智能体工作流。
* 工作流设计:使用检索增强生成(RAG)、多智能体系统。
* 应用领域:软件代码开发、工作流自动化、多模态应用、企业应用。
* 重要议题:LLM智能体的安全性与伦理。
* 课程将邀请众多客座讲者和研究人员共同授课。

大型语言模型推理的关键思想 (Speaker 2 - Denny Zhou)

Denny Zhou (Speaker 2)指出,人工智能期望能解决复杂问题并像人类一样从少量样本中学习,但传统机器学习在后者上表现不佳,主要因为缺乏推理能力。LLM(通常是Transformer模型,通过预测下一个词进行训练)为此提供了新途径。

1. 通过中间步骤推导答案 (思维链 - CoT)
* 核心思想是提示LLM在得出最终答案前生成一个“推理过程”或中间步骤。
* 引用了Ling et al. (2017, DeepMind发表于ACL) 的工作,他们使用自然语言原理(natural language rationale)解决数学问题,通过一系列小步骤推导答案,并从头训练了一个序列到序列模型。Denny Zhou称赞此工作具有远见。
* Cobbe et al. (2021, OpenAI) 发布了GSM8K数据集,其中每个问题都附带多步文本解法和最终答案,并用此数据集微调GPT-3模型。
* Nye et al. (2021, Google Brain) 的工作 "Show Your Work" 针对程序综合领域,独立发现了类似思想。
* Wei et al. (2022) 的工作(Denny Zhou团队)“Chain-of-Thought (CoT) Prompting” 广泛评估了通过提示引导中间步骤,并在几乎所有NLP任务上展示了惊人结果。
* 即使只有一个包含这些步骤的演示,也能使LLM高精度解决类似问题,模拟人类的少样本学习。
* 例如,对于“连接名字(first name)的最后一个字母和姓氏(last name)的最后一个字母”问题(如 Elon Musk -> NK),传统机器学习需要数千样本,准确率约85-90%。而LLM通过CoT,仅需一个示例即可达到100%准确率。
* 核心结论“真正重要的是中间步骤”,无论是通过训练、微调还是提示,提供带有中间步骤的示例都会鼓励LLM生成类似的逐步解决方案。

2. 融入推理策略 (从最少到最多提示 - Least-to-Most Prompting)
* 在演示示例中不仅展示步骤,还包含推理策略是有益的。
* 从最少到最多提示:将复杂问题分解为一系列更简单的子问题,然后依次解决。
* 灵感来源于乔治·波利亚(George Pólya)在其著作《怎样解题》(How to Solve It) 中提出的分解与重组原则。
* 示例:对于数学应用题“Esther有3个苹果,她爸爸比她多2个苹果,他们总共有多少苹果?”,首先引导LLM分解问题并逐步求解。
* 该方法在组合泛化任务(如SCAN和CFQ的文本到代码转换)中非常有效,仅用极少量(如0.1%到1%)的演示数据即达到近乎完美或显著提升的结果,远超文献中的SOTA结果(这些SOTA结果通常依赖专门的架构设计和训练,并使用完整数据集)。
* Denny Zhou提到SCAN任务由徐晶(Xu Jin)[不确定,根据发音推测]多年前提出,并用优雅的符号方法解决。

3. 中间步骤的理论基础
* 与斯坦福大学学生合作的理论研究 (ICLR 2024) 表明:
* 生成足够长中间推理步骤的固定深度Transformer模型可以解决任何固有的串行问题,其深度超过一个与输入无关的常数阈值即可。
* 而直接输出最终答案的Transformer模型,则可能需要巨大的深度才能解决此类问题,或者根本无法解决。
* 实际意义:鼓励模型“思考更长时间”(生成更多步骤)或利用外部工具(如搜索)辅助中间步骤的计算。

4. 无需演示即可引出推理
* 零样本CoT (Zero-Shot CoT):通过在问题后附加短语如“让我们一步一步思考”(Let's think step by step),无需提供任何示例即可触发逐步推理。
* 由Kojima et al. (2022)提出。
* 通常效果不如少样本CoT。
* 类比推理 (Analogical Reasoning):提示LLM回忆相关问题,然后解决当前问题,借鉴先前成功的方法。
* Denny Zhou团队的工作“LLMs as Analogical Reasoners”,灵感亦源于波利亚的书。
* LLM自适应地生成相关的范例和知识。
* 在GSM8K, MATH, BIG-bench, CODEFORCES等基准测试上,其表现通常优于标准的零样本或少样本CoT方法。
* 关键在于模型为每个问题直接生成相关的范例和知识,而非使用固定的示例集。
* 思维链解码 (Chain-of-Thought Decoding):通过非贪婪解码策略,在没有明确提示(如“让我们一步一步思考”)的情况下引出逐步推理。
* Denny Zhou团队的近期工作。
* 当存在推理路径时,LLM对最终答案的置信度高于直接答案解码。
* 例如,对于问题“尼古拉斯·凯奇出生在奇数年还是偶数年?”,包含推理(如“凯奇出生于1964年,是偶数年”)的生成路径具有更高的概率(如0.98),而直接判断奇偶的概率则低很多。
* 在GSM8K和MultiArith等数据集上,该方法在不同模型尺寸下均显著优于贪婪解码。

5. 自洽性 (Self-Consistency): 提升推理鲁棒性
* LLM训练目标是最大化 P(推理路径, 最终答案 | 问题),而期望目标通常是找到最大化 P(最终答案 | 问题) 的答案。
* 自洽性通过对同一问题进行多次采样,生成多个不同的推理路径,然后选择不同结果中出现频率最高的最终答案
* Denny Zhou团队的工作。
* 该方法显著提高了CoT推理在GSM8K等基准上的准确性。
* 数据显示,“更一致的输出(即出现频率更高的答案)更可能是正确的”。例如,若一致性超过80%,准确率接近100%。
* 该原则也适用于直接输出答案(无中间步骤)的情况,通过多次采样选择最常见答案。
* 但如果只是让模型生成多个回复(非独立采样)然后选择最常见的,则不符合该原则。
* 通用自洽性 (Universal Self-Consistency, USC):将此思想扩展到自由格式的答案,通过提示LLM在多个生成的选项中自我选择最一致的回复。
* 例如,对于问题“在哪里人们喝咖啡比墨西哥少?”,多数回答指向“日本、中国、印度”。

当前LLM推理的局限性 (Speaker 2 - Denny Zhou)

1. 易受无关上下文干扰
* LLM很容易被提示中包含的无关信息分散注意力,导致性能显著下降。
* 例如,在GSM8K上可能导致“20多分”的性能下降
* 这与人类心理学的发现相似。
* 指示LLM忽略无关上下文可以部分缓解此问题,但如果无关信息被精心设计,模型仍难以恢复。
* 即使简单加入如“天空是蓝色的,草是绿色的”这类无关句子,如果输入过长,所有前沿LLM都会出现显著性能下降。

2. 无法可靠地自我纠正推理
* 虽然可以提示LLM审查和纠正其答案,但此过程不可靠。
* 它们可能将正确答案改成错误的,或未能有效修复错误。
* 若无“预言机”式反馈(即不知道初始答案是否错误),自我纠正可能导致比标准提示更差的结果。
* 文献中声称自我纠正有效的,通常使用了“预言机”反馈,即只在答案错误时才提示模型纠正。
* 多LLM辩论格式(让多个LLM互相辩论以达成共识)的效果也不如自洽性。
* 然而,当存在外部反馈(如代码生成的单元测试)充当“预言机”时,自我调试 (self-debugging) 可以是有效的。

3. 对前提顺序的敏感性
* 问题中前提或信息片段的呈现顺序会显著影响LLM解决问题的能力,即使底层逻辑和信息保持不变。
* 在数学应用题(GSM8K)或逻辑推理任务中,重新排序信息(即使只调整与推理相关的规则顺序,保持无关规则位置不变)会导致各种高级LLM出现“10到30多分”的性能大幅下降
* 模型似乎只能顺序处理问题,难以“来回查看信息”。
* 即使在纯粹的逻辑推理任务中(使用随机符号而非真实词汇),打乱规则顺序也会导致类似问题。

未来展望 (Speaker 2 - Denny Zhou)

Denny Zhou最后总结并展望了未来方向:
* 核心观点总结
* 生成中间步骤能显著提升LLM性能。
* 自洽性极大改善了逐步推理的可靠性。
* LLM推理仍存在诸多局限,如易受无关信息干扰、自我纠正能力弱、对前提顺序敏感等。
* 未来研究方向
* 定义正确的问题去研究:不仅仅是追求AGI,而是要找到具体问题,并从第一性原理出发解决它,而非仅仅在某些基准上提升数字。机器学习知识依然非常重要。
* 质疑当前的提示范式:目前的提示方式并不反映自然的人类交互。
* 开发能够自主学习所讨论的各种推理技术并克服已识别局限性的模型。
* Denny Zhou提及他正参与组织一个首次专门讨论LLM推理的会议/研讨会。