2025-02-05 | Agentic AI: A Progression of Language Model Usage

Agentic AI 语言模型的应用与设计模式解析

视频科技

媒体详情

上传日期: 2025-06-06 20:18
来源: https://www.youtube.com/watch?v=kJLiOGle3Lw
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: My name is inso, so today wewould like to go over asient tic AI asient tic language model as a progression of language model usage. So here is the outline of today's talk. We'll go over the overview of language model and how we use, and then the common limitations, and then some of the method that improve towards this common limitation. And then we'll transition into what is the agenent tic language model and its design patterns. So language model is a machine learning model that predicts the next coming word given the input text. As in this example, if the input is the students open there, then language model can predict what's the most likely word coming next as a next word. So if the language model is trained with the large corporate, it is predict, it is generating the probability of next coming award. In this example, as you can see, hooks and laptops have a higher probability than other words in the vocabulary. So the completion of this whole sentence could be the students open their books, and then if you want to keep generating the what's coming next, then we can turn them in as an input and then put it into the language model, and then language model continuously generating the next coming word, Bhow. These language models are trained largely two parts, pre training part and then post training part. And then first pre training portion is the one that language models are trained with large coppers, text collected from Internet or books or different type of text, publicly available text, and then trained with the next token or next word prediction objectives. So once the models is finished in this pretraining stage, models are fairly good at predicting any words coming next as a word, given the inputs. However, this type, the pretrain model itself, is not easy to use. So the hence the post training steps are coming. And then these post training stage would include instruction following training as well as the reinforcement learning with human feedback. And what this training stage means is we could prepare a data set such a way that specific instruction or question, and then the answers or the generated output, that is what the user would expect or more more related to the questions and answers. So that's how the models are trained so that it's easier to use and then also itrespond to specific styles. And then once this done, and then additional training method is aligning to human preference by using reinforcement learning with human feedback, which is using human preference to align the model by using rewards schemes. And let's take a quick look, really quick look on the instruction data set. This is the template that we would use to train the model in instruction following training phase. As you can see, there is a specific instructions will be substituted in and then expected output will be substituted in and then this is fed to the model and then model is only up train on the response part that is generating the generating the output based on the given instructions. All right. So language model that is trained on pre training stage and as well as post training stage is quite capable of generating text given instruction. Essentially, it has a lot of world knowledge that could easily generating the outputs. So these are rapidly developing. And then could these models are used in various of application domains that you and we use day to pay work, such as AI coding assistance or domain specific AI copilots, or most widely known chagpt and related conversational interfaces. And then in order to use these type of models, as for your applications or specific tools, you could use the cloud cloud based api calls towards the model provider or the model servers or some other ways that you could also host the models on your local your local machines or even mobile machines for models that are small enough or to host on this compute constraint environments. So what does it mean by using ati calls? So we step back. The language model is taking the input natural language, input text and then generating the output. So that means we need to prepare a certain, certain form of free form text, natural language text, as an instruction or question, then put them in a specific format that you could make an api call towards the model provider, and then the model provider takes that usually on a cloud environment, and then generate the output and then respond to your api calls with a generated output. Then your software around that, software that based on this model will parse the output and then use it as is? Or maybe you could make a follow up llm api calls to further generate the output. So the input to the model is, again, free form text. So how you prepare your input, also known as a prompting, is critical. So there are well known best practices, strategy, how you prepare your prompt, and here are some of them, and such as write a clear and very descriptive detailed instructions, thatwill help the model to generate the output that you want. And you could include a couple of examples, the form that you want to see, as in the style or form. And also you could provide the references of context, such a way that model rely on that context that you provide. And instead of just ask the model to answer right away, you could ask to give model to time to think about it, such as reasoning, enable reasoning, or using chain dot dot or comethod. And the next one is, instead of asking model really complex tests, you could break them down and then ask them chain in sequence, also chain the complex prompts. And the last one is something that as a good engineering practice, have a good way of systemic trace and logging will help you. And also automated evaluation is always helpful to develop your essential to develop your progress on your application. So let's take a quick look at on each items what that means to get more familiar with. So write clear and descriptive instruction as an example on the left, instead of asking short requyou could describe in detail, so that model knows what you are asking, because modoesn't understand model cannot read your mind. So that means you need to describe what the model, what you want the model to generate the output for you. So this is always useful for using language model in general, and include few shared examples, meaning that give model the example input and output that you would expect, as in this case, you as some type of consistent style output. Then what is the consistent style? You provide an input as example input and example output that you would ask. And then you can finally you ask your original questions. Then it's going to be the model will generate the output based on your input. So future examples is always helpful to generate output that you want you would want to generate. So provide relevant context and references. This is really helpful for many of the cases that are related to generating text based on some factual information. So llm can generate easily generate some incorrect output, also known as a hallucination for those topics that it doesn't know or is not too confident about. For those type of cases, providing context or references would always helpful. So here is the example prompt template that you would want to use in cases such as a retrieval of meditic generation that we'll look at in the following slides, saying that omanswer based on the input, based on the article that you you provide, you could substitute your related reference and then only answer based on these references. If the model cannot find the answer, you could just say that cannot find the answer, then mowill likely generate the answer based on your references. So this is important part that give models time to think. In other words, instead of ask direct question, youask model to think through it or come up with its own solution and then finally generate the finally compare and generate the output. So this is also known as a chain of thought. So here is one example that might not work in some medium size model. You could ask model saying that evaluate the stut's solution is correct or not, and then provide a solution description, and then finally keep a student solution. And since the system prompt or your original request is saying that just answer it is correct or wrong, model might not get it right. However, for the same model, that might not get the right answer. For this, if you prepare your prompt in a way that you could ask first, work out your own solution to the problem first, and then compare your solution to the student solution, then by doing this model will generate its own solution. And then, as it does, ithave an opportunity to provide good attention to all these inputs that it's from the original input as well as the output that it generated, and then turn towards the right answers. So reasoning chain upload is always helpful to to generate the output that you want to see. So here's an interesting one, probably easier to implement for your application. So instead of asking your request that includes a multiple tesks in one request, you could prepare your prompt in small, simple stages. So how you do is prepare simple prompt, then generate the output and then prepend the output to the next stage, two prompt, and then generate the output, and then again, prepare the output from the previous stage, and then generate the output in a third stage, like here, and then finally generate the output that youwant to see. So by doing this, you may need to do it manually, or this can be done by llm, as we'll see in the following slides. But having a simple clear task per each request would be a good way to do it. Okay, so this may not be obvious, but because many type of as with many engineering application or development, having a good way of keeping tracing, logging will definitely helpful for your development for debugging as well as auditing. So same principle applies to language model based development. So keep track of the log is always good. And so that also relates to having an automated evaluation from the early stage of your development will definitely help you. In other words, you need to prepare question and answer pair, ground truth answer pair, so that you could compare that against generated output. You could use a human to evaluate that, but that's usually costly and then time consuming. So you may use a language model as a judge, meaning that you could ask language model to evaluate model generated output as well with ground truth output. So that model can score the generated outputs quality so that you can use that against your currently developed your own applications. This will help. This will be very important because the language models are continuously and rapidly improving as well as the methodology and then tools that you are using for developing language model is also rapidly developing. In other words, without clear evaluation, it is hard to make forward progress or even hard to change the model, change the different type of models. Because the models are rapidly developing also means that the sumodels are rapidly deprecated, which means you may need to force to change your language model that you are using in your application. So having good evaluation methodology up front from the beginning will definitely help. All right, so this is simple idea, but will be helpful for many applications. So instead of taking your input prompt as is and then process essing, you may have some software or some model that you could detect the intention and then send it to a different prompt handlers. So or this is also known as the prompt router. So based on the input query type, you may need to use simple prompt together with simple language model, and then this will both help in terms of operation costs as well as generate more appropriate output together with more relevant prompt with the language model that is more capable of that type of query. All right. So maybe Petra there, this may be a good moment that we could take question if there's any, if you could. Thank you so much.
speaker 2: inand. Such an inspirational talk already. Thank you so much. We will get to like more details about agenent tic AI just shortly. We did want to kind of provide you with some sort of a background and kind of the progression of like what has been going on in the field. But I think maybe let's ask one question that came up. It's a little bit specific, but it might be what more people are wondering about. And is there any optimal amount of data to perform a good training or like anything that you could advise people around kind of the data available or like data being used?
speaker 1: Okay. Maybe I'll be short on this. So I assume the training here is meaning that fine tuning the llm as part of an additional training on top of open source language model, if Yeah depends on your task, it would definitely vary because it's hard to say one or the other. But if if you have your enough data set or text that you would want to see, then you may come up with a simple question and answer pair or instruction following data set format and then you could also make user language model to further generate more. So if you need it more, but I would think you would start with, say, tens or tens of samples of data set first and then see whether that makes the model al behave. What do you actually see it to be? Behave. And then you could add, based on the result or signals from the initial quick test, then you could Additionally add more data set data set samples. Possibly we use language model to augment it or synthetic data that you could create. Great question.
speaker 2: Thank you so much. I see like questions started coming in, which is wonderful. I think we will pause the question for now, and we will try to get to as many as we can at the end. But please keep them coming. Definitely makes this session kind of more engaging and also let us know what you are interested in. Thank you.
speaker 1: Thank you, Petra. All right. So so far, we've been looked at overview of language model, great, very powerful models that are out there, many models out there. And then how we use however, even there are still limitations for models that are available that are listed here, such as hallucination is a well known issue that models models can sometimes oftentimes generate incorrect or incorrect information, particularly if it's related to some computation or some other specific area. So this is also, this is a problem that is that we want to avoid in your application domain. And other thing is, there's always a knowledge cut up in data set preparation. So model provider, model creator, repair data set. However, they need to at some point cut up their data set collection and then use it so Momay not know, may have not seen the recent information or news as part of their prere training data set, lack of attribution, so Moal can answer a lot of world th knowledge questions and can answer those type of general questions. However, it may not tell, it's not going to tell you where they drew the answer from. Particular specific data source, data privacy is one fact that model al creator, you prepare their data set using publicly available data source. That means modhave not seen your proprietary data set from your organization or particular domain. And limited context length, although is rapidly increasing. However, it always its fine balance because providing a longer context will give more context information to the model or however, it comes with operational cost as well as the speed, the latency of latency in tax generations. So in order to address these common limitations, retrieval, augmented generation is one way to handle this, such as it could reduce the hallucination by using the actual related relevant reference. It also addressed the citation because it knows where this preference is coming from. And then this will allow you, as an application developer or system developer, to prepare systems so that you could use your own proprietary data set or the text. And then you could use the good use of a small number of context links, because it only select relevant data set. So how it works is you could pre index your your own data set or your own text by turn them into, turn them into smaller chunk, chunk of text and then convert them into embedding space using embedding model and then stage them as part of your database or vector database. And then when the request or query came, you could turn this query into embedding space so that you could do nearest neighboring search and then select k, top k, relevant information, relevant chunk, text chunk, and then place them as part of your prompt. Some of the slithat we see previously, as you put the reference as part of the prompt and then use that as as the model, only only make use of this reference. So this is one good way to make use of your own proprietary data. And then similar method is can be used in the actual AI search or so that instead of using index data set, you could also rely on web search or different type of search so that you could provide information as part of the index. And one of the things also mentioned here is there are many methods or ideas for retrieval. Al, generation mog commonly used method is something that we've just mentioned, we've just talked about, meaning that turn the tecchunk into embedding space and then do a nearest neighbing search. However, there are many methods, and then you could also use knowledge graph base. So if you could generate a knowledge graph from your text source, and then that could also help to extract the more relevant information, also known as a graph rack is one part of it. However, there's a many method, and then you may need to look into the right method or if the right method and make use of those. All right, tool usage. So language model being most widely used is a text in and text output, which means that it could answer many, many type of queries. However, it's not going to execute or extract information from the external. So that's where these tool usage came to rescue or also known as a function calling. So with this method, you could get real time information or you could actually do a computation by generating software or computer code. So what does it mean is let's look at an example here. So if you have a chatbot that if you ask, what is the weather in, say, San Francisco, then modo will not know it in itself. So however, if you tell mopreviously as part of the prompt, saying that if you are asked whether related question generated output in a form, that the software that parts the output can make an api call. So as in this example, model will generate the output in a way that, Hey, this is the case for tool usage. So it generates an output as in the form, as in the form here, get weather. And then input argument to this api call or function call is the place that we asked. So then software receives this text output from the model and then parse and then actually make an api call towards the weather provider and then get the weather information and then again, provide to back to the language model. And then language model will generate more human, friendly or helpful output based on this api based result. And for some cases, a model can also generate a software software code that can be executed as part of the sandbox outside of the language model by the software that is coordinating all these activities. All right, so asient language model, so there can be many definitions. One definition is it could interact with environment. So compared to simple language model usage, generally use simple language model usage, as you seen here, text input and text output agent agenent tic language model usage could be language model could do something with environment by generating two usage or retrieval requests. And then from the environment, anything outside of language model could generated, could provide an output, could provide and information that can be fed in, fed to language model as an observation. And then the whole thing, the asient tic language model, which includes language model at its core with the soatuaround, it will process process it and then also put them in a memory. And as with any, as with its conversational history, which can be taken as a memory. So this is one way of definition of agent tic language model. The other way to look at it is this agent language model usage can be defined as eccould reason as well as it could action, it could do an action, so also called react vision and action. So reasoning part, something that you could encourage model to reason about by method such by using a method such as chain of, and then doing an action using method that we had seen previous slides, as in retrieval or search engine, or actually using calculator by making an api call or different type of api call, such as weather api that we had seen, and also generating a Python code so that you could run as part of your box sandboxes. So by combining this reasoning and action model can do a lot more complex taks than simple input and output type of interaction. So here is let's look at a little more detail on this. What does it mean by reasoning and action? So reasoning part, instead of doing the test that is asked, you could prepare your prompt such a way that ask model to break down the task and then make a plan. So instead of instead of breaking down the task by yourself, as we've seen in the previous chathe prompt slide, you could ask moto break down and then prepare your task. So in other words, plan plan the actions and then based on that breaking down model can generate different actions by making an api calls or tool usage so that it could extract or collect additional information from the external world. And then by combining all this, put them in a memory so that it knows model knows what's been happening. And then based on that, finally draw an answer for you. So let's look at the concrete example here. So if you have customers for AI agent, then look how it might work. So as a customer ask I can I get a refund for product full, then agent tic system will break down this task, break down this request, test into the following four different type of actions, check the refund policy, check the customer information, and then check the products, then finally collect and then decide what to do in each step. Language model will speed out the api calls so that it could collect information. For example, check the refund policy language model, could ask a retrieval system against the pre indexed company policy refund policy, and from there it could retrthe information and then put them in its own context, and then using that, also request a customer order information. It could either ask customer back to in a chat format, collect more information, or you could look it up in the system because depends on how this chat system is prepared the same thing for the product so that it could collect more information and then finally draw the conclusion based on the policy and then product information as well as customer older information and then send a request to the follow up system as a api call as well as the send prepare, say, response draft. And then that's going to be handled by final approval. All right, so workflow is generally like this. So in a sense, agent tic language model al system is generally llm is making language model is making iterative calls by reviewing the document of teand, then making an external tool calls things that an example. If you want to do some research of certain matters, you could prepare your agent to do a research web search or different type of search and then keep summarize them, and then iteratively, and then finally prepare the report to you or to your system. Another example could be software assistant agent that you could ask the software agent, software agent or three I agent that asked the issues of certain type of software bugo issues, then this agent will look it up and review this issue, and then collect the relevant piece of code or files, and then review them, and then propose the output. Or it could also execute in its sandbox environment, and then test the correct, test the fix, and then get the output, and then iteratively try to find the fix, and then finally propose the tool request or the changes to users or developers. These are the ways that we can use language model in agent format and by doing iterative language model calls. So the difference main reasons and reasons and difference why these agent tic language model usage is getting more widely used is given that if you had the same model, if you ask just direct requests to the model, may not be able to handle it. However, if you put your task in this type of agent tic format or patterns, then mowill do more complex taeven using a model that may not be able to do it if you don't do this way. So that's one of the reason that agent tic language model is pushing the boundaries, so that the things that we can do with AI asiis more complex or different domains that we can rely on. All right. So here are the real world applications, software development, code generation or bug fixing or type of this type of development is widely investigated or being researched by different organizations as well as there are companies that trying to provide these services as decision on the right side, research and analysis that gather information, synthesize it and then provide a summary for the users. And then task automation is one of the area that agenent method could be used. All right. So to make it more clear, here are some of the design patterns that you could use. Use aggentic language model planning is critical because by asking a model to break down the task, to make it Sima task or clear taso, that the lmodel later can make an api called or use the tool usage. So planning is critical. Then reflection is something that model can generate in itself. And then the next next model call can criticize the output that actually came from the same model. So by doing this, the output could be improved. And then tool usages is something that anything, something that outside of the language model that you need a real time information or different type of information, then you can use this. And then multi agent collaboration is one way to handle this. Reflection is a pattern that quick to implement and then leads to good performance. And let's use a concrete example here. So if you want na refactor a programming code instead of ask model to improve it right away, if you do this pattern, as in this example, saying that you you ask model saying that here's is the code, and then check the code and provide constructive feedback, and then take the feedback to the second prompt, as in this example, you could also prepare prompts saying that here is the code and the feedback which came from the model itself, and then ask the model to big factor it. And then this way of reflection will likely generate better output or better picks for the code that you are asking to the model. Tool usage is something that we've seen at the four ask model to generate the api patterns so that you can use this api function prototype to make an actual code or if the task is related to actual computation or some different form, you could also ask model to generate a program as an output. And then you can run that on a safe sandbox environment that your software or software scafolding around the lmodel can execute. And then an input provide the execution output back to the model so that model can synthesize it. All right? So multi agent is an interesting way to implement or accomplish your complex task. So you could split up the task or you could split up your task and then assign those tasks in a different agent that are dedicated for specific taks. And then this agent is, in this case, in this context, could be just as a different prompt or different persona. So you could the prompt usually consists of you are a helpful AI agent. You could change that into a different persona to a different agent. And also you may or may not use the same model or the different model based on the test. So let's use a concrete example here. So if you build a multi agent system for smart home automation, you could create a different agent, climate control agent and lightning lighting control agent and so on. And then these are the sotipiece that includes a different prompt with a persona as well as handling external triggers. And then those are the ones that work internally and then that coordinate these agents, essentially a model al prompt together with soscaffolding around it, coordinate the whole activity. All right. So that brings up to our summary. So the the agent tic language model usage is a progression or extension to a existing language model usage method. So for those, for the best practice that you have used in language model, for simple cases, most of them are applicable. However, you could use different additional methods, such as more retrieval, search tool usage, and then prepare different type of prompts and then workflow, so that you could use language model as in its core, as a reasoning or smart in turn. And then you could use a tool usage or other retrieval method to interact with the external wall, and then combine these results, and then such a way that you could you could achieve a complex task instead of simple input and output type of language model usage. And that said, Petra, thank you so much enof.
speaker 2: This has been really great. So so much information. Hopefully this is useful for everybody. We keep collecting the questions. We got so many. We will try to get as many as we can. But Yeah, please feel free to keep them coming. Maybe the first question for you in up and let's focus on the agenent tic AI is about the evaluation. And do you have some recommendations for good strategy for evaluating agents beyond just using an llm as a judge? It seems it should be a little bit more complicated to do the evaluation on agents. At least that's kind of the general notion, the questions, and people are wondering how that could be them.
speaker 1: right? I think this is a great question. So just to make a quick context, llm as a judge is commonly used method that you actually use llm to evaluate the model al generated output against the ground truth answer or some type of preference information, which works great. And then why do we use and also I we use that as well for the I think one thing that I've recently tried is agent tic charging method, meaning that I use a reflection type of, reflection type of pattern that we have seen previously instead of just ask one question right away to llm as a judge is ask first llms, provide initial reference and then feedback. And then I also ask again another llm call or different prompt saying that, Hey, this is feedback from other other your junior engineer. If you are senior engineer, how would you compare the junior engineers evalualiation against the output that you are evaluating? So I find that this reflection pattern was helpful to get the better evaluation instead of just one shot llm as a judge output. However, I think there can be a more creative way to improve your evaluation stage using agent patterns because the evaluation, it is really, really important. I can't emphasize more because that will help you to advance fast or change models and different type of changing the prompt and so on and so forth. So I think that was a great question.
speaker 2: Thank you so much, insob. We got a few questions about kind of augmenting the AI agents for specific uses and making sure that kind of they they get shaped in a way you need for your application, like on the technological side, like what is there to do? What is there to use? We got many questions kind of like going to this idea. So if you have some some kind of information, it would be helpful.
speaker 1: right? Again, I think this is a great question. I would so one thing first thing came up to my mind is if you have a task which is simple already, which is simple, then you just use a language model, simple use cases. However, if you see a little more involved or complex test, you could experiment with a simple agent tic test, even if it's agent language model usage could be many, could be, you can define agent tic language model in many, many different ways. But if you have a little more involved task, you could do iterative language model called even. That could improve your output. So I think it all depends on your test case test, actual application domain. But instead of trying to apply, trying to look for the test that you could apply language model, turn that into, turn der around that, how this test cannot be solved with the simple cases. So simple language model usage. So try to apply simple k, simple usage first and then try to improve it using different patterns. Same for, I think, a slightly tangential but fine tuning cases. Also, if you could have a if you have a task that in you want to solve ld, try to use the uc model first and whether it makes sense or not. And then from there, you could decide how to do it. Then you could prepare small data samples and then try it and then make a progress and then a really quick iteration instead of trying to invest upfront too much.
speaker 2: Thank you. And so few questions came in and abl can't like this is like a kind of giant question on its own. So I will kind of let you choose how you want to respond to that. But a lot of questions about kind of ethical considerations, how to avoid hallucinations, like how to avoid kind of like using data that might be like unethical or like there might be something kind of behind them. What would be your recommendation again, like this question is so big, but maybe like looking for something that you would say about this.
speaker 1: All right. Yeah, another great question. Yes, due to its a probabilistic generation in nature, hallucination is always there, although a lot of people are working on it. So it's a problem as well as the the ents contents output has concern could be concerning. So I think a model provider themselves is checking the generated output in terms of different categories as well as your application builder yourself also adsome guardrails, guardrails being checking the output using small small language model so that you could quickly check the output or some type of, I guess. A criteria using classifier or some type of thing so that you could actually filter them out to see it could be either on the final generation stage or it could be in taking in the input stage that the query comes in so that you could actually avoid it. This can be can backfire, but I think as an enterprise or business, you may be on the safer side. So they're making sure input generated output as well as the input queries that being requested could be more safe or reasonable cases. So this is evolving. But I think of fundamentally, it is something that checthe output using type of classifier or even decota type model to. Thank you so much. And a little plugin.
speaker 2: We do have generative AI program that also covers a lot of kind of the ethical aspects of kind of llms. It's not a technical program, but like I'm kind of reaching to people who are asking about this sort of very important questions. Maybe last question ends up for you. And you know we are not going to be endorsing, of course, like at Stanford. But like a question is like how to get started? Are there any open source models to recommend? Is there anything kind of people can do to kind of start testing this out? Like start playing with this?
speaker 1: Great question. So one thing that as a take home message here is start simple and then experimented and then iterate it. So that and also agent tic language model is sounds complex, fancy, but it isn't progression and extension of language model usage. So to start, I think you could there are many language model usage framework and also agent tic framework that you could use. However, to start, I would suggest to use a playground type of environment, say a model provider generally have their playground that you could type your prompt or input so that you can see the output right away so that you can experiment it really quickly. And then once you get familiar with it and then use the api to make a call from your program and then see what's going on in that way, you gain insights as well as practice. Best practice ces for mpt, prompt preparing. So once you have that, you could intelligently decide whether it makes sense for you to actually continue your own code based or make use of widely available libraries. So in short, start simple work on a playground first and then make a simple api calls and then decide whether you want to use more extended libraries or just continue on way. This this applies for language model usage as well as agenent language model usage as well because there's also many agent tic language model framework out there. Thank you so much.
speaker 2: And to close us out like in this session, in the last minute, in so you know the field is progressing so fast. There is so much happening. Basically every week we see in news about something new coming up in any of these fields. Are there any resources you study, you follow, anything you would recommend people to kind of keep track of to stay up to date in lms? Agenent? Take AI in this field.
speaker 1: Difficult question, but great question there. I think it's a little hard, but I picked some some experts who are known this field. I follow them either in formally Twitter or YouTube channel. And then from there I get more upinformation and then do my own diggings. But I think find out a good starting point is good thing. And then I think a reference here, you can screenshot it. And this can be a good starting point because it includes some agent tic usage courses as well as a good type of courses, either from standport as well as different places. So Yeah, thank you so much int, this has been .
speaker 2: really great. So much helpful information. And you know you're absolutely the best person to hear this room. Thank you so much for taking the time. Thank you also everyone who joined us live. Thank you for your time.

概览/核心摘要 (Executive Summary)

本次演讲由GitHub Next的首席机器学习研究员Insop Song主讲，系统阐述了智能体AI（Agentic AI）作为语言模型（LM）用法演进的下一阶段。核心观点是，智能体AI通过赋予语言模型推理（Reasoning）和行动（Action）的能力，极大地扩展了其应用边界，使其能够处理传统LM无法完成的复杂、多步骤任务。

演讲首先回顾了语言模型的基础，包括其通过预训练和指令微调（Instruction Tuning）、RLHF等后训练阶段获得强大的文本生成能力。随后，指出了LM的普遍局限性，如幻觉（Hallucination）、知识截止、缺乏溯源等，并介绍了两种关键解决方案：检索增强生成（RAG）和工具使用（Tool Usage）/函数调用。RAG通过引入外部知识库来提高事实准确性，而工具使用则允许LM与API、数据库等外部系统交互。

智能体AI正是将这些能力系统化，其核心在于一个迭代循环：LM首先进行规划，将复杂任务分解为小步骤；然后通过调用工具来执行这些步骤（行动），并从外部环境中获取观察结果；最后，LM对结果进行反思和推理，更新其记忆和下一步计划，直至任务完成。演讲重点介绍了四种关键的智能体设计模式：规划（Planning）、反思（Reflection）、工具使用（Tool Usage）和多智能体协作（Multi-Agent Collaboration）。这些模式的组合应用，使得AI能够自主完成研究分析、软件开发、客户支持等复杂工作流。最后，演讲者建议开发者从简单的应用入手，通过实验和快速迭代来逐步构建和优化智能体系统。

语言模型（LM）基础与应用

模型定义与训练

基本定义: 语言模型是一种预测给定文本后下一个最可能出现的词的机器学习模型。
训练过程:
1. 预训练 (Pre-training): 在海量的互联网文本、书籍等语料库上进行训练，目标是预测下一个词（Next Token Prediction）。此阶段的模型具备广泛的世界知识，但直接使用较为困难。
2. 后训练 (Post-training):
  - 指令遵循训练 (Instruction Following Training): 使用“指令-期望输出”格式的数据对模型进行微调，使其能更好地理解和响应用户指令。
  - 基于人类反馈的强化学习 (RLHF): 利用人类偏好数据，通过奖励机制对模型进行对齐，使其输出更符合人类期望的风格和价值观。

主要应用领域

AI编程助手 (AI coding assistance)
领域特定的AI副驾驶 (Domain-specific AI copilots)
对话式接口，如ChatGPT

使用方式

API调用: 通过云服务商提供的API来使用模型。
本地部署: 对于足够小的模型，可以将其部署在本地机器甚至移动设备上。

高效使用语言模型的最佳实践：提示工程

构建高质量的输入提示（Prompt）对于引导模型生成期望的输出至关重要。以下是关键的最佳实践：
* 清晰具体的指令: 避免模糊请求，详细描述任务要求。演讲者强调：> "模型无法读懂你的心思（model cannot read your mind），你需要详细描述你希望模型为你生成什么样的输出。"
* 提供少量示例 (Few-shot Examples): 给出输入和期望输出的范例，帮助模型学习所需的格式或风格。
* 提供相关上下文与参考: 为减少模型产生不正确信息（即“幻觉”），可以提供相关的背景资料或参考文章，并要求模型基于这些信息回答。这是检索增强生成（RAG）的核心思想。
* 给予模型“思考时间”: 不要直接要求答案，而是引导模型先进行推理。例如，使用思维链 (Chain-of-Thought)提示，要求模型“首先制定自己的解决方案，然后将其与学生的方案进行比较”。
* 分解复杂任务: 将一个包含多个子任务的复杂请求，分解为一系列简单的、连续的提示，逐步完成。
* 系统化追踪与评估:
* 建立良好的日志和追踪系统，便于调试和审计。
* 尽早建立自动化评估流程，使用基准问答对（Ground Truth）来衡量模型表现。可以使用LM作为评判者 (LM as a Judge)来辅助评估，以应对模型和方法的快速迭代。
* 使用提示路由 (Prompt Router): 根据用户查询的意图，将其分发给不同的、更专业的提示处理器或模型，以优化成本和输出质量。

语言模型的局限性与核心解决方案

常见局限性

幻觉 (Hallucination): 生成不正确或捏造的信息。
知识截止 (Knowledge Cutoff): 模型知识停留在其训练数据收集的截止日期，无法获取最新信息。
缺乏溯源 (Lack of Attribution): 无法说明其答案的信息来源。
数据隐私 (Data Privacy): 模型未在组织的专有数据上训练。
上下文长度有限 (Limited Context Length): 更长的上下文会带来更高的运营成本和延迟。

解决方案一：检索增强生成（RAG）

工作原理:
1. 预处理: 将私有文档或知识库切分成小块（chunks），通过嵌入模型（Embedding Model）转换为向量，并存入向量数据库。
2. 查询时: 将用户查询也转换为向量，在数据库中进行相似性搜索，找到最相关的K个文本块。
3. 增强提示: 将这些检索到的文本块作为上下文，与原始查询一同注入提示中，要求模型基于此上下文进行回答。
核心优势:
- 通过提供事实依据，显著减少幻觉。
- 能够为答案提供引用和来源。
- 允许模型安全地使用私有或专有数据。
- 高效利用有限的上下文窗口。

解决方案二：工具使用与函数调用

核心概念: 赋予语言模型执行外部操作或从外部世界提取信息的能力。
工作原理:
1. 当用户提出需要外部信息（如实时天气）或计算的请求时，模型会生成一个结构化的输出，通常是API调用的格式。
2. 例如，对于“旧金山的天气如何？”的提问，模型可能输出：get_weather(location='San Francisco')。
3. 外部的软件层解析此输出，实际执行API调用，获取天气数据。
4. 将API返回的结果再提供给模型，模型最终生成一段自然、友好的回答。
5. 模型也可以生成可执行的代码（如Python），在沙盒环境中运行。

智能体AI（Agentic AI）的核心理念

智能体的定义

智能体AI系统以语言模型为核心，但超越了简单的文本输入输出。它可以与环境（Environment）进行交互，通过工具使用或信息检索来行动，并将外部反馈作为观察（Observation）输入给模型，同时将交互历史存入记忆（Memory）。

核心能力：推理与行动（ReAct）

智能体AI的关键在于结合了推理（Reasoning）和行动（Action）：
* 推理: 模型首先分析任务，将其分解为更小的、可执行的步骤，并制定一个计划。这可以通过思维链等提示技巧来激发。
* 行动: 模型根据计划，通过调用工具（如RAG、搜索引擎、计算器、API）来收集信息或执行操作。

工作流程与优势

智能体通过一个迭代循环来完成复杂任务：规划 -> 行动 -> 观察 -> 推理 -> 更新规划。

"通过结合推理和行动，模型可以完成比简单的输入输出交互复杂得多的任务。"
这种模式的优势在于，即使是同一个语言模型，在智能体框架下也能解决它在单次直接请求中无法处理的复杂问题，从而“推动了AI能力的边界”。

智能体AI的关键设计模式

规划 (Planning): 这是智能体工作流的起点。要求模型将复杂任务分解为一系列清晰的子任务，为后续的工具调用做准备。
反思 (Reflection): 一种自我纠错和迭代改进的模式。让模型生成一个初步输出后，再调用一次模型（或使用不同提示）来批判和评估该输出，并利用反馈来生成更优的结果。
- 示例 - 代码重构:
  1. 第一步: 要求模型“检查这段代码并提供建设性的反馈”。
  2. 第二步: 将原始代码和模型生成的反馈一起提供给模型，要求它“根据反馈重构代码”。
工具使用 (Tool Usage): (如前所述) 与外部API、数据库、代码执行环境等进行交互，获取实时信息或执行具体操作。
多智能体协作 (Multi-Agent Collaboration): 将一个复杂任务分解，分配给多个具有不同“角色”或“专长”的智能体。每个智能体可以通过特定的提示（Persona）来定义其角色，并协同工作。
- 示例 - 智能家居自动化: 可以创建“气候控制智能体”、“灯光控制智能体”等，它们各自处理特定领域的任务，并通过一个协调器进行交互。

问答环节（Q&A）精选洞察

智能体评估策略: 传统的“LM作为评判者”方法可以被“智能体式评判 (Agentic Judging)”所增强。可以采用反思模式，让一个LM评判后，再让另一个LM（或以“高级工程师”身份）来审查前者的评判，从而获得更可靠的评估结果。
伦理考量与安全护栏: 幻觉因其概率生成特性而难以根除。应用开发者应建立“护栏 (Guardrails)”，例如使用一个小型分类器或LM来审查输入查询和模型输出，以过滤不当或有害内容。
入门建议: 演讲者建议“从简单开始，进行实验，然后迭代”。
1. 首先在模型提供商的Playground环境中快速试验提示。
2. 熟悉后再通过API进行编程调用，以建立对模型行为的直观理解。
3. 在积累了实践经验后，再决定是否采用更复杂的智能体框架或库。

结论

智能体AI是语言模型应用的自然演进和强大扩展。它并非取代了提示工程等基础实践，而是在其之上构建了更复杂的工作流。通过将语言模型定位为核心的“推理引擎”或“聪明的实习生”，并赋予其使用工具与外部世界交互的能力，智能体AI能够以结构化的方式（如规划、反思）解决以往无法企及的复杂、多步骤问题。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-06 20:22

StreamSparkAI