CS 194⧸294-196 (LLM Agents) - Lecture 3, Chi Wang and Jerry Liu

未来AI应用与智能体编程:从自动化任务到多智能体协作

媒体详情

上传日期
2025-05-23 13:04
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-preview-06-05

转录

下载为TXT
speaker 1: Okay, so Yeah, here's the agenda. It's very simple. You have two part as a general. And then I will dive into the autogen framework for the for the both parts, there are two big motivating questions I want to cover. Number one is, what are the future AI applications like? And number number two is how do we empower every developer to build them? So let's start with the first question. What are the future AI applications like? So starting from around the year like 2022, we see the big power of the language models and other generative models. So the generative AI people have seen a superior ability of generating content like text and images. They are much better than generative techniques like 20 years ago, over 15 years ago, where I was doing my PhD and stulike topic modeling and other older generative modeling techniques. Apparently these days the new technique has passed a bar that we can think about in more creative applications, the higher level of AI capabilities. And so what's the next what are the best way to leverage these genative AI techniques? So starting from last year, early last year, we start to think about that question, and we made a lot of technical bets along the way. The most important bet would be that we think the future AI applications can be agent tic, and the agents can be a new way for human to interact with the digital world and help us execute more complex tasks around beyond human and get to higher and higher level of complexity. And we got a lot at that time. Many people have still have doubts about whether aai can be a viable notion, but as time goes by, we are seeing more and more confirmation and evidence, including earlier this year, there's a article published by Berkeley talking about how we're observing more and more AI results, shifting from using a simple language model to building compound AI systems. So that's one of the similar observations and technical bamade we made. So if we think about the examples of agentic AI, you'll notice that building AI agents like personal assistants, mattunnel spots, or gaming agents, there's the notion of agent AI is not new. But what's what's new is that with the power of the new generative AI techniques and language models, for example, we're able to build some of the old agc applications much easier and much more capable. And at the same time, we are also seeing some very new novel agentic applications that over go beyond the imagination we had before, such as some science agents that can do science discovery automatically, web agents that can automate some web browsing and web automation tasks, and also software agents that can build even software from beginning to scratch. I bet you have seen a lot of demos before, but today let me show you a very recent demo. I just got this this morning from syup called zinley. This example of building a website to extract models from hugging face and download automatically. So after you give the request to the AI, they started working on them automatically, starting from analyzing the task, looking for information and exalling the necessary dependencies. They do use a multi agent framework to have agents of different roles to finish the task and can do mulstep of tasks. Now we are seeing that AI is creating different files. All the reasoning and between the agents are done automatically using both language models and tools. The agents are talking to each other. Now they are developing the website. And finally, they are trying to build and compile the website. Okay, here we go. This is a website built by the AI. It looks like functioning there. Many models are released in the website, and users can search model and export model data and download the data. Download the models. But it looks like the video is stain the solar speed than normal. I'm not sure if that's because of the Internet issue. And Furthermore, if if you happen to make some mistakes, for example, if you remove a very critical line of code from the files that are reccreated, let's see what happens. So we are running the AI again for executing the same task. So this time, the AI is able to as trying to also perform the same task. But because of that removal of the critical lines of code, we're supposed to hit some error. Yeah so we see your error message there, missing script. So let's see what AI will do. So just within a few steps, the AI is able to correct the mistake and finish the exact line which is removed. So complete the task. So that shows some self healing and self recovery ability of the AI agents. So how would that give us idea about the potential kind of promise the AI agents can give us in future? We may build the software in a totally different way. So to summarize the key benefits of agic AI number one is many people understand that AI is able to use a more natural interface to talk with humans. So you can tell them what you want exactly. In this case, just tell them we want to build a website with certain requirements, and then you can iterate over the natural language. And also by giving the agents more capabilities, they are able to operate and finish more comptest tasks with minimal human supervision. So that has tremendous value in terms of automation. And the third point, I think let's talk about among people, but I have a strong emphasis that is agai can be also a very useful software architecture to enable us to build software in a totally different way so that you can have multiple agents working with each other. And to finish much more text in a recursive way. So I will use a simple example to go over these benefits. So let's look at one particular example application about solving some spching organization on cloud. This application is built by my former colleague from Microsoft Research, which allows users like coffee shop owners to answer questions like, what if we change the shipping constraints? How would that affect my operating cost? So this is a difficult question. They can't get answers from chagpt because the AI needs to understand the user specific data constraints and even some optimization tools to be able to understand the question and get the answer. But with in this case, use autogen to build three agents that can solve this question very nicely. The three agents are commander, writer and safeguard. So let's see how that works. So initially, the user sends a question, what if we prohibit shipping from supply one to rothree two? The commander received that question. Before it tries to answer that question, it will hold the current conversation and initiate another message chat between the commander and the writer. The writer in this case is using a large younger model as a backend try to understand question and proposes some code that can be used to answer this question. In this case, it shares some Python code of adding a specific constraint into the opproblem and returns that code to commander. The commander sees that code again, holds the conversation with writer, and start another message chat between the commander and safeguard. The safeguard is another agent using language model to check whether the code is safe. So in this case, it turns safe. The commander will finish that conversation with the safeguard and start executing the code. So here's that code execution result. It runs the organization program with update constraint and check the new result. So the commander will then go back to the creconversation with writer, send the execution result back to the writer. Here are the execution results. And after writer c said it tries to come up with the final answer, this is a simple example of like using multiple agents and multiple steps to finish the task. In general, there can be all sorts of problems, like the code can be unsafe. In that case, the Calder will not execute the code. And in case there's an execution error, the commander will return the execution error back to the writer, and writer needs to rewrite the code. So in general, there can be multiple turns back and forth to finish the context test task. But here we are just showing one simplest case that is at the smooth trace. And you can tell how the end user now is able to still use a relatively same point. So doesn't need know there are multiple agents running at the back end. They just ask a simple question and then can answer natural language back. The user doesn't understand any any knowledge about the coding or opso. This is an example of using multiple agents. We allow end user to achieve taat a much harder than before. Now if you look at the programming protective, how do they construct such a program? The way they do? This is also very different from traditional programs. There are several steps. The first step is to create these several agents. So in this case, they are creating the writer, safeguard, commander and user using one line for each. And then they need to define the interaction patterns among these agents. So here they registered the school message chat of the commander. For each agency chat, we need define who are the agents involved in the chat, what the sender, who was the receiver, and what's the summary zation method from that conversation. And then their behavior patterns are defined this way. And finally, we just initiate the chat from the user proxy agent to the commander agent, representing the initial task requirement. And then every other steps also follow automatically. So the framework will handle all the steps and message chats and eventually return the result back. So how that gives the the idea what the agenprogramming means. And also, I'm using two slides to summarize the benefits. The one number one is to enable us using these multiple agents to handle more comptest tasks and improve the response quality for multiple reasons. You see that the multi agents talking to each other can be a very natural way to make improvement over interactions. And we can divide and coner a comptest task to decompose that to smaller tasks and have more quality result for each step. And finally, we can define agents that are not necessarily depending on laromodels. We can have special purpose agents that perform grounding or validation using knowledge outside of the agent, outside of the models so that we can address the weakness inherent in the models. If you look at the chart on the right, it's an experiment result about if we decompose the task into two agents for the writer and safeguard versus if we put all the instructions in one single agent, how is the performance on the safeguard side? So if we compare the performance, we see that for GPT -4, the multi agent setup has a 20% higher record than single agent setup. And for gps 3.5, the difference is even higher. So this indicates that for certain scenarios, is beneficial to decompose the tasks and have agents to perform relatively simple tasks and compared to just asking the agent to do too much work in in 11 try. And this is especially depending on the task complexity and more capacity, right? So in general, the more complex the task and the weaker the model, there's a stronger need to have this multi agent workflows. And another perspective is from the programming perspective. So in general, it's easier to understand, maintain and extend the system if you have a modular design, for example, if you want to keep most of agents unchanged, but just change, for example, how you change the safeguard behavior, if you want to switch from using language model to using tools or using human, it's easy to do that with moderate design. And this allows natural human participation, because human can take over any of the agent in the system. And by having natural conversation with other agents, there's no need to change how human interact with other agents. So combining these benefits, we see that the agenprogramming has a promising potential to enable fast and creative experimentation and build novel applications. But it's not easy to design a framework that can meet all these promises. In general, we need to consider several factors to design such a framework. We want to have a unified agenent tic abstraction that can unify all these different types of entities. And we want to accommodate all the flexible needs for multi agent orchestrations and satisfy all the different application needs. And we also need to make sure we have effective implementation of all the agenent tic design patterns. Let me let me use the next few slides to explain them. So first of all, for agentic abstraction, we want to unify the notions so that we don't need to have hard time to reason about all these different type of vinges, such as humans of different roles, tools, or language models from different providers. In general, these are all needed for building a common and AI systems. But if we can have a single concept to think about them, we greatdimake it easier to read about them. So let me explain more in the rest of the talk. One immediate benefit of that is now we can use multi agent orchestration for building an arcomplex applications of command AI systems. And but to do that, we also need to think about all the different requirements of different AI agenpatents of interactions. For example, sometimes the developers want a more static workflow so that they can clearly find each step and the order of agents from the tasks. But in other cases, if it's hard to know all the possible situations that the agents will handle, then we want give the agents more autonomously and enable them to create more dynamic workflows. And similarly, sometimes they want to use natural language to tell the agents to perform the task. And any other times they want to use program language to have more precicontrol. So there's a trade off between the flexibility and the controllability. And there are many other trade deoffs you consider. For example, when do you share the context among agents and when do we isolate them? Have a hierarchical setup, for example, and when do we want the agent to cooperate? When do we want them to compete to finish task better? And there's also consideration about the centralized versus decentralized and automation versus intervention. So a good framework should be able to accommodate all these different requirements. And also as people developed agents more and more, there are all sorts of like good design patterns emerging. And we need to also check whether a framework can meet all these design patns requirements. There's a conversation. We want the agent to be able to deal with flexible conversations. There's also promompting and reasoning techniques like react, reflection, chin, opthe and so on. So there more of such emerging reasoning techniics, including like moncallo research based methods to using a very important part of the agent design patterns. And so is planning and integrating with multiple models, modalities and memories. And it looks like a lot, right? So it's hard to have a framework that can consider all these different factors in the beginning. But if you think about this from the first order principle, you may wonder like, is it possible to have a single kind of design point to start with and have all the other patterns to be able to so that we can derive all the other patterns from that. So Yeah so can anyone think about among of these different design patterns, can you think of using just one single design pattern of them to accommodate all all of the others? Yeah. I believe that different people will have maybe different ideas and study from the different parts. You can you may pick either one of them. So here I want to mention one personal kind of design point view. That is a conversation. So this this has a like long time history. Back in college, I learned that conversation is a proper way of making progress in learning and protheums. And that's the takeaand. When I see the power of chagbt, I immediately read that back to my parenexperience and decide, okay, so let's consider trying to use the mulconversation as a central mechanism and see how far we can go. So it turns out we can go very far. So here I'm listing some examples of aging AI frameworks. Like autogen, as I mentioned, is based on this multi agent conversation programming. And it turns out to be very comprehensive and flexible that's able to integrate with all the different design patterns I mentioned and also other frameworks. And you will hear more from Jerry about lama index. And there are also some other like lching based frameworks such as langraph and crew AI. They focus on different also starting reasoning points like langraph, for example, focus on providing some graph based control flows. And creai, I focus on providing some high level static agent task workflows. Yeah. So that's a very brief overview of the AI agent, sorry, agent frameworks in general. Next I will go into the specific autogen. I will try to explain more about that. So Augen is a program framework for agand AI. It was initially developed inside another open source project called flamo. Flamo is for automated machine learning and hyperparameter tuning. And later we spinned off the center, al ripple. And this year, recently, we also created a standalone GitHub organization to have an open Garance structure. So everyone is welcome to join and contribute. And here's a brief history about the journey. Early last year, we started prototyping this asient frameworks. And initially we just built a muleaent conversation framework with the capability of code execution. And in August, we published the first version of the research paper in August, sorry, in October, we moved this report, standalone GitHub ripple. And it's got a lot of recognition from the community. And such as this year, we got also a best paper at clear 2024 rm agents workshop. And we were seeing more and more interest in use cases from community, including both enterprise companies and startup companies. We just also offered an online course on deeplearning dot AI about AI agent design am patterents with autogm. So that's just very brief about it. In general, autogen has very rich features about all the different AI design patterns. But today I mainly want to focus on the most essential concept, because this is a key to understand all the other patterns. So we want any developer to reason about the very complex workflow in simply two steps. The first step is to define the agents, and the second step is to get them to talk. So here where there we see two key concepts. One is the conversible agent, another is conversation program. So the agent concept in autogen is very generic, can abstract a lot of different entities. You can choose to use language model as a backend for the agent, or you can use a tool or human input as a backend. And you can also mix multiple types, entities together. And Furthermore, if you use this multi agent conversation patterns like you see on the right hand side, you can build agents that can contain other agents inside it, have inner conversations, and then wrap them up as an interface to talk to other agents. So in this way, you can nemultiple chats inside a agent, have inner conversations, and build more and more complex agent in a recursive way. And the examples of commercial importance include sequential chats, message chat and group chat. I will explain, and a few of them in the next few slides. The simplest one is the two agent conversation. Even though this is a very simple converpatterent, it already allows us to form some advanced reasoning, such as reflection. For example, we can construct the writer for writing blog post and another critic agent to to make suggestions to the propost. And you can have these two agents iterate and improve the quality of the blopost in this way. And now, if you think that's not good enough, you can add more other reasoning text, advanced reasoning techniques using this message chat. For example, instead of using a single language model to do the critic, we can use the nchat. And in the nchat, we construct sequential chats that contain multiple steps of chats. We first send the message to the seo reviewer, and then we send it to a legal reviewer, sex reviewer, and so on. And finally, you can have a meta reviewer that summarizes all the Reviewers Comments in different areas and give the final comment. And from the writer's point view, it's still talking to a single cratic agent, but underneath a critic agent uses multiple other agents. So that's idea how we can use a message chat to potentially extend the agent's capability using conversations. And think about two using how do we enable two using using conversations? Here's the example of building a game of conversational chess. We want the AI to be able to play the chess while chat, chatting with each other and making fun of each other. So if you just ask two language model agents to play the chess directly, they often make mistakes and form like random moves on the board, which are not legal. So the game is not watchble at all. Now the solution can be that we can add a third agenent called chess board. This board agent is a two base agent. It uses some pyson tools, Python library to manage the chessboard and provides the tools to other language model agents. So language model agents can only make legal moves, and otherwise they need to be iteratively refiner moves until the moves are legal. In that way, we can make sure the game can Carry on nicely. And again, we're using message chat between a two proposal agent using language model and a two execution agent that using two as a backend and having them talk to each other to finish that two execution functionality. And there can be other more complex workflow patterns and solving more commerce tasusing planning and using more dynamic way like group chat. In the group chat, the users only need to define agents of different roles, and then can put them in the group chat, and they can automatically decide which agent to speak next, depending on the current progress of the task. Therebe a group chat manager, which can monitor the progress and make the selection of the speakers. And Furthermore, you could add more constraints or finanance data machine transmission logic about what others should the agents follow. You don't have to make a very stritive order. You can just give some candidates about like agent a after agent a speaks, only agent b and c can speak, and other other agents, things like that. So there's still some decision made by the language model and some autonomity there, but you can add certain constraints to to make sure at least the selection is within the scope. And you can further add some transition logic to tell the agent when you see certain situations, you should go this route, otherwise you to go do the other parts and you can use either natural language or broken language to specify that Yeah in general, there are many other type of converand applications enabled by these converpatterns. Feel to check our website for more notebook examples. They are nicely taed. So you can search for anything question, for example, you can search for how to integrate autogene with alarm index. And there are notebook examples like that. And in the paper, we just showed a few like very simple examples using simple conversion patterns. But in general, we can build many more complex tasks using these building blocks. I've seen developers building all kinds of Sorof complex tasks. So here's the overview of like the categories of domains that I'm we are seeing from the community. The top two categories are software development and agent platforms, voting, research, data processing, Galong tail. There's a web browsing, tafinance, healthcare, education, and even blockchain, which is not shown in your list. And here I want to select a few interesting examples. The first one is in the science and engineering domain. This professor Marx Buler from mit has done multiple work in different science domains, including protedesign material design, using autogen to build teams of agents that simulate the behavior of a scientific team or engineering team, all the way from like calculate data, generating hypothesis, and then conducting experiments and verified hypothesis. They have also built a very simple work called sign agents, which can use ontoknowledge graph to reason about any scientific domain and make interesting connections and do advanced reasoning using that knowledge graph. They contrast a team of agents of multiple roles, and they can start from understanding ontology and then make multiple sigand critics to finish the very complex workflow, including generate all the different concepts and possible hypothesis, and then select the best out of them. So this is very promising and use case. I see that in future has the potential to accelerate our sound discovery. And maybe soon after, we'll have some AI designed medicine or AI design architectures and so on, or or or interesting laws, ys and materials. Here's another domain of web web agents. This is an agent called agent e developed by a startup called the emergency AI. They use autogen to build a hierarchical agent team so that they can perform some very complex task ks on the web, like automatically book flight tickets or automatically field form for medical clinics. This is leverage, is some planning agent, some web browsing agent and have some deeper understanding of the content from html dom. They haven't used any multimodmodels, so they only leveraging the html content and not leveraging the images and the vision models. But they already achieved the seaof, the art performance on this web volume group benchmark and now outperforms the previous techniques using multimodal models. There's still a large room for improvement. You can see the overall success rate just 73%. So by combining multimodal models and more agentic workflows, there's a potential to get it higher, even higher. But there's some very good foundational design principles we can learn from network. And I want to share a very recent quote I learned from company doing construction. They're trying to use autogen to help the users without special knowledge or expert knowledge to be able to finish their construction projects. And they mentioned the benefit of Augin as being able to rapidly explore many different agentic design patterns, configurations and conducting experiments at scale. And that to summarize, we have seen a big enterprise customer interest from pretty much every vertical domain and got contributors of users from universities, organizations or companies from all over the world, including some contributors from Berkeley. There's a work called mgpt created by Berkeley students, and they also have integration of autogen parawork. So we see that there's still a lot of changes and progress in autogen. And here, just a few examples in several categories, including evaluation interface learning, such teaching ating slaorganization. In the evaluation category, we are building hd based evaluation tools to help developers understand how effective their application is, how good their agent are. It's a really difficult task because there are a lot of text generated and it's hard to understand what is that is going on. But using agents, they're able to automatically come up with success criteria based on the application and task and then suggest the scores for each dimension. And Furthermore, we can extend that idea to improve agents over time, right, by providing the feedback to the agents themselves and have agent based optimizations or learning or teaching capabilities. One center piece of Augin is still to improve the program interface to make it even easier for people to build all sorts of agenapplications. So I want to talk about one particular research that excites me. This is called auto build. So one remaining question for many developers is, what is the most effective multi agent workflow I should use? What agent should I create for my physical task? So auto build is designed to address that issue as an initial attempt. So it works in this way. The user first provide a task to describe the high level requirements, and the system will suggest agents of different roles automatically, and these agents can be put together in group chat to solve the task. And for a new task, a more specific task, the other task, we can review these credit agent teams to solve that without the users need to specispecify which agents we use, and we can further extend that idea from static agent team to adaptive agent team. So there's a technical called adaptive build. And in this case, we can first decompose a comptest task into smaller steps, and then for each step, we propose a specific agent team for it. We can either choose from existing library of agents or decide to create new agents that don't exibefore. And after finishing one step, we now check what new agents do we need for the next step. So these agencan be dynamically secand. As we create more agents, we can also add them back to the library to improve overall system over time. And we make experiments on several different benchmarks, including MaaS H programming, data analytics and so on, and find the very promising result and outforms previous techniques of similar kind. Yeah this is just one particular example of research we're doing. There are many more challenging questions we still need the community to work on together. The biggest question is how do design optimal rand workflow for any application and considering multiple factors like quality, the monetary cost, latency and manual effort in general, we still want to improve agent capability from the reasoning, planning, multimodalities or learning capabilities. And we also want to ensure like skability and make sure human has a good way to guide their safety and so on. Yeah. So so that that's the that's the end of my lecture. So I want to acknowledge all the open source contributors and you can find our discarcommunity. We have very large community on this court and deffind, this new GitHub organization. And I'm happy to follow up with questions. Thank you very much. Hey everyone.
speaker 2: that's a great talk by tree from auto Jan. And Yeah, you know I hope to build upon that. I think I think this is this talk is a bit less like generic, I guess like in terms of the covering a lot of the different agent architectures and stuff, this is actually quite about like a specific use case around building a multimodal knowledge assistant. And so it's really focused on maybe some principles around like rag and how do you actually extend that building like a research agent. And it's like a use case that we've been exploring pretty deeply as a company. But of course, you know there's plenty of agent stuff out there. You should definitely check out otogen, check out you know waindex, try building stuff on your own and then see we can come up with. So let's get started. And first, if you're not familiar with lamanx, laminx is a company that helps any developer build context augmented llen applications from prototype to production. We have an open source toolkit. The open source toolkit is a developer toolkit for building agents over your data. Well, different types of L M applications. These is there's a lot of people building you know agents. We start off helping people build like rag systems and now you know we're going into territories where people are building slightly more advanced stuff where you're using lms for multi step reasoning as opposed to just like single shot you know synthesis and prop like generation. We also have an enterprise product, which I probably won't talk about too much today, but it's basically like a managed service to help you you know offload your data indexing and rag and all that stuff. I'll talk a little bit about one specific piece of this, which is document parsing or you know data parsing. And we think this is like a pretty important piece in any sort of like context augmented pipeline. So let's get started. How many of you know what rag is? Yes. So one part of rag is you have a database. And so a database of some knowledge. And the way it works is you first want to do retrieval from that database to return and basically just a bunch of texto, chunk it, and then you know given the chunk context, embed it and put it into, for instance, a vector database. And we'll talk a little bit about that as well. So this overall goal of building a knowledge assistant is that you know a lot of companies have these types of use cases where they have like a lot of data. They have like a million pdf's. They have a bunch of PowerPoint presentations, they have excel files. And you want na build some interface where you can take in subtask as input and give back an output. It's that's really it. If you think about a trapbot, that's basically a trapbot. So you have a lot of data. You want your llm to basically understand that data and then you want the llm to do stuff with that data. An example could be generating like a short answer. It could also be generating like a structured output, a research report. It could take actions for you, like send an email. It could schedule a calendar meeting, write code, do a lot of these things. We talk a lot about like rag as a company, and especially if you're just getting started, if you follow like rag 101 on how this works, we call that basic rag. And so what is basic rag? Basic rag is you take like your unstructured data you do, you load it with some you know standard document parser, and then you chunk it into you know every thousand tokens or so. You just slice that text up into a bunch of slices, and then you stuff each slice into an embedding model like OpenAI embeddings, and then you put it into a vector store. Then when you return stuff where you do retrieval from this database, you typically do you know semantic search or vector search to return the most relevant items from this knowledge base and stuff it into the Allen prompt window. This entire pipeline gives what I call like a basic rag pipeline. It will work kind of in being able to answer basic questions that you have over that data. However, there's a bunch of limitations. And the four limitations we us here is one is the data processing layer is pretty primitive. When you do the chunking, you're not really doing or taking into account all the different elements within that data. You're not really taking into account you know are there tables or images or weird sections and you want to semantically preserve different chunks together. You're only using the llm for synthesis. And so you're not using it for any sort of reasoning or planning. And so if you think about all of what like she just talked about with respect to adrentic coordination and stuff, you're doing none of that with a basic rag pipeline. And so it's kind of a waste of lm capabilities, especially with the latest models like zero, one, 3.5, like Clyde and whatever. Like a lot of these models have much greater capabilities and just being able to summarize over a piece of text. And so you want to figure out how do you actually use the model capabilities to do you know more advanced reasoning or planning? The other piece is you know standard vapipelines are typically one shot, so they're not personalized. It's just like after you ask a question, it's going to forget about it. And so every new interaction will basically be stateless. This has certain advantages, but you know if you're building a personalized knowledge assistant, ideally you're able to add like a memory layer to net. So a lot of what we ask ourselves is can we do more than a basic rag pipeline? There's a lot of like questions or tasks that a naive rag pipeline can't give an answer to. This leads to hallucinations for the end user like you doing like having a chatbot where it can't answer like 80% of the questions you might want to ask. It is kind of like a limited value add. And so how do you you know build this more generalized knowledge assistant that can take in questions of like arbitrary complexity and answer that over arbitrary amounts of data? We think a better knowledge assistant has four main ingredients. So and actually specifically, the focus of this talk is on like how do you take in like a mulmodal or build a multimodal knowledge assistant? So instead of just you know reasoning over like a like standard text file, how do you reason over like an entire research report with a lot of diagrams and pictures, images, how do you like basically reason over all the visual data that exists you know on the Internet in addition to just text? So the first piece is we need a core high quality mulmodal retrieval pipeline. We want to then maybe generalize the output and think about something that's a little bit more complex than your standard chatbot response. So generating a research report, doing data analysis, taking actions, three is a gentic reasoning over the inputs. So this is where you know instead of just you know taking in the user question and only using the alm for synthesis, like applying chain of thought tool, use reflection, all that fancy stuff to try to you know break down the question, do some planning and actually step by step work towards an overall goal. And last is deployment. We'll see how long I have. I plan to probably talk for 15, 20 more minutes. So just cover some of the high level details. And for some of the actual like like the examples, though, they're basically linked din, the slides in case you want to check them out. The first piece is setting up multimodal rag. If you're familiar with rag, you might be familiar with rag over text data, but what we're really interested in is having rag actually operate over just like visual data. And by visual data, I don't just mean like a jpeg file. I mean, even if your PowerPoints, for instance, like or like a research like archive paper, it's going to have like you know charts diagrams. It's going to have like weird layouts. And the issue with a lot of standard bag pipelines is they do a terrible job at actually extracting out that information for you. And so like like I mentioned, any llm or rag or agent application is only as good as your data processing pipeline. And if you're familiar with the garbage and garbage out principle in traditional machine learning, I think for lm application development, it's no different. So you basically want to have good data quality as a necessary component of any production lm map. This etl layer for lms consists of basically you want to do some parsing from the document, you want to figure out a smart way to chunk it, and then you need a smart way to know, index it and put it into the lm prompt window. This data source that I talked about, like this case study of complex documents, is a pretty common data format that we see across a lot of different companies. And so a lot of documents can be classified as complex, like embedded tables, charts, images. There's like irregular layouts. There's like headers and footers. And a lot of times you know when you kind of apply like like off the shelf components to parse a lot of this data, it ends up in a broken format. And Alon hallucinates the answer. Users want to ask different types of questions over this data. So you have a bank of like say, pdf's. You want to ask questions over it. It could be simple point questions. It could be multi document comparisons. It could be longer running, like research tasks. A research task could be you know given these ten archive papers around like you know like around an L M quantization or something, generate like a condensed summary or like a survey paper, right? And so that's something that's a bit longer or higher level in nature compared to something that's just a simple search and retrieval task. So we'll start with the basics, which is just parsing. Ideally, document parser can actually structure this complex data for any downstream use case. I won't talk about it too much because you know the goal of this is really about agents. But really like without needing to know the internals of like document parsing, you need to have like a good pdf parser basically, because if you have a bad pdf parser, then you're gonna to load in some PowerPoint or pdf. It's going to not really extract out the right text from that pdf. And then when you feed in text that's been hallucinated by the parser, the llm is going to have a really hard time understanding that no matter how good the llm is. So ideally, you basically want a parser that can parse out like text chunks, tables, diagrams, all that stuff into like semantically consistent ways. That is one of the things that we do is we make a pretty good giai powered pdf parser. We're at like 30 zero plus users right now. And Yeah, I mean, if you're interested in trying it out, everybody gets like a thousand credits for pages per day. And so it's used at different small companies to large enterprises. So you want to have a good parser. And what this enables is it's actually an important piece in being able to structure your data in the right way if you think about different types of data. So here's like maybe like an investor slide deck. Here's a ten k annual financial report. Here's like an excel sheet. Here's a form. Having that parsing an extraction step to extract out stuff into clean formatted data gives like it's just much easier for l lines and retrieval processes to understand that afterwards, once you've actually know parsed out a pdf into its consistent elements in a good way, you can then leverage like hierarchical indexing and retrieval to do something more fancy than your standard rag pipeline. So given this like you know structure or document structure of like text trunks, tables and diagrams for each of these elements, you know a standard rag pipeline will basically directly try to embed each of these trunks. But we found, for instance, a better approach is you actually extract out a bunch of different representations that point to the source trunk. So for instance, for a table, you might want to extract out a variety of different summaries that point to that table for a picture. I mean, a picture you can't feed into a text on batting model anyways. And so you need to maybe use some model to extract out a variety of different summaries that point to that picture. For bigger text chunks, you might want to extract out like smaller text chunks that feed to that bigger text chunk. Once you extract out these representations, we call these nodes because these are the things that will be embedded and indexed by the vector database that you're using. So if you're using a vector database like pine cone or whatnot, you know you can basically extract and index this metadata that is associated with the source element, but is not like the direct source element itself. Then during the retrieval process, you know given a user question, it will first retrieve the nodes. And because the nodes have a reference to the source document, you can basically derefference it and then feed the resulting element into the model. Notice that most models, he says, are multimodal in nature. If you look at zero one GPT -4 zero, clad like 3.5 sonnet, the latest Gemini models by Google, they can take in both text and images. And so the nice thing here is that you can still use text embedding models will represent the element itself. But then like you know when you actually feed stuff into the llm, you're able to feed in both text and images. And so what I just outline to is a basic way like know of building a multimodal rag pipeline. So a multimodal rag pipeline can take in you know any sort of document. It could have different types of visual elements in that document, and it will store both text and image chunks. And so in order to index the image chunks, you could use like clip embeddings, or you could do what I said, which is like you know use a model to extract out text representations and link the text representation to the image trunk. And so when you set this up, then during retrieval, you feed in or you return both the text and the image chunks and you feed both to a multimodal model. I'm going to skip this example, but here is like a basic example here that shows you how to build like a standard multimodrack pipeline. Notice up until now, you know haven't really talked about agents yet. This is just like setting up the basics of you know multimodal rag. And so you don't get the benefits of like chain of thought or reasoning or tool use quite yet. All this is doing is saying, you know given a specific question you have about maybe a more complex data set like research reports or slide decks, you're able to ask questions and get answers over the visual elements on the page. So the next piece, actually, I might skip this just due to time, but basically the high level idea here is you know a lot of the promise of agenriance is that it's not just going to give you back a response in the form of like a chat response, but actually generate entire like units of output for you. So producing like its own PowerPoint or pdf, or you know, taking actions for you, or you know, do you guys use, anybody use clad like in tragpt t anyone use Claude you how when you ask it to write a paper for you for your Berkeley assay, then itactually generate like an entire thing on the side. That's an example of like report generation, right? Itactually you give you like a thing that you can just directly copy and paste into something that you then later at it. So for instance, like that this is like a pretty common use case we're seeing in the enterprise too. So a lot of you know consultants like knowledge workers are interested in like basically like generalizing beyond the capabilities of just like giving you back unformatted response, but giving you something that you can just directly use on its own. So whether that's code, whether that's like reports, it's all very interesting. And I'll probably skip some of the architectures of how do you actually build this for now. And now we talk about adrentic reasoning over your inputs. So this is a third section. So we have like you know multimodal rag pipeline in place. Now let's add some layers of adrentic reasoning to basically build a gentic rag. Naive rag works well for pointed questions but fails on more complex tasks. And again, this is due to all the reasons I mentioned above, right? You're just retrieving a fixed number of chunks. You're not really using the llm to break down the question in the beginning. And so if you ask a question like summarization questions where you need the entire document, instead just a set of chunks, comparison questions where you know you actually need to look into two or three or more documents, multipart question, same thing, or a high level task, you don't really get good results if you use a standard Grag pipeline. So there's a wide spectrum of different types of adritic applications you can build. And I think she probably gave much better architectures of like you know how multi agents can collaborate with each other and achieve very advanced things. The way we think about it is like there's kind of both like simple to advanced adrentic components at the at the right, you basically have entire like generalized agent architectures, right? And so this includes like you know a react loop, for instance, which is like one of the most common agent architecture these days, came out like about two years ago almost and you know basically just uses some chain of thought plus tool use to basically give you some generic know agent architecture. You can plug in whatever tools you want and itroughly try to reason over them to solve the task at hand. This also includes alum compiler, right, which is you know sawpaper. And so this basically generalizes a little bit beyond react into doing some pre planning, right? So instead of just planning the next step at a time, we'll actually plan out a dag, optimize it, run it and basically replan periodically. What we actually see a lot of people these days build is some people do use react. It's a pretty easy architecture to get started, but other people just like take some of the existing components and build more constrained architectures. And far this is just due to the desire for reliability. Even though it's less expressive, even though it can't do everything, some people are still building up that trust towards AI. And so they're trying to solve a specific use case in the beginning. And so by solving this specific use case, they can leverage more specific components and try to you know maybe solve in a more constrained fashion. This includes, by the way, like tool use, like maybe leveraging a memory module, like function calling. We see at a lot of places people are very interested in like structured output generation, tool use, being able to call like an existing api and then also doing some basic like query decomposition, whether that is chain of thought or like parallel, like given a question, break it down into a bunch of different suppquestions. This overall thing we call a gentic rag because it really is just like an agent layer on top of rag. If you think about like rag or retrieval or you know retrieval from a vector database as a tool, you can think about an agent that operates on top of these tools. So instead of like you know given a query directly feeding it to the vector database first, passing it through this general agent reasoning layer, and this reasoning layer can you know decide to do a bunch of things to the query and also decide what tools to call in order to give back the right response. The end result is that you're able to build a more personalized qa system that can handle more complex questions. And this is an example of what I mean by unconstrained versus constrained flows. So for instance, a more constrained flow might just be you know you have a task and you just have a simple router prompt. A router prompt is just a llm prompt that just selects you know one option out of n and all it does is given this task, feed the task to one of the know downstream tools based on the decision and the router prompt and then feed it to maybe a like reflection layer and then give back a response. There's no loops in this orchestration. All that happens is you know ithit the router, go through a tool, go through you know another prompt that just reflects and tries to validate if it's correct and then generates a response. This I defined as more constrained because a lot of the control flow is actually defined through humans, you know through you guys versus through the agent. And typically, the programs that you know are more constrained look like fl statements. And like while loops that you actually write instead of the agent, if you're using like a more generalized agent architecture, like if you're using you know like react or Alan en compiler or triaththa or whatever, then it's a little bit more general because you're basically saying, I don't actually know what the specific plan I want the agent to follow is. I'm just gonna to give the agent a bunch of tools and let it figure it out. And so this is more expressive because it can technically solve a greater variety of different tasks than you trying to you know hard code that flow beforehand, but it's also less reliable, right? It might veer off and call stuff like call tools that you really didn't want it to call it. Might you just repeat, get stuck in the infinite loop somewhere or never converge? And it's also more expensive. Typically, these types of agent architectures use bigger prompts. You're stuffing in more tools at once. And so the marginal token costs are much higher. And so we see a little bit fewer architecture is being built with very wide unconstrained flows. But a good rule thumb is if you're interested in you know using react tor something, try try to stuff in like four or five tools and try to limit it like to less than like ten. With the current model, we have core capabilities in law index basically help you build workflows. So we call all these things like workflows. They're basically all agengentic in some nature. The very rough definition of an agent. Everyone has a different definition. You guys might disagree with me. A very rough definition is just a you know program like a computer program that has non zero llm calls. That's like a very general definition. And we basically help you write those types of programs. So whether you try to you define a very constrained program where you're writing the fls conditions or you're letting an agent handle that task, we basically have like an event driven orchestration system where you know every step you can listen for a message. You can PaaS a message to a downstream step. You can PaaS messages back and forth between two different steps. And basically at a certain point, you know these steps can be just like regular Python code. They could be llm calls. They could be anything you want. At a certain point, the program stops and it gives you back a response. And so we're building out this like very fundamental low level orchestration because we believe that there's some like interesting properties among like adriantic behavior where it is fundamentally a little bit event driven. And this also provides like a nice base to help like users deploy workflows to production in the event that you want to translate your program into like a Python service. So try it out. I mean, I think I guess there are some links here, but you know this stuff is all linked in the docs. I'm going to skip this piece. And then Yeah some use cases that maybe are interesting to cover, which you know I think the links are basically here in case you want na track it out is like we're very interested in like a report generation use case. So this is again, something that we see pop up across a lot of different companies. And basically given a bank of data, you want to actually produce some output from that data. An example architecture for this is you have like a researcher and a writer and maybe like a reviewer as well. And you could think about this as like a multi agent system depends on how you define it. But basically you have like a researcher that you know does a little bit of rag. It retrieves relevant chunks and documents from some database and maybe like given a task like makes sure that it basically it's like going on the Internet and fetching stuff like and storing it in your notes. Yeah itput that stuff in data cache. Like basically, this contains all the relevant information you need to you know generate the report. The second step is like a writer, and this writer might use this data cache to then make like an llm call that will then you know generate this interleaving sequence of like text image blotables and give you back a full output. So we have an example architecture here, and you know we also have some example reppos where you generate like an entire slide deck instead of just a report. Another use case, by the way, which isn't in these slides but is very interesting, is customer support. I think if you look at just practical enterprise use cases of agents, customer support for external facing use cases is probably number one. This is you know basically like there's like just a lot of automation that could be baked into the decision flow to basically increase your deflection rate and basically ensure that like the user ends up having like a much better experience and going through like you know those automated phone values dies. And so we see that popping up in a lot of different places too. And then the last bit is really around running agents in production. I think you know so far, if you start off like building a lot of these components, you're probably gonna to start off with like a Jupiter or notebook, and that's totally fine. You know when you start off building a prototype, it makes sense to do something that's very local, that's very narrowly scoped, and you basically see if it works over test data. An interesting design exercise is to think about what a like complex multi agent architecture looks like and how we can leverage like existing production infrastructure components to like achieve that vision of like you know multi agents in production. Ideally, you know if you think about agent one, agent two, agent three, every agent is responsible for solving some task and they can all communicate with each other in some way. So you ideally can encapsulate their behavior behind some api interface and then you can you know standardize the way they communicate with each other through some sort of core masaging layer. You can easily scale up the number of agents in this overall system to add more to this multi agent network. And you can also take into account you know a large volume of client requests with different sessions. So this is basically what we're building. And you know it's like a work in progress, but we've made a lot of progress in the past few months is like how do you actually deploy you know agrentic workflows as microrservices in production? So you model every agent workflow as like a service api. We allow you to spin this up locally and also deploy this on, for instance, like Kubernetes. All agent communication happens via like a central message queue. You can have like human in the loop as a service. So for instance, you know if like this agent actually needs your input, itsend a message back to you, await your response and then you give it you know an input before it resumes execution. How many of you have seen like the the Devon demo like the cognition lives s Devon you want talking about how many of you know what like Devon is? All right. So like half of you, I think if you took a look at the demo, one example that does is like this coding agent will just generate an entire repository for you, but sometimes it will stop, right? Sometimes it will say, I don't actually have enough clarity to basically kind of give you the response. Can you actually tell me what to do next? And so if you played around with Evan, that's basically what it does. And that's an example of the human in the loop. This like kind of interesting back and forth like client server communication where the server is actually waiting on the client to send like a like a human feedback message. So that's basically it. All these components, I think, are kind of like step by step towards this idea of building a production grade multimodal knowledge system over your data. And Yeah, thanks.

最新摘要 (详细摘要)

生成于 2025-06-07 15:42

CS 194/294-196 (LLM Agents) - Lecture 3, Chi Wang and Jerry Liu

概览/核心摘要 (Executive Summary)

本次讲座由 Chi Wang 和 Jerry Liu 两位专家主讲,从不同视角深入探讨了大型语言模型(LLM)代理(Agents)的构建与应用。演讲呈现了两种互补的思路:Chi Wang 从框架构建出发,介绍了其通用、灵活的 AutoGen 框架;而 Jerry Liu 则从具体的商业应用落地出发,分享了构建端到端多模态知识助手的实践经验。

Chi Wang 首先阐述了未来 AI 应用的核心趋势是“代理化”(Agentic),即通过 AI 代理执行复杂任务。他强调了其团队开发的 AutoGen 框架,该框架以“多代理对话编程”为核心设计原则,允许开发者通过定义可对话代理并编排其交互来灵活构建复杂应用。Wang 通过供应链优化、科学发现(Science Agents)和网页自动化(Agent E)等案例,展示了 AutoGen 的强大能力和广泛应用前景,并介绍了旨在自动优化代理团队的 AutoBuild 等前沿研究。

Jerry Liu 随后聚焦于构建“多模态知识助手”这一高价值用例,并介绍了 LlamaIndex 的解决方案。他首先剖析了基础 RAG(检索增强生成)在处理复杂数据和任务时的局限性,进而提出了构建更优知识助手的四要素:高质量多模态检索、泛化输出、代理化推理和可靠部署。他重点阐述了 LlamaIndex 如何通过先进的文档解析与层级化索引来提升多模态 RAG 的质量,并引入“代理化 RAG”概念,即在 RAG 之上增加代理层以实现更高级的推理。Liu 还探讨了在企业实践中,为保证可靠性,开发者当前更倾向于构建“约束性”而非完全自主的“非约束性”代理流程,并分享了将代理工作流作为微服务部署的生产实践。

Speaker 1: Chi Wang - AI 代理的未来与 AutoGen 框架

未来 AI 应用的趋势:代理化 (Agentic AI)

  • 背景: 自 2022 年以来,生成式 AI 在内容生成上展现出卓越能力,为更高阶的 AI 应用奠定了基础。
  • 核心观点: 未来的 AI 应用将是“代理化的”(Agentic)。AI 代理将成为人类与数字世界交互、执行日益复杂任务的新范式。
    • 该观点正获得越来越多的证实,例如伯克利有文章指出 AI 成果正从使用简单语言模型转向构建“复合 AI 系统”(Compound AI Systems,即由多个模型或组件协同工作的系统)。
  • 代理化 AI 的新旧应用:
    • 旧应用增强: 个人助理、聊天机器人等,借助新技术能力更强、更易构建。
    • 新型应用: 科学发现代理、网页自动化代理、从零构建软件的软件代理等。

代理化 AI 能力演示:Zinli 网站构建

  • Chi Wang 展示了 AI 自动构建一个从 Hugging Face 提取并下载模型的网站的演示。
  • 过程: AI 采用多代理框架,通过分析任务、安装依赖、自动编写代码、代理间协作等步骤,成功构建了网站。
  • 自愈能力: 演示中,在人为删除一行关键代码后,AI 再次运行时能识别错误(missing script),并自动修正,补全代码行,展现了强大的自我修复能力,预示了未来软件构建方式的变革。

AI 代理的关键优势

  1. 自然交互: 用户可通过自然语言与 AI 沟通需求并进行迭代。
  2. 复杂任务自动化: 代理能以最少的人工监督完成复杂任务,释放巨大的自动化价值。
  3. 新型软件架构: 多个代理协同工作,以递归方式完成更复杂的任务。Chi Wang 特别强调了这一点的重要性。

实例:AutoGen 在云端供应链优化中的应用

  • 场景: 帮助非专业用户(如咖啡店主)解决需要特定数据和优化工具的复杂问题。
  • AutoGen 解决方案: 构建了三个代理:Commander(指挥官)、Writer(编写器)、Safeguard(安全员)。
  • 工作流程: 用户提问后,Commander 代理协调 Writer(生成代码方案)和 Safeguard(检查代码安全),在确认安全后执行代码,并将结果交由 Writer 整理成自然语言答案返回给用户。整个过程对用户透明,且能处理代码不安全或执行出错等异常情况。

AI 代理的编程范式

  • 核心步骤: 1. 创建代理;2. 定义交互模式;3. 启动对话。
  • 多代理编程的益处:
    1. 处理更复杂任务,提升响应质量: 通过交互改进、分而治之以及使用专用代理(如验证、接地)来弥补 LLM 的缺陷。
      • 实验数据: 将任务分解为 WriterSafeguard 两个代理,对比单代理。对于 GPT-4,多代理设置在保障方面的召回率高 20%;对于 GPT-3.5,差异更大,表明任务越复杂、模型能力越弱,多代理工作流的需求越强。
    2. 易于理解、维护和扩展 (模块化设计): 可独立修改某个代理的行为,并支持自然的人工参与(人类可随时接管任一代理角色)。

设计 AI 代理框架的考量因素

  • 统一的代理抽象: 能统一表示人类、工具、LLM 等不同实体。
  • 灵活的多代理编排: 需平衡一系列设计上的权衡点 (Trade-offs)
    • 控制流: 静态工作流 (可预测) vs. 动态工作流 (灵活)。
    • 控制语言: 自然语言 (易用) vs. 编程语言 (精确)。
    • 上下文管理: 共享 (协作) vs. 隔离 (独立)。
    • 交互模式: 合作 vs. 竞争。
    • 架构: 中心化 vs. 去中心化。
    • 人机协作: 自动化 vs. 人工干预。
  • 有效实现设计模式: 如 ReAct、Reflection、规划、多模态、记忆等。
  • 核心设计原则 (Chi Wang 个人观点): 对话 (Conversation) 是串联所有这些要素、实现复杂功能的中心机制。

主流 AI 代理框架概览

  • AutoGen: 基于多代理对话编程,全面且灵活。
  • LlamaIndex: Jerry Liu 后续介绍。
  • 基于 LangChain 的框架: 如 LangGraph (基于图的控制流) 和 CrewAI (高级静态工作流)。

AutoGen 框架详解

  • 历史: 源于 FLAML 项目,后独立发展并成立开放治理的 GitHub 组织,曾获 ICLR 2024 Agents Workshop 最佳论文奖
  • 核心概念: 1. 可对话代理 (Conversable Agent);2. 对话编程 (Conversation Programming)
    • 嵌套聊天 (Nested Chat): 一个代理内部可以包含并协调其他代理进行子对话,从而实现能力的递归封装和扩展。例如,一个 Critic 代理内部可协调 SEO 审阅者法律审阅者完成多维度评审。
    • 通过对话使用工具: 引入基于工具的代理(如管理棋盘规则的 Chessboard 代理)与 LLM 代理对话,以确保任务(如下棋)的正确性。
    • 群聊 (Group Chat): Group Chat Manager 自动协调多个角色代理的发言顺序,并可通过规则进行约束,在自主性与可控性之间取得平衡。

AutoGen 的应用案例与社区影响

  • 应用领域: 软件开发、代理平台、科研、数据处理、金融、医疗等。
  • 精选案例:
    1. 科学与工程 (MIT): 构建 "Science Agents" 模拟科研团队,利用本体知识图谱进行推理和发现,应用于材料设计等领域。
    2. 网页代理 (Emerge AI, "Agent E"): 构建层级化代理团队执行复杂网页任务,在 WebArena 基准测试上取得当前最佳性能 (state-of-the-art),成功率达 73%。
  • 广泛关注: 吸引了各行业企业客户,以及全球大学、组织和公司的用户与贡献者(如伯克利学生创建了 MGPT 项目并与 AutoGen 进行了集成)。

AutoGen 的持续进展与未来挑战

  • 进行中的工作: 代理评估工具、接口学习、AutoBuild(自动为任务构建和优化多代理工作流的研究)。
  • 未来挑战: 设计最优工作流(平衡质量、成本、延迟)、提升代理核心能力(推理、规划、学习)、确保可扩展性与安全性。

Speaker 2: Jerry Liu - 使用 LlamaIndex 和代理化 RAG 构建多模态知识助手

LlamaIndex 简介与知识助手概念

  • LlamaIndex: 帮助开发者构建从原型到生产的“上下文增强的 LLM 应用”。
  • 知识助手核心理念: 构建一个能理解企业内部海量、多格式数据(PDF, PPT 等)并基于此执行任务(回答问题、生成报告、采取行动)的智能接口。

理解基础 RAG (Retrieval Augmented Generation) 及其局限性

  • 基础 RAG 流程: 加载 -> 分块 -> 嵌入 -> 存储 -> 检索 -> 生成。
  • 局限性:
    1. 数据处理原始: 粗糙的分块破坏了表格、图像等元素的语义结构。
    2. LLM 仅用于合成: 未利用其推理、规划等高级能力。
    3. 通常无状态、非个性化: 每次交互都从零开始。

迈向更好的(多模态)知识助手

  • 焦点: 多模态,即能理解和推理文本、图表、图像等视觉数据。
  • 四大要素:
    1. 核心的高质量多模态检索流程
    2. 更泛化的输出(如研究报告、数据分析)。
    3. 对输入的代理化推理 (Agentic Reasoning)。
    4. 可靠的部署

1. 建立多模态 RAG

  • 数据处理至关重要: “垃圾进,垃圾出”,高质量的数据解析是所有下游任务的基础。
  • LlamaParse: LlamaIndex 提供的 AI 驱动的 PDF 解析器,能以语义一致的方式提取文本、表格、图表等复杂元素。
  • 层级化索引与检索 (Hierarchical Indexing and Retrieval): 一种更优的索引策略,即不直接索引大的数据块(如整个表格或图片),而是为其生成多个小尺寸的文本表征(如摘要),并索引这些表征。检索时先找到最佳表征,再通过其引用调出完整的原始数据(文本或图像)送入 LLM。
  • 多模态 RAG 流程: 解析并存储文档中的文本和图像块 -> 索引这些数据块(图像可通过 CLIP 或文本表征索引)-> 检索时同时返回文本和图像块 -> 送入多模态模型进行处理。

2. 泛化输出 (如报告生成)

  • (简要提及) 代理的潜力在于生成完整的输出单元(如 PPT、代码),这是企业知识工作者的一个核心需求。

3. 输入的代理化推理 (Agentic RAG)

  • 代理化 RAG (Agentic RAG): 在 RAG 之上构建一个代理层,将“检索”本身视为代理可调用的“工具”之一。查询首先由代理进行分析、分解、规划,然后代理决定是否以及如何使用检索工具。
  • 约束性 vs. 非约束性流程的权衡:
    • 非约束性流程 (通用代理): 如 ReAct、LLMCompiler,让代理自行规划。优点是表达能力强、灵活;缺点是可靠性低、成本高,可能陷入循环或不收敛。
    • 约束性流程: 控制流由开发者预先定义(如使用路由器提示、if-else 逻辑)。优点是可靠、可控;缺点是表达能力有限。
    • 企业实践: 目前,由于对可靠性的追求,企业更倾向于构建约束性的代理架构来解决特定问题。经验法则是,在使用 ReAct 等通用架构时,工具数量建议控制在 4-5 个,少于 10 个
  • LlamaIndex 的工作流能力: 提供一个事件驱动的编排系统,支持构建各类(约束性或非约束性)代理工作流。

实用案例与生产部署

  • 案例:
    1. 报告生成: 通过 Researcher (执行 RAG) 和 Writer (生成报告) 等代理协作完成。
    2. 客户支持: 被认为是排名第一的实用企业级代理应用场景,通过自动化极大提升效率和用户体验。
  • 生产中运行代理:
    • 挑战: 如何将复杂的本地多代理原型部署到生产环境。
    • LlamaIndex 的实践: 将代理工作流作为微服务 (microservices) 进行部署。每个代理被封装为服务 API,通过中央消息队列通信,易于扩展和管理。
    • 支持“人在回路” (Human-in-the-loop): 代理在需要时可暂停,向用户请求输入,待用户响应后继续执行,这对于处理模糊或关键决策至关重要。