speaker 1: Today we have our co instructor, div, talking about human inspired approaches to agents and how the path to agi requires a rethinking of how we design, evaluate and deploy intelligence. Div Garg is the founder and CEO of agi Inc, a new applied AI lab redefining AI human interaction with the mission to bring agi into everyday life. The previously founded multi on the first AI agts startup, developing agents that can interact with computers and assist with everyday tasks. Funded by top Silicon Valley vcs. Deb has spent his career at the intersection of AI research and startups and was previously a PhD student here at Stanford focused on rl. His work spans across various high impact areas ranging from self driving cars, robotics, computer control, and Minecraft. AI agents.
speaker 2: With that, I'll hand it to him. So take it away. Yes. Excited to be here. Great. So Yeah excited to be here. And the topic for this lecture is we wanted to talk about a lot of new things that are happening in the AI world right now. So there's been a lot of developments with agents and all the new models that are coming out. And it seems like you have like some sort of super intelligence when it comes like chat and reasoning already there compared to like average humans. And it is going to be very interesting like the next few years as you figure out what does intelligence look like? What is something like hei, and what is the form factor? How can this be something that's useful? And like how will this be applied in society? Cool. So let's take the first thing we want to touch on. Like what does agi look like? It's like agi is such an abstract concept right now. It's like no one has like visualized it or given it a meaning. It's like, is it some sort of supercomputer? Is it just like ChatGPT, but just like ten x batis? It something that's more of a personal companion? Is it something that's embedded in your life? And like that's not clear yet. And those are like kind of the questions I think we really need to go and figure out. This is one diagram on how AI agents work. So this is architecture from open air researcher Lilian Wang. She recently left and joined a new company. So this is showing how you can think about agents and how they can be broken down into different supaffs. And there's a lot of different things that you require to make the siwork. So the first layer is memory. You need to have like some sort of short term memory. You want to have some sort of long term memory. This is like you have some sort of short ter presentation that's maybe like a chat window if you're using something to like chat tgpt. And you might also have like a personal history of the user where like, okay, this is maybe like what the user likes. This is what they don't like. The second thing that you need is tools. Like you want this kind of agents to be able to use tools, like how humans use tools. So you want them to be able to use calculators. You want them to be able to use a calendars, web search, coding and so on. The third part over here is like you want to have advanced planning. And that means like you want to the agency be able to like use reflection where like if something goes wrong, they can have faillover mechanisms, error correct and like recovered. You want like self criticism, and you want like decomposition, where like you have chains of thoughts so that agent can do their own reasoning loops. They can also break down a complex task into some goals. And the final fourth ingredient is actions, where like you want this agency to be able to act on your behalf and like go do things. And this is kind of like high level encapsulates how agents look like fundamentally. And this is maybe like what will as this systems become more powerful over time, will eventually lead to something that's like ar. This is one thing that we also building at. So I recently started this new AI log. We're called agi Inc. And we're looking a lot into like, what does agi look like for everyday purposes? And like how can this be applito daily life? This is one of the demos of some technologies we built in the past. This shows how an AI agent can be applied in the real world. So this is a bit old, and this shows like how an AI agent can be applied to PaaS a real driving test in California. And so this is like an actual dmv test that the agent took. And then let me share the screen and talk about setup. So in this screen, what's happening is there's someone attempting the dmviolinterest and there's a human who has their hands or the keyboard. They're not actually touching the screen. And it's the agent that's going in taking all the exams. And there's like 40 questions in this test and the agency regards that can go and PaaS the whole thing. And we did this live. So the dmv was actually like screen recording what we doing. They were also like watching like the person on camera. But even then, like the agent was successfully able to evade like the whole startup and like past exam. So this was really fun. We did this as a White haecking attempso. We informed the dmv afterwards that we did this. Funnily enough, they actually sent us a driving license afterwards. So that was really fun actually. So at the end, the agent is able to PaaS and get a PaaS, get falcode on this test here. And so Yeah, so this is like a very fun experiment showing how agents can be applied in the real world. ID, and like there's so many things that are possible in this vein, like how can we make agents more useful, apply them in real life? We have been like working on a lot of different efforts along with a lot of like the AI community. One of those things is like agent evaluations. How can we evaluate this kind of agents in the real world and make sure we have standards and benchmarks that allows us to know, okay, how well are these agents working on like different websites or different use cases? How can we trust them? How can we know, okay, where to deploy them and how to use them? Another thing we doing have been doing is agent training. Can we train agents to be able to do advanced planning, self correction and improve themselves? And that this uses the combination of like reinforcement learning and a bunch of other advanced techniques. And finally, we have also been looking a lot into agent communication. Like how can you have agent communicate with other agents? And there's been a lot of like new breakthroughs in this area recently. So if you have looked at model context protocol mcp, that's a very new thing that has been coming out. Similarly, like there's a lot of work around like a two a. That's like Google's agenand two a and communication protocol that recently came out. We also have been working on some open source projects called agent protocol where you also we have been allowing different kind of agents to communicate to each other. So you can have a coding agent that can talk to, a web agent that can talk to like apbased agent and so on. And that allows you to like do like much, much more complex things than what's possible with just a single agent. Cool. So before we dive more deeper into like how a lot of these things works, that's I want to bring about. Like why do we need agents? Like why are they useful? Why do we actually want to go and build them? And there's a lot of things we need to think about here. And I will touch on a lot of different different topics in the introduction, going from the architectures, building like more like human like agents using computer interactions, maybe like memory, communication and like what are the future red actions? So like when you think we're building agents, there's a lot of things questions you have to answer. The first one is okay, like why is this useful? How can you actually build them? What are the different building blocks? And finally, what can you do with them? And to first answer the y question, we have this key thesis that agents will be more efficient in interfacing with computers in the digital world compared to humans. And that's the reason that we want to go and apply agents to be able to do things for us. So you can imagine you have an army of virtual assithat are like fully digital that can go and do whatever you want on your behalf. And you can talk to them using a human interface. And that's kind of the vision we have been like moving towards. Also have a blog post about this called software 3.0 that you can check out, which touches upon with some of those ideas. Oh, so we want to .
speaker 3: go and build agents because .
speaker 2: usually like large language models are not good enough. And we want like action capabilities that allows us to unlock more productivity and go do things. And this also allows us to build more complex systems. There's a lot of techniques involved in actually building this, such as like chaining different models together, reflection and a bunch of the mechanisms. And as sure before in the architecture slide two, there's a lot of different components like memory actions, personalization, access to the Internet and so on. And finally, the questions will come. What are the different applications we can apply them to? There's also a question of why do we want to build human like agents? Like why can't we just have api agents or why can't we have a bunch other kind of agents you can imagine, which are not mimicking human interactions? And one reason we want to push towards more human like agencies, these agents can operate interfaces like how we do. And usually the Internet and the web and computers are designed for humans. So they're designed for like keyword and mouse interactions so that we can go and like navigate interfaces. We can like use our and if agents are able to like use interfaces like we do that allows them to directly communicate and like do a lot of things, they're changing how current software programs work. And that becomes very, very effective because that allows you to work on the 100% of the Internet without any sort of bottlenecks. If you think about apis, there's only like 5% of apis on the Internet are public that are accessible, and it's very hard to build agents that are fully reliable over apis. And so there's a lot of contention between human agents versus api agents, and that's like an ongoing battle that's happening right now. Second thing is you can imagine a lot of humanic agents as becoming a digital extension of you. So they can learn about you. They can have a context about you. They can do tasks like how you will do it. They also have less restricted boundaries. This kind of human like agents can handle logins, they can handle payments, and they're able to interact with any of the services without restrictions on terms of app access. So you don't need to pay for using an api or you don't need to like go to like a service provider and ask them for, okay, can you give me access to this api? You can just go and use an interface like you normally do. And the final thing is like there's a very ing simple action space. The agents only need to learn how to click and type. And if they're able to do that very effectively, they can generalize to any sort of interface, and they can also improve our time. So the more you teach them, the more data can give them. They can learn from, like user recordings, feedback, and become better and better over time. And so when it comes to this api versus more direct computer control agents, this are kind of like how we think about like the pros and cons. Api agents are usually easier to build. They are more controllable, they are more safer. But apis have higher variability. So you have to build like different agents for each api, and then apis can keep changing. You never have like full guarantee that this agent will always work 100% when it comes to this more direct interaction computer to control lled agents there, it's easier to take actions. In this case, it's also more free from interactions because you're not restricted by the api boundaries. But it's also hard to provide guarantees because you don't know what the agent will do. So if anyone here has played with like agents, like operator, it's work in progress. It's not like clearly there there's a lot of like issues that it turns into and that's kind of the boundaries where like agents are right now. There's also like different levels of autonomy. When you think about agents, this is usually goes from level one to level five. So level one to level two is when a human is in control and the agent is acting like a Copilot. So it's helping the human erso. This is something like if you use like a coeditor like cursor, that' S L two agent where you have paral automation, where the human is in control, the human is directing the code, but the agent is helping them. When it comes to something like l three, this is where there's still a human fallback mechanism, but the agent is in control. So this is like if you use like cursor composer or Windsor or any of the newer code editors that are more agent tic, the agent is writing most of the code, but a human is monitoring, giving it feedback. Okay, this went wrong. Can you correct that for me? Can you fix this issue? And that is more for l three system. And then you have more advanced systems, which are like l four and l five. In l pha systems, you don't have a human in the loop, so it's the agent that's going on doing everything. You might still have some sort of like automated fallback layers. So if you look at wemo in sf, that's A L pha system because the cell driving car is driving itself, but there's human operators that are remotely monitoring it, making sure that nothing goes wrong. And when you have l five system, in that case, there's no humans in the loop, there's no monitoring, and the AI agent is able to operate itself autonomously, fully, fully, independently. So when we are .
speaker 3: building this agents.
speaker 2: one hard thing is trust. How do we trust this agents actually going to go do what we want them to do? How can we go and deploy them in the real world to solve these issues? One effort that we have in building is a miniature version of the Internet, where we have cwn, like the top 20 websites on the Internet, and we are benchmarking, okay. Ly, how do agents go and perform on all these interfaces? This is actually live, so you can go check it out on here. Valdraxfive c, and what we have done is we have built like digital clones of websites like AirBNB and Amazon and dodash and LinkedIn. And the agents can go and navigate these interfaces on predefintasks, and you can get a final score. This is showing the evaluation results for GPT -4o. We find that GPT -4 zero actually is not very good on when it comes to being agent tic and it only reaches 14% successful aggreacy in this case. We try this on like eleven different environments that we are showing on the right. And we have our different moments. We have dash dish, which is our dodh clone and omnizon and so on. So you can actually go and check this moment out. We also compare a lot of open source frameworks out there. Some of them are like the open enair computer use model that's poweroperator. We actually find it's not very good when it comes to this task. Ks, so it's only able to reach maximum 20% accuracy on some of the environments, like our email environment or our calendar environments. But on a lot of the other environments, it's not able to actually go and do really well. We also tried a bunch of the frameworks out there, stage hand, if you have seen that it's open source framework for automating web agents, browser use and one of our own custom agents, which we are calling agenzero. And we find that agents are still like early when it comes to like actually automating a lot of those interfaces. And we are able to reach me like I will select up to 50% success rate. But a lot of the students are actually failing when you are applying them in a lot of these real world websites. Similarly, we benchmark all the different models that are available, including all the closed source apis and all the open source models. And we find again, on erc tasks, most models are doing like decently well, but no one is like really good right now. The maximum success we have seen is with like clot 3.7, where it can reach around 40% accuracy. Gemini 2.5 and zero, three follow very closely with it. The other models like tend to taper off. And so the interesting learning has been for us is that a lot of these models are not fully ready to be deployed in the real world. Because if you have, say, like an agent that's powered by Claude and then you aped that, you can only expect a 41 person success rate. That this will actually go do what you want to do, and that's not good enough. And this brings a question, okay, what is it that is required to make this agency even better? How can they improve and how can they be applied for your actual practical use cases? And so this brings us to our next topic for the lecture, which is like, how can we train AI agenent tic models? So how can we have models that are more custom, fine tunand, are better on decision titosks. This is one of our past works called agent q, which is a self improving agent system. So this is agent q. That's a system that can self improve. You can learn by corrections and planning. And how the system works is it's able to go and self correct itself. So whenever it makes a mistake, it can save that mistake in its vamemory. And it's able to like use that to do a lot of trial and error learning similar to humans. So like suppose the first time you learn how to ride a bike, you make a lot of mistakes, you follow a lot of times, but over time, you're able to improve your policy and go and do that really well. We apply similar mechanisms to make these agents actually work really, really well in the real world. And so what's happening in this system is the agent can explore the space of interfaces and see, ok, what are the things that it did that went wrong? What are the things that went right? And it's able to use reinforcement learning to self improve and become better and better. So asient q combines a lot of different techniques. The first method is Monte Carlo research. This is borrowed from like other rl techniques like AlphaGo that allow you to plan over search space of tasks and unlock like advanced reasoning. A second thing that we do is self critic mechanisms. So the agent can self verify and get feedback whenever it makes a mistake and to able to learn from that feedback. And finally, we use rl aif techniques like dpu direct preference optimization to be able to improve the agent using rl. And by combining all these two techniques together, we are able to build some very powerful systems. Agent queis also available on archive as a research paper, apor. So you can go and check it out. For the sake of time, I will give some of the lectures here. But how aent key normally works is we have this Monte Carlo research where the agents is exploring the different states. It's estimating like rewards on if we were to visit like this state, what's expected value of the future preted reward. And based on that, it's able to improve its prediction model like a collect. Should we go take this path or a different path in the tree? And then over time, the agent can become very good at exploring the right states and figuring out like what are the right paths in the state space and what are the wrong ones. We also do excel critic mechanism in this case. What happens is if you have a particular task in this case, where like say, a user says, like book me or reservation for a restaurant for the chahow on open table for two people on August 14, 20, 24 at 7:00p.m., and this is the current state of the screen where you can see the screenshot, then the agent can go and propose a bunch of different actions. So you can choose to go and select the date time. It can choose to select select the number of people and then open the date lector. It can instate search for terra Italy Silicon Valley restaurant and type that in the search bar. Or it can maybe like decide to go to the open table home page. And how the self critic mechanism works, as like all this proposed actions are pastoral critic network and the critic lm is able to go and predict like, ok, what's the best action you take? And so to give a ranking order, okay, this is the best action that we should go and use. This is like, so this is like rank one. This is rank two and rank three. And based on that, we can go and optimize the system to take the correct actions and improve over time. And finally, we use reinforcement learning from human feedback, where we used methods like grpo and tpo, which are different, our algorithms, to be able to use all the failures and successful directories you collected so far and improve the agent over them. And so dpu is a special atechnique based on our lecture, where you can train an lm using preference data of failures and successes and use that to improve the model overall. And so this is how Asian cube works, where we create this Monte color lo research to create trajectories of successes and failures. We can then use celcratic mechanisms to identify what are the proposed actions that actually succeeded and failed. And then we are able to PaaS them through dpu to actually go and optimize the network. This is an example of how this works. So the agent starts in like the first rate, and the task in this case is we want to go and book a restaurant reservation on open table. So first it makes a mistake and goes to the home page, then recognizes that middle mistake and can backtrack. Ack. So the blue arrow here shows that it's going on backtracking. Then it can go and navigate to the rice restaurant. In this case, if the agent accidentally makes a mistake and choose the incorrect date, then it can, again, backtrack, recover back, open the date lector, choose the right date, open a seselection, and then finally complete it, the reservation. And so this is kind of how the system is learning over time. It's making a lot of mistakes, but it's saving the mistakes and over time, improving on them. We tried Asian quein, a lot of real world scenarios, including like open table, actual reservations. So we actually spun up thousands of, or I would select more like 100000s of bots that are ran on open table and use our method to create agents that are able to book restaurants and like make reservations and do a bunch of other things. And we tried this with a lot of different methods and models out there. So we tried GPT photo, and then we found like on this open table reservation tasks, we are only able to reach around 62.6% accuracies. When it comes to something like tpu, the accures actually go to something like 71%. When we try agent q, we are able to make this work much, much better. So we are able to reach 81% accuracy without any cts as part of the method. And when we apply the whole technique with mts and dpu and like the self critic mechanisms, we are actually able to reach close to 95.4% accuracies. And this is using a lot of like self learning for the agent to improve itself. This takes usually less than one day of training for the agent to go from, I would say here, like 20% accuracy. That's 18.6. That's roughly percent all the way to 95.4. So that's a forex improvement in agent performance in last Monday. All right. Cool. As the next topic, I'll touch on memory and personalization. So one way to think about AI agents is that they are taking information, processing them. And okay, so imagine like khave, a air model, what an AI model is doing. It's taking some prompts. So it's taking some language tokens and start putting some new language tokens. And so this is acting like similar to processor, where if you have a cpu, what happens is you have some instructions, which are usually manually encoded, that go into the cpu, and then you have some instructions that come out, which also binally encoded, and then you do a loop over them again and again. And that's how like normal computers work. You can do a very similar thing and have the abstraction of an AI model as acting similar to a computer, where you have language tokens that are going in, that are encoded in the prompt, and you have language tokens coming out. And this allows you to think about an AI model as being a processor that's operating over natural language. So this is something that you can think about GPT -4, for example, going and doing this. This is similar to some of the older processor like mythirty two that use 32 bit instructions. Right now, if you look at GPT -4, we are able to reach like very big context. So that's very interesting. And when like GPT -4 initially came out, it was constrained to like 8K tokens. Now we have 32K tokens and 120k tokens and like 1 million tokens. So the context length of the model is just increasing, increasing over time. And as the context length increases.
speaker 3: that also allows us to have.
speaker 1: A question from online, can you speak to the compute budget for the day long run? Like was it H -100s or like cluster?
speaker 2: The results? Yes. Yes. So that was all at 100. Actually, we trained the whole models on 50H -100s in less than one day. Gotcha.
speaker 1: And then one question from before, as AI agents increasingly emulate human behavior, what protocols do you foresee being implemented to help users distinguish between AI and humans in conversation?
speaker 2: Yeah, that's very interesting. And that becomes a question of security, of how can we identify is whether it's a human or agent. It's actually a very hard question right now because you actually have voice agents that are effectively able to mimic humans and are able to PaaS as humans. And that's actually happening in the real world right now. Over time, we will need like human proof of identity. So this could be like biometrics. This could also be a combination of maybe like some sort of like personal data or like some sort of password or secrets that only you know and you can use that to authenticate that you're talking to an actual human and not an agent. Cool any more questions or Yeah .
speaker 3: by professors, students, why do agency cies then fail? So they need some comprehensive study first they say ms has been there for more than 20 years, right? Your distribution systems, the transition processing. So we are just heavy AI cover by by the second name. And so so far I really haven't seen anything new except for you have an agent which instead of just having abi people coding all the logic in the program, you have an agent will be able to do something, you're given a problem, will give you some results. How you just you together the intelligence suddenly anyway ated. So communicating between angels my point is just have communication between ages is exactly the same as ever my us before 20 years ago. Collaboration giving agents is only being to any intelligence but I'm missing that part. Correct? This is .
speaker 2: actually something that's coming next. But just to answer the question, the biggest issue is just to lavigate. What happens is when all these agents are communicating using natural language, that causes a lot of miscommunication where like maybe your agent got the wrong instruction or fail to understand what's happening. And the more agents you add, the more communication overhead is there. You can imagine if you have agenent tic system with n different agents and this n squared communication groups ops. And so like the amount of errors in the system increases. They come like as quadratic and makes that allows for lot of different .
speaker 3: mistakes that can happen. Sting, transport, maall, pretty much all the problem solved. 15 chapters.
speaker 2: Great. Yeah, totally.
speaker 3: Yeah. That could be .
speaker 2: very interesting. But for the audience here. So let's come back to this. So one way to think about agents is when you have this transform al model, the transformodel is acting as a processor. So it's taking in this input prompts, and it's giving out the output prompts. And what you want to do is you want to be able to have a memory system. So you want to have something like a file dascram where you are saving what's happening and being able to process that over time. So you want to have repeated operations. So you do the first Passover model, you get some output tokens, you can save them in a ram like system, and then you have some new instructions that come out. Okay. Like now here's step two of the plan. Coexecute that. Here's step three of the plan. Here's step four of the plan. And that looping behavior, this is what's in a sense giving rise to Asians. Where you can imagine this is the transformer, is the processor, the memory system, and the instructions and the planning are acting similar to the filsystem and the ram. And so they are overall, like giving rise to this computer architecture, where you have the age ent acting as like a computer system with a memory processors, which are the compute, and then being able to use the browsers and actions and multimodity, which can be like inputs like audio and voice and so on. Okay. When we think about long term memory, there's based on the anlogy before, you can think of this as similar to a disk where you want a user memory that's long lived and persistent and so that you can save context about the user, you can load that on the fly whenever you want it. There's different mechanisms for long term memories. The prevalent one is embeddings. So you have reiteval models that can go and fethe right user embeddings on the fly. So if I have a question like, ok, does this person, Joe, is he allergic to peanuts? Then can the system go and find out? And if we have a lot of user data about the user, then we can use a rtable model to do embedding lookup, find out, okay? Like if this is something that we already know about the user or not, and based on that, make a right judgment. And this is something that is very important. And you are able to see early cases of this in system sexactivity right now. There's still have a lot of open questions when it comes to long term memory. The first one is hierarchy. How do we decompose memory into like more graphical structures where you can have temporal persistence, you can have more structures? And you might also want to think about memory as something that is adaptable, because human memory is usually not static. It's changing over time. And so you also want to think about when you have agent memory, how can it change? How can it be dynamic? How can it self adjust? Because the systems are also learning, they're improving. And what does this dynamic memory systems look like? Cool. And with .
speaker 3: memory .
speaker 2: like leads to personalization, where the goal with having long term memory is that you can personalize this agents to the user and they able to understand what you like, what you don't like, and they're aligned with the preferences. So if you have this case of like maybe someone is allergic to peanuts and you want to have an agent that's ordering food on dodash, then you want it to be personalized so it doesn't accidentally order something that you're allergic to. And how can you go and build that? And everyone has different preferences, likes and dislikes. So when you're designing agents, it's very important to actually make sure that you can account for this. So there might be a lot of express set personalization information that you can collect when I go like what is the user? Like are they allergic something? What are the favorite dishes? What seed preferences they have, if they're flying and so on. There's also a lot of implicit preferences. So there's a lot of things around lego collwhich brand. Do you like do like ididas versus Nike? If there were like ten items on a list, like suppose you're looking for like a housing, which which one do you prefer and buy? And those things are very implicit, so they're not explicitly known. And then there are mechanisms where you can collect a lot of this implicit preferences and then personalize over time. There's a lot of challenges when you are building these specialization systems. The first one is just use ser privacy interest trust. How do you actually go and actively collect this information and how do you get people to give it that to you? There's different methods you can go and use to actually collect this formation. So one is just actively active learning where you're explicitly asking the user for the preferences. You're asking them, okay, like are you allergic to something tor? Do you have the seat preference and so on. And there might also be like passive learning where if you can record the users and see what they're doing, then you're able to passively learn from the preferences. Maybe this person likes mic issues because that's where what we have seen them do on the computer and the agis, learning from your behavior and become better and better. And you can learn to personalize by supervised fine tuning where you are collecting a lot of interactions. This can also be through human feedback where you can get ththumbs down and use that to like improve, okay. Like this agent, go and do the right thing. And this is something similar to ChatGPT where if you like the chat outputs, then you can give it a thumbs up. If you don't like it, you give it a thumbs on. And then this can be used to personalize the system over time. Okay. So now we're going to agent to agent communication.
speaker 1: One question online, how do you do evaluations on the performance of agents that collaborate with humans? And is it a moving target? At what point is human performance redundant and agents can be fully autonomous?
speaker 2: I would also say it's a hard question. You just have to go and build benchmarks because it's very hard to know what's going to happen in the real world right now. I will say like based on a lot of current state of evaluations and like what I showed before, like agents are not fully there. The most successful agents have seen so far are coding agents. So if you have your whatever like intelligent code editor and you can already see the traces, like they are automating a lot of engineering for you already that you don't have to go and write a lot of bolet plate code or you don't have to like spend a lot of your own time fixing books. So at some point we'll see this thing where like humans become more like managers and we are giving them feedback, we are giving them direction, okay? Like we want this like suppose you have like systems of different agents, so giving them, okay, like I want agent want to go and do this, agent to go and do this so on, see what the final output is and over using that improve the overall the generation process that you're are going towards. And so this is likely what's going to happen is like the agenent tic systems will become better and better executors where like the humans become the managers for the systems of agents.
speaker 3: So when it comes to .
speaker 2: agent to in communication, we think about like multi agent architectures and multi agent systems where you have all this cute little digital robots that can go and talk to each other, communicate and like go do your book in a very coordinated and streamline manner. There's reasons that you want to go and build metitan systems. The first one is paralyzation. By dividing a task into smaller terms and having like multiple agents, like if you have any agent instead of one agent, you can improve the overall speeds and efficiencies. The second thing is specialization. If you have different specialized agents, so you have a maybe like a spreadagent and you have a slack agent, and you have a web browser agent, then you can also route different tasks, different agents. Each agent can become really good at their tasks. And this is similar to having a degree in a specific major or having an occupation specialized in that occupation. There's a lot of challenges when it comes to agent agent communication. The biggest one is that this kind of communication is lossy. When you have one agent communicating to another agent, it's possible that it might make mistakes. So like it's like this system, not what happens in human organizations. Maybe your manager will ask you to go do something, but you may misunderstood them, then do something different. Or they were like, Oh, what does this happen? And similarly, like agent to agent, communication is also fundamentally losy, where whenever you are communicating information from one agent to another, you're losing some percentage of the forand that allows for mistakes to propagating that system and become increasingly more prevalent. And there's different mechanisms from a tisystem. This is a very novel field right now. Like people are still trying to figure this out. No one has actually crthis right now. What you want to do is you want to like build the right system of hierarchies where like you might have manaagents that are working with like worker agents, you might have managers of manaagents and you might have maybe like flat agents organizations where like maybe like one manamanaging hundreds of agents, or it could be like a big like vertical tree where you have maybe like ten different hierarchies of agents that are managing each other. And so a lot of these systems are possible, and this just depends on the task or what you're going on specializing on. And the biggest challenge for this kind of systems is like how do you exchange communication effectively without losing that information? How do you build syncing primitives? Like how can communication from one agent that's very far away from the maybe like another agent in the hierarchy go and be communicated very, very effectively across a chain? There's a couple of frameworks out there that are looking to solve these problems on how do we make this communication protocoll robust and how can we have mechanisms to reduce this miscommunication. A big one in this part is mcp, which is model context protocol. This is a protocol that came from anthropic that a lot of people are using right now. It's a simple wrapper around apis. So what does it it does is it gives you like a streamlined standard format around each api and by creating an mcp wrapper around your service, so this could maybe like you have a file server service that exposes an api you can create like an mcp rapper for your file server, or maybe for your email client, or for maybe a slack client or something running on your computer. Then all this mcp connector servers can go and communicate with each other and like do things for you. And so this allows for very effective communication where you are able to control the routing and like make this listmoduer. So you're able to plug in like new services as you want to. Similarly, another framework in this space is agent to agent protocol. So this is a new protocol that came from Google very recently that's aligned for agents, logicommunity other agents and add a lot of reliability and fallback mechanisms. And I'm not sure how many people here in the room have used mcps. Yeah, not many. Okay, okay, cool. So mcps are actually very cool. Like what they are doing is they are abstracting your apis and making them like very, very modeler so that you can go and plug plug your api into an mcp protocol. And once it's wrapped around that, then you can go and interconnect it to any other service that's supported by mcps. So it kind of becomes in a sense, like having a standard interface for communication for your different services or applications you have and then exposing them and like letting them connect and talk to each other. So similar to like how you have some htttps for communication on the normal Internet, mcv becomes like an interesting protocol for communication to happen over different. And Yep, so if you have like a client like clad or raplet or some of the model, you can connect that to servers that are supporting the nv protocol. You can have a bunch of like different services. Each services could be like some sort of data tool like a database api or pretty much als, and they can all interconnect them like do like modular things for you. And because mcps are not dependent on the spec of your api, they can allow you to absorb a lot of changes and add this level of modularity and abstraction by standardizing the whole interface. You can also have like dynamic tool discovery, because you can find different ncp servers that are exposed in some sort of directories. And then you can also plug in mcp servers that you like and connect them so you can plug in new tools .
speaker 3: them out and you can route .
speaker 2: information based on what you want to do. Okay. Finally, like touching on some of the issues when it comes to aient existence. So so far, we have seen a lot of different things, okay? Like how can this agents work? How can we valid them? How can we train them? How can we think about communicating with different agenent systems? And even though a lot of these things are very interesting, a lot of these things are taking off. There remains a lot of key problems in the space that you still have to solve for this agency denpractical for them to be applied in everyday life and for them to become useful for you. The biggest one is just reliability, that the systems have to become very, very reliable. Like they need to be close to 99.9% reliable if you're giving them access to your payments and your bank details, for example, or maybe they're connected to your emails, calendars and whatever services. And then you want to really trust them that you don't want the systems to go rogue and maybe will like post something wrong for you on socials on your Twitter or your LinkedIn, or you don't want them to go and create an havoor, make a wrong transaction on your behalf. And so that becomes like like how can you trust an agenent tic system that's operating autonomously? And that's where reliability because becomes a big thing. Second issue with otous agents is looping. Like these agents can go do something wrong so they can get stuck in a loop and they might just go and repeat that process again again. So if you give them a task and maybe if you remember like the restaurant booking task that are sure before, and maybe like the agent went to the wrong restaurant and can maybe just trying to do the same thing again and again, it doesn't know what to do, and that kind of issues can happen. A lot of the agents where you might end up wasting a lot of money and compute. And it's very important to be able to figure that out and correct that. And that leads to a lot of use cases around like how can we test agents, how can we properly benchmark them in the real world on a lot of different use cases and make sure we are learning from that? And how can we also, once we deploy the systems, be able to observe them? And that becomes like, how can we know what is happening? Can we monitor them online? Can we have some sort of like safety, which could be based on audit trails, that we can audit all the operations that this incihas done so far. And we can maybe also have human overrides that if something goes wrong, we have some sort of human fallback where maybe like a remote operator can take control of the agent and correct it, fix it, or maybe you are able to go and rightly take control and fix it. This is similar to autopilot in tester. So when you're driving autopilot, maybe like you see like you correct something, maybe it's going to go do something wrong and you can take over control and override the system. And that becomes very interesting when you're thinking about real world deployment of agents. Cool. Okay. So that was Yeah so that was the whole lecture on agents. Sorry. There was like some things that are a bit messy. Yeah, we had to put together some final slides. Happy would take questions. And Yeah.
speaker 3: go on. What you see like an accuracy. I say like 40% or something on the front on task over the course of the day, you think there's a plan, you know 99 or 51, 999. And the other thought logists are just iterations or is actually clear things that you don't have to try.
speaker 2: So this is definitely possible, especially with like reinforcement learning and like we I showed the agent q method before. So right now, a lot of these models, like even if you have like clouude sonnet or GPT -4GPT -4 or Gemini, they're not trained on this agenent tic interfaces tasks. So that's why like they're kind of they're working zero short ts, so they're not never trained in their distribution training set on actually going and optimizing these problems. And so when they encounter this new interfaces or this kind of new tasks in the real world, they often fail. But if you're able to train the system directly to work on this task, ks, using reinforcement learning and like collections and like self improvement, then you can actually reach very, very high accuracy. So on the open table task with agand q, we reach like 95% accuracies. And if you keep and going and training the systems, you can fully saturate them, like reach close to like 99.9%. The hard thing becomes this. There's a diversity of tasks. So I can imagine there's millions of upsites. And if you want to train an agent that's usually 99.9% on each website, that's a hard challenge. And that's something that's very interesting. Me, how can you build a generalized agent that can work on the whole Internet, that can generalize to everything? Maybe in the future you'll have agents that can do like ultimate, all of voice calling, all of like computer control, maybe like they can also use all of the apis and everything. And something like that is possible theoretically. It's just like very hard to build out.
speaker 3: Yeah do you know whether agents are able to solve .
speaker 2: capp as they can?
speaker 3: What do you think the implications of that are for? Like how the Internet is going to work in the next ten years?
speaker 2: It's definitely very interesting. I would say it's a cat and mouse game. So you have seen like the new generation capp chthat becoming harder and harder to solve. And I think like it's very hard to beat this because if a human can do it theoretically, like a agent can also go and do the same thing. So over time, I think we'll have to just figure out like a better methods of identity. Biometrics can be a big part of that. Like if you are able to use fingerprints or some sort of two fa mechanisms, then we know like this is an actual human.
speaker 3: not an agent. So in the there's this article called AI 2027 that you've probably heard of that you know, outlined, you know like where AI research is going to go and what might happen. And in 2027, after 27, when you know AI, we automate programming and then we automate AI research. And you after your lecture, I was wondering, do you think we could automate the process of creating AI agents? Because from what I understand, the main bottleneck is how am I going to access uis apis? How am I going to be able to access data that is enclosed in those like, I guess, complex and somewhat dynamic systems? So what if, very simply, someone designed an agent that could that was optimized to vectorize apis and uis, and then you designed an agent that was optimized to train agents on different vectorize data sets? Because they're like specific architectures that you can use train agents, whatever. Do you think we will see in the future people automating with confidence our process of creating AI agents, making all these niche specific AI agents that we're seeing on the market obsolete? Yeah, I absolutely things .
speaker 2: so this is gonna to happen. I know I think it's already happening in the bigger las. So if you have like las like opthey have a lot of there's also select papers from like if you have simpfrom siai, we are working on like AI research agents that can go and write research papers and like train models and do a bunch of things. So it's totally possible for agents to go and self improve and build other agents and like you can have a whole process on how that can happen. And definitely like it's possible to like train on a lot of this like data sourand apis and find ways to like represent them and like collect the right types of details and improve that. I do think that seems to be the future of a lot of maybe hard research, especially ironlike protein designs and a lot of hard sciences. So we'll definitely see a lot of that happen.
speaker 3: Hi God. Nice to meet you again. Just give you context with building like this lufor AI agents. Basically, it's like Uber for AI agents. So I've been working for agents on the agents for a long time. The biggest problem with agents has been, as you said, reliability and hallucination. The first thing is, the first thing we try to work on is how do we prevent agent prevention from hallucinating? The next thing is what models were based at executing action. So for my research, we realized that clouude is great. Better still, we have GPT at the end. So we have just like snacks, we have like a team of agent doing work. And then one of those, the action, seem to be GPT agent because we struggle with some agents doing, as you said, GPT -4o is great at taking action. And other models of GPT seem to know work well and pand other stuff. So I think the biggest challenge with building agents is also the third one is the fact that end users can take one hit. So my wife here doesn't give, like if I give out to telthe product and it makes one mistake, there is no space for reinforcement learning in sense that if I say, book my flight like I told manus to do yesterday, and I made one mistake, I lost trust. So the problem is to work in the real world. Our agent should prevent making a mistake in the real world. So that brings us to samandbox, which I love, what you're doing with sandbox and doing clones of this website. The challenge with sandbox is you can't clone all the website on the Internet. And where human excel is, the fact that if given a new tathey, figure out their way around. So these are challenges that we have with agents, and I'm happy if we can talk more about it or we can talk about later.
speaker 2: Totally, totally. Just get a guess just of it. What's the exact .
speaker 3: question there? So I think the question is, how do we make them ready for the real world? We have embody with those. Good job with calling what makes mistakes. We have email agent. My owner went got stopped in a loop and kept sent email five times. So an investor, we have coding agent that wipes up 3000 lines of code for me yesterday. How to redo it? So we have these challenges in the reward and people like my wife are not going to take one shot hit and they will just stop using it. So I think the question is, how do we prevent agent from hallucinating, right?
speaker 2: Yeah. So it's definitely a hard problem where like you can go and keep improving this agents, even if you look at like maybe a lot of like the initial models that came out, like when you had like the first version of like gbthree and so on, they helescinate a lot. But like as you have bigger models that are like more parameter size and the train on more data, they start halescinating less. So like if you see the negeneration models, GPT -4 and clouude, I think like over time, I think as you figure out how to make better foundation models, a lot of these errors in the systems go down, especially hallucinations and other things that can happen. You just require a lot of like monitoring and evolutions and a lot of testing, and this also becomes very domspecific. So like if you're working on something that's a domain specific problem and you're like, okay, you want an agent that can work 99.9% on this domain, then what you want to do is you want to curate the right task cases. You can be like, okay, here's a thousand scenarios that we really go and care about. Can we go and test this agents on this 1000 scenarios all the time, which could be in production and when you're are actually running this with your users or this could be like some sort of offline simulation where you're like daily testing. Is there any regressions in the system? What happens if you change a prompt? Like what will this look like? And if you're able to build a lot of like very robust testing, then you can also verify, okay? Your accuses are going up and then it kind of becomes tyocan. You find tunthis agento become better and better for your use cases. So I think I would say the correct answer is a combination of models will become better and better over time. So you can just simplicity, trust them more as the new model comes out. And the second thing becomes is just you want to have very domain specific testing and evaluation. So for your own use case, can you go and have like some sort of ways to rank which model is doing what, how good is it and make the right judgment and be able to like fine tune and use like reinforce cement learning and other techniques like make them better over time? Bye ke?
speaker 3: Reinforactions because I think the problem lifelong with models, I don't think you need. So the question here is we need become a small language model. Yeah. So that's an interesting question. Also.
speaker 2: we are already seeing some hints of this. So if you look at a lot of the newer models, that Traon reason increases. And we have found like you can actually train smaller models on reason Ising traces and have like better accuracies. So a lot of the newer activity four models, like all the gbfour o and like all the new series of like zero three mini and so on, they're actually distribulled small models, but they're just like fun you and using reinforcement learning other techniques to be very good at reasoning. And so we are already seeing that with all the new generation ation of all the thinking models that are coming out and the o one and o three series. So that's showing this, that smaller models with batreasoning better processing, actually, as that cancer itwill be interesting to see, like how far can you push the limits? And like what well this look like is like maybe like over this year, what are the best accuracies we can expect from these kind of reasoning things? Can we actually go and be like psd level at mathematics and even like three super intelligence on a lot of this specific domains?
speaker 3: I think a lipotest is reward. And my approach architecture, which I think may work, is the manager agent could be large language model and the worker agent could be small language models because I think it's distillation happening when you are collaborating in a team. Yeah. What the last question is regarding memory. When analogy gave with respect to a computer, we have random access memory. We have the room and then we have the you know hard drive with AI agents right now, I just think they have random access memory. And with mazero, we are just giving it wrong. I don't think they have the hard drive and this the consciousness, right? Why they are working. I think that's a challenge. I would like to know how do we implement that kind of system to make it sort of like a computer?
speaker 2: Yeah, that's an interesting question. Like I'll be curious if you're actually trying experiment and see how that works. Like I'll say this state answer to this. It just depends on what you're building, what your applications are. And then it is depending on what you're doing. Like there can be different sia models that might work better if you're doing a coding task. You might work like a more of recording ding model versus like if you're doing something more chat based or actions and so on. And I think I could just have to find direct ingredients in a sense, like the right components for your application and I go and build that. Yeah. So there's no right answer to it .
speaker 3: in a sense.