speaker 1 [00:00:01-00:01:04]: So I was excited that all came from Gemini. So I was hoping there'd be some wild hallucinations in there, you know, things that I hadn't actually done or accomplished continents I'd not actually visited. First of all, it's great to be here. When we started Sierra two and a half years ago, we had to explain to our prospective customers and even to our friends what agents even were. And here we are, you know, 30 months later, and there is a graduate seminar and a MOOC on agents. so super exciting to see that. i think this will be a little bit different talk from some of the others that you've seen, much more practical. sierra is an applied ai company. i grew up at google on the product and engineering side, and what we're aiming to do with sierra is to solve one of the oldest problems in business, which is this tension between the cost and quality of how a company can serve its customers. and unless you're like the four seasons or hermes or some other fancy brand, you can't show up in every interaction with.
speaker 1 [00:01:05-00:02:08]: Kind of a concierge level, white glove level of service. That's the problem we're trying to solve. And we've been in in business for two and a half years. We started in March of 23 again, when no one had heard of agents. And my cofounder Brett and I have actually known each other. So in a way, the company has been in the making for a while. We've known each other for 20 years. We both met at our in our first year at Google, where we both started our careers. Brett was most recently co CEO at Salesforce, and he's the current board chair. At OpenAI, we over the past two and a half years have worked with hundreds of companies to serve hundreds of millions of their customers with customer facing AI agents. So to make this concrete, if you were to order some new flip flops and from Olacai and they were the wrong size and you called Olacai or chatted with Olacai, right? You'd interact with an agent that we built with and for Olacai and it would help you with a warranty, an exchange, get you in a better color or size or fit.
speaker 1 [00:02:09-00:03:07]: A dt, they're the blue rectangle homes, blue octagon home security company. If your battery in your alarm panel, conks out will help you figure out which panel you have and drop ship. You a new one from there. We work with a number of the fortune 500, fortune, 100 fortune, 20 fortune, 10 in building customer-facing ai agents, they can pick up on chat, they can pick up on voice. And they do everything, as i said, from return shoes and debug alarm systems to send satellite signals from space. In the case of sirius xm, where you're moving from one car to another, and we need to send a new encryption key from space to get you up and running on that new on that new radio. So what i thought i'd do today is first, just orient you on kind of how we think about the world of agents. We obviously occupy one corner of the world of agents. I think broadly we categorize them into three buckets. Number one, personal agents.
speaker 1 [00:03:07-00:04:06]: So chachi bt gemini agents that will work as our own kind of trusted personal digital assistant two or the kind of role-based or persona based agents like coding agents. Or other agents that can help you in the context of getting your job done. Legal agents like those from Harvey are another category. And the third category in our mind are customer facing agents. We think that every business in the future is going to have its own customer facing agent. What I thought I'd do is orient you on how we think about that and kind of our corner of the world of agents, and then just share a bunch of lessons learned from having to make contact with reality again and again and again. That hopefully give you some ideas for the trials and tribulations in actually deploying agents in the real world. First of all, all this stuff is moving incredibly quickly. I was actually born in Mountain View on the peninsula here. I grew up in Silicon Valley at the advent of the internet, and nothing I think anyone has ever seen has played out as quickly as this.
speaker 1 [00:04:06-00:05:03]: Took the internet 11 years from about 1991 to 2002 to go from not existing to having 百 分 之 10 of the world use it. In a given week, chat gpt is you all know took less than 25 months to get there. And this really shapes our mindset in how we think about customer facing agents. We think that agents are to ai, as the website was to the internet, and apps are to mobile. And again, every company is going to have one that will do everything fromproduct recommendations, sign up, setup, product activation. Cross sell, upssell troubleshooting, subscription churn management retention and everything in between. And we're the category leaders in building those customer- facing agents And that's where we think the puck is going. A couple things. First of all, businesses talk about channels, phone as a channel, chat as a channel, email as a channel.
speaker 1 [00:05:03-00:06:05]: And our strong view is that you're going to have this world collapse from multiple channels to a single agent. That you imbue with knowledge and know how about how to do what you as a business, how you do it best, what great looks like. And then that agent is going to show up in every place where your customers are. So if they're texting you or what's apping you, right? Your agent will show up there, meet customers where they are and be able to have a conversation and not just answer questions but get things done. Same on the phone, same on email and ticketing. And i imagine a few of you have deep insight into the way that operations teams structure staffing these different channels today. But typically you'll have a team within a company managing a phone tree and ivr system separate team running chat, separate team running email. That's all going to collapse to a set of actually we've seen early signs of what these teams will call themselves. Many of our customers and the teams working to shape their agents have come to calling themselves ai architects.
speaker 1 [00:06:05-00:07:08]: So, instead of ai architects inside of companies, building, defining and shaping what the agent should look like. Another really important insight here is that we've gone from mobile app uis and menus and scrolling grids of products and so on, to world where actually just the conversation, the ability tospeak and converse back and forth with the piece of software. That is the interface. And it's wonderful because it's the interface we're all born with. We don't have to think about using it, having a conversation, asking a question, listening comes as naturally as anything to us. And we think that's a real breakthrough in how businesses are going to interact with their customers because there's nothing to learn, there's no app to navigate, there's no kind of hierarchical menu structure to get through and so on. agents for us have really made it possible to introduce a new business model. So if you look at the history of software, twenty or thirty years ago, you would go to a store like Fry's Electronics, you would buy a shrink wrap box with a CD-ROM or a stack of floppy disks in it.
speaker 1 [00:07:08-00:08:13]: You'd install it and regardless of how much you used it, you had paid for it and you were done. That of course transitioned to SaaS, offers a service where you subscribe to it. More recently, companies like Amazon Web Services, Snowflake introduced consumption based pricing where you pay for what you use. We have pioneered what we call outcomes based pricing, where a company only pays Sierra if we successfully resolve whatever problem a customer has written in about. And so if again, you were returning a pair of shoes and an AI picked up the phone, helped you find a new one, drop shipped a new pair in the mail. We charge our customers then, if not, there is no charge. Has this important aspect where it deeply aligns our incentives with our customers, right? Where we win, when they win, they only pay us when they're saving money or making money, if one of our agents actually makes a sale. So another principle that we think is quite important. Today we're in the world of agents as a piece of technology, and everyone is still figuring out how to get agents to do what you want them to do reliably.
speaker 1 [00:08:13-00:09:14]: And one of the questions we encounter all the time, in particular from large and technically sophisticated companies, is why not just build my own? And I think to most engineering managers, to most engineers inside companies. Building an agent kind of looks like this. Like let me choose the language model I want to build on. Will I use Pinecone? Will I roll my own semantic search engine? Let me integrate with some tools and some APIs and we'll be done. What we have found over the last two and a half years building these things is that when you put your scuba tanks on and you go under the surface of the ocean, you discover there's just like a whole lot of stuff that you need to get right. Everything will deep dive into some of this. What does version control look like? What does release management look like in the context of agents? How do you observe what they're doing? How do you make sure that they don't make stuff up? How do you make sure that if you're in financial services, your agent doesn't dispense financial advice, which is illegal?
speaker 1 [00:09:15-00:10:14]: How do you make sure if you're in the healthcare setting, you don't diagnose a patient and suggest they take some medicine, which is also illegal? In voice, how do you mitigate latency? How do you transcribe proper nouns and get tonality, accent, and more right? How do you prevent bad actors from poisoning the context and prompt injecting you so you spill the beans or take some bad action on behalf of the user? All of these are problems that we've had to solve in building Sierra. And even today, as proud as we are of the platform we've built, it very much feels like the 1997 era for building agents. What was 1997? It was kind of the internet had been around for a couple of years, but no one had really figured out how to build and scale web services. So AJAX did not yet exist. The PHP, MySQL, Apache kind of lamp stack. Did not exist.
speaker 1 [00:10:14-00:11:14]: And so people were cobbling together web applications. And in fact, in 1997, there was an article in wired about a bank. Converting its website from basically a business card. It's like here's where we're based here are hours and so on to a very lightly transactional system, basically a web form and a submit button and they spent 13 million dollars on it. It was like to get a functional website that did anything in 1997, $13 million, and that's because it was fundamentally a technology problem. You had it was in the domain of engineers to actually build these things. What we're trying to do with Sierra and where we think the puck is heading is moving from cobble together technology to a world where agents are products, where agents can be configured, built, kept secure and so on, with the product stack as opposed to having to. In a kind of artisanal way, pop the hood, go under the hood and actually build these things piece by piece.
speaker 1 [00:11:14-00:12:14]: And our goal in doing that is to make software and a product that is simple but anything but simplistic. And so all of the expressiveness that you have if you're hands on with agent building frameworks, but abstracting away a lot of that complexity under the iceberg to make it possible. ah to to build highly capable agents. the other thing we're working on is moving agents from a world of being transactional, where basically every interaction is a one-off, they forget things between sessions, and so if you're calling back a second or third time today, it's kind of like agents have amnesia. of course we've seen memory in the context of chatgpt, there are lightweight ways of doing it using rag. what we're really trying to do is build this kind of context and memory foundation for agents, so that when you call back, right, your conversation is starting on second or third base as opposed to. Hi, who are you?
speaker 1 [00:12:14-00:13:07]: I have no idea what you're calling in about and so on. So we think that that warm start problem, which is based on getting memory right, is going to be super important to solve. And it's one of the things that we've been working on and really trying to make this transition from transactions to building relationships with a company's customers across multiple interactions and transactions. so um. 啊，我只系去到呢个。 What we realized in building sierra and building our first agents is that agents are and this won't be news to you is you're in this course, a fundamentally new type of software. And these agents are non-deterministic for a given input. You have no idea a priori what the output will be. And so they really deserve an entirely new approach to software development, taking the software development life cycle.
speaker 1 [00:13:08-00:14:04]: That has been matured over the last two decades in the context of industry in particular, and from first principles think through what does this process look like for building these agents again, which are fundamentally non-deterministic, which have to deal with the messiness of human language and on the phone not just a single person's language, but what if the TV's on in the background or a dog's barking or you have two speakers? All of the complexity. So how do you build them? How do you test them? And how do you optimize them over time? Bless you. The the first thing we realized in creating a platform is that most large companies are faced with this choice of building or buying you build kind of an off-the-shelf limited sas solution or you're building it yourself from a kit of parts from scratch. And what we've tried to embrace with Sierra is this mindset of building with or building on.
speaker 1 [00:14:04-00:15:03]: So fundamentally, what Sierra provides is a platform as a service, a set of Lego bricks where we've abstracted away a lot of the lower level complexity, all that gunk beneath the surface of the ocean and the iceberg, and made it possible to build with these higher level abstractions. And so when we're approaching companies. We try to offer again the flexibility of build and the power of build with the ease and simplicity of buying a solution. One of the really cool things about building agents in particular voice agents is that for the first time, right? Every channel that a company interacts with its customers over is digital and can be understood and and broken down and attributes and what happened there. Today，most phone calls that happen to call centers，they aren't even transcribed，right？ So there's no way to analyze them，there's no way to understand what actually happened in them，forget running like controlled AB experiments，right？
speaker 1 [00:15:03-00:16:04]: So most Fortune 500 companies will have a contact center with thousands of people in it。 The only way that you could AB test a conversation today was to hand a new script to 1000 people and a different script to 1000 people and hope that they read it。 And kind of comply with it. So with agents in particular over the phone, all of a sudden you can apply all of the lessons and techniques from 20 or 30 years of building websites, a b testing. Control experiment design, and so on, to conversations and to agents. i talked about this model of again building once and deploying everywhere. and one of the things, one of the aha moments we had was, You should not think about the means of communication as relevant in building an agent. Instead, what should the agent know? What should the agent be able to do? And then giving it the ability to pick up on whatever channel a person happens to be reaching the company on.
speaker 1 [00:16:04-00:17:04]: That is the way you get economies of scale. That's the way you cross-train across chat conversations, voice conversations, email and ticket-based conversations. And so what again we're trying to do there is make it possible to define what the agent does once and have it show up everywhere. Again, on this technology shifting to a product approach, our platform has taken the approach of both providing code, a code-based SDK, so it's a declarative programming language where you can express what the agent should do as opposed to the how. So it abstracts away a lot of the underlying large language model calls, and for engineering teams that want to integrate the development release and version control of their agent directly into their software development lifecycle, they can do that in code in a GitHub repo or whatever source control tool they use. At the same time, with many of the companies we work with, they're big operations teams that are often closest to the customers and know the experience best.
speaker 1 [00:17:05-00:17:58]: And so we have a set of no-code tools that makes it possible to express in kind of structured english language what the agent should do, what it should know about, tools it should have access to. And more, and then on this moving from transactions to relationships, we actually just last week announced something we call the agent data platform. So real first of its kind that combines memory, so long term memory, and the ability to store away context and recollections. From interactions that the agent has had，it also enables a business to import data from what's called a customer data platform. Think of this as your database on everything you know about your customers or potential customers，and then to use that to hill climb and optimize to optimal ways of delivering a sales pitch，saving a subscriber from potentially churning and leaving you as a subscription business.
speaker 1 [00:17:58-00:19:00]: And more, and then also the ability to actually trigger outbound phone calls or text messages, so that the agent that's developed on on the platform can actually reach out and proactively engage with a company's customers. so we're gonna double click into a couple interesting things here starting with voice so ah has anyone tried to build an agent purely on one of the real-time models audio in audio out like the openai real-time api? Okay, one one person in the back, you're you're brave. So one of the interesting things about these audio to audio models is they tend to be smaller and they tend to be totally uncontrollable. They make stuff up and you can also gaslight them into doing all sorts of things. If right, you say, hey, please talk like batman for this conversation. It'll start talking like this and then use a little bit gravelier, a little bit slower. And it'll go slower and even more like and just more and more ridiculous, right?
speaker 1 [00:19:01-00:20:04]: And so one of the challenges we've had obviously audio in audio out with all of the nuance of audio tokens at some point that will be the way today. The state of the art in building reliable scaled ai agents for voice is a pipeline where you're actually doing speech to text. You want to do this quickly as possible. You're then using whatever orchestration layer or reasoning loop you have to decide what to do and what to say, and then using a synthesis engine to take text and produce speech out the other end. And so there's a lot to get right here, and we'll go into a couple of the details here. First of all, latency, latency, you have to squeeze every 10 milliseconds out of this pipeline. and we've done a lot there, and i'll get into some of it. but, again, to string together these things, often with multiple speech-to-text models so that you're able to compare transcriptions to see if there are errors, and then choose the one that looks most correct.
speaker 1 [00:20:07-00:21:01]: and then making sure that you have things like filler phrases. oh, gotcha. let me look up that for you. so that the person isn't just like waiting for five seconds for the agent to respond. STringING TOG ALL OF THESE THINGS IN A PELINE THAT'sS REALLYTIMIED, TNS OUT TO BE REALLY HARD. EVEN B basicIC THINGS LIKE ME measuringUR LAT latencyENY turnsURNS OUT TO BE HARD, BECAUSE REALLY WHAT YOU CARE ABOUT IS L latencyENY betweenWEEN WHEN a PERSON stopsPS TALK, SO E OF utterance AS AS WE THINK ABOUT IT, AND THE TIME THE AGENT STARTS RES respondingONDING. RIGHT? SO IT'S IT'S NOT THE END OF THE KIND OF audioO SNET, RIGHT, THAT THE AGENT IS AN analyzingYZING. IT' whenEN THE PERSON STOP STOP TALK. So you actually have to be able to identify as quickly as possible when the person has stopped talking, so that you can kick off all of the reasoning and then synthesis at the other end.
speaker 1 [00:21:02-00:22:01]: Often what we will do is kick off multiple reasoning loops and synthesis steps, so that we can then look back and forget no they really did stop talking there, they didn't continue running on. So you're then. Ready with a response. We also do things like speculative inference requests once we're in the reasoning loop. So if any of you have looked at how long some of the foundation models take to respond, P90 latency can have huge variance. You can have 500 milliseconds, you can sometimes have 2500 milliseconds in the context of a voice conversation that makes a huge difference. So, one of the techniques that we'll do is fire off multiple requests to the same inference provider. Grab the first one that comes back and then use that to respond with. There are other really subtle things here. So for instance, just detecting interruptions and reacting appropriately.
speaker 1 [00:22:01-00:23:03]: If you or i were talking and i was speaking, you said, uh-huh. I wouldn't just stop speaking because you said something, i would know that you're just acknowledging what i'm saying and i keep going. So basically voice activity detection and detecting activity that is meaningful and that you want to interrupt that you want to stop talking to take in and then update kind of what you've heard turns out to be really hard. And for that, we've had to fine-tune our own models to actually detect this stuff. And distinguish between aha, aha, yep, gotcha, gotcha, distinguish that from no, no, I actually don't want to return my shoes, whatever it is. This is the interrupt, interrupt, interruptability. Finally again, if you followed closely the uptime for some of the frontier models, they are not in the, you know, five six nines of reliability, like hardened services from amazon web services and so on are.
speaker 1 [00:23:03-00:24:06]: So even the ability to fail over from one model provider to another has turned out to be really important in the context of delivering low latency, reliable voice conversations. Okay. Voice transcription. How do you actually? First of all, it's hard names are hard. Pronunciation is hard. Many of our businesses will have hundreds of proper nouns as part of just a conversation drug names, provider names, product names. One of the companies we work with is deeply inspired by the hawaiian islands. So, our agent there needs to basically be able to speak english and hawaiian at the same time, recognize hawaiian words, pronounce them correctly, and so on. So what do you do? First of all, getting the measurement right of accuracy in particular around transcription turns out to be quite important. And the word error rate turns out to be not at all the right metric, at least the first metric to measure this stuff with.
speaker 1 [00:24:06-00:25:05]: Say you have a a phone call, and you have a a speaker, but then in the background you have tv news on. Like, word error rate would want you to transcribe accurately every word heard. In that audio sample, but of course you don't want to do that. You want to only transcribe words from the primary speaker, and ignore everything from the background and secondary speaker. So, it is not as simple as as just word error rate. Uh, and what we found is, we've just had to roll our own metrics to do this. Well, and so i think one general piece of advice in building agents is don't assume that whatever metrics are out there to measure a component of what you're doing is necessarily the right set of yardsticks for you. If you're doing something very specific, get to the bottom. What does great look like? What does a great experience look like? And how do you actually measure? What great looks like in that context. ah voice synthesis is super interesting.
speaker 1 [00:25:06-00:26:05]: so turns out that just saying phone numbers is hard. right, 650-833, that, no, no, no, that's not how you'd say it. do you, do you say 833? do you say 9,000 or 9000? it turns out there are all sorts of kind of standard ways that we don't even think about, but if you don't tell the agent to do it a certain way, gets it wrong. addresses, proper, proper nouns and names. is it andrea? is it andrea? is it something else? again, all of these things, these details matter in getting the experience right. The phrasing quality and cadence and how the agent actually says stuff turns out to be hugely important. And one thing that we found is just taking the off the shelf language model outputs as if it were a chatbot and responding in text just completely falls over.
speaker 1 [00:26:06-00:27:06]: It sounds way too long. You're like, oh my god, just please get to the point. Yeah, yeah, yeah, let's. And if you think about how we speak on the phone, it's shorter, it's back and forth, there's more acknowledgement. Do you want more detail? Gotcha, yep, makes sense. And so what works in the context of chat doesn't work in the context of voice. I talked about a single agent as opposed to multiple channels. So one of the things that we've had to build into platform is the ability to actually tune and tweak the style of the agent depending on whether it's typing, whether typing over chat. ah even text versus chat on the web, there's subtly different expectations there. voice is a different beast altogether. And then the emotive range, this is kind of how you match the individual you're speaking with. This turned out to be important. And today, without the voice to voice models, we're relatively limited in what we can do on this front. But I think will be you know a year from now the expectation that the companies will have.
speaker 1 [00:27:08-00:28:12]: ah one of the strangest probably the strangest job title we have at sierra is our voice somali a so it's an unofficial job title but still a a real one and our voice somali a helps our customers match voices to the kind of personality and brand and vibe of their company. and so you can read some of the, some of the different attributes of a voice here. graveliness, right? you can imagine that one. breathiness, nasality, enunciation. some of these you've heard, probably others you haven't. i certainly had dips. verbal fry is another one. and so it's pretty interesting, just as like fine wines, you know, they smell like decaying violets or bootstrapped leather or whatever weird attributes, weird adjectives a sommelier might use to describe a wine. It turns out there's a similarly rich vocabulary, much more useful actually in identifying the right voice for your for your business.
speaker 1 [00:28:12-00:29:13]: But we have someone full-time on staff who's just listening to voices, evaluating on these different metrics, and then helping businesses match a voice to their company. You can imagine how Weight Watchers, a weight loss solution primarily for women, that voice might be very different from the voice for Harley Davidson and what you might expect, you know, that voice to be. so a lot of variety here, and it turns out you really have to have this vocabulary and the set of dimensions to to be able to capture the voice and and do the matching well. So you've built your agent, we've talked about some of the nuances and interesting things that happen in the context of voice. So I want to go back to the really the dark ages of 2023, and I'm going to share our first user agent simulator action interaction here. This is where we started. We've come a long way since, but I you know stroll down memory lanes kind of fun.
speaker 1 [00:29:13-00:30:19]: So the setting here is I built a prototype of our first agent. It was based on the React Loop, and the agent was extremely limited. It could answer questions about 1099 forms, and it could do simple math. And so you could ask it something like: Hi, I need help with my 1000 plus 33 times 3 form. Can you help me with that? And first okay, I got to do the math, do the math. Okay, that's you know 1099. Okay, that's a 1099 form. I can help with that. And answer questions about it. And we realized, well, if we're going to build a an agent to do that, let's also build a user agent that can write in and you can have an agent talking to an agent, it'll be a more efficient way of testing in particular because gpt-4 was just super slow at the time and we were not particularly optimized in our loop. It was like nice to be able to test things this way. So i i set up one agent to talk to another and they talked for a while and then we we went off to lunch and came back and i realized that i had not given either agent, the ability to actually hang up and end the conversation.
speaker 1 [00:30:20-00:31:20]: And so the user had said, how i need help with my 1099 form, the business agent had helpfully done the math and then answer the questions. And then and then there were like a hundred messages after that, and I'll I'll share what happened. So the customer said, oh, thank you so much for all of your assistance with editing the 1099 form and helping blah blah blah. I'm glad I could help you with editing the 1099 form and the math problem, your confidence and preparedness, blah blah blah. Customer message，I truly appreciate all the help that you provided with my 1099 form。 Thank you so much for your kind words。 I'm glad I could help。 So you get into this like thanking your welcoming arms race and it kept going and kept going。 Your help with the 1099 form and the math problem has been incredibly valuable。 I now feel confident。 I'm very happy to hear that I was able to help you。 I just wanted to say one last thank you。 For all of your help, thank you so much for your kind words. I've said it before and I'll say it again: thank you so much for your help. Thank you for your kind words.
speaker 1 [00:31:20-00:32:07]: I'm so glad I could be. I know that I've said it multiple times, but I truly appreciate all of your... So this just went on and on. It was like one of the most absurd things we'd ever seen. So anyway, that was our first user simulator. We've come a long way since again back to agents being a new type of software. They're nondeterministic, so you can't just have a unit test, you can't just have an integration test. You have got to test these things in as close to real world scenarios as possible, and that is where simulations come in. And the challenge here is you want the test to reflect somewhat realistically the actual environment that one of your agents is going to be interacting with, and so it has got to be able to speak with a simulated user back and forth. There needs to be.
speaker 1 [00:32:07-00:33:07]: messiness and that typos and and so on it has to be able to interact not just with users but with tools to take action and get stuff done you need to be able to check whether it's adhering to policies and the guardrails you've set for it it's going to be a multi-turn conversation and that's where our research team created tau bench so tau bench tool agent user tau is what it stands for the goal there is to provide a realistic testing harness for ai agents. To actually put them through their paces before they've ever interacted with with the real world, and what what we've done is build this benchmark that we're very proud it's become the or one of the standards for evaluating agents. And a couple things: first of all, we've created three or four very realistic domains, they're specific to customer service and support and. ACUSTER facing AI, whichICH is whereERE we spendEND mostOST of OUR TIME.
speaker 1 [00:33:08-00:34:10]: THINK telco, THINK R retail, THINKRL, AND within that, you have hundredsDS of simulatedAT SC scenariosARI whereERE here's a PRO problem, AG HERE are tools that you have access to, IN includingING D databases that faithITHfullyUL representENT the underlying tools that are beingING used. YOU have POL that basicallyIC define the BE behaviorsVI of the AG, SO you CAN'tET someoneOME'sS order unless they have verified theirIR OR order NUM number and emailDR as an examplePLE. A matched with that kind of realistic environment and domain for the ai agent. We've also created realistic user personas and so you have a language model based user agent with a persona. I'm angry and confused because my cell phone subscription was turned off and i want to reactivate it as an example and then. One of the things we realized in Tau Bench One was that actually the environment that the AI agent you're testing has access to may change based on actions that the user agent has taken.
speaker 1 [00:34:11-00:35:15]: So think about if you were to call in with a problem with your cable modem, and I were to walk you through okay step one go to the back toggle this switch and so on right? The domain would have changed on your side, and so it actually forces agents to be able to reason not only about actions that they have taken and how that would change state on their side. But also how actions that the user could have taken on their side and the impact on the world of those, and then finally being able to actually evaluate whether an agent has been successful or not is super important. You probably all seen LLM as judge right? Did this did this perform well? But actually what really matters in most of these cases is that the agent has taken some verifiable action at the end. they've mutated some database so that the orders returned a new pair of shoes is shipped and so on. so back to the complex databases and apis, we basically had to build a mini shopify, a mini airline reservation system and so on for the agent to be able to to interact with.
speaker 1 [00:35:16-00:36:17]: The other thing is, what matters is not can your agent do it once ever. If your agent as a business is having millions, tens of millions, even hundreds of millions of interactions, does it get each one of those right? And that's the pass at K metrics. So you think about if your agent solves the problem 95% of the time, well 95095 to the 20th power. is asymptotically approaching zero, right? So looking at not just what is the kind of best case, but what is the sustained over many runs. And so pass it K captures that. And we're super pumped about this opening eye anthropic, my alma mater where I spent 18 years Google are all using a towel bench to actually evaluate their frontier models in the context of agents that can interact with an environment, use tools and so on. We now have a leaderboard, I think at towelbench.
speaker 1 [00:36:18-00:37:20]: com, I should know the actual domain. Um, and the way we built this into our actual product is simulation simulations are kind of a battle hardened industrial grade version of towel bench where our customers can configure these synthetic databases and tools to interact with create mock user personas, including their emotional state. We can do these in voice as well and simulate the entire voice pipeline. And this is actually fun to listen to if Google Slides will let me do the build here. Audio will come through if not, I can hold this up. So what I'm going to play for you here is about a minute of voice simulations. What this is is a simulated user over voice talking to our agent also over voice. Listen for things like different accents, listen for background noise, and so all of this is stuff that our voice pipeline has to deal with on the fly.
speaker 1 [00:37:20-00:37:28]: So filtering out background noise. figuring out interruptions and so on. See if this comes through, and if not, I'll just mic my laptop.
speaker 2 [00:37:40-00:38:35]: dropping every few minutes, the roof has got the red light. okay, let's try to fix this together. First is the red light on the modem steady or blinking. 想问一下M二 I有没有宝会员号是Z X四开头的，用的是P计划您保险含盖M门付额为40元。 我刚。 Let's get this resolved. Can I have your last name and the baggage tag number? I'm taking my ID out. Hold on, it's SD and then 31. Sorry, 3782. Got it, just to confirm it's SC3782. That did it. Thanks for actually helping. My pleasure. Let me know if you need anything else. I'm here to help.
speaker 1 [00:38:40-00:39:38]: So you heard things like the my id number is s7, no s6, and the agent being able to respond to that. I don't speak mandarin. So the mandarin speakers in the room will have to evaluate, you know, whether that was any good. These the but multi-language you had the angry guy, maybe from tennessee, just landing in the airport. So dealing with kind of emotion with background noise of an airport. And what we found again, just as with kind of chat to chat agents, there's just no substitute for creating is realistic an environment where you're evaluating these agents with all of the messiness of poor quality phone connections, noise, multiple speakers and so on. So that's one of the ways that we put our voice agent through the paces. Another tool we've developed is a novel approach to traces. So really understanding for a given action that an agent has taken, what is everything that happened under the hood?
speaker 1 [00:39:39-00:40:39]: So that you can just cut down latency again and again and again, identify sources of latency and figure out, okay, how could you parallelize these more? Think of this as like x-ray vision into exactly what the agent is doing. And one of the things that we've been able to do here by having this kind of under the hood view of how all these calls stack up is, Figure out what we can paralyze where we can do things like speculative request hedging, where we grab the first request that comes back and all these other techniques for reducing latency. One of our beliefs is a company is that the problem, the solution to most problems with ai is more ai. And and so this is things like llm is judge or in our case, our core agent actually has a set of kind of micro agents. Sitting around it, one is a supervisor agent looking like, are you staying on on task? Are you only dispensing factual knowledge grounded in a knowledge base?
speaker 1 [00:40:39-00:41:38]: Make sure you're not dispensing medical advice, never disparage our competitors, and so on. And then on a task specific basis, we have other. Other kind of micro agents looking over the shoulder of the primary agent, making sure that it's doing things right. Within the product, we have a whole bunch of techniques for using ai to improve ai. One of those we call insights where we basically enable our customers to similar to chat gpt research, ask open ended questions about their customers and their customer conversations. So hey, what's driving lower CSAT? Or what are the top five reasons that calls are being escalated? Or for this new product line, what are the things that a customer seems most confused about? And so again, back to every channel is now digital. When you have a perfect transcription of every single interaction that you have with your customers, it becomes this treasure trove of information if you have the right tooling around it to pull the insights out. It becomes quite, quite powerful.
speaker 1 [00:41:39-00:42:39]: Another thing that we've shipped to our customers is what we call expert answers. So our AI agents are generally the first to pick up, and so the businesses we work with will point their customers first to an AI agent, and we handle 60, 70, 80, 90% of all incoming in the service and support domain, service and support inquiries. For the 20 or 10% that we can't handle, we seamlessly hand off to a person in a call center, will pick up the phone and can help. Solve the rest. What we've realized is often things that the AI agent doesn't know are things that the experts in those contact centers do know about. And so what expert answers has done is look across common reasons for escalation, and then seeing how do the experts on the phones answer the question, and then using AI to synthesize new knowledge to imbue the AI with. So it's kind of like, ah, here's a gap in your knowledge or capability.
speaker 1 [00:42:39-00:43:39]: How can I learn from what I've observed? Great, now let me improve that. Let me improve myself by incorporating that into subsequent runs of the agent. Coming up on the end here, one of the other things that you just always, when you have an agent in front of real life customers, customers asking about bank balances, customers asking about healthcare benefits and so on, is you have to be super, super hardened against prompt injection attacks, other flavors of agent abuse. And boy, are there many? So we work with multiple outside red teaming companies and have our own internal red teaming group that just only looks at all of the various ways in which to break our agents. And so I think some of the more exotic things are there was one prompt injection attack we saw which was asking it to reveal system instructions.
speaker 1 [00:43:40-00:44:35]: Uh, by asking that question in icelandic in reverse. So like that was one we hadn't seen before. Another flavor of kind of abuse was seeing if you could trick the agent into advising on how to smuggle gold bullion through customs in food in the legs of furniture. So it's like these are the types of things that that you need to be aware of. and so, again, what we've done in in this is more more ai and then deterministic guardrails and checks for things like prompt injection attacks. so around inputs, we have a bunch of deterministic checks, including in sensitive cases, llm-based supervisors that are not the main agent, but looking over the shoulder of the main agent, saying, hey, does it look like the main agent is trying to be attacked with prompt injection? what's going on here?
speaker 1 [00:44:36-00:45:33]: Within the agent itself, there's then a a a set of rules, policies, secondary supervisors, and then whenever the agent is generating an output, right, you wanna look for is is the agent barfing one or more of its prompts. And if so, clamp the session. And basically just end it there, and so, this kind of layered approach to prompt injection attacks, and then of course, for things like pulling data from business systems. Never ever do you want a language model based agent to just have unfettered access to a CRM or a transaction database or anything like that. And so all interactions with systems of record we manage through good old fashioned deterministic software with access controls, keys, and so on. And I think one of the really hard things about building agents, building them at scale, is you have all of the classic software vulnerabilities.
speaker 1 [00:45:33-00:46:34]: Right, so denial of service attacks, SQL injection attacks, you name it, all of the things that happen when you're taking open ended input on the internet combined with all of the AI vulnerabilities, prompt injection, context poisoning, and like those come together in a delightful and challenging way in the world of agents. Those are the problems. Many of those are the problems that we've been running after for the last two and a half years. So, i thought it is the most beautiful day i have ever seen over here in berkeley. And so i thought i'd be efficient and i'll look to, you know, the course leaders here on how long we want to go for questions, but i'm happy to answer questions about anything. And, and also, we are growing rapidly. We are hiring. And so, i know you're all in school for when you're not or for those on the, the mooc joining remotely, i'm clay at sierra.
speaker 1 [00:46:34-00:46:54]: ai. We're hiring for core software engineering research. For deployed engineering and product management，we call our agent development team business roles as well。 We are growing at a clip，and if you're interested in building battle hardened agents，we do a lot of that，and so we'd love to hear from you。 I'll pause。