2025-11-11 | Berkeley RDI | Agentic AI MOOC: Practical Lessons from Deploying AI agents by Clay Bavor
Sierra创始人Clay Bavor深度分享:从原型到生产的企业级AI智能体部署实战策略与工程避坑指南
转录
speaker 1 [00:00:01-00:01:04]: So I was excited that all came from Gemini. So I was hoping there'd be some wild hallucinations in there, you know, things that I hadn't actually done or accomplished continents I'd not actually visited. First of all, it's great to be here. When we started Sierra two and a half years ago, we had to explain to our prospective customers and even to our friends what agents even were. And here we are, you know, 30 months later, and there is a graduate seminar and a MOOC on agents. so super exciting to see that. i think this will be a little bit different talk from some of the others that you've seen, much more practical. sierra is an applied ai company. i grew up at google on the product and engineering side, and what we're aiming to do with sierra is to solve one of the oldest problems in business, which is this tension between the cost and quality of how a company can serve its customers. and unless you're like the four seasons or hermes or some other fancy brand, you can't show up in every interaction with. speaker 1 [00:01:05-00:02:08]: Kind of a concierge level, white glove level of service. That's the problem we're trying to solve. And we've been in in business for two and a half years. We started in March of 23 again, when no one had heard of agents. And my cofounder Brett and I have actually known each other. So in a way, the company has been in the making for a while. We've known each other for 20 years. We both met at our in our first year at Google, where we both started our careers. Brett was most recently co CEO at Salesforce, and he's the current board chair. At OpenAI, we over the past two and a half years have worked with hundreds of companies to serve hundreds of millions of their customers with customer facing AI agents. So to make this concrete, if you were to order some new flip flops and from Olacai and they were the wrong size and you called Olacai or chatted with Olacai, right? You'd interact with an agent that we built with and for Olacai and it would help you with a warranty, an exchange, get you in a better color or size or fit. speaker 1 [00:02:09-00:03:07]: A dt, they're the blue rectangle homes, blue octagon home security company. If your battery in your alarm panel, conks out will help you figure out which panel you have and drop ship. You a new one from there. We work with a number of the fortune 500, fortune, 100 fortune, 20 fortune, 10 in building customer-facing ai agents, they can pick up on chat, they can pick up on voice. And they do everything, as i said, from return shoes and debug alarm systems to send satellite signals from space. In the case of sirius xm, where you're moving from one car to another, and we need to send a new encryption key from space to get you up and running on that new on that new radio. So what i thought i'd do today is first, just orient you on kind of how we think about the world of agents. We obviously occupy one corner of the world of agents. I think broadly we categorize them into three buckets. Number one, personal agents. speaker 1 [00:03:07-00:04:06]: So chachi bt gemini agents that will work as our own kind of trusted personal digital assistant two or the kind of role-based or persona based agents like coding agents. Or other agents that can help you in the context of getting your job done. Legal agents like those from Harvey are another category. And the third category in our mind are customer facing agents. We think that every business in the future is going to have its own customer facing agent. What I thought I'd do is orient you on how we think about that and kind of our corner of the world of agents, and then just share a bunch of lessons learned from having to make contact with reality again and again and again. That hopefully give you some ideas for the trials and tribulations in actually deploying agents in the real world. First of all, all this stuff is moving incredibly quickly. I was actually born in Mountain View on the peninsula here. I grew up in Silicon Valley at the advent of the internet, and nothing I think anyone has ever seen has played out as quickly as this. speaker 1 [00:04:06-00:05:03]: Took the internet 11 years from about 1991 to 2002 to go from not existing to having 百 分 之 10 of the world use it. In a given week, chat gpt is you all know took less than 25 months to get there. And this really shapes our mindset in how we think about customer facing agents. We think that agents are to ai, as the website was to the internet, and apps are to mobile. And again, every company is going to have one that will do everything fromproduct recommendations, sign up, setup, product activation. Cross sell, upssell troubleshooting, subscription churn management retention and everything in between. And we're the category leaders in building those customer- facing agents And that's where we think the puck is going. A couple things. First of all, businesses talk about channels, phone as a channel, chat as a channel, email as a channel. speaker 1 [00:05:03-00:06:05]: And our strong view is that you're going to have this world collapse from multiple channels to a single agent. That you imbue with knowledge and know how about how to do what you as a business, how you do it best, what great looks like. And then that agent is going to show up in every place where your customers are. So if they're texting you or what's apping you, right? Your agent will show up there, meet customers where they are and be able to have a conversation and not just answer questions but get things done. Same on the phone, same on email and ticketing. And i imagine a few of you have deep insight into the way that operations teams structure staffing these different channels today. But typically you'll have a team within a company managing a phone tree and ivr system separate team running chat, separate team running email. That's all going to collapse to a set of actually we've seen early signs of what these teams will call themselves. Many of our customers and the teams working to shape their agents have come to calling themselves ai architects. speaker 1 [00:06:05-00:07:08]: So, instead of ai architects inside of companies, building, defining and shaping what the agent should look like. Another really important insight here is that we've gone from mobile app uis and menus and scrolling grids of products and so on, to world where actually just the conversation, the ability tospeak and converse back and forth with the piece of software. That is the interface. And it's wonderful because it's the interface we're all born with. We don't have to think about using it, having a conversation, asking a question, listening comes as naturally as anything to us. And we think that's a real breakthrough in how businesses are going to interact with their customers because there's nothing to learn, there's no app to navigate, there's no kind of hierarchical menu structure to get through and so on. agents for us have really made it possible to introduce a new business model. So if you look at the history of software, twenty or thirty years ago, you would go to a store like Fry's Electronics, you would buy a shrink wrap box with a CD-ROM or a stack of floppy disks in it. speaker 1 [00:07:08-00:08:13]: You'd install it and regardless of how much you used it, you had paid for it and you were done. That of course transitioned to SaaS, offers a service where you subscribe to it. More recently, companies like Amazon Web Services, Snowflake introduced consumption based pricing where you pay for what you use. We have pioneered what we call outcomes based pricing, where a company only pays Sierra if we successfully resolve whatever problem a customer has written in about. And so if again, you were returning a pair of shoes and an AI picked up the phone, helped you find a new one, drop shipped a new pair in the mail. We charge our customers then, if not, there is no charge. Has this important aspect where it deeply aligns our incentives with our customers, right? Where we win, when they win, they only pay us when they're saving money or making money, if one of our agents actually makes a sale. So another principle that we think is quite important. Today we're in the world of agents as a piece of technology, and everyone is still figuring out how to get agents to do what you want them to do reliably. speaker 1 [00:08:13-00:09:14]: And one of the questions we encounter all the time, in particular from large and technically sophisticated companies, is why not just build my own? And I think to most engineering managers, to most engineers inside companies. Building an agent kind of looks like this. Like let me choose the language model I want to build on. Will I use Pinecone? Will I roll my own semantic search engine? Let me integrate with some tools and some APIs and we'll be done. What we have found over the last two and a half years building these things is that when you put your scuba tanks on and you go under the surface of the ocean, you discover there's just like a whole lot of stuff that you need to get right. Everything will deep dive into some of this. What does version control look like? What does release management look like in the context of agents? How do you observe what they're doing? How do you make sure that they don't make stuff up? How do you make sure that if you're in financial services, your agent doesn't dispense financial advice, which is illegal? speaker 1 [00:09:15-00:10:14]: How do you make sure if you're in the healthcare setting, you don't diagnose a patient and suggest they take some medicine, which is also illegal? In voice, how do you mitigate latency? How do you transcribe proper nouns and get tonality, accent, and more right? How do you prevent bad actors from poisoning the context and prompt injecting you so you spill the beans or take some bad action on behalf of the user? All of these are problems that we've had to solve in building Sierra. And even today, as proud as we are of the platform we've built, it very much feels like the 1997 era for building agents. What was 1997? It was kind of the internet had been around for a couple of years, but no one had really figured out how to build and scale web services. So AJAX did not yet exist. The PHP, MySQL, Apache kind of lamp stack. Did not exist. speaker 1 [00:10:14-00:11:14]: And so people were cobbling together web applications. And in fact, in 1997, there was an article in wired about a bank. Converting its website from basically a business card. It's like here's where we're based here are hours and so on to a very lightly transactional system, basically a web form and a submit button and they spent 13 million dollars on it. It was like to get a functional website that did anything in 1997, $13 million, and that's because it was fundamentally a technology problem. You had it was in the domain of engineers to actually build these things. What we're trying to do with Sierra and where we think the puck is heading is moving from cobble together technology to a world where agents are products, where agents can be configured, built, kept secure and so on, with the product stack as opposed to having to. In a kind of artisanal way, pop the hood, go under the hood and actually build these things piece by piece. speaker 1 [00:11:14-00:12:14]: And our goal in doing that is to make software and a product that is simple but anything but simplistic. And so all of the expressiveness that you have if you're hands on with agent building frameworks, but abstracting away a lot of that complexity under the iceberg to make it possible. ah to to build highly capable agents. the other thing we're working on is moving agents from a world of being transactional, where basically every interaction is a one-off, they forget things between sessions, and so if you're calling back a second or third time today, it's kind of like agents have amnesia. of course we've seen memory in the context of chatgpt, there are lightweight ways of doing it using rag. what we're really trying to do is build this kind of context and memory foundation for agents, so that when you call back, right, your conversation is starting on second or third base as opposed to. Hi, who are you? speaker 1 [00:12:14-00:13:07]: I have no idea what you're calling in about and so on. So we think that that warm start problem, which is based on getting memory right, is going to be super important to solve. And it's one of the things that we've been working on and really trying to make this transition from transactions to building relationships with a company's customers across multiple interactions and transactions. so um. 啊,我只系去到呢个。 What we realized in building sierra and building our first agents is that agents are and this won't be news to you is you're in this course, a fundamentally new type of software. And these agents are non-deterministic for a given input. You have no idea a priori what the output will be. And so they really deserve an entirely new approach to software development, taking the software development life cycle. speaker 1 [00:13:08-00:14:04]: That has been matured over the last two decades in the context of industry in particular, and from first principles think through what does this process look like for building these agents again, which are fundamentally non-deterministic, which have to deal with the messiness of human language and on the phone not just a single person's language, but what if the TV's on in the background or a dog's barking or you have two speakers? All of the complexity. So how do you build them? How do you test them? And how do you optimize them over time? Bless you. The the first thing we realized in creating a platform is that most large companies are faced with this choice of building or buying you build kind of an off-the-shelf limited sas solution or you're building it yourself from a kit of parts from scratch. And what we've tried to embrace with Sierra is this mindset of building with or building on. speaker 1 [00:14:04-00:15:03]: So fundamentally, what Sierra provides is a platform as a service, a set of Lego bricks where we've abstracted away a lot of the lower level complexity, all that gunk beneath the surface of the ocean and the iceberg, and made it possible to build with these higher level abstractions. And so when we're approaching companies. We try to offer again the flexibility of build and the power of build with the ease and simplicity of buying a solution. One of the really cool things about building agents in particular voice agents is that for the first time, right? Every channel that a company interacts with its customers over is digital and can be understood and and broken down and attributes and what happened there. Today,most phone calls that happen to call centers,they aren't even transcribed,right? So there's no way to analyze them,there's no way to understand what actually happened in them,forget running like controlled AB experiments,right? speaker 1 [00:15:03-00:16:04]: So most Fortune 500 companies will have a contact center with thousands of people in it。 The only way that you could AB test a conversation today was to hand a new script to 1000 people and a different script to 1000 people and hope that they read it。 And kind of comply with it. So with agents in particular over the phone, all of a sudden you can apply all of the lessons and techniques from 20 or 30 years of building websites, a b testing. Control experiment design, and so on, to conversations and to agents. i talked about this model of again building once and deploying everywhere. and one of the things, one of the aha moments we had was, You should not think about the means of communication as relevant in building an agent. Instead, what should the agent know? What should the agent be able to do? And then giving it the ability to pick up on whatever channel a person happens to be reaching the company on. speaker 1 [00:16:04-00:17:04]: That is the way you get economies of scale. That's the way you cross-train across chat conversations, voice conversations, email and ticket-based conversations. And so what again we're trying to do there is make it possible to define what the agent does once and have it show up everywhere. Again, on this technology shifting to a product approach, our platform has taken the approach of both providing code, a code-based SDK, so it's a declarative programming language where you can express what the agent should do as opposed to the how. So it abstracts away a lot of the underlying large language model calls, and for engineering teams that want to integrate the development release and version control of their agent directly into their software development lifecycle, they can do that in code in a GitHub repo or whatever source control tool they use. At the same time, with many of the companies we work with, they're big operations teams that are often closest to the customers and know the experience best. speaker 1 [00:17:05-00:17:58]: And so we have a set of no-code tools that makes it possible to express in kind of structured english language what the agent should do, what it should know about, tools it should have access to. And more, and then on this moving from transactions to relationships, we actually just last week announced something we call the agent data platform. So real first of its kind that combines memory, so long term memory, and the ability to store away context and recollections. From interactions that the agent has had,it also enables a business to import data from what's called a customer data platform. Think of this as your database on everything you know about your customers or potential customers,and then to use that to hill climb and optimize to optimal ways of delivering a sales pitch,saving a subscriber from potentially churning and leaving you as a subscription business. speaker 1 [00:17:58-00:19:00]: And more, and then also the ability to actually trigger outbound phone calls or text messages, so that the agent that's developed on on the platform can actually reach out and proactively engage with a company's customers. so we're gonna double click into a couple interesting things here starting with voice so ah has anyone tried to build an agent purely on one of the real-time models audio in audio out like the openai real-time api? Okay, one one person in the back, you're you're brave. So one of the interesting things about these audio to audio models is they tend to be smaller and they tend to be totally uncontrollable. They make stuff up and you can also gaslight them into doing all sorts of things. If right, you say, hey, please talk like batman for this conversation. It'll start talking like this and then use a little bit gravelier, a little bit slower. And it'll go slower and even more like and just more and more ridiculous, right? speaker 1 [00:19:01-00:20:04]: And so one of the challenges we've had obviously audio in audio out with all of the nuance of audio tokens at some point that will be the way today. The state of the art in building reliable scaled ai agents for voice is a pipeline where you're actually doing speech to text. You want to do this quickly as possible. You're then using whatever orchestration layer or reasoning loop you have to decide what to do and what to say, and then using a synthesis engine to take text and produce speech out the other end. And so there's a lot to get right here, and we'll go into a couple of the details here. First of all, latency, latency, you have to squeeze every 10 milliseconds out of this pipeline. and we've done a lot there, and i'll get into some of it. but, again, to string together these things, often with multiple speech-to-text models so that you're able to compare transcriptions to see if there are errors, and then choose the one that looks most correct. speaker 1 [00:20:07-00:21:01]: and then making sure that you have things like filler phrases. oh, gotcha. let me look up that for you. so that the person isn't just like waiting for five seconds for the agent to respond. STringING TOG ALL OF THESE THINGS IN A PELINE THAT'sS REALLYTIMIED, TNS OUT TO BE REALLY HARD. EVEN B basicIC THINGS LIKE ME measuringUR LAT latencyENY turnsURNS OUT TO BE HARD, BECAUSE REALLY WHAT YOU CARE ABOUT IS L latencyENY betweenWEEN WHEN a PERSON stopsPS TALK, SO E OF utterance AS AS WE THINK ABOUT IT, AND THE TIME THE AGENT STARTS RES respondingONDING. RIGHT? SO IT'S IT'S NOT THE END OF THE KIND OF audioO SNET, RIGHT, THAT THE AGENT IS AN analyzingYZING. IT' whenEN THE PERSON STOP STOP TALK. So you actually have to be able to identify as quickly as possible when the person has stopped talking, so that you can kick off all of the reasoning and then synthesis at the other end. speaker 1 [00:21:02-00:22:01]: Often what we will do is kick off multiple reasoning loops and synthesis steps, so that we can then look back and forget no they really did stop talking there, they didn't continue running on. So you're then. Ready with a response. We also do things like speculative inference requests once we're in the reasoning loop. So if any of you have looked at how long some of the foundation models take to respond, P90 latency can have huge variance. You can have 500 milliseconds, you can sometimes have 2500 milliseconds in the context of a voice conversation that makes a huge difference. So, one of the techniques that we'll do is fire off multiple requests to the same inference provider. Grab the first one that comes back and then use that to respond with. There are other really subtle things here. So for instance, just detecting interruptions and reacting appropriately. speaker 1 [00:22:01-00:23:03]: If you or i were talking and i was speaking, you said, uh-huh. I wouldn't just stop speaking because you said something, i would know that you're just acknowledging what i'm saying and i keep going. So basically voice activity detection and detecting activity that is meaningful and that you want to interrupt that you want to stop talking to take in and then update kind of what you've heard turns out to be really hard. And for that, we've had to fine-tune our own models to actually detect this stuff. And distinguish between aha, aha, yep, gotcha, gotcha, distinguish that from no, no, I actually don't want to return my shoes, whatever it is. This is the interrupt, interrupt, interruptability. Finally again, if you followed closely the uptime for some of the frontier models, they are not in the, you know, five six nines of reliability, like hardened services from amazon web services and so on are. speaker 1 [00:23:03-00:24:06]: So even the ability to fail over from one model provider to another has turned out to be really important in the context of delivering low latency, reliable voice conversations. Okay. Voice transcription. How do you actually? First of all, it's hard names are hard. Pronunciation is hard. Many of our businesses will have hundreds of proper nouns as part of just a conversation drug names, provider names, product names. One of the companies we work with is deeply inspired by the hawaiian islands. So, our agent there needs to basically be able to speak english and hawaiian at the same time, recognize hawaiian words, pronounce them correctly, and so on. So what do you do? First of all, getting the measurement right of accuracy in particular around transcription turns out to be quite important. And the word error rate turns out to be not at all the right metric, at least the first metric to measure this stuff with. speaker 1 [00:24:06-00:25:05]: Say you have a a phone call, and you have a a speaker, but then in the background you have tv news on. Like, word error rate would want you to transcribe accurately every word heard. In that audio sample, but of course you don't want to do that. You want to only transcribe words from the primary speaker, and ignore everything from the background and secondary speaker. So, it is not as simple as as just word error rate. Uh, and what we found is, we've just had to roll our own metrics to do this. Well, and so i think one general piece of advice in building agents is don't assume that whatever metrics are out there to measure a component of what you're doing is necessarily the right set of yardsticks for you. If you're doing something very specific, get to the bottom. What does great look like? What does a great experience look like? And how do you actually measure? What great looks like in that context. ah voice synthesis is super interesting. speaker 1 [00:25:06-00:26:05]: so turns out that just saying phone numbers is hard. right, 650-833, that, no, no, no, that's not how you'd say it. do you, do you say 833? do you say 9,000 or 9000? it turns out there are all sorts of kind of standard ways that we don't even think about, but if you don't tell the agent to do it a certain way, gets it wrong. addresses, proper, proper nouns and names. is it andrea? is it andrea? is it something else? again, all of these things, these details matter in getting the experience right. The phrasing quality and cadence and how the agent actually says stuff turns out to be hugely important. And one thing that we found is just taking the off the shelf language model outputs as if it were a chatbot and responding in text just completely falls over. speaker 1 [00:26:06-00:27:06]: It sounds way too long. You're like, oh my god, just please get to the point. Yeah, yeah, yeah, let's. And if you think about how we speak on the phone, it's shorter, it's back and forth, there's more acknowledgement. Do you want more detail? Gotcha, yep, makes sense. And so what works in the context of chat doesn't work in the context of voice. I talked about a single agent as opposed to multiple channels. So one of the things that we've had to build into platform is the ability to actually tune and tweak the style of the agent depending on whether it's typing, whether typing over chat. ah even text versus chat on the web, there's subtly different expectations there. voice is a different beast altogether. And then the emotive range, this is kind of how you match the individual you're speaking with. This turned out to be important. And today, without the voice to voice models, we're relatively limited in what we can do on this front. But I think will be you know a year from now the expectation that the companies will have. speaker 1 [00:27:08-00:28:12]: ah one of the strangest probably the strangest job title we have at sierra is our voice somali a so it's an unofficial job title but still a a real one and our voice somali a helps our customers match voices to the kind of personality and brand and vibe of their company. and so you can read some of the, some of the different attributes of a voice here. graveliness, right? you can imagine that one. breathiness, nasality, enunciation. some of these you've heard, probably others you haven't. i certainly had dips. verbal fry is another one. and so it's pretty interesting, just as like fine wines, you know, they smell like decaying violets or bootstrapped leather or whatever weird attributes, weird adjectives a sommelier might use to describe a wine. It turns out there's a similarly rich vocabulary, much more useful actually in identifying the right voice for your for your business. speaker 1 [00:28:12-00:29:13]: But we have someone full-time on staff who's just listening to voices, evaluating on these different metrics, and then helping businesses match a voice to their company. You can imagine how Weight Watchers, a weight loss solution primarily for women, that voice might be very different from the voice for Harley Davidson and what you might expect, you know, that voice to be. so a lot of variety here, and it turns out you really have to have this vocabulary and the set of dimensions to to be able to capture the voice and and do the matching well. So you've built your agent, we've talked about some of the nuances and interesting things that happen in the context of voice. So I want to go back to the really the dark ages of 2023, and I'm going to share our first user agent simulator action interaction here. This is where we started. We've come a long way since, but I you know stroll down memory lanes kind of fun. speaker 1 [00:29:13-00:30:19]: So the setting here is I built a prototype of our first agent. It was based on the React Loop, and the agent was extremely limited. It could answer questions about 1099 forms, and it could do simple math. And so you could ask it something like: Hi, I need help with my 1000 plus 33 times 3 form. Can you help me with that? And first okay, I got to do the math, do the math. Okay, that's you know 1099. Okay, that's a 1099 form. I can help with that. And answer questions about it. And we realized, well, if we're going to build a an agent to do that, let's also build a user agent that can write in and you can have an agent talking to an agent, it'll be a more efficient way of testing in particular because gpt-4 was just super slow at the time and we were not particularly optimized in our loop. It was like nice to be able to test things this way. So i i set up one agent to talk to another and they talked for a while and then we we went off to lunch and came back and i realized that i had not given either agent, the ability to actually hang up and end the conversation. speaker 1 [00:30:20-00:31:20]: And so the user had said, how i need help with my 1099 form, the business agent had helpfully done the math and then answer the questions. And then and then there were like a hundred messages after that, and I'll I'll share what happened. So the customer said, oh, thank you so much for all of your assistance with editing the 1099 form and helping blah blah blah. I'm glad I could help you with editing the 1099 form and the math problem, your confidence and preparedness, blah blah blah. Customer message,I truly appreciate all the help that you provided with my 1099 form。 Thank you so much for your kind words。 I'm glad I could help。 So you get into this like thanking your welcoming arms race and it kept going and kept going。 Your help with the 1099 form and the math problem has been incredibly valuable。 I now feel confident。 I'm very happy to hear that I was able to help you。 I just wanted to say one last thank you。 For all of your help, thank you so much for your kind words. I've said it before and I'll say it again: thank you so much for your help. Thank you for your kind words. speaker 1 [00:31:20-00:32:07]: I'm so glad I could be. I know that I've said it multiple times, but I truly appreciate all of your... So this just went on and on. It was like one of the most absurd things we'd ever seen. So anyway, that was our first user simulator. We've come a long way since again back to agents being a new type of software. They're nondeterministic, so you can't just have a unit test, you can't just have an integration test. You have got to test these things in as close to real world scenarios as possible, and that is where simulations come in. And the challenge here is you want the test to reflect somewhat realistically the actual environment that one of your agents is going to be interacting with, and so it has got to be able to speak with a simulated user back and forth. There needs to be. speaker 1 [00:32:07-00:33:07]: messiness and that typos and and so on it has to be able to interact not just with users but with tools to take action and get stuff done you need to be able to check whether it's adhering to policies and the guardrails you've set for it it's going to be a multi-turn conversation and that's where our research team created tau bench so tau bench tool agent user tau is what it stands for the goal there is to provide a realistic testing harness for ai agents. To actually put them through their paces before they've ever interacted with with the real world, and what what we've done is build this benchmark that we're very proud it's become the or one of the standards for evaluating agents. And a couple things: first of all, we've created three or four very realistic domains, they're specific to customer service and support and. ACUSTER facing AI, whichICH is whereERE we spendEND mostOST of OUR TIME. speaker 1 [00:33:08-00:34:10]: THINK telco, THINK R retail, THINKRL, AND within that, you have hundredsDS of simulatedAT SC scenariosARI whereERE here's a PRO problem, AG HERE are tools that you have access to, IN includingING D databases that faithITHfullyUL representENT the underlying tools that are beingING used. YOU have POL that basicallyIC define the BE behaviorsVI of the AG, SO you CAN'tET someoneOME'sS order unless they have verified theirIR OR order NUM number and emailDR as an examplePLE. A matched with that kind of realistic environment and domain for the ai agent. We've also created realistic user personas and so you have a language model based user agent with a persona. I'm angry and confused because my cell phone subscription was turned off and i want to reactivate it as an example and then. One of the things we realized in Tau Bench One was that actually the environment that the AI agent you're testing has access to may change based on actions that the user agent has taken. speaker 1 [00:34:11-00:35:15]: So think about if you were to call in with a problem with your cable modem, and I were to walk you through okay step one go to the back toggle this switch and so on right? The domain would have changed on your side, and so it actually forces agents to be able to reason not only about actions that they have taken and how that would change state on their side. But also how actions that the user could have taken on their side and the impact on the world of those, and then finally being able to actually evaluate whether an agent has been successful or not is super important. You probably all seen LLM as judge right? Did this did this perform well? But actually what really matters in most of these cases is that the agent has taken some verifiable action at the end. they've mutated some database so that the orders returned a new pair of shoes is shipped and so on. so back to the complex databases and apis, we basically had to build a mini shopify, a mini airline reservation system and so on for the agent to be able to to interact with. speaker 1 [00:35:16-00:36:17]: The other thing is, what matters is not can your agent do it once ever. If your agent as a business is having millions, tens of millions, even hundreds of millions of interactions, does it get each one of those right? And that's the pass at K metrics. So you think about if your agent solves the problem 95% of the time, well 95095 to the 20th power. is asymptotically approaching zero, right? So looking at not just what is the kind of best case, but what is the sustained over many runs. And so pass it K captures that. And we're super pumped about this opening eye anthropic, my alma mater where I spent 18 years Google are all using a towel bench to actually evaluate their frontier models in the context of agents that can interact with an environment, use tools and so on. We now have a leaderboard, I think at towelbench. speaker 1 [00:36:18-00:37:20]: com, I should know the actual domain. Um, and the way we built this into our actual product is simulation simulations are kind of a battle hardened industrial grade version of towel bench where our customers can configure these synthetic databases and tools to interact with create mock user personas, including their emotional state. We can do these in voice as well and simulate the entire voice pipeline. And this is actually fun to listen to if Google Slides will let me do the build here. Audio will come through if not, I can hold this up. So what I'm going to play for you here is about a minute of voice simulations. What this is is a simulated user over voice talking to our agent also over voice. Listen for things like different accents, listen for background noise, and so all of this is stuff that our voice pipeline has to deal with on the fly. speaker 1 [00:37:20-00:37:28]: So filtering out background noise. figuring out interruptions and so on. See if this comes through, and if not, I'll just mic my laptop. speaker 2 [00:37:40-00:38:35]: dropping every few minutes, the roof has got the red light. okay, let's try to fix this together. First is the red light on the modem steady or blinking. 想问一下M二 I有没有宝会员号是Z X四开头的,用的是P计划您保险含盖M门付额为40元。 我刚。 Let's get this resolved. Can I have your last name and the baggage tag number? I'm taking my ID out. Hold on, it's SD and then 31. Sorry, 3782. Got it, just to confirm it's SC3782. That did it. Thanks for actually helping. My pleasure. Let me know if you need anything else. I'm here to help. speaker 1 [00:38:40-00:39:38]: So you heard things like the my id number is s7, no s6, and the agent being able to respond to that. I don't speak mandarin. So the mandarin speakers in the room will have to evaluate, you know, whether that was any good. These the but multi-language you had the angry guy, maybe from tennessee, just landing in the airport. So dealing with kind of emotion with background noise of an airport. And what we found again, just as with kind of chat to chat agents, there's just no substitute for creating is realistic an environment where you're evaluating these agents with all of the messiness of poor quality phone connections, noise, multiple speakers and so on. So that's one of the ways that we put our voice agent through the paces. Another tool we've developed is a novel approach to traces. So really understanding for a given action that an agent has taken, what is everything that happened under the hood? speaker 1 [00:39:39-00:40:39]: So that you can just cut down latency again and again and again, identify sources of latency and figure out, okay, how could you parallelize these more? Think of this as like x-ray vision into exactly what the agent is doing. And one of the things that we've been able to do here by having this kind of under the hood view of how all these calls stack up is, Figure out what we can paralyze where we can do things like speculative request hedging, where we grab the first request that comes back and all these other techniques for reducing latency. One of our beliefs is a company is that the problem, the solution to most problems with ai is more ai. And and so this is things like llm is judge or in our case, our core agent actually has a set of kind of micro agents. Sitting around it, one is a supervisor agent looking like, are you staying on on task? Are you only dispensing factual knowledge grounded in a knowledge base? speaker 1 [00:40:39-00:41:38]: Make sure you're not dispensing medical advice, never disparage our competitors, and so on. And then on a task specific basis, we have other. Other kind of micro agents looking over the shoulder of the primary agent, making sure that it's doing things right. Within the product, we have a whole bunch of techniques for using ai to improve ai. One of those we call insights where we basically enable our customers to similar to chat gpt research, ask open ended questions about their customers and their customer conversations. So hey, what's driving lower CSAT? Or what are the top five reasons that calls are being escalated? Or for this new product line, what are the things that a customer seems most confused about? And so again, back to every channel is now digital. When you have a perfect transcription of every single interaction that you have with your customers, it becomes this treasure trove of information if you have the right tooling around it to pull the insights out. It becomes quite, quite powerful. speaker 1 [00:41:39-00:42:39]: Another thing that we've shipped to our customers is what we call expert answers. So our AI agents are generally the first to pick up, and so the businesses we work with will point their customers first to an AI agent, and we handle 60, 70, 80, 90% of all incoming in the service and support domain, service and support inquiries. For the 20 or 10% that we can't handle, we seamlessly hand off to a person in a call center, will pick up the phone and can help. Solve the rest. What we've realized is often things that the AI agent doesn't know are things that the experts in those contact centers do know about. And so what expert answers has done is look across common reasons for escalation, and then seeing how do the experts on the phones answer the question, and then using AI to synthesize new knowledge to imbue the AI with. So it's kind of like, ah, here's a gap in your knowledge or capability. speaker 1 [00:42:39-00:43:39]: How can I learn from what I've observed? Great, now let me improve that. Let me improve myself by incorporating that into subsequent runs of the agent. Coming up on the end here, one of the other things that you just always, when you have an agent in front of real life customers, customers asking about bank balances, customers asking about healthcare benefits and so on, is you have to be super, super hardened against prompt injection attacks, other flavors of agent abuse. And boy, are there many? So we work with multiple outside red teaming companies and have our own internal red teaming group that just only looks at all of the various ways in which to break our agents. And so I think some of the more exotic things are there was one prompt injection attack we saw which was asking it to reveal system instructions. speaker 1 [00:43:40-00:44:35]: Uh, by asking that question in icelandic in reverse. So like that was one we hadn't seen before. Another flavor of kind of abuse was seeing if you could trick the agent into advising on how to smuggle gold bullion through customs in food in the legs of furniture. So it's like these are the types of things that that you need to be aware of. and so, again, what we've done in in this is more more ai and then deterministic guardrails and checks for things like prompt injection attacks. so around inputs, we have a bunch of deterministic checks, including in sensitive cases, llm-based supervisors that are not the main agent, but looking over the shoulder of the main agent, saying, hey, does it look like the main agent is trying to be attacked with prompt injection? what's going on here? speaker 1 [00:44:36-00:45:33]: Within the agent itself, there's then a a a set of rules, policies, secondary supervisors, and then whenever the agent is generating an output, right, you wanna look for is is the agent barfing one or more of its prompts. And if so, clamp the session. And basically just end it there, and so, this kind of layered approach to prompt injection attacks, and then of course, for things like pulling data from business systems. Never ever do you want a language model based agent to just have unfettered access to a CRM or a transaction database or anything like that. And so all interactions with systems of record we manage through good old fashioned deterministic software with access controls, keys, and so on. And I think one of the really hard things about building agents, building them at scale, is you have all of the classic software vulnerabilities. speaker 1 [00:45:33-00:46:34]: Right, so denial of service attacks, SQL injection attacks, you name it, all of the things that happen when you're taking open ended input on the internet combined with all of the AI vulnerabilities, prompt injection, context poisoning, and like those come together in a delightful and challenging way in the world of agents. Those are the problems. Many of those are the problems that we've been running after for the last two and a half years. So, i thought it is the most beautiful day i have ever seen over here in berkeley. And so i thought i'd be efficient and i'll look to, you know, the course leaders here on how long we want to go for questions, but i'm happy to answer questions about anything. And, and also, we are growing rapidly. We are hiring. And so, i know you're all in school for when you're not or for those on the, the mooc joining remotely, i'm clay at sierra. speaker 1 [00:46:34-00:46:54]: ai. We're hiring for core software engineering research. For deployed engineering and product management,we call our agent development team business roles as well。 We are growing at a clip,and if you're interested in building battle hardened agents,we do a lot of that,and so we'd love to hear from you。 I'll pause。
核心概览
Sierra 联合创始人 Clay Bavor 的分享聚焦一个非常具体的问题:如何把面向客户的智能体真正部署到企业一线,而不只是做出一个演示原型。 在他的定义里,Sierra 做的并不只是狭义“客服机器人”,而是 customer-facing agents——直接出现在客户触点上的智能体,覆盖 商品推荐、注册开通、产品激活、排障支持、交叉销售、追加销售、订阅留存与流失挽回 等完整链路。
Bavor 的核心判断有三点。第一,智能体之于 AI,就像网站之于互联网、App 之于移动时代;未来企业会围绕一个统一的面向客户智能体来组织服务与经营。第二,真正的难点远不止接一个大模型、加几个工具调用,而是上线后的版本管理、延迟控制、稳定性、记忆、评测、合规、安全、语音细节与大规模运营。第三,当前行业仍处在“1997 年的网站时代”:技术能做出来,但还远未形成成熟、低摩擦、可规模化的产品栈,因此 Sierra 试图把智能体从手工拼装推进到平台化、产品化和可运营的阶段。
整场分享最有价值的部分,是 Bavor 基于两年半真实部署经验,总结出一套较完整的方法论:用统一智能体取代分裂渠道、用对话作为新界面、用“按结果付费”对齐价值、用高仿真模拟做上线前测试、用可验证业务动作而非单次演示评估效果、再用更多 AI 去监督和改进 AI。
详细总结
一、Sierra 的业务边界:不是单纯客服,而是“所有客户触点上的智能体”
Bavor 先给 Sierra 的定位下了定义。它是一家应用型 AI 公司,目标是解决企业长期存在的一组矛盾:客户服务与客户经营既想要高质量,又想控制成本,但现实里两者经常互相拉扯。 除了极少数奢侈或高端品牌,大多数企业都不可能在每次客户互动中提供礼宾式、白手套式服务。
Sierra 试图用智能体来重构这件事。公司创立于 2023 年 3 月,至今约两年半,已经与数百家公司合作,服务数亿终端用户。Bavor 用几个例子把“面向客户的智能体”讲得很具体:
- 在鞋履品牌场景中,智能体可以处理 退换货、保修、尺码和颜色更换。
- 在 ADT 这样的家庭安防场景里,智能体可以帮助用户 识别报警面板型号,并补寄新电池。
- 在 SiriusXM 这样的场景里,智能体甚至可以处理 换车后卫星广播的信号和加密密钥重置。
这些例子说明,Sierra 所说的 customer-facing agents 覆盖的不只是售后支持,而是 所有直接面向客户、且需要解释、判断、调用系统并完成动作的业务接触点。
Bavor 还把智能体大致分成三类:
- 个人智能体:例如 ChatGPT、Gemini 这类个人数字助理。
- 岗位型智能体:例如编程智能体、法律智能体。
- 面向客户的智能体:Sierra 聚焦的方向。
他的判断是,未来几乎每家公司都会拥有自己的这一类智能体。
二、Bavor 的行业判断:企业会从“多渠道”走向“一个统一智能体”
Bavor 反复强调的一句话是:“智能体之于 AI,就像网站之于互联网,应用之于移动时代。”
在他看来,企业今天按电话、聊天、邮件、工单、短信等渠道分别建设系统的做法,会逐步被一种新模式替代:先定义一个统一的智能体,再让它出现在所有客户所在的触点里。
这个统一智能体未来会承担的工作,不只是回答问题,还包括:
- 商品推荐
- 注册与开户
- 安装与开通
- 产品激活
- 故障排查
- 交叉销售与追加销售
- 订阅管理
- 留存与流失挽回
因此,Bavor 认为企业不应先从“渠道”出发,而应先回答两个更根本的问题:
- 这个智能体应该知道什么?
- 这个智能体应该能做什么?
一旦这两个问题定义清楚,电话、聊天、邮件、短信、WhatsApp、工单系统等“渠道”本身就不再是核心设计对象,而只是这个智能体出现的不同表面。这样做的价值在于,企业能够实现 一次定义,到处部署,并在不同接触点之间共享能力、经验与数据。
他还提到,部分客户公司内部已经出现了一种新角色命名:AI architects。这意味着企业组织也可能随之变化——从原先按渠道分割的电话团队、聊天团队、邮件团队,逐步转向围绕智能体能力来建设产品和运营团队。
三、对话成为新界面:交互方式与运营方式同时改变
Bavor 认为,企业软件界面正在发生一轮根本性变化:从菜单、页面、商品网格、表单,转向对话本身。
在这种模式下,用户不需要学习复杂导航,也不需要理解企业内部流程结构。说话、倾听、确认,本来就是人类天生会用的交互方式,因此对话式界面有天然的低门槛优势。
但更重要的是,这不仅是 UI 变化,也是运营范式变化。Bavor 特别强调了一个很容易被低估的点:当语音、聊天、邮件都由智能体接管后,企业第一次能像经营网站一样经营“对话”。
他的逻辑是:
- 过去大量呼叫中心电话 甚至没有转录,企业根本不知道里面发生了什么。
- 传统电话脚本的 A/B 测试几乎无法精确执行。企业只能把一套新脚本发给一批坐席,再把另一套脚本发给另一批坐席,然后“希望他们按脚本念”。
- 一旦智能体成为对话入口,每一次对话都变成数字化、可观测、可拆解、可分析、可实验的资产。
- 于是,网站时代积累的很多方法——例如 A/B 测试、对照实验、转化优化、漏斗分析——第一次能够系统性地应用到对话和客户运营上。
这是 Bavor 眼里非常关键的一层变化:智能体不只是新的自动化工具,也会把企业的客户互动变成一个可以持续优化的产品与运营系统。
四、商业模式:从按软件、按订阅、按用量,走到“按结果付费”
Bavor 用软件行业的发展过程解释 Sierra 的商业设计:
- 早年是盒装软件,一次买断;
- 后来转向 SaaS,按订阅收费;
- 再后来出现按用量收费;
- Sierra 则进一步提出 按结果付费(outcomes-based pricing)。
它的定义很直接:只有当 Sierra 的智能体真的解决了客户发起的问题,客户企业才付费。
例如,用户想退掉一双鞋,智能体接起电话,帮他完成换货并补寄新鞋,这时 Sierra 才收费;如果问题没有被真正解决,就不收费。
Bavor 强调,这种模式的重要性在于 激励对齐。Sierra 只有在客户企业真正“省钱或赚钱”时才赚钱:问题被解决、人工成本被替代、销售被促成、流失被挽回,价值才成立。
五、为什么“自己搭一个智能体”比看上去难得多
Bavor 提到,大型技术公司最常问的问题之一是:“为什么我不自己做?”
在很多工程团队看来,搭一个智能体似乎只是:
- 选一个底层模型;
- 选一个向量数据库或语义检索方案;
- 接几个工具和 API;
- 然后就可以上线。
但 Sierra 两年半的实践让他们看到,真正的复杂度大都藏在水面之下。Bavor 列出的难点包括:
- 版本控制与发布管理如何适配非确定性系统;
- 如何观察智能体到底做了什么;
- 如何减少编造、幻觉和越权回答;
- 如何满足金融、医疗等场景的合规边界;
- 语音场景下如何压低延迟;
- 如何提高转写准确率,尤其是专有名词、口音、噪声环境下的识别;
- 如何处理语气、风格和品牌一致性;
- 如何防 prompt injection、上下文污染和恶意利用;
- 当底层模型服务不稳定时,如何做故障切换。
他特别指出,在金融和医疗等行业,错误不只是“体验差”,而可能直接越过法律边界。例如:
- 金融场景不能让智能体擅自提供金融建议;
- 医疗场景不能让智能体诊断病情或建议用药。
因此,Bavor 的结论是:“能做一个 demo” 与 “能把系统部署到真实企业一线” 之间,隔着大量工程与产品化工作。
六、Bavor 用“1997 年的网站时代”比喻今天的智能体阶段
Bavor 认为,今天整个行业仍很像 1997 年的互联网:互联网已经存在,但还没有成熟的构建与扩展方法。那时连 AJAX、LAMP 这类后来非常基础的技术栈都还没形成,企业只能靠大量工程手工拼装网站。
他举了一个例子:1997 年,《Wired》曾报道一家银行把官网从“电子名片”升级成一个带简单网页表单和提交按钮的轻量交易系统,花了 1300 万美元。这并不是因为目标功能多复杂,而是因为当时一切都还是底层技术问题。
Bavor 认为,智能体也正处在类似阶段。Sierra 想做的,不是继续手工拼装,而是把它推进到 产品栈:
- 从技术对象变成产品对象;
- 从工程师手工雕刻,变成可配置、可发布、可观测、可审计、可优化的系统;
- 在保留表达能力的同时,把底层复杂度隐藏起来。
他把这种目标概括为:做出“简单,但绝不简单化”的产品。
七、平台哲学:不是“买现成”或“从零自建”,而是 build with / build on
Bavor 很重视 Sierra 的平台定位。企业通常只有两种选择:
- buy:买一个现成但能力有限的 SaaS 成品;
- build:自己从一堆底层组件和框架开始造。
Sierra 想提供的是第三种路径:build with / build on。也就是:
- 保留“自建”的灵活性和表达能力;
- 同时拥有“采购现成方案”的易用性和交付速度。
他把 Sierra 描述为一种 PaaS 式平台层:平台把底层复杂度抽象掉,提供更高层的“乐高积木”,让企业可以在其上构建自己的智能体,而不是被迫在封闭成品和底层自造之间二选一。
对应到产品形式,Sierra 同时提供两套能力:
-
代码式 SDK
它更像一种声明式编程方式,让工程团队描述“智能体该做什么”,而不是手动处理大量底层模型调用细节。这样企业可以把智能体开发纳入自己的 GitHub 仓库、版本控制、发布流程和软件开发生命周期。 -
无代码工具
很多真正最懂客户体验的人并不是工程师,而是运营团队、服务团队和业务负责人。Sierra 提供结构化自然语言或配置界面,让这些团队也能参与定义智能体的知识、规则、工具权限和行为。
这也是 Bavor 试图推动的变化:把智能体开发从纯技术问题,转变为产品、运营和工程协同的问题。
八、从一次性事务走向持续关系:记忆是关键基础设施
Bavor 认为,今天很多智能体仍然只会处理“单次事务”:一次对话结束后就像失忆,下一次客户再来,系统又从零开始。
Sierra 试图解决的是从 transaction 到 relationship 的转变。也就是说,智能体不只在单轮里完成任务,还应该在多次交互之间延续上下文,让客户再次联系时不是冷启动,而是“热启动”。
他把理想状态形容为:客户第二次或第三次来时,智能体的起点应接近“二垒或三垒”,而不是重新问“你是谁、你为什么来”。
围绕这个目标,Sierra 发布了 Agent Data Platform,其核心包括:
- 长期记忆:保存过去互动的上下文和关键 recollections;
- 接入企业客户数据平台(CDP):把企业已有客户数据导入智能体可用环境;
- 基于历史数据持续优化策略:例如优化销售话术、降低订阅流失;
- 外呼与主动触达能力:通过电话或短信主动联系客户,而不只是被动接待。
这意味着智能体不再只是“应答器”,而会逐步变成 持续经营客户关系的系统。
九、语音部署的核心经验:今天最可靠的方案仍是三段式流水线
Bavor 在语音部分给出了非常明确的判断:当前真正能在生产环境里稳定扩展的语音智能体,最佳实践仍然是“语音转文本 → 推理/编排 → 文本转语音”的流水线。
他并没有否认端到端音频模型的长期潜力,但指出现阶段这类模型通常存在几个问题:
- 模型较小;
- 可控性差;
- 容易被诱导跑偏;
- 风格和行为稳定性不足。
他现场举了一个非常形象的例子:如果不断要求模型说得更像蝙蝠侠,它会越来越夸张,最后失去控制。
因此,在 Sierra 的实践里,可靠语音系统的重点不是“少几段处理链”,而是 把每段都做到极致稳定,并把整个链路优化到接近自然对话。
1. 延迟是决定成败的指标
Bavor 认为,语音体验中真正关键的延迟不是“模型多久返回结果”,而是:
从用户停止说话,到智能体开始说话,中间到底有多久。
为了压缩这段时间,Sierra 做了很多工程优化,例如:
- 尽早判断用户是否已经结束发言;
- 使用多个语音转写模型并行转写,再比较结果;
- 在推理前或推理中插入类似“明白了,我帮你查一下”的填充短语,减少用户感知等待;
- 对同一推理请求做并发请求和“抢答”,先返回的结果先用;
- 对模型提供方做故障切换,避免单一供应商抖动拖慢整体响应。
2. “打断”识别非常难
语音对话里,人类会频繁发出“嗯”“对”“好”“我知道了”这样的短促声音。它们很多时候不是打断,而只是表示“我在听”。
Bavor 指出,智能体必须区分:
- 只是确认和附和;
- 还是用户真的想中断、修正或改变请求。
这件事不能只靠简单的语音活动检测解决,因此 Sierra 甚至为此微调了自己的模型。
3. 转写准确率不能只看通用指标
在语音转写上,Bavor 特别提醒不要迷信通用指标。比如常见的 词错误率(WER) 在很多实际场景并不是最重要的指标。
原因在于,真实电话里可能同时出现:
- 电视背景声;
- 狗叫声;
- 其他说话人;
- 嘈杂环境;
- 长尾专有名词;
- 多语言混杂。
这时,真正好的系统不是“把所有听到的词都转出来”,而是 优先识别主要说话人,并忽略背景和无关信息。
因此,Sierra 为自己的业务场景设计了更贴近体验目标的评测方式,而不是照搬公开基准。
4. 语音合成的细节决定“像不像真人”
Bavor 说,文本转语音并不是把聊天机器人的长段回复读出来那么简单。真实电话交谈有自己的一套语言节奏:
- 更短;
- 更来回;
- 更多确认;
- 更少长篇独白。
如果直接把聊天场景里的文本输出读出来,听感会非常冗长,像在“念答案”,而不像真人通话。
此外,还有大量人们平时不留意、但系统必须处理对的细节,例如:
- 电话号码怎么读更自然;
- 地址该如何停顿;
- 人名、药名、品牌名怎么发音;
- 英语与其他语言混杂时怎么保持稳定。
他还强调,不同渠道对应的表达风格也不同:网页聊天、文本消息、电话语音,用户对语气和节奏的预期并不一样,因此智能体必须按渠道调优。
5. 声音本身要与品牌匹配
Sierra 甚至有一个非常少见的岗位:“voice sommelier”。它的职责是从大量维度评估声音,帮助客户找到与品牌气质匹配的声音。
Bavor 提到的维度包括:
- 沙哑度
- 气声感
- 鼻音
- 咬字清晰度
- vocal fry 等声音特征
一个主要面向女性减重服务品牌的声音,显然不会和 Harley-Davidson 这样的品牌用同一套声音风格。
这说明在 Bavor 看来,语音智能体不是只有“能说”就行,而是要把 品牌感、情绪感和可信度 一起做出来。
十、测试与评估:智能体必须在“接近真实世界”的环境里被反复折磨
Bavor 分享了 Sierra 早期一个颇具代表性的失败案例。2023 年,他做了一个非常简单的原型:一个只能回答 1099 表格问题、还能做简单数学计算的智能体;同时又搭了一个模拟用户智能体和它对话。
结果由于两边都没有“结束对话和挂断”的能力,问题解决后,两个智能体开始陷入无休止的互相感谢:“谢谢你的帮助”“不客气”“真的非常感谢”“我很高兴帮到你”……最后来回刷了上百条消息。
这个荒诞例子背后的重点是:智能体是非确定性软件,不能只靠传统单元测试和集成测试。
对它的评估必须更接近真实世界,尤其要包含:
- 多轮对话;
- 含噪输入;
- 用户情绪变化;
- 错别字、口误、自我纠正;
- 工具调用;
- 业务规则与权限限制;
- 状态在对话中的持续变化。
Tau Bench:把测试从“聊天对话”提升到“带环境、带工具、带规则的任务评估”
为了解决这个问题,Sierra 研究团队开发了 Tau Bench。Tau 代表 Tool、Agent、User,目标是构造一个更贴近现实的智能体评测框架。
Tau Bench 的关键设计包括:
- 选择多个贴近客户支持与客户触点的业务域,例如 电信、零售等;
- 在每个业务域下构造大量具体场景;
- 为智能体提供数据库、工具和 API,让它必须在真实约束下行动;
- 注入业务政策,例如“未核验订单号和邮箱前,不能取消订单”;
- 给用户智能体设定真实 persona 和情绪,例如“愤怒且困惑的用户,希望恢复被停掉的手机套餐”。
Bavor 还特别提到一个常被忽略的复杂性:环境不仅会被智能体的动作改变,也会被用户的动作改变。
例如电话里指导用户去重启猫、切换某个开关后,现实世界状态已经变化,智能体后续必须基于新的状态继续推理。
在评估标准上,Bavor 认为单纯让大模型来打分并不够。更重要的是:智能体最后是否完成了可验证的业务动作。
比如:
- 退货是否真的写入了数据库;
- 新鞋是否真的触发了发货;
- 航班或预订是否真的被改动。
为此,Sierra 甚至构建了迷你版电商系统、迷你版航旅系统等底座,让智能体在接近真实的系统环境里接受测试。
关注的是“长期稳定成功率”,不是偶尔成功一次
Bavor 还强调了 pass@k 这类指标的重要性。企业场景不是看智能体“有没有一次做对”,而是看它在海量交互里能否 持续稳定做对。
如果某一步只有 95% 的成功率,在多轮、多步骤任务里,整体成功率会迅速下降。因此,真正有价值的评估,不是最好成绩,而是反复运行后的稳定表现。
他提到,Tau Bench 目前已经被 OpenAI、Anthropic、Google 等用于评估其前沿模型在工具使用和环境交互上的能力。
十一、把模拟和可观测性做成产品:从“能测”到“能优化”
Sierra 不只把 Tau Bench 当研究工具,也把“仿真”产品化。客户可以在平台上配置:
- 合成数据库和工具;
- 模拟用户画像;
- 用户情绪状态;
- 语音场景;
- 整个语音流水线的模拟运行。
Bavor 在现场播放的语音模拟里,包含了多种真实世界常见扰动:
- 不同口音;
- 机场等背景噪声;
- 用户先说错编号再立刻自我更正;
- 多语言片段混杂;
- 情绪化语气。
他的结论非常明确:实验室级干净输入并不能说明真实世界表现,只有把系统放进噪声、打断、混乱和口误里,测试才有意义。
在可观测性方面,Sierra 还开发了类似 trace 的能力,用来追踪一次智能体动作背后到底经历了哪些模型调用、工具调用和等待时间。Bavor 把它形容为一种“X 光视角”,目的是:
- 找到延迟来源;
- 找到可并行化的环节;
- 应用并发请求、抢先返回等优化手段;
- 让语音体验尽量贴近自然交谈。
十二、用更多 AI 去监督和改进 AI
Bavor 明确提出了一个方法论:“大多数 AI 问题的解法,仍然是更多 AI。”
在 Sierra 的体系里,主智能体外面会包一层或多层“微型监督智能体”,负责盯住几个关键点:
- 是否偏离任务;
- 是否只在知识库支撑范围内回答事实问题;
- 是否触碰医疗、金融等高风险边界;
- 是否贬损竞争对手;
- 是否在执行某类任务时遵守特定策略。
除了实时监督,Sierra 还把 AI 用在两个非常实用的运营方向上。
1. Insights:从全部客户对话里提炼洞察
Bavor 提到,客户企业可以直接像做研究一样向系统提问,例如:
- 为什么近期客户满意度下降?
- 哪些原因最常导致转人工?
- 新产品线最容易让客户困惑的点是什么?
因为现在每个渠道都被数字化、被转录、被结构化,这些对话会成为非常有价值的经营数据。只要工具足够好,企业就可以从中持续提炼可执行洞察。
2. Expert Answers:让 AI 向人工专家学习
Sierra 的实际部署中,AI 通常先接线,能处理 60% 到 90% 的服务与支持咨询。剩余无法处理的部分再无缝交给人工坐席。
Bavor 发现,很多 AI 不会回答的问题,其实人工专家是会的。于是 Sierra 做了一个叫 Expert Answers 的能力:
它会分析常见的升级转人工原因,再去观察人工专家究竟是怎么解决这些问题的,然后用 AI 把这些解决方式提炼成新的知识,回灌给智能体。
这形成了一个非常清晰的闭环:
AI 先接线 → 无法解决时转人工 → 学习人工专家的做法 → 把经验重新注入 AI。
十三、安全与红队:智能体同时暴露在“传统软件风险”和“AI 特有风险”之下
在真实客户环境中部署智能体,Bavor 认为安全必须被极端严肃地对待。因为客户可能会询问:
- 银行余额;
- 医疗福利;
- 订单与账户信息;
- 涉及敏感系统的操作请求。
Sierra 为此既有外部红队,也有内部专职红队,专门负责攻击自己的系统。Bavor 提到过一些非常具体的案例:
- 有人试图让智能体 用倒序冰岛语泄露系统提示词;
- 有人试图诱导它提供 如何把金条藏在食物和家具腿里带过海关 的建议。
这类攻击说明,面向公众开放输入的智能体不只是会被“问错问题”,而是会被系统性探测和利用。
Sierra 的安全策略是分层的:
1. 输入层
- 用确定性规则先做基础过滤;
- 在敏感场景中,再加独立的 LLM 监督器去判断当前输入是否疑似 prompt injection 或上下文污染。
2. 智能体运行层
- 用规则、政策和次级监督智能体持续监控主智能体行为;
- 在关键任务上做任务级监督,避免它越界。
3. 输出层
- 检查智能体是否在泄露系统提示、异常输出内部信息或生成不允许内容;
- 一旦发现异常,直接截断会话。
4. 系统接入层
Bavor 非常明确地说:绝不能让基于语言模型的智能体直接、无约束地访问 CRM、交易数据库或核心记录系统。
所有对系统记录层的访问都应该经过传统的确定性软件、访问控制、密钥和权限体系。
这也是他总结出来的一个现实:智能体安全难,是因为它叠加了两类风险。 一方面有传统互联网系统的攻击面,如拒绝服务、SQL 注入等;另一方面又有 AI 特有的 prompt injection、上下文污染等问题。两类风险在同一个系统里同时存在,使得防护难度显著上升。
关键数字与案例
- 公司阶段:Sierra 自 2023 年 3 月成立,分享时约运行 两年半。
- 服务规模:已与 数百家公司 合作,覆盖 数亿客户。
- 智能体分类:个人智能体、岗位型智能体、面向客户的智能体。
- 扩散速度对比:
- 互联网从几乎不存在到全球约 10% 人口每周使用,大约用了 11 年;
- ChatGPT 达到类似量级,用时 不到 25 个月。
- 自动化比例:Sierra 的 AI 首线通常能处理 60%—90% 的服务支持请求。
- 历史类比:Bavor 用 1997 年一家公司花 1300 万美元做轻量交易网站 的例子,类比今天智能体仍处于早期基础设施阶段。
- 典型客户场景:
- 鞋履品牌:退换货、保修、尺码/颜色更换
- ADT:报警面板识别、电池补寄
- SiriusXM:换车后的卫星广播密钥和服务恢复
- 评测框架:Tau Bench 强调带工具、带数据库、带政策、带用户画像和情绪的真实任务环境,而非纯对话问答。
核心结论
- Clay Bavor 的判断是:面向客户的智能体将成为企业新的基础触点,它不只承担客服工作,还会覆盖销售、开通、激活、留存和客户经营。
- Sierra 的实践显示,企业智能体真正的门槛不在“接上模型”,而在产品化与工程化:延迟、稳定性、评测、记忆、合规、安全和运维都必须系统解决。
- 在当前阶段,可靠语音智能体的生产级方案仍是“语音转文本—推理编排—文本转语音”的流水线,而不是直接依赖端到端音频模型。
- 渠道统一的意义不只是节省成本,更在于让电话、聊天、邮件第一次都变成可转录、可分析、可实验、可持续优化的数据资产。
- Bavor 强调,评估智能体不能看它“演示得像不像”,而要看它是否在真实或高仿真的环境里,稳定完成了可验证的业务动作。
- Sierra 的产品哲学不是简单的 build 或 buy,而是 build with / build on:既保留自建灵活性,又提供平台化易用性。
- 在安全上,智能体必须同时面对传统软件漏洞与 AI 特有攻击,因此所有核心系统访问都必须继续依赖确定性权限控制,而不能完全交给语言模型。