2025-05-30 | Y Combinator | State-Of-The-Art Prompting For AI Agents
AI代理提示工程的前沿实践与挑战
标签
媒体详情
- 上传日期
- 2025-06-06 20:12
- 来源
- https://www.youtube.com/watch?v=DL82mGde6wo
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-06-05
转录
speaker 1: Metacprompting is turning out to be a very, very powerful tool that everyone's using now. It kind of actually feels like coding in you know 1995, like the tools are not all the way there. We're you know in this new frontier. But personally, it also kind of feels like learning how to manage a person where it's like, how do I actually communicate? You know the things that they need to know in order to make a good decision. Welcome back to another episode of the light cone. Today, we're pulling back the curtain on what is actually happening inside the best AI startups when it comes to prompt engineering. We surveyed more than a dozen companies and got their take right from the frontier of building this stuff. The practical tips. Jared, why don't we start with an example from one of your best AI startups? speaker 2: I managed to get an example from a company called parhelp. Parhelp does AI customer support. There are a bunch of companies who are doing this, but parhelp is doing it really, really well. They're actually powering the customer support for perplexity and replet and bolt and a bunch of other like top AI companies now. So if you go and you like email a customer support ticket into perplexity, what's actually responding is like their AI agent. The cool thing is that the powerhelp guys very graciously agreed to show us the actual prompt that is powering this agent and to put it on screen on YouTube for the entire world to see. It's like relatively hard to get these problems for vertical AI agents because they're kind of like the crown jewels of the ip of these companies. And so very grateful to the powerhelp guys for agreeing . speaker 1: to basically like open source this prompt. Diana, can you walk us through this very detailed prompt? It's super interesting and it's very rare to get a chance . speaker 3: to see this in action. So the interesting thing about this prompt is actually first is really long. It's very detailed in this document, you can see is like six pages long just scrolling through it. The big thing that a lot of the best prompts start with is this concept of setting up the role of the llm. You're a manager of a customer service agent, and it breaks it down into bullet points what it needs to do. Then the big thing is telling the task, which is to approve or reject a tool call, because it's orchestrating agent calls from all these other ones. And then it gives it a bit of the high level plan. It breaks it down step by step. You see steps one, two, three, four, five, and then it gives some of the important things to keep in mind that it should not kind of go weird into calling different kinds of tools. It tells them how to structure the output, because a lot of things with agents is you need them to integrate with other agents. So so almost like gluing the api call. So it's important to specify that it's gonna na give shoutput of accepting or rejecting. And in this format, then this is sort of the high level section. And one thing that the best prompts do, they break it down sort of in this markdown type of style formatting. So you have sort of the heading here. And then later on it goes into more details on how to do the planning. And you see this is like a subbullet part of it. And as part of the plan, this actually three big sections, this how to plan and then how to create each of the steps in the plan. And then the high level example of the plan, one big thing about the best prompts is they outline how to reason about the task. And then a big thing is giving giving it an example. And this is what it does. And one thing that's interesting about this, it looks more like programming than writing English because it has this xml tag kind of format to specify sort of the plan. We found that it makes it a lot easier for lms to follow because a lot of lms were posttrain ined in irhf with kind of xml type of input and it turns out to produce better results. Yeah. speaker 1: One thing I'm surprised that isn't in here, or maybe this is just the version that they released. What I almost expect is there to be a section where it describes a particular scenario and actually gives example output for that . speaker 2: scenario that's in like the next stage of the pipeline. Yeah. Oh. speaker 1: really? Okay. Yeah. speaker 2: because it's customer specific, right? Because like every customer has their own like flavor of how to respond to these support tickets. And so their challenge, like a lot of these agent companies, is like how do you build a general purpose product when every customer like wants you know has like slightly different workflows and like preferences? Has a really interesting thing that I see the vertical AI agent companies talking about a lot, which is like how do you have enough flexibility to make special purpose logic without turning into a consulting company where you're building like a new prompt for every customer? I actually think this like conept of like forking and merging prompts across customers and which part of the prompt is customers specific versus like company wide is like like really interesting thing that the world is only just beginning to explore. speaker 3: Yeah, that's a very good point, Jared. So this this concept of defining the prompt in the system prompt, then there's the developer prompt and then there's a user prompt. So what this mean is the system prompt is basically almost like defining sort of the high level api of how your company operates. In this case, the example of pahelp is very much a system prompt. There's nothing specific about the customer. And then as they add specific instances of that api and calling it, then they stuff all that in into more the developer prompt, which is not shown here. And that's at all the context of, let's say, working with perplexity. There's certain ways of how you handle Raack questions as opposed to working with bold is very different, right? And then I don't think pHealth has user prompt because their product is not consumed directly by end user, but end user prompt could be more like relet or a zero, right? Where users need to type is like generate me a site that that has these buttons, this that that goes all in the user prompt. So that's sort of the architecture that's . speaker 4: sort of emerging. And to your point about avoiding becoming a consulting company, I think there's so many startup opportunities in building the tooling around all of this stuff. Like for example, like anyone who's done promptly engineering engineering knows that the examples and worked examples are really important to improving the quality of the output. And so then if you take like parhelp as an example, they really want good worked examples that are specific to each company. And so you can imagine that as they scale, you almost want that done automatically. Like in your dream model. What you want is just like an agent itself that can pluck out the best examples from like the customer data set and then software that just like ingests that straight into like wherever it should belong in the pipeline without you having to manually go out and plug that all and ingeit in all of yourself. speaker 2: That's really great. So going into meta promptlink, which is one of the things we want to talk about because that's that's a consistent theme that keeps coming up when we talk . speaker 1: to our AI startups Yeah trope here is one of the startups I'm working with in the current yc batch. And they've really helped people like ysee company, ducky, do really in depth understanding and debugging of the prompts and the return values from multi stage workflow. And one of the things they figured out is prompt folding. So you know basically one prompt can dynamically generate better versions of itself. So a good example of that is a classifier prompt that generates a specialized prompt based on the previous query. And so you can actually go in to take the existing prompt that you have and actually feed it more examples where maybe the prompt failed, it didn't quite do what you wanted. And you can actually, instead of you having to go and rewrite the prompt, you just put it into the raw llm and say, help me make this prompt better. And because it knows itself so well, strangely, meta prompting is turning out to be a very, very powerful tool that everyone's using now. speaker 3: And the next step, after you do sort of prompfolding, if the task is very complex, there's this concept of a using examples, and this is what jazz berry does, is one of the companies I'm working with, disppatch. They basically build automatic bug finding in code, which is a lot harder. And the way they do it is they feed a bunch of really hard examples that only expert programmers could do. Let's say, if we want to find an n plus one query, it's actually hard for today for even like the best llms to find those. And the way to do those is they find parts of the code, then they add those into the prompt. A meta prompt is like, Hey, this is an example of n plus one type of error, and then that works it out. And I think this pattern of sometimes when it's too hard to even kind of write a prose around it, let's just give you an example that turns out to work really well because it helps lm to reason around complicated tasks and steer it better because you can't quite kind of put exact parameters. And it's almost like a unit testing programming in a sense, like test driven development is sort of the llm version of that. speaker 1: Yeah. Another thing that trophere sort of talks about is the the the model really wants to actually help you so much that if you just tell it, give me back output in this particular format, even if it doesn't quite have the information it needs, itactually just tell you what it thinks you want na hear. It's literally a hallucination. So one thing they discovered is that you actually have to give the llms a real escape hatch. You need to tell it, if you do not have enough information to say yes or no or make a determination, don't just make it up, stop and ask me. And that's a very different way to think about it. speaker 4: That's actually something we learned at some of the internal work that we've done with agents at yc, where jred came up with a really inventive way to give the lmm sk patch. She won't talk about that. speaker 2: Yeah. So the trope here, approach is one way to give the lum an escape batch. We came up with a different way, which is in the response format, to give it the ability to have part of the response be essentially a complaint to you, the developer, that like you have given it confusing or under specified information and it doesn't know what to do. And then the nice thing about that is that when you just run your llm like in production with real hoser data, and then you can go back and you can look at the outputs that it has given you in that like output parameter. We we call it debug info internally. So like we have this like debug info parameter where it's basically reporting to us things that we need to fix about it and it literally ends up being like a to do list that you, the agent developer has to do. It's like really kind of mind blowing stuff. Yeah I mean. speaker 4: just even for hobbyists of people who are interested in playing around for this for personal project, like a very simple way to get started with meta prompting is to follow the same structure. The prompt is to give it a role and make the role be like you know you're a expert prompt engineer who gives really like detailed great critiques and advice on how to improve prompts and give it the prompt that you had in mind and it will spit you back a much a more expanded, better prompt. And so you can just keep running that loop for a while, work surprisingly well. speaker 3: There's a common pattern sometimes for companies when they need to get responses from elements in their product a lot quicker. They do the meta prompting with a bigger befier model. And you have the, I don't know, hundreds of billions of parameters ter plus models like, I guess, cloud 43.7 or your gzero three, and they do this meta prompting and then they have a very good working one that then they use into the distilled model. So they use it on a, for example, nafour zero and ends up working pretty well specifically sometimes for voice AI agents companies, because latency is very important to get this whole Turing ing test to PaaS. Because if you have too much pause before the agent responds, I think humans can detect something is off. So they use a faster model, but with a bigger batter prompt that was refined from the bigger models. So that's like a common . speaker 4: pattern as well. Another, again, less sophisticated maybe, but as the prompt gets longer and longer, that becomes a large working dock. One thing I found useful is as you're using it, if you just note down in a Google doc things that you're seeing, just the outputs not being how you want or ways that you can think of to improve it, you can just write those in note form and then give Gemini pro like your notes plus the original prompt and ask it to suggest a bunch of edits to the prompt to incorporate these in well, and it does that quite well. speaker 3: The other trick is in Gemini 25 pro, if you look at the thinking traces, as is parsing through evaluation, you could actually learn a lot about all those misses as well. We've done that internal as well, right? As this is critical because . speaker 2: if you're just using Gemini via the api, until recently you did not get the thinking traces. And like the thinking traces are like the critical debug information to understand like what's wrong with your prompt? They just added it to the api so you can now actually like pipe that back into your developer tools and workflows. speaker 4: Yeah. I think it's an underrated consequence of Gemini pro having such long context windows is you can effectively use it like a reppple go sort like one by one. I put your prompt on like one example, then literally watch the reasoning trace in real time to figure out like how you can steer it in the . speaker 1: direction you want deerit and the software team at yc has actually built this you know various forms of workbenches that allow us to like do debug and things like that. But to your point, like sometimes it's better just to use Gemini dot Google dot com directly and then drag and drop you know literally Jason files and you know you don't have to do it in some sort of a special container. Like it you know seems to be totally something that works even directly in you know ChatGPT itself. speaker 4: Yeah this is all stuff I would give a shout out to why he's head of data, Eric bacon, who's helped us all a lot a lot of this meta proting and using Gemy pro 2.5 is effectively a rale. speaker 1: What about evils? I mean, we've talked about evils for going on a year now. What are some of the things that founders are discovering . speaker 2: even though we've been saying this for a year or more now, Gary, I think it's still the case that like evals are the crown jewel like data asset for all of these companies. Like 11 reason the power help was willing to open source the prois, they told me that they actually don't consider the prompts to be the crown jewels like the evals are the crown jewels because without the evals, you don't know why the prompt was written the way that it was and it's very hard to improve it. speaker 1: Yeah and I think in abstraction, you can think about you know yc funds a lot of companies, especially in verticli and sas and then you can't get the evals unless you are sitting literally side by side with people who are doing x yz knowledge work. You know you need to sit next to the tractor sales regional manager and understand, well, you know this person cares of you know this is how they get promoted. This is what they care about. This is that person's reward function. And then you know what you're doing is taking these in person interactions sitting next to someone in Nebraska and then going back to your computer and codifying it into very specific evils. Like this particular user wants this outcome after the you know after this invoice comes in, we have to decide whether we're gonna to honor the the warranty on this tractor. Like just to take one of one example, that's the value, right? Like you know everyone's really worried about, are we just rappers and you what is going to happen to startups? And I think this is literally where the rubber meets the road, where you know if you are out there in particular places understanding that user better than anyone else and having the software actually work for those people. speaker 2: that's the mas that is just like such a perfect depiction of like what is the core competency required of founders today? Like heaterly, like the thing that you just said, like that's your job as a founder of a company. Like this is to be really good at that thing and like maniacally obsessed with like the details of the regional tractor sales . speaker 1: manager's workflow. Yeah. And then the wild thing is it's very hard to do. Like you know how have you even been to Nebraska? That you know the classic view is that the best founders in the world, they're sort of really great Pract engineers and technologists and just really brilliant. And then at the same time, they have to understand some part of the world that very few people understand. And then there's this little sliver that is you know the founder of a mulbillion dollar startup. You know I think of Ryan Peterson from flexport, you know, really, really great person who understands how software is built. But then also, I think he was the third biggest importer of medical hot tubs for an entire year like you know a decade ago. So you know the weirder that is, the more of the world that you've seen that nobody else who's a technologist has seen, the greater the opportunity. Actually, I think you've put this in a really interesting way. speaker 4: people kwhere you sort of saying that every founders become a forward deployed engineer. That's like a term that traces back to palente. And since you were eliliot palantier, maybe tell us a little bit about how did forward deployed engineer become a thing at palantiand and what can founders learn from it now? speaker 1: I mean, I think the whole thesis of Palantir at some level was that if you look at meta back then, it was called Facebook or Google or any of the top software startups that everyone sort of knew back then. One of the key recognitions that Peter teaand, Alex carp and Stefan Cohen and Joe Lonsdale, Nathan gettings, like the original founders of palener had, was that go into anywhere in the fortune 500, go into any government agency in the world, including the United States, and nobody who understands computer science and technology at the level that at the highest possible level would ever even be in that room. And so palaners sort of really, really big idea that they discovered very early was that the problems that those places face, they're actually multi billion, dolsometimes trillion dollar problems. And yet this was well before AI became a think. You know, I mean, people were sort of talking about machine learning, but back then they called it data mining. You know, the world is awash in data, these giant databases of people and things and transactions, and we have no idea what to do with it. That's what palanr was, is and still is, that you can go and find the world's best technologists who know how to write software to actually make sense of the world. You know you have these petabytes of data and you don't know how do you find the needle in the haystack and know the wild thing is on something like 20, 22 years later, it's only become more that we have more and more data and we have less and less of an understanding of what's going on. And it's no mistake that actually now that we have llms, like we actually it is becoming much more tractable. And then the four deployed engineer title was specifically, how do you sit next to literally, the fbi agent who's investigating domestic terrorism? How do you sit right next to them in their actual office and see what does the case coming in look like? What are all the steps when you actually need to go to the federal prosecutor? What are the things that they're sending? Is it I mean, what's funny is like literally it's like word documents and excel spreadsheets, right? And what you do as a forward deployed engineer is to take these sort of you know file cabinet and fax machine things that people have to do and then convert it into really clean software. So you know the classic view is that it should be as easy to actually do an investigation at a three letter agency as going and taking a photo of your lunch on Instagram and posting it to all your friends. Like that's you know kind of the funniest part of it. And so you I think it's it's no mistake today that for deployed engineers who came up through that system at palener now they're turning out to be some of the best founders at . speaker 2: yc actually Yeah I mean, reduced this incredible, incredible number of startup founders because Yeah like the training to be a four deployed an engineer, that's exactly the right training to be a founder of these companies. Now the other interesting thing about palhere is like other companies would send like a salesperson to go and sit with the fbi agent and like palter sent engineers to go and do that. I think palter is probably the first company to really like institutionalize that and scale that as a process. speaker 1: right? Yeah. I mean, I think. What happened there? The reason why they were able to get these sort of 78 and now nine figure contracts very consistently is that instead of sending someone who's like hair and teeth and they're in there and, you know, go to the, let's go to the steakhouse, you know, it's all like relationship and youhave one meeting, they would really like the salesperson. And then through sheer force of personality, youtry to get them to give you a seven figure contract. And like the timscales on this would be you know six weeks, ten weeks, twelve weeks, like five years. I don't know. It's like and the software would never work. Whereas if you put an engineer in there and you give them you a palenyour foundry, which is what they now call sort of their core data viz and data mining suites, instead of the next meeting being reviewing 50 pages of you know sort of sales documentation or a contract or a spec or anything like that, it's literally like, okay, we built it and then you're getting like real live feedback within days. And I mean, that's honestly the biggest opportunity for startup founders. If startup founders can do that, and that's what forward deployed engineers are sort of used to doing, that's how you could beat a salforce or an oracle or you know a booze Allen or literally any company out there that has a big office and a big fence. You know you have fancy salespeople with big strong handshakes. And it's like, how does a really good engineer with a weak handshake go in there and beat them? Well, it's actually you show them something that they've never seen before and like make them feel super heard. You have to be super empathetic about it. Like you actually have to be a great designer and product person and then you know come back and you can just blow them away. Like the software is so powerful. The second you see something that makes you feel seen, you want to buy it on the spot. Is a good way . speaker 2: of thinking about it. That founders should think about themselves as being the four deployed engineers of their own company? speaker 1: Absolutely. Yeah. Like you definitely can't farm this out. Like literally the founders themselves, they're technical. They have to be the great product people. They have to be the ethnographer. They have to be the designer. You want the person on the second meeting to see the demo you put together based on the stuff you heard, and you want them to say, wow, I've never seen anything like that and take my money. speaker 3: I think the incredible thing about this model is this is why we're seeing a lot of the vertical AI agents take off, is precisely this, because they can have these meetings with the pire and champion at these big enterprises. They take that context and then they stuff it basically in the prompt, and then they can quickly come back in a meeting like just the next day, maybe with paler, would have taken a bit longer on a team of engineers here. It could be just the two founders go in. And then they were close to six, seven figure deals, which was seen and with large enterprises, which has never been done before. And it's just possible with this new model of four deploengineer plus, AI is just on accelerating. speaker 4: It just reminds me a company I mentioned before on the podcast at gigo ml who do another customer support and especially a lot of voice support. And it's just classic case of two extremely talented software engineers, not natural sales people, but they force themselves to be essentially forward deployed engineers. And they close a huge deal with zpto and then a couple of other companies they can't announce yet. Do they physically . speaker 2: go on site like the palinto your model? Yes. speaker 4: So they do. So they did all of that where once they close the deal, they go on site and they sit there with all the customer support people and figuring out how to keep tuning and getting the software, or they lthem to work even better. But before that, even to win the deal, what they found is that they can win by just having the most impressive demo. And in their case, they've invaded a bit on the rag pipeline so that they can have their voice responses be both accurate and very low latency. So they get technically challenging thing to do. But I just feel like in the palpre sort of the current lm rise, you can necessarily differentiate enough in the demo phase of sales to beat out incumbent. So you can really beat salesforce by having a slightly better crm with a better ui. But now because the technology evolves so fast and so hard to get this like last ten, five to 10% correct, you can actually, if you're a forward deployed engineer, go in, do the first meeting, tweak it so that it works really well for that customer, go back with the demo and just get that, Oh, wow. Like we've not seen anyone else pull . speaker 3: this off before. Experience and close huge deals. And that was the exact same case with happy robot, who has sold seven figure contracts to the top three largest logistic brokers in the world. They built AI voice agents for that. They are the ones doing the forward deploy engineer model and talking to like the cio's of these companies and quickly shipping a lot of product like very, very quick turnaround. And it's been incredible to see that take off right now. And it started from six figure deals, now doing closing and seven figure deals, which is crazy. This is just a couple months after. speaker 1: So that's the kind of stuff that you can do with I mean, unbelievably very, very smart, prompt engineering actually. Well, one of the things that's kind of interesting about each model is that they each seem to have their own personality. And one of the things the founders are really realizing is that you're going na go to different people for different things. speaker 3: Actually, one of the things that's known a lot is Claude is sort of the more happy and more human steable model. And the other one is lama four is one that needs a lot more steering. It's almost like talking to a developer. And part of it could be an artifact of not having done as much rla chef on top of it. So it's a bit more rough to work with, but you could actually steer it very well if you actually are good at actually doing a lot of problem and almost doing a bit more rhf, but is a bit harder to work with actually. speaker 1: Well, one of the things we've been using lms four internally is actually helping founders figure out who they should take money from. And so in that case, sometimes you need a very straightforward rubric, a zero to 100, zero being never, ever take their money, and 100 being take their money right away, like they actually help you so much that yoube crazy not to take their money Harj. We've been working on some scoring rubrics around that using prompts. What are some of the things we've learned? speaker 4: So it's certainly best practice to give llms rubrics, especially if you want to get a numerical score as the output. You want to give it a rubric to help it understand like how should I think through and what's like a 80 versus a 90. But these rubrics are never perfect. speaker 3: There's often always exceptions. And you tried it with a zero three versus Gemini 2.5. speaker 4: and you found that is what we found really interesting is that you can give the same rubric to two different models. And in our specific case, what we found is that zero three was very rigid actually, like it really sticks to the rubric. It's heavily penalizes for anything that doesn't fit like the rubric that you've given at whereas Gemini 2.5 pro was actually quite good at being flexible in that it would apply the rubric, but it could also sort of almost reason through why someone might be like an exception or why you might want to push something up more positively or negatively than the rubric might suggest, which I just thought was really interesting because that it's just like when you're training a person, you're trying to you give them a rubric, like you want them to use a rubric as a guide, but there are always these sort of edge cases where you need to sort of think a little bit more deeply. And I just thought it was interesting that the models themselves will handle that differently, which means they sort of have different personalities, right? Like zero three felt a little bit more like a soldier, sort of like I'm definitely doing check, check, check, check, check and general approach O2 point five out a little bit more like a high agency sort of employee was like, Oh, okay. Now I think this makes sense, but this might be an exception in this case, which was which is really interesting to see. speaker 1: Yeah. It's funny to see that for investors. You know sometimes you have investors like a benchmark or a thrive. It's like, Yeah take their money right away. Their process is immaculate. They never ghost anyone. They answer their emails faster than most founders. It's you. Very impressive. And then one example here might be you there are plenty of investors who are just overwhelmed and maybe they're just not that good at managing their time. And so they might be really great investors and their track record bears that out, but they're sort of slow to get back. They seem overwhelmed all the time. They accidentally, probably not intentionally, ghost people. And so this is legitimately exactly what an llm is for. Like the debug info on some of these are very interesting to see. Like you know maybe it's a 91 instead of like an 89. We'll see. I guess one of the things that's been really surprising to me, as you know, we ourselves are playing with it and we spend you know maybe 80 to 90% of our time with founders who are all the way out on the edge, is, you know on the one hand, the analogies I think even we use to discuss this is it's kind of like coding. It kind of actually feels like coding in you know 1995, like the tools are not all the way there. There's a lot of stuff that's unspecified were you know in this new frontier. But personally, it also kind of feels like learning how to manage a person where it's like how do I actually communicate you know the things that they need to know in order to make a good decision, and how do I make sure that they know how I'm going to evaluate and score them? And not only that, like there's this aspect of kaizen, you know this manufacturing technique that created really, really good cars for Japan in the nineties. And that principle actually says that the people who are the absolute best at improving the process are the people actually doing it. Literally why Japanese cars got so good in the 90s, and that's metaprompting to me. So I don't know. It's a brave new world. We're sort of in this new moment. So with that, we're out of time, but can't wait to see what kind of prompts you guys come up with, and we'll see you next time.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本次讨论深入剖析了顶尖AI初创公司在提示工程(Prompt Engineering)领域的最新实践与核心思想。与会者(Garry, Harj, Diana, Jared)一致认为,提示已从临时技巧演变为与AI交互的关键,其复杂性和重要性日益凸显。核心结论包括:1) 高质量的提示是构建可靠AI代理的基石,如Parahelp公司公开的六页长提示所示,其结构化、角色设定、分步规划和输出格式定义是最佳实践的典范。2) “元提示(Metaprompting)”和“提示折叠(Prompt Folding)”正成为主流,即利用大语言模型(LLM)自身来迭代和优化提示,实现持续改进,这被比作制造业中的“改善(Kaizen)”原则。3) 创始人需扮演“前线部署工程师(Forward Deployed Engineer, FDE)”的角色,通过深入客户工作流,将具体业务场景转化为有效的评估集(Evals)和提示,这是AI初创公司构建护城河、击败行业巨头的关键。4) 评估集(Evals)比提示本身更重要,它们是公司真正的“皇冠上的宝石”,体现了对用户需求的深刻理解。最后,讨论强调了不同LLM模型具有独特的“个性”,需要采用不同的策略进行引导,而提示工程本身既像早期编程,也像管理人才,需要精细的沟通与反馈机制。
剖析顶尖AI代理的提示设计:以Parahelp为例
Parahelp是一家为Perplexity、Replit等顶尖AI公司提供AI客户支持服务的公司。他们公开的代理提示(Agent Prompt)揭示了当前最先进的提示设计方法。
-
核心特征:
- 详尽与结构化:该提示长达六页,内容非常详细,并使用Markdown格式进行清晰的结构划分。
- 角色设定(Role Setting):提示首先为LLM设定一个明确的角色,例如“你是一名客服代理的经理”,并用要点列出其职责。
- 任务定义与规划:明确指出核心任务(例如“批准或拒绝一个工具调用”),并提供一个高层次的分步计划(步骤1, 2, 3, 4, 5)。
- 输出格式规范:严格规定输出的结构,以确保能与其他代理或API无缝集成。这通常通过指定JSON格式或类似
accept/reject的特定输出来实现。 - 推理指导与示例:提示不仅告诉LLM做什么,还指导它“如何思考(how to reason)”,并提供具体的示例。
- 类编程语言的语法:使用类似XML标签的格式来组织内容,因为研究发现LLM在经过此类格式的训练后,能更好地遵循指令,产出更可靠的结果。
-
观点与争议:
- Garry的疑问:Garry指出,该公开提示中缺少针对特定场景的具体输出示例。
- Jared的解释:这些客户特定的示例被放置在流程的下一阶段,即“开发者提示(Developer Prompt)”中。这引出了一个核心挑战:如何在提供定制化逻辑的同时,避免成为一家为每个客户重写提示的“咨询公司”。
新兴的提示架构与高级技巧
为了解决通用性与定制化的矛盾,一种三层提示架构正在兴起。
-
三层提示架构:
- 系统提示 (System Prompt):定义公司运营的高层API和通用逻辑,如Parahelp的例子,不包含客户特定信息。
- 开发者提示 (Developer Prompt):为特定客户填充上下文和逻辑。例如,处理Perplexity的RAG问题与处理Bolt的工单流程截然不同,这些差异在此层定义。
- 用户提示 (User Prompt):由最终用户直接输入的内容,例如在Replit中输入“为我生成一个包含这些按钮的网站”。
-
高级技巧与策略:
- 提供“逃生舱口” (Escape Hatch):为防止LLM在信息不足时产生幻觉,必须给予其明确的退出机制。
- 方法一(Trophe.ai):直接指示模型“如果你没有足够信息,不要编造,停下来问我”。
- 方法二(YC内部实践):在输出格式中增加一个
debug_info字段,让LLM可以在其中“抱怨”指令不明确或信息不足。这为开发者提供了一个持续改进的“待办事项列表”。
- 利用高质量示例:对于复杂的任务(如在代码中寻找N+1查询错误),仅用文字描述难以奏效。此时,提供一个专家级的解决示例(类似于软件开发的“测试驱动开发”)能有效引导LLM进行复杂推理。
- 利用思考轨迹 (Thinking Traces):Gemini 1.5 Pro等模型API现在提供“思考轨迹”,让开发者能看到模型的推理过程,这是调试和优化提示的关键信息。
- 提供“逃生舱口” (Escape Hatch):为防止LLM在信息不足时产生幻觉,必须给予其明确的退出机制。
元提示(Metaprompting):AI的自我进化
元提示是当前最强大的工具之一,其核心是利用LLM来改进其自身的提示。
-
核心概念:
- 提示折叠 (Prompt Folding):由YC投资的Trophe.ai公司提出,指一个提示可以动态地生成一个更优化的、针对特定查询的专用版本。
- 实践方法:当一个提示失败时,开发者可以将失败的案例和原始提示一起输入给LLM,并要求它“帮我把这个提示变得更好”。
- 简易入门:可以给LLM设定一个“专家级提示工程师”的角色,然后让它来评判和改进你现有的提示。
-
应用策略:
- 模型分层优化:使用一个更强大、更昂贵的模型(如GPT-4o或Claude 3 Opus)进行元提示,生成一个高质量的优化版提示。然后,将这个精炼后的提示用于一个更小、更快、成本更低的生产模型中,这对于需要低延迟的应用(如语音AI代理)尤为重要。
评估集(Evals):真正的“皇冠上的宝石”
与会者一致认为,评估集(Evals)是AI初创公司最核心的数据资产,其价值甚至超过提示本身。
- Jared的观点:> “Evals是这些公司的皇冠上的宝石。没有Evals,你不知道提示为什么被写成那样,也很难去改进它。”
- Garry的观点:Evals是将真实世界的用户需求和工作流“编码”成软件的过程。它要求创始人深入一线,例如“坐在内布拉斯加州的拖拉机销售区域经理旁边”,理解他们的痛点和奖励机制,然后将这些洞察转化为具体的评估标准。这是AI初创公司建立护城河,避免成为简单“模型包装商”的关键。
创始人即“前线部署工程师 (Forward Deployed Engineer)”
该模式源于Palantir,现已成为垂直AI代理初创公司成功的关键模型。
- 模式核心:创始人不再是传统的销售或产品经理,而是集技术、产品、设计和民族志学者于一身的“前线部署工程师”。他们直接与客户(如大型企业的CIO)会面,深入理解其复杂工作流。
- 运作方式:
- 深入一线:与客户并肩工作,观察并理解其现有流程(通常是基于Word文档和Excel表格)。
- 快速原型:利用LLM的强大能力,将洞察迅速转化为一个能解决客户核心痛点的产品演示(Demo)。
- “Wow”时刻:在第二次会议上展示一个让客户感觉“被看见了”的、前所未见的解决方案,从而快速赢得大额合同(6至7位数)。
- 成功案例:
- Gigo ML:通过此模式与Zomato等公司签下大单。
- Happy Robot:向全球三大物流经纪商销售了七位数的AI语音代理合同。
LLM的“个性”与评分标准(Rubrics)的应用
不同的LLM模型在处理任务时表现出不同的“个性”和行为模式。
- 模型个性对比:
- Claude系列:被认为是“更快乐、更人性化”的模型。
- Llama系列:更像一个“开发者”,需要更多明确的引导和操纵,但如果提示得当,可塑性很强。
- 使用评分标准(Rubrics)的经验:
- 在要求模型进行数值评分时,提供一个清晰的评分标准至关重要。
- YC内部在使用LLM评估投资者时发现:
- GPT-4o (原文为"zero three"):表现得非常“死板”,像一个“士兵”,严格遵守评分标准。
- Gemini 1.5 Pro (原文为"Gemini 2.5 pro"):则更加“灵活”,像一个“高自主性的员工”,它会使用评分标准作为指导,但也能理解和处理例外情况。
结论:提示工程是编码、管理与持续改善的艺术
Garry总结道,当前的提示工程体验非常独特,融合了多种技能。
- 像1995年的编程:身处一个工具尚不完善的“新前沿”。
- 像管理一个人:核心在于如何有效沟通,设定明确的目标和评估标准。
- 体现了“改善(Kaizen)”精神:日本制造业的持续改进原则,即由一线执行者来推动流程优化,这与“元提示”让模型自我改进的理念不谋而合。