2024-08-17 | AI Engineer | Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI提示工程实战工作坊:优化技巧与案例解析
标签
媒体详情
- 上传日期
- 2025-06-10 12:39
- 来源
- https://www.youtube.com/watch?app=desktop&v=hkhDdcM5V94
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-06-05
转录
speaker 1: Tiyou. speaker 2: All right, good afternoon, everybody. Thank you all so much for joining us. We have the enviable position of being after lunch. I'm seeing some cookies on the table still, but thankfully you're not here to listen to me. You're gonna to be riveted by the prompt doctor who's gonna to come in a second. So I'm expecting no sleeping on the table, but you never know. Excited to be here. I' M Jamie nei, lead our startup team at anthropic. And I just want to say a couple of quick things before we got going here with, again, the reason you're here, the prompt doctor. We've had a lot of really exciting releases just in the last couple of days, some in the last couple of hours, and wanted to just put these up there to highlight some of the cool things that we're doing, but also share how a lot of folks, not only in this room, some of your peers, maybe folks back at the office, can work with anthropic on not only some of the prompting, of course, work we're going to do here, but just helping you grow your business and the really cool things you guys are building with quad, with oomms on top of AI. My team is here specifically to help your company grow and scale the business, whether that's getting access to higher rate limits, getting access to folks like Zach, who's going to be up here in a moment learning more about what we're doing from an early access perspective. We want to work with you and empower the next wave of really exciting AI companies built on top of clot. And we're helping from a product perspective with a couple of the releases you see here. Has anyone tried 35 sonnet yet? Love it. Very cool, really. Yeah. Thankfully, Zach has tried it as well. So we're in a good place, really excited about what we were able to release there as well. Also just today, or I guess artifacts came out with that with our quad teams and quad AI plan artifacts, is this really cool tool that some have been playing around with making video games with a couple lines of text that turns into code. But actually a lot of really cool business use cases from a diagram perspective, I've seen a lot of cool pharma companies using this and thinking out, how can I put what's in my head on a gene sequencing kind of discussion on something like an artifact, share that with my peers all the way to prompting cookbooks, helping again, folks like yourself. I would imagine a lot of you hopefully will be featured on their website, know coming up with what you be able to prompt for these use cases in your companies. So just some ways that we've things that we've come out with over the last couple of days, a couple of hours when you think of clad teams and projects and then ways to connect, feel free to reach out at sales saladanthropic. Com. We have our head of Dev, reo Alex here as well. So we really love this community, love staying engaged. We're really excited about what we've released over the last couple of days to be able to help you again and what you guys are building. So I'm going to leave all that to the prompt doctor as well to get that to really make sure that we can help. And Zach, why don't you come on up here and see what we can do? Thank you all so much for joining us today. We're really excited. speaker 1: Thanks Jamie for the intro and thank you all for coming. This is really awesome. I had no idea how this many people are going to be here. So thanks. Okay, so not going to be much talk is mostly just going to be straight up prompting from beginning to end. Did make a couple slides, so mostly about what to bring. So I set up a slack channel in this, the AI engineer. Slack is called prompt end live workshop, anthropic. So that's where you can upload prompts. And what we're going to do on stage is I'm just going to sort of look at them. I'm going to read them. We're going to test them in the console, the anthropic console. We're going to see if we can get better results, and we'll just sort of like try to learn as we go. So this is something that I do like internally in our team, slack quite a bit, but I've never done it in front of this many people. And itbe exciting, itbe fun. Might be some hiccups along the way, but hopefully you'll have a good time too. And maybe learn something. I know I'll definitely learn something. So what kind of things should you put in this slack channel? So you can put a prompt template. So a prompt template is kind of like a prompt. Actually, I just realized I don't even need this mug. Okay? You see, you put a prompt template, which is, it's like a prompt, but with spaces where the variables are going to go and the variables, they're going to be denoted with these double brackets. So in this case, it's like this document part. If you don't have it in this format, that's fine. We can figure it out. This is just like the ideal. So this is like the prompt tand. This is like the kind of thing youput there. Then you can also have a couple examples of cases where it doesn't do what you want, and that will give us the direction as far as where we want to go with it. I might also ask you questions out loud if I have questions about what kind of output is good or not, or I might ask questions in slack either way, because it's easier, we'll have to kind of figure that out as we go. Okay, so that being said, we're going to use the console for iterating mostly, although I might use a couple other tools like quad for sheets, which is like a spreadsheet where you can call quad. Okay? So Yeah, let's see what we've got in the slack already. Okay, we have something here. Thank you, Gordy. So you're an expert patient. So let's put this into the console and then let's take a look. Okay. And I'm just going to go through as many of these as we can get through in the session. And Yeah, this is pretty much what it's going to be. So first of all, we can probably capitalize all the sentences. Does that matter? Thank you. Yeah, Yeah, Yeah, Yeah, Yeah. Perfect. I think so a lot of things like propped engineering is like it's very new, right? So like we don't know for sure somebody out there might have done a study where they like conclusively show that using capital letters and like using like better grammar, fixing grammar mistakes, help. I have like anecdotally found this in a few cases. I also have read some White quantitative stuff showing that typos do her performance. But I'm also just like pretty obsessive about this stuff, so I just fix it and I think like it definitely doesn't hurt. Okay, can I zoom in? Great question. Is that any better? No, little more. Okay. Is that any better? Okay. So first thing, let's put information in xml tags so we can go like this. Why xml? Why not markdown? Another great question. So clad was trained with a lot of xml in its training data. And so it's sort of seen more of that than it's seen of other formats. So it's just works a little bit better. So this looks like all the information here, so we have the medication review will be okay. Yeah, great call. Okay. Actually let me undo everything that I've done so far. Okay, so we can run it here. And then now in the console, it's asking us for the user input. So do we have a user input? Okay, perfect. So who are you? Let's do this one. Why do I need to? Yeah, we can do them both. Who are you? Okay. So, Gordy, what do you think of this? It's way too long. This is a conversational. Okay, it's too long. We can probably fix that. What we can also use this evaluate tab. So let's just like add all the test cases here. So this is the evaluate tab of the console. I'm also going to be doing some showing off of the console features because I think it's a cool tool for prompt iteration. There's also some secret new features that I might show. We'll see about that. Okay, so then we also have why do I need to do this experiment? Looks like it added a bunch of neurons. Let's definitely get rid of those, and then we can get this next one. Can I schedule it tomorrow on. Cool. So I hit run remaining and this is all running through dove. Sorry, through. So okay, so we have why don't need to do this appointment? So this looks pretty long as well. And here we have this like I apologize, but I don't have information about scheduling availability. Okay, so. Is that is that that we don't have information about scheduling. 24 hour availability. Okay, all right, so this is like the version one. Now let's make some changes. So first of all, I'll actually I'll try to do things like roughly in some order of importance. So maybe I won't make you all sit through the capitalization even though I like would definitely do that. I'm also going to add a new line here just because I think that's like more normal. What yousee in like a written document youhave a new line. We'll close the information tab. What's that? Yeah, I could. Actually, that's not a bad idea. The one thing I wouldn't feel completely confident of is that would like exactly transcribe the rest of everything, like word for word. I think it probably would. What I actually might do is just have quad write code to capitalize every first word of the sentence. Then I'd be worried about edge cases. Like what if there's like ellipses, but that kind of thing is definitely useful. And like I definitely use quad a lot in writing prompts, for instance, like we have like a quad tool that helps complete code basically. And I do a lot of prompting in that ide because like especially with very nested xml tags helps a lot, just like suggesting the closures of them, which is like pretty obvious but still takes a long time to type. So Yeah if you have any sort of like Copilot type thing, definitely that's like a good environment for writing prompts. Okay, now let's do the same thing with this instructions and we can do this. It looks like this one like didn't get a number. So let's like do that. Yeah. So the key thing in terms of xml, I think is just like really xml isn't even that important. The most important thing is just clearly separating the different parts of the prompt. Yeah, exactly. It's like, here's this stuff, here's this other stuff. Like if we wanted to, we could do something like like like I wouldn't do this, but I think it would probably work fine. Yeah, okay. So this is all fine. Let's also do the same thing with user input. Now we can run we can go back to the evaluate tab and we can hit rerun all and it's using our nice new prompt. Still looks pretty long. We can also see how it does on the last case where okay. So here it's still said I don't have access to the specific scheduling information. So let's try and fix these two things. So first of all, we can make it shorter. Do we have anything here about making it shorter? What's that rule? Rule seven. Okay. Be concise and offer only relevant information. Oops, let's actually do this. Don't want to miss number here and be concise. So like and offer all that relevant information. Each response should be, let's be a little bit like less prescriptive to give quad, like a little bit more room. Like if we say like every response should be like exactly three sentences, that might be like a little too constraining. I'm just guessing. So we could just say like. speaker 3: Better than telling it to be concise. speaker 1: Yeah concise could mean a lot of different things to different people. Like in some cases, like concise might mean like literally only one word. In some cases, like if you ask for like a concise book review, we might be looking at like you know a single page word doc that would be concise in the context of a book review. So Yeah, quad is like trying to guess what you mean by concise. Sorry, one sec. Go ahead. I think that's right. I think the tone of the prompt, so what he was saying if people couldn't hear is like the prompt is long, so the response might also be wrong. I don't think that's like definitively like you can have long prompts that give short responses or short prompts that give long responses, but it's more like if you don't say anything, it might pick up on some of those context clues. You were saying something over here. Yeah keeping short for. Yeah, Yeah. So let me actually get to that after we do this. So this two to four sentences, it looks like it's still pretty long. I think maybe that's actually like longer than necessary. So maybe we should make it like one to two sentences. Let's try that. Never more than three. Okay, now we can try that here. Okay, that looks better, right? This is definitely shorter, okay? And it also seems that it is giving variable numbers of sentences. So these were both two sentences and then this one is three. So one of the questions over here is like can the llm figure out that it should do longer responses in certain situations and shorter others? So it seems like it did that here. Okay? So the next point was that in this case, it shouldn't say that I don't have access to the scheduling system or specific appointment times. What should it say? Instead, it should what? But intentionally, in the prompt, I left out that it's 24 hour service. So this is a case where we're asking a question that's not present, Ted. The information it has. Okay, see, I mean, we could add something like you're open for 24 hour service, but you're saying you want to test its ability to figure out how to do it without that? Oh, okay. speaker 3: no, which is good. Okay. speaker 1: Well, then we're doing great. All right. Should we anything else that you wanted to get out of looking at this example gdy? Structure. So does the order of the rules or the order of putting information then rules or then information? Does any of that matter? Yeah. So it does it matter what order we have these components? I think it's better to put the information above the instructions. We've sort of found that instructions are more tightly followed the closer they are to the bottom of the prompt. As a rule, this doesn't necessarily apply in all situations, so definitely test it out. That actually is like a blanket statement that applies to everything that I say, but particularly for that. Okay, Yeah. No, no, no, go ahead. Only bald man tle out. Exclamation mark clamation mark, yes. Oh, remember, I don't think I added that on or it looks like you added that. Gordy. Exclamation Marks. Just as they emphasize things for humans, they also emphasize things to the model. Do you think that has more of an effect over the numbers? Or let's look in terms of, Oh, Yeah, I don't know at that level of detail, and I think it's dependent on context as well. But Yeah if you want to emphasize things like capitalizing them or putting exclamation Marks or like just saying this is extremely important, that all does do something. Yeah. So the tokenzer though, I'm taking a look . speaker 2: at some of the year, your talking I code. speaker 1: but it doesn't. It does something. That's all I can say. Just anecdotally, if you put exclamation Marks in, it's different. Yeah. Okay. I think let's go to the next how many guys? We got six already. That's pretty good. Okay. This is just a general question. I'll just answer this really quick. In general, for translations, our multilingual output, is it better to instruct our English or the native language? I think it's better to instruct in the native language. If you speak the native language, if you only speak English and you're choosing between a better prompt that's written in English versus like a worse prompt that's written in a language that you don't understand, I'd probably default to writing it in the language that I knew super well. But ideally, I think for the ideal prompt, you would find a native speaker of the language and explain your use case to them and have them write the prompt. Yeah. So I think is that not the same question that I just answered? Is it different? I think it's better to have the prompt in that language in general, if you can if you can write a really good prompt, right? Let's go to this next guy. So you'll be acting as a test reviewer. Let me pump up the size here too. Okay. Not sure if there's a way that I can hide this. Okay, great. Responsible for improving unit tests based on a set of requirements below the project Directory Project path. Don't include any other explanation, conversion or output besides the json. Okay, this is great because it's going to let me show off prefills. So let's make a new prompt here. Let's paste this in. We have to use double brackets instead of single brackets to get variables in the console. And then I think there's another but this json string. Who gave this prompt? This is from Dan, the json string. Dan, is that variable or is that is that like an example in this protemplate? Yeah. Test of your agent does not return always return chase on all way. Yeah like almost all of this is just like workarounds for the fact that it doesn't always speak Jason, right? Like you can see how many times we said that in the book. Yeah and then do you have an example input here? So that first, if you go up a little bit, sorry. So that first comment is from the unit test writer. So that is the input. Like the unit test writer writes a bunch of unit tests and then the reviewer reviews it and makes them better. So everything here is what I should put into. Yes. And then you can see that second set, this this is a good result where it writes json, which basically says, cool, update this file with these unit tests. And here's the modifications I made, that sort of thing. Okay, so in this template here, where would the thing that I just copied go? Well, so essentially we don't provide it in line with the prompt. We just provide the conversation and then this thing jumps in as a separate agent. So like the context window is going to have the unit tests in it, but we're saying respond in this format given the unit tests that are earlier in the conversation that you're picking up. So this the thing that you just pasted is like step three of a multi shop conversation, like not multi shot, multi turn. Yeah Yeah. The thing I just pasted was two shots, right? Unit test writer and then a unit test reviewer. And the reviewer is the one that's having the problem. It comes second. Okay, so it would be something like this. Here are some unit tests written by a unit test writer bot, right? Okay. And then we have this now you didn't. Okay, so this is basically how it works. Does this look right? No, it does. I mean just we do this in sort of this larger conversation, not just as sort of a standalone prompt for this one. Yeah. Okay. So then here I put the unit test in, right? Yep. And then for project path, what sort of thing should I put there? Anything? I mean this is this is just like the local directory that's gonna to be modified and so it actually has access to the files in that directory and itfill in its own sort of what files to modify. Okay, so then I'll just put that's fine, something like that. Yeah. Okay, so let's see if it comes out with a json or not. Baated breath here. Yeah p we did do most of our tests on clad three and not 3.5 so 3.5 is probably a little better. Okay we haven't done Yeah I mean if it makes it more realistic we could also switch the model version to use high no, I mean we're gonna to upgrade so I'd rather see it with this okay, what . speaker 2: was that? speaker 1: Is the temperature being set to zero intentional? Yeah I usually for knowledge work, I usually have the temperature set to zero. I think using temperature zero, you'll probably get like marginally fewer hallucinations ans okay. Oh, here we go. Okay, so it looks like in this case, it did output json. I think, Yeah, that looks plausible. Okay, it's very long, Jason. I guess that explains why it was taking so long to looks like actually it even ran into the max output tokens because it didn't finish it's json. Aha. Just to make this, since this one is kind of going kind of slow, I will test it with haiku and let's also increase the max tokens to sample so that it doesn't run into that issue. So what I'm really hoping is to get a case that doesn't output json so that then I can fix it and then it will output json. If not, I can still say like how I would fix it. Yeah, that would be great. Just honestly, any comments you have just on how we structured things? Okay. Yeah. So I mean, this is like definitely like a big request from people is like, how do I make sure the model outputs json? The most reliable way to do that, I feel, is using the assistant prefill. So maybe some of you have used this feature before. Maybe some of you have only used other models, such as GPT that don't offer this feature. Something that you can do in the quad api is partially prefill an assistant message. So what you're doing there is you're putting some words in quads mouth, as we call it. And then when quad continues from there, it assumes that it's already said whatever you told it that it had said. And that can help you get it on the right path. So for instance, in this case, if we want to make sure, so the classic like bad response from quad when people give it prompts like this where they want to get json, is quad would say something like, I'll just like add another message, just have some of us to type. It might be like, here is the json, right? Have people seen stuff like this? And this part right here is like very annoying and difficult to get rid of. So okay, so I have two strategies. Let me actually just give the simplest one both though they both require a tiny bit of post processing at the end. So let's by let's actually take out all this stuff about make sure to only respond in json. That could be one way to get it to not do to be bad. So we could just go like this. Let's try to make it not do json. Let's get rid of all this stuff. Okay, so here now maybe it will do the preamble thing that we don't want it to do. Perfect. Okay. So an easy way to get it to not do that is to just take this and then put it in the prefill. So it thinks that it already said that like this. So if we do that, just the json. So what we're doing here is you could think of quad almost like a child who's just like misbehaving and it wants to do something and you're like, don't do the thing, but it just keeps doing it because it just loves preamble so much, and it has this innate desire to do them. So one way is to like argue with it a lot, but like if you have a kid, sometimes, you know you just have to like let them do the thing that they want na do and then theyget over it. So in this case, that's basically what we did. We just gave claad this prefill where we let it do the thing. So as far as it's concerned, it already did the thing. And then from there, what it's outputting is json. Now if you want to make this even more reliable, you can put this nice little bracket here and then it's like, Oh dang, like I'm really in json mode now. Like I'm I'm really my json has actually already begun. So it's at this point, it's definitely not going to do the preamble. The only thing here is if you sample with this with this prefill, you will need to add the bracket back before you try to do your json los or what have you. Because quis, since you told it that it had already said the opening bracket, it's not going to give you another opening bracket. Okay. So then another thing that you can do is return the json in json tags. And then if we do this without the prefill, let's try it without the prefill capitalized Jasson. I'm not a software engineer, okay, I'm a prompt engineer. I don't even know that's capitalized frai know that's just like English word but Yeah, good, good. Thank you. Okay, so here we see it did the thing. So it gave its preamble, right? And then it gave the json tag. Everything within the json tag is json. And then at the end it closed this json tag. So again, this requires like the tiniest midof post processing where you're saying like you just it's like a regex. You're just like take everything within the json tags and then use that. You can even combine these two techniques. So you could say here the updated json, and now you give it the json tag. We can even put a bracket here. And now what we'll see is it will just give the json minus the bracket, and then it will close the bracket, and then it will close with the json tag, the closed jsontag. There we go. And also gave this, you can see it did at first I was like a little bit panicked because I didn't see the closed json tag at the very bottom, but then I saw that it actually did include the tag up here. And then it gave this little explanation afterwards. So this is another useful thing. This will save you some time and tokens and trouble. One thing we could do, like like it costs quad, like it costs you money to get quad to output all this stuff. And probably you probably don't need it in most cases, you don't need the explanation, you just need the json. So one thing we could do is we could say, do not include any explanation after the json, right? I mean, probably, but I don't know, honestly, I don't yell that much. I'm just like this is actually meant to be my parody of like what a frustrated prompt engineer would write if they were like, couldn't get rid of this. But in practice, you might not need to do that. But the simpler way to do this, there's and we're getting outside the realm of prompt engineering for a second and into the world of like api parameters. But that's okay. There's a parameter that's called stop sequences. And if you set so we told it to return the json and json tags, right? So there's no functionality to do this in the console, so I can't show it off at this exact moment. But in the api, there's a parameter called stop sequences. And if you add this close json tag that I've highlighted with my mouse, if you add that to the stop sequences, then it will just hard stop after it outputs those. And you'll even have to worry about telling it not to continue from there because it just won't even sample from there. You won't be charged. It's all good. So one of the things that I'm sort of hoping to impart with this talk is that a lot of times it's cheaper and easier to do a little bit of work outside of a call to the lom and not even worry about prompting, because prompting can feel sometimes like non deterministic. You don't know what the model is going to do. So when you can offload stuff to code, especially if the code is really easy to write, it's like, just do that, right? Like don't put a bunch of stuff in the prompt about you must output json, just use the prefill and then parse it out with the Rex. Don't add a bunch of stuff about how you have to stop after you say a certain word, just add it to the stop sequences. So simple is better. And falling back on code is better than relying on proms. Yeah. Yes, the prefill is available through the api. What you do is you include an assistant message as the last message in the messages list. And when I say an assistant message, I just mean a message where the role is set . speaker 3: to assistant. And what would have happened if you had the text that you put into prefail, you just put it into the last line of the instructions. speaker 1: So in other words, if I said I'm actually not sure that's a good question, so let's actually try that. I genuinely do not know how called will respond to this. So let's see. So it looks like what it did was it looks to me like what it did without looking at this json, is it included an additional open bracket, right? Because it's supposed to already have started with an open bracket, but here it started with an additional open bracket. So it kind of almost worked, but not quite. Anyways, I don't recommend doing this, but that was fun just out of curiosity's sake. Yeah. And after the sentence, Yeah, after the sentence, we can just write it like you wrote, right? Written the json, written the response in the json format, and then in the next line you can just write json colon, then leave it. I think it will. Oh, so like if I said something like this, like this, Yeah, that's Yeah, I could see this working. Yeah, it looks like it worked pretty well. I think there's like a lot of ways to accomplish this. I think the ways that I showed are the most reliable. So that's what I would officially recommend. But Yeah, like definitely experiment. speaker 4: We're going to try . speaker 1: to use this for production . speaker 4: or whatever, what these exact kind of things you're playing around with right now. How would you think about testing that like at some sort of scale? speaker 1: Like how do we test it at . speaker 4: some sort of scale? So more than like like the one . speaker 1: shot test we did, Yeah, Yeah, Yeah to test it at scale, you need a bunch of test cases. And if you don't have test cases, okay, this is maybe a good time to maybe show off this thing, although I'm actually not sure if it will work. speaker 4: So I guess I'm more pointed . speaker 1: Yeah I . speaker 4: think test cases are useful when I'm writing a prompt to deduce whether like does asking it to think step by step lead to this thing being more accurate? But in this case, for formatting, I guess what I'm wondering is like, could you have this prompt and then feed in the output and the prompt and then ask the model itself to evaluate like how good these various things are at following the instructions? Yeah. Yeah. Okay. speaker 1: So can we do model grading? Can we model grade the outputs? Yeah, especially for formatting related things. For formatting, I would not model grade the outputs, okay? Because formatting is something that I can check and code. So if I can do anything in code and I don't have to call it the llm, the llm is like this crazy black box, right? It's like if I don't need to make this pilgrimage to the oracle and ask it, I'd rather just do it like in code. So formatting specifically, we're kind of like in lock. It's easy to check for something like other the previous prompt we looked at where the outputs are a lot more squishy. We might possibly a model grading could work, possibly we might need a human to evaluate the answers . speaker 4: just to lightly push back on that, wondering like I actually put an example in the slack channel. We don't need to get to it because we're talking through it now. But like for let's imagine I don't have or actually maybe maybe tags are the answer to this. Like imagine I'm asking for a summary or something. And then I wanted to deduce whether there's additional chat like content like before or after that. In that case, would you I would have my mental model would have been to use like an llm as a grabut. It sounds like maybe would you encourage instead using the summary tags and checking like hard coding for additional texts around that? speaker 1: Yeah, I think that will be pretty quick and easy to do. Also, just having the summary be in summary tags is like generally a good practice. I generally have all my outputs inside tags to make it really easy to extract them. I don't think there's really any downside to doing that. So and it might even be that by doing that, you effectively fix your entire issue and you don't even need to do the test anymore. And you just put close summary in the stop sequences and you're kind of good to go. Cool. But that also does sound like a problem that an llm could grade. Okay, let's go to the next prompt here to out your question. Here's a very poor formatted excel spreadsheet. I got a question real quick. So this seems like a really ridiculously like powerful attack vector. So can we test the prompt real quick? I don't want to get into too much like jailbreaking stuff here. Sorry. Okay, apoapologies. So that's kind of my specialty. Okay, Yeah, I'm going to go to the next prompt. So what do we have here? Here's a poorly formatted excel spreadsheet slash csv. Please extract all data into json. Okay, how can we Jan or yaan? Is it Jan or yyan? What is the actual text that I should? Can you paste the text here because I don't know how to get the csv into into the console then? Oh, Hey, thanks. I've just been just copying the entire csv. I'm putting that into the prompts again. I've been trying to use clad to extract some information from spreadsheets, and it's always been very, very hard. It hallucinates a lot or it gets ps, a lot of stuff. I was wondering to you, maybe more generally, how do you have Claude, analyze really poorly formatted spreadsheets of sometimes the different clusters or multiple datsets in the same sheet and things like that? Okay, I'll try to answer the general question of having analyzing poorly formatted spreadsheets. The first thing that came to mind, especially when you're talking about how the spreadsheets are very big, is breaking the problem down into, so give it fewer spreadsheets at a time, give it fewer columns of the spreadsheet at a time, only give it the columns that it needs to work with, and then make the questions sort of smaller and more byte sized, and then tackle it that way by breaking it down. So at that level of generality, that would be my answer here. I'd also be curious to look at this one more specifically. Right now, I'm just struggling with how to copy the text and put it into the tool. That's what I did, but it keeps downloading it. I guess I can. Sorry. The next tab. This is that other one. So I don't want to click open my downloads. I'm scared. I'm going to reveal some private information. This is my work computer. So I want na just do it all in the browser. So you can go to the next one. Now let's do the next one. Okay? You are social media ghost writer. Given the bolong for an article. Okay, generally we would recommend putting the I like this one as short. We can do some quick hits here. We would recommend putting the instructions after the document. That's similar to the question that was asked about should we have the information first or the instructions first, particularly with long documents. It's a little bit better to give the instructions at the end. Let's also put some xml tags here. Let's just like clean up the grammar a little bit, given the above long form article, create five to ten tweets, don't use hashtags, don't be hypervoloc and dobe, cringe our return of json array of post content. Probably good to give an example of the format. So we could do like I guess we can just do it. We turn like a list. What did you originally have it? Is there a special reason that you wanted it to be a json array or is it just to make it possible? Okay. Yeah. So let's say return. And I'm just a huge fan of these tags. So let's do it like this. Okay. So that's some stuff without adding examples. The other thing that I would want to do is to give some illustrative examples of what it means to not be cringe. So how long are these documents? Perfect. Yeah, actually, let's run this as is. Oh, Yep, thank you. Now we can take this. What if they write cringe tweets about our product? I'm going to be embarrassed. Okay, this doesn't include any hashtags, doesn't seem very hyperbolic. Is this cringe? What do we think? New feature alert that could be a little bit cringe. Okay, your name was Charlie. What do you think of this, Charlie? Not engaging. Okay. So Yeah Yeah, Yeah. So we can try to make it more engaging without making it cringe. So let's say don't give hashtags but cringe. Try to make the tweets engaging. Are these meant to be tweeted from the anthropic Twitter account or from like the AI influencer Twitter account? Okay. We'll see how this goes. I'm going to switch back to Dov two. Exciting news. Better to break it up into small sentences in the prompt? Or can you use. Is it better to break up sentences to use small sentences and that prompt our big sentences? I think generally in English writing, it's better to use small sentences and small words. So I think it's probably also better to do that in a prompt. I think it's fine to use big words if you are really sure you know what you're doing and you know that it's the exact right word for the situation. Sometimes I'll find myself using more academic language if I want the output to seem a bit more academic. Generally, I think simple small sentences is better. Okay, so these are maybe a little bit more engaging. Like they have these questions here. Want to try it. What do we think it's got? Exclamation points in question Marks. Is it better? Do we want it to be even more engaging or something? Let's see because okay. So I honestly think temperatures doesn't is a bit overrated. Maybe we can see how that differs though. I'm not sure exactly how to distinguish these from the previous ones. They look kind of similar more to me from the ones with temperature, temperature one or temperature zero. That's right. Yeah. So what I was going to say is I think this is roughly as far as you can take this without examples. I think the best thing to improve this prompt would be either examples of the sort of tweets that you want or even an entire other document, an example of tweets that go with that document and maybe you like multiple of those. So if you're cost limited, maybe you don't want to put in all those input tokens every time, but I don't know. The models are pretty cheap now and we don't need to generate that many tweets. So if they have like any economic value to you at all, it's probably pretty cost effective to put so basically like but it's more work on your part because what you're doing then is so okay. So the way that I would actually do this is I would start out with some document. I would have quad write a bunch of tweets. I would take the ones that I liked and maybe I would write some more, or get like my friend to write some more, or maybe I'd have quad generate 100 tweets. And then I would take the seven that I liked best. And then I would put that in as an example. And then from there, I would sample, okay, now here's another document and then write a bunch of more tweets based on this. And what I would do is iterably build up this list of documents, plus example tweets, and then I'd put them all into the prompt and it would look something like this. So let's actually do that. So let's imagine that we had done this. So it could be like, you know system prompt. You you are an AI influencer who writes engaging social media content about new models and releases. It could be like here are some example of documents along with the tweets you wrote about them. And here you would actually I'm going to write this, but you would actually put the literal text of the document here. And here again, youput a literal tweet here, and this could either be something that you wrote or something that claad wrote, or know something that quad wrote and then you edited. Like a lot of times, quad might give you an example that's not perfect, but it's close enough, and then you'll change it a little bit to make it perfect. I have honestly given multi shot examples pretty short shrift in this talk so far relative to their level of importance. Like I think that in reality, most of the gains, most of the effort, most of the gains of writing a good prompt is literally just picking the perfect document that goes here, picking the perfect set of tweets that go here, altering and changing them to modulate the tone. In some ways, that's more important than everything else that I've said combined. Like another way to do the whole json thing would just be like with examples of quad giving the stuff without a preamble. The json one is maybe an exception because the prefill approach works so well. They're along with the tags, but for anything else, the examples are really huge. Anyways, then we would. One response like this, or do you find further success with an exchange of messages between the agent and the user where you're putting your fushot prompts in there? Really good question and something that I would dearly love to know the answer to, but I don't the question is, I don't need to repeat the questions. I think people can hear them, but I'll repeat it anyway. So do we want to just put all the examples in one big giant examples block like this? Or do we want to structure the examples as a dialogue where the human says something and then the assistant says something back, and we're literally like putting a large number of messages into the messages list? I typically do it this way with a big examples block, but it's mostly because it's less work for me, and I don't have any evidence that this works, either better or worse. I did do some testing of this at one point on a few data sets, and I found that it didn't make much of a difference for my particular case, but there's a lot of like little particulars that went into my testing that make me not very confident in the result that I got. So sorry for a bit of an unsatisfying answer here. I'll just say, I don't think if it is wrong to do one giant examples block, I don't think it's like very wrong. And key examples do so I can hear, would you give it a thing and say like this would be bad because this is yes. Yeah, I think that is good. I think it's good to include negative examples, particularly around like the cringe thing where claad might mess up. I think just negative examples on their own don't usually get you there. You want to have some positive examples too. But I think it's great to have especially like contrasting pairs. So like here's a document, like here's a cringe tweet about this document. Here's excellent tweet about the same document and like set those up side by side. I think that's pretty powerful. And I do that and I think it helps qud. speaker 2: And then if you also include like the reasoning for it, right? So like if it was a cringe tweet, it has like a little reasoning of like why why do you also do you trust that reasoning for the model? So if you ask it, like Hey, give me like what were you thinking when you were writing this tweet and then write me this tweet when you're reading through your examples to choose the best ones, how much do you trust that reasoning and how much do you rely on that versus just like I just care about like the input output. speaker 1: I don't trust the reasoning very much, especially if it's after something the model already said, then I like really don't trust it. Yeah but I mean, humans are not very good at explaining why we do the things that we do. We're really good at rationalizing and coming up with like fake reasons. But a lot of times we don't even know why we do the things that we did, let alone be able to coherently explain them to someone else. So there's a subtlety here. So something that does work pretty well is having the model think about its reasoning in advance and like go through different reasons or rationalfor why it might choose one option or the other, or think about what sort of things might go into good response. So if I had the model do some thinking in advance before it gave the response, then I might just trust or assume that the response would be better, having a bunch of explanation for why I did the thing. After, probably I would not trust that. So you had a question for a while. Do I give reasoning to explain the examples? Yes, I do a lot of giving giving reasoning to explain the examples. So for instance, just in this case, one thing that we could do here is like we could add something like I was going to say tweet planning, but maybe it's like key points of document. And then we hear we have some key points. Like the document presents the launch of. So you would have this after the example. So if you have ten examples, this is before the examples. Sorry, this is part of the example right here. So in this particular example, in this example block, I gave it a document. Now I'm doing this key points business. Got it. And then I would have these tweets. No, this key points could be something that I wrote myself, or it could be something that Claude wrote, and then I'm editing it. Or if Claude did a perfect job, maybe I could just include the thing that Claude wrote. But now in order to get Claude to do this, we would also say something like return in this format. Key points, a list of the key points from the document. So this is like a lightweight chain of thought where we're having the model do some thinking in advance. And we also gave it examples of it doing the thinking in advance like this. Yeah. speaker 4: Yeah, so let's imagine we like really want to give examples like this, but we have a problem, which is that our documents are like super long and I'm greedy and want na save on input tokens. Would you err on the side of doing like one document, but a really good example, or doing like truncated versions of more documents? speaker 1: I would that's a good question. I would err on the side of one extremely good example and not truncated versions of more documents, but I would also want to look at the outputs and test that assumption, because it's possible that with only one example, Claude would fixate on aspects of the exact document that uploaded and start trying to transfer them to your document. I think of it would be case by case, but I would want to start with having one extremely good example. Generally, I think that less higher quality is a better way to go than more and lower quality. Cool. Thank you. Okay, we have a lot of prompts here. Let's go to the okay, this is good. I was hoping we would get some like persona ones here. Okay, so this looks like something where we're trying to get called to role play in these different protocols. So let's try this out and let's see how it works. So this looks something where we're going to have like this looks like it's like meant to be a multi turn prompt, right? So this is like a conversation. You are talking to assistant. It said execute greater than p assist. Where's that? speaker 3: Talk. Oh, thanks so much. So you see three roles at the top. And then if you do that passist you see down there that's highlighted in yellow, you can just do that little arrow p, and then you can pick a different persona and then you can have them talk between themselves, or you can just switch pwe, use this for designers in our shop to do synthetic interviews to synthetic users. Basically, God allows us to switch back and forth. speaker 1: And then what issues or troubles have you been having with us? speaker 3: I'm guessing you have seen a lot of role playing proms out there. So I was just wondering if you see anything that's perhaps not as optimized as it could be or any other best is practices for role playing, particularly with multiple synthetic personas within the same session. speaker 1: Yeah, okay. For single personas. There's one answer that I would give. This multiple personas thing, actually, I haven't worked that much with, but here's like off the top of my head, here's probably how I would think about it. I would give all the personas to, I would write a separate prom for each persona. And then I would have the user's command trigger some coding logic where it would decide which bot to send, which prompt to send that reply to. So this is getting back to the thing that I said before about like don't do it in a prompt if you don't have to like this I mean, this prompt like like there's a lot of thought that went into it, which is probably makes it work a lot better than it would have if you hadn't put as much effort into it. But I think it's going to be easier if you just dynamically route the query based on what the user said. Does that make sense? Okay. speaker 3: You're talking about like if you were to use the api. speaker 1: Yeah, like Yeah. speaker 3: exactly. But you're . speaker 1: doing this just in the chat. speaker 3: This was just in the chat, but I appreciate I definitely appreciate the note there. So maybe related to that, one of the other things is how much have you dealt with having a second thread with the api that acts as maybe the entity that's capturing inputs from multiple ones into a single thread? You know what I mean? Like let's say that I build an app and I have the user interact with these different synthetic personas, but then I have a second interaction with the api that's tying these things together into a cohesive hole. I don't know if you guys have explored some of that. I'll be curious. speaker 1: Yeah, I don't have a great answer for that one. Sorry. I do want to kind of test this prompt out though, just to kind of see how it goes. So maybe here I would say I can just say something like, so how would . speaker 3: I switch it? I could so do the right arrow p, right arrow p and then type Sam and then say, Hey, Yeah, okay. You could say, Hey, how do you what's your process to look for the right, for the best medication pricing whenever you get sick or something like that? And then here in this particular case, if you switch to Joe, Joe is optimized more for convenience versus cost savings. So you have two different types of users and we can learn from, Yeah. speaker 1: okay. So quad did the thing here that I wanna show you all like how to get rid of. Well, yes, yes, as Sam, it's like that's not something that Sam would say, right? So I don't know for sure this is going to work. I feel like a magician that's about to do a trick, but like I haven't practiced it. But generally, something that is pretty useful here is to, we could say, prepend each response with the name of the current persona in brackets. So one thing I'm going to do here is I'm going to change this multi shot a little bit also because if quad sees itself not doing the thing that I told it to do, actually let's just redo the whole conversation or we can take out this. So let's just run that back. You are talking to assistant nice. And now we could say. And now we could say the same thing, like what's your process for finding the best prices for medication? Oh, okay, so I guess we need to do this. We need to like change in a separate call. Okay, great. Now it's going to work totally. It's going to totally work, okay? It's a little bit better, right? It didn't say as Sam, this is like something that human might maybe say, like as as someone who's okay, I don't know, it's better than it was before, right? Maybe we could say something like. You don't need to say too much about your persona and your responses, just state and character. Hey, quick question. What are your thoughts on using things in the negative sense version? Yo, check it out. It worked a lot better. Sorry to interrupt. Oh Yeah, very nice. Yeah. So Yeah, what are your thoughts on using like negative stuff like you don't versus the positive sense? Yeah, I think positive is like a little bit better in this case. I don't really have a good answer for why I phrased this negatively. I guess I did a combination. I was like, you don't need to say too much, just stay ying character. I guess I think it's better to use like a White touch, like if you're doing negative prompting. I think there's like a little bit of a thing going on with reverse psychology where if you tell the model like don't talk about elephants, don't definitely no elephants, definitely don't say anything about elephant, it might make it more likely to talk about an elephant. So if you do use negative prompting, I think it's better to have like a light touch where you just kind of say it once, but like don't dwell on it too much. Also something similar with parenting. It's like if you don't want your kid to eat prunes, you're just like, Oh, we're not having prunes today. And then you just change the subject. But if you really emphasize that there are like no prunes to be had, then you might get more pushback. Hi, I know you're not using this system prompt much like is there a reason for that? What do you think the biggest value items for a system prompt? Yeah, system prompt. Personally, the only thing that I ever put in the system prompt is a role. So I might say like you are this, you are that. I think generally chd follows instructions a little bit better if they're in the human prompt and not in the system prompt. The exception is things like tool use, where maybe there's been some explicit fine tuning on certain system prompts specifically for like general prompts like the ones we've been going over here, though I don't really think you need to use the system prompt very much. Yeah. Thank you. One thing we've found when using the user prompt, I guess sometimes, is it makes it more prone to hallucinations because it thinks the user is saying it. And so we migrated things to the system prompt more. I don't know if you have any experience with that. Yeah, Yeah, I've actually heard that before, so it's possible I'm missing something. I've heard this from enough people that I could just be wrong. So I'm unusually likely to be wrong when I say this. I think that if you just put a bunch of contand, you're like, here's the message from the user, open user bracket and then put the message and then close the user bracket, it will work and you won't have that issue anymore. That said, like I don't know, maybe it does fall over sometimes, but that would be my default is just to like specify even more clearly in if you're having this issue, be like, here's the message from the user, here's the stuff that I want you to do and I think probably won't get confused by that. I have a question about the counter examples. So before, in order to get it to say not crcringy things you were saying, provide it with a counter example. So but here in the case of where you're doing this character bot, you haven't provided it any counter examples. So this is sort of like a generic question. So if the model is trained on preference optimization with examples and counter examples, do you get a better result in the prompting? Well, I don't know that the details of the rhf have that much bearing because I think when the model is trained, it doesn't usually see those both in the same window. It's more that it's like some stuff that happens with the rl algorithms. So I don't think that's necessarily the right way to think of it with counter examples. I don't feel that I have to include them in every prompt. It's just a tool that I have in my toolbox that I'd use sometimes in regards to like negative prompting. I'm over here. Do you think that it would be better to do negative prompting using control vectors like what you talked about in your scaling monosymanticity paper and maybe having like a negative version of the vector as your kind of negative prompt instead of mentioning in the prompt outright? Yeah. Steering is still like super new. We don't know how well it works relative to prompting. I'm like a you know die hard prompter till the end. So I played around with a little bit. I haven't found it to work as well as prompting in my experience so far. That said, there's like a lot of research improvements that I won't get into in too much detail, but there's a lot of stuff that could make it work better than so like right now, it's like finding these features and then you're steering according to the features, which are sort of like these abstractions on top of the underlying vector space. There's other possibilities for how you could steer and there's like academic papers that you can read where you're steering according to just like the differences in the activations versus like trying to pull it out to this feature first. So maybe that would work a bit better. Like the control vectors thing, I haven't played with it enough to know for sure, but I think there's definitely like something along that line. Those lines will work eventually. I can't say in the long term if itwork better or worse than prompting. Right now, I still think prompting works like a lot better. I mean, from my experience with smaller models and trying to work with control vectors, I've seen that it's better when it comes to style than it is for like actual deterministic. Yeah, pretty interesting. Yeah. Sometimes I feel like stuff from smaller models transfers. Sometimes it doesn't transfer. I don't have a great intuition for what doesn't and doesn't transfer between small and large models. But Yeah, it's all good points. Thank you. Okay. I think we've gone over this role play stuff enough. Let's go to the next one. I'm going to upload a few screenshots in my dating profile. Okay, this is our first image one. Are there any screenshots? No screenshots. Okay, actually somebody responded in the very first in reply. So since we're doing images, maybe I'll start there. That was you. I try and find my message here. You just it's at the bottom of the. Riveting to watch me scroll through this channel, I'm sure. Oh, there's a lot of messages here. Okay, here we go. Okay, so can I copy the image? Copy image. Okay, cool. I actually don't know if I can paste it into the console, so I might fall back on using quda AI for this. Place ing images is not enabled right now. Okay, let's try qualai then. So then the question was the prompt here was. High performing validated AI model. Sorry, I like lost all your formatting here. Okay, so in this image here, if we zoom in, it's supposed to get Maddie White and 86. I have a hard time reading this. 86, 87. That's the hard one that it messes up on eight, six, 87. And then 86, 87 down here. And then in this one, it's just eight, six, seven. Yeah, that's a typo. That's a typo of the human. And so I'm hoping it can correct for that. I'm just trying pull out kind of the average data, all three combined. Okay. And it said it looks like it said middle mark, so it's misreading the okay. So I don't know too many good tips for image is, but I'll tell you what I have. So one of the things that you said oops, sorry, one of the things that you said here was that it works better with zooming in and cropping the image. That's definitely like the easiest win that you can have is just giving a higher quality image, taking out the unnecessary details that might be hard to do programmatically because it's like you don't know that which details are necessary and unnecessary. But for the same reason that including extraneous information in text form, you probably won't get as good results. If you include extraneous information in image form, the results probably won't be as good either. So the more that you can like narrow in on the exact information that you need, the better. Is the model downstandpoint large images? I don't know, slash can't talk about that. But definitely having higher quality, bigger images is better. I did just read on your website that it says it down samples to a thousand pixels by a thousand pixels. Okay, great. That can be found with Google. Okay. So then any general tips on how to discuss images with sonnet? My number one tip for images is to start by having the model describe everything it sees about the image. So I don't know if that will work here. This this example is like this one is hard enough for me to even read that. Like I kind of doubt that quad will do well in it regardless of what we say. But can we can give another shot? One thing I've noticed when I attempted that, where if I asked it to go like tube, bite tube in that image, if it like the first tube, and if it came to the wrong conclusion, it would use that and come to that same wrong conclusion on multiple tubes where like it made it. So there was some kind of like directionality in its thinking where if it got the answer wrong at first, it would project that onto the rest of the image it was analyzing. Yes, I think that's definitely if the model starts off on the wrong path, it probably would just continue going on the wrong path. It's trying really hard to be self consistent. And I mean, this is why self correction is such a big frontier for alumms in general right now. I've found like tectrand ocr models I've played with don't do as well with some of the handwriting. Like if I zoom in on this image, it actually does perform pretty well from some of these fairly messy handwriting. That's even hard for a human. Okay. Yeah. Unfortunately, I don't know how to get better results out of this other than maybe by cropping it better and upsampling. Cool. Thank you. Oops. Here we are. So let's scroll up to where we were before. Okay, Yeah, I still know screenshots there. How do I enable fragments? What are fragments? Is fragments like the prefill part artifacts? Oh, okay. It's just a setting. It's in the bottom left of the you have to enable it. Yeah, you can find it online. Our people will help you. So I often dump full tracback errors directly into the prompt box. I often dump full tracace back errors directly into the prompt box. I api, it seems exceptional at not running into tracace back loops. I don't know, like if that's intentional, like literally, I'll just take the entire tracace back zero context to Claude and I'll just dump the entire thing in Yeah and then itgive me the fix. Okay, great. So but can you elaborate on how that might have how this, for example, like this only really appeared with like very recently like youhave to explain explicitly a lot? The models got better, man, they get better every every I understand. But this is like you know this isn't prompt like this is prompt engineering we're talking about, right? So I'm just I'm just wondering like is this is this a form of prompt engineering or is this the model being good? Sounds more like the model being good if you're just dumping it in, right? I'm going to move to this prompt here. Okay. So. To the person who uploaded this and generally to anyone who's uploaded more examples. If you could just put some stuff in the thread, write some stuff about what the issue is that you're having or like why it's not working, that would be great. speaker 3: This is actually a follow up to what I was doing with the translation. So basically what I'm trying to get to do is to actually analyze the text. So like in this case, there's original English, there's a bad Japanese translation, and I'm trying to get to score between 15 how good the translation is. And so what I've been doing is adding a lot of stuff to try to get to do sort of chain of thought to, you see, because theynotice errors, but you know it just generally does a . speaker 1: very bad job at scoring. Yeah. Okay. So this is great. I'm really glad that you asked this because model grading is something that I it would be incredibly useful if it worked. And right now it's in a place where it like sometimes kind of works and sometimes doesn't. So let's paste this into the console. Okay, so we have some English text and we have some Jeff? Yep, Yep, Yep. Then we got the Japanese text. So is this a good translation or a bad translation? Terrible translation. Okay. Now quit's actually supposed to be good at Japanese too. So Oh, this is somebody else. But I'm saying if quad is good at Japanese, it should be good at like judging other people's Japanese in theory, in theory. Okay, so now that we've got the answer here, I guess it's stalled out here for whatever reason. So what are we doing in this prompt? We're scoring between 15 as below one is many grammatical errors the native would never make. Contains multiple grammatical errors, an average quality translation with some errors. Okay, that looks pretty good. Look for specific clues or indicators. I don't know why, I think that's just a bug. I think I'll conclude here. Okay, so I gave it a three, but we wanted to give a one. Okay. Now have you found that it's generally too forgiving or too strict or that it's just all over the place? Well, it's all over the place. speaker 3: Also, it seems to confuse sometimes content versus you know so like for example, this is from I think the hlrf you know set. So know the content is fine, but so it might not actually rate it low even though it's a terrible translation because it thinks the content is okay, even though if you asked it even to list all the errors, there are a dozen errors in that single piece of text for the translation. So that's what I'm trying to also see. Is there where to get to separate out the grammatical or the actual translation . speaker 1: errors versus Yeah, okay, a few thoughts. So generally, this is like a thing where if you ask the model if some text is like good or bad, it's sort of if the text is like about a nice subject, it's more likely to say that it's good and always like it's a good translation, like it's well written, it is flows very logically. And if it's about like a negative subject, it's more likely to like criticize it and say that it doesn't flow well. I think you can get at those issues by typing language about it in the prompt. So for instance, here you might say something like. Hmm. speaker 3: one is the worst and five should be the best. speaker 1: That's sort of like implicit in this rubric up here. But it might be good to say it anyways. speaker 3: How do you get around . speaker 1: the fact that you may not have a Japanese tokenizer? Do you have a Japanese tokenizer and the api will tokenize anything but how is it trained? How is it trained pre and off topic for and also, I don't know, I don't think I actually don't think there is a tokenizer for Claude. Like, Oh, so I mean, unless you say otherwise, there's not I mean, if you upload some text, it will be tokenized, but it's not pre trained. So you're not going to get a really good answer for this or Japanese text. speaker 3: I've tested this. Okay. Uclaude speaks the best Japanese of a model available. speaker 1: Actually, we are we don't need to debate like Claude's Japanese skill. That's just like. speaker 3: but the tokenizer isn't available. And so that it would be interesting. speaker 1: Yeah I have a question about programming. So can we actually cut the questions off while I type this prompt here? Okay. So it's extra important to so what we're trying to do here is get it to distinguish between the ethical nature of the text and the quality of the translation. speaker 3: So is it useful to tell to be like extra critical? Say like you're grading a you know like I don't know like graduate course. speaker 1: You know so I'm a little bit all over the place. I need to like type this out before I can clear the queue and like respond to other questions here. So it's important to distinguish between the. What's a good word here? Risk a topics or all ated? Yeah so I don't know, something like this could help the model not pay so much attention. The main thing that I would want to do for this prompt is just add a bunch of examples. So for each category, I would add at least one example of that category. So like I'd have like a really bad translation and I'd say why it's a one. And you can say that in your own words and I'd have an example of like a two level translation, a three level translation and so on. In each case, before you get to the answer, you would have the explanation for why it's good or bad. I can't tape all that here, among other reasons, because I don't speak Japanese, but I think that's the most valuable thing that you could do here. Otherwise, I mean, like the formatting looks really good. The fact that you're doing the chain of thought in advance looks good. I think Yeah I think maybe this specific clues are indicators. I think you could go into a little bit more detail here about what is okay, how you to enshoevery . speaker 3: single example or you know greater with an example basically, Yeah, I mean, it's tedious to . speaker 1: write out all these examples, but a quad can help. And then you have the problem of just editing quad response versus like writing all yourself and b, it really does lead to better performance versus like almost anything else that you could do. Yeah in terms of the scale on these rubrics, I think either a number scale or good bad is fine. The thing I'd be careful about with the number scale is that I don't think it's very well calibrated. So if you're telling it like choose a number from one through 100, it's not going to be like an unbiased estimator necessarily. So I'd probably just limit the granularity to maybe five different classes. Yeah. speaker 4: This is a case where you might be able to utilize like log probs. And I'm wondering if that's something you ever use in any of your work or other prompts. speaker 1: Yeah, I think I agree. This could be a case where log probbs would be useful if you could get like the probability of each grade. Yeah, the thing so here's the thing with log probbs is you really want the chain of thought beforehand and the chain of thought I think is gonna to get you more of a win than using the log problems. But log probs there's no in any model, I don't think there's a way where you can just say sample all the way through and then put after you output this like closed chain of fatag, then give me the log probs of whatever comes next. So if you are gonna to use the log probs, then you're looking at this like multi turn setup where you first sample all the chain of thought or sample all the like pre cogitation that it wants to do and then cut it off there and then reupload that with the chain of thought as a prefill message, and then you could get the log props. But for that you need a model that both has the prefill capacity and has log prob capacity. I'm not sure of what model both those walk me through. speaker 4: why it wouldn't just be sufficient to like in this case, just ask for I'm asking for a score one to five only return that. But then like look at the log probs of what it returns in that case. speaker 1: Yeah. So what you're losing there is whatever intelligence boost you got from having the model do the chain of thought. And my sense is that chain of thought plus have the model say either one, two, three, four or five, is going to be more accurate than the additional nuance that youget by having it give you the log probs, because it's actually doing a lot of its thinking in that chain of thought. You're like leveraging more computation, you're getting more forward passes. For all the same reason that chain of thought is like usually good idea. speaker 4: Are you talking about chain of thought? And like it's actually out loud writing a chain of thought before the answer. Exactly. Okay. Okay, sure. speaker 1: I got you. Which is what we see in this prompt right here, right? We have this analysis section. So if we cut out this whole analysis section, we're really tanking the number of forward passes that the model can be. speaker 4: It gave you the answer. Okay, cool. That's good to know. Thank you. speaker 1: What's up? Okay. I'm being told that I should answer a couple more questions and then get off stage. Before I came here, I was honestly really worried that no one would have questions and I'd be supplying my own. But you all have had amazing questions so far. An amazing example. So I really appreciate that it's made this gone well. Sorry, I'm supposed to say at the end of this, but I'm giving a re pre. Thank you. That we now we can do the encore. Yeah wanted to add that having a numbered list make it more weight to like number one, two, three, four, five versus just having an unstructured list. So it may give more weight into their score for the output if you just change it from like one, two, three, four, five to like little dashes for criteria, just as my experience of what I've been seeing okay, cool. So just fyi, I have replied back with the prompt that was fixed or that like in my own improvement of what I think is a better prompt for this. Yeah. Okay, awesome. Nisha, should I do another prompter or should I just answer a couple questions and then head up? All right, let's do one last prompt. Which one should we choose? Okay, let's do this one. Good old mitigating hallucinations because we haven't really done that. Okay? Please provide a summary of the text providers input. Okay, first thing I'll do is just move these instructions down. Now, Matt, did you have an example of a document where it hoates ates with this prompt? Yeah. Can you just put that in the thread? Okay, your summary should be concise while maintaining all important information that would assist in helping someone understand the content. If it mentions any dates, don't start or end with anything like I've have some journey to sumated for you. Here is the summer you ask for. Yeah. So this one we can fix with a prefill. Okay, we could do something like this. Now how about the hallucination part? The best trick that I know for getting around hallucinations in a case like this is to have the model extract relevant quotes first. So what I would say do here is I would say something like. And now of course, in this prefill, since we're having relevant quotes here, we wouldn't want to start with summary. That would just be confusing slawrong. So we could say here is the. Something like this. Okay. Did you get the dock yet? Okay. And then of course, I'd put the document here. Yep. Okay, I think I should get off stage. So Yeah, let me just call it here. And Matt, we can talk after Yeah once again, I really appreciate you all coming out. It's been amazing to have such like a great audience engaged. I've had fun. I learned some things. I hope you all did too. I'm planning to stick around this event for for the rest of the afternoon, so I don't know exactly where I'll be, but maybe just dm me if you want to come find me in chat. I'm always happy to talk prompt engineering. It's like my truest passion at this point in the world. So find me, hit me off what we'll talk. And Yeah, this has been great. Thank you so much.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本次内容为2024年8月17日由Anthropic工程师Zack Witten主持的Claude提示词(Prompt)工作坊。工作坊以现场实践为主,旨在通过实时编写、测试和编辑用户提交的提示词,提升其效果。Anthropic初创企业销售负责人Jamie Neuwirth首先介绍了Anthropic近期的重要发布,包括Claude 3.5 Sonnet模型和Artifacts功能,并强调Anthropic致力于支持初创企业成长。
Zack Witten随后详细阐述了提示词工程的核心原则与实用技巧。他反复强调两大核心理念:1. 优先使用代码解决确定性问题:当任务(如格式化、后处理)能通过代码可靠完成时,应优先于依赖不确定性较高的提示词。2. 高质量示例至关重要:提供少量、精心挑选的优质示例(Few-shot Examples),尤其是包含“思路链”(Chain-of-Thought)的示例,是提升模型表现和控制输出的关键。
具体技巧方面,Zack强调了清晰分隔提示词不同部分(推荐使用XML标签,因Claude训练数据包含海量XML)、将指令置于信息之后(尤其对于长文档)、以及使用目标语言编写多语言提示词(若条件允许,请母语者协助)。针对常见问题,他演示了多种策略:
1. JSON输出:通过“助手预填充(Assistant Prefill)”和API的“停止序列(Stop Sequences)”参数,可有效确保纯净的JSON输出。
2. 内容简洁性:明确指令长度(如“1-2句话,绝不超过3句”)比模糊的“简洁”更有效。
3. 角色扮演:通过明确指示和温和的负面提示改善。
4. 模型评分/评估:提供清晰的评分标准,并为每个评分等级提供带有“思路链”解释的正面和(或)负面示例,明确区分内容本身与评估对象。
5. 减少幻觉(总结任务):先让模型提取相关引文,再基于引文进行总结。
6. 处理复杂/格式不佳数据:将问题分解,逐步处理。
7. 图像理解:提供高质量、裁剪过的图像,可先让模型描述图像内容。
Zack还讨论了系统提示词的使用(个人倾向于仅放角色定义)、负面提示的“轻描淡写”原则、以及温度设为0(适用于知识型工作以减少幻觉)等。整个工作坊充满了实践操作和与观众的实时问答互动,强调迭代测试的重要性。
Anthropic近期发布与支持策略(Jamie Neuwirth发言)
Jamie Neuwirth,Anthropic初创企业销售负责人,在工作坊开始前进行了简要介绍。
- 近期重要发布:
- Claude 3.5 Sonnet: 一款新模型,已获得用户积极反馈。Zack Witten也已试用并将在工作坊中使用。
- Artifacts: 一项与Claude Teams和Claude AI计划一同推出的新功能。
- 功能:可将几行文本转化为代码(如制作视频游戏)、图表等。
- 应用案例:商业图表制作、制药公司基因序列讨论的可视化分享、提示词食谱(Prompting Cookbooks)等。
- Claude Teams 和 Projects: 为团队协作和项目管理提供的功能。
- 对初创企业的支持:
- 目标:赋能基于Claude构建的下一波AI公司。
- 具体支持:
- 提供更高的API调用速率限制(higher rate limits)。
- 提供与Zack等专家的交流机会。
- 提供早期产品访问权限(early access)。
- 联系方式:
sales@anthropic.com。 - Anthropic的DevRel负责人Alex也出席了活动。
Zack Witten的提示词工作坊:核心原则与实践
Zack Witten(被誉为“提示词医生”)主导了工作坊的核心部分,通过现场修改和测试参与者提交的提示词,分享了提升Claude模型表现的多种技巧。
工作坊形式与准备
- 形式: 实时互动,参与者通过Slack频道(
prompt-live-workshop-anthropic)提交有问题的提示词及其表现不佳的案例。Zack现场分析、修改并在Anthropic控制台(Console)中测试。 - 提交内容建议:
- 提示词模板(Prompt template),变量使用
{{双大括号}}表示。 - 具体表现不佳的示例。
- 提示词模板(Prompt template),变量使用
- 使用工具:
- Anthropic Console(包括其评估“Evaluate”标签页,可能展示一些“秘密新功能”)。
- Claude for Sheets(电子表格中调用Claude)。
通用提示词工程最佳实践
- 高质量示例是核心: 提供少量、精心挑选的优质示例(Few-shot Examples)是提升模型表现和准确控制输出的最有效手段之一。示例应清晰展示期望的输入输出格式和内容。
- 清晰度与结构:
- XML标签: 强烈建议使用XML标签(如
<information>...</information>,<instructions>...</instructions>)来分隔提示词的不同部分(如背景信息、指令、用户输入)。原因是Claude的训练数据中包含海量XML,因此它对这种格式的理解和遵循度通常更高,这比Markdown效果更好。核心在于“清晰地分离提示词的不同部分”。 - 指令位置: 建议将信息(Information)置于指令(Instructions)之上。通常,“指令越靠近提示词底部,越容易被严格遵守”。
- 语法与拼写: 保持提示词的语法正确和拼写无误。Zack提到“有轶事证据表明这有帮助”,并且“错别字确实会损害性能”。他个人对此非常注重,认为“这绝对没有坏处”。
- XML标签: 强烈建议使用XML标签(如
- 强调:
- 使用大写字母、感叹号或明确说明“这一点极其重要”等方式,可以有效强调某些指令。
- 简洁性与明确性:
- 对于模型输出长度,给出具体数字范围(如“每条回复应为1-2句话,绝不超过3句”)比模糊的“简洁些”效果更好。
- 多语言提示词:
- 对于翻译或多语言输出任务,如果可能,尽量使用目标语言(Native Language)编写提示词。如果自己不擅长该语言,理想情况下应请母语者协助撰写和优化提示词。
- 温度(Temperature):
- 对于“知识型工作”,Zack通常将温度设置为0,他认为这样“可能会略微减少幻觉”。
针对特定问题的提示词优化技巧
1. 确保一致的JSON输出
- 问题: 模型有时会在JSON输出前添加引导语(如“Here is the JSON:”)或在之后添加解释,或者不完全遵循JSON格式。
- 核心解决方案:
- 助手预填充 (Assistant Prefill): 这是Anthropic API的一个特性,允许用户预先设定模型回复的开头。
- 方法1: 将不希望出现的引导语(如“Here is the JSON:”)放入预填充。模型会认为它“已经说过了”这部分。
- 方法2: 预填充引导语和开头的JSON括号(如“Here is the JSON: {”)。这能更强力地引导模型进入JSON模式。注意: 后续处理时需要将这个预填充的括号加回到实际的JSON输出中。
- 使用标签包裹JSON: 指示模型将JSON输出包裹在特定标签内(如
<json_output>...</json_output>)。 - 结合预填充与标签: 例如,预填充为“Here is the JSON:
{”。 - 停止序列 (Stop Sequences - API功能): 在API调用中,将JSON的结束标签(如
</json_output>)或结束括号}设为停止序列。模型在输出该序列后会立即停止,避免了后续的解释性文字,也节省了token。
- 助手预填充 (Assistant Prefill): 这是Anthropic API的一个特性,允许用户预先设定模型回复的开头。
- 原则: Zack反复强调:“能用代码解决的问题(如格式化、提取特定部分),就不要过度依赖提示词。” 代码更可靠且成本更低。
2. 处理格式不佳的电子表格/CSV数据提取
- 问题: 从格式混乱、包含多个数据集的大型电子表格中准确提取信息到JSON时,模型容易产生幻ecuzione或遗漏。
- 建议:
- 分解问题:
- 一次处理较少量的行或工作表。
- 一次只关注必要的列。
- 将大的提取任务分解为多个“小规模、一口大小(byte-sized)”的子问题。
- 分解问题:
3. 提升内容创作的质量(如生成推文)
- 问题: 生成的推文可能不够吸引人或显得“尬”(cringe)。
- 核心解决方案:
- 高质量的“少样本示例” (Few-shot Examples): 这是提升效果的关键。
- 提供1-2个“极好”的完整示例(包含输入文档和期望的推文输出),比提供多个截断或质量不高的示例效果更好。
- 示例应能体现期望的语气、风格(如“不尬”、“吸引人”)。
- 可以包含对比示例:一个“尬”的推文和一个“优秀”的推文,针对同一份文档,并排展示。
- 在示例中解释为何某个输出是好的/坏的。
- “思路链” (Chain-of-Thought - CoT) 融入示例: 在示例中展示模型“思考”过程,如先从文档中提取“关键点(key points)”,然后再基于这些关键点生成推文。并指示模型在处理新任务时也遵循此思考步骤。
xml <example> <document_text>...</document_text> <key_points> <point>...</point> </key_points> <tweets>...</tweets> </example> - 迭代构建示例集: 从一个基础提示词开始,让模型生成内容,挑选好的,修改不好的,或自己撰写一些,逐步积累高质量的示例对。
- 高质量的“少样本示例” (Few-shot Examples): 这是提升效果的关键。
- 示例的组织: Zack个人倾向于将所有示例放在一个大的
<examples>块中,而非模拟多轮对话格式,主要是因为前者操作更简便,且未发现性能上有显著差异。
4. 角色扮演与多角色管理
- 问题: 模型在角色扮演时可能“出戏”(如回复“作为Sam,我会...”),或在多角色切换时管理复杂。
- 建议:
- 单角色扮演改进:
- 明确指示模型在回复前加上当前角色名,并用括号括起来(如“
[Sam]我的流程是...”)。 - 使用温和的负面提示,避免过度强调,如:“你不需要在回复中过多提及你的角色身份,只需保持角色即可。”
- 明确指示模型在回复前加上当前角色名,并用括号括起来(如“
- 多角色管理:
- API层面: 为每个角色编写独立的提示词,通过外部代码逻辑根据用户输入判断并将请求路由到相应的角色提示词。这通常比在一个提示词中管理多个角色更简单有效(再次体现“代码优先”原则)。
- 单角色扮演改进:
- 负面提示 (Negative Prompting):
- 一般而言,正面表述(告诉模型做什么)优于负面表述(告诉模型不做什么)。
- 若使用负面提示,应“轻描淡写(light touch)”,说一次即可,不要反复强调,否则可能触发模型的“逆反心理”(类似“别去想大象”效应)。
5. 图像理解与数据提取
- 问题: 从图像(尤其是手写字迹)中准确提取文本信息。
- 建议:
- 图像质量: 提供尽可能高质量、清晰的图像。进行裁剪,只保留包含目标信息的关键区域,去除无关背景。Anthropic Claude会将图像下采样到1000x1000像素。
- 初步描述: 可以先让模型“描述一下它在图像中看到的全部内容”。
- 错误传播: 模型一旦在初期对图像的某部分产生错误解读,可能会为了保持“自我一致性”而将此错误延续到后续分析中。
- Zack指出,特别难识别的手写内容,即使对人类也有挑战,除了优化图像质量外,没有立竿见影的提示词技巧。
6. 模型评分与评估(如翻译质量评估)
- 问题: 让模型根据标准(rubric)对文本(如翻译稿)进行评分时,评分可能不准、不稳定,或混淆内容本身的好坏与评估对象(如翻译语法)的好坏。
- 核心解决方案:
- 明确区分: 在提示词中清晰指示模型区分“文本讨论的主题/情感色彩”与“评估对象本身的质量”。例如:“你的评分应仅反映翻译的质量(语法、流畅度、准确性),而非原文内容的性质。”
- 高质量的“少样本示例”:
- 为评分标准中的每一个等级提供至少一个示例。
- 每个示例都应包含“输入文本”、“对应的分数”以及详细的“为什么得到这个分数”的解释(即思路链)。
- 指示模型进行“思路链”思考: 要求模型在给出最终评分前,先进行一步分析(如列出优点、缺点、符合/不符合哪些标准)。
- 评分量表: 使用数字量表(如1-5分)是可行的,但不宜期望过高的粒度(如1-100分),模型可能无法很好地校准。5个等级通常是合理的。
- Log Probs的使用:
- 观众提问是否可以使用log probabilities来获取模型对不同分数的置信度。Zack认为,这需要先让模型完成“思路链”的思考,然后获取log probs。他个人感觉“思路链 + 离散分数输出”带来的准确性提升,可能大于单纯使用log probs获得的细微差别,因为“思路链”本身利用了更多的计算。
7. 减少总结任务中的幻觉
- 问题: 模型在进行文本总结时可能编造不实信息。
- 建议 (一个有效技巧):
- 第一步:指示模型首先从原文中提取所有相关的引言/引述 (relevant quotes)。
- 第二步:指示模型仅基于这些提取出的引言/引述来撰写总结。
- 可以使用助手预填充来引导这个两步过程,例如预填充“Relevant Quotes:”。
其他重要观点与问答
- 系统提示词 (System Prompt) vs. 用户提示词 (User Prompt):
- Zack个人习惯只在系统提示词中放入角色定义(如“你是一个乐于助人的助手”)。他发现Claude通常能更好地遵循用户提示词中的指令。
- 他承认有些用户反馈,将复杂指令放入系统提示词有助于避免模型将其误认为是用户输入的一部分,从而减少幻觉。Zack对此的建议是,如果遇到此问题,可以在用户提示词中更明确地标记用户输入部分(如使用
<user_input>...</user_input>)。
- 代码与提示词的边界: 再次强调,对于可以通过代码确定性完成的任务(如格式检查、从特定标签提取内容、停止输出),应优先使用代码,而不是试图通过复杂的提示词来控制,后者往往不够稳定且成本更高。
- 迭代与测试: 提示词工程是一个不断测试和迭代的过程。使用控制台的“评估”功能和构建测试用例集非常重要。
- 对Traceback错误的处理: 有观众提到直接将完整的traceback错误信息粘贴给Claude,模型能很好地给出修复方案。Zack评论这更多体现了“模型能力的提升”,而非特定的提示词工程技巧。
结论
Zack Witten的提示词工作坊强调了通过结构化提示、高质量示例、以及结合API特性(如预填充和停止序列)来优化Claude模型输出的实用策略。核心思想是追求清晰、明确,并善于将复杂问题分解。至关重要的是,他反复倡导“代码优先”原则:在提示词的非确定性与代码的确定性之间,应优先选择后者来处理可以明确定义的任务,这不仅更可靠,也更经济。整个活动为AI工程师提供了宝贵的实践经验和可以直接应用的技巧,鼓励开发者通过迭代测试不断提升提示词效果。