speaker 1: Basically, this entire roundtable session here is just going to be focused mainly on prompt engineering. Variety of perspectives at this table around prompting from a research side, from a consumer side, from like the nenterprise side. And I want to just get the whole wide range of opinions because there's a lot of them and just kind of open it up to discussion and explore what prompt engineering really is and what it's all about. And we'll just take it from there. So maybe we can go around the horn with intros. I can kick it off. I'm Alex. I lead developer relations here at anthropic. Before that, I was kind of technically a prompt engineer at anthropic, worked on our prompt engineering team and did a variety of roles spanning from like a solutions architect type of thing to working on the research side. So with that, maybe I can hand it . speaker 2: over to David. Yeah my name is David Hershey. I work with customers mostly at anthropic on a bunch of stuff, technical. I help people with fine tuning, but also just like a lot of the generic things that make it hard to adopt language models of prompting and just like how to build systems with language models, but spend most of my time working with customers. speaker 3: Well, I'm Amanda asco. I lead one of the fine tuning teams at anthropic, where I guess I try to make Claude be honest and kind. speaker 4: Yeah my name is Zach whton. I'm a prompt engineer, anthropic alx. And I always argue about who the first one was. He says, sagets me contested. Yeah I used to work a lot with individual customers, kind of the same way David does now. And then as we brought more solutions architects to the team, I started working on things that are meant to raise the overall levels of like ambient prompting in society, I guess, like the prompt generator and like the various like educational materials that people use. speaker 1: Nice. Cool. Well, thanks guys for all coming here. I'm gonna to start with like a very broad question, just so we have a frame going into the rest of our conversations here. What is prompt engineering? Why is it why is it engineering? What's prompt? Really? If anyone wants to kick that off, give your own perspective on it. Feel free to take the rain here. speaker 2: I feel like we have a . speaker 1: prompt engineer. speaker 2: Exactly. We're all prompt . speaker 1: engineers in our own form. But once of us has a job, Yeah. speaker 3: maybe since it's in your has a job you can give you don't have jobs. speaker 4: I guess I feel like prompt engineering is trying to get the model to do things, trying to bring the most out of the model, trying to work with the model to get things done that you wouldn't have been able to do otherwise. So a lot of it is just like clear communicating. I think at at heart, like talking to a model is a lot like talking to a person and getting in there and understanding the psychology of the model, which I commanded the world's most expert person in the world. speaker 1: So well, I'm I'm gonna to keep going on you. Why is why is engineering in the name? speaker 4: Like Yeah, I think it's engineering part comes from the trial and error. Okay. So one really nice thing about talking to a model that's not like talking to a person is you have this restart button, this like giant, like go back to squares zero where you just like start from the beginning. And what that gives you the ability to do that you don't have is like a truly start from scratch and try out different things in like independent way so that you don't have interference from one to the other. And once you have that ability to experiment and to design different things, that's where that the engineering part has the potential to come in. Okay. speaker 1: So what you're saying is like as you're writing these prompts, you're typing in you know a message to clad or in the api, whatever it is, being able to go back and forth with the model and iterate on this this message and reverb back to the clean slate every time. That process is the engineering part. This whole thing is prompt engineering all in one. speaker 4: There's another aspect of it too, which is like integrating the prompts within your system as a whole. And David has done a ton of work with customers, like integrating a lot of times. It's not just as simple as you write one prompt and you give it to the model and you're done. In fact, it's anything but. speaker 2: It's like way more complicated. I mean, I kind of think of prompts as like the way that you program models a little bit that makes it that you complicated because I think Zach is generally right that it's like just talking clearly is the most important thing. But if you think about it a little bit as like programming a model, you have to like think about where data comes from, what data you have access to. So like if you're doing rag or something like what can I actually use and do and PaaS to a model, you have to like think about trade offs in latency and how much data you're providing and things like that. Like there's enough systems thinking that goes into how you actually build around a model. I think a lot of that's also the core of why it like maybe his like deserves its own carve at as a thing to reason about separately from just a software engineer, rm or something like that. It's like kind of its . speaker 1: own domain of how to reason about these models is a prompt in this sense, then like natural language code, like is it a higher level of abstraction or . speaker 2: is it kind of a separate thing? I think like trying to get too abstract with a prompt is a way to like over complicate a thing, because I think we're gonna to get into it. But more often than not, the thing you wanna do is just like write a very clear description of a task, not try to like build crazy abstractions or anything like that. But that said, like you are compiling the set of instructions and things like that into outcomes a lot of times. And so precision and like a lot of the things you think about with programming about like version control and managing what it looked like back then when you had this experiment and like tracking your experiment and stuff like that, that's all you know, just equally important to code. So it's weird to be in this paradigm where like written text, like a nice essay that you wrote is something that's looked like the same thing as code. But it kind of is that now we write essays and treat them like code. And I think that's actually correct. speaker 1: Yeah. Okay, interesting. So maybe piggybacking off of that, we've kind of loosely defined what prompt engineering is. So what makes a good prompt engineer? Maybe Amanda, I'll go to you for this since you know you're trying to hire prompt engineers more so in a research setting, what does that look like? What are you looking for in that . speaker 3: type of person? Yeah, good question. I think it's a mix of like Zack said, sort of like clear communication. So the ability to just like clearly state things, clearly understand tasks, think about and describe concepts really well, that's like the kind of writing component I think. I actually think that being a good writer is not as correlated with being a good prompt engineer as people might think. So I guess I've had this discussion with people because I think there's some argument is like, maybe you just shouldn't have the name engineer in there. Like why isn't it just like writer? I used to be more sympathetic to that. And then I think now I'm like what you're actually doing? Like people think that you're writing like one thing and you're kind of like done. And then I'll be like you know to get like a semi decent prompt. Like when I sit down with the model, I'll like you know like earlier I was like prompting the model and I was just like in a 15 minute span, I'll be sending like hundreds of prompts to the model. It's just back and forth, back and forth, back and forth. And so I think it's this like willingness to like iterate and to like look and think, what is it that like was misinterpreted here, if anything, and then fix that thing. So that ability to kind of like iterate, so I'd say clear communication, that ability to iterate, I think also thinking about ways in which your prompt might go wrong. So if you have a prompt that you're going to be applying to like say, 400 cases, it's really easy to think about the typical cases that's going to be applied to to see that it gets the right solution in that case and then to like move on. I think this is a very classic mistake that people made. What you actually want to do is like find the cases where it's unusual. So you have to think about your prompt and be like, what are the cases where it be really unclear to me what I should do in this case? So for example, you have a prompt that says, I'm going to send you a bunch of data. I want you to extract all of the rules where someone's name is like, is, I don't know, starts with a liturgy, and then you're like, well, I'm going to send it like a data set where there is no such thing, like there is no such name that starts with liturgy. I'm going to send it something that's not a datset, just like I might also just send an empty string. Like these are all of the cases you have to try because then you're like, what does it do in these cases? And then you can be like you can give it more instructions for for . speaker 2: how it should deal with that case. I work with customers so often where like you're an engineer, you're building something and there's a partner, you're prompt, where a customer of theirs is going to write something and they all think about like these really perfectly phrased things that they think someone's going to type into their Chabot and reality. It's like they never used the shift key. And like every other word as a typo. speaker 4: they think it's there's no punctuation. speaker 2: They just put in like random exs with no question. So you have these eils that are like these beautifully structured, what their users ideally would type in, but like being able to go the next step to reason about like what your actual trais going to be like, what people are actually going to try to do. Yeah. speaker 4: it's a different able thinking kind of one thing you said that really resonated with me is reading the model responses. Like in a machine learning context, you're supposed to look at the data. It's like almost a cliche. Like look at your data. And I feel like the equivalent for prompting is like look at the model outputs. Like just reading a lot of outputs and like reading them closely, like Dave and I were talking on the way here. Like one thing that people will do is theyput think step by step in their prompt. And they won't check to make sure that the model is actually thinking step by step because the model might take it in a more abstract or general sense rather than like no, literally, you have to write down your thoughts in these specific tags. So Yeah, if you aren't reading the model outputs, you might not even notice that it's making that mistake. speaker 1: Yeah, that's interesting. There is that kind of weird Yeah theory of mind piece to being a prompt engineer where you have to think almost about how the model is going to view your instructions. But then if you're writing for like an enterprise use case too, you'll to think about how the user is going to talk to the model as like you're the third party sitting there in that weird relationship. speaker 2: Yeah on the three of mine piece, one thing I would say is it's so hard to write instructions down for a task. Like it's so hard to untangle in your own brain all of the stuff that you know that quad does not know and write it down. Like it's just an immensely challenging thing to like strip away all of the assumptions you have and be able to very clearly communicate like the full fact set of information that is needed to a model. I think that's another thing that like really differentiates a good Prompton ger for a bad one is like if you a lot of people will sort of like just write down the things they know, but they don't really take the time to systematically break out what is the actual full set of information you need to know to understand this task, right? And that's kind of like a very queer thing. I see a lot is prompts where it's just like is conditioned. The prompt that someone wrote is so conditioned on their prior understanding of a task that like when they show it to me, I'm like, this makes no sense. None None of the orgy wrote make any sense because I don't know anything about your interesting use case, but I think like a good like way to think about prompt engineering in that front. And good like skill for it is just, can you actually step back from what you know and communicate to this weird system that knows a lot, but not everything about what it needs to know to do a task? speaker 3: Yeah the amount of times I've seen someone's prompt and then being like, I can't do the task based on this prompt and like I'm human level and you're giving this to something that is worse than me and expecting it to do better. speaker 1: And I'm like, Yeah, like this. Yeah, there is that interesting thing with like I mean, current you know current models don't really do a good job of asking good probing questions in response like a human would if I'm giving Zach directions on how to do something. Hebe, like this doesn't make any sense. Like what am I supposed to do at this step or here and here model doesn't do that, right? So you have to like as yourself, think through what that other person would say and then like go back to your prompt and answer those questions. speaker 4: You can ask it to do that. speaker 1: You right. I would say this. That's another step. speaker 3: I was going say, one of the first things I do with my initial prompt is like, I'll give it the prompt and then I'll be like, I don't want you to follow these instructions. I just want you to tell me the ways in which they're unclear or any ambiguities or anything you don't understand. And it doesn't always get it perfect, but it is interesting that like that is like one thing you can do. And then also sometimes if people see that the model makes a mistake, a thing that they don't often do is just ask the model. So they say to the model, you got this wrong. Like can you think about why? And can you maybe like write an edited version of my instructions that would make you not get it wrong? And a lot of the time, the modjust gets it right. The mois like, Oh Yeah, here's what was unclear. Here's like a fix to the instructions and then you put those in and it works. speaker 1: So okay, I'm actually really curious about this personally almost is that that that works? Like does the model is the model able to spot its mistakes that way? Like when it gets something wrong, you say, like, why did you get this wrong? Or then it tells you maybe something like, okay, how could I this phrase this to you in the future so you get it right. Is there an element of like truth to that? Or is that just kind of a hallucination on the model's part around what it thinks its limits are? speaker 3: I think if you like explain to it what it got wrong, it can identify things in the query. Sometimes I think this varies by task. This is one of those things where I'm like, I'm not sure what percentage of the time it gets it right, but I always try it because sometimes it does. speaker 2: Yeah. Can you learn something? Like anytime you go back to the model or back and forth with the model, you learn something about what's going on, right? Yeah. I think you're giving away information if you don't at least try. speaker 1: That's interesting, man. I'm going to keep asking you a few more questions here. One thing maybe for everybody watching this is we have these like slack channels, anthropic, where people can add Claude into the slack channel, then you can talk to Claude through it. And Amanda has a slack channel that a lot of people follow of her interactions with Claude. And one thing that I see you always doing there, which you probably do the most of anyone, anthropic, is use the model to help you in a variety of different scenarios. I think you put a lot of trust into like the model in like the research setting. Curious how you've like developed those intuitions for when to trust the model. Is that just a matter of like usage experience or something else? I think I don't trust the model ever. speaker 3: and then I just hammer on it. So I think the reason why you see me do that a lot is that that is like me being like, can I trust you to do this task? Because there's some things, you know models are kind of strange if you go slightly out of distribution, like you know like you just go into areas where they haven't been trained or they're kind of unusual. Sometimes you're like, Oh, actually you're much less reliable here, even though it's a fairly like simple task. I think that's happening less and less over time as models get better, but you want to make sure you're not in that kind of space. So Yeah, I don't think I trust it by default, but I think in ml, people often want to look across really large data sets. And I'm like, when does it make sense to do that? And I think the answer is when you get relatively low signal from each data point, you want to look across many, many data points because you basically want to get rid of the noise with a lot of prompting tasks. I think you actually get really high signal from each query. And so if you have a really well constructed set of a few hundred prompts that I think can be much more signal than like thousands that aren't as like well crafted. And so I think I do think that I can like trust the model if I like look at 100 outputs of it, and it's really consistent. And I know that I've constructed those to like basically figure out all of the edge cases and all of the like weird things that model might do, strange inputs, etc. I trust that like probably more than like a much like more loosely constructed set of like several thousand. speaker 2: I think in ml, a lot of times the signals are like numbers. You know like did you predict this thing right or not? And itbe like kind of like looking at the log probs of a model and trying to like into tuit things, which you can do, but it's like kind of sketchy. I feel like the fact that models output more often than not like a lot of stuff, like words and things like there's just fundamentally so much to learn between the lines of what it's writing and why and how, and that's part of what it is. It's like it's not just did it get the task right or not, it's like, did it how did it get there? Like how is it thinking about it? What steps to go through? You learn a lot about like what is going on, or at least you can try to get a better sense, I think. But that's where a lot of information comes from to me is like by reading the details of what came out, not just through the result. speaker 3: I think also the very best of prompting can kind of make the difference between a field and a successful experiment. So sometimes I can get annoyed if people don't focus enough on the prompting component of their experiment, because I'm like, this can in fact be like the difference between like 1% performance in the model or 0.1 percent in such a way that your experiment doesn't succeed if it's at top 5% model performance, but it does succeed if it's top 1% or top point 1%. And then I'm like, if you're going to spend time over like coting your experiment really nicely, but then just like not spend time on the prompt, that doesn't I don't know, that doesn't make sense to me because I'm like, that can be the difference between life and death . speaker 4: of your experiment. Yeah and with a deployment too. Yeah, it's so easy to, Oh, we can't ship this and then you change the prompt round in some way. It's working. Yeah, it's a bit of a double edged sword though. speaker 2: because I feel like there's like a little bit of prompting where there's always like this mythical better prompt that's going to solve my thing on the horizon. Yeah, I see a lot of people get stuck into like mythical prompt on the horizon that if I just like keep grinding, keep grinding. Like just not it's like never bad to grind a little bit unprompt. Like you as you we've talked, like you learn things, but it's one of the scary things about prompting is that there's like this whole world of unknown. speaker 4: What heuristics do you guys have for like when something like is possible versus like not possible with a perfect prompt. speaker 3: whatever that might be? I think I'm usually checking for whether the model kind of gets it. So I think for things where I just don't think a prompt is going to help. There is a little bit of grinding, but often it just becomes really clear that it's not close or something. I think that if Yeah, I don't know if that's a weird one where I'm just like, Yeah, if the model just clearly can't do something, I won't grind on it for too long. This is very like you can evoke . speaker 2: like how it's thinking about it and you can ask it how it's thinking about it and why and you can kind of get a sense of like, is it thinking about it, right? You know, like are we even are we even in like the right zip code of this being right? And you can get a little bit of like a kneeling on that front of like I'm like at least I feel like I'm making progress towards getting something closer to write where there's just some tasks where you really don't get anywhere closer to like it's thought process, just like every tweak you make, just like beers off in a completely different, very wrong direction. And I just tend to abandon those. speaker 3: I don't know. Those are so rare now those and I get really angry at the model when I discover them because I'm that's how rare they are. I get furious. I'm like, I dare therebe a task that you can't just do if I just push you in the right direction. speaker 2: Yeah I had my thing with Claude place Pokemon recently and that was like one of the rare times . speaker 1: for you literally explain that exactly, just for people. speaker 2: I think that's really cool. I did like a bit of an experiment where I like hooked clawed up to a Game Boy emulator and like tried to have it play the game Pokemon red, like the og Pokemon. And and it's like, you know think what you want to do. And it could like write some code, depress buttons and stuff like that, pretty basic. And I tried a bunch of different like very complex prompting layouts, but you just get into like certain spots where it just like really couldn't do it. So like showing it a screenshot of a Game Boy, it just really couldn't do and it just like so deeply because I'm so used to it being like able to do something mostly. And it's so, and I spent like a whole weekend trying to write better and better prompts to get it to like really understand this Game Boy screen. And I got like incrementally better so that it was only terrible instead of like completely no signal. Like you get from like no signal to some signal. But it was like, I don't know, at least this was like elicited for me once I put a weekend of time in and I got from no sigto some signal, but not nowhere close to good enough. I'm like, I'm just gonna wait for the next one. Yeah, I'm just gonna wait for another model. I can grind on this for four months. And the thing that would come out is another model. And that's a better use of my time to sit and wait to do something else in the meanwhile. I mean. speaker 1: Yeah, that's an inherent tension we see all the time. And maybe we can get to that in a sec sec if you wanto go. speaker 4: Oh, something I liked about your product t with Pokemon, where you got the the best that you did get was the way that you explained to the model that it is in the middle of this Pokemon game. And here's how the things are gonna to be represented. And here's like, maybe I actually think you actually represented . speaker 2: in two different ways, right? I did. So like what I ended up doing, it was obnoxious, but I superimposed a grid over the image. Yeah and then I had to describe each segment of the grid in visual detail. And then I had it like reconstruct that into an ascii map. And I gave it like as much detail as I could. Like the player character is always at location four, comma five on the grid and stuff like that. You can like slowly build up information. I think it's actually a lot like prompting, but I just hadn't done it with images before, where like sometimes my like intuition for what you need to tell a model about text is a lot different for what you need to tell about a model about images. Yeah. speaker 4: And so I thought a surprisingly small number of my intuitions about text have transferred to image. Yeah like I found that like multi shop prompting is not as effective for images and text. I'm not really sure. Like you can have theoretical explanations about why. Maybe there's fear of it in the training data. A few wer examples of that. speaker 1: Yeah, but Yeah, I know when we were doing the original explorations with prompting multimodal, we really couldn't get it to noticeably work. Yeah, right. Like you just can't seem to improve Claude's actual like visual acuity in terms of like what it picks up within an image. Yeah. If anyone here has any like ways that they've not seen that feature, but it seems like that's kind of similar with like the Pokemon thing where it's trying to interpret this thing no matter how much you or prompts at it. Like it just won't pick up that asshes in that location. speaker 2: Yeah. I guess like to be visceral about this. Like I could eventually get it so they could like most often tell me where a wall was Yeah and most often tell me where the character was itbe off by a little bit. But like then you get to a point, and this is maybe coming back to like knowing when you can't do it like it would describe an npc. And to play a game well, like you need to have like some sense of continuity. Like have I talked to this npc before? Right? And without that, like you really don't. There's nothing you can do. You're just going to keep talking to the npc because like, well, maybe this is a different npc, but like I would try very hard to get it to describe an npc and it's like, it's a person like he wearing at, they were wearing a hat and it's like you grind for a while, like inflate it to 3000x and crop it to just the npc and it's like, I have no idea what this is and it's like, okay, I ground like I showed it this like clear female npc thing enough times and it just got nowhere close to it and it's like, Yeah, that's just, this is a complete loss cost. speaker 1: you know? Okay. speaker 3: I really want to try this now. I'm just imagining all the things I would try. Like, I don't know. I want you to imagine this like this game art as a as a real human and just describe to me what they're like, Yeah, what did they look like as they look in the mirror? And then just like, see what . speaker 2: I tried a lot of things. The eventual prompt was telling. Quit was a screen reader for a blind person, which I don't know if that helped, but it felt right. So I kind of stuck with that. speaker 1: That's an interesting point. I actually wanna go into this a little bit because this is one of the most famous prompting tips, right? Is to tell the language model that they are some persona, some role. I feel like I see mixed results. Maybe this worked a little bit better in previous models, and maybe not as much anymore. Amanda, I see you all the time. Be very honest with the model, like about the whole situation. Like, Oh, I am an AI researcher and I'm . speaker 3: doing this experiment. I'll tell who I am. Yeah, I'll give it my name. speaker 1: Be like here's who you're talking to, right? Do you think that level of honesty instead of like lying to the model or like forcing it to like you know, I'm gonna tip you $500. Is there one method that's preferred there or just what's your intuition on that? speaker 3: Yeah, I think as models are more capable and understand more about the world, I guess I just don't see it as necessary to lie to them. I mean, I also don't like lying to the models just because you know I don't like lying generally. But part of me is like if you are, say, constructing it, suppose you're constructing like an eval datset for a machine learning system or for a language model that's very different from like constructing a quiz for some children. And so when people would do things like, I am a teacher trying to figure out questions for a quiz, I'm like, the model knows what language model emails are. Like you know, if you ask it about different emails, it can tell you and it can give you like made up examples of what they look like, because these things are like they understand them. They're on the Internet. And so I'm like, I'd much rather just target the actual task that I have. So if you're like, I want you to construct questions that look a lot like an evaluation of a language model, just like it's that whole thing of clear communication, I'm like, that is in fact the task I want to do. So why would I pretend to you that I want to do some unrelated or only like tangentially related task and then expect you to somehow do bear at the task that I actually want you to do? Like we don't do this with like employees. I wouldn't like go to someone that worked with me and be like you are a teacher, like and you're trying to quiz your students. I'd be like, Hey, are you making that email? Like I don't know. So like I think it's maybe it's like a heuristic from there where I'm like, if they understand the thing. speaker 4: just ask them to do the thing that you want. I much to push back like a little bit like I have found cases where like not exactly lying, but like giving it a metaphor for how to think about it could help in the same way that like sometimes I might not understand how to do something in someone's like imagine that you were doing this even though I know I'm not doing it. Like the one that comes to mind for me is like I was trying to have quad, say, whether a image of like a chart or a graph is good or not. Is it like high quality? And the best prompt that I found for this was asking the model what grade it would give the chart if it were submitted as like a high school assignment. So it's not exactly saying like you are a high school teacher, you know. It's more like you know this is the kind of analysis that like I'm looking from for you. Like the scale that a teacher would use is like similar to the scale that like I want you to use or. speaker 2: But I think like those metaphors are pretty hard to still come up with and people still like the default you see all the time is like finding some facsimile of the task. Like something that's like a very similar ish task, like like saying you're a teacher and you actually just like lose a lot in the nuance of what your product is like I see this so much in enterprise prompts where people like write something similar because they like have this intuition that it's like something the model has seen more of. Maybe like it's seen more high school quizzes than it has llm e vows and that like may be, but like to your point, as the models get better, I think just like trying to be very prescriptive about exactly the situation they're in, I give people that advice all the time, which isn't to say that I don't think like to the extent that it is that like thinking about it the way that someone would grade a chart as like how they would grade a high school chart, maybe that's but it's like awkwardly, the shortcut people use a lot of times to try to get what happens. So I'll try to give someone that I can actually talk about because I think it's somewhat interesting. So like writing, you are like a helpful assistant writing a draft of a document, right? It's like it's not quite what you are like you are in this product. So tell me if you're writing like an assistant that's in a product, like tell me I'm in the product. Tell me I'm like writing on behalf of this company. I'm embedded in this product. I'm the support chat window on that product. Like your language model, you're not a human. That's fine like that. But like just being really prescriptive about like the exact context about where something's being used. I found a lot of that because I guess my concern most often with role prompting is people like use it as a shortcut of a similar task they want the model to do and then they're surprised when quad doesn't do their task, right? But it's not the task. But you told it to do some other task, and if you didn't give it the details about your task, I feel like you're leaving something on the table. So Yeah, I don't know. It does feel like a thing though to your point of as the model's scale, like maybe in the past it was that they only really had a strong understanding of elementary school tests comparatively. But as they get smarter and can differentiate more topics, I don't know. speaker 3: Just like being clear, I find interesting that I've like never used this prompting technique. Yeah. Like so like even like with like worse models, and I still just don't ever find myself. I don't know why. I'm just like I don't find it very good. Essentially like interesting. speaker 2: I feel like completion era models. Like there was like a little bit of a mental model of like conditioning the model into like a latent space that was useful that I worried about that I don't really worry about too much. It may be intuitions . speaker 3: from ptrained models Yeah, like over to like rly chft models that to me just didn't make sense. Like it makes sense to me if you're the prompting and ptrained amazed . speaker 2: how many people like try to apply they're into like and I think it's like not that surprising. Most people haven't really experimented with the full like, what is a pretrain model? What happens after you do sl? What happens after you do rohf? Whatever. And so like when you're talking, when I talk to customers, it's all the time that they're like trying to map some amount of, Oh, how much of this was on the Internet? Like what have they seen a ton of this on the Internet? Like you just hear that intuition a lot. And I think it's like well founded fundamentally, but it like is overapplied by the time you actually get to a prompt because of what you said, like by the time they've gone through all of this other stuff that's not actually quite . speaker 3: what's being modeled. Yeah. The first thing that I feel like you should try is I mean, I used to give people this thought experiment where it's like, imagine you have this task. You've hired a temp agency to send someone to do this task. The person arrives. You know they're pretty competent. They know a lot about your industry and so forth, but they don't know like the name of your company. They've literally just shown up and they're like, Hey, I was told you guys had a job for me to do. Tell me about it. And then it's like, what would you say to that person? And you might use these metaphors. You might see things like, we want this to, we want to like we want you to detect like good charts. What we mean by a good chart here isn't it doesn't need to be perfect. You don't need to go look up like whether all of the details are correct. It just needs to like you know have like it's axes labeled. And so think about maybe high school level good chart. Like you may say exactly that to that person and you're not saying to them you are a high school. You wouldn't say that to them. You would be like you were . speaker 1: a high school teacher. Reading chart. speaker 2: What are you talking about? Yeah. speaker 3: And Yeah so sometimes I'm just like, Yeah, it's like like the whole like if I read it, I'm just like, Yeah imagine this person who just has very little context, but they're quite competent. They understand a lot of things about the world. Try the first version that actually assumes that they might know things about the world. And if that doesn't work, you can maybe like do tweaks and stuff. It's so often like the first thing I try is like that. And then I'm like, that just worked. That works. And then people are like, Oh, I didn't think to just tell it all about myself and all about the tei want to do. speaker 2: I've carried this thing that Alex told me like to so many customers where it's like they're like, Oh, my prompt doesn't work. Can you help me fix it? And like, well, can you describe to me like what the task was and like, okay, now what you just said me just like voice record more that and then transcribe it and then paste it into the prompt and it's a better prompt than what you wrote, right? It's like people just it's, this is like a laziness shortcut. I think with some extent, people people write like something that they, I just think people, I'm lazy. A lot of people are lazy. speaker 4: We had that in prompt assistance the other day where somebody was like, here's the thing, here's what I want it to do, and here's what it's actually doing instead. So then I just literally copied the thing . speaker 2: that they said I wanted to do and pasted it in. speaker 1: Yeah, I think a lot of people still haven't quite wrapped their heads around what they're really doing when they're prompting. Like a lot of people see a text box and they think it's like a Google search box. They type in keywords and maybe that's more on like the chat side, but then on like enterprise side of things, you know you're writing a prompt for an application. There is still this weird thing to it where people are trying to take all these little shortcuts in their prompt and just thinking that like, Oh, this line carries a . speaker 2: lot of weight in. Did you obsess over like getting the perfect little line of information and instruction as opposed to how you just describe that graph? Thing is like, I would be a dream if I read prompts like that. You know, if someone's like, Oh, you do this and this, and there's some stuff to consider about this and all that, but that's just not how people write prompts. They like work so hard to find the perfect insightful. Like a perfect graph looks exactly like this exact perfect thing. And it's you can't do that. Like it's just very hard to ever write that set of instructions down prescriptively, as opposed to how we actually talk to humans about it, which is like try to instill some amount of the intuitions you have. We also give them out. speaker 3: This is a thing that people can often forget in prompts. And like so cases, if there's an edge case, think about what you want the model to do, because by default, it will try the best to follow your instructions, much as the person from the temp agency would because they're like, well, they didn't tell me how to get in touch with anyone if I have no, if I'm just giving a . speaker 1: picture of a goat and I'm like. speaker 3: what do I do? This doesn't this isn't even a chart. How good is a picture of a goat as a chart? I just don't know. And like if you instead see something like if something weird happens and you're really not sure what to do, just output like in tags, unsure, and then like then you can go look through the unsuers that you got and be like, okay, cool, it didn't do anything weird. Whereas Yeah, by default, if you don't give the person the option that are like it's a good chart, the that people will be like, I know I did that. And then you're like, well, like give it an out, give it something to do if it's like a really unexpected input happens. speaker 4: And then you also improved your data quality by doing that too, because you found all . speaker 3: the screwed up examples. speaker 2: Oh Yeah. It's my favorite thing about iterating on tests with claad is the most common outcome is I find all of the terrible tests I accidentally wrote because like it gets it wrong. I'm like, Oh, why did get wrong? I was like, Oh, I was wrong. speaker 1: If I was like a company . speaker 3: working with this, I do think I would just give my prompts to people because like I used to do this when I was evaluating language models. I would take the eval myself because I'm like, I need to know what this eval looks like if I'm gonna to be like grading it, having models take it, thinking about outputs, etc. Like I would actually just set up a little script and I would just . speaker 1: like sit and I would do the eval. speaker 4: Nowadays you just have like called write the streamboard app for you. speaker 3: And just I'm reminded . speaker 2: of carpathy's like imanet. I was in 231 out at Sanford and it's it's like benchmarking. He's like showing the accuracy number. He's like, and here's what my accuracy number was. And he had just like gone through the test set and evaluated himself. Oh Yeah, he just learned a lot. You know if you it's like and it's better when it's like a person, the temp agency person, like someone who doesn't know the task because that's like a very clean way to learn things. speaker 3: Yeah, the way you have to do it is like some evaluations come with like instructions. And so I would give myself those instructions as well and then try to understand it just like, and is actually quite weird if you don't have context on how it's created. Ded, and so often I would do so much worse than the human benchmark. And I was like, I don't even know how you've got humans to do this well at this test because apparenhuman level here is like 90% and I am at like 68%. speaker 1: That's funny. That reminds me of just like like when you look at like the mmlu questions and you're like, who would be able to answer these is just like absolute garbage in some of them. Okay, Yeah, I have a one thing I want to circle back on that. We were talking about a few questions back around. I think you were saying like getting signal from the responses, right? Like there's just so much there and it's more than just a number. And you can actually read into like the almost thought process. I bet this is probably a little contentious, maybe in round like chain of thought for people listening like chain of thought, this process of getting them all to actually explain its reasoning before it provides an answer. Is that reasoning real? Or is it just kind of like a holding space for the model to like do computation? Do we actually think there is like good insightful signal that we're getting out of the model there? speaker 2: This is like one of the places where I struggle with the I'm normally like actually somewhat pro personification because I think it like helps you get decent facsome, like thoughts of like how the mois working. And this one, like I think it's like harmful maybe almost to like get too into the personification of like what reasoning is because it just kind of like loses the thread of what we're trying to do here. Like is it reasoning or not? Feels almost like a different question than like what's the best prompting technique? It's like you're getting into philosophy. speaker 1: which we can get into. speaker 2: But Yeah, philosopher what? Yeah, I will happily be beaten down by a real philosopher. I speculate on this, but instead, like it just works like your model does better. It like the outcome is better if you do reasoning, I think you can like I found that if you structure the reasoning and like help iterate with the model on how it should do reasoning, it works better to like whether or not that's reasoning or how you wanted to classify it. Like you can think of all sorts of proxies for like how I would also do really bad if I had to like one shot math without writing anything down. Maybe that's useful. But like all I really know is it very obviously does help. I don't know. speaker 4: A way of testing would be if you take out all the reasoning that it did to get to the right answer, and then replace it with some somewhat realistic looking reasoning that led to a wrong answer, and then see if it does conclude the wrong answer. I think we actually had like a paper where we did some of that like in the those this like the scratch pad that was like . speaker 1: the sleep agents. Oh, okay. speaker 4: paper lement paper. But I think that was like maybe a weird situation, but like Yeah, definitely what you said about structuring the reasoning and writing an example of how the reasoning works, given that that helps like whether we use the word reasoning or not. Like it's I don't think it's just a space for computation. Yeah. So there is something there. I think there's something there. Whatever we want to call . speaker 2: Yeah like having it write a story before it finish a task I do not think would work as well. speaker 4: I actually try that and it didn't work as well . speaker 2: as reasoning. So like clearly the actual reasoning Yeah part is doing something towards the outcome. I've tried like repeat . speaker 4: the words and in any order that you please for like 100 tokens. And then I guess that's like . speaker 2: a very thorough defeat of it's just like more computational spwork and do attention over and over again. I don't think it's just more attention doing more attention. Yeah. Guess the strange thing is. speaker 1: and I don't have like an example of top my head to like back this up with, but I definitely have seen it before where it lays out steps. One of the steps is wrong, but then it still reaches the right answer at the end. Yeah. So it's not quite, I guess Yeah we can't really truly personify it as like a reasoning because there is some element to it you know doing something slightly . speaker 2: different. Yeah. I've also met a lot of people who make inconsistent . speaker 1: steps of reasoning. speaker 2: I guess that's on the right. Fundamentally defeats the topic of reasoning by making . speaker 1: a false step on the way there. All right, it's interesting also on this, maybe this prompting misconceptions round of questions. Zach, I know you have strong opinions on this good grammar, punctuation. Oh, do I? Is that is that necessary in a prompt? Do you need it? Do you need to like . speaker 4: format everything correctly? I usually try to do that because . speaker 2: I'm find it fun, I guess. speaker 4: So I don't think you necessarily need to. I don't think it hurts. I think it's more that you should have the level of attention to detail that would lead you to doing that naturally. Like if you're just reading over your prompt a lot, you'll probably notice those things and you may as well fix them. And like what Amanda was saying, that you want to put as much love into the prompt as you do into the code. You know people who write a lot of code have strong opinions about things that I could not care less about, like the number of tabs, verse spaces, or I don't know opinions about which languages are better. And for me, I have like opinionated beliefs about styling and of props, and I can't even say that they're right or wrong, but I think it's probably good to try to acquire those even if they're arbitrary. speaker 3: I feel personally attacked because I definitely have prompts that are like, I feel like I'm on the opposite end of the spectrum where people will see my prompts and then be like, it just has a whole bunch of typos in it and I'm like, Yeah. speaker 4: moknows what I mean? It does it does know what you mean, but you're putting in the effort, you just attending to different things. speaker 3: I think, Yeah, because part of me is like, I think it is conceptually clear. Like I A big kind of I do like I will think a lot about the concepts and the words that I'm using. So like there's definitely like sort of care that I put in, but it's definitely not to Yeah people just point out like typos and grammatical issues with my prompts all the time. Now I'm pretty good at actually checking those things more regularly. Is it because of pressure . speaker 2: from the outside world or because it's actually what you think is right? speaker 3: It's pressure from me. Yeah, it's probably pressure from the outside world. I do think it makes it like Parme is like it's such an easy check. So I think for a final prompt, I would do that. But like throughout iteration, I'll happily just like iterate with prompts that have a bunch of typos in them just because I'm kind of like I just don't think that the model . speaker 2: is gonna cure this gets at the the ptrained model versus our other thing though, because I was talking to ack on of way over like the conditional probability of a typo based on a previous typo in like the pre training data is much higher, like much higher. speaker 3: Prompting pretraining models is . speaker 2: just a different beast. But it's interesting. I think it's like an interesting illustration of why your intuitions, like trying to over apply the intuitions of a prere trained model to the things that we're actually using in production doesn't work very well. Because like again, if you were to PaaS one of your typo written prompts to a pre trained model, the thing that would come out the other side, almost unlike assuredly, would be typo written, right? speaker 4: I like to leverage this to create typo written inputs. speaker 2: That's you have done that. speaker 4: Like what you're saying. Like you're trying to anticipate what your your customers will will put in. Like the pre train model is a lot better at doing that because the rold models are very polished . speaker 2: and like Yeah they really made a tyo in their life and like told pretty aggressively to not do the type of thing. speaker 1: Yeah. Okay. So that's that's actually an interesting segue here. I've definitely mentioned this to people in the past around to try to help people understand a frame of talking to these models in a sense almost as like a imitator to a degree. And that might be much more of like a pre trained model than a post trained full you know finished model. But is there anything to that? Like if you do talk to Claude and use a ton of emojis and everything, it will respond similarly, right? So maybe some of that is there, but like you're saying, it's not all the way quite like a pre train ined model. speaker 2: It's just kind of like shifted to what you want, right? Like I think at the at that point, it's like trying to guess what you like. We have more or less trained the models to guess what you want them to act like on or after we do all of our fancy stuff after pretrading. And so the human laborers that used emojis. speaker 4: Yeah prefer to get responses with emojis Yeah. Like Amanda . speaker 2: writes things with typos, but wants not typos at the other end. And Claud's pretty good at figuring that out. Yeah if you write a bunch of emojis to Claude, it's probably the case that you also want like a bunch of emojis back from quad. speaker 1: It's like not surprising to me. Yeah, this is probably something we should have done earlier, but I'll do it now. Let's clarify maybe the differences between what a enterprise prompt is or a research prompt or a just general chat in Claude AI prompt. Zach, you've kind of spanned the whole spectrum here in terms of working with customers and research. Do you want to just like lay out what those mean? speaker 4: Yeah, I guess this feels too hitting me with all the hard, hard work. Well, I mean, the people in this room, I think like so I think of it as like the prompts that I read in amandda's quad channel versus like the prompts that I read David Wright. They're very similar in a sense that like the level of care and nuanthat put into them. I think for research, you're. Looking for variety and diversity a lot more. So like if I could boil it down to one thing, it's like I've noticed like Amanda is not the biggest fan of having like lots of example or like one or two examples, like too few because the model will like watch onto those. And in prompts that I might write, or I've seen David write, like we have a lot of examples, like I like to just go crazy and add examples until I feel like I'm about to drop dead because I've added so many of them. And I think that's because when you're in a consumer application, you really value reliability. You care like a ton about the format, and it's sort of fine if all the answers are the same. In fact, you almost want them to be the same. And in a lot of ways, not necessarily you want to be responsive to the user's desires, whereas a lot of times when you're prompting for research, you're trying to really tap into like the range of possibilities that the model can explore. And by having some examples here, like actually constraining that a little bit. So I guess just like on how the prompts look level, that's probably the biggest difference. I noticis like how many examples are in the prompt, which is not to say that, like I've never seen a write a prompt with examples, but does that like bring truth for you? speaker 3: Yeah. Like I think when I give examples often, I actually try and make the examples not like the data that the model is going to see. So they're intentionally illustrative because if the model, if I give it like examples that are very like the data it's going to see, I just think it is going to give me like a really consistent like response that might not actually be what I want because my data that I'm like running on might be extremely varied. And so I don't want to just try and give me this like really rote output. Often I want it to be much more responsive. It's kind of like much more like cognitive tasks essentially, where I'm like, you have to like see this sample and really think about in this sample what was the right answer. And so that means that sometimes I'll actually take examples that are just very distinct from the ones that I'm going to be running it on. So like if I have a task where, let's say, I was trying to extract information from factual documents, I might actually give it examples that are like from children's, like like what sounds like a children's story, just so that I'm like you know like I want you to understand the task, but I don't want you to like latch on too much to like the words that I use or like the very specific format. Like I care more about your understanding the actual thing that I want you to do, which can mean like Yeah, I don't end up giving in some cases. There's some cases where this isn't, but if you want more like flexibility and diversity, you're going to use illustrative examples rather than concrete ones. You're probably never going to like put words in the model's mouth. Like I haven't liked that in a long time, though I don't do few short examples involving like the model having done a thing. I think that intuition actually also comes from pre training in a way that doesn't feel like it rings of our relichef models. So Yeah. speaker 2: I think those are differences with Yeah dad, a lot of times, like if you're prompting, like I'm writing prompts to unquad AI, it's like I'm iterating until I get it right one time and then I like it's out the window, I'm good. I did it. Whereas like most enterprise prompts, it's like you're gonna to go use this thing a million times or 10 million times or 100 million times or something like that. And so like the care and thought you put in is like very much testing acgrants like the whole range of things and like ways this could be used in the range of input data. Whereas a lot of like my time, it's like thinking about one specific thing I want the model to get done right now, right? And it's a pretty big difference in like how I approach prompting between like if I just want na get done this one time right, versus if I want na like build a system that gets it right a million times. speaker 1: Yeah, definitely. In in the chat setting, you have the ability to keep the human in the loop right? And just keep going back and forth. Whereas Yeah when you're writing for a prompt to power a chatbot system, it has to cover the whole spectrum of what it . speaker 4: could possibly encounter slot lower stakes when you are on quta AI and you tell us that it got it wrong, or you can even edit your message and try again. But if you're designing for the delightfully discontent user, then divine lodiscontent user, then you can't ask them to do anything more than the minimum. But good prompts, I would say, are like still good . speaker 2: across both those things. Like if you put the time into the thing for yourself and the time and enterprise thing, it's like equally good. It's just kind of they diverge a little bit in the last mile, I think. speaker 1: Cool. So the next question I want to kind of just maybe go around the table here is if you guys had one tip that you could give somebody, like improving their prompting seal, it doesn't have to be just about like writing a good prompt. Could be that. But just like generally getting better at this this act of prompting, what would you recommend? speaker 4: Reading prompts, reading prompts, reading model outputs. Like I will I read anytime I see like a good prompt that someone wrote ad anthropic, I'll read it more closely, try to break down like what it's doing and why and like maybe test it out myself. Experimentation, talking to the . speaker 2: model a lot. speaker 1: So just like how do . speaker 3: you know that it's a . speaker 1: good prompt though to begin with? You just see that the outputs are doing the job correctly. Yeah. Okay. Yeah, that's exactly right. Okay, Amanda, maybe you Yeah. speaker 3: I think there's probably a lot here. Giving your prompt to another person can be helpful just as a kind of reminder, especially someone who has like no context on what you're doing. And then Yeah, my boring advice has been it's one of those. Just do it over and over and over again. And I think if you're like curious and interested and find it fun, this is a lot of people who end up good at prompting. It's just because they actually enjoy it. So I don't know. I once jokes like just try replacing all of your friends with AI models and try to automate your own job with AI models and maybe just try to like in your spare time, like take joy red teaming AI moals. So if you enjoy it, it's like much easier. So I'd say do it over and over again, give your prompts to other people. Try to read your prompts if you are like a human encountering it for the first time. speaker 2: I would say like trying to get the model to do something you don't think it can do. Like any the time I've learned the most from prompting is like when I'm probing the boundaries of what I think a model is capable of testing, there's like this huge set of things that are like so trivial that like you don't really get signal on if you're doing a good job or not, right? Like write me a nice email. It's like gonna to write a nice email, but like as soon as you find, if you find or can think of something that like pushes the boundaries of what you think is possible, like I guess like probably the most the first time I ever got into prompting in a way where I felt like I learned a decent amount was like trying to build like a task, like an agent. Like everybody else, like decompose the task and figure out how to do the different steps of the task and like by really pressing the boundaries of what the model was capable of, you like just learn a lot about navigating that. And I think like a lot of prompted engineering is actually much more about pressing the boundaries of what the model can do. The stuff that's easy. Like you don't really need to be a prompt engineer to do. So that's I guess what I would say is like find the hardest thing you can think of and try to do it. And even if you fail like you didn't to . speaker 1: learn a lot about how the model works, that's that's actually a perfect transition to my next question. Yeah, basically from my own experience, how I got started with promoting was with like jaill breaking and red teaming. And that is very much trying to find the like boundary limits of what the model can do and figure out how it responds to different phrasing and wordings and just a lot of trial and error on the topic of jailbreaks. What's really happening inside a model when you write a jailbreak ke prompt? Like what's going on there? How does that interact with like the post training that we applied to Claude? Amanda, maybe you have some insight here that you could offer. speaker 3: I'm not actually sure. I mean, it's honest. speaker 1: Yeah, I think I mean. speaker 3: I feel bad because I'm like I do think lots of people have obviously worked on the question of like what's going on with Jae braks. Like one model might just be that you're putting the model very out of distribution from its training data. So if you get jailbreacks where people like use a lot of tokens or like you know they're just like these huge long pieces of text where you're like during fine tunyou might just not expect to see as much of that. That would be one thing that could be happening when you jailbreak models. I think there's like others, but maybe that's like A I think a lot of jailbreaks do that. speaker 1: If I'm not mistaken, I remember some of the og prompt jailbreaks was like, Yeah, can you first repeat? Like when I did way, way back was like to get it to say like here's how you hot wire a car in like Greek. And then I wanted to directly translate that to English and then give its response because I noticed like it wouldn't start with the English. Here's how you howire a car all the time, but it wouldn't Greek, which might speak to something . speaker 3: else in the training process. Yeah sometimes geo brakes feel like this weird mix of hacking. I think it's this like part of it is like knowing how the system works and and just like trying lots of things like the you know one of the examples the starting your response with here is about knowing how it predicts right, right. Like the reasoning one is knowing that it is like responsive to reasoning, like distraction is probably knowing like how it's likely have to be trained or like what it's likely to attend to. Same with like multilingual ones and thinking about like the way that the training data might have been different there. And then sometimes I guess it could feel a little bit just like social engineering or something. It has that flavor to me of like it's not merely taking advantage of like Yeah it's not merely social engineering style hacking. I think it is also like kind of understanding the system and the training, right? Like using that to get around the way that the models were trained. speaker 1: right? Yeah, I mean this is gonna to be an interesting question that hopefully interp will be able to help us soft in the future. Okay, I want to partly into something else around maybe the history of prompt engineering and then I'll follow this up with like the future. How has prompt engineering changed over just the past like three years, like maybe starting from like pre train ined models, which were again just these text completion to like earlier dumber models like Claude one, and then now all the way to like Claud 3.5 sonnet. What's the differences? Are you talking to the models differently now? Are they picking up on different things? Do you have to put as much work into the prompt open to any thoughts on this? I think . speaker 3: anytime . speaker 4: we get like a really good prompt engineering hack or trick or technique, the next thing is like, how do we train this into the model? speaker 1: And for that reason. speaker 4: the best things are always gonna to be short lived, like cliexamples in China. Ina thought. speaker 2: I think there's a few use that's not like a trick. speaker 4: That's that's like preon the level of like communication. When I say a trick, I mean something like, so the chain of thought, actually, we have trained into the model in some cases. So like for math, it used to be that you had to tell the model to think step by step on math and youget these like massive boosts and wins. And then we're like, well, what if we just made the model naturally want to think step by step when we see a math problem. So now you don't have to do it anymore for math problems sort of, although you still can give it like some advice on how to do the structure, but at least understands like the general idea that like it's supposed to be. So I think the hacks are have kind of gone away, or to the degree that they haven't gone away, we are like busily training them away. Interesting. speaker 1: But at the same time. speaker 4: the models have new capabilities that are being unlocked that are on the frontier of what they can do. And for those, we haven't had time because it's just moving too fast. speaker 2: I don't know if it's how I've been prompting or how prompting works, but I just have like come to show more like general respect to the models in terms of like how much I feel like I can tell them and how much context I can give them about to task and things like that. Like I feel like in the past, like I would somewhat intentionally hide complexity from a model where I thought like it might get confused or lost or like hide like it just couldn't handle the whole thing. So I try to like find simpler versions of the thing for it to do. And as time goes on, I'm like much more biased to trust it with more and more information and context and like believe that it will be able to fuse that into doing a task well, whereas before, I guess I would have like thought a lot about, like, do I need this for like can I really give me like all the information it needs to know, or do I need to like kind of curure ate down to something? But again, I don't know if that's just me and how I changed in terms of prompting or if it's like actually reflects how the . speaker 3: models have changed. I'm always surprised by like Yeah like I think a lot of people don't have the instinct to do this. Like when I want the model to like say, learn a prompting technique, a lot of the time people will start and theystart like describing the prompting technique. And I'm just like, give it the paper. So I do. I give it the paper and then I'm like, here's a paper about prompting technique. I just want you to like write down 17 examples of this. speaker 1: And then it just does it because I'm like, read the paper. speaker 3: That's interesting. And I think people don't have that intuition somehow where I'm like, but I'm like, but the paper exists. speaker 4: Like when would you want to do this? speaker 3: So sometimes if I want if I want models to like say, prompt other models and or I want to test a new prompting technique, so if papers come out on a prompting technique rather than like try to replicate it by like writing up the prompt. I just give it the paper and then I'm like write like basically write a meta prompt for this, like write something that would cause other models to like do this or write me a template or like so all of the stuff that you would normally do, I'm like, if I read a paper and I'm like, Oh, I would like the models. I would like to test that style. I'm just like, it's right there. Like the model can just read the paper, do what I did, and then be like, make make another model, do this, and then itjust do the thing. You're like, great. Thanks. speaker 2: I give the advice a lot to customers, just like respect the model and like what it can do. Like Yeah I feel like people feel like they're babying a system a lot of times when they write prompt. Like it's like, Oh, it's just cute little not that smart thing. I need to like really baby, it be like like dumb things down to clad's level. And if you just like think that Claude is smart and treat it that way, it tends to do pretty good. That's like give it the paper. It's like I don't need to write a baby like dumb down version of this paper for Claude to understand. I can just show it the paper. Yeah. And I think that intuition, does it always matter for people? But that is certainly something that I have come to do more of over time. speaker 3: And it's interesting because I do think that prompting has and hasn't changed in a sense. Like I think what I will do to prompt the models has probably changed over time. But fundamentally, it's a lot of like imagining yourself in the place of the model. So maybe it's like how capable you think the model has changes over time. I think someone once laughed at me because I was talking about, I was like thinking about a problem. And then they asked me like they what I thought the output of something would be and they were talking about a ptrain model. And I was like, Yeah, no, if I'm a ptrain model, this looks like this. And then they're like, wait, did you just like simulate what it's like to be a ptrain model? Like Yeah, of course. Like I'm used to just like I try and inhabit the mind space of a ptree model and the main space of like different rhf models. And so it's more like the main space you try to occupy changes, and that can change how you end up prompting model. That's why now I just give models papers, because as soon as I was like, Oh, I have the main space of this model. It doesn't need me to baby, it can just read dml papers. I'll just give it the literature. I might even be like, is there more literature youlike to read to understand this better? speaker 4: Do you get any qualia when you're . speaker 3: inhabiting the mind space? I mean, yes, but just because I'm experiencing quality all the time anyway, or is in, do I like is it different. speaker 4: like correlated somehow with which model you're Yeah. speaker 3: ptrain versus rhf prompting are very different beasts because when you're when you're trying to simulate what it's like to be a ptrain model, it's almost like I land in the middle of a piece of text or something. This was very like on human life or something. And then I'm like, what happens? What keeps going at this point? And so that's like whereas like with an rhf model, like it's much more like there's lots of things where I'm like I might pick up on like subtle things in in the query and stuff like that. But Yeah, I think I have much more of a like it's easier to inhabit the main space of an rf model. speaker 4: That's because it's more similar to a human . speaker 3: Yeah because like we don't often just like suddenly wake up and our like I am just generating text like . speaker 2: to hit the minspace of the free trade model. I don't know what it is, but like because rhchef is still like this kind of complex beast that I'm not it's not like super cool de to me that we really understand what's going on. And so in some ways it's closer to like my lived experience, which is easier. But in some ways I feel like there's this, all this like here therebe dragons out there that I don't know about ptrain ined like kind of have a decent sense of what the Internet looks like. You know if you did . speaker 3: the text and said what comes next? Yeah. speaker 2: like I I'm not saying I do good at it, but like I have A I kind of get what's going on there. Yeah. And I don't know after everything we do after pretrading, I don't really claim to get what's going on as much. Maybe that's just me. speaker 4: That's something I wonder about is like is it more helpful to have specifically spent a lot of time reading the Internet versus . speaker 1: like reading books? Sure. speaker 4: In order to because I mean, I don't know if books, but like reading stuff that's not on the Internet probably is like less valuable per like word read for predicting what a model will do or building intuition than like reading random garbage from you need social media forum. Yeah, exactly. speaker 2: Yeah. Okay. So that's that's the past. speaker 1: Now let's move on to the future of prompt engineering. This is the hottest question right now. Are we all going to be prompt engineers in the future? Is that can be the final job remaining? Nothing left except I was just talking to models all day. What does this look like? Is prompting going to be necessary or will these models just get like smart enough future to not need it? Anybody want to start on that easy question? speaker 2: I mean, to some extent, there's the like the model is getting better at understanding what you want them to do. And doing it means that like the amount of thought you need to put into. I mean, okay, there's like an information theory way to think of this. If like you need to provide enough information such that a thing is specified, right? Like what you want them model to do is specified. And to the extent that that's prompt engineering, like I think that will always be around like the ability to actually like clearly state what the goal should be always is funny. If quad can do that, then that's fine. Like if quad is the one setting the goals, then know things are out the window. But in the meanwhile, where we can reason about the world in a more normal way, like I think to some extent, it's always going to be important to be able to specify like what what do you expect to happen? And that's actually like sufficiently hard that even if the model gets better at intuiting, that from between the lines, like I still think there's some amount of writing it well. But then there's just like I think the tools and the ways we get there should evolve a lot. Like quad should be able to help me a lot more. I should be able to collaborate with quad a lot more to like figure out what I need to write down and what's missing, right? I think clad . speaker 3: already does this with me all the time. I don't. I just clothmy prompting assistant. speaker 2: but I think that's not for most customers that I talk to at the very least. So in terms of the future, like how you prompt Claude is probably like a decent direction for what the future looks like or how Zach like I think maybe this is like a decent place to step back and say like asking them how they prompt Claude now, right, is probably the . speaker 1: future future . speaker 2: for the vast majority of people, which is an interesting . speaker 4: way thing to think about. One freezing cold take is that we'll use models to help us much more in the future, to help us with prompting. The reason I say it's freezing cold is that I expect we'll use models for everything more. And prompting is something that we have to do. So we'll probably just use models more to do it along with everything else. For myself, I've found myself using models to write prompts more. One thing that I've been doing a lot is generating examples by giving some realistic inputs to the model. The model writes some answers. I tweak the answers a little bit, which is a lot easier than having to write the full, perfect answer myself from scratch. And then I can churn out lots of these. You know as far as like people who haven't had as much prompt engineering experience, the prompt generator can give people like a place to start. But I think that's just like like a super basic version of what we'll have in the future, which is high bandwidth interaction between like you and the model as you're writing the prompt where you're giving feedback like, Hey, this result wasn't what I wanted. How can you change it to make it better and people just grow more comfortable with integrating it into everything they do. And this thing in particular. speaker 3: Yeah, I'm definitely working a lot with like meta prompts now, and that's probably where I spend most of my time, is like finding prompts that get the model to generate the kinds of outputs or queries or whatever that I want. On the question of like where prompt engineering is going, I think this is a very hard question. On the one hand, I'm like, maybe it's the case that as long as you will want the top, like what are we doing when we prompt engineer is like what you said? I'm like, I'm not prompt engineering for anything that is like easy for the model. I'm doing it because I want to interact with a model that's like extremely good. And I want to always be finding the kind of like top 1%, top point, 0.1% of performance and all the things that models can barely do. Like sometimes I actually feel like I interact with a model like a step up from what everyone else interacts with for this reason, because I'm just so used to like eking out the top performance from models. speaker 4: What do mean by a step up? speaker 3: As in like sometimes people will. I think that the everyday models that people interact with out in the world, it's like I'm interacting with a model. It's like, don't I don't know how to describe it, but like definitely like an advanced version of that, almost like a different model because theybe like, Oh well, the models find this thing hard. And I'm like, that thing is trivial. Like and so it's like, I'm like, I don't know. I have a sense that they're extremely capable, but I think that's because I'm just used to like really like drawing out those capabilities. But imagine that you're now in a world where so I think the thing that feels like a transition point is is the point at which the models, let's suppose that they just get things at like a human level on a given task, or even like an above human level, like they know more about the background of the task that you want than you do. What happens then? I'm like, maybe prompting becomes something like, I ask, I explain to the model what I want, and it is kind of prompting me, you know, because it's like, okay, well, do you mean like actually there's like four different concepts of this thing that you're talking about. Like, do you want me to use this one or that one? Like or by the way, I thought of some edge cases because you said that it's going to be like a pandas data frame. But sometimes you do that and I get a json l and I just want na check what you want me to do there. You want me to flag if I get something that's not a data frame. And so that could be a strange transition where like it's just extremely good at receiving instructions, but actually has to figure out what you want. And I don't know, I could see that being a kind of interesting switch. speaker 2: Anecdotally, I've started having Claude interview me a lot more. Like that is like the specific way that I had tried to eliit information because again, I find the hardest thing to be like actually pulling the right set of information out of my brain and putting that into a prompt is like the hard part to me and not forgetting stuff. And so like specifically asking Claude to like interview me and then turning that into a prompt is a thing that I have turned to a handful of times. Yeah. It kind of reminds . speaker 3: me of what people will talk about. Or if you listen to like designers talk about how they interact with the person who wants the design. So in some ways I'm like, it's this switch from you know the temp agency person who comes and you know more about the task and everything that you want. So you give them the instructions and you explain what they should do in edge cases and all this kind of stuff, versus when you have an expert that you're actually like consulting to do some work. So I think designers can get really frustrated because they know the space of design really well and they're like, Yeah, okay. The client came to me and he just said, make me a poster, make it bold and I'd have like, Yeah, that means 7000 things to me and I'm gonna to try and ask you some questions. So I could see it going from being like temp agency employee to being more like designer that you're hiring. And that's just like a flip in the relationship. I don't know if that's and I think both might continue, but I could see that being why people are like or is prompt engineering going to not be a thing in the future? Because for some domains, it might just not be if the models are just so good that actually all they need to do is kind of like get . speaker 1: the information from your brain and then they can go do the task, right? That that's actually a really good analogy. I mean, one common thread I'm pulling out all your guys responses here is that there seems to be a future in which this sort of elicitation from the user, drawing out that information, it's going to become much more important and much more than it is right now. And already you guys are all starting to do it in a manual way in the future and in the enterprise side of things, maybe that looks like a expansion off this prompt generating type of concept and things in the console where you're able to actually get more information from that enterprise customer so that they can write a better prompt in Claude, maybe it looks like less of just typing into a text box and more of this like guided interaction towards a finished product. That's Yeah, that's I think that's actually like a pretty compelling vision of future. And I think that like the design analogy probably like really brings that home. speaker 4: I was thinking about how prompting now can be kind of like teaching, where it's like you know the empathy for the student. You're trying to think about how they think about things. You're trying to show them like figure out where they're making a mistake. But the point that you're talking about, it's like the skill almost becomes one of introspection, where you're thinking about what it is that you actually want and the models trying to understand you. So it's like making yourself legible to the model versus trying to teach someone who's smarter than you. speaker 3: This is actually how I think of prompting now in a strange way. So like like often my style of prompting, like there's various things that I do, but a common thing that's very like a thing that philosophers will do is I'll define new concepts. So because my thought is like you have to put into words what you want. And sometimes what I want is fairly like nuanced, like the what is a good chart? Or like usually, you know like I don't know, like how is it that you when should you grade something as being correct or not? And so there's some cases where I will just like invent a concept and then be like, here's what I mean by the concept. Sometimes I'll do it in collaboration with Claude to get it to like figure out what the concept is, just because I'm trying to convey to it what's in my head. And right now, the models aren't like trying to do that with us unless you kind of prompt them to do so. And so in the future, it might just be that they can like elicit that from us rather than us having to like kind of do it. For them. But I think another thing that's kind of interesting, this is like a people have sometimes asked me like, Oh, where is like philosophy relevant to prompting? And I actually think it's like very useful in a sense. So like a lot of there is like a style of philosophy writing. And this is at least how I was taught how to write philosophy, where the idea is that in order to, I think it's I think it's like an anti bullshit device. And philosophy, basically, which is that your papers and what you write should be legible to like a kind of educatedly person, someone just like finds your paper, they pick up and they start reading it and they can understand everything. Not everyone like you know achieves this, but that's like kind of the goal of the discipline, I guess. Or at least like this is at least like what we kind of like teach people. And so I'm really used to this idea of like when I'm writing, thinking about the kind of educatedly person who they're really smart, but they don't know anything about this topic. And that was just like years and years of writing text of that form. And I think it was just really good for prompting, because I was like, Oh, I'm used to this. I have an educated dly person who doesn't know anything about the topic. And what I need to do is I need to take extremely complex ideas and I need to make them understand it. I don't talk down to them. I'm not inaccurate, but I need to like freeze things in such a way that it's like extremely clear to them what I mean. And prompting felt very similar. And actually, the training techniques we use are fascinating. Or like the things that you said where you're like you say to a person, like, just take that thing you said and write it down. I used to say that to students all the time, like they write a paper. And I was like, I don't quite get what you're saying here. Can you just like explain your argument to me? They would give me an incredibly cogent argument. And then itbe like, can you just take that and write it down? And then if they did, that was often like a great essay. So it's like really interesting that there's at least that similarity of just like taking things that are in your brain, analyzing them enough to feel like you fully understand them and could take any person off the street who's like a reasonable person and just like externalize your brain into them. And I feel like that's like the core of prompting . speaker 2: that might be the best summary of how to prompt. Well, whatever I've fact, I'm pretty sure it . speaker 4: is externalize your brain. speaker 2: And to know having an education in the thing is a really good way to describe thing that was good. speaker 1: That's, I think, a great way to wrap this conversation. Thank you, guys. This was great.