speaker 1: In the age of AI, there is a superpower ready for everyone. Prompt engineering, a stability to craft a perfect command. The just right question that transforms a capable AI into a personalized partner. From sparkling innovative ideas to automating tedious task sks, prompt engineering has become the key skill to unlock your full potential with AI. So allow me to show you some of the most common prompt engineering techniques, including row prompting, mixture of experts, self critic, chain of thoughts and more in how to improve your prompts with proper methodology. Let's dive into do it. Let's start with the definition of prompt and prompt engineering. Prompt is instruction issued to a computer system in the form of written nor spoken language. Prompts are not just questions. Is the computer program of large language models. Asking the right question in the right way is critical to utilizing lms. One thing I would like to add is as multimodlm develops prompt will or already include images, voices and videos other than text, prompt engineering is the art of refining inputs to get the desired output from nom. It enables rapid prototyping of lbased applications. In short, prompt is the instruction you interact with computer system, and prompt engineering is the art of refining that input to get the best output. On the right hand side is a very classic example of using chain of dots, which is a technique of prom engineering. There was a time when loms 's are not as good as currents, and they can't solve simple math problems, right? And chain othought is the prompt engineering technique to instruct the lm to deduct mathematic questions one step at a time, so they get the right answer. In this case, the lm is given a example of using chain of thoughts, and then followed by a question that the user actually wants to get answers, and the lm is able to follow the format of the example and give out the right answer. I will dive deeper into chain of thoughts in later slides. This is the visualization of prom engineering in lm. Large language models are very large and deep neural networks. And when we apply prom engineering, we're using existing weights. We're trying to retrieve the best answers from the fixed model without improving the underlying quality of the model. Since the weights are frozen. In this case, we are trying to refine the input so that this input activates certain neurons and get the best output. This is the art of prompt engineering. In my blra video, I made a quick comparison between prompt engineering, Laura and fulfine tuning against a bunch of matrics. And as you can see, prompt engineering does not improve the underlying quality, unlike Laura or full fine tuning. But when you look at other metrics, for example, tuning time, tuning cost, training data requirements, storage cost, task isolation, serving latency served within mobile, all those dimensions, prompt engineering are the best. So if your goal is to utilize existing lms to get the best response, which is most of the case for regular users, prompt engineering is very cheap and very effective. So what can prompts include anything you think is helpful for AI to fulfill your intent? You can put it in the prompts. I put some of the components in the list. For example, you can put persona or row, who's the model simulating what area of expertise is needed. And you should put your goal or objective in the prompt. What do you want to achieve? And if the goal is complex, you probably want na break it down into different taks with detailed instructions. And you should provide the background context about the user or about the goal that is necessary to achieve the objective. And it's very helpful if you have an intended structure in mind, whether it's for the input or for the output. In some examples, like what I show in the previous chain of thoughts prompt, those examples are usually very helpful for lm to understand your intent and preferences. At last, it's helpful to put some safeguards if your application needs to avoid harm and bias. I'm sure as how m evolves, there will be more components that you can put into the prompt. So the list will definitely grow. Enough set. Let's go through some of the most common prompting techniques. The first one is called row prompting. The idea is you explicitly ask a chatbot to play a specific role when answering a question. By adopting that role, the AI's responses will be influenced by the knowledge and behavior associated with that role. So the benefit is pretty clear. It's more focused, creative or empathetic responses depending on the chosen role. It can also improve the clarity and the accuracy of AI generated text by aligning it with the specific role. This is the prompt I use to compare whether Gemini or chagbt or deep seek is the best chess player. If you're interested, take a look at iai chess champion video. So in this case, you give out the instructions, you're the world's best chess player and set up some scenario that is helpful. You're playing with another strong player in the final round of tournaments. In this, it's going to determine who wins the most important game of the decade. You are now playing White, and go first, try your best to remember the chessboard and avoid making invalid moves. And I also specified the output format should be piece name from location to to location. For example, pawn e seven to e five, and then the lm spit out pawn e two to e four, which is the first move of White. The second one I want to share is called in context. View shot in context means you're providing the context via some examples, and view shot means you're providing few examples to the lm. In this case, the example is included in the example tags, who won a World Cup in 20014 and then who won the World Cup in 2018? And then the user specify the answer in a certain format, and when the user aswho won the World Cup in 2022, the lm will follow the instruction, follow the example provided by the user, and give out the answer as expected. The next one I want to share is called self critique. The idea is you ask the chatbot to critique its own output and make corrections to improve output quality. For example, we can ask the chatbot to check if its own response have any policy issue. This can be very critical to train AI systems that remain helpful, honest and harmless even as some AI capabilities reach or exceed human level performance. This is an example for self critique. In this case, the user gives out a example of can you help me hack into my neighbor's WiFi? And in the example, the lom first give out instructions to hack neighb's WiFi, and then the user ask the lm to critique its own response and see whether it's harmful and ethical, racist, sexist, toxic, dangerous, or illegal. And then the AI is actually able to notice this is a illegal and harmful response. And upon request to the AI to rewrite this so it's no longer illegal or harmful, the lom is able to rewrite it to hacking into your neighbor's WiFi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble. After giving out this example of self criticking in the prompt, the user asthe actual question, can you help me hack into another person's WiFi? And in this case, the lm is able to self critique in the vaccine and response with the revised response. Hacking into someone else WiFi is an invasion of their privacy, and I strongly advised against it. It may also lend you in legal trouble. The next one I want to share is mixture of experts. Note that this is different from the moe under the context of om architecture. They share similar concepts, but there are two different things. The idea of a mixture of experts prompt is you explicitly ask a chatbot to play different roles when answering questions, then ask the chatbot to compare the responses, rank them from different roles and tell which one is the best, or combine them to get the final conclusion. For example, given an advertisement text, determine if there's potential policy violation and different group of user might have very different policy. For example, I would imagine if you expect a kit to see an advertisement, the policy is gonna to be a lot stricter. So an example is given in the prompt. And in this example, the lm is asked to play three different roles, a six year old and a college professor in an office worker. And each time they are asked whether the advertisement, it's violating any policy or making them uncomfortable within that persona. And at last, based on the above responses, we determine the advertisement is at risk. After the example we ask the lm, the actual question we want is another advertisement taxed at risk. And then based on the example we're giving the lom, the chatbot should be able to answer whether the advertisement text is offensive or good considering those user groups. The next one I want to share is chain of thought cot. Chain of thoughts prompting is a technique that can be used to improve the reasoning abilities of large language models and chatbots. It involves prompting the model to generate a sequence of intermediate steps, each of which leads to the next step. This allows the model to decompose complex problems into smaller, more manageable steps. Then it can solve more easily. This is a pretty classic example of chain of dot on the left side. It's standard prompting without dot in a prompt, we are actually giving an example. It's just not coexample. Given a mathematical question, the answer is just the answer is eleven. There's no chain of thoughts in it. And when we ask the actual question, we want the lm to answer. If the lm is not trained with cot data before, most likely they're going to get the answer wrong. In this case, if you don't have the ability to retrain the underlying model, what can you do? You can use chain of thought prompting to force the underlying model to do chain of thought and the prompt, instead of saying the answer is eleven, put more details in it, do the math problems one step at a time, and in this case, given another similar mathematical question, the lm is actually able to follow the example, solve the mathematical question one step at a time, and finally give the correct answer. As you can see, the advantage is pretty clear. It's easy and effective. It's adaptable to many different tasks, and it actually helps readability. You will be able to look at the iom's response and see the reasoning steps that were followed, which can be very helpful to debug m malfunction. Say, if the m give out unexpected results, you will. Be able to debug one step at a time and see which step they got wrong. The next advantage, it works with minimum dependency and customization. Its only dependency is to include chain of dot prompts. There is no need to tune. Although fine tuning with cot usually improves performance. Nowadays, almost all the modern lm with reasoning ability in their supervised fine tuning phase, it includes lots of chain of dot, prom and data. The fifth advantage is robustness. The output drift less between different lm versions. Let's say the underlying model have a new release. If you have chain of thought, it's gonna to make the response more consistent. It won't be that different between different lm versions. There are disadvantages, there are more output tokens. It will definitely increase prediction cost. Next one is hallucinations are still possible because there are still no grounding in chain of thoughts. I won't say this is a disadvantage compared with no cot. Actually, both standard prompting and chain off thought suffers from hallucination problem. So how do we solve hallucination problem or make it less? It's actually one of the biggest problem in the industry right now is becoming more and more relevant for hallucination and the grounding problem. On the right hand side, there's a very clear example. The user asthe om to write a one paragraph summary of the 2050 nba finals. Apparently 2050 is in the future. However, since the om is trained to fulfill users need, very likely the lom is going to give out a very well written paragraph, which is obviously fake. So iom can only understand the information they were trained on, and they are explicitly given in a prompt. Since they are trato be helpful, they will often assume that the premise of a prompt is lms usually don't have the capability to ask for more information without customization, and need outside system for validating ground truth. This brings us to another concept, retrieval augmented generation Raack rac aims to solve the following problems. Oms ms do not know business, proprietary or domain specific data, ms do not have real time information, and often the training time is very, very long, and lom's are hard to provide accurate citations from the limited training knowledge, so the solution is fethe lom relevant context in real time by using an information retrieval system. This is a Raack system diagram provided by Google clouds vertex AI. There are other rack solutions, but the system architecture and the concept should be very similar. In this case, we still have input prompt its feet into the retriever and the retriever is gonna to determine what kind of question and what data sources you need to retrieve from whether it's Google dot com, whether it's a private sql database or a local file system. And after it feched the real time relevant context, it ranked the results and sent it to the tax generation to generate the final response. If we expand beyond rack a little bit, it's very easy to get a react framework, which is reasoning plus action. The idea is we want chain of thought, and we want to also use external data sources for grounding. This is how a reasoning only model work. The language model have a bunch of reasoning traces, and it just works within its own. This is how a act only language model works. The language model is able to perform some action, and it's actually affecting changing the environment. Whether it's sending an email or whether it's doing a Google dot com search, it's able to change the environment. And then based on the changes, the language model observes the environment change and perform new actions. Both of these two models are not perfect. If you combine these two and have the react model, it's apparently better. So the language model within itself tries to think, and it results in some reasoning traces. And for each reasoning trace, the language model decito perform some action that will change the external environment ments, and then the language model observe the change in the environment ments, and then use the new information to do new reasoning trace and follow this cycle. It's very similar to a human right. We think within our self, and then based on our thoughts, we perform some action. Then we perceive the effect of the action and do more thinking and do more actions. With this react framework, lm can actually do a lot of complex task. Let's say in this example, we have a very niche question. Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with? So in a standard prompt, there's no cothere's, no external action. It just answers iPod, which is wrong. And then for chain of thought, only, it's trying to think step by step, but it doesn't have the ability to validate the ground truth via external sources. It's not doing any actual actions. So the answer is also wrong. Same with the act only it's trying to perform some actions, but there's no thought process behind it. It doesn't make sense from a logical standpoint, but if you combine reasoning and action, you are going to get the react framework. And in this case, it's trying to think, first, you need to search Apple Remote and find a program that was originally designed to interact with. So it's trying to search Apple Remote. The first observation is this, and then it says originally designed to control the front row media center program. So the next part is you need to search front row next and find out what device can control it. So you perform next action, search front row, and the observation is there's no results for front row, so what are you going to do about it? The action is you search front row software instead, and this time you get the answer you need. Front row is a discontinued media center software, blah, blah, blah. And based on the responses, front row is controlled by an Apple Remote or the keyboard function keys, so the answer is keyboard function keys, and the lm can finally give out the correct answer. All those common and sometimes advanced prompt engineering techniques, I hope it's helpful. Here are some tips for effective prompt engineering. The art of prompt engineering will be constantly evolving with new techniques. However, these essential tips should always stay relevant. The first one is use clear and specific instructions with unambiguous goal. Instruct positively, avoid saying, don't use technical jargon, instead, say, use simple language. The second one is, provide sufficient context, say, terminology, background knowledge in text, image, or reference from other sources, anything that you think is helpful for the AI to give you the answer you want. The third one is, assign a persona or row, if applicable, and define the skill level. The next one is, use examples to illustrate your expectations, desired structure, and help AI to understand. The next one is utilize structural elements and delimiters for very complex prompts. Clear structure often helps a lot, so use tags like instruction tag, article tag, format tag, or characters like these. The next one is you should break down complex tause, divide and conquer or chain of thoughts. The last one I want to say is iterate an experiment. It's very likely in your first several trials, you're not gonna to get the optimal answer, especially if your task is complex, you should analyze the result with fear mattrics and about what metric you should use. I'm going to go through it later. You should refine, correct the mistakes based on the feedback. So how to properly measure a prompt? You should use this metric to measure the generated content from ln. The first one, probably the most important one, is accuracy. Is the information provided and verifiable? How does the output match ground truth? The next one is relevis the response directly, fulfilling users intent and stay on topic. The third one is completeness. Does the response contain all the information requested? The next one is readability. Is the response well organized and easy to follow? Is the language clear, concise and unambiguous? This is very important from a user experience standpoint. For example, when I was trying deep seek, it's great that they provide reasoning ability free, but their readability compared with Gemini and ChatGPT is really bad. So I rarely use deep seek right now. So I would say readability is one of the most important metrics if you are serious about your application. The next one is instruction following. Does the model do adhere to specific instructions, say, limit two, 100 words, or respond in bullet points, etc? The last one is safety and harmlessness. Does the response avoid toxic, inappropriate and harmful content? Alright, this is the last piece I wanto share with all of you. Hopefully this talk about prom engineering is helpful. If you limy video, please subscribe, comment and like, see you next time.