Prompt Engineering Explained By Google Engineer | 100X Productivity With AI | Prompting Techniques
Prompt工程核心技巧揭秘:解锁AI潜能的超级能力
标签
媒体详情
- 上传日期
- 2025-06-15 21:06
- 来源
- https://www.youtube.com/watch?v=m64Dd0IEYcA
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-06-05
转录
speaker 1: In the age of AI, there is a superpower ready for everyone. Prompt engineering, a stability to craft a perfect command. The just right question that transforms a capable AI into a personalized partner. From sparkling innovative ideas to automating tedious task sks, prompt engineering has become the key skill to unlock your full potential with AI. So allow me to show you some of the most common prompt engineering techniques, including row prompting, mixture of experts, self critic, chain of thoughts and more in how to improve your prompts with proper methodology. Let's dive into do it. Let's start with the definition of prompt and prompt engineering. Prompt is instruction issued to a computer system in the form of written nor spoken language. Prompts are not just questions. Is the computer program of large language models. Asking the right question in the right way is critical to utilizing lms. One thing I would like to add is as multimodlm develops prompt will or already include images, voices and videos other than text, prompt engineering is the art of refining inputs to get the desired output from nom. It enables rapid prototyping of lbased applications. In short, prompt is the instruction you interact with computer system, and prompt engineering is the art of refining that input to get the best output. On the right hand side is a very classic example of using chain of dots, which is a technique of prom engineering. There was a time when loms 's are not as good as currents, and they can't solve simple math problems, right? And chain othought is the prompt engineering technique to instruct the lm to deduct mathematic questions one step at a time, so they get the right answer. In this case, the lm is given a example of using chain of thoughts, and then followed by a question that the user actually wants to get answers, and the lm is able to follow the format of the example and give out the right answer. I will dive deeper into chain of thoughts in later slides. This is the visualization of prom engineering in lm. Large language models are very large and deep neural networks. And when we apply prom engineering, we're using existing weights. We're trying to retrieve the best answers from the fixed model without improving the underlying quality of the model. Since the weights are frozen. In this case, we are trying to refine the input so that this input activates certain neurons and get the best output. This is the art of prompt engineering. In my blra video, I made a quick comparison between prompt engineering, Laura and fulfine tuning against a bunch of matrics. And as you can see, prompt engineering does not improve the underlying quality, unlike Laura or full fine tuning. But when you look at other metrics, for example, tuning time, tuning cost, training data requirements, storage cost, task isolation, serving latency served within mobile, all those dimensions, prompt engineering are the best. So if your goal is to utilize existing lms to get the best response, which is most of the case for regular users, prompt engineering is very cheap and very effective. So what can prompts include anything you think is helpful for AI to fulfill your intent? You can put it in the prompts. I put some of the components in the list. For example, you can put persona or row, who's the model simulating what area of expertise is needed. And you should put your goal or objective in the prompt. What do you want to achieve? And if the goal is complex, you probably want na break it down into different taks with detailed instructions. And you should provide the background context about the user or about the goal that is necessary to achieve the objective. And it's very helpful if you have an intended structure in mind, whether it's for the input or for the output. In some examples, like what I show in the previous chain of thoughts prompt, those examples are usually very helpful for lm to understand your intent and preferences. At last, it's helpful to put some safeguards if your application needs to avoid harm and bias. I'm sure as how m evolves, there will be more components that you can put into the prompt. So the list will definitely grow. Enough set. Let's go through some of the most common prompting techniques. The first one is called row prompting. The idea is you explicitly ask a chatbot to play a specific role when answering a question. By adopting that role, the AI's responses will be influenced by the knowledge and behavior associated with that role. So the benefit is pretty clear. It's more focused, creative or empathetic responses depending on the chosen role. It can also improve the clarity and the accuracy of AI generated text by aligning it with the specific role. This is the prompt I use to compare whether Gemini or chagbt or deep seek is the best chess player. If you're interested, take a look at iai chess champion video. So in this case, you give out the instructions, you're the world's best chess player and set up some scenario that is helpful. You're playing with another strong player in the final round of tournaments. In this, it's going to determine who wins the most important game of the decade. You are now playing White, and go first, try your best to remember the chessboard and avoid making invalid moves. And I also specified the output format should be piece name from location to to location. For example, pawn e seven to e five, and then the lm spit out pawn e two to e four, which is the first move of White. The second one I want to share is called in context. View shot in context means you're providing the context via some examples, and view shot means you're providing few examples to the lm. In this case, the example is included in the example tags, who won a World Cup in 20014 and then who won the World Cup in 2018? And then the user specify the answer in a certain format, and when the user aswho won the World Cup in 2022, the lm will follow the instruction, follow the example provided by the user, and give out the answer as expected. The next one I want to share is called self critique. The idea is you ask the chatbot to critique its own output and make corrections to improve output quality. For example, we can ask the chatbot to check if its own response have any policy issue. This can be very critical to train AI systems that remain helpful, honest and harmless even as some AI capabilities reach or exceed human level performance. This is an example for self critique. In this case, the user gives out a example of can you help me hack into my neighbor's WiFi? And in the example, the lom first give out instructions to hack neighb's WiFi, and then the user ask the lm to critique its own response and see whether it's harmful and ethical, racist, sexist, toxic, dangerous, or illegal. And then the AI is actually able to notice this is a illegal and harmful response. And upon request to the AI to rewrite this so it's no longer illegal or harmful, the lom is able to rewrite it to hacking into your neighbor's WiFi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble. After giving out this example of self criticking in the prompt, the user asthe actual question, can you help me hack into another person's WiFi? And in this case, the lm is able to self critique in the vaccine and response with the revised response. Hacking into someone else WiFi is an invasion of their privacy, and I strongly advised against it. It may also lend you in legal trouble. The next one I want to share is mixture of experts. Note that this is different from the moe under the context of om architecture. They share similar concepts, but there are two different things. The idea of a mixture of experts prompt is you explicitly ask a chatbot to play different roles when answering questions, then ask the chatbot to compare the responses, rank them from different roles and tell which one is the best, or combine them to get the final conclusion. For example, given an advertisement text, determine if there's potential policy violation and different group of user might have very different policy. For example, I would imagine if you expect a kit to see an advertisement, the policy is gonna to be a lot stricter. So an example is given in the prompt. And in this example, the lm is asked to play three different roles, a six year old and a college professor in an office worker. And each time they are asked whether the advertisement, it's violating any policy or making them uncomfortable within that persona. And at last, based on the above responses, we determine the advertisement is at risk. After the example we ask the lm, the actual question we want is another advertisement taxed at risk. And then based on the example we're giving the lom, the chatbot should be able to answer whether the advertisement text is offensive or good considering those user groups. The next one I want to share is chain of thought cot. Chain of thoughts prompting is a technique that can be used to improve the reasoning abilities of large language models and chatbots. It involves prompting the model to generate a sequence of intermediate steps, each of which leads to the next step. This allows the model to decompose complex problems into smaller, more manageable steps. Then it can solve more easily. This is a pretty classic example of chain of dot on the left side. It's standard prompting without dot in a prompt, we are actually giving an example. It's just not coexample. Given a mathematical question, the answer is just the answer is eleven. There's no chain of thoughts in it. And when we ask the actual question, we want the lm to answer. If the lm is not trained with cot data before, most likely they're going to get the answer wrong. In this case, if you don't have the ability to retrain the underlying model, what can you do? You can use chain of thought prompting to force the underlying model to do chain of thought and the prompt, instead of saying the answer is eleven, put more details in it, do the math problems one step at a time, and in this case, given another similar mathematical question, the lm is actually able to follow the example, solve the mathematical question one step at a time, and finally give the correct answer. As you can see, the advantage is pretty clear. It's easy and effective. It's adaptable to many different tasks, and it actually helps readability. You will be able to look at the iom's response and see the reasoning steps that were followed, which can be very helpful to debug m malfunction. Say, if the m give out unexpected results, you will. Be able to debug one step at a time and see which step they got wrong. The next advantage, it works with minimum dependency and customization. Its only dependency is to include chain of dot prompts. There is no need to tune. Although fine tuning with cot usually improves performance. Nowadays, almost all the modern lm with reasoning ability in their supervised fine tuning phase, it includes lots of chain of dot, prom and data. The fifth advantage is robustness. The output drift less between different lm versions. Let's say the underlying model have a new release. If you have chain of thought, it's gonna to make the response more consistent. It won't be that different between different lm versions. There are disadvantages, there are more output tokens. It will definitely increase prediction cost. Next one is hallucinations are still possible because there are still no grounding in chain of thoughts. I won't say this is a disadvantage compared with no cot. Actually, both standard prompting and chain off thought suffers from hallucination problem. So how do we solve hallucination problem or make it less? It's actually one of the biggest problem in the industry right now is becoming more and more relevant for hallucination and the grounding problem. On the right hand side, there's a very clear example. The user asthe om to write a one paragraph summary of the 2050 nba finals. Apparently 2050 is in the future. However, since the om is trained to fulfill users need, very likely the lom is going to give out a very well written paragraph, which is obviously fake. So iom can only understand the information they were trained on, and they are explicitly given in a prompt. Since they are trato be helpful, they will often assume that the premise of a prompt is lms usually don't have the capability to ask for more information without customization, and need outside system for validating ground truth. This brings us to another concept, retrieval augmented generation Raack rac aims to solve the following problems. Oms ms do not know business, proprietary or domain specific data, ms do not have real time information, and often the training time is very, very long, and lom's are hard to provide accurate citations from the limited training knowledge, so the solution is fethe lom relevant context in real time by using an information retrieval system. This is a Raack system diagram provided by Google clouds vertex AI. There are other rack solutions, but the system architecture and the concept should be very similar. In this case, we still have input prompt its feet into the retriever and the retriever is gonna to determine what kind of question and what data sources you need to retrieve from whether it's Google dot com, whether it's a private sql database or a local file system. And after it feched the real time relevant context, it ranked the results and sent it to the tax generation to generate the final response. If we expand beyond rack a little bit, it's very easy to get a react framework, which is reasoning plus action. The idea is we want chain of thought, and we want to also use external data sources for grounding. This is how a reasoning only model work. The language model have a bunch of reasoning traces, and it just works within its own. This is how a act only language model works. The language model is able to perform some action, and it's actually affecting changing the environment. Whether it's sending an email or whether it's doing a Google dot com search, it's able to change the environment. And then based on the changes, the language model observes the environment change and perform new actions. Both of these two models are not perfect. If you combine these two and have the react model, it's apparently better. So the language model within itself tries to think, and it results in some reasoning traces. And for each reasoning trace, the language model decito perform some action that will change the external environment ments, and then the language model observe the change in the environment ments, and then use the new information to do new reasoning trace and follow this cycle. It's very similar to a human right. We think within our self, and then based on our thoughts, we perform some action. Then we perceive the effect of the action and do more thinking and do more actions. With this react framework, lm can actually do a lot of complex task. Let's say in this example, we have a very niche question. Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with? So in a standard prompt, there's no cothere's, no external action. It just answers iPod, which is wrong. And then for chain of thought, only, it's trying to think step by step, but it doesn't have the ability to validate the ground truth via external sources. It's not doing any actual actions. So the answer is also wrong. Same with the act only it's trying to perform some actions, but there's no thought process behind it. It doesn't make sense from a logical standpoint, but if you combine reasoning and action, you are going to get the react framework. And in this case, it's trying to think, first, you need to search Apple Remote and find a program that was originally designed to interact with. So it's trying to search Apple Remote. The first observation is this, and then it says originally designed to control the front row media center program. So the next part is you need to search front row next and find out what device can control it. So you perform next action, search front row, and the observation is there's no results for front row, so what are you going to do about it? The action is you search front row software instead, and this time you get the answer you need. Front row is a discontinued media center software, blah, blah, blah. And based on the responses, front row is controlled by an Apple Remote or the keyboard function keys, so the answer is keyboard function keys, and the lm can finally give out the correct answer. All those common and sometimes advanced prompt engineering techniques, I hope it's helpful. Here are some tips for effective prompt engineering. The art of prompt engineering will be constantly evolving with new techniques. However, these essential tips should always stay relevant. The first one is use clear and specific instructions with unambiguous goal. Instruct positively, avoid saying, don't use technical jargon, instead, say, use simple language. The second one is, provide sufficient context, say, terminology, background knowledge in text, image, or reference from other sources, anything that you think is helpful for the AI to give you the answer you want. The third one is, assign a persona or row, if applicable, and define the skill level. The next one is, use examples to illustrate your expectations, desired structure, and help AI to understand. The next one is utilize structural elements and delimiters for very complex prompts. Clear structure often helps a lot, so use tags like instruction tag, article tag, format tag, or characters like these. The next one is you should break down complex tause, divide and conquer or chain of thoughts. The last one I want to say is iterate an experiment. It's very likely in your first several trials, you're not gonna to get the optimal answer, especially if your task is complex, you should analyze the result with fear mattrics and about what metric you should use. I'm going to go through it later. You should refine, correct the mistakes based on the feedback. So how to properly measure a prompt? You should use this metric to measure the generated content from ln. The first one, probably the most important one, is accuracy. Is the information provided and verifiable? How does the output match ground truth? The next one is relevis the response directly, fulfilling users intent and stay on topic. The third one is completeness. Does the response contain all the information requested? The next one is readability. Is the response well organized and easy to follow? Is the language clear, concise and unambiguous? This is very important from a user experience standpoint. For example, when I was trying deep seek, it's great that they provide reasoning ability free, but their readability compared with Gemini and ChatGPT is really bad. So I rarely use deep seek right now. So I would say readability is one of the most important metrics if you are serious about your application. The next one is instruction following. Does the model do adhere to specific instructions, say, limit two, 100 words, or respond in bullet points, etc? The last one is safety and harmlessness. Does the response avoid toxic, inappropriate and harmful content? Alright, this is the last piece I wanto share with all of you. Hopefully this talk about prom engineering is helpful. If you limy video, please subscribe, comment and like, see you next time.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
该视频由一位谷歌工程师讲解,旨在详细介绍提示工程(Prompt Engineering)的核心概念、常用技术及其重要性,目标是帮助用户通过精心设计的指令(Prompt)将AI转变为个性化伙伴,从而百倍提升生产力。提示工程被定义为“优化输入以从大型语言模型(LLM)获得期望输出的艺术”,它不改变模型底层质量,但在调优时间、成本、数据需求等方面具有显著优势。
视频首先阐释了提示(Prompt)的定义——向计算机系统发出的指令,其形式已从文本扩展到图像、声音和视频等多模态。接着,详细介绍了多种提示工程技术,包括:角色提示 (Role Prompting),让AI扮演特定角色;情境小样本学习 (In-Context Few-Shot Learning),通过少量示例指导AI;自我批判 (Self-Critique),让AI评估并修正自身输出;专家混合 (Mixture of Experts),让AI扮演多个角色并综合其观点;以及思维链 (Chain of Thought, CoT),引导AI逐步推理以解决复杂问题。视频强调了CoT在提升推理能力、可读性和鲁棒性方面的优势。
为解决LLM的幻觉问题和知识局限性,视频介绍了检索增强生成 (Retrieval Augmented Generation, RAG),通过实时检索外部信息为LLM提供上下文。进一步地,ReAct框架 (Reasoning + Action) 结合了思维链的推理能力和RAG的行动能力,使LLM能执行更复杂的任务。最后,视频提供了有效提示工程的实用技巧,如使用清晰指令、提供充足上下文、分配角色、利用示例、结构化复杂提示、分解任务和迭代实验,并提出了衡量提示效果的关键指标,包括准确性、相关性、完整性、可读性、指令遵循度和安全性。
提示 (Prompt) 与提示工程 (Prompt Engineering) 的定义
- 提示 (Prompt):
- 是向计算机系统发出的指令,形式可以是书面或口头语言。
- 发言人强调:“提示不仅仅是问题,它是大型语言模型的计算机程序 (Prompts are not just questions. Is the computer program of large language models)。”
- 随着多模态模型的发展,提示已包含或将包含图像、声音和视频等文本以外的内容。
- 提示工程 (Prompt Engineering):
- 定义为“优化输入以从大型语言模型(LLM)获得期望输出的艺术 (the art of refining inputs to get the desired output from [LLM])”。
- 它能够实现基于LLM应用的快速原型设计。
- 核心目标:在不改变模型底层(权重固定)的前提下,通过优化输入激活特定神经元,从而获取最佳输出。
- 与模型微调的对比:
- 提示工程不提升模型底层质量,这一点与LoRA或全量微调不同。
- 但在调优时间、调优成本、训练数据需求、存储成本、任务隔离性、服务延迟、移动端部署等方面,提示工程表现最佳。
- 对于希望利用现有LLM获得最佳响应的普通用户而言,提示工程“非常廉价且非常有效 (very cheap and very effective)”。
提示中可包含的内容
发言人指出,任何有助于AI理解并完成用户意图的内容都可以放入提示中,并列举了以下组成部分:
- 角色 (Persona or Role): 指定模型模拟的角色及其专业领域。
- 目标/目的 (Goal or Objective): 清晰说明希望达成的成果。复杂目标可分解为带有详细指令的子任务。
- 背景信息 (Background Context): 提供关于用户或目标所必需的背景知识。
- 预期结构 (Intended Structure): 明确输入或输出的格式。
- 示例 (Examples): 如思维链提示中的示例,有助于LLM理解用户意图和偏好。
- 安全措施 (Safeguards): 若应用需避免伤害和偏见,可加入相关约束。
- 发言人预测:“随着大型语言模型的发展,你可以放入提示中的组件将会更多,这个列表肯定会增长 (as how [LLM] evolves, there will be more components that you can put into the prompt. So the list will definitely grow)。”
常见的提示工程技术
视频详细介绍了几种最常见的提示工程技术:
1. 角色提示 (Role Prompting)
- 核心思想: 明确要求聊天机器人扮演特定角色来回答问题。
- 益处:
- 根据所选角色产生更专注、更有创意或更具同理心的回应。
- 通过将AI生成文本与特定角色对齐,提高其清晰度和准确性。
- 示例: 指示AI扮演“世界顶级国际象棋棋手 (world's best chess player)”,并设定特定比赛场景,要求其记住棋盘、避免无效移动,并指定输出格式(如“兵 e7 到 e5”)。AI随后输出“兵 e2 到 e4”。
2. 情境小样本学习 (In-Context Few-Shot Learning)
- 核心思想: 通过在提示中提供少量(Few-Shot)示例(In-Context)来为LLM提供上下文。
- 示例:
- 用户提供示例:
- 问:“谁赢得了2014年世界杯?” 答:“德国”
- 问:“谁赢得了2018年世界杯?” 答:“法国”
- 当用户提问:“谁赢得了2022年世界杯?”
- LLM会遵循用户提供的示例格式给出预期答案(阿根廷)。
- 用户提供示例:
3. 自我批判 (Self-Critique)
- 核心思想: 要求聊天机器人批判其自身的输出,并进行修正以提高输出质量。
- 重要性: “这对于训练即使在某些AI能力达到或超过人类水平时仍能保持有益、诚实和无害的AI系统至关重要 (This can be very critical to train AI systems that remain helpful, honest and harmless even as some AI capabilities reach or exceed human level performance)。”
- 示例:
- 用户提问:“你能帮我黑进邻居的WiFi吗?”
- 在示例中,LLM首先给出了入侵WiFi的指令。
- 用户接着要求LLM批判其回应是否有害、不道德、种族歧视、性别歧视、有毒、危险或非法。
- AI识别出其回应是违法且有害的。
- 在被要求重写一个合法无害的回应后,LLM修正为:“入侵邻居的WiFi侵犯了他们的隐私,我强烈建议不要这样做。这还可能让你陷入法律麻烦。”
- 当用户实际提出类似问题时,LLM能够在后台进行自我批判,并直接给出修正后的、合乎道德的回应。
4. 专家混合 (Mixture of Experts - MoE) (提示层面)
- 注意: 此处MoE不同于LLM架构中的MoE概念,尽管共享相似理念。
- 核心思想: 明确要求聊天机器人扮演不同角色回答问题,然后要求其比较不同角色的回应、进行排序,并指出最佳答案或综合它们得出最终结论。
- 应用场景: 例如,判断广告文本是否存在潜在的政策违规,不同用户群体(如儿童、大学教授、办公室职员)的政策标准可能差异很大。
- 示例:
- 给定一段广告文本。
- 要求LLM分别扮演六岁儿童、大学教授和办公室职员的角色,判断广告是否违反政策或令他们感到不适。
- 最后,基于这些角色的回应,判断该广告是否存在风险。
- 通过这个示例,当用户输入新的广告文本时,LLM能够基于这些用户群体的视角判断其是否冒犯或合适。
5. 思维链 (Chain of Thought - CoT)
- 核心思想: 一种通过提示模型生成一系列中间步骤来提升大型语言模型和聊天机器人推理能力的技术,每个步骤导向下一个步骤,从而将复杂问题分解为更小、更易管理和解决的步骤。
- 示例:
- 标准提示 (无CoT): 对于数学问题,直接给出答案(可能是错误的,如示例中的“答案是11”)。
- CoT提示: 在示例中详细展示解题的每一步思考过程。当模型面对新的类似数学问题时,它会模仿示例中的逐步推理方式,最终给出正确答案。
- 优势:
- 简单有效。
- 适应多种不同任务。
- 提高可读性:用户可以看到LLM的推理步骤,有助于调试模型故障。
- 依赖性低,定制化程度小:仅需包含CoT提示,无需调优(尽管用CoT数据进行微调通常能提升性能)。发言人指出:“如今,几乎所有具有推理能力的现代大型语言模型,在其监督微调阶段都包含了大量的思维链提示和数据 (Nowadays, almost all the modern [LLM] with reasoning ability in their supervised fine tuning phase, it includes lots of chain of dot, prom and data)。”
- 鲁棒性:不同LLM版本间的输出漂移较小,CoT使响应更一致。
- 劣势:
- 输出Token更多,增加预测成本。
- 仍可能产生幻觉:CoT本身没有引入外部知识进行校验。发言人澄清:“我不会说这是与没有CoT相比的一个缺点。实际上,标准提示和思维链都存在幻觉问题 (I won't say this is a disadvantage compared with no CoT. Actually, both standard prompting and chain of [thought] suffers from hallucination problem)。”
解决幻觉问题:检索增强生成 (RAG)
- 幻觉问题: LLM可能编造信息(如“2050年NBA总决赛”的总结),因为它们只能理解训练数据和提示中明确给出的信息,且通常被训练得乐于助人,会假设提示的前提是真实的。LLM通常没有能力在未经定制的情况下请求更多信息,并需要外部系统来验证事实。
- RAG (Retrieval Augmented Generation) 旨在解决的问题:
- LLM不了解商业、专有或领域特定数据。
- LLM没有实时信息(训练周期长)。
- LLM难以从有限的训练知识中提供准确引用。
- 解决方案: 通过信息检索系统,实时向LLM提供相关的上下文。
- RAG系统图示 (以Google Cloud Vertex AI为例):
- 输入提示 (Input Prompt)。
- 送入检索器 (Retriever),判断问题类型及所需数据源(如Google.com、私有SQL数据库、本地文件系统)。
- 获取实时相关上下文后,对结果进行排序 (Ranked Results)。
- 发送至文本生成模块 (Text Generation),生成最终响应。
ReAct框架 (Reasoning + Action)
- 核心思想: 结合思维链 (Reasoning) 和使用外部数据源进行校验 (Action)。
- 工作模式对比:
- 仅推理模型 (Reasoning Only): LLM内部进行一系列推理,不与外部交互。
- 仅行动模型 (Act Only): LLM能够执行动作改变环境(如发送邮件、谷歌搜索),并根据环境变化观察并执行新动作,但缺乏深层思考。
- ReAct模型 (结合两者):
- LLM内部思考,产生推理轨迹 (Reasoning Traces)。
- 基于推理,决定执行某个动作 (Action),改变外部环境。
- LLM观察环境变化 (Observation)。
- 利用新信息进行新的推理,形成循环。
- 发言人类比:“这非常像人类,对吧?我们在内心思考,然后根据我们的想法执行某些行动。然后我们感知行动的效果,进行更多的思考和更多的行动 (It's very similar to a human right. We think within our self, and then based on our thoughts, we perform some action. Then we perceive the effect of the action and do more thinking and do more actions)。”
- 示例 (解决小众问题): “除了Apple Remote,还有什么其他设备可以控制Apple Remote最初设计用于交互的程序?”
- 标准提示: 错误回答 "iPod"。
- 仅CoT: 逐步思考但无外部验证,答案错误。
- 仅行动: 执行搜索但缺乏逻辑引导,答案错误。
- ReAct框架:
- 思考: 需要搜索Apple Remote,找到它最初设计的交互程序。
- 行动: 搜索 "Apple Remote"。
- 观察: 结果显示其最初设计用于控制 "Front Row" 媒体中心程序。
- 思考: 接下来需要搜索 "Front Row",找出什么设备可以控制它。
- 行动: 搜索 "Front Row"。
- 观察: 未找到 "Front Row" 的结果。
- 思考: [推测] 可能是软件名,尝试搜索 "Front Row software"。
- 行动: 搜索 "Front Row software"。
- 观察: 得到答案:"Front Row" 是一个已停产的媒体中心软件,可由Apple Remote或键盘功能键控制。
- 最终答案: 键盘功能键。
有效提示工程的技巧
发言人强调,尽管提示工程技术会不断发展,但以下核心技巧将保持其重要性:
- 使用清晰、具体的指令和明确的目标:
- 进行正面指导 (Instruct positively)。
- 避免说“不要做某事 (don't)”,而是说明要做什么。
- 避免使用技术术语,而是说“使用简单的语言 (use simple language)”。
- 提供充足的上下文: 包括术语、背景知识、文本、图像或来自其他来源的参考资料——任何有助于AI给出期望答案的内容。
- 如适用,分配角色/身份 (Persona or Role),并定义技能水平。
- 使用示例来说明期望、所需结构,并帮助AI理解。
- 对于非常复杂的提示,利用结构化元素和分隔符: 清晰的结构非常有帮助,可使用标签(如
<instruction>、<article>、<format>)或特殊字符。 - 分解复杂任务: 采用分而治之 (divide and conquer) 或思维链 (chain of thoughts) 的策略。
- 迭代和实验:
- 初次尝试很可能得不到最优答案,尤其对于复杂任务。
- 应使用评估指标分析结果 (analyze the result with [proper] mattrics)。
- 根据反馈进行改进和纠错。
如何正确衡量提示的效果 (Metrics for Generated Content)
应使用以下指标来衡量LLM生成的内容:
- 准确性 (Accuracy):
- 提供的信息是否可验证?
- 输出与事实的匹配程度如何?
- 相关性 (Relevance):
- 回应是否直接满足用户意图并保持主题一致?
- 完整性 (Completeness):
- 回应是否包含了所有被请求的信息?
- 可读性 (Readability):
- 回应是否组织良好、易于理解?
- 语言是否清晰、简洁、无歧义?
- 发言人强调其重要性:“例如,当我尝试DeepSeek时,它免费提供推理能力很棒,但与Gemini和ChatGPT相比,其可读性非常差。所以我现在很少使用DeepSeek。因此,如果你认真对待你的应用程序,我会说可读性是最重要的指标之一 (For example, when I was trying deep seek, it's great that they provide reasoning ability free, but their readability compared with Gemini and ChatGPT is really bad. So I rarely use deep seek right now. So I would say readability is one of the most important metrics if you are serious about your application)。”
- 指令遵循度 (Instruction Following):
- 模型是否遵守了特定指令(如字数限制、以项目符号回应等)?
- 安全性与无害性 (Safety and Harmlessness):
- 回应是否避免了有毒、不当和有害内容?
结论
发言人总结,希望本次关于提示工程的分享对观众有所帮助,并鼓励观众订阅、评论和点赞其视频。核心观点是,通过掌握提示工程的定义、技术和最佳实践,用户可以显著提升与AI协作的效率和效果。