Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

大型语言模型推理能力解析:从思维链到解码策略

媒体详情

上传日期
2025-05-27 21:30
来源
https://www.youtube.com/watch?v=ebnX5Ur1hBk
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-exp-03-25

转录

下载为TXT
speaker 1: All right, so hi everyone. We're going to get started. So for today's lecture for cs 25, I'm very pleasure to have Denny zoo from Google demind here to give a talk on large language model reasoning. And so Denny founded the reasoning team at Google brain, which is now part of Google DeepMind. His group is renowned for pioneering chain of thought, prompting and self consistency, as well as developing the mathematical foundations of in context learning and chain of thought reasoning. His team also created core capabilities that power geminze reasoning capabilities. Further, Denny co founded the conference on language modeling, or comb, and served as general chair for the 2024 conference. So Yeah, I'll let denanny take it from here.
speaker 2: Yeah I'm glad to see many of you guys have already believed aoms kind of reason. Actually, you may wonder, what's my answer for this question? Yeah, to me, actually, I don't know. That really depends on the definition of reasoning. So for my talk today, we have a very specific definition about reasoning. So I know there are many debates about if m can reason. I never joined those debates because without a definition on reasoning, I have no idea about those things. But for alreasoning, and we particularly mean that intermediate tokens between input and output, that called reasoning or intermediate steps. So this idea actually is not very new. Even in 2:17, Duman already published a paper, how to use intermediate tokens to solve MaaS H problems. So at that time, I think the community were quite happy about alpha go, alpha zero. But this paper is really ground breaking paper. If you haven't read that paper before, I strongly encourage you to look at that paper. They introduced natural language to solve matths problems. However, in the literature at that time, as an f else, just used symbolic approach or search. So actually, this idea actually is also very common for nerosymbolic literature. In neurosymbolic literature, actually, it's very common to use intermediate process to solve some reasoning problems. Here's example about how to how to use alreasoning. When I founded the reasoning team in Google brain, I created this task. So it's called a last letter condemnation. Use this task as a motivating example. At that time, one could use transmodels to solve this task. So what's the output when conatenating the last letter of each word of artificial intelligence? So if there no reasoning process, you will see, okay, the answer is le. If there's a reasoning process, the model would output, say, the last letter of artificial is l, the last letter of intelligence is e conatenating l and e this to sl e or something like that. So the highlighted text here is called resiing. That's what I parmean here about reasoning. So if we are familiar with program synthesis or neurosyabotic reasoning, you wouldn't be surprised about this task design. Of course, you can imagine that I tried other options. For example, I didn't see the first letter. The reason is that I tried first letter, and all larger models can solve that problem quite a well, because it has so many initials on the web, and the model has already learned how to concatenate first letters. Then I switch to last letters, and all models failed. And I know many people say, Oh Yeah, this is so natural, right? We need intermediate steps, just like humans. And now in the current days, you may see oms, oms, ms are very similar to humans. But for us as a researchers, we should always keep in mind lms are just priability models. They are not humans. And if you always keep this in mind, it will be better for you to understand a lot of new techniques. So why intermediate tokens of reasonmatters? Okay, we have theoretical war as it collaborated with professor tama in Stanford and his students. So for any problems solvable by pucircus of size t constant size transformers can solve it by generating ot. Intermediate tokens is a very powerful result. So the size here means the number of logic gaso. For example, if we use the GPU clusters, that would be types of median garight, even the meditrading. If we directly generate final answers, either require a huge depth or cannot solve at all, that's how we understand reasoning from a theoretical perspective. In the late of this lecture, we come back to this theoretical argument. There is a common belief about elm reasoning and pressure. Elms cannot reason without further prompting engineering like soprompting or fine tuning in the kinwhen. Talk about I of fine tuning, right? Is that do you ever agree that? Agree that? Okay. So I'd believe he's surwrong this very well. Yeah so perlms are ready to reason and all we need is decoding just about decoding process Yeah no matter how finthose techniques looks like in the kind of days. So here's example here. If I have three apples ts, my dad has two appome, and how many appots do we have in total, if you have any per models like lama deep psick or Chen Eng wen or something? And I didn't try those models. Okay, if you have any permodels, you can type this question in the permodel and see what happened. Probably, it's very likely you see anlike five apples. Of course, the irons is wrong here. Okay, this is called gredecoding. He will say, okay, Yeah, you're right. Right? For pressure models, there's no reasoning, right? The problem is about decoding because we use graded decoding by default. If you look at second candidates, second, if you look because you have big vocabulary size, right? And you can look at the second candidate, a second candidate for the first token and prouwe start from I and we'll see what happened. We could just then continue the decoding process. We see, okay, I have three apples and my dad has two more apples than me. So he has five apples and three plus five to eight. It's perfect, right? We just need to look for more candidates. That's amazing. And there's another choice. And the third candidate for the first token is we we see what happened here. We have eight apples in total as Yeah I see somehow it's always and probably from the fourth candidate will be you. We will continue decoding. We see what happening here again. Yeah, you can clearly see a channel of sort in this response. And the final answer is correct. And this is the fifth candidate for the first token. And the answer is five is wrong. Okay? Yeah, you can see that actually the reasonpass is already in the ouput space. And in particular here for the second response and the first response, they are based on the channel for sort of resiing. The problem is how to select the best response, right? If we just look at the examples here, you may see, ok, we can buy output length if the model has some syking and the output length will be longer because the it contains reasoning tokens. And actually we have a better idea to select response. And by its answer confidence, confidence means because the model is just a previous model, we can look the probability of the token in prediction. A real interesting is that for the response with channel of sort of reasoning, the answer token has way high confidence. For this example, actually for the token eight ID, the model companies is nearly 98%. Okay, man, that's a huge because we have huge vocabulary size. So usually for each token the probability are nearly zero. So this process is called a chain of sort decoding. So basically, it consists two steps. Step one will just go beyond great d decoding by checking more generation candidates. And in the second step, we choose candidates which have the highest confidence on the final answer. And China sort decoding is a very simple approach. Palstill it needs some programming work. And I heard in the kind days, and the people just want to use a natural language, no one read code. Of course, you guys are exceptional. And we have to say, okay, can we reshape the model's output distribution so that thoughtful responses naturally rank first if the chain of a sword ult response is ranked first and then the graded decodeon can naturally fund it, right? So now we can look at the channel for the prompting. If you know channel for prompting, now you can see why it works. Shniproud, here is a very simple approach. So given this a problem, and you probably use another similar problems in as an example, and I put that before your question, and then the model will magically follow the business style and generate a step by step solution. Yeah, you can. Now you can see that why channels of prompting works, because it changes the output distribubution to push the original channel, sort solutions in the output space to the top position. Even this simple pler approach is called a less simple step. There's another amazing work in reasonwhen that people came out and I thought it was a joke. How possible. And at that time, the Google brain team built a model called a palm. And I tried lessons step by step in our pmodel, because, of course, I know how pum would build is definitely not related tive to this magic trick. And then I found it works on palm. I was so shocked. So this paper really inspired me a lot on reasoning research. Those promptly approach you are really simple and promptly really works. But we can see this also some pitfso safety prompting. It needs task specific examples for me. I don't feel comfortable about that. If I have questions to ask someone, if I know similar problems, I can, then I can solve it by myself, right? Why should I ask other people? And the 40, the other approach is called lesson step by step is generic. Okay? You don't have to find similar examples. You just see lthan step by step and then the magic Cal way come out. Unfortunately, it performs much worse than a few short prompting. And Yeah I just mentioned of that. Okay. Yeah both approach looks well, right? Even for lessons. That rest step is also well right? If I ask somebody a question, then I have to follow with lessons step by step, otherwise they couldn't sink anymore, right? That's not expected. So how to fix it? So there's is a popular approach called supervised ftunia. So for this approach, and the idea actually is very simple, we collect a set of problems and the step by step solutions from human annotators, and then we maximize the likelihood of human solutions. Max hood, actually for lm's training, pretty nectoken just maximize likelihood. And after that, we can apply the model everywhere. So I listed gimana paper in 2017, mentioned that paper at a very beginning. Yeah, they exactly did something like that. They collected a set of MaaS war problems and also human annoted step asset up solutions. And then they tried sequsive model to solve MaaS H problems in 2021. And opi actually further extended that approach, built a much larger dataset called gsm 8K grad school math problems. And then they those data set to fund you in GPT -3 models. So here let me give you example how it works. Okay? You can just put problems here that for example, at the beginning I said, okay, we can do last letter conation and you can put this example here. The problem and the answer, okay. And the other is the cmass problems. How many apples you can put there and then use that as a training data to function on your model. And then you can test the model with a new question. So how many rin, the strawberry probably know why I particularly chothis problem here, because in the social media, many people believe that is a good question to test if agi has come or not. Yeah. Yeah and St is really generic approach. Once you train model you can apply anywhere. And if that can stop reasoning, my talk is done here, right? We don't have talk more, just collect more examples from the brilliant mines in Stanford, right? We can train model only this sun, but actually it doesn't generalize well. And we realized this issue in 2021 in the summer. We found it didn't work well and reasoning what it could do, skating skating skaling to get more data to train model and see how it works. Yeah, is a lesson here isn't a, you know, don't scale for ently. Once the paradigm is wrong, no matter how to scale, it doesn't work. So how to fix the generations video from sft. Let's look the s procedure here, right? Just two steps. So where's the mistake? They Mike apart actually from human, if you don't know that before, you'll be surprised, right, if human entries are wrong and how skill AI can make money. And actually one of my team members invented Alon tunin. Actually, when he told me, and the response generated by machines could even better for training than human data, I was really surprised at the very beginning. So first at a time, it's called self improve. Yeah, exactly. Just change that. Okay? Instead of generating collecting data from humans, we can just let a model generate data. So cloud a set of problems and also then that your model generates step by step solutions. And then again, marabout, the likelihood of correct answers like a match H problems. You may know the final answer, right? You know the ground truth answer, but you don't have step by step solutions. Okay, that model generates step by step solutions. And then you can the answer to decide which response to be used. If the ice is crked from the solution, then choose that. Otherwise reject this could reject sampling and then you can use this data aset to funyour model. Okay, exactly as you have done in the sft. The only difference the data is from your model is not from humans. And this approach actually was proposed by Eric and Tony and and also nohour Yeah is the paper is called star. Yeah the style is really amazing. Paper actually in that start paper. Actually when they proposed the approach, they considered using that to save cost labeling costs because human labeling are really expensive. But in the kind days, we understand this approach from different perspective. Okay, once the response are generated or training data generated by the model and the model can be self improved, right? And after the model improved and we kind of collect data again, that means we can of repeat this process, repeat. And then this approach is then just the same as the ifunctuning approach in the kindays. I put a paper here and as is a paper by researchers in bdance published in January 2024, I think this is the earliest academic publication I have noticed about ianing. Even the paper title is called reasoning with reinforced aning Yeah. As an after open eyes zero one got of popular and then everyone began to realize fine tuning in the public. Yeah I believe multiple institutions independently discovered this idea. This is such a simple idea, but it works really well. Of course, yes, after seeing this alpha tuning process, and we need vifire in this in this training loop, the vifier can tell us which response is correct, because we know the final answer. We just need to use that to select the step by step reasondom path. So a reliable of a fire is the most crucial in iantonum. No, there are algorithms. I know in the context, so many people talk about different algorithms and so many tons of variants of ppo or reinforced, you know if anyone found those, some algorithms are significantly better than another one, please let me know. Probably I miss something. Yeah, I really like what racsatsaid here. Verifacsing, the key to AI is an article title by racsaturn in 22, one. Okay, now a very interesting question is why generated from the model instead of from humans? That's a really interesting question. It's not about saving cost, it's about performance. Does anyone have idea here? Yeah.
speaker 3: is it consistency in genothought structure versus human?
speaker 2: I bought a consistency say .
speaker 3: okay Yeah distribuboth ser it's easier .
speaker 2: is closer to .
speaker 3: what you do transit's easier .
speaker 2: to train the Yeah that Yeah Yeah so Yeah this is related to first principle in machine learning directly optimize what we want. I don't know if anystill members Michelin stopped here. Of course you guys should remember that Yeah so if we want to build a model for reasoning right or just in general about generating something interesting, right, we need to optimize the matgic of measuring generation quality. Those metricould be different, right? For example, for solving math problems, we would care about the correctness, whether ice is correct or not. If for machine translation, you would optimize blue score or just about a matrical to imagine the quality of generations. Okay? Once you have a metric, all we need is to compute gradients of the metric and do back propagation. Yeah. So mathematically we can write this formula, right? We need a function R to measure the response quality given the problem and also your model parameter theta. Okay. Yeah, of course, you can see R is a reward, or R is your question accuracy, or R is your blue score. What no matter, you can define any R you want, that's your target. And then compute the gradient. Since the model is a previous model, we need to maximize the expected value of the metric. So how to do it? We need to sample pling to compute expectation. That's why you get a policy gradient. Yeah, that's how it works. If you understand the order mathematical principle here, there's no magic. I know some people would like to talk about something in a more magical way. For example, how to centify your model, to syincentivize your model to reason. I don't use those words. I just use standard machine learning words, define your metric, compute gradient and do back propagation, that's all. Yeah. Of course, Yeah once your friend, your paraarm works well, we need to scale your approach. Not the problem is a lot of scale. And the unity is that for this I functunia approach, we are scare the output length, ens or scated length of co t. And you probably also scale are the model depth. Because from our cigarec analysis, once as long as your co t is long enough, the model can solve nearly every computable problems. That's amazing. You don't have to scale your motor size. You just need minimal constant size transmodels, and that's fine. So actually, if you look literature and people realized our fine tuna is better than sin the very early days, but it's harder to notice that we need to scale these sillearens that's even more nontrivial to realize. Yeah I would like to mention the beauty of elreasoning human like reasoning process emerges from token to token prediction rather than relying on exresearch as in class. AI Yeah, I also have fun court here by kazparv after losing to deep blue blue in 1997. And see deep bluare only intelligent the way your programble alarm clock is intelligent. Actually I agree with him, but am reason is different. We don't do any explicit search and the search is relevant here. And before my talk, the hallway and me, she quoted my tweet about search is irrelevant. And wow, I said, I'm happy to note that actually I use your code and to see search still is useful. Actually, I want to give an example here about why aom reason is so different from classical AI here. In December 2024, and a Google released model called gm 92.0 synken mode, of course, 2.5 pro is much more powerful. Okay, I use that model is for particular reason. So in December 2024 after model release, I tried a match H problem just to ensure this problem is not in our training set, okay? Because I use a number 2025 for the next year. Now for this year, okay, using the numbers from one to ten to make 2025 and using each number ones and primary operations plus and multiplication. Okay? Of course, one can write a pattern program, do exhaust research and get results, right? Let's look at the syking process on the red panel generated from the model. Actually, for gmor models, you can check the rule thinking nprocess as it's very interesting to look at. Okay, let's see how the model did. The syright is not by search. You can see that at the very beginning the model said, okay, this is a relatively large number suggesting multiplilication will be heavily involved. That's just like a human thinking, right? And wow, even I see, okay, it's worth noting that 2025 is a 45 squared and 45 times 45. Actually when when I made this question, even I didn't realize that that's huge kind here. And I see, okay. So the target is large and the start thinking about how to get large intermediate products, use multiplication and see blah blah. And that's the aim for products that get us closer to the square root of 20, 24, which is 45. You see that. And after actually I made a cut off here, the sinis very, very long. That's why we did long co t in the iftuning. And you can find an answer after thinking the model showed the final answer, right? They exactly followed the syprocess and see that's break down it. Okay. First part and ten times four plus five equal to 40 plus five equal 45 and the second party is also again 45 and then 45 times the 45 together. 2025. That's amazing, right? We don't need any search. I don't know if anyone read another paper related to a chain of sort prompting. It's got a trail of sort prompting. Anyone read that paper? Great. Yeah in that paper actually there's a way of interesting example is game 24. This problem is way harder than game 24. In twelve short prompting, they combine search with prompting to solve game 24. But now you don't need that at all, right? The model can solve game 24 just by natural language and see that that's how ichain is so powerful. It's amazing. And again, we like to say, Richard auon here, you see in the bitterlesson the core idea here, okay? Building all discoveries only makes it harder to see how the discovery process can be done. Yeah, that's I think risoto drew the bitterlesson after his joequative mind and his sort of success of AlphaGo and alpha zero. And he said, okay, only two process really scalable. Why is learning the other is a search? But here I would actually I only emphasize why thing learning is scalable. We also didn't need learning. Yeah for I often tuning okay. Yeah and the big advantage is that it generalize so well but for automatically verifiable tasks because we need verifiin the loop there's no way put him in the loop there. And of course not all tasks are automatically verifiable. Can anyone give examples? None via fiable tasks? Yeah .
speaker 3: right. Yes, creative .
speaker 2: writing. Yeah great example. Yeah so Yeah that's big restrictions for I often functuning and disappoint and also many people are really interested in creating rl algorithms to improve the approach. I really want to see see we spend more time thinking about you how to solve those non verifiable tasks for real problems actually are really a non verifiable like creative writing, even like a coding, right? And also people say, okay, coding problem will be solved by by AI in a few years. And I think it was very challenging to be solved. I know actually for they will talk about a program mming they only talk about competitive programming. Competitive problem is not like our daily programming work, right? So we read code, care about your design, your ability, right? Yeah how to collaborate with other people, not just give a final answer. Yeah. Yeah, I have a talk about you know all the other ideas. Actually the very beginning I talk about sildecoding. Okay. Actually the reason PaaS is already in the output space, and all I need to do is about decoding to receive the output distribution such that grid could funded. Okay. And then I talk about channels of the prompting or less than by style, which can the rethe output distribubution and then sft t and then I functuning. I find it's so powerful, but we still have a chance to improve those process is basically, I want to talk about two kiiwise aggregation. The other is about retrieval. And we have seen that alreason is really powerful, right? But any decoding issue in the paradigm of generating reasoning tokens and then final answers, it seems so natural, right? Given the problem and generating the media tokens and then find an answer, does anyone see any problem in this process? Any problem? Yeah.
speaker 3: This is the design of the model. More those designed to predict next over the challenge the way predicts next open. That's what cresituation outcome. Yeah. Yeah.
speaker 2: The model is originally designed just for pretty nectokens. Yeah. So Yeah, thanks. So Yeah, we need always keep in mind here that alms are probability models. There are no humans. What does that mean mathematically? Let's think about what m does in decoding, right? Given the problem and a general reasoning, and then find an answer, and then the response found by Grady decoding, what does gradidecoding means? Arguzed the probability, right? However, for us, we need to arguthe final answer, choose was the answer with the maximum probability. She was the most confident answer. So they are not aligned, right? There's such a simple high school conditionally prohibit MaaS here. But it's really useful for us to understand the decoding process and we can let's fix it, right? We just need to buy stuff, okay? If we can reason PaaS, we should sum over original pato find the probability of the final answer in terms of machine learning is called marginalization. Just some overall, because all the recent ing parts actually essentially are just latent variables. And of course, if we were start the machine learning in and they will know actually this, the sum can be computed by sampling. And then once you cut this idea and you see, okay, that's exactly the motivation underlying or another proper approach is called self consistency. So generate multiple response by a randomly sampling and then choose the answer that appears most frequently. So let me show some example here for this MaaS problem. You know, and you could sample what response many times for the first response you will get and say lar 18. And for the second one, you will get lar 26. And again, you will get lar 18, right? And then we'll look at the final answer. And then she was the most frequent one. That's exactly the process implementing marginalization, improbability. We don't look at the reasoning past, which only chothe most frequent answer, not most frequently reasoning path. That's the trick. That's called marginalization empirically. And if you apply this approach, you can see a huge improvement as really surprised, I know, in the kind you have to you may think, okay, if going to get a huge improvement, you probably need to spend a lot of time to build a sopcated MaaS formulations. We don't have two. So for gsmk problems, we can see that earlier, even for funtunon three models, they used the galaxthirty 3% and then OpenAI used verifier, got actually 55%. And use a part model plus cot, we get accuracy 58%. That's amazing. That's a magic performance from the vifier. And however, the most surprising cy is after appliplying itself, consistency, the accuracy jumped to 75%. The relative improvement is nearly 50%. And upalm two, we even get accuracy 92%. And of course, one said, okay, Yeah, that's for pmodels. You know, the model for many for years, sounds like ten years ago, between the kind days every year, just like one decade, the whole field is moving so fast. Actually, if you look at the zero one model, I've forgot when ophiis obamprobably October last year, right? Yeah. And actually they also showed the results by aggregation seconsensus at the 664 years. And then we still see a great improvement by aggregation or self consency. Yeah. Yes, great partner, of course. Yes, uses by cell phone c and use more samples will be more expensive, but use more tokens and people see that's kind of infrafor the time skidding. There are so many ways for inference time scading. If you use longer cothat also increase inference time. So actually, when some people, to me about inference time scading really, I don't know what they exactly means unless they can completely see what's scaled Yeah and the subcurrenis definitely a way to scale up Yeah. And also south franis naturally self calibrated, higher consistency indicates higher accuracy is for gsmk benchmark. Actually when the subconsistency is more than 80%, the accuracy is nearly 100%. So I know some people care about uncertainty or confidence in prediction, and you can just simply try something multiple times. And I have short questions here so to make sure everyone got the really key ideas in Souki. Hope you know how to Yeah you really find a lot of fun using this simple idea. So the first question, say, okay, when the alm outputs a direct answer without intermediate steps, will use steer samples several times and then choose the most common answer. Will you does anyone have an answer here? If the model just directly generates finanswer, what do we do? Yeah, go .
speaker 3: ahead. Like like you can just get them.
speaker 2: Exactly. Yes, exactly. Exactly. Just like exactly what we did in the classical machine learning, right? When we have ulogistry question to get a py given eggs, we just need a maximize probability there. That's why we couldn't see self continency in the old machine learning. Literature is unnecessary. It's only useful for elm reasoning. That's why we see it here after we have reasoning, and then we need self consistency. Yeah and the second question is change self consciency by letin oms. Generate multiple response instead of sampling multiple times and then choosing the most common answer. Does this make sense? Now you can see jjl model generate five answers instead of sample five times. Yeah, there's actually we can try that. And again, you know for everything, we just need to follow the machine helaniprinciple, actually, this principle is called max marginal inference. Yeah you just need to choose the finiso with maximum probability. That's all we need to know. You don't have to think about any fancy things, about alms, don't have to compare with humans, you know, MaaS x is all we need here. And one can naturally, of course, Subhas, a problem, you see the unique answer, right? You check the frequency of the unique answer. And for general problems, it's hard to see the answer will be by single token. And for example, this problem you will see all ansare different, okay? In this case, we have extension about self consistency. It's called universal self consistency. And for this problem here, you can see the second response is the most common one, Japan, China and India, because all these three countries are in all other answers, right? And we just need to let lms choose the most consistent response. Okay? I've talked about how to use aggregation to improve reasoning. The other way is about retrieval. So I know there's a lot of debate about album reasonpeople say, okay, ms may not just do retrieval instead of reasoning. I know many people. I saw that debates in social media addito me. It's always hard to differentiate retrieval and reasoning. And when I am average chess or senior average chess for all the conference almost every year, and we would have talked about the novelty of each paper. And actually it's similar to the debate is retrieval or reasoning, right .
speaker 3: concept. That's all far ways to. And then running Farrow we may have lyric running learning this is conference withdranine 2.5 look like four different models comparhundred for the same question. And then at the end just having like to play fire. So some on find it in most consistent .
speaker 2: yes. Yes. If you generate response from different models, that would be more like the model example pling approach with many models and combresults like a random forest. Yes. Yeah the mathematical principle is not exactly the same as self consistbut. The implementation are the same. Yes. Yeah, great point. Yeah about actually again, not interested debate about reeval reasoning and for people actually I work in industry C, I really just care about performance. So to me, you know just to retrieve a plus reasoning, well I should do the debate right Yeah. So in 2024, we have a paper about analogy reasoning. So I can just use this small example to show why your trieval is important. Antly reasoning, okay, so for this proacy, what's the area of the square with the four phototices la blah? And okay, the highlighted is added by me and say, okay, is there prompt? Okay, recall a rproblem and then solve this one. So at that moment, I tried to GPT -3 point five and also our own model, and they failed solving this problem. This after adding the prompt of recrelated problems, and the model can solve it. Let's see what happened here after tell the model to recall related problems, and the model did find a related problem. Related problem doesn't mean the same problem is indeed just a related problem. You see, the related problem here is finding the distance between two points on a coordinate plan. And this a formula there. And then the model of all. Yeah now I know how to compute the distance and then how to compute the area. It's just a small case to show how retriever is important in reasoning. Here's another example called a step back for the physical problems. And before solving this problem with just that model, you know Yeah we just give few short examples. We show the model. Okay, before solving this problem, you can make a step back to consider more abstract problem, get the principle, and then solve it快。 That's how a retrieval works for reasoning. And as of now, everyone knows deep research. The research is exactly the same idea here, right? So we'll have a gynadeeper research and also open a deeper research. And one of oi's deepresearch lead was my intern. And after his pd and he joopi and he invented deep research. And you see how the different research works, because they can find a similar problem or knowledge to solve the problem. Yeah, the basic ideas are very simple. Yeah now I can give a summary here. Actually, you know forget about other debate if alms reason or not, for alms, reasoning is always better than no reasoning. And alventing is better than sat aggregating multiple answers. It's better than one answer, of course, that way more costly. And retrieval plus reasoning is better than reasoning only. And Yeah, that's the end of my talk. And for the next breakthroughs, you know I know I really want to see, okay, how to solve the task beyond unique verifiable answers and in the kind of days. And I also want to see how people build real applicasings instead of just solving benchmarks as think all benchmarks will be saturated soon. And I know you know all you guys are very passionate about agi or build AI. Would like to quote Richard faman here, the truth always turned out to be simple than you thought. And I think that's a particular issue for aresearch. And I saw so many academic papers, they always tried complicated medithings. So that's why I just came my talk as simple as possible. Actually. Actually, it's indeed simple. That's it. Yeah, thank you.
speaker 1: Thanks, Danny, for the very insightful as well as interesting talk. So now we'll be taking questions. We have some questions online from slido on zoom, but also in person. So we can maybe start with some in person questions.
speaker 3: Hi, thank you for the talk. So earlier on in the lecture, you talked about confidence. And like a common way to do this is like just taking the average log probabilities of output token sequences. So like my question is, do you think there are better ways to do this? And also, is this a good indicator for hallucinations?
speaker 2: Oh, for the first clai want not to talk about a confidence, just the noapplication gation just just a for nectoken prediction, just a conditional probability for the generation. You can just look the logo props from the model and see you can see the probability.
speaker 3: Yeah. Yeah. And like do you think this is a good .
speaker 2: indicator for hallucinations? Yeah, same so Yeah from our pucal observation, Yeah and we can see after recent ing past, there's a huge jump on confidence for the finaner. Yeah.
speaker 3: Oh, earlier as you mentioned that for example, Richard Hutton said that it's scaling learning and search and your opinions more like scaling learning is all you need. I just like to expand more on that and why you believe that a search is not as necessary.
speaker 2: That's why I used that example. And actually, okay. So actually I should make it more concrete. When you build models, you don't have to keep things in mind, but in the after model, it's built and you can search as a tool. It has a special case of two years, like twelve sword prompting. They can just integrate symbolic search with the model. Yeah. So part for reasoning research. I just care about the fundamental abilities. Yeah for example, if you want to solve this problem, the model the model could be motivated to write a pattern program to solve those problems by search. But for the reasoning process, we don't need to search. It's just how to say it. Of course, we can always search everything. And that's why if you use a search gerto solve any problems, you can get higher accuracy. And I don't know, that really depends on what you want, intelligence or just by search. Yeah. Hi. Thank you for the talk.
speaker 3: You mentioned in the case where there's no reasoning that it's not necessary to sample because you can simply look at the logets, but wouldn't sampling converge on a different distribution in the case, for example, where the most likely next token leads to a diffuse distribution for the following token and the different paths spread out? Whereas if you were to sample and a less likely token were to lead to a sharper distribution, you can actually have a more likely path of tokens there. So wouldn't these two methods fundamentally differ?
speaker 2: Question. Yeah. The problem is actually we still don't know how the dispusion the argushaped during the training stage is very unclear there. Yeah. So to me, it's very hard to answer this question, but we still don't have good explanation. How does the distribubution be shipped for the final .
speaker 3: distribubution? Yeah, thank you. I thank you for the talk. So how do you differentiate reasoning and answer like to you to extract that number from the tokens, from the final strings, the opput string? And what if the answer can be like a program?
speaker 2: Then how to differentiate .
speaker 3: the reasoning and the answer?
speaker 2: Yeah, great question. Yeah. If the is a program and will be harder to extract, will be harder to attract. So when people use the iof and tuning, and that's why you just see those kinds MaaS H problems or competitive programming problems. Yeah. So I think for your for the general case you have for your case, you have to great a very careful passer for the finer Yeah.
speaker 3: I see. And also what if the problem is very challenging so such that actually the lower confidence answer might be the correct answer because possible.
speaker 2: Yeah, yes.
speaker 3: Then how can I use the self consistency better? Self consistency .
speaker 2: is not perfect. It's perfect. Everything time, right? Yeah, not perfect.
speaker 3: All right. Okay, thank you. So .
speaker 2: considering the you know, conversations that ais coming you know like from two to five years from now, how and basically if we, you know, if it's then, let's say, 90% of jobs automated, what skills do you you know, develop in kids to give them a shot to survive in the future that is coming? That's a big question. Who's that ao would come in five years, but there's AI 2027, right? By Daniel kukutao. Like lots of conversations in AI community that given like the timeline of like two to five years, I was in the. I clear red last year there was a workshop and I remember one audience asked me a question in the panelilist and he said, okay, and oai is a point is moving so fast the failed, you know and what would be the most scary thing in the in the next few years? And Yeah, I remember some people did talk about the risk of AI, but my answer is to me, most guys things AI winter comes back and I then now lost my job. Actually, I saw many restrictions for the kind of approach. So actually I know many people like the chaterboards am sort of things. I really actually really want to see real keler applications from the kind AI research. I don't know if anyone really need those AI stuff or not, just for fun. Yeah, I'm not quite sure about it. I know actually the AI models is really good for programming. Yeah can be a good assistant for coding and that's all I know about it. Yeah, we should be fine.
speaker 1: Okay. I think we're out of time, but thanks everybody for great questions. And thanks again to dny for the great talk.

最新摘要 (详细摘要)

生成于 2025-05-27 21:46

概览/核心摘要 (Executive Summary)

Denny Zhou (Google DeepMind) 在 Stanford CS25 的讲座中深入探讨了大型语言模型 (LLM) 的推理能力。他首先将 LLM 推理定义为输入与输出之间的中间令牌 (intermediate tokens) 或中间步骤,并强调了其重要性,引用了理论研究证明恒定大小的 Transformer 可以通过生成 O(T) 中间令牌来解决 P 电路 (P-circuits) 可解决的问题。Denny 认为,预训练的 LLM 本身已具备推理潜力,关键在于解码过程。他回顾了多种引发和增强 LLM 推理的方法,包括早期的链式思考 (Chain-of-Thought, CoT) 解码(选择最终答案置信度最高的候选路径)和 CoT 提示(如 few-shot 示例和 "Let's think step by step")。

随后,讲座重点讨论了监督式微调 (Supervised Fine-Tuning, SFT) 及其在泛化性上的局限,进而引出了更强大的迭代式微调 (Iterative Fine-Tuning, IFT) 或称自改进 (Self-Improve),即模型生成推理路径,通过验证器筛选正确路径后用于进一步微调自身。这种方法之所以有效,是因为它直接优化了衡量生成质量的指标。Denny 强调,LLM 的推理是从词元到词元的预测中涌现的,而非经典 AI 的显式搜索,并通过 Gemini 2.0 [原文为 gm 92.0,推测应为 Gemini 2.0 的某个版本或模式,如原文提及的 "synken mode"] 解决复杂数学问题的例子加以说明。

为进一步提升推理性能,Denny 介绍了聚合 (Aggregation) 技术,特别是自洽性 (Self-Consistency),通过对多个采样输出进行投票选择最频繁的最终答案,显著提升了准确率。此外,检索 (Retrieval) 增强推理也被提及,通过模型回忆相关问题或知识来辅助解决当前问题。Denny 总结道,有推理优于无推理,IFT 优于 SFT,聚合优于单一答案,检索加推理优于单纯推理。他展望未来,希望能解决自动可验证答案之外的任务,并构建真实的 AI 应用。

LLM 推理的定义与重要性

Denny Zhou (Speaker 2) 开篇明确了他对 LLM 推理的定义:
* 推理 (Reasoning):特指“输入和输出之间的中间令牌 (intermediate tokens),称为推理或中间步骤 (intermediate steps)”。
* 这个概念并非全新,早在 2017 年,DeepMind [原文为 Duman,推测为 DeepMind] 就有论文探讨使用中间令牌解决数学问题。
* Denny Zhou 创立 Google Brain 推理团队时,设计了“最后一个字母串联 (last letter concatenation)”任务作为激励示例,例如“artificial intelligence”的最后一个字母串联,有推理过程的模型会输出“artificial 的最后一个字母是 l,intelligence 的最后一个字母是 e,串联 l 和 e 得到 le”。
* 他强调一个核心观点:“LMs are just probability models. They are not humans.” 保持这一认知有助于理解许多新技术。

中间令牌的理论意义

  • Denny 引用了与斯坦福大学 Tamar [不确定姓氏,原文为 Tama] 教授及其学生的合作研究成果:“对于任何可由大小为 T 的 P 电路 (P-circuits) [原文为 pucircus] 解决的问题,恒定大小的 Transformer 可以通过生成 O(T) 个中间令牌来解决它。”
    • “大小 (size)”指逻辑门数量。
    • 若直接生成最终答案,要么需要巨大的深度,要么根本无法解决。
  • 这是从理论角度理解推理的方式。

预训练 LLM 的内在推理潜力与解码

Denny Zhou 反驳了一个普遍看法,即“预训练 LLM 在没有进一步提示工程(如 CoT 提示)或微调的情况下无法推理”。
* 他的观点是:“Pre-trained LMs are ready to reason, and all we need is decoding, just about decoding process.
* 示例:苹果问题
* 问题:“我有3个苹果,我爸爸比我多2个苹果,我们总共有多少个苹果?”
* 贪婪解码 (Greedy Decoding) 可能直接输出错误答案,如“5个苹果”。
* 但若查看第二个或第四个候选输出(基于第一个词元的选择),可能会发现包含正确推理过程的答案:
* 第二个候选:“I have three apples and my dad has two more apples than me. So he has five apples and three plus five to eight.” (答案正确)
* 第四个候选(以 "You" 开头)也可能展现链式思考并得到正确答案。
* 这表明“the reasoning path is already in the output space.” 问题在于如何选择最佳响应。

链式思考 (Chain-of-Thought, CoT) 解码

  • 一种选择最佳响应的方法是基于答案置信度 (answer confidence)
  • 对于包含链式思考推理的响应,最终答案词元的置信度(概率)会非常高。例如,在苹果问题中,答案词元 "eight" (8) 的模型置信度 [原文为 companies,推测为 confidence] 接近 98%。
  • CoT 解码流程
    1. 超越贪婪解码,检查更多生成候选。
    2. 选择最终答案具有最高置信度的候选。
  • Denny 认为这是一种简单但需要编程工作的方法。

通过提示工程和微调引发推理

提示工程 (Prompting Engineering)

目标是“重塑模型的输出分布,使深思熟虑的响应自然排名靠前 (reshape the model's output distribution so that thoughtful responses naturally rank first)”。
* 少样本 CoT 提示 (Few-shot Chain-of-Thought Prompting)
* 在问题前提供一个或多个类似的、包含逐步解题过程的示例。
* 模型会“神奇地”遵循这种风格生成逐步解决方案。
* 原理:“it changes the output distribution to push the original chain-of-thought solutions in the output space to the top position.
* “让我们一步一步思考 (Let's think step by step)”提示
* 一种通用提示,无需寻找相似示例。
* Denny 提到,当他尝试在 Google 的 PaLM 模型上使用这个“魔法技巧”时,发现它确实有效,令他非常震惊。
* 提示工程的缺陷
* 少样本 CoT 提示需要任务特定的示例,Denny 认为这不自然 (“If I have questions to ask someone, if I know similar problems, I can, then I can solve it by myself, right?”)。
* “Let's think step by step” 虽然通用,但“performs much worse than a few-shot prompting.
* 两种方法都显得有些“奇怪 (weird)”,不符合理想的交互方式。

监督式微调 (Supervised Fine-Tuning, SFT)

  • 思路:收集一系列问题和由人类标注者提供的逐步解决方案,然后最大化模型对这些人类解决方案的似然性。
  • 历史
    • DeepMind 2017 年的论文就采用了类似方法收集数学问题和人类标注的解题步骤。
    • OpenAI 在 2021 年进一步扩展此方法,构建了更大的数据集 GSM8K (Grade School Math problems),并用其微调 GPT-3 模型。
  • 示例:使用“最后一个字母串联”问题和答案,或苹果问题和答案作为训练数据微调模型。
  • 局限性:SFT 在推理任务上“doesn't generalize well.
    • Denny 团队在 2021 年夏天意识到此问题。
    • 他强调:“Don't scale for ently [不确定,原文为 for ently,疑为 'Don't scale blindly/indefinitely']. Once the paradigm is wrong, no matter how to scale, it doesn't work.

迭代式微调 (Iterative Fine-Tuning, IFT) / 自改进 (Self-Improve)

这是对 SFT 泛化问题的改进,Denny 团队的成员发明了此方法,最初称为“自改进 (self-improve)”。
* 核心思想:用模型生成的响应进行训练,而非人类数据。
1. 收集一系列问题。
2. 让模型生成逐步解决方案。
3. 使用验证器 (verifier)(例如,对于数学问题,可以检查最终答案是否与真实答案一致)来判断哪些模型生成的解决方案是正确的。
4. 仅使用这些被验证为正确的模型生成解决方案来微调模型(最大化正确答案的似然性)。
* 与 SFT 的关键区别:训练数据来自模型自身,而非人类。
* 论文参考:Eric [不确定姓氏]、Tony [不确定姓氏] 和 Noah [原文为 nohour,推测为 Noah] 的论文 "STaR" (Self-Taught Reasoner)。最初提出是为了节省昂贵的人类标注成本。
* 迭代过程:模型改进后,可以再次用于生成更高质量的训练数据,形成一个迭代循环。这与当前流行的 RLAIF (Reinforcement Learning from AI Feedback) 或类似微调方法思想一致。
* Denny 提到一篇字节跳动 [原文为 bdance] 研究者于 2024 年 1 月发表的关于此方法的早期学术论文,标题为 "Reasoning with Reinforced Fine-Tuning [原文为 aning]"。
* 验证器的关键性:“A reliable verifier is the most crucial in IFT.
* 为何模型生成的数据优于人类数据?
* 并非为了节省成本,而是关乎性能。
* Denny 引用机器学习第一性原理:“Directly optimize what we want.
* 我们需要优化衡量生成质量的指标(如数学问题的正确性、机器翻译的 BLEU 分数)。
* 通过计算该指标的梯度并进行反向传播(策略梯度)。
* Denny 不喜欢用“激励模型去思考”这类说法,而是坚持标准机器学习术语:“定义你的指标,计算梯度,进行反向传播,仅此而已。”
* 扩展性 (Scaling):对于 IFT 方法,通常扩展的是“output length, or scaled length of CoT”,而非必须扩展模型深度或大小。理论上,只要 CoT 足够长,模型就能解决几乎所有可计算问题。

LLM 推理的本质:涌现而非搜索

Denny 强调 LLM 推理的美妙之处:“Human-like reasoning process emerges from token-to-token prediction rather than relying on explicit search as in classical AI.
* 他引用了卡斯帕罗夫在 1997 年输给深蓝后的评论:“深蓝的智能就像你的可编程闹钟一样智能。” Denny 同意此评论,但认为 LLM 推理是不同的,不进行显式搜索。
* 示例:Gemini 2.0 [原文为 gm 92.0,推测应为 Gemini 2.0 的某个版本或模式,如原文提及的 "synken mode"] 模型解决数学难题
* 问题(2024年12月测试,确保不在训练集中):“使用数字1到10(每个数字用一次)以及基本运算(加、乘)来组成2025。”
* 模型生成的思考过程(非搜索):
* “这是一个相对较大的数字,表明乘法将被大量使用。”
* “值得注意的是,2025 是 45 的平方 (45 * 45)。” (Denny 表示他出题时都未意识到这点)
* 模型思考如何获得大的中间乘积,目标是接近 2025 的平方根 45。
* 最终模型给出了正确解法,如 (10 * 4 + 5) * (其他数字组合成的45) = 45 * 45 = 2025。
* 这比 CoT 提示论文中提到的 Game 24 问题(结合搜索与提示解决)更难,但模型仅通过自然语言推理解决。
* Denny 再次引用 Richard Sutton 的“The Bitter Lesson”:“构建所有发现只会使其更难看到发现过程是如何完成的。” Sutton 认为学习和搜索是可扩展的,但 Denny 在此更强调“learning is scalable”。

IFT 的局限性

  • IFT 的一大优势是泛化性好,但其主要局限在于“for automatically verifiable tasks”,因为循环中需要验证器。
  • 对于非自动可验证的任务(如创意写作、日常编程工作的代码设计和可维护性),IFT 难以直接应用。

进一步提升推理性能的高级技术

Denny 讨论了两种进一步改进推理过程的关键思路:聚合 (Aggregation)检索 (Retrieval)

聚合:自洽性 (Self-Consistency)

  • 问题背景:LLM 是概率模型。标准的逐步生成推理词元然后生成最终答案的解码过程(如贪婪解码)是在最大化 P(推理路径, 答案 | 问题)。然而,我们真正关心的是最大化 P(答案 | 问题)。
  • 数学原理:边缘化 (Marginalization)
    • 需要对所有可能的推理路径求和(或积分)来得到最终答案的概率。
    • 推理路径本质上是潜变量 (latent variables)。
  • 实现方法:自洽性 (Self-Consistency)
    1. 通过随机采样生成多个响应(包含推理过程和最终答案)。
    2. 选择出现最频繁的最终答案
    3. We don't look at the reasoning path, we only choose the most frequent answer, not most frequently reasoning path. That's the trick.
  • 显著的性能提升
    • GSM8K 数据集上的数据
      • 微调的 GPT-3 [原文为 funtunon three] 模型:约 33% [原文为 galaxthirty 3%] 准确率。
      • OpenAI 使用验证器:55% 准确率。
      • PaLM [原文为 part model] 模型 + CoT:58% 准确率。
      • 应用自洽性后,PaLM + CoT 准确率跃升至 75% (相对提升近 50%)。
      • PaLM 2 [原文为 upalm two] 甚至达到 92% 准确率。
    • 即使是较新的模型(如 OpenAI 的某模型,Denny 提到可能去年10月发布),应用自洽性后仍有显著提升。
  • 特性
    • 使用更多样本会更昂贵(增加推理时间)。
    • 自洽性具有自校准 (self-calibrated) 特性:更高的一致性通常意味着更高的准确率。在 GSM8K 上,当自洽性超过 80% 时,准确率接近 100%。
  • 重要区分
    • 如果 LLM 直接输出答案而无中间步骤,则无需多次采样选最常见答案,直接最大化该答案的概率即可(经典机器学习做法)。自洽性主要用于带推理过程的 LLM。
    • 让 LLM 一次性生成多个不同答案,与多次独立采样然后选最常见答案,在原理上(最大边缘推理 Max Marginal Inference)是相通的。
  • 通用自洽性 (Universal Self-Consistency):对于答案非单一词元或更复杂的情况(如列出亚洲国家),让 LLM 自行判断哪个生成的复杂答案与其他答案最为一致。

检索 (Retrieval)

  • Denny 表示对“检索 vs 推理”的辩论不感兴趣,更关注实际性能:“To me, you know, just to retrieve a plus reasoning, why I should do the debate, right?
  • 示例 1:类比推理 (Analogy Reasoning) (2024年论文)
    • 问题:“一个正方形四个顶点的坐标为...求面积?”
    • 在没有提示的情况下,GPT-3.5 和 Google 自己的模型都失败了。
    • 加入提示:“Recall a related problem and then solve this one.
    • 模型回忆起相关问题:“在坐标平面上找到两点之间的距离”及其公式。
    • 然后模型成功计算出距离并求出面积。
  • 示例 2:“退一步思考 (Step Back)”提示
    • 对于物理问题,在解决前提示模型“退一步思考更抽象的问题,获取原理,然后再解决它。”
  • 与“深度研究 (Deep Research)”的联系:Denny 提到 DeepMind [原文为 gynadeeper research] 和 OpenAI [原文为 open a deeper research] 都有类似“深度研究”的概念,其核心思想也是通过找到相似问题或知识来解决当前问题。

总结与未来展望

Denny Zhou 的核心观点总结:

  1. 有推理总是优于无推理 (Reasoning is always better than no reasoning)。
  2. 迭代式微调 (IFT) 优于监督式微调 (SFT) (IFT is better than SFT)。
  3. 聚合多个答案优于单个答案 (Aggregating multiple answers is better than one answer),尽管成本更高。
  4. 检索 + 推理优于仅推理 (Retrieval plus reasoning is better than reasoning only)。

未来的突破方向:

  • 解决超出唯一可验证答案的任务 (How to solve the task beyond unique verifiable answers)。
  • 构建真实的应用程序,而不仅仅是解决基准测试 (Build real applications instead of just solving benchmarks)。 Denny 认为所有基准测试很快都会饱和。

Denny 最后引用了理查德·费曼 (Richard Feynman) 的话:“The truth always turned out to be simpler than you thought.” 他认为这在 AI 研究中尤其重要,许多学术论文倾向于复杂的方案,而他试图让讲座尽可能简单,因为事实确实如此。

问答环节 (Q&A)

  • 关于置信度与幻觉 (Confidence and Hallucinations) (Speaker 3 提问):
    • Denny (Speaker 2) 回应,他所说的置信度是指模型对下一个词元预测的条件概率(log probabilities)。
    • 经验观察表明,在推理路径之后,最终答案的置信度会有巨大提升,这可能有助于识别幻觉。
  • 关于搜索的必要性 (Necessity of Search) (Speaker 3 提问):
    • Denny (Speaker 2) 澄清,在构建模型时,不必考虑搜索,但模型构建完成后,可以将搜索作为一种工具使用(如 Tree of Thoughts Prompting [原文为 twelve sword prompting,推测为 Tree of Thoughts] 结合符号搜索)。他个人在推理研究中更关注基础能力。模型可以被激励去写代码通过搜索解决问题,但推理过程本身不需要搜索。
  • 无推理时采样与直接看 Logits 的区别 (Sampling vs. Logits without reasoning) (Speaker 3 提问):
    • Denny (Speaker 2) 承认,训练阶段输出分布如何形成尚不完全清楚,因此难以确切回答,目前没有很好的解释。
  • 区分推理和答案,特别是当答案是程序时 (Differentiating reasoning and answer, especially if answer is a program) (Speaker 3 提问):
    • Denny (Speaker 2) 表示,如果答案是程序,提取会更困难。这就是为什么 IFT 通常用于数学问题或竞争性编程问题。对于一般情况,需要非常仔细的解析器来提取最终答案。
  • 当低置信度答案反而是正确答案时,如何使用自洽性 (Low confidence answer being correct and self-consistency) (Speaker 3 提问):
    • Denny (Speaker 2) 承认自洽性并非完美。
  • 关于 AGI 时间线和未来技能培养 (AGI timeline and future skills for kids) (Speaker 3 提问):
    • Denny (Speaker 2) 对“AGI 五年内到来”的说法表示怀疑。他提到去年在某会议 [原文为 I clear red last year,推测为某AI顶会如ICML/NeurIPS等] 的一个研讨会上,他担心的不是 AI 风险,而是“AI winter comes back and I then now lost my job.
    • 他看到了当前方法的许多局限性,并希望看到来自当前 AI 研究的真实“杀手级应用 (killer applications)”,而不仅仅是娱乐。
    • 他认为 AI 模型在编程方面确实很好,可以作为编码助手,但对其他方面尚不确定。他表示“我们应该还好 (We should be fine)”。