speaker 1: All right, so hi everyone. We're going to get started. So for today's lecture for cs 25, I'm very pleasure to have Denny zoo from Google demind here to give a talk on large language model reasoning. And so Denny founded the reasoning team at Google brain, which is now part of Google DeepMind. His group is renowned for pioneering chain of thought, prompting and self consistency, as well as developing the mathematical foundations of in context learning and chain of thought reasoning. His team also created core capabilities that power geminze reasoning capabilities. Further, Denny co founded the conference on language modeling, or comb, and served as general chair for the 2024 conference. So Yeah, I'll let denanny take it from here.
speaker 2: Yeah I'm glad to see many of you guys have already believed aoms kind of reason. Actually, you may wonder, what's my answer for this question? Yeah, to me, actually, I don't know. That really depends on the definition of reasoning. So for my talk today, we have a very specific definition about reasoning. So I know there are many debates about if m can reason. I never joined those debates because without a definition on reasoning, I have no idea about those things. But for alreasoning, and we particularly mean that intermediate tokens between input and output, that called reasoning or intermediate steps. So this idea actually is not very new. Even in 2:17, Duman already published a paper, how to use intermediate tokens to solve MaaS H problems. So at that time, I think the community were quite happy about alpha go, alpha zero. But this paper is really ground breaking paper. If you haven't read that paper before, I strongly encourage you to look at that paper. They introduced natural language to solve matths problems. However, in the literature at that time, as an f else, just used symbolic approach or search. So actually, this idea actually is also very common for nerosymbolic literature. In neurosymbolic literature, actually, it's very common to use intermediate process to solve some reasoning problems. Here's example about how to how to use alreasoning. When I founded the reasoning team in Google brain, I created this task. So it's called a last letter condemnation. Use this task as a motivating example. At that time, one could use transmodels to solve this task. So what's the output when conatenating the last letter of each word of artificial intelligence? So if there no reasoning process, you will see, okay, the answer is le. If there's a reasoning process, the model would output, say, the last letter of artificial is l, the last letter of intelligence is e conatenating l and e this to sl e or something like that. So the highlighted text here is called resiing. That's what I parmean here about reasoning. So if we are familiar with program synthesis or neurosyabotic reasoning, you wouldn't be surprised about this task design. Of course, you can imagine that I tried other options. For example, I didn't see the first letter. The reason is that I tried first letter, and all larger models can solve that problem quite a well, because it has so many initials on the web, and the model has already learned how to concatenate first letters. Then I switch to last letters, and all models failed. And I know many people say, Oh Yeah, this is so natural, right? We need intermediate steps, just like humans. And now in the current days, you may see oms, oms, ms are very similar to humans. But for us as a researchers, we should always keep in mind lms are just priability models. They are not humans. And if you always keep this in mind, it will be better for you to understand a lot of new techniques. So why intermediate tokens of reasonmatters? Okay, we have theoretical war as it collaborated with professor tama in Stanford and his students. So for any problems solvable by pucircus of size t constant size transformers can solve it by generating ot. Intermediate tokens is a very powerful result. So the size here means the number of logic gaso. For example, if we use the GPU clusters, that would be types of median garight, even the meditrading. If we directly generate final answers, either require a huge depth or cannot solve at all, that's how we understand reasoning from a theoretical perspective. In the late of this lecture, we come back to this theoretical argument. There is a common belief about elm reasoning and pressure. Elms cannot reason without further prompting engineering like soprompting or fine tuning in the kinwhen. Talk about I of fine tuning, right? Is that do you ever agree that? Agree that? Okay. So I'd believe he's surwrong this very well. Yeah so perlms are ready to reason and all we need is decoding just about decoding process Yeah no matter how finthose techniques looks like in the kind of days. So here's example here. If I have three apples ts, my dad has two appome, and how many appots do we have in total, if you have any per models like lama deep psick or Chen Eng wen or something? And I didn't try those models. Okay, if you have any permodels, you can type this question in the permodel and see what happened. Probably, it's very likely you see anlike five apples. Of course, the irons is wrong here. Okay, this is called gredecoding. He will say, okay, Yeah, you're right. Right? For pressure models, there's no reasoning, right? The problem is about decoding because we use graded decoding by default. If you look at second candidates, second, if you look because you have big vocabulary size, right? And you can look at the second candidate, a second candidate for the first token and prouwe start from I and we'll see what happened. We could just then continue the decoding process. We see, okay, I have three apples and my dad has two more apples than me. So he has five apples and three plus five to eight. It's perfect, right? We just need to look for more candidates. That's amazing. And there's another choice. And the third candidate for the first token is we we see what happened here. We have eight apples in total as Yeah I see somehow it's always and probably from the fourth candidate will be you. We will continue decoding. We see what happening here again. Yeah, you can clearly see a channel of sort in this response. And the final answer is correct. And this is the fifth candidate for the first token. And the answer is five is wrong. Okay? Yeah, you can see that actually the reasonpass is already in the ouput space. And in particular here for the second response and the first response, they are based on the channel for sort of resiing. The problem is how to select the best response, right? If we just look at the examples here, you may see, ok, we can buy output length if the model has some syking and the output length will be longer because the it contains reasoning tokens. And actually we have a better idea to select response. And by its answer confidence, confidence means because the model is just a previous model, we can look the probability of the token in prediction. A real interesting is that for the response with channel of sort of reasoning, the answer token has way high confidence. For this example, actually for the token eight ID, the model companies is nearly 98%. Okay, man, that's a huge because we have huge vocabulary size. So usually for each token the probability are nearly zero. So this process is called a chain of sort decoding. So basically, it consists two steps. Step one will just go beyond great d decoding by checking more generation candidates. And in the second step, we choose candidates which have the highest confidence on the final answer. And China sort decoding is a very simple approach. Palstill it needs some programming work. And I heard in the kind days, and the people just want to use a natural language, no one read code. Of course, you guys are exceptional. And we have to say, okay, can we reshape the model's output distribution so that thoughtful responses naturally rank first if the chain of a sword ult response is ranked first and then the graded decodeon can naturally fund it, right? So now we can look at the channel for the prompting. If you know channel for prompting, now you can see why it works. Shniproud, here is a very simple approach. So given this a problem, and you probably use another similar problems in as an example, and I put that before your question, and then the model will magically follow the business style and generate a step by step solution. Yeah, you can. Now you can see that why channels of prompting works, because it changes the output distribubution to push the original channel, sort solutions in the output space to the top position. Even this simple pler approach is called a less simple step. There's another amazing work in reasonwhen that people came out and I thought it was a joke. How possible. And at that time, the Google brain team built a model called a palm. And I tried lessons step by step in our pmodel, because, of course, I know how pum would build is definitely not related tive to this magic trick. And then I found it works on palm. I was so shocked. So this paper really inspired me a lot on reasoning research. Those promptly approach you are really simple and promptly really works. But we can see this also some pitfso safety prompting. It needs task specific examples for me. I don't feel comfortable about that. If I have questions to ask someone, if I know similar problems, I can, then I can solve it by myself, right? Why should I ask other people? And the 40, the other approach is called lesson step by step is generic. Okay? You don't have to find similar examples. You just see lthan step by step and then the magic Cal way come out. Unfortunately, it performs much worse than a few short prompting. And Yeah I just mentioned of that. Okay. Yeah both approach looks well, right? Even for lessons. That rest step is also well right? If I ask somebody a question, then I have to follow with lessons step by step, otherwise they couldn't sink anymore, right? That's not expected. So how to fix it? So there's is a popular approach called supervised ftunia. So for this approach, and the idea actually is very simple, we collect a set of problems and the step by step solutions from human annotators, and then we maximize the likelihood of human solutions. Max hood, actually for lm's training, pretty nectoken just maximize likelihood. And after that, we can apply the model everywhere. So I listed gimana paper in 2017, mentioned that paper at a very beginning. Yeah, they exactly did something like that. They collected a set of MaaS war problems and also human annoted step asset up solutions. And then they tried sequsive model to solve MaaS H problems in 2021. And opi actually further extended that approach, built a much larger dataset called gsm 8K grad school math problems. And then they those data set to fund you in GPT -3 models. So here let me give you example how it works. Okay? You can just put problems here that for example, at the beginning I said, okay, we can do last letter conation and you can put this example here. The problem and the answer, okay. And the other is the cmass problems. How many apples you can put there and then use that as a training data to function on your model. And then you can test the model with a new question. So how many rin, the strawberry probably know why I particularly chothis problem here, because in the social media, many people believe that is a good question to test if agi has come or not. Yeah. Yeah and St is really generic approach. Once you train model you can apply anywhere. And if that can stop reasoning, my talk is done here, right? We don't have talk more, just collect more examples from the brilliant mines in Stanford, right? We can train model only this sun, but actually it doesn't generalize well. And we realized this issue in 2021 in the summer. We found it didn't work well and reasoning what it could do, skating skating skaling to get more data to train model and see how it works. Yeah, is a lesson here isn't a, you know, don't scale for ently. Once the paradigm is wrong, no matter how to scale, it doesn't work. So how to fix the generations video from sft. Let's look the s procedure here, right? Just two steps. So where's the mistake? They Mike apart actually from human, if you don't know that before, you'll be surprised, right, if human entries are wrong and how skill AI can make money. And actually one of my team members invented Alon tunin. Actually, when he told me, and the response generated by machines could even better for training than human data, I was really surprised at the very beginning. So first at a time, it's called self improve. Yeah, exactly. Just change that. Okay? Instead of generating collecting data from humans, we can just let a model generate data. So cloud a set of problems and also then that your model generates step by step solutions. And then again, marabout, the likelihood of correct answers like a match H problems. You may know the final answer, right? You know the ground truth answer, but you don't have step by step solutions. Okay, that model generates step by step solutions. And then you can the answer to decide which response to be used. If the ice is crked from the solution, then choose that. Otherwise reject this could reject sampling and then you can use this data aset to funyour model. Okay, exactly as you have done in the sft. The only difference the data is from your model is not from humans. And this approach actually was proposed by Eric and Tony and and also nohour Yeah is the paper is called star. Yeah the style is really amazing. Paper actually in that start paper. Actually when they proposed the approach, they considered using that to save cost labeling costs because human labeling are really expensive. But in the kind days, we understand this approach from different perspective. Okay, once the response are generated or training data generated by the model and the model can be self improved, right? And after the model improved and we kind of collect data again, that means we can of repeat this process, repeat. And then this approach is then just the same as the ifunctuning approach in the kindays. I put a paper here and as is a paper by researchers in bdance published in January 2024, I think this is the earliest academic publication I have noticed about ianing. Even the paper title is called reasoning with reinforced aning Yeah. As an after open eyes zero one got of popular and then everyone began to realize fine tuning in the public. Yeah I believe multiple institutions independently discovered this idea. This is such a simple idea, but it works really well. Of course, yes, after seeing this alpha tuning process, and we need vifire in this in this training loop, the vifier can tell us which response is correct, because we know the final answer. We just need to use that to select the step by step reasondom path. So a reliable of a fire is the most crucial in iantonum. No, there are algorithms. I know in the context, so many people talk about different algorithms and so many tons of variants of ppo or reinforced, you know if anyone found those, some algorithms are significantly better than another one, please let me know. Probably I miss something. Yeah, I really like what racsatsaid here. Verifacsing, the key to AI is an article title by racsaturn in 22, one. Okay, now a very interesting question is why generated from the model instead of from humans? That's a really interesting question. It's not about saving cost, it's about performance. Does anyone have idea here? Yeah.
speaker 3: is it consistency in genothought structure versus human?
speaker 2: I bought a consistency say .
speaker 3: okay Yeah distribuboth ser it's easier .
speaker 2: is closer to .
speaker 3: what you do transit's easier .
speaker 2: to train the Yeah that Yeah Yeah so Yeah this is related to first principle in machine learning directly optimize what we want. I don't know if anystill members Michelin stopped here. Of course you guys should remember that Yeah so if we want to build a model for reasoning right or just in general about generating something interesting, right, we need to optimize the matgic of measuring generation quality. Those metricould be different, right? For example, for solving math problems, we would care about the correctness, whether ice is correct or not. If for machine translation, you would optimize blue score or just about a matrical to imagine the quality of generations. Okay? Once you have a metric, all we need is to compute gradients of the metric and do back propagation. Yeah. So mathematically we can write this formula, right? We need a function R to measure the response quality given the problem and also your model parameter theta. Okay. Yeah, of course, you can see R is a reward, or R is your question accuracy, or R is your blue score. What no matter, you can define any R you want, that's your target. And then compute the gradient. Since the model is a previous model, we need to maximize the expected value of the metric. So how to do it? We need to sample pling to compute expectation. That's why you get a policy gradient. Yeah, that's how it works. If you understand the order mathematical principle here, there's no magic. I know some people would like to talk about something in a more magical way. For example, how to centify your model, to syincentivize your model to reason. I don't use those words. I just use standard machine learning words, define your metric, compute gradient and do back propagation, that's all. Yeah. Of course, Yeah once your friend, your paraarm works well, we need to scale your approach. Not the problem is a lot of scale. And the unity is that for this I functunia approach, we are scare the output length, ens or scated length of co t. And you probably also scale are the model depth. Because from our cigarec analysis, once as long as your co t is long enough, the model can solve nearly every computable problems. That's amazing. You don't have to scale your motor size. You just need minimal constant size transmodels, and that's fine. So actually, if you look literature and people realized our fine tuna is better than sin the very early days, but it's harder to notice that we need to scale these sillearens that's even more nontrivial to realize. Yeah I would like to mention the beauty of elreasoning human like reasoning process emerges from token to token prediction rather than relying on exresearch as in class. AI Yeah, I also have fun court here by kazparv after losing to deep blue blue in 1997. And see deep bluare only intelligent the way your programble alarm clock is intelligent. Actually I agree with him, but am reason is different. We don't do any explicit search and the search is relevant here. And before my talk, the hallway and me, she quoted my tweet about search is irrelevant. And wow, I said, I'm happy to note that actually I use your code and to see search still is useful. Actually, I want to give an example here about why aom reason is so different from classical AI here. In December 2024, and a Google released model called gm 92.0 synken mode, of course, 2.5 pro is much more powerful. Okay, I use that model is for particular reason. So in December 2024 after model release, I tried a match H problem just to ensure this problem is not in our training set, okay? Because I use a number 2025 for the next year. Now for this year, okay, using the numbers from one to ten to make 2025 and using each number ones and primary operations plus and multiplication. Okay? Of course, one can write a pattern program, do exhaust research and get results, right? Let's look at the syking process on the red panel generated from the model. Actually, for gmor models, you can check the rule thinking nprocess as it's very interesting to look at. Okay, let's see how the model did. The syright is not by search. You can see that at the very beginning the model said, okay, this is a relatively large number suggesting multiplilication will be heavily involved. That's just like a human thinking, right? And wow, even I see, okay, it's worth noting that 2025 is a 45 squared and 45 times 45. Actually when when I made this question, even I didn't realize that that's huge kind here. And I see, okay. So the target is large and the start thinking about how to get large intermediate products, use multiplication and see blah blah. And that's the aim for products that get us closer to the square root of 20, 24, which is 45. You see that. And after actually I made a cut off here, the sinis very, very long. That's why we did long co t in the iftuning. And you can find an answer after thinking the model showed the final answer, right? They exactly followed the syprocess and see that's break down it. Okay. First part and ten times four plus five equal to 40 plus five equal 45 and the second party is also again 45 and then 45 times the 45 together. 2025. That's amazing, right? We don't need any search. I don't know if anyone read another paper related to a chain of sort prompting. It's got a trail of sort prompting. Anyone read that paper? Great. Yeah in that paper actually there's a way of interesting example is game 24. This problem is way harder than game 24. In twelve short prompting, they combine search with prompting to solve game 24. But now you don't need that at all, right? The model can solve game 24 just by natural language and see that that's how ichain is so powerful. It's amazing. And again, we like to say, Richard auon here, you see in the bitterlesson the core idea here, okay? Building all discoveries only makes it harder to see how the discovery process can be done. Yeah, that's I think risoto drew the bitterlesson after his joequative mind and his sort of success of AlphaGo and alpha zero. And he said, okay, only two process really scalable. Why is learning the other is a search? But here I would actually I only emphasize why thing learning is scalable. We also didn't need learning. Yeah for I often tuning okay. Yeah and the big advantage is that it generalize so well but for automatically verifiable tasks because we need verifiin the loop there's no way put him in the loop there. And of course not all tasks are automatically verifiable. Can anyone give examples? None via fiable tasks? Yeah .
speaker 3: right. Yes, creative .
speaker 2: writing. Yeah great example. Yeah so Yeah that's big restrictions for I often functuning and disappoint and also many people are really interested in creating rl algorithms to improve the approach. I really want to see see we spend more time thinking about you how to solve those non verifiable tasks for real problems actually are really a non verifiable like creative writing, even like a coding, right? And also people say, okay, coding problem will be solved by by AI in a few years. And I think it was very challenging to be solved. I know actually for they will talk about a program mming they only talk about competitive programming. Competitive problem is not like our daily programming work, right? So we read code, care about your design, your ability, right? Yeah how to collaborate with other people, not just give a final answer. Yeah. Yeah, I have a talk about you know all the other ideas. Actually the very beginning I talk about sildecoding. Okay. Actually the reason PaaS is already in the output space, and all I need to do is about decoding to receive the output distribution such that grid could funded. Okay. And then I talk about channels of the prompting or less than by style, which can the rethe output distribubution and then sft t and then I functuning. I find it's so powerful, but we still have a chance to improve those process is basically, I want to talk about two kiiwise aggregation. The other is about retrieval. And we have seen that alreason is really powerful, right? But any decoding issue in the paradigm of generating reasoning tokens and then final answers, it seems so natural, right? Given the problem and generating the media tokens and then find an answer, does anyone see any problem in this process? Any problem? Yeah.
speaker 3: This is the design of the model. More those designed to predict next over the challenge the way predicts next open. That's what cresituation outcome. Yeah. Yeah.
speaker 2: The model is originally designed just for pretty nectokens. Yeah. So Yeah, thanks. So Yeah, we need always keep in mind here that alms are probability models. There are no humans. What does that mean mathematically? Let's think about what m does in decoding, right? Given the problem and a general reasoning, and then find an answer, and then the response found by Grady decoding, what does gradidecoding means? Arguzed the probability, right? However, for us, we need to arguthe final answer, choose was the answer with the maximum probability. She was the most confident answer. So they are not aligned, right? There's such a simple high school conditionally prohibit MaaS here. But it's really useful for us to understand the decoding process and we can let's fix it, right? We just need to buy stuff, okay? If we can reason PaaS, we should sum over original pato find the probability of the final answer in terms of machine learning is called marginalization. Just some overall, because all the recent ing parts actually essentially are just latent variables. And of course, if we were start the machine learning in and they will know actually this, the sum can be computed by sampling. And then once you cut this idea and you see, okay, that's exactly the motivation underlying or another proper approach is called self consistency. So generate multiple response by a randomly sampling and then choose the answer that appears most frequently. So let me show some example here for this MaaS problem. You know, and you could sample what response many times for the first response you will get and say lar 18. And for the second one, you will get lar 26. And again, you will get lar 18, right? And then we'll look at the final answer. And then she was the most frequent one. That's exactly the process implementing marginalization, improbability. We don't look at the reasoning past, which only chothe most frequent answer, not most frequently reasoning path. That's the trick. That's called marginalization empirically. And if you apply this approach, you can see a huge improvement as really surprised, I know, in the kind you have to you may think, okay, if going to get a huge improvement, you probably need to spend a lot of time to build a sopcated MaaS formulations. We don't have two. So for gsmk problems, we can see that earlier, even for funtunon three models, they used the galaxthirty 3% and then OpenAI used verifier, got actually 55%. And use a part model plus cot, we get accuracy 58%. That's amazing. That's a magic performance from the vifier. And however, the most surprising cy is after appliplying itself, consistency, the accuracy jumped to 75%. The relative improvement is nearly 50%. And upalm two, we even get accuracy 92%. And of course, one said, okay, Yeah, that's for pmodels. You know, the model for many for years, sounds like ten years ago, between the kind days every year, just like one decade, the whole field is moving so fast. Actually, if you look at the zero one model, I've forgot when ophiis obamprobably October last year, right? Yeah. And actually they also showed the results by aggregation seconsensus at the 664 years. And then we still see a great improvement by aggregation or self consency. Yeah. Yes, great partner, of course. Yes, uses by cell phone c and use more samples will be more expensive, but use more tokens and people see that's kind of infrafor the time skidding. There are so many ways for inference time scading. If you use longer cothat also increase inference time. So actually, when some people, to me about inference time scading really, I don't know what they exactly means unless they can completely see what's scaled Yeah and the subcurrenis definitely a way to scale up Yeah. And also south franis naturally self calibrated, higher consistency indicates higher accuracy is for gsmk benchmark. Actually when the subconsistency is more than 80%, the accuracy is nearly 100%. So I know some people care about uncertainty or confidence in prediction, and you can just simply try something multiple times. And I have short questions here so to make sure everyone got the really key ideas in Souki. Hope you know how to Yeah you really find a lot of fun using this simple idea. So the first question, say, okay, when the alm outputs a direct answer without intermediate steps, will use steer samples several times and then choose the most common answer. Will you does anyone have an answer here? If the model just directly generates finanswer, what do we do? Yeah, go .
speaker 3: ahead. Like like you can just get them.
speaker 2: Exactly. Yes, exactly. Exactly. Just like exactly what we did in the classical machine learning, right? When we have ulogistry question to get a py given eggs, we just need a maximize probability there. That's why we couldn't see self continency in the old machine learning. Literature is unnecessary. It's only useful for elm reasoning. That's why we see it here after we have reasoning, and then we need self consistency. Yeah and the second question is change self consciency by letin oms. Generate multiple response instead of sampling multiple times and then choosing the most common answer. Does this make sense? Now you can see jjl model generate five answers instead of sample five times. Yeah, there's actually we can try that. And again, you know for everything, we just need to follow the machine helaniprinciple, actually, this principle is called max marginal inference. Yeah you just need to choose the finiso with maximum probability. That's all we need to know. You don't have to think about any fancy things, about alms, don't have to compare with humans, you know, MaaS x is all we need here. And one can naturally, of course, Subhas, a problem, you see the unique answer, right? You check the frequency of the unique answer. And for general problems, it's hard to see the answer will be by single token. And for example, this problem you will see all ansare different, okay? In this case, we have extension about self consistency. It's called universal self consistency. And for this problem here, you can see the second response is the most common one, Japan, China and India, because all these three countries are in all other answers, right? And we just need to let lms choose the most consistent response. Okay? I've talked about how to use aggregation to improve reasoning. The other way is about retrieval. So I know there's a lot of debate about album reasonpeople say, okay, ms may not just do retrieval instead of reasoning. I know many people. I saw that debates in social media addito me. It's always hard to differentiate retrieval and reasoning. And when I am average chess or senior average chess for all the conference almost every year, and we would have talked about the novelty of each paper. And actually it's similar to the debate is retrieval or reasoning, right .
speaker 3: concept. That's all far ways to. And then running Farrow we may have lyric running learning this is conference withdranine 2.5 look like four different models comparhundred for the same question. And then at the end just having like to play fire. So some on find it in most consistent .
speaker 2: yes. Yes. If you generate response from different models, that would be more like the model example pling approach with many models and combresults like a random forest. Yes. Yeah the mathematical principle is not exactly the same as self consistbut. The implementation are the same. Yes. Yeah, great point. Yeah about actually again, not interested debate about reeval reasoning and for people actually I work in industry C, I really just care about performance. So to me, you know just to retrieve a plus reasoning, well I should do the debate right Yeah. So in 2024, we have a paper about analogy reasoning. So I can just use this small example to show why your trieval is important. Antly reasoning, okay, so for this proacy, what's the area of the square with the four phototices la blah? And okay, the highlighted is added by me and say, okay, is there prompt? Okay, recall a rproblem and then solve this one. So at that moment, I tried to GPT -3 point five and also our own model, and they failed solving this problem. This after adding the prompt of recrelated problems, and the model can solve it. Let's see what happened here after tell the model to recall related problems, and the model did find a related problem. Related problem doesn't mean the same problem is indeed just a related problem. You see, the related problem here is finding the distance between two points on a coordinate plan. And this a formula there. And then the model of all. Yeah now I know how to compute the distance and then how to compute the area. It's just a small case to show how retriever is important in reasoning. Here's another example called a step back for the physical problems. And before solving this problem with just that model, you know Yeah we just give few short examples. We show the model. Okay, before solving this problem, you can make a step back to consider more abstract problem, get the principle, and then solve it快。 That's how a retrieval works for reasoning. And as of now, everyone knows deep research. The research is exactly the same idea here, right? So we'll have a gynadeeper research and also open a deeper research. And one of oi's deepresearch lead was my intern. And after his pd and he joopi and he invented deep research. And you see how the different research works, because they can find a similar problem or knowledge to solve the problem. Yeah, the basic ideas are very simple. Yeah now I can give a summary here. Actually, you know forget about other debate if alms reason or not, for alms, reasoning is always better than no reasoning. And alventing is better than sat aggregating multiple answers. It's better than one answer, of course, that way more costly. And retrieval plus reasoning is better than reasoning only. And Yeah, that's the end of my talk. And for the next breakthroughs, you know I know I really want to see, okay, how to solve the task beyond unique verifiable answers and in the kind of days. And I also want to see how people build real applicasings instead of just solving benchmarks as think all benchmarks will be saturated soon. And I know you know all you guys are very passionate about agi or build AI. Would like to quote Richard faman here, the truth always turned out to be simple than you thought. And I think that's a particular issue for aresearch. And I saw so many academic papers, they always tried complicated medithings. So that's why I just came my talk as simple as possible. Actually. Actually, it's indeed simple. That's it. Yeah, thank you.
speaker 1: Thanks, Danny, for the very insightful as well as interesting talk. So now we'll be taking questions. We have some questions online from slido on zoom, but also in person. So we can maybe start with some in person questions.
speaker 3: Hi, thank you for the talk. So earlier on in the lecture, you talked about confidence. And like a common way to do this is like just taking the average log probabilities of output token sequences. So like my question is, do you think there are better ways to do this? And also, is this a good indicator for hallucinations?
speaker 2: Oh, for the first clai want not to talk about a confidence, just the noapplication gation just just a for nectoken prediction, just a conditional probability for the generation. You can just look the logo props from the model and see you can see the probability.
speaker 3: Yeah. Yeah. And like do you think this is a good .
speaker 2: indicator for hallucinations? Yeah, same so Yeah from our pucal observation, Yeah and we can see after recent ing past, there's a huge jump on confidence for the finaner. Yeah.
speaker 3: Oh, earlier as you mentioned that for example, Richard Hutton said that it's scaling learning and search and your opinions more like scaling learning is all you need. I just like to expand more on that and why you believe that a search is not as necessary.
speaker 2: That's why I used that example. And actually, okay. So actually I should make it more concrete. When you build models, you don't have to keep things in mind, but in the after model, it's built and you can search as a tool. It has a special case of two years, like twelve sword prompting. They can just integrate symbolic search with the model. Yeah. So part for reasoning research. I just care about the fundamental abilities. Yeah for example, if you want to solve this problem, the model the model could be motivated to write a pattern program to solve those problems by search. But for the reasoning process, we don't need to search. It's just how to say it. Of course, we can always search everything. And that's why if you use a search gerto solve any problems, you can get higher accuracy. And I don't know, that really depends on what you want, intelligence or just by search. Yeah. Hi. Thank you for the talk.
speaker 3: You mentioned in the case where there's no reasoning that it's not necessary to sample because you can simply look at the logets, but wouldn't sampling converge on a different distribution in the case, for example, where the most likely next token leads to a diffuse distribution for the following token and the different paths spread out? Whereas if you were to sample and a less likely token were to lead to a sharper distribution, you can actually have a more likely path of tokens there. So wouldn't these two methods fundamentally differ?
speaker 2: Question. Yeah. The problem is actually we still don't know how the dispusion the argushaped during the training stage is very unclear there. Yeah. So to me, it's very hard to answer this question, but we still don't have good explanation. How does the distribubution be shipped for the final .
speaker 3: distribubution? Yeah, thank you. I thank you for the talk. So how do you differentiate reasoning and answer like to you to extract that number from the tokens, from the final strings, the opput string? And what if the answer can be like a program?
speaker 2: Then how to differentiate .
speaker 3: the reasoning and the answer?
speaker 2: Yeah, great question. Yeah. If the is a program and will be harder to extract, will be harder to attract. So when people use the iof and tuning, and that's why you just see those kinds MaaS H problems or competitive programming problems. Yeah. So I think for your for the general case you have for your case, you have to great a very careful passer for the finer Yeah.
speaker 3: I see. And also what if the problem is very challenging so such that actually the lower confidence answer might be the correct answer because possible.
speaker 2: Yeah, yes.
speaker 3: Then how can I use the self consistency better? Self consistency .
speaker 2: is not perfect. It's perfect. Everything time, right? Yeah, not perfect.
speaker 3: All right. Okay, thank you. So .
speaker 2: considering the you know, conversations that ais coming you know like from two to five years from now, how and basically if we, you know, if it's then, let's say, 90% of jobs automated, what skills do you you know, develop in kids to give them a shot to survive in the future that is coming? That's a big question. Who's that ao would come in five years, but there's AI 2027, right? By Daniel kukutao. Like lots of conversations in AI community that given like the timeline of like two to five years, I was in the. I clear red last year there was a workshop and I remember one audience asked me a question in the panelilist and he said, okay, and oai is a point is moving so fast the failed, you know and what would be the most scary thing in the in the next few years? And Yeah, I remember some people did talk about the risk of AI, but my answer is to me, most guys things AI winter comes back and I then now lost my job. Actually, I saw many restrictions for the kind of approach. So actually I know many people like the chaterboards am sort of things. I really actually really want to see real keler applications from the kind AI research. I don't know if anyone really need those AI stuff or not, just for fun. Yeah, I'm not quite sure about it. I know actually the AI models is really good for programming. Yeah can be a good assistant for coding and that's all I know about it. Yeah, we should be fine.
speaker 1: Okay. I think we're out of time, but thanks everybody for great questions. And thanks again to dny for the great talk.