2024-04-11 | Stanford CS25: V4 Intuitions on Language Models
Jason在演讲中探讨了语言模型表现优异的根本原因。他认为,通过预测下一个词,语言模型实际执行了一种大规模多任务学习,同时掌握语法、词汇、世界知识、情感分析、翻译、空间推理和数学等多方面技能。演讲指出,随着模型规模、数据量和计算资源的不断扩展,模型的整体损失会持续降低,而在某些特定任务上则可能出现能力突然跃升的涌现现象。Jason还以自己在肺癌分类实验中的经历说明,通过仔细观察和分析数据,可以培养对复杂任务的直观理解,这对推动语言模型的发展具有重要意义。
媒体详情
- 上传日期
- 2025-05-18 15:58
- 来源
- https://www.youtube.com/watch?v=3gb-ZkVRemQ
- 处理状态
- 已完成
- 转录状态
- 已完成
- LLM 提供商/模型
- openai/gemini-2.5-pro-preview-06-05
转录
speaker 1: So again, I'm very happy to have Jason here. So he's an AI researcher based in San Francisco, currently working at OpenAI. He was previously a research scientist at Google brain, where he popularized key ideas in llm such as chain of thought, prompting, instruction, tuning, as well as emergent phenomena. He's also a good friend of mine, and he's been here before to give some talks. So we're very happy to have you back, Jace, and take it away. Huh? Yeah, thanks for the intro. So a bit about the structure. So I'll talk for around 30 minutes and then I'll take a few questions and then Young wan an will talk for 30 minutes and then we'll both take questions at the end. Great. So I want to talk about a few very basic things. And I think the fundamental question that I hope to get at is why do language models work so well? And one thing that I'd encourage everyone to do, that I found to be extremely helpful in trying to answer this question is to use a tool which is manually inspect data. And I'll give a short anecdote. I've been doing this for a long time. So in in 2019, I was like trying to build one of the first lung cancer claphotoso. Like there be an image and you have to say, like, okay, what type of lung cancer is this? And my first thing was like, okay, if I want to train the neural network to do this, I should be able to at least do the task. Sks, I went to my advisor and I said, like, Oh, I wanna learn to do this task first. And he said, Jason, you need like a medical degree and like three years of pathology experience to even do this task. And I found that a bit discouraging, but I wouldn't did it anyways. So I basically looked at like the specific type of lung cancer that I was working on. And I'd like read all the papers on how to classify different types. And I went to pathologists and I said, okay, try to classify these. What do I do wrong? And then what do you think of that? And in the end, I learned how to do this task of classifying lung cancer. And the result of this was like I gained intuitions about the task that led to many papers. Okay, so first I will do a quick review of language models. So the way language models are trained are with the next word prediction task. So let's say you have, let's say you have a sentence, Dartmouth students, like two. And the goal of next word prediction is you have some words that come before, and then you want to predict the next word. And what the language model does is it outputs a probability for every single word in the vocabulary. So vocabulary would be like a ardvak, you know drink, study and then all the way to like zucchini. And then the language model is going to put a probability over every single word here. So like you know the probability of a being the next word is like something really small. Ardvarc is something really small. And then maybe drink is like say 0.6, study is like 0.3. Zucchini is again really small. And then the way that you train the language model is you say, I want, let's say a drink is the correct word here. I want this number here, 0.6 to be as close as possible to one. So your loss is like basically how close is actual the probability of the actual next word, and you want this loss to be as low as possible. Okay. So. The first intuition that I would encourage everyone to use is next word prediction is. Massively multi task learning. And what I mean by this is the following. I'll give you a few examples. So when you train a language model on a large enough database, large enough data set on this task of next word prediction, you have a lot of sentences that you can learn from. So for example, there might be some sentence in my free time I like to. And the language model has to learn that code should be higher probability than the word banana. So learn some grammar. It will learn lexical semantics. So somewhere in your data set, there might be a sentence, I went to the store to buy papaya, dragon fruit and durian. And the language moshould know that the probability of durian should be higher than squirrel. The language model will learn rule knowledge. So therewill be some sentence on the Internet that says you, the capital of Azerbaijan is, and then the language model should learn that it should be Baku instead of London. You can learn traditional nlp tasks like senmen analysis. So therebe some sentence. You know, I was engaged on the edge of my seat the whole time. The movie was. And the language model looks like, okay, these are the prior next words. The next word should probably be good and not bad. And then finally, another example is translation. So here you might see some sentence. The word for prin, Spanish is. And then the language model should weigh Bonita more than ola spatial reasoning. So you might even have some sentence like, ira went to the kitchen to make some tea, standing next to irrozuco, ponder as destiny zuco lefta. And then kitchen should be higher probability than store. And then finally, even ask some math question. So you might have like some arithmetic exam answer key somewhere on the Internet. And then the language model looks at this and says, okay, the next word should probably be 15 and not eleven. And you can have like basically millions of tasks like this when you have a huge data set, and you can think of this as basically extreme multask learning. And these are sort of like very clean examples of tasks. But I'll give an example of. How arbitrary some of these tasks can be. So here's a sentence from Wikipedia. Biden married Nelia. Okay? And then now pretend you're the language model. And you could say, like, okay, what's the next word here? And the next word here is a hunter. So is Biden's first wife. And so like, okay, what's the language model? Learning to predict from dicting this word, I guess, like world knowledge. And then what's the next word after this? Turns out the next word is a comma. So here the model is learning like basically comma prediction. And then what's the next word after that? I think it's not kind of hard to know, but the answer is a and I guess this is like maybe grammar, but like somewhat arbitrary. And then what's the next word after that? Turns out it's student. And this, I would say, I don't know what tathis is. This is like, you know, you know, it could have been woman. It could have been something else. So this is like a pretty arbitrary task. And the point that I'm trying to make here is that the next word prediction task is really challenging. So like if you do this over the entire database, you're going to learn a lot of tasks. Okay. The next intuition I want to talk about is scaling, which is, by the way, let's say scaling. Compute. And by the way, compute is equal to how much data you have times the size of language model. Reliably improves loss. And this idea was basically pioneered by Kaplan at all. In 2020, I would encourage you guys to read the paper. And what this basically says is you can have a plot here, and we'll see many plots like this, where the x axis is compute and the y axis is loss. And what this intuition says is you can train one language model. You'll have that loss and obviously want loss to be lower. So you can train the next one ithave that loss. If you train the one after that, ithave that loss, then if you train the one after that, ithave that loss. And you can basically predict the loss of a language model based on how much compute you're going to use to train it. And the reason why this is called a law is that in this paper, they showed that the x axis here is actually seven orders of magnitude. So basically, it would be surprising if the trend broke if you continued. And the important thing about this is that the line does not go like that, because if it went like that, then it would saturate. And then putting more compute or training a larger language model wouldn't actually lead to lower loss. Okay. So I think a question that we don't have a good answer to as a field, but I'll give you like a hand wave answer is why does scaling up the size of your language model improve the loss? And I'll give like two basically hand wave answers. So here's a small lm and here's large. So one thing that's important is like how good is your language model at memorizing facts? And imagine you're a small language model and you see a bunch of facts on the Internet. You have to be pretty choosy in which facts you memorize, right? Because if you don't have that many parameters, you're like, Oh, I can only memorize like a million facts. Is this one of the facts I want to memorize, like at the lowest loss? And so I have to be very selective. Whereas if you're a large language model, you can memorize a lot of tail knowledge. And so basically every fact you see, you don't have to say, like, Oh, is this something I want to memorize? You can just memorize it. And then the other hand, wavy answer I'll give is small language models tend to learn first order heuristics. So if you're like a small language model, you're like already struggling to get the grammar correct. You're not going to do your best to like try to get the math problem exactly correct. Whereas if you're a large language model, you have a lot of parameters in your forward PaaS and you can try to do really complicated things to get the next token correct and to get the loss as low as possible. Okay. So the third intuition I'll talk about is while overall loss improves smoothly. Individual tasks can improve suddenly. And here's what I mean by this. So you can write t your overall loss by this. If you take some corpus of data and you compute the overall loss on every word in that data, set the overall loss, because we know that next word prediction is massive multitask learning, you can decompose this overall loss into the loss of every single individual task. So you have, like, I don't know, some small number times the loss of, say, grammar plus, some small number times the loss of like sentiment analysis plus, some small number times the loss of like world knowledge plus. And then all the way to like, let's say you have something times the loss of math. And so you can basically write your overall loss as the weighted sum of the individual tasks in the data set. And now the question is, let's say I improve my loss from four to three, do each of these individual tasks improve at the same rate? Well, I would say probably not. So you know if you have a good enough language model, it's already doing basically perfectly on grammar and sensible analysis. So this might not improve. So this might be saturated. And maybe the loss on math is not saturated. It's not that good at math yet. So this could improve suddenly. And I'll redraw the diagram to show this. So again, compute here and loss here and your overall loss is like this. What I'm saying is there might be some part of that overall loss that scales like this. So this is, say, grammar. So for example, I could say like if GPT -3 point five is there and GPT -4 is there, you haven't actually improved the grammar that much. On the other hand, you might have something like this for say, doing math or harder tasks where the difference between GPT -3 point five and GPT -4 will be much larger. And it turns out. It turns out you can look at a big set of tasks, which I did, and you can look at, like, what's the shape of these scaling curves? So I looked at 202 tasks. There's this corpus called bigbench, which has like 200 tasks. I looked at all of them, and here's the distribution you have. This was 29% of tasks that were smooth. So if I draw the scaling plant computers on the x axis, and then here we have accuracy instead of loss. So higher air is better, then you have something like this. I believe this was 22% will be flat. So if you have your scaling curve itjust all be zero. The task was too hard. 2% will be something called inverse scaling. I'll talk about this in a sec, but what that means is the accuracy actually gets worse as you increase the size of the language model. And then I think this was 13% will be not correlated. So you have like something like that, I don't know. And then finally, a pretty big portion, 33%, will be emergent abilities. And what I mean by that is if you plot your compute and accuracy for a certain point up to a certain point, your accuracy will be zero and then the accuracy suddenly starts to improve. And so you can define like an emergent ability basically. As for small models, the performance is zero, so it's not present in this model. And then for large models. You have a much better than random performance. And the interesting thing about this is, let's say you had only trained the small language models up to that point, you would have predicted that it would have been impossible for the language model to ever perform the task. But actually, when you train the larger model, the language model does learn to perform the task. So in a sense, it's pretty unpredictable. I'll talk about one final thing, which is something called inverse scaling slash U shaped scaling. Okay, so I'll give a tricky prompt to illustrate this. So the tricky prompt is this, repeat after me. All that glysters is not glib. All that blisters is not. And this is the prompt I give to the language model. And the goal is to predict this next word. And obviously the correct answer is glib because you ask to repeat after after me. And what you see is, let's say you have a extra small language model, a small language model and a large language model. The performance for the extra small language model will be, say, here's 100%. The small language model is actually worse at this task. So it's something here. And then the large language model again, learns to do this task. So how do we basically explain a behavior like this for a prompt like this? And the answer is you can decompose this prompt into like three subtasks that are basically being done. So the first subtask is like, can you repeat subtext, right? And if you draw the plot here again, extra small, small, large, and then here is 100, this is like a super easy task. And so all the language models have perfect performance on that. That's one hidden task. The other task is, can you fix a quote? So the quote is supposed to be all that glysters is, not gold. And so you can then plot, again, extra small, small, large. What's the ability to fix that quote? Well, the small language model doesn't know the quote, so it's going to get zero. And then the extra small doesn't know the quote, so it's going to get zero. The small will be able to do it, and the large can obviously do it. So that's what the scaling curve looks like. And then finally, you have the quote, you have the task follow and instruction. And obviously, this is the instruction here and you could say, okay, what's the performance of these models on this task? And the small model can't do it, or the extra small model can't do it. Small model also can't do it, but the large model can't do it. So you get a curve like this. And then why does this explain this behavior here? Well, the extra small model, it can repeat. It can't fix the quote, and it can't follow instruction. So it actually gets it correct. It says glue. It says squb. The small model can repeat, it can fix the quote, but it doesn't follow the instruction, so it decides to fix the quote. And then the large model, it can do all three, you can follow the instruction, so it just repeats. And so that's how if you look at the individual subtasks, you can explain the behavior of some of these weird scaling properties. So I will conclude with one general takeaway, which is applicable if you do research. And the takeaway is to just plot scaling curves. And I'll just give a really simple example. So you know, let's say I do something for my research project. I fine tune a model on some number of examples, and I get you know, this is my thing. I get some performance there. And then here's the baseline of not doing whatever my research project is, and there's the performance. The reason you want to plot a scaling curve for this is like let's say you take half the data and you find out that the performance is actually here. So your curve looks like this. What this will tell you is you didn't have to collect all the data to do your thing. And if you collect more, you probably won't see an improvement in performance. Another scenario would be if you plot Ted that point and it was there, then your curve will look like this. And potentially, if you kept doing more of whatever your research project is, yousee an improvement in performance. And then finally, maybe your point is there, so your curve looks like this, and in this case, youwould expect to see an even larger jump in performance after you continue doing your thing. So Yeah, on my talk here, and happy to take a few questions before him once talk. Yeah, go ahead. Data that has an information tional source. So you write the data. Oh, Oh Yeah. Yeah. Thanks. Good question. So the question is during pre training, how do you differentiate between good data and bad data? The question is the answer is you don't really, but you should buy only training on good data. So like maybe you should look at like your data source and filter out some data if it's like not from a reliable data source. Do you want to give us maybe the intuition behind the intuition for one or two of the examples like emergent or you know why kale knowledge starts to develop? Sort of what's behind that? What do you mean intuition behind intuition intuitively, these concepts you're seeing in the graphs and from your experience and expertise, what in the model itself is really causing that emergent? What do you mean in the model itself? Is it more nodes, more Oh, I attention in an intuisense? Oh Yeah. Okay. Yeah. So the question is like what in the model makes the language model better at memorizing tail knowledge or at doing math problems? Yeah, I think it's definitely related to the size of the language model. So Yeah, if you have like more layers, you could encode probably a more complex like function within that. And then I guess if you have like more breadth, you could probably encode like more facts about the world. And then like if you want to repeat a factor like retrieve something, it probably easier we'll get one more in person and then I'll business. So when you are studying the 200 ish problems in the big bench, you notice that 22% were planned. But there's a possibility that if you increase the compueven further, you might those might have put off the emergency. So my question to you is that when you were looking . speaker 2: at the 23% that turned out to be emergent, did you notice anything . speaker 1: about the loss in the flat portion that suggested that they would eventually become emergent? Oh, Yeah, I didn't notice anything. Oh, sorry. Let me repeat the question. The question is like when I looked at all the emergent tasks, was there anything that I noticed before the emergence point in the loss that would have like hinted that it would have emergent later? To me, it's kind of tough. Like we have a few plots of this. Like you can look at the loss and it like kind da gets better and then suddenly it spikes and there's like no way to predict it. But also you don't have like perfect data because you might not have like old intermediate points for a given model size. Yeah, great question. Okay. We just have a few questions from people who are joining on zoom. The first one is, what do you think are the biggest bottlenecks for current large language models? Is it the quality of data, the amount of compute or something else? Yeah, great question. I guess like if you go back to the scaling laws paradigm, what it says is that if youincrease the size of the data and the size of the model, then youexpect to get a lot better performance. And I think Yeah, we'll probably try to keep increasing those things. Gotcha. And then the last one, what are your thoughts on the paper? If you've read it? Our emergent abilities of large language models, a mirage. Oh Yeah, I always get this question. I guess I would encourage you to read the paper and decide for yourself. But know, I guess what the paper says is like if you change the metric a bit, it looks different. But I would say at the end of the day, I think the language model abilities are real. And if you think like, Yeah, I guess I don't think that's a mirage. So Yeah, Yeah. All right. So thanks, Jason for the very insightful talk. And now we have Kewan give a talk. So he currently a research scientist on the OpenAI chagpt team. He has worked on various aspects of large language models, things like pre training instruction, fine tuning, reinforcement learning with human feedback, reasoning and so forth. And some of his notable works include the scaling flan papers such as flan t five as well as flan palm and t five x, the training framework used to train the palm language model. And before OpenAI, he was at Google brain, and he received his PhD from mit. So give a hand for hyjuan. Just wait. speaker 2: All right, my name is Young wan and really happy to be here today. And this week I was thinking about, by the way, it's miworking fine. Yeah, Yeah. So this week I thought about, okay, I'm giving a lecture on transformers at Stanford. What should I talk about? And I thought, okay, some of you in this room and in zoom will actually go shape the future of AI. So maybe I should talk about that. It's a really important goal and ambitious, and we really have to get it right. So that could be a good topic to think about. And when we talk about something into the future, the best place to get an advice is to look into the history. And in particular, we look at the early history of transformer and try to learn many lessons from there. And the goal will be to develop a unified perspective in which we can look into many seemingly disjoint events. And from that, we can probably hope to project into the future what might be coming. And so that will be the goal of this lecture. And we'll look at some of the architectures of the transformers. So let's get started, everyone. I see it's saying AI is so advancing so fast that it's so hard to keep up. And doesn't matter if you have like years of experience. There's so many things are coming out every week that it's just hard to keep up. And I do see many people spend a lot of time and energy catching up with this latest development, the cutting edge and the newest thing. And not enough attention goes into old things because they become deprecated and no longer relevant. But I think it's important actually to look into that because we really need to when things are moving so fast beyond our ability to catch up, what we need to do is study the change itself. And that means we can look back at the previous things and then look at the current thing and try to map how we got here and from which we can look into what where we are heading towards. So what does it mean to study the change itself? First, we need to identify the dominant driving forces behind the change. So here, dominant is an important word, because typically a change has many, many driving forces. And we only care about the dominant one because we're not trying to get really accurate. We just want to have the sense of directionality. Second, we need to understand the driving force really well. And then after that we can predict the future trajectory by, know, rolling out the driving force and so on. And you heard it right. I mentioned about predicting the future. This is a computer science class, not like an astrology or something. But we do I think it's actually not that impossible to predict some future trajectory of a very narrow scientific domain. And that endeaver is really useful to do, because let's say you do all these and then make your prediction accuracy from 1% to 10%, and then you make, say, 100 predictions, ten of them will be correct, say, one of them will be really, really correct, meaning it will have an outside impact that outweighs everything. And I think that is kind of how many I've seen a very general thing in life that you really have to be right a few times. So why so if we think about why predicting the future is difficult, or maybe even think about the extreme case where we can all do the prediction with perfect accuracy, almost perfect accuracy. So here I'm going to do a very simple experiment of dropping this pen and follow this same three step process. So we're going to identify the driving, dominant driving force. First of all, what are the driving forces acting on this pen? Gravity downwards? And is that all we also have, like say, air friction, if I drop it, and that will cause what's called a drag force acting upwards. Actually, depending on how I dropped this, the orientation, the aerodynamic interaction will be so complicated that we don't currently have any analytical way of modeling, that we can do it with the cfd, the computational of fluid dynamics, but it will be non trivial. So we can neglect that this is heavy enough, that gravity is probably the only dominant force. We simply but simplify the problem. Second, do we understand this dominant driving force, which is gravity? And we do, because we have this Newtonian mechanics, which provides reasonably good model. And then with that, we can predict the future trajectory of this pin. And if you remember from this dynamics class, if we have this initial velocity is zero, I'm not going to put any velocity. And then let's say position is zero here, and then one half gt square will give a precise trajectory of this pin as I drop this. So if there is a driving force, single driving force, that we really understand, it's actually possible to predict what's going to happen. So then why do we really fear about predicting the future in the most general sense? And I argue that among many reasons, the number of driving force, the sheer number of dominant driving forces acting on the generals prediction is so complicated, and their interaction creates a complexity that we cannot predict the most general sense. So here's my cartoon way of thinking about the prediction of future aaxes. We have a number of dominant driving forces. Why axes? We have a prediction difficulty. So on the left hand side, we have a dropping of pit's. A very simple case. It's a difficulty. It's very small. You just need to learn physics. And then as you add more stuff, it just becomes impossible. So how does this fit into the AI research? And you might think, okay, I see all the time things are coming in, we are bombarded by new things. And some people will come up with the new agent, new modality, new mml score, whatever. We just see so many things. It just I'm not even able to catch up with the latest thing. How can I even hope to predict the future of the AI research? But I argue that it's actually simpler because there is a dominant driving force that is governing a lot, if not all of the AI research. And because of that, I would like to point out that it's actually A Coser to the left than to the right, than we actually may perceive. What is that driving force? Oh, maybe before that, I'd like to caveat that when I do this kind of talk, I would like to not focus too much on the technical stuff, which you can probably do better in your own time, but rather, I wanna share how I think and for that, I wanna share how my opinion is this. And so it will be very strongly opiniand. I'm by no means I'm saying this is correct or not. Just wanted to share my perspective. So coming back to this driving force for AI, what is that dominant driving force? And here's a plot from rich Sutton. And on the y axis, we have the calculations flopped. If you pay $100 and how much computing power do you get? And it's in log scale. And then xaxes, we have a time of more than 100 years. So this is actually more than exponential. And I don't know any trend that is as strong and as long lasting as this one. So this, whenever I see this kind of thing, I should say, okay, I should not compete with this and better. I should try to leverage as much as possible. And so what this means is you get ten x more compute every five years if you spend the same amount of dollar. And so in other words, get the cost of compute is going down exponentially. And this an associated scaling is really dominating the AI research. And that is somewhat hard to take, but that is I think really important to think about. So coming back to this AI research, how is this exponentially cheaper compute drive the AI research? Let's think about the job of the AI researchers. It is to teach machines how to think in a very general sense. And one somewhat unfortunately common approach is we think about how we teach machine, how we think we think. So meaning we model how we think and then try to incorporate that into some kind of mathematical model, teach that. And now the question is, do we understand how we think at the very low level? I don't think we do. I have no idea what's going on. So it's fundamentally flawed in the sense that we try to model something that we have no idea about. And what happens if we go with this kind of approach is that it poses a structure that serves as a shortcut in the short term. And so you can maybe get a paper or something, but then it becomes a bottleneck because we'll know how this will limit further scaling up. More fundamentally, what this is doing is we are limiting the degree of freedom we are giving to the machines, and that will backfire at some point. And this has been going on for decades. And bitter lesson is, I think, the single most important piece of writing in AI, and it says, this is my wording, by way past 70 years of entire AI research can be summarized into developing progressively more general method with weaker modeling assumptions or inductive biases, add more data and compute, in other words, scale up. And that has been the recipe of entire AI research, not fancy things. And if you think about this, the models of 22000, it's a lot more difficult than what we use now. And so it's much easier to get into AI nowadays from technical perspective. So this is, I think, really the key information we have. This compute cost is going down exponentially and it's getting cheaper faster than we're becoming a better researcher. So don't compete with that and just try to leverage that as much as possible. And that is the driving force that I wanted to identify. And I'm not saying this is the only driving force, but this is the dominant driving force. So we can probably neglect the other ones. So here's the graphical version of that. X xis. We have a compute Y X is we have a performance of some kind. Let's think about some general intelligence, and let's look at two different methods. One with more structure, more modeling assumptions, fancier math, whatever. And then the other one is a less structure. What you see is typically, you start with a better performance when you have a low compute regime, then but it plateaus because of some kind of structure backfiring. And then with the less structure, because we give a lot more freedom to the model, it doesn't work in the beginning. But then as we add more compute, it starts working and then it gets better. We call this more scalable methods. So does that mean we should just go with the least structure, most freedom to the model possible way from the get go? And the answer is obviously no. Let's think about even less structure case. This red one here is it will pick up a lot later. And requires a lot more compute. So it really depends on where we are. We cannot indefinitely wait for the most general case. And so let's think about the case where our compute situation is at this dotline. If we're here, we should choose this less structure one as opposed to this even less structure one, because the other one doesn't really work and the other one works. But crucially, we need to remember that we are adding some structure because we don't have compute, so we need to remove that later. And so the difference between these two methods is that additional inductive biases or structure we impose, someone impose that typically don't get removed. So adding this, what that means is that at the given level of compute data, algorithmic development and architecture that we have, there is like an optimal inductive bias or structure that we can add to the problem to make the progress. And that has been really how we have made so much progress. But these are like shortcuts that hinder further scaling later on. So we have to remove them later on when we have more compute, better algorithm or whatever. And as a community, we do adding structure very well because there's an incentive structure with papers. You add a nice one, then you get a paper. But removing that doesn't really get you much. So that we don't really do that, and I think we should do a lot more of those. So maybe another implication of this bitter lesson is that because of this, what is better in the long term almost necessarily looks worse now. And this is quite unique to AI research because the AI research of current paradigm is learning based method, meaning that we are giving models freedom. The machines choose how they learn. So because we need to give more freedom, it is more chaotic at the beginning, so it doesn't work. But then when it started working, we can put in more compute and then it can be better. So it's really important to have this in mind. So to summarize, we have identified this dominant driving force behind the AI research, and that is this exponentially cheaper compute and associated scaling up. Now that we have identified, if you remember back from my initial slides, the next step is to understand this driving force better so that we're going to spend most of the time doing that. And for that, we need to go back to some history of transformer, because this is a transformers class, and analyze key structures and decisions that were made by the researchers at the time and why they did that, whether that was an optimal structure that could have been added at the time and why they might be irrelevant now and should we remove that? And we'll go through some of the practice of this, and hopefully this will give you some flavor of what like scaling research looks like. So now we'll go into a little bit of the technical stuff. H transformer architecture. There are some variants. I'll talk about three of them. First is the encoder decoder, which is the original transformer, which has a little bit more structure. Second one is the encoder only, which is popularized by Bert. And then third one is decoder only, which you can think of as a current like GPT -3 or other language models. This has a lot less structure than the encoder decoder. So these are the three types will go into detail. Second, the encoder only is actually not that useful in the most general sense. It still has some place, but we will so just briefly go over that and then spend most of the time comparing 13. So one has more structure. What's the implication of that? And so on. So first of all, let's think about what a transformer is just at a very high level. Or first principles. What is the transformer as a sequence model? And sequence model has an input of a sequence. So sequence can be sequence of elements, can be words or images or whatever. It's a very general concept. In this particular example, I show you with the words, sentence is a sequence of words. And then the first step is to tokenize it, because we have to represent this word in computers, which requires just some kind of encoding scheme. So we just do it with a fixed number of integers that we have now, sequence of integers. And then the dominant paradigm nowadays is to represent each sequence element as a vector, then vector, because we know how to multiply them well. So we have a sequence of vectors. And finally, this sequence model will do the following. We just want to model the interaction between sequence elements. And we do that by let them take the dot product of each other. And if the dot product is high, we can say semantically they are more related than the dot products that is low. And that's kind of the sequence model. And the transformer is a particular type of sequence model that uses what's called attention to model this interaction. So let's get into the details of this encoder decoder, which was the original transformer. It's quite many, many pieces. So let's go into a little bit a piece at a time. So starting with that encoder. So here I'm going to show you an example of machine translation, which used to be very cool thing. And so you have an English sentence that is good. And then we're going to translate into a German. So first thing is to encode this into a dense vector. So here I'm representing it with this, a vector of size three or something. And then we have to let them take the dot product. So this blinds represent which element can talk to which element, other elements. And here, because it's an input, we take what is called a bidirectional attention. So any token can talk to any other token. And then we have this mlp or feforward layer, which is curtoken. It doesn't have any interaction. We just do know some multiplication just because we can do it. And then that's one layer, and we repeat that n times. And that's just the transformer encoder. And at the end, what you get is the sequence of vectors, each representing the sequence element, in this case, a word. So that's the output of this encoder. Now let's look at the decoder, which is similarly shaped, stack of layers. So here we put in as an input what the answer should be. So here, pos is the beginning of sequence. And then duis, good. I don't know how to pronounce it, but this, the German translation of that is good. And so we kind of go through the similar process. Here we have a causal self attention, meaning that the tokens of time step t can only attend to t and before, because when we start generating it, we don't have the future tokens. So we when we train it, we should limit that. And that way this is done by like masking, but it's just different from the encoder. So after this you can get after again n layers, you get this sequence output and have the output is sequence so sequence so sequence mapping. This is a general encoder decoder architecture. And when you get this end of sequence, you stop generating it. So this is the overall picture. Now I'll point out some important attention patterns. So we are translating into German what is input to the encoder. So there has to be some connection between the decoder and encoder. That is done by this cross attention mechanism shown in this red, which is just that each vectorrepresentation on each sequence in the output decoder should attend to some of them in the encoder. And that is done in particular. The design feature, which is interesting, is that all the layers in the decoder attend the final layer output of the encoder. I will come back to the implication of this design. So Yep, that's that. And now move on to the second type of architecture, which is encoded only. We'll spend a little bit of time here. So again, we have the same input and we go through similar structure. And then in this case, the final output is a single vector, regardless of the length of the sequence. We just get a single vector, and that is that represent the input sequence. That's a dense factor representation. And then let's say we do some kind of a sentiment analysis. We run through a task specific linear layer to map it to classification labels, positive or negative probabilities here, and that's required for all these test specific cases. And this is kind of popularized by birth. And what this means is that here at the time, 2018, when Bert came out, we had the benchmark called glue, which was a language understanding test. You have a sequence in classification labels out for most cases. This was how the field really advanced at the time. So when we care about such tasks, then there's an incentive to think about simplifying the problem, adding the structure, the problem, so that we can make a progress. So this, the additional structure that was put into this particular architecture is that we're going to give up on the generation. If we do that, it becomes a lot simpler problem. Instead of sequence to sequence, we're talking about sequence to classification labels. And that's just so much easier. And so at some point, 2018, 2019, a lot of the papers and just research was like we sometimes call it birth engineers. It's a little of a change of something. You get like four, 5% better on glue and you get a paper and things like that. It was like very chaotic error. But if we look at from this perspective, we are putting the sequence structure of not generating the sequence. That puts a lot of performance win. But in the long term, it's not really useful. So we're not going to look at this encoder only architecture going forward. Third, architecture decoded only. This one is my favorite personally, and it looks kind of daunting, but because of this attention pattern, but it actually is very simple. So here we only have a single stack and it can actually generate stuff. And so there's misconception that some people think this decoder only architecture is used for language modeling next to competidiction. So it cannot be used for supervised learning. But here we can actually do it. The trick is to have this input that is good concatenated with the target. And if you do that, then it just becomes simple to sequence in, sequence out. So what we do is the self attention mechanism here is actually handling both the cross attention between target and the input and self attention sequence learning within each. So that's the causal attention. And then the Yeah as I mentioned, the output is a sequence. And then the key design features are self attention instead, so serving both roles. And we are in some sense sharing the parameters between input and target, so same serof parameters are applied to both input and the target sequences. So this is the decoder only now we will go into the comparison. So I think there are many like they look very different, at least on the schematics. So how different are they actually? And I argue that they're actually quite similar. And so to illustrate that, we're going to transform, starting from this encoder decoder, which has more structures built in, and then into the decoder only architecture and see what are some of the differences, and then interpret those differences, those additional structures, are they relevant nowadays now that we have more compute, better algorithm and so on? So let's have this table, four differences. We'll see each of them. And then as we go through, we'll populate this table. So let's first look at this additional cross attention sion. What that means is that this on the left is an encoded decoder, which has this additional red block, the cross al tension, compared to the simpler one that doesn't have that. So we want to make it, make the left closer to the right. So that means we need to either get rid of it or something. And attention mechanism has kind of the four projection matrices. And so self attention and cross attention actually have the same number of parameters, ter, same shape. So we can just share them. So that's the first step. Share both of these. And that name becomes mostly the same mechanism. And then so that's the first different separate cursor, tenor, self attention, serving both roles. Second difference is the parameter sharing. So what that means is that between the input and the target encoder, decoder architecture uses the separate parameters. And decoder only has a single stack, so it uses the share parameter. So if we want to make the left close to right, that we want to share the encoder parameters. So let's do that. Just color this. So now they share the parameters. Third difference is the target to input attention pattern. So we need to connect the target to the input. And how does that how is that done? In the encoder decoder case, we had this cross attention. And then in the decoder only it's per the self tension doing everything. What the difference is that we have this every layer of the decoder attending to the final layer output of the decoder. Whereas if you think about this decoder is actually per layer within layer, when we are decoding the say word does, we are looking at the same layer representation of the encoder and that's within layer. And I think this is the design feature. So if you want na make this close to that, we have to bring back this attention to each layer. So now layer one will be attending to layer one of this. And then finally, the last difference is to input attention. I mentioned about this bidirectional attention. And because we have this decoder only typically with the unidirectional attention sion, we need to make them matching. So that's we can just get rid of it. I just got rid of some of the arrows. So then at this point, these two architectures are almost identical. A little bit of difference in the cross attention, but same number of parameters. And if you have in deep learning, if you just train this these two architecture and the same task, same data, I think you will get pretty much within the noise, probably closer than if you train the same thing twice. So I would say they are identical. And so these are the main differences. Now look at what are the additional structures, what they made means. So Yeah, that's the populated table now and then so we can say that encoder decoder compared to the decoder on the architecture, has these additional structures, inducted biases built in. So let's go into each of them. The first one is what encoder decoder tries at it as a structure, is that input and the target sequences are sufficiently different that it be useful to use as separate parameters? That's the assumption. And so why is that useful? When can that assumption be useful? And one example is machine translation. Back when the transform was introduced in 2:17, translation was a really popular task and it was difficult consider difficult and because it a sequence to sequence and you can actually have a blue score, which is heuristic based method that can give you a single number and then people can optimize that. So in that task, we have this input and target in completely different languages. So if the goal is to learn translation only, then it kind of makes sense to have, okay, this parameter in the encoder will take care of the English, and this parameter in the decoder will take care of the German. That seems natural. And what about now, modern language models is about learning knowledge. And it's not just about translation or not even about language. Language ages comes up as a byproduct of doing this next token prediction and translation as well. So does it make sense to have a separate parameter for this kind of situation now, like we have some knowledge in German, some knowledge in English, and if anything, you want to combine them. And if we represent them in a separate parameters, I don't think that's natural. So I would say with this much more general, larger models that can do a lot of things, this assumption seems very unnatural to me. Second example is a little bit more modern. Two years ago, when I was at Google and with Jason, we did this instruction fine tuning work. And what this is, is you take the pre train model and then just fine tune on academic data set so that it can understand the natural language instruction. So the detail doesn't matter. But here, let's think about the performance gain by doing this fine tuning on two different architectures. We tried. So first five is the Flon t five, which is t five base, which is saying coder decolarchitecture. Last, the latter five decoder only architecture based on palm. So we spent 99% of the time on palm optimizing a lot of these. And then at the end, we just spent like three days on t five. But the performance gain was a lot higher on this. And I was really confused about this, and in a very good way. And after the paper was published, I wanted to dig a little bit deeper into why this might be the case. So my hypothesis is that it's about the length. So academic data sets we use, we use like 18, 832 tasks. And here they have this very distinctive characteristic where we have a long input long in order to make the tasmore difficult, but then we cannot make the target long because if we do, there's no way to create it. So there's fundamental challenge of that. So what happens is you have a long text of input and then short text of the target. And so this is kind of a length distribution of what it went into the flfine tuning. So then you see this. You have a very different sequence going into the encoder as an input and a very different type of sequence going into the target. So now this encoder decoder architecture has an assumption that they will be very different. That structure really shines because of this. It was a kind of an accident, but that was, I think, why this really architecture was just suitable for fine tuning with the academic data set. What about now? Do we care about this kind of assumption? And if you think about the general use cases of the language models nowadays, if anything, the more interesting cases involve longer generation, longer target. Just because we cannot grade them doesn't mean that we are not interested in them. Actually, if anything, we are more interested in that. So now we have this longer target situation. So this separate sequence length parameter doesn't seem to make much sense. Moreover, we think about this chat application like ChatGPT. We do multurn conversation. And then so what is the target of this turn becomes comes the input of the next turn. And then my question is, does that make sense to even think about a different parameters if next turn it's going to be the same thing? So that was the first inducted bias we just mentioned. And then the second structure is that target element can only attend to the fully encoded ones. The final put up the encoder. And let's look at this additional structure of what that means. So as I mentioned, we have this very top layer attending to it. And so in deep neural nets, typically we see that the bottom layers and the top layers in code information at a very different level, meaning that, for example, in computer vision, the bottom layers include something like edges, top layers, higher levels combining the features, something like cat face. And so we call this deep learning a hierarchical representation learning method. And so now the question is, if decoder layer one attends to encoder final layer, which probably has a very different level of information, is that some kind of an information bottleneck which actually motivated the original attention mechanism? And in practice, I would say, in my experience, doesn't really make any difference. And that's because my experience was limited to, say, 24 layers of encoder of t five. So layer one attending to 24, probably fine, but what if we have ten x or 1000x more layers? Would that be problematic? I'm not really comfortable with that. So I think this is also unnecessary design that maybe we need to revisit. Final structure we're going to talk about is when we do this, there's like a bidirectional thing in the encodecoder. Let's think about that. So Yeah, bidirectional input attention, is that really necessary? So when we had this birth be in birth stand ds for bidirectional 2:18, when we were solving that question, answering squat actually was very difficult task. So if you have any additional trick, it can make a huge difference. Bii directionality was a really useful, like I think maybe boosting up the squat score by like 20. So it was really huge thing. But at scale, I don't think this matters that much. This is my highly anecdotal experience. So we did in flantwo, we tried both by directional and unidirectional fine tuning. Didn't really much difference. But I want to point out this bidirectionality actually bringing in an engineering challenge for modern multi turn chat applications. So at every turn, the new input has to be encoded again. And for union, directional attention is much, much better. So here's what I mean by that. So let's think about this more modern conversation between user and assistant. How are you bad and why? And so here, if we think about the bidirectional case, we will, when we generate bad, we need to encode this input with the bidirectional thing, which is fine. And then after the bad is generated, when we trying to generate why, we'll need to encode how again, because how can attend to bad? So we need to do everything from scratch again. In contrast, if we do unidirection one, we can do much, much better because now when we are trying to generate why we don't have to redo how? Because we cannot attend to the next future tokens, so we don't have to do anything. So if you see the difference, this part can be cached, and then this part is the only thing that has to be encoded again. So this kind of makes a big difference when we think about multiple turns going in. So I would say bidirectional tension did well in 2018, which is mostly sold by scale. And now because of this engineering challenge, we don't really need that. So to conclude, we have looked into this driving force, dominant driving force covering this AI research, and that was this exponentially cheaper compute and associated scaling effort. And so to understand this driving force, we analyzed some of the additional structures added to the encoder decoder compared to the decoder only, and then thought about what that means as from the perspective of scaling. And I wanted to just conclude with this remark. So we have looked at these kind of analysis, which are, Oh, one can say this is just historical artifacts and it doesn't matter. But if you do many of these now, you look at this current events, you can hopefully think about those in a more unified manner and then see, okay, what assumptions in my problem that I need to revisit and are they relevant, and if not, why? And you have an answer to it is, can we do it with a more general thing and scale up? And so I hope you can go back and really think about these problems. And together, we can really shape the future of AI in a really nice way. So that's it. Thanks. speaker 1: Hi, thank you for the talk. So about the mix of expert structure, if know what you're saying is correct that how long do you think the mix of experts is going to stay for the new length models? speaker 2: So one thing I have to apologize, the architecture is kind of a thing that I not really comfortable sharing a lot. That's why I'm limiting a little bit to the future. So Yeah, if you I'll probably just skip that, but I would say that seems quite general. speaker 1: So some of the changes that you describe between encoder decoder versus decoder, only the parameter sharing and the bidirectional, can they not be interpreted as less structure or sorry, more structure or less freedom for the model to learn? speaker 2: Yeah, I think one can do that. But I think somewhat subjective, but I think it's a simpler structure that the model kind of if but we're . speaker 1: just saying . speaker 2: like input and target are just sequences and we just if we have enough capacity, we can just handle both. And there are other cases where like Yeah, so Yeah, I can totally see, Oh, actually, maybe I should repepudate the question. The question is, can we think about this parameter sharing other structures in the encoder decoder as actually less structure? But I think it's a little bit more complicated model. And that's such complications of like it has more assumption, right? The input and target are different. I think that's a stronger assumption than, okay, it's a sequence. We deal with the sequence in a unified way. So that would be just my take. speaker 1: Do you have any thoughts on recent state space models like mommba and how that fits into the paradigm of less structure and mostructure without really . speaker 2: Yeah, Yeah. Okay. It's hard to think about it on the spot. But I think to me, I talked about this architectures, but I don't architecture is kind of a it doesn't change things too much. And maybe I think multimodalities might bring in another challenges like when this transformer structure might become a bottleneck when we think about that. But Yeah, so I think French foreers have done a good job. So maybe you should think about it, especially with the multimodalities. For cross attention, sion and casual attention, it's like the polating permutation of the variant in a way for multiadinstead of caual. And then for computer vision, there's like a lot of learning structure for invariances, for some software learning. What do you think about those in terms of complexity? So the question is this causal attention versus like the bidirectional tension, they were probably fine in the text domain, but in the computer vision case that being able to attend to the future part of it is really important, is that the question. speaker 1: Calls attention remove . speaker 2: the invariance for permutation. speaker 1: So what do you think about like for conservation? Like you learn a lot of invariance for augmentation. So what do you think about . speaker 2: those as a way to structure? So, so I think the, I don't really like this invariances in all these. These are like how humans think we perceive the vision. Like cnn, for example, is like translation invariance, which was very important, we thought. But it's I don't think it's that important actually, if anything now is hurting the model, learning more general thing. And so the machines might be learning the vision in a completely different way from how humans do. And I don't think that's problematic. So those invariances, I think is could be a good guiding principle, but don't I don't I'm not too worried about just not having such structures. Yeah, I would just just try out like just based on some metric, if not having that invariance structure is actually better and more scalable. And I think that's probably fine and actually even better if we do it without the structure is actually better. speaker 1: So I actually have two questions. One, so clearly, you've been thinking about how inductive biases and structure limit our potential financial is. So I'm curious where bait inductive biices currently that you think are like big blothat we've been put release as or let go of. speaker 2: The current structure that we should get rid of just . speaker 1: current inductor bias es, you think, because clearly you've been thinking about this, right? So when you look at the state of research, you must be thinking, man, this is like a pretty big inductive bias itbe really cool. If you could let this go. speaker 2: So I'm just trying to see what you're Yeah. So so when I think about this as an architecture, I think the architectures are not the current bottleneck in my view. And so partly because I did a lot of the architecture research, and at the end we published this paper called, it's saying, okay, we try like 60 different transformer modifications, pretty much the same thing. And none of that really makes sense. Make a huge difference. Caveat would now maybe the conclusion can be different. So I have a very huge bias against not doing their architecture research. One message could be that actually the architecture is not the bottleneck in further scaling. And I think what's the bottleneck now? Is this learning objective, especially on the supervised learning paradigm or even like self supervised pretraining? What we're doing with this maximum likelihood estimation is okay given this this is the only correct target and everything else is incorrect because the probability measure is finite. So is that really a comfortable thing to do, teaching signal to give the model? And I think if we think about the old days, we have we could formalize the correct behavior for a given input very well. And maybe once answer being the single correct answer is fine. But now if you're thinking about very general, especially the chat applications for write a poem, and then you say this is the only correct answer, I think that the implication that could be really severe. So I think that's really something that I'm not comfortable with and partly why I'm interested in rlhf as one instantiation of not using this maximum likelihood, instead using an reward model as a learned objective function, which is a lot less structure. Now we can scale further. Our lhf itself is not really that scalable, I would say, but it just show that we can use this supervised deep learning to train a model that serves as an objective function, and that really works in a cool way. I think that's a great . speaker 1: paraof digm. Thank you. Great answer. Not that you're being judged for anything, but a second question I would say is, so then in the beginning of the talk, you talk about the big driving force to be the exponentially cheap compute, right? But some of the stuff I've been reading says Moore's law is ending. We're going towards like performance range of architecture. So can we rely then on because, right, the past 50 years we had transistors doubling or whatever, and we thought, but Yeah, that's ending. So when you talk about the compute, these demands that we've been looking at, and that's structure, our history, we're also uncertain necessarily about how that's going to protect your future. So what are some of your thoughts on Yeah, I think the Morres . speaker 2: law is really a red herring in this case because it's like number of transistor, but that doesn't really matter. I think what matters is a compute availability. And GPU, for example, is a very different kind of architecture. And that enabled the continuation of this trend. And I think right now, 20, 23, 20, 24, we're kind of taking the shortcuts with low precision thing, which still I think is cool. But I think there are many other like GPU level things. But also if we are kind of sure about the architecture, this is my thoughts. We can hard code into the chips and that can provide a lot of benefits. And I think training for training, I don't think that's really done. But GPU, if we think about it, is like too general. Maybe that is like something that we can revisit. And so I'm not losing hope and I don't see any trend of doing that. But maybe other things will come as a bottleneck, like maybe energy or something. speaker 1: So physics probably . speaker 2: is something that we need to study. Again. speaker 1: if you don't mind like continuing, then the problem is like we're talking about exponential, like driving force ces, right? You can tell me that you want to hard code chips, but that's not the same as telling me that there's going to be exponential growth that we can drive de right into. Like. speaker 2: Yeah, here's my very boring answer. I think we just need to do a little bit better. And at some point, the machines will be better than us in thinking about chip design. So I think it's half joking. But if we look back at, say, this video two years from now, I think it be less serious thing. Let's just get there first. speaker 1: All right. So thanks to Kwan for an amazing talk.
副标题: 该讲座深入探讨了大型语言模型的内在运作直觉,并从Transformer架构的演进历史展望了人工智能的未来趋势。
核心摘要
本次讲座由OpenAI的Jason Wei和Hyung Won Chung主讲。
Jason Wei首先分享了对大型语言模型(LLM)工作原理的直觉性洞察,强调通过手动检查数据来理解模型行为至关重要,这一过程如同训练研究者自身的“生物神经网络”。他指出,LLM的核心训练任务——下一词预测——本质上是一种大规模多任务学习,模型在此过程中隐式地掌握了语法、世界知识乃至数学推理等多种能力。遵循“扩展定律”,即增加计算资源(模型大小与数据量的乘积)能稳定降低模型损失并提升性能。Jason同时阐明:尽管整体性能随规模平稳提升,但特定任务的能力可能在模型达到某一规模后才以“涌现”方式突然出现。他还解释了“U型扩展”现象,即部分任务性能随模型规模增长表现出先降后升的趋势,这通常是多种底层能力相互作用的结果。
Hyung Won Chung则从Transformer架构的演变视角,探讨了人工智能发展的未来方向。他认为,AI领域的核心驱动力在于计算能力(特别是每单位美元可获得的计算能力)的指数级增长及由此带来的扩展可能性。遵循Rich Sutton的“惨痛的教训”,Chung强调更通用、结构更少的方法,在充足的计算和数据支持下,长远来看更具优势,并引用了一个核心洞察:“长期来看更好的方法,在当前几乎总是看起来更糟。” Chung详细对比了Transformer的三种主要架构:Encoder-Decoder (E-D)、Encoder-Only及Decoder-Only (D-O),并论证了D-O架构因其简洁性和参数共享的彻底性,更适应当前的扩展趋势。他分析了早期E-D架构引入的特定结构(如独立的编解码器参数、特定的注意力模式)在当时计算和任务限制下的合理性,及其在当前追求大规模通用能力时如何转变为潜在瓶颈。
两位讲者的分享共同强调了“规模化”在LLM发展中的核心地位,揭示了从简单目标中涌现复杂智能的现象,并启示研究者应关注通用性,审慎对待和移除不必要的结构限制,以迎接AI持续扩展的未来。
关于语言模型的直觉 (Jason Wei)
核心问题与洞察方法
Jason Wei探讨的核心问题是:大型语言模型为何表现如此出色?他认为,理解这一问题的关键在于通过手动检查数据来培养直觉。他以个人早年研究肺癌图像分类的经历为例,说明了深入理解任务本身对于模型研究者获得洞察的重要性,这个过程如同训练自身的“生物神经网络”。
语言模型基本回顾
语言模型通过下一词预测 (Next-word prediction) 的任务进行训练。给定一段文本,如“Dartmouth students like to”,模型会输出词汇表中每个词作为下一个词的概率(例如,P(drink) = 0.6, P(study) = 0.3)。训练的目标是使模型赋予真实下一词的概率尽可能接近1,从而最小化损失函数(通常是负对数似然)。
核心直觉
直觉 1:下一词预测即大规模多任务学习
Jason指出,下一词预测本质上是一种大规模多任务学习 (massively multi-task learning)。在预测下一个词的过程中,模型隐式地学习了大量不同的“任务”。
以下是一些通过下一词预测可以学习到的任务示例:
| 任务 | 预训练中可能教会该任务的示例句子 |
|---|---|
| 语法 (Grammar) | In my free time, I like to code, |
| 词汇语义 (Lexical semantics) | I went to the store to buy papaya, dragon fruit, and |
| 世界知识 (World knowledge) | The capital of Azerbaijan is |
| 情感分析 (Sentiment analysis) | Movie review: I was engaged and on the edge of my seat the whole time. The movie was |
| 翻译 (Translation) | The word for "pretty" in Spanish is |
| 空间推理 (Spatial reasoning) | Iroh went into the kitchen to make tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the kitchen, |
| 数学问题 (Math question) | Arithmetic exam answer key: $3+8+4={15, 11}$ |
| ...等等(还有数百万种) |
这种多任务学习的规模是巨大的,包含了数百万种潜在任务。这些任务并非总是定义清晰,有时甚至相当任意 (arbitrary)。例如,从维基百科关于乔·拜登的页面中,我们可以看到以下输入和目标词的例子:
| 输入 | 目标 | 任务类型 |
|---|---|---|
| Biden married Neilia | Hunter | 世界知识 (world knowledge) |
| Biden married Neilia Hunter | , | 逗号预测 (comma prediction) |
| Biden married Neilia Hunter, | a | 语法 (grammar) |
| Biden married Neilia Hunter, a | student | 可能是无法完成的任务? |
| 这说明下一词预测任务的复杂性和挑战性。 |
直觉 2:扩展计算资源能够可靠地降低损失
遵循扩展定律 (Scaling Laws),增加计算资源(定义为数据量 × 模型大小)能够可靠地、平稳地降低模型的损失。Kaplan等人(2020)的研究表明,这种性能提升的趋势在计算资源跨越七个数量级的情况下依然保持,且性能曲线不会饱和,这意味着投入更多计算通常会带来更好的模型。
为何扩展有效? Jason给出了一些推测性解释:
* 小型语言模型:由于参数有限,记忆事实的成本高,因此必须有所选择;倾向于学习一阶启发式,难以处理复杂模式。
* 大型语言模型:参数充足,更乐于记忆长尾知识;有能力发展更复杂的启发式方法以精确预测下一个词元,从而降低损失。
直觉 3:整体损失平稳下降,个体任务性能或现“涌现”
尽管整体损失随着计算规模的增加而平稳下降,但并非所有下游任务的性能都同步改善。整体损失可视为不同任务损失(如语法、世界知识、数学能力等)的加权和。
* 一些易饱和任务(如基础语法)的损失可能很早就达到瓶颈,后续改进空间小。
* 而一些困难任务(如复杂数学推理)的性能可能在模型规模达到一定阈值后才开始显著提升,甚至突然出现,这种现象被称为“涌现能力 (emergent abilities)”。
Jason分析了BIG-Bench中的202例下游任务,发现其性能随模型规模变化的模式各异:约29%的任务性能平稳提升;约33%表现出涌现能力,即小模型性能接近随机,大模型则远超随机,这种能力的出现往往难以预测;约22%的任务性能持平(可能过于困难);约13%的任务性能与规模无明确关联;约2.5%的任务甚至表现出“反向扩展”,即性能随规模增大而下降。
直觉 4:特定“聪明”任务可致反向或U型扩展
某些精心设计的“聪明”任务,其性能随模型规模的变化可能呈现U型曲线(先变差后变好)或反向扩展(持续变差)。
例如,对于指令“Repeat after me: All that glisters is not glib”,期望输出是“glib”。
* 超小型 (Extra Small) 模型:可能直接重复,输出“glib”(正确)。
* 小型 (Small) 模型:可能学习并错误地应用了谚语“All that glitters is not gold”,输出“gold”(错误)。
* 大型 (Large) 模型:可能凭借更强的指令遵循能力,再次正确输出“glib”。
这种U型性能可以通过将任务分解为子能力来解释:1) 重复文本的能力(各模型均可);2) 修正错误引用的能力(小、大模型具备);3) 遵循指令的能力(大模型具备)。不同规模的模型在这些能力上的差异组合导致了最终输出的差异。
研究建议与总结
Jason总结,扩展模型大小和数据量预计仍将持续改善整体损失,但研究者需关注个体任务的性能变化,特别是涌现现象。他强烈建议研究者绘制扩展曲线 (plot scaling curves),通过在不同数据量或计算量下评估模型性能,来判断特定研究方向或方法是否有效、是否已达瓶颈,或是否值得进一步投入。同时,为了更好地理解聚合指标,应将其分解到更细致的类别进行分析。
Jason Wei 问答精选
- 预训练数据质量:理想情况下应仅使用高质量数据,实践中虽不完美,但应尽力过滤不可靠数据源。
- 涌现与记忆机制:与模型规模(层数、宽度)直接相关,更大的模型能编码更复杂的函数和更多的事实。
- 预测涌现:在涌现点之前,从损失函数上很难明确预测任务何时会涌现。
- LLM瓶颈:数据质量和计算量仍是关键,遵循扩展定律,增加这两者有望持续提升性能。
- “涌现能力是海市蜃楼”观点:Jason个人认为LLM的能力是真实的,尽管评估指标的选择可能影响观察结果。
从 Transformer 的历史塑造人工智能的未来 (Hyung Won Chung)
研究变化本身以洞察未来
Hyung Won Chung认为,在飞速发展的人工智能领域,与其疲于追赶最新进展,不如研究变化本身 (study the change itself),这包括三个步骤:
1. 识别变化背后的主导驱动力 (dominant driving forces)。
2. 深入理解这些驱动力。
3. 基于此预测未来的发展轨迹。
他用“扔笔实验”类比:重力是主导力,牛顿力学帮助理解,从而可预测轨迹。AI研究的复杂性看似很高,但可能因存在一个强大的主导驱动力而比想象的更易把握方向。
AI的主要驱动力:“惨痛的教训”与扩展
- AI的主导驱动力:计算能力的指数级增长及其带来的成本下降。Rich Sutton的图表显示,每单位美元可获得的计算能力大约每5年提升10倍,这一趋势已持续很长时间。
- AI研究者的任务:教会机器如何“思考”。以往常见的方法是“教会机器我们认为我们是如何思考的”,但这会引入人类认知的局限作为模型的结构性限制,当规模扩大时可能成为瓶颈。
- “惨痛的教训” (The Bitter Lesson by Rich Sutton):过去几十年AI的进步主要源于两条:
- 开发具有更弱建模假设、逐渐更通用的方法。
- 增加更多的数据和计算资源(即扩大规模)。
结构越强(归纳偏置越多)的方法,其可扩展性越差。在计算资源有限的早期,引入特定结构(捷径)可能暂时获得更好性能,但随着计算能力的提升,这些结构可能阻碍进一步扩展,因此需要被移除。社区通常擅长添加结构,却不擅长移除它们。一个重要启示是:“长期来看更好的方法,在当前几乎总是看起来更糟。”
Transformer 架构的演变与启示
Chung回顾了Transformer架构的早期历史,分析了研究者们最初添加的关键结构及其动机,以及这些结构在计算能力和算法进步的背景下如何逐渐变得不再那么重要或甚至成为制约。
* 三种主要架构变体:
1. Encoder-Decoder (E-D):原始Transformer架构,如用于机器翻译,结构相对复杂。
2. Encoder-Only:如BERT,主要用于理解任务,输出固定表示,无法直接生成序列,通用性受限。
3. Decoder-Only (D-O):如GPT系列,结构最简洁,是当前许多大型语言模型的基础。
* 数据处理流程:文本首先被分词 (Tokenization),然后每个词元被嵌入 (Embedding) 为向量,最后由Transformer的序列模型层进行处理。
Encoder-Decoder (E-D) 与 Decoder-Only (D-O) 架构对比
Chung通过一个思想实验,将E-D架构逐步转换为D-O架构,以揭示两者间的核心差异,这些差异体现了E-D架构中额外的结构性假设。
差异总结:
| 特性 | Encoder-Decoder | Decoder-Only |
|---|---|---|
| 额外的 Cross Attention | 有单独的 Cross Attention 模块 | Self-attention 同时承担 Cross Attention 的角色 |
| 参数共享 | 输入和目标的参数通常是分开的 | 输入和目标的参数是共享的 |
| 目标到输入的 Attention 模式 | 通常只关注 Encoder 最后一层的输出 | 层内 Attention (例如,第1层关注第1层输入部分) |
| 输入 Attention | 双向 (Bidirectional) | 单向 (Unidirectional) (但输入部分可以设计成双向) |
逐步转换 Encoder-Decoder 到 Decoder-Only:
1. 共享 Cross-attention 和 Self-attention 参数。
2. 共享 Encoder 和 Decoder 的参数。
3. 让 Decoder 的第1层关注 Encoder 的第1层输出(而不是最后一层)。
4. 使 Encoder 的 Self-attention 变为因果的 (Causal)。
经过这些转换后,Encoder-Decoder 架构就非常接近 Decoder-Only 架构了。
Encoder-Decoder中额外结构性假设的当前适用性分析
Chung分析了上述E-D架构中额外结构所隐含的假设,并探讨了它们在当前大规模、通用化AI趋势下的局限性:
1. 假设:输入和目标序列差异显著,使用独立参数更有效。
* 早期机器翻译:输入输出语言不同,独立参数有其合理性。但现代LLM旨在学习通用世界知识,仅因语言不同而分离参数可能不再最优。
* 早期指令微调:在特定学术数据集(常为长输入、短输出)上,E-D模型(如T5, Flan-T5)因独立参数能更好适应这种长度不匹配,表现优于D-O模型(如PaLM, Flan-PaLM)。但当前LLM应用(如长文本生成、多轮对话中上一轮输出成为下一轮输入)使得这一假设的普适性降低。
2. 假设:目标序列元素关注编码器最终(完全编码)的表示是理想的。
* 深度网络中,不同层级编码的信息粒度不同。如果编码器非常深,解码器仅关注其顶层输出,可能造成信息瓶颈。虽然在T5(24层编码器)等模型中影响不明显,但对于未来可能出现的更深层模型,这可能成为问题。
3. 假设:编码输入序列时,元素间的“全体到全体”交互(双向注意力)是必要的。
* BERT时代,双向注意力对某些NLU任务提升显著。
* 当前观点:在大规模模型上,双向性的优势似乎减弱。更重要的是,对于多轮对话等应用,双向注意力带来了工程上的挑战(每轮需重编码整个历史),而单向注意力可以通过缓存历史状态来提高推理效率。
结论与展望
Chung总结道,人工智能研究的主要驱动力是指数级增长的廉价计算资源和相关的扩展能力。通过分析E-D这类早期架构相对于D-O这类更通用架构的额外结构,并从扩展的视角审视这些结构的演变,有助于理解当前AI领域正在发生的变化,并为预测未来发展轨迹提供思路。他鼓励研究者们反思当前工作中存在的隐式假设和结构,判断它们是否适应持续扩展的需求,并勇于探索更通用的方法。
Hyung Won Chung 问答精选
- 关于混合专家 (MoE) 结构的持久性:Chung表示MoE看起来“相当通用”,但未深入细节。
- Decoder-Only的参数共享和单向性是否也是一种“结构限制”:他认为E-D架构因其对输入输出的区分等,包含了更强的假设和更复杂的模型。D-O将所有序列统一处理,是更简洁的结构。
- 对Mamba等近期状态空间模型的看法:认为架构本身可能不是当前改变局势的关键,多模态等新挑战可能会对现有Transformer结构提出考验。
- 关于计算机视觉中的不变性学习:对强制引入人类感知中的“不变性”(如平移不变性)持保留态度,机器可能以不同于人类的方式学习。如果去除这些结构能带来更好的扩展性,则更可取。
- 当前LLM中应被移除的关键归纳偏置:
- 架构本身可能不是最大瓶颈(基于其团队对多种Transformer变体的大量实验,性能差异不大)。
- 一个更值得关注的领域是学习目标 (learning objective)。例如,传统的最大似然估计 (MLE) 假设每个输入有唯一的正确答案,这对于诗歌创作等开放式生成任务可能存在问题。RLHF(基于人类反馈的强化学习)通过学习一个奖励模型作为目标函数,是一种结构性更弱的尝试,指明了探索新学习范式的方向,尽管RLHF本身的可扩展性仍需提升。
- 摩尔定律终结与计算能力增长:Chung认为关键在于计算能力的实际可用性而非单纯的晶体管数量。GPU的发展、低精度计算的应用、乃至未来可能出现的专用芯片(若架构趋于稳定)等,都可能延续计算能力增长的趋势。能源消耗可能是未来的一个瓶颈。他半开玩笑地表示,未来机器甚至可能帮助人类设计更高效的芯片。