speaker 1: So again, I'm very happy to have Jason here. So he's an AI researcher based in San Francisco, currently working at OpenAI. He was previously a research scientist at Google brain, where he popularized key ideas in llm such as chain of thought, prompting, instruction, tuning, as well as emergent phenomena. He's also a good friend of mine, and he's been here before to give some talks. So we're very happy to have you back, Jace, and take it away. Huh? Yeah, thanks for the intro. So a bit about the structure. So I'll talk for around 30 minutes and then I'll take a few questions and then Young wan an will talk for 30 minutes and then we'll both take questions at the end. Great. So I want to talk about a few very basic things. And I think the fundamental question that I hope to get at is why do language models work so well? And one thing that I'd encourage everyone to do, that I found to be extremely helpful in trying to answer this question is to use a tool which is manually inspect data. And I'll give a short anecdote. I've been doing this for a long time. So in in 2019, I was like trying to build one of the first lung cancer claphotoso. Like there be an image and you have to say, like, okay, what type of lung cancer is this? And my first thing was like, okay, if I want to train the neural network to do this, I should be able to at least do the task. Sks, I went to my advisor and I said, like, Oh, I wanna learn to do this task first. And he said, Jason, you need like a medical degree and like three years of pathology experience to even do this task. And I found that a bit discouraging, but I wouldn't did it anyways. So I basically looked at like the specific type of lung cancer that I was working on. And I'd like read all the papers on how to classify different types. And I went to pathologists and I said, okay, try to classify these. What do I do wrong? And then what do you think of that? And in the end, I learned how to do this task of classifying lung cancer. And the result of this was like I gained intuitions about the task that led to many papers. Okay, so first I will do a quick review of language models. So the way language models are trained are with the next word prediction task. So let's say you have, let's say you have a sentence, Dartmouth students, like two. And the goal of next word prediction is you have some words that come before, and then you want to predict the next word. And what the language model does is it outputs a probability for every single word in the vocabulary. So vocabulary would be like a ardvak, you know drink, study and then all the way to like zucchini. And then the language model is going to put a probability over every single word here. So like you know the probability of a being the next word is like something really small. Ardvarc is something really small. And then maybe drink is like say 0.6, study is like 0.3. Zucchini is again really small. And then the way that you train the language model is you say, I want, let's say a drink is the correct word here. I want this number here, 0.6 to be as close as possible to one. So your loss is like basically how close is actual the probability of the actual next word, and you want this loss to be as low as possible. Okay. So. The first intuition that I would encourage everyone to use is next word prediction is. Massively multi task learning. And what I mean by this is the following. I'll give you a few examples. So when you train a language model on a large enough database, large enough data set on this task of next word prediction, you have a lot of sentences that you can learn from. So for example, there might be some sentence in my free time I like to. And the language model has to learn that code should be higher probability than the word banana. So learn some grammar. It will learn lexical semantics. So somewhere in your data set, there might be a sentence, I went to the store to buy papaya, dragon fruit and durian. And the language moshould know that the probability of durian should be higher than squirrel. The language model will learn rule knowledge. So therewill be some sentence on the Internet that says you, the capital of Azerbaijan is, and then the language model should learn that it should be Baku instead of London. You can learn traditional nlp tasks like senmen analysis. So therebe some sentence. You know, I was engaged on the edge of my seat the whole time. The movie was. And the language model looks like, okay, these are the prior next words. The next word should probably be good and not bad. And then finally, another example is translation. So here you might see some sentence. The word for prin, Spanish is. And then the language model should weigh Bonita more than ola spatial reasoning. So you might even have some sentence like, ira went to the kitchen to make some tea, standing next to irrozuco, ponder as destiny zuco lefta. And then kitchen should be higher probability than store. And then finally, even ask some math question. So you might have like some arithmetic exam answer key somewhere on the Internet. And then the language model looks at this and says, okay, the next word should probably be 15 and not eleven. And you can have like basically millions of tasks like this when you have a huge data set, and you can think of this as basically extreme multask learning. And these are sort of like very clean examples of tasks. But I'll give an example of. How arbitrary some of these tasks can be. So here's a sentence from Wikipedia. Biden married Nelia. Okay? And then now pretend you're the language model. And you could say, like, okay, what's the next word here? And the next word here is a hunter. So is Biden's first wife. And so like, okay, what's the language model? Learning to predict from dicting this word, I guess, like world knowledge. And then what's the next word after this? Turns out the next word is a comma. So here the model is learning like basically comma prediction. And then what's the next word after that? I think it's not kind of hard to know, but the answer is a and I guess this is like maybe grammar, but like somewhat arbitrary. And then what's the next word after that? Turns out it's student. And this, I would say, I don't know what tathis is. This is like, you know, you know, it could have been woman. It could have been something else. So this is like a pretty arbitrary task. And the point that I'm trying to make here is that the next word prediction task is really challenging. So like if you do this over the entire database, you're going to learn a lot of tasks. Okay. The next intuition I want to talk about is scaling, which is, by the way, let's say scaling. Compute. And by the way, compute is equal to how much data you have times the size of language model. Reliably improves loss. And this idea was basically pioneered by Kaplan at all. In 2020, I would encourage you guys to read the paper. And what this basically says is you can have a plot here, and we'll see many plots like this, where the x axis is compute and the y axis is loss. And what this intuition says is you can train one language model. You'll have that loss and obviously want loss to be lower. So you can train the next one ithave that loss. If you train the one after that, ithave that loss, then if you train the one after that, ithave that loss. And you can basically predict the loss of a language model based on how much compute you're going to use to train it. And the reason why this is called a law is that in this paper, they showed that the x axis here is actually seven orders of magnitude. So basically, it would be surprising if the trend broke if you continued. And the important thing about this is that the line does not go like that, because if it went like that, then it would saturate. And then putting more compute or training a larger language model wouldn't actually lead to lower loss. Okay. So I think a question that we don't have a good answer to as a field, but I'll give you like a hand wave answer is why does scaling up the size of your language model improve the loss? And I'll give like two basically hand wave answers. So here's a small lm and here's large. So one thing that's important is like how good is your language model at memorizing facts? And imagine you're a small language model and you see a bunch of facts on the Internet. You have to be pretty choosy in which facts you memorize, right? Because if you don't have that many parameters, you're like, Oh, I can only memorize like a million facts. Is this one of the facts I want to memorize, like at the lowest loss? And so I have to be very selective. Whereas if you're a large language model, you can memorize a lot of tail knowledge. And so basically every fact you see, you don't have to say, like, Oh, is this something I want to memorize? You can just memorize it. And then the other hand, wavy answer I'll give is small language models tend to learn first order heuristics. So if you're like a small language model, you're like already struggling to get the grammar correct. You're not going to do your best to like try to get the math problem exactly correct. Whereas if you're a large language model, you have a lot of parameters in your forward PaaS and you can try to do really complicated things to get the next token correct and to get the loss as low as possible. Okay. So the third intuition I'll talk about is while overall loss improves smoothly. Individual tasks can improve suddenly. And here's what I mean by this. So you can write t your overall loss by this. If you take some corpus of data and you compute the overall loss on every word in that data, set the overall loss, because we know that next word prediction is massive multitask learning, you can decompose this overall loss into the loss of every single individual task. So you have, like, I don't know, some small number times the loss of, say, grammar plus, some small number times the loss of like sentiment analysis plus, some small number times the loss of like world knowledge plus. And then all the way to like, let's say you have something times the loss of math. And so you can basically write your overall loss as the weighted sum of the individual tasks in the data set. And now the question is, let's say I improve my loss from four to three, do each of these individual tasks improve at the same rate? Well, I would say probably not. So you know if you have a good enough language model, it's already doing basically perfectly on grammar and sensible analysis. So this might not improve. So this might be saturated. And maybe the loss on math is not saturated. It's not that good at math yet. So this could improve suddenly. And I'll redraw the diagram to show this. So again, compute here and loss here and your overall loss is like this. What I'm saying is there might be some part of that overall loss that scales like this. So this is, say, grammar. So for example, I could say like if GPT -3 point five is there and GPT -4 is there, you haven't actually improved the grammar that much. On the other hand, you might have something like this for say, doing math or harder tasks where the difference between GPT -3 point five and GPT -4 will be much larger. And it turns out. It turns out you can look at a big set of tasks, which I did, and you can look at, like, what's the shape of these scaling curves? So I looked at 202 tasks. There's this corpus called bigbench, which has like 200 tasks. I looked at all of them, and here's the distribution you have. This was 29% of tasks that were smooth. So if I draw the scaling plant computers on the x axis, and then here we have accuracy instead of loss. So higher air is better, then you have something like this. I believe this was 22% will be flat. So if you have your scaling curve itjust all be zero. The task was too hard. 2% will be something called inverse scaling. I'll talk about this in a sec, but what that means is the accuracy actually gets worse as you increase the size of the language model. And then I think this was 13% will be not correlated. So you have like something like that, I don't know. And then finally, a pretty big portion, 33%, will be emergent abilities. And what I mean by that is if you plot your compute and accuracy for a certain point up to a certain point, your accuracy will be zero and then the accuracy suddenly starts to improve. And so you can define like an emergent ability basically. As for small models, the performance is zero, so it's not present in this model. And then for large models. You have a much better than random performance. And the interesting thing about this is, let's say you had only trained the small language models up to that point, you would have predicted that it would have been impossible for the language model to ever perform the task. But actually, when you train the larger model, the language model does learn to perform the task. So in a sense, it's pretty unpredictable. I'll talk about one final thing, which is something called inverse scaling slash U shaped scaling. Okay, so I'll give a tricky prompt to illustrate this. So the tricky prompt is this, repeat after me. All that glysters is not glib. All that blisters is not. And this is the prompt I give to the language model. And the goal is to predict this next word. And obviously the correct answer is glib because you ask to repeat after after me. And what you see is, let's say you have a extra small language model, a small language model and a large language model. The performance for the extra small language model will be, say, here's 100%. The small language model is actually worse at this task. So it's something here. And then the large language model again, learns to do this task. So how do we basically explain a behavior like this for a prompt like this? And the answer is you can decompose this prompt into like three subtasks that are basically being done. So the first subtask is like, can you repeat subtext, right? And if you draw the plot here again, extra small, small, large, and then here is 100, this is like a super easy task. And so all the language models have perfect performance on that. That's one hidden task. The other task is, can you fix a quote? So the quote is supposed to be all that glysters is, not gold. And so you can then plot, again, extra small, small, large. What's the ability to fix that quote? Well, the small language model doesn't know the quote, so it's going to get zero. And then the extra small doesn't know the quote, so it's going to get zero. The small will be able to do it, and the large can obviously do it. So that's what the scaling curve looks like. And then finally, you have the quote, you have the task follow and instruction. And obviously, this is the instruction here and you could say, okay, what's the performance of these models on this task? And the small model can't do it, or the extra small model can't do it. Small model also can't do it, but the large model can't do it. So you get a curve like this. And then why does this explain this behavior here? Well, the extra small model, it can repeat. It can't fix the quote, and it can't follow instruction. So it actually gets it correct. It says glue. It says squb. The small model can repeat, it can fix the quote, but it doesn't follow the instruction, so it decides to fix the quote. And then the large model, it can do all three, you can follow the instruction, so it just repeats. And so that's how if you look at the individual subtasks, you can explain the behavior of some of these weird scaling properties. So I will conclude with one general takeaway, which is applicable if you do research. And the takeaway is to just plot scaling curves. And I'll just give a really simple example. So you know, let's say I do something for my research project. I fine tune a model on some number of examples, and I get you know, this is my thing. I get some performance there. And then here's the baseline of not doing whatever my research project is, and there's the performance. The reason you want to plot a scaling curve for this is like let's say you take half the data and you find out that the performance is actually here. So your curve looks like this. What this will tell you is you didn't have to collect all the data to do your thing. And if you collect more, you probably won't see an improvement in performance. Another scenario would be if you plot Ted that point and it was there, then your curve will look like this. And potentially, if you kept doing more of whatever your research project is, yousee an improvement in performance. And then finally, maybe your point is there, so your curve looks like this, and in this case, youwould expect to see an even larger jump in performance after you continue doing your thing. So Yeah, on my talk here, and happy to take a few questions before him once talk. Yeah, go ahead. Data that has an information tional source. So you write the data. Oh, Oh Yeah. Yeah. Thanks. Good question. So the question is during pre training, how do you differentiate between good data and bad data? The question is the answer is you don't really, but you should buy only training on good data. So like maybe you should look at like your data source and filter out some data if it's like not from a reliable data source. Do you want to give us maybe the intuition behind the intuition for one or two of the examples like emergent or you know why kale knowledge starts to develop? Sort of what's behind that? What do you mean intuition behind intuition intuitively, these concepts you're seeing in the graphs and from your experience and expertise, what in the model itself is really causing that emergent? What do you mean in the model itself? Is it more nodes, more Oh, I attention in an intuisense? Oh Yeah. Okay. Yeah. So the question is like what in the model makes the language model better at memorizing tail knowledge or at doing math problems? Yeah, I think it's definitely related to the size of the language model. So Yeah, if you have like more layers, you could encode probably a more complex like function within that. And then I guess if you have like more breadth, you could probably encode like more facts about the world. And then like if you want to repeat a factor like retrieve something, it probably easier we'll get one more in person and then I'll business. So when you are studying the 200 ish problems in the big bench, you notice that 22% were planned. But there's a possibility that if you increase the compueven further, you might those might have put off the emergency. So my question to you is that when you were looking . speaker 2: at the 23% that turned out to be emergent, did you notice anything . speaker 1: about the loss in the flat portion that suggested that they would eventually become emergent? Oh, Yeah, I didn't notice anything. Oh, sorry. Let me repeat the question. The question is like when I looked at all the emergent tasks, was there anything that I noticed before the emergence point in the loss that would have like hinted that it would have emergent later? To me, it's kind of tough. Like we have a few plots of this. Like you can look at the loss and it like kind da gets better and then suddenly it spikes and there's like no way to predict it. But also you don't have like perfect data because you might not have like old intermediate points for a given model size. Yeah, great question. Okay. We just have a few questions from people who are joining on zoom. The first one is, what do you think are the biggest bottlenecks for current large language models? Is it the quality of data, the amount of compute or something else? Yeah, great question. I guess like if you go back to the scaling laws paradigm, what it says is that if youincrease the size of the data and the size of the model, then youexpect to get a lot better performance. And I think Yeah, we'll probably try to keep increasing those things. Gotcha. And then the last one, what are your thoughts on the paper? If you've read it? Our emergent abilities of large language models, a mirage. Oh Yeah, I always get this question. I guess I would encourage you to read the paper and decide for yourself. But know, I guess what the paper says is like if you change the metric a bit, it looks different. But I would say at the end of the day, I think the language model abilities are real. And if you think like, Yeah, I guess I don't think that's a mirage. So Yeah, Yeah. All right. So thanks, Jason for the very insightful talk. And now we have Kewan give a talk. So he currently a research scientist on the OpenAI chagpt team. He has worked on various aspects of large language models, things like pre training instruction, fine tuning, reinforcement learning with human feedback, reasoning and so forth. And some of his notable works include the scaling flan papers such as flan t five as well as flan palm and t five x, the training framework used to train the palm language model. And before OpenAI, he was at Google brain, and he received his PhD from mit. So give a hand for hyjuan. Just wait. speaker 2: All right, my name is Young wan and really happy to be here today. And this week I was thinking about, by the way, it's miworking fine. Yeah, Yeah. So this week I thought about, okay, I'm giving a lecture on transformers at Stanford. What should I talk about? And I thought, okay, some of you in this room and in zoom will actually go shape the future of AI. So maybe I should talk about that. It's a really important goal and ambitious, and we really have to get it right. So that could be a good topic to think about. And when we talk about something into the future, the best place to get an advice is to look into the history. And in particular, we look at the early history of transformer and try to learn many lessons from there. And the goal will be to develop a unified perspective in which we can look into many seemingly disjoint events. And from that, we can probably hope to project into the future what might be coming. And so that will be the goal of this lecture. And we'll look at some of the architectures of the transformers. So let's get started, everyone. I see it's saying AI is so advancing so fast that it's so hard to keep up. And doesn't matter if you have like years of experience. There's so many things are coming out every week that it's just hard to keep up. And I do see many people spend a lot of time and energy catching up with this latest development, the cutting edge and the newest thing. And not enough attention goes into old things because they become deprecated and no longer relevant. But I think it's important actually to look into that because we really need to when things are moving so fast beyond our ability to catch up, what we need to do is study the change itself. And that means we can look back at the previous things and then look at the current thing and try to map how we got here and from which we can look into what where we are heading towards. So what does it mean to study the change itself? First, we need to identify the dominant driving forces behind the change. So here, dominant is an important word, because typically a change has many, many driving forces. And we only care about the dominant one because we're not trying to get really accurate. We just want to have the sense of directionality. Second, we need to understand the driving force really well. And then after that we can predict the future trajectory by, know, rolling out the driving force and so on. And you heard it right. I mentioned about predicting the future. This is a computer science class, not like an astrology or something. But we do I think it's actually not that impossible to predict some future trajectory of a very narrow scientific domain. And that endeaver is really useful to do, because let's say you do all these and then make your prediction accuracy from 1% to 10%, and then you make, say, 100 predictions, ten of them will be correct, say, one of them will be really, really correct, meaning it will have an outside impact that outweighs everything. And I think that is kind of how many I've seen a very general thing in life that you really have to be right a few times. So why so if we think about why predicting the future is difficult, or maybe even think about the extreme case where we can all do the prediction with perfect accuracy, almost perfect accuracy. So here I'm going to do a very simple experiment of dropping this pen and follow this same three step process. So we're going to identify the driving, dominant driving force. First of all, what are the driving forces acting on this pen? Gravity downwards? And is that all we also have, like say, air friction, if I drop it, and that will cause what's called a drag force acting upwards. Actually, depending on how I dropped this, the orientation, the aerodynamic interaction will be so complicated that we don't currently have any analytical way of modeling, that we can do it with the cfd, the computational of fluid dynamics, but it will be non trivial. So we can neglect that this is heavy enough, that gravity is probably the only dominant force. We simply but simplify the problem. Second, do we understand this dominant driving force, which is gravity? And we do, because we have this Newtonian mechanics, which provides reasonably good model. And then with that, we can predict the future trajectory of this pin. And if you remember from this dynamics class, if we have this initial velocity is zero, I'm not going to put any velocity. And then let's say position is zero here, and then one half gt square will give a precise trajectory of this pin as I drop this. So if there is a driving force, single driving force, that we really understand, it's actually possible to predict what's going to happen. So then why do we really fear about predicting the future in the most general sense? And I argue that among many reasons, the number of driving force, the sheer number of dominant driving forces acting on the generals prediction is so complicated, and their interaction creates a complexity that we cannot predict the most general sense. So here's my cartoon way of thinking about the prediction of future aaxes. We have a number of dominant driving forces. Why axes? We have a prediction difficulty. So on the left hand side, we have a dropping of pit's. A very simple case. It's a difficulty. It's very small. You just need to learn physics. And then as you add more stuff, it just becomes impossible. So how does this fit into the AI research? And you might think, okay, I see all the time things are coming in, we are bombarded by new things. And some people will come up with the new agent, new modality, new mml score, whatever. We just see so many things. It just I'm not even able to catch up with the latest thing. How can I even hope to predict the future of the AI research? But I argue that it's actually simpler because there is a dominant driving force that is governing a lot, if not all of the AI research. And because of that, I would like to point out that it's actually A Coser to the left than to the right, than we actually may perceive. What is that driving force? Oh, maybe before that, I'd like to caveat that when I do this kind of talk, I would like to not focus too much on the technical stuff, which you can probably do better in your own time, but rather, I wanna share how I think and for that, I wanna share how my opinion is this. And so it will be very strongly opiniand. I'm by no means I'm saying this is correct or not. Just wanted to share my perspective. So coming back to this driving force for AI, what is that dominant driving force? And here's a plot from rich Sutton. And on the y axis, we have the calculations flopped. If you pay $100 and how much computing power do you get? And it's in log scale. And then xaxes, we have a time of more than 100 years. So this is actually more than exponential. And I don't know any trend that is as strong and as long lasting as this one. So this, whenever I see this kind of thing, I should say, okay, I should not compete with this and better. I should try to leverage as much as possible. And so what this means is you get ten x more compute every five years if you spend the same amount of dollar. And so in other words, get the cost of compute is going down exponentially. And this an associated scaling is really dominating the AI research. And that is somewhat hard to take, but that is I think really important to think about. So coming back to this AI research, how is this exponentially cheaper compute drive the AI research? Let's think about the job of the AI researchers. It is to teach machines how to think in a very general sense. And one somewhat unfortunately common approach is we think about how we teach machine, how we think we think. So meaning we model how we think and then try to incorporate that into some kind of mathematical model, teach that. And now the question is, do we understand how we think at the very low level? I don't think we do. I have no idea what's going on. So it's fundamentally flawed in the sense that we try to model something that we have no idea about. And what happens if we go with this kind of approach is that it poses a structure that serves as a shortcut in the short term. And so you can maybe get a paper or something, but then it becomes a bottleneck because we'll know how this will limit further scaling up. More fundamentally, what this is doing is we are limiting the degree of freedom we are giving to the machines, and that will backfire at some point. And this has been going on for decades. And bitter lesson is, I think, the single most important piece of writing in AI, and it says, this is my wording, by way past 70 years of entire AI research can be summarized into developing progressively more general method with weaker modeling assumptions or inductive biases, add more data and compute, in other words, scale up. And that has been the recipe of entire AI research, not fancy things. And if you think about this, the models of 22000, it's a lot more difficult than what we use now. And so it's much easier to get into AI nowadays from technical perspective. So this is, I think, really the key information we have. This compute cost is going down exponentially and it's getting cheaper faster than we're becoming a better researcher. So don't compete with that and just try to leverage that as much as possible. And that is the driving force that I wanted to identify. And I'm not saying this is the only driving force, but this is the dominant driving force. So we can probably neglect the other ones. So here's the graphical version of that. X xis. We have a compute Y X is we have a performance of some kind. Let's think about some general intelligence, and let's look at two different methods. One with more structure, more modeling assumptions, fancier math, whatever. And then the other one is a less structure. What you see is typically, you start with a better performance when you have a low compute regime, then but it plateaus because of some kind of structure backfiring. And then with the less structure, because we give a lot more freedom to the model, it doesn't work in the beginning. But then as we add more compute, it starts working and then it gets better. We call this more scalable methods. So does that mean we should just go with the least structure, most freedom to the model possible way from the get go? And the answer is obviously no. Let's think about even less structure case. This red one here is it will pick up a lot later. And requires a lot more compute. So it really depends on where we are. We cannot indefinitely wait for the most general case. And so let's think about the case where our compute situation is at this dotline. If we're here, we should choose this less structure one as opposed to this even less structure one, because the other one doesn't really work and the other one works. But crucially, we need to remember that we are adding some structure because we don't have compute, so we need to remove that later. And so the difference between these two methods is that additional inductive biases or structure we impose, someone impose that typically don't get removed. So adding this, what that means is that at the given level of compute data, algorithmic development and architecture that we have, there is like an optimal inductive bias or structure that we can add to the problem to make the progress. And that has been really how we have made so much progress. But these are like shortcuts that hinder further scaling later on. So we have to remove them later on when we have more compute, better algorithm or whatever. And as a community, we do adding structure very well because there's an incentive structure with papers. You add a nice one, then you get a paper. But removing that doesn't really get you much. So that we don't really do that, and I think we should do a lot more of those. So maybe another implication of this bitter lesson is that because of this, what is better in the long term almost necessarily looks worse now. And this is quite unique to AI research because the AI research of current paradigm is learning based method, meaning that we are giving models freedom. The machines choose how they learn. So because we need to give more freedom, it is more chaotic at the beginning, so it doesn't work. But then when it started working, we can put in more compute and then it can be better. So it's really important to have this in mind. So to summarize, we have identified this dominant driving force behind the AI research, and that is this exponentially cheaper compute and associated scaling up. Now that we have identified, if you remember back from my initial slides, the next step is to understand this driving force better so that we're going to spend most of the time doing that. And for that, we need to go back to some history of transformer, because this is a transformers class, and analyze key structures and decisions that were made by the researchers at the time and why they did that, whether that was an optimal structure that could have been added at the time and why they might be irrelevant now and should we remove that? And we'll go through some of the practice of this, and hopefully this will give you some flavor of what like scaling research looks like. So now we'll go into a little bit of the technical stuff. H transformer architecture. There are some variants. I'll talk about three of them. First is the encoder decoder, which is the original transformer, which has a little bit more structure. Second one is the encoder only, which is popularized by Bert. And then third one is decoder only, which you can think of as a current like GPT -3 or other language models. This has a lot less structure than the encoder decoder. So these are the three types will go into detail. Second, the encoder only is actually not that useful in the most general sense. It still has some place, but we will so just briefly go over that and then spend most of the time comparing 13. So one has more structure. What's the implication of that? And so on. So first of all, let's think about what a transformer is just at a very high level. Or first principles. What is the transformer as a sequence model? And sequence model has an input of a sequence. So sequence can be sequence of elements, can be words or images or whatever. It's a very general concept. In this particular example, I show you with the words, sentence is a sequence of words. And then the first step is to tokenize it, because we have to represent this word in computers, which requires just some kind of encoding scheme. So we just do it with a fixed number of integers that we have now, sequence of integers. And then the dominant paradigm nowadays is to represent each sequence element as a vector, then vector, because we know how to multiply them well. So we have a sequence of vectors. And finally, this sequence model will do the following. We just want to model the interaction between sequence elements. And we do that by let them take the dot product of each other. And if the dot product is high, we can say semantically they are more related than the dot products that is low. And that's kind of the sequence model. And the transformer is a particular type of sequence model that uses what's called attention to model this interaction. So let's get into the details of this encoder decoder, which was the original transformer. It's quite many, many pieces. So let's go into a little bit a piece at a time. So starting with that encoder. So here I'm going to show you an example of machine translation, which used to be very cool thing. And so you have an English sentence that is good. And then we're going to translate into a German. So first thing is to encode this into a dense vector. So here I'm representing it with this, a vector of size three or something. And then we have to let them take the dot product. So this blinds represent which element can talk to which element, other elements. And here, because it's an input, we take what is called a bidirectional attention. So any token can talk to any other token. And then we have this mlp or feforward layer, which is curtoken. It doesn't have any interaction. We just do know some multiplication just because we can do it. And then that's one layer, and we repeat that n times. And that's just the transformer encoder. And at the end, what you get is the sequence of vectors, each representing the sequence element, in this case, a word. So that's the output of this encoder. Now let's look at the decoder, which is similarly shaped, stack of layers. So here we put in as an input what the answer should be. So here, pos is the beginning of sequence. And then duis, good. I don't know how to pronounce it, but this, the German translation of that is good. And so we kind of go through the similar process. Here we have a causal self attention, meaning that the tokens of time step t can only attend to t and before, because when we start generating it, we don't have the future tokens. So we when we train it, we should limit that. And that way this is done by like masking, but it's just different from the encoder. So after this you can get after again n layers, you get this sequence output and have the output is sequence so sequence so sequence mapping. This is a general encoder decoder architecture. And when you get this end of sequence, you stop generating it. So this is the overall picture. Now I'll point out some important attention patterns. So we are translating into German what is input to the encoder. So there has to be some connection between the decoder and encoder. That is done by this cross attention mechanism shown in this red, which is just that each vectorrepresentation on each sequence in the output decoder should attend to some of them in the encoder. And that is done in particular. The design feature, which is interesting, is that all the layers in the decoder attend the final layer output of the encoder. I will come back to the implication of this design. So Yep, that's that. And now move on to the second type of architecture, which is encoded only. We'll spend a little bit of time here. So again, we have the same input and we go through similar structure. And then in this case, the final output is a single vector, regardless of the length of the sequence. We just get a single vector, and that is that represent the input sequence. That's a dense factor representation. And then let's say we do some kind of a sentiment analysis. We run through a task specific linear layer to map it to classification labels, positive or negative probabilities here, and that's required for all these test specific cases. And this is kind of popularized by birth. And what this means is that here at the time, 2018, when Bert came out, we had the benchmark called glue, which was a language understanding test. You have a sequence in classification labels out for most cases. This was how the field really advanced at the time. So when we care about such tasks, then there's an incentive to think about simplifying the problem, adding the structure, the problem, so that we can make a progress. So this, the additional structure that was put into this particular architecture is that we're going to give up on the generation. If we do that, it becomes a lot simpler problem. Instead of sequence to sequence, we're talking about sequence to classification labels. And that's just so much easier. And so at some point, 2018, 2019, a lot of the papers and just research was like we sometimes call it birth engineers. It's a little of a change of something. You get like four, 5% better on glue and you get a paper and things like that. It was like very chaotic error. But if we look at from this perspective, we are putting the sequence structure of not generating the sequence. That puts a lot of performance win. But in the long term, it's not really useful. So we're not going to look at this encoder only architecture going forward. Third, architecture decoded only. This one is my favorite personally, and it looks kind of daunting, but because of this attention pattern, but it actually is very simple. So here we only have a single stack and it can actually generate stuff. And so there's misconception that some people think this decoder only architecture is used for language modeling next to competidiction. So it cannot be used for supervised learning. But here we can actually do it. The trick is to have this input that is good concatenated with the target. And if you do that, then it just becomes simple to sequence in, sequence out. So what we do is the self attention mechanism here is actually handling both the cross attention between target and the input and self attention sequence learning within each. So that's the causal attention. And then the Yeah as I mentioned, the output is a sequence. And then the key design features are self attention instead, so serving both roles. And we are in some sense sharing the parameters between input and target, so same serof parameters are applied to both input and the target sequences. So this is the decoder only now we will go into the comparison. So I think there are many like they look very different, at least on the schematics. So how different are they actually? And I argue that they're actually quite similar. And so to illustrate that, we're going to transform, starting from this encoder decoder, which has more structures built in, and then into the decoder only architecture and see what are some of the differences, and then interpret those differences, those additional structures, are they relevant nowadays now that we have more compute, better algorithm and so on? So let's have this table, four differences. We'll see each of them. And then as we go through, we'll populate this table. So let's first look at this additional cross attention sion. What that means is that this on the left is an encoded decoder, which has this additional red block, the cross al tension, compared to the simpler one that doesn't have that. So we want to make it, make the left closer to the right. So that means we need to either get rid of it or something. And attention mechanism has kind of the four projection matrices. And so self attention and cross attention actually have the same number of parameters, ter, same shape. So we can just share them. So that's the first step. Share both of these. And that name becomes mostly the same mechanism. And then so that's the first different separate cursor, tenor, self attention, serving both roles. Second difference is the parameter sharing. So what that means is that between the input and the target encoder, decoder architecture uses the separate parameters. And decoder only has a single stack, so it uses the share parameter. So if we want to make the left close to right, that we want to share the encoder parameters. So let's do that. Just color this. So now they share the parameters. Third difference is the target to input attention pattern. So we need to connect the target to the input. And how does that how is that done? In the encoder decoder case, we had this cross attention. And then in the decoder only it's per the self tension doing everything. What the difference is that we have this every layer of the decoder attending to the final layer output of the decoder. Whereas if you think about this decoder is actually per layer within layer, when we are decoding the say word does, we are looking at the same layer representation of the encoder and that's within layer. And I think this is the design feature. So if you want na make this close to that, we have to bring back this attention to each layer. So now layer one will be attending to layer one of this. And then finally, the last difference is to input attention. I mentioned about this bidirectional attention. And because we have this decoder only typically with the unidirectional attention sion, we need to make them matching. So that's we can just get rid of it. I just got rid of some of the arrows. So then at this point, these two architectures are almost identical. A little bit of difference in the cross attention, but same number of parameters. And if you have in deep learning, if you just train this these two architecture and the same task, same data, I think you will get pretty much within the noise, probably closer than if you train the same thing twice. So I would say they are identical. And so these are the main differences. Now look at what are the additional structures, what they made means. So Yeah, that's the populated table now and then so we can say that encoder decoder compared to the decoder on the architecture, has these additional structures, inducted biases built in. So let's go into each of them. The first one is what encoder decoder tries at it as a structure, is that input and the target sequences are sufficiently different that it be useful to use as separate parameters? That's the assumption. And so why is that useful? When can that assumption be useful? And one example is machine translation. Back when the transform was introduced in 2:17, translation was a really popular task and it was difficult consider difficult and because it a sequence to sequence and you can actually have a blue score, which is heuristic based method that can give you a single number and then people can optimize that. So in that task, we have this input and target in completely different languages. So if the goal is to learn translation only, then it kind of makes sense to have, okay, this parameter in the encoder will take care of the English, and this parameter in the decoder will take care of the German. That seems natural. And what about now, modern language models is about learning knowledge. And it's not just about translation or not even about language. Language ages comes up as a byproduct of doing this next token prediction and translation as well. So does it make sense to have a separate parameter for this kind of situation now, like we have some knowledge in German, some knowledge in English, and if anything, you want to combine them. And if we represent them in a separate parameters, I don't think that's natural. So I would say with this much more general, larger models that can do a lot of things, this assumption seems very unnatural to me. Second example is a little bit more modern. Two years ago, when I was at Google and with Jason, we did this instruction fine tuning work. And what this is, is you take the pre train model and then just fine tune on academic data set so that it can understand the natural language instruction. So the detail doesn't matter. But here, let's think about the performance gain by doing this fine tuning on two different architectures. We tried. So first five is the Flon t five, which is t five base, which is saying coder decolarchitecture. Last, the latter five decoder only architecture based on palm. So we spent 99% of the time on palm optimizing a lot of these. And then at the end, we just spent like three days on t five. But the performance gain was a lot higher on this. And I was really confused about this, and in a very good way. And after the paper was published, I wanted to dig a little bit deeper into why this might be the case. So my hypothesis is that it's about the length. So academic data sets we use, we use like 18, 832 tasks. And here they have this very distinctive characteristic where we have a long input long in order to make the tasmore difficult, but then we cannot make the target long because if we do, there's no way to create it. So there's fundamental challenge of that. So what happens is you have a long text of input and then short text of the target. And so this is kind of a length distribution of what it went into the flfine tuning. So then you see this. You have a very different sequence going into the encoder as an input and a very different type of sequence going into the target. So now this encoder decoder architecture has an assumption that they will be very different. That structure really shines because of this. It was a kind of an accident, but that was, I think, why this really architecture was just suitable for fine tuning with the academic data set. What about now? Do we care about this kind of assumption? And if you think about the general use cases of the language models nowadays, if anything, the more interesting cases involve longer generation, longer target. Just because we cannot grade them doesn't mean that we are not interested in them. Actually, if anything, we are more interested in that. So now we have this longer target situation. So this separate sequence length parameter doesn't seem to make much sense. Moreover, we think about this chat application like ChatGPT. We do multurn conversation. And then so what is the target of this turn becomes comes the input of the next turn. And then my question is, does that make sense to even think about a different parameters if next turn it's going to be the same thing? So that was the first inducted bias we just mentioned. And then the second structure is that target element can only attend to the fully encoded ones. The final put up the encoder. And let's look at this additional structure of what that means. So as I mentioned, we have this very top layer attending to it. And so in deep neural nets, typically we see that the bottom layers and the top layers in code information at a very different level, meaning that, for example, in computer vision, the bottom layers include something like edges, top layers, higher levels combining the features, something like cat face. And so we call this deep learning a hierarchical representation learning method. And so now the question is, if decoder layer one attends to encoder final layer, which probably has a very different level of information, is that some kind of an information bottleneck which actually motivated the original attention mechanism? And in practice, I would say, in my experience, doesn't really make any difference. And that's because my experience was limited to, say, 24 layers of encoder of t five. So layer one attending to 24, probably fine, but what if we have ten x or 1000x more layers? Would that be problematic? I'm not really comfortable with that. So I think this is also unnecessary design that maybe we need to revisit. Final structure we're going to talk about is when we do this, there's like a bidirectional thing in the encodecoder. Let's think about that. So Yeah, bidirectional input attention, is that really necessary? So when we had this birth be in birth stand ds for bidirectional 2:18, when we were solving that question, answering squat actually was very difficult task. So if you have any additional trick, it can make a huge difference. Bii directionality was a really useful, like I think maybe boosting up the squat score by like 20. So it was really huge thing. But at scale, I don't think this matters that much. This is my highly anecdotal experience. So we did in flantwo, we tried both by directional and unidirectional fine tuning. Didn't really much difference. But I want to point out this bidirectionality actually bringing in an engineering challenge for modern multi turn chat applications. So at every turn, the new input has to be encoded again. And for union, directional attention is much, much better. So here's what I mean by that. So let's think about this more modern conversation between user and assistant. How are you bad and why? And so here, if we think about the bidirectional case, we will, when we generate bad, we need to encode this input with the bidirectional thing, which is fine. And then after the bad is generated, when we trying to generate why, we'll need to encode how again, because how can attend to bad? So we need to do everything from scratch again. In contrast, if we do unidirection one, we can do much, much better because now when we are trying to generate why we don't have to redo how? Because we cannot attend to the next future tokens, so we don't have to do anything. So if you see the difference, this part can be cached, and then this part is the only thing that has to be encoded again. So this kind of makes a big difference when we think about multiple turns going in. So I would say bidirectional tension did well in 2018, which is mostly sold by scale. And now because of this engineering challenge, we don't really need that. So to conclude, we have looked into this driving force, dominant driving force covering this AI research, and that was this exponentially cheaper compute and associated scaling effort. And so to understand this driving force, we analyzed some of the additional structures added to the encoder decoder compared to the decoder only, and then thought about what that means as from the perspective of scaling. And I wanted to just conclude with this remark. So we have looked at these kind of analysis, which are, Oh, one can say this is just historical artifacts and it doesn't matter. But if you do many of these now, you look at this current events, you can hopefully think about those in a more unified manner and then see, okay, what assumptions in my problem that I need to revisit and are they relevant, and if not, why? And you have an answer to it is, can we do it with a more general thing and scale up? And so I hope you can go back and really think about these problems. And together, we can really shape the future of AI in a really nice way. So that's it. Thanks. speaker 1: Hi, thank you for the talk. So about the mix of expert structure, if know what you're saying is correct that how long do you think the mix of experts is going to stay for the new length models? speaker 2: So one thing I have to apologize, the architecture is kind of a thing that I not really comfortable sharing a lot. That's why I'm limiting a little bit to the future. So Yeah, if you I'll probably just skip that, but I would say that seems quite general. speaker 1: So some of the changes that you describe between encoder decoder versus decoder, only the parameter sharing and the bidirectional, can they not be interpreted as less structure or sorry, more structure or less freedom for the model to learn? speaker 2: Yeah, I think one can do that. But I think somewhat subjective, but I think it's a simpler structure that the model kind of if but we're . speaker 1: just saying . speaker 2: like input and target are just sequences and we just if we have enough capacity, we can just handle both. And there are other cases where like Yeah, so Yeah, I can totally see, Oh, actually, maybe I should repepudate the question. The question is, can we think about this parameter sharing other structures in the encoder decoder as actually less structure? But I think it's a little bit more complicated model. And that's such complications of like it has more assumption, right? The input and target are different. I think that's a stronger assumption than, okay, it's a sequence. We deal with the sequence in a unified way. So that would be just my take. speaker 1: Do you have any thoughts on recent state space models like mommba and how that fits into the paradigm of less structure and mostructure without really . speaker 2: Yeah, Yeah. Okay. It's hard to think about it on the spot. But I think to me, I talked about this architectures, but I don't architecture is kind of a it doesn't change things too much. And maybe I think multimodalities might bring in another challenges like when this transformer structure might become a bottleneck when we think about that. But Yeah, so I think French foreers have done a good job. So maybe you should think about it, especially with the multimodalities. For cross attention, sion and casual attention, it's like the polating permutation of the variant in a way for multiadinstead of caual. And then for computer vision, there's like a lot of learning structure for invariances, for some software learning. What do you think about those in terms of complexity? So the question is this causal attention versus like the bidirectional tension, they were probably fine in the text domain, but in the computer vision case that being able to attend to the future part of it is really important, is that the question. speaker 1: Calls attention remove . speaker 2: the invariance for permutation. speaker 1: So what do you think about like for conservation? Like you learn a lot of invariance for augmentation. So what do you think about . speaker 2: those as a way to structure? So, so I think the, I don't really like this invariances in all these. These are like how humans think we perceive the vision. Like cnn, for example, is like translation invariance, which was very important, we thought. But it's I don't think it's that important actually, if anything now is hurting the model, learning more general thing. And so the machines might be learning the vision in a completely different way from how humans do. And I don't think that's problematic. So those invariances, I think is could be a good guiding principle, but don't I don't I'm not too worried about just not having such structures. Yeah, I would just just try out like just based on some metric, if not having that invariance structure is actually better and more scalable. And I think that's probably fine and actually even better if we do it without the structure is actually better. speaker 1: So I actually have two questions. One, so clearly, you've been thinking about how inductive biases and structure limit our potential financial is. So I'm curious where bait inductive biices currently that you think are like big blothat we've been put release as or let go of. speaker 2: The current structure that we should get rid of just . speaker 1: current inductor bias es, you think, because clearly you've been thinking about this, right? So when you look at the state of research, you must be thinking, man, this is like a pretty big inductive bias itbe really cool. If you could let this go. speaker 2: So I'm just trying to see what you're Yeah. So so when I think about this as an architecture, I think the architectures are not the current bottleneck in my view. And so partly because I did a lot of the architecture research, and at the end we published this paper called, it's saying, okay, we try like 60 different transformer modifications, pretty much the same thing. And none of that really makes sense. Make a huge difference. Caveat would now maybe the conclusion can be different. So I have a very huge bias against not doing their architecture research. One message could be that actually the architecture is not the bottleneck in further scaling. And I think what's the bottleneck now? Is this learning objective, especially on the supervised learning paradigm or even like self supervised pretraining? What we're doing with this maximum likelihood estimation is okay given this this is the only correct target and everything else is incorrect because the probability measure is finite. So is that really a comfortable thing to do, teaching signal to give the model? And I think if we think about the old days, we have we could formalize the correct behavior for a given input very well. And maybe once answer being the single correct answer is fine. But now if you're thinking about very general, especially the chat applications for write a poem, and then you say this is the only correct answer, I think that the implication that could be really severe. So I think that's really something that I'm not comfortable with and partly why I'm interested in rlhf as one instantiation of not using this maximum likelihood, instead using an reward model as a learned objective function, which is a lot less structure. Now we can scale further. Our lhf itself is not really that scalable, I would say, but it just show that we can use this supervised deep learning to train a model that serves as an objective function, and that really works in a cool way. I think that's a great . speaker 1: paraof digm. Thank you. Great answer. Not that you're being judged for anything, but a second question I would say is, so then in the beginning of the talk, you talk about the big driving force to be the exponentially cheap compute, right? But some of the stuff I've been reading says Moore's law is ending. We're going towards like performance range of architecture. So can we rely then on because, right, the past 50 years we had transistors doubling or whatever, and we thought, but Yeah, that's ending. So when you talk about the compute, these demands that we've been looking at, and that's structure, our history, we're also uncertain necessarily about how that's going to protect your future. So what are some of your thoughts on Yeah, I think the Morres . speaker 2: law is really a red herring in this case because it's like number of transistor, but that doesn't really matter. I think what matters is a compute availability. And GPU, for example, is a very different kind of architecture. And that enabled the continuation of this trend. And I think right now, 20, 23, 20, 24, we're kind of taking the shortcuts with low precision thing, which still I think is cool. But I think there are many other like GPU level things. But also if we are kind of sure about the architecture, this is my thoughts. We can hard code into the chips and that can provide a lot of benefits. And I think training for training, I don't think that's really done. But GPU, if we think about it, is like too general. Maybe that is like something that we can revisit. And so I'm not losing hope and I don't see any trend of doing that. But maybe other things will come as a bottleneck, like maybe energy or something. speaker 1: So physics probably . speaker 2: is something that we need to study. Again. speaker 1: if you don't mind like continuing, then the problem is like we're talking about exponential, like driving force ces, right? You can tell me that you want to hard code chips, but that's not the same as telling me that there's going to be exponential growth that we can drive de right into. Like. speaker 2: Yeah, here's my very boring answer. I think we just need to do a little bit better. And at some point, the machines will be better than us in thinking about chip design. So I think it's half joking. But if we look back at, say, this video two years from now, I think it be less serious thing. Let's just get there first. speaker 1: All right. So thanks to Kwan for an amazing talk.