Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining

斯坦福大学CS224N课程的第九讲主要讨论了自然语言处理中的模型预训练。讲座首先介绍了子词建模（subword modeling）技术，该技术通过将词汇分解为子词单元（如字符、字符组合），以解决固定词汇表在处理未登录词、新词、拼写错误及词形复杂语言（如斯瓦希里语有大量动词变位）时的局限性，从而提升模型对词汇的覆盖和泛化能力，避免将未知词统一映射为“unk”符号导致信息丢失。随后，课程计划探讨预训练的动机（从词嵌入出发）、三种主要的预训练方法（解码器、编码器、编码器-解码器架构）、预训练赋予模型的能力，以及大型模型和上下文学习（in-context learning）等相关主题。讲座开始时还提及了课程作业（如第五项作业将涵盖预训练、Transformer等内容）和项目提案的截止日期安排。

视频科技

媒体详情

上传日期: 2025-05-15 21:37
来源: https://www.youtube.com/watch?v=DGfCRXuNA2w
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Hello, welcome to cs 224n. Today we'll be talking about pre training, which is another exciting topic on the road to modern natural language processing. Okay, how is everyone doing? Thumbs up. Thumbs side. Thumbs down. Wow. No response bias there. All, you know, all thumbs up. Oh, side. Nice. I like that. Honesty. That's good. Well, okay, so we're now what is this? Week five? Yes, it's week five and we have a couple. So this lecture, the transformers lecture, and then to a lesser extent, Thursday's lecture on natural language generation will be sort of the sum of lectures for the assignments you have to do, right? So assignment five is coming out on Thursday, and the topics covered in this lecture and the self attention transformers, and again, a little bit of natural language generation will be tested in Simon five. And then the rest of the course will go through some really fascinating topics and sort of modern natural language processing that should be useful for your final projects and future jobs and interviews and intellectual curiosity. But I think that this today's lecture is significantly less technical in detail than last Thursdays on self attention and transformers. What should give you an idea of this sort of world of pre training and sort of how it helps define natural language processing today? So a reminder about assignment five. Your project proposals also are due on Tuesday, next Tuesday. Please do get those in, try to get them in on time so that we can give you prompt feedback about your project proposals. And Yeah, so let's jump into it. Okay. So what we're going to start with today is a bit of a technical detail on word structure and sort of how we model the input sequence of words that we get. So when we were teaching word to vec and sort of all the methods that we've talked about so far, we assumed a finite vocabulary, right? So you had a vocabulary v that you definine via whatever. You've looked at some data, you decided what the words are in that data. And so you know you have some words like haat and learn. And you know you have this embedding. It's in red because you've learned it properly. Actually. Let's replace hat and learn with pizza and tasty. Those are better. And so that's all well in good. You see these words in your model and you have an embedding that's been learned on your data to sort of know what to do when you see those words. But when you see some sort of variations, maybe you see like tasty and maybe a typo, like Leon, or maybe novel items where it's like a word that you know you as a human can understand as sort of this combination. This is called derivational morphology of like this word transformer that you know, and if I, which means you know take this noun and give me back a verb, that means to make more like that noun to transformer. If I nlp might mean to you know, make nlp more like using transformers and such. And for each of these, right? This maybe didn't show up in your training corpus and languages always doing this right? People are always coming up with new words and there's new domains and there's the you know, Young people are always making new words. It's great. And so it's a problem for your model though, right? Because you've defined this finite vocabulary and there's sort of no mapping in that vocabulary for each of these things, even though their meanings should be relatively well defined based on the data you've seen so far. It's just that the sort of string of characters that define them aren't quite what you've seen. And so what do you do? Well, maybe you map them to this sort of universal unknown tokens. This is unk. So it's like, Oh, I see something. I don't know what you've never seen it before. I'm gonna to say it's always represented by the same token, unk. And so that's been done in the past and that's sort of bad, right? Because it's totally losing tons of information, but you know you need to map it to something. And so this is like a clear problem, especially I mean in English. It's a problem in many of the rolanguages. It's a substantially larger problem, right? So you know English has relatively simple word structure. There's a couple of conjugations for each verb, like you know eeats, Eton eight. But in a language with much more complex morphology or word structure, you'll have a considerably more complex sort of set of things that you can see in the world. So here is a conjugation table for a Swahili verb, and it has over 300 conjugations. And if I defined a vocabulary to be every unique string of characters maps to its own word, then every one of the 300 conjugations would get an independent vector under my model, which makes no sense, because the 300 conjugations obviously have a lot in common and differ by sort of meaningful extent. So you don't want to do this. I'd have to have a huge vocabulary if I wanted all conjugations to show up. And that's a mistake for efficiency reasons and for learning reasons. Any questions so far? Cool. Okay. And so what we end up doing is we'll look at subword structure, subword modeling. So what we're going to do is we're going to say, I'm not going to even try to define what the set of all words is. I'm going to define my vocabulary to include parts of words. Where am I? Oh, right. So I'm going to split words into sequences of known subwords. And so there's a simple sort of algorithm for this where you start with all characters, right? So if I only had a vocabulary of all characters and maybe like an end of word symbol I for a finite datset, then I could, no matter what word I saw in the future, as long as I had seen all possible characters, I could take the word and say, I don't know what this word is. I'm gonna na split it into like all of its individual characters. So you won't have this unk problem. You can sort of represent any word and then you're going to find common adjacent characters and say, okay, a and b co occur next to each other quite a bit. So I'm going to add a new word to my vocabulary. Now it's all characters, plus this new word, ab, which is a subword. And likewise, so now I'm going to replace the character pair with the do subword and repeat until you add a lot, a lot, a lot of vocabulary items through this process of what things tend to cooccur next to each other. And so what you'll end up with is a vocabulary of very commonly cooccurring, sort of substrings, by which you can build up words. And this was originally developed for machine translation, but then it's been used considerably in pretty much all modern language models. So now we have hat and learn, hat and learn. So in our subword vocabulary, hat and learn showed up enough, but they're their own individual words. So that's sort of good, right? So simple, common words show up as a word in your vocabulary, just like youlike them to. But now tasty, maybe it gets split into taa and then maybe in some cases, this hash hash means like don't add a space next, right? So taa and then aaa and then stright. So I've actually taken one sort of thing that seems like a word, and in my vocabulary, it's now split into three subword tokens. So when I PaaS this to my transformer or to my recurrent neural network, the recurrent neural network would take taa as just a single element, do the rnn update and then take aaa, do the rna update and then stso. It could learn to process constructions like this and maybe I can even add more aaas in the middle and have it do something similar instead of just seeing the entire word tasty and not knowing what it means. Is that that's feedback. Yeah. How loud is that feedback? Loud. We good. Okay. I think we're fixed. Great. And so same with transformer. If I maybe transformer is its own word, and then if I and so you can see that you have sort of three learned embeddings instead of one sort of useless unk embedding. So this is just wildly useful and is used pretty much everywhere. Variants of this algorithm are used pretty much everywhere in like modern nlp questions. Yes, if we have three embeddings for tasty, do we just add them together? So the question is, if we have three embeddings for tasty, do we just add them together if we want to represent? So when we're actually processing the sequence, I'd see something like I learned about the taa, a stso itactually be totally separate tokens. But if I wanted to then say, what's my representation of this thing? Depends on what you want to do. Sometimes average you average the contextual representations of the three or look at the last one. Maybe at that point, it's unclear what to do, but everything sort of works. Okay. How do you what Yeah. So you know where to split based on the algorithm that I specified earlier for learning the vocabulary. So you learned this vocabulary by just combining commonly cooccurring adjacent strings of letters, right? So ab cooccurred a lot. So now I've got a new word that's ab. And then when I'm actually walking through and tokenizing, I try to split as little as possible. So I split words into the maximal sort of subword that takes up the most characters there algorithms for this. Yeah. So like I'm like, okay, if I want to split this up, you know like as many ways I could split it up and you try to find some approximate like what the best way to split it into the Fuuse words is? Yeah, it seem to make sense that good punctuation. Yes. So the question is, do people use punctuation in the character set? Do people do it? Yes, absolutely. So you know sort of from this point on, so just assume that what text is given to these models is as unprocessed as possible. You take it, you try to make it sort of clean looking text where you've removed you know html tags, maybe if it's from the Internet or whatever, but then beyond that, you process it as little as possible so that it reflects as well as possible what people might actually be using this for. So maybe earlier in the course when we were looking at word to vec, maybe had what might have thought about, Oh, we don't want word to vec vectors of punctuation or something like that. Now everything is just as close as possible to what the text youget with people trying to use your system would be. So yes, in practice, punctuation and like dot, dot dot might be its own word you know and maybe a sequence of like hyphens because people make big bars across you know tables. Yeah. How is it impact what one word is? Now it could be multiple embeddings versus a single embedding. Does the system treat those any differently? The question is, does the system treat any differently words that are like really themselves a whole word versus words that are sort of pieces? Now the system has no idea. They're all just indices into your embedding vocabulary matrix. So they're all treated equally. What about really long months that are, I guess, relatively common? Because if you're building up your character all the way up, what happens then? Yeah, the question is what happens to very long words if you're building up from sort of character pairs and portions of characters, you know in practice, the statistics speak really well for themselves. So if a long word is very common, it will end up in the vocabulary. And if it's not very common, it won't. There are algorithms that aren't this that do slightly better in various ways. But the intuition that you sort of figure out what the common cooccurring substrings are, sort of independent of length almost is the right intuition to have. And so Yeah, you can actually just look at the learned vocabularies of a lot of these models and you see some long words just because if they showed up a lot. I'm curious, how does it weigh the frequency? So let's say there's like iffy or in your next slide, it was like I at the very last one. So if could be really common. So how does it weigh like the frequency of a subword versus the length of it? Like it tries to spread it up into smallest number, but what if it's put it up into three, but one of them was super common. Yeah. So the question is, you know if transformer is a subword in my vocabulary and if is a subword d and why is a subword d? And if I as a three letter tuple is also a subword, how does it choose to like take the you know if I maybe it's not very common as opposed to splitting it into more subwords. It's just a choice. We choose to try to take the smallest number of subwords because that tends to be more of the bottleneck as opposed to the having a bunch of very common, very short suburbs. Sequence length is a big problem in transformers, and this seems to be sort of what works, although trying to split things into multiple options of a sequence and running the transformer on all of them is the thing that people have done to see which one will work better. But Yeah, having fewer, bigger subwords tends to be the best sort of idea. I'm gonna to start moving on though. Feel free to ask me more questions about this afterward. Okay, so let's talk about pre training from the context of the course so far. So we at the very beginning of the course we gave you this quote which was you know you shall know word by the company it keeps. This was the sort of thesis of the distributional hypothesis, right, that the meaning of the word is defined by, or at least reflected by, what words it tends to cooccur around. And we implemented this via word to vek. The same person who made that quote had a separate quote actually earlier, that continues this norof notion of meaning as defined by context, which has something along the lines of, well, you know, since the word shows up in context when we actually use it, when we speak to each other, the meaning of the word should be defined in the context that it actually shows up in. And so the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously. So the big difference here is, like at word to vec training time, if I have the word record rec ord, when I'm training word to vec, I get one vector or two. But you know, one vector meaning, record the string, and it has to learn by what context it shows up in that sometimes, you know, it's going, mean I record iis the verb or record, ie, the noun, right? But I only have one vector to represent it. And so when I use the word to vc embedding a record, it sort of has this mixture meaning of both of its sort of senses, right? It doesn't get to specialize and say, Oh, this part means record and this part means record. And so word to bec is going to just sort of fail. And so I can build better representations of language through these contextual representations that are going to take things like recurrent neural networks or transformers that we used before to build up sort of contextual meaning. So what we had before were pre trained word embeddings. And then we had sort of a big box on top of it, like a transformer or an lstm that was not pre trained, right? So you learn via context your word embeddings here, and then you have a task like sentiment analysis or machine translation or parsing or whatever, and you initialize all the parameters of this randomly, and then you train to predict your label. And the big difference in today's work is that we're going to try to pre train all the parameters. So I have my big transformer, and instead of just know ptraining my word embeddings with word to vec, I'm going to train all of the parameters of the network, trying to teach it, know much more about language that I could use my in my downstream tasks. So now I'm sort of the labeled data that I have for, say, machine translation might need to be smaller. I might not need as much of it because I've already trained much more of the network than I otherwise would have if I had just gotten sort of word tobec embeddings. Okay, so here, right, I've pre trained this entire sort of structure, the word embeddings, the transformer on top. Everything's been trained via methods that we'll talk about today. And so what does this give you? I mean, it gives you very strong representations of language. So the meaning of record and record will be different in the sort of contextual representations that know where in the sequence it is and what words are cooccurring with it in the specific input. Then word tebek, which only has one representation for record independent of where it shows up italso be used as strong parameter initializations for nlp models. So in all of your homework so far, you've worked with you know building out a natural language processing system sort of from scratch, right? Like how do I initialize this weight matrix? And we always say, Oh, you know, small, normally distributed noise, like little values, you know close to zero. And here we're going to say, well, just like we were going to you know use the word to vec embeddings and those sort of encoded structure, I'm going to start maybe my machine translation system from a parameter initialization that's given to me via pre training. And then also it's going to give us probability distributions over language that we can use to generate and otherwise. And we'll talk about this. Okay? So whole models are going to be ptrained. So all of pre training is effectively going to be centered around this idea of reconstructing the input. So you have an input. It's a sequence of text that some human has generated. And the sort of hypothesis is that by masking out part of it and tasking a neural network with reconstructing the original input, that neural network has to learn a lot about language, about the world in order to do a good job of reconstructing the input, right? So this is now a supervised learning problem, just like, you know, machine translation, taking this sentence that just existed, Stanford University is located in, say, Palo Alto, California, or Stanford, California, I guess. And I have, by removing this part of the sentence, made a label for myself. The input is this sort of broken masentence, and the label is Stanford or Palo Alto. So if I give this example to a network and ask it to predict the center thing, as it's doing its gradient step on this input, it's going to encode information about the cooccurrence between this context denfor universities located in and Palo Alto. So by tasking it with this, it might learn, say, where Stanford is. What else might it learn? Well, it can learn things about maybe syntax. So I put blank fork down on the table here. There's only a certain set of words that could go here. I put the fork down on the table. I put a fork down on the table. These are syntactic constraints, right? So the context shows me sort of what kinds of words can appear and what kinds of contexts. The woman walked across the street checking for traffic over blank shoulder and the idea of what could go here, right? So this sort of co reference between this entity who is being discussed in the world, this woman and her shoulder. Now, when I discuss, you know this is sort of a linguistic concept. The word her here is a co referring to woman is referring to the same entity in the discourse. And so the network might be able to learn things about you know the kind of what entities are doing what, where. It can learn things about sort of semantics. So if I have, I went to the ocean to see the fish, turtles, seals and blank, then the word that's in the blank should be sort of a member of the class that I'm thinking of as a person writing this sentence of stuff that I see when I go to the ocean and see these other things as well, right? So in order to do this prediction task, maybe I learned about you know the semantics of aquatic creatures. Okay, so what else could I learn? I've got overall, the value I got from the two hours watching it was the sum total of the popcorn and drink. The movie was blank. What kind of task could I be learning from doing this sort of prediction problem sentiment? Exactly. So this is just a naturalistic sort of text that I naturally wrote myself. But by saying, Oh, the movie was bad, I'm learning about sort of the latent sentiment of the person who wrote this, what they were feeling about the movie at the time. So maybe if I see a new review later on, I can just paste in the review, say the movie was blank. And if the model generates bad or good, that could be implicitly solving the task of sentiment analysis. So here's another one. Io went to the kitchen to make some tea. Standing next to io, Zuko pondered his destiny. Zuko left the blank. Okay. So in this scenario, we've got a world implicitly that's been designed by the person who is creating this text, right? I've got physical locations in the discourse, like the kitchen, and I've got zuco, got iro's in the kitchen. Zuko's next to irow. So Zuko must be in the kitchen. So what could Zuko leave but the kitchen? Right? And so in terms of you know latent notions of embodiment and physical location, the way that people talk about people being next to something and then leaving something could tell you stuff about sort of Yeah a little bit about how the world works even. So here's the sequence I was thinking about the sequence that goes one, one, two, three, five, eight, 13, two, one, blank. And you know, this is a pretty tough one, right? This is the Fibonacci sequence, right? To get a model by looking at a bunch of numbers from the Fibonacci sequence, learn to in general predict the next one. It's a question you should be thinking about throughout the lecture. Okay. Any questions on these sort of examples of what you might learn from predicting the context? Okay, okay, cool. So you know a very simple way to think about pre training is pre training is language modeling. So we saw language modeling earlier in the course, and now we're just going to say, instead of using my language model just to provide probabilities over the next word, I am going to train it on that task, right? I'm going to actually model the distribution p theta of the word t, given all the words previous. And there's a ton of data for this, right? Just an amazing amount of data for this in a lot of languages, especially English. There's very little data for this in actually most of the world's languages, which is a separate problem. But you can pretrain just through language modeling, right? So I'm going to sort of do the teacher forcing things. So I have iro. I predict goes, I have goes, I predict two. And I'm going to train my sort of lm or my transformer to do this task and then I'm just gonna to keep all the weights, okay? I'm going to save all the network parameters. And then once I have these parameters, instead of generating from my language model, I'm just going to use them as an initialization for my parameters. So I have this pre training fine tuning paradigm, two steps. Most of you, I think in your, well, maybe not this year. Let's say a large portion of you this year in your final projects will be doing the pre training fine tuning sort of paradigm where someone has done the pre training for you, right? So you have a ton of text. You, we learn very general things about the distribution of words and sort of the latent things that that tells you about the world and about language. And then in step two, you've got some task, maybe sentiment analysis and you have maybe not very many labels. You have a little bit of labeled data, and you adapt the pre trained model to the task that you care about by further doing gradient steps on this task. So you give it the movie was you predict, happy or sad, and then you sort of continue to update the parameters based on the initialization from the pre training. And this just works exceptionally well. I mean, unbelievably well compared to training from scratch intuitively because you've taken a lot of the burden of learning about language, learning about the world off of the data that you've labeled for sentiment analysis, and you're sort of giving that task of learning all this sort of very general stuff to the much more general task of language modeling. Yes, to in other languages. What do you mean by that? Is it just text in that language labeled in some way? The question is, you know, you said we have a lot of data in English, but not in other languages. What do you mean by data that we don't have a lot of in other languages? Is it just text? It's literally just text, no annotations, because you don't need annotations to do language model pre training, right? The existence of that sequence of words that someone has written provides you with all these pairs of input and output, input irow output goes input irow goes output two. Those are all labels, sort of that you've constructed from the input just existing. But you know in most languages, even on the entire Internet, I mean, it's about seven, zero ish languages on earth, and most of them don't have the sort of you know billions of words that you might want to train these systems on. Yeah, anyone? The darthing. Are you still only like one vector sediper wood? The question is, if you're prere training the entire thing, do you still learn one vector representation per word? You learn one vector representation that is the non contextual input vector. So you have your vocabulary matrix. You've got your embedding matrix, that is vocabulary size by model dimensionality. And so Yeah, iro has one vector, goes has one vector, but then the transformer that you're learning on top of it takes in the sequence so far and sort of gives a vector to each of them. That's dependent on the context in that case. But still at the input, you only have one embedding per word. Yeah. So what sort of like metrics would you use to like evaluate the pre trained model? It's supposed to be like general, but things like application specific metrics. So which one do use? Yeah. So the question is what metric do you use to evaluate pre trained models since it's supposed to be so general? But there are lots of sort of very specific evaluations you could use. We'll get into a lot of that in the rest of the lecture while you're training it. You can use simple metrics that sort of correlate with what you want but are actually what you want, just like the probability quality, right? So you can evaluate the perplexity of your language model, just like you would have when you cared about language modeling. And it turns out to be the case that better perplexity correlates with all the stuff that's much harder to evaluate, like lots and lots ts of different tasks. But also the natural language processing community has built very large sort of benchmark suites of varying tasks to try to get at sort of a notion of generality, although that's very, very difficult. It's sort of ill defined even. And so when you develop new pre training methods, what you often do is you try to pick a whole bunch of evaluations and show that you do better on all of them, you know and that's your argument for generality. Okay, so so why should this sort of pre training, fine tuning, two part paradigm help? You know this is still an open area of research, but the intuitions are all you're going to take from this course. So right? So pre training provides some sort of starting parameters, l, theta. So this is like all the parameters in your network, right? From trying to do this minimum over all possible settings of your parameters of the pre training loss. And then the fine tuning process takes your data for fine tuning. You've got some labels and it tries to approximate the minimum through gradient descent of the loss of the fine tuning task of theta. But you start at theta hat, right? So you start gradient descent at theta hat, which your pre training process gave you. And then you know, if you could actually solve this min, and wanted to, it sort of feels like the starting point shouldn't matter, but it really, really, really does. It really does. So and we'll talk a bit more about this later, but the process of grading descent, you know maybe it sticks relatively close to the theta hat during fine tuning, right? So you you start at theta hat and then you sort of walk downhill with gradient descent until you hit sort of a valley. And that valley ends up being really good because it's close to the pre training parameters, which were really good for a lot of things. This is a cool place where sort of practice and theory or sort of like meeting where like optimization, people want to understand why this is so useful. And lp, people sort of want to build better systems. So Yeah, maybe the stuff around theta hat tends to generalize. Well, if you want to work on this kind of thing, you should talk about it. Yeah, it was like stochastic rating to spend six kes, relatively close. But what if we were to use a different optimior? How would that change your results? The question is, if stochastic grading descent stakes relatively close, what if we use a different optimizer? I mean, if we use sort of any common variant of gradient descent, like any first order method, like adom, which we use in this course, or ada grad, or they all have this very, very similar properties, other types of optimization we just tend to not use. So who knows? Yeah. Fine tuning works better than just fine tuning, but making the powerful like adding more layers, more data. So Yeah, the question is why does the ptrained fine tune paradigm work better than just making the model more powerful, adding more layers, adding more data to justifine tuning? That's a the simple answer is that you have orders of magnitude more data that's unlabeled. That's just text that you found. Then you do carefully labeled data and the tasks that you care about, right? Because that's expensive to get, has to be examples of your movie reviews or whatever that you've had someone labeled carefully. So you have you know something like on the Internet, at least 5 trillion, maybe 10 trillion words of this, and you have maybe a million words of your labeled data or whatever over here. So it's just like it's just the scale is way off. But there's also an intuition that like learning to do a very, very simple thing like sentiment analysis is not going to get you a very general, generally able agent in a wide range of settings compared to language modeling. So like it's hard to get how to put it even if you have a lot of labeled data of movie reviews of the kind that people are writing today, maybe tomorrow they start writing slightly different kinds of movie reviews and your system doesn't perform as well. Whereas if you pre trained on a really diverse set of texts from a wide range of sources in people, it might be more adaptable to seeing stuff that doesn't quite look like the training data you showed it, even if you showed it a ton of training data. So one of the sort of big takeaways of pre training is that you get this huge amount of sort of variety of text on the Internet. You have to be very careful. I mean, Yeah you should be very careful about what kind of text you're showing it and what kind of text you're not, because the Internet is full of you know awful text as well. But some of that generality just comes from how hard this problem is and how much data you can show it. Train on so much data, how do you then train it so that it considers the stuff that you're fine tuning you with as like more important, more salient to the task it's trying to do rather than just one in a billion are Yeah, it's a good question. So the question is, given that the amount of data on the pre training side is orders of magnitude more than the amount of data on the fine tuning side, how do you sort of get across to the model that okay. Actually, the fine tuning task is like what I care about. So like focus on that. It's about the fact that I did this first, the pre training first and then I do the fine tuning second, right? So I've done, I've gotten my parameter initialization from this, I've set it somewhere and then I fine tune, I move to where the parameters are doing well for this task afterward. And so well, it might just forget a lot about how to do this because now I'm just asking it to do this at this point, I should move on, I think. But we're going to keep talking about this in much more detail with more concrete elements. Okay, so let's talk about model pre training. Oh, wait, that did not advance the slides. Nice. Okay, let's talk about model pretraining three ways. In our transformers lecture Tuesday, we talked about encoders, encoder, decoders and decoders. And we'll do decoders last because actually many of the largest models that are being used today are all decoders. And so we'll have a bit more to say about them, right? So let's recall these three. So encoders get bidirectional context. You have a single sequence and you're able to see the whole thing kind of like an encoder in machine translation. Encoder decoders have one portion of the network that gets bidirectional context. So that's like the source sentence of my machine translation system. And then they're sort of paired with a decoder that gets unidirectional context so that I have this sort of informational masking where I can't see the future, so that I can do things like language modeling, I can generate the next token of my translation, whatever. So you could think of it as you know, I've got my source sentence here in my partial translation here, and I'm sort of decoding out the translation. And then decoders only are things like language models. We've seen a lot of this so far. And there's pre training for all three sort of large classes of models. And how you pre train them and then how you use them depends on the properties and the proctivities of the specific architecture. So let's look at encoders first. So we've looked at language modeling quite a bit, but we can't do language modeling within encoder because they get bidirectional context, right? So if I'm down here at I and I to present, I want to predict the next word. It's a trivial task at this level here to predict the next word because in the middle, I was able to look at the next word. And so I should just know there's nothing hard about learning to predict the next word here, because I could just look at it, see what it is, and then, you know, copy it over. So when I'm training an encoder in something for pre training, I have to be a little bit more clever in practice. What I do is something like this. I take the input and I modify it somewhat. I mask out words, sort of like I did in the examples I gave at the beginning of class. So I blank to the blank. And then I have the network predict with its whole you. I have it build contextual representations. So now this vector representation of the blank sees the entire context around it here. And then I predict the word went and then hear the word store. Any questions? Okay. And you can see how this is doing something quite a bit like language modeling, but with you know bidirectional context. I've removed the network's information about the words that go in the blanks, and I'm training it to reconstruct that. So I only have lost terms, right? I only ask it to actually do the prediction, compute the loss back, propagate the gradients for the words that I've masked out. And then you can think of this as know, instead of learning probability of x, where x is like a sentence or a document, this is learning the probability of x, the real document given x, Tilda, which is this sort of corrupted document with some of the information mismissing, okay. And so maybe I get a sequence of vectors here, one per word, which is the output of my encoder in blue. And then I say that for the words that I want to predict yi, I draw them. This is, this sim means the probability is proportional to you know my embedding matrix times my representation of it. So it's a linear transformation of that last thing here. So this a plus b is this red portion here. And then do the prediction. And I train the entire network to do this. Yes. So the words that we mask out are, do we just select randomly or is there something to it? The question is, do we just choose words randomly to mask out or is there a scheme mostly randomly? We'll talk about a slightly smarter scheme in a couple of slides. But Yeah, just mostly randomly. Yeah. Just what was that last part on the bottom exit the maps version of like if it's the first for the very last sentence. Yeah. So so I'm saying that I'm defining x tilde to be this input part where I've got the mask version of the sentence with the sort of words missing. And then I'm defining a probability distribution. That's the probability of a sequence conditioned on the input being the sort of corrupted sequence, the masked sequence. Okay. So this brings us to a very, very popular and sort of nlp model that you need to know about. It's called Burt. And it was the first one to popularize this masked language modeling objective. And they released the weights of this pre trained transformer that they pre trained via something that looks a lot like MaaS language modeling. And so these you can download, you can use them via code that's released by the company hugging face that we have continued to bring up. Many of you will use a model like Burke in your final project because it's such a useful builder of representations of language and context. So let's talk a little bit about the details of MaaS language modeling in Bert. First, we take 15% of the subword tokens. So remember, all of our inputs now are subword tokens. I've made them all look like words, but just like we saw at the very beginning of class, each of these tokens could just be some portion, some subword, and I'm gonna to do a couple of things with it. Sometimes I am going to just mask out the word and then, you know, predict the word. Sometimes I'm going to replace the word with some random sample of another word from distribution from my vocabulary and predict the real word that was supposed to go there. And sometimes I'm going to not change the word at all and still predict it. The intuition of this is the following. If I just had to build good representations of in the sort of middle of this network for words that are masked out, then when I actually use the model at test time on some real review to do sentiment analysis on, well, there are never going to be any tokens like this. So maybe the model won't do a very good job because it's like, Oh, you know, I have no job to do here because I only need to deal with the matokens by giving it sequences of words. Or sometimes it's the real word that needs to be predicted. Sometimes you have to detect if the word is wrong. The idea is that now when I give it a sentence that doesn't have any masks, it actually sort of does a good job of representing all the words in context, because it has this chance that it could be asked to predict anything at any time. Okay. So the folks at Google who were defining this had a separate additional task that is sort of interesting to think about. So this was their Burt model from their paper. They had their position embeddings, just like we saw from our transformers lecture, token embeddings, just like we saw from the transformers lecture. But then also they had this thing called segment embedding, where they had two possible segments, segment a and segment b, and they had this additional task where they would get a big chunk of text for segment a and a big chunk of text for segment b. And then they would ask the model, is segment B A real continuation of segment a? Was it the text that actually came next? Or did I just pick this big segment randomly from somewhere else? And the idea was that this should teach the network something, some notion of sort of long distance coherence about sort of the connection between a bunch of text over here and a bunch of text over there. Turns out it's not really necessary, but it's an interesting idea. And sort of similar things have continued to have some sort of influence since then. But again, you should get this intuition that we're trying to come up with hard problems for the network to solve, such that by solving them, it has to learn a lot about language. And we're defining those problems by making simple transformations or removing information from text that just happened to occur. Questions? Yeah the plus times, do we concatenate the vectors or do we do an element wise edition? The question is, for these plus Sigdo, we concatenate the vectors or do element wise edition. We do element wise eyou. Could have concatenated them. However, one of the big sort of conventions of all these networks is that you always have exactly the same number of dimensions everywhere at every layer of the network. It just makes everything very simple. So just saying everything g's the same dimension and then doing addition just ends up being simpler. So why was the next sentence prediction not necessary? What the great question for that? Yeah why was the next sentence prediction not necessary? I mean, one thing that it does that's a negative is that now the sort of the effective context length for a lot of your examples is halved. So one of the things that's useful about pre training seemingly is that you get to build representations of very long sequences of text. So this is very short. But in practice, segment a was going to be something like 250 words and segment b was going to be 250 words. And in the paper that sort of let us know that this wasn't necessary. They always had a long segment of 500 words. And it seemed to be useful to always have this very long context, because longer contexts help give you more information about the role that each word is playing in that specific context, right? If I see one word, it's hard to know. If I just see record, it's hard to know. It's supposed to mean. But if I see a thousand words around it, it's much clearer. What its role is in that context is so Yeah it cuts the effective context size as one answer. Another thing is that this is actually much more difficult. This is a much more recent paper that I don't have in the slides, but it's been shown since then that these models are really, really bad at the next sentence prediction task. So it could be that maybe it just like was too hard at the time and so it just like wasn't useful because the model was failing to do it at all. So I give the link for that paper later. Explain again why we need to do a next sentence prediction. What about just masking and predicting next? I missed that. Yeah. So the question is, why do we need to do next sentence prediction? Why not just do the masking we saw before? That's the thing you seem to not need to do next sentence prediction. But you know as sort of like history of the research, it was thought that this was useful. And the idea was that it required you to develop this sort of pairwise, like do these two segments of text interact? How do they interact? Are they related? The sort of longer distance notion. And many nlp tasks are defined on pairs of things. And they thought that might be useful. And so they published it with this. And then someone else came through, published a new model that didn't do that and it sort of did better. So you know, this is just Yeah so Yeah, there are intuitions as to why it could work. It just didn't. So wasn't doing it was doing both. It was doing both this sentence, so Bert was doing both this next sentence, prediction evalutraining, as well as this masking training all at the same time. And so you had to have a separate predictor head on top of bird, a separate predictor sort of classification thing. And you know so one detail there is that there's this special word at the beginning of Bert in every sequence. That's cls. And you know you can define a predictor on top of that sort of fake word embedding that was going to say, is the next sentence real or fake or not? Yeah, okay, I'm going to move on. And so this gets at sort of the question that we had earlier about how do you evaluate these things? There's a lot of different nlp tasks out there, gosh. And you know when people were defining these papers, they would look at a ton of different evaluations that had been sort of compiled as a set of things that are still hard for today's systems. So are you detecting paraphrases between questions? Are two core a questions actually the same question that turns out to be hard? You know can you do sentiment analysis on this hard data set? Can you tell if sentences are linguistically acceptable? Are they grammatical or not? Are two sequences similar semantically? Do they mean sort of vaguely the similar thing? And we'll talk a bit about natural language inference later, but that's the task of defining sort of if I say, you know, I saw the dog, that does not necessarily mean I saw the little dog, but saying I saw the little dog does mean I saw the dog. So that's sort of this natural language inference task. The difference between sort of pre pre training days where you had sort of this row here before you had substantial amounts of pre training and burwas, just like the field was taken aback in a way that's hard to describe. You know very carefully crafted architectures for each individual task where everyone was designing their own neural network and doing things that they thought were sort of clever as to how to define all the connections and the weights and whatever to do their tasks independently. So everyone was doing a different thing for each one of these tasks. Roughly all of that was blown out of the water by just build a big transformer and just teach it to predict the missing words a whole bunch and then fine tune it on each of these tasks. So this was just a sea change in the field. People were, I mean, amazed. It's a little bit less flashy than ChatGPT, I'll admit, but it's really part of the story that gets us to it, you know. Okay, questions. So like to get stuff out of the like the during the encoder ptraining stage, encoder usually outputs like some sort of hidden values. How do we correlate those to Verds that we are trying to test against? So the question is the encoder output is a bunch of hidden values. How do we actually correlate those values to stuff that we want to predict? I'm going to go on to the next slide here to bring up this example here. Right? So the encoder gives us, for each input word token, a vector of that token that represents the token in context. And the question is, you know how do we get these representations and turn them into sort of answers for the tasks that we care about? And the answer comes back to two, two, two, something like this. Something like this maybe. Sure. So when we were doing a pre training, right, we had the transformer that was giving us our representations, and we had this little last layer here, this little sort of effine transformation that moved us from the encoder's hidden state size to the vocabulary to do our prediction. And we just removed this last prediction layer here. And let's say we want to do something that is classifying the sentiment of the sentence. We just pick arbitrarily maybe the last word in the sentence, and we stick a linear classifier on top and map it to positive or negative and then fine tune the whole thing. Okay. So Yeah, the Burt model had two different models. One was 110 million parameters, one was 340 million. Keep that sort of in the back of your head sort of percolating as we talk about models with many, many more parameters later on. It was trained on 800 million words plus, that is definitely wrong, maybe 25 million words, but on the order of less than a billion words of text, quite a bit still. And it was trained on what was considered at the time to be a whole lot of compute. Just, you know, it was Google doing this and they released it and we were like, Oh, who has that kind of compute? But Google, although nowadays it's not considered to be very much, but fine tuning is practical and common on a single GPU. So you could take the Burt model that they just spend a lot of time training and fine tune it yourself on your task on even sort of a very sort of small GPU. Okay, so one question is like, well, this seems really great. Why don't we just use this for everything? Yeah. And the answer is, well, no. What is the sort of pre training objective? What's the structure of the pre trained model good for? Bert is really good for sort of filling in the blanks, but it's much less naturally used for actually generating text, right? So I wouldn't want to use Burto generate a summary of something because it's not really built for it. It doesn't have a natural notion of predicting the next word given all the words that came before it. So so maybe I want to use, if I want a good representation of say, a document to classify it, give it one of a set of topic labels or say it's toxic or non toxic or whatever, but I wouldn't want to use it to generate a whole sequence. Okay, some extensions of Burt. So we had a question earlier of whether you just mask things out randomly. One thing that seems to work better is you mask out sort of whole contiguous fans because sort of the difficulty of this problem is much easier than it would otherwise be because sort of this is part of irresistibly, and you can tell very easily based on the sort of subwords that came before it. Whereas if I have a much longer sequence is a trade off. But you know this might be a harder problem and it ends up being better to do this sort of span based masking than random masking. And that might be because subwords make very simple prediction problems when you mask out just one subword of a word versus all the suburbs of a word. Okay, so this ends up doing much better. There's also a paper called the Roberta paper which showed that the next sentence prediction wasn't necessary. They also showed that they really should have trained it on a lot more text. So Roberta is a drop in replacement for Bert. So if you're thinking of using just user Berta, it's better. And it gave us this intuition that we really don't know a whole lot about the best practices for training these things. You sort of train it for as long as you're willing to and things do good stuff and whatever. So this is very but it's very difficult to do sort of iteration on these models because they're big. It's expensive to train them. Another thing that you should know for your final projects in the world ahead is this notion of fine tuning all parameters of the network versus just a couple of them. So what we've talked about so far is you pre train all the parameters and then you fine tune all of them as well. So all the parameter values change, an alternative, which you call parameter efficient or lightweight fine tuning. You sort of choose little bits of parameters or you choose the very smart way of keeping most of the parameters fixed and only fine tuning others. And the intuition is that these pre trained parameters were really good. And you want to make the minimal change from the prere trained model to the model that does what you want, so that you keep some of the generality, some of the goodness of the ptraining. So one way that this is done is called prefix tuning. Prompt tuning is very similar, where you actually freeze all the parameters of the network. So I've pre trained my network here, and I never change any of the parameter values. Instead, I make a bunch of fake sort of pseudo word vectors that I preend the very beginning of the sequence, and I train just them, sort of unintuitive, like these would have been like inputs to the network, but I'm specifying them as parameters. And I'm training everything to do my sentiment analysis task just by changing the values of these sort of fake words. And this is nice because you know I get to keep all the good pre trained parameters and then just specify this sort of diff that ends up generalizing better. This is a very open field of research. But this is also cheaper because I don't have to compute the gradients or I don't have to store the gradients and all the optimizer state with respect to all these parameters, I'm only training a very small number of parameters. Yeah. Did you make any difference to put these like eight parameters as if here it doesn't make any difference, but he's at the end of the beginning in a decoder, you have to put them at the beginning because otherwise you don't see them before you process the whole sequence. Yes, new layers and only we train the new layers. But the question is, can we just attach a new layers of the sort of top of this and only train those? Absolutely. This works a bit better. Another thing that works well, sorry, we're running out of time, is taking each weight matrix. So I have a bunch of weight matrices in my transformer, and I freeze the weight matrix and learn a very low rank little diff. And I set the weight matrix's value to be sort of the original value, plus my sort of very low rank diff from the original one. And this ends up being a very similarly useful technique. And the overall idea here is that, again, I'm learning way fewer parameters than I did via pre training and freezing most of the pre training parameters. Okay, encoder decoders. So for encoder decoders, we could do something like language modeling, right? I've got my input sequence here, encoder output sequence here. And I could say this part is my prefix for sort of having bidirectional context. And I could then predict all the words that are sort of in the latter half of the sequence, just like a language model, and that would work fine. And so this this is something that you could do, right? You sort of take it long text, split it into two, give half of it to the encoder and then generate the second half with the decoder. But in practice, what works much better is this notion of span corruption. Span corruption is going to show up in your assignment five. And the idea here is a lot like burbut in a sort of generative sense, where I'm going to mask out a bunch of words in the input. Thank you. Mask token. One me to your party. Mask token two week. And then at the output, I generate the mask token. And then what was supposed to be there, but the mask token replaced it, right? So thank you. Then predict for inviting at the output me your party last week. And what this does is that it allows you to have bidirectional context, right? I get to see the whole sequence, except I can generate the parts that we're missing. So this feels a little bit like you mask out parts of the input, but you actually generate the output as a sequence like you would in language modeling. So this might be good for something like machine translation where I have an input that I want bidirectional context in, but then I want to generate an output and I want to pre train the whole thing. So this was shown to work better than language modeling at the scales that these folks at Google were able to test back in 2018. This is still quite popular. Yeah there's a lot of numbers. It works better than the other stuff. I'm not going to worry about it. You know there's a fascinating property of these models also. So t five was the model that was originally introduced with salience spmasking. And you can think of you know at pre training time, you saw a bunch of things like Franklin D. Roosevelt was born in you know blank and you generated out the blank. And there's this task called open domain question answering, which has a bunch of trivia questions like, you know when was Franklin D. Roosevelt born? And then you supposed to generate out the answer as a string, just like just from your parameters, right? So you did a bunch of pre training, you saw a bunch of text, and then you're supposed to generate these answers. And what's fascinating is that this sort of salient span masking method allowed you to pre train and then fine tune on some examples of questions, trivia questions. And then when you tested on new trivia questions, it would sort of the model would sort of implicitly extract from its pre training data somehow the answer to that new question that it never saw explicitly at fine tuning time. So it learned the sort of implicit retrieval sometimes sometimes you know less than 50% of the time or whatever, but you know much more than random chance Yeah. And that's is sort of fascinating, right? So you've sort of learned to access this sort of latent knowledge that you stored up by pre training. And so Yeah, you just PaaS it the text, when was Roosevelt born? And it would PaaS out an answer. And one thing to know is that the answers always look very fluent, they always look very reasonable, but they're frequently wrong. And that's still of things like ChatGPT. Yeah okay. So that's that's like encoder decoder models. Next up we've got decoders and it's spend a long time on decoders. So this is just our normal language model. So I get a sequence of hidden states from my decoder. The models, the words can only look at themselves, not the future. And then I predict, you know, the next word in the sentence. And then here again, I can you know, to do sentiment analysis, maybe take the last state for the last word and then predict happy or sad based on. That last embedding back, propagate the gradient to the whole network, train the whole thing or do some kind of lightweight or parameter efficient fine tuning like we mentioned earlier. So this is our ptraining, a decoder. And you I can just pre train it on language modeling. So again, you might want to do this if you are wanting to generate generate texts, generate things, this is you sort of can use this like you use an encoder decoder. But in practice, as we'll see, a lot of the sort of biggest, most powerful pre trained models tend to be decoder only. It's not really clear exactly why, except they seem a little bit simpler than encoder decoders. And you get to share all the parameters in one big network for the decoder, whereas in an encoder decoder, you have to split them sort of Summ on into the encoder sum into the decoder. So for the rest of this lecture, we'll talk only about decoders. So even in modern things, the biggest networks do tend to be decoders. So we're coming all the way back again to 2018. And the GPT model from OpenAI was a big success. It had 117a million parameters. It had 768 dimensional hidden states. And it had this vocabulary that was 40 zero ish words that was defined via a method like what we showed at the beginning of class, trained on bocorpus. And know, actually, GPT never actually showed up in the original paper, sort of, it's unclear what exactly it's supposed to refer to, but this model was a precursor to all the things that you're hearing about nowadays. If you move forward, Oh Yeah. So if you aszoline show, so if we wanted to do something like natural language inference, right, which says, know, take these pairs of sentences, the man is in the doorway, the person is near the door, and say that these mean that one entails the other. The sort of premise entails the hypothesis, that I can believe the hypothesis. If I believe the premise, I just sort of concatenate them together, right? So give it maybe a start, token PaaS in one sentence, PaaS in some deliberate or token PaaS in the other, and then predict sort of, yes, no entailment, not entailment. Fine tuning sheet pt on this. It worked really well. And then, you know, Bert came after GPT. Bert did a bit better, had bidirectional context, but you know, did it did sort of an excellent job. And then came GPT two where they focused more on the generative abilities of the network. So we looked at now a much larger network. We've gone from 117 million to 1.5 billion. And given some sort of prompt, it could generate at the time, a quite surprisingly coherent continuation to the prompt. So it's telling this sort of story about scientists and unicorns here. And this size of model is still sort of small enough that you can use on a small GPU and fine tune and whatever. And its capabilities of generating long, coherent text was just sort of exceptional at the time and was also trained on more data, although I don't something like 9 billion words of text. And then so after GPT two, we come to GPT -3, sort of walking through these models, and then we come with a different way of interacting with the models. So we've interacted with pre trained models in two ways so far. We've sort of sampled from the distribution that they define. We've generated text via like a machine translation system or whatever, or we fine tuned them on a task that we care about, and then we take their predictions. But GPT -3 seems to have an interesting new ability. It's much larger and it can do some tasks without any sort of fine tuning whatsoever. GPT -3 is much larger than GPT two, right? So we went from GPT 100 ish million parameters, GPT two, 1.5 billion, GPT -3, 175 billion, much larger, trained on 300 billion words of text. And this sort of notion of in context learning that it could define or figure out patterns in the training or in the example that it's currently seeing and continue the pattern is called in context learning. So you've got you the word thanks. And I PaaS in this little arrow and say, okay, thanks goes to you, mercy. And then hello goes to bonourgeour. And then, you know, I give it all of these examples and ask it what you know otter should go to. And it's learned to sort of continue the pattern and say that this is the translation of otter. So now remember, this is a single sort of input that I've given to my model. And I haven't said, Oh, do translation or fine tuna on translation or whatever. I've just passed in the input, given it some examples, and then it is able to, to some extent, do this seemingly complex task because that's in context learning. And here are more examples, know, maybe you give it examples of addition, and then it can do some simple eafterward. You give it, in this case, this is sort of rewriting typos. They can figure out how to rewrite typos in context learning for machine translation. And this was the start of this idea that there were these emergent properties that showed up in much larger models. And it wasn't clear when looking at the smaller models that youget sort of new this qualitatively new behavior out of them. Like it's not obvious from just the language modeling signal, right? GPT -3 is just trained on that decoder only just next predict the next word that it would, as a result of that training, learn to perform seemingly quite complex things as a function of its context. Yeah. Okay. One or two questions about that. This should be quite surprising, I think, right? Like so far, we said, talked about good representations, contextual representations, meanings of words and context. This is some very, very high level pattern matching, right? Is coming up with patterns in just the input data that one sequence of text that you passed it so far and it's able to sort of identify how to complete the pattern. And you think, what kinds of things can this solve? What are its capabilities? What are its limitations? It ends up being an open area of research, sort of what are the kinds of problems that you maybe saw in the training data lab? Maybe GPT -3 saw a ton of pairs of words that saw a bunch of dictionaries, bilingual dictionaries, in its training data. So it learned to do something like this. Or is it doing something much more general where it's really learning the task in context? You know the actual story, we're not totally sure. Something in the middle seems like it has to be tied to your training data in ways that we don't quite understand. But there's also a non trivial ability to learn new sort of at least types of patterns just from the context. So this is a very interesting thing to work on. Now we've talked a lot about the size of these models so far. And as models have gotten larger, they've always gotten better. We trained them on more data, right? So GPT -3 was trained on 300 billion words of text, and it was 175 billion parameters. And you know at that scale, it costs a lot of money to build these things. And it's very unclear whether you're getting the best use out of your money. Like it's bigger really what you should have been doing in terms of the number of parameters. So you know the cost of training. One of these is roughly, you take the number of parameters, you multiply it by the number of tokens that you're gonna na train it on, the number of words. And some folks at DeepMind throughout the citation on this, some folks at DeepMind realized through some experimentation that actually GPT -3 was just comically oversized, right? So chinchilla, the model they trained is less than half the size and works better, but they just trained it on way more data. And this is sort of an interesting sort of traoff about you know how do you best spend your compute? I mean, you can't do this more than a handful of times even if you're you know Google so know open questions there as well. Another sort of way of interacting with these networks that has come out recently is called chain of thought. So the prefix right we saw in the in context learning slide that the prefix can help sort of specify what task syou're trying to solve right now, and it can do even more. So here's standard sort of prompting. We have a prefix of examples of questions and answers. So you have a question and then an example answer. So that's your prompt that's specifying the task. And then you have a new question and you having the model generate an answer and it generates it wrong. And chain of thought prompting says, well, how about in the example in the demonstration, we give we give the question and then we give this sort of decomposition of steps towards how to get an answer, right? So I'm actually writing this out as part of the input. I'm giving annotations as a human to say, Oh, you know, to solve this sort of word problem, here's how you could think it through ish. And then I give it a new question, and the model says, Oh, I know what I'm supposed to do. I'm supposed to first generate a sequence of steps, of intermediate steps, and then next say the answer is, and then say what the answer is. And it turns out, and this should again be very surprising, that the model can tend to generate plausible sequences of steps, and then much more frequently generates the correct answer after doing so, relative to trying to generate the answer by itself. So you can think of this as a scratch pad. You can think of this as increasing the amount of computation that you're putting into trying to solve the problem, sort of writing out your thoughts, right? As I generate each word of this continuation here, I'm able to condition on all the past words so far. And so maybe it just Yeah allows the network to sort of decompose the problem into smaller, simpler problems, which is more able to solve each no one's really sure why this works exactly either at this point with networks that are this large, they're emergent properties are both very powerful and exceptionally hard to understand and very hard, you should think, to trust it. It's unclear what its capabilities are and what its limitations are, where it will fail. So what do we think pre training is teaching gosh, a wide range of things, even beyond what I've written in this slide, which I mostly wrote two years ago. So it can teach you trivia and syntax and co reference and maybe some lexical semantics and sentiment and some reasoning, like way more reasoning than we would have thought even three years ago. And yet they also learn and exacerbate racism and sexism, all manner of biases. More on this later. But the generality of this is really, I think, what's taken many people aback. And so increasingly, these objects are not just studied for the sake of using them, but studied for the sake of understanding anything about how they work and how they fail. Yeah. Any questions? Has anyone tried like benchmarking, like GPT, like programming tasts, like how accurate does, etcetera? Yeah. The question is, has anyone tried benchmarking GPT for programming tasks? Anyone seen how well it is? Yes. So there's definitely examples of people using GPT -3, four simple programming things. And then you know the modern state of the art competitive programming bots are all based on ideas from language modeling. And I think they're all also based on pre trained language models themselves. Like if you just take all of these ideas and apply it to like GitHub, then you get some very interesting emergent behaviors relating to code fallout. And so Yeah, I think all of the best systems use this more or less. There's lots of benchmarking there for sure. Is that the basis for what GitHub Copilot? The question is, is this the basis? Is that what we just mentioned? The basis for the GitHub Copilot system? Yes, absolutely. We don't know exactly what it is in terms of details, but it's all these ideas. What if you have a situation where you have you still a large amount of data for you general data and then you have also a large amount of data for your fine community task? At what point is it better to train enough new model for that fine train versus you know get data from both? So Yeah, the question is what if you have a large amount of data for pre training and a large amount of data for fine tuning, when is it better to do sort of a separate training on just the fine tuning data? Almost never. If you have a bunch of data for the task that you care about, what's frequently done instead is three part training where you pre train on a very broad corpus, then you sort of continue to pre train using something like language modeling on an unlabeled version of the labeled data that you have. You just like strip the labels off and just treat it all as text and do language modeling on that, adapt the parameters a little bit and then do the final stage of fine tuning with the labels that you want, and that works even better. Interesting paper called don't stop pre training. Final question. That's a lot of questions. So anyone you you someone needs a question. Yeah, I was wondering, do you know if there's like a lot of instances where a pre train ined model can do some task as not seen before, even move? Yeah. So other eight instances of where a pretree model can do a task that it hasn't seen before know without fine tuning. The question is, what is hasn't seen before mean, right? Like these models, especially GPT -3 and similar very large models you know during pre training, did it ever see something exactly like this sort of word problem arithmetic? Maybe, maybe not. It's actually sort of unclear. It's clearly able to recombine sort of bits and pieces of tasks that it saw implicitly during pre training. We saw the same thing with trivia, right? Like language modeling looks a lot like trivia sometimes where you just read the first paragraph of a Wikipedia page, and it's kind of like answering a bunch of little trivia questions about where someone was born and when. But like it's never seen something quite like this, and it's actually still kind of astounding how much is able to do things that don't seem like they should have shown up all that directly in the pre training data. Quantifying that extent is an open research problem. Okay, that's it. Let's call it.

概览/核心摘要 (Executive Summary)

本讲座深入探讨了自然语言处理（NLP）中“预训练”的核心概念、方法及其对现代NLP发展的革命性影响。首先，讲座阐述了子词建模（Subword Modeling）的必要性，以解决传统词汇表在处理未登录词（UNK）、词形变化和新词方面的局限性，特别是针对形态丰富的语言。通过将词汇切分为更小的、有意义的子词单元，模型能够更好地泛化并处理词汇的多样性。

讲座的核心在于模型预训练，其动机源于词嵌入（如Word2Vec）的成功，但旨在将预训练的范围从词嵌入扩展到整个模型参数。预训练的核心思想是通过重构输入来学习语言的深层结构和知识。具体方法包括：1) 解码器（Decoder-only）架构，通常采用标准语言建模（预测下一个词），如GPT系列；2) 编码器（Encoder-only）架构，采用双向上下文，通过掩码语言建模（Masked Language Modeling, MLM）预测被遮盖的词，如BERT；3) 编码器-解码器（Encoder-Decoder）架构，结合两者优势，常用于序列到序列任务，如T5模型采用的跨度损坏（Span Corruption）方法。

讲座强调，预训练模型（如BERT, GPT系列, T5）通过在海量无标签文本上学习，能够掌握语法、语义、常识知识甚至一定的推理能力。这些预训练好的模型随后可以通过微调（Fine-tuning）在特定下游任务上（如情感分析、机器翻译）以少量标注数据达到优异性能，显著优于从零开始训练。近年来，超大规模预训练模型（如GPT-3）展现出上下文学习（In-context Learning）和思维链（Chain-of-Thought）提示等惊人的涌现能力，无需微调即可执行复杂任务。尽管预训练带来了巨大进步，但也面临挑战，如模型可能学习并放大训练数据中的偏见，以及其工作机制的完全理解仍是开放性研究问题。

子词建模 (Subword Modeling)

问题背景:
- 传统模型依赖固定词汇表 (finite vocabulary)，遇到训练集中未出现的词 (e.g., "tasty"的拼写错误 "Leon", 新造词 "transformerify") 时，通常将其映射为统一的未知词符号 (UNK token)，导致信息损失。
- 对于词形变化丰富的语言 (e.g., 斯瓦希里语动词有超过300种变形)，为每种变形创建独立词向量既不高效也不合理，因为它们共享核心语义。
解决方案: 子词建模不试图定义所有“词”，而是将词分解为已知的子词序列。
- 算法思路:
  1. 从所有单个字符开始作为初始词汇表。
  2. 迭代地找出语料中频繁共现的相邻字符对或子词对，将其合并为一个新的子词单元加入词汇表。
  3. 重复此过程，直到达到预设的词汇表大小。
- 效果:
  - 常见词（如 "hat", "learn"）本身可以作为子词单元。
  - 不常见或复杂的词（如 "tasty"）可能被分解为多个子词 ("taa", "##sty"，“##”表示不加空格连接前一个子词)。
  - 新词或派生词（如 "transformerify"）可以被分解为 "transformer" 和 "##ify"。
- 优势:
  - 有效处理未登录词问题，几乎所有词都可以被表示。
  - 更好地捕捉词的内部结构和形态学信息。
  - 模型输入变为子词序列，每个子词有其独立的嵌入。
- 应用: 该技术最初为机器翻译开发，现已广泛应用于几乎所有现代语言模型。
- 处理方式: 对于由多个子词组成的词，在模型处理时，每个子词作为序列中的一个独立单元输入（如RNN或Transformer的每个时间步）。若需该词的整体表示，可对各子词的上下文表示进行平均或取最后一个子词的表示。
- 标点符号: 标点符号通常也包含在字符集中，并可能形成独立的子词单元 (e.g., "...")。模型处理文本时，尽可能保持原始文本的样貌，减少预处理。

预训练的动机与核心思想

从词嵌入到模型预训练:
- 早期方法（如Word2Vec）基于“词的含义由其上下文决定”（Distributional Hypothesis: "You shall know a word by the company it keeps"）的思想，预训练词嵌入。
- 然而，Word2Vec为每个词（如 "record"）生成单一的静态向量，无法区分其在不同上下文中的多义性（名词 "记录" vs. 动词 "录制"）。
- 核心转变: 预训练整个模型（包括词嵌入层和其上的大型网络如Transformer或LSTM），而不仅仅是词嵌入。
- 目标: 让模型学习更丰富的语言知识，这些知识可用于下游任务，从而减少对下游任务标注数据的依赖。
预训练的核心机制：重构输入 (Reconstructing the Input)
- 基本假设: 通过让神经网络完成“掩盖部分输入并重构原始输入”的任务，网络必须学习大量关于语言和世界知识才能做好。
- 这是一种自监督学习：从无标签文本中自动创建（输入，标签）对。
- 示例与学习内容:
  - "Stanford University is located in [MASK]." -> 预测 "Palo Alto"。可学习地理知识。
  - "I put [MASK] fork down on the table." -> 预测 "the" 或 "a"。可学习句法知识（冠词使用）。
  - "The woman walked across the street checking for traffic over [MASK] shoulder." -> 预测 "her"。可学习共指消解。
  - "I went to the ocean to see the fish, turtles, seals and [MASK]." -> 预测同类海洋生物。可学习语义类别。
  - "Overall, the value I got from the two hours watching it was the sum total of the popcorn and drink. The movie was [MASK]." -> 预测 "bad"。可学习情感倾向。
  - "Iroh went to the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the [MASK]." -> 预测 "kitchen"。可学习简单的物理世界交互和位置追踪。
  - 预测斐波那契数列的下一项："1, 1, 2, 3, 5, 8, 13, 21, [MASK]" -> 预测 "34"。可学习序列模式。

预训练-微调范式 (Pretraining-Finetuning Paradigm)

两阶段过程:
1. 预训练 (Pretraining):
  - 使用大量无标签文本数据。
  - 通过自监督任务（如语言建模）训练一个大型神经网络模型。
  - 目标是学习通用的语言表示和世界知识。
  - 得到的模型参数 (θ_hat) 作为下一阶段的初始值。
2. 微调 (Finetuning):
  - 使用特定下游任务的少量有标签数据 (e.g., 情感分析、机器翻译)。
  - 以预训练得到的参数 θ_hat 为起点，继续在特定任务数据上进行梯度下降训练模型。
  - 使模型适应特定任务的需求。
为何有效:
- 预训练提供了良好的参数初始化，使得模型在微调时能更快收敛到更好的解。
- 将学习语言通用知识的负担从有限的标注数据转移到海量的无标注文本上。
- 无标签文本数据量远大于有标签数据（可能相差万亿词 vs. 百万词级别）。
- 预训练于多样化文本有助于模型获得更好的泛化能力，适应训练数据中未见过的新模式。
- 即使有大量特定任务的标注数据，预训练通常仍有益。一种进阶做法是三阶段训练：通用预训练 -> 领域自适应预训练（在目标任务的无标签数据上继续预训练）-> 任务微调。

模型预训练的三种主要架构

讲座根据Transformer的三种主要架构（编码器、解码器、编码器-解码器）分别介绍了其预训练方法。

1. 编码器 (Encoders) - 如BERT

特点: 能够处理输入的双向上下文信息。
预训练挑战: 不能直接使用标准语言建模（从左到右预测下一个词），因为双向性使其能“看到”未来词，任务变得平凡。
解决方案：掩码语言建模 (Masked Language Modeling - MLM)
- 核心思想: 随机遮盖（mask out）输入序列中的一部分词，然后让模型预测这些被遮盖的词。
- 模型学习的是 P(真实文档 X | 被损坏的文档 X_tilde)。
- 损失函数仅计算被遮盖词的预测。
BERT (Bidirectional Encoder Representations from Transformers):
- 其输入表示由词嵌入（Token Embeddings）、位置嵌入（Position Embeddings）和片段嵌入（Segment Embeddings）相加构成。
- MLM细节:
  - 随机选择15%的子词进行处理。
  - 对于选中的子词：
    - 80%的概率替换为特殊 "[MASK]" 标记。
    - 10%的概率替换为词汇表中随机的其他词。
    - 10%的概率保持原词不变。
  - 目的: 迫使模型不仅学习被遮盖词的上下文表示，也学习所有词的良好上下文表示，因为模型不知道哪些词会被要求预测，或哪些词可能被错误替换。
- 下一句预测 (Next Sentence Prediction - NSP) [原文提及但后续研究表明并非必要]:
  - 输入句子对 (A, B)，判断句子B是否是句子A在原文中的下一句。
  - 通过在输入前加入特殊标记 [CLS]，并基于其最终表示进行二分类。
  - 目的：让模型学习句子间的关系，认为对某些下游任务（如问答、自然语言推断）有益。
  - 后续发现: RoBERTa等研究表明NSP任务可能不是必需的，甚至可能有害（如将有效上下文长度减半）。模型在该任务上表现也并不好。
- 影响: BERT的出现是NLP领域的“巨变 (sea change)”，通过预训练+微调的方式，在多项NLP任务上大幅超越了之前为各任务精心设计的模型。
- BERT模型规模:
  - BERT-Base: 1.1亿参数
  - BERT-Large: 3.4亿参数
  - 训练数据量级：讲者在讲座中对其具体数量表示不确定，提及最初估计的数亿词（如8亿词）可能不准确，最终描述为“远少于十亿词但仍相当可观”的量级。
- BERT的局限性: 虽然擅长理解和填空，但不自然适用于文本生成任务。
BERT的改进与扩展:
- RoBERTa: 证明了NSP非必要，并指出BERT训练不足（应使用更多数据、更长训练时间）。是BERT的直接替代品。
- Span Masking (如SpanBERT): 遮盖连续的子词片段（span）而非单个随机子词，这被认为是一个更难的任务，能带来更好的性能，因为单个子词的预测可能过于简单。
参数高效微调 (Parameter-Efficient Fine-tuning / Lightweight Fine-tuning):
- 动机: 微调所有参数可能导致模型偏离预训练学到的良好泛化区域，且成本较高。
- 方法: 冻结大部分预训练参数，只微调一小部分参数或添加少量新参数。
  - Prefix Tuning / Prompt Tuning: 冻结整个预训练模型，在输入序列前添加一些可训练的“伪词向量”（virtual token embeddings），只训练这些伪词向量。
  - LoRA (Low-Rank Adaptation): 冻结预训练权重矩阵W，学习一个低秩的差异矩阵 ΔW (表示为两个小矩阵的乘积 A*B)，实际使用的权重为 W + ΔW。只训练A和B。

2. 编码器-解码器 (Encoder-Decoders) - 如T5

特点: 编码器处理输入序列（双向上下文），解码器生成输出序列（单向上下文）。天然适用于序列到序列任务。
预训练方法:
- 语言建模式: 将长文本切分为两半，编码器处理前半部分，解码器生成后半部分。
- 跨度损坏 (Span Corruption) - T5模型采用:
  - 核心思想: 在输入文本中随机选择一些连续片段（spans），用单一的特殊掩码标记（如 <MASK_TOKEN_1>, <MASK_TOKEN_2>）替换它们。
  - 解码器的任务是依次生成这些特殊掩码标记，然后是被替换掉的原始文本片段。
  - 例如，输入："Thank you [MASK_TOKEN_1] me to your party [MASK_TOKEN_2] week."
  - 输出："[MASK_TOKEN_1] for inviting [MASK_TOKEN_2] last."
  - 优势: 结合了BERT式掩码的思想（允许双向上下文理解输入）和生成式任务的特点。
T5 (Text-to-Text Transfer Transformer):
- 将所有NLP任务都统一为“文本到文本”的格式。
- 通过跨度损坏进行预训练。
- 有趣发现: 经过预训练和少量问答样本微调后，T5模型能回答其在微调阶段未见过的、但可能在预训练海量文本中间接学到的事实性问题（开放域问答），表现出从参数中“隐式检索”知识的能力。但生成的答案虽然流畅，却常常是错误的。

3. 解码器 (Decoders) - 如GPT系列

特点: 只能处理从左到右的单向上下文信息。
预训练方法: 标准语言建模 (Standard Language Modeling，常采用teacher forcing机制)
- 给定前面的词序列，预测下一个词。
- 训练目标是最大化序列的联合概率 P(w_t | w_1, ..., w_{t-1})。
GPT (Generative Pre-trained Transformer) 系列:
- GPT (初代): 1.17亿参数。证明了大规模解码器模型在语言建模预训练后，通过微调能在多种下游任务（如自然语言推断、问答、文本相似度、分类）上取得良好效果。
- GPT-2: 参数量增至15亿。更关注生成能力，能够生成在当时看来非常连贯的长文本。训练数据量更大 (约90亿词)。
- GPT-3: 参数量剧增至1750亿，训练数据达3000亿词。
  - 上下文学习 (In-Context Learning / Few-shot Learning): GPT-3展现出的核心能力。无需更新模型权重（即无需微调），仅通过在输入提示（prompt）中给出少量任务示例（demonstrations），模型就能理解任务模式并推广到新的输入上。
    - 例如，给出 "thanks -> merci", "hello -> bonjour"，然后问 "water -> ?"，模型能输出 "eau"。
    - 这表明模型从海量数据中学习到了某种高度抽象的模式匹配和任务泛化能力。其具体机制仍是研究热点，可能与训练数据中包含大量类似模式的文本有关，也可能存在更通用的学习能力。
  - 涌现能力 (Emergent Abilities): 指那些在小模型上不明显或不存在，但在模型规模达到一定程度后突然出现的新能力。上下文学习是典型例子。
解码器模型的应用: 天然适用于文本生成任务。许多当前最大、最强的模型都是解码器架构。原因尚不完全清楚，可能因其结构相对简单，参数可以集中在一个大网络中。

超大模型与高级概念

模型规模与性能:
- 普遍趋势：模型越大，数据越多，性能越好。
- Chinchilla Scaling Laws: DeepMind的研究（如Chinchilla模型）指出，对于给定的计算预算，之前的一些模型（如GPT-3）可能参数量过大，而训练数据量相对不足。Chinchilla模型参数量小于GPT-3（约700亿 vs 1750亿），但在更多数据上训练后，性能优于GPT-3。这揭示了模型大小和训练数据量之间的最佳权衡关系。
思维链提示 (Chain-of-Thought Prompting - CoT):
- 标准提示: 直接给出问题，期望模型输出答案。
- CoT提示: 在给模型的示例（demonstrations）中，不仅给出问题和最终答案，还展示一步步推导出答案的思考过程/中间步骤。
- 当模型处理新问题时，它会先尝试生成类似的思考步骤，然后再给出最终答案。
- 效果: 显著提高大型语言模型在复杂推理任务（如数学应用题、常识推理）上的表现。
- 原因推测: 生成中间步骤相当于为模型提供了“草稿纸”，增加了解决问题所需的计算量，将复杂问题分解为更简单的子问题。这同样是一种涌现能力，在较小模型上效果不明显。

预训练教会了模型什么？

多方面的知识和能力:
- 事实性知识/常识 (Trivia)
- 句法结构 (Syntax)
- 共指关系 (Coreference)
- 词汇语义 (Lexical Semantics)
- 情感倾向 (Sentiment)
- 一定程度的推理能力 (Reasoning) - 尤其在CoT等技术的加持下。
- 模式识别与泛化 (Pattern Matching and Generalization) - 如上下文学习所示。
- 编程能力 (Programming) - 将代码视为一种语言进行预训练（如GitHub Copilot）。
潜在问题与挑战:
- 偏见放大 (Bias Amplification): 模型会学习并可能放大训练数据中存在的社会偏见（如种族、性别歧视）。
- 可靠性与可信度: 模型可能生成看似合理但实际上错误或无意义的内容（"hallucinations"），且难以判断其能力的边界和失效场景。
- 可解释性: 超大模型的工作机制如“黑箱”，难以完全理解其内部决策过程。
研究意义: 这些预训练模型不仅是工具，也成为研究对象，用以探索学习、智能和语言的本质。

结论与展望

预训练技术，特别是基于Transformer架构的大型语言模型，已经深刻改变了自然语言处理领域。从早期的词嵌入到BERT、GPT系列等全模型预训练，再到超大模型展现出的上下文学习和思维链等惊人能力，NLP的能力边界被不断拓宽。尽管取得了巨大成功，但如何有效利用计算资源、缓解模型偏见、增强模型可解释性和可靠性，以及探索更通用的学习机制，仍是未来重要的研究方向。预训练模型在多大程度上真正“理解”语言和世界，以及它们如何泛化到未见过的任务和模式，是持续探索的核心问题。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:40

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:19

StreamSparkAI