Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers

该讲座主要介绍了自然语言处理领域从循环神经网络（RNN）向基于自注意力机制和Transformer模型的转变。

首先，讲座回顾了以往使用双向长短期记忆网络（BiLSTM）进行编码、单向LSTM结合注意力机制进行解码的NLP模型，并指出了其在处理长距离依赖和并行计算方面的局限性。RNN存在“线性交互距离”问题，即模型难以捕捉序列中远距离词语间的依赖关系，因为信息需要逐词传递，导致梯度传播困难。此外，RNN的计算具有时序依赖性，无法充分利用GPU进行并行处理，计算效率随序列长度增加而降低。

接着，讲座提出，尽管注意力机制此前已与RNN结合使用以改善信息瓶颈等问题，但新的范式将更彻底地采用注意力，特别是自注意力机制，以完全取代循环结构。自注意力机制允许模型在处理单个句子时，让每个词直接关注到句子中的所有其他词，从而更好地捕捉长距离依赖，并实现高度并行化的计算。讲座预告将深入探讨自注意力的原理及其在Transformer模型中的应用。

此外，讲座还包含了课程安排的通知：新的详细讲义已发布；第四次作业将于一周后截止，由于Azure GPU资源问题，建议使用Colab进行模型训练；最终项目提案的反馈即将发布。

视频科技

媒体详情

上传日期: 2025-05-15 21:31
来源: https://www.youtube.com/watch?v=LWMzyfvuehA
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Hi everyone. Welcome to cs 224n. We're about two minutes since, so let's get started. So today we've got what I think is quite an exciting lecture topic. We're going to talk about self attention and transformers. So these are some ideas that are sort of the foundation of most of the modern advances in natural language processing and actually sort of AI systems in a broad range of fields. So it's a very, very fun topic before we get into that. Okay. Before we get into that, we're going to have a couple of reminders. So there are brand new lecture notes. Guys. Thank you. Yeah, I'm very excited about them. They go into they pretty much follow along with what I'll be talking about today, but go into considerably more detail. Assignment four is due a week from today. Yeah. So the issues with azure continue. Thankfully. Thankfully, our tas especially has tested that. This works on colab and the amount of training is such that a collab session will allow you to train your machine translation system. So if you don't have a GPU use collab. We're continuing to work on getting access to more GPU's for assignment five in the final project. We'll continue to update you as we're able to. But know the usual systems this year are no longer holding because companies are changing their minds about things. Okay. So our final project proposal, you have a proposal of what you want to work on for your final project. We will give you feedback on whether we think it's a feasible idea or how to change it. So this is very important because we want you to work on something that we think has a good chance of success for the rest of the quarter. That's going to be out tonight. We'll have an ed announcement when it is out, and we want to get you feedback on that pretty quickly because you know you'll be working on this after assignment five is done. Really, the major core component of the course after that is the final project. Okay. Any questions? Cool. Okay. Okay. So so let's let's kind of take a look back into what we've done so far in this course and sort of see what you know what we were doing in natural language processing. What was our strategy? If you had a natural language processing problem and you wanted to say take like your best effort to tempt at it without doing anything too fancy, you would have said, okay, I'm gonna to have you know a bidirectional lstm instead of a simple rnn, right? Well, I'm going to use an lstm to encode my sentences. I get bidirectional context. And if I have an output that I'm trying to generate, right, I'll have like a unidirectional lstm that I was going to generate one by one. So you have a translation or a parse or whatever. And so maybe I've encoded in a bidirectional stm the source sentence, and I'm sort of you know one by one decoding out the target with my unidirectional lstm. And then also, right, I was going to use something like attention to give flexible access to memory. If I you know felt like I needed to do the sort of look back and see where I want to translate from, okay? And this was just working exceptionally well and we motivated so you know attention through wanting to do machine translation. And you have this this bottleneck where you don't want to have to encode the whole source sentence in a single vector. And in this lecture, we have the same goal. So we're going to be looking at a lot of the same problems that we did previously, but we're going to use different building blocks. We're going to say, you know if 2014 to 2017 ish, I was using recurrence through lots of trial and error. Years later, we had these like brand new building blocks that we could plug in sort of you know direct replacement for lsttms, and they're gonna to allow for just a huge range of much more successful applications. And so what are the issues with the recurrent neural networks we used to use and what are the new systems that we're going to use sort of from this point moving forward? Okay. So one of the issues with a recurrent neural network is what we're going to call linear interaction distance. So as we know, you Arons are unrolled left to right or right to left, depending on the language and the direction, okay? But it encodes the sort of notion of linear locality, which is useful because if two words occur right next to each other, sometimes they're actually quite related. So tasty pizza, they're nearby. And in the recurrent neural network, right, you sort of encode you know, tasty, and then you sort of walk one step and you encode pizza. So nearby words do often affect each other's meanings. But you have this problem where very long distance dependencies can take a very long time to interact. So if I have the sentence, the chef, so those are nearby, those interact with each other and then who and then a bunch of stuff like the chef who went to the stores and picked up the ingredients and you know, loves garlic and then was right, like I actually have an rnn step, right? This sort of application of the recurrent weight matrix and some element wise nonlinearties once, twice, three times, sort of as many times as there is potentially the length of the sequence between chef and was, and it's the chef who was. So this is a long distance dependency should feel kind of related to the stuff that we did in dependency syntax. But you know it's quite difficult to learn potentially that these words should be related. So if you have sort of a lot of steps between between .
speaker 2: words.
speaker 1: you know it can be difficult to learn the dependencies between them. Know we talked about all these gradient problems. Ltms do a lot better at modeling the gradients across long distances than simple recurrent neural networks, but it's not perfect. And we already know sort of that this linear order isn't sort of the right way to think about sentences. So if I wanted to learn that it's the chef who was, then I might have a hard time doing it because the gradients have to propagate from was to chef. And really, I'd like more direct connection between words that might be related in the sentence or in a document, even if these are going to get much longer. So this is this linear interaction distance problem. We would like words that might be related to be able to interact with each other in the neural networks 's computation sort of graph more easily than sort of being linearly far away, so that we can learn these long distance dependencies better. And there's a related problem, too, that, again, comes back to the recurrent neural networks. Dependence on the index, on the index into the sequence often call it a dependence on time. So in a recurrent neural network, the forward and backward passes have o of sequence length many. So that means just roughly sequin, this case, just sequence length many unparalyzable operations. So you know we know GPU's are great. They can do a lot of operations at once as long as there's no dependency between the operations in terms of time that you have to compute one and then compute the other. But in a recurrent neural network, you can't actually compute the rnn hidden state for time step five before you compute the rnn hidden state for time step four or time step three. And so you get this graph that looks very similar, where if I want to compute this hidden state, so I've got some word, I have zero operations I need to do before I can compute this state. I have one operation I can do before I can compute this state. And as my sequence length grows, right, I've got okay here. I've got three operations I need to do before I can compute this state with the number three because I need to compute this and this and that. So there's sort of three unparalyzable operations that I'm sort of glooming, you know, all the matrix multiplies and stuff into a single one. So one, two, three. And of course, this grows with the sequence length as well. So down over here. So as the sequence length grows, I can't paralleze, you know I can't just have a big GPU just you know kacunka with the matrix multiply to compute this state because I need to compute all the previous states beforehand. Okay. Any questions about that? So these are these two sort of related problems, both with the don time. Yeah, Yeah. So I have question on the linear interaction issues.
speaker 2: I thought that was whole point of the extension network. And then how maybe you want during the training of the actual celdepend more each other, can we do something like the attention? And then so the question is with the linear .
speaker 1: interaction distance, wasn't this sort of the point of attention that sort of gets around that? Can't we use something with attention to sort of help? Or does that just helped? So it won't solve the paralylizability problem? And in fact, everything we do in the rest of the lecwill be attention based, but we'll get rid of the recurrence and just do attention more or less. So Yeah, it's a great intuition. Any other .
speaker 2: questions?
speaker 1: Okay, cool. So if not recurrence, what about attention? See, want just a slide back. And so you know, just we're going to get deep into attention today, but just for the second, where an attention treats each word representation as a query to access and incorporate information from a set of values. So previously, right, we were in a decoder, we were decoding out a translation of a sentence, and we attended to the encoder so that we didn't have to store the entire representation of the source sentence into a single vector. And here today, we'll think about attention within a single sentence. So I've got this sort of sentence written out here with a word one through word t in this case, and write on these sort of integers in the boxes. I'm writing out the number of unparallzable operations that you need to do before you can compute these. So for each word, you can independently compute its embedding without doing anything else previously, because the embedding just depends on the word identity. And then with attention, if I wanted to build an attention representation of this word by looking at all the other words in the sequence, that's sort of one big operation. And I can do them in parallel for all the words. So the attention for this word, I can do the attention for this word. I don't need to sort of walk left to right like I did for an rn. Again, we'll get much deeper into this. But this, you should have the intuition that it solves the linear interaction problem and the non paralyzability problem, because now, no matter how far away words are from each other, I am potentially interacting. I might just attend to you even if you're very, very far away, sort of independent of how far away you are. And I also don't need to sort of walk along the sequence linearly long. So I'm treating the whole sequence at once. All right. So so you know the intuition is that attention allows you to look very far away at once, and it doesn't have this dependence on the sequence index that keeps us from parallelizing operations. And so now the rest of the lecture, we'll talk in great depth about attention. So maybe let's just. Move on. Okay. So let's think more deeply about attention. You know, one thing that you might think of with attention is that it's sort of performing kind of a Fuzzy Lookup in a key value store. So you have a bunch of keys, a bunch of values, and it's going to help you sort of access that. So in an actual lookup table, right? Just like a dictionary in Python, for example, right? Very simple. You have a table of keys that each key maps to a value and then you like give it a query and the query matches you know one of the keys and then you return the value, right? So I've got a bunch of keys here and my query matches the key, so I return the value. Simple, fair, easy, okay, good and in attention, right? So just like we saw before, the query matches all keys softly. There's no exact match. You sort of compute some sort of similarity between the key and all sorry, the query and all of the keys, and then you sort of weight the results. So you've got a query again, you've got a bunch of keys. The query to different extents is similar to each of the keys. And you will sort of measure that similarity between zero and one through a softmax. And then you know you get the values out. You average them via the weights of the similarity between the key and the query and the keys. You do a weighted sum with noweights and you get an output, right? So it really is quite a bit like a lookup table, but in this sort of soft vector space, you mushy sort of sense. So I'm really doing some kind of accessing into this information that's stored in the key value store, but I'm sort of softly looking at all of the results. Okay. Any questions there? Cool.
speaker 2: So so what might this look like? Right.
speaker 1: So if I was trying to represent this sentence, I went to Stanford cs 224n and learned. So I'm trying to build a representation of learned. You know I have a key for each word. So this is this self attention thing that we'll get into. I have a key for each word, a value for each word. I've got the query for learned and I've got these sort of sort of teelish bars up top, which sort of might say how much you're going to try to access each of the word. Like so maybe 224n is not that important. Cs, maybe that determines what I learned, you know, Stanford, right? And then learned maybe that's important to representing itself, right? So you sort of look across it the whole sentence and build up the sort of soft accessing of information across the sentence in order to represent learned in context. Okay, so this is just A A toy diagram. So let's get into the math. So we're going to look at a sequence of words. So that' S W one to N, A sequence of words in a vocabulary. So this is like you know Zuko made his Uncle T. That's a good sequence. And for each word, we're going to embed it with this embedding matrix, just like we've been doing in this class, right? So I have this embedding matrix that goes from the vocabulary size to the dimensionality d. So that's each word has a non contextual, only dependent on itself word embedding. And now I'm going to transform each word with one of three different weight matrices. So this is often called key query value, self attention. So I have a matrix q, which is an rd to d. So this maps xi two, which is a vector of dimensionality d, to another vector of dimensionality d. And so that's going to be a query vector, right? So it takes an xi and it sort of you know, rotates it, shuffles it around, stretches it, squishes it, makes it different. Now it's a query. And now for a different learnable parameter, k, so that's another matrix. I'm going to come up with my keys. And with a different learnable parameter, v, I'm going to come up with my values, right? So I'm taking each of the non contextual word embeddings each of these ex's, and I'm transforming each of them to come up with my query for that word, my key for that word, and my value for that word. Okay? So every word is doing each of these roles. Next, I'm going to compute all pairs of similarities between the keys and queries, right? So in the toy example we saw, I was computing sort of the similarity between a single query for the word learned and all of the keys for the entire sentence. In this context, I'm computing all pairs of similarities between all keys and all values because I want to represent sort of all of these sums. So I've got this sort of dot product. I'm just gonna to take the dot product between these two vectors, right? So we've got qi. So this is saying the query for word I dotted with the key for word J. And I get this score, which is you know a real value might be very large, negative, might be zero, might be very large and positive. And so that's like, how much should I look at J in this lookup table? And then I do the softmax, right? So I softmax. So I say that you know the actual weight that I'm going to look at J from I is softmax of this over all of the possible indices, right? So it's like the affinity between I and J normalized by the affinity between I and all of the possible J prime in the sequence. And then my output is just the weighted sum of values. So I've got this output for word to I. So maybe I is like one for zuco and I'm representing it as the sum of these weights for all J. So Zuko and maid and his and uncle and t and the value vector for that word J, I'm looking from I to J as much as alpha I J. Oh, wi, you can either think of it as a symbol in vocab v. So that's like you can think of it as a one hot vector. And Yeah, in this case, we are, I guess, think of it so a one hot vector in dimensionality, size of vocab. So in the matrix e, you see that it's rd by bars around v that's size of the vocabulary. So when I do e multiplied by wi, I, that's taking e, which is d by v, multiplying it by W, which is v and returning a vector that's .
speaker 2: dimensionality d.
speaker 1: So W at first line, like W one n, that's a matrix where it has like maybe like a column for every word in that sentence, in each column is a length v Yeah. Usually I guess we think of it as having A, I mean, if I'm putting the sequence length index first, you might think of it as having a row for each word. But similarly, Yeah, it's it's n, which is the sequence length. And then the second dimension would be v, which is the vocabulary size. And then that gets mapped to this thing, which is sequence length by d.
speaker 2: Why do we learn two different matrices.
speaker 1: q and k, when q pose qi transpose? Kj is really just one matrix in the middle. That's a great question. It ends up being because this will end up being a low rank approximation to that matrix. So it is for computational efficiency reasons, although it also, I think, feels kind of nice in the presentation, but Yeah, what we'll end up doing is having a very low rank approximation to qk transpose. And so you actually do do it like this.
speaker 2: It's a good question. I so the qurewould any any specific. So could you repeat that for me? The dii.
speaker 1: so the query of the .
speaker 2: word dowith, the key by itself, does it look like an identity or could it look any things specific particular? That's a good question. Okay, let me remember .
speaker 1: to repeat questions. So does ei right for for J equal to I? So looking at itself look like anything in particular, does it look like the identity? Is that the question? Okay. So so right, it's unclear actually this question of should you look at yourself for representing yourself? Well, it's going to be encoded by the matrices q and k, right? If I didn't have q and k in there, right? If those were the identity matrices, if q is identity, k's identity, then this would be sort of dot credit with yourself, which is going to be high on average, like you're pointing in the same direction as yourself. But it could be that you know qxi and kxi might be sort of arbitrarily different from each other because q could be the identity and k could map you to the negative of yourself, for example, so that you don't look at yourself. So this is all learned in practice. So you end it can sort of decide by learning whether you should be looking at yourself or not. And that's some of the flexibility that parameterizing it sq and k gives you. That wouldn't be there if I just used excise everywhere in this equation. I'm going to try to move on, I'm afraid, because there's a lot to get on, but we'll keep talking about self attention. And so as more questions come up, I can also potentially return back. Okay. So this is our basic building block, but there are a bunch of barriers to using it as a replacement for our lstms. And so what we're going to do for this portion of the lecture is talk about the minimal components that we need in order to use self attention as sort of this like very fundamental building block. So we can't use it as it stands, as I've presented it, but because there are a couple of things that we need to sort of solve or fix, one of them is that there's no notion of sequence order in self attention. So know, what does this mean? If I have a sentence like when I move over here to the whiteboard briefly, and hopefully I'll write quite large. If I have a sentence like, zuco made his uncle, and let's say his uncle made Zuko, if I were to embed each of these words, right, using its embedding matrix, the embedding matrix isn't dependent on the index of the word. So this is the Word Index one, two, three, four versus now his is over here and uncle, right? And so when I compute the self attention, and there's a lot more on this in the lecture notes that goes through a full example, the actual self attention operation will give you exactly the same representations. For this sequence, Zuko made his uncle as for this sequence, his uncle made Zuko. And that's bad because there sentences that mean different things. And so it's sort of this idea that self attention is in an operation on sets, like you have a set of vectors that you're going to perform self attention on. And nowhere does like the exact position of the words come into play directly. So we're going to encode the position of words through the keys, queries and values that we have. So you know consider now representing each sequence index. Our sequences are going from one to n as a vector. So don't worry so far about how it's being made, but you can imagine representing sort of the number one, like the position one, the position two, the position three as a vector in the dimensionality d, just like we're representing our keys, queries and values. And so these are position vectors. You know you can, if you were to want to incorporate the information represented by these positions into our self attention, you could just add these vectors, these pi vectors to the inputs, right? So if I have, you know this xi embedding of a word, which is the word at position I, but really just represents, Oh, the word Zuko is here. Now I can say that, Oh, it's the word Zuko and it's at position five because you know, this vector represents position five. Okay. So how do we do this? And we might only have to do this once. So we can do it once at the very input to the network. And then that sort of is sufficient. We don't have to do it at every layer because it sort of knows from the input. So one way in which people have done this is look at these sinusoidal position representations. So this looks a little bit like this where you have so these are, this is a vector pi, which is in dimensionality d, right? And each one of the dimensions, you take the value I, you modify it by some constant and you PaaS it to the sine or cosine function, and you get these sort of values that vary according to the period, differing periods depending on the dimensionalities d. So I've got this sort of a representation of a matrix where d is the vertical dimension and then n is the horizontal. And you can see that there's sort of like, Oh, you know, as I walk along, you see the period of the sine function going up and down. And each of the dimensions d has a different period. And so together you can represent a bunch of different sort of position indices. And now it gives so this intuition that, Oh, maybe sort of the absolute position of a word isn't as important. You've got the sort of periodicity of the siges and cosines, and maybe that allows you to extrapolate to longer sequences, but in practice that doesn't work. But this is sort of like an early notion that is still sometimes used for how to represent position in transformers and self attention networks in general. So so that's one idea. You might think it's a little bit complicated, a little bit unintuitive. Here's something that feels a little bit more deep learning. So we're just going to say, Oh, you know, I've got a maximum sequence length of n, and I'm just gonna to learn a matrix that's dimensionality d by n, and that's gonna to represent my positions. I'm gonna to learn it as a parameter, just like I learn every other parameter. And what do they mean? Oh, I have no idea, but it represents position. And so you just sort of add this matrix to the ex's, your input embeddings, and it learns to know fit to data. So whatever representation of position that's linear, sort of index based that you want, you can learn. And the cons are that, well, you definitely now can't represent anything that's longer than n words long, no sequence longer than n you can handle because, well, you only learned a matrix of this many positions. And so in practice, you'll get a model error if you PaaS a self attention model, something longer than length n, it will just sort of crash and say, I can't do this. And so this is sort of what most systems nowadays use. There are more flexible representations of position, including a couple in the lecture notes you might want to look at, sort of like the relative linear position or words before or after each other. But another absolute position is also some sort of representations that harken back to our dependency syntax, because like, Oh, maybe words that are close in the dependency parse tree should be the things that are sort of close in the self attention operation. Okay, questions .
speaker 2: in practice, do we typically just make n large enough that we don't run into the issue of having something that could be input longer than them? So the .
speaker 1: question is, in practice, do we just make n long enough that we don't run into the problem where we're going to you know look at a text longer than n? No, in practice, it's actually quite a problem even today, even in the largest, biggest language models. And you know you know can I fit this prompt into chat, GPT or whatever is a thing that you might see on Twitter. I mean, these continue to be issues. And part of it is because the self attention operation, and we'll get into this later in the lecture, it's quadratic complexity in the sequence length. So you're going to spend n squared sort of memory budget in order to make sequence lengths longer. So in practice, this might be on a large model, say 4000 or so n it's 4000. So you can fit 4000 words, which feels like a lot, but it's not gonna to fit a novel. It's not going to fit a Wikipedia page, you know. And so and there are models that do longer sequences for sure. And again, we'll talk a bit about it. But no, this actually is an issue.
speaker 2: How do you look at the P U as the quotation, which that is not any other kind of interest x. So how do you know that the p that you've learned.
speaker 1: this matrix that you've learned is representing position as opposed to anything else? And the reason is the only thing, it correlates this position, right? So like when I see these vectors, I'm adding this p matrix to my x matrix, the word embeddings, I'm adding them together. And the words that show up at each index will vary depending on what word actually showed up there in the example. But the p matrix never differs. It's always exactly the same at every index. And so it's the only thing in the data that it correlates with. So you're sort of learning it implicitly. Like this vector at index one is always an index one for every example, for every gradient update, and nothing else cooccurs like that. Yeah so what you end up learning, I'm I don't know, it's unclear, but it definitely allows you to know, Oh, this word is with this index of this. Yeah. Okay. Yeah.
speaker 2: Just quickly when take the conspace. There's a sequence right now line as a sentence rooms or okay. So the question is when this is quadratic .
speaker 1: in this sequence, is that a sequence of words? Yeah, think of it as a sequence of words, sometimes therebe pieces that are smaller than words, which we'll go into next in the next lecture. But Yeah think of this as a sequence of words, but not necessarily just for a sentence, maybe for an entire paragraph or an entire document or something like that.
speaker 2: Yeah, the attention .
speaker 1: is based words to words. Okay, cool. I'm going to move on. Okay, so right. So we have another problem. Another is that you know based on the presentation of self attention that we've done, you know there's really no nonlinearities for sort of deep learning magic. We're just sort of computing weighted averages of stuff. So so you know if I apply self attention and then apply self attention again and and again and again and again, you should get you should look at the next lecture notes if you're interested in this is actually quite cool. But what you end up doing is you're just re averaging value vectors together. So you're like computing averages of value vectors and it ends up looking like one big self attention. But there's an easy fix to this if you want sort of the traditional deep learning magic, and you can just add a feed forward network to post process each output vector. So I've got a word here that's sort of the output of self attention, and I'm going to PaaS it through. You know in this case, I'm calling it a multi layer perceptron mlp. So this is a vector and rd d that's going to be, and it's taking in as input a vector and rd d. And you know you do the usual sort of multilayer perceptron thing, right, where you have the output and you multiply by a matrix, PaaS it through a nonlinearity multiplied by another matrix. Okay. And so what this looks like in self attention is that I've got this sort of sentence, the chef who food, and I've got my embeddings for it. I passed it through this whole big self attention block, which looks at the whole sequence and sort of incorporates context and all that. And then I PaaS each one individually through a feed forward layer, right? So this embedding, that's sort of the output of the self attention for the word the is passed independently through a multilayer perceptron here. And that's you can think of it as sort of combining together or processing the result of attention. So there's a number of reasons why we do this. One of them also is that you can actually stack a ton of computation into these feed forward networks very, very efficiently, very parallelizable, very good for GPU's. But this is what's done in practice. So you do self attention and then you can you know PaaS it through this sort of position wise feed forward layer, right? Every word is processed independently by this feed forward network to process the result. Okay. So that's adding our sort of classical deep learning nonlinearities for self attention. And that's an easy fix for this sort of no nonlinearities problem in self attention. And then we have a last issue before we have our final minimal self attention building block with which we can replace rnn's. And that's that. Well, you know when I've been writing out all of these examples of self attention, you can sort of look at the entire sequence. And in practice, for some tasks, such as machine translation or language modeling, whenever you want to define a probability distribution over a sequence, you can't cheat and look at the future, right? So you know, at every time step I could define the set of keys and queries and values to would only include past words. But this is inefficient. Bear with me. It's inefficient because you can't paralleze it so well. So instead, we compute the entire n by n matrix, just like I showed in the slide, discussing self attention. And then I mask out words in the future. So this score eij right, and I computed eij for all n by n pairs of words. It's equal to whatever it was before. If the word that you're looking at, index J, is an index that is less than or equal to where you are, index I, and it's equal to negative infinity ish. Otherwise, if it's in the future and when you softmatch the eij, negative infinity gets mapped to zero. So now my attention is weighted zero. My weighted average is zero on the future. So I can't look at it. What does this look like? So in order to encode these words, the chef who and maybe the start symbol there, I can look at these words. That's all pairs of words. And then I just gray out, I sort of negative infinity out the words I can't look at. So encoding the start symbol, I can just look at the start symbol. When encoding the, I can look at the start symbol and the when coding chef, I can look at start the chef, but I you know, can't look at who, right? And so with this representation of chef in code, that is, know, only looking at start the chef, I can define a probability distribution using this vector that allows me to predict who without having cheated by already looking ahead and seeing that, well, who is the next word? Questions.
speaker 2: So it says for using it in decoders, do we do this for both the encoding laand, the eccoding Laor for the encoding layer or we allowing ourselves to look forward? The question is.
speaker 1: it says here that we're using this in a decoder. Do we also use it in the encoder? So that this is the distinction between sort of like a bidirectional lstm and a unidirectional lstm, right? So wherever you don't need this constraint, you probably don't use it. So if you're using an encoder on the source sentence of your machine translation problem, you probably don't do this masking because it's probably good to let everything look at each other. And then whenever you do need to use it, because you have this autoaggressive sort of probability of word, one probability of two given one, know, three given 21, then you would use this. So traditionally, yes, in decoders you will use it. In encoders you will not.
speaker 2: My question is a little bit philosophical, how humans actually generate sentences by having some notion of the probof future words before they say the words that or before they choose the words that they are currently speaking or writing generating. Good question.
speaker 1: So the question is, isn't you know looking ahead a little bit and sort of predicting or getting an idea of the words that you might say in the future, sort of how humans generate language instead of the sort of strict constraint of not seeing into the future? Is that is that what you're okay? So so right, you know, trying to plan ahead to see what I should do is definitely an interesting idea. But when I am training the network, right, I can't if I'm teaching it to try to predict the next word, and if I give it the answer, it's not going to learn anything useful. So in practice, when I'm generating text, maybe it would be a good idea to make some guesses far into the future or have a high level plan or something. But in training the network, I can't encode that intuition about how humans build, like generate sequences of language by just giving it the answer of the future directly at least, because then it's just too easy, like there's nothing to learn. Yeah but there might be interesting ideas about maybe giving the network like a hint as to what kind of thing could come next, for example. But but that's out of scope for this Yeah Yeah question up here.
speaker 2: So I understand the like why we want to mathe future ure for stuff like language models, but how does it apply to machine translation?
speaker 1: Like why would we use it first? Yeah. So in machine translation, we're going to come over to this board and hopefully get a better marker. Nice. In machine translation, you know, I have a sentence like, I like pizza and I want to be able to you know, translate it. Jpizza nice. Right. And so when I'm when I'm looking at the I like pizza, right, I get this as the input. And so I want self attention without masking because I want I to look at like, and I to look at pizza and like to look at pizza and I want it all. And then when I'm generating this, right, if my tokens are like jm la pizza, I want to, in encoding this word, I want to be able to look only at myself. And we'll talk about encoder decoder architectures in this later in the lecture, but I want to be able to look at myself none of the future and all of this. And so what I'm talking about right now in this masking case is masking out you know with like negative infinity, all of these words. So that sort of attention score from to everything else should be to be you know negative infinity. Yeah. Does that answer your question? Great. Okay, let's move ahead. Okay, so that was our last big sort of building block issue with self attention. So this is what I would call, and this is my personal opinion, a minimal you know self attention building block. You have self attention, the basis of the method. So that's sort of here in the red. And maybe we had you know the inputs to the sequence here, and then you embed it with that embedding matrix e, and then you add position embeddings, right? And then these three arrows represent using you know the key, the value and the query. That's sort of stylized there. This is often how you see these diagrams. And so you PaaS it to self attention with the position representation, right? So that specifies the sequence order because otherwise you have no idea what order the words showed up in. You have the nonlinearities in sort of the teal feed forward network there to sort of provide that sort of squashing and sort of deep learning expressivity. And then you have masking in order to have parallelizable operations that don't look at the future. Okay? So this is sort of our minimal architecture. And then up at the top above here, right? So you have this thing, maybe you repeat this sort of self attention and feed forward many times. So self attention, feed forward. Self attention feed forward, self attention feed forward, right? Let's what I'm calling this block. And then maybe at the end of it you you know predict something. I don't know, we haven't really talked about that, but you know you have these representations and then you predict the next word or you predict the sentiment or you predict whatever. So this is like a self attention architecture. Okay, we're going to move on to the transformer next. So if there are any questions, Yeah, we're using masking for decoding just for all the way around. We will use masking for decoders where I want to decode out the sequence, where I have an informational constraint, where to represent this word properly, I cannot have the information .
speaker 2: of the future.
speaker 1: I'm asking when you don't, okay? Okay, great. So now let's talk about the transformer. So what what I've pitched to you is what I call a minimal self attention architecture and know I quite like pitching it that way, but really no one uses the architecture that was just up on the slide, the previous slide. It doesn't work quite as well as it could. And there's a bunch of sort of important details that we'll talk about now that goes into the transformer. What I would hope though to sort of how do you take away from that is that the transformer architecture, as I'll present it now, is not necessarily the end point of our search for better and better ways of representing language, even though it's now ubiquitous and has been for a couple of years. So think about these sort of ideas of the problems of using self attention and maybe ways of fixing some of the issues with transformers. Okay. So a transformer decoder is how we'll build systems like language models, right? And so we've discussed this. It's like our decoder with our self attention only sort of minimal architecture that's got a couple of extra components, some of which I've grayed out here that will go over one by one. The first that's actually different is that we'll replace our self attention with masking with masked multi head self attention. This ends up being crucial, is probably the most important distinction between the transformer and this sort of minimal architecture that I've presented. So let's come back to our toy example of attention, where we've been trying to represent the word learned in the context of the sequence. I went to Stanford ctwenty four n and learned, and I was sort of giving these teal bars to say, Oh, maybe intuitively, you look at various things to build up your representation of learned. But you know really there are varying ways in which I want to look pat the sequence to see varying sort of aspects of information that I want to incorporate into my representation. So maybe in this way, I sort of want to look at Stanford cs 224n because like Oh, it's like entities like you learn different stuff at Stanford cs 224n than you do at other courses or other universities or whatever, right? And so maybe I want to look here for this reason and maybe you know there's in another sense, I actually want to look at the word learned and I want to look at I you know I went and learned, right, as he sort of like maybe syntactically relevant words, right? Like it's very different reasons for which I might want to look at different things in the sequence. And so trying to sort of average it all out with a single operation of self attention ends up being maybe somewhat too difficult in a way that will make precise in assignment five. So we'll do a little bit more math. Okay. So any questions about this intuition? Just putting attention ahead to Yeah. So it should be an application of attention, just as I've presented it, right? So one, independent, definine the keys, definfind the queries to find the values. I'll define it more precisely here, but think of it as I do attention once, and then I do it again with different, like being different parameters, being able to look at different things etcec. So two, separate, like how do we ensure the different things? We do not. Okay. So the question is, if we have two separate sets of weights, try to learn, say, to do this and to do that, how do we ensure that they learn different things? We do not ensure that they hope that they learn different things. And in practice they do, although not perfectly. So it ends up being the case that you have some redundancy and you can sort of like cut out some of these, but a sort of out of scope for this. But we sort of hope, just like we hope that different sort of dimensions in our feed forward layers will learn different things because of lack of symmetry and whatever, that we hope that the heads will start to specialize and that will mean theyspecialize even more. And Yeah, okay. All right. So in order to discuss multi head self attention, well, we really need to talk about the matrices, how we're going to implement this in GPU's efficiently. We're going to talk about the sequence stacked form of attention. So we've been talking about each word sort of individually as a vector in dimensionality d, but you know, really we're going to be working on these as big matrices that are stacked. So I take you all of my word, embeddings x one to xn, and I stack them together and now have a big matrix that is in dimensionality, R N by d. Okay. And now with my matrices K, Q and V, I can just multiply them sort of on this side of x. So x is our n by D, K is our d by d. So n by d times d by d gives you n by d again. So I can just compute a big, you know, matrix multiply on my whole sequence to multiply each one of the words with my key query and value matrices very efficiently. This is sort of this vectoriization idea. I don't want to for loop over the sequence. I represent the sequence as a big matrix, and I just do one big matrix multiply. Then the output is defined as this sort of inscrutable bit of math, which I'm going to go over visually. So first, we're going to take the key query dot products in one matrix. So we've got xq, which is rn by d, and I've got X K transposed, which is our d by n. So n by D, D by n. This is computing all of the eijs, these scores for self attention, right? So this is all pairs of attention scores computed in one big matrix multiply. Okay? So this is this big matrix here. Next I use the softmax, right? So I softmax this over the second dimension, the second end dimension, and I get my sort of normalized scores, and then I multiply with xv. So this is an n by n matrix multiplied by an n by d matrix. And what do I get? Well, this is just doing the weighted average right? So this is one big weighted average contribution on the whole matrix, giving me my whole self attention output, an rn by d. So I've just restated identically the self attention operations, but computed in terms of matrices, so that you could do this efficiently on a GPU. Okay. So multi headed attention, this is going to give us, and it's going to be important to compute this in terms of the matrices, which we'll see. This is going to give us the ability to look in multiple places at once for different reasons. So sort of you know for self attention looks where this dot product here is high, right? This xi, the q matrix, the q matrix. But maybe we want to look in different places for different reasons. So we actually define multiple query key and value matrices. So I'm going to have a bunch of heads. I'm going to have H self attention heads. And for each head, I'm going to define an independent query key and value matrix. And I'm going to say that it's its shape is going to map from the model dimensionality to the model dimensionality over H. So each one of these is doing projection down to a lower dimensional space. This is going to be for computational efficiency. And I'll just apply self attention sort of independently for each output. So this equation here is identical to the one we saw for single headed self attention, except we've got these sort of l indices everywhere. So we got this lower dimensional thing. I'm mapping to a lower dimensional space, and then I do have my lower dimensional value vector there. So my output is an R D by H, but really you're doing exactly the same kind of operation. I'm just doing it H different times, and then you combine the outputs. So I've done sort of look in different places with the different key query and value matrices, and then I get each of their outputs and then I concatenate them together. So each one is dimensionality d by H, and I concatenate them together and then sort of mix them together with the final linear transformation. And so each head gets to look at different things and construct their value vectors differently. And then I sort of combine the result altogether at once. Okay, let's go through this visually because it's at least helpful for me. So right, it's actually not more costly to do this really than it is to compute a single head of self attention. And we'll see through the pictures. So you know we were in single headed self attention, we computed xq and in multi headed self attention, we'll also compute xq the same way. So xq is rn by d, and then we can reshape it into rn. That's sequence length times the number of heads, times the model dimensionality over the number of heads. So I've just reshaped it to say, now I've got you know a big three axis tensor. The first axis is the sequence length, the second one's number of heads, the third is this reduced model dimensionality. And that costs nothing. And do the same thing for x and v, and then I transpose so that I've got the head axis as the first axis. And now I can compute all my other operations with the head axis, kind of like a batch. So what does this look like in practice? Like instead of having one big xq matrix that's model dimensionality d, I've got, like in this case, three xq matrices of model dimensionality d by three, d by three, d by three. Same thing with the key matrix here. So everything looks almost identical. It's just a reshaping of the tensors. And now right at the output of this, I've got three sets of attention scores just by doing this reshape. And the cost is that well, each of my attention heads has only A D by H vector to work with instead of A D dimensional vector to work with. So I get the output. I get these three sets of pairs of scores. I compute the softmax independently for each of the three, and then I have three value matrices there as well, each of them lower dimensional. And then finally, right, I get my three different output vectors, and I have a final linear transformation to sort of mush them together. And I get an output. And in summary, what this allows you to do is exactly what I gave in the toy example, which was I can have each of these heads look at different parts of a sequence for different reasons. Just a question. So this is at a given block, right? Like all of these attention heads are for a given transformer block. A next block would also could also have three attention heads. The question is, are all of these for a given block? And we'll talk about a block again. But this block was this sort of pair of self attention and feed forward network. So you do like self attention feed forward. That's one block. Another block is another self attention, another feed forward. And the question is, are the parameters shared between the blocks or not? Generally, they are not shared. You'll have independent parameters at every block.
speaker 2: although there are some exceptions voting on that. Is it typically the case that you have the same number of heads at each block? Or do you Carry the number of heads across blocks? Have you definitely could vary it.
speaker 1: People haven't found reason to vso. The question is you have different numbers of heads across the different blocks, or do you have the same number of heads across all blocks? The simplest thing is to just have it be the same everywhere, which is what people have done. I haven't yet found a good reason to vary it, but, well, could be interesting. It's definitely the case that you know after training these networks, you can actually just totally zero out, remove some of the attention heads. And I'd be curious to know if you could remove more or less depending on the like layer index, which might then say, Oh, we should just have fewer. But again, it's not actually more expensive to have a bunch. So people tend to instead set the number of heads to be roughly so that you have like a reasonable number of dimensions per head given the total model dimensionality d that you want. So for example, I might want at least 64 dimensions per head, which if d is you know, 128, that tells me how many heads I'm going to have roughly. So people tend to scale the number of heads up with a model dimensionality.
speaker 2: Yeah, just excby .
speaker 1: slicing it in different columns. You're reducing the rank of the final matrix, right? Yeah, just but that doesn't really have any effect on the results. So the question is by having a sort of reduced xq and xk matrices, right? This is a very low rank approximation. This little sliver in this little sliver defining this whole big matrix, this very low rank. Is that not bad in practice? No. I mean, again, it's sort of the reason why we limit the number of heads depending on the model dimensionality, because you know you want intuitively at least some number of dimensions. So no, a 64 is sometimes done, 128, something like that. But you know if you're not giving each head too much to do and it's got sort of a simple job, you've got a lot of heads, it ends up sort of being okay. All we really know is that empirically, it's way better to have more heads than like one. Yes, I'm wondering.
speaker 2: have there have been studies to see if information in one of the sets of the attention scores, like information that one of them learns, is consistent and like related to each other? Or how do they rellater? So the question is, have there been studies .
speaker 1: to see if there's sort of consistent information encoded by the attention heads? And yes, actually, there's been quite a lot of sort of study and interpretability and analysis of these models to try to figure out what roles what sort of mechanistic roles each of these heads takes on. And there's quite a bit of exciting results there around some attention heads, you know learning to pick out sort of the you know like syntactic dependencies or maybe doing like a sort of a global averaging of context. The question is quite nuanced though, because in a deep network, it's unclear. We should talk about this more offline, but it's unclear if you look at a word ten layers deep in a network, what you're really looking at because it's already incorporated context from everyone else and it's a little bit unclear. Active area of research, but I think I should move on now to keep discussing transformers. But Yeah, if you want to talk more about it, I'm happy to okay. So so another sort of hack that I'm going to toss in here. I mean, maybe they wouldn't call it hack, but you know it's a nice little method to improve things is called scaled dot product attention. So one of the issues with this sort of key query value self attention is that when the model dimensionality becomes large, the dot products between vectors, even random vectors, tend to become large. And when that happens, the inputs to the softmax function can be very large, making the gradient small. So intuitively, if you have two random vectors in model dimensionality d and you just dot product them together, as d grows, their dot product grows in expectation to be very large. And so you know, you sort of want to start out with everyone's attention being very uniform, very flat, sort of look everywhere. But if some dot products are very large, then you know, learning will be inhibited. And so what you end up doing is you just sort of, for each of your heads, you know you just sort of divide all the scores by this constant that's determined by the model dimensionality. So as the vectors grow very large, their dot products don't at least at an initialization time. So this is sort of like a nice little you important, but maybe not like Yeah, it's important to know. And so that's called scale dot product. Attention from here on out, we'll just assume that we do this. You know it's quite easy to implement. Just do a little division in all of your computations. Okay, so now in the transformer decoder, we've got a couple of other things that I have unfaded out here. We have two big optimization tricks or optimization methods, I should say really, because these are quite important that end up being very important. We've got residual connections and layer normalization. And in transformer diagrams that you see sort of around the web, they're often written together as this add and norm box. And in practice, in the transformer decoder, I'm going to apply mask mulhead attention and then do this sort of optimization, add a norm, then I'll do a feed forward apply application and then add a norm. So you know this is quite important. So let's go over these two individual components. The first is resisendual connections. I mean, we've I think we've talked about residential connections before, right? That's worth doing it again. But you know it's really a good trick to help models train better. So just to recap, right? We're going to take instead of having this sort of you have a layer layer I minus one and you PaaS it through a thing, maybe it's self attention, maybe it's a feforward network. Now you've got layer I. I am going to add the result of layer I to this sort of to its input here. So now I'm saying I'm just going to compute the layer and I'm going to add in the input to the layer so that I only have to learn the residual from the previous layer, right? So I've got this sort of connection here. It's often written as this. It's sort of like, Oh, connection, okay, right? Goes around and you should think that the gradient is just really great through the residual connection, right? Like you know if I've got vanishing or exploding gradient, vanishing gradients through this layer, well, I can at least learn everything behind it because I've got this residual connection the where the gradient is one because it's the identity. This is really nice. And you know, it also maybe is like a, at least at initialization, everything looks a little bit like the identity function now, right? Because if the contribution of the layer is somewhat small, because all of your weights are small and I have the addition from the input, maybe the whole thing looks a little bit like the identity, which might be a good sort of place to start. And you know there are really nice visualizations. I just love this visualization, right? So this is your like lost landscape, right? So you're grading descent and you're trying to traverse the mountains of the lost landscape. This is like the parameter space and down is better in your lost function and it's really hard. So you get stuck in some local optima and you can't sort of find your way to get out. And then this with residual connections, I mean, come on, you just sort of walk down. I mean, it's not actually, I guess, really how it works all the time, but I really love this. It's great. Okay, so Yeah, we've seen residual connections. We should move on to layer normalization. So layer norm is another thing to help your model train faster. And you know there's the intuitions around layer normalization and sort of the empiricism of it working very well, maybe aren't perfectly like let's say, connected. But you know you should imagine, I suppose, that we want to say you know this variation within each layer, things can get very big, things can get very small. That's not actually informative because of you know variations between maybe the gradients or you I've got sort of weird things going on in my layers that I can't totally control. I haven't been able to sort of make everything behave sort of nicely where everything stays roughly the same norm. Maybe some things explode, maybe some things shrink. And I want to cut down on sort of uninformative variation in between layers. So I'm going to let x and rd d be an individual word vector in the model. So this is like add a single index, one vector, and what I'm going to try to do is just normalize it. Normalize it in the sense of it's got a bunch of variation and I'm going to cut out on everything. I'm going to normalize it to unit mean and standard deviation. So I'm going to estimate the mean here across for all of the dimensions in the vector. So J equals one to the model dimensionality. I'm going to sum up the value. So I've got this one big word vector, and I sum up all the values division by d here, right? That's the mean. I'm gonna to have my estimate of the standard deviation. Again, these should say estimates. This is my simple estimate of this gender deviation or the values within this one vector. And I'm just going to and then possibly, I guess I can have learned parameters to try to like scale back out in terms of multiplicatively and additively here. That's optional. We're gonna to compute this standardization right where I'm going to take my vector x, subtract out the mean divide by the standard deviation, plus this epsilon sort of constant. If there's not a lot of variation, I don't want things to explode. So I'm gonna to have this epsilon there that's close to zero. So this part here, x minus mu over square sigma plus epsilon is saying take all the variation and sort of normalize it to unit mean and standard deviation and then maybe I want to sort of scale it, stretch it back out and then maybe add an offset beta that I've learned, although in practice actually this part and discuss this in lecture notes and practice this part maybe isn't actually that important. But so layer normalization, Yeah, you're sort of you know you can think of this as when I get the output of layer normalization, it's going to be sort of look nice and look similar to the next layer independent of what's gone on because it's going to be unit mean and standard deviation. So maybe that makes for a better thing to learn off of for the next layer. Okay. Any questions for residual .
speaker 2: or layer norm? Yes.
speaker 1: Yeah, it's a good question. When I subtract the scalar mu from the vector X, I broadcast mu to dimensionality d and remove mu from all d. Yeah, good point. Thank you. That was unclear. In the fourth bolet .
speaker 2: confuis it divided. Should it be divided by d or prome?
speaker 1: So I hear in the forward bullet point, when you're calculating the mean, is it divided by d? Or is it or maybe I'm just think it is .
speaker 2: divided by d?
speaker 1: These are so this is the average deviation from the mean of all of the Yeah, yes.
speaker 2: five words sentence by their norm. Do you normalized based on the statistifive one or one, one, one?
speaker 1: So the question is, if I have five words ds in the sequence, do I normalize by sort of aggregating the statistics to estimate mu and sigma across all the five words, share their statistics or do it independently for each word? This is a great question, which I think in all the papers that discuss transformers is under specified. You do not share across the five words, which is somewhat confusing to me. But so each of the five words is done completely independently. You could have shared across the five words and said that my estimate of the statistics are just based on all five. But you do not. I can't pretend I understand totally why.
speaker 2: Example per match, per outlet from the same position.
speaker 1: So similar question. The question is if you have a batch of sequences, right? So like just like we doing batch based training, do you for a single word now we don't share across a sequence index for sharing the statistics, but do you share across the batch? And the answer is no. You also do not share across the batch. In fact, layer normalization was sort of invented as a replacement for batch normalization, which did just that. And the issue with batch normalization is that now your forward PaaS sort of depends in a way that you don't like on examples that should be not related to your example. And so Yeah, you don't share statistics across the batch. Okay, cool. Okay.
speaker 2: So so now we have our full transformer .
speaker 1: decoder and we have our blocks. So in this sort of slightly grayed out thing here that says repeat for number of encoder or sorry, decoder blocks, each block consists of I PaaS it through self attention and then my add and norm, right? So I've got this residual connection here that goes around add. I've got the layer normalization there, and then a feed forward layer, and then another add and norm. And so that sort of set of four operations I apply, you know for some number of times, number of blocks. So that whole thing is called a single block. And that's it. That's the transformer decoder as it is. Cool. So that's a whole architecture right there. We've solved things like needing to represent position. We've solved things like not being able to look into the future. We've solved a lot of different optimization problems. You had a question. Yes, yes, yes. Mamultiti head attention. Yeah. With the dot product scaling with the square root, do overh as well. Yeah. So the question is, how do these models handle variable length inputs? If you have so the input to the like GPU forward PaaS is going to be a constant length. So you're going to maybe pad to a constant length. And in order to not look at the future, the stuff that's sort of happening in the future, you can mask out the pad tokens, just like the masking that we showed for not looking at the future in general. You can just say set all of the attention weights to zero or the scores negative infinity for all of the pad tokens. Yeah, exactly. So can you can set everything to this maximum length now in practice, so the question was, do you set this length that you have everything be that maximum length? I mean, know, yes, often, although you can save computation by setting it to something smaller and everything, the math all still works out. You just have to code it properly so it can handle. You set everything instead of to n, you set it all to five. If everything is shorter than like five and you save a lot of computation, all of the self attention operations just work. So Yeah. There's one hidden layer in the feed forward. Juyeah, okay, I should move on. Got a couple more things and not very much time. Okay. But be here after the class as well. So in the encoder, so the transformer encoder is almost identical, but again, we want bidirectional context and so we just don't do the masking, right? So I've got in my multi head attention here, I've got no masking. And so it's that easy to make the model bidirectional. Okay, so that's easy. So that's called the transformer encoder. It's almost identical, but no masking. And then finally, we've got the transformer encoder decoder, which is actually how the transformer was originally presented. And this paper, attention is all you need. And this is when we want to have sort of a bdirectional network. Here's the encoder it takes in and say, my source sentence for machine translation. It's multiti headed. Attention is not masked and I have a decoder to decode out my sentence now, but you'll see that this is slightly more complicated. I have my masked multi head self attention, just like I had before in my decoder. But now I have an extra operation which is called cross attention, where I am going to use my decoder vectors as my queries, but then I'll take the output of the encoder as my keys and values. So now for every word in the decoder, I'm looking at all the possible words in the output of all of the blocks of the encoder. Yes.
speaker 2: Like keys and the values. How do we get like a key in values separated from the output? Because didn't we collapse .
speaker 1: those into the single output? So we sorry, how will we get the keys and values out?
speaker 2: Like how do we because when we have the output, didn't we collapse like the keys and values into like a .
speaker 1: single output? So we outwe Yeah. The question is, how do you get the keys and values and queries out of this sort of single collapsed output? Now remember, the output for each word is just this weighted average of the value vectors for the previous words, right? And then from that output for the next layer, we apply a new key query and value transformation to each of them for the next layer of self attention. So it's not actually that you're .
speaker 2: to the output. It's not the output itself. When you're thinking .
speaker 1: from the Yeah you apply the key matrix to query matrix to the output of whatever came before it. Yeah and so it doesn't a little bit of math, right? We have these vectors H1 through H N, I'm gonna to call them that are the output of the encoder, right? And then I've got vectors that are the output of the decoder. So I've got the zi'm calling you output of the decoder and then I simply define my keys and my values from the encoder vectors, these H's. So I take the H's, I apply a key matrix and a value matrix, and then I define the queries from my decoder. So my queries here. So this is why two of the arrows come from the encoder and one of the arrows comes from the decoder. I've got my z's here. Get my queries, my keys and values from the encoder. Okay, so that is it. I've got a couple of minutes and I want to discuss some of the sort of results of transformers, and I'm happy to answer more questions about transformers after class. So you know, really the original results of transformers, they had this big pitch for like, Oh, look, you can do way more computation because of parallelization. They got great results in machine translation. So you had do you had transformers sort of doing quite well, although not like astoundingly better than existing machine translation systems, but they were significantly more efficient to train because you don't have this parallelization problem. You could compute on much more data, much faster, and you could make use of faster GPU's much more. You know, after that, there were things like document generation, where you had this sort of old standard of sequence to sequence models to the lms, and eventually everything became sort of transformers all the way down. Transformers also enabled this revolution into pre training, which we'll go over in next year, next class. And sort of the efficiency, the parallelizability allows you to compute on tons and tons of data. And so after a certain point, sort of on standard large benchmarks, everything became transformer based. This ability to make use of lots and lots of data, lots and lots of compute, just put transformers head and shoulders above lstms in, let's say, almost every sort of modern advancement in natural language processing. There are many sort of drawbacks and variants to transformers. You know, the clearest one that people have tried to work on quite a bit is this quadratic compute problem. So this all pairs of interactions, right, means that our sort of total computation for each block grows quadratically with a sequence length. And in a student's question, we heard that you, well, as the sequence length becomes long, if I want to process, you know, a whole Wikipedia article, a whole novel, that becomes quite unfeasible. And actually, you know that's a step backwards in some sense, because for recurrent neural networks, it only grew linearly with the sequence length. Other things people have tried to work on are sort of better position representations, because the absolute index of a word is not really the best way maybe to represent its position in a sequence. And just to give you an intuition of quadratic sequence length, right? Remember that we had this big matrix multiply here that resulted in this matrix of n by n and computing. This is like a big cost to cost a lot of memory. And so there's been work. Oh Yeah. And so you if you think of the model dimensionality as like a thousand, although today it gets much larger then for a short sequence, so then is roughly 30. Maybe the you know if you're computing n squared times d 30 isn't so bad, but if you had something like know, 50000, then n squared becomes huge and sort of totally infeasible. So people have tried to sort of map things down to a lower dimensional space to get rid of the sort of quadratic computation. But in practice, I mean, as people have gone to things like GPT -3, ChatGPT, most of the computation doesn't show up in the self attention. So people are wondering, sort of, is it even necessary to get rid of the self attention operations, quadratic constraint, it's an open form of research, whether this is sort of necessary. And then finally, there have been a ton of modifications for the transformer over the last five, four ish years. And it turns out that the original transformer, plus maybe a couple of modifications is pretty much the best thing there is. Still, there have been a couple of things that end up being important. Changing out the nonlinearities and the feed forward network ends up being important of it's had lasting power so far, but I think it's ripe for people to come through and think about how to sort of improve it in various ways. So pre training is on Tuesday. Good luck on assignment four. And then Yeah, we'll have the project proposal documents out tonight for you to talk about.

概览/核心摘要 (Executive Summary)

本讲座阐述了自然语言处理从循环神经网络 (RNN) 向基于自注意力机制的 Transformer 模型的重大转变。核心内容包括：
1. RNN的局限性: 讨论了RNN在处理长距离依赖和并行计算方面的不足，如线性交互距离导致信息逐词传递困难、时序依赖性阻碍GPU并行处理，这些促使了新架构的探索。
2. 自注意力的突破: 引入自注意力机制，允许模型在处理单个序列时，让每个词直接关注序列中所有其他词，从而直接捕捉序列内任意词语间的关系，并实现高度并行化计算，有效克服RNN的瓶颈。
3. Transformer核心组件: 详细解析了Transformer模型的关键构成，如多头注意力（在不同表示子空间并行关注信息）、位置编码（解决自注意力缺乏顺序信息的问题）、逐位置前馈网络（引入非线性）、残差连接与层归一化（促进深度网络训练），以及掩码机制在特定任务（如解码）中的应用。
4. 架构与成就: 介绍了Transformer的编码器、解码器及完整的编码器-解码器架构，强调其在机器翻译等任务上的卓越表现和对大规模预训练模型（如BERT、GPT）发展的巨大推动作用。
5. 挑战与展望: 指出了Transformer面临的主要挑战，特别是随序列长度增长的二次方计算复杂度，并提及了相关的研究进展和模型变体。

课程管理与通知

新的讲义已发布: 讲义内容与本次讲座内容基本一致，但包含更多细节。
作业四 (Assignment 4): 将于一周后到期。
- Azure 的问题仍在继续。
- 助教已测试，作业可在 Colab 上完成训练，训练量在 Colab 会话允许范围内。建议没有 GPU 的学生使用 Colab。
- 团队正在为作业五和最终项目争取更多 GPU 资源。
最终项目提案 (Final Project Proposal):
- 将于当晚发布，届时会有 Ed Announcement 通知。
- 团队将对提案的可行性提供反馈，并建议修改方向，以确保项目在剩余学期内有较好的成功机会。
- 反馈将尽快给出，因为学生将在作业五完成后主要投入到最终项目中。

从循环神经网络 (RNN) 到注意力机制

讲座回顾了课程早期处理自然语言处理问题的方法，通常采用双向长短期记忆网络 (BiLSTM) 对句子进行编码，并使用单向 LSTM 结合注意力机制逐个生成输出（如翻译或解析）。注意力机制被用于提供对记忆的灵活访问，例如在机器翻译中避免将整个源句信息压缩到单个向量中（即信息瓶颈问题）。

本次讲座的目标与之前类似，但将使用不同的构建模块。讲座指出，大约在 2014-2017 年间，RNN 是主流，而之后出现了新的构建模块（即 Transformer），它们可以直接替代 LSTM，并带来了更广泛和成功的应用。

RNN 的局限性

讲座指出了 RNN 的两个主要问题：

线性交互距离 (Linear Interaction Distance):
- RNN 按顺序（从左到右或从右到左）处理信息，这使得相邻词语易于交互。
- 然而，对于长距离依赖关系（例如句子 "The chef who ... was..." 中 "chef" 和 "was" 的关系），信息需要在 RNN 中逐步传递。讲者指出：“非常长距离的依赖关系可能需要非常长的时间才能相互作用。”距离越远，学习这种依赖关系就越困难，梯度传播也更具挑战性。
- 尽管 LSTM 比简单 RNN 在处理长距离梯度问题上有所改进，但并非完美。
- 理想情况下，希望相关的词语能更容易地在神经网络的计算图中交互，而不受线性距离的限制。
缺乏并行性 (Lack of Parallelizability / Dependence on Time):
- RNN 的前向和后向传播包含 O(序列长度) 数量的不可并行操作。
- 要计算时间步 t 的隐藏状态，必须先计算时间步 t-1 的隐藏状态。讲者举例说明：“你无法在计算时间步四或时间步三的RNN隐藏状态之前，计算时间步五的RNN隐藏状态。”
- 这限制了 GPU 等并行计算硬件的有效利用，因为序列越长，无法并行的操作链就越长，计算效率随序列长度增加而降低。

一位学生提问，注意力机制是否已经解决了线性交互问题。讲者回应，注意力确实有助于解决该问题，并且本讲座后续内容将更彻底地采用注意力，特别是自注意力，以完全取代循环结构。

深入理解注意力机制 (Attention)

讲座提出，如果 RNN 不是最佳选择，那么注意力机制是可行的替代方案。注意力机制的核心思想是：

将每个词的表示视为一个查询 (Query)。
该查询用于访问并整合来自一组值 (Values) 的信息，这些值与相应的键 (Keys)相关联。
注意力机制允许模型直接关注序列中任意远距离的词语，解决了线性交互问题。
由于可以同时处理整个序列（或至少是序列的一部分，取决于注意力类型），它也解决了并行性问题。

注意力机制：类比于模糊查找

讲座将注意力机制比作在键值存储 (Key-Value Store) 中的一种“模糊查找”：

标准查找表 (Lookup Table): 如 Python 字典，查询精确匹配一个键，然后返回对应的值。
注意力机制:
- 查询 (Query) 与所有键 (Keys) 进行“软性”匹配，计算相似度。
- 通过 softmax 函数将相似度转换为权重 (0到1之间)。
- 最终输出是所有值 (Values) 基于这些权重的加权平均。
- 讲者将其描述为一种在“软性、模糊的向量空间”中进行的类似查找表的操作。

示例: 在句子 "I went to Stanford cs224n and learned" 中，为了构建 "learned" 的上下文感知表示，"learned" 作为查询，会与句子中所有词（作为键）计算相似度，并对这些词对应的值进行加权求和。

自注意力 (Self-Attention) 详解

自注意力是指在单个序列内部，元素之间相互关注，允许模型在处理单个句子时，让每个词直接关注到句子中的所有其他词。其计算步骤如下：

输入: 词序列 W_1, ..., W_n。
词嵌入: 每个词 W_i 通过嵌入矩阵 E 转换为词向量 x_i (其维度为模型维度 d_model)。这些是无上下文的词嵌入。
生成查询 (Query)、键 (Key)、值 (Value) 向量:
- 对每个词向量 x_i，使用三个不同的可学习权重矩阵 W_q, W_k, W_v (维度均为 d_model x d_model，将 d_model 维向量映射回 d_model 维向量) 进行线性变换，得到该词的查询向量 q_i、键向量 k_i 和值向量 v_i。
  - q_i = W_q * x_i
  - k_i = W_k * x_i
  - v_i = W_v * x_i
- 因此，序列中的每个词都同时扮演这三种角色。
计算注意力得分 (Attention Scores):
- 对于任意一对词 i 和 j (词 i 作为查询，词 j 作为键)，计算它们之间的点积相似度 e_ij。
  - e_ij = q_i^T * k_j
- 这个得分表示词 i 应该多大程度上关注词 j。
计算注意力权重 (Attention Weights):
- 对每个词 i，将其与序列中所有其他词 j 的得分 e_ij 进行 softmax 归一化，得到注意力权重 alpha_ij。
  - alpha_ij = softmax_j(e_ij) = exp(e_ij) / sum_k(exp(e_ik))
- 权重 alpha_ij 表示词 i 的表示中，来自词 j 的信息的贡献程度。
计算输出向量 (Output Vectors):
- 词 i 的最终输出向量 o_i 是序列中所有词的值向量 v_j 的加权和，权重为 alpha_ij。
  - o_i = sum_j (alpha_ij * v_j)

针对学生关于为何需要独立的 Q 和 K 矩阵的提问，讲者解释这与计算效率有关，最终会形成 QK^T 的低秩近似。关于词自身对其表示的注意力，讲者指出这取决于学习到的 W_q 和 W_k 矩阵，模型可以通过学习决定一个词是否应该关注自身。

构建自注意力模块的关键组件

讲座指出，直接使用上述定义的自注意力机制替代 RNN 还存在一些问题，需要引入额外组件来构建一个“最小可行”的自注意力模块。

问题1：缺乏序列顺序信息 (No Notion of Sequence Order)

问题描述: 自注意力机制本身是对集合的操作，它不关心词语在序列中的原始顺序。例如，"Zuko made his uncle" 和 "His uncle made Zuko" 在经过自注意力处理后可能会得到相同的表示，因为词的嵌入和注意力计算不依赖于词的位置索引。
解决方案: 位置编码 (Positional Encoding)。将代表词语位置信息的向量 p_i 加到对应的词嵌入 x_i 上。
- 正弦/余弦位置编码 (Sinusoidal Positional Encoding):
  - 使用不同频率的正弦和余弦函数为每个位置生成一个固定维度的向量。例如，P_i 的每个维度 k 的计算方式类似于 sin(pos / 10000^(2k/d_model)) 或 cos(pos / 10000^(2k/d_model))。
  - 优点：理论上可以处理比训练时更长的序列，但实践中外推效果不佳。这是一种早期的、有时仍在使用的方法。
- 可学习的位置编码 (Learned Positional Encoding):
  - 创建一个可学习的嵌入矩阵，维度为 (最大序列长度 N_max, 词嵌入维度 d_model)。
  - 在输入时，将对应位置的向量加到词嵌入上。
  - 优点：模型可以学习到最适合数据的位置表示。
  - 缺点：无法处理超过 N_max 长度的序列。讲者提到，实践中，如果向模型输入超过预设最大长度 N_max 的序列，模型通常会出错或崩溃。
  - 目前大多数系统采用这种方式。实际中 N_max 可能为 4000 左右，对于小说或长篇维基百科页面仍显不足，这在实践中仍是一个问题，部分原因在于自注意力操作的二次方复杂度。
- 其他方法：相对位置编码、依赖树结构感知的位置编码等（讲义中有更多细节）。

针对学生关于如何确保学习到的 P 矩阵代表位置信息的提问，讲者解释，因为 P 矩阵在每个位置上是固定的，而词是变化的，所以 P 矩阵唯一关联的就是位置信息，模型通过学习隐式地捕捉这种关联。

问题2：缺乏非线性 (No Non-linearities for Deep Learning Magic)

问题描述: 如果仅仅堆叠自注意力层，本质上只是在反复对值向量进行加权平均，缺乏深度学习模型通常需要的非线性变换来增强表达能力。
解决方案: 逐位置前馈神经网络 (Position-wise Feed-Forward Network, FFN)。
- 在自注意力层的输出之后，对每个位置的输出向量独立地应用一个小型的前馈网络（通常是一个两层的 MLP，包含一个非线性激活函数，如 ReLU）。
- 例如，FFN(x) = max(0, xW_1 + b_1)W_2 + b_2。
- 这为模型引入了非线性，并且可以高效并行计算。

问题3：解码时“预知未来” (Looking at the Future)

问题描述: 在某些任务中（如语言建模、机器翻译解码），模型在预测当前词时不应获取未来词的信息。标准的自注意力会允许一个词关注序列中所有其他词，包括未来的词。
解决方案: 掩码 (Masking)。
- 在计算注意力权重之前，修改注意力得分矩阵 e_ij。
- 对于词 i，如果词 j 在其之后 (即 j > i)，则将 e_ij 设置为一个非常小的负数 (如负无穷大)。
- 这样，在 softmax 计算后，这些未来位置的注意力权重 alpha_ij 会趋近于零。讲者解释道：“当对包含负无穷的得分进行softmax时，负无穷会映射为零，因此未来词的注意力权重为零。”
- 这使得模型在并行计算整个序列的注意力的同时，能够确保每个位置的表示仅依赖于当前和之前的位置。
- 通常在解码器中使用，编码器中通常不需要（允许双向信息流）。

最小自注意力构建模块总结:
1. 输入嵌入 + 位置编码
2. 自注意力机制 (计算 Q, K, V, 得分, 权重, 输出)
3. (可选) 掩码机制 (用于解码等任务)
4. 逐位置前馈神经网络

这个模块可以堆叠多次，形成深度网络。

Transformer 模型详解

讲座指出，之前介绍的“最小自注意力模块”并非实际中性能最佳的方案。Transformer 模型在此基础上引入了更多关键细节。

Transformer 解码器 (Decoder)

Transformer 解码器用于语言模型等任务，其核心组件包括：

掩码多头自注意力 (Masked Multi-Head Self-Attention):
- 动机: 单个自注意力机制可能难以同时捕捉多种不同类型的依赖关系。例如，一个词可能需要基于语义相似性关注某些词，同时基于句法结构关注另一些词。
- 多头机制:
  - 将原始的查询、键、值向量分别投影到 H 个不同的、更低维度的子空间中（每个子空间的维度是 d_model / H）。
  - 在每个子空间（即每个“头”）内独立地执行自注意力计算（带掩码）。
  - 将 H 个头的输出拼接起来。
  - 通过一个最终的线性变换层 (W_o) 将拼接后的结果融合。
- 讲者强调：“每个头可以关注不同的事物，并以不同的方式构建其值向量。”
- 实现: 通过对 Q, K, V 矩阵进行变形 (reshape) 来高效实现，计算成本并不会显著高于单头注意力。
- 效果: 允许模型从不同角度、不同表示子空间关注信息。实践中，不同的头倾向于学习关注不同的模式。尽管没有显式保证，但它们倾向于自发地专业化。
缩放点积注意力 (Scaled Dot-Product Attention):
- 问题: 当键向量的维度 d_k (即 d_model / H) 较大时，点积 q^T * k 的结果可能会变得很大，导致 softmax 函数进入梯度饱和区，使得梯度非常小，不利于学习。
- 解决方案: 在计算 softmax 之前，将点积结果除以 sqrt(d_k)。
  - Attention(Q, K, V) = softmax( (QK^T) / sqrt(d_k) ) * V
- 这有助于在初始化时保持注意力权重的平滑分布。
残差连接 (Residual Connections) 与层归一化 (Layer Normalization) (合称 "Add & Norm"):
- 在每个子层（如多头注意力层、前馈网络层）的输出端应用。
- 残差连接: 将子层的输入 x 加到其输出 SubLayer(x) 上，即 x + SubLayer(x)。
  - 极大地帮助了深度网络的训练，缓解梯度消失问题，使网络更容易学习恒等映射。讲者解释说，通过残差连接，梯度可以直接传播，“因为它是恒等映射，所以梯度为1”，这非常有利于训练。
- 层归一化: 对每个样本的每个词的表示向量（在 d_model 维度上）进行归一化，使其均值为0，标准差为1。然后通过可学习的缩放参数 gamma 和偏移参数 beta 进行调整。
  - LayerNorm(x) = gamma * ( (x - mean(x)) / sqrt(std(x)^2 + epsilon) ) + beta
  - 有助于稳定训练过程，减少层间协方差偏移。
  - 注意: 层归一化是针对单个词向量的维度进行的，不跨序列中的词，也不跨批次中的样本。

Transformer 解码器模块 (Block) 结构:
一个解码器模块通常包含以下顺序操作：
1. 掩码多头自注意力 (Masked Multi-Head Self-Attention)
2. Add & Norm (残差连接 + 层归一化)
3. 逐位置前馈神经网络 (FFN)
4. Add & Norm (残差连接 + 层归一化)
这样的模块会重复堆叠 N 次。

针对模型如何处理可变长度输入的问题，讲座解释，输入通常会被填充 (pad) 到一个固定的最大长度。在注意力计算中，可以通过掩码机制忽略这些填充标记的贡献，将对应位置的注意力得分设为负无穷大。

Transformer 编码器 (Encoder)

结构与解码器模块非常相似。
主要区别: 编码器中的多头自注意力层是非掩码的 (unmasked)，允许每个位置关注序列中的所有其他位置（双向上下文）。

Transformer 编码器-解码器 (Encoder-Decoder) 架构

这是原始 "Attention Is All You Need" 论文中提出的完整 Transformer 架构，常用于机器翻译等序列到序列任务。

编码器 (Encoder):
- 由 N 个编码器模块堆叠而成。
- 处理输入源序列（例如，待翻译的句子）。
- 其最终输出是一系列上下文感知的词表示。
解码器 (Decoder):
- 由 N 个解码器模块堆叠而成。
- 生成目标序列（例如，翻译后的句子）。
- 解码器模块与之前描述的略有不同，它包含两个多头注意力子层：
  1. 掩码多头自注意力层 (Masked Multi-Head Self-Attention): 对解码器自身已生成的部分序列进行自注意力计算（与独立解码器中的一样）。
  2. 编码器-解码器注意力层 (Encoder-Decoder Attention / Cross-Attention):
    - 查询 (Queries) 来自前一个解码器子层（即掩码自注意力层）的输出。
    - 键 (Keys) 和值 (Values) 来自编码器的最终输出。
    - 这一层允许解码器在生成每个目标词时，关注源序列中的相关部分。
    - 这一层是非掩码的，因为解码器可以关注源序列的任何部分。
  3. 逐位置前馈神经网络 (FFN)。
- 每个子层（两个注意力层和 FFN）之后都跟着一个 "Add & Norm" 操作。

Transformer 的卓越成果

机器翻译: 原始 Transformer 论文在机器翻译任务上取得了优异成绩。虽然不一定比当时最好的系统有惊人的提升，但其训练效率显著更高，因为它能更好地利用并行计算，处理更多数据，并充分利用GPU。
预训练模型的基石: Transformer 的并行性和效率使其能够处理海量数据进行预训练，催生了 BERT、GPT 等一系列强大的预训练语言模型。
广泛应用: 在各种自然语言处理基准测试中，基于 Transformer 的模型迅速成为主流，并在文档生成、摘要、问答等多种任务上取得了顶尖 (State-of-the-Art) 结果。讲者总结道：“这种利用海量数据和计算资源的能力，使得Transformer在几乎所有现代自然语言处理的进展中都遥遥领先于LSTM。”

Transformer 的不足与变体

二次方计算复杂度 (Quadratic Compute Problem):
- 自注意力机制需要计算序列中所有词对之间的交互，导致计算量和内存消耗随序列长度 N 呈 O(N^2 * d_model) 增长（d_model为模型维度）。
- 这限制了 Transformer 处理非常长序列（如整本书或长篇文档）的能力。讲者举例说明，如果序列长度达到如50000，N平方项将变得巨大，使得计算完全不可行。相比之下，RNN 的复杂度通常认为是 O(N * d_model^2)。
- 许多研究工作致力于降低这种复杂度（例如，稀疏注意力、线性化注意力），但讲座提到，对于像 GPT-3 这样的大模型，自注意力部分可能已不是主要的计算瓶颈，这是否仍是必要的研究方向是一个开放问题。
位置表示: 绝对位置编码可能不是最优的，相对位置编码等变体被提出。
模型变体: 过去几年中出现了大量 Transformer 的变体，但讲座认为，原始 Transformer 加上一些小的修改（如改变 FFN 中的非线性激活函数）仍然非常强大且具有持久力。

结论与展望

Transformer 架构，特别是其自注意力机制，已经成为现代自然语言处理领域的基础。
尽管存在计算复杂度等问题，但其并行性和强大的表示能力使其在各种任务中取得了巨大成功。
未来的讲座将讨论预训练技术。
提醒学生关注作业四的截止日期和最终项目提案的发布。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:20

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:15

StreamSparkAI