2025-03-04 | Transformer Deep Dive with Google Engineer | Foundation of LLMs and Modern AI

Transformer架构深度解析：从注意力机制到位置编码

视频科技

媒体详情

上传日期: 2025-06-15 20:50
来源: https://www.youtube.com/watch?v=TcKJMBZySj0
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: Hello everyone. Welcome to this video. This video is going to be the single video you need to understand the basic of classic transformer architecture that later changed the world with AI technology. In this video, I'm gonna to break down the architecture into several key components and go through each components one by one. Let's dive into it. I want to start with something seems unrelated. So how to smooth a noisy time series. So first graph x, it's a noisy time series data verse x zero to x five, five numbers, and they form a vector x. Now we have a relweying function. We call it W three or x three. So it's a normal distribution like function. What it does is it will filter data that's like far away of x three and amplifies the data that's close. So if we multiply the filter, reveying function is the time series data ta x, like x zero, multiply W zero three, x one to W zero one, and we add all the values together, we're going to get a new value by three. So here is the mathematical formula, y three equals to W three product x. Similarly, y four equals to W four product X. W three is a vector. X three is x is a vector. So y three is a number. Similarly, y four is a number. And then with auy zero to y five, it forms a new vector by, and this vector is supposed to be better than x. The advantage is vector y now have like other points contacts compared to x, so it's less noisy. So you can see y three paid attention to other points. Similarly, from y zero to y five, they all paid attention to other points. I want to also say it's not necessarily like smoothing this function. It really depends on how you designed every ray function, right? If your reveying function is just flat, then your output y is going to be flat. If your reveying function is just extremely pointy, say only W 33 have value and all the others all Zered, then it basically does nothing. Your y will just look like identical with x. So it really depends on how you design your reveying function. Okay? The important thing is you understand with some breveying function, now you have a new vector from x, and this new vector y have the advantage of it has contacts. Okay, let's move to the next one. Before I talk more about transformer and natural language, I wanna talk a little about word embedding. Word embedding is the present representation of word, which is the like, the smallest like representation in the natural language. So the definition is a word embedding is the representation of word embedding is used in text analysis. Tythe representation is a real valued vector that encothe meaning of the word in such a way that the words that are closer in vector space are expected to be similar in meaning. That's very long sentence, right? I wanna show some example here. Let's simplify this word embedding. Let's say in the vector, there's five values and each of them is zero or one. In reality, it usually is a score. Let's say it's zero and one. The vector and first one is human, second is east femmale, third one is is male. And then it's bark or not, and it's top Cornot. Then for the word father representation, the word embedding is one, zero, one, zero, one, right? Human is not female. Es, no, that does embark talks. The representation of stud is zero, zero, one, one, zero. It's not human, not female is no, the barks and doesn't talk. So this is a pretty simple example for the embedding. So why is it useful there too? Big advantage, I would say. First one is the distance between vectors indicates the similarity between words. So you can actually take a look at these examples from king to queen, the vector very similar to from man and woman, it captures real relationship. And in the real world, similarly to walking and walked, swimming to swamp, and then like country capital relationship, the distance between the word embedding is a vtor and it indicates some kind of relationship. Next one is words with similar meanings are represented by similar factors, tors. What it does is in the end, it aggregates similar entities in very close clusters. This is a 2D representation, but in reality is a very high dimensional. But it's just for you to see, like train, bus, card is cluster. Similarly, College School work, it's one cluster. And on in the similar meaning, words are represented by similar actors. So now we know about word embeddings. Let's talk about the tension for like natural language. So here a sentence, date is my favorite fruit. Date is is a word that we use very commonly, right? And I would say 90% of the time it refers to the time instead of a like alidible fruit. But like how do we human knows date means fruit in this, something to eat. In this case, it's like human. We read the sentence and we see, okay, there's a fruit and this fruit is referring to this date. So that's why date in this sentence is something to eat rather than the time. So similarly, when you try to tell the AI system like this as another meaning, you can use attention of black mechanism, right? So imagine like natural language needs the similar mechanism. Like each word or token or data point needs context of others in the same sentence or paragraph. For example, date in this sentence means an edible fruit and some time with the context provided by this fruit. However, this context is different from the previous context, the previous example. When we say contacts, it's basically similar, like proximity, our filtering function. But we can't use proximity like we did before. We need a more sophisticated rouying method, and it needs to have semantic meaning. So let's move on to the next one. Let's say here is a black box of Reway and linear reprocess. We first partition date is my favorite fruit to like different word embeddings. So we can do mathematical process to this natural language. V1 is the word embedding of dates. V2 is the word embedding of is ettera. And then we input this word embedding to this big box, supposed to be the attention mechanism. And then the output is going to be y one, y two, y three, y four, y five. And what we want is y one is a better representation than we want because it has more context of other words. So this why representation should have more context than v. What we do is usually like this. V1 is a vector. So dot product V1, it's a number. It's a real value. W 11 that starts similarly. It's the W 12, 13, 14, 15 star. And then we normalize it. So they add up to the value one. This is the weight that we need. And then we use this weight to reweiv one W 11 product, not that product. This is a real, real value. This is a vector. So W eleven, multiply V1, and then same bit v five, and we have a new vector. This is like a linear Rewey one reall vector towards V1. So it has a context of other words, in that we do similar process for all the other vectors, y two, y three, y four, y five, and we have the new representation of date is my favorite fruit. And now the y vector have context. And this is the basic for attention mechanism in natural language. Before transformer, the most popular model for natural language is lstdm. So I want to spend some time to talk about it. Lstdm is a recurrence neural network, a traditional ra, and has a single hidden state that is passed through time, and that can make it difficult for the network to learn like long term dependencies. Lstm model solve this problem by introducing a memory cell. This is a container that can hold information for an extended period, so it makes ostm like capable of learning long term dependency and sequential data. It works great at that time for language translation, speech recognition and time series forecasting. Donright es, there is the lstm. However, there a problem with it. So rstms are slow and less efficient than transformer due to their sequential nature means the training phase, you need to PaaS the sequential data one at a time. On the right hand side there, the transformer architecture, you can see it is very different from the previous lstm. Transformers can capture global dependencies more effectively in long sequence compared to istm and others like grus. As a result, transformer is better at paralyzing computes. The intuition is, for example, in episode one, when we have a sentence you don't have to PaaS in one word at a time. We actually has everything as the input embedding at once. It makes the calculation a lot more effective. And if we can paralyze the computation, it scales up a lot better than lstm. Of course, there are like other stuff that makes transformer works, and position encoding is one of them. As you can see on the right hand side, input and then input embedding. And then the next step is we need to add like position or encoding. What is position encoding then? So as we discussed before, self attention mechanism captures dependency and context. In short, date normally refers to time. But when we know in the sentence there's another word, fruit, it captures the context from fruit. In this case, date most likely refer to the edible fruit instead of the time. However, in language border matters, this is the most classical example that you're gonna to see everywhere. Here's why order matters. So even though she did not win the award, she was satissatisfied. That is one sentence, the other, almost identical sentence. Tences, even though she did win the award, she was not satisfied. They have the same character, same word, but the order is different and the meaning is completely different. Other than the context, the previous mechanism that captures context, obviously we need a way to know the position. And previously we know the relative position based on the index of the word embedding. But that's not enough. We need something similar to a embedding. So that is the position encoding here. A naive way is we need another embedding for position. The most naive approach is we just put the position of the word as the embedding right. For example, first word e zero stands for the word embedding for the first word. P zero stands for the position encoding for the first word. And because it's the first word, the index is zero. So p zero will be all zero. Similarly, E1 will be, P1 will be all ones, and pn will be all ends. Well, this work, this is not working since like large and school, school, all the embeddings, like you can imagine if the sentence is like 100 words long, em plus pm, that is the new input or the nth position, the embedding will be a lot larger than e zero plus p zero. So the result will all be skewed. So we needed to bound mim and max value, and we still want it to be able to differentiate like the position info, like whether it's the real position or the encoded position. So let's try something else. We include normalization in it, and we normalize it by the first p zero, it's all zero. P1 is like one over n. That means for pn, it's one, right? In this case, all the values is between zero and one. So this won't work perfectly either. And there are some caveats, like encoding for the same word combination will have different encodings. The sentence length is different. Here's an example. There is two words and two sentence, I am a dad, which is four characters long, four long. And there is, I am Martin, that is three words long. And in these two sentence, the subsentence I am those position and Colding will be different. And we don't want that. We want it to be ideally to be the same. So we need another property, which is the encoding ideally doesn't dependent, it's not dependent on the length of the sentence. So what's the actual solution here? The proposed solution is actually very smart. So as you can see, this is the two formula of the position encoding. Os is the position of the word. And so for first word, it's p zero. For second is P1, this is typo. And then P2, p four F, P two, p three. So d is the dimension of the embedding for the words. In this case, our word embedding have five dimensions to y. It means the even indices of position encoding, like within those d dimensions. So zero, two, four, two, I plus one is the odd in this system. And you can for the even values, you just use the first sformula. For odposition, you use the second one cosine value. Just to clarify, all the values I put here is fake. Just put random numbers. So now we combine the semantic embedding, which is e zero, E1, e two, E E three, and the position encoding, which is the p zero, P1, P2, p three. We combine those and we have a positionally aware of semantic embeddings, which is what we want. So now we have the perfect input for the transformer model. So then we are ready to dive deeper into what actually happens in the transformer model and how attention mechanism is actually using these semantic positionally aware semantic embeddings. This is the intuition. This graph on the right hand side, it's for even in this is two y equals to zero to y equals to two to y equals to four. They have similar sine graph, but they have different frequencies, right? And they have those properties that we want. They have bound min and max values one and minus one, and they're able to differentiate encoded position info, for example, even, let's say, for I equals to two, I equals to zero, there's like two points, two word position, they have the same value for I equals to zero. If you take a look and see, like for I equals two I equals to two, and two I equals to four, the value is going to be different. So in this way, the encoded position, the position encoding, is actually different. Even even if, let's say, for I equals to zero, they are the same. You have the same value for other two is they're going to have different values, but this algorithm is able to differentiate encoded, and we know they're not dependent on the length of the sentence. So for I am Martin and I am addebt, the I am like common prefix will have the same encoded position encoding. Let's do a quick recap. So on the first episode, I went through the simplified attention mechanism using some examples. I started with smoothing a time series with using a normal distribution like filter, and then I moved the example from time series to natural language. So I want each word then dissentence to get some context of other words, just like what we did in the time series. I use the example of data is my favorite fruit, b vector is the word embedding. And when after we go through the attention angent mechanism, we get A Y vectors, and y has more contacts taken into consideration compared with b, including the position encoding that we went through on the second episode. So this is achieved by a tangent mechanism. And here is the attention mechanisms formula that's in the original paper. So now let's dive deeper into what actually happened to this formula. So this is still the recap. When we transform v to y, we go through the dot product to get some weight, and then we normalize it. So there's sum of one, and then we apply the weight and Rev vectors. We get y one. And we did similar stuff for y three and y through y five. So let's modulize this a little bit. So it's querer. So this is a modualized view of the previous mathematical calculation. And this time I'm going to use a shorter example, which is I am Martin. So these three words, each of them has a word embedding V1, V2 and V3. And let's assume a word embedding is one over four s in dimension. Let's say in this case, we want to get the third word, which is Martin's context on authe whole sentence. In this case, V3 is right here. In this case, I want to use database analogy, which is query keys and values. In this case, I want the contacts. I want a attention representation of the third word, V3. So V3 is the query here. And in order to get what I want for V3, I need to go through a bunch of key value search, as in database analogy. So if the key is V1, then the value is one. Here is second key and then the second value. I understand this might not mean a lot, but this is just to help you to understand why we are calling these query keys and values. It might not make total sense at all. Let's say V3 is the query here. What we are going to do, as in previous like math calculation, we first have to go through the dot product. V1, V3 will get R one, V2, V3 will get R2, V3, V3 will get R three. And those are three real value, which is like three numbers. And then we normalize those three numbers. So the weight, the sum, equals to one, and this is the weight. And then we apply the weight and reweithe whole v vector and get the y three here. And we can also go through similar operation to get y one and y two. This is the by contexalized attention like representation of I and m. So this is the modulzed view of previous calculation. So let's add more to it right now. So as you can see, right now, there's no weight at all. And if you have work in machine learning, whether it's classior, the deep learning, you know, like everything is about learning the correct weight. So let's add some weight to it. Now we have three weight matrix, mkm, b and M Q each, one for a query, one for key and one for value. The weights should be able to generalize the learning capability. Just for simplification purpose, let's say the dimension is four over four. So in this case, when we multiply the original vector, it won't change the dimensions one over four, multiply by four over four. It's still one over four. So this is just simplification. In reality, when we go through this, this is a linear matrix multiplication, right? So in actual productionization, this is called linear layer. When we go through linear layer, whether here or later, after concatenation, I'm going to go through it in later, the dimension usually will change, like either increasing or decreasing. But just for simplification purpose, let's say right now it's four over four. So the dimension won't change. This won't change this graph too much, just will make the small v to big b. And then now we have weights. So now we can generalize the learning capability. So the weight contains like more information. Let's simplify this view a little bit more. And now we can have this graph, the word embedding, go through three linear layer, which is the matrix multiplication with mv mqand mv. So we get keys, quries values. And then for keys and quries, we do a matrix multiplication, and then we go through a normalization. In the last slide, for simplification, we just normalize it to sum to one. But in reality, we should do softmax, we should do scale. We should also mask if you're going through like inference. And then after we get the attention, scowe do a like matrix multiplication with the values matrix, and then we get A Y vector. Here, as you can see, this is basically a tenformula. Qkt is this part, and the softmax and the scale is this part. And then you do another multiplication with the, this is a more detailed view of the whole attention block in the transformer paper. As you can see, there are multiple attention blocks. And don't worry, I'm going to go through all of them one by one in later. Take a look in into the attention block details. You can see it's there's linear layers for bkq as we discusand. Then they all go through a dot product attention, which is the matrix multiplication and then scale and then mask for inference. In our case, we we still haven't discussed this yet. And then go through soft max and then matrix multidiplication with v again afterwards. We're going to go through concatenation because it's Mulkey heaand. We're going to go through linear layer to drop the dimension again. So now we have went through the mathematical details of how we get the attention with K, Q and b. There are less several questions. So first one is, why do we scale? Why are we scaling over this number here? The answer is the key query vector could be very large. So when we do dot products, it could be disproportionately large. And for that large number, when we do actual training, this can lead to managed ingredients or explogradients that rule the training. So it's learning too fast or too slow. So if you want to learn more about these two concepts, you can Google it on your own. Next one is why use this number to be specific as the scaling factor, the mean zero variance one, this is a typo, is standard normal distribution. And if we assume qk are dk dimensional vectors whose components are independent random variables that follow the standard normal distribution, their dot product will have a mean of zero. And variants of dk, we want the distribution to the standard, so the variants should be equal to one. So in this case, the values won't be too scattered in order to get this variance. So we want to divide by this specific thefactor. But this is the map reasoning behind it. Last one is, why do we want to do soft? Mathis is pretty standard actually. After the mathematical calculation, we get a bunch of attention scores. We use softmax to transform the attention scores to a probability distribution that sum up. Ps, do one, which is very convenient for us. We talk about the tension mechanism, and we know it can capture contand relationship between tokens. However, natural language is a very complex problem and usually have multiple aspects of context and relationship. For example, in this sentence, date is my favorite fruit. We know date is a fruit, and we know date is my favorite fruit. And we know the concept of favorites eyes to date. And there are a lot more actually. For example, we know this language, this sentence probably has a neutral mood, or even like a neutral to positive mood. We're talking about our favorite stuff. So there are a lot of complex and multiple layers of context and relationship in just one simple sentence. Are we only going to use one type of attention to capture it? The answer is no. We usually use different set of parameters to capture different type of contand relationship, and sometimes we don't even know what is the type of relationship. We just use a different set of parameters. This is a example for a natural language. Next, I'm going to also show an example for computer vision. Transformer is a very powerful tool. Don't get me wrong. It's not only useful for natural language, we also use it for computer vision. This is one example of different attention mechanisms. Intuition, we have a image with a full background, and there is three filters. There is three attention mechanism. First, one attention filter, one, just figure out who's the person, and this is hamulan. And then the second attention focused on figuring out what is the sky. And the third one focused on figuring out what is the mountains. So this is another example, intuition of attention mechanism and why we need multiple different types of them. This was the previous attention block that I went through. So we already know with the input word embedding plus position embedding, we go through linear layers. And then for key enqueries, after you go through linear layers, we go through matrix multiplication, and then we go through soft tmax scale, ignore mask. I'm gonna to talk about it in the next section. And then we also do another matrix multidipliwith values and get the attention scores. So the question is, how do we introduce the multi hadness of attention to this problem? So this is multi head attention block. We basically have the same input. The input dimension doesn't change. But when it go through the different blocks now we go through multiple head, let's say for the first one we call it head one, and for the last one call it head H. So H is the number of heads we are gonna to go through. And each of them focus on capturing one specific relationship. It still go through the linear layers and still keys, quries values, still go through the matrix multiplication, still go through normalization and and matrix multiplication with values. But now since there are H heads, now we have H Y vectors, right? And H Y vector is with the same dimension of b but then we also say, said before, we want the output dimension to be the same as the input dimension. So we can like layered hundredth of attention block. But this way the output, let I mentioned, is H times of input. So how do we solve this? We solve it by first concatenate all of the result. So we append one after one, and we get A N multiply by H dimensions of y vectors. And then we go through the linear layer again. And this time the linear layers 's focus is just to remap the dimension back to n, and we get another y to yn. This is this typo. Actually, this this y one is actually different from the y one here. So it's a different vectors. The dimension is the same with b, but it's from multi head. So how do we achieve this, right? How can we use linear layer to drop the dimension from n, multiply by H to n? Before I show that, I just want to say, previously we went through the scale dot product attention. We went through the linear layer here. This time we were focusing on the concat and linear layer here. So the way linear layer works is like this. So linear layer, I've heard, like all other names people call it, I've heard dense layers, I've heard fully connected layer. So they're all pointing the same like concept. Let's say why is the input and b is the output of the layers, then we actually have all of them fully connected. This is just a view of it in between. There could be like multiple layers of fully connected layer. And in this way, we are able to maps the input through output, and we can also change the input dimension to the output dimension that we want. In this case, the example is we change a five x one dimension to three by one dimension. So this is how lineer or dense or fully connected layer drop the concatenated dimensions back to the original dimension that we want. On the right hand side, there's the transformer encoder decoder architecture. And one important thing to know is the training and inference base is actually a little different. Transformer behaves differently. At least this encoder decoder architecture behaves differently. The main difference occurs in the decoder, which is the right hand side part doing training. We know the entire output sentence, and we can PaaS it to the decoder all at once. For example, if we trying to do a translation service from I am Martin, we can PaaS wamarinto inputs and we can PaaS I am Martin to outputs, and we can PaaS a bunch of training translations to this inputs and outputs during training phase. However, during inference, the decoder must generate one word at a time in an autoregressive manner. Then the decoder used like previously predicted words to help to predict the next word. That means you have to output one word at a time. This is just the nature of the application. You are gonna to use a bunch of time. Decoder is just doing translation or tax generation. What's already generated, it's already out there. You're not gonna to change it. How to achieve that is the by mask attention. So this is the mask attention module. It will make sure each token in a sequence only attends to previous tokens and itself, not future tokens. Which means the tokens generated, the word generated, it all depends on itself and its previous token. It's not going to depend on anything in the future. This is a example. Obviously this is the oversimplified version in the token embedding stuff. There is start of sentence token and when it ends there, end of sentence token. But this is like a simplified version. Let's say we want to translate I am Martin to Chinese, and the encoder input is going to be I am Martin and we do the position encoding. And then we do the multi head attention at n norm forward another at and norm, which I'm going to cover in the later episodes. The encoder part is pretty straightforward that we have covered in previous episodes. So the decoder, as I mentioned, it's not going to put the whole output sequence because this is not training data. We're trying to translate a unknown sequence, which is an I am marking here. So the first decoder input is a startup sentence token, and then an output, it go through all the decoder and an output will be the first translation war, which is Chinese nai. And then we've used this token to PaaS to the output, to PaaS it to the decoder again. And then we get the next generated Chinese character, which is shu, that is a Chinese ran. Similarly, we PaaS in the combined washshi to the decoder input, and then we get the next to a generated token, which is Martin, Chinese senfor Martin. So this is oversimplified version of how decoder works in the inference space. We can get into a little detail into the mathematics, and then you're gonna to know the one very important concept that is the kb cache, the during inference autoregression. At each generation step, we are recalculating the same previous token attention. So this is assuming we don't have any cash. We are not storing anything in memory. This mathematic step, hopefully you are familiar with it. I have gone through it in our previous episodes. Let's say the first token comes and then we go through the qktv. I ignore the scale, the soft max. Now just to lagh out the qktmv for simplicity, qkt is the matrix multiplication and then another amount, matrix multiplication, and then you get the output. But as you can see, the size of the calculation is one multipby embedding size. Sorry, this is a typo, embedding size Q1 is the embedding. So this size of q is one by embedding size. Same with embedding size over one. And then this is one embedding size, and then you get one by one, and one that by embedding size and output is one by embedand. Then when a second token pups, because you're not soaring anything right, you have to PaaS, as I said, you have to PaaS the whole sequence into the decoder. So you PaaS Q1, Q2, and then in order to get the whole attention matrix, you need to do, okay, K1, K2, and then y one, y two. Then when it passed the mask attention module, instead of showing all the four attention scores, it used a mask to basically mask out all the future tokens. For example, in this case, this one is Q1, K2. So the first token, K2 is something in the future, right? So you're not going to have access to it. So that is mask. And then that will guarantee the o one is only generated by poken one or previous, and it guarantee O2 is generated by token two or a previous tokens. And then the third query, token comps. And then you have to PaaS in the whole sequence, do this, do this and then do the mask. Similarly, you're going na mask out all the previous tokens, which is basically the upper right parts of the matrix. And then you get get the three output. So as you can see, as the time goes, the embedding, the size of the calculation is throwing exponentially, right? Let's say time. The first one is t equals to one, and then t equals to two and t equals to three. So obviously the dimension of this matrix is t squared, right? So quadratic. Now let's go with inference with kv cache. So as you can imagine, kv cache is just trying to cache the key and value embedding. Q1 is the same. When Q2 comes, all you need is actually this qkt. In order to get O2, you actually just need this calculation. And this calculation you only need K1 and B1. So if we already cacit, let's say Green means cit, you actually you can only focus on Q2 itself. You don't have to focus on whatever Q1 says. So you store K1 and V1 into memory, and then you are able to get this qkt, which is one by two, and then you multiply it by v vector, and then you get the next output. Similarly with Q3, when a new q token comes, since you are already storing previous caves. And once you can get the qkt to calculate zero, three directly, and the matrix multiplication, the complexity, like intuitively right now, no longer quadratic, right? It's linear to t. Let's say t equals to one, t equals to two, and t equals. So it drops the complexity of matrix calculation a lot, which is what matters in machine learning, right? It's flops, a floating point operation per second. So this is the kvcache traoff. What is good, the plus is number of flops. Boolpoint operation per second drops. It drops from ot square to ot, the number of the length of the sequence. I would say the downside is we need extra memory. You need to store a and b and beddings. So the memory amount we need is the size of two multiplied by time, number of tokens multiplied by a number of layers, and then multiply number of kb hats, and then multiply kb dimensions. The two, as in you have to score both heat and value the number of tokens. You should be easy to understand kv had, we went through a previous episode, kv dimension. And similarly, we went through a previous episode, the number of layers here, this is something people usually ignore. The newer network I show here, it's all simplified version. It usually have hidden layers. So for each attention module and feforward module, all of them are multi layers. So it have to go through a bunch of hidden layers. You have to consider also the number of layers to door all the kmbs. And then that's the kv cache eligibility. So when can you use kv cache? I kind of already gone through it. When I introduce kv cache, you need to have causal relationships. The new tokens attention calculation only depend on itself in previous tokens. So if it's depending on future token, then you can't use kv cash. For example, there are art, the famous, at least it used to be famous encoder transformer architecture. If you use Birto do natural language processing, you're gonna to see the output changes when it procthe token. That is because it's not a generative. The generated token is not fixed. So it kind of changes the outputs when it process the whole input. So in that case, we cannot use kv cache anymore. Most of the generative AI models right now, for example, gpdeep seek, is also a generative. There's decoder over the architecture. It also have kv cache component. It's slightly different, but it still have the same mechanism, which I'm gonna to go through in another series. First of all, let's start with the different attentions in the classic transformer encoder decoder architecture. As I covered before, in the encoder there is a self attention module, and in the decoder there a mask, multi head attention. There's another cross attention, multi head attention. Yes, I think I've already said at least ten attentions, so hopefully this is not confusing. I'm going to go through each of them one by one in the later slides. First of all, I want to build an intuition on attention mechanism. So this graph I used a lot in my previous slides. So this is a modulzed view of what attention module is doing within basically it's a matrix multiplication, like a dot product for keys and queries after the linear transformation. And then it's gonna to go through a normalization, which is basically soft max and scale. For mask, there is a mask. Basically it's off tmax plus game. And then we do another matrix multiplication with values after it's linear transformation. And then we get the output. So the intuition is matrix multiplication. Number one is the dot product cosine similarity between key and query. And the normalization is the softmax to map similarities through a probability space. And then scale is to prevent gradient diminishing or explosion. And the second matrix multiplication is applying the similarity distribution to Reay value. You can think about first one is cosine similarity dot products. Second end one is a bunch of normalization to map it to probability space. And then second, matrix multiplication is apply that probability and reweithe values. So that is the intuition of attention. Let's start with the first basic attention. Let me remind you that transformer is based on the concept of self attention. Before self attention, there is actually attention. So attention is a very long lived concept already. And in this case, let's say we have two sentence. First one is, I have a White cat that is say, a qury. And then there's a second sentence is I love little animals. And we want to know what's the similarity scores, what's the attention scores between these two query? We can calculate it based on their word embeddings and dot product. And we're gonna to get this matrix. So this is the basic attention. It's trying to get similarity or attention scores between these two queries using dot product cosine similarity. And then we have self attention. What's the difference between basic attention and self attention? Then query equals to key, right? That's why it's called self attention. And then we have the cross attention in the decoder. The cross attention layer is trying to get the similarity scores between the input and between the output. So the keys and value is going to come from encoder. Let's say at input is I am Martin. It's trying to do a translation from iam Martin to mandarin. So the encoder inputs I Martin, and then after a linear layer, it gets the keys and values matrix. And then the decoder, let's say it already translated I am to wasshand, we're gonna to use the decoder input and linear transformation to get the qui matrix. And then we do the same dot product similarity score to see the similarity between output sequence and input sequence. And then after normalization, we get the matrix and Reay, we get the next translation, which is R in Martin cross. Attention is to establish relationship between input sequence and generated output. So the attention conssimilarity happens between the coder query encoder key, and the similarity will apply weight to value, which is the encoder value you want the attention score between encoder and decoder. This is the example. Encoder is I am Martin, decoder is hushi, which is basically the translation of I. Okay, so the next concept is mask attention. The causal mask is applied to attention mechanism to ensure a model predicts each token in a sequence solely based on the previous tokens, not on any future tokens. So what it means is for generative models, let's say GPT's, deep seek, those are all generative models. So the token generated only depend on previous token, not future tokens, which means in the similarity in the attention matrix, we need a mask, mask out future keys and values. In this case, if the query is from the first token, we don't need anything from second and third, and for the second query, we don't need anything from the third, and for the third query, we can have everything. So basically, we apply mask. This actually enables kv cache to speed up transformer. So we can actually cache keys and values to speed up transformer at the cost of memory. So the pros and cons, you can refer to my previous kv cache episode. So then now we use kvcache and mask attention to optimize transformers performance. Then we hit another bottleneck. Kb cache memory size becomes the bottleneck, especially when the sequence is longer and longer, because the memory is linear with the sequence. Even with kvcachamask attention, engineers and scientists would think about different ways to address the bottleneck. So then people start thinking about, can we reduce the kb cache and not hurt equality? There's the group query attention and mulquery attention. Let's start with the mulquery attention, which is a bit of extreme. It shares a single key and single value had across all query heads, which can significantly reduce memory usage, but also impacting the accuracy of attention. So as you can see, this is the classic multi head attention. The kesvalues will be cached in memory. So in mulquery attention, we're gonna to reduce those to one Kim, one value. Obviously it's dropping the memory size, but it will also impact quality. Group query attention is somewhere in between the classic and the mulquery attention. So it's a interpolating method between mha and qa where a single pair of key and value hats will be shared only by a group of query hats, not all queries. So the quality impact is less, and it still saves tons of memory. Can merge two to one or three to one or based of some similarity scores to merge different queries. So a lot of methods can be used. If you want to know the details, you can actually refer to my deep setechnical report. It contains a bunch of information about vqa and mqa. So now this is the last concrete example I have, which is the mla. Basically, this is one of the reasons why deep seek can achieve great quality with very cheap cost. So they use a what they call mla attention mechanism that compresses projects the kesvalues into a lower dimension space. They use some really smart way, like doing customized operation with rope, and then concatenate compressed keys with non compressed keys to get the actual attention. So they use some very innovative mechanism to speed up the classic attention mechanism. So this is a very heavily customized attention mechanism, and it's just one example of it. There are a lot of variances of the classic attention mechanism, and each of them did different customizations to solve different performance problems or quality problems. So don't be surprised if you are gonna to see more and more variances of transformer. Just keep in mind of the original concept of its matrix multiplication. You get the cosine similarity scores for that product, and then a use normalization to normalize it so we can use it for later. And then you apply the similarity distribution to reweigh the value. This is the basic intuition of attention. So first of all, why name it feet forward? So this concept is actually comparing to another concept, which is recurrent network. The four transformer recurrent network is mainly used in machine translation, tax recognition, etc. And there's a reason for that. So because recurrent neural network can process information in loops, so this allowed information from previous steps, from the past to be fed back into the network. So this is great for sequential data because it holds some memory. So text recognition, machine translation, this model works perfectly. And you must have heard of lstm, its variance of this rm. So in the contrast, we have feet forward, newer network. So in this case, the information flows in one single direction with no circle. So this is good for task that doesn't require memory of past data, say, image recognition. Of course, this is before transformer, as we have gone through already, transformer is pretty creative in doing self attention and then position embedding to make sure we can have the whole context, but still do it in the parallel instead of going through sequential. So this is the feforward architecture. It's the most simple deep neural network you're going to see. So it has input layer. It has output layer, and it has one or multiple hidden layers. So the first layer is input, blayers output, and everything in between is hidden layers. Let's take a look at feet forward net network inside transformer. So inside transformer, you can see there's two blue blocks that is feet forward, and it all followed by a at n norm. Today, we're also going to talk briefly about add n norm. First of all, let's take a look at the basics. The formula of fefour network is, is this. So as you can see, there's a max zero X W one plus V1, and then W two plus V2. So this is basically two linear transformation. The first one is X W one plus V1. The next one is W two. This whole locks plus B2. So there are two linear transformation with a reu activation function in between. So what is the relu activation function? So it is a non linear activation function. Use c neural network rau x equals to max zero and x. So the larger value between zero and x. So this is the reu part, and it's happening in between the two linear transformations. So the intuition between these formulwithin, this formula is, again, thanks to Gemini, you made this part really easy. So the first one is nonlinary. The first fully connected layer projects the input to a higher dimension space. So this allows the network to learn more complex relationship, and then it goes through the activation function. So this is the part where it applies to nonlinear complex relationships. And then we go through the dimensionality reduction. It happens in the second fully connected layer. So it projects the data back to its original dimensionality incompatibility if it's the next layer. So after these two, it actually learns a lot more, not only because of we expand the dimension and the first connected layer. More importantly, vu actually make sure it has the nonlinear, it learns nonlinear relationship. Remember, in our attention block, everything is linear, so this is the place we learn nonlinear relationships. And then it also have the advantage of parallel processing. So since each position in a sequence is processed separately through the ffn, the computation can actually be paralyzed efficiently. So this is not in the graph, but it's a pretty standard exercise that is applying dropout. Dropout is a regularization technique I used to prevent overfitting, which is very common in machine learning, by randomly dropping out. That is, to set the values to setting neurons to zeros, a certain percentage of neurons during training. So this forced the network to learn more robust features and not rely too heavily on any single neuron. So this will also improve improve the ability to generalize to new data. So the advantage is basically covered in the summarization in the summary. So it reduced overfitting this, it reduced overfitting by preventing neurons from coradapting too much. And then it also improved generalization. It forced a network to learn features that are not dependent on any specific neurons. After it go through this, we usually have dropouts. And then there comes to add n norm. So this is not in the blue block, but this is, this follows the blue block. Add n norm means the first one is add. It means residual connection. This is also a common practice in deep learning. So before I talk about this, you can take a look at the graph on the right hand side. X is, let's say, some inputs, and some of it is gonna to go through the deep newer network, including reloop. And then it's gonna to get the get the fx, which is the the gradients. And then instead of passing the gradients directly through the next layer, it also adits own function. It also adds its own like identity function is just a fancy way of saying itself. So this actually helps a lot in mitigating the vanishing radiance problem. The goal is to, as I said, mitigate the vanishing radiant problem for deep neural networks. So this happens originally from rasnet, and now it's used more in deep neural networks, say, transformer. So this enables the deeper architecture by, of course, resolving the vanishing radiant problem, that which is more serious as layers entries. So how it does is it allows radiant to bypass immediate layers, bypass and adding identity function with the rweated values. So this is the feed forward. This is the added a norm. And as you can see, the identity function goes directly through added a norm. And also some of it go through the feet forward. And then we add in this. So we have went through the add part. And then let's talk about the norm. So the norm, usually in transformer or sequential like lms, it refers to layer normalization. So layer normalization is a technique that helps stabilize and accelerate training by normalizing the inputs to each layer. So it ensures the model process information consistently regardless of the input scale or distribution. I know it doesn't mean a lot, but we can use the next episode to talk about layer normalization. And I want to also compare with other normalization. So what is normalization? From Wikipedia, normalization means adjusting values measured on different scales to a common scale. If you Google normalization, most likely you're are going to see two common ways of doing normalization. The first one is called minin max normalization. And the nominator is x minus minimum of x. The denominators maximum of x, minus minimum of x. So in this way, we can do a rescale for the value of x. The next one is called z score normalization. This one is doing not only rescale, but also reenter. With z score normalization, it's going to normalize your distribution to standard distribution, and it's done by minsing the mean and then divide by the standard deviation. If you have statistic background in your University, most likely they have gone through a similar concept. If you forget about this, just Google it and it should be relatively simple to pick up. So why normalization? In deep learning, there are many benefits that helps stabilize the training process by ensuring features are on a similar scale, preventing features with large values from dominating the learning process, which leads to faster convergence and improved model performance. So the first one is pretty obvious. On faster convergence, by scaling features to a similar range, the gradient descent optimization algorithm can update weights more effectively, leading to a quicker training times. The second one is to improve generalization. Normalization can help the model generalize better to unseen data by reducing the sensitivity to feature skills. The third one is stability in gradient calculations. So when features have vastly different scales, it can lead to very unstable gradients and back propagation normalization can help mitigate this issue, whether it's vanishing or explosion or exploding gradients. The fourth one is three important. It reduced the internal covariate shift. So this is especially for batch normalization. Normalizing activations within each layer helps to maintain a stable distribution throughout the network. This is very crucial for our deep architecture. I'm going to go through more details with example later. This is an example of how internal corvariate shift can happen. Let's say your training data is training from actual cats and dogs, and obviously, they follow some kind of distributions. And when you actually do testing, it turns out all the testing data is actually cartoons, which is a little bit the distribution is different from your training data. And if you don't do any batch normalization, it's very likely it won't perform well on testing data because of the internal covariate shift. So this can confuse the model a lot. I take this example from this website, where to apply normalization in deep learning. Usually you can do you can normalize input data. For example, the data that you feed into neural network can be normalized before entering the network. This is basically pprocessing. The next place is activations. This is more relevant to what we're talking today. So we can also normalize the activations, which is outputs from the hidden layers neurons in the neural network. This is often done to stabilize and accelerate training, especially in deep networks. So for this, let's go through batch normalization. For simplicity, I just pull out a layer with four neurons from a deep neural network. So it has four neurons representing four features, let's say x one, x to x three, x four. And then this is a mini batch with three samples. The sample value is just random. I just come up with random numbers. Batch normalization, it means we're going to do normalization and we're going to do calculation of mean and standard deviation for each row, like for each feature. So for x one, the mean is 5.3, standard deviation is 4.9. Similarly, for x two, this row we get mean and standard deviation. And then for x three, we do another calculation. For x four, we do another round of calculation. So normalization happens across mini batch. It's independent for each feature. After we get the standard deviation and mean for all features, we apply the following two activvaand get a normalized value. The normalized value equals to the z score and normalization, which is the original value minus spover standard deviation. After that, we apply a additional and additional scaling and shifting parameter. So each node or neuron have learning parameter gamma and data to represent the scaling and shifting. So the scaling is obviously this gamma and this data is shifting. So these two values are learnable. So why is batch normalization not used for transformer ers? So batch normalization have its own limitations. So in batch normalization, the mean and standard deviation are calculated within the current mini batch. However, when the batch size is small, the sample mean and sample the standard deviation are not representative enough of the actual distribution. And for sequential model, we see more smaller batch sizes because the sequence is very long. This usually happens to allow a model to update this parameter more frequently. The smaller subof data, the transformer, is very popular among the sequential model, right? So it usually have small batch sizes very often. So bath alization doesn't work very well. And to be more specific, we usually do padding, which means adding zeros to make sure the data size consistent, to make the input size equal during self attention, which is not part of our original data and can mislead the model a lot. So this is a example. Let's say we have two input. The first one is hello, exclamation mark, and then my name is Martin. And in order for us to go through the self attention matrix, because the dimension is fixed, we actually have to add two paddings for the first input. So padding will be all zeros. And when we do normalization using this, it's going to confuse the model a lot because we have to include this unnecessarily zero values. So what's the solution then? The solution is to do layer normalization. So layer normalization is very popular in sequential datset. And again, I'm going to use it example of one efour neuron input layer. And then again, it represenpresent four features. And then the sample output within the mini batches, all random. This time when we do the mean and standard deviation calculation, it actually happens across features. It's independent of each sample. So we get the mean and standard deviation of all the features here, and then all the features on the second sample and all the features of the third sample. So in this way, we actually don't have to do padding at all. For all the features. We apply the following ings to activations and get an zation value to to ch normalization. And then again, we apply the scheme and shifting similar to the bath normalization. So this is layer normalization. Next, I want to talk a little about the ramass normalization, which is a relatively new nonlicennative method and it's catching out in popularity. So rams mornormalizes the activations by dividing them by the root mean square of the activations for each layer. So unlike layering norm rms, norm typically does not center the activations by subtracting the mean before normalization. It does include scaling by a learnable parameter, but there's usually no shifin parameter either. So the rms is only doing restaaling. It's not doing any reentertering at all to compare with batch norm and layering norm. Rms norms normalization process and is simple and without overhead of calculation calculating mean and variance. Additionally, rms norm claim that the scaling factor is more important for stabilizing the norm compared with the shifting factor, so that's why they are not doing any of the shifting receptor. So rms is actually preferred in many scenarios where computation efficiency matter. You can see for a lot of recent models like the long amount version of GPTs and then for deep seep, they're all using our mmoso. It's definitely catching up. So this is the formula for rms norm. You can see it's a lot more simpler. There's no additional steps. Just one AI is the activation for the ineuron. And you have to first calculate rms for a, it's basically adding up all the AI square and then divided by N, N, it's the total number of features, which is the total number of neurons. And then you can put back throughout this specific layer. And then there's a scving factor. This again is this scathing factor, that parameter that I mentioned, it's learnable and it's applied like here. So this is rms normalization, already looked a lot simpler than layer or dch normalization. All hope this helps. Thank you everyone. If you like my video, please subscribe, like or comment.

概览/核心摘要 (Executive Summary)

本视频由一位谷歌工程师主讲，旨在深入解析作为大型语言模型（LLM）和现代AI基石的经典Transformer架构。内容首先通过时间序列平滑的例子引入“注意力”概念，阐释其如何通过加权赋予数据点上下文感知能力。接着，介绍了词嵌入技术，即将词语转换为能反映语义相似性的向量。为解决词序问题，引入了基于正弦/余弦函数的位置编码。

视频核心部分详细拆解了注意力机制，包括其关键组件——查询（Query）、键（Key）和值（Value），以及多头注意力机制如何捕捉多维度信息。针对Transformer解码器的自回归特性，讲解了确保因果依赖的掩码注意力，以及通过缓存键值对大幅提升推理速度的KV缓存技术。此外，还探讨了不同注意力变体（如自注意力、交叉注意力及MQA/GQA等优化）的应用。最后，阐述了前馈神经网络（FFN）在引入非线性和并行处理中的作用，以及残差连接与层归一化（如Layer Norm, RMS Norm）对于模型训练稳定性和性能的重要性。

注意力基础概念 (Attention Basic Concepts)

Speaker 1首先通过一个看似无关的例子——平滑噪声时间序列——来引入注意力的核心思想。

时间序列平滑：
- 一个含噪声的时间序列向量 x (包含 x0 到 x5 六个数据点)。
- 引入一个“重加权函数 (reweighing function)” W3（类似正态分布函数），它会过滤掉远离 x3 的数据，并放大临近的数据。
- 通过将 W3 与 x 的对应元素相乘再求和，得到新的值 y3。公式为：y3 = W3 · x。
- 类似地，可以计算出 y0 到 y5，形成新的向量 y。
- 核心观点： 新向量 y 中的每个点（如 y3）都“关注”了原始向量 x 中的其他点，从而获得了上下文信息，使得 y 比 x 更平滑、噪声更少。
- 输出 y 的特性取决于重加权函数的设计：若函数扁平，则 y 扁平；若函数极度尖锐（如仅 W33 有值），则 y 与 x 几乎相同。
词嵌入 (Word Embedding)：
- 定义：词嵌入是词语的表示，通常是一个实值向量，它编码了词的意义，使得向量空间中相近的词在意义上也相似。
- 简化示例：假设向量有5个值（0或1），分别代表：是否人类、是否女性、是否男性、是否吠叫、是否说话。
  - "father" 的词嵌入: [1, 0, 1, 0, 1] (是人类, 非女性, 是男性, 不吠叫, 说话)
  - "dog" (原文为stud，但根据上下文和属性更像dog) 的词嵌入: [0, 0, 1, 1, 0] (非人类, 非女性, 是男性, 吠叫, 不说话) [原文此处表述为 "not female is no, the barks"，推测为转录错误，应理解为“非女性，是男性，吠叫”]
- 词嵌入的优势：
  1. 向量间距离指示词语相似性： 例如，“从‘king’到‘queen’的向量关系与从‘man’到‘woman’的向量关系非常相似”，能够捕捉真实世界的关系，如“walking”与“walked”，“swimming”与“swam”，以及国家与首都的关系。
  2. 意义相近的词由相似向量表示： 这使得相似实体在向量空间中聚集，形成簇。例如，“train, bus, car”会聚成一簇，“college, school, work”会聚成另一簇。
自然语言中的注意力：
- 例子：“Date is my favorite fruit.”
  - 通常“date”指时间，但在此句中，由于“fruit”的存在，人类能理解“date”指一种可食用的水果。
- 需求： 自然语言处理需要类似机制，让每个词（token）获取句子或段落中其他词的上下文。
- 与时间序列不同，这里的上下文不能简单依赖“邻近性”，需要更复杂的、基于语义的重加权方法。
- 过程：
  1. 将句子“Date is my favorite fruit”中的每个词转换为词嵌入向量 (v1 到 v5)。
  2. 输入到注意力机制（一个“黑箱”）。
  3. 输出新的向量表示 y1 到 y5。
  4. 期望： y1 是比 v1 更好的表示，因为它包含了其他词的上下文。
  5. 计算 y1 的方式（简化版）：
    - v1 与 v1 到 v5 分别进行点积，得到原始权重 w_11_star 到 w_15_star。
    - 对这些原始权重进行归一化，使它们的和为1，得到最终权重 w_11 到 w_15。
    - y1 = w_11 * v1 + w_12 * v2 + ... + w_15 * v5 (注意：这里的乘法是标量乘以向量)。
  6. 对所有其他词向量（v2 到 v5）进行类似处理，得到 y2 到 y5。最终，y 向量包含了上下文信息。
与LSTM的对比：
- LSTM (Long Short-Term Memory)：
  - 一种循环神经网络（RNN），通过引入记忆单元（memory cell）解决了传统RNN难以学习长期依赖的问题。
  - 在Transformer出现前，广泛用于语言翻译、语音识别和时间序列预测。
  - 缺点： 由于其顺序处理数据的特性，训练缓慢且效率低于Transformer。
- Transformer：
  - 能更有效地捕捉长序列中的全局依赖。
  - 核心优势： 能够并行化计算（例如，句子中的所有词可以一次性作为输入嵌入进行处理），因此扩展性远胜于LSTM。

位置编码 (Position Encoding)

Speaker 1 指出，自注意力机制能捕捉依赖和上下文，但语言中词的顺序至关重要。

词序的重要性：
- 经典例子：
  - "Even though she did not win the award, she was satisfied."
  - "Even though she did win the award, she was not satisfied."
  - 两个句子词语相同，但顺序不同，导致意义完全相反。
需求： 需要一种方法来表示词语的位置信息。
不理想的尝试：
1. 直接使用索引作为编码：
  - p0 (第一个词的位置编码) 为全0向量，p1 为全1向量，pn 为全n向量。
  - 问题： 对于长句子，位置编码的值会变得非常大，从而使 词嵌入 + 位置编码 的结果偏斜。
2. 归一化索引：
  - p0 为全0，p1 为 1/n，pn 为1（所有值在0和1之间）。
  - 问题：
    - 对于不同长度的句子，相同词语组合（如子句 "I am"）的位置编码会不同。例如，在 "I am a dad" (长度4) 和 "I am Martin" (长度3) 中，"I am" 的位置编码会因句子总长度 n 的不同而不同。
    - 期望属性： 位置编码理想情况下不应依赖于句子的总长度。
实际解决方案 (Sinusoidal Position Encoding)：
- 使用正弦和余弦函数来生成位置编码。
- 公式：
  - PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
  - PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  - 其中 pos 是词在序列中的位置，i 是编码向量中的维度索引，d_model 是词嵌入的维度。
- 特性：
  1. 有界的最值： 值在-1和1之间。
  2. 能够区分编码后的位置信息： 即使某个维度 2i 的值对于不同位置相同，其他维度 2i+1 或 2(i+1) 的值也会不同，从而使整体编码唯一。
  3. 不依赖于句子长度： 对于公共前缀如 "I am"，其位置编码在不同句子中是相同的。
最终输入： 将语义嵌入（词嵌入 E）与位置编码 (P) 相加，得到“位置感知的语义嵌入 (positionally aware semantic embeddings)”，作为Transformer模型的输入。

键、值、查询 (Keys, Values, Queries - KVQ)

Speaker 1 进一步解释了注意力机制的数学细节。

回顾注意力计算：
- 将输入向量 v 转换为上下文感知的向量 y。
- 通过点积得到权重，归一化权重，然后用权重对 v 进行加权求和。
模块化视角与数据库类比：
- 例子：“I am Martin”（v1, v2, v3）。假设要获取 "Martin" (v3) 的上下文表示 y3。
- v3 作为 查询 (Query)。
- 为了得到 v3 的信息，需要在“数据库”中进行键值查找。
  - v1, v2, v3 同时作为 键 (Keys) 和 值 (Values)。
- 计算过程（以 y3 为例）：
  1. 点积：r1 = v1 · v3, r2 = v2 · v3, r3 = v3 · v3。
  2. 归一化 r1, r2, r3 得到权重 w1, w2, w3 (和为1)。
  3. 加权求和：y3 = w1*v1 + w2*v2 + w3*v3。
引入可学习的权重矩阵：
- 为了使模型具有学习和泛化能力，引入三个权重矩阵：M_K (用于键)，M_V (用于值)，M_Q (用于查询)。
- 原始词嵌入向量（或上一层输出）X 分别与这些矩阵相乘，得到真正的键、查询和值矩阵：
  - K = X * M_K
  - Q = X * M_Q
  - V_actual = X * M_V (注意，之前例子中 v 同时充当了键和值的角色，这里 V_actual 是经过 M_V 变换后的值)
- 这些操作在实际中是线性层 (Linear Layer)。通常维度会发生变化，但为简化，假设维度不变。
注意力公式详解 (Attention is All You Need 论文)：
- Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V_actual
- Q * K^T： 查询和键的矩阵乘法（点积），计算查询与所有键的相似度分数。
- sqrt(d_k) (缩放因子)：
  - d_k 是键向量的维度。
  - 为何缩放？ 当 d_k 较大时，点积结果的绝对值可能非常大，导致Softmax函数梯度极小（梯度消失）或极大（梯度爆炸），破坏训练。
  - 为何是 sqrt(d_k)？ 假设Q和K的元素是均值为0、方差为1的独立随机变量，则它们的点积结果均值为0，方差为 d_k。除以 sqrt(d_k) 可以使点积结果的方差近似为1，保持分布稳定。
- softmax： 将相似度分数转换为概率分布（所有分数和为1），表示每个值向量的权重。
- * V_actual： 将计算出的注意力权重（概率分布）应用于值向量，进行加权求和，得到最终的输出。

多头注意力 (Multi-head Attention)

Speaker 1 解释了为何需要以及如何实现多头注意力。

动机：
- 自然语言非常复杂，单一的注意力机制可能无法捕捉到文本中所有层面或类型的上下文和关系。
- 例子：“Date is my favorite fruit.”
  - 关系1: "date" 是一种 "fruit"。
  - 关系2: "date" 是 "my favorite" 的对象。
  - 可能还有情感色彩（中性到积极）。
- 目标： 使用多组不同的参数（权重矩阵）来捕捉不同类型的上下文和关系。
计算机视觉类比：
- 一张包含人物、天空和山脉的图片。
- 不同的注意力“头”可以分别关注：
  - 头1: 识别出人物 (花木兰)。
  - 头2: 识别出天空。
  - 头3: 识别出山脉。
多头注意力机制流程：
1. 输入： 词嵌入 + 位置编码。
2. 并行处理： 输入同时进入 H 个不同的“注意力头 (attention heads)”。
3. 每个头的操作：
  - 使用各自独立的线性层（权重矩阵 M_Q_h, M_K_h, M_V_h）将输入映射为该头的查询、键和值。
  - 执行标准的缩放点积注意力计算，得到该头的输出 Y_h。
4. 拼接 (Concatenation)： 将所有 H 个头的输出 Y_1, Y_2, ..., Y_H 拼接起来。如果每个 Y_h 的维度与输入维度相同（为 d_model），则拼接后的维度变为 H * d_model。
5. 最终线性层： 将拼接后的高维向量通过一个额外的线性层，将其维度重新映射回原始的输入维度 d_model。这个线性层也是可学习的。
  - Speaker 1 提到，线性层（也称作Dense层或Fully Connected层）可以将输入维度映射到任意期望的输出维度。

掩码注意力 (Masked Attention) 和 KV缓存 (KV Cache)

Speaker 1 讨论了Transformer在训练和推理阶段的行为差异，以及由此引出的掩码注意力和KV缓存。

训练 vs. 推理 (Encoder-Decoder架构)：
- 训练阶段：
  - 对于翻译任务（如 "I am Martin" -> "我是马丁"），编码器接收 "I am Martin"，解码器同时接收目标序列 "我是马丁"（通常带有起始符，并进行移位作为输入）。模型可以一次性看到整个目标序列（或其一部分）来学习。
- 推理阶段：
  - 解码器必须以自回归 (autoregressive) 的方式逐个生成词。
  - 解码器使用先前已生成的词来预测下一个词。
  - 例如，翻译 "I am Martin"：
    1. 解码器输入：<start_of_sentence> (起始符)。输出第一个词，如 "我"。
    2. 解码器输入：<start_of_sentence> 我。输出第二个词，如 "是"。
    3. 解码器输入：<start_of_sentence> 我是。输出第三个词，如 "马丁"。
掩码注意力 (Masked Attention)：
- 目的： 在解码器自注意力层中，确保在预测当前词时，模型只能关注到序列中该词之前的词（包括自身），而不能“看到”未来的词。
- 实现： 在计算注意力分数后、进行Softmax之前，将对应未来位置的分数设置为一个非常小的负数（如负无穷），这样Softmax后这些位置的权重将趋近于0。
- 这对应于注意力分数矩阵的上三角部分（不包括对角线，取决于具体实现）被“掩盖”。
KV缓存 (KV Cache)：
- 问题： 在自回归生成过程中，每生成一个新词，解码器需要重新计算所有先前词与当前词的注意力关系。
- 解决方案： 缓存 (cache) 之前时间步计算出的键 (K) 和值 (V) 向量。
  - 当生成第 t 个词时，只需要计算当前词的查询向量 Q_t。
  - 然后 Q_t 与所有已缓存的 K_1, ..., K_{t-1} 和自身 K_t 计算注意力分数，并应用于已缓存的 V_1, ..., V_{t-1} 和自身 V_t。
- 效果：
  - 优点： 将每一步生成新词的注意力计算复杂度从与当前已生成序列长度 T_current 相关的较高复杂度（如朴素实现下的 O(T_current^2)）降低到 O(T_current)（新的查询与所有已缓存的键交互），从而大幅提升整体推理速度。尽管每步计算量减少，但生成整个长度为 T 的序列的总计算量级仍与 T^2 相关。
  - 缺点： 需要额外的内存来存储K和V向量。所需内存大小约为 2 * num_tokens * num_layers * num_kv_heads * kv_dimension。
    - 2: 存储K和V。
    - num_tokens: 序列长度。
    - num_layers: Transformer的层数。
    - num_kv_heads: 键值对的头的数量。
    - kv_dimension: 键值向量的维度。
- 适用条件： KV缓存适用于具有因果关系 (causal relationships) 的模型，即新生成的词只依赖于自身和之前的词。不适用于像BERT这样的模型，其输出会根据整个输入序列的变化而变化。大多数生成式AI模型（如GPT、DeepSeek）都使用KV缓存。

不同类型的注意力机制 (Different Attentions)

Speaker 1 总结并扩展了不同注意力类型的概念。

注意力机制直觉回顾：
1. 矩阵乘法1 (QK^T)： 计算键和查询之间的点积/余弦相似度。
2. 归一化 (Softmax + Scale)： 将相似度分数映射到概率空间，并进行缩放以防止梯度问题。
3. 矩阵乘法2 (...V)： 将相似度（概率）分布应用于值向量，进行加权求和。
Transformer架构中的注意力类型：
- 编码器 (Encoder) 中： 自注意力 (Self-Attention) 模块。
- 解码器 (Decoder) 中：
  1. 掩码多头自注意力 (Masked Multi-head Self-Attention)。
  2. 交叉注意力多头注意力 (Cross-Attention Multi-head Attention)。
具体注意力类型解释：
1. 基础注意力 (Basic Attention)：
  - 计算两个不同句子（一个作为查询源，一个作为键/值源）中词语间的相似度/注意力分数。
2. 自注意力 (Self-Attention)：
  - 查询、键和值均来自同一个输入序列。即 Query = Key = Value (在初始输入层面，经过各自的线性变换后得到Q, K, V)。
  - 模型关注输入序列内部不同位置之间的关系。
3. 交叉注意力 (Cross-Attention)：
  - 通常在解码器中使用。
  - 查询 (Query)： 来自解码器的当前状态（基于已生成的输出序列）。
  - 键 (Key) 和值 (Value)： 来自编码器的最终输出。
  - 目的： 建立输入序列（源语言）和已生成的部分输出序列（目标语言）之间的关系，帮助解码器决定下一个词应该关注编码器输出的哪些部分。
  - 例子：翻译 "I am Martin" (编码器输入) 为 "我是马丁"。当解码器已生成 "我是" 时，交叉注意力会帮助模型在编码器的 "I am Martin" 输出中找到与 "马丁" 最相关的信息。
4. 掩码注意力 (Masked Attention)： (再次强调)
  - 应用于生成模型（如GPT, DeepSeek）的自注意力层。
  - 确保模型在预测序列中的每个词时，仅基于之前的词，不依赖任何未来的词。
  - 通过在注意力矩阵中掩盖（设置为极小值）未来位置的键和值来实现。
  - 这是实现KV缓存以加速Transformer推理的前提。
解决KV缓存内存瓶颈的注意力变体：
- 当序列越来越长时，KV缓存所需的内存成为瓶颈。
- 多查询注意力 (Multi-Query Attention, MQA)：
  - 所有查询头共享单个键 (K) 和值 (V) 头。
  - 显著减少内存使用，但可能影响注意力计算的准确性。
- 分组查询注意力 (Grouped-Query Attention, GQA)：
  - 介于标准多头注意力 (MHA) 和MQA之间。
  - 一组查询头共享一对键和值头。
  - 在内存节省和模型质量之间取得平衡。合并方式可以是2对1，3对1，或基于相似度分数等。
- MLA (Multi-Layer Attention [推测，原文未全称，但指DeepSeek的技术])：
  - Speaker 1 提到这是DeepSeek能够以低成本实现高质量的原因之一。
  - 将键和值投影到较低维空间（压缩）。
  - 可能涉及对RoPE（旋转位置编码）的定制操作。
  - 将压缩后的键与未压缩的键拼接起来进行注意力计算。
  - 这是一种高度定制化的创新注意力机制，旨在提升效率。
核心总结： 尽管存在多种注意力变体，其基本直觉——通过矩阵乘法计算相似度，归一化，再应用于值向量——保持不变。

前馈网络 (Feed Forward Network, FFN)

Speaker 1 解释了Transformer中前馈网络的作用。

命名对比：
- 循环网络 (Recurrent Network, 如RNN, LSTM)： 信息在网络中循环流动，允许先前步骤的信息反馈回网络，适合处理序列数据并保持“记忆”。
- 前馈网络 (Feed Forward Network, FFN)： 信息单向流动，无循环。适合不需要过去数据记忆的任务（如传统图像识别）。
- Transformer通过自注意力和位置编码创造性地处理序列依赖，同时保持并行性。
FFN在Transformer中的位置：
- 在每个编码器和解码器层中，位于多头注意力模块之后，通常后接一个“Add & Norm”操作。
FFN的结构和公式：
- 通常由两个线性变换和一个非线性激活函数（如ReLU）组成。
- 公式：FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
  - xW_1 + b_1: 第一个线性变换，通常将输入 x 投影到更高维度。
  - max(0, ...): ReLU (Rectified Linear Unit) 激活函数，ReLU(z) = max(0, z)。
  - (...)W_2 + b_2: 第二个线性变换，将激活后的结果投影回原始维度（或下一层期望的维度）。
FFN的作用和直觉：
1. 引入非线性： 注意力机制本身主要是线性操作（矩阵乘法和加权和）。FFN中的ReLU激活函数为模型引入了非线性，使其能够学习更复杂的关系。
2. 维度变换与特征学习：
  - 第一个线性层通常将输入投影到更高维空间，这允许网络学习更复杂的特征表示。
  - 第二个线性层将数据投影回其原始维度，以便与下一层兼容。
3. 并行处理： 序列中的每个位置都独立地通过FFN进行处理，因此计算可以高效并行化。
Dropout：
- 一种正则化技术，用于防止模型过拟合。
- 在训练期间，以一定概率随机地将一部分神经元的输出设置为零。
- 迫使网络学习更鲁棒的特征，不过分依赖任何单个神经元。
- 提升模型对新数据的泛化能力。
Add & Norm (残差连接和层归一化)：
- FFN模块之后通常跟着一个“Add & Norm”层。
- Add (Residual Connection / 残差连接)：
  - 将FFN（或注意力模块）的输入 x 直接加到其输出 F(x) 上，即 output = F(x) + x。
  - 目的： 缓解深度神经网络中的梯度消失问题，使得训练更深的网络成为可能。梯度可以直接通过“跳跃连接”反向传播。
- Norm (Layer Normalization / 层归一化)： (将在下一节详细讨论)
  - 对每一层的输入进行归一化，有助于稳定和加速训练过程。

归一化 (Normalization - Batch, Layer, RMS)

Speaker 1 详细讨论了不同类型的归一化及其在Transformer中的应用。

什么是归一化：
- 将不同尺度下测量的值调整到一个共同的尺度。
- 常见方法：
  1. 最小-最大归一化 (Min-Max Normalization)： x_norm = (x - min(x_all)) / (max(x_all) - min(x_all))，将值缩放到 [0, 1] 区间。
  2. Z-score归一化 (Standardization)： x_norm = (x - mean(x_all)) / std_dev(x_all)，将数据转换为均值为0，标准差为1的标准正态分布。
为何在深度学习中使用归一化：
1. 加速收敛： 使特征处于相似尺度，梯度下降算法能更有效地更新权重。
2. 改善泛化： 降低模型对特征尺度的敏感性。
3. 稳定梯度计算： 缓解因特征尺度差异巨大导致的梯度不稳定（消失或爆炸）。
4. 减少内部协变量偏移 (Internal Covariate Shift)： (尤其对Batch Norm)
  - 指在训练过程中，由于前一层参数的变化，导致后续网络层输入分布发生变化的现象。
  - 归一化有助于维持网络各层激活值分布的稳定性。
  - 例子：训练数据是真实猫狗图片，测试数据是卡通猫狗图片，分布差异可能导致模型性能下降。
在深度学习中应用归一化的位置：
1. 输入数据： 对馈入网络的原始数据进行预处理。
2. 激活值： 对隐藏层神经元的输出进行归一化（更相关于本次讨论）。
批归一化 (Batch Normalization, BN)：
- 计算方式：
  - 在一个mini-batch内，对每个特征 (feature/neuron) 单独计算均值和标准差（跨批次中的样本）。
  - 使用这些统计量对该特征在该批次内的所有样本值进行Z-score归一化。
  - 之后，应用两个可学习的参数：缩放因子 γ (gamma) 和平移因子 β (beta)，即 y = γ * x_norm + β。
- 为何不常用于Transformer：
  1. 对小批量大小敏感： 当批量较小时，样本均值和标准差可能无法准确代表整体分布。Transformer等序列模型常因序列过长而使用较小批量。
  2. Padding问题： Transformer中常用padding（填充0）使序列长度一致。这些0值会干扰BN的统计量计算，误导模型。
层归一化 (Layer Normalization, LN)：
- 计算方式：
  - 对每个样本 (sample) 单独计算其所有特征的均值和标准差（跨单个样本内的所有特征/神经元）。
  - 使用这些统计量对该样本的所有特征值进行Z-score归一化。
  - 同样应用可学习的缩放因子 γ 和平移因子 β。
- 优势： 不受批量大小影响，且由于是对单个样本内部进行归一化，padding问题的影响较小。因此在Transformer等序列模型中非常流行。
RMS归一化 (RMSNorm / Root Mean Square Normalization)：
- 一种较新的、更简洁的归一化方法，正逐渐流行。
- 计算方式：
  - 通过将激活值除以该层激活值的均方根 (Root Mean Square) 来进行归一化。
  - 公式：a_norm_i = a_i / sqrt( (1/N) * sum(a_j^2 for j in 1..N) + ε )，其中 a_i 是第 i 个神经元的激活值，N 是该层神经元总数，ε 是一个小的稳定常数（防止除以零）。
  - 通常不进行中心化 (即不减去均值)。
  - 通常只进行缩放，通过一个可学习的参数 g (gamma) (原文为 g，与BN/LN的 γ 类似)，即 output = g * a_norm。通常没有平移参数 β。
- 声称： 缩放因子比平移因子对稳定归一化更重要。
- 优势： 计算更简单高效，无需计算均值和方差的开销。
- 应用： 许多近期模型（如LLaMA、DeepSeek）采用RMSNorm。

Speaker 1最后感谢观众，并鼓励点赞、订阅或评论。

核心观点总结

该视频系统地拆解了Transformer架构的各个关键组成部分：
1. 注意力机制是核心，通过计算查询、键、值的交互来动态赋权，使模型能够关注输入序列中的相关部分，从而捕捉上下文信息。
2. 位置编码通过引入周期函数（正弦/余弦）为模型提供了词序信息，解决了自注意力机制本身不感知顺序的问题。
3. 多头注意力允许模型从不同表示子空间并行学习不同方面的信息，增强了模型的表达能力。
4. 掩码注意力和KV缓存是针对Transformer解码器在自回归生成任务中的优化，前者保证了因果依赖，后者大幅提升了推理效率。
5. 前馈神经网络 (FFN)在每个Transformer块中引入非线性变换，进一步处理注意力层的输出。
6. 残差连接和归一化层 (特别是LayerNorm和RMSNorm) 对于稳定训练过程、加速收敛以及构建更深的网络至关重要。

这些组件协同工作，使得Transformer架构在处理序列数据，尤其是自然语言处理任务上取得了巨大成功。

摘要历史 (3)

StreamSparkAI