speaker 1: Hello everyone. Welcome to this video. This video is going to be the single video you need to understand the basic of classic transformer architecture that later changed the world with AI technology. In this video, I'm gonna to break down the architecture into several key components and go through each components one by one. Let's dive into it. I want to start with something seems unrelated. So how to smooth a noisy time series. So first graph x, it's a noisy time series data verse x zero to x five, five numbers, and they form a vector x. Now we have a relweying function. We call it W three or x three. So it's a normal distribution like function. What it does is it will filter data that's like far away of x three and amplifies the data that's close. So if we multiply the filter, reveying function is the time series data ta x, like x zero, multiply W zero three, x one to W zero one, and we add all the values together, we're going to get a new value by three. So here is the mathematical formula, y three equals to W three product x. Similarly, y four equals to W four product X. W three is a vector. X three is x is a vector. So y three is a number. Similarly, y four is a number. And then with auy zero to y five, it forms a new vector by, and this vector is supposed to be better than x. The advantage is vector y now have like other points contacts compared to x, so it's less noisy. So you can see y three paid attention to other points. Similarly, from y zero to y five, they all paid attention to other points. I want to also say it's not necessarily like smoothing this function. It really depends on how you designed every ray function, right? If your reveying function is just flat, then your output y is going to be flat. If your reveying function is just extremely pointy, say only W 33 have value and all the others all Zered, then it basically does nothing. Your y will just look like identical with x. So it really depends on how you design your reveying function. Okay? The important thing is you understand with some breveying function, now you have a new vector from x, and this new vector y have the advantage of it has contacts. Okay, let's move to the next one. Before I talk more about transformer and natural language, I wanna talk a little about word embedding. Word embedding is the present representation of word, which is the like, the smallest like representation in the natural language. So the definition is a word embedding is the representation of word embedding is used in text analysis. Tythe representation is a real valued vector that encothe meaning of the word in such a way that the words that are closer in vector space are expected to be similar in meaning. That's very long sentence, right? I wanna show some example here. Let's simplify this word embedding. Let's say in the vector, there's five values and each of them is zero or one. In reality, it usually is a score. Let's say it's zero and one. The vector and first one is human, second is east femmale, third one is is male. And then it's bark or not, and it's top Cornot. Then for the word father representation, the word embedding is one, zero, one, zero, one, right? Human is not female. Es, no, that does embark talks. The representation of stud is zero, zero, one, one, zero. It's not human, not female is no, the barks and doesn't talk. So this is a pretty simple example for the embedding. So why is it useful there too? Big advantage, I would say. First one is the distance between vectors indicates the similarity between words. So you can actually take a look at these examples from king to queen, the vector very similar to from man and woman, it captures real relationship. And in the real world, similarly to walking and walked, swimming to swamp, and then like country capital relationship, the distance between the word embedding is a vtor and it indicates some kind of relationship. Next one is words with similar meanings are represented by similar factors, tors. What it does is in the end, it aggregates similar entities in very close clusters. This is a 2D representation, but in reality is a very high dimensional. But it's just for you to see, like train, bus, card is cluster. Similarly, College School work, it's one cluster. And on in the similar meaning, words are represented by similar actors. So now we know about word embeddings. Let's talk about the tension for like natural language. So here a sentence, date is my favorite fruit. Date is is a word that we use very commonly, right? And I would say 90% of the time it refers to the time instead of a like alidible fruit. But like how do we human knows date means fruit in this, something to eat. In this case, it's like human. We read the sentence and we see, okay, there's a fruit and this fruit is referring to this date. So that's why date in this sentence is something to eat rather than the time. So similarly, when you try to tell the AI system like this as another meaning, you can use attention of black mechanism, right? So imagine like natural language needs the similar mechanism. Like each word or token or data point needs context of others in the same sentence or paragraph. For example, date in this sentence means an edible fruit and some time with the context provided by this fruit. However, this context is different from the previous context, the previous example. When we say contacts, it's basically similar, like proximity, our filtering function. But we can't use proximity like we did before. We need a more sophisticated rouying method, and it needs to have semantic meaning. So let's move on to the next one. Let's say here is a black box of Reway and linear reprocess. We first partition date is my favorite fruit to like different word embeddings. So we can do mathematical process to this natural language. V1 is the word embedding of dates. V2 is the word embedding of is ettera. And then we input this word embedding to this big box, supposed to be the attention mechanism. And then the output is going to be y one, y two, y three, y four, y five. And what we want is y one is a better representation than we want because it has more context of other words. So this why representation should have more context than v. What we do is usually like this. V1 is a vector. So dot product V1, it's a number. It's a real value. W 11 that starts similarly. It's the W 12, 13, 14, 15 star. And then we normalize it. So they add up to the value one. This is the weight that we need. And then we use this weight to reweiv one W 11 product, not that product. This is a real, real value. This is a vector. So W eleven, multiply V1, and then same bit v five, and we have a new vector. This is like a linear Rewey one reall vector towards V1. So it has a context of other words, in that we do similar process for all the other vectors, y two, y three, y four, y five, and we have the new representation of date is my favorite fruit. And now the y vector have context. And this is the basic for attention mechanism in natural language. Before transformer, the most popular model for natural language is lstdm. So I want to spend some time to talk about it. Lstdm is a recurrence neural network, a traditional ra, and has a single hidden state that is passed through time, and that can make it difficult for the network to learn like long term dependencies. Lstm model solve this problem by introducing a memory cell. This is a container that can hold information for an extended period, so it makes ostm like capable of learning long term dependency and sequential data. It works great at that time for language translation, speech recognition and time series forecasting. Donright es, there is the lstm. However, there a problem with it. So rstms are slow and less efficient than transformer due to their sequential nature means the training phase, you need to PaaS the sequential data one at a time. On the right hand side there, the transformer architecture, you can see it is very different from the previous lstm. Transformers can capture global dependencies more effectively in long sequence compared to istm and others like grus. As a result, transformer is better at paralyzing computes. The intuition is, for example, in episode one, when we have a sentence you don't have to PaaS in one word at a time. We actually has everything as the input embedding at once. It makes the calculation a lot more effective. And if we can paralyze the computation, it scales up a lot better than lstm. Of course, there are like other stuff that makes transformer works, and position encoding is one of them. As you can see on the right hand side, input and then input embedding. And then the next step is we need to add like position or encoding. What is position encoding then? So as we discussed before, self attention mechanism captures dependency and context. In short, date normally refers to time. But when we know in the sentence there's another word, fruit, it captures the context from fruit. In this case, date most likely refer to the edible fruit instead of the time. However, in language border matters, this is the most classical example that you're gonna to see everywhere. Here's why order matters. So even though she did not win the award, she was satissatisfied. That is one sentence, the other, almost identical sentence. Tences, even though she did win the award, she was not satisfied. They have the same character, same word, but the order is different and the meaning is completely different. Other than the context, the previous mechanism that captures context, obviously we need a way to know the position. And previously we know the relative position based on the index of the word embedding. But that's not enough. We need something similar to a embedding. So that is the position encoding here. A naive way is we need another embedding for position. The most naive approach is we just put the position of the word as the embedding right. For example, first word e zero stands for the word embedding for the first word. P zero stands for the position encoding for the first word. And because it's the first word, the index is zero. So p zero will be all zero. Similarly, E1 will be, P1 will be all ones, and pn will be all ends. Well, this work, this is not working since like large and school, school, all the embeddings, like you can imagine if the sentence is like 100 words long, em plus pm, that is the new input or the nth position, the embedding will be a lot larger than e zero plus p zero. So the result will all be skewed. So we needed to bound mim and max value, and we still want it to be able to differentiate like the position info, like whether it's the real position or the encoded position. So let's try something else. We include normalization in it, and we normalize it by the first p zero, it's all zero. P1 is like one over n. That means for pn, it's one, right? In this case, all the values is between zero and one. So this won't work perfectly either. And there are some caveats, like encoding for the same word combination will have different encodings. The sentence length is different. Here's an example. There is two words and two sentence, I am a dad, which is four characters long, four long. And there is, I am Martin, that is three words long. And in these two sentence, the subsentence I am those position and Colding will be different. And we don't want that. We want it to be ideally to be the same. So we need another property, which is the encoding ideally doesn't dependent, it's not dependent on the length of the sentence. So what's the actual solution here? The proposed solution is actually very smart. So as you can see, this is the two formula of the position encoding. Os is the position of the word. And so for first word, it's p zero. For second is P1, this is typo. And then P2, p four F, P two, p three. So d is the dimension of the embedding for the words. In this case, our word embedding have five dimensions to y. It means the even indices of position encoding, like within those d dimensions. So zero, two, four, two, I plus one is the odd in this system. And you can for the even values, you just use the first sformula. For odposition, you use the second one cosine value. Just to clarify, all the values I put here is fake. Just put random numbers. So now we combine the semantic embedding, which is e zero, E1, e two, E E three, and the position encoding, which is the p zero, P1, P2, p three. We combine those and we have a positionally aware of semantic embeddings, which is what we want. So now we have the perfect input for the transformer model. So then we are ready to dive deeper into what actually happens in the transformer model and how attention mechanism is actually using these semantic positionally aware semantic embeddings. This is the intuition. This graph on the right hand side, it's for even in this is two y equals to zero to y equals to two to y equals to four. They have similar sine graph, but they have different frequencies, right? And they have those properties that we want. They have bound min and max values one and minus one, and they're able to differentiate encoded position info, for example, even, let's say, for I equals to two, I equals to zero, there's like two points, two word position, they have the same value for I equals to zero. If you take a look and see, like for I equals two I equals to two, and two I equals to four, the value is going to be different. So in this way, the encoded position, the position encoding, is actually different. Even even if, let's say, for I equals to zero, they are the same. You have the same value for other two is they're going to have different values, but this algorithm is able to differentiate encoded, and we know they're not dependent on the length of the sentence. So for I am Martin and I am addebt, the I am like common prefix will have the same encoded position encoding. Let's do a quick recap. So on the first episode, I went through the simplified attention mechanism using some examples. I started with smoothing a time series with using a normal distribution like filter, and then I moved the example from time series to natural language. So I want each word then dissentence to get some context of other words, just like what we did in the time series. I use the example of data is my favorite fruit, b vector is the word embedding. And when after we go through the attention angent mechanism, we get A Y vectors, and y has more contacts taken into consideration compared with b, including the position encoding that we went through on the second episode. So this is achieved by a tangent mechanism. And here is the attention mechanisms formula that's in the original paper. So now let's dive deeper into what actually happened to this formula. So this is still the recap. When we transform v to y, we go through the dot product to get some weight, and then we normalize it. So there's sum of one, and then we apply the weight and Rev vectors. We get y one. And we did similar stuff for y three and y through y five. So let's modulize this a little bit. So it's querer. So this is a modualized view of the previous mathematical calculation. And this time I'm going to use a shorter example, which is I am Martin. So these three words, each of them has a word embedding V1, V2 and V3. And let's assume a word embedding is one over four s in dimension. Let's say in this case, we want to get the third word, which is Martin's context on authe whole sentence. In this case, V3 is right here. In this case, I want to use database analogy, which is query keys and values. In this case, I want the contacts. I want a attention representation of the third word, V3. So V3 is the query here. And in order to get what I want for V3, I need to go through a bunch of key value search, as in database analogy. So if the key is V1, then the value is one. Here is second key and then the second value. I understand this might not mean a lot, but this is just to help you to understand why we are calling these query keys and values. It might not make total sense at all. Let's say V3 is the query here. What we are going to do, as in previous like math calculation, we first have to go through the dot product. V1, V3 will get R one, V2, V3 will get R2, V3, V3 will get R three. And those are three real value, which is like three numbers. And then we normalize those three numbers. So the weight, the sum, equals to one, and this is the weight. And then we apply the weight and reweithe whole v vector and get the y three here. And we can also go through similar operation to get y one and y two. This is the by contexalized attention like representation of I and m. So this is the modulzed view of previous calculation. So let's add more to it right now. So as you can see, right now, there's no weight at all. And if you have work in machine learning, whether it's classior, the deep learning, you know, like everything is about learning the correct weight. So let's add some weight to it. Now we have three weight matrix, mkm, b and M Q each, one for a query, one for key and one for value. The weights should be able to generalize the learning capability. Just for simplification purpose, let's say the dimension is four over four. So in this case, when we multiply the original vector, it won't change the dimensions one over four, multiply by four over four. It's still one over four. So this is just simplification. In reality, when we go through this, this is a linear matrix multiplication, right? So in actual productionization, this is called linear layer. When we go through linear layer, whether here or later, after concatenation, I'm going to go through it in later, the dimension usually will change, like either increasing or decreasing. But just for simplification purpose, let's say right now it's four over four. So the dimension won't change. This won't change this graph too much, just will make the small v to big b. And then now we have weights. So now we can generalize the learning capability. So the weight contains like more information. Let's simplify this view a little bit more. And now we can have this graph, the word embedding, go through three linear layer, which is the matrix multiplication with mv mqand mv. So we get keys, quries values. And then for keys and quries, we do a matrix multiplication, and then we go through a normalization. In the last slide, for simplification, we just normalize it to sum to one. But in reality, we should do softmax, we should do scale. We should also mask if you're going through like inference. And then after we get the attention, scowe do a like matrix multiplication with the values matrix, and then we get A Y vector. Here, as you can see, this is basically a tenformula. Qkt is this part, and the softmax and the scale is this part. And then you do another multiplication with the, this is a more detailed view of the whole attention block in the transformer paper. As you can see, there are multiple attention blocks. And don't worry, I'm going to go through all of them one by one in later. Take a look in into the attention block details. You can see it's there's linear layers for bkq as we discusand. Then they all go through a dot product attention, which is the matrix multiplication and then scale and then mask for inference. In our case, we we still haven't discussed this yet. And then go through soft max and then matrix multidiplication with v again afterwards. We're going to go through concatenation because it's Mulkey heaand. We're going to go through linear layer to drop the dimension again. So now we have went through the mathematical details of how we get the attention with K, Q and b. There are less several questions. So first one is, why do we scale? Why are we scaling over this number here? The answer is the key query vector could be very large. So when we do dot products, it could be disproportionately large. And for that large number, when we do actual training, this can lead to managed ingredients or explogradients that rule the training. So it's learning too fast or too slow. So if you want to learn more about these two concepts, you can Google it on your own. Next one is why use this number to be specific as the scaling factor, the mean zero variance one, this is a typo, is standard normal distribution. And if we assume qk are dk dimensional vectors whose components are independent random variables that follow the standard normal distribution, their dot product will have a mean of zero. And variants of dk, we want the distribution to the standard, so the variants should be equal to one. So in this case, the values won't be too scattered in order to get this variance. So we want to divide by this specific thefactor. But this is the map reasoning behind it. Last one is, why do we want to do soft? Mathis is pretty standard actually. After the mathematical calculation, we get a bunch of attention scores. We use softmax to transform the attention scores to a probability distribution that sum up. Ps, do one, which is very convenient for us. We talk about the tension mechanism, and we know it can capture contand relationship between tokens. However, natural language is a very complex problem and usually have multiple aspects of context and relationship. For example, in this sentence, date is my favorite fruit. We know date is a fruit, and we know date is my favorite fruit. And we know the concept of favorites eyes to date. And there are a lot more actually. For example, we know this language, this sentence probably has a neutral mood, or even like a neutral to positive mood. We're talking about our favorite stuff. So there are a lot of complex and multiple layers of context and relationship in just one simple sentence. Are we only going to use one type of attention to capture it? The answer is no. We usually use different set of parameters to capture different type of contand relationship, and sometimes we don't even know what is the type of relationship. We just use a different set of parameters. This is a example for a natural language. Next, I'm going to also show an example for computer vision. Transformer is a very powerful tool. Don't get me wrong. It's not only useful for natural language, we also use it for computer vision. This is one example of different attention mechanisms. Intuition, we have a image with a full background, and there is three filters. There is three attention mechanism. First, one attention filter, one, just figure out who's the person, and this is hamulan. And then the second attention focused on figuring out what is the sky. And the third one focused on figuring out what is the mountains. So this is another example, intuition of attention mechanism and why we need multiple different types of them. This was the previous attention block that I went through. So we already know with the input word embedding plus position embedding, we go through linear layers. And then for key enqueries, after you go through linear layers, we go through matrix multiplication, and then we go through soft tmax scale, ignore mask. I'm gonna to talk about it in the next section. And then we also do another matrix multidipliwith values and get the attention scores. So the question is, how do we introduce the multi hadness of attention to this problem? So this is multi head attention block. We basically have the same input. The input dimension doesn't change. But when it go through the different blocks now we go through multiple head, let's say for the first one we call it head one, and for the last one call it head H. So H is the number of heads we are gonna to go through. And each of them focus on capturing one specific relationship. It still go through the linear layers and still keys, quries values, still go through the matrix multiplication, still go through normalization and and matrix multiplication with values. But now since there are H heads, now we have H Y vectors, right? And H Y vector is with the same dimension of b but then we also say, said before, we want the output dimension to be the same as the input dimension. So we can like layered hundredth of attention block. But this way the output, let I mentioned, is H times of input. So how do we solve this? We solve it by first concatenate all of the result. So we append one after one, and we get A N multiply by H dimensions of y vectors. And then we go through the linear layer again. And this time the linear layers 's focus is just to remap the dimension back to n, and we get another y to yn. This is this typo. Actually, this this y one is actually different from the y one here. So it's a different vectors. The dimension is the same with b, but it's from multi head. So how do we achieve this, right? How can we use linear layer to drop the dimension from n, multiply by H to n? Before I show that, I just want to say, previously we went through the scale dot product attention. We went through the linear layer here. This time we were focusing on the concat and linear layer here. So the way linear layer works is like this. So linear layer, I've heard, like all other names people call it, I've heard dense layers, I've heard fully connected layer. So they're all pointing the same like concept. Let's say why is the input and b is the output of the layers, then we actually have all of them fully connected. This is just a view of it in between. There could be like multiple layers of fully connected layer. And in this way, we are able to maps the input through output, and we can also change the input dimension to the output dimension that we want. In this case, the example is we change a five x one dimension to three by one dimension. So this is how lineer or dense or fully connected layer drop the concatenated dimensions back to the original dimension that we want. On the right hand side, there's the transformer encoder decoder architecture. And one important thing to know is the training and inference base is actually a little different. Transformer behaves differently. At least this encoder decoder architecture behaves differently. The main difference occurs in the decoder, which is the right hand side part doing training. We know the entire output sentence, and we can PaaS it to the decoder all at once. For example, if we trying to do a translation service from I am Martin, we can PaaS wamarinto inputs and we can PaaS I am Martin to outputs, and we can PaaS a bunch of training translations to this inputs and outputs during training phase. However, during inference, the decoder must generate one word at a time in an autoregressive manner. Then the decoder used like previously predicted words to help to predict the next word. That means you have to output one word at a time. This is just the nature of the application. You are gonna to use a bunch of time. Decoder is just doing translation or tax generation. What's already generated, it's already out there. You're not gonna to change it. How to achieve that is the by mask attention. So this is the mask attention module. It will make sure each token in a sequence only attends to previous tokens and itself, not future tokens. Which means the tokens generated, the word generated, it all depends on itself and its previous token. It's not going to depend on anything in the future. This is a example. Obviously this is the oversimplified version in the token embedding stuff. There is start of sentence token and when it ends there, end of sentence token. But this is like a simplified version. Let's say we want to translate I am Martin to Chinese, and the encoder input is going to be I am Martin and we do the position encoding. And then we do the multi head attention at n norm forward another at and norm, which I'm going to cover in the later episodes. The encoder part is pretty straightforward that we have covered in previous episodes. So the decoder, as I mentioned, it's not going to put the whole output sequence because this is not training data. We're trying to translate a unknown sequence, which is an I am marking here. So the first decoder input is a startup sentence token, and then an output, it go through all the decoder and an output will be the first translation war, which is Chinese nai. And then we've used this token to PaaS to the output, to PaaS it to the decoder again. And then we get the next generated Chinese character, which is shu, that is a Chinese ran. Similarly, we PaaS in the combined washshi to the decoder input, and then we get the next to a generated token, which is Martin, Chinese senfor Martin. So this is oversimplified version of how decoder works in the inference space. We can get into a little detail into the mathematics, and then you're gonna to know the one very important concept that is the kb cache, the during inference autoregression. At each generation step, we are recalculating the same previous token attention. So this is assuming we don't have any cash. We are not storing anything in memory. This mathematic step, hopefully you are familiar with it. I have gone through it in our previous episodes. Let's say the first token comes and then we go through the qktv. I ignore the scale, the soft max. Now just to lagh out the qktmv for simplicity, qkt is the matrix multiplication and then another amount, matrix multiplication, and then you get the output. But as you can see, the size of the calculation is one multipby embedding size. Sorry, this is a typo, embedding size Q1 is the embedding. So this size of q is one by embedding size. Same with embedding size over one. And then this is one embedding size, and then you get one by one, and one that by embedding size and output is one by embedand. Then when a second token pups, because you're not soaring anything right, you have to PaaS, as I said, you have to PaaS the whole sequence into the decoder. So you PaaS Q1, Q2, and then in order to get the whole attention matrix, you need to do, okay, K1, K2, and then y one, y two. Then when it passed the mask attention module, instead of showing all the four attention scores, it used a mask to basically mask out all the future tokens. For example, in this case, this one is Q1, K2. So the first token, K2 is something in the future, right? So you're not going to have access to it. So that is mask. And then that will guarantee the o one is only generated by poken one or previous, and it guarantee O2 is generated by token two or a previous tokens. And then the third query, token comps. And then you have to PaaS in the whole sequence, do this, do this and then do the mask. Similarly, you're going na mask out all the previous tokens, which is basically the upper right parts of the matrix. And then you get get the three output. So as you can see, as the time goes, the embedding, the size of the calculation is throwing exponentially, right? Let's say time. The first one is t equals to one, and then t equals to two and t equals to three. So obviously the dimension of this matrix is t squared, right? So quadratic. Now let's go with inference with kv cache. So as you can imagine, kv cache is just trying to cache the key and value embedding. Q1 is the same. When Q2 comes, all you need is actually this qkt. In order to get O2, you actually just need this calculation. And this calculation you only need K1 and B1. So if we already cacit, let's say Green means cit, you actually you can only focus on Q2 itself. You don't have to focus on whatever Q1 says. So you store K1 and V1 into memory, and then you are able to get this qkt, which is one by two, and then you multiply it by v vector, and then you get the next output. Similarly with Q3, when a new q token comes, since you are already storing previous caves. And once you can get the qkt to calculate zero, three directly, and the matrix multiplication, the complexity, like intuitively right now, no longer quadratic, right? It's linear to t. Let's say t equals to one, t equals to two, and t equals. So it drops the complexity of matrix calculation a lot, which is what matters in machine learning, right? It's flops, a floating point operation per second. So this is the kvcache traoff. What is good, the plus is number of flops. Boolpoint operation per second drops. It drops from ot square to ot, the number of the length of the sequence. I would say the downside is we need extra memory. You need to store a and b and beddings. So the memory amount we need is the size of two multiplied by time, number of tokens multiplied by a number of layers, and then multiply number of kb hats, and then multiply kb dimensions. The two, as in you have to score both heat and value the number of tokens. You should be easy to understand kv had, we went through a previous episode, kv dimension. And similarly, we went through a previous episode, the number of layers here, this is something people usually ignore. The newer network I show here, it's all simplified version. It usually have hidden layers. So for each attention module and feforward module, all of them are multi layers. So it have to go through a bunch of hidden layers. You have to consider also the number of layers to door all the kmbs. And then that's the kv cache eligibility. So when can you use kv cache? I kind of already gone through it. When I introduce kv cache, you need to have causal relationships. The new tokens attention calculation only depend on itself in previous tokens. So if it's depending on future token, then you can't use kv cash. For example, there are art, the famous, at least it used to be famous encoder transformer architecture. If you use Birto do natural language processing, you're gonna to see the output changes when it procthe token. That is because it's not a generative. The generated token is not fixed. So it kind of changes the outputs when it process the whole input. So in that case, we cannot use kv cache anymore. Most of the generative AI models right now, for example, gpdeep seek, is also a generative. There's decoder over the architecture. It also have kv cache component. It's slightly different, but it still have the same mechanism, which I'm gonna to go through in another series. First of all, let's start with the different attentions in the classic transformer encoder decoder architecture. As I covered before, in the encoder there is a self attention module, and in the decoder there a mask, multi head attention. There's another cross attention, multi head attention. Yes, I think I've already said at least ten attentions, so hopefully this is not confusing. I'm going to go through each of them one by one in the later slides. First of all, I want to build an intuition on attention mechanism. So this graph I used a lot in my previous slides. So this is a modulzed view of what attention module is doing within basically it's a matrix multiplication, like a dot product for keys and queries after the linear transformation. And then it's gonna to go through a normalization, which is basically soft max and scale. For mask, there is a mask. Basically it's off tmax plus game. And then we do another matrix multiplication with values after it's linear transformation. And then we get the output. So the intuition is matrix multiplication. Number one is the dot product cosine similarity between key and query. And the normalization is the softmax to map similarities through a probability space. And then scale is to prevent gradient diminishing or explosion. And the second matrix multiplication is applying the similarity distribution to Reay value. You can think about first one is cosine similarity dot products. Second end one is a bunch of normalization to map it to probability space. And then second, matrix multiplication is apply that probability and reweithe values. So that is the intuition of attention. Let's start with the first basic attention. Let me remind you that transformer is based on the concept of self attention. Before self attention, there is actually attention. So attention is a very long lived concept already. And in this case, let's say we have two sentence. First one is, I have a White cat that is say, a qury. And then there's a second sentence is I love little animals. And we want to know what's the similarity scores, what's the attention scores between these two query? We can calculate it based on their word embeddings and dot product. And we're gonna to get this matrix. So this is the basic attention. It's trying to get similarity or attention scores between these two queries using dot product cosine similarity. And then we have self attention. What's the difference between basic attention and self attention? Then query equals to key, right? That's why it's called self attention. And then we have the cross attention in the decoder. The cross attention layer is trying to get the similarity scores between the input and between the output. So the keys and value is going to come from encoder. Let's say at input is I am Martin. It's trying to do a translation from iam Martin to mandarin. So the encoder inputs I Martin, and then after a linear layer, it gets the keys and values matrix. And then the decoder, let's say it already translated I am to wasshand, we're gonna to use the decoder input and linear transformation to get the qui matrix. And then we do the same dot product similarity score to see the similarity between output sequence and input sequence. And then after normalization, we get the matrix and Reay, we get the next translation, which is R in Martin cross. Attention is to establish relationship between input sequence and generated output. So the attention conssimilarity happens between the coder query encoder key, and the similarity will apply weight to value, which is the encoder value you want the attention score between encoder and decoder. This is the example. Encoder is I am Martin, decoder is hushi, which is basically the translation of I. Okay, so the next concept is mask attention. The causal mask is applied to attention mechanism to ensure a model predicts each token in a sequence solely based on the previous tokens, not on any future tokens. So what it means is for generative models, let's say GPT's, deep seek, those are all generative models. So the token generated only depend on previous token, not future tokens, which means in the similarity in the attention matrix, we need a mask, mask out future keys and values. In this case, if the query is from the first token, we don't need anything from second and third, and for the second query, we don't need anything from the third, and for the third query, we can have everything. So basically, we apply mask. This actually enables kv cache to speed up transformer. So we can actually cache keys and values to speed up transformer at the cost of memory. So the pros and cons, you can refer to my previous kv cache episode. So then now we use kvcache and mask attention to optimize transformers performance. Then we hit another bottleneck. Kb cache memory size becomes the bottleneck, especially when the sequence is longer and longer, because the memory is linear with the sequence. Even with kvcachamask attention, engineers and scientists would think about different ways to address the bottleneck. So then people start thinking about, can we reduce the kb cache and not hurt equality? There's the group query attention and mulquery attention. Let's start with the mulquery attention, which is a bit of extreme. It shares a single key and single value had across all query heads, which can significantly reduce memory usage, but also impacting the accuracy of attention. So as you can see, this is the classic multi head attention. The kesvalues will be cached in memory. So in mulquery attention, we're gonna to reduce those to one Kim, one value. Obviously it's dropping the memory size, but it will also impact quality. Group query attention is somewhere in between the classic and the mulquery attention. So it's a interpolating method between mha and qa where a single pair of key and value hats will be shared only by a group of query hats, not all queries. So the quality impact is less, and it still saves tons of memory. Can merge two to one or three to one or based of some similarity scores to merge different queries. So a lot of methods can be used. If you want to know the details, you can actually refer to my deep setechnical report. It contains a bunch of information about vqa and mqa. So now this is the last concrete example I have, which is the mla. Basically, this is one of the reasons why deep seek can achieve great quality with very cheap cost. So they use a what they call mla attention mechanism that compresses projects the kesvalues into a lower dimension space. They use some really smart way, like doing customized operation with rope, and then concatenate compressed keys with non compressed keys to get the actual attention. So they use some very innovative mechanism to speed up the classic attention mechanism. So this is a very heavily customized attention mechanism, and it's just one example of it. There are a lot of variances of the classic attention mechanism, and each of them did different customizations to solve different performance problems or quality problems. So don't be surprised if you are gonna to see more and more variances of transformer. Just keep in mind of the original concept of its matrix multiplication. You get the cosine similarity scores for that product, and then a use normalization to normalize it so we can use it for later. And then you apply the similarity distribution to reweigh the value. This is the basic intuition of attention. So first of all, why name it feet forward? So this concept is actually comparing to another concept, which is recurrent network. The four transformer recurrent network is mainly used in machine translation, tax recognition, etc. And there's a reason for that. So because recurrent neural network can process information in loops, so this allowed information from previous steps, from the past to be fed back into the network. So this is great for sequential data because it holds some memory. So text recognition, machine translation, this model works perfectly. And you must have heard of lstm, its variance of this rm. So in the contrast, we have feet forward, newer network. So in this case, the information flows in one single direction with no circle. So this is good for task that doesn't require memory of past data, say, image recognition. Of course, this is before transformer, as we have gone through already, transformer is pretty creative in doing self attention and then position embedding to make sure we can have the whole context, but still do it in the parallel instead of going through sequential. So this is the feforward architecture. It's the most simple deep neural network you're going to see. So it has input layer. It has output layer, and it has one or multiple hidden layers. So the first layer is input, blayers output, and everything in between is hidden layers. Let's take a look at feet forward net network inside transformer. So inside transformer, you can see there's two blue blocks that is feet forward, and it all followed by a at n norm. Today, we're also going to talk briefly about add n norm. First of all, let's take a look at the basics. The formula of fefour network is, is this. So as you can see, there's a max zero X W one plus V1, and then W two plus V2. So this is basically two linear transformation. The first one is X W one plus V1. The next one is W two. This whole locks plus B2. So there are two linear transformation with a reu activation function in between. So what is the relu activation function? So it is a non linear activation function. Use c neural network rau x equals to max zero and x. So the larger value between zero and x. So this is the reu part, and it's happening in between the two linear transformations. So the intuition between these formulwithin, this formula is, again, thanks to Gemini, you made this part really easy. So the first one is nonlinary. The first fully connected layer projects the input to a higher dimension space. So this allows the network to learn more complex relationship, and then it goes through the activation function. So this is the part where it applies to nonlinear complex relationships. And then we go through the dimensionality reduction. It happens in the second fully connected layer. So it projects the data back to its original dimensionality incompatibility if it's the next layer. So after these two, it actually learns a lot more, not only because of we expand the dimension and the first connected layer. More importantly, vu actually make sure it has the nonlinear, it learns nonlinear relationship. Remember, in our attention block, everything is linear, so this is the place we learn nonlinear relationships. And then it also have the advantage of parallel processing. So since each position in a sequence is processed separately through the ffn, the computation can actually be paralyzed efficiently. So this is not in the graph, but it's a pretty standard exercise that is applying dropout. Dropout is a regularization technique I used to prevent overfitting, which is very common in machine learning, by randomly dropping out. That is, to set the values to setting neurons to zeros, a certain percentage of neurons during training. So this forced the network to learn more robust features and not rely too heavily on any single neuron. So this will also improve improve the ability to generalize to new data. So the advantage is basically covered in the summarization in the summary. So it reduced overfitting this, it reduced overfitting by preventing neurons from coradapting too much. And then it also improved generalization. It forced a network to learn features that are not dependent on any specific neurons. After it go through this, we usually have dropouts. And then there comes to add n norm. So this is not in the blue block, but this is, this follows the blue block. Add n norm means the first one is add. It means residual connection. This is also a common practice in deep learning. So before I talk about this, you can take a look at the graph on the right hand side. X is, let's say, some inputs, and some of it is gonna to go through the deep newer network, including reloop. And then it's gonna to get the get the fx, which is the the gradients. And then instead of passing the gradients directly through the next layer, it also adits own function. It also adds its own like identity function is just a fancy way of saying itself. So this actually helps a lot in mitigating the vanishing radiance problem. The goal is to, as I said, mitigate the vanishing radiant problem for deep neural networks. So this happens originally from rasnet, and now it's used more in deep neural networks, say, transformer. So this enables the deeper architecture by, of course, resolving the vanishing radiant problem, that which is more serious as layers entries. So how it does is it allows radiant to bypass immediate layers, bypass and adding identity function with the rweated values. So this is the feed forward. This is the added a norm. And as you can see, the identity function goes directly through added a norm. And also some of it go through the feet forward. And then we add in this. So we have went through the add part. And then let's talk about the norm. So the norm, usually in transformer or sequential like lms, it refers to layer normalization. So layer normalization is a technique that helps stabilize and accelerate training by normalizing the inputs to each layer. So it ensures the model process information consistently regardless of the input scale or distribution. I know it doesn't mean a lot, but we can use the next episode to talk about layer normalization. And I want to also compare with other normalization. So what is normalization? From Wikipedia, normalization means adjusting values measured on different scales to a common scale. If you Google normalization, most likely you're are going to see two common ways of doing normalization. The first one is called minin max normalization. And the nominator is x minus minimum of x. The denominators maximum of x, minus minimum of x. So in this way, we can do a rescale for the value of x. The next one is called z score normalization. This one is doing not only rescale, but also reenter. With z score normalization, it's going to normalize your distribution to standard distribution, and it's done by minsing the mean and then divide by the standard deviation. If you have statistic background in your University, most likely they have gone through a similar concept. If you forget about this, just Google it and it should be relatively simple to pick up. So why normalization? In deep learning, there are many benefits that helps stabilize the training process by ensuring features are on a similar scale, preventing features with large values from dominating the learning process, which leads to faster convergence and improved model performance. So the first one is pretty obvious. On faster convergence, by scaling features to a similar range, the gradient descent optimization algorithm can update weights more effectively, leading to a quicker training times. The second one is to improve generalization. Normalization can help the model generalize better to unseen data by reducing the sensitivity to feature skills. The third one is stability in gradient calculations. So when features have vastly different scales, it can lead to very unstable gradients and back propagation normalization can help mitigate this issue, whether it's vanishing or explosion or exploding gradients. The fourth one is three important. It reduced the internal covariate shift. So this is especially for batch normalization. Normalizing activations within each layer helps to maintain a stable distribution throughout the network. This is very crucial for our deep architecture. I'm going to go through more details with example later. This is an example of how internal corvariate shift can happen. Let's say your training data is training from actual cats and dogs, and obviously, they follow some kind of distributions. And when you actually do testing, it turns out all the testing data is actually cartoons, which is a little bit the distribution is different from your training data. And if you don't do any batch normalization, it's very likely it won't perform well on testing data because of the internal covariate shift. So this can confuse the model a lot. I take this example from this website, where to apply normalization in deep learning. Usually you can do you can normalize input data. For example, the data that you feed into neural network can be normalized before entering the network. This is basically pprocessing. The next place is activations. This is more relevant to what we're talking today. So we can also normalize the activations, which is outputs from the hidden layers neurons in the neural network. This is often done to stabilize and accelerate training, especially in deep networks. So for this, let's go through batch normalization. For simplicity, I just pull out a layer with four neurons from a deep neural network. So it has four neurons representing four features, let's say x one, x to x three, x four. And then this is a mini batch with three samples. The sample value is just random. I just come up with random numbers. Batch normalization, it means we're going to do normalization and we're going to do calculation of mean and standard deviation for each row, like for each feature. So for x one, the mean is 5.3, standard deviation is 4.9. Similarly, for x two, this row we get mean and standard deviation. And then for x three, we do another calculation. For x four, we do another round of calculation. So normalization happens across mini batch. It's independent for each feature. After we get the standard deviation and mean for all features, we apply the following two activvaand get a normalized value. The normalized value equals to the z score and normalization, which is the original value minus spover standard deviation. After that, we apply a additional and additional scaling and shifting parameter. So each node or neuron have learning parameter gamma and data to represent the scaling and shifting. So the scaling is obviously this gamma and this data is shifting. So these two values are learnable. So why is batch normalization not used for transformer ers? So batch normalization have its own limitations. So in batch normalization, the mean and standard deviation are calculated within the current mini batch. However, when the batch size is small, the sample mean and sample the standard deviation are not representative enough of the actual distribution. And for sequential model, we see more smaller batch sizes because the sequence is very long. This usually happens to allow a model to update this parameter more frequently. The smaller subof data, the transformer, is very popular among the sequential model, right? So it usually have small batch sizes very often. So bath alization doesn't work very well. And to be more specific, we usually do padding, which means adding zeros to make sure the data size consistent, to make the input size equal during self attention, which is not part of our original data and can mislead the model a lot. So this is a example. Let's say we have two input. The first one is hello, exclamation mark, and then my name is Martin. And in order for us to go through the self attention matrix, because the dimension is fixed, we actually have to add two paddings for the first input. So padding will be all zeros. And when we do normalization using this, it's going to confuse the model a lot because we have to include this unnecessarily zero values. So what's the solution then? The solution is to do layer normalization. So layer normalization is very popular in sequential datset. And again, I'm going to use it example of one efour neuron input layer. And then again, it represenpresent four features. And then the sample output within the mini batches, all random. This time when we do the mean and standard deviation calculation, it actually happens across features. It's independent of each sample. So we get the mean and standard deviation of all the features here, and then all the features on the second sample and all the features of the third sample. So in this way, we actually don't have to do padding at all. For all the features. We apply the following ings to activations and get an zation value to to ch normalization. And then again, we apply the scheme and shifting similar to the bath normalization. So this is layer normalization. Next, I want to talk a little about the ramass normalization, which is a relatively new nonlicennative method and it's catching out in popularity. So rams mornormalizes the activations by dividing them by the root mean square of the activations for each layer. So unlike layering norm rms, norm typically does not center the activations by subtracting the mean before normalization. It does include scaling by a learnable parameter, but there's usually no shifin parameter either. So the rms is only doing restaaling. It's not doing any reentertering at all to compare with batch norm and layering norm. Rms norms normalization process and is simple and without overhead of calculation calculating mean and variance. Additionally, rms norm claim that the scaling factor is more important for stabilizing the norm compared with the shifting factor, so that's why they are not doing any of the shifting receptor. So rms is actually preferred in many scenarios where computation efficiency matter. You can see for a lot of recent models like the long amount version of GPTs and then for deep seep, they're all using our mmoso. It's definitely catching up. So this is the formula for rms norm. You can see it's a lot more simpler. There's no additional steps. Just one AI is the activation for the ineuron. And you have to first calculate rms for a, it's basically adding up all the AI square and then divided by N, N, it's the total number of features, which is the total number of neurons. And then you can put back throughout this specific layer. And then there's a scving factor. This again is this scathing factor, that parameter that I mentioned, it's learnable and it's applied like here. So this is rms normalization, already looked a lot simpler than layer or dch normalization. All hope this helps. Thank you everyone. If you like my video, please subscribe, like or comment.