speaker 1: Hello everyone. Welcome. In this video, we are going to cover deep seev three's secret of success, including mixture of experts, multi head latent attention, multi token predictions, and other hardware optimizations like fpa mixed precision training. It's been several months since deep seev three s release, but there is a recent release on March 20 fourth with several cool improvements, and it looks awesome. There are some highlights to it. For example, major boost in reasoning performance. You can see the performance is better than original V3, aliice, qm, gpfour point five and sonnet 3.7, and multiple benchmarks. It also have stronger front end development scales, for example, this boss in hexagon test, and it has smarter two wiuse capabilities. This great update reminds me that I promito do a collection version of deep C V three deep dive. So here I am. In this collection version, I've done some updates to my old content, including a complete redo of moe. Hope you enjoy enough talking. Let's dive into it. There are many reasons why deep cqb three is efficient, good and cheap, and mixture of experts architecture is definitely one of the most important. One mixture of experts is an architecture where the computation for each input tokens is handled by a combination of multiple specialized subnetworks called experts. Instead of using a single large, dense network for all inputs, moe selectively activates only a subset of these experts based on the inputs. This architecture originally came out earlier than the famous transformer paper in 2017, also from Google. This is the architecture they gave out in the paper. So as you can see, there are several key components. First one is experts. It's the individual neural network, often as smaller lm or feet forward networks, that are trained to specialize in different tasks. And there is a router, also known as a gating network. So it's a separate neural network that analyze the incoming inputs and decides which experts are most relevant to processes. It essentially acts as a dispatcher. There's also the combination mechanism. Those output from the selected experts will combine and produce the final output of the moe layer. This is the typical flow for how mixture of experts work. First, there's the input finto the im with an moe layer. This is the input gets into the moe layer, this is usually replacing transformer ers speed forward network layer, and the router, which is the gating network process, the input based on the input, the router assign a weight or score to each expert, indicating its relevance to the current input. In this case, let's say we pick expert two and expert and minus one. The other experts are not selected. Typically only the top k experts were k as a pretty tified number. Often a small value like one or two with the higher est spores are activated for that specific input. This is known as sparse moe, which is more efficient. There are also dense moe models where all experts contribute to the output, but with varying weights I'm not touching that topic today and then the activated experts independently process the input, and their inputs later is combined according to the weights assigned by the router to produce the final output of the moe layer. So the benefits of using mixture of experts is pretty clear. It increased model capacity. Emily allows for creating models with a significantly large total number of parameters compared to dance models with similar computational cost. Each expert contributes its own set of parameters, leading to a high overall capacity to learn complex patterns. It improves efficiency during inference. Only a small fraction of the total parameters are actually used for each input. This leads to faster inference speeds and lower computational cost compared to running a dense model with the same number of parameters. The third one is specialization and better performance. By having experts specialize in different aspects of the data, the model can potentially achieve better performance on a wider range of tasks and handle diverse type of input more effectively. Each expert can become highly proficient in its area of specialization. And the last one is faster ptraining due to the sparse activation of experts, moe models can be ptrained more efficiently than dense models with the same capacity. This is the architecture detail of mixture of experts. We have gone through transformer architecture in my previous videos. If you're interested, you can take a look. And for moe transformer, we simply replaced the feet forward network layer of the transformer model with an moe layer. An moe layer, as we discuss, is composed of a gating network and a certain number of experts. So this is the replaced moe layer. And if you expend it, this is how it looks like. The input goes through self attention at a normalized, then we go through ffn and another ednormalize. Let's go through a example. Let's say a input sequence containing two token tries to run through transformer encoder. First it goes through the position embedding, and then it goes through self attention layer trying to get the attention scores, and then it goes through the add annormalization with its own identity function when it's supposed to reach ffn previously in the transformer encoder. Now it reaches the moe layer, and it's going to hit the router network first. The router, also known as the gating network, is going to determine which of the subsets of the experts is going to get activated. In the first tokens case, the second ffn second expert is getting activated, and the router is also going to produce a gating score right here. And the final output of this token is going to be a combination of the router gating score and the output of the experts. And then after the moe layer output, the final outcome, it also go through at a normalization. Similar with the second token, it goes through position encoder, it goes through self attention, it goes through at a normalization, it goes through the gating network. And in this case, it's the first expert getting activated, and the gating score is 0.8. With the combination of the gating score and the output of the first experts, we have the final output of the moe layer. And then it goes through the eda normalization. Again, this is the deep seek moe that got introduced in deep seek V2. This is the graph of the deep seemoe. And as you can see, there are several differences with conventional moe. First, with small number of experts, different tokens can easily route it to single experts, creating low imbalance. So to resolve this problem, instead of having a smaller number of large experts deep seek, moe finally segments the expert layer into a larger number of smaller, more granular experts. In this case, they have 256 experts, and it's picking the top eight. The second problem is tokens assigned to different experts could require a common knowledge to solve this and avoid redundancy, V2 isolates a set of experts specifically as shared experts. These shared experts are designed to capture common general knowledge that is relevant across various inputs. In this case, we pick one shared experts for deep sev two, there are several benefits with deep seemoe compared to normal moe. The first one is higher degree of specialization by using more smaller experts and encouraging them to focus on narrower domain of knowledge. There are also improved parameter efficiency by promoting specialization. The model can potentially achieve better performance with a smaller number of active parameters compared to regular moes. It also reduce redundancy by having dedicated shared experts handling common knowledge. The routed experts can focus on unique specializations. Previously, I was mentioning load imbalance was a problem. It actually is critical for moe to work properly. Without a proper load balancing mechanism, we're going to run into route collapse. It means a small subset of experts will receive large number of input tokens, while others remain underutilized. This hurts the learning of the underloaded experts, leading to poor overall model performance and generalization. And also, a good balancing strategy is going to ensure all experts contribute meaningfully to the model's output by dynamically distributing taks that prevents any single expert from being overwhelmed, which could lead to computational bottlenecks and reduced throughput. A good load balancing strategy will also improve training stability. Load imbalance can introduce instability during the training process. When some experts are heavily loaded and the others are underloaded, the learning signals and updates to the model parameters can become skewed, making convergence very difficult. There is also the enhancing generalization. If some experts are consistently under utilized, they don't receive enough training data to learn meaningful specializations. This lack of specialization across the expert pool can lead to a model that doesn't generalize well to diverse inputs. And the last one, it optimized parallel training in distributed training scenarios where experts reside on different devices, for example, GPU's. By the way, this is the case for deep seek. Load imbalance can severely hinder parallel efficiency. If some devices are low overloaded while others are underloaded, their overall training time increases. There are some common load balancing solution. The most common one is introduce auxiliary loss for load balancing to original training objective. This is a similar mechanism of doing regularization. The example loss function can be something like this. This is the loss for load balance. N is the number of experts, t is the number of tokens case, the number of activated experts in sit is the gating function, output is normalized to minimum zero, maximum one voff max function. It means the probability of teeth token choosing ith experts. You can consider it as the affinity score between I and t. Pi is the average probability of choosing the ith expert for the entire input sequence. Also, the average gating score F I represent the fraction of tokens routed to the ithe experts. Alpha is a hyperparameter controlling the strength of the auxiliary loss. If we include this loss function into the original training objective and try to minimize this loss, it will force the model parameters to spread evenly between difference experts. Deep seek V2 is using this similar approach. It introduced different levels of load balancing. Though they have expert level balance loss, it encourage uniform token distribution across all experts. They also have device level balance loss, it encourage balance ced computation across devices. GPU's. It also dropped tokens if it's beyond expert capacity. However, this solution load balancing auxiliary loss, it's not free, since it's combining a regularization term in the original loss function. Its gradient can interfere with lms quality objective. Just like all over regularizations, it can lead to worse model performance. A small auxiliary loss coefficient cauinsufficient load balancing, and a larger value alpha is going to impair the model performance. This is a study that was done in this paper. You don't have to know what max vo is. You just need to know the higher the value is, the more the imbalance is. Ideally, we want to be at this pono imbalance, and the quality is great. However, with the auxiliary loss approach, when there's no load balancing, we have very imbalload, and when the load balancing hyperparameter is larger than the model's, quality is bad. Perplexity is the ability for oom to predict the next token. What deep secould be three introducis a new strategy for load balancing its auxiliary loss. Three. Conceptually, it's pretty straightforward. They add a additional bias for every expert in the moe router to the original. Affinity scores sit to determine top k routing. And this bias is only used for routing and not others like gating scores. This is the mathematic behind it. When we try to calculate the gating scores, we're still using the original. However, when we try to determine which sit to pick, we are considering sit plus bi. It have to be within the top k. So the bias values are updated at the end of each training step. Increasing bias for experts that receive less than average number of tokens. Decrease it for experts that receive more tokens. For experts that receive less than average number of tokens, we increase the probability of it getting picked off by the top k function here. And for experts that receive more tokens, we decrease the probability of it getting picked. And no tokens are getting dropped due to the effective load balancing strategy. So with the deep seek moe and this auxiliary last free strategy, deep Seeb three is doing a lot better. This is a diagram of how this auxiliary loss free strategy works. We're not including additional gradients to the original loss function. Instead, we have a bias updating process here at the end of each training step, and it's gonna to update the gating score here with expert bias. Since loss free balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from the moe train. So this is all about deep C B three e's moe and its load balancing strategy. Let me start with what is kvcaci. Have another transformer series talking details about kvcache. Please take a look if you want to know more details about it, but I will try to summarize it a bit. Kvcacis eligible when new tokens attentions calculation only depend on itself and previous book, which is very normal for generative model nowadays. For example, the decoder only GPT families diagram I put here is encoder decoder based, not decoder only. The mask attention module here is critical for kvcache. It will make sure the token in a sequence only attends to previous tokens in itself, not future tokens. So what is the benefit of the kvcacso? It's going to drop the number of flock floating point operator per second. It will drop it from ot squares ot to t, the number of the length of sequence. So what is the cost? The cost will be x from memory. So that memory amount is upsize two, multiplied by a number of tokens, multiplied by a number of layers, multiply by number of kb hads, and then kb dimensions two s and tn value. It's using extra memory to beat up the calculation. So kb cache is very important in speeding up the transformer based models. It consumes a lot of memory, and it often becomes the bottlenecespecially when a sequence is longer and longer. So people are thinking, okay, can we do something about it? And then people start thinking about mqa and gqa. So mqa is the multi query attention, and gqa is the group query attention. This is a graph of how the multi head attention works. So basically, each query token corresponds to a keys and values and matrix, and they're gonna to do matrix multiplication and we cache keys and values and it's gonna to consume memory. So if we want na cut that memory, intuition is, can we group some of them? Can we merge two into one or three into one, or in the extreme, merge everything into one? Each qa mqa is doing that. Mqa is on a pretty extreme. Like autoquery had corresponto one key, one value, and group query is in the middle. I remember lama is probably doing two to one mqis to share a single key and single value head across autoquery heads. Obviously it can significantly reduce the memory usage. But as you can imagine, if previously you are using eight matrix to represent a attention and now you only use one, there's going to be a lot of information loss. So it's going to impact the quality and negatively impact the accuracy of tenso. Gqa is in between. It tries to share only by a group of query hads, not all quries. The quality impact is last, and you still have a memory gain. I'm not gonna to talk about mqa. I think gqa is probably more useful, like gqs quality closer to mha by finding a better balance and maintains a speed comparable to mqa, which is faster than the classical moddihead attention by reducing the kb cache memory. If you reduce the kb cache memory, it also reduces the computational complexity that will be less matrix multiplications. And then it also have lower memory ories. So it's more suitable for large scale models with like memory constraints, which is what deep sek do. So when you compare gqa with the classical mqa, like obviously there will still be quality loss. The balance between speed and quality is still a challenge. You need to be really smart on the grouping strategy to minimize the quality loss. Then there's the group division. So the division and management of groups adds to the complexity of the gqa implementation. Another one, complexity of the implementation. So this two is the mode. And then because you introduce extra complexity, you introduce more hyperparam you have between more parameters. It has huge impact on models, performance and efficiency. Before I dive into multi hat latent attention, I want to talk a little bit about another concept, which is rotary of physician encoding ropreviously. In my transformer series, I talk a little on the absolute position encoding, which is here. In today's modern architecture, rope rotary position encoding is more popular. And the idea behind it is you try to apply a rotation matrix which transforms gxy to this new gx y, so you preserve the length of the original vector, but then the angle of it is changed. How to efficiently calculate this by introducing this two, 22 matrix? This is the rope formula and adthis stiposition. By the way, let's take a look into deep seeks mla implementation on the left part. This is a decoder only architecture in the feforward network. I already talked about it in the first episode. They use moe to replace the feed forward network. And in deep seeks case, they have shared expert and they have routed expert. This time I wanto talk about the attention parts. This is what's also very important. Let's zoom into this mla module. Deep seat uses a low end contingient mechanism, the call mla, which projects the keand values into a shared low rank space. This equates reducing the sum of the dimensions of the query key n value vectors. It impressed the original e value and query to a lower dimension, similar of what gqa would do. As you can see, this is the original input. And then after some compression matrix, it will compress input way lower dimension matrix. One thing is rope isn't simply preserved by the lower ram compression. So that's why deep seek uses a pretty smart approach. They have a single query pair that doesn't have the compression matrix. And these quries and keys are added directly to the deep compressed quries and keys without positioning. Then so as you can see, this one half rope and this one half rope, but there are some matrixes that doesn't have rope. They don't have any position tioning codes. And we concatenate position encoding this rope and without rope and do a dot product to gather multi head attention. This is how they solve the rope problem. But in fact, with some additional calculation, rope would be combined with mla directly. So let's say you have big W is an compression matrix. You can find a new matrix that falls to this condition, diusing this new molar W matrix. It will ensure that rope matrices muwith the lower end compression matrix. And the rope matrices are composed of two to two blocks on the diagonal. So actually, there are other ways to do mla, but this is how deep seek decided to do, and it has been working pretty good. So they find ways to reduce the kb cache by compressing the cache elements dimension, and they find a smart way to include position encoding into it without hurting the ability to do kv cache. And that's why they made their attention mechanism so cheap, and it doesn't hurt the quality that much. So the quality is still pretty good. First, I want to talk about what is single token prediction or the next token prediction. On the right hand side, this is the classic encoder decoder transformer architecture. And on the left hand side, this is a example that I use in my transformer series. So this is doing a machine translation task. It's trying to translate I am Martin to Chinese. In this case, while doing inference, I send the English I am Martin to encoder input in the first round. Decoder input is gonna be a start token, and then the output is gonna be Chinese translation of I, which is woand. Then we take the output and send it to decoder input again, because this is auto regression. And then we get the next token, which is the Chinese for am, and then we get a new sentence, which is washi, and then send it to decolder input. And then it output the next token again, which is Mar in Martin. We combine everything and send it to decoder input. And that last it outputs tiin Martin. This is a typical inference for next token prediction. In the next slide, let's take a look at what happens in next token prediction training. In training in machine learning, we always relion minimize loss function, and we back propagation to update weights. So what is the loss function? In this case, the loss function we are using is cross entropy loss. So this is a metric we use in machine learning to measure how well a model performs. And it's done by comparing the probability distribution predicted by the model to the actual distribution of the target labels. In math, the formula is lost, and then it's the sum of a probability given x from one to t, and then the probability of t plus one. So in other words, given a sequence from x one to X T, these are all tokens. Let's use example of I am ruthiy's. So these are three tokens and forms a incomplete sentence. And the probability of the model predicting next token to be xt plus one is this. So it's part of the formula here. So we are trying to maximize this probability. So in training, we already know what xt plus one is for each t. And doing training, we're gonna compare the model output with the actual input. And we're trying to make sure the model behaves as close as the actual output. So in this case, we maximize the probability. And to maximize the probability, it means we're minimizing this loss function, which is what we're actually gonna to do in training. We're. Going to minimize this loss function. That's how we make sure the probability is maximized. So this is how the machine learning training works for an next token prediction. But is there a more efficient way using the same training data? The answer is yes. So let's take a look at the multi token prediction. This is a paper done by fair, which is a research lab from meta. They are trying to predict multiple token at a time. So the model predicts several future words simultaneously. Instead of just one at each position in a sentence. The model uses multiple hats to forecast the next several words all at once. So this working collaborately to improve the efficiency and coherence in this graph. They have four heads. That means it's trying to predict the next four token at a time. The last formula in this case is going to be given the token from x one to xt. Instead of predicting xt plus one, we want to predict xt plus one to t plus n. In this graph, n equals to three. So the advantage of this multi token prediction is pretty clear. It's more efficient learning. So multi token prediction enables to model, to learn more effectively from the same data set. It uses the data set in a more efficient way, and then it also have better quality in many cases. So this model captures and predict long term patterns instead of just the next single prediction. So it has better quality in tasks requiring content generation, such as cogeneration for software engineers. And this is a joke or not. And then it also have faster inference. I don't see this too much nowadays. Actually probably it's not using a lot of model because extra model complexity or there's actually some quality loss. I want to call this out because deep seek is not doing mtp for faster inference. So let's take a look at deep seevariants. They are trying to do mtp in a little different way, but similar concept. The differences is they're doing sequential in stuff parallel. They do this because they want to maintain the full Casso chain. That means your current token only depend on all the previous tokens you have, and doing it parallel is going to break this promise. For example, if we have t one, t two, t three, and it predicts t four and t five simultaneously, then t five is generated without the context of t four. As I said, they only use to optimize training. As you can see, they have the main model here and they have the mtp module here. And the mtp module is sharing output hat and sharing embedding layers with the main model. So doing training, they want na speed up the training convergence. So they are using multi token prediction, and after the main model is trained, and in inference, they actually dump this mtp module. So they discard the mtp modules. Those are not used in inference, only the main model is using. So let's take a look at what the mtp module is doing. It basically consisted of multiple components. The first one is a shared embding layer. So this shared embding layer is just to turn token into word vectors. And then there's a shared output hat. This linear transform of transformer block outputs and then applies soft tmax to compute the prediction probability. So this is also shared. Then there's the transformer block. It's for attenand. Then there's the linear projection matrix. This is doing linear transformation because we have a concat here. We want na maintain the dimensions consistent. And then there's the rms norm. This is to stabilize the training process. If we want na learn more about normalization, please refer to my transformer series. This is only used in training, as I said, so the mtp module actually ensuthe training process is more efficient. And then once the training is done, only the main model is used for actual inference. Before I start with fb eight, I want to talk about fb 16, which is more popular in AI world. Fp 16 refers to a 16 bit floating point. Normally, programs run with 32 or 60 board bit numbers. If you're a programmer, you should be familiar. And by using fp 16 to represent model weights and activations, it allows faster computation and significantly reduce memory usage. In summary, the advantages are smaller memory footprint, fp 16 efifty of memory compared to fp 32. This enables large batch sizes and models to fit into GPU memory. It is the main reason why it's so popular in GPU and AI training. It comes with faster computation. So many GPU's are optimized for fp 16 because of the smaller memory usage, so this allows accelerated matrix multitation without significant degradation in accuracy. The next one is acceptable. Although it half the bcompared with 32 bits, it still retains enough dynamic range, and numerical stability is proper handling of quantization. So quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. You're gonna to see quantitization quite a bit later. Do I want to claim it? Little. Usually when you translate tire bits to lower bits, you could run into quantitization, underflows and overflows. And those are bad things. And I'm gonna to go through it later in the following slides. Now we have fb 16 from fb 32, and we have all those advantages. So can we do better? Let's say let's try fb eight. So that's what a lot of other people also thinfb eight 's advantage, ges, even smaller memory usage, another 50% off. And then it also with higher throughput, because training the fpa enables faster computation, smaller data types reduce the data transfer time and allow for more flops. Flops is what matters in training a model. It comes with lower energy consumption. Energy consumption is actually one of the major concerns of training large models right now. However, the transition from fp 16 to fp eight is the cky. Fp eight has several unique challenges compared with fp 16. One of the main reasons is fp 16 is an an established model in both software and hardware. It comes with a pretty stable ecosystem for most models and hardware, and fpa doesn't have it. It's starting to catch up. I want na talk about fp eight challenges. First one is numerical in stability. So doing back propagation, which is one of the essential depths of machine learning, radiance can become extremely small. Ffiates reduced range can't represent these values accurately. This could lead to unstable or non converging training, but value during the process of quantization becomes smaller than the smallest representable value allowed by the data type, which is f eight, so essentially resulting in the value being rounded to zero due to the limitations of precision. This is the very bad thing, because small values is critical for fine tune outdates, and there usually are a lot of sensitive layers in large models, small value is very important, and running to zero, it's not acceptable. The next one is gradient overflow. It's similar with underflow. It's just gradients can become extremely large. And the problem is it exceeds the maximum representable value within the chosen data type though ID essentially overflowing the allowed range, resulting in an inaccurate representation. It's a type of error that can occur when a value becomes too large. And this is also a pretty serious problem. How deep seek solve this? They have a block wise quantization scale. First of all, to prevent overflow, we usually scale to higher preciweghts or activation matrix down to the fbates representable range or quantisize zing. How to do it? The easy way is just to divide by the maximum element, and the maximum element is kept separately and used as a scaling factor in each matrix multiplication. Or again, it's basically the same. However, this makes the quantisization process very sensitive to outliers, though the presence ts of a very large weight in some layer could force or other ways to be quantized to zero. If in one layer your value is very large, let's say 1 million, but in all other layers your value is all single digits, then your maximum value is a million, and then you have to divide by a million for all the way, then only one layer will have value. All the other layers will be basically zero, so this is very bad. So what deep seek does is it introduced blowise and tallow wise aliiling, in which each submatrix or subvector are scaled and quantized separately. Next, fpa challenges error accumulation. This related to the first challenge. Actually, errors introduced you to flower precision and propagate and accumulate through layers of the model, compounding over time, seriously breaking the quality of the model. More than AI models usually are large in like transformers. They have a lot of hidden layers. The arrows accumulate very fast across those layers and arror becomes pretty big. And the quality impact acted readback deep. Seek solve it by mixed precision training. This is their mixed precision training pipeline. You probably don't have to know all the details behind it, but pretty obvious they have all kind of precision. Here you see feight fp 16, fp 32, right? In summary, they have fp two accumulation matrix multiplications. Matmo or gem are performed in feight, but accumulation of results is done in fp 32 or higher precision. And they have a hybrid precimodel. Weights are stored in fb eight. The activations and gradients are stored in bfc. This helps preserve important information during training. And fv 32 is also used in some internal computation. The last fv eight challenge I want to go through is hardware limitation. The hardware like nvidia hopper architecture, which is what deep psik use, they're starting to support more and more of fba, but there still a lot of gaps or optimal performance. The low precision matmultiplication or gem often suffer from underflow issues and their accuracy largely depends on high precision and accumulation, which is usually fp 32. However, deep seek observe the accumulation precision of fb eight matrix multiplication on nvda H 800, which is what they use, is limited to retaining around 14 bits significantly lower than fb 32 accumulation. This is even worse when the indimension is large. Basically, they found the hardware limitation that makes the precision even worse. Their solution is to use a hybrid approach. They use tensor core plus kuda core. Usually everyone use tensor core, and kuda core is one layer down. So to overcome the hardware limitation, deep secan move some of the accumulation outside the tensor course. The matrix multiplication kernel performs each series of four consecutive wgmma eight, that means aychronous or group level matrix multiply and accumulate operation. You don't have to know the details about it, but just know this is one of the policy of nvdhardware for matrix multiplication and value accumulation, so the operations inside the tensor course, but accumulating in the lower precision format, but then they add the results into a separate register back accumulator tensor in fthirty two. This happens in kuda course, so this mitigates the loss of accuracy. And on the right, this is a graph of how they achieve this. This is how deep seovercomes the limitation that they require less resource and fast within their budgets. Hope this is helpful.