DeepSeek-V3 Explained by Google Engineer | Mixture of Experts | Multi-head Latent Attention | CUDA
DeepSeek V3技术解析:MoE架构与性能优化
标签
媒体详情
- 上传日期
- 2025-05-31 20:15
- 来源
- https://www.youtube.com/watch?v=jjckEViwt_o
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Hello everyone. Welcome. In this video, we are going to cover deep seev three's secret of success, including mixture of experts, multi head latent attention, multi token predictions, and other hardware optimizations like fpa mixed precision training. It's been several months since deep seev three s release, but there is a recent release on March 20 fourth with several cool improvements, and it looks awesome. There are some highlights to it. For example, major boost in reasoning performance. You can see the performance is better than original V3, aliice, qm, gpfour point five and sonnet 3.7, and multiple benchmarks. It also have stronger front end development scales, for example, this boss in hexagon test, and it has smarter two wiuse capabilities. This great update reminds me that I promito do a collection version of deep C V three deep dive. So here I am. In this collection version, I've done some updates to my old content, including a complete redo of moe. Hope you enjoy enough talking. Let's dive into it. There are many reasons why deep cqb three is efficient, good and cheap, and mixture of experts architecture is definitely one of the most important. One mixture of experts is an architecture where the computation for each input tokens is handled by a combination of multiple specialized subnetworks called experts. Instead of using a single large, dense network for all inputs, moe selectively activates only a subset of these experts based on the inputs. This architecture originally came out earlier than the famous transformer paper in 2017, also from Google. This is the architecture they gave out in the paper. So as you can see, there are several key components. First one is experts. It's the individual neural network, often as smaller lm or feet forward networks, that are trained to specialize in different tasks. And there is a router, also known as a gating network. So it's a separate neural network that analyze the incoming inputs and decides which experts are most relevant to processes. It essentially acts as a dispatcher. There's also the combination mechanism. Those output from the selected experts will combine and produce the final output of the moe layer. This is the typical flow for how mixture of experts work. First, there's the input finto the im with an moe layer. This is the input gets into the moe layer, this is usually replacing transformer ers speed forward network layer, and the router, which is the gating network process, the input based on the input, the router assign a weight or score to each expert, indicating its relevance to the current input. In this case, let's say we pick expert two and expert and minus one. The other experts are not selected. Typically only the top k experts were k as a pretty tified number. Often a small value like one or two with the higher est spores are activated for that specific input. This is known as sparse moe, which is more efficient. There are also dense moe models where all experts contribute to the output, but with varying weights I'm not touching that topic today and then the activated experts independently process the input, and their inputs later is combined according to the weights assigned by the router to produce the final output of the moe layer. So the benefits of using mixture of experts is pretty clear. It increased model capacity. Emily allows for creating models with a significantly large total number of parameters compared to dance models with similar computational cost. Each expert contributes its own set of parameters, leading to a high overall capacity to learn complex patterns. It improves efficiency during inference. Only a small fraction of the total parameters are actually used for each input. This leads to faster inference speeds and lower computational cost compared to running a dense model with the same number of parameters. The third one is specialization and better performance. By having experts specialize in different aspects of the data, the model can potentially achieve better performance on a wider range of tasks and handle diverse type of input more effectively. Each expert can become highly proficient in its area of specialization. And the last one is faster ptraining due to the sparse activation of experts, moe models can be ptrained more efficiently than dense models with the same capacity. This is the architecture detail of mixture of experts. We have gone through transformer architecture in my previous videos. If you're interested, you can take a look. And for moe transformer, we simply replaced the feet forward network layer of the transformer model with an moe layer. An moe layer, as we discuss, is composed of a gating network and a certain number of experts. So this is the replaced moe layer. And if you expend it, this is how it looks like. The input goes through self attention at a normalized, then we go through ffn and another ednormalize. Let's go through a example. Let's say a input sequence containing two token tries to run through transformer encoder. First it goes through the position embedding, and then it goes through self attention layer trying to get the attention scores, and then it goes through the add annormalization with its own identity function when it's supposed to reach ffn previously in the transformer encoder. Now it reaches the moe layer, and it's going to hit the router network first. The router, also known as the gating network, is going to determine which of the subsets of the experts is going to get activated. In the first tokens case, the second ffn second expert is getting activated, and the router is also going to produce a gating score right here. And the final output of this token is going to be a combination of the router gating score and the output of the experts. And then after the moe layer output, the final outcome, it also go through at a normalization. Similar with the second token, it goes through position encoder, it goes through self attention, it goes through at a normalization, it goes through the gating network. And in this case, it's the first expert getting activated, and the gating score is 0.8. With the combination of the gating score and the output of the first experts, we have the final output of the moe layer. And then it goes through the eda normalization. Again, this is the deep seek moe that got introduced in deep seek V2. This is the graph of the deep seemoe. And as you can see, there are several differences with conventional moe. First, with small number of experts, different tokens can easily route it to single experts, creating low imbalance. So to resolve this problem, instead of having a smaller number of large experts deep seek, moe finally segments the expert layer into a larger number of smaller, more granular experts. In this case, they have 256 experts, and it's picking the top eight. The second problem is tokens assigned to different experts could require a common knowledge to solve this and avoid redundancy, V2 isolates a set of experts specifically as shared experts. These shared experts are designed to capture common general knowledge that is relevant across various inputs. In this case, we pick one shared experts for deep sev two, there are several benefits with deep seemoe compared to normal moe. The first one is higher degree of specialization by using more smaller experts and encouraging them to focus on narrower domain of knowledge. There are also improved parameter efficiency by promoting specialization. The model can potentially achieve better performance with a smaller number of active parameters compared to regular moes. It also reduce redundancy by having dedicated shared experts handling common knowledge. The routed experts can focus on unique specializations. Previously, I was mentioning load imbalance was a problem. It actually is critical for moe to work properly. Without a proper load balancing mechanism, we're going to run into route collapse. It means a small subset of experts will receive large number of input tokens, while others remain underutilized. This hurts the learning of the underloaded experts, leading to poor overall model performance and generalization. And also, a good balancing strategy is going to ensure all experts contribute meaningfully to the model's output by dynamically distributing taks that prevents any single expert from being overwhelmed, which could lead to computational bottlenecks and reduced throughput. A good load balancing strategy will also improve training stability. Load imbalance can introduce instability during the training process. When some experts are heavily loaded and the others are underloaded, the learning signals and updates to the model parameters can become skewed, making convergence very difficult. There is also the enhancing generalization. If some experts are consistently under utilized, they don't receive enough training data to learn meaningful specializations. This lack of specialization across the expert pool can lead to a model that doesn't generalize well to diverse inputs. And the last one, it optimized parallel training in distributed training scenarios where experts reside on different devices, for example, GPU's. By the way, this is the case for deep seek. Load imbalance can severely hinder parallel efficiency. If some devices are low overloaded while others are underloaded, their overall training time increases. There are some common load balancing solution. The most common one is introduce auxiliary loss for load balancing to original training objective. This is a similar mechanism of doing regularization. The example loss function can be something like this. This is the loss for load balance. N is the number of experts, t is the number of tokens case, the number of activated experts in sit is the gating function, output is normalized to minimum zero, maximum one voff max function. It means the probability of teeth token choosing ith experts. You can consider it as the affinity score between I and t. Pi is the average probability of choosing the ith expert for the entire input sequence. Also, the average gating score F I represent the fraction of tokens routed to the ithe experts. Alpha is a hyperparameter controlling the strength of the auxiliary loss. If we include this loss function into the original training objective and try to minimize this loss, it will force the model parameters to spread evenly between difference experts. Deep seek V2 is using this similar approach. It introduced different levels of load balancing. Though they have expert level balance loss, it encourage uniform token distribution across all experts. They also have device level balance loss, it encourage balance ced computation across devices. GPU's. It also dropped tokens if it's beyond expert capacity. However, this solution load balancing auxiliary loss, it's not free, since it's combining a regularization term in the original loss function. Its gradient can interfere with lms quality objective. Just like all over regularizations, it can lead to worse model performance. A small auxiliary loss coefficient cauinsufficient load balancing, and a larger value alpha is going to impair the model performance. This is a study that was done in this paper. You don't have to know what max vo is. You just need to know the higher the value is, the more the imbalance is. Ideally, we want to be at this pono imbalance, and the quality is great. However, with the auxiliary loss approach, when there's no load balancing, we have very imbalload, and when the load balancing hyperparameter is larger than the model's, quality is bad. Perplexity is the ability for oom to predict the next token. What deep secould be three introducis a new strategy for load balancing its auxiliary loss. Three. Conceptually, it's pretty straightforward. They add a additional bias for every expert in the moe router to the original. Affinity scores sit to determine top k routing. And this bias is only used for routing and not others like gating scores. This is the mathematic behind it. When we try to calculate the gating scores, we're still using the original. However, when we try to determine which sit to pick, we are considering sit plus bi. It have to be within the top k. So the bias values are updated at the end of each training step. Increasing bias for experts that receive less than average number of tokens. Decrease it for experts that receive more tokens. For experts that receive less than average number of tokens, we increase the probability of it getting picked off by the top k function here. And for experts that receive more tokens, we decrease the probability of it getting picked. And no tokens are getting dropped due to the effective load balancing strategy. So with the deep seek moe and this auxiliary last free strategy, deep Seeb three is doing a lot better. This is a diagram of how this auxiliary loss free strategy works. We're not including additional gradients to the original loss function. Instead, we have a bias updating process here at the end of each training step, and it's gonna to update the gating score here with expert bias. Since loss free balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from the moe train. So this is all about deep C B three e's moe and its load balancing strategy. Let me start with what is kvcaci. Have another transformer series talking details about kvcache. Please take a look if you want to know more details about it, but I will try to summarize it a bit. Kvcacis eligible when new tokens attentions calculation only depend on itself and previous book, which is very normal for generative model nowadays. For example, the decoder only GPT families diagram I put here is encoder decoder based, not decoder only. The mask attention module here is critical for kvcache. It will make sure the token in a sequence only attends to previous tokens in itself, not future tokens. So what is the benefit of the kvcacso? It's going to drop the number of flock floating point operator per second. It will drop it from ot squares ot to t, the number of the length of sequence. So what is the cost? The cost will be x from memory. So that memory amount is upsize two, multiplied by a number of tokens, multiplied by a number of layers, multiply by number of kb hads, and then kb dimensions two s and tn value. It's using extra memory to beat up the calculation. So kb cache is very important in speeding up the transformer based models. It consumes a lot of memory, and it often becomes the bottlenecespecially when a sequence is longer and longer. So people are thinking, okay, can we do something about it? And then people start thinking about mqa and gqa. So mqa is the multi query attention, and gqa is the group query attention. This is a graph of how the multi head attention works. So basically, each query token corresponds to a keys and values and matrix, and they're gonna to do matrix multiplication and we cache keys and values and it's gonna to consume memory. So if we want na cut that memory, intuition is, can we group some of them? Can we merge two into one or three into one, or in the extreme, merge everything into one? Each qa mqa is doing that. Mqa is on a pretty extreme. Like autoquery had corresponto one key, one value, and group query is in the middle. I remember lama is probably doing two to one mqis to share a single key and single value head across autoquery heads. Obviously it can significantly reduce the memory usage. But as you can imagine, if previously you are using eight matrix to represent a attention and now you only use one, there's going to be a lot of information loss. So it's going to impact the quality and negatively impact the accuracy of tenso. Gqa is in between. It tries to share only by a group of query hads, not all quries. The quality impact is last, and you still have a memory gain. I'm not gonna to talk about mqa. I think gqa is probably more useful, like gqs quality closer to mha by finding a better balance and maintains a speed comparable to mqa, which is faster than the classical moddihead attention by reducing the kb cache memory. If you reduce the kb cache memory, it also reduces the computational complexity that will be less matrix multiplications. And then it also have lower memory ories. So it's more suitable for large scale models with like memory constraints, which is what deep sek do. So when you compare gqa with the classical mqa, like obviously there will still be quality loss. The balance between speed and quality is still a challenge. You need to be really smart on the grouping strategy to minimize the quality loss. Then there's the group division. So the division and management of groups adds to the complexity of the gqa implementation. Another one, complexity of the implementation. So this two is the mode. And then because you introduce extra complexity, you introduce more hyperparam you have between more parameters. It has huge impact on models, performance and efficiency. Before I dive into multi hat latent attention, I want to talk a little bit about another concept, which is rotary of physician encoding ropreviously. In my transformer series, I talk a little on the absolute position encoding, which is here. In today's modern architecture, rope rotary position encoding is more popular. And the idea behind it is you try to apply a rotation matrix which transforms gxy to this new gx y, so you preserve the length of the original vector, but then the angle of it is changed. How to efficiently calculate this by introducing this two, 22 matrix? This is the rope formula and adthis stiposition. By the way, let's take a look into deep seeks mla implementation on the left part. This is a decoder only architecture in the feforward network. I already talked about it in the first episode. They use moe to replace the feed forward network. And in deep seeks case, they have shared expert and they have routed expert. This time I wanto talk about the attention parts. This is what's also very important. Let's zoom into this mla module. Deep seat uses a low end contingient mechanism, the call mla, which projects the keand values into a shared low rank space. This equates reducing the sum of the dimensions of the query key n value vectors. It impressed the original e value and query to a lower dimension, similar of what gqa would do. As you can see, this is the original input. And then after some compression matrix, it will compress input way lower dimension matrix. One thing is rope isn't simply preserved by the lower ram compression. So that's why deep seek uses a pretty smart approach. They have a single query pair that doesn't have the compression matrix. And these quries and keys are added directly to the deep compressed quries and keys without positioning. Then so as you can see, this one half rope and this one half rope, but there are some matrixes that doesn't have rope. They don't have any position tioning codes. And we concatenate position encoding this rope and without rope and do a dot product to gather multi head attention. This is how they solve the rope problem. But in fact, with some additional calculation, rope would be combined with mla directly. So let's say you have big W is an compression matrix. You can find a new matrix that falls to this condition, diusing this new molar W matrix. It will ensure that rope matrices muwith the lower end compression matrix. And the rope matrices are composed of two to two blocks on the diagonal. So actually, there are other ways to do mla, but this is how deep seek decided to do, and it has been working pretty good. So they find ways to reduce the kb cache by compressing the cache elements dimension, and they find a smart way to include position encoding into it without hurting the ability to do kv cache. And that's why they made their attention mechanism so cheap, and it doesn't hurt the quality that much. So the quality is still pretty good. First, I want to talk about what is single token prediction or the next token prediction. On the right hand side, this is the classic encoder decoder transformer architecture. And on the left hand side, this is a example that I use in my transformer series. So this is doing a machine translation task. It's trying to translate I am Martin to Chinese. In this case, while doing inference, I send the English I am Martin to encoder input in the first round. Decoder input is gonna be a start token, and then the output is gonna be Chinese translation of I, which is woand. Then we take the output and send it to decoder input again, because this is auto regression. And then we get the next token, which is the Chinese for am, and then we get a new sentence, which is washi, and then send it to decolder input. And then it output the next token again, which is Mar in Martin. We combine everything and send it to decoder input. And that last it outputs tiin Martin. This is a typical inference for next token prediction. In the next slide, let's take a look at what happens in next token prediction training. In training in machine learning, we always relion minimize loss function, and we back propagation to update weights. So what is the loss function? In this case, the loss function we are using is cross entropy loss. So this is a metric we use in machine learning to measure how well a model performs. And it's done by comparing the probability distribution predicted by the model to the actual distribution of the target labels. In math, the formula is lost, and then it's the sum of a probability given x from one to t, and then the probability of t plus one. So in other words, given a sequence from x one to X T, these are all tokens. Let's use example of I am ruthiy's. So these are three tokens and forms a incomplete sentence. And the probability of the model predicting next token to be xt plus one is this. So it's part of the formula here. So we are trying to maximize this probability. So in training, we already know what xt plus one is for each t. And doing training, we're gonna compare the model output with the actual input. And we're trying to make sure the model behaves as close as the actual output. So in this case, we maximize the probability. And to maximize the probability, it means we're minimizing this loss function, which is what we're actually gonna to do in training. We're. Going to minimize this loss function. That's how we make sure the probability is maximized. So this is how the machine learning training works for an next token prediction. But is there a more efficient way using the same training data? The answer is yes. So let's take a look at the multi token prediction. This is a paper done by fair, which is a research lab from meta. They are trying to predict multiple token at a time. So the model predicts several future words simultaneously. Instead of just one at each position in a sentence. The model uses multiple hats to forecast the next several words all at once. So this working collaborately to improve the efficiency and coherence in this graph. They have four heads. That means it's trying to predict the next four token at a time. The last formula in this case is going to be given the token from x one to xt. Instead of predicting xt plus one, we want to predict xt plus one to t plus n. In this graph, n equals to three. So the advantage of this multi token prediction is pretty clear. It's more efficient learning. So multi token prediction enables to model, to learn more effectively from the same data set. It uses the data set in a more efficient way, and then it also have better quality in many cases. So this model captures and predict long term patterns instead of just the next single prediction. So it has better quality in tasks requiring content generation, such as cogeneration for software engineers. And this is a joke or not. And then it also have faster inference. I don't see this too much nowadays. Actually probably it's not using a lot of model because extra model complexity or there's actually some quality loss. I want to call this out because deep seek is not doing mtp for faster inference. So let's take a look at deep seevariants. They are trying to do mtp in a little different way, but similar concept. The differences is they're doing sequential in stuff parallel. They do this because they want to maintain the full Casso chain. That means your current token only depend on all the previous tokens you have, and doing it parallel is going to break this promise. For example, if we have t one, t two, t three, and it predicts t four and t five simultaneously, then t five is generated without the context of t four. As I said, they only use to optimize training. As you can see, they have the main model here and they have the mtp module here. And the mtp module is sharing output hat and sharing embedding layers with the main model. So doing training, they want na speed up the training convergence. So they are using multi token prediction, and after the main model is trained, and in inference, they actually dump this mtp module. So they discard the mtp modules. Those are not used in inference, only the main model is using. So let's take a look at what the mtp module is doing. It basically consisted of multiple components. The first one is a shared embding layer. So this shared embding layer is just to turn token into word vectors. And then there's a shared output hat. This linear transform of transformer block outputs and then applies soft tmax to compute the prediction probability. So this is also shared. Then there's the transformer block. It's for attenand. Then there's the linear projection matrix. This is doing linear transformation because we have a concat here. We want na maintain the dimensions consistent. And then there's the rms norm. This is to stabilize the training process. If we want na learn more about normalization, please refer to my transformer series. This is only used in training, as I said, so the mtp module actually ensuthe training process is more efficient. And then once the training is done, only the main model is used for actual inference. Before I start with fb eight, I want to talk about fb 16, which is more popular in AI world. Fp 16 refers to a 16 bit floating point. Normally, programs run with 32 or 60 board bit numbers. If you're a programmer, you should be familiar. And by using fp 16 to represent model weights and activations, it allows faster computation and significantly reduce memory usage. In summary, the advantages are smaller memory footprint, fp 16 efifty of memory compared to fp 32. This enables large batch sizes and models to fit into GPU memory. It is the main reason why it's so popular in GPU and AI training. It comes with faster computation. So many GPU's are optimized for fp 16 because of the smaller memory usage, so this allows accelerated matrix multitation without significant degradation in accuracy. The next one is acceptable. Although it half the bcompared with 32 bits, it still retains enough dynamic range, and numerical stability is proper handling of quantization. So quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. You're gonna to see quantitization quite a bit later. Do I want to claim it? Little. Usually when you translate tire bits to lower bits, you could run into quantitization, underflows and overflows. And those are bad things. And I'm gonna to go through it later in the following slides. Now we have fb 16 from fb 32, and we have all those advantages. So can we do better? Let's say let's try fb eight. So that's what a lot of other people also thinfb eight 's advantage, ges, even smaller memory usage, another 50% off. And then it also with higher throughput, because training the fpa enables faster computation, smaller data types reduce the data transfer time and allow for more flops. Flops is what matters in training a model. It comes with lower energy consumption. Energy consumption is actually one of the major concerns of training large models right now. However, the transition from fp 16 to fp eight is the cky. Fp eight has several unique challenges compared with fp 16. One of the main reasons is fp 16 is an an established model in both software and hardware. It comes with a pretty stable ecosystem for most models and hardware, and fpa doesn't have it. It's starting to catch up. I want na talk about fp eight challenges. First one is numerical in stability. So doing back propagation, which is one of the essential depths of machine learning, radiance can become extremely small. Ffiates reduced range can't represent these values accurately. This could lead to unstable or non converging training, but value during the process of quantization becomes smaller than the smallest representable value allowed by the data type, which is f eight, so essentially resulting in the value being rounded to zero due to the limitations of precision. This is the very bad thing, because small values is critical for fine tune outdates, and there usually are a lot of sensitive layers in large models, small value is very important, and running to zero, it's not acceptable. The next one is gradient overflow. It's similar with underflow. It's just gradients can become extremely large. And the problem is it exceeds the maximum representable value within the chosen data type though ID essentially overflowing the allowed range, resulting in an inaccurate representation. It's a type of error that can occur when a value becomes too large. And this is also a pretty serious problem. How deep seek solve this? They have a block wise quantization scale. First of all, to prevent overflow, we usually scale to higher preciweghts or activation matrix down to the fbates representable range or quantisize zing. How to do it? The easy way is just to divide by the maximum element, and the maximum element is kept separately and used as a scaling factor in each matrix multiplication. Or again, it's basically the same. However, this makes the quantisization process very sensitive to outliers, though the presence ts of a very large weight in some layer could force or other ways to be quantized to zero. If in one layer your value is very large, let's say 1 million, but in all other layers your value is all single digits, then your maximum value is a million, and then you have to divide by a million for all the way, then only one layer will have value. All the other layers will be basically zero, so this is very bad. So what deep seek does is it introduced blowise and tallow wise aliiling, in which each submatrix or subvector are scaled and quantized separately. Next, fpa challenges error accumulation. This related to the first challenge. Actually, errors introduced you to flower precision and propagate and accumulate through layers of the model, compounding over time, seriously breaking the quality of the model. More than AI models usually are large in like transformers. They have a lot of hidden layers. The arrows accumulate very fast across those layers and arror becomes pretty big. And the quality impact acted readback deep. Seek solve it by mixed precision training. This is their mixed precision training pipeline. You probably don't have to know all the details behind it, but pretty obvious they have all kind of precision. Here you see feight fp 16, fp 32, right? In summary, they have fp two accumulation matrix multiplications. Matmo or gem are performed in feight, but accumulation of results is done in fp 32 or higher precision. And they have a hybrid precimodel. Weights are stored in fb eight. The activations and gradients are stored in bfc. This helps preserve important information during training. And fv 32 is also used in some internal computation. The last fv eight challenge I want to go through is hardware limitation. The hardware like nvidia hopper architecture, which is what deep psik use, they're starting to support more and more of fba, but there still a lot of gaps or optimal performance. The low precision matmultiplication or gem often suffer from underflow issues and their accuracy largely depends on high precision and accumulation, which is usually fp 32. However, deep seek observe the accumulation precision of fb eight matrix multiplication on nvda H 800, which is what they use, is limited to retaining around 14 bits significantly lower than fb 32 accumulation. This is even worse when the indimension is large. Basically, they found the hardware limitation that makes the precision even worse. Their solution is to use a hybrid approach. They use tensor core plus kuda core. Usually everyone use tensor core, and kuda core is one layer down. So to overcome the hardware limitation, deep secan move some of the accumulation outside the tensor course. The matrix multiplication kernel performs each series of four consecutive wgmma eight, that means aychronous or group level matrix multiply and accumulate operation. You don't have to know the details about it, but just know this is one of the policy of nvdhardware for matrix multiplication and value accumulation, so the operations inside the tensor course, but accumulating in the lower precision format, but then they add the results into a separate register back accumulator tensor in fthirty two. This happens in kuda course, so this mitigates the loss of accuracy. And on the right, this is a graph of how they achieve this. This is how deep seovercomes the limitation that they require less resource and fast within their budgets. Hope this is helpful.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
一位谷歌工程师(自称普通软件工程师)详细解读了DeepSeek-V3模型成功的关键技术。视频指出,DeepSeek-V3的卓越性能、高效率和成本效益主要归功于其创新的架构设计和硬件优化。核心技术包括混合专家模型 (MoE),通过大量细粒度专家和共享专家机制提升模型容量和专业化程度,并采用无辅助损失的负载均衡策略有效解决专家负载不均问题,避免了传统辅助损失对模型性能的干扰。其次是多头潜在注意力 (Multi-head Latent Attention, MLA),该机制通过将键(K)和值(V)投影到共享的低秩空间,显著减少了KV缓存的内存占用,并巧妙结合旋转位置编码(RoPE)以保持性能。再者,多词元预测 (Multi-Token Prediction, MTP) 被用于优化训练过程,通过序贯预测多个未来词元来加速模型收敛,但该模块在推理阶段会被舍弃。最后,视频详细介绍了针对FP8精度的硬件优化,包括分块量化 (Blockwise Quantization)以应对FP8的数值不稳定性(如溢出和下溢),混合精度训练以平衡计算速度和模型精度,以及利用CUDA核心进行优化,通过Tensor Core和CUDA Core的混合使用来克服NVIDIA H100硬件在FP8累积精度上的限制。该工程师强调,这些技术的综合运用使得DeepSeek-V3在推理性能、开发能力和工具使用方面均有显著提升。
DeepSeek-V3 技术解析
引言与近期更新
- 分享者:一位自称“普通软件工程师 (normal regular SWE)”的人士。
- 目的:分享DeepSeek-V3成功的秘诀,包括MoE、MLA、MTP及FP8混合精度训练等硬件优化。
- 近期发布 (2025年3月24日):DeepSeek-V3有一次重要更新,带来了显著改进。
- 推理性能大幅提升:在多个基准测试中表现优于原版V3、Llama、Qwen、GPT-4.5和Sonnet 3.7 [不确定具体模型名称,根据发音推断]。
- 更强的前端开发能力:例如,在“hexagon test”中表现突出。
- 更智能的工具使用能力。
- 视频性质:对旧有DeepSeek-V3深度剖析内容的“合集版 (collection version)”,其中MoE部分进行了完全重制。
混合专家模型 (Mixture of Experts, MoE)
- 核心思想:由多个专业化的子网络(专家)组合处理每个输入词元,而非单一大型密集网络。根据输入选择性激活一部分专家。
- 起源:谷歌提出,早于2017年的Transformer论文。
- 关键组件:
- 专家 (Experts):独立的神经网络(通常是小型LM或前馈网络),各自擅长不同任务。
- 路由器 (Router) / 门控网络 (Gating Network):分析输入,决定哪些专家最适合处理当前输入,充当调度员。
- 组合机制 (Combination Mechanism):被选中专家的输出被组合起来,产生MoE层的最终输出。
- 工作流程:
- 输入进入MoE层(通常替代Transformer中的前馈网络层)。
- 路由器处理输入,为每个专家分配权重或分数。
- 通常激活得分最高的Top-K个专家(K通常很小,如1或2),这被称为稀疏MoE (Sparse MoE),效率更高。也存在所有专家都贡献输出的密集MoE,但视频未深入。
- 激活的专家独立处理输入。
- 它们的输出根据路由器分配的权重进行组合,得到MoE层的最终输出。
- MoE的优势:
- 提升模型容量:在相似计算成本下,总参数量远超密集模型。
- 提高推理效率:每次输入仅使用一小部分参数,推理速度更快,计算成本更低。
- 专业化与更佳性能:专家专注于特定数据方面,能更好地处理多样化任务和输入。
- 更快的预训练:由于专家的稀疏激活,预训练效率高于同等容量的密集模型。
- MoE Transformer架构:将Transformer模型中的前馈网络层替换为MoE层。
- MoE层包含一个门控网络和一定数量的专家。
- 示例:输入序列经过词嵌入、自注意力、Add & Norm后,进入MoE层。路由器为每个词元选择激活的专家(如第一个词元激活专家2,第二个词元激活专家1),并产生门控分数。最终输出是门控分数与专家输出的组合,之后再经过Add & Norm。
- DeepSeek MoE (源自DeepSeek-V2):
- 问题1:专家数量少导致负载不平衡。
- 解决方案:将专家层细分为更多、更小、更细粒度的专家。例如,DeepSeek-V2使用 256个专家,并选择Top-8个。
- 问题2:不同专家可能需要共同知识,导致冗余。
- 解决方案:设立一部分专家作为共享专家 (Shared Experts),用于捕捉通用的、跨输入的知识。例如,DeepSeek-V2选择 1个共享专家。
- 问题1:专家数量少导致负载不平衡。
- DeepSeek MoE的优势:
- 更高程度的专业化。
- 改进的参数效率。
- 减少冗余:通过共享专家处理通用知识。
无辅助损失的负载均衡 (Auxiliary-Loss-Free Load Balancing)
- 负载均衡的重要性:
- 防止路由坍塌 (Route Collapse):少数专家处理大量词元,其他专家未被充分利用,损害学习效果。
- 确保所有专家有效贡献,防止计算瓶颈,提高吞吐量。
- 提高训练稳定性。
- 增强泛化能力:未充分利用的专家无法学习有效专业知识。
- 优化并行训练:在分布式训练中(如专家分布在不同GPU上,DeepSeek即如此),负载不均会严重影响并行效率。
- 常见负载均衡方案:辅助损失 (Auxiliary Loss)
- 类似于正则化,在原始训练目标中引入辅助损失项。
- 示例损失函数:
L_balance = α * Σ (P_i - F_i)^2(视频中给出了更复杂的公式,涉及门控函数输出g_it,专家i被选中的平均概率P_i,以及分配给专家i的词元比例F_i)。 - DeepSeek-V2采用类似方法:引入专家级平衡损失和设备级(GPU)平衡损失,并丢弃超出专家容量的词元。
- 辅助损失的问题:其梯度可能干扰语言模型本身的质量目标。小的
α值可能导致负载均衡不足,大的α值则会损害模型性能。视频引用了一项研究,表明在无负载均衡时,负载极不平衡;而当α增大时,模型质量(以困惑度Perplexity衡量)会下降。
- DeepSeek-V3的策略:无辅助损失 (Auxiliary-Loss-Free)
- 核心概念:在MoE路由器的原始亲和度分数
S_it上为每个专家添加一个额外的偏置项 (biasB_i),仅用于Top-K路由决策,不影响门控分数的计算。- 路由决策基于
S_it + B_i。 - 门控分数仍使用原始的
S_it。
- 路由决策基于
- 偏置更新:在每个训练步骤结束时更新。对于接收词元数少于平均值的专家,增加其偏置;对于接收词元数多于平均值的专家,减少其偏置。
- 效果:有效实现负载均衡,且不丢弃任何词元。
- 优势:由于不产生干扰梯度,它提升了通过MoE训练能达到的模型性能上限。
- 核心概念:在MoE路由器的原始亲和度分数
多头潜在注意力 (Multi-head Latent Attention, MLA)
- KV缓存 (KV Cache):
- 适用场景:当新词元的注意力计算仅依赖于自身及之前的词元时(常见于自回归生成模型,如Decoder-only GPT)。掩码注意力模块 (Masked Attention Module) 是关键。
- 优势:将计算复杂度从O(T²)降至O(T) (T为序列长度)。
- 成本:额外的内存消耗,计算公式为
2 * num_tokens * num_layers * num_kv_heads * kv_dimensions。 - 重要性:加速Transformer模型,但长序列下内存消耗是瓶颈。
- MQA (Multi-Query Attention) 与 GQA (Grouped-Query Attention):
- MQA:所有查询头共享一个键(K)和值(V)头。显著减少内存,但可能导致“大量信息损失”,影响模型质量。
- GQA:查询头分组共享K/V头。在性能和内存间取得更好平衡。视频提及“Llama大概是做2对1的GQA” [不确定]。
- GQA优势:质量接近MHA,速度接近MQA,减少KV缓存内存,降低计算复杂度,减少内存I/O。适用于内存受限的大模型(如DeepSeek)。
- GQA挑战:仍有质量损失,分组策略复杂,实现复杂,引入更多超参数。
- 旋转位置编码 (Rotary Position Encoding, RoPE):
- 现代架构中比绝对位置编码更流行。
- 思想:应用旋转矩阵,保持向量长度不变,改变其角度。可通过2x2矩阵高效计算。
- DeepSeek的MLA实现:
- 架构背景:Decoder-only架构,FFN层使用MoE。
- MLA机制:一种“低秩约束机制 (low-rank contingent mechanism)”,将K和V投影到共享的低秩空间,从而减少查询(Q)、K、V向量的维度,类似于GQA的效果。
- RoPE与低秩压缩的挑战:“RoPE不能简单地通过低秩压缩来保持”。
- DeepSeek的解决方案:
- 存在一对不经过压缩矩阵的Q/K对(带有RoPE)。
- 这些带有RoPE的Q/K被直接加到解压缩后的、不带位置编码的Q/K上。视频描述为:“我们将带有RoPE的位置编码和不带RoPE的部分拼接起来,然后进行点积运算以获得多头注意力。” [原文此处表述略有模糊,但强调了RoPE的特殊处理]。
- (提及的另一种可能方式,非DeepSeek明确做法):找到一个新的压缩矩阵
W',使其与RoPE矩阵可交换。
- 效果:通过压缩缓存元素的维度减少了KV缓存,并巧妙地集成了位置编码,实现了廉价且质量损失不大的注意力机制。
多词元预测训练 (Multi-Token Prediction, MTP Training)
- 单词元预测 (Next Token Prediction):
- 推理:自回归方式,逐个生成词元。
- 训练:最小化交叉熵损失,即最大化模型预测下一个真实词元
X_t+1的概率(给定X_1...X_t)。
- 多词元预测 (MTP):
- 概念(基于FAIR, Meta的论文):模型一次性预测多个未来词元。使用多个头同时预测接下来的N个词。
- 损失函数:最大化模型预测未来N个词元序列
X_t+1...X_t+N的概率(给定X_1...X_t)。 - MTP优势:
- 更高效的学习:更有效地利用数据集。
- 通常有更好的质量:捕捉和预测长期模式,适用于内容生成(如代码生成)。
- 更快的推理:发言人对此持保留态度,认为“现在不常见,可能因为额外的模型复杂性或实际存在质量损失。DeepSeek并非为了更快的推理而使用MTP。”
- DeepSeek的MTP变体:
- 特点:采用序贯 (Sequential)而非并行方式预测。
- 原因:为了“保持完整的因果链 (full causal chain)”。并行预测会导致如
T_5的生成没有T_4的上下文。 - 用途:仅用于优化训练过程,加速训练收敛。
- 架构:包含主模型 (Main Model) 和MTP模块。MTP模块与主模型共享输出头和嵌入层。
- 流程:训练时使用MTP模块加速收敛;训练完成后,MTP模块被丢弃,推理时仅使用主模型。
- MTP模块组件:共享嵌入层、共享输出头、Transformer块、线性投影矩阵(因存在拼接操作,用于保持维度一致)、RMS Norm(稳定训练过程)。
- 效果:MTP模块使训练过程更高效。
硬件优化:FP8精度、量化与混合精度训练
- FP16 (16位浮点数):
- 优势:内存占用小(FP32的一半),计算速度快(GPU优化),可接受的精度损失。
- 量化 (Quantization):将连续无限值映射到较小的离散有限值集合的过程,可能导致下溢和溢出。
- FP8 (8位浮点数):
- 优势:内存占用更小(FP16的一半),吞吐量更高(计算更快,数据传输时间减少,FLOPs更多),能耗更低。
- FP8的挑战:
- 数值不稳定性 (Numerical Instability) - 下溢 (Underflow):反向传播中的梯度可能变得极小,FP8的有限范围无法精确表示,导致数值被舍入为零。这对微调更新和敏感层是致命的。
- 梯度溢出 (Gradient Overflow):梯度变得过大,超出FP8可表示的最大值。
- 误差累积 (Error Accumulation):低精度引入的误差在多层模型中传播并累积,严重影响模型质量。
- 硬件限制 (Hardware Limitation):后述。
- DeepSeek针对溢出/下溢的解决方案:分块量化 (Blockwise Quantization)
- 常规做法:为防止溢出,通常将高精度权重或激活矩阵缩放到FP8可表示范围内。简单地除以最大元素值会导致量化过程对异常值敏感(一个非常大的权重可能迫使其他权重被量化为零)。
- DeepSeek的方法:引入分块 (blockwise) 和张量级 (tensorwise) 缩放,即每个子矩阵或子向量被分开缩放和量化。
- DeepSeek针对误差累积的解决方案:混合精度训练 (Mixed Precision Training)
- 流程:
- FP32累积:矩阵乘法 (MatMul/GEMM) 在FP8中执行,但结果的累积在FP32或更高精度下完成。
- 混合精度模型:权重存储为FP8,激活值和梯度存储为bfloat16 (BF16) [原文为"bfc",推断为bfloat16]。某些内部计算也使用FP32。
- 流程:
利用CUDA核心进行优化
- FP8的硬件限制挑战:
- 尽管NVIDIA Hopper等硬件(DeepSeek使用)开始支持FP8,但仍存在性能差距。
- 低精度矩阵乘法(GEMM)易受下溢影响,其精度很大程度上依赖于高精度累积(通常是FP32)。
- DeepSeek观察:在NVIDIA H100(DeepSeek所用)上,FP8矩阵乘法的累积精度有限,约保留14位,远低于FP32的累积精度,当内积维度较大时问题更严重。
- DeepSeek的解决方案:混合使用Tensor Core和CUDA Core
- 策略:“将部分累积操作移出Tensor Core”。
- 具体操作:矩阵乘法核心执行一系列连续的“wgmma”(异步或组级矩阵乘法和累加操作,全称Asynchronous or Group-level Matrix Multiply and Accumulate)。这些操作在Tensor Core内部以较低精度格式进行累积。
- 然后,将这些结果添加到一个独立的、由寄存器支持的FP32累加器张量 (register-backed accumulator tensor in FP32) 中,这一步发生在CUDA Core中。
- 效果:减轻了精度损失。
核心观点总结
DeepSeek-V3的成功是多种先进技术协同作用的结果。其核心在于通过精心设计的混合专家模型 (MoE) 配合创新的无辅助损失负载均衡策略,实现了模型容量与效率的平衡。通过多头潜在注意力 (MLA) 优化了KV缓存,降低了内存消耗。利用多词元预测 (MTP) 加速了训练收敛。并在硬件层面,通过分块量化、混合精度训练以及Tensor Core与CUDA Core的协同优化,有效克服了FP8精度带来的挑战,最终实现了在有限资源和预算下,模型性能、效率和成本效益的显著提升。