Stanford CS336 Language Modeling from Scratch ｜ Spring 2025 ｜ 04 Mixture of experts

该讲座介绍了混合专家（MoE）架构在语言模型中的应用。讲座指出，MoE已成为2025年构建高性能大语言模型（如Grok、DeepSeek、Llama 4）的关键技术，相较于密集模型，能在相似计算资源消耗（FLOPs）下实现更优性能。

MoE的核心思想是将传统Transformer模型中的前馈网络（FFN）替换为多个“专家”（即多个FFN副本）和一个“路由器”。在每次前向传播时，路由器会选择性地激活一小部分专家进行计算，从而在不显著增加实际计算量的前提下，大幅提升模型的总参数量。这种稀疏激活机制使得模型能以相同的训练FLOPs达到更低的训练损失和更好的性能指标（如困惑度）。

讲座强调了MoE的几大优势：1) 以更少的计算激活更多参数，提升模型容量和知识记忆能力；2) 在相同训练FLOPs下性能优于密集模型；3) 提供“专家并行”这一新的模型并行化维度，便于将大模型扩展到多设备上。

尽管MoE在系统实现上存在复杂性，例如专家权重的存储和数据路由，但其带来的性能提升和并行化便利性使其得到广泛应用。讲座还提及，MoE技术早期由Google等闭源实验室研发，中国团队（如Qwen、DeepSeek）在开源MoE的探索和基准测试方面做出了重要贡献，近期西方开源社区也开始积极采纳该架构。讲座后续计划深入探讨DeepSeek V3等具体案例。

视频科技

媒体详情

上传日期: 2025-05-13 16:59
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: So we'll get started today. We're going to cover a mixture of experts. Last year, this was kind of a fun bonus lecture that I threw together. But this year, thanks to you know lots of people doing moes, this has become a much more critical lecture. So I've added a lot of the recent developments. And at the end, we'll try to walk through deep seek V3 and try to understand like what are all the sort of components that make up a state of the art open source system, or at least on the architecture side, what that looks like. So mixture of experts is how a lot of the most modern high performance systems today are built and deployed. So there was the funny nvidia leak of GPT for actually being potentially revealed as GPT moe one bt. But more broadly, others like Grock and deep seek and llama four now have all adopted a mixture of experts architecture. And it seems like at this point in 2025 that the advantage of mixtures of experts over dense architectures is very much clear, right? Almost all compute scales training a mixture of experts model, if you do it well, is going na give you benefits over a dense model. And so everyone seems to be doing it in both that east stand the west. And so this will be an important thing to understand if you're trying to build sort of the best model that you can for the flops that you have. So mixture of experts is very simple. It's a very terribly named concept. I think you hear a mixture of experts and you think, Oh, there must be experts specialized for different domains. And they're like doing different things. Like there's a coding expert and like an English expert and other languages expert. It is very far from that mental model. A mixture of experts is a type of fancy architecture that has several subcomponents called experts that are activated sparsely. And in particular, when you think about mixture of experts, you should be thinking about the mlps. This is where all the action is, right? So moe architecture and a non moe architecture are going to be similar in almost all of its components, except for one. And that is, you know, if you look at this slide over here, you know this is the components of a standard transformer. You got your self attention. You got your ffn. If you zoom in in a dense model, the fefour component just sort of is there. It's one big block in a sparse model. What you would do is that you would take this ffn and you would split it up or you would copy it, depending on how you're gonna to be setting up your moe. You're gonna to have multiple copies, let's say, of your ffn, your fully connected networks, and you're gonna to have a router that picks some smaller number of those, you in each forward PaaS, right, each inference. So this is the basic idea behind the moe, and we're gonna to replace this one big feet forward on the left side with a selector layer and many smaller ones. And what's the advantage of this thing? Well, if it's sparsely activated, that is, let's say it only picks one expert, and an expert is the same size as your dense ffn, then the flops between the left side and the right side, the dense model and the moe model, they have the same flops, right? They're doing the same. Matrix multiplies as you do your forward PaaS. So you have more parameters without affecting your flops. And if you're a believer that what matters is having more parameters to, for example, memorize facts about the world, well, you know, this is a great architecture. So you can kind of see the intuition behind moes. Hopefully, that's all very clear and you might wonder, okay. So it makes sense that you can get more parameters per flops, but does that translate to actually better performance for the models that you're training? And there's been, I think, at this point, many, many, many papers showing that at the same flop count, at the same training amount of flops, you get better performance out of a mixture of experts than out of a dense model. So this is a nice paper. Toso, today I'm gonna to go over a couple of the classic Google papers that you put this field together. And this is one of them by feta seal 2:22, where they show that, you know, if you flop, match your training flop. So that's the same amount of compute use for training. And as you increase the number of experts, the training loss of your language model just keeps going down and down and down and down and down, right? So you know more experts better. Of course, the experts aren't free. You need to store the memory for these experts. And when you do parallelism, you're going to have to think about routing your data into 256 separate experts so that there's going to be systems complexities. But if you're only thinking about flops, this is a great chart to see because you have the same flops, but you've gone free test loss here and you see the same thing reflected on the right side. You as you train for longer and longer, the model, the switch base with 128 experts, the model with more experts gets better perplexity faster. So hopefully that is quite clear. You might say, well, this is a 2022 paper. Is this sort of on modern architecture, on modern scales? It continues to very much be AI two had a very nice paper, omo, which did a whole bunch of ablations and carefully controlled comparisons into dense versus moe and other architectures, and they sort of see exactly the same thing. So here on the left side, this is still from fetus at all. You see the seven x speed up from having many experts. On the right side, this is the omo comparison. You see the pink one is the moe and the teal one is dense and the training loss for the dense model goes down much more slowly than the moe, right? So hopefully, you know I have in some sense sold you on the value of moes and for learning this kind of new slightly new architecture, right? So we're gonna to pay up a price for all of this, but at least the flops level, this looks very compelling, right? So yes, question .
speaker 2: tree menlike since who's the highest stern park, although it's a free G Cross competitation, the facts are actually full part time and pretty badly we're loading in and out. So the question was in last lecture.
speaker 1: you know I was saying even small non flops, you know negligible flops can be really big in wall clock. Is anything in the moe world going to look like that? And so I think one of the drawbacks of moe is why you know that's not the standard thing that's being taught. You know let's say at 224n is because there's significant systems complexities to making this thing efficient. So I'll get to that. It's possible to make these things very efficient, especially if each expert lives on a separate device so that you're routing data to different places. You can be very efficient when you do that, but it's not easy, right? So there's a lot of infrastructure tural concerns and you're going to see a lot of complexities to get this thing to work. But when it does work, you're putting all of your flops to use. Okay. And then the last one that I wanted to show is you know a lot of the companies really love moe's because you get to present plots that look very compelling like this, right? This was from the deep seek V2 paper, you know on the x axis. This is a little bit of slight of hand. This is only activated parameters, right? So this is only the parameters that are used for computation. So you ignore all the deactivated experts and the y axis is mmlu performance. And we see deep sev two. Wow. Look, very few activated parameters. Really good mmlu performance, right? And so if you're only interested in both training and inference flops, you activated parameters is the name of the game. You get really good performance here. And this is not just an ablation. This is a real system that someone spent a lot of money to train and deployed out in the wild. And we'll see this sort of pattern recur in other examples as well. Oh, was there a question? All right. And so the systems thing that is also a benefit is that moes allow us to have another axis of parallelism. So I'm going to get into parallelism in much, much more detail, sort of in the systems lectures. So I'm going to talk about how you're going to take your model and you're going to cut it up into many small pieces and lay them out across many different devices. But I'm going to talk a very high level. But when you have experts, there's a very natural way to paralyze at the expert level. So you have multiple different fefour blocks. You can take each of these experts and you can put them on a different device. And because experts are sparsely activated, all you have to do is take your token and route it to the appropriate device, and the computation will happen on that device. So it's a natural sort of cutting point to be able to shard your model into different devices. And so this is called expert parallelism. And this is another reason why moes are very popular, right? If you really want to paralyze really big models, this is a thing that you're gonna na have to do. And kind of Interestingly enough, I think moes developed that Google and many of the frontier labs, the closed labs, were doing it. But I think the open results actually came from China very frequently. Quan and deep seek were doing a lot of moe work last year. And it's only really recently that I think Western Open source groups have started to do more moe work. So mixstroll groc, I guess grocks not open. And then now llama is now an moe architecture, right? And so it's here, you know lama four just got released, right? Latest and greatest. This is also a sparse moe. And I'll talk about llama four as well as I go through the lecture. As I said before, so you know one of the kind of starting points for this is some of the Chinese groups quan and deep seek have actually done some really nice work benchmarking and understanding and evaluating some of these moe results. So these moes, so quan 1.5 was one of the first models that I knew of to have like this large scale, well tested, well documented moe. And what they did was they took a quen 1.5 dense model, and they had a nice trick to upcycle it into a mixture of experts. That's a clever kind of trick to take a dense model and then turn it into an moe. And they showed sort of significant gains, at least in terms of compute efficiency, while sort of decreasing the total number of parameters relative to their sort of seven b model. Deep seek, which is now famous, but originally, when these papers were coming out, were not quite as famous, did some of the, I think, really foundational moe work in the open source world. A big part of this lecture is actually going to be tracing the trajectory of the deep sek moe architecture. But if you look at their original deep seek moe paper, you'll see very nice papers, sorry, very nice sort of comparison, showing things like what happens when you train a dense model with a particular amount of flops? What happens when you train a really naive moe that doesn't do very smart routing? What happens? And then if you use a smarter routing called the switch sort of moe, what happens? And so you'll see all these very carefully controlled comparisons. And you see you as you go from dense to sparse. So that's the leftmost column, the rightmost column. You see all these sort of benchmark metrics very consistently improve ved for a fixed amount of flops. So this is very consistent. And kind of one thing that I think almost everyone at this point has probably heard of, right, is deep pk V3. And that's in some sense, know a culmination of all this line of work. But if you habeen following moes and you were excited about kind of this branch of neural networks and language modeling, you would have actually known about deep seek long before V3 got popular. And we'll see at the very end of this lecture, actually, deep seek V3 is now very different from the very earliest deep seek moes. Architecturally, they had kind of nailed it way back when they were training the sort of much smaller 2 billion parameter models. They really just kind of got the engineering right to get something that is actually really quite remarkably good, which is their V3 model. Okay. So now I think I've spent quite a few minutes trying to really hype you up on moes, and they really are, I think, worth hyping up. They're very good, but I think there's a question of why haven't they been more popular? Why isn't it the standard thing we teach in nlp and language modeling classes? It's just that they're very complex and they're very messy. And I'm hoping that theyget simplified over the next few years, but they still remain pretty nasty. So one of the things is the infrastructure is very complex. And the biggest advantages of moes really happen when you're doing multino training, like when you have to split up your models anyway, then it starts to make sense to shard experts across different models. That's a very natural thing to do, but until you get to that point, maybe moes are not quite as good, right? So some of the earlier Google papers really talk about this tradeoff where they say, actually when you get these really big models that you have to split up, then experts become uniquely good. There's also other things that are really tricky if you think about it carefully, right? This decision of which expert you route tokens to is a very difficult thing to learn, right? In deep learning, we really like differentiable objectives, very smooth things that we can take. Gradients of routing decisions are not differentiable because we have to pick and commit to a particular expert. So if we're doing that, you know we're gonna to have a very tricky optimization problem. And the training objectives to make that work is either heuristic and or unstable, right? And so we're gonna to have to really carefully engineer those guys to get them to work. So those are two reasons why you don't really want to maybe do this normally. So what do employees es look like as I started this lecture with, you know, the classic moes that you should think of as you take, know, the densely connected layers, the ffn's, and you split them up or you copy them, and you have sparse routing decisions among them. Of course, you could do the same kind of idea. You could have a sparsely routed attention layer, and some people have done this. There's been a couple papers and a couple releases that have taken this approach, but it is actually quite rare to see this in the major model releases. I think I've seen people talking on the Internet saying like this approach is actually really much even more unstable and very difficult to really train consistently. It's sort of I haven't really seen the ablations to back that out, but certainly there haven't really been many people training those kinds of models with moe attentions. So now, you know, I've told you about the basic architecture, right? It's really simple. It's just you have a router of some kind and you route and then you have different mlps. So what are the things that might vary across different moe choices? You might ask, how do we route, right? The routing function is an obviously important choice. How many experts and how big should the experts be? That's another choice. And the final one is how would we train this router, this non differentiable objective that seems very difficult to train. So those are very important design questions, and we're going to go through each one, hopefully covering the design space of all these moe things. Okay. Any questions before I get into each one of these different sucomponents here? Good. Okay. So if you're interested in just kind of understanding a broad overview of moe's at least circa 22, there's a really nice sort of survey or a review paper by fetus at all in 2022 that covers a lot of these, and many of my figures are credited to that paper. If we're thinking about how we're going to route or essentially match tokens to experts, right? This is the core component of a moe because what a moe does is you know tokens are going to be coming in, right? You have your sequence that you're processing, and those sequences are going to be assigned to experts, right? Not all experts will process every token. That's the whole point of a sparsely routed moe. And so you can ask, how are these routing decisions made? So you can sort of have three different kinds of choices. You you can have token choice where each token is going to have a sort of routing sort of preference for different experts and I will choose the top k experts for each token. Or I can have Expert Choice where each expert is going to sort of have a rank preference over tokens and then I'm going to choose the top k tokens for each expert. This has a really nice benefit of being balanced over experts. And then the last one is sort of you could solve some sort of complicated optimization problem to make sure that the mapping between experts and tokens is somehow balanced. This is global assignment.
speaker 2: And just to give .
speaker 1: you a bit of a teaser here, almost all the moes do token choice top k. In the early days of moes, people tried many, many different things, sort of spanning this whole spectrum of design space of token routers. If you look at the big releases, they have all converged to basically one class of routing mechanisms, which is token choice top k. So each token is going to rank order experts by affinity, and then there's going to be kind of a top k choice for each one of this. And omo, which I'll keep referring to throughout this lecture, because they have a really nice series of ablations. So it's really nice to teach off of have exactly this ablation. They compare a token choice routing versus an Expert Choice routing. And they show, if you look at validation loss, token choice ices much, much nicer, behaved much faster in loss decay. Yes.
speaker 2: Function, a function of the toitself for its position.
speaker 1: It's a function of the sort of the hidden state, right? So the token is going to get processed with all the position embeddings and so on, and then the hidden state will come in .
speaker 2: and then it will be processed by the mlp. And so for the other for the other two, the experts choosing the token and also can explain when you say it's like more balanced across the experts, are they like it's still for the current like token sequence, but it's like it's forcing them to it's still going to be the .
speaker 1: same set of tokens, but really it's about kind of the ranking selector function, right? In token choice, I'm just gonna to take the top k amongst the column. Like maybe the scores are even identical. I'm just going to take the top k amongst the columns. In Expert Choice, I'm going to take top k amongst the rows, right? And top k amongst the columns is kind of nice because you might be able to say, Oh, I can define a scoring function, such as the score is how well each token gets processed by each expert. And token choice will route me to the best expert right, for that token. So that makes sense from processing. But Expert Choice has the benefit that each expert gets exactly the same number of tokens. And so now you might like if you're putting different experts on different devices, you've got balanced utilization. So there's different tradeoffs that play as you think about routing. Yes. The best good. Yes. So the question was, how does each token know which expert is good? That is exactly the role of the router. And I'll give you the router equation. But to give you a bit of a not really a spoiler, but you know the routers are much more lightweight than you think. So you know your token, let's say is represented by vector x. That's your like hidden you residual stream coming in. So now x is gonna to get multiplied by the W A matrix then you'll just take a sigmoid or something and that's the score. So it's really just a vector vector inner product almost like an attention operation in a way.
speaker 2: Yes, he top three top one year each time is right.
speaker 1: So the choice of that's what the question was is K1 here? So k is actually a hyper parameter and different moes will choose different things. I will talk about this again, but to give you the high level intuition, the initial argument that the early liest seme papers made was that k should be greater than two, because that way you get some exploration. If you're doing k equals one, maybe you're just always exploiting the best arm and you'll never know about the potential other things you could do. But if k is two, then maybe that second arm can tell you a little bit of exploration information. So k equals two is the canonical choice, and k equals two actually continues to be very popular. That's right. That's right. So that would double the flops. And so when people talk about moes, they usually say things like x number of activated parameters. And that would account for the fact that .
speaker 2: you're putting into mlps. Yes. So when case employees for time, do you combthe outputs of different experts into question?
speaker 1: Was case one, did the outputs get combined? That's right. Like if you look at, I guess, look at the attention diagram over there, you know, you got the router, it's routed to two mlps up top and then they get combined together right after, right? So that's exactly right.
speaker 2: In that case, isn't just sort of simple average, but the weget average.
speaker 1: So the question was, how does the aggregation happen? It's just the so I'm going to go over the variance, very common variants that people do. And really, in some ways, all you need to know is top k in order to actually implement a high performance moe. But I'll give you the other variance because their natural things you might think of top k routing is what is used in most moe token choice top routing, top k routing. So how that works is you know you have your residual stream inputs x that will go into a router. And as I said, a router is really kind of like the attention operation. There's like a linear inner product and then a softmax, and then you pick the top k most highly activated experts, and then those outputs are gated depending on the implementation. You might wait the outputs based on this router way or you might not. And then you will just output the weighted average or just a straight sum, depending on how your moe implementation works. And so a lot of the moe papers and methods use top case, which transformer g, shard rock, mixbroclan. All the deep pk variants use different top k variants. Maybe you have a very surprising fact, and this should really make you think about what's going on with moe's. There are a lot of results that show that actually you don't even need a smart router at all. You can actually just use a hashing function at the very bottom to map these x's onto your experts. And even if you're doing hashing, so no semantic information at all, you will still get gains from a hashing based moe, which is pretty wild. Some of the earliest work on moe's, I think, had the very smart idea and in many ways the right idea. If you're thinking about this top down of using rl to learn the routing behavior, right? Of course, you the choice of where to route to is a discrete decision. And rl is great for learning discrete decisions. Why don't you use rl to learn routing? It was using as some of the earliest work on mixture of experts. As far as I know, basically, no one does this now. The compute cost to do this is too prohibitive, and you already have stability issues. You might not want to do that. There have been a couple of papers that have explored things like solving linear assignment problems or optimal transport style problems. They're very elegant. But once again, the cost of doing this is much higher than the benefits that it gives you, I think, in practice, and it hasn't really been adopted, but there's a lot of really interesting things that people are doing like this to try to improve the routing. So now I can point at this slide and really talk through how routing works in detail. So this is the kind of top k routing that almost everyone has converged to now. This is the router that's used in deep sev one to two. Quinn and Grock do almost exactly this. There's, instead of having a soft tmax directly at the bottom, here they do a deep sev three mixroll. Dbrx don't have a soft max at the bottom, but theysoftmax the g of ibut. This a very minor difference. So let's walk through what's going on here and try to reason about the behavior of this. So what's happening here is at the very bottom, we've got our inputs. This is our U fl input. And I would like to take this sort of residual stream input and process it through my moe. So the first thing I'm gonna na do is I have to figure out which experts are going to be activated. Now, how am I gonna to do that? Well, how I'm going to do that is very similar to attention. I'm going to take my U, which is my residual stream input, and I'm going to take the inner products with the e of is. These are kind of learned vectors that are for each expert that tells the expert, I'm an expert that points in this direction, right? And so I'm computing in this inner product here, expert and input affinity, and I'm computing a softmax to determine for each token what are the best experts as I normalize, this is s of I of t. Now I take the s of I of t and I go through a top k function. I only select the k best weights. And then I use this as my gate. So I zero out everything else. And I take the weighted average of each of the experts outputs, and then I add that to my original residual stream and then I return that right. So this is hopefully very familiar to kind of what you're off very familiar with in terms of how you know transformer works with only the difference of this top k routing piece. Is that clear to everyone how this thing works? Good. Excellent. So in some sense, the mechanics of the Ford process of the routing is very simple. What is kind of mystifying is that fact that you can learn this very well, right? This is in some sense a fairly complicated set of things to have to learn to do well by a model. Yes.
speaker 2: So we're using soft tmaacks here previously talking 11 of the benefits of softmax is that it's going to push you pretty extreme lintage, a singular max. It's not a hard max, but a licit tool. I'm having trouble in the intuition of playing softmax basically on top of like combining it with the top k where you're getting multiple and then you're using something that's going to push me towards choosing just one thing.
speaker 1: Yeah. I mean, I think maybe one way of thinking about the soft max is, you know the whole purpose of this is just to make it so that when I average my experts later, it kind of sums to one, don't think of the softmax as like a soft max operation, even though that's literally the name. I'm really the soft max operation is a normalized to one operation, and the normalized to one operation is gonna to make that a weighted average up top. The other thing that's very important is you might think, why can't I just get rid of the top k? Why don't I just use the soft tmax here and just gate all the experts? Well, then you immediately lose the syefficiency aspect of this, right? You have to have top k during training, otherwise you pay the training cost of all capital n of your experts, right? This is the key thing about moes. Like we have to do all of this gymnastics to make sure that both that training time and inference time, we have a sparse number of activated experts. That's why we go through the top. K. Okay. Yes, from the back. So because .
speaker 2: you're doing soft tmax first, at that top, you get the weghts. You no longer have to guarantee something. So the question was, Yeah.
speaker 1: So the question was, if you sofmax first, you no longer sum to one. And yes, that's absolutely right. You no longer sum to one. And in some ways, there's no requirement that you have to sum to one because you know the next layer can magnify it back up. You know there's layer norms everywhere. It's not as if it has to sum to one. But I think that is the reason why some of the other architecture is basically move the location of the soft max. There's a kind of aesthetic choice about whether you really want that weight to be normalized to one or not. Yes. Yeah. So I'm .
speaker 2: wondering how does the e actor here relate to the weight of the gate forward network?
speaker 1: Okay. So the question was whether whether and how the e vectors relate to the feet forward. They're not really tied in any way. The e vectors are just learned vectors for the just think of the e's as parameters for the router, right? There's separate objects from the ffi. Yeah Yeah, I was just .
speaker 2: wondering.
speaker 1: Great. The question was about how does it compare to sampling from the softmax? You can sample from the softmax, and some methods actually do a kind of soft sampling from the softmax. Specifically, one of the Google papers has a procedure where they take the top element of the softmax, and then they randomly sample the second element proportional to the remainder of the softmax. And that gives you more exploration, which is good. But the drawback of that is that if you don't sample at test time now you've got a train test mismatch. Okay, yes.
speaker 2: Why not just renormalize up top eight? Was the question.
speaker 1: is that right? And some some models do that. Some models do rnormalize after from the top k, but that's a kind of a choice like some architectures don't do that. Some architectures do. It doesn't actually matter because the scale can be basically adjusted post horight. So there's no reason why it has to sum to one after the jeep operation. Cool. Oh.
speaker 2: sorry. Yes, up there. So the first turn into sum, if g is approximately probvector, could be seen as an expectation of the function of the plus. So so ffn.
speaker 1: actually this is not an expectation of ffn because each ffn is a different ffn. So this is not actually an expectation. And the gates are sparse. So this is like a weighted selection operation over A K different or actually capital n different ffns. And then the utl at the very end there. You know, if you remember the transformer, that's the residual screen, right? So I'm adding back the inputs because I want sort of an identity connection through it. Okay. Oh, there's another.
speaker 2: why does the router have such a basic parametrization? Like what happens if you put more weights into, Oh, your router? The question was.
speaker 1: why is the router so basic? It seems like if you're going to have experts, it seems important to route to the right experts. So why don't you do that? I think you know there have been some ablations in some of the earlier Google papers on having like mlp routers and more sophisticated things. I think the sort of complex answer here is that the syconcerns sort of weigh heavily. If you're using a lot of flops to make routing decisions, you have to pay for those swaps. And so you have to get performance improvements in just the routing. And I think one other thing to appreciate here is that there are really big limits to how well you can route because the learning process for this routing thing is actually pretty dicey, right? Because how are you gonna to get gradients for which routers are good or bad? Well, the only thing you have is if you have top two, then you can compare the two things that you have and you can push the gradients into s of key because your g is a weight, then the s of t might inform your inner products. But that's a very indirect way to be learning your affinity. So even if you make it complex, there's no guarantee that you're going to really learn the optimal route, right? Great. Okay. So I think .
speaker 2: the .
speaker 1: one of the great innovations of the deep seek moe, and which was very quickly adopted by all the other sort of Chinese moe releases, is this idea of both a shared expert and a fine grained expert. And so the basic moe structure that was sort of originally proposed is to take your dense architecture and kind of copy the experts over. So in this case, you're going to, let's say, if you have top two routing, you're going to have twice the activated parameters of your original dense model. So you take your moe and you copy it over and you activate k equals two. So this is kind of what you might think of as like the vanilla or the basic moe that you might start with. People realize fairly quickly that having lots of experts is good. And the logical sort of next step beyond having lots of experts is good is I want lots of experts, but I don't want to pay the parameter cost for having lots of experts. And so deep seek basically argued that the right thing to do then was to cut the expert up into smaller pieces, right? So remember last lecture I was telling you about, Oh, the kind of golden rule in some sense is to have your hidden layer and then you multiply that by four, and that will give you kind of your projection layer, right? So now what you would do is you would, instead of multipying by, let's say, four, you might multiply by two, right? So now you have smaller matrices, you have more fine grained experts. You can have twice as many of them, and you can kind of take that logic much more to the extreme. You can quadruple or multiply by eight and you can keep decreasing the size of your sort of projection dimension there. That's flying grained experts and there's drawbacks. I'll talk about later. It doesn't come for free. So you have to be very careful about how you structure these things. And then the other thing that has been sort of studied and noted is maybe it's helpful to have at least some mlp that can capture shared structure. Like maybe there's just like processing that always needs to happen no matter which token you're processing. In that case, it seems like kind of a waste to do all this routing work and to have all these, like you parameters spread out everywhere when we can just have one shared one or few shared experts, you know, whose job it is to handle all of this shared processing that's needed. So there's shared experts. And so this setup of using fine grained experts plus shared experts originally came out in deep seek moe, although I think the original inspiration came from deep speed moe and Quinn and others. So almost all of the open moe releases since deep seek have adopted some sets of these innovations because it's quite clear that especially fine grained experts is just really, really useful. That's a kind of no brainer at this point to do one of the things I really like about reading deep seek papers is that they do ablations. You know it's not like a lot of our sales tech report. You know they actually care about whether or not their methods work. And so they have this lovely ablation in the deep seek moe paper where they show you the blue bar over here. This is g shard. This is a very basic vanilla implementation of an moe. You know you can have one shared expert, that's the orange bar. And that gives you a big boost on some tasks and no boost on others. You can have fine grained experts, that's the Green and orange know bars. And you get further boosts from that. And if you compare the blue to the orange composing, all of these differences give you quite the big boost over others. And so we can see that more experts and shared experts generally seem to help. Okay, yes.
speaker 2: Question off. Like when it says seven out of something this afternoon is doing like top seven.
speaker 1: Yes, sorry, I should have I should have a right. So x out of y means x activated out of y. Total routed experts. That's right. Yeah. And so you can kind of see the pattern here as well of as you increase the number of experts, you also often increase the number of activated experts, especially if you're doing fine grained experts. Foxwise is free, right? Because each expert is now smaller. Good. Okay. So omo has know basically corroborating evidence that shows really nicely that these things work. So the bottom one, I think I'll start with because it's more decisive shows you know fine grained experts going from eight to 32 to 64, fine grained experts mirroring in some sense the deep sea cubulations. And you see very clear trends and losses and other kinds of metrics that you see improvements going from eight to 32 to 64, right? Fine grained experts is great. Shared experts, which is purple versus teal at the very top. You actually don't see really any gains, at least in the olo setup. So they actually end up going with no shared experts, even though the deep seek paper seem to show more gains. So that one actually is maybe more mixed given this sort of follow up or this third party replication of these kinds of ideas. So at this point, you might be wondering what are common configurations? I think I'm going to take the page out of you last lectures playbook of looking at a lot of the recent releases, looking at what people do and trying to talk a little bit about the patterns that have arisen. So some of the early Google paper, so gshard, switch, transformer, stmoe, some of them had really large numbers of routed experts. And there was a lot of really interesting stuff going on in those papers. I'd encourage you to read them. Some of them happened in lms and other kinds of architectures. Regardless, you know very quickly, I think there was like kind of a period of like eight to 16 experts like mixtroll, dbx, Grock with two active experts. Those worked reasonably well. But then kind of deep seek moe or deep seek moe V1 comes out that has kind of the prototypical configuration I told you about fine grained expert, 64 of them, six actively routed, two shared experts. And each sort of expert is sort of one fourth the size of a normally sized expert. Take that last column with a grain of salt because I had to sort of back them out from like config files and things like that. I'm not 100% sure about the exact ratios here. So we've then got essentially Quin 15, deep seek V3 minimax. These are Chinese moes. They follow essentially in the same footsteps as deep seek V1. The specific numbers are different, but in the sense that they use fine grained experts, and they often have shared experts, they're very similar to kind of this original deep seek moe configuration. Omo, mimacs and llama are very recent. Moes, they definitely do all this like fine grained expert stuff. And lama four also uses a shared expert. And you kind of see sort of variations in configuration, but you see what's basically shared, which is this fine grained experts idea. And especially for the big models like lama four and deep seek very, very large numbers of routed experts or sorry, not routed like total, total experts. Yes. The ratio is expenis representing roughly like how much each exports is sliced relative to having just the standard dense configuration. So in terms of hyperparameters, you know that if you're following the rule of thumb, your hidden dimension is sort of your projection from in your mlp should be about one to four or one to two, six if you're doing a gated network, right? And so by looking at the hidden layers of these architectures, you can kind of see how many times they sliced up that original feed forward size.
speaker 2: So like previously through one of primary shows, 14, but then they have 64 of those experts because I mean, that still increase ing their you're the Yeah so so you know you .
speaker 1: can think of this as roughly, you know they have you know 16 normally sized experts. And so you know they're of course, having more parameters than the dense equivalent. They have six routed so they have eight total active experts at any time each that are quarter sized. And so you should think of them as like roughly double the flops right, of a dense equivalent. So some arithmetic but hopefully the math is clear and consistent.
speaker 2: Yeah like the regilike one are going really begin .
speaker 1: he was like that me so for some of the exotic ratios, I'm not quite sure why they're that way but they are very precisely whole numbers when you take the ratios between the ffns and the implied hyper parameters. And so I think those are exactly the split counts of like how much they were sliced but I'm not sure why they have one over 14 does smaller .
speaker 2: dimension because like that we sure so points in .
speaker 1: the mlp Yeah so Yeah that's why you're asking like do they do they downyeah that's right. In some of them they are actually smaller. I don't remember which models in particular, but in some of them, I do remember they're actually downfor.
speaker 2: What is that intuition for wanting more than one shared expert?
speaker 1: Yeah. I mean, it does kind of seem like there was a period where some other Chinese lm companies tried many shared experts and then you know people have come back to zero or one. And if you look at the old oblations, it's not quite clear that even one shared expert is decisively useful. I think the original motivation was that then you have equally sized you know experts like these are both one quarter sized experts and now you have eight active experts total. And so you can keep the sizes consistent. Otherwise, I don't really see a particular justification for why it should be two smaller ones versus one larger one. Okay, cool. So then hopefully, you know you get a sense of how the routing works for a lot of these moes and how it's all set up the forward PaaS, hopefully you fully understand. Now we need to think about training. And training is pretty gnarly, right? And the major challenge I foreshadowed earlier, right, when we train, we cannot turn on all the experts because if we do that, then we pay the full flops cost of all the experts, right? Having a model that's like, I don't know, 256 times more expensive to train is a total no go, right? So we need train time sparsity, but sparse gaining decisions are obviously not differentiable. We now have a kind of annoying rl ish problem. And so we could do any of these things like rl to optimize gaating policies. We could do bandit inspired things of doing randomization to do exploration, or know we can just have some heuristics that try to balance things out, right, like put some lost terms in there and hope things work out. You know, having gone through deep learning classes of many kinds, you can kind of guess internally which one people use in practice. And I'll talk about each one of these three in turn. Okay, so rl, I think is one of the earliest things that people tried. It's probably the most principle thing that you can do in this space, right? You have a non differentiable routing decision. Well, think of that as a policy through our rl edit and then solve the problem. Unfortunately, it's not better than a lot of the other things that you can do. There is a paper by Clark at on 2:20 who were exploring various like scaling related questions in moes, and they do have an rl baseline that I was able to dig up, but unfortunately, it's not really that much better than say, using hashing for decisions. And they were really interested in benchmarking this thing on the left called s base, which is like a linear assignment, kind of a method. And that thing know handily beats you know doing rl. And I think in practice, the gradient variances and complexity means that it's pretty finicky to use. And no one at scale has really used an rl based approach to optimize these gating decisions. As far as I know, a thing that has been done much more at scale is stochastic approximations of various kinds. So what they might do is they might add a bit of perturbations. So here's is an example of one from shazir in 2:17. This is one of the early moe papers where they're still going to do kind of top k routing. So they're going to keep the top k elements of this H of x operation, and they're going to sofmax that to get the gate. But what we're going to do to get this H of x operation is kind of the following. So what we're going to do is we're going to have our original sort of linear affinity. This is identical to what we were doing before. We were basically just computing our inputs x and a sort of learned weight for each gate. And so this part's the same, but I'm actually gonna now gonna to jitter it a little bit. I'm going to add a normal, and then I'm going to pick sort of A W noise scale that's learned. And this thing is going to control how much noise to inject into this process. And you can kind of think of this as a stochastic exploration policy. And by manipulating W noise in particular ways, like sort of kneeling it down or doing various things, I can control the exploration exploitation tradeoffs that this moe is going to have, right? And so this is going to give you one solution to the explore exploit dilemma. And especially if you're noising things up, each expert might randomly get you know some other tokens that it wasn't expecting to get. So itlead to experts that are less specialized, but maybe a little bit more robust. And so that seems generally quite nice. Of course, the stochasticity also means that you don't get as much specialization, and that leads to loss of efficiency. And you know there's another approach that people have done where they sort of multiply the router logates, or sorry, they have a multiplicative perturbation to the router loits with the goal of getting less brittle. Experts. But this sort of jitter process was kind of removed in some of the later papers because they found it just didn't work as well as some of the heuristic loss based approaches. And so this was an approach that was tried in a couple this kind of stochastic routing tricks were tried in a couple of the early Google papers, but I think that has generally been abandoned by a lot of the people training these moes. Okay. So yes, for the stochastic.
speaker 2: like what problem does that solve? Because we're still making the top. So we still can't approach about well.
speaker 1: if you think of this, so the question was we still can't differentiate because we're taking the top k. But if you kind of change your interpretation of the problem a little bit, if you think about a bandit problem, right, it has the same structure as this where you know you pull a bandit arm and you don't see any of the other arm. So you can't really allocate your resources efficiently if you pull some of the other ones at random. Now you've got enough data to be able to do some optimization. And so this jittering is very similar in spirit to this kind of like epsilon greedy style exploration thing where you're randomly pulling some of the other arms with some probability, where the probability itself depends on how confident you are about this routing decision. So that's kind of the intuition. And then of course, you that's going na give you some way of getting some signal back. Okay. So the thing that in practice, people have ended up with is we don't do any of that. We don't do rl, we don't do stochastic exploration, but we rely on really another mechanism to sort of keep things reasonable. So if we're doing top two routing, right, technically speaking, we do get some signal in the gradient descent process because we can compare the top two experts that we did evaluate. And so it's possible to do some optimization. But when we do ignore, if we drop all the other constraints, the big issue that arises is you just end up sort of picking one expert all the time. And that expert is good at everything and all the other experts are terrible, right? You end up in this local minimum where you've routed all of your tokens to one experts all the time. So really the key game becomes, then how do we get out of that local minimum loss? Balancing or balancing losses is really the key trick to get out of this. And this is kind of important to understand because this is the loss that mostly everyone actually uses to train the moes. So if you were zoning out earlier, you probably should make sure to pay attention to this particular set of equations here. So this is originally from the switch transformer from fetiseon 2022. And they add this particular loss where what they're going to do is they're going to loop all over each of the experts and they're going to take you could think of this as an inner product between the vector f and the vector p. And so what are these vectors? Well, f is for each of the experts. This is the fraction of the tokens that were allocated to expert I. So you can think of this as kind of a probability vector that's telling me you what fraction of my tokens in my batch or in my you know, whatever the unit is here did route to x per I. Now p of I is the fraction of the router probability that was allocated to xt I. So the router probability is kind of the original sort of softmax routing decision that I was sort of intending to send. So this is kind of measuring p of I is what was sort of the intended probability from the router. And then f of I, what was the actual sort of like you know what was the actual routing decision made by the top k method? And one thing that's kind of interesting to look at here is let's say we take the derivative of that loss with respect to p of I. So this is a linear function with respect to p of I, and you'll see that the strongest downweighting action happens on the sort of biggest experts with the biggest allocations, right? So it's actually, in fact, proportional to the amount of tokens that you get. So you're going to be pushed downwards sort of more strongly if you got more tokens. And so this is kind of the basic behavior of this loss. And almost everybody uses this kind of F P kind of a trick to try to balance tokens across different units. So the basic unit that you might want to balance over initially is batches. You might want each batch to get allocated evenly to experts, but you might actually have other kinds of balancing that you might want to do. And deep seek does exactly this kind of thing. I'll talk about all the variants that they've thrown in, but you know the first thing is per expert balancing per batch. So each batch, they want to make sure experts get an even number of tokens. And this is from the deep seed paper, and hopefully this looks very familiar to you. This is exactly the same F P inner product structure as you saw before. You know, p of I is defined a little bit differently. That's s of I of p, but that should be familiar from earlier as well. That's the softmax pre top k, right? So hopefully this looks all pretty good to you. The other thing you might want though is you might want to balance across experts. That's all well and good, but you might also want to think about the systconcerns because you're going to shard your experts onto different devices and you might want to balance per device. And so you might have another loss. That's essentially the same structure. But instead of summing which tokens go to which experts, you might measure which tokens go to which devices, and that's going to be a different f that's measured over the device groups rather than over each expert. And so now you can set up a different loss to balance over devices. You optimize this, you're naturally going to try to learn routing functions that make sure each GPU or each GPU will have you have an even number of tokens leading to even utilization. And that would be great from a systperspective. So basically, everyone does kind of this kind of a thing. And so deep seek V3 actually kind of innovates a little bit. This is kind of cool. And I don't think I've seen this before. It's one of the first things in the moe world that doesn't actually come from Google really, which is that they have gotten rid of this expert balancing term. They've gotten rid of this entirely. And instead, what they now do is they basically take their soft max scores and they add a little fudge factor b of I, where b of I is a little fudge factor score for each expert, right? So expert I might get upweighted or downweighted. So if an expert isn't getting enough tokens, it's going to be given a higher b of I, and then that's going to allow it to grab more tokens. And the way that this works is the way that this works is that they're going to learn b of I through a really simple online gradient scheme, online learning. And so they're going to measure at each batch, you know what are each of the experts getting like? Are they getting an even number of tokens? And if they're not getting enough tokens, they add sort of gamma, some learning rate to b of I, sort of making it higher. If they're getting too many tokens, they're going to subtract gamma, making that expert slightly less attractive, right? So they're just learning little offsets for each of the s of is. And notice here know you're only using the b of is to make the routing decisions. You're not actually sending it over as part of your gating weights, right? That's a sort of somewhat important thing to do. So they call this auxiliary loss rebalancing. If you go and read the deep sek V3 paper, which all of you should, because it's a really nice paper, theymake a big deal about how this makes training so stable, so great, so wonderful. And then, of course, you keep reading the section and they're like, actually, but we decided that, you know, for each sequence, maybe we still want to be balanced, and this doesn't work well enough. So we've added the heuristic loss back. So they do have something called the complementary sequence wise auxiliary loss that is basically exactly the auxiliary loss that they decided they needed because what they wanted to do was to balance load, balance the experts at a per sequence level rather than a per batch level. I'm not sure why they do this particular thing rather than any other sort of b of I style trick, but that's just kind of what they do in deep seek V3. So it's not fully auxiliary loss free as theylike you to believe. Oh, yes, question this a bit .
speaker 2: of an unfair question. But if we do not have to worry about systems optimizations, you think the performance of this model will be but would it stay roughly this same if we did .
speaker 1: not think about systems optimization? Would the performance of this model be better or stay the same? When you say this model, what do you mean deep seb .
speaker 2: three or like this in general, like this modern?
speaker 1: So are you saying like if we ignore the system's concerns, do we think moes are still good? Is that kind of one way of asking that question?
speaker 2: Like would the performance of downstream transfor Alwill be better than what we have? Right?
speaker 1: Yeah. So I think I didn't .
speaker 2: have to balance this. I must roughly tosurvey expert.
speaker 1: Yeah, Yeah, that's right. That's right. Well, I think actually per expert balancing this term, right, this is not a systems concern. So you still want to do this because if you don't do this well, you'll find and actually there is, you know, I'm gonna to keep referring to the old mode paper, because they have to have so many ablations, they have a really nice ablation where they get rid of exactly this. And what they find is basically early on in training, the model just picks like one or two experts and all the other experts are dead, like the router never sends anything to them. So you're just wasting memory at that point. So now you've just lost performance for free. You've effectively gotten a smaller model. And so even if you ignore all the other device balancing parallelism concerns, you've just gotten the worst model because you didn't properly allocate your experts, right? It's the same way as like you want to use all your parameters, right? You would like to effectively use your parameters. You want to do expert into balancing. Sorry, say I. What does device refer to? Yeah, actually. So normally this would refer to like GPU or GPU. There is a subtlety, I'll talk about this maybe in the very last or second to last slide. There are more sophisticated and cool versions of this where you try to balance things to minimize communication costs as well. And so there's broader notions of device like know one rack or whatever else. But here it usually refers like GPU. Yes.
speaker 2: Going back to the facts that shing as a rouow's team improperformance, like is there intuition for that? Because that's effectively just like choosing like one of the few form members to send it through, right? So why does having multiple copies of that, I guess, each of which get less data, what does that make?
speaker 1: Yes. The question was, why does hashing do anything at all? I don't have the really precise intuition for this, but you can make arguments either two ways. One is know, even if you're hashing, the same tokens are going to go to the same or the same kinds of sequences are going to go to the same expert every time, right? And so each expert will still get some deterministic subset of the inputs. And so there's some specialization that can still occur. It's just non semantic or non learned. And if you're a distribution ziythian like the word the might dominate one expert, you know and so you might still get actually semantic specialization where like one expert is effectively dominated by like .
speaker 2: very frequent things like a random, like a pure random thing .
speaker 1: that's not dependent on input. I would bet that that would be really terrible. Yes, I have never run or seen that, but yes, I think that would be that would be horrible. Good. Yes.
speaker 2: You have many labwhere then transformer said I think you've heard from the lecture mentioned that Asian expert so you can kind of do like a knoking blike 32 layers in like 64. I think a lot of chicken eor I wonder he's like couple experts are bumped together on make single chicken can eat.
speaker 1: So the question was like, wouldn't you need lots of GPU's if you have lots of layers and lots of experts? Yeah. If you exclusively give a GPU to a single expert, yes, that would be kind of crazy. But you would sign a shard thing so that each gp would hold enough of these units to effectively use memory, right? The name of the game in parallelism is you always want to use up all of your memory because that's one of your resources, right? You don't want na paralyze more than you have to. Cool. Okay. Excellent. Oh, okay, I did put the ablation in here. Yeah. So this is exactly what happens to the question of what happens if you don't do you know expert balancing loss, I think the great picture to see is this bottom left one. If you don't do load balancing, you know what are the tokens assigned to which expert? You see the pink and the yellow expert, they just like kind of take over. They take up you know about 50% of the tokens. All the other experts are dead. They do nothing, right? And so you've wasted you know the majority of your experts at this point, you know six out of eight of your experts and you've created a two expert, moe, unintentionally, and that gives you worse losses up seen up on the top, right? The tl lines, of course, maybe that's still better than the dense model because at least you've got two experts going, but you could have done better. Counterfactually speaking. Okay. So I won't go quite as deep as I could into the system side because I haven't really started to cover the core systems concepts necessary for you to deeply appreciate a lot of the parallelism concerns, like basically the hierarchy of communication speeds in a data center and so on. But really, as I said before, one thing to keep in mind is just how nicely moes can fit into devices. Know, the thing that people say is expert parallel you that involves sending or putting one or a few experts onto each device. And what happens when you are basically processing a token? Well, you would hit the router, and after the router, you now have picked few experts. And so now you would have a collective communication call, like all to all communication dispatch that would send the tokens to the relevant devices, you know, the feed fords of compute, you know, their outputs. And then you would return the tokens to sort of where they belong, or you would combine, I guess, multiple experts. And so you would need another sort of collective communication call. And so if your feforward computations are sort of big and beefy enough, you can kind of pay for the cost of basically doing this expert parallelism. And one of the things that's nice about this is that it's another form of parallelism in your toolkit. So you've got, on the right side, data parallelism, model parallelism of two or three different kinds, and then you've got expert parallelism, and you can combine all of them to come up with sort of ways of trading off all the resources you have. So the communication speed, the amount of data that you have, your batch size and your number of experts and your memory. So I'm not gonna to go into too much detail about how specifically this is gonna to help, but keep in mind that this gives you another sort of tool in your expert toolkit. Another thing that is also useful is let's say you have multiple experts on a single device. Know you might hope that because the computations are sparse, like let's say you know token one, this first token gets multiplied to expert zero. The second one is expert one, and this third one's is expert two. So this is really three matrix multiplies that are small and sparse. And you might hope that modern GPU's can sort of take advantage of these kinds of kinds of sparse matrix multiplications. And that's exactly right. So if you lay out your sort of experts correctly and the weights are sort of fused in the right way, then modern sort of sparse matrix multiply sort of engines can sort of effectively make sure that you're not wasting any flops in doing this one big matrix multiply. So modern libraries like megabblocks can basically take advantage of this know device level sort of sparsity support to do multiple expert computations sort of all at once. So this is yet another advantage that you get with moes. So one fun side thing, which maybe isn't mysterious to you on anymore because you've sort of grown up in the era of GPT -4, but when the GPT -4 api first came out, it was kind of mysterious to me because when you set the temperature to zero, you know you kind of got different responses, even though it was supposed to be deterministic and lots of people speculated about why would that be? I'm not saying this is the answer to that reason, but there is actually an interesting source of randomness in moes, right? So in moes, think about you know what happens. You're gonna to route your tokens to experts, and experts live in different devices. It could be that you have a lot of examples. You're going to, of course, batch your queries when you're processing them. And so if you've batched your queries, these tokens are gonna to get routed into different experts. So imagine you've got this batch to process and you've got a bunch of experts. But for whatever reason, this batch really loves expert number three. Like all the tokens go to expert number three. So now what happens? Well, the device for expert number three doesn't have enough memory to load all of those tokens. And then what happens is what people call token dropping. And this happens at training time as well. You often have what's called a load factor where you're sort of controlling the maximum number of allowed tokens. And if the router just allocates too many tokens to an expert, you just drop those tokens off, either for systems reasons or because you're just worried that that expert is going to take over, at least in the training time. So now this token has gone drop pped and it's not going to get anything at all. Like the mlp is just going to do a zero computation and the residual connection is just going to PaaS things straight forward and then you're going to return an output. And so if your token got dropped, you're gonna to get a different result than if your token didn't get dropped. And so based on who else is in your batch, moe's can induce stochasticity both that training time and inference time, which is like kind of an interesting thing that you don't normally think about because you almost never think about like cross batch effects when doing inference. Okay? So that's kind of the main bits of the main basic components of building the moe. In a fun side thing, if you were to actually go out tomorrow and trying to train an moe, I think the system side will make you a little bit sad. But the other thing that would make you sad is probably the stability side of things. So moe kind of have this property that sometimes theyjust kind of blow up on you. If you try to fine tune them, they're very difficult to fine tune and theysometimes blow up on you. And so, you know, Barrett zoand, others really studied. They had a whole paper on basically trying to make moe's more stable. And there's a paper, which is the one I'm referencing here, whose entire purpose is to stabilize moe training. And there's a couple of tricks that I'll mention that I think are relevant and that people do. The first one is you know if you're doing the router soft mac, so this goes back to last lecture about stability, right? Like what did I say about stability? Well, the thing to be afraid of is the soft maxes. The softmax is always where you want to be afraid. And so for the moes, they do all the computations in float 32 for the router computations just to be safe. And sometimes they also add the know on auxiliary z loss. So hopefully you remember that it was just last lecture. Know you do log of the sum of the exponentiated values in the softmax and you square that and you add that as an extra loss. So this is gonna to keep the normalizer values near one, which is nice for stability. So this is actually one of the places where z loss was used earlier before I got sort of more popular for training models. You can kind of see the effects here. If you look at the losses, I think the center, the second plot here is maybe a great one. You know if you remove the z loss from your router routing function, you see these giant loss spikes in your validation loss where you the model just kind of goes a little bit crazy for a couple iterations and then gets kind of pulled back. Of course, it like still trains, okay, but you are better off having A Z loss than not having A Z loss. There is a pretty noticeable gap in the validation loss by the end here, right? Other things that can happen, people, you know, of course, you want to fine tune your moe youlike to also rhf your moe for you're going to ship and release it. But this turns out to be kind of problematic. Some of the earlier work, you know, when people were starting to do moes, this was back in kind of the burr and P5 era. So there was a lot of fine tuning going on. And you know one of the things that people saw was you know actually there's a lot of overfitting that happens. If you were kind of doing sparse models, you see this big gap between train and val, right? This blue and orange line, whereas the dense model, this Green and red line, has a smaller train test gap. And so there was a lot of worries about overfitting because you have these like gigantic parameter models that you're fine tuning on small data. One of the solutions that was proposed at the time, I don't think this is very popular as far as I understand, is to architect your moes such that not every layer is an moe layer, but you like let's say, alternate dense layers and moe layers, then you can just fine tune the dense layers and then that will still be fine. That behaves just like a dense model. So that was fine. Another solution, the one that we saw in the deep seemoe paper, is just kind of use a lot of data. Like if overfiteding is a problem, you know we have access to lots and lots of sft data. Just shovel all of those guys in. So in the case of deep sek moe, they use 1.4 million training examples, then maybe you're not quite as worried about these overfitting concerns. The last thing I'll end with, which is a trick in the toolkit that people have done and seen, is upcycling. And so this idea is to take a dense model like the one over here, and then you take your mlp and you make a bunch of copies of it, and then you maybe perturb it, and then you have your router that's initialized from scratch, and then you just pretend this is a moe, and then you train it. From that point on, you just initialize the moe from a dense model. And this is a trick that's kind of called upcycling. And people have shown that if you can get it to work, it is a very, very, very cost effective way of getting a moe right. And the moe is great for inference because not every mlp is going to be active at inference time. So you might effectively get a much larger parameter model without doing the training of a much larger parameter model. And several people have succeeded at this mini cpm, which I'll mention again in the scaling law lecture. But this is a Chinese open llm that basically tried to build really good small language models. And they succeeded at taking a dense model and upcycling it into an moe. And you can see that their numbers get significantly better in the last two rows. So the dense model to the moe, they get a pretty nontrivial bump in performance. Quinn, I mentioned at the start of this lecture, one of their earliest attempts at moe was taking one of their dense models and then building up cycled moe. And they got fairly significant performance gains relative to sort of smaller models at the time, like they got models on par with their seven b models with a 2.7 billion parameter active model. So to wrap up, I want to sort of walk through the deep seek moe architecture at the very end here. And hopefully this will give you a sense of, you know the first thing I want to do is I want you to understand the deep seek V3 architecture setup and all the changes that they did, because that's an example of a modern, high performance open source system. I also want you to just maybe appreciate that architectures don't change that much. Deep seek V1 or a deep seek moe V1 is a, you know, it's not that new. It's like maybe a year and a half or something, maybe two years old. And they basically nailed the architecture at that point, right? So see, I want you to see what they changed from the very earliest attempt to their big training run. So this is the very first starting point. This is deep seek moe. I'm calling it V1, but actually probably the right way to refer to it as deep pk moe. It's a 16 billion parameter model with 2.8 of those parameters active. And you've seen already this diagram over here. This is the shared, two shared plus 64 fine grained experts, of which four of them are active at a time. Well, maybe six of them are active at the time, sorry. And the routing know you've already seen this. I presented this in the middle of the lecture here. This is the very standard top k routing where the softmax is at the bottom before the top k selection. And for balancing at training time, all they do is to add this auxiliary loss balancing term, both the expert and device level balancing terms. So hopefully you remember those from earlier. So that's the V1. And then they saw how sort of effective their moe model was. So I guess to had some more context, right? Deep seek originally had a dense model, and then they had a moe model. And the moe model was remarkably good. And so when they went to V2, they went straight to the moe. And now this is a 236 billion parameter model, of which 21 of those billion parameters are active, right? So you need a lot of memory, but your flops, consumption for inferencing, this model is not so bad. Now, the architecture is identical. I copy literally the same figure because the architecture is literally the same, minus changes to the number of experts that are active. And we've got now sort of some new things happening, but not too many new things. So the top case selector is the same. So the equation from before, this previous equation, this is identical. This is still how they do things. But they have this very clever trick that they add on. And this is you, I was going to say at the very beginning, you know, what's the drawback of having fine grained experts? Why can't I have, I don't know, ten, 24 fine grained experts or 20, 46 fine grained experts? Well, the problem is when you shard your experts very finely and you have a lot of active experts, right, you're going to have to route to those experts, right? So your communication costs potentially grow. If you're very fragmented, you might have to send a lot of tokens to a lot of devices, right? And so the clever thing they come up with is to say, I'm not just going to for each batch route to the top k experts naively, which might force me to send my tokens to lots of devices. What I'm going to do is I'm going to first pick top m devices. So I'm going to do my normal scoring calculation, but I'm first going to sort of subset the set of allowed devices to top m. And once I've picked my devices, then I'm gonna to pick top k for each token within each device. So now I've restricted the devices. This really controls the communication cost. And now this gives you more efficient training when you're scaling up to these gigantic sizes, right? You need to start really engaging with the systems aspect of things when you're training a 236 billion parameter model. The other thing which reflects the systconcerns that are necessary at the scale is that they add a communication balancing loss. One way of thinking about things is know for an expert, there's kind of inputs and outputs, right? The inputs are you, the token comes in and you route your expert and the outputs are know, you have to kind of bring the tokens back where they belong. So if a batch belongs on this device, it has to go back where the original device was. So we have to think about both the input communication cost and the output communication cost. And so they add a balancing loss to try to balance out the output communication cost as well, not just the sort of input side. So that's a minor note, but you can kind of see their attention to detail on trying to make sure all of the different sort of systems aspects are properly taken care of. Now, finally, we kind of get to the big deep seek V3. Sorry, that's just A V three, not V Q up there. 671 billion parameters, of which 37 are active, know, once again, know exactly the same figure because the moe architecture itself doesn't change. That's stayed the same since deep sek moe, right? If it works, don't change it. They do change a couple things. Maybe they were hearing you all say, why don't you normalized to one? And so they've normalized the gate to one. They've moved kind of the softmax normalized their operation up there, but they're not actually exponentiating sort of the sort of gating decisions. They're actually taking sigmoids, which is a sort of softer, sort of more nicely behaved operation than the softmax. And so they've got some changes here. But conceptually, this is still the same as the top k routing decision. You hopefully see very, very similar things happening. And then in terms of losses, they've gone through this auxiliary loss free trick of this b of I being incremented or decremented based on the expert load, and then they have a sequence wise auxiliary loss. And just to add some context, why would you want to balance different experts on a single sequence? Well, the thing that they're very concerned about is that training time know it's fine to not have a sequence wise balancing loss, but at inference time, it might be the case that someone sends you very out of distribution sequences and that might overwhelm certain experts, right? So at inference time, you can't control which sequences you get. So you might want sort of stronger balancing that operates at a single sequence level rather than a overall batch level. Okay. And then in the whoops, yes.
speaker 2: The top end devices like does it keep the B2? Yeah. They keep the top end improvement.
speaker 1: They do not keep, for example, the communication loss. So they've jettison some things. But top m is A I mean, it seems like a pretty clever idea. They keep it. Yeah. Yeah but it's not like they always add things. They have removed some of the things. And so in the last two or so minutes of the class, I'm going to go over the non moe parts of deep seek V3 because I think you know we're already at the point where I've explained most of deep seek V3. I might as well go through the steps of explaining the rest of deep seek V3 at this point. So you all know kind of how that works. So they have a clever sort of optimization for the attention piece called mla or multi head latent attention. And you actually already know all the ingredients that you need to understand this because at the end of last lecture, I talked about like gqa and mha. So those are all inference optimizations that you need in order to optimize the size of the kv cch. So the deep c cloks take a different tack or different approach at optimizing this. Instead of reducing the number of heads, they're actually going to sort of project the heads into a lower dimensional space. So you have your inputs, H of p. And instead of sort of generating in the kand v's directly from these H of t's, what I'm going to do is I'm going to first generate a low dimensional c. This, you can think of this as like a compressed version of H and the c is going to be smaller and easier to cache. And I'm just going to cache these c's. And whenever I need know these kand v's, well, I can sort of upproject from this kv sort of conceptually speaking, and then you, I can take the inner products with the cu's, right? So you can kind of see how this would be a kv cache savings if I only have to save the c instead of the higher dimensional H of t. And that's exactly the idea. So you take your H of p, you project it into a lower dimensional c, and then you up project this back into the kand v's. And if the c's are small, well, you've compressed the kv cache. That's good. And then you in terms of the computation, right, if you're thinking about flops, well, you might think, well, this is not good because I have to multiply an extra matrix wk. I didn't have this matrix before. That's an extra matrix multiply that I have to pay for. But kind of the clever thing here is remember that on the other side of k, I'm going to take k dot q, that q dot k is going to be an inner product in the attention operation, and q itself has a projection matrix q. And so the trick here is you can merge this wk and this q matrix together into one matrix. I haven't gotten any extra matrix multiplies. I've just merged this new matrix multiply into my other one. This is just associativity. I can just merge the two. They also compress the queries for memory savings during training. But really, that one is not quite as necessary because it doesn't interact at all with the kvcache. I'm only going to mention this last one in passing because it is a subtlety, but it's kind of a clever subtlety that you realize, which is that this original trick, this sort of thing that I just described at the top, is not compatible with rope, right? And the reason is because, you know, the rope matrices, you know, basically you have the cues into the case, and you rotate each of those 's in the ke's by multiplying with the rotation matrix rq and rk. But if you do that, then these rq's and rk's are in between the query projection and this latent vector up projection matrix. And since I can reorder these matrix multiplies, rope kind of gets in the way. And they still have a solution of basically doing rope on non compressed dimensions. That's kind of a side point. I think it's not quite as important. You can kind of look at the paper if you're super interested. The other thing that they do, and this is the last thing I promise, is that they have a minor change in their loss function called mtp, where they predict multiple tokens in parallel. And so what they can do is normally you have your inputs, you shift them to the left by one. So you're predicting one token in the future, and then your transformer is going to predict all those tokens. That's your normal transformer loss. But then what you can do is, right before you make those predictions, you can take the hidden state, you can PaaS it to a very lightweight one layer transformer, and that model can predict one token in the future. So now the model is not just predicting the next token, it's predicting the two tokens into the future so that hopefully all makes sense. And this is just a small, lightweight ve model that can do that. You can sort of see the architecture right here. The one thing that is kind of disappointing that I learned as I was sort of researching for this lecture is actually they only do mtp with one token ahead. So even though they have this very complicated diagram of how they could do it for many tokens, turns out it's only done for one token. Okay, so now I'm all done. Moes are kind of now at the core of how you would build and deploy a really high performance, large scale system. And they take advantage of kind of the sparsity idea that you don't need all of the parameters all the time. And discrete routing is the real big challenge. And this is, I think, one of the big reasons why moes didn't immediately catch on. It's very scary to have to try to optimize this top k routing decisions, but heuristics somehow seem to work right like they just do. And so there's a lot of empirical evidence now that moe's, at least for flop constraint settings, is just a good idea. It's cost effective. You should do it. So definitely worth learning. Thanks a lot for listening.

概览/核心摘要 (Executive Summary)

本讲座（Stanford CS336, Spring 2025, Lecture 04）深入探讨了“专家混合”（Mixture of Experts, MoE）架构，指出其已成为构建和部署现代高性能语言模型（如GPT-4传闻、Grok、DeepSeek、Llama 4）的关键技术。MoE的核心优势在于，通过稀疏激活机制，模型可以在不显著增加计算量（FLOPs）的情况下拥有更多参数，从而提升性能，尤其是在记忆世界知识方面。讲座强调，在相同的训练FLOPs下，MoE模型通常比密集模型表现更好，训练损失下降更快。

MoE架构主要对Transformer模型中的前馈网络（FFN/MLP）层进行修改，将其替换为一个路由器（router）和多个较小的“专家”（即多个FFN副本）。路由器负责为每个输入token选择激活少数几个专家。关键设计挑战包括路由机制的选择（“token choice top-k”已成主流）、专家数量与大小的配置（细粒度专家和共享专家是重要创新）、以及如何训练非微分的路由决策（启发式平衡损失函数是当前主流方案）。

讲座还讨论了MoE带来的系统复杂性、训练不稳定性、并行化优势（专家并行），并回顾了早期研究（如Google的GShard、Switch Transformer）和近期开源进展，特别强调了中国团队（如Qwen、DeepSeek）在MoE领域的早期开源贡献。最后，详细剖析了DeepSeek系列MoE架构的演进（V1至V3），包括其在路由、平衡损失、以及如多头潜在注意力（MLA）等非MoE组件上的创新。尽管MoE带来了复杂性，但其在FLOPs受限场景下的成本效益已得到广泛证实。

引言与MoE核心价值

Speaker 1指出，专家混合（MoE）已从去年的一个“有趣的附加讲座”转变为今年“更为关键的讲座”，因为大量研究者和模型（如Grok、DeepSeek、Llama 4，甚至传闻中的GPT-4）已采用MoE架构。

MoE的普及：
- Nvidia泄露信息暗示GPT-4可能是某种MoE变体（如传闻中的“GPT moe one bt”）。
- Grok、DeepSeek、以及最新的Llama 4均采用MoE架构。
- Speaker 1认为：“在2025年，MoE相对于密集架构的优势已非常明显。”
- “几乎所有计算规模下，良好实现的MoE模型训练都将优于密集模型。”
MoE的基本理念：
- MoE并非指领域专家（如编码专家、语言专家），而是一种具有多个稀疏激活子组件（专家）的特定网络架构。
- 核心改动在于Transformer模型中的前馈网络（FFN）/多层感知机（MLP）部分。
- 标准Transformer的单个大型FFN被替换为一个选择器层（路由器）和多个较小的FFN副本（专家）。
- 路由器在每个前向传播中为每个token选择激活一小部分专家（例如，只选一个）。
MoE的核心优势：更多参数，相同FLOPs
- 如果一个专家的大小与原密集模型的FFN相同，且只激活一个专家，则MoE模型与密集模型的前向传播FLOPs相同。
- “你拥有了更多参数，而没有影响你的FLOPs。”
- 这对于需要大量参数记忆世界知识的模型而言是巨大优势。
性能提升证据：
- 大量研究表明，在相同训练FLOPs下，MoE模型性能优于密集模型。
- 引用Fedus et al. (2022) 的经典Google论文：
  - 在FLOPs匹配的训练下，增加专家数量，训练损失持续下降。
  - Switch Base模型（128个专家）比密集模型更快达到更优的困惑度（perplexity）。
- 引用OLMo (AI2) 论文的消融实验，同样证实MoE（粉线）训练损失下降速度远快于密集模型（青线）。
- DeepSeek V2论文展示图：x轴为激活参数量，y轴为MMLU性能，DeepSeek V2以较少激活参数实现了高MMLU性能。
  - “如果你只关心训练和推理的FLOPs，激活参数是关键。”

MoE的系统复杂性与并行化

系统复杂性 (Speaker 2 提问关于非FLOPs开销)：
- Speaker 1承认MoE存在显著的系统复杂性，这是其未成为标准教学内容的原因之一。
- “高效实现MoE，尤其是在每个专家位于不同设备上时，需要大量基础设施投入和复杂工程。”
- 但如果实现得当，所有FLOPs都能得到有效利用。
专家并行 (Expert Parallelism)：
- MoE提供了新的并行维度。
- 可以将不同专家放置在不同设备上。
- 由于专家稀疏激活，只需将token路由到相应设备进行计算。
- “这是一种自然的模型切分方式，便于将大模型分片到不同设备。”
- 这是MoE流行的另一个原因，尤其对于超大模型。
开源MoE的演进：
- MoE最初在Google等闭源实验室发展。
- 开源领域的MoE成果早期多来自中国团队，如Qwen (原文为Quan/Quen) 和 DeepSeek。
- 近期西方开源社区（如Mixtral、Llama）也开始更多采用MoE。
- Llama 4 最新发布，也是稀疏MoE架构。

早期开源MoE探索：Qwen与DeepSeek

Qwen 1.5 (讲者口述为Quan/Quen 1.5)：
- 是最早一批大规模、经过良好测试和文档记录的MoE模型之一。
- 采用一种“upcycle”技巧，将Qwen 1.5密集模型转换为MoE模型。
- 在计算效率方面展现显著增益。
DeepSeek (早期工作)：
- 在DeepSeek V3成名前，其早期MoE论文已奠定了开源MoE研究的基础。
- 其论文包含详细的对比实验：
  - 密集模型 vs. 朴素MoE vs. 智能路由MoE (Switch MoE)。
  - 结果一致显示，在固定FLOPs下，从密集模型到稀疏MoE，各项基准指标均有提升。
- Speaker 1指出：“DeepSeek V3是这一系列工作的顶峰... 他们很早就掌握了核心架构，只是在工程上不断优化。”

MoE的挑战与设计考量

尽管MoE效果显著，但其复杂性和“混乱性”使其推广受阻。

主要挑战：
1. 基础设施复杂：尤其在多节点训练时，跨模型分片专家才显现最大优势。
2. 路由学习困难：将token路由到特定专家的决策是离散的，难以通过梯度下降直接优化。
  - “路由决策是不可微的，因为我们必须选择并提交给特定的专家。”
  - 训练目标要么是启发式的，要么不稳定。
MoE主要作用于FFN层：
- 经典MoE是将FFN层拆分/复制，并进行稀疏路由。
- 理论上也可以对注意力层进行稀疏路由，但实践中罕见，因其“更不稳定，难以持续训练”。
核心设计问题：
1. 如何路由 (Routing Function)？
2. 专家数量与大小 (Number and Size of Experts)？
3. 如何训练路由器 (Training the Router)？

路由机制 (Routing Mechanisms)

参考Fedus et al. (2022) 的综述，路由机制主要有三种：

Token Choice (令牌选择)：每个token为不同专家打分，选择Top-K个专家。
- 这是目前绝大多数大型MoE模型（如DeepSeek V1/V2, Qwen, Grok, Mixtral, DBRX, Llama 4）采用的机制。
- OLMo论文的消融实验表明，Token Choice在验证损失上表现更优，收敛更快。
- 路由器的实现通常很简单：token的隐藏状态向量 x 与每个专家的学习向量 E_i 做内积，然后通过Softmax（或其他归一化）得到分数。
- K的选择：通常 K=2 是经典选择。早期MoE论文认为 K>=2 有助于探索（exploration），避免模型只利用（exploit）最优专家。K=2 会使激活参数和FLOPs翻倍（相较于K=1且专家大小与密集FFN相同时）。
- 输出合并：当 K>1 时，多个被选中专家的输出会通过加权平均（权重来自路由器的打分）或直接求和的方式合并。
Expert Choice (专家选择)：每个专家对所有token打分，选择Top-K个token进行处理。
- 优点：可以确保每个专家的负载均衡。
Global Assignment (全局分配)：通过解决优化问题，实现专家和token之间的平衡映射。
令人惊讶的发现：随机哈希路由也有效
- “即使使用哈希函数（完全没有语义信息）将token映射到专家，基于哈希的MoE仍然能带来收益，这相当疯狂。”
- 这表明MoE的部分收益可能来自参数量的增加本身，而非完全依赖智能路由。
被放弃的路由方法：
- 强化学习 (RL)：早期工作尝试用RL学习路由策略，但计算成本过高，且稳定性问题使其未被广泛采用。
- 线性分配/最优传输：虽然理论优雅，但实际成本远超收益。
主流Top-K路由详解 (以DeepSeek V1/V2, Qwen, Grok为例)：
1. 输入：残差流输入 U_t。
2. 计算亲和度：U_t 与每个专家的可学习向量 E_i 做内积。
3. Softmax归一化：得到每个token对各专家的概率分布 S_i(t)。
4. Top-K选择：选出得分最高的K个专家，其余置零，形成门控信号 G_i(t)。
5. 加权输出：Output = Σ G_i(t) * Expert_i(U_t)。
6. 残差连接：Final_Output = U_t + Output。
7. Speaker 1强调，Softmax在此处主要起归一化作用，使后续加权平均的权重和为1。Top-K选择是保证稀疏性的关键，否则训练时会激活所有专家，失去效率。
8. 关于Softmax后Top-K导致权重不再和为1的问题：一些架构会重新归一化，一些则不会，因为后续层（如LayerNorm）可以调整尺度。

专家配置：细粒度专家与共享专家

这是DeepSeek MoE引入并被广泛采用的重要创新，灵感可能来自DeepSpeed MoE和Qwen。讲座中提及了多个模型的MoE配置，据讲者口述及提及的论文信息，总结如下：

标准MoE：将密集模型的FFN复制多份作为专家。若Top-2路由，则激活参数量约为原密集模型的两倍。
细粒度专家 (Fine-grained Experts)：
- 动机：希望拥有大量专家，但不想承担过高的参数成本。
- 做法：将每个专家的FFN中间层维度减小。例如，标准FFN的隐藏层维度是输入维度的4倍，细粒度专家可能只用2倍，从而可以用相同参数量得到更多（但更“瘦”）的专家。
- DeepSeek论文的消融实验及OLMo的复现均表明，增加细粒度专家的数量（如从8到32到64）能持续改进模型性能。
共享专家 (Shared Experts)：
- 动机：某些计算可能对所有token都是通用的，无需通过复杂路由分配给特定专家。
- 做法：设置一个或多个所有token都会经过的FFN专家。
- DeepSeek MoE论文显示共享专家有益。
- OLMo的复现实验中，共享专家带来的收益不明显，最终未使用。因此，共享专家的有效性可能存在争议或依赖特定配置。
常见配置表解读 (X out of Y 表示 Y个总路由专家中激活X个)：
- 早期Google模型 (GShard, Switch Transformer, ST-MoE)：专家数量可以非常大。
- 中期 (Mixtral, DBRX, Grok)：通常8-16个专家，激活2个。
- DeepSeek MoE V1 (原型)：64个细粒度专家（每个约为标准大小的1/4），激活6个，外加2个共享专家。
- 后续模型 (Qwen 1.5, DeepSeek V3, Minimax, OLMo, Llama 4)：普遍采用细粒度专家。Llama 4据提及有非常大量的总专家并使用1个共享专家。DeepSeek V3核心MoE单元设计（如路由从64个细粒度专家中选X个，辅以2个共享专家，细粒度专家大小为标准1/4）与V1类似，但总专家数量和激活专家数量均大幅增加以匹配其庞大的模型规模（讲者提到DeepSeek有非常大量的总专家）。
- 专家大小比例：指专家FFN的中间层维度相对于标准配置（如4倍输入维度）的比例。例如1/4意味着中间层维度是标准配置的1/4。有些模型甚至会小于输入维度（down-projection）。
- 关于为何需要多个共享专家（如DeepSeek V1的2个）：Speaker 1推测可能是为了保持所有激活专家（路由的+共享的）大小一致，便于系统处理，但无明确理论依据。

训练MoE模型

训练MoE的主要挑战在于训练时也需要保持稀疏性，以避免巨大的FLOPs开销，同时路由决策不可微。

被放弃的训练策略：
1. 强化学习 (RL)：Clark et al. (2020) 的研究表明RL效果不比哈希路由好，且梯度方差大、实现复杂。
2. 随机探索 (Stochastic Exploration)：
  - Shazeer et al. (2017) 提出在路由打分时加入高斯噪声 Normal(0, W_noise)，W_noise可学习，以控制探索-利用平衡。
  - 目标是让专家不那么特化，更鲁棒。
  - 缺点：随机性导致特化不足，效率降低。后续论文发现不如启发式损失方法，逐渐被弃用。
主流训练策略：平衡损失 (Balancing Losses)
- 核心问题：若无约束，模型倾向于将所有token路由到少数几个“超级专家”，导致其他专家“死亡”，浪费参数。
- Switch Transformer (Fedus et al., 2022) 提出的辅助损失：
  - Loss_balance = α * Σ (f_i * p_i) (遍历所有专家i)
  - f_i：实际分配给专家i的token比例。
  - p_i：路由器分配给专家i的概率总和（Softmax输出，Top-K选择前）。
  - α是超参数。
  - 该损失会惩罚获得过多token的专家（p_i越大，负梯度越大），促使负载均衡。
- DeepSeek的平衡损失应用：
  - 逐专家、逐批次平衡：确保每个批次内，token在各专家间均匀分配。公式与Switch Transformer类似。
  - 逐设备平衡：确保分配到不同设备（GPU）的token数量均衡，优化系统利用率。计算方式类似，但f_i统计的是分配到各设备的token比例。
DeepSeek V3的创新：辅助损失重平衡 (Auxiliary Loss Rebalancing)
- 摒弃传统平衡损失项：不再直接将f_i * p_i加入总损失。
- 引入偏置项 b_i：在计算路由分数S_i时加入一个可学习的偏置b_i。S_i_new = S_i + b_i。
- b_i的在线学习：
  - 若某专家i获得的token过少，则b_i增加 ( b_i += γ )。
  - 若获得的token过多，则b_i减少 ( b_i -= γ )。
  - γ是学习率。
  - b_i仅用于路由决策，不参与门控权重的计算。
- 实际情况：DeepSeek V3论文声称此方法使训练更稳定，但随后又补充道，为了实现序列级别的专家负载均衡（应对推理时可能出现的分布外序列），他们还是加回了一个启发式的辅助损失（称为“互补序列级辅助损失”）。
  - Speaker 1评论：“所以它并不像他们声称的那样完全没有辅助损失。”
- 为何需要专家平衡（非系统层面）：若不进行专家平衡，模型会倾向于只使用少数几个专家，其他专家“死亡”，导致模型实际变小，性能下降。OLMo论文的消融实验证实了这一点：无负载均衡时，少数专家占据了大部分token。

系统层面考量与优化

专家并行 (Expert Parallelism) 的通信：
- token经过路由器后，通过集体通信操作（如all-to-all）分发到持有对应专家的设备。
- 专家计算完成后，结果再通过集体通信收集并合并。
- 如果FFN计算足够密集，可以摊销通信成本。
- 这是工具箱中除数据并行、模型并行外的又一种并行方式。
设备级稀疏性 (Device-level Sparsity)：
- 当单个设备上承载多个专家时，由于计算是稀疏的（不同token激活不同专家），现代GPU的稀疏矩阵乘法引擎（如NVIDIA的cuSPARSELt）和库（如Megablocks）可以高效执行这些操作，避免浪费FLOPs。
Token丢弃与随机性 (Token Dropping and Stochasticity)：
- 在推理或训练时，如果一个批次内的许多token都涌向同一个专家，可能超出该专家所在设备的处理容量（内存或设定的最大token数，即load factor）。
- 此时，多余的token会被“丢弃”，即不经过该专家MLP计算，直接通过残差连接传递。
- 这会导致即使在temperature=0（确定性采样）时，由于批次内其他查询的不同，同一查询也可能得到不同结果，引入了非预期的随机性。
- 这是GPT-4 API早期temperature=0仍有随机性的一种可能解释。

MoE的稳定性与微调

训练不稳定性：MoE模型有时会“爆炸”（梯度爆炸/损失突增）。
- Barret Zoph等人的研究专门探讨如何稳定MoE训练。
- 技巧1：路由器计算使用FP32精度：Softmax操作是数值不稳定的常见来源，使用更高精度有助于缓解。
- 技巧2：Z-loss (Log-Sum-Exp Squared Loss)：对路由器Softmax的logit应用Z-loss，可以使归一化后的值接近1，增强稳定性。OLMo论文图表显示，移除Z-loss会导致验证损失出现巨大尖峰。
微调挑战：
- 过拟合：MoE参数量巨大，在小规模微调数据上容易过拟合，导致训练集和验证集之间差距过大。
- 解决方案1 (早期提出，现不常用)：采用交替的MoE层和密集层，微调时只调整密集层。
- 解决方案2 (DeepSeek MoE采用)：使用海量SFT数据进行微调（如DeepSeek MoE使用了140万样本）。

模型上采样 (Upcycling)

思想：将一个预训练好的密集模型转换为MoE模型。
- 取密集模型的MLP层，复制多份作为专家。
- 可以对副本进行微小扰动。
- 从头初始化路由器。
- 以此为起点继续训练MoE模型。
优势：
- 成本效益高，能以较低训练成本获得一个参数量更大的MoE模型。
- 推理时仍享受MoE的稀疏激活带来的效率。
成功案例：
- MiniCPM：成功将密集模型上采样为MoE，性能显著提升。
- Qwen (讲者口述为Quan/Quen)：早期MoE尝试即采用此方法，用27亿激活参数的MoE达到了70亿参数密集模型的性能水平。

案例研究：DeepSeek MoE架构演进

Speaker 1强调，DeepSeek的MoE架构从V1开始就基本定型，后续主要是工程优化和规模扩展。

DeepSeek MoE (V1) (约1.5-2年前)
- 参数量：160亿总参数，28亿激活参数。
- 架构：2个共享专家 + 64个细粒度专家（每个约为标准大小的1/4），激活6个。
- 路由：采用主流的Top-K路由机制（详见前述“路由机制”部分），Softmax在Top-K选择之前。
- 平衡：辅助损失平衡项（专家级和设备级）。
DeepSeek V2
- 参数量：2360亿总参数，210亿激活参数。
- MoE架构：与V1相同，仅专家数量和激活数量调整。
- 路由：Top-K选择器与V1相同。
- 新增技巧：Top-M设备选择
  - 背景：细粒度专家数量多、激活专家多时，通信成本可能很高（token需发往大量设备）。
  - 做法：在选择Top-K专家前，先选择Top-M个“候选设备”。即，先基于路由分数选出最相关的M个设备，然后在这些设备上的专家中再为每个token选择Top-K个。
  - 目的：控制通信开销，提高大规模训练效率。
- 新增平衡损失：通信平衡损失
  - 考虑输出通信成本（token计算完后需返回原设备），增加平衡损失项以均衡输出通信。
DeepSeek V3
- 参数量：6710亿总参数，370亿激活参数。
- MoE架构：与V1/V2核心部分相同，即延续了V1的“2个共享专家 + 64个细粒度专家（选6激活）”这类MoE单元的核心设计模式，并通过复制扩展此类单元以达到非常庞大的总专家数量。
- 路由机制调整：
  - 门控权重归一化到1（Softmax移到Top-K之后，或类似操作）。
  - 使用Sigmoid而非Softmax的指数函数作为激活，更平滑。
- 平衡损失调整：
  - 采用前述的“辅助损失重平衡”技巧（通过b_i偏置项在线调整）。
  - 但仍保留了“序列级辅助损失”以确保单个序列内的专家负载均衡。
- 保留Top-M设备选择技巧，但移除了V2中的通信平衡损失。
DeepSeek V3 非MoE组件创新：
1. 多头潜在注意力 (Multi-Head Latent Attention, MLA)
  - 目的：优化KV Cache大小。
  - 做法：
    - 输入H_t先投影到一个低维的“潜在表示”C。
    - 缓存这个较小的C。
    - 需要K, V时，从C上采样回K, V。
    - K = C * W_k，V = C * W_v。
    - 计算注意力时 Q * K^T = Q * (C * W_k)^T = Q * W_k^T * C^T。
    - 可以将 Q的投影矩阵W_q与这里的W_k（或其转置）合并，从而不增加额外的矩阵乘法FLOPs。
  - 与RoPE的兼容性：原始MLA与RoPE不直接兼容，DeepSeek有特定解决方案（在非压缩维度上应用RoPE）。
2. 多令牌预测 (Multi-Token Prediction, MTP)
  - 思想：模型不仅预测下一个token，还预测未来多个token。
  - 做法：在主Transformer输出最终隐状态后，接一个非常轻量级的一层Transformer，用它来预测更远未来的token。
  - 实际应用：尽管图示复杂，DeepSeek V3的MTP仅预测未来1个额外token（即总共预测未来2个token）。

结论

MoE已成为构建和部署大规模、高性能语言模型的核心技术。
MoE利用稀疏激活思想，在不显著增加FLOPs的情况下扩展模型参数量。
主要挑战在于离散路由的优化和系统复杂性，但启发式方法（如平衡损失）已被证明有效。
大量经验证据表明，在FLOPs受限的场景下，MoE是成本效益高的选择，值得学习和应用。

摘要历史 (3)

StreamSparkAI