speaker 1: So we'll get started today. We're going to cover a mixture of experts. Last year, this was kind of a fun bonus lecture that I threw together. But this year, thanks to you know lots of people doing moes, this has become a much more critical lecture. So I've added a lot of the recent developments. And at the end, we'll try to walk through deep seek V3 and try to understand like what are all the sort of components that make up a state of the art open source system, or at least on the architecture side, what that looks like. So mixture of experts is how a lot of the most modern high performance systems today are built and deployed. So there was the funny nvidia leak of GPT for actually being potentially revealed as GPT moe one bt. But more broadly, others like Grock and deep seek and llama four now have all adopted a mixture of experts architecture. And it seems like at this point in 2025 that the advantage of mixtures of experts over dense architectures is very much clear, right? Almost all compute scales training a mixture of experts model, if you do it well, is going na give you benefits over a dense model. And so everyone seems to be doing it in both that east stand the west. And so this will be an important thing to understand if you're trying to build sort of the best model that you can for the flops that you have. So mixture of experts is very simple. It's a very terribly named concept. I think you hear a mixture of experts and you think, Oh, there must be experts specialized for different domains. And they're like doing different things. Like there's a coding expert and like an English expert and other languages expert. It is very far from that mental model. A mixture of experts is a type of fancy architecture that has several subcomponents called experts that are activated sparsely. And in particular, when you think about mixture of experts, you should be thinking about the mlps. This is where all the action is, right? So moe architecture and a non moe architecture are going to be similar in almost all of its components, except for one. And that is, you know, if you look at this slide over here, you know this is the components of a standard transformer. You got your self attention. You got your ffn. If you zoom in in a dense model, the fefour component just sort of is there. It's one big block in a sparse model. What you would do is that you would take this ffn and you would split it up or you would copy it, depending on how you're gonna to be setting up your moe. You're gonna to have multiple copies, let's say, of your ffn, your fully connected networks, and you're gonna to have a router that picks some smaller number of those, you in each forward PaaS, right, each inference. So this is the basic idea behind the moe, and we're gonna to replace this one big feet forward on the left side with a selector layer and many smaller ones. And what's the advantage of this thing? Well, if it's sparsely activated, that is, let's say it only picks one expert, and an expert is the same size as your dense ffn, then the flops between the left side and the right side, the dense model and the moe model, they have the same flops, right? They're doing the same. Matrix multiplies as you do your forward PaaS. So you have more parameters without affecting your flops. And if you're a believer that what matters is having more parameters to, for example, memorize facts about the world, well, you know, this is a great architecture. So you can kind of see the intuition behind moes. Hopefully, that's all very clear and you might wonder, okay. So it makes sense that you can get more parameters per flops, but does that translate to actually better performance for the models that you're training? And there's been, I think, at this point, many, many, many papers showing that at the same flop count, at the same training amount of flops, you get better performance out of a mixture of experts than out of a dense model. So this is a nice paper. Toso, today I'm gonna to go over a couple of the classic Google papers that you put this field together. And this is one of them by feta seal 2:22, where they show that, you know, if you flop, match your training flop. So that's the same amount of compute use for training. And as you increase the number of experts, the training loss of your language model just keeps going down and down and down and down and down, right? So you know more experts better. Of course, the experts aren't free. You need to store the memory for these experts. And when you do parallelism, you're going to have to think about routing your data into 256 separate experts so that there's going to be systems complexities. But if you're only thinking about flops, this is a great chart to see because you have the same flops, but you've gone free test loss here and you see the same thing reflected on the right side. You as you train for longer and longer, the model, the switch base with 128 experts, the model with more experts gets better perplexity faster. So hopefully that is quite clear. You might say, well, this is a 2022 paper. Is this sort of on modern architecture, on modern scales? It continues to very much be AI two had a very nice paper, omo, which did a whole bunch of ablations and carefully controlled comparisons into dense versus moe and other architectures, and they sort of see exactly the same thing. So here on the left side, this is still from fetus at all. You see the seven x speed up from having many experts. On the right side, this is the omo comparison. You see the pink one is the moe and the teal one is dense and the training loss for the dense model goes down much more slowly than the moe, right? So hopefully, you know I have in some sense sold you on the value of moes and for learning this kind of new slightly new architecture, right? So we're gonna to pay up a price for all of this, but at least the flops level, this looks very compelling, right? So yes, question .
speaker 2: tree menlike since who's the highest stern park, although it's a free G Cross competitation, the facts are actually full part time and pretty badly we're loading in and out. So the question was in last lecture.
speaker 1: you know I was saying even small non flops, you know negligible flops can be really big in wall clock. Is anything in the moe world going to look like that? And so I think one of the drawbacks of moe is why you know that's not the standard thing that's being taught. You know let's say at 224n is because there's significant systems complexities to making this thing efficient. So I'll get to that. It's possible to make these things very efficient, especially if each expert lives on a separate device so that you're routing data to different places. You can be very efficient when you do that, but it's not easy, right? So there's a lot of infrastructure tural concerns and you're going to see a lot of complexities to get this thing to work. But when it does work, you're putting all of your flops to use. Okay. And then the last one that I wanted to show is you know a lot of the companies really love moe's because you get to present plots that look very compelling like this, right? This was from the deep seek V2 paper, you know on the x axis. This is a little bit of slight of hand. This is only activated parameters, right? So this is only the parameters that are used for computation. So you ignore all the deactivated experts and the y axis is mmlu performance. And we see deep sev two. Wow. Look, very few activated parameters. Really good mmlu performance, right? And so if you're only interested in both training and inference flops, you activated parameters is the name of the game. You get really good performance here. And this is not just an ablation. This is a real system that someone spent a lot of money to train and deployed out in the wild. And we'll see this sort of pattern recur in other examples as well. Oh, was there a question? All right. And so the systems thing that is also a benefit is that moes allow us to have another axis of parallelism. So I'm going to get into parallelism in much, much more detail, sort of in the systems lectures. So I'm going to talk about how you're going to take your model and you're going to cut it up into many small pieces and lay them out across many different devices. But I'm going to talk a very high level. But when you have experts, there's a very natural way to paralyze at the expert level. So you have multiple different fefour blocks. You can take each of these experts and you can put them on a different device. And because experts are sparsely activated, all you have to do is take your token and route it to the appropriate device, and the computation will happen on that device. So it's a natural sort of cutting point to be able to shard your model into different devices. And so this is called expert parallelism. And this is another reason why moes are very popular, right? If you really want to paralyze really big models, this is a thing that you're gonna na have to do. And kind of Interestingly enough, I think moes developed that Google and many of the frontier labs, the closed labs, were doing it. But I think the open results actually came from China very frequently. Quan and deep seek were doing a lot of moe work last year. And it's only really recently that I think Western Open source groups have started to do more moe work. So mixstroll groc, I guess grocks not open. And then now llama is now an moe architecture, right? And so it's here, you know lama four just got released, right? Latest and greatest. This is also a sparse moe. And I'll talk about llama four as well as I go through the lecture. As I said before, so you know one of the kind of starting points for this is some of the Chinese groups quan and deep seek have actually done some really nice work benchmarking and understanding and evaluating some of these moe results. So these moes, so quan 1.5 was one of the first models that I knew of to have like this large scale, well tested, well documented moe. And what they did was they took a quen 1.5 dense model, and they had a nice trick to upcycle it into a mixture of experts. That's a clever kind of trick to take a dense model and then turn it into an moe. And they showed sort of significant gains, at least in terms of compute efficiency, while sort of decreasing the total number of parameters relative to their sort of seven b model. Deep seek, which is now famous, but originally, when these papers were coming out, were not quite as famous, did some of the, I think, really foundational moe work in the open source world. A big part of this lecture is actually going to be tracing the trajectory of the deep sek moe architecture. But if you look at their original deep seek moe paper, you'll see very nice papers, sorry, very nice sort of comparison, showing things like what happens when you train a dense model with a particular amount of flops? What happens when you train a really naive moe that doesn't do very smart routing? What happens? And then if you use a smarter routing called the switch sort of moe, what happens? And so you'll see all these very carefully controlled comparisons. And you see you as you go from dense to sparse. So that's the leftmost column, the rightmost column. You see all these sort of benchmark metrics very consistently improve ved for a fixed amount of flops. So this is very consistent. And kind of one thing that I think almost everyone at this point has probably heard of, right, is deep pk V3. And that's in some sense, know a culmination of all this line of work. But if you habeen following moes and you were excited about kind of this branch of neural networks and language modeling, you would have actually known about deep seek long before V3 got popular. And we'll see at the very end of this lecture, actually, deep seek V3 is now very different from the very earliest deep seek moes. Architecturally, they had kind of nailed it way back when they were training the sort of much smaller 2 billion parameter models. They really just kind of got the engineering right to get something that is actually really quite remarkably good, which is their V3 model. Okay. So now I think I've spent quite a few minutes trying to really hype you up on moes, and they really are, I think, worth hyping up. They're very good, but I think there's a question of why haven't they been more popular? Why isn't it the standard thing we teach in nlp and language modeling classes? It's just that they're very complex and they're very messy. And I'm hoping that theyget simplified over the next few years, but they still remain pretty nasty. So one of the things is the infrastructure is very complex. And the biggest advantages of moes really happen when you're doing multino training, like when you have to split up your models anyway, then it starts to make sense to shard experts across different models. That's a very natural thing to do, but until you get to that point, maybe moes are not quite as good, right? So some of the earlier Google papers really talk about this tradeoff where they say, actually when you get these really big models that you have to split up, then experts become uniquely good. There's also other things that are really tricky if you think about it carefully, right? This decision of which expert you route tokens to is a very difficult thing to learn, right? In deep learning, we really like differentiable objectives, very smooth things that we can take. Gradients of routing decisions are not differentiable because we have to pick and commit to a particular expert. So if we're doing that, you know we're gonna to have a very tricky optimization problem. And the training objectives to make that work is either heuristic and or unstable, right? And so we're gonna to have to really carefully engineer those guys to get them to work. So those are two reasons why you don't really want to maybe do this normally. So what do employees es look like as I started this lecture with, you know, the classic moes that you should think of as you take, know, the densely connected layers, the ffn's, and you split them up or you copy them, and you have sparse routing decisions among them. Of course, you could do the same kind of idea. You could have a sparsely routed attention layer, and some people have done this. There's been a couple papers and a couple releases that have taken this approach, but it is actually quite rare to see this in the major model releases. I think I've seen people talking on the Internet saying like this approach is actually really much even more unstable and very difficult to really train consistently. It's sort of I haven't really seen the ablations to back that out, but certainly there haven't really been many people training those kinds of models with moe attentions. So now, you know, I've told you about the basic architecture, right? It's really simple. It's just you have a router of some kind and you route and then you have different mlps. So what are the things that might vary across different moe choices? You might ask, how do we route, right? The routing function is an obviously important choice. How many experts and how big should the experts be? That's another choice. And the final one is how would we train this router, this non differentiable objective that seems very difficult to train. So those are very important design questions, and we're going to go through each one, hopefully covering the design space of all these moe things. Okay. Any questions before I get into each one of these different sucomponents here? Good. Okay. So if you're interested in just kind of understanding a broad overview of moe's at least circa 22, there's a really nice sort of survey or a review paper by fetus at all in 2022 that covers a lot of these, and many of my figures are credited to that paper. If we're thinking about how we're going to route or essentially match tokens to experts, right? This is the core component of a moe because what a moe does is you know tokens are going to be coming in, right? You have your sequence that you're processing, and those sequences are going to be assigned to experts, right? Not all experts will process every token. That's the whole point of a sparsely routed moe. And so you can ask, how are these routing decisions made? So you can sort of have three different kinds of choices. You you can have token choice where each token is going to have a sort of routing sort of preference for different experts and I will choose the top k experts for each token. Or I can have Expert Choice where each expert is going to sort of have a rank preference over tokens and then I'm going to choose the top k tokens for each expert. This has a really nice benefit of being balanced over experts. And then the last one is sort of you could solve some sort of complicated optimization problem to make sure that the mapping between experts and tokens is somehow balanced. This is global assignment.
speaker 2: And just to give .
speaker 1: you a bit of a teaser here, almost all the moes do token choice top k. In the early days of moes, people tried many, many different things, sort of spanning this whole spectrum of design space of token routers. If you look at the big releases, they have all converged to basically one class of routing mechanisms, which is token choice top k. So each token is going to rank order experts by affinity, and then there's going to be kind of a top k choice for each one of this. And omo, which I'll keep referring to throughout this lecture, because they have a really nice series of ablations. So it's really nice to teach off of have exactly this ablation. They compare a token choice routing versus an Expert Choice routing. And they show, if you look at validation loss, token choice ices much, much nicer, behaved much faster in loss decay. Yes.
speaker 2: Function, a function of the toitself for its position.
speaker 1: It's a function of the sort of the hidden state, right? So the token is going to get processed with all the position embeddings and so on, and then the hidden state will come in .
speaker 2: and then it will be processed by the mlp. And so for the other for the other two, the experts choosing the token and also can explain when you say it's like more balanced across the experts, are they like it's still for the current like token sequence, but it's like it's forcing them to it's still going to be the .
speaker 1: same set of tokens, but really it's about kind of the ranking selector function, right? In token choice, I'm just gonna to take the top k amongst the column. Like maybe the scores are even identical. I'm just going to take the top k amongst the columns. In Expert Choice, I'm going to take top k amongst the rows, right? And top k amongst the columns is kind of nice because you might be able to say, Oh, I can define a scoring function, such as the score is how well each token gets processed by each expert. And token choice will route me to the best expert right, for that token. So that makes sense from processing. But Expert Choice has the benefit that each expert gets exactly the same number of tokens. And so now you might like if you're putting different experts on different devices, you've got balanced utilization. So there's different tradeoffs that play as you think about routing. Yes. The best good. Yes. So the question was, how does each token know which expert is good? That is exactly the role of the router. And I'll give you the router equation. But to give you a bit of a not really a spoiler, but you know the routers are much more lightweight than you think. So you know your token, let's say is represented by vector x. That's your like hidden you residual stream coming in. So now x is gonna to get multiplied by the W A matrix then you'll just take a sigmoid or something and that's the score. So it's really just a vector vector inner product almost like an attention operation in a way.
speaker 2: Yes, he top three top one year each time is right.
speaker 1: So the choice of that's what the question was is K1 here? So k is actually a hyper parameter and different moes will choose different things. I will talk about this again, but to give you the high level intuition, the initial argument that the early liest seme papers made was that k should be greater than two, because that way you get some exploration. If you're doing k equals one, maybe you're just always exploiting the best arm and you'll never know about the potential other things you could do. But if k is two, then maybe that second arm can tell you a little bit of exploration information. So k equals two is the canonical choice, and k equals two actually continues to be very popular. That's right. That's right. So that would double the flops. And so when people talk about moes, they usually say things like x number of activated parameters. And that would account for the fact that .
speaker 2: you're putting into mlps. Yes. So when case employees for time, do you combthe outputs of different experts into question?
speaker 1: Was case one, did the outputs get combined? That's right. Like if you look at, I guess, look at the attention diagram over there, you know, you got the router, it's routed to two mlps up top and then they get combined together right after, right? So that's exactly right.
speaker 2: In that case, isn't just sort of simple average, but the weget average.
speaker 1: So the question was, how does the aggregation happen? It's just the so I'm going to go over the variance, very common variants that people do. And really, in some ways, all you need to know is top k in order to actually implement a high performance moe. But I'll give you the other variance because their natural things you might think of top k routing is what is used in most moe token choice top routing, top k routing. So how that works is you know you have your residual stream inputs x that will go into a router. And as I said, a router is really kind of like the attention operation. There's like a linear inner product and then a softmax, and then you pick the top k most highly activated experts, and then those outputs are gated depending on the implementation. You might wait the outputs based on this router way or you might not. And then you will just output the weighted average or just a straight sum, depending on how your moe implementation works. And so a lot of the moe papers and methods use top case, which transformer g, shard rock, mixbroclan. All the deep pk variants use different top k variants. Maybe you have a very surprising fact, and this should really make you think about what's going on with moe's. There are a lot of results that show that actually you don't even need a smart router at all. You can actually just use a hashing function at the very bottom to map these x's onto your experts. And even if you're doing hashing, so no semantic information at all, you will still get gains from a hashing based moe, which is pretty wild. Some of the earliest work on moe's, I think, had the very smart idea and in many ways the right idea. If you're thinking about this top down of using rl to learn the routing behavior, right? Of course, you the choice of where to route to is a discrete decision. And rl is great for learning discrete decisions. Why don't you use rl to learn routing? It was using as some of the earliest work on mixture of experts. As far as I know, basically, no one does this now. The compute cost to do this is too prohibitive, and you already have stability issues. You might not want to do that. There have been a couple of papers that have explored things like solving linear assignment problems or optimal transport style problems. They're very elegant. But once again, the cost of doing this is much higher than the benefits that it gives you, I think, in practice, and it hasn't really been adopted, but there's a lot of really interesting things that people are doing like this to try to improve the routing. So now I can point at this slide and really talk through how routing works in detail. So this is the kind of top k routing that almost everyone has converged to now. This is the router that's used in deep sev one to two. Quinn and Grock do almost exactly this. There's, instead of having a soft tmax directly at the bottom, here they do a deep sev three mixroll. Dbrx don't have a soft max at the bottom, but theysoftmax the g of ibut. This a very minor difference. So let's walk through what's going on here and try to reason about the behavior of this. So what's happening here is at the very bottom, we've got our inputs. This is our U fl input. And I would like to take this sort of residual stream input and process it through my moe. So the first thing I'm gonna na do is I have to figure out which experts are going to be activated. Now, how am I gonna to do that? Well, how I'm going to do that is very similar to attention. I'm going to take my U, which is my residual stream input, and I'm going to take the inner products with the e of is. These are kind of learned vectors that are for each expert that tells the expert, I'm an expert that points in this direction, right? And so I'm computing in this inner product here, expert and input affinity, and I'm computing a softmax to determine for each token what are the best experts as I normalize, this is s of I of t. Now I take the s of I of t and I go through a top k function. I only select the k best weights. And then I use this as my gate. So I zero out everything else. And I take the weighted average of each of the experts outputs, and then I add that to my original residual stream and then I return that right. So this is hopefully very familiar to kind of what you're off very familiar with in terms of how you know transformer works with only the difference of this top k routing piece. Is that clear to everyone how this thing works? Good. Excellent. So in some sense, the mechanics of the Ford process of the routing is very simple. What is kind of mystifying is that fact that you can learn this very well, right? This is in some sense a fairly complicated set of things to have to learn to do well by a model. Yes.
speaker 2: So we're using soft tmaacks here previously talking 11 of the benefits of softmax is that it's going to push you pretty extreme lintage, a singular max. It's not a hard max, but a licit tool. I'm having trouble in the intuition of playing softmax basically on top of like combining it with the top k where you're getting multiple and then you're using something that's going to push me towards choosing just one thing.
speaker 1: Yeah. I mean, I think maybe one way of thinking about the soft max is, you know the whole purpose of this is just to make it so that when I average my experts later, it kind of sums to one, don't think of the softmax as like a soft max operation, even though that's literally the name. I'm really the soft max operation is a normalized to one operation, and the normalized to one operation is gonna to make that a weighted average up top. The other thing that's very important is you might think, why can't I just get rid of the top k? Why don't I just use the soft tmax here and just gate all the experts? Well, then you immediately lose the syefficiency aspect of this, right? You have to have top k during training, otherwise you pay the training cost of all capital n of your experts, right? This is the key thing about moes. Like we have to do all of this gymnastics to make sure that both that training time and inference time, we have a sparse number of activated experts. That's why we go through the top. K. Okay. Yes, from the back. So because .
speaker 2: you're doing soft tmax first, at that top, you get the weghts. You no longer have to guarantee something. So the question was, Yeah.
speaker 1: So the question was, if you sofmax first, you no longer sum to one. And yes, that's absolutely right. You no longer sum to one. And in some ways, there's no requirement that you have to sum to one because you know the next layer can magnify it back up. You know there's layer norms everywhere. It's not as if it has to sum to one. But I think that is the reason why some of the other architecture is basically move the location of the soft max. There's a kind of aesthetic choice about whether you really want that weight to be normalized to one or not. Yes. Yeah. So I'm .
speaker 2: wondering how does the e actor here relate to the weight of the gate forward network?
speaker 1: Okay. So the question was whether whether and how the e vectors relate to the feet forward. They're not really tied in any way. The e vectors are just learned vectors for the just think of the e's as parameters for the router, right? There's separate objects from the ffi. Yeah Yeah, I was just .
speaker 2: wondering.
speaker 1: Great. The question was about how does it compare to sampling from the softmax? You can sample from the softmax, and some methods actually do a kind of soft sampling from the softmax. Specifically, one of the Google papers has a procedure where they take the top element of the softmax, and then they randomly sample the second element proportional to the remainder of the softmax. And that gives you more exploration, which is good. But the drawback of that is that if you don't sample at test time now you've got a train test mismatch. Okay, yes.
speaker 2: Why not just renormalize up top eight? Was the question.
speaker 1: is that right? And some some models do that. Some models do rnormalize after from the top k, but that's a kind of a choice like some architectures don't do that. Some architectures do. It doesn't actually matter because the scale can be basically adjusted post horight. So there's no reason why it has to sum to one after the jeep operation. Cool. Oh.
speaker 2: sorry. Yes, up there. So the first turn into sum, if g is approximately probvector, could be seen as an expectation of the function of the plus. So so ffn.
speaker 1: actually this is not an expectation of ffn because each ffn is a different ffn. So this is not actually an expectation. And the gates are sparse. So this is like a weighted selection operation over A K different or actually capital n different ffns. And then the utl at the very end there. You know, if you remember the transformer, that's the residual screen, right? So I'm adding back the inputs because I want sort of an identity connection through it. Okay. Oh, there's another.
speaker 2: why does the router have such a basic parametrization? Like what happens if you put more weights into, Oh, your router? The question was.
speaker 1: why is the router so basic? It seems like if you're going to have experts, it seems important to route to the right experts. So why don't you do that? I think you know there have been some ablations in some of the earlier Google papers on having like mlp routers and more sophisticated things. I think the sort of complex answer here is that the syconcerns sort of weigh heavily. If you're using a lot of flops to make routing decisions, you have to pay for those swaps. And so you have to get performance improvements in just the routing. And I think one other thing to appreciate here is that there are really big limits to how well you can route because the learning process for this routing thing is actually pretty dicey, right? Because how are you gonna to get gradients for which routers are good or bad? Well, the only thing you have is if you have top two, then you can compare the two things that you have and you can push the gradients into s of key because your g is a weight, then the s of t might inform your inner products. But that's a very indirect way to be learning your affinity. So even if you make it complex, there's no guarantee that you're going to really learn the optimal route, right? Great. Okay. So I think .
speaker 2: the .
speaker 1: one of the great innovations of the deep seek moe, and which was very quickly adopted by all the other sort of Chinese moe releases, is this idea of both a shared expert and a fine grained expert. And so the basic moe structure that was sort of originally proposed is to take your dense architecture and kind of copy the experts over. So in this case, you're going to, let's say, if you have top two routing, you're going to have twice the activated parameters of your original dense model. So you take your moe and you copy it over and you activate k equals two. So this is kind of what you might think of as like the vanilla or the basic moe that you might start with. People realize fairly quickly that having lots of experts is good. And the logical sort of next step beyond having lots of experts is good is I want lots of experts, but I don't want to pay the parameter cost for having lots of experts. And so deep seek basically argued that the right thing to do then was to cut the expert up into smaller pieces, right? So remember last lecture I was telling you about, Oh, the kind of golden rule in some sense is to have your hidden layer and then you multiply that by four, and that will give you kind of your projection layer, right? So now what you would do is you would, instead of multipying by, let's say, four, you might multiply by two, right? So now you have smaller matrices, you have more fine grained experts. You can have twice as many of them, and you can kind of take that logic much more to the extreme. You can quadruple or multiply by eight and you can keep decreasing the size of your sort of projection dimension there. That's flying grained experts and there's drawbacks. I'll talk about later. It doesn't come for free. So you have to be very careful about how you structure these things. And then the other thing that has been sort of studied and noted is maybe it's helpful to have at least some mlp that can capture shared structure. Like maybe there's just like processing that always needs to happen no matter which token you're processing. In that case, it seems like kind of a waste to do all this routing work and to have all these, like you parameters spread out everywhere when we can just have one shared one or few shared experts, you know, whose job it is to handle all of this shared processing that's needed. So there's shared experts. And so this setup of using fine grained experts plus shared experts originally came out in deep seek moe, although I think the original inspiration came from deep speed moe and Quinn and others. So almost all of the open moe releases since deep seek have adopted some sets of these innovations because it's quite clear that especially fine grained experts is just really, really useful. That's a kind of no brainer at this point to do one of the things I really like about reading deep seek papers is that they do ablations. You know it's not like a lot of our sales tech report. You know they actually care about whether or not their methods work. And so they have this lovely ablation in the deep seek moe paper where they show you the blue bar over here. This is g shard. This is a very basic vanilla implementation of an moe. You know you can have one shared expert, that's the orange bar. And that gives you a big boost on some tasks and no boost on others. You can have fine grained experts, that's the Green and orange know bars. And you get further boosts from that. And if you compare the blue to the orange composing, all of these differences give you quite the big boost over others. And so we can see that more experts and shared experts generally seem to help. Okay, yes.
speaker 2: Question off. Like when it says seven out of something this afternoon is doing like top seven.
speaker 1: Yes, sorry, I should have I should have a right. So x out of y means x activated out of y. Total routed experts. That's right. Yeah. And so you can kind of see the pattern here as well of as you increase the number of experts, you also often increase the number of activated experts, especially if you're doing fine grained experts. Foxwise is free, right? Because each expert is now smaller. Good. Okay. So omo has know basically corroborating evidence that shows really nicely that these things work. So the bottom one, I think I'll start with because it's more decisive shows you know fine grained experts going from eight to 32 to 64, fine grained experts mirroring in some sense the deep sea cubulations. And you see very clear trends and losses and other kinds of metrics that you see improvements going from eight to 32 to 64, right? Fine grained experts is great. Shared experts, which is purple versus teal at the very top. You actually don't see really any gains, at least in the olo setup. So they actually end up going with no shared experts, even though the deep seek paper seem to show more gains. So that one actually is maybe more mixed given this sort of follow up or this third party replication of these kinds of ideas. So at this point, you might be wondering what are common configurations? I think I'm going to take the page out of you last lectures playbook of looking at a lot of the recent releases, looking at what people do and trying to talk a little bit about the patterns that have arisen. So some of the early Google paper, so gshard, switch, transformer, stmoe, some of them had really large numbers of routed experts. And there was a lot of really interesting stuff going on in those papers. I'd encourage you to read them. Some of them happened in lms and other kinds of architectures. Regardless, you know very quickly, I think there was like kind of a period of like eight to 16 experts like mixtroll, dbx, Grock with two active experts. Those worked reasonably well. But then kind of deep seek moe or deep seek moe V1 comes out that has kind of the prototypical configuration I told you about fine grained expert, 64 of them, six actively routed, two shared experts. And each sort of expert is sort of one fourth the size of a normally sized expert. Take that last column with a grain of salt because I had to sort of back them out from like config files and things like that. I'm not 100% sure about the exact ratios here. So we've then got essentially Quin 15, deep seek V3 minimax. These are Chinese moes. They follow essentially in the same footsteps as deep seek V1. The specific numbers are different, but in the sense that they use fine grained experts, and they often have shared experts, they're very similar to kind of this original deep seek moe configuration. Omo, mimacs and llama are very recent. Moes, they definitely do all this like fine grained expert stuff. And lama four also uses a shared expert. And you kind of see sort of variations in configuration, but you see what's basically shared, which is this fine grained experts idea. And especially for the big models like lama four and deep seek very, very large numbers of routed experts or sorry, not routed like total, total experts. Yes. The ratio is expenis representing roughly like how much each exports is sliced relative to having just the standard dense configuration. So in terms of hyperparameters, you know that if you're following the rule of thumb, your hidden dimension is sort of your projection from in your mlp should be about one to four or one to two, six if you're doing a gated network, right? And so by looking at the hidden layers of these architectures, you can kind of see how many times they sliced up that original feed forward size.
speaker 2: So like previously through one of primary shows, 14, but then they have 64 of those experts because I mean, that still increase ing their you're the Yeah so so you know you .
speaker 1: can think of this as roughly, you know they have you know 16 normally sized experts. And so you know they're of course, having more parameters than the dense equivalent. They have six routed so they have eight total active experts at any time each that are quarter sized. And so you should think of them as like roughly double the flops right, of a dense equivalent. So some arithmetic but hopefully the math is clear and consistent.
speaker 2: Yeah like the regilike one are going really begin .
speaker 1: he was like that me so for some of the exotic ratios, I'm not quite sure why they're that way but they are very precisely whole numbers when you take the ratios between the ffns and the implied hyper parameters. And so I think those are exactly the split counts of like how much they were sliced but I'm not sure why they have one over 14 does smaller .
speaker 2: dimension because like that we sure so points in .
speaker 1: the mlp Yeah so Yeah that's why you're asking like do they do they downyeah that's right. In some of them they are actually smaller. I don't remember which models in particular, but in some of them, I do remember they're actually downfor.
speaker 2: What is that intuition for wanting more than one shared expert?
speaker 1: Yeah. I mean, it does kind of seem like there was a period where some other Chinese lm companies tried many shared experts and then you know people have come back to zero or one. And if you look at the old oblations, it's not quite clear that even one shared expert is decisively useful. I think the original motivation was that then you have equally sized you know experts like these are both one quarter sized experts and now you have eight active experts total. And so you can keep the sizes consistent. Otherwise, I don't really see a particular justification for why it should be two smaller ones versus one larger one. Okay, cool. So then hopefully, you know you get a sense of how the routing works for a lot of these moes and how it's all set up the forward PaaS, hopefully you fully understand. Now we need to think about training. And training is pretty gnarly, right? And the major challenge I foreshadowed earlier, right, when we train, we cannot turn on all the experts because if we do that, then we pay the full flops cost of all the experts, right? Having a model that's like, I don't know, 256 times more expensive to train is a total no go, right? So we need train time sparsity, but sparse gaining decisions are obviously not differentiable. We now have a kind of annoying rl ish problem. And so we could do any of these things like rl to optimize gaating policies. We could do bandit inspired things of doing randomization to do exploration, or know we can just have some heuristics that try to balance things out, right, like put some lost terms in there and hope things work out. You know, having gone through deep learning classes of many kinds, you can kind of guess internally which one people use in practice. And I'll talk about each one of these three in turn. Okay, so rl, I think is one of the earliest things that people tried. It's probably the most principle thing that you can do in this space, right? You have a non differentiable routing decision. Well, think of that as a policy through our rl edit and then solve the problem. Unfortunately, it's not better than a lot of the other things that you can do. There is a paper by Clark at on 2:20 who were exploring various like scaling related questions in moes, and they do have an rl baseline that I was able to dig up, but unfortunately, it's not really that much better than say, using hashing for decisions. And they were really interested in benchmarking this thing on the left called s base, which is like a linear assignment, kind of a method. And that thing know handily beats you know doing rl. And I think in practice, the gradient variances and complexity means that it's pretty finicky to use. And no one at scale has really used an rl based approach to optimize these gating decisions. As far as I know, a thing that has been done much more at scale is stochastic approximations of various kinds. So what they might do is they might add a bit of perturbations. So here's is an example of one from shazir in 2:17. This is one of the early moe papers where they're still going to do kind of top k routing. So they're going to keep the top k elements of this H of x operation, and they're going to sofmax that to get the gate. But what we're going to do to get this H of x operation is kind of the following. So what we're going to do is we're going to have our original sort of linear affinity. This is identical to what we were doing before. We were basically just computing our inputs x and a sort of learned weight for each gate. And so this part's the same, but I'm actually gonna now gonna to jitter it a little bit. I'm going to add a normal, and then I'm going to pick sort of A W noise scale that's learned. And this thing is going to control how much noise to inject into this process. And you can kind of think of this as a stochastic exploration policy. And by manipulating W noise in particular ways, like sort of kneeling it down or doing various things, I can control the exploration exploitation tradeoffs that this moe is going to have, right? And so this is going to give you one solution to the explore exploit dilemma. And especially if you're noising things up, each expert might randomly get you know some other tokens that it wasn't expecting to get. So itlead to experts that are less specialized, but maybe a little bit more robust. And so that seems generally quite nice. Of course, the stochasticity also means that you don't get as much specialization, and that leads to loss of efficiency. And you know there's another approach that people have done where they sort of multiply the router logates, or sorry, they have a multiplicative perturbation to the router loits with the goal of getting less brittle. Experts. But this sort of jitter process was kind of removed in some of the later papers because they found it just didn't work as well as some of the heuristic loss based approaches. And so this was an approach that was tried in a couple this kind of stochastic routing tricks were tried in a couple of the early Google papers, but I think that has generally been abandoned by a lot of the people training these moes. Okay. So yes, for the stochastic.
speaker 2: like what problem does that solve? Because we're still making the top. So we still can't approach about well.
speaker 1: if you think of this, so the question was we still can't differentiate because we're taking the top k. But if you kind of change your interpretation of the problem a little bit, if you think about a bandit problem, right, it has the same structure as this where you know you pull a bandit arm and you don't see any of the other arm. So you can't really allocate your resources efficiently if you pull some of the other ones at random. Now you've got enough data to be able to do some optimization. And so this jittering is very similar in spirit to this kind of like epsilon greedy style exploration thing where you're randomly pulling some of the other arms with some probability, where the probability itself depends on how confident you are about this routing decision. So that's kind of the intuition. And then of course, you that's going na give you some way of getting some signal back. Okay. So the thing that in practice, people have ended up with is we don't do any of that. We don't do rl, we don't do stochastic exploration, but we rely on really another mechanism to sort of keep things reasonable. So if we're doing top two routing, right, technically speaking, we do get some signal in the gradient descent process because we can compare the top two experts that we did evaluate. And so it's possible to do some optimization. But when we do ignore, if we drop all the other constraints, the big issue that arises is you just end up sort of picking one expert all the time. And that expert is good at everything and all the other experts are terrible, right? You end up in this local minimum where you've routed all of your tokens to one experts all the time. So really the key game becomes, then how do we get out of that local minimum loss? Balancing or balancing losses is really the key trick to get out of this. And this is kind of important to understand because this is the loss that mostly everyone actually uses to train the moes. So if you were zoning out earlier, you probably should make sure to pay attention to this particular set of equations here. So this is originally from the switch transformer from fetiseon 2022. And they add this particular loss where what they're going to do is they're going to loop all over each of the experts and they're going to take you could think of this as an inner product between the vector f and the vector p. And so what are these vectors? Well, f is for each of the experts. This is the fraction of the tokens that were allocated to expert I. So you can think of this as kind of a probability vector that's telling me you what fraction of my tokens in my batch or in my you know, whatever the unit is here did route to x per I. Now p of I is the fraction of the router probability that was allocated to xt I. So the router probability is kind of the original sort of softmax routing decision that I was sort of intending to send. So this is kind of measuring p of I is what was sort of the intended probability from the router. And then f of I, what was the actual sort of like you know what was the actual routing decision made by the top k method? And one thing that's kind of interesting to look at here is let's say we take the derivative of that loss with respect to p of I. So this is a linear function with respect to p of I, and you'll see that the strongest downweighting action happens on the sort of biggest experts with the biggest allocations, right? So it's actually, in fact, proportional to the amount of tokens that you get. So you're going to be pushed downwards sort of more strongly if you got more tokens. And so this is kind of the basic behavior of this loss. And almost everybody uses this kind of F P kind of a trick to try to balance tokens across different units. So the basic unit that you might want to balance over initially is batches. You might want each batch to get allocated evenly to experts, but you might actually have other kinds of balancing that you might want to do. And deep seek does exactly this kind of thing. I'll talk about all the variants that they've thrown in, but you know the first thing is per expert balancing per batch. So each batch, they want to make sure experts get an even number of tokens. And this is from the deep seed paper, and hopefully this looks very familiar to you. This is exactly the same F P inner product structure as you saw before. You know, p of I is defined a little bit differently. That's s of I of p, but that should be familiar from earlier as well. That's the softmax pre top k, right? So hopefully this looks all pretty good to you. The other thing you might want though is you might want to balance across experts. That's all well and good, but you might also want to think about the systconcerns because you're going to shard your experts onto different devices and you might want to balance per device. And so you might have another loss. That's essentially the same structure. But instead of summing which tokens go to which experts, you might measure which tokens go to which devices, and that's going to be a different f that's measured over the device groups rather than over each expert. And so now you can set up a different loss to balance over devices. You optimize this, you're naturally going to try to learn routing functions that make sure each GPU or each GPU will have you have an even number of tokens leading to even utilization. And that would be great from a systperspective. So basically, everyone does kind of this kind of a thing. And so deep seek V3 actually kind of innovates a little bit. This is kind of cool. And I don't think I've seen this before. It's one of the first things in the moe world that doesn't actually come from Google really, which is that they have gotten rid of this expert balancing term. They've gotten rid of this entirely. And instead, what they now do is they basically take their soft max scores and they add a little fudge factor b of I, where b of I is a little fudge factor score for each expert, right? So expert I might get upweighted or downweighted. So if an expert isn't getting enough tokens, it's going to be given a higher b of I, and then that's going to allow it to grab more tokens. And the way that this works is the way that this works is that they're going to learn b of I through a really simple online gradient scheme, online learning. And so they're going to measure at each batch, you know what are each of the experts getting like? Are they getting an even number of tokens? And if they're not getting enough tokens, they add sort of gamma, some learning rate to b of I, sort of making it higher. If they're getting too many tokens, they're going to subtract gamma, making that expert slightly less attractive, right? So they're just learning little offsets for each of the s of is. And notice here know you're only using the b of is to make the routing decisions. You're not actually sending it over as part of your gating weights, right? That's a sort of somewhat important thing to do. So they call this auxiliary loss rebalancing. If you go and read the deep sek V3 paper, which all of you should, because it's a really nice paper, theymake a big deal about how this makes training so stable, so great, so wonderful. And then, of course, you keep reading the section and they're like, actually, but we decided that, you know, for each sequence, maybe we still want to be balanced, and this doesn't work well enough. So we've added the heuristic loss back. So they do have something called the complementary sequence wise auxiliary loss that is basically exactly the auxiliary loss that they decided they needed because what they wanted to do was to balance load, balance the experts at a per sequence level rather than a per batch level. I'm not sure why they do this particular thing rather than any other sort of b of I style trick, but that's just kind of what they do in deep seek V3. So it's not fully auxiliary loss free as theylike you to believe. Oh, yes, question this a bit .
speaker 2: of an unfair question. But if we do not have to worry about systems optimizations, you think the performance of this model will be but would it stay roughly this same if we did .
speaker 1: not think about systems optimization? Would the performance of this model be better or stay the same? When you say this model, what do you mean deep seb .
speaker 2: three or like this in general, like this modern?
speaker 1: So are you saying like if we ignore the system's concerns, do we think moes are still good? Is that kind of one way of asking that question?
speaker 2: Like would the performance of downstream transfor Alwill be better than what we have? Right?
speaker 1: Yeah. So I think I didn't .
speaker 2: have to balance this. I must roughly tosurvey expert.
speaker 1: Yeah, Yeah, that's right. That's right. Well, I think actually per expert balancing this term, right, this is not a systems concern. So you still want to do this because if you don't do this well, you'll find and actually there is, you know, I'm gonna to keep referring to the old mode paper, because they have to have so many ablations, they have a really nice ablation where they get rid of exactly this. And what they find is basically early on in training, the model just picks like one or two experts and all the other experts are dead, like the router never sends anything to them. So you're just wasting memory at that point. So now you've just lost performance for free. You've effectively gotten a smaller model. And so even if you ignore all the other device balancing parallelism concerns, you've just gotten the worst model because you didn't properly allocate your experts, right? It's the same way as like you want to use all your parameters, right? You would like to effectively use your parameters. You want to do expert into balancing. Sorry, say I. What does device refer to? Yeah, actually. So normally this would refer to like GPU or GPU. There is a subtlety, I'll talk about this maybe in the very last or second to last slide. There are more sophisticated and cool versions of this where you try to balance things to minimize communication costs as well. And so there's broader notions of device like know one rack or whatever else. But here it usually refers like GPU. Yes.
speaker 2: Going back to the facts that shing as a rouow's team improperformance, like is there intuition for that? Because that's effectively just like choosing like one of the few form members to send it through, right? So why does having multiple copies of that, I guess, each of which get less data, what does that make?
speaker 1: Yes. The question was, why does hashing do anything at all? I don't have the really precise intuition for this, but you can make arguments either two ways. One is know, even if you're hashing, the same tokens are going to go to the same or the same kinds of sequences are going to go to the same expert every time, right? And so each expert will still get some deterministic subset of the inputs. And so there's some specialization that can still occur. It's just non semantic or non learned. And if you're a distribution ziythian like the word the might dominate one expert, you know and so you might still get actually semantic specialization where like one expert is effectively dominated by like .
speaker 2: very frequent things like a random, like a pure random thing .
speaker 1: that's not dependent on input. I would bet that that would be really terrible. Yes, I have never run or seen that, but yes, I think that would be that would be horrible. Good. Yes.
speaker 2: You have many labwhere then transformer said I think you've heard from the lecture mentioned that Asian expert so you can kind of do like a knoking blike 32 layers in like 64. I think a lot of chicken eor I wonder he's like couple experts are bumped together on make single chicken can eat.
speaker 1: So the question was like, wouldn't you need lots of GPU's if you have lots of layers and lots of experts? Yeah. If you exclusively give a GPU to a single expert, yes, that would be kind of crazy. But you would sign a shard thing so that each gp would hold enough of these units to effectively use memory, right? The name of the game in parallelism is you always want to use up all of your memory because that's one of your resources, right? You don't want na paralyze more than you have to. Cool. Okay. Excellent. Oh, okay, I did put the ablation in here. Yeah. So this is exactly what happens to the question of what happens if you don't do you know expert balancing loss, I think the great picture to see is this bottom left one. If you don't do load balancing, you know what are the tokens assigned to which expert? You see the pink and the yellow expert, they just like kind of take over. They take up you know about 50% of the tokens. All the other experts are dead. They do nothing, right? And so you've wasted you know the majority of your experts at this point, you know six out of eight of your experts and you've created a two expert, moe, unintentionally, and that gives you worse losses up seen up on the top, right? The tl lines, of course, maybe that's still better than the dense model because at least you've got two experts going, but you could have done better. Counterfactually speaking. Okay. So I won't go quite as deep as I could into the system side because I haven't really started to cover the core systems concepts necessary for you to deeply appreciate a lot of the parallelism concerns, like basically the hierarchy of communication speeds in a data center and so on. But really, as I said before, one thing to keep in mind is just how nicely moes can fit into devices. Know, the thing that people say is expert parallel you that involves sending or putting one or a few experts onto each device. And what happens when you are basically processing a token? Well, you would hit the router, and after the router, you now have picked few experts. And so now you would have a collective communication call, like all to all communication dispatch that would send the tokens to the relevant devices, you know, the feed fords of compute, you know, their outputs. And then you would return the tokens to sort of where they belong, or you would combine, I guess, multiple experts. And so you would need another sort of collective communication call. And so if your feforward computations are sort of big and beefy enough, you can kind of pay for the cost of basically doing this expert parallelism. And one of the things that's nice about this is that it's another form of parallelism in your toolkit. So you've got, on the right side, data parallelism, model parallelism of two or three different kinds, and then you've got expert parallelism, and you can combine all of them to come up with sort of ways of trading off all the resources you have. So the communication speed, the amount of data that you have, your batch size and your number of experts and your memory. So I'm not gonna to go into too much detail about how specifically this is gonna to help, but keep in mind that this gives you another sort of tool in your expert toolkit. Another thing that is also useful is let's say you have multiple experts on a single device. Know you might hope that because the computations are sparse, like let's say you know token one, this first token gets multiplied to expert zero. The second one is expert one, and this third one's is expert two. So this is really three matrix multiplies that are small and sparse. And you might hope that modern GPU's can sort of take advantage of these kinds of kinds of sparse matrix multiplications. And that's exactly right. So if you lay out your sort of experts correctly and the weights are sort of fused in the right way, then modern sort of sparse matrix multiply sort of engines can sort of effectively make sure that you're not wasting any flops in doing this one big matrix multiply. So modern libraries like megabblocks can basically take advantage of this know device level sort of sparsity support to do multiple expert computations sort of all at once. So this is yet another advantage that you get with moes. So one fun side thing, which maybe isn't mysterious to you on anymore because you've sort of grown up in the era of GPT -4, but when the GPT -4 api first came out, it was kind of mysterious to me because when you set the temperature to zero, you know you kind of got different responses, even though it was supposed to be deterministic and lots of people speculated about why would that be? I'm not saying this is the answer to that reason, but there is actually an interesting source of randomness in moes, right? So in moes, think about you know what happens. You're gonna to route your tokens to experts, and experts live in different devices. It could be that you have a lot of examples. You're going to, of course, batch your queries when you're processing them. And so if you've batched your queries, these tokens are gonna to get routed into different experts. So imagine you've got this batch to process and you've got a bunch of experts. But for whatever reason, this batch really loves expert number three. Like all the tokens go to expert number three. So now what happens? Well, the device for expert number three doesn't have enough memory to load all of those tokens. And then what happens is what people call token dropping. And this happens at training time as well. You often have what's called a load factor where you're sort of controlling the maximum number of allowed tokens. And if the router just allocates too many tokens to an expert, you just drop those tokens off, either for systems reasons or because you're just worried that that expert is going to take over, at least in the training time. So now this token has gone drop pped and it's not going to get anything at all. Like the mlp is just going to do a zero computation and the residual connection is just going to PaaS things straight forward and then you're going to return an output. And so if your token got dropped, you're gonna to get a different result than if your token didn't get dropped. And so based on who else is in your batch, moe's can induce stochasticity both that training time and inference time, which is like kind of an interesting thing that you don't normally think about because you almost never think about like cross batch effects when doing inference. Okay? So that's kind of the main bits of the main basic components of building the moe. In a fun side thing, if you were to actually go out tomorrow and trying to train an moe, I think the system side will make you a little bit sad. But the other thing that would make you sad is probably the stability side of things. So moe kind of have this property that sometimes theyjust kind of blow up on you. If you try to fine tune them, they're very difficult to fine tune and theysometimes blow up on you. And so, you know, Barrett zoand, others really studied. They had a whole paper on basically trying to make moe's more stable. And there's a paper, which is the one I'm referencing here, whose entire purpose is to stabilize moe training. And there's a couple of tricks that I'll mention that I think are relevant and that people do. The first one is you know if you're doing the router soft mac, so this goes back to last lecture about stability, right? Like what did I say about stability? Well, the thing to be afraid of is the soft maxes. The softmax is always where you want to be afraid. And so for the moes, they do all the computations in float 32 for the router computations just to be safe. And sometimes they also add the know on auxiliary z loss. So hopefully you remember that it was just last lecture. Know you do log of the sum of the exponentiated values in the softmax and you square that and you add that as an extra loss. So this is gonna to keep the normalizer values near one, which is nice for stability. So this is actually one of the places where z loss was used earlier before I got sort of more popular for training models. You can kind of see the effects here. If you look at the losses, I think the center, the second plot here is maybe a great one. You know if you remove the z loss from your router routing function, you see these giant loss spikes in your validation loss where you the model just kind of goes a little bit crazy for a couple iterations and then gets kind of pulled back. Of course, it like still trains, okay, but you are better off having A Z loss than not having A Z loss. There is a pretty noticeable gap in the validation loss by the end here, right? Other things that can happen, people, you know, of course, you want to fine tune your moe youlike to also rhf your moe for you're going to ship and release it. But this turns out to be kind of problematic. Some of the earlier work, you know, when people were starting to do moes, this was back in kind of the burr and P5 era. So there was a lot of fine tuning going on. And you know one of the things that people saw was you know actually there's a lot of overfitting that happens. If you were kind of doing sparse models, you see this big gap between train and val, right? This blue and orange line, whereas the dense model, this Green and red line, has a smaller train test gap. And so there was a lot of worries about overfitting because you have these like gigantic parameter models that you're fine tuning on small data. One of the solutions that was proposed at the time, I don't think this is very popular as far as I understand, is to architect your moes such that not every layer is an moe layer, but you like let's say, alternate dense layers and moe layers, then you can just fine tune the dense layers and then that will still be fine. That behaves just like a dense model. So that was fine. Another solution, the one that we saw in the deep seemoe paper, is just kind of use a lot of data. Like if overfiteding is a problem, you know we have access to lots and lots of sft data. Just shovel all of those guys in. So in the case of deep sek moe, they use 1.4 million training examples, then maybe you're not quite as worried about these overfitting concerns. The last thing I'll end with, which is a trick in the toolkit that people have done and seen, is upcycling. And so this idea is to take a dense model like the one over here, and then you take your mlp and you make a bunch of copies of it, and then you maybe perturb it, and then you have your router that's initialized from scratch, and then you just pretend this is a moe, and then you train it. From that point on, you just initialize the moe from a dense model. And this is a trick that's kind of called upcycling. And people have shown that if you can get it to work, it is a very, very, very cost effective way of getting a moe right. And the moe is great for inference because not every mlp is going to be active at inference time. So you might effectively get a much larger parameter model without doing the training of a much larger parameter model. And several people have succeeded at this mini cpm, which I'll mention again in the scaling law lecture. But this is a Chinese open llm that basically tried to build really good small language models. And they succeeded at taking a dense model and upcycling it into an moe. And you can see that their numbers get significantly better in the last two rows. So the dense model to the moe, they get a pretty nontrivial bump in performance. Quinn, I mentioned at the start of this lecture, one of their earliest attempts at moe was taking one of their dense models and then building up cycled moe. And they got fairly significant performance gains relative to sort of smaller models at the time, like they got models on par with their seven b models with a 2.7 billion parameter active model. So to wrap up, I want to sort of walk through the deep seek moe architecture at the very end here. And hopefully this will give you a sense of, you know the first thing I want to do is I want you to understand the deep seek V3 architecture setup and all the changes that they did, because that's an example of a modern, high performance open source system. I also want you to just maybe appreciate that architectures don't change that much. Deep seek V1 or a deep seek moe V1 is a, you know, it's not that new. It's like maybe a year and a half or something, maybe two years old. And they basically nailed the architecture at that point, right? So see, I want you to see what they changed from the very earliest attempt to their big training run. So this is the very first starting point. This is deep seek moe. I'm calling it V1, but actually probably the right way to refer to it as deep pk moe. It's a 16 billion parameter model with 2.8 of those parameters active. And you've seen already this diagram over here. This is the shared, two shared plus 64 fine grained experts, of which four of them are active at a time. Well, maybe six of them are active at the time, sorry. And the routing know you've already seen this. I presented this in the middle of the lecture here. This is the very standard top k routing where the softmax is at the bottom before the top k selection. And for balancing at training time, all they do is to add this auxiliary loss balancing term, both the expert and device level balancing terms. So hopefully you remember those from earlier. So that's the V1. And then they saw how sort of effective their moe model was. So I guess to had some more context, right? Deep seek originally had a dense model, and then they had a moe model. And the moe model was remarkably good. And so when they went to V2, they went straight to the moe. And now this is a 236 billion parameter model, of which 21 of those billion parameters are active, right? So you need a lot of memory, but your flops, consumption for inferencing, this model is not so bad. Now, the architecture is identical. I copy literally the same figure because the architecture is literally the same, minus changes to the number of experts that are active. And we've got now sort of some new things happening, but not too many new things. So the top case selector is the same. So the equation from before, this previous equation, this is identical. This is still how they do things. But they have this very clever trick that they add on. And this is you, I was going to say at the very beginning, you know, what's the drawback of having fine grained experts? Why can't I have, I don't know, ten, 24 fine grained experts or 20, 46 fine grained experts? Well, the problem is when you shard your experts very finely and you have a lot of active experts, right, you're going to have to route to those experts, right? So your communication costs potentially grow. If you're very fragmented, you might have to send a lot of tokens to a lot of devices, right? And so the clever thing they come up with is to say, I'm not just going to for each batch route to the top k experts naively, which might force me to send my tokens to lots of devices. What I'm going to do is I'm going to first pick top m devices. So I'm going to do my normal scoring calculation, but I'm first going to sort of subset the set of allowed devices to top m. And once I've picked my devices, then I'm gonna to pick top k for each token within each device. So now I've restricted the devices. This really controls the communication cost. And now this gives you more efficient training when you're scaling up to these gigantic sizes, right? You need to start really engaging with the systems aspect of things when you're training a 236 billion parameter model. The other thing which reflects the systconcerns that are necessary at the scale is that they add a communication balancing loss. One way of thinking about things is know for an expert, there's kind of inputs and outputs, right? The inputs are you, the token comes in and you route your expert and the outputs are know, you have to kind of bring the tokens back where they belong. So if a batch belongs on this device, it has to go back where the original device was. So we have to think about both the input communication cost and the output communication cost. And so they add a balancing loss to try to balance out the output communication cost as well, not just the sort of input side. So that's a minor note, but you can kind of see their attention to detail on trying to make sure all of the different sort of systems aspects are properly taken care of. Now, finally, we kind of get to the big deep seek V3. Sorry, that's just A V three, not V Q up there. 671 billion parameters, of which 37 are active, know, once again, know exactly the same figure because the moe architecture itself doesn't change. That's stayed the same since deep sek moe, right? If it works, don't change it. They do change a couple things. Maybe they were hearing you all say, why don't you normalized to one? And so they've normalized the gate to one. They've moved kind of the softmax normalized their operation up there, but they're not actually exponentiating sort of the sort of gating decisions. They're actually taking sigmoids, which is a sort of softer, sort of more nicely behaved operation than the softmax. And so they've got some changes here. But conceptually, this is still the same as the top k routing decision. You hopefully see very, very similar things happening. And then in terms of losses, they've gone through this auxiliary loss free trick of this b of I being incremented or decremented based on the expert load, and then they have a sequence wise auxiliary loss. And just to add some context, why would you want to balance different experts on a single sequence? Well, the thing that they're very concerned about is that training time know it's fine to not have a sequence wise balancing loss, but at inference time, it might be the case that someone sends you very out of distribution sequences and that might overwhelm certain experts, right? So at inference time, you can't control which sequences you get. So you might want sort of stronger balancing that operates at a single sequence level rather than a overall batch level. Okay. And then in the whoops, yes.
speaker 2: The top end devices like does it keep the B2? Yeah. They keep the top end improvement.
speaker 1: They do not keep, for example, the communication loss. So they've jettison some things. But top m is A I mean, it seems like a pretty clever idea. They keep it. Yeah. Yeah but it's not like they always add things. They have removed some of the things. And so in the last two or so minutes of the class, I'm going to go over the non moe parts of deep seek V3 because I think you know we're already at the point where I've explained most of deep seek V3. I might as well go through the steps of explaining the rest of deep seek V3 at this point. So you all know kind of how that works. So they have a clever sort of optimization for the attention piece called mla or multi head latent attention. And you actually already know all the ingredients that you need to understand this because at the end of last lecture, I talked about like gqa and mha. So those are all inference optimizations that you need in order to optimize the size of the kv cch. So the deep c cloks take a different tack or different approach at optimizing this. Instead of reducing the number of heads, they're actually going to sort of project the heads into a lower dimensional space. So you have your inputs, H of p. And instead of sort of generating in the kand v's directly from these H of t's, what I'm going to do is I'm going to first generate a low dimensional c. This, you can think of this as like a compressed version of H and the c is going to be smaller and easier to cache. And I'm just going to cache these c's. And whenever I need know these kand v's, well, I can sort of upproject from this kv sort of conceptually speaking, and then you, I can take the inner products with the cu's, right? So you can kind of see how this would be a kv cache savings if I only have to save the c instead of the higher dimensional H of t. And that's exactly the idea. So you take your H of p, you project it into a lower dimensional c, and then you up project this back into the kand v's. And if the c's are small, well, you've compressed the kv cache. That's good. And then you in terms of the computation, right, if you're thinking about flops, well, you might think, well, this is not good because I have to multiply an extra matrix wk. I didn't have this matrix before. That's an extra matrix multiply that I have to pay for. But kind of the clever thing here is remember that on the other side of k, I'm going to take k dot q, that q dot k is going to be an inner product in the attention operation, and q itself has a projection matrix q. And so the trick here is you can merge this wk and this q matrix together into one matrix. I haven't gotten any extra matrix multiplies. I've just merged this new matrix multiply into my other one. This is just associativity. I can just merge the two. They also compress the queries for memory savings during training. But really, that one is not quite as necessary because it doesn't interact at all with the kvcache. I'm only going to mention this last one in passing because it is a subtlety, but it's kind of a clever subtlety that you realize, which is that this original trick, this sort of thing that I just described at the top, is not compatible with rope, right? And the reason is because, you know, the rope matrices, you know, basically you have the cues into the case, and you rotate each of those 's in the ke's by multiplying with the rotation matrix rq and rk. But if you do that, then these rq's and rk's are in between the query projection and this latent vector up projection matrix. And since I can reorder these matrix multiplies, rope kind of gets in the way. And they still have a solution of basically doing rope on non compressed dimensions. That's kind of a side point. I think it's not quite as important. You can kind of look at the paper if you're super interested. The other thing that they do, and this is the last thing I promise, is that they have a minor change in their loss function called mtp, where they predict multiple tokens in parallel. And so what they can do is normally you have your inputs, you shift them to the left by one. So you're predicting one token in the future, and then your transformer is going to predict all those tokens. That's your normal transformer loss. But then what you can do is, right before you make those predictions, you can take the hidden state, you can PaaS it to a very lightweight one layer transformer, and that model can predict one token in the future. So now the model is not just predicting the next token, it's predicting the two tokens into the future so that hopefully all makes sense. And this is just a small, lightweight ve model that can do that. You can sort of see the architecture right here. The one thing that is kind of disappointing that I learned as I was sort of researching for this lecture is actually they only do mtp with one token ahead. So even though they have this very complicated diagram of how they could do it for many tokens, turns out it's only done for one token. Okay, so now I'm all done. Moes are kind of now at the core of how you would build and deploy a really high performance, large scale system. And they take advantage of kind of the sparsity idea that you don't need all of the parameters all the time. And discrete routing is the real big challenge. And this is, I think, one of the big reasons why moes didn't immediately catch on. It's very scary to have to try to optimize this top k routing decisions, but heuristics somehow seem to work right like they just do. And so there's a lot of empirical evidence now that moe's, at least for flop constraint settings, is just a good idea. It's cost effective. You should do it. So definitely worth learning. Thanks a lot for listening.