speaker 1: And hello everyone here. So today for our speaker, we have Albert Zhang from mial AI and also from the University of	Cambridge. Nice to meet you and thank you for coming today. I'm been being our speaker. So Albert is a AI scientist at mistreal AI and a final year PhD student at the Computer Science Department of Cambridge University. He works on language model pre training and reasoning and mial AI and language models for mathematics at Cambridge. So thank you for coming today. And today Albert is going to talk about mistreal, a sparse mixture of experts, language models. And Yeah, I think I will now leave it for Albert.
speaker 2: Hi Emily. Thanks Thanks for the really nice introduction. So let me share my screen here. Yeah. So Yeah thanks very much again for the really nice introduction, Emily. So today I'll be talking about demmystiine mixture of experts and I'm a name is Albert. I'm a scientist at Mr roia and a PhD unit at University of	Cambridge. So the contest you'll be hearing today, first I will talk about the architecture of first, the dense transformers architecture just to review, and also the sparse mixture of experts or smmoes. So after that I will be talking about interpreting sparse mixture of experts models. And the majority of this slides will be based on the mixture of experts paper that were put out early earlier this year in January. Throughout this talk, if you have any questions, feel free to raise your hand and I'll be really happy to answer. And also, I'll be posing some open research questions because these things are really very much on the research. And we hope that more research can be done on it so the open source community can really enjoy the benefits. Okay, let's get started. So first, let's talk talk about architecture. We can first talk about the dance se transform architecture that a lot of people must have already talked about. But here I want to focus on the mial seven b model, which is the dense transformer. The main difference between the mistreal seven b model and the, I guess, the rest of the dense transformer models is that it has group query attention, which means, which is different from the multi head, and mulquery attention somewhat in between. So for multi head attention, you have the same number of have the same number of keys, queries and values. And for mulcareer attention, you have a lot of queries and just one key and one value. And group career attention are sort of in between and you have a lot of queries, much less keys and values. Also the difference also there's another difference in the attention we're using for mial seven b, that is we're using a sliding window attention. The idea being for some of the lower layers of the transformer, the tokens for each token position, they attend to a relatively short span of previous tokens. And as you go deeper in the transformer, the later token positions will get information transmitted from the earlier token positions. And neither of these architectural designs are new, so these are already present in the premiworks from 2023, 2019 and 2020. So these are really nothing new, but we decided to adopt these architectural changes in mial seven b to make a better model. Okay. So after writing these slides, I realized there's actually a huge number of decision choices you have to make in designing a transformer able architecture because there are so many details. So like just presenting this doesn't feel sufficient. So I try to, well, first, these are the sort of standard configurations for Mr. L seven b. And also I just want to have a very single, have a very, very clean code in piyhorse that just tells you how to write a single transformer layer. And I'll be adopting the noum shazir notation by annotating the dimensions of every tensor as a suffix in the variable name. So let's say our sequence size is 8008 and that we have 32 query heads and eight K B heads. So this is quite standard for group query attention. And our head dimensions are 128 and our latent dimensions is 14k. So in order to write a single transformer layer, first you initialize your query key and value matrices. So here I wrote down their dimensions here and notice that we don't use bias. So there's no bias in any of these matrices. And we also write down our output matrix so to know that the output matrix has the same dimensions as the transpose of of the curmatrix. So in order to write an attention forward layer, so here let's assume that our input x has a dimension l times d. So l is a sequence length and t is the latent dimension. So you first want to use these matrices to try to find what your query key and values are. So here their dimensions have been normalized to l, nh, lkh and lvh, respectively. And then you want to apply rosary embedding to your keys and queries. You don't really apply that to values. And then you do a very standard attention mechanism with the queries, keys and values to note. Here. We actually repeat the keys and values a couple of times to make their dimensions match with the queries. And then we just return the output multiply with the output matrix. So this is your rstandard attention forward layer. And to have a transformer layer, you also want na tie that you want na first do a normalization. Here we use rsm rms norm over the input, and then we do a tenforward, and then we do a tenresidual layer, followed by normalization plus an mlp, plus a dense residual layer. So this is what's present in a transformer layer. Okay. So with these architecture designs, we managed to make a model that's quite good. So this is released. This was released in September last year, and it was able to beat a lot of the lama two models back then despite being very, very small. Okay. So let's now go to the interesting bit. That's mixture of experts. And mixture of experts is not a very new idea. It's being present forever. And there are a couple of very quite recent papers that really highlighted the importance and like what are the benefits you're getting from these mixture of expert mixof experts, your models? So I guess probably the most, well, no one is the switch transformers. That is the Google paper that scales transformers to trillion parameter models with simple and efficient sparsity. And they've discussed a lot about the sort parallelism decisions you can make about data parallelism, model parallelism and expert parallelism. And for the a mixture of experts layer, it's also, it's even older idea. So from 2017, also by some quite famous people from Google. And the idea is that you try to take out so at the time, they were still using recurrence language models. So the the idea is that you try to have a getting layer, a gating network that decides which experts you should route to. Okay? So in our mixture of experts, in our implementation of the mixture of experts layer, we're doing something very similar. So we have the inputs that get sent to a router and the router assigned the gating ways and to pick the top two experts to give this inputs to. And after these experts individually process the inputs, the gating ways will be applied to the output of the experts, and they get some to get through the results. So if you write it in very mathematically, you can say that the input is x and the router has a matrix multiplot to it. And then you take the top k experts out of that, do a soft tmax, and you're getting these are your getting waste and output. The mixture of experlayer y is the soft max of the top two gating weights times the sweet glue function over x for each of the experts. Okay. So a sparse mixture of experorts model is quite near the cost performance per retotal frontier because for each token, you the token doesn't have to go through the entire neural networks parameters. Instead, it can just go to like the active parameters for that token. And this mixture of experts model can outperform lama 270p with about five times faster inference. So we kind of see it as a dropping replacement for gp 300 by five, which can also master quite a few European languages in addition to English. And it can gracefully handle context for the 32K tokens. So this mixture of experts late model was released under a two zero. You can free to use for commercial purposes, also attach the configurations for the mixture of experts model here. But I so can refer this back to the missile seven b to see the differences. Okay. So I want to talk about what does moe I, which is verbajust created the molp layers actually give you? So the conventional wisdom is that the mlps in a transformer network store knowledge and the tension stores algorithms or reasoning, or they sort of implement algorithms and reasoning. So by moe find the molp layers, we're supposed to get a boost in knowledge and indeed we try to benchmark mixture a times seven b with a mistal seven b and A, A couple of the llama two models and being a couple of different categories, you can definitely see that on the knowledge heavy tasks such as mmu and knowledge. The mixa times seven b model, which is shaded in yellow here is doing a lot better than Mistral seven beand. Also to the reasoning, comprehension and tasks, it seems to be doing a little bit better, but still not as much as it does on knowledge in mma. Okay. And here is a different graph where here what we're trying to show is that what is on the horizontal axis is a number of active parameters. And on the vertical axis is the performance of the models on various tasks and the sort of round dos. Here are lama two models. The series of lama two models and the orange stts are mixal models from mixal seven b to mixa times seven b. So the mixroll a times seven b uses only 12.9 billion active parameters, so close to 13 billion, yet it can want quite a bit better, especially in the knowledge category. You can see the drop, you can see the increase from mitwelve seven b to mixol a times seven b is quite huge here. There are some improvements on other categories, but the knowledge improvement is really great. Okay, so there is a research question that is making your language model, if making the mlp layers of your language model can give you a huge boost, acknowledge, like how about mofying the attention layers, right? So this is also very well, not very old but quite old idea from 2022 in the switch transformer papers. So the idea is you can replace the trinable qkb matrices with switch layers. So instead of doing a standard matrix modilification, you can do a moe. So you have a gating layer and then some dance layers for the qkb rimatrices. And the problem at a time was with stability. So if you use bf 16. Precision to trayour models. They can sometimes subverge, but if you had used app p 32, they don't and you get some performance boost, which is nice. But at present, it's mostly where we're trading with bf 16. So we have to find a way to make it work. And so in order to make this work, maybe you can try batstability techniques, maybe some normalization, or maybe you could use a different way to moe fy the attention layers. So this is an open research question. I wereally like to see some research to be done here. And after this, I want to just talk about several myth that seems to be present around the mixture of experts models. So the myth one is that there are eight experts in mixroll a times seven b. So this is probably our fault by naming our model mixture of a times seven b. And so the truth is that every layer has, every transformer layer has eight experts. And these experts are permutationally equivalent, because the way it works is that the gating network decides how much ways Ste to give to, how much ways to give to each of the experts and only picks the top two. So it doesn't matter. So it doesn't matter how you sorry, so it doesn't matter how you permute them, they will give you the same result. So that means instead of eight experts, you actually have 32 times eight experts in total, and they're relatively independent across layers. The miss too, is that there are 56 billion prometers in mixroll, eight times seven b, but the gating and the attention layers are shared. So these are not in the a times part. And in fact, there are only 46 times 7 billion parameters ters in total. And each token will actually only see 12.9 billion active parameters instead of 14. Okay. So another thing that seems to be quite popular these days is to compare everything tieverything that's related to cost with the number of active parameters as if they're proportional. And that's not quite right because mixroll eight times seven b has fewer active parameters than laa 213b. But having this sort experrouting means that you need to send tokens around to different experts quite dynamically. So that means you have more communication costs. You can just pre program like which token will be sent to which expert, while you're gain much more in performance divided by cost, the absolute cost is not proportional to the active parameter count. So usually your mixture of ts, your mixture of experts model, if you have the active parameters, your actual cost of serving these models is actually a bit more than the equivalent denmodel with the same number of accparameters. Okay. Here's also a really I think a really interesting research question is how do you balance loads at inference time, right? So your glayer might still decide to do unbalanced loading at inference time, which makes inference slower. What that means is that ideally, you want your aid experts or like however many experts you have, ideally you want them to handle the same number of tokens such that you don't get the, you don't get the slowest. You don't always have to wait for the slowest expert and some nice ideas are around like mixture of depth and also that I'm loading based on scores. So some of the ideas are like if you if you saturate one of the experts, maybe you can try to find a neighbor of that expert to keep the token to which is not saturated and then sort of try to slowly fill up number, the total number of expert times token count you have. Okay, another I think quite open research question is how can you compress smmoes? And this comes from a screenshot someone took of Tim gaatmbecause. I really trust this men's model compression skills with my life. So Tim gaatmers said he thinks we can compress a mixture of fafirst model to lower smaller than four gabytes. So because moe compression is quite different from dense transformer compression, because the regular transformers, they're really, really difficult to sparfy. And this is really for like the feet forward, the molp layers, but the moe layers are quite different. So maybe there's something you can do with moe layers such that your compression is much, much more efficient. So by compression here, I mean sparfication. So like a lot of the so so speciarfication, not in the sense of choosing only top two experts instead of eight, but I mean, a lot of the parameters might be like doing not much work and you can like share them out. Okay, so this this was said by Tim detmerin December last year, and I don't think I've seeing much convincing work on sparse fying moes to an extreme, but I really like more work going on here. Okay, now I want to talk about how to interpret a mixture of expert models. So the mixture of expert models with a sparse gating layer actually provides you an incredible opportunity by giving you discregating signals. So deep neural networks are traditionally very, very hard to interpret things like all the ways in activations, maybe very high dimensional place spaces. The attention in transformers potentially offers some interpretation opportunity because you can look at like how much of the attention weights get assigned to which tokens or to which tokens. But you're essentially picking at like what is the model looking at by looking at the attention scores. But this also quickly gets very messy as you have a lot of attention hads. It's hard to visualize and hard to interpret. And the gating layer, seeing as moes potentially tells you a bit more because it tells you which experts is looking at which tokens. So can we make some sense out of this then? So in our paper, we try to find if there some sort of domain specialization with these experts. So we took the validation splits of the pile datsets. And so these are represented in different colors. You can see the red is archive, the Green one here is GitHub, and the sort of blue one here is few papers. The purple one Stack Exchange. And for layers zero, 15 and 31, so these correspond to the shallowwest layer, the sort of mid layer and the deepest layer. So this is the layer. 31 is the layer just before decoso. For each of these layers, we plot how much each expert is selected. So since we have eight experts, the sort of random random selection chance is around twelve is 12.5%. And you can see that at layer zero. So this is the layer that's closest to the raw tokens. You will see that the distribution is quite uniform. It's not like very uniform, but quite so this might be telling us that this layer is like too shallow, that it's still doing a lot of syntactical stuff, that you don't get a lot of meaningful domain specialization going on and on the sort of the mid layer. So here is potentially where you get a lot of semantic information. You can you can notice that aspt three here is doing something quite interesting, right? So because it gets selected very not very often for any of the other categories, but it gets really selected quite a lot for dm mathematics. And second place is GitHub. So maybe this layer, so maybe this expert in expert three in layer 15 is doing something that's math and code heavy, but this is quite speculative and we cannot really conclude much out of it. And for layer 31, I think the distribution go back to close to uniform again, although you get some spiking distributions here and here and there that specialize in mathematics, it seems, or since the data size, keep mind, mathematics is more like arithmetics. Okay, we also did some analysis on consecutive tokens. So the question to ask here is whether two tokens that are consecutive get assigned to the same expert. So here we have the first choice. So that means if a token is given to expert I, does the next token also get assigned to expert I? Right? And also, we're still doing this analysis on the three layer, layer zero, layer 15 and layer 31 corresponding to three different places in the network. And the validation set is still the power validation split. So the random chance assignment, scranchance assignment will give you 1212.5% in terms of like how many of the tokens that are assigned to expert eyes to get assigned to expert I? Sorry, Maxwell can also get assigned to expert I. And we can see that at layer zero, this is slightly higher than random, but and there 15, this is significantly higher than rendom, right? So almost well, it's almost double the random chance that the first choice for the gating layer will actually assign the next token to the same expert as this token. And when go to the last layer, you'll find that this transole regresses a little bit. It's still quite significantly higher than random, but it's a bit less than layer 15. We also look at like the first or second choice. So if this expert is not chosen as the best expert to route this token to, could it be the second best? So here the random chess probability is around 46%. And you can see for layer zero, this we have the sort of the same pattern, right? So for layer zero, the number of times that the next token get assigned to the first or second choice for this expert is around just slightly higher than random. And for layer 15, this is really significantly higher. And then for layer 31, it sort of regresses a little bit. So I think there's certainly some sort of more detailed analysis to be done here that can conclude out of from these mixture of experts models. Also, we try to visualize in three examples, like which experts actually selected which tokens. We have three examples. So the first example is from is actually from the GitHub code for the moe layer. And the second example are like simple arithmetic questions, I think from dm mathematics. And the third is from a simple multiple choice question. And this is for layer zero, layer 15 on layer 31, you can see that there's there doesn't seem to be much specialization, right? So here a lot of digits get assigned to the same sorry, so the colors represent different experts. So here all these digits get assigned to the same expert. But that seems to be it. We don't see a lot of very clear distinction between like which experts, like which tokens, right? So that really closely relates to myth four. That is, you really the myth four is that you want experts with specializing domains. Let's say you have two experts specialized in coding. So that means they just basically like if you have some code, they would just handle all the tokens. Then what do the other other experts do when you're coding? Right? So when you want to generate lot of a lot of code, it seems that you will only have the chance to route ot all your tokens to these two experts. And the other experts just stand there and do nothing. So you actually want all the experts to be fully engaged at all the time to maximize your inference efficiency. Language also is so complex that many ly specifying these domains that experts should specialize over seems to be a simplification to me because there might be a lot of underlying features that the experts are actually specializing over and just not present at the high level, like the very high domain level. There's a treasure hunum. So this is this is why we love open source so much. So like around after 24 hours of our mixa x seven b model release, there's a guy on a Chinese website that just found out that there's why expert in one of the layers that's particularly crucial. So what they did was that they try to remove the expert from all of the layers. And here you can see what are the effects if you remove the iexpert. And on the vertical axis is the mma U score. It seems that if you remove the third expert, then everything just collapses. The mmlu score is 0.63. So this is 0.63%, not 63%. And for other experts, if you remove the iexpert, the mmscores drop a little bit, but not too much. So and they made this meme about it. So the expert three seems to be doing all the work, while all the experts are sort of standing around and doing not much. Okay. So I think another really important research question here is how do we interpret the mixture for expert decisions and what are the features that they're learning? So the experts might capture features that are really different, very different concepts and what we perceive as concepts. And it might be more efficient to represent linear combinations of concepts as long as they spend sort of the same subspace, right? And how do we recover that subspace or how do we recover some of the sets that we can actually understand? What are the on the line concepts that they're actually learning? Okay. So I just want to also just conclude here that a sparse mixture of experts models leveraged sparsity to gain a lot more knowledge. And you can train very good mixmixture of experts models to be quite efficient at inference and expert specialization is not as straightforward as one might think. And just there's a tons to doing architecture and interpretability research and also just want to plug Mr al AI co founby, Arthur, Timothy and Guillaume. And we have offices in Paris, London, San Francisco, bay area. It's currently it's actually in Palo Alto, have 500 million in funding. And if you enjoy solving the research questions that I just mentioned or doing open source AI or just generally empowering people with language models, you very welcome to join us. Thank you very much.
speaker 1: If any of you all have any questions, feel free to come up and ask.
speaker 3: So I recently read the paper from meta. They are using the intensive model, not this one. 28, I think page 28. Yeah. So which is I feel like it's the other direction, opposite way, why you choose the sparse mixture of experts rather than the other people. Now I see most of them using the intensive models like to do some of like an edge database, like in a confined scope. You can have more knowledges in a confined scope. That's I see the train goes to that way. For example, you go to the Egypta museum here in I in Sunny where somewhere, right? So you can have the chagpt, have the local knowledges, which you cannot get it from the Internet, but you can only cooperate with the museum and get it. I think the intensive model allows this kind of more domain knowledge, like going deeper. But now here you are using the spas model with mixture of experts. To me, it's the opposite. Can you explain why? Yeah, sure.
speaker 2: So if I understand the question correctly, the question is asking, why are we choosing sparse mature experts models instead of dense models? Are there's no sort of sparse sity involved. So I think for ash devices, dense models are have a lot of potential because for sparse mixture of experts models, although you have a sparsity in inference time, so that means a token does not have to see all the parameters, but rather just a small fraction of it, you still have to load all of the experts into memory. And that might pose a huge challenge for adices. Like your phone might not have like 200 gabytes of memory in order to inference. You order to run inference very efficiently. But for like I guess for data centers or for people who are serving these models, it might pose a advantage in terms of inference cost because you can have where you can have very good performance sladivided by cost ratio.
speaker 3: I see. So your point is because there's not enough memory at the edge. So for the edge, it's better to use sparse model rather than densive model?
speaker 2: Oh, sorry, no. I mean, for AI, I think it's good to use dense models because because for sparse models, you still have to load .
speaker 3: everything into each memory. Yeah, Yeah, Yeah. So so your your use cases is more for like closer to the center of the cloud is swiming.
speaker 2: Yeah. Yeah. Okay.
speaker 3: Got it. Thank you.
speaker 1: Thank you. Does anyone else have a question theylike to ask? Yes. Do .
speaker 2: you .
speaker 1: want .
speaker 4: to.
speaker 5: I have a question in regards to fine tuning. We've been working with multiple models, Gemini, chargpt, llama. And there is a big problem with fine tuning these models, especially on tensor, on images. They're having issues with understanding. What can you say about that?
speaker 2: You mean in general fine tuning language models on visual tasks? I guess. Okay. I guess that's a very broad question. I can what I can say about that is there's I guess for ChatGPT and Gemini, you don't really have the model Wein your hand. So you are basically relying on OpenAI or Google to do lower for you, right? So you get I guess you have less control over your fine tuning. Whereas for an open source model or guess open waste model, you actually have waights in your hands. So you can do you can do laa, but you can do full fine tuning. You guess you get more control. And Yeah, I guess that's in general like what I can say about like fine tuning open source models versus versus very closed source models. But I guess that really highly depends like exactly what happens when you're doing fine tuning really depends .
speaker 5: on your use case and your data. Yeah, I mean, I'm sorry, it was a little groso. When we try start transferring them into a large a large datset, let's say like 10000 images or 5000 images, so and you have to transfer them into textual format because none of these models are actually tuned, designed to be fine tuned on images. They don't understand it. So is this something you see in development in open source models?
speaker 2: So by translating these images into text, do you mean like ocr or is is that something digital format into the textual format?
speaker 5: So they do understand them in digital as an image. If you upload them into like text text line, right? Like you go, you charge GPT, but when you start feeding them through apis, you need to transfer them. So then becomes a problem. Like, is there any way to overcome this issue with open source models?
speaker 2: I guess I'm not I guess I'm not entirely sure what the root cause of this problem is, but I think I'm sure that with open source models you can actually, well, you can see like what the waare doing, right? So you can actually tell what are like. You have more control of the waste basically. So I think open weight, open waste models definitely post an advantage here.
speaker 1: Thank you. Thank you for the question. Does anyone else have any other questions in person?
speaker 2: Yeah. I guess I was curious like where you think the improvements of mixture of experts are like coming from? Like I've I've heard that you can like put different experts or different like fefour networks on different GPU's and that that like allows it to be more parallel and like speed things up from like a compute perspective? Or is it like more of the sparsity itself that is like giving these improvements? I guess is there any insight on that? I guess there are there are two dimensions to this, right? So in terms of improvement, the one dimension is performance. So as I said, the mixture of experts models have these moe molp layers. So imagine your original molp layers and you make it eight times wider, you will be able to store a lot more knowledge into these layers because you have more parameter counts. So that's, I guess the on ner performance dimension that actually you can actually store a lot more knowledge into these models. And on the performance, sorry, and the other dimension is the inference efficiency dimension, right? So so definitely go read the switch transformer paper where they discuss the okay, I guess that's training efficiency, but they really discussed in quite a lot of details about data parallelism, model parallelism and experort parallelism and how they affect the efficiency. What's the communication cost there? What's the best way to triyour models? And for these sparse mixture of experts models, since you only so you only select 13 billion parameters for each token at in princetime, you are actually trying to select the most relevant parameters for each token. So that can make inference quite a lot quite a lot more efficient. I hope that answer your question. Yeah, no, that definitely did. I guess I was also wondering like if you've seen like the mixture of depth paper and like that as like a way of sparsity is, do you have any thoughts on that? Yeah, I think sparsity sparsity helps when you can do like sort adaptive computation. So the mixture of depth model is like the best example of the adaptive computation, which means for different tokens, for predicting different tokens, you want first different parameters ters to engage. So in the mixture of depth paper, they selected different number of parameters to engage in calculating each token. Any hour paper, we selected different like the same number, the same number of parameters but like different experts. So it's the the difference is I guess quality versus quality. But there are like two different dimensions you can optimize over. You definitely want have to select the most relevant parameters and as few parameters as possible for decoding on Yeah. Okay. Yeah. I think those all my questions perfect.
speaker 1: Thank you. We'll take one more in person question if any of you all have anything. If not, we can turn it to the online questions. Yes.
speaker 2: Hey, Albert, thanks for the talk. Just one in the earlier you like the beginning of the presentation you mentioned like routing and communication cost. And I was wondering if you could talk about how that scales relative to the number of experts you have in parameters. Oh, Yeah, sure. So so that depends on like first number, the number of experts you have and also depends on like how large each expert is. So that that sort of can change your the way we do parallelism. So we know that communication is really expensive where you need to go from one GPU to another, and it's more expensive if you want to go from one note de to another. And if you have a huge ton of experts that will not fit into just one node, then you really incur a huge quite a big communication cost. And like how to scale beyond that, I guess, is a very open cenfic questions. But essentially, the communication cost you incur is roughly proportional to the number of token routing, first between GPU's and then between nodes. Thank you.
speaker 1: Thank you for all the questions. We'll have Stephen introduce some of the questions that were submitted by the people joining us on zoom.
speaker 2: Hey.
speaker 4: can you hear me? Yeah.
speaker 2: thanks for the great talk.
speaker 4: Thanks. All right, so we have some questions through zoom and slidle. I'm gonna pick and choose some of them starting with some zoom questions maybe. So we have a question here about any comments on why potentially lama three is not using a mixture of experts. A lot of people suspected they might after mistress success. I don't know if you would know anything about that.
speaker 2: but Yeah, I don't know. Just ask .
speaker 4: the lot. My people. Sorry. Right. Let me see. Here's A D question. Do you foresee mixture of experts techniques being incorporated into the other large foundation models? Or will it remain a subset of models that are best for certain use cases?
speaker 2: Oh, certainly. So I think as my answer to the first like in person question was for as devices, probably you probably want to stick with dens models because of the memory constraints. And for a mixture for experts models, you're really getting a lot in efficiency if you can serve it as scale. So if you have high batch sizes, you will so you will potentially be doing better in terms of throughput than the denmodels. So Yeah, it definitely depends on like depends on a use case. Usually the larger the scale, the better. Well, I guess the more the benefit of the mixture paraverse model shine. And while speculated that you before is a mixture backverse model, if you haven't heard already.
speaker 4: can make sure experts models outperform domain specific models at their respective tasks with experts.
speaker 2: So well, first, usually models that have been trained for particular domains are really hard to beat. And I think that's where also where the purpose of continuous prtraining and fine tuning comes in. So if you let's say you first portrying a general tax model and then you want to adapt it to the medical domain, so you grab a lot of medical data and then you continue portraying or you're funon that, that model is going to be really hard to beat. And I think the mixture of experts model, the experts don't really, at least in our case here, the experts don't really focus on like traditional like domains as we know it. We don't have a medical expert. We don't have a coding expert, rather they just try to that they try to they have concepts encoded in a very non intertable way. So I wouldn't say that just taking one mixture practice first model, which is outperform all the other like domain focus models.
speaker 4: right? That makes sense. Here's a question. Has there been any study on whether the moe layers should be both at the early layers and deep layers for old neural network ensemble methods? Late fusion tends to work better than early fusion.
speaker 2: So okay, I think that's a really great question because I think from the inception from the neural network, we are trying to adopt this, trying we sort of have this tradition of having layers that are that look exactly the same as at each other you have. And so then you can try to focus on just designing this layer to be as appropriate as possible and then you can just copterpaste that a couple of times. But I think I've recently seen a paper, sorry, I don't remember the name, but I think they're just trying to put layers in random orders. So sometimes it can't be attention first, sometimes it can be moe first, and it can have also some other wacky layers. But I think that forms quite well. It's unclear to me like why that is. It could be just chance. But I guess I guess if you wanted to design a generally really, really performing model, it's really safe to have the same architecture for every layer. So Yeah, so I guess that's I guess the safest choice. You can also do some sort of newer architecture search in order to your sorry, optimizer architecture, but that's which is a much more principled approach that's guessing that this layer might work better.
speaker 4: All right, thanks for the answer. Asking could you talk a bit more about how different the eight experts are built? Are they just fine tuned on different data sets .
speaker 2: so they are trained on roughly the same data sets? But I'm not sure if I can say more of that. But Yeah, sorry.
speaker 4: And here's a question. Some work has suggested that learning to doesn't perform any better than simply using, for example, a random mapping of inputs to experts. Do you have any thoughts?
speaker 2: I think that I would definitely need to see the paper first, but I wouldn't be too surprised if that's the case. I would still be surprised by not like extremely surprised because if you if you think about mofying the molp layers as augmenting the knowledge capacity, then it seems that you can potentially just do do that in a very rote force way and just readily mapping the token sorry to random experts. I would still think that the gating has the advantage of you know of being able to choose experts more intelligently than just randomly. So Yeah, I don't know that paper, but I would love to see it.
speaker 4: Someone was asking, could you speak a bit about the general development process from your perspective for this model? What aspects, design choices, hyper parameters and so forth. Did you try that did not work before arriving at this specific architecture?
speaker 2: So we know so for this particular architecture, the mixture of experts architecture, we kind of know that it will work because there having a lot of successful papers beforehand in terms of like what sort of things you need to consider. I think one really good practice I guess for companies that of as like a single developer is to always you always want to take inference needs. You you you want to think about inference needs before you design the architecture for a model. You don't want to so for example, you don't want to have an have a model that's what just just slightly exceeds like what a single 888 giabyeight 100 can contain. And then you basically lose a lot of efficiency from just like being like overhead being just a little bit more than 80 gb. So I think in terms of design, before training the model, you definitely want to consider how to infer with this model. And in terms of the hyperparameters, you definitely want to do some sort of scaling before you some sort of scaling law, I guess, search and analysis before a train model to make sure that model is served at the best performance cost ratio. So Yeah, that's it. Great.
speaker 4: Speaking of a GPU memory, someone is asking what is the inference runtime GPU memory footprint for the seven b model versus the eight by seven b? It would be great to understand this for specific applications, especially if mistrual has the potential to run on edge devices with sufficient GPU of your m.
speaker 2: Sorry, I don't think I got the last part of the question.
speaker 4: So ads devices and GPU, they're just asking what is the approximate inference runtime in terms of GPU memory footprint for the seven .
speaker 2: b model versus the a by seven b? Okay, I see. So if you do like if you do very naive. If you do it very straightforwardly, that should load for the eight times seven p to load all the experts in, and then the experts always stay in GPU memory, then the difference between the a times seven p and the seven p will be the same as the ratio between 46.7 to seven. That's the memory requirement. But at any printime, since the mixof experts model only has 13b active parameters, so each gets to see much fewer parameters than there are in the memory. And you can do some interesting things like like I think cpu kind of forgot the name for it, but you can keep some of the experts in cpu and then only load them when you need them. So I think that's something you can do as well. But then you also lose some efficiency because you need to constantly transfer prompters from cpu to GPU and vice versa. Right?
speaker 4: That makes sense. A speaking parameters. Someone asked, do you have a rough rule of thumb for how to approximate how capable an moe model would be in terms of equivalent dense parameters? For example, I've heard that calculating the geometric mean of active parameters to total parameters is a good way to do this, which would mean that 13 billion active parameters with 47 billion total parameters would be equivalent to a 22 billion parameters dense model. I'm wondering how accurate that would be.
speaker 2: So I think I think that's a good rule of thumb, although that definitely depends on like how well you train these models, like and how many tokens do you put through them and how good are those tokens if all other things being equal, I think that's a very good rule of thumb.
speaker 4: All great. And I'm just looking at some of the upvoted slide. Do questions. Someone asked for the treasure hunted expert. Was there a reverse experiment of removing all experts except number three?
speaker 2: Oh, that's a that's a great question. I think Yeah, if you can figure that out, I'll be really like I'll be all years. I'll be really happy to hear that. So so then then you are just trying to I guess you're trying to trim the a times seven b back into a seven b, which might be an interesting experiment to do.
speaker 4: Right. Someone asks what is your intuition about why the eight by seven b is significantly better and recently compared to the seven b? Are they learning better internal algorithms?
speaker 2: That could be the case, although I think that's very speculative. I think. Okay. So I think for a lot of these benchmarks, for example, the even for the math benchmark, the the like what you do that like what sort things count as knowledge and what sort of things count as reasoning is quite ambiguous. For example, if you're are doing a math task and you can reason to get something out, or you can just recall, Oh, here's a land, I can just use that. And so I think there's like high ambiguity between what sort of things count as knowledge and what sort of things counts as reasoning. So for this particular example, the a times seven b definitely gains a lot more in knowledge. But does the gaining knowledge also induce some sort of change in its reasoning capability? I don't know. But I think that that's just like whatever I can say about is like highly to speculative, but I would say the benchmarks themselves are also very ambiguous. So I think it's worth finding out what exactly great. Another question about .
speaker 4: GPU memory. Someone sitting in production, we're constantly bounded by GPU memory instead of compute. Will mixture of experts make it more challenging to serve any extra cost besides communication overhead?
speaker 2: So in purpose of cost, I think communication like Yeah so so when you say you founded by GPU of memory, I assume that it's like it's like comparing the a times seven b to a seven b, then yes, if you need to load all the experts into the GPU memory, then Yeah, you're gonna to consume, you're going you're going to use a lot more. So the benefit or the advantage of the a times seven b over the seven b is that or let's say over equivalent dense models with similar active parameter count is that at high batch size, you'll get more therpoints because you'll get different experts handling the tokens set at the same time. So I'd say yes, mixer of experts model definitely gives you a little bit more trouble in terms of serving, but if you have high volume, it's definitely worth it because you get much more efficient in processing these tokens.
speaker 4: Okay, great. Here I'll ask a few more since we do have some extra time, but do let me know when you might need to leave. So someone is asking specifically about the architecture they want you to clarify, does each layer include attention, routing and experts or layers of attention, then routing, then experts?
speaker 2: Okay, let me try to find out. Yes, so so the architecture of mixtrel a times seven b is exactly the same as mistreal seven b as I've shown here. So in mistreal seven b, you first do for free each of the input. You do normalization, you do attention residual norm mlp residual. So it's attention followed by mlp. Okay? And the only difference is in the mlp layer here. So instead of so after the attention layer and after the the attention residual, instead of doing molp, you doing moe molp here. So here, instead of just using one or one matrix to process these hidden disnating representation, you use eight experts. Well, you choose two out of eight experts to process that.
speaker 4: Right, that makes sense. All right, let's see. Give .
speaker 2: me .
speaker 4: a moe. Introduces extra load balancing losses and loss function discontinuing discontinudid you run into any difficulties while training due to these complications?
speaker 2: Yeah, that's a really great question. So I think to try to clarify with the question, my understanding is that when you're training mixture of access models, you definitely want each of your experts to be quite balanced in the sense that they handle a similar amount of tokens such that you not like I guess I was waiting for the slowest expert or the expert that handles the most. So yes, you definitely need to do something to make the load balance a bit better in training, but we didn't run into any big troubles during training this model.
speaker 4: And here's a question about reg. Here's to hear when an moe approach would be preferred to a rag approach in a given domain, pros and cons, etcetera.
speaker 2: I think these are a sog mso. You can use an moe model to do rc. Like so I don't see why like I don't see why there should be a conflict between the two. Yeah. So so like you can do desmodel no rac desmodel rag moe no rag moe rag.
speaker 4: right. And someone asked, can you potentially swap out one expert and insert a domain specific expert that wasn't trained on the same training data set like a customizable or modular moe?
speaker 2: I see. I think that's possible given that you also you also need to so after you swap out one of the one of the experts and replace it with a domain specific expert, but you after after you do this swapping, you definitely need to train this model a bit more such that the gaating layers know how to, such the gleaders yers know how to route the let's you have a medical expert swapped in. You need to train your glayers a little bit such that knows how how to handle medical situations and to basically treat this replaced expert quite differently. So I think it's only Olly possible in very exciting research direction. They're they're being quite interesting research ched on like merging and swapping and Frank emerge these models. So I think that's certainly something very exciting. I'm looking forward to it.
speaker 4: Okay, great. I'll ask two more zoom questions and thatbe it. So someone asked during training is make sure of experts less computationally intensive since the gradients would always be seven b, how do how do the gradients back propagate through the routing that .
speaker 2: so if I go to the routing layer, you will see that the pub so all the operations here are differentiable. So no, there's no, I guess like discrete operations such that the gradient just stops there. So this this thing is entirely like Anto differentiable. Sorry, what's the first part of the question? Again.
speaker 4: asking if training moes is less computationally intensive?
speaker 2: So the cost of training moes is roughly proportional to the number of active parameters you have. And so it's roughly equivalent to training a 13b, but you incurr some actual communication cost.
speaker 4: Okay, great. Last question. Let's talk about even bigger moe models. Even if we're moving away from the piiretal frontier like those eight by 22, eight by 30 or even eight by 100 plus billing parameter models. Do you see any further serving challenges when one GPU cannot even hold a full expert after heavy quantization?
speaker 2: I see. Yes. So I think if you have one GPU, I think I would go for a dance model or a heavily quantized mixture of experts model and about like what you said about like having more module experts, like having 128 experts, I think that's something super exciting because that basically allows you to specialize your basically allows your experts to specialize a bit more. And that's how it's exciting. You can always reduce the number of you can always make the experts specialize better and you can pick, but basically allow giving your gating layers more power to choose like which experts are the best for this token. I think for serving purposes, that will definitely make things very, very hard, even like even after heavy quantization, if you have 128 experts, then and potentially you were talking about ti like having the model model in multiple notes, that will make, I guess, both the make the implementation harder and also the communication cost higher. So that's something you probably want to leave to a model provider through api, such like serving it yourself. But I think for like misure backstormodel, make sure a times seven b is definitely after quantization is something very redoable, almost single GPU.
speaker 4: All right, great. Thanks for answering a bunch of questions from very curious folks. So thanks again, Albert, for the amazing talk and the time today.