2024-05-01 | Stanford CS25 V4 I Demystifying Mixtral of Experts
Mixtral 8x7B:稀疏专家混合模型解析
标签
媒体详情
- 上传日期
- 2025-05-20 13:31
- 来源
- https://www.youtube.com/watch?v=RcJ1YXHLv5o
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: And hello everyone here. So today for our speaker, we have Albert Zhang from mial AI and also from the University of Cambridge. Nice to meet you and thank you for coming today. I'm been being our speaker. So Albert is a AI scientist at mistreal AI and a final year PhD student at the Computer Science Department of Cambridge University. He works on language model pre training and reasoning and mial AI and language models for mathematics at Cambridge. So thank you for coming today. And today Albert is going to talk about mistreal, a sparse mixture of experts, language models. And Yeah, I think I will now leave it for Albert. speaker 2: Hi Emily. Thanks Thanks for the really nice introduction. So let me share my screen here. Yeah. So Yeah thanks very much again for the really nice introduction, Emily. So today I'll be talking about demmystiine mixture of experts and I'm a name is Albert. I'm a scientist at Mr roia and a PhD unit at University of Cambridge. So the contest you'll be hearing today, first I will talk about the architecture of first, the dense transformers architecture just to review, and also the sparse mixture of experts or smmoes. So after that I will be talking about interpreting sparse mixture of experts models. And the majority of this slides will be based on the mixture of experts paper that were put out early earlier this year in January. Throughout this talk, if you have any questions, feel free to raise your hand and I'll be really happy to answer. And also, I'll be posing some open research questions because these things are really very much on the research. And we hope that more research can be done on it so the open source community can really enjoy the benefits. Okay, let's get started. So first, let's talk talk about architecture. We can first talk about the dance se transform architecture that a lot of people must have already talked about. But here I want to focus on the mial seven b model, which is the dense transformer. The main difference between the mistreal seven b model and the, I guess, the rest of the dense transformer models is that it has group query attention, which means, which is different from the multi head, and mulquery attention somewhat in between. So for multi head attention, you have the same number of have the same number of keys, queries and values. And for mulcareer attention, you have a lot of queries and just one key and one value. And group career attention are sort of in between and you have a lot of queries, much less keys and values. Also the difference also there's another difference in the attention we're using for mial seven b, that is we're using a sliding window attention. The idea being for some of the lower layers of the transformer, the tokens for each token position, they attend to a relatively short span of previous tokens. And as you go deeper in the transformer, the later token positions will get information transmitted from the earlier token positions. And neither of these architectural designs are new, so these are already present in the premiworks from 2023, 2019 and 2020. So these are really nothing new, but we decided to adopt these architectural changes in mial seven b to make a better model. Okay. So after writing these slides, I realized there's actually a huge number of decision choices you have to make in designing a transformer able architecture because there are so many details. So like just presenting this doesn't feel sufficient. So I try to, well, first, these are the sort of standard configurations for Mr. L seven b. And also I just want to have a very single, have a very, very clean code in piyhorse that just tells you how to write a single transformer layer. And I'll be adopting the noum shazir notation by annotating the dimensions of every tensor as a suffix in the variable name. So let's say our sequence size is 8008 and that we have 32 query heads and eight K B heads. So this is quite standard for group query attention. And our head dimensions are 128 and our latent dimensions is 14k. So in order to write a single transformer layer, first you initialize your query key and value matrices. So here I wrote down their dimensions here and notice that we don't use bias. So there's no bias in any of these matrices. And we also write down our output matrix so to know that the output matrix has the same dimensions as the transpose of of the curmatrix. So in order to write an attention forward layer, so here let's assume that our input x has a dimension l times d. So l is a sequence length and t is the latent dimension. So you first want to use these matrices to try to find what your query key and values are. So here their dimensions have been normalized to l, nh, lkh and lvh, respectively. And then you want to apply rosary embedding to your keys and queries. You don't really apply that to values. And then you do a very standard attention mechanism with the queries, keys and values to note. Here. We actually repeat the keys and values a couple of times to make their dimensions match with the queries. And then we just return the output multiply with the output matrix. So this is your rstandard attention forward layer. And to have a transformer layer, you also want na tie that you want na first do a normalization. Here we use rsm rms norm over the input, and then we do a tenforward, and then we do a tenresidual layer, followed by normalization plus an mlp, plus a dense residual layer. So this is what's present in a transformer layer. Okay. So with these architecture designs, we managed to make a model that's quite good. So this is released. This was released in September last year, and it was able to beat a lot of the lama two models back then despite being very, very small. Okay. So let's now go to the interesting bit. That's mixture of experts. And mixture of experts is not a very new idea. It's being present forever. And there are a couple of very quite recent papers that really highlighted the importance and like what are the benefits you're getting from these mixture of expert mixof experts, your models? So I guess probably the most, well, no one is the switch transformers. That is the Google paper that scales transformers to trillion parameter models with simple and efficient sparsity. And they've discussed a lot about the sort parallelism decisions you can make about data parallelism, model parallelism and expert parallelism. And for the a mixture of experts layer, it's also, it's even older idea. So from 2017, also by some quite famous people from Google. And the idea is that you try to take out so at the time, they were still using recurrence language models. So the the idea is that you try to have a getting layer, a gating network that decides which experts you should route to. Okay? So in our mixture of experts, in our implementation of the mixture of experts layer, we're doing something very similar. So we have the inputs that get sent to a router and the router assigned the gating ways and to pick the top two experts to give this inputs to. And after these experts individually process the inputs, the gating ways will be applied to the output of the experts, and they get some to get through the results. So if you write it in very mathematically, you can say that the input is x and the router has a matrix multiplot to it. And then you take the top k experts out of that, do a soft tmax, and you're getting these are your getting waste and output. The mixture of experlayer y is the soft max of the top two gating weights times the sweet glue function over x for each of the experts. Okay. So a sparse mixture of experorts model is quite near the cost performance per retotal frontier because for each token, you the token doesn't have to go through the entire neural networks parameters. Instead, it can just go to like the active parameters for that token. And this mixture of experts model can outperform lama 270p with about five times faster inference. So we kind of see it as a dropping replacement for gp 300 by five, which can also master quite a few European languages in addition to English. And it can gracefully handle context for the 32K tokens. So this mixture of experts late model was released under a two zero. You can free to use for commercial purposes, also attach the configurations for the mixture of experts model here. But I so can refer this back to the missile seven b to see the differences. Okay. So I want to talk about what does moe I, which is verbajust created the molp layers actually give you? So the conventional wisdom is that the mlps in a transformer network store knowledge and the tension stores algorithms or reasoning, or they sort of implement algorithms and reasoning. So by moe find the molp layers, we're supposed to get a boost in knowledge and indeed we try to benchmark mixture a times seven b with a mistal seven b and A, A couple of the llama two models and being a couple of different categories, you can definitely see that on the knowledge heavy tasks such as mmu and knowledge. The mixa times seven b model, which is shaded in yellow here is doing a lot better than Mistral seven beand. Also to the reasoning, comprehension and tasks, it seems to be doing a little bit better, but still not as much as it does on knowledge in mma. Okay. And here is a different graph where here what we're trying to show is that what is on the horizontal axis is a number of active parameters. And on the vertical axis is the performance of the models on various tasks and the sort of round dos. Here are lama two models. The series of lama two models and the orange stts are mixal models from mixal seven b to mixa times seven b. So the mixroll a times seven b uses only 12.9 billion active parameters, so close to 13 billion, yet it can want quite a bit better, especially in the knowledge category. You can see the drop, you can see the increase from mitwelve seven b to mixol a times seven b is quite huge here. There are some improvements on other categories, but the knowledge improvement is really great. Okay, so there is a research question that is making your language model, if making the mlp layers of your language model can give you a huge boost, acknowledge, like how about mofying the attention layers, right? So this is also very well, not very old but quite old idea from 2022 in the switch transformer papers. So the idea is you can replace the trinable qkb matrices with switch layers. So instead of doing a standard matrix modilification, you can do a moe. So you have a gating layer and then some dance layers for the qkb rimatrices. And the problem at a time was with stability. So if you use bf 16. Precision to trayour models. They can sometimes subverge, but if you had used app p 32, they don't and you get some performance boost, which is nice. But at present, it's mostly where we're trading with bf 16. So we have to find a way to make it work. And so in order to make this work, maybe you can try batstability techniques, maybe some normalization, or maybe you could use a different way to moe fy the attention layers. So this is an open research question. I wereally like to see some research to be done here. And after this, I want to just talk about several myth that seems to be present around the mixture of experts models. So the myth one is that there are eight experts in mixroll a times seven b. So this is probably our fault by naming our model mixture of a times seven b. And so the truth is that every layer has, every transformer layer has eight experts. And these experts are permutationally equivalent, because the way it works is that the gating network decides how much ways Ste to give to, how much ways to give to each of the experts and only picks the top two. So it doesn't matter. So it doesn't matter how you sorry, so it doesn't matter how you permute them, they will give you the same result. So that means instead of eight experts, you actually have 32 times eight experts in total, and they're relatively independent across layers. The miss too, is that there are 56 billion prometers in mixroll, eight times seven b, but the gating and the attention layers are shared. So these are not in the a times part. And in fact, there are only 46 times 7 billion parameters ters in total. And each token will actually only see 12.9 billion active parameters instead of 14. Okay. So another thing that seems to be quite popular these days is to compare everything tieverything that's related to cost with the number of active parameters as if they're proportional. And that's not quite right because mixroll eight times seven b has fewer active parameters than laa 213b. But having this sort experrouting means that you need to send tokens around to different experts quite dynamically. So that means you have more communication costs. You can just pre program like which token will be sent to which expert, while you're gain much more in performance divided by cost, the absolute cost is not proportional to the active parameter count. So usually your mixture of ts, your mixture of experts model, if you have the active parameters, your actual cost of serving these models is actually a bit more than the equivalent denmodel with the same number of accparameters. Okay. Here's also a really I think a really interesting research question is how do you balance loads at inference time, right? So your glayer might still decide to do unbalanced loading at inference time, which makes inference slower. What that means is that ideally, you want your aid experts or like however many experts you have, ideally you want them to handle the same number of tokens such that you don't get the, you don't get the slowest. You don't always have to wait for the slowest expert and some nice ideas are around like mixture of depth and also that I'm loading based on scores. So some of the ideas are like if you if you saturate one of the experts, maybe you can try to find a neighbor of that expert to keep the token to which is not saturated and then sort of try to slowly fill up number, the total number of expert times token count you have. Okay, another I think quite open research question is how can you compress smmoes? And this comes from a screenshot someone took of Tim gaatmbecause. I really trust this men's model compression skills with my life. So Tim gaatmers said he thinks we can compress a mixture of fafirst model to lower smaller than four gabytes. So because moe compression is quite different from dense transformer compression, because the regular transformers, they're really, really difficult to sparfy. And this is really for like the feet forward, the molp layers, but the moe layers are quite different. So maybe there's something you can do with moe layers such that your compression is much, much more efficient. So by compression here, I mean sparfication. So like a lot of the so so speciarfication, not in the sense of choosing only top two experts instead of eight, but I mean, a lot of the parameters might be like doing not much work and you can like share them out. Okay, so this this was said by Tim detmerin December last year, and I don't think I've seeing much convincing work on sparse fying moes to an extreme, but I really like more work going on here. Okay, now I want to talk about how to interpret a mixture of expert models. So the mixture of expert models with a sparse gating layer actually provides you an incredible opportunity by giving you discregating signals. So deep neural networks are traditionally very, very hard to interpret things like all the ways in activations, maybe very high dimensional place spaces. The attention in transformers potentially offers some interpretation opportunity because you can look at like how much of the attention weights get assigned to which tokens or to which tokens. But you're essentially picking at like what is the model looking at by looking at the attention scores. But this also quickly gets very messy as you have a lot of attention hads. It's hard to visualize and hard to interpret. And the gating layer, seeing as moes potentially tells you a bit more because it tells you which experts is looking at which tokens. So can we make some sense out of this then? So in our paper, we try to find if there some sort of domain specialization with these experts. So we took the validation splits of the pile datsets. And so these are represented in different colors. You can see the red is archive, the Green one here is GitHub, and the sort of blue one here is few papers. The purple one Stack Exchange. And for layers zero, 15 and 31, so these correspond to the shallowwest layer, the sort of mid layer and the deepest layer. So this is the layer. 31 is the layer just before decoso. For each of these layers, we plot how much each expert is selected. So since we have eight experts, the sort of random random selection chance is around twelve is 12.5%. And you can see that at layer zero. So this is the layer that's closest to the raw tokens. You will see that the distribution is quite uniform. It's not like very uniform, but quite so this might be telling us that this layer is like too shallow, that it's still doing a lot of syntactical stuff, that you don't get a lot of meaningful domain specialization going on and on the sort of the mid layer. So here is potentially where you get a lot of semantic information. You can you can notice that aspt three here is doing something quite interesting, right? So because it gets selected very not very often for any of the other categories, but it gets really selected quite a lot for dm mathematics. And second place is GitHub. So maybe this layer, so maybe this expert in expert three in layer 15 is doing something that's math and code heavy, but this is quite speculative and we cannot really conclude much out of it. And for layer 31, I think the distribution go back to close to uniform again, although you get some spiking distributions here and here and there that specialize in mathematics, it seems, or since the data size, keep mind, mathematics is more like arithmetics. Okay, we also did some analysis on consecutive tokens. So the question to ask here is whether two tokens that are consecutive get assigned to the same expert. So here we have the first choice. So that means if a token is given to expert I, does the next token also get assigned to expert I? Right? And also, we're still doing this analysis on the three layer, layer zero, layer 15 and layer 31 corresponding to three different places in the network. And the validation set is still the power validation split. So the random chance assignment, scranchance assignment will give you 1212.5% in terms of like how many of the tokens that are assigned to expert eyes to get assigned to expert I? Sorry, Maxwell can also get assigned to expert I. And we can see that at layer zero, this is slightly higher than random, but and there 15, this is significantly higher than rendom, right? So almost well, it's almost double the random chance that the first choice for the gating layer will actually assign the next token to the same expert as this token. And when go to the last layer, you'll find that this transole regresses a little bit. It's still quite significantly higher than random, but it's a bit less than layer 15. We also look at like the first or second choice. So if this expert is not chosen as the best expert to route this token to, could it be the second best? So here the random chess probability is around 46%. And you can see for layer zero, this we have the sort of the same pattern, right? So for layer zero, the number of times that the next token get assigned to the first or second choice for this expert is around just slightly higher than random. And for layer 15, this is really significantly higher. And then for layer 31, it sort of regresses a little bit. So I think there's certainly some sort of more detailed analysis to be done here that can conclude out of from these mixture of experts models. Also, we try to visualize in three examples, like which experts actually selected which tokens. We have three examples. So the first example is from is actually from the GitHub code for the moe layer. And the second example are like simple arithmetic questions, I think from dm mathematics. And the third is from a simple multiple choice question. And this is for layer zero, layer 15 on layer 31, you can see that there's there doesn't seem to be much specialization, right? So here a lot of digits get assigned to the same sorry, so the colors represent different experts. So here all these digits get assigned to the same expert. But that seems to be it. We don't see a lot of very clear distinction between like which experts, like which tokens, right? So that really closely relates to myth four. That is, you really the myth four is that you want experts with specializing domains. Let's say you have two experts specialized in coding. So that means they just basically like if you have some code, they would just handle all the tokens. Then what do the other other experts do when you're coding? Right? So when you want to generate lot of a lot of code, it seems that you will only have the chance to route ot all your tokens to these two experts. And the other experts just stand there and do nothing. So you actually want all the experts to be fully engaged at all the time to maximize your inference efficiency. Language also is so complex that many ly specifying these domains that experts should specialize over seems to be a simplification to me because there might be a lot of underlying features that the experts are actually specializing over and just not present at the high level, like the very high domain level. There's a treasure hunum. So this is this is why we love open source so much. So like around after 24 hours of our mixa x seven b model release, there's a guy on a Chinese website that just found out that there's why expert in one of the layers that's particularly crucial. So what they did was that they try to remove the expert from all of the layers. And here you can see what are the effects if you remove the iexpert. And on the vertical axis is the mma U score. It seems that if you remove the third expert, then everything just collapses. The mmlu score is 0.63. So this is 0.63%, not 63%. And for other experts, if you remove the iexpert, the mmscores drop a little bit, but not too much. So and they made this meme about it. So the expert three seems to be doing all the work, while all the experts are sort of standing around and doing not much. Okay. So I think another really important research question here is how do we interpret the mixture for expert decisions and what are the features that they're learning? So the experts might capture features that are really different, very different concepts and what we perceive as concepts. And it might be more efficient to represent linear combinations of concepts as long as they spend sort of the same subspace, right? And how do we recover that subspace or how do we recover some of the sets that we can actually understand? What are the on the line concepts that they're actually learning? Okay. So I just want to also just conclude here that a sparse mixture of experts models leveraged sparsity to gain a lot more knowledge. And you can train very good mixmixture of experts models to be quite efficient at inference and expert specialization is not as straightforward as one might think. And just there's a tons to doing architecture and interpretability research and also just want to plug Mr al AI co founby, Arthur, Timothy and Guillaume. And we have offices in Paris, London, San Francisco, bay area. It's currently it's actually in Palo Alto, have 500 million in funding. And if you enjoy solving the research questions that I just mentioned or doing open source AI or just generally empowering people with language models, you very welcome to join us. Thank you very much. speaker 1: If any of you all have any questions, feel free to come up and ask. speaker 3: So I recently read the paper from meta. They are using the intensive model, not this one. 28, I think page 28. Yeah. So which is I feel like it's the other direction, opposite way, why you choose the sparse mixture of experts rather than the other people. Now I see most of them using the intensive models like to do some of like an edge database, like in a confined scope. You can have more knowledges in a confined scope. That's I see the train goes to that way. For example, you go to the Egypta museum here in I in Sunny where somewhere, right? So you can have the chagpt, have the local knowledges, which you cannot get it from the Internet, but you can only cooperate with the museum and get it. I think the intensive model allows this kind of more domain knowledge, like going deeper. But now here you are using the spas model with mixture of experts. To me, it's the opposite. Can you explain why? Yeah, sure. speaker 2: So if I understand the question correctly, the question is asking, why are we choosing sparse mature experts models instead of dense models? Are there's no sort of sparse sity involved. So I think for ash devices, dense models are have a lot of potential because for sparse mixture of experts models, although you have a sparsity in inference time, so that means a token does not have to see all the parameters, but rather just a small fraction of it, you still have to load all of the experts into memory. And that might pose a huge challenge for adices. Like your phone might not have like 200 gabytes of memory in order to inference. You order to run inference very efficiently. But for like I guess for data centers or for people who are serving these models, it might pose a advantage in terms of inference cost because you can have where you can have very good performance sladivided by cost ratio. speaker 3: I see. So your point is because there's not enough memory at the edge. So for the edge, it's better to use sparse model rather than densive model? speaker 2: Oh, sorry, no. I mean, for AI, I think it's good to use dense models because because for sparse models, you still have to load . speaker 3: everything into each memory. Yeah, Yeah, Yeah. So so your your use cases is more for like closer to the center of the cloud is swiming. speaker 2: Yeah. Yeah. Okay. speaker 3: Got it. Thank you. speaker 1: Thank you. Does anyone else have a question theylike to ask? Yes. Do . speaker 2: you . speaker 1: want . speaker 4: to. speaker 5: I have a question in regards to fine tuning. We've been working with multiple models, Gemini, chargpt, llama. And there is a big problem with fine tuning these models, especially on tensor, on images. They're having issues with understanding. What can you say about that? speaker 2: You mean in general fine tuning language models on visual tasks? I guess. Okay. I guess that's a very broad question. I can what I can say about that is there's I guess for ChatGPT and Gemini, you don't really have the model Wein your hand. So you are basically relying on OpenAI or Google to do lower for you, right? So you get I guess you have less control over your fine tuning. Whereas for an open source model or guess open waste model, you actually have waights in your hands. So you can do you can do laa, but you can do full fine tuning. You guess you get more control. And Yeah, I guess that's in general like what I can say about like fine tuning open source models versus versus very closed source models. But I guess that really highly depends like exactly what happens when you're doing fine tuning really depends . speaker 5: on your use case and your data. Yeah, I mean, I'm sorry, it was a little groso. When we try start transferring them into a large a large datset, let's say like 10000 images or 5000 images, so and you have to transfer them into textual format because none of these models are actually tuned, designed to be fine tuned on images. They don't understand it. So is this something you see in development in open source models? speaker 2: So by translating these images into text, do you mean like ocr or is is that something digital format into the textual format? speaker 5: So they do understand them in digital as an image. If you upload them into like text text line, right? Like you go, you charge GPT, but when you start feeding them through apis, you need to transfer them. So then becomes a problem. Like, is there any way to overcome this issue with open source models? speaker 2: I guess I'm not I guess I'm not entirely sure what the root cause of this problem is, but I think I'm sure that with open source models you can actually, well, you can see like what the waare doing, right? So you can actually tell what are like. You have more control of the waste basically. So I think open weight, open waste models definitely post an advantage here. speaker 1: Thank you. Thank you for the question. Does anyone else have any other questions in person? speaker 2: Yeah. I guess I was curious like where you think the improvements of mixture of experts are like coming from? Like I've I've heard that you can like put different experts or different like fefour networks on different GPU's and that that like allows it to be more parallel and like speed things up from like a compute perspective? Or is it like more of the sparsity itself that is like giving these improvements? I guess is there any insight on that? I guess there are there are two dimensions to this, right? So in terms of improvement, the one dimension is performance. So as I said, the mixture of experts models have these moe molp layers. So imagine your original molp layers and you make it eight times wider, you will be able to store a lot more knowledge into these layers because you have more parameter counts. So that's, I guess the on ner performance dimension that actually you can actually store a lot more knowledge into these models. And on the performance, sorry, and the other dimension is the inference efficiency dimension, right? So so definitely go read the switch transformer paper where they discuss the okay, I guess that's training efficiency, but they really discussed in quite a lot of details about data parallelism, model parallelism and experort parallelism and how they affect the efficiency. What's the communication cost there? What's the best way to triyour models? And for these sparse mixture of experts models, since you only so you only select 13 billion parameters for each token at in princetime, you are actually trying to select the most relevant parameters for each token. So that can make inference quite a lot quite a lot more efficient. I hope that answer your question. Yeah, no, that definitely did. I guess I was also wondering like if you've seen like the mixture of depth paper and like that as like a way of sparsity is, do you have any thoughts on that? Yeah, I think sparsity sparsity helps when you can do like sort adaptive computation. So the mixture of depth model is like the best example of the adaptive computation, which means for different tokens, for predicting different tokens, you want first different parameters ters to engage. So in the mixture of depth paper, they selected different number of parameters to engage in calculating each token. Any hour paper, we selected different like the same number, the same number of parameters but like different experts. So it's the the difference is I guess quality versus quality. But there are like two different dimensions you can optimize over. You definitely want have to select the most relevant parameters and as few parameters as possible for decoding on Yeah. Okay. Yeah. I think those all my questions perfect. speaker 1: Thank you. We'll take one more in person question if any of you all have anything. If not, we can turn it to the online questions. Yes. speaker 2: Hey, Albert, thanks for the talk. Just one in the earlier you like the beginning of the presentation you mentioned like routing and communication cost. And I was wondering if you could talk about how that scales relative to the number of experts you have in parameters. Oh, Yeah, sure. So so that depends on like first number, the number of experts you have and also depends on like how large each expert is. So that that sort of can change your the way we do parallelism. So we know that communication is really expensive where you need to go from one GPU to another, and it's more expensive if you want to go from one note de to another. And if you have a huge ton of experts that will not fit into just one node, then you really incur a huge quite a big communication cost. And like how to scale beyond that, I guess, is a very open cenfic questions. But essentially, the communication cost you incur is roughly proportional to the number of token routing, first between GPU's and then between nodes. Thank you. speaker 1: Thank you for all the questions. We'll have Stephen introduce some of the questions that were submitted by the people joining us on zoom. speaker 2: Hey. speaker 4: can you hear me? Yeah. speaker 2: thanks for the great talk. speaker 4: Thanks. All right, so we have some questions through zoom and slidle. I'm gonna pick and choose some of them starting with some zoom questions maybe. So we have a question here about any comments on why potentially lama three is not using a mixture of experts. A lot of people suspected they might after mistress success. I don't know if you would know anything about that. speaker 2: but Yeah, I don't know. Just ask . speaker 4: the lot. My people. Sorry. Right. Let me see. Here's A D question. Do you foresee mixture of experts techniques being incorporated into the other large foundation models? Or will it remain a subset of models that are best for certain use cases? speaker 2: Oh, certainly. So I think as my answer to the first like in person question was for as devices, probably you probably want to stick with dens models because of the memory constraints. And for a mixture for experts models, you're really getting a lot in efficiency if you can serve it as scale. So if you have high batch sizes, you will so you will potentially be doing better in terms of throughput than the denmodels. So Yeah, it definitely depends on like depends on a use case. Usually the larger the scale, the better. Well, I guess the more the benefit of the mixture paraverse model shine. And while speculated that you before is a mixture backverse model, if you haven't heard already. speaker 4: can make sure experts models outperform domain specific models at their respective tasks with experts. speaker 2: So well, first, usually models that have been trained for particular domains are really hard to beat. And I think that's where also where the purpose of continuous prtraining and fine tuning comes in. So if you let's say you first portrying a general tax model and then you want to adapt it to the medical domain, so you grab a lot of medical data and then you continue portraying or you're funon that, that model is going to be really hard to beat. And I think the mixture of experts model, the experts don't really, at least in our case here, the experts don't really focus on like traditional like domains as we know it. We don't have a medical expert. We don't have a coding expert, rather they just try to that they try to they have concepts encoded in a very non intertable way. So I wouldn't say that just taking one mixture practice first model, which is outperform all the other like domain focus models. speaker 4: right? That makes sense. Here's a question. Has there been any study on whether the moe layers should be both at the early layers and deep layers for old neural network ensemble methods? Late fusion tends to work better than early fusion. speaker 2: So okay, I think that's a really great question because I think from the inception from the neural network, we are trying to adopt this, trying we sort of have this tradition of having layers that are that look exactly the same as at each other you have. And so then you can try to focus on just designing this layer to be as appropriate as possible and then you can just copterpaste that a couple of times. But I think I've recently seen a paper, sorry, I don't remember the name, but I think they're just trying to put layers in random orders. So sometimes it can't be attention first, sometimes it can be moe first, and it can have also some other wacky layers. But I think that forms quite well. It's unclear to me like why that is. It could be just chance. But I guess I guess if you wanted to design a generally really, really performing model, it's really safe to have the same architecture for every layer. So Yeah, so I guess that's I guess the safest choice. You can also do some sort of newer architecture search in order to your sorry, optimizer architecture, but that's which is a much more principled approach that's guessing that this layer might work better. speaker 4: All right, thanks for the answer. Asking could you talk a bit more about how different the eight experts are built? Are they just fine tuned on different data sets . speaker 2: so they are trained on roughly the same data sets? But I'm not sure if I can say more of that. But Yeah, sorry. speaker 4: And here's a question. Some work has suggested that learning to doesn't perform any better than simply using, for example, a random mapping of inputs to experts. Do you have any thoughts? speaker 2: I think that I would definitely need to see the paper first, but I wouldn't be too surprised if that's the case. I would still be surprised by not like extremely surprised because if you if you think about mofying the molp layers as augmenting the knowledge capacity, then it seems that you can potentially just do do that in a very rote force way and just readily mapping the token sorry to random experts. I would still think that the gating has the advantage of you know of being able to choose experts more intelligently than just randomly. So Yeah, I don't know that paper, but I would love to see it. speaker 4: Someone was asking, could you speak a bit about the general development process from your perspective for this model? What aspects, design choices, hyper parameters and so forth. Did you try that did not work before arriving at this specific architecture? speaker 2: So we know so for this particular architecture, the mixture of experts architecture, we kind of know that it will work because there having a lot of successful papers beforehand in terms of like what sort of things you need to consider. I think one really good practice I guess for companies that of as like a single developer is to always you always want to take inference needs. You you you want to think about inference needs before you design the architecture for a model. You don't want to so for example, you don't want to have an have a model that's what just just slightly exceeds like what a single 888 giabyeight 100 can contain. And then you basically lose a lot of efficiency from just like being like overhead being just a little bit more than 80 gb. So I think in terms of design, before training the model, you definitely want to consider how to infer with this model. And in terms of the hyperparameters, you definitely want to do some sort of scaling before you some sort of scaling law, I guess, search and analysis before a train model to make sure that model is served at the best performance cost ratio. So Yeah, that's it. Great. speaker 4: Speaking of a GPU memory, someone is asking what is the inference runtime GPU memory footprint for the seven b model versus the eight by seven b? It would be great to understand this for specific applications, especially if mistrual has the potential to run on edge devices with sufficient GPU of your m. speaker 2: Sorry, I don't think I got the last part of the question. speaker 4: So ads devices and GPU, they're just asking what is the approximate inference runtime in terms of GPU memory footprint for the seven . speaker 2: b model versus the a by seven b? Okay, I see. So if you do like if you do very naive. If you do it very straightforwardly, that should load for the eight times seven p to load all the experts in, and then the experts always stay in GPU memory, then the difference between the a times seven p and the seven p will be the same as the ratio between 46.7 to seven. That's the memory requirement. But at any printime, since the mixof experts model only has 13b active parameters, so each gets to see much fewer parameters than there are in the memory. And you can do some interesting things like like I think cpu kind of forgot the name for it, but you can keep some of the experts in cpu and then only load them when you need them. So I think that's something you can do as well. But then you also lose some efficiency because you need to constantly transfer prompters from cpu to GPU and vice versa. Right? speaker 4: That makes sense. A speaking parameters. Someone asked, do you have a rough rule of thumb for how to approximate how capable an moe model would be in terms of equivalent dense parameters? For example, I've heard that calculating the geometric mean of active parameters to total parameters is a good way to do this, which would mean that 13 billion active parameters with 47 billion total parameters would be equivalent to a 22 billion parameters dense model. I'm wondering how accurate that would be. speaker 2: So I think I think that's a good rule of thumb, although that definitely depends on like how well you train these models, like and how many tokens do you put through them and how good are those tokens if all other things being equal, I think that's a very good rule of thumb. speaker 4: All great. And I'm just looking at some of the upvoted slide. Do questions. Someone asked for the treasure hunted expert. Was there a reverse experiment of removing all experts except number three? speaker 2: Oh, that's a that's a great question. I think Yeah, if you can figure that out, I'll be really like I'll be all years. I'll be really happy to hear that. So so then then you are just trying to I guess you're trying to trim the a times seven b back into a seven b, which might be an interesting experiment to do. speaker 4: Right. Someone asks what is your intuition about why the eight by seven b is significantly better and recently compared to the seven b? Are they learning better internal algorithms? speaker 2: That could be the case, although I think that's very speculative. I think. Okay. So I think for a lot of these benchmarks, for example, the even for the math benchmark, the the like what you do that like what sort things count as knowledge and what sort of things count as reasoning is quite ambiguous. For example, if you're are doing a math task and you can reason to get something out, or you can just recall, Oh, here's a land, I can just use that. And so I think there's like high ambiguity between what sort of things count as knowledge and what sort of things counts as reasoning. So for this particular example, the a times seven b definitely gains a lot more in knowledge. But does the gaining knowledge also induce some sort of change in its reasoning capability? I don't know. But I think that that's just like whatever I can say about is like highly to speculative, but I would say the benchmarks themselves are also very ambiguous. So I think it's worth finding out what exactly great. Another question about . speaker 4: GPU memory. Someone sitting in production, we're constantly bounded by GPU memory instead of compute. Will mixture of experts make it more challenging to serve any extra cost besides communication overhead? speaker 2: So in purpose of cost, I think communication like Yeah so so when you say you founded by GPU of memory, I assume that it's like it's like comparing the a times seven b to a seven b, then yes, if you need to load all the experts into the GPU memory, then Yeah, you're gonna to consume, you're going you're going to use a lot more. So the benefit or the advantage of the a times seven b over the seven b is that or let's say over equivalent dense models with similar active parameter count is that at high batch size, you'll get more therpoints because you'll get different experts handling the tokens set at the same time. So I'd say yes, mixer of experts model definitely gives you a little bit more trouble in terms of serving, but if you have high volume, it's definitely worth it because you get much more efficient in processing these tokens. speaker 4: Okay, great. Here I'll ask a few more since we do have some extra time, but do let me know when you might need to leave. So someone is asking specifically about the architecture they want you to clarify, does each layer include attention, routing and experts or layers of attention, then routing, then experts? speaker 2: Okay, let me try to find out. Yes, so so the architecture of mixtrel a times seven b is exactly the same as mistreal seven b as I've shown here. So in mistreal seven b, you first do for free each of the input. You do normalization, you do attention residual norm mlp residual. So it's attention followed by mlp. Okay? And the only difference is in the mlp layer here. So instead of so after the attention layer and after the the attention residual, instead of doing molp, you doing moe molp here. So here, instead of just using one or one matrix to process these hidden disnating representation, you use eight experts. Well, you choose two out of eight experts to process that. speaker 4: Right, that makes sense. All right, let's see. Give . speaker 2: me . speaker 4: a moe. Introduces extra load balancing losses and loss function discontinuing discontinudid you run into any difficulties while training due to these complications? speaker 2: Yeah, that's a really great question. So I think to try to clarify with the question, my understanding is that when you're training mixture of access models, you definitely want each of your experts to be quite balanced in the sense that they handle a similar amount of tokens such that you not like I guess I was waiting for the slowest expert or the expert that handles the most. So yes, you definitely need to do something to make the load balance a bit better in training, but we didn't run into any big troubles during training this model. speaker 4: And here's a question about reg. Here's to hear when an moe approach would be preferred to a rag approach in a given domain, pros and cons, etcetera. speaker 2: I think these are a sog mso. You can use an moe model to do rc. Like so I don't see why like I don't see why there should be a conflict between the two. Yeah. So so like you can do desmodel no rac desmodel rag moe no rag moe rag. speaker 4: right. And someone asked, can you potentially swap out one expert and insert a domain specific expert that wasn't trained on the same training data set like a customizable or modular moe? speaker 2: I see. I think that's possible given that you also you also need to so after you swap out one of the one of the experts and replace it with a domain specific expert, but you after after you do this swapping, you definitely need to train this model a bit more such that the gaating layers know how to, such the gleaders yers know how to route the let's you have a medical expert swapped in. You need to train your glayers a little bit such that knows how how to handle medical situations and to basically treat this replaced expert quite differently. So I think it's only Olly possible in very exciting research direction. They're they're being quite interesting research ched on like merging and swapping and Frank emerge these models. So I think that's certainly something very exciting. I'm looking forward to it. speaker 4: Okay, great. I'll ask two more zoom questions and thatbe it. So someone asked during training is make sure of experts less computationally intensive since the gradients would always be seven b, how do how do the gradients back propagate through the routing that . speaker 2: so if I go to the routing layer, you will see that the pub so all the operations here are differentiable. So no, there's no, I guess like discrete operations such that the gradient just stops there. So this this thing is entirely like Anto differentiable. Sorry, what's the first part of the question? Again. speaker 4: asking if training moes is less computationally intensive? speaker 2: So the cost of training moes is roughly proportional to the number of active parameters you have. And so it's roughly equivalent to training a 13b, but you incurr some actual communication cost. speaker 4: Okay, great. Last question. Let's talk about even bigger moe models. Even if we're moving away from the piiretal frontier like those eight by 22, eight by 30 or even eight by 100 plus billing parameter models. Do you see any further serving challenges when one GPU cannot even hold a full expert after heavy quantization? speaker 2: I see. Yes. So I think if you have one GPU, I think I would go for a dance model or a heavily quantized mixture of experts model and about like what you said about like having more module experts, like having 128 experts, I think that's something super exciting because that basically allows you to specialize your basically allows your experts to specialize a bit more. And that's how it's exciting. You can always reduce the number of you can always make the experts specialize better and you can pick, but basically allow giving your gating layers more power to choose like which experts are the best for this token. I think for serving purposes, that will definitely make things very, very hard, even like even after heavy quantization, if you have 128 experts, then and potentially you were talking about ti like having the model model in multiple notes, that will make, I guess, both the make the implementation harder and also the communication cost higher. So that's something you probably want to leave to a model provider through api, such like serving it yourself. But I think for like misure backstormodel, make sure a times seven b is definitely after quantization is something very redoable, almost single GPU. speaker 4: All right, great. Thanks for answering a bunch of questions from very curious folks. So thanks again, Albert, for the amazing talk and the time today.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
Albert Jiang (Mistral AI / University of Cambridge,主要研究方向为语言模型预训练、推理及数学语言模型) 详细介绍了 Mixtral 8x7B,一个稀疏混合专家(SMoE)语言模型。该模型在Mistral 7B的架构基础上,将每层的MLP(前馈网络)替换为8个独立的专家网络。在推理过程中,每个token在每一层由一个路由网络选择两个最相关的专家进行处理,并将其输出加权组合。尽管每个token只激活两个专家(约13B参数),但由于不同时间步选择的专家可能不同,使得模型在推理时能够利用总共约46.7B参数中的一部分,从而在保持较高推理效率的同时,显著提升了模型性能,尤其是在知识密集型任务上。Mixtral 8x7B 的推理速度约是Llama 2 70B的五倍,并能处理32K的上下文长度,采用Apache 2.0开源许可。
演讲澄清了关于Mixtral的几个误区:并非只有8个专家(而是每层8个,共32层),总参数量为46.7B而非56B,且其推理成本与活跃参数量并非简单的正比关系,因涉及额外的通信开销。Albert Jiang探讨了MoE模型的可解释性,指出专家并非简单地按人类理解的领域(如“编码专家”)进行分工,其专业化更为复杂和底层。初步分析显示,不同层级的专家对不同类型数据的激活模式有所差异,例如中间层某个专家可能对数学和代码内容有更高激活度,但总体上专家分工尚不明确。演讲还提出多个开放性研究问题,包括MoE化注意力层、推理时负载均衡、SMoE模型压缩以及深入理解专家决策机制等,并强调了开源社区在推动这些研究中的重要性。
演讲概述
演讲者: Albert Jiang (Mistral AI / University of Cambridge),主要研究方向为语言模型预训练、推理及数学语言模型。
主题: Demystifying Mixtral of Experts (揭秘混合专家模型 Mixtral)
核心内容: 介绍Mixtral 8x7B模型的架构、性能、设计理念、常见误区、可解释性分析以及未来的研究方向。
模型架构
稠密Transformer架构回顾 (Mistral 7B)
Albert Jiang首先回顾了稠密Transformer模型,以Mistral 7B为例:
* 主要特点:
* 分组查询注意力 (Grouped Query Attention, GQA): 查询头的数量远多于键和值头,介于多头注意力 (Multi-Head Attention) 和多查询注意力 (Multi-Query Attention) 之间。
* 滑动窗口注意力 (Sliding Window Attention): 底层Transformer的token仅关注较短的先前token序列,信息随层数加深而传播。
* 设计选择: 这些设计并非全新,已在2019-2023年的工作中出现。
* Transformer层实现细节 (以PyTorch风格伪代码为例):
* 输入参数示例: 序列长度 L=8008,查询头 nh=32,键值头 nkv_heads=8,头维度 d_head=128。
* 模型维度 d_model = 4096 (由 nh=32, d_head=128 计算得出)。
* MLP中间层维度 ffn_hidden_dim ≈ 14k (讲者示例中提及 "latent dimensions is 14k",对应Mistral 7B配置中 intermediate_size = 14336)。
* 核心组件:
1. 初始化查询 (Q),键 (K),值 (V) 矩阵 (无偏置)。
2. 输出矩阵 (O)。
3. 注意力前向传播:
* 计算Q, K, V。
* 对Q, K应用旋转位置编码 (Rotary Positional Embedding, RoPE)。
* 标准注意力计算 (重复K, V以匹配Q的维度)。
* 结果与输出矩阵O相乘。
4. Transformer层结构:
* 输入 -> RMSNorm -> 注意力层 -> 残差连接
* -> RMSNorm -> MLP层 -> 残差连接
稀疏混合专家模型 (Sparse Mixture of Experts, SMoE) - Mixtral 8x7B
- 基本理念: MoE并非新概念,Switch Transformers等工作已证明其潜力。核心思想是通过门控网络 (gating network) 将输入路由到特定的专家网络。
- Mixtral 8x7B 实现:
- 架构: 与Mistral 7B架构相同,区别在于每层的MLP(前馈网络)被替换为8个独立的专家网络(feedforward blocks)。
- 路由机制:
- 输入token (
x) 进入路由网络 (router)。 - 路由网络计算门控权重,并选择前2个 (top-2) 专家。
- 选中的专家独立处理输入。
- 专家输出根据Softmax归一化后的门控权重进行加权求和。
- 数学表达:
y = sum(softmax(top2_gating_weights) * SwiGLU(x, expert_i_params))
- 输入token (
- 参数与性能:
- 总参数: 46.7B (并非简单的7B * 8 = 56B,因为注意力层和门控层是共享的)。
- 活跃参数: 每个token在推理时仅使用约 12.9B (接近13B) 活跃参数。
- 性能表现:
- 能够以约 5倍 的速度胜过Llama 2 70B模型。
- 被视为GPT-3.5的“直接替代品”。
- 支持多种欧洲语言及英语。
- 能处理 32K tokens 的上下文长度。
- 开源许可: Apache 2.0,可用于商业用途。
MoE 的优势与性能
- 传统观点: Transformer中的MLP层存储知识,注意力层实现算法和推理。
- MoE化MLP层的效果: 旨在提升模型的知识存储能力。
- 基准测试结果:
- Mixtral 8x7B (图中黄色标注) 在知识密集型任务 (如MMLU) 上显著优于Mistral 7B和Llama 2系列模型。
- 在推理、理解等任务上也有提升,但不如知识类任务明显。
- 数据显示: Mixtral 8x7B 使用 12.9B 活跃参数,在知识类任务上的性能提升尤为突出。
关于 Mixtral 8x7B 的常见误区
Albert Jiang澄清了关于Mixtral 8x7B的几个常见误解:
- 误区一:Mixtral 8x7B 中只有8个专家。
- 真相: 每个Transformer层都有8个专家。模型共有32层,因此总计有
32 * 8 = 256个专家。同一层内的专家是置换等价的 (permutationally equivalent),因为路由网络决定权重分配。
- 真相: 每个Transformer层都有8个专家。模型共有32层,因此总计有
- 误区二:Mixtral 8x7B 有56B (8x7B) 参数。
- 真相: 门控网络和注意力层的参数是共享的,不参与8倍的计算。总参数量为 46.7B。每个token实际只“看到” 12.9B 的活跃参数 (而非14B)。
- 误区三:模型的成本与活跃参数数量成正比。
- 真相: Mixtral 8x7B的活跃参数少于Llama 2 13B,但由于专家路由的动态性,需要额外的token分发和通信开销。虽然性能/成本比更优,但其绝对成本通常会略高于具有相同活跃参数数量的等效稠密模型。
- 误区四:期望专家在人类可理解的领域(如编码、数学)上进行明确分工。
- 真相: 为了最大化推理效率,理想情况下所有专家都应始终保持充分参与。语言的复杂性使得简单的领域划分过于简化。专家可能在更底层的、人类难以直接感知的特征上进行分工。
MoE 模型的可解释性探索
- 研究动机: 稀疏门控层提供的离散门控信号为模型可解释性提供了新的机会。
- 领域专业化分析 (Domain Specialization):
- 数据集: The Pile验证集的不同子集 (Archive, GitHub, PubMed Central, StackExchange等)。
- 观察层级: 第0层 (最浅层)、第15层 (中间层)、第31层 (最深层,解码前)。
- 发现:
- 第0层: 专家选择分布相对均匀 (随机选择概率为12.5%),可能主要处理句法等浅层信息,领域专业化不明显。
- 第15层: 观察到一些有趣的现象,例如专家3对于DM Mathematics (数学) 和GitHub (代码) 的激活频率远高于其他领域,Albert Jiang推测该专家可能处理与数学和代码相关的任务,但这仍是推测性的。
- 第31层: 专家选择分布再次趋向均匀,但仍有部分专家在数学等领域有较高激活。
- 连续Token分析 (Consecutive Tokens):
- 问题: 连续的两个token是否倾向于被分配给同一个专家?
- 发现 (第一选择专家):
- 第0层:略高于随机概率 (12.5%)。
- 第15层:显著高于随机概率,几乎是随机概率的两倍。
- 第31层:有所回落,但仍显著高于随机。
- 发现 (第一或第二选择专家): 随机概率约为46%。观察到与第一选择类似的模式,即第15层关联性最强。
- 可视化案例分析:
- 对GitHub代码、简单算术问题、多项选择题等三个例子中的token进行专家选择可视化。
- 结果显示,没有观察到非常清晰的专家分工模式。例如,数字可能被分配给同一专家,但整体上难以区分不同专家对不同类型token的偏好。
- “寻宝”专家 ("Treasure Hunt" Expert):
- Mixtral 8x7B模型发布后约24小时,一位中国用户发现某一层中的专家3 (Expert 3) 至关重要。
- 实验表明,移除该层的专家3会导致MMLU得分骤降至接近0 (0.63%),而移除其他专家影响较小。这引发了关于该专家是否“承担了所有工作”的讨论。
开放性研究问题
Albert Jiang提出了几个值得进一步研究的开放性问题:
- MoE化注意力层 (MoE-ifying Attention Layers):
- 已有工作 (如Switch Transformers) 尝试用MoE替换QKV矩阵,但在BF16精度下存在稳定性问题 (FP32下可行)。
- 如何使其在BF16下稳定工作?(例如,批处理稳定性技术、归一化方法、不同的MoE化方式)。
- 推理时的负载均衡 (Load Balancing at Inference Time):
- 门控层可能导致专家负载不均,影响推理速度。
- 潜在解决方案:Mixture of Depth、基于分数的动态加载 (如当某个专家饱和时,将token分配给其邻近未饱和的专家)。
- SMoE模型压缩 (Compressing SMoEs):
- 引用Tim Dettmers的观点:“我认为我们可以将混合专家模型压缩到小于4GB。”
- MoE层的压缩与稠密Transformer(难以稀疏化MLP)不同,可能存在更有效的压缩(稀疏化)方法,例如许多参数可能贡献不大,可以被移除或共享。
- 深入理解MoE的决策机制和学习到的特征:
- 专家可能捕获了与人类感知概念不同的底层特征。
- 如何恢复这些专家学习到的潜在子空间或可理解的概念集?
问答环节重点
- 关于稀疏MoE与稠密模型的选择 (尤其针对边缘设备):
- Albert Jiang解释,对于内存受限的边缘设备,稠密模型可能更优,因为SMoE模型虽然推理时只激活部分参数,但仍需将所有专家加载到内存中。Mixtral这类SMoE模型更适用于数据中心/云端,能通过高并发处理发挥其高吞吐量和高效率(性能/成本比)的优势。
- 关于在图像任务上微调模型的挑战:
- Albert Jiang认为,开源模型(如Mistral)相比闭源模型(如ChatGPT, Gemini)在微调时提供了更大的控制权和透明度,有助于诊断和解决问题。具体问题(如图像到文本的转换)需要具体分析。
- MoE模型性能提升的来源:
- 知识容量提升: MoE化的MLP层相当于扩展了MLP的宽度,可以存储更多知识。
- 推理效率: 每个token只选择最相关的参数进行计算。Switch Transformer论文详细讨论了相关的并行策略(数据并行、模型并行、专家并行)。
- 关于Mixture of Depth (MoD) 模型:
- Albert Jiang认为MoD是自适应计算的一个好例子,它为不同token选择不同数量的参数。Mixtral则是为不同token选择不同组合的专家(但数量固定为2)。两者都旨在为特定token选择最相关且尽可能少的参数。
- 路由和通信成本的扩展性:
- 通信成本与专家数量、大小以及跨GPU/节点的token路由数量大致成正比。专家过多过大导致跨节点部署时,通信成本会显著增加。
- 为何Llama 3未使用MoE架构:
- Albert Jiang表示无法回答,建议询问Meta团队。
- MoE是否会成为主流或仅限于特定场景:
- Albert Jiang重申,边缘设备可能倾向稠密模型,而大规模、高吞吐量场景更能发挥MoE优势。他提及GPT-4被广泛推测为MoE模型。
- MoE模型能否超越领域专用模型:
- Albert Jiang认为,经过特定领域数据训练/微调的模型通常难以超越。Mixtral中的专家并非按传统领域(如医疗、编码)划分,其专业化方式更为底层和抽象,因此不太可能直接在所有细分领域超越专门优化的模型。
- MoE层在网络中的位置(早期vs深层):
- Albert Jiang指出,传统做法是各层采用相同结构。虽然有研究尝试不同层使用不同结构,但保持一致性是更安全的设计选择。神经架构搜索是更原则性的方法。
- Mixtral的MoE结构位于每个Transformer块的MLP部分,即在注意力层之后。
- 8个专家是如何构建的(是否在不同数据集上微调):
- Albert Jiang表示它们大致在相同的数据集上训练,但无法透露更多细节。
- 关于“学习路由不如随机路由”的观点:
- Albert Jiang表示需要看到具体论文,但如果目标是暴力增加知识容量,随机路由或许有一定效果,但他仍认为智能门控具有优势。
- Mixtral的开发过程和尝试过的无效方案:
- Albert Jiang强调,设计模型架构前必须考虑推理需求(如GPU显存限制)。训练前进行扩展法则 (scaling law) 分析以优化性能成本比是重要实践。
- Mixtral 7B vs 8x7B的推理显存占用:
- 简单来说,8x7B需要加载全部46.7B参数的显存,而7B仅需7B。尽管8x7B活跃参数仅13B,但全量加载是基础。可以通过CPU offloading等技术管理显存,但会牺牲效率。
- MoE模型等效稠密参数的经验法则 (如活跃参数和总参数的几何平均数):
- Albert Jiang认为在其他条件(训练质量、数据量等)相同的情况下,这是一个不错的经验法则。
- 关于“寻宝”专家的反向实验(只保留专家3,移除其他):
- Albert Jiang认为这是个好问题,会是一个有趣的实验,相当于将8x7B修剪回一个类似7B的模型。
- Mixtral 8x7B为何显著优于7B (是否学习了更好的内部算法):
- Albert Jiang表示这很大程度上是推测。基准测试本身在知识和推理的界限上存在模糊性。8x7B在知识获取上有显著提升,但这是否直接改进了推理能力尚不明确。
- MoE在生产环境中GPU显存的挑战:
- Albert Jiang承认,若需加载所有专家,MoE对显存要求更高。其优势在于高并发场景下的高吞吐量。服务MoE模型确实会带来额外麻烦,但对于高流量应用是值得的。
- MoE训练中的负载均衡损失和不连续性问题:
- Albert Jiang确认训练时需要确保专家负载均衡,以避免等待最慢的专家。Mistral AI在训练Mixtral时未遇到重大困难。
- MoE与RAG (Retrieval Augmented Generation) 的关系:
- Albert Jiang指出两者是正交的技术,可以结合使用 (MoE模型 + RAG)。
- 是否可以替换或插入领域特定的专家 (模块化MoE):
- Albert Jiang认为理论上可行,但替换后需要对模型(尤其是门控层)进行额外训练,使其能识别并正确路由到新专家。这是个令人兴奋的研究方向,与模型合并、编辑等相关。
- MoE训练的计算强度 (梯度是否只相当于活跃参数):
- Albert Jiang解释,路由操作是可微的。训练成本大致与活跃参数数量 (对Mixtral 8x7B而言约13B) 成正比,外加一些通信开销。
- 更大规模MoE模型 (如8x22B, 8x100B+) 的服务挑战:
- Albert Jiang表示,如果单个GPU无法容纳一个完整的专家,服务将非常困难,即使经过重度量化。拥有更多专家(如128个)对专业化有利,但对服务是巨大挑战,可能需要多节点部署,导致高通信成本。他提到Mixtral 8x7B在量化后“几乎可以在单个GPU上运行”。
核心结论
Albert Jiang总结道:
* 稀疏混合专家模型 (SMoE) 利用稀疏性来获取更多知识。
* 通过精心训练,SMoE模型可以在推理时达到很高的效率。
* 专家的专业化分工并不像人们最初想象的那么直接和简单。
* 在模型架构设计和可解释性研究方面仍有大量工作有待完成。
* 最后,Albert Jiang介绍了Mistral AI并欢迎有志者加入。