Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 12 - Efficient Training, Shikhar Murty

该讲座主要讨论了大规模神经网络的高效训练方法。首先,讲师发布了课程项目提案的评分即将公布以及项目里程碑要求的通知。

核心内容从解释计算机中数字(特别是浮点数)的表示方式开始。FP32(32位浮点数)占用4字节内存,具有较大的表示范围和较高的精度。然而,训练大型模型时,FP32可能导致显存不足(OOM)。

为节省显存,可以使用FP16(16位浮点数),它将内存需求减半,但代价是牺牲了表示范围和精度。这会导致非常小的数值变为零,非常大的数值变为NaN,同时存在舍入误差,尤其影响梯度计算,许多小梯度会因范围限制而直接归零,不利于模型训练。

为解决此问题,引入了混合精度训练(Mixed Precision Training)。一种方案是同时使用FP32和FP16:模型权重保留一份FP32的主副本(master weights),前向和反向传播时将权重转换为FP16进行计算,得到FP16格式的梯度,然后将梯度转换回FP32更新主权重。但这种方法仍存在问题,因为FP16梯度在转换回FP32前可能已经因范围过小而丢失信息(变为零)。

进一步的解决方案是损失缩放(Loss Scaling):在前向传播得到损失后,将损失乘以一个较大的缩放因子,这会相应地放大梯度值,使得原本在FP16下会变成零的梯度能够被保留。计算完FP16梯度后,将其转换回FP32,再除以缩放因子还原,然后更新FP32主权重。PyTorch中可通过GradScalerautocast实现。但损失缩放的缺点是需要小心调整缩放因子,以避免NaN并适应网络动态。

最后,讲座介绍了另一种16位浮点格式BFloat16(Brain Float 16)。BFloat16通过牺牲部分精度(尾数位数减少)来保持与FP32相同的指数位数,从而拥有与FP32相同的动态范围,但精度低于FP16。实践证明,这种精度损失对神经网络训练通常是可以接受的,并且使用BFloat16通常可以避免复杂的梯度缩放问题。

媒体详情

上传日期
2025-05-16 20:37
来源
https://www.youtube.com/watch?v=UVX7SYGCKkA
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-preview-06-05

转录

下载为TXT
speaker 1: Okay, cool, let's just get started. Welcome everyone to lecture twelve. So you know so far we've learned a lot about how like you know we convert words into vectors, how we convert sentences into vectors and you know basically take actions in the real world using that. So like classify documents, we learned about transformers, we learned about pretraining. It is going to be a little bit different. I'm going to be talking about how you can train large models on GPU's and a few basics about how you know these ml systems work. It has nothing to do with natural language at all, but hopefully it's going to be useful for final projects. So I'm gonna to spend some time on mixed precision training, sometime on multi GPU training with ddp and fsdp, and hopefully by the end of the lecture, these terms will make sense and sometime on parameter efficient fine tuning. But before we get into the lecture, just some announcements. Proposal grades are going to be coming out shortly, hopefully by the end of the day. Thank you so much for all the hard work. I know in it's kind of getting a little bit crammed with a lot of deadlines for assignment four and the project proposals. So thank you so much for all your hard work. The other thing is the project milestone details should be out shortly, if not already out on the website. So it's worth 5% of the overall grade. It's due twelve days from now, and it's a maximum of two pages. And really, the way to think about the milestone is to use this as a forcing function to get work done for your final project. And Yeah, with that out of the way, let's just jump into the material. So I'm going to start by thinking about how parameters and gradients and generally numbers are represented in computers. And I promise it's going to be relevant to deep learning pretty soon. So let's start with floating point. How many people here are familiar with this cartoon depiction of fp 32? Can you just get okay, so some of you so Yeah, let's kind of recap how floating points are represented in computers. So firstly, fp 32, that's like 32 bytes. So the memory requirement is it's 32 bytes. So the memory requirement is four bytes. Okay. And so if you're thinking about neural networks and for every single neural net parameters, you need four bytes of GPU memory. And the way to convert this cartoon into a real Ural number is something like this. So first bit there is the sign, and then the stuff in Green represents the range, and then the stuff in blue represents precision. Yeah. And so for fp 32, there's is like you can represent a pretty large range and it's fairly precise, right? And so the larger the stuff in Green is, the more numbers you can represent, which means more smaller numbers and also like larger numbers. And in Green, the more stuff in blue we have the greater precision in representing actual numbers. So another popular data type that takes half the memory of fp 32 is fp 16. And the way we reduce memory is we're going to reduce the stuff in Green, so there's going to be less range, less dynamic range, and also the stuff in blue, which means there's going to be less precision. But the good thing is that we can save memory, so we slash memory requirements in half. So let's think of a scenario where you're trying to train a big neural network and your model parameters and gradients are represented in fb 32, you start training and suddenly you get an out of memory kuerror. Okay. And so just based on what you've seen so far, one possible solution is you cast everything into fp 16. And if you do that, you reduce memory usage by half. So let's kind of work through what are some possible problems with doing something like that? So you like I said, because there's less stuff in Green, there's going to be less range. And so that means a lot of very small numbers will get converted to zero, and a lot of really large numbers will get converted into nans. And there's also less precision because you have less bits in blue, which means you're going to get rounding errors. For example, 1.0001 gets converted to one in half precision. And I have a little screenshot of how you can test various properties of data types, right? So basically, the things to look at are the epsilon. The epsilon is like the smallest number such that if you add that to one, you don't lose any precision. If you add a number that's smaller than the silent to one that gets just rounded down to one. And the smallest normal is the smallest number that can be represented in fp 16. Anything smaller than that, it goes straight to zero. And for neural network training, if a lot of small numbers get rounded down to zero, that's actually not good. So here is a diagram that I took from an nwith blog post that's just showing just sort of some gradients during the course of training. And more than half of these gradients will literally just get set to zero in fp 16, which is kind of a problem, and that has to do with the range of fp 16. And the second problem is we're with precision, right? So we have basically less precision. And so our updates are not going to be precise. Okay, so the solution here, one possible solution, right? So we are going to use fp 16, but we are also going to use fp 32. Okay? So that's sort of the high level idea. And what we're going to do is we're going to maintain a copy of the model in fp 32, and let's call those master weights. And then you get a little bit of data, you run a forward PaaS, and then when you run your PaaS, you run it by converting from fp 32 into fp 16. And then you get a gradient on a backward PaaS, and then get your gradient in fp 16. So everything so far has happened in fp 16. Then you take your gradients, upcast them into fp 32, and then update your maweiand. Then once you update your maweyou, copy them into the fp 16 version of the neural network. So this seems like a reasonable scheme. You know I'm using fp 16 on my GPU, but I have the full sort of 32 bit precision also lying around somewhere so I can have more precise updates. Des okay, can someone tell me why this is still problematic? Any guesses? Yeah want not be like at least slow pe because you have to copy like the 32 mid versions from my qq into like some discipline. Yeah. So so that's a good point. So you can often like overlap io with like forward and backward passes. So practically this is not a problem. But Yeah, that's a good point. Potentially if your network is very, very small, this could be a problem.
speaker 2: Gradients are usually fairly small and individual gradients are usually fairly small. And when you copy the fp 16 compute gradients onto fp 32, you may be setting in your network somewhere else where you don't want it to be.
speaker 1: So Yeah, so that's pretty much pretty much the right answer. So you know let's kind of go back to this diagram that we had. So this shows gradients in the backward PaaS. And you know I said that we're going to compute all our gradients in fp 16. What's going to happen? Like most of them will just get converted to zero, which which is something that we really would like to avoid. Okay, so here's a possible solution. So what you can do is you can you get your batch of data, you run your faout PaaS in fp 16, you compute your gradient. But then when you have the sorry, so here, so you get a batch of data, you compute a forward PaaS in fp 16, you get your loss. You scale the loss by some large value. Okay, let's say 100, let's say 1000, and then you compute gradients and now you just like scale your gradient by a large number. And so everything that we had on the left hand side of this red line just gets shifted to the right. And hopefully there's less tough that will get rounded down to zero. And then compute your gradient in fp 16, copy it into fp 32, and then divide it by the scaling factor, and then you update your master wins. Okay? So this will solve both the problems that we talked about and so this is basically what we call mixed precision training, okay? And it's relatively simple to implement this in PyTorch. All you have to do is you need to instantiate this grad scalar object. And then within the context of like this autocast, you want to run your forward and backward passes and then scale down your gradient and then update your model parameters. But then this seems a little complex. You know we have to deal with sort of scaling the loss and then scaling it back down. What if you multiply it by 10000 and that leads to nans? And so then you have to kind of scale. You have to update your scalar. You have to, in the next iteration, multiply by 1000, and you have to kind of adjust to sort of network dynamics. Okay, so welike to not do gradient scaling. So can we do something better? Okay. So the reason why we have to do the scaling is you know just recall sort of the role of sort the bits in Green that kind of tells you what is the dynamic range of the data type. And we needed scaling because fp 16 has a much smaller range compared to fp 32, right? And so because of that, fp 16 cannot represent very small numbers. So how do we solve this? Any ideas? Yeah so here's a problem, right? So in fp 16, because you have fewer bits for the exponent, you can't represent very small numbers. So if you have something that's smaller than, I don't know, six e minus five, it gets down sort of rounded down to zero. And that's because of the dynamic range of fp 16. So how do you solve that position more?
speaker 2: Queit's? Absolutely. Yeah.
speaker 1: So that's that's the right answer. So what we going to do is we're going to sacrifice precision. So that's the idea for b float 16, which stands for a brain float 16. So you're going to have exactly the same number of bits for representing the range. So that's going to be eight bits so as the same dynamic range as fp 32, but a lot less precision. And it turns out that this is okay for neural network training. And now if you use b float 16, you don't need to use grad scalers anymore. It's as simple as wrapping your model forward PaaS and backward PaaS within the right context. The one caveat about b float 16 is that it's not available on all GPU's. So you need to have the latest sort of ampere nvidia architectures, which the H -100s, the a 100s, the a 6000s have. But if you have like older GPU, then you might not be able to utilize b float 16 .
speaker 2: precision, but the same amount of ebits. Yeah. So it's thing in.
speaker 1: So here are some kind of results. So someone finds you inteal. But for a sentiment classification or single a 100, at the very top is float 64, which is like you know really, really rich. 64 bit representation of floating points. It takes about 25 minutes and you get a pretty high accuracy, but it also takes a lot more memory. And all the way down, we're using mixed precision training with beflosixteen, and now we have reduced training time by roughly a third, more or less have the same accuracy, a little bit better actually because there's some regularizing effect from the half precision representation and then a lot less memory. And the reason we see speed eduups for training is because matrix multiplies tend to be faster when you are multiplying in half precision. Okay. So before we move on, are there any questions about this? Okay, cool. So let's keep going and let's sort of change the setting, right? So now now we have more than one GPU. Now we have multiple GPU's, and we want to train a network over all of the multiple GPU's that we have. Okay, so let's start with some basics. Okay, so here's a cartoon showing basically a model and an optimizer receiving some data from a data set. Okay? And let's kind of work through what's stored on GPU vm. And this is going to be somewhat of a lie, and I will point out what my lie is soon. But just to keep things simple, we have the neural net parameters. Okay? So let's say we're doing mixed procitor training, and so it's stored in fp 16. And then we have an optimizer. And you know when I first saw this few years back, I was very surprised to see that optimizers also need memory. But if you're using something like Adam, then you need to store the Adam momentum term and the adom variance. And every time you get a gradient, you have to update adom momentum invvariance. And that's what you use for updating your parameters. And because you're using mixed precision training, these have to be represented in sort of fp 32. Okay, so that's what the picture looks like if you have a single GPU. Now let's say we have multiple GPU's, okay? And what welike to do is first divide our data aset into, let's say we have four GPU's, right? So we'll divide our data set into four parts, and we'll maintain a synchronized copy of the model. And every model receives its own slice of the data set, okay? In the beginning, we have a synchronized model and everyone has their own copy. We run a forward PaaS, okay? So this forward PaaS receives different data points. And so every model is going to have different activations and correspondingly, every model is going to have different gradients. Okay? So you run a backward PaaS, every model has a different gradient because there's different data points. And then we're going to run a synchronization step. And what synchronization is going to do is communicate gradients between different workers. And so I'm going to introduce the first sort of mpi primitive in this lecture. And that primitive is called the all reducoperation. What all reducdoes is, it takes four pieces of information, in this example, on four different GPU's. It sort of merges everything together and then distributes it to all of the gps 's. And the communication overhead of doing that is two bytes per parameter, because, remember, we have fp 16 gradients, so two bytes per gradient. And then this needs to be communicated. And so the overhead is two bytes per parameter. Okay? So that's the all reduced operation. And then once gradients have been communicated, so they have to be communicated by sort of gathering on one worker and just sort of distributing the cumulative gradient. At that point, every optimizer has the full gradient and then the optimizer can update the model so that you maintain synchronization. Okay, so that's the basic that's known as distributed data parallel. Okay, that's good. But turns out that it has really poor memory scaling. So let's kind of go through our math for how much memory is needed, right? So we have the model parameters. That's fp 16 because we're doing mixed position training. And then for the gradient, we also have the gradient in fp 16, right? So two bytes for the gradient. And then we have the stuff in Green. The stuff in Green is, let's say we're doing Adam. So we need to, well, we need to store the mawaregardless of whether we're doing Adam or not, and then we need to store the momentum and the variance. Okay? So that's twelve extra bytes per parameter, okay? And this needs to be stored on every single GPU, okay? And so the question is, can we do better than this? Okay. And so now things are going to get a little bit more tricky. So if you have questions, just stop me and we can go from there. So the way we're going to improve our memory sort of scaling is we have a set of techniques that are together known as zero that stands for zero redundancy optimizer. So this was know a set of techniques released by Microsoft as part of that deep speed project, okay? And the idea is going to be that we are going to, instead of having every GPU maintain all of the state. So by the state, I mean, the stuff in blue, the stuff in orange and the stuff in Green you're gonna to sort of shit, okay? So there's going to be shard so that not every GPU has all of the parameters or all of the gradient, but by communication they can sort of synchronize. Okay? So that's pretty much what the sketch for this is going to look like. Okay? So let's look at stage one. So like zero has multiple stages. So there's stage one, 23. In stage one, we are going to shard the stuff in Green. So stuff in Green was the optimizer state. And so the way we're going to shard and still maintains synchronization of something like this, so every GPU has you the full set of parameters in fp 16, and every GPU has its gradient for its data, but it only has a sharded copy of the full optimizer state. And the other requirement is that every GPU is responsible for updating the parameters corresponding to its own shot. So if you go step by step, this is what it looks like. Every GPU has its own data. Every GPU gets a gradient on its subset of the data. Okay, then we perform a reduced scatter. So now this is the second mpi operation of the lecture. So we've done all reduced. This is the second one. This is called reduced scatter. What a reduced scdoes is. Every GPU has the full gradient on its data. And what you want to do is you want to take the chunk corresponding to, let's say, GPU one. So let's say you GPU zero, and you've computed the full gradient for all the parameters, and you want to communicate the chunk for GPU one to GPU one, and same for GPU 23. Ok? So what you're going to do is from the full gradient, just communicate the bits that a different worker wants to that worker. And every GPU has to do that. So that's called a reduced scatter. And then once every worker gets the gradient corresponding to its shard, they're going to update its parameters. And then once they have updated their shard, they're going to sort of perform and all gather. So what that means is, let's say you have a neural network with just, let's say, eight parameters, two parameters on each GPU. At the end of this, each GPU has updated their subset of parameters and then they're going to sort of do an all gather to just sort of maintain synchronization. So every GPU gets the full set of parameters that are all updated.
speaker 2: He's maintaining this and you're not merging into the other in that way. What makes this more efficient?
speaker 1: Sorry, you could to hear a video deal question why .
speaker 2: this is better than doing a previous right?
speaker 1: So what we're going to do is shard the optimizer state. So let's say in a running an example we have a neural network with eight parameters. Earlier we needed the optimizer state for all of the eight parameters. Now every GPU has to maintain optimize a state for only two parameters. Okay, so after the reduced scatters are done, you have the full gradient correspond to just two parameters. Okay? So the optimizer state is just the gradient for two parameters. The model is going to update only two parameters using the partial sort of optimized state.
speaker 2: So you'll eventually get the rest of the .
speaker 1: parameters back. So you have the entire set of parameters, you have all the stuff in blue, and you have the full gradient for your subset where you don't have the full optimizer state. So what you can do is you can only update the parameters for the bits of optimizer state you have. Okay, so in a running example that I just made up, you know GPU zero updates two parameters, GPU one updates two parameters and so on. And then they communicate updated parameters to maintain synchronization. Okay, more questions about this. Okay. So let's keep back. So so far, we have looked at three mpi operations. We looked at all gather, we looked at reduced scatter and we looked at all reducokay. So it turns out that all reduce is actually equivalent to running a reduced scatter followed by an all gather operation. And just recall that like for ddp, all we had to do was this all reduce operation and we computed what's the communication overhead of that? And turns out that when you're doing this optimize state shoding, you have to do exactly the same amount of communication overhead just because an all reduces equivalent to a reduced scatter followed by an all gaokay. And so we basically saved memory for free, okay? So just I mean, you should just always use this, okay? Because you're going to get memory savings and you don't have any additional communication overhead, okay? So we are happy we saved memory and now you know we want to sheven more things. Okay, so let's let's start doing zero stage two. And now along with sharding the stuff in Green which was my optimizer state, I'm also gonna to shard gradients. Okay? And now this is going to be a little bit more complex because we kind of still need the full gradient for the workers data slice, okay? But each GPU only has enough memory for instantiating the gradient for a small subset of parameters. So how are we going to deal with that? So we are actually never going to instantiate the full gradient factor. And then whenever a GPU gets a gradient in the backward PaaS, you instantiate a vector, sort of temporarily for the parameter for which you just received a gradient, and then compute the gradient, and then just send it to the right worker, and then you destroy the memory that you just created. Okay, that's kind of the sketch. And let's kind of go through this step by step. Okay? So we have four workers, okay? Each worker performs a backward PaaS, and the backward PaaS happens layer by layer, right? So recall the lecture on alterative. So you have the loss. And then you have this backward PaaS where layer by layer you compute gradients. Okay? So now let's see. You're at layer J. You take the upstream gradient, you compute gradient for the parameters at layer germ immediately. The moment you compute those gradients, send it to the right worker, okay? So there exists some worker that is responsible for layer ger, okay? And what's going to happen is every GPU that's just computed the gradient for layered J for its data slice sends it to the right worker, okay? And then the moment you've done that, you dellocate this memory that you just created. And so this is kind of a fourth mpi operation, but really not very different from a reduced catter. This is just a reduced so there are four GPU's that have a gradient and then they just have to communicate it to whoever is responsible for maintaining gradient for that layer. Okay. And then Yeah so there exists some worker that is responsible for a given layer. They're going to update its parameter shot using the full gradient that it received via this communication along with the optimizer state. Okay. And then at the end, to synchronize everything, you have to perform in all GaAs before. Okay. Any questions about about this high level sketch? Okay, so let's keep moving. Okay, so recall that for zero stage one, it was basically free because turns out that an all reduces equivalent to a reduced scatter plus in all gather. And we're kind of doing the same thing here. We have a reducfollowed by an all gather. So this is practically also for free. Okay? So we've gotten away with saving memory without any communication overhead compared to edp so far. So let's keep going. Let's try and see if we can shot even more things. And I think someone sort of alluded to this in the audience early on. So what happens if you shot even your model parameters? Okay. So let's say you run into a situation where you know forget about the optimizer state. Even your model wouldn't fit on a single GPU. And so in that case, what you do is you split up your model. So you split up your model across all the different GPU, so you shyour model parameters, which is the stuff in blue. But the caveat is that now we're not going to get this for free. We're not going to get memory savings for free. There's going to be some communication overhead. And this is zero stage three. This is a final stage of zero. This is also known as fsdp fully shaded data parallel for anyone who's heard that term before. And here's sort of the high level sketch and I feel like this is kind of the easiest to understand compared to zero stage 12, just because there needs to be communication at every step of the way, right? You can't get away without communicating. So the first thing we're gonna to do is we're gonna to take our model and we're gonna to convert the entire model into fsdp units. Okay, so here's a sketch, a simple deep neural network. I'm going to convert that into multiple fsdp units. Three fsdp units here. Okay, so that's just a data structure, an fsdp unit. Okay, we've not done anything so far. And then I have this fsdp unit. I'm going to convert this into another data structure called a flat parameter. And then I'm going to assign a subset of these parameters to every single gpuu. So here we have 16GPU's and a flat parameter consisting of 14 parameters plus some extra padding so that things divide properly. And I'm going to assign each parameter to a distinct GPU. And so that's basically just a complex way of saying that we created some data structures and we just like divided up model parameters to every GPU. So every GPU gets a subset of model parameters. Okay, now let's start thinking about what my forward PaaS will look like. So there's no GPU that has a full set of parameters. Okay, so you're running a forward PaaS. Let's say you're at layer four now and there's no GPU that has all of layer four. So you have to communicate. So we need to do an all gathered operation. That's the operation that we did to you know cumulate multiple things that are on multiple GPU's so that every GPU has the full thing. So you perform an all gather. So you have all pieces of layer four. You run a forward PaaS, and now you don't need layer four, so you now discard your parameter shards, and now you have to run your backward PaaS, right? So you computed your loss and now you have to do a backward PaaS again. Let's say you are back at layer four. You have your upstream gradient, you don't have layer four, so you need to do another all gather. So you get all the parameters of layer four, and then you run a backward PaaS for layer force. You compute the gradient for your subset of parameters. So recall that every GPU has different data points, right? So there's going to be different gradients for every GPU. Okay? So then for layer four you do an all gather, get all parameters computer gradient. Every GPU has different gradients and then you have to do a reduced scatter so that you can send the full gradient to the GPU that's responsible for whatever parts of layer four that you're sending. So Yeah, so that's basically full fsdp. And then once you sort of run the forward and backward PaaS, then each GPU will update its own parameter shot using the full gradient that it received just now. And then you do a synchronization, right? So let's kind of do a quick review of everything we've looked at so far. So those ddp, which was you don't shot anything, you have the full model, the full gradient, the full optimized state on every single GPU. And all you're going to divide up is the full data set, right? So you have a big data set of a thousand examples. Every GPU gets 250 examples, okay? And then you compute a forward PaaS and a backward PaaS. Every GPU has a different gradient. You need to communicate that gradient, and then you synchronize. And so that was called an all reducoperation in mpi atoms. And then we looked at zero, which is now we want to save some memory. We don't want the full sort of memory requirements of models, gradients and optimiser state on every single GPU. And in zero, stage one, we shot at the optimizer state so that you don't have to maintain the full optimiser state for every GPU. You kind of break that down between all the different GPU's that you have. And we saw that the communication overhead of maintaining synchronization in zero stage one boiled down to basically just doing an all reducthrough, this identity that says that an all reduces or reduced scatter plus and all gather. And we save memory for free with zero stage 12. So you should just do it. And then with zero stage three, things got a little bit more complex because you have to divide up your model parameters, the optimized state and the gradient. And so while you're running your forward PaaS, you kind of have to do some communication to get the full parameters for any layer for layer four in ner example, and and then also have to do an all gather in the backward PaaS. So you get the full gradient and then you have to do a reduced scatter so that you can send the full gradient for whatever chunk of the parameter to the right GPU. And overall, that's like two all gathers plus a reduced scatter. So that's a lot more overhead than stages 12. But if you don't have enough GPU vram so that you can even load your model onto a GPU, then this is kind of what you have to do. All any questions about mpi primitives or stages of zero or fsdp? Okay, cool. So I'm going to fix the lie that I said earlier about the GPU vracalculation. So I said that there's just like model parameters and gradients and the optimizer state, but there is this thing. There's this like final thing, the model activation. So like you know we've all seen that as you keep know, you want to increase the bat size. And there's a point when the GPU says that it can't fit more stuff, and that's because you also need to store model activations in the backward PaaS, right? And that scales linearly with the bsize. So the larger the bsize, the more the number of model activations that need to be stored. And by the way, if you're doing mixed precision, this is in fp 16 or bf 16, but it scales with the bsize. And so that's sort of the other thing that you have to think about. And none of the techniques that we've looked at so far help with kind of shoding model activations. Okay. So we looked at a bunch of like know basics of like multi GPU training and you know like a floating point, but it kind of boils down to this very simple flow chart, which you can use for your final projects when you find tuning models. So the first thing is always use mixed position training. You know you barely ever see a hit dden performance by performance, I mean like generalization or like F1 or accuracy. And if you're using the newer ampere architectures, the H -100s or the a 100s or the a six zero s, always use b float 16. It's just better. And you can check that with that torch command. Okay, so always use mixed precision training. Now ask yourself this question, does bat size one fit on a single GPU? Okay, if it fits and try a larger bat size, bat size one is too small, larger bat size and or use zero stage two. Okay, zero stage two is for free. Just use zero stage two and increase your bat size. If you can't fit even bat size one, then you have to see if zero stage three fixes your out of memory issues because now you're going to shyour model parameters. Okay? And all of this is in the context of full fine tuning, right? So I'm fine tuning all of my model parameters. Okay?
speaker 2: Sometimes the answer .
speaker 1: to that question is also no. So you can't fulfind tune in your model on four, whatever, a 100s or know a six zero s. And you've tried zero stage three, you've tried mixed precision training. You have a bad size of one. Maybe you did gradient checkpointing activation checkpointing. Nothing works. And so now basically, you can't do full fine tuning. And so the thing to do is to try parameter efficient fine tuning, and that's going to give you a lot more memory savings. Okay, so let's talk about parameter efficient fine tuning. Okay, so why is it called parameter efficient ent fine tuning? So in full fine tuning, you run a forward PaaS and a backward PaaS and you update every single model parameter. And in parameter efficient fine tuning, you're only going to update a small subset of the full set of parameters. Okay. And why would you want to do that? So maybe you're in a setting where you cannot fulfine tune, even with bsize one. You tried ed all the tricks possible. It just wouldn't fit. And so maybe you have to do parameter efficient fine tuning. Maybe the other possible reason why you want na do it is kind, know, slightly more scientific. Know these like models these days are heavily overparameterized and you have a small data set and you believe that if you do parameter efficient fine tuning, then you can get a better generalization or you believe that it's going to match for fine tuning. Okay. Sort of a second reason for wanting to do efficient adaptation. So the plot on the right here shows in red it's sort of the estimated growth in training compute for training the largest AI models. And the line in blue is the global compute capacity. So very soon, we are going to overshoot the global compute capacity and going to need a lot more compute than the global capacity. And so this is kind of not sustainable. And you know there are arguments to be made about how if we keep going down this route, then AI development becomes concentrated in only the hands of a few well funded organizations. And as students, we can't do it. And so that's a problem. And then also like if there's only a small number of players that are training and fine tuning models, then they may bias the model in specific ways that reflect their value systems and not sort of the broader public. And so that's another reason to think about efficient adaptation and the sort of this paradigm in machine learning in general and analp specifically to focus a lot on accuracy and sort of efficiency. And so the problem on the right here shows the percentage of papers where the main contribution is a method that produces just more accurate models versus methods that produce same accuracy for more efficiency. And so we can see that for most conferences, the vast majority of papers are about accuracy, and there's very few papers about efficiency. And so maybe this is kind of leading to this like monoculture, and maybe that's why we want to focus on efficiency. Second, maybe bigger sort of concern is that there's this huge hidden environmental cost of training and fine tuning large language models. So I was just reading some report where they said that the cost of training GPT -3 was equivalent to 1.1 million tons of carbon emission, or some such number. And they kind of estimated that that's the cost of running a coal power plant for ten hours straight. All right. And for an example closer to home in the reinforcement learning class, there was like the final project, not the final project, a homework assignment. And a lot of students implemented kind of a common algorithm, one or two algorithms that sort of outperformed everything else where used a lot more power, okay? And someone did this calculation that if everyone had used the most efficient algorithm, that would have sorry, if everyone had used the more efficient algorithm, that would have reduced the power consumption of the class by about 880 kilowatt hours, which is what an American household uses in a month. Okay, so there's these are all reasons to think about efficiency and how you can find tunnew models with less resources. So let's kind of jump back into parameter efficient find tunand. Let's start by recapping what full find tuning is. Any questions so far about any of this? Okay, so Yeah, so let's recarep full find tuning. So let's say we have some large pre train autoregressive language model. Let's say it's a GPT, and maybe we want to use it for summarization. Maybe we want it for semantic parsing, so like converting natural language to sql commands, or maybe we wanted to answer questions about paragraphs. Okay. And what do we do? We collect a data set of X, Y pairs, and then we do full fine tuning. In full fine tuning, we are going to update all of the model parameters based on the gradient for some loss function. Okay. And maybe that's not feasible. GPT D3 has 175 billion parameters. And so there's just like a lot more parameters to learn. And even once you have done full fine tuning, you kind of have to store all of the parameters. And if you're doing like several tasks, you have to store parameters for every task. So can we do better? So the main idea is instead of updating all of the parameters, I am going to update a much smaller number of parameters. And then instead of finding sort of a delta theer, which is the same size as the entire set of parameters, I have to search over a much smaller space. And then the added benefit is I can store this much smaller delta pretty easily on disk. And hopefully it's going to require less compute. And hopefully it's going to generalize almost as well as full fintuning. So there's many different ways of kind of operationalizing, this high level idea of parameter efficient fine tuning. The one I'm going to talk about today is Laura. Okay. So that stands for low rank adaptation. And that basically comes from this observation that when you have big language models that you fine tune, oftentimes when you look at sort of the like geometric structure of the gradients, they tend to have a low intrinsic rank. Do people remember rank and sd? All right. Okay. So these parameters, the gradients tend to have a low intrinsic rank. And so what the authors realized is instead of fine tuning the entire sort of parameters, you could instead fine tune a much smaller, let's say, rank R matrix for every full rank matrix that exists in the model. Okay? So let's say we have some pretrain weight matrix, W zero. And what I'm going to do is instead of applying some kind of arbitrary update, I'm going to make sure that the update has this following form. Okay? So it's going to be the product of two low rank matrices, b and a. Okay? So a is is an R Cross k matrix and b is A D cross R matrix. Okay? And R is the rank much, much smaller than either the sort of incoming dimension and much, much smaller than the outgoing dimension and the term alpha. You can think of that as some kind of trade off between the knowledge that's already stored in the pretend model versus some additional knowledge that you want to add into the model. Okay. So if alpha zero, then you're not doing anything falfa something really, really small, then you don't really want to change your model parameters all that much and you want to add some really small taspecific knowledge. And then Additionally, the only trainable parameters here are going to be a and b. Okay. And then sort the other thing to note about this is since I'm representing updates as this product b times a, as I increase R, that's going to converge towards full finunit. So you kind of have the slider that you can use to control how much fine tuning you want to do essentially. And then the other important thing is inference latency. So what you can do is you can just store these learned t matrices for every task. And whenever you switch to a different task, you can just remove the extra term that you've added to every matrix for the task and add in sort of the task specific terms for the new task that you want to run inference on. Okay. And the cost of like storing these like much smaller matrices also way lower than storing sort of the full delta. And we'll kind of see where you should apply, Laura. But like generally you want to apply it to the weight matrices in self attention. Okay? So in code, it actually looks fairly simple. So what you're going to do is when you're running a regular forward PaaS, then you sort of you know just like compute the hidden state as let's say, the product of the matrix and the incoming feature vector. Now with Laura, what you're going to do is you're going to freeze them all parameters. You're going to compute the H as before. And then to that, you're going to add this additional offset term. And that's the only thing that's going to be trainable. And that's pretty much all you have to do. We have to do it for every single weight matrix, for every single layer. But Yeah.
speaker 2: there's like an alphater in the second glass line. Where do you define alpha in the the rest one? Or do you just like put it under?
speaker 1: So yes, you define the somewhere if you set it to one, that's like saying that I kind of want like an equal trade off between pretrain knowledge and the new task specific knowledge. Typically people set it to one. You could set it to something larger than one if you believe your task is something that the model, the pretrain model, has no idea about, or something smaller than one if you don't want to change the model too much. So that's .
speaker 2: basically Laura .
speaker 1: in practice. So I said there's a bunch of different parameter efficient fine hearing methods, right? So I'm not even going to name all of these. There's adapters. Some of you might have heard about adapters. There is bit fit, which is not here. And so there's is like lots of different like p tuning. But it turns out that compared to a lot of these different methods, it's kind of like pretty high performing on a bunch of different tasks for these like relatively smaller models. And then if we try and look at trying to find tunsome of the bigger models like gpd three and then compare it with other parameter efficient fine tuning methods, so full fine tuning is at the way to, then we have bit fit, which is you only fine tune the Bistones and adapters compared to that. Firstly, laa requires a lot fewer additional parameters that you need to store and it kind of gives you good trade off for accuracy, better to fulfine tuning. And sometimes there's a regularizing effect from fine tuning only a small subset of your model parameters. Okay. So you know the question is like for every matrix you can apply, Laura and I said that you want to apply it to the various learned weight matrices inside self attention. The question is what parameters you want to apply Laura to. Generally, the rule of the thumb is that if you apply it to the matrix that takes your hidden state and converts that into queries, and the matrix that converts your hidden state into values, apply laa to those, and that's pretty much going to give you the best performance overall. The other hyperparameter for Laura is the optimal rank. So recall there these like two matrices, b and a, that are both like low rank. And turns out that already with a really small rank, you can get a pretty high performance. And this is much, much smaller than sort of the hidden state dimensions of the matrices for most models these days. Okay, all right. So we covered a bunch of things we talked about, know floating points and mixed precision training, multi GPU training, ddp fsdp. Laura kind of boils down to a very simple flow chart that you can just like use for your project. So if you were sleeping through the entire lecture, maybe now the time to wake up and just like look at this flow chart. So always .
speaker 2: use mixed .
speaker 1: precision training. Okay? If you have the newer mpere architectures, use b float 16, try with a bat size one, okay? If bsize one fits, try larger ger batsize. And then always just use zero stage two, okay? Bsize one doesn't fit, try zero stage three. Maybe try gradient checkpointing activation checkpointing. So it is a question .
speaker 2: more than one gp because otherwise zero swittwo doesn't really help us.
speaker 1: Oh Yeah. So all of this applies only if you have more than one GPU. If you have a single GPU, Yeah you have to do other things. Maybe you have to like heavily quantize the model. And even then, I don't think you can find tune in some of the bigger models. So assuming you have multiple GPU's, you can try zero stage three if you have out of memory errors with a bad size of one. And if that doesn't work, you can try Laura. Okay? And the main hyperparameters in Laura are the alpha, the rank, and what weight matrices to apply. Laa two, apply that to the query matrix, apply that to the value matrix. Set rank to eight. That's a good starting point. Set alpha to one. Just do that and you should be good to go. So you can find two new models and things should be reasonably good. Okay. So I'm going to end now unless there's questions. Oh, this is one question. The back science. I was wondering if you could go back to it and walk through it a little bit on step, starting with slide 48. Yeah this diagram on the left, right. Okay, so let's go through this diagram. So basically what this diagram shows is how the communication overhead is really not that bad. If you have a fairly big model such that the time it takes to do a forward PaaS, you can already sort of prefetch all of the parameters for the next layer. Okay? So that's pretty much the idea. So that's kind of like a standard idea I guess everyone should already be using. Like you want to make sure PyTorch does this by default, by the way. Like you want to make sure that know you sort of fully saturate your GPU and then sort of make sure that you kind of overlay that with any additional compuyou're doing. And that's pretty much what's going on here. But let's sort of go through this kind of step by step. Okay. And so the starting point here is fsdp units. So zero, one into two are different fsdp units. Okay? So what you start by doing is you want to run a forward PaaS on the first layer. You don't have the first layer. Okay? So let's say you are gpuk. You don't have the first layer. So you have to do an all gather to get all of the parameters for the first layer. So that's azero. At the end of azero, every GPU has the full set of parameters for the layers corresponding to fsdp unit zero. Let's just say, yer, that's layer one, okay? Or let's just say that's layer zero, okay? So you have the full parameters for layer zero. You run a forward PaaS, so that's the stuff in blue. And while you're running a forward PaaS through the first layer, you're going to be smart about communication overheads. And while you're running that, you're going to prefetch the parameters for the next fsdp unit. So let's say layer two is a different fsdp unit. So that's A G one. Okay? And so you can see that there is like a little bit of overlap between between four zero and ag one. At the end of getting all of the parameters for layer one, you're going to do a forward PaaS and so on. And then you're going to do ag two. And at the same time, now let's say you just have way too many parameters on your GPU, okay? So you're gonna to do a little bit of like memory free, you're gonna to free up some memory, okay? So that's the stuff in yellow and so that's how that goes. So you basically overlay all gather operations with the forward PaaS and that's how you run the forward PaaS. So the communication overhead is really not that bad if you have a really big deep neural network and assuming that you have kind of shaded everything properly, okay. And then you start the backward PaaS. So in the backward PaaS, I guess it's a little bit tricky because you want to do these all gathered operations to get the full gradient. So let's say it's a ten laneural network. So you want to compute the full gradient at layer ten. You need to do an all gather operation to get all of the gradients, to get all the parameters at layer ten, and then you have to do a reduced scatter. Okay? So you have four GPU's. Everyone has the full set of parameters at layer ten, they have different gradients. And so they have to kind of merge their gradients and then sort of split them up to the right GPU. And so that's the reduced scatter. But that's not too bad because you can still like overlay reduced scatter operations with the backward PaaS. And so that's what you see happening on the backward PaaS there. And then along with these forward and backward passes at sort of regular intervals, you have to make sure that you kind of free up GPU memory. So for example, once you have run a forward PaaS through layer one, now you run to layer two. You don't need anything in layer one. You just like free up the memory in layer one. Okay? That's pretty much the idea of behind this diagram. So this few details here. One of the details is like in fsdp, unit zero is sort of treated differently. So you'll see that unit zero is never freed up. That's just sort of an implementation detail in fsdp. I'll just quickly say one more thing about fsdp and take a question. Okay. So the presentation here makes it seem like it's so simple and that it can be applied to any neural network, right? But turns out that that's not the full picture. So you need to do this kind of you need to kind of divide up your neural network into fsdp units, okay? And depending on how you depending on what policy you use for dividing up your parameters into fsdp units, there's different communication overheads, okay? So for example, it makes sense to kind of have multiple consecutive layers in the same fsdp unit and so on. And so this is like very architecture specific. So you start to use this in PyTorch, you'll see that the fsdp wrapper requires sort of a sharding policy. And that is like way architecture specific. So because everyone uses transformers now, they're like very sort of handcrafted, fine tuned policies for transformer, for creating fsdp units and shotting strategies for transformers. But let's say you want to know for your final project, you came up with a new architecture subquadratic attention, whatever, maybe it's not going to be as efficient just because you don't have the right shoding policy. Okay. So that's like one detail about fsdp that maybe you want to keep in mind. Okay. You had a question.
speaker 2: just a clarification. You mentioned you can throw away the weights that you don't need after you try your forward PaaS, but then we can keep backwards back. Do you stream them back in each time? Or if you sort of cash some or cash recent or is there any cash going on? Or do you throw them all away, stream them all back?
speaker 1: So there might be some caching but in the system, but the idea is that you just sort of throw them away or at least to the user, it seems like you've thrown it all away in terms of like GPU ram utilization.
speaker 2: we stream them each layer again. And so that's .
speaker 1: why it's important to .
speaker 2: like shit properly, right? So for example.
speaker 1: if like every consecutive layer is you know is like charded such that it's on multiple GPU's, then you kind of always are communicating right, as opposed to you know you kind of did like all gather and then all of the sort of next three layers are loaded in. So that's why how you Shand this sharding policy becomes important. Okay. So if there's no more questions, let's end early. Thank you so much.

最新摘要 (详细摘要)

生成于 2025-06-06 14:48

标题: Stanford CS224N: NLP与深度学习 | 2024春季 | 第12讲 - 高效训练
描述: 本讲座是斯坦福CS224N课程中一次纯粹的技术实践分享,旨在揭秘如何利用有限的GPU资源训练动辄数十亿参数的大型模型。内容涵盖从混合精度、分布式训练到参数高效微调(如LoRA)等一系列核心工程技术,为任何希望动手实践大模型的学习者提供了一份宝贵的“省钱省显存”实战指南。讲座开始时,讲师还通告了课程项目的相关安排。

概览/核心摘要 (Executive Summary)

本讲座 (Stanford CS224N Lecture 12, Spring 2024) 深入探讨了在GPU上高效训练大规模神经网络的多种关键技术,旨在帮助学生应对期末项目中可能遇到的计算资源和内存限制问题。内容不直接涉及自然语言处理理论,而是聚焦于机器学习系统的实践层面。

讲座首先介绍了混合精度训练 (Mixed Precision Training),通过使用FP16或BFloat16等低精度浮点数,在保证模型性能的同时,大幅降低显存占用并加速计算。重点讨论了如何通过梯度缩放 (Gradient Scaling) 或直接使用BFloat16来避免训练不稳定的问题。

其次,讲座详细拆解了多GPU训练 (Multi-GPU Training) 的策略。以“团队协作写书”为喻,从基础的分布式数据并行 (DDP)——即“每人拿一份完整的书稿副本,各自审阅后开会同步所有修改意见”——过渡到更高效的ZeRO技术。ZeRO通过分阶段切分优化器状态、梯度乃至模型参数,好比“书稿副本每人一份,但参考资料分章节保管”,最终演进到FSDP——“书稿本身也被拆分,每个人只负责自己的章节,需要时再临时借阅”,从而在多GPU环境下极致地利用总体显存资源。

最后,讲座介绍了参数高效微调 (Parameter-Efficient Fine-tuning, PEFT),并重点讲解了LoRA技术。这好比“给一个功能强大的瑞士军刀(预训练模型)增加一个微调旋钮(LoRA模块),而不是重新设计整个刀具”。通过冻结大部分参数,仅训练少量适配器,LoRA能以极低的资源开销实现优异的下游任务性能,并探讨了其在促进AI普惠化及降低环境影响等方面的积极意义。讲座以一个实用的决策流程图收尾,指导学习者如何根据实际情况组合运用这些高效训练技术。

数字表示与浮点数基础

讲座首先回顾了计算机中浮点数的表示方法,这是理解后续训练技术的基础。

  • FP32 (单精度浮点数):
    • 占用32位(4字节)内存,是标准的浮点数格式。
    • 拥有广阔的表示范围和高精度,像一把刻度精密的尺子。
  • FP16 (半精度浮点数):
    • 占用16位(2字节)内存,显存需求减半。
    • 代价是牺牲了范围和精度,像一把刻度更粗、长度更短的尺子。
      • 范围减小: 过小的数值会直接变成0,过大的数会变成无效值(NaN)。
      • 精度降低: 像 1.0001 这样的数可能会被四舍五入成 1.0
    • 这对训练很关键,因为许多梯度值非常小,在FP16下它们会直接“消失”(变为0),导致模型无法学习。

混合精度训练 (Mixed Precision Training)

混合精度训练的目标是两全其美:既享受FP16的低显存和高速度,又避免其数值不稳定的陷阱。

  • 问题背景:

    • 用FP32训练大模型,很容易遇到“显存不足 (Out of Memory)”的窘境。
    • 直接用FP16训练,大量梯度会因范围太小而归零,导致训练失败。讲座引用图表称:“超过一半的梯度在FP16下会直接变成0”。
  • 解决方案1:梯度缩放 (Gradient Scaling)
    这是一个巧妙的“放大镜”技巧:

    1. 正常进行前向计算,得到损失值 (Loss)。
    2. 将损失值乘以一个很大的缩放因子S (比如1000)。
    3. 基于被放大的损失计算梯度。这样,原本微小的梯度也被同等倍数放大,大到足以在FP16的表示范围内“存活”下来。
    4. 在更新权重前,再将放大的梯度除以缩放因子S,将其还原回真实大小。
    5. 为了保持更新的精度,通常会保留一份FP32格式的“主权重”,用还原后的梯度来更新它。
    6. PyTorch实现: 通过 GradScalerautocast 上下文管理器实现。
    7. 缺点: 需要小心翼翼地调整缩放因子S,像是在走钢丝,S太大或太小都可能出问题。
  • 解决方案2:BFloat16 (Brain Float 16)
    BFloat16是更现代、更简单的解决方案,它的设计哲学是:“看得远比看得清更重要”

    • 它和FP16一样只占用16位,但内部比特分配不同:它保留了和FP32一样多的“指数位”(决定表示范围),但牺牲了更多的“尾数位”(决定精度)。
    • 结果是,BFloat16拥有和FP32几乎相同的动态范围,能轻松表示非常小或非常大的数,从而基本上无需梯度缩放这个复杂步骤
    • 局限性: 需要较新的GPU硬件支持,如NVIDIA Ampere架构 (A100, H100等)。
    • 效果: 实验证明,使用BFloat16的混合精度训练,训练时间减少约三分之一,显存占用大大降低,而模型精度几乎不受影响,有时甚至因其轻微的噪声带来正则化效果而略有提升。

多GPU训练 (Multi-GPU Training)

当一个GPU不够用时,就需要让多个GPU协同工作。

  • 分布式数据并行 (DDP - Distributed Data Parallel)

    • 工作方式: 这是最基础的多GPU训练方法。可以想象成一个写作团队,为了加速审稿:
      1. 分发任务: 将一大本书(数据集)分成几份,每个审稿人(GPU)拿到一份。
      2. 独立工作: 每个审稿人(GPU)都有一份完整的书稿副本(模型副本),并根据自己负责的章节(数据分片)提出修改意见(计算梯度)。
      3. 汇总意见: 所有审稿人开会,把各自的修改意见汇总并平均一下,得出一个统一的修改方案(同步梯度)。这一步通过一个叫 AllReduce 的操作完成,可以通俗理解为:“大家把各自的计算结果汇总求平均,然后每个人都拿到这份最终的平均结果。”
    • 显存瓶颈: DDP的缺点是每个审稿人(GPU)都需要保存一份完整的书稿(模型)、一份完整的参考资料(优化器状态)和自己的修改意见(梯度),非常浪费存储空间(显存)。
  • Zero Redundancy Optimizer (ZeRO)
    ZeRO是一套旨在解决DDP显存浪费问题的先进技术,它通过“分工”来节省每个GPU的负担。

    • ZeRO Stage 1: 切分优化器状态

      • 工作方式: 相当于审稿时,书稿(模型)还是每人一份,但厚厚的参考资料(优化器状态)被拆分了,每个审稿人只保管其中一部分。
      • 通信流程: 审稿后,大家需要交换意见,确保负责更新每一页的人能拿到所有相关的修改建议。这涉及到 ReduceScatter(“大家汇总意见,但每个人只拿走自己负责那一小部分的最终结果”)和 AllGather(“每个人把自己更新好的那一小块拼图拿出来,拼成完整的新版书稿,并确保每人都有一份新版副本”)两个步骤。
      • 效果: 通信量和DDP几乎一样,但“免费节省了显存”,因为每个GPU不再需要保存完整的优化器状态。
    • ZeRO Stage 2: 切分梯度和优化器状态

      • 工作方式: 更进一步,现在连每个人的修改意见草稿(梯度)都不需要完整写下来了。每当审阅完一页,就立刻把修改意见告诉负责那一页的同事,然后就把自己的草稿扔掉。
      • 效果: 同样,通信量没有显著增加,但又节省了存储完整梯度的显存。这“实际上也是免费的”。
    • ZeRO Stage 3 / FSDP: 完全切分

      • 工作方式: 这是最极致的模式,适用于连书稿(模型)本身都大到一个人(单个GPU)的桌子放不下的情况。现在,连书稿本身都被拆开了,每个审稿人只持有书的几个章节(模型参数分片)。
      • 通信流程: 这种模式下,通信变得无处不在。
        • 前向计算: 当要审阅第4章时,需要先通过 AllGather 从同事那里借来完整的第4章内容,审完后再还回去,以腾出桌面空间。
        • 反向计算: 提出修改意见时,同样需要先借来完整的第4章,计算出修改意见后,再通过 ReduceScatter 将所有关于第4章的意见汇总给专门负责第4章的同事。
      • 代价: 这种模式虽然极大地节省了单个GPU的显存,但通信开销显著增加,不再是“免费午餐”。
  • 被忽略的显存占用:模型激活值 (Model Activations)

    • 在训练过程中,还需要临时存储大量的中间计算结果(激活值),以便在反向传播时计算梯度。这部分显存占用与批量大小 (Batch Size) 成正比。
    • ZeRO等技术主要优化的是模型、梯度和优化器的存储,并未直接解决激活值的显存压力。讲座提及,解决这一问题的常用技术是梯度检查点 (Gradient Checkpointing),它通过在前向传播时不保存所有激活值,而在反向传播时重新计算它们的方式,用更多的计算时间换取宝贵的显存空间。

参数高效微调 (Parameter-Efficient Fine-tuning - PEFT)

当硬件资源极其有限,连全量微调(Full Fine-tuning)都跑不动时,PEFT就是救星。

  • 核心思想: 与其更新模型中全部数十亿个参数,不如只“微调”其中一小部分,四两拨千斤。

  • LoRA (Low-Rank Adaptation)

    • 工作方式: LoRA是PEFT中最流行的方法之一。它好比给一个功能强大的瑞士军刀(预训练模型)增加一个精密的“微调旋钮”(LoRA模块),而不是重新设计整个刀具。
    • 技术原理: LoRA基于一个观察:模型微调时的参数变化(更新矩阵)通常是“低秩”的,意味着其核心信息可以用更少的参数来表示。因此,LoRA冻结原始模型的全部参数,并在模型的特定层(如Transformer的Q和V矩阵)旁边增加两个小小的、可训练的低秩矩阵(A和B)。训练时,只更新这两个小矩阵。
    • 优势:
      • 极省资源: 可训练参数量可能只有全量微调的0.01%,大大降低了显存和计算需求。
      • 性能优异: 在许多任务上,LoRA能达到与全量微调相当甚至更好的性能,因为只微调少量参数有时能起到正则化作用,防止过拟合。
      • 部署灵活: 只需为每个下游任务存储一对小小的A、B矩阵,就可以轻松切换模型功能,无需保存多个庞大的完整模型。
    • 推荐配置:
      • 应用位置: 应用于Transformer的查询(Q)和值(V)矩阵
      • 秩 (Rank, R): 从一个很小的值开始,如 R=8,通常效果就很好。
      • 缩放因子 ((\alpha)): 通常设为1即可。

实用训练流程总结

讲座最后提供了一个清晰的决策流程图,指导如何在项目中选择高效训练策略:

  1. 第一步:始终使用混合精度训练

    • 如果GPU支持(如A100),优先使用BFloat16,因为它更简单高效。
  2. 第二步:检查单GPU能否容纳最小批量 (Batch Size=1)

    • 如果可以: 恭喜!你可以尝试增大批量,并(在多GPU环境下)直接使用 ZeRO Stage 2 来进一步节省显存,压榨GPU性能。
    • 如果不行 (OOM) (且你拥有多GPU):
      • 尝试使用 ZeRO Stage 3 (FSDP),它通过切分模型本身来解决OOM问题。
      • 如果还不行,可以考虑 梯度检查点 技术,用计算换显存。
  3. 第三步:如果全量微调之路走不通:

    • 果断转向 参数高效微调 (PEFT),首选 LoRA
    • LoRA推荐起始配置: 将LoRA应用于Q和V矩阵,设置Rank=8,Alpha=1。

注意: ZeRO相关策略主要适用于多GPU环境。对于单GPU且显存不足的情况,除了混合精度和梯度检查点,PEFT (如LoRA) 是最关键的解决方向。