speaker 1: Okay, cool, let's just get started. Welcome everyone to lecture twelve. So you know so far we've learned a lot about how like you know we convert words into vectors, how we convert sentences into vectors and you know basically take actions in the real world using that. So like classify documents, we learned about transformers, we learned about pretraining. It is going to be a little bit different. I'm going to be talking about how you can train large models on GPU's and a few basics about how you know these ml systems work. It has nothing to do with natural language at all, but hopefully it's going to be useful for final projects. So I'm gonna to spend some time on mixed precision training, sometime on multi GPU training with ddp and fsdp, and hopefully by the end of the lecture, these terms will make sense and sometime on parameter efficient fine tuning. But before we get into the lecture, just some announcements. Proposal grades are going to be coming out shortly, hopefully by the end of the day. Thank you so much for all the hard work. I know in it's kind of getting a little bit crammed with a lot of deadlines for assignment four and the project proposals. So thank you so much for all your hard work. The other thing is the project milestone details should be out shortly, if not already out on the website. So it's worth 5% of the overall grade. It's due twelve days from now, and it's a maximum of two pages. And really, the way to think about the milestone is to use this as a forcing function to get work done for your final project. And Yeah, with that out of the way, let's just jump into the material. So I'm going to start by thinking about how parameters and gradients and generally numbers are represented in computers. And I promise it's going to be relevant to deep learning pretty soon. So let's start with floating point. How many people here are familiar with this cartoon depiction of fp 32? Can you just get okay, so some of you so Yeah, let's kind of recap how floating points are represented in computers. So firstly, fp 32, that's like 32 bytes. So the memory requirement is it's 32 bytes. So the memory requirement is four bytes. Okay. And so if you're thinking about neural networks and for every single neural net parameters, you need four bytes of GPU memory. And the way to convert this cartoon into a real Ural number is something like this. So first bit there is the sign, and then the stuff in Green represents the range, and then the stuff in blue represents precision. Yeah. And so for fp 32, there's is like you can represent a pretty large range and it's fairly precise, right? And so the larger the stuff in Green is, the more numbers you can represent, which means more smaller numbers and also like larger numbers. And in Green, the more stuff in blue we have the greater precision in representing actual numbers. So another popular data type that takes half the memory of fp 32 is fp 16. And the way we reduce memory is we're going to reduce the stuff in Green, so there's going to be less range, less dynamic range, and also the stuff in blue, which means there's going to be less precision. But the good thing is that we can save memory, so we slash memory requirements in half. So let's think of a scenario where you're trying to train a big neural network and your model parameters and gradients are represented in fb 32, you start training and suddenly you get an out of memory kuerror. Okay. And so just based on what you've seen so far, one possible solution is you cast everything into fp 16. And if you do that, you reduce memory usage by half. So let's kind of work through what are some possible problems with doing something like that? So you like I said, because there's less stuff in Green, there's going to be less range. And so that means a lot of very small numbers will get converted to zero, and a lot of really large numbers will get converted into nans. And there's also less precision because you have less bits in blue, which means you're going to get rounding errors. For example, 1.0001 gets converted to one in half precision. And I have a little screenshot of how you can test various properties of data types, right? So basically, the things to look at are the epsilon. The epsilon is like the smallest number such that if you add that to one, you don't lose any precision. If you add a number that's smaller than the silent to one that gets just rounded down to one. And the smallest normal is the smallest number that can be represented in fp 16. Anything smaller than that, it goes straight to zero. And for neural network training, if a lot of small numbers get rounded down to zero, that's actually not good. So here is a diagram that I took from an nwith blog post that's just showing just sort of some gradients during the course of training. And more than half of these gradients will literally just get set to zero in fp 16, which is kind of a problem, and that has to do with the range of fp 16. And the second problem is we're with precision, right? So we have basically less precision. And so our updates are not going to be precise. Okay, so the solution here, one possible solution, right? So we are going to use fp 16, but we are also going to use fp 32. Okay? So that's sort of the high level idea. And what we're going to do is we're going to maintain a copy of the model in fp 32, and let's call those master weights. And then you get a little bit of data, you run a forward PaaS, and then when you run your PaaS, you run it by converting from fp 32 into fp 16. And then you get a gradient on a backward PaaS, and then get your gradient in fp 16. So everything so far has happened in fp 16. Then you take your gradients, upcast them into fp 32, and then update your maweiand. Then once you update your maweyou, copy them into the fp 16 version of the neural network. So this seems like a reasonable scheme. You know I'm using fp 16 on my GPU, but I have the full sort of 32 bit precision also lying around somewhere so I can have more precise updates. Des okay, can someone tell me why this is still problematic? Any guesses? Yeah want not be like at least slow pe because you have to copy like the 32 mid versions from my qq into like some discipline. Yeah. So so that's a good point. So you can often like overlap io with like forward and backward passes. So practically this is not a problem. But Yeah, that's a good point. Potentially if your network is very, very small, this could be a problem. speaker 2: Gradients are usually fairly small and individual gradients are usually fairly small. And when you copy the fp 16 compute gradients onto fp 32, you may be setting in your network somewhere else where you don't want it to be. speaker 1: So Yeah, so that's pretty much pretty much the right answer. So you know let's kind of go back to this diagram that we had. So this shows gradients in the backward PaaS. And you know I said that we're going to compute all our gradients in fp 16. What's going to happen? Like most of them will just get converted to zero, which which is something that we really would like to avoid. Okay, so here's a possible solution. So what you can do is you can you get your batch of data, you run your faout PaaS in fp 16, you compute your gradient. But then when you have the sorry, so here, so you get a batch of data, you compute a forward PaaS in fp 16, you get your loss. You scale the loss by some large value. Okay, let's say 100, let's say 1000, and then you compute gradients and now you just like scale your gradient by a large number. And so everything that we had on the left hand side of this red line just gets shifted to the right. And hopefully there's less tough that will get rounded down to zero. And then compute your gradient in fp 16, copy it into fp 32, and then divide it by the scaling factor, and then you update your master wins. Okay? So this will solve both the problems that we talked about and so this is basically what we call mixed precision training, okay? And it's relatively simple to implement this in PyTorch. All you have to do is you need to instantiate this grad scalar object. And then within the context of like this autocast, you want to run your forward and backward passes and then scale down your gradient and then update your model parameters. But then this seems a little complex. You know we have to deal with sort of scaling the loss and then scaling it back down. What if you multiply it by 10000 and that leads to nans? And so then you have to kind of scale. You have to update your scalar. You have to, in the next iteration, multiply by 1000, and you have to kind of adjust to sort of network dynamics. Okay, so welike to not do gradient scaling. So can we do something better? Okay. So the reason why we have to do the scaling is you know just recall sort of the role of sort the bits in Green that kind of tells you what is the dynamic range of the data type. And we needed scaling because fp 16 has a much smaller range compared to fp 32, right? And so because of that, fp 16 cannot represent very small numbers. So how do we solve this? Any ideas? Yeah so here's a problem, right? So in fp 16, because you have fewer bits for the exponent, you can't represent very small numbers. So if you have something that's smaller than, I don't know, six e minus five, it gets down sort of rounded down to zero. And that's because of the dynamic range of fp 16. So how do you solve that position more? speaker 2: Queit's? Absolutely. Yeah. speaker 1: So that's that's the right answer. So what we going to do is we're going to sacrifice precision. So that's the idea for b float 16, which stands for a brain float 16. So you're going to have exactly the same number of bits for representing the range. So that's going to be eight bits so as the same dynamic range as fp 32, but a lot less precision. And it turns out that this is okay for neural network training. And now if you use b float 16, you don't need to use grad scalers anymore. It's as simple as wrapping your model forward PaaS and backward PaaS within the right context. The one caveat about b float 16 is that it's not available on all GPU's. So you need to have the latest sort of ampere nvidia architectures, which the H -100s, the a 100s, the a 6000s have. But if you have like older GPU, then you might not be able to utilize b float 16 . speaker 2: precision, but the same amount of ebits. Yeah. So it's thing in. speaker 1: So here are some kind of results. So someone finds you inteal. But for a sentiment classification or single a 100, at the very top is float 64, which is like you know really, really rich. 64 bit representation of floating points. It takes about 25 minutes and you get a pretty high accuracy, but it also takes a lot more memory. And all the way down, we're using mixed precision training with beflosixteen, and now we have reduced training time by roughly a third, more or less have the same accuracy, a little bit better actually because there's some regularizing effect from the half precision representation and then a lot less memory. And the reason we see speed eduups for training is because matrix multiplies tend to be faster when you are multiplying in half precision. Okay. So before we move on, are there any questions about this? Okay, cool. So let's keep going and let's sort of change the setting, right? So now now we have more than one GPU. Now we have multiple GPU's, and we want to train a network over all of the multiple GPU's that we have. Okay, so let's start with some basics. Okay, so here's a cartoon showing basically a model and an optimizer receiving some data from a data set. Okay? And let's kind of work through what's stored on GPU vm. And this is going to be somewhat of a lie, and I will point out what my lie is soon. But just to keep things simple, we have the neural net parameters. Okay? So let's say we're doing mixed procitor training, and so it's stored in fp 16. And then we have an optimizer. And you know when I first saw this few years back, I was very surprised to see that optimizers also need memory. But if you're using something like Adam, then you need to store the Adam momentum term and the adom variance. And every time you get a gradient, you have to update adom momentum invvariance. And that's what you use for updating your parameters. And because you're using mixed precision training, these have to be represented in sort of fp 32. Okay, so that's what the picture looks like if you have a single GPU. Now let's say we have multiple GPU's, okay? And what welike to do is first divide our data aset into, let's say we have four GPU's, right? So we'll divide our data set into four parts, and we'll maintain a synchronized copy of the model. And every model receives its own slice of the data set, okay? In the beginning, we have a synchronized model and everyone has their own copy. We run a forward PaaS, okay? So this forward PaaS receives different data points. And so every model is going to have different activations and correspondingly, every model is going to have different gradients. Okay? So you run a backward PaaS, every model has a different gradient because there's different data points. And then we're going to run a synchronization step. And what synchronization is going to do is communicate gradients between different workers. And so I'm going to introduce the first sort of mpi primitive in this lecture. And that primitive is called the all reducoperation. What all reducdoes is, it takes four pieces of information, in this example, on four different GPU's. It sort of merges everything together and then distributes it to all of the gps 's. And the communication overhead of doing that is two bytes per parameter, because, remember, we have fp 16 gradients, so two bytes per gradient. And then this needs to be communicated. And so the overhead is two bytes per parameter. Okay? So that's the all reduced operation. And then once gradients have been communicated, so they have to be communicated by sort of gathering on one worker and just sort of distributing the cumulative gradient. At that point, every optimizer has the full gradient and then the optimizer can update the model so that you maintain synchronization. Okay, so that's the basic that's known as distributed data parallel. Okay, that's good. But turns out that it has really poor memory scaling. So let's kind of go through our math for how much memory is needed, right? So we have the model parameters. That's fp 16 because we're doing mixed position training. And then for the gradient, we also have the gradient in fp 16, right? So two bytes for the gradient. And then we have the stuff in Green. The stuff in Green is, let's say we're doing Adam. So we need to, well, we need to store the mawaregardless of whether we're doing Adam or not, and then we need to store the momentum and the variance. Okay? So that's twelve extra bytes per parameter, okay? And this needs to be stored on every single GPU, okay? And so the question is, can we do better than this? Okay. And so now things are going to get a little bit more tricky. So if you have questions, just stop me and we can go from there. So the way we're going to improve our memory sort of scaling is we have a set of techniques that are together known as zero that stands for zero redundancy optimizer. So this was know a set of techniques released by Microsoft as part of that deep speed project, okay? And the idea is going to be that we are going to, instead of having every GPU maintain all of the state. So by the state, I mean, the stuff in blue, the stuff in orange and the stuff in Green you're gonna to sort of shit, okay? So there's going to be shard so that not every GPU has all of the parameters or all of the gradient, but by communication they can sort of synchronize. Okay? So that's pretty much what the sketch for this is going to look like. Okay? So let's look at stage one. So like zero has multiple stages. So there's stage one, 23. In stage one, we are going to shard the stuff in Green. So stuff in Green was the optimizer state. And so the way we're going to shard and still maintains synchronization of something like this, so every GPU has you the full set of parameters in fp 16, and every GPU has its gradient for its data, but it only has a sharded copy of the full optimizer state. And the other requirement is that every GPU is responsible for updating the parameters corresponding to its own shot. So if you go step by step, this is what it looks like. Every GPU has its own data. Every GPU gets a gradient on its subset of the data. Okay, then we perform a reduced scatter. So now this is the second mpi operation of the lecture. So we've done all reduced. This is the second one. This is called reduced scatter. What a reduced scdoes is. Every GPU has the full gradient on its data. And what you want to do is you want to take the chunk corresponding to, let's say, GPU one. So let's say you GPU zero, and you've computed the full gradient for all the parameters, and you want to communicate the chunk for GPU one to GPU one, and same for GPU 23. Ok? So what you're going to do is from the full gradient, just communicate the bits that a different worker wants to that worker. And every GPU has to do that. So that's called a reduced scatter. And then once every worker gets the gradient corresponding to its shard, they're going to update its parameters. And then once they have updated their shard, they're going to sort of perform and all gather. So what that means is, let's say you have a neural network with just, let's say, eight parameters, two parameters on each GPU. At the end of this, each GPU has updated their subset of parameters and then they're going to sort of do an all gather to just sort of maintain synchronization. So every GPU gets the full set of parameters that are all updated. speaker 2: He's maintaining this and you're not merging into the other in that way. What makes this more efficient? speaker 1: Sorry, you could to hear a video deal question why . speaker 2: this is better than doing a previous right? speaker 1: So what we're going to do is shard the optimizer state. So let's say in a running an example we have a neural network with eight parameters. Earlier we needed the optimizer state for all of the eight parameters. Now every GPU has to maintain optimize a state for only two parameters. Okay, so after the reduced scatters are done, you have the full gradient correspond to just two parameters. Okay? So the optimizer state is just the gradient for two parameters. The model is going to update only two parameters using the partial sort of optimized state. speaker 2: So you'll eventually get the rest of the . speaker 1: parameters back. So you have the entire set of parameters, you have all the stuff in blue, and you have the full gradient for your subset where you don't have the full optimizer state. So what you can do is you can only update the parameters for the bits of optimizer state you have. Okay, so in a running example that I just made up, you know GPU zero updates two parameters, GPU one updates two parameters and so on. And then they communicate updated parameters to maintain synchronization. Okay, more questions about this. Okay. So let's keep back. So so far, we have looked at three mpi operations. We looked at all gather, we looked at reduced scatter and we looked at all reducokay. So it turns out that all reduce is actually equivalent to running a reduced scatter followed by an all gather operation. And just recall that like for ddp, all we had to do was this all reduce operation and we computed what's the communication overhead of that? And turns out that when you're doing this optimize state shoding, you have to do exactly the same amount of communication overhead just because an all reduces equivalent to a reduced scatter followed by an all gaokay. And so we basically saved memory for free, okay? So just I mean, you should just always use this, okay? Because you're going to get memory savings and you don't have any additional communication overhead, okay? So we are happy we saved memory and now you know we want to sheven more things. Okay, so let's let's start doing zero stage two. And now along with sharding the stuff in Green which was my optimizer state, I'm also gonna to shard gradients. Okay? And now this is going to be a little bit more complex because we kind of still need the full gradient for the workers data slice, okay? But each GPU only has enough memory for instantiating the gradient for a small subset of parameters. So how are we going to deal with that? So we are actually never going to instantiate the full gradient factor. And then whenever a GPU gets a gradient in the backward PaaS, you instantiate a vector, sort of temporarily for the parameter for which you just received a gradient, and then compute the gradient, and then just send it to the right worker, and then you destroy the memory that you just created. Okay, that's kind of the sketch. And let's kind of go through this step by step. Okay? So we have four workers, okay? Each worker performs a backward PaaS, and the backward PaaS happens layer by layer, right? So recall the lecture on alterative. So you have the loss. And then you have this backward PaaS where layer by layer you compute gradients. Okay? So now let's see. You're at layer J. You take the upstream gradient, you compute gradient for the parameters at layer germ immediately. The moment you compute those gradients, send it to the right worker, okay? So there exists some worker that is responsible for layer ger, okay? And what's going to happen is every GPU that's just computed the gradient for layered J for its data slice sends it to the right worker, okay? And then the moment you've done that, you dellocate this memory that you just created. And so this is kind of a fourth mpi operation, but really not very different from a reduced catter. This is just a reduced so there are four GPU's that have a gradient and then they just have to communicate it to whoever is responsible for maintaining gradient for that layer. Okay. And then Yeah so there exists some worker that is responsible for a given layer. They're going to update its parameter shot using the full gradient that it received via this communication along with the optimizer state. Okay. And then at the end, to synchronize everything, you have to perform in all GaAs before. Okay. Any questions about about this high level sketch? Okay, so let's keep moving. Okay, so recall that for zero stage one, it was basically free because turns out that an all reduces equivalent to a reduced scatter plus in all gather. And we're kind of doing the same thing here. We have a reducfollowed by an all gather. So this is practically also for free. Okay? So we've gotten away with saving memory without any communication overhead compared to edp so far. So let's keep going. Let's try and see if we can shot even more things. And I think someone sort of alluded to this in the audience early on. So what happens if you shot even your model parameters? Okay. So let's say you run into a situation where you know forget about the optimizer state. Even your model wouldn't fit on a single GPU. And so in that case, what you do is you split up your model. So you split up your model across all the different GPU, so you shyour model parameters, which is the stuff in blue. But the caveat is that now we're not going to get this for free. We're not going to get memory savings for free. There's going to be some communication overhead. And this is zero stage three. This is a final stage of zero. This is also known as fsdp fully shaded data parallel for anyone who's heard that term before. And here's sort of the high level sketch and I feel like this is kind of the easiest to understand compared to zero stage 12, just because there needs to be communication at every step of the way, right? You can't get away without communicating. So the first thing we're gonna to do is we're gonna to take our model and we're gonna to convert the entire model into fsdp units. Okay, so here's a sketch, a simple deep neural network. I'm going to convert that into multiple fsdp units. Three fsdp units here. Okay, so that's just a data structure, an fsdp unit. Okay, we've not done anything so far. And then I have this fsdp unit. I'm going to convert this into another data structure called a flat parameter. And then I'm going to assign a subset of these parameters to every single gpuu. So here we have 16GPU's and a flat parameter consisting of 14 parameters plus some extra padding so that things divide properly. And I'm going to assign each parameter to a distinct GPU. And so that's basically just a complex way of saying that we created some data structures and we just like divided up model parameters to every GPU. So every GPU gets a subset of model parameters. Okay, now let's start thinking about what my forward PaaS will look like. So there's no GPU that has a full set of parameters. Okay, so you're running a forward PaaS. Let's say you're at layer four now and there's no GPU that has all of layer four. So you have to communicate. So we need to do an all gathered operation. That's the operation that we did to you know cumulate multiple things that are on multiple GPU's so that every GPU has the full thing. So you perform an all gather. So you have all pieces of layer four. You run a forward PaaS, and now you don't need layer four, so you now discard your parameter shards, and now you have to run your backward PaaS, right? So you computed your loss and now you have to do a backward PaaS again. Let's say you are back at layer four. You have your upstream gradient, you don't have layer four, so you need to do another all gather. So you get all the parameters of layer four, and then you run a backward PaaS for layer force. You compute the gradient for your subset of parameters. So recall that every GPU has different data points, right? So there's going to be different gradients for every GPU. Okay? So then for layer four you do an all gather, get all parameters computer gradient. Every GPU has different gradients and then you have to do a reduced scatter so that you can send the full gradient to the GPU that's responsible for whatever parts of layer four that you're sending. So Yeah, so that's basically full fsdp. And then once you sort of run the forward and backward PaaS, then each GPU will update its own parameter shot using the full gradient that it received just now. And then you do a synchronization, right? So let's kind of do a quick review of everything we've looked at so far. So those ddp, which was you don't shot anything, you have the full model, the full gradient, the full optimized state on every single GPU. And all you're going to divide up is the full data set, right? So you have a big data set of a thousand examples. Every GPU gets 250 examples, okay? And then you compute a forward PaaS and a backward PaaS. Every GPU has a different gradient. You need to communicate that gradient, and then you synchronize. And so that was called an all reducoperation in mpi atoms. And then we looked at zero, which is now we want to save some memory. We don't want the full sort of memory requirements of models, gradients and optimiser state on every single GPU. And in zero, stage one, we shot at the optimizer state so that you don't have to maintain the full optimiser state for every GPU. You kind of break that down between all the different GPU's that you have. And we saw that the communication overhead of maintaining synchronization in zero stage one boiled down to basically just doing an all reducthrough, this identity that says that an all reduces or reduced scatter plus and all gather. And we save memory for free with zero stage 12. So you should just do it. And then with zero stage three, things got a little bit more complex because you have to divide up your model parameters, the optimized state and the gradient. And so while you're running your forward PaaS, you kind of have to do some communication to get the full parameters for any layer for layer four in ner example, and and then also have to do an all gather in the backward PaaS. So you get the full gradient and then you have to do a reduced scatter so that you can send the full gradient for whatever chunk of the parameter to the right GPU. And overall, that's like two all gathers plus a reduced scatter. So that's a lot more overhead than stages 12. But if you don't have enough GPU vram so that you can even load your model onto a GPU, then this is kind of what you have to do. All any questions about mpi primitives or stages of zero or fsdp? Okay, cool. So I'm going to fix the lie that I said earlier about the GPU vracalculation. So I said that there's just like model parameters and gradients and the optimizer state, but there is this thing. There's this like final thing, the model activation. So like you know we've all seen that as you keep know, you want to increase the bat size. And there's a point when the GPU says that it can't fit more stuff, and that's because you also need to store model activations in the backward PaaS, right? And that scales linearly with the bsize. So the larger the bsize, the more the number of model activations that need to be stored. And by the way, if you're doing mixed precision, this is in fp 16 or bf 16, but it scales with the bsize. And so that's sort of the other thing that you have to think about. And none of the techniques that we've looked at so far help with kind of shoding model activations. Okay. So we looked at a bunch of like know basics of like multi GPU training and you know like a floating point, but it kind of boils down to this very simple flow chart, which you can use for your final projects when you find tuning models. So the first thing is always use mixed position training. You know you barely ever see a hit dden performance by performance, I mean like generalization or like F1 or accuracy. And if you're using the newer ampere architectures, the H -100s or the a 100s or the a six zero s, always use b float 16. It's just better. And you can check that with that torch command. Okay, so always use mixed precision training. Now ask yourself this question, does bat size one fit on a single GPU? Okay, if it fits and try a larger bat size, bat size one is too small, larger bat size and or use zero stage two. Okay, zero stage two is for free. Just use zero stage two and increase your bat size. If you can't fit even bat size one, then you have to see if zero stage three fixes your out of memory issues because now you're going to shyour model parameters. Okay? And all of this is in the context of full fine tuning, right? So I'm fine tuning all of my model parameters. Okay? speaker 2: Sometimes the answer . speaker 1: to that question is also no. So you can't fulfind tune in your model on four, whatever, a 100s or know a six zero s. And you've tried zero stage three, you've tried mixed precision training. You have a bad size of one. Maybe you did gradient checkpointing activation checkpointing. Nothing works. And so now basically, you can't do full fine tuning. And so the thing to do is to try parameter efficient fine tuning, and that's going to give you a lot more memory savings. Okay, so let's talk about parameter efficient fine tuning. Okay, so why is it called parameter efficient ent fine tuning? So in full fine tuning, you run a forward PaaS and a backward PaaS and you update every single model parameter. And in parameter efficient fine tuning, you're only going to update a small subset of the full set of parameters. Okay. And why would you want to do that? So maybe you're in a setting where you cannot fulfine tune, even with bsize one. You tried ed all the tricks possible. It just wouldn't fit. And so maybe you have to do parameter efficient fine tuning. Maybe the other possible reason why you want na do it is kind, know, slightly more scientific. Know these like models these days are heavily overparameterized and you have a small data set and you believe that if you do parameter efficient fine tuning, then you can get a better generalization or you believe that it's going to match for fine tuning. Okay. Sort of a second reason for wanting to do efficient adaptation. So the plot on the right here shows in red it's sort of the estimated growth in training compute for training the largest AI models. And the line in blue is the global compute capacity. So very soon, we are going to overshoot the global compute capacity and going to need a lot more compute than the global capacity. And so this is kind of not sustainable. And you know there are arguments to be made about how if we keep going down this route, then AI development becomes concentrated in only the hands of a few well funded organizations. And as students, we can't do it. And so that's a problem. And then also like if there's only a small number of players that are training and fine tuning models, then they may bias the model in specific ways that reflect their value systems and not sort of the broader public. And so that's another reason to think about efficient adaptation and the sort of this paradigm in machine learning in general and analp specifically to focus a lot on accuracy and sort of efficiency. And so the problem on the right here shows the percentage of papers where the main contribution is a method that produces just more accurate models versus methods that produce same accuracy for more efficiency. And so we can see that for most conferences, the vast majority of papers are about accuracy, and there's very few papers about efficiency. And so maybe this is kind of leading to this like monoculture, and maybe that's why we want to focus on efficiency. Second, maybe bigger sort of concern is that there's this huge hidden environmental cost of training and fine tuning large language models. So I was just reading some report where they said that the cost of training GPT -3 was equivalent to 1.1 million tons of carbon emission, or some such number. And they kind of estimated that that's the cost of running a coal power plant for ten hours straight. All right. And for an example closer to home in the reinforcement learning class, there was like the final project, not the final project, a homework assignment. And a lot of students implemented kind of a common algorithm, one or two algorithms that sort of outperformed everything else where used a lot more power, okay? And someone did this calculation that if everyone had used the most efficient algorithm, that would have sorry, if everyone had used the more efficient algorithm, that would have reduced the power consumption of the class by about 880 kilowatt hours, which is what an American household uses in a month. Okay, so there's these are all reasons to think about efficiency and how you can find tunnew models with less resources. So let's kind of jump back into parameter efficient find tunand. Let's start by recapping what full find tuning is. Any questions so far about any of this? Okay, so Yeah, so let's recarep full find tuning. So let's say we have some large pre train autoregressive language model. Let's say it's a GPT, and maybe we want to use it for summarization. Maybe we want it for semantic parsing, so like converting natural language to sql commands, or maybe we wanted to answer questions about paragraphs. Okay. And what do we do? We collect a data set of X, Y pairs, and then we do full fine tuning. In full fine tuning, we are going to update all of the model parameters based on the gradient for some loss function. Okay. And maybe that's not feasible. GPT D3 has 175 billion parameters. And so there's just like a lot more parameters to learn. And even once you have done full fine tuning, you kind of have to store all of the parameters. And if you're doing like several tasks, you have to store parameters for every task. So can we do better? So the main idea is instead of updating all of the parameters, I am going to update a much smaller number of parameters. And then instead of finding sort of a delta theer, which is the same size as the entire set of parameters, I have to search over a much smaller space. And then the added benefit is I can store this much smaller delta pretty easily on disk. And hopefully it's going to require less compute. And hopefully it's going to generalize almost as well as full fintuning. So there's many different ways of kind of operationalizing, this high level idea of parameter efficient fine tuning. The one I'm going to talk about today is Laura. Okay. So that stands for low rank adaptation. And that basically comes from this observation that when you have big language models that you fine tune, oftentimes when you look at sort of the like geometric structure of the gradients, they tend to have a low intrinsic rank. Do people remember rank and sd? All right. Okay. So these parameters, the gradients tend to have a low intrinsic rank. And so what the authors realized is instead of fine tuning the entire sort of parameters, you could instead fine tune a much smaller, let's say, rank R matrix for every full rank matrix that exists in the model. Okay? So let's say we have some pretrain weight matrix, W zero. And what I'm going to do is instead of applying some kind of arbitrary update, I'm going to make sure that the update has this following form. Okay? So it's going to be the product of two low rank matrices, b and a. Okay? So a is is an R Cross k matrix and b is A D cross R matrix. Okay? And R is the rank much, much smaller than either the sort of incoming dimension and much, much smaller than the outgoing dimension and the term alpha. You can think of that as some kind of trade off between the knowledge that's already stored in the pretend model versus some additional knowledge that you want to add into the model. Okay. So if alpha zero, then you're not doing anything falfa something really, really small, then you don't really want to change your model parameters all that much and you want to add some really small taspecific knowledge. And then Additionally, the only trainable parameters here are going to be a and b. Okay. And then sort the other thing to note about this is since I'm representing updates as this product b times a, as I increase R, that's going to converge towards full finunit. So you kind of have the slider that you can use to control how much fine tuning you want to do essentially. And then the other important thing is inference latency. So what you can do is you can just store these learned t matrices for every task. And whenever you switch to a different task, you can just remove the extra term that you've added to every matrix for the task and add in sort of the task specific terms for the new task that you want to run inference on. Okay. And the cost of like storing these like much smaller matrices also way lower than storing sort of the full delta. And we'll kind of see where you should apply, Laura. But like generally you want to apply it to the weight matrices in self attention. Okay? So in code, it actually looks fairly simple. So what you're going to do is when you're running a regular forward PaaS, then you sort of you know just like compute the hidden state as let's say, the product of the matrix and the incoming feature vector. Now with Laura, what you're going to do is you're going to freeze them all parameters. You're going to compute the H as before. And then to that, you're going to add this additional offset term. And that's the only thing that's going to be trainable. And that's pretty much all you have to do. We have to do it for every single weight matrix, for every single layer. But Yeah. speaker 2: there's like an alphater in the second glass line. Where do you define alpha in the the rest one? Or do you just like put it under? speaker 1: So yes, you define the somewhere if you set it to one, that's like saying that I kind of want like an equal trade off between pretrain knowledge and the new task specific knowledge. Typically people set it to one. You could set it to something larger than one if you believe your task is something that the model, the pretrain model, has no idea about, or something smaller than one if you don't want to change the model too much. So that's . speaker 2: basically Laura . speaker 1: in practice. So I said there's a bunch of different parameter efficient fine hearing methods, right? So I'm not even going to name all of these. There's adapters. Some of you might have heard about adapters. There is bit fit, which is not here. And so there's is like lots of different like p tuning. But it turns out that compared to a lot of these different methods, it's kind of like pretty high performing on a bunch of different tasks for these like relatively smaller models. And then if we try and look at trying to find tunsome of the bigger models like gpd three and then compare it with other parameter efficient fine tuning methods, so full fine tuning is at the way to, then we have bit fit, which is you only fine tune the Bistones and adapters compared to that. Firstly, laa requires a lot fewer additional parameters that you need to store and it kind of gives you good trade off for accuracy, better to fulfine tuning. And sometimes there's a regularizing effect from fine tuning only a small subset of your model parameters. Okay. So you know the question is like for every matrix you can apply, Laura and I said that you want to apply it to the various learned weight matrices inside self attention. The question is what parameters you want to apply Laura to. Generally, the rule of the thumb is that if you apply it to the matrix that takes your hidden state and converts that into queries, and the matrix that converts your hidden state into values, apply laa to those, and that's pretty much going to give you the best performance overall. The other hyperparameter for Laura is the optimal rank. So recall there these like two matrices, b and a, that are both like low rank. And turns out that already with a really small rank, you can get a pretty high performance. And this is much, much smaller than sort of the hidden state dimensions of the matrices for most models these days. Okay, all right. So we covered a bunch of things we talked about, know floating points and mixed precision training, multi GPU training, ddp fsdp. Laura kind of boils down to a very simple flow chart that you can just like use for your project. So if you were sleeping through the entire lecture, maybe now the time to wake up and just like look at this flow chart. So always . speaker 2: use mixed . speaker 1: precision training. Okay? If you have the newer mpere architectures, use b float 16, try with a bat size one, okay? If bsize one fits, try larger ger batsize. And then always just use zero stage two, okay? Bsize one doesn't fit, try zero stage three. Maybe try gradient checkpointing activation checkpointing. So it is a question . speaker 2: more than one gp because otherwise zero swittwo doesn't really help us. speaker 1: Oh Yeah. So all of this applies only if you have more than one GPU. If you have a single GPU, Yeah you have to do other things. Maybe you have to like heavily quantize the model. And even then, I don't think you can find tune in some of the bigger models. So assuming you have multiple GPU's, you can try zero stage three if you have out of memory errors with a bad size of one. And if that doesn't work, you can try Laura. Okay? And the main hyperparameters in Laura are the alpha, the rank, and what weight matrices to apply. Laa two, apply that to the query matrix, apply that to the value matrix. Set rank to eight. That's a good starting point. Set alpha to one. Just do that and you should be good to go. So you can find two new models and things should be reasonably good. Okay. So I'm going to end now unless there's questions. Oh, this is one question. The back science. I was wondering if you could go back to it and walk through it a little bit on step, starting with slide 48. Yeah this diagram on the left, right. Okay, so let's go through this diagram. So basically what this diagram shows is how the communication overhead is really not that bad. If you have a fairly big model such that the time it takes to do a forward PaaS, you can already sort of prefetch all of the parameters for the next layer. Okay? So that's pretty much the idea. So that's kind of like a standard idea I guess everyone should already be using. Like you want to make sure PyTorch does this by default, by the way. Like you want to make sure that know you sort of fully saturate your GPU and then sort of make sure that you kind of overlay that with any additional compuyou're doing. And that's pretty much what's going on here. But let's sort of go through this kind of step by step. Okay. And so the starting point here is fsdp units. So zero, one into two are different fsdp units. Okay? So what you start by doing is you want to run a forward PaaS on the first layer. You don't have the first layer. Okay? So let's say you are gpuk. You don't have the first layer. So you have to do an all gather to get all of the parameters for the first layer. So that's azero. At the end of azero, every GPU has the full set of parameters for the layers corresponding to fsdp unit zero. Let's just say, yer, that's layer one, okay? Or let's just say that's layer zero, okay? So you have the full parameters for layer zero. You run a forward PaaS, so that's the stuff in blue. And while you're running a forward PaaS through the first layer, you're going to be smart about communication overheads. And while you're running that, you're going to prefetch the parameters for the next fsdp unit. So let's say layer two is a different fsdp unit. So that's A G one. Okay? And so you can see that there is like a little bit of overlap between between four zero and ag one. At the end of getting all of the parameters for layer one, you're going to do a forward PaaS and so on. And then you're going to do ag two. And at the same time, now let's say you just have way too many parameters on your GPU, okay? So you're gonna to do a little bit of like memory free, you're gonna to free up some memory, okay? So that's the stuff in yellow and so that's how that goes. So you basically overlay all gather operations with the forward PaaS and that's how you run the forward PaaS. So the communication overhead is really not that bad if you have a really big deep neural network and assuming that you have kind of shaded everything properly, okay. And then you start the backward PaaS. So in the backward PaaS, I guess it's a little bit tricky because you want to do these all gathered operations to get the full gradient. So let's say it's a ten laneural network. So you want to compute the full gradient at layer ten. You need to do an all gather operation to get all of the gradients, to get all the parameters at layer ten, and then you have to do a reduced scatter. Okay? So you have four GPU's. Everyone has the full set of parameters at layer ten, they have different gradients. And so they have to kind of merge their gradients and then sort of split them up to the right GPU. And so that's the reduced scatter. But that's not too bad because you can still like overlay reduced scatter operations with the backward PaaS. And so that's what you see happening on the backward PaaS there. And then along with these forward and backward passes at sort of regular intervals, you have to make sure that you kind of free up GPU memory. So for example, once you have run a forward PaaS through layer one, now you run to layer two. You don't need anything in layer one. You just like free up the memory in layer one. Okay? That's pretty much the idea of behind this diagram. So this few details here. One of the details is like in fsdp, unit zero is sort of treated differently. So you'll see that unit zero is never freed up. That's just sort of an implementation detail in fsdp. I'll just quickly say one more thing about fsdp and take a question. Okay. So the presentation here makes it seem like it's so simple and that it can be applied to any neural network, right? But turns out that that's not the full picture. So you need to do this kind of you need to kind of divide up your neural network into fsdp units, okay? And depending on how you depending on what policy you use for dividing up your parameters into fsdp units, there's different communication overheads, okay? So for example, it makes sense to kind of have multiple consecutive layers in the same fsdp unit and so on. And so this is like very architecture specific. So you start to use this in PyTorch, you'll see that the fsdp wrapper requires sort of a sharding policy. And that is like way architecture specific. So because everyone uses transformers now, they're like very sort of handcrafted, fine tuned policies for transformer, for creating fsdp units and shotting strategies for transformers. But let's say you want to know for your final project, you came up with a new architecture subquadratic attention, whatever, maybe it's not going to be as efficient just because you don't have the right shoding policy. Okay. So that's like one detail about fsdp that maybe you want to keep in mind. Okay. You had a question. speaker 2: just a clarification. You mentioned you can throw away the weights that you don't need after you try your forward PaaS, but then we can keep backwards back. Do you stream them back in each time? Or if you sort of cash some or cash recent or is there any cash going on? Or do you throw them all away, stream them all back? speaker 1: So there might be some caching but in the system, but the idea is that you just sort of throw them away or at least to the user, it seems like you've thrown it all away in terms of like GPU ram utilization. speaker 2: we stream them each layer again. And so that's . speaker 1: why it's important to . speaker 2: like shit properly, right? So for example. speaker 1: if like every consecutive layer is you know is like charded such that it's on multiple GPU's, then you kind of always are communicating right, as opposed to you know you kind of did like all gather and then all of the sort of next three layers are loaded in. So that's why how you Shand this sharding policy becomes important. Okay. So if there's no more questions, let's end early. Thank you so much.