speaker 1: I'm going to take you through a series of fine tuning optimizations. A very common way to fine tune language models is using a technique called Laura. This works very well in terms of efficiency and performance, except for sometimes it doesn't quite match the performance of a full fine tuning. I'll be covering some of the newest techniques that have emerged, Dora neft, Laura plus, and also unsloth. So let's take a look at the agenda. I'll start by covering how Laura works. Just a quick recap. Then I'll describe Dora, which is a slight modification of Laura. It's a directional form. Then I'll cover Laura plus, which uses different optimization rates for the Laura matrices. Next I'll cover unsloth, which is quite different to Laura, although it supports Laura. It's the combination of a number of clever speed ups that allow you to get at least usually a two x speed up on fine tuning. Then I'll cover any ft, which involves adding noise to the fine tuning, and that allows you to get reduced overfitting and generally better performance. Now I'll be showing you all of these in a Jupiter notebook script, which you can get from the advanced fine tuning repository. You can purchase that over on trellis dot com, but as usual, I'll try to give you enough details so you can make the modifications by yourself. The idea of Laura is to avoid fine tuning all of the parameters in the weights of language model. Language models have different modules, and these modules have matrices. So there are many matrices of weights within a language model. For example, you might have a matrix that's about 1000 by 1000 in size. And the idea with Laura is for each of the matrices within the language model, rather than tuning what here would be about 1 million parameters, we're going to instead tune an adapter, for example. And this is exactly how Laura is set up. We will fine tune to smaller matrices, a and b. And a will be the same widone thousand as the original matrix, but it will be smaller in height. Here I've shown that as eight, and the same with b. So these are both kind of long matrices. And when they're multiplied together, taking b transpose times a, you'll get back to a matrix that's about a thousand in size. But the idea with Laura is to freeze all of these original weights and to instead train just this subset of weights. And by doing that, we only have to train about 16, zero parameters in this specific matrix case compared to a million parameters. So we have far fewer parameters to update. But the other benefit is that because we train this adapter, it allows you to get some form of smoothing, and it tends to make the updates in a more even way than if you're trying to individually optimize a full about a million parameters. So here are the equations for Laura applied to one matrix. And we usually use W to represent that matrix. So instead of training W, we train instead b, transpose times a, and we freeze W. So we represent the new W as being a combination of the original matrix plus this adapter. We train the adapter, and at the end of training, we'll just combine that adapter on top of the original W. So here you have it, the original matrix plus the matrices that we're going to train. And more specifically, I already said that we'll freeze the original weights. But when we initialize these trainable matrices, we're going to initialize the b matrix to zero, and we initialize the a matrix to random values. And the reason for this is so that at the start of training, when we multiply b times a because b is zeros, b times a is going to evaluate to zero. And so right at the start of training, we're just going to be training model that's the very same as the original model. And over time, as b gets updated and as a gets updated, that's going to result in the being non zero values in b and b and a is going to contribute a fine tuning update. And so this is Laura, and it works very well, and it still works very well. So if you're happy using Laura, in most cases, I would stick with that. But as an improvement and a method that delivers a little bit closer to a full fine tune, we have Dora. Now, the idea behind Dora is to take the original weight matrix and split that into a magnitude. You can think of this in simple terms as a scalar times a matrix, which represents a direction. So you have a scalar magnitude times a direction. So we decompose the original weight matrix like this. And now instead of doing Laura on the full weight matrix W, we're just going to fine tune or we're just going to represent the directional matrix by a times b. Now let me make that concrete here with Dora. We represent the wake matrix as not just the weight matrix plus b times a, but a magnitude vector times the directional matrix plus b times a. So basically, Dora just does Laura, but on the directional matrix, and Additionally, Dora involves making m here the magnitude trainable. So you can think of Dora as taking the original weight matrix, allowing the size of that matrix or the magnitude of that matrix to be trainable, and then using Laura to train the direction of that matrix. So Dora just adds in this extra parameter, which is allowing us to train the magnitude of the original matrix. And here it is in kind of graphical form. Quite simply, your training m, which is the magnitude times the directional form of the original matrix, plus Laura adapters that you're applying specifically to the directional form of that matrix. And so Dora generally should be at least as good as Laura, except it gives this extra fine tuning degree of freedom around the magnitude. And it turns out that just being able to adjust the magnitude of the original weight matrices typically can lead you to getting better performance closer to a full fine tuning. Now how do we apply Dora? I'll show you within a Jupiter notebook. But very simply, when we're using the hogging face trainer, when we're setting up the Lora configuration, we will PaaS in, use Dora equals. And in addition to selecting the modules, the attention and the linear layers, which I'll show you, we're going to also make traable the lower magnitude vector, which is this m here. And that's the summary of how Dora works. So let's move to the next approach, which is Laura plus. Laura plus is another modification of Laura, and it's quite simple. So Laura plus takes the original form of Laura, and it applies different learning rates to the matrix b and the matrix a. So normally, when we optimize, we use the same learning rate for all of the matrices. Let's say we call that learning rate lor. And let's say that we're applying the learning rate lor to the parameters in matrix a well, in Lora. Plus, they've realized that actually you can have a relatively higher learning rate for the matrix b, which is initialized to zeros. Now the intuition around this is a bit difficult. Maybe one slightly inexact way to think about it is that because b is initialized at zeros, you can afford to have a higher learning rate and because you need to bring it up to some steady state of having non zero values. But anyway, empirically, you're able to increase the learning rate of matrix B A bit higher than matrix a, probably up to about 16 times, according to the paper. And this overall will allow you to catfaster convergence within your Laura training. So to use Laura plus, you don't change anything about Laura. You just change the optimizer. And I'll show you some code that I've put together with the help of the Laura plus project on GitHub. And I just adjust the optimizer to change learning rate for the laa b matrices so that it's some multiple of the learning rate for pa. The next method I'll talk about is neft, and this involves adding noise to your model during fine tuning. Very specifically, it adds noise to the embedding layers. Now this again, is not a Lora specific technique. What it involves is taking the embedding layers, which convert from tokens into vector representations of those tokens, and applying some noise to just that specific layer, some Gaussian noise. And it turns out that when you add some noise, it allows the language model to better appreciate the higher level features of the training data set, rather than focusing too much on the one off granularity that would result in overfitting. And simply by adding in a little bit of noise to the embeddings, you can improve the performance of the fine tuning. And I'll show you that in the code. The last class of optimizations I'll talk about is onslath. Onslath is a very smart project where they've dug into the nitty gritty of how the fine tuning process works, and they found a large number of small speeduups that together allow for a large speup in the fine tuning process. On this blog, you can see step by step all of the different small improvements that together generally will get you a two x or faster speed up. And basically there are different ways that the matrices are multiplied, and onslath has found ways to reduce and combine calculations so that you get overall speed ups during the fine tuning process. Hogging face provides documentation for how to integrate onslath. It's supported by transformers. There are a few differences that I'll go through in a notebook, but when you load the model, you need to use the fast language model instead of an auto model for causal. And you also need to use the fast language model when applying the Lora adapters. That's getting the parameter efficient, fine tuned, modeled. There are some limitations to unslath. Generally, it supports only lama type models. That does include Mistral models or lammified versions of models, like a lammified version of the e model, but it does constrain a bit the models you can use. There are some other technical differences. For example, you can't apply the lorura dropout, which means masking certain weights during the fine tuning in order to avoid overfitting. But broadly speaking, the onsloth approach can be used with tools like sft trainer without any changes. And that means you can even overlay optimizations like adding noise with neft or using a slightly different optimizer if you want to use Laura plus. Before I dive into the notebook comparison, I just want to give a quick overview. So we have Dora, Laura plus, neft and unsloth. As I mentioned, you can use all of these methods, Dora, ora plus and neft, on pretty much any model that's supported by hugging face. With unsloth, you are a bit more limited to lamattite models in terms of quality. I think you get some boost using Dora and I'll show you that and laa plus the same with neft, with unsloth, this is more about a speed boost and you can expect about two x. You'll see setup how difficult it is when I go through the notebook in general using Doris easy. It's just adding the used Dora parameter and then setting the lower a magnitude vector to trainable any ft is very easy as well. You're just adding one flag to the sft trainer Laura plus is a bit more difficult because you need some custom code right now in order to modify the optimizer. But I'll show you that and it's not too long. And on sloth, I would say it's relatively easy to run. And if you look at the GitHub, they provide very good scripts and notebooks for doing full fine tunings. There can be some intricacies in getting the installation right at the start, depending on the couda drivers you have. So that's probably the most tricky part. And also because onsloth requires different model loading and also peft, or rather Laura adapter loading, that means you'll have to make some changes to your code if you're using the original transformers approach. It's time now for me to run through notebooks with each of these approaches, and I'm going to be using the chafine tuning branch of the advanced fine tuning repo. You can check that out and make a purchase if youlike on trellis com. As a reminder, the advanced fine tuning repo has got a wide variety of scripts, all the way from dpo, function calling, long context fine tuning, quantization, supervised unsupervised fine tuning. To get started, I'm going to use runpod, and I'm going to use a one click template, which is for acuda 12.1 environment. I like using this template because it means that I have a consistent setup, consistent cuda drivers every time I run fine tuning. And typically here I select na 6000. So I'll click deploy. Often I'll increase the size of the volume a little bit. I usually make it much bigger than I need, so I don't have to go back and increase the size later on. Now that pod is going to load up. And next I'm going to upload the Jupiter notebook from the chat fine tuning branch. And I'm going to upload a few versions that I've saved so we can take a look at the different fine tuning methods. I've just uploaded the four files that we're going to compare. So I've got a chat fine tuning with Laura, with Dora, then the same Dora fine tuning with the addition of noise. That's nft. And then I have a script for onsloth with Laura plus. So in that asscript, we'll see the speed of benefit of onsloth, and then we can see the emperplexity or performance benefit of using Laura plus. I'll go through the lorore script first, and then more quickly I'll go through the other script and just highlight what's going to be different. So here in the chat fine tuning script, I'm going to start off by connecting to hogging face so I can push models. I'll also install and connect to weights and biases so that I can track my run and the performance in weights and biases. Next I'll do the installations. You can see that here I have set specific versions so that when people run this pt in future, they're not going to run into bugs because of future library upgrades. Once the installations are done, I'm going to enable this environment variable, hf hub, enable hf transfer. This allows you to load or download weights and upload them much, much faster. This is part of hf underscore transfer package, and I highly recommend using it. Next up, we're going to load the model. And the model that I'm going to chat fine tune is the base Mistral seven b model. Now that there is an instruct version of this available. So what I'm doing is a bit redundant, but it's a good exercise to show you how to take a model that's a base model, not chat fine tuned, and then how to fine tune it. And I'll show you the data set a little later. Next, I'm going to load the model using auto model for causal, and I'm going to load it with flash attention to to provide for some speup. And I'm going to load it in b flosixteen, which is possible because I'm using an ampere GPU. I'm using an a 6000. If you're using an a 40 or an a 100 or an H -100, those will also support the brain float 16 format, which allows for improved quality. Note that I'm not going to load in quantities format. Usually when I find tune, I try to do it in 16 bit because it gives the best performance and then allows me to make quantisations off of that high quality, fine tuned molast. Here, I'm just loading the tokenizer. Next, I like to run some loading checks. I just checked that there are no parameters on meta, which means there are no parameters on the cpu. That means everything is just on the single GPU, which is what I want. Now I'm going to prepare for larefine tuning. So here I create a function that will show me the trainable parameters in the model. And next I'm going to enable gradient checkpointing, which saves some vram during training. I'm going to print the model. So here you can see what modules are. Basically lists of matrices are contained within the model. Within the model, the Mistral model has 32 layers from zero to 31. It has attention matrices here, self attention qk V O. And it has multiti layer perceptrons, layers the gate up down. There are also the input laynorms post attention laynorms. And we also have root mean squared norm here. So when I decide to create Laura, I'm going to create adapters only for specific matrices. And a common approach is to create them for the attention layers. So qkvo and also the multilayer perceptrons. Now notice here that I've also listed laa magnitude vector. So that also would have a set of laa adapters if I turn on Dora. Now the fact that I've put it here, it's just gonna to check if there's any module named Laura magnitude vector and there won't be because I'm not turning on use Dora. So you can just leave it there and it won't make a difference, but it will make a difference if you've Dora turned on. As you can see here, I've got Dora turned off. Quite simply, I would turn that to if I want to use Dora, and that will mean there is going to be a matrix in each layer for Laura magnitude vector. Next, I'll apply that, lower a configuration, and load the model. The next step is to set up the tokenizer and padding. Usually I like to print the tokenizer and inspect it. Check the vocab size, check the special tokens. I'll then often inspect the chat template so I can set up the same chat template in my dataset for fine tuning. I also like to check the pad token. Usually I'll use the pad token if it's already defined. Otherwise, I'll use the ont token. And if there's no untoken, I might manually define a pad token using option b here. Okay, I'm moving quickly on here. I'm going to do one more thing which is necessary for chat fine tuning, and that's to set the embed and norm layers to trainable. It's actually not enough usually to just fine tune attention and multilayer perceptrons. If you wanted fine tune, especially because of these tokens at the start and the end of conversations or between the roles in the conversations, I find that you need to set the embed and normal layers to trainable. Now these layers are not very large matrices, so it doesn't add much extra training time, but I find it's a very important step to get good performance. So I set some trainable parameter names and bed tokens, input laynorm, post attention to laynorm. And that means these modules are also going to be set as trainable. They're just going to be trained fully. There won't be any Laura adapters like to these. So once that replied, we have sat trainable, the Laura parameters and also the embed and the norm layers. And we're ready to set up evaluation. I have a function here that will just create a streaming output, given some questions. And I like to run that on some test questions. You can see here, I've just run the evaluation. First question, why our planets are in our solar system. And it gives the correct planets, but it keeps on blabbing on. And that's because the model hasn't been fine tuned. Likewise, when I ask the first five Fibonacci numbers, it gives the correct answer, but the model also keeps blabbing on. And the same on the last question about writing a Python snippet. So doing this evaluation is the way to test if the model has been checked fine tune just yet. Next I'm going to load the dataset, and the data set I'm going to use is the open assistant lama style datset. And we can have a quick look at it here on hugginface. It's a of filit's, a filtered version of the open assistant dataset that was filtered by Tim detnerand. Further, it's been adapted then to the chat format I need for lamma or mimistral or mixtrial al actually. So you can see it includes end of sequence tokens and also includes these beginning of instruction and end of instruction tokens here. And this is publicly available if you want to make use of it. So after loading that, I often inspect the data and check everything is correct, maybe check, test out some tokenization before then. Moving on to the training step for training, I'm going to train typically for one epoch, although actually I'm just going to train for 20 steps. In this case, I'm not going to bother running the full training because I just want to run a comparison between the different optimizations. I'm going to run with a batch size of four that's going to fish within my GPU v ram, and I'll run with gradient accumulation of eight. Usually I like the batch size times gradient accumulation, the product of that to be 32. It means that in every step I'm processing 32 rows of data before I back propagate. Next, I have a custom logging callback function. This just adds some extra logging in to the training process. And I have the lorreplus optimizer code here, which will run later in a later iteration. For now, I'm just using a standard optimizer. You can see down here in the training args Optim equals Adam W torch. So that will use the same learning rate here of one e minus four for all of the parameters in the model. As I said, we're just going to run for 20 steps. So you want to comment that out if you actually want to run for one full epoch, as per here, I'm using the sft trainer. This is nice because I'll be able to use it as well for doing onsloth. And it also allows me to add noise if I like, which we'll see itbe down here. We'll add an extra parameter. This here is the commented out parameter for putting in a custom optimizer. We can also, as you'll see later, putting in a custom parameter for adding in noise. So I then run the training. And the training here takes about twelve minutes using Laura. And you can see that my validation loss goes down to about 1.12. So keep that in mind because we're going to compare how low we can get the validation loss when we look at some other methods a little bit later on. And that's a quick overview of Laura. I just wanted to give you that baseline because we'll see now how to apply the optimizations on top. Next up, I'll show you the tweaks to make for using Dora. Now there are three different things that we need to change within the script. The very first one is that we do need to make a small installation update, and that's because Dora has not yet been installed within the transformers package. I assume it will be merged quite soon. So this little cell here needs to be run so that we installuninstallpeft and reinstall it then from Benjamin Bosson's branch. And that will allow us to have Dora integrated within this code base. The second change then is, as we saw a little earlier, we will go down to the use Dora flag and make sure that's commented in. It's also important that we set the lower magnitude vector to be trainable because we need to train the length of those original matrices. Here we are with the results of Dora, and there are a few things that stand out. The first is the time. So it took about 27 minutes, whereas in original Laura, it took about twelve. This is because Dora is not yet fully optimized. So unfortunately, it provides for actually a slowdown until there are optimizations brought that bring it back on parallel to Laura. Second of all, the validation loss is not significantly improved. In fact, it's a little bit worse. So the validation loss is 1.127, whereas in the original Laura, I've got 1.121. Now that isn't a very big difference. And I do find that when I do some manual comparison, I sometimes get improved performance. In the Dora fine tuning, for example, I've done some function calling fine tuning here. This is a Laura example that scores 0.86, whereas Madora example scores 0.856. So in this case, with function calling fine tuning, I've gotten a slight improvement, but really not very different using Dora, although when I inspect the results of the function calling, I find that the percentage of correct answers is very slightly higher using Dora. So I believe that it is possible I'm getting an improvement here. But in short, until Dora is optimized a bit more and itbe nice when it's merged into transformers, I don't think that using it is justified because of the slowdown. And also, there's not a very significant improvement, at least for these two specific cases of chafine tuning and function calling fine tuning. Next up, I'll show you how to add noise to embeddings to improve performance. And I'm just going to run the same Dora notebook, but I'm going to make one change here. So I'll just search in the notebook for neft. And quite simply, I add in nef tune noise alpha equal to five within the sft trainer, and that is it. It's a very easy addition and we can go down and take a look at the results. So after a step 20 of a validation loss of one point eleven six or one point eleven seven rounding up versus the original of 1.122, so you can see a small improvement in the validation loss with adding the noise. It's quite a simple ad. And also you can see that comparing just Dora with 27 minutes to Dora with 27 minutes when we add the noise, there isn't any slowdown from adding in that noise that I can measure here. So I think in a lot of cases, it's a parameter that's worth adding in that any ft noise in the sft trainer. Next, I'm going to take a look at using Laura plus, which is where we have different optimization rate for the Laura B parameter. Now everything stays the same here except the optimizer needs to be different. So I'm just going to search in the notebook for optimizer. And I have a piece of code here that is dedicated to setting up a custom optimizer. So we're creating a Laura plus optimizer, and I'm using the code from the lareplus repo on GitHub. And you can see that this code splits the modules intude or group a and group b. And for group a, we're going to apply the learning rate. And for group b, we're going to apply some multiple of that learning rate. And you can see here that once the optimizer has been defined, what I do is initiate the optimizer with Laura plus ratio. Now I tried using a ratio of 20 actually, so having a 20 times faster training of Laura B, and what I found was that my validation loss was very high. So I came back and I reduced it down to a value of two. And it worked reasonably well. I think possibly you could increase it further and maybe get more benefit. But for now, I'm just going to look at the case where I'm using a training rate for Laura B matrices. That's twice the value for laa a. Now I need to load that custom optimizer within my trainer. So you'll see here there's this line, optimizers equals optimizer. And you could PaaS in a scheduler, but I'm doing a console learning rate. So I'm not going to PaaS in a scheduler here. And I'm also just going to comment out the learning rate and the optimizer here. I believe theybe overwritten in any case, but for clarity, I commenting those out because I am loading the custom optimizer here. So we move on down to the results. And remember, this is using both unsloth and also using Laura plus the optimizer. And you'll see here in results, first off, the validation loss after 20 seps, 1.10 versus the original of the 1.12. So there's some slight improvement, but nothing very material, I would say, to determine whether it's going to unslath or to the choice of optimizer. I can run a script that just runs on sloth. So I here have run everything only using on sloth, but I have not used the custom optimizer. You can see that's turned off here and I'm using the original default optimizer. And in this case, we get down to a final last 1.105 compared to 1.1038. And so basically, I think the conclusion here is that the performance is not really changing all that much whether I use onslth or whether I use the optimizer with Laura B. Being trained at a faster rate. Now, I think probably with more steps and more work, I could figure out what the optimal ratio would be. So instead of using a ratio of two, I previously said that if I use a ratio of 20, it becomes unstable. So maybe using a value between two and 20 would lead to a better improvement from using Laura plus. But for now, I'm not seeing a big one. And unfortunately, because you need to kind of fine tune what this ratio is, that that makes it quite a bit less practical for when you just want to get a fine tuning done and have a reasonable guess of what hyperparameters you should be using. Just to confirm, when you run on sloth on its own, you get similar time. So we can tell that the optimizer itself is not really changing the total time for training. Now onsloth generally does provide a two x or greater improvement in speed. That's not the case in what I've shown here. I've got twelve minutes with standard Laura and I've got just under eleven minutes using onslaught. So there is some speed up. But I think the reason there isn't such a big difference is because in this case, I'm spending a lot of my time evaluating the training run itself is very, very short and I'm doing five evaluations within a small number of 20 steps. And so even if onslath is really speeding up the training, because a lot of my total time here, my twelve minutes is taking up by the evaluation, that means I'm not seeing a very big improvement with onslaught. So it's probably not showing it in the most favorable light. Certainly, onslah is bringing a speed up though. And so generally, I would recommend if the model is supported by onslath, it generally is an all brainer to use that as a form of training. Okay, folks. So I've explained to you at a high level how each of these optimizations works. I've shown you how to implement it within your scripts, but you can see that the improvements and the effort required to apply the optimizations is not necessarily always worth it, and it doesn't necessarily bring clear benefits to employ all of these techniques. I'll give a big haveat though that the results will depend on your specific fine tuning task. Here I have just run for 20 steps on a chat fine tuning task, and I did mention function calling as well. But perhaps for more complex tasks where you are really moving the model away from its base training set, you will see larger optimization improvements. And that's something that's specifically mentioned in the la plus paper and I believe also would be for using Dora. Taking a practical approach, I would recommend if the model is supported by onslath, it will give you speed ups if you can use the onslaught fine tuning. I also think adding noise, it's a very simple parameter change. It doesn't seem to slow down performance. So that is probably a smart optimization to add when it comes to Dora right now because it slows down the fine tuning. I can't recommend using it until it's been further optimized and probably integrated into the transformers library through emerge. Lastly, I think that laa plus potentially could provide speed ups, but you need to be willing to spend some time optimizing what the right ratio is for the training rate of Laura B matrices to Laura a. And so unless you're doing a significantly long task where you're gonna to run unch some short tests initially, it's probably not worth the effort to add in Laura plus. That's it. For these optimizations, you can check out the scripts in the advanced fine tuning repository. You can buy access to that repository and get access to any future scripts on this topic that I upload to the repo. In the meantime, let me know your questions. Write below chairs.