Trelis Research | Fine tuning Optimizations - DoRA, NEFT, LoRA+, Unsloth

微调优化技术解析:DoRA、NEFT、LoRA+与Unsloth

媒体详情

上传日期
2025-05-31 19:40
来源
https://www.youtube.com/watch?v=ae2lbmtTY5A
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-exp-03-25

转录

下载为TXT
speaker 1: I'm going to take you through a series of fine tuning optimizations. A very common way to fine tune language models is using a technique called Laura. This works very well in terms of efficiency and performance, except for sometimes it doesn't quite match the performance of a full fine tuning. I'll be covering some of the newest techniques that have emerged, Dora neft, Laura plus, and also unsloth. So let's take a look at the agenda. I'll start by covering how Laura works. Just a quick recap. Then I'll describe Dora, which is a slight modification of Laura. It's a directional form. Then I'll cover Laura plus, which uses different optimization rates for the Laura matrices. Next I'll cover unsloth, which is quite different to Laura, although it supports Laura. It's the combination of a number of clever speed ups that allow you to get at least usually a two x speed up on fine tuning. Then I'll cover any ft, which involves adding noise to the fine tuning, and that allows you to get reduced overfitting and generally better performance. Now I'll be showing you all of these in a Jupiter notebook script, which you can get from the advanced fine tuning repository. You can purchase that over on trellis dot com, but as usual, I'll try to give you enough details so you can make the modifications by yourself. The idea of Laura is to avoid fine tuning all of the parameters in the weights of language model. Language models have different modules, and these modules have matrices. So there are many matrices of weights within a language model. For example, you might have a matrix that's about 1000 by 1000 in size. And the idea with Laura is for each of the matrices within the language model, rather than tuning what here would be about 1 million parameters, we're going to instead tune an adapter, for example. And this is exactly how Laura is set up. We will fine tune to smaller matrices, a and b. And a will be the same widone thousand as the original matrix, but it will be smaller in height. Here I've shown that as eight, and the same with b. So these are both kind of long matrices. And when they're multiplied together, taking b transpose times a, you'll get back to a matrix that's about a thousand in size. But the idea with Laura is to freeze all of these original weights and to instead train just this subset of weights. And by doing that, we only have to train about 16, zero parameters in this specific matrix case compared to a million parameters. So we have far fewer parameters to update. But the other benefit is that because we train this adapter, it allows you to get some form of smoothing, and it tends to make the updates in a more even way than if you're trying to individually optimize a full about a million parameters. So here are the equations for Laura applied to one matrix. And we usually use W to represent that matrix. So instead of training W, we train instead b, transpose times a, and we freeze W. So we represent the new W as being a combination of the original matrix plus this adapter. We train the adapter, and at the end of training, we'll just combine that adapter on top of the original W. So here you have it, the original matrix plus the matrices that we're going to train. And more specifically, I already said that we'll freeze the original weights. But when we initialize these trainable matrices, we're going to initialize the b matrix to zero, and we initialize the a matrix to random values. And the reason for this is so that at the start of training, when we multiply b times a because b is zeros, b times a is going to evaluate to zero. And so right at the start of training, we're just going to be training model that's the very same as the original model. And over time, as b gets updated and as a gets updated, that's going to result in the being non zero values in b and b and a is going to contribute a fine tuning update. And so this is Laura, and it works very well, and it still works very well. So if you're happy using Laura, in most cases, I would stick with that. But as an improvement and a method that delivers a little bit closer to a full fine tune, we have Dora. Now, the idea behind Dora is to take the original weight matrix and split that into a magnitude. You can think of this in simple terms as a scalar times a matrix, which represents a direction. So you have a scalar magnitude times a direction. So we decompose the original weight matrix like this. And now instead of doing Laura on the full weight matrix W, we're just going to fine tune or we're just going to represent the directional matrix by a times b. Now let me make that concrete here with Dora. We represent the wake matrix as not just the weight matrix plus b times a, but a magnitude vector times the directional matrix plus b times a. So basically, Dora just does Laura, but on the directional matrix, and Additionally, Dora involves making m here the magnitude trainable. So you can think of Dora as taking the original weight matrix, allowing the size of that matrix or the magnitude of that matrix to be trainable, and then using Laura to train the direction of that matrix. So Dora just adds in this extra parameter, which is allowing us to train the magnitude of the original matrix. And here it is in kind of graphical form. Quite simply, your training m, which is the magnitude times the directional form of the original matrix, plus Laura adapters that you're applying specifically to the directional form of that matrix. And so Dora generally should be at least as good as Laura, except it gives this extra fine tuning degree of freedom around the magnitude. And it turns out that just being able to adjust the magnitude of the original weight matrices typically can lead you to getting better performance closer to a full fine tuning. Now how do we apply Dora? I'll show you within a Jupiter notebook. But very simply, when we're using the hogging face trainer, when we're setting up the Lora configuration, we will PaaS in, use Dora equals. And in addition to selecting the modules, the attention and the linear layers, which I'll show you, we're going to also make traable the lower magnitude vector, which is this m here. And that's the summary of how Dora works. So let's move to the next approach, which is Laura plus. Laura plus is another modification of Laura, and it's quite simple. So Laura plus takes the original form of Laura, and it applies different learning rates to the matrix b and the matrix a. So normally, when we optimize, we use the same learning rate for all of the matrices. Let's say we call that learning rate lor. And let's say that we're applying the learning rate lor to the parameters in matrix a well, in Lora. Plus, they've realized that actually you can have a relatively higher learning rate for the matrix b, which is initialized to zeros. Now the intuition around this is a bit difficult. Maybe one slightly inexact way to think about it is that because b is initialized at zeros, you can afford to have a higher learning rate and because you need to bring it up to some steady state of having non zero values. But anyway, empirically, you're able to increase the learning rate of matrix B A bit higher than matrix a, probably up to about 16 times, according to the paper. And this overall will allow you to catfaster convergence within your Laura training. So to use Laura plus, you don't change anything about Laura. You just change the optimizer. And I'll show you some code that I've put together with the help of the Laura plus project on GitHub. And I just adjust the optimizer to change learning rate for the laa b matrices so that it's some multiple of the learning rate for pa. The next method I'll talk about is neft, and this involves adding noise to your model during fine tuning. Very specifically, it adds noise to the embedding layers. Now this again, is not a Lora specific technique. What it involves is taking the embedding layers, which convert from tokens into vector representations of those tokens, and applying some noise to just that specific layer, some Gaussian noise. And it turns out that when you add some noise, it allows the language model to better appreciate the higher level features of the training data set, rather than focusing too much on the one off granularity that would result in overfitting. And simply by adding in a little bit of noise to the embeddings, you can improve the performance of the fine tuning. And I'll show you that in the code. The last class of optimizations I'll talk about is onslath. Onslath is a very smart project where they've dug into the nitty gritty of how the fine tuning process works, and they found a large number of small speeduups that together allow for a large speup in the fine tuning process. On this blog, you can see step by step all of the different small improvements that together generally will get you a two x or faster speed up. And basically there are different ways that the matrices are multiplied, and onslath has found ways to reduce and combine calculations so that you get overall speed ups during the fine tuning process. Hogging face provides documentation for how to integrate onslath. It's supported by transformers. There are a few differences that I'll go through in a notebook, but when you load the model, you need to use the fast language model instead of an auto model for causal. And you also need to use the fast language model when applying the Lora adapters. That's getting the parameter efficient, fine tuned, modeled. There are some limitations to unslath. Generally, it supports only lama type models. That does include Mistral models or lammified versions of models, like a lammified version of the e model, but it does constrain a bit the models you can use. There are some other technical differences. For example, you can't apply the lorura dropout, which means masking certain weights during the fine tuning in order to avoid overfitting. But broadly speaking, the onsloth approach can be used with tools like sft trainer without any changes. And that means you can even overlay optimizations like adding noise with neft or using a slightly different optimizer if you want to use Laura plus. Before I dive into the notebook comparison, I just want to give a quick overview. So we have Dora, Laura plus, neft and unsloth. As I mentioned, you can use all of these methods, Dora, ora plus and neft, on pretty much any model that's supported by hugging face. With unsloth, you are a bit more limited to lamattite models in terms of quality. I think you get some boost using Dora and I'll show you that and laa plus the same with neft, with unsloth, this is more about a speed boost and you can expect about two x. You'll see setup how difficult it is when I go through the notebook in general using Doris easy. It's just adding the used Dora parameter and then setting the lower a magnitude vector to trainable any ft is very easy as well. You're just adding one flag to the sft trainer Laura plus is a bit more difficult because you need some custom code right now in order to modify the optimizer. But I'll show you that and it's not too long. And on sloth, I would say it's relatively easy to run. And if you look at the GitHub, they provide very good scripts and notebooks for doing full fine tunings. There can be some intricacies in getting the installation right at the start, depending on the couda drivers you have. So that's probably the most tricky part. And also because onsloth requires different model loading and also peft, or rather Laura adapter loading, that means you'll have to make some changes to your code if you're using the original transformers approach. It's time now for me to run through notebooks with each of these approaches, and I'm going to be using the chafine tuning branch of the advanced fine tuning repo. You can check that out and make a purchase if youlike on trellis com. As a reminder, the advanced fine tuning repo has got a wide variety of scripts, all the way from dpo, function calling, long context fine tuning, quantization, supervised unsupervised fine tuning. To get started, I'm going to use runpod, and I'm going to use a one click template, which is for acuda 12.1 environment. I like using this template because it means that I have a consistent setup, consistent cuda drivers every time I run fine tuning. And typically here I select na 6000. So I'll click deploy. Often I'll increase the size of the volume a little bit. I usually make it much bigger than I need, so I don't have to go back and increase the size later on. Now that pod is going to load up. And next I'm going to upload the Jupiter notebook from the chat fine tuning branch. And I'm going to upload a few versions that I've saved so we can take a look at the different fine tuning methods. I've just uploaded the four files that we're going to compare. So I've got a chat fine tuning with Laura, with Dora, then the same Dora fine tuning with the addition of noise. That's nft. And then I have a script for onsloth with Laura plus. So in that asscript, we'll see the speed of benefit of onsloth, and then we can see the emperplexity or performance benefit of using Laura plus. I'll go through the lorore script first, and then more quickly I'll go through the other script and just highlight what's going to be different. So here in the chat fine tuning script, I'm going to start off by connecting to hogging face so I can push models. I'll also install and connect to weights and biases so that I can track my run and the performance in weights and biases. Next I'll do the installations. You can see that here I have set specific versions so that when people run this pt in future, they're not going to run into bugs because of future library upgrades. Once the installations are done, I'm going to enable this environment variable, hf hub, enable hf transfer. This allows you to load or download weights and upload them much, much faster. This is part of hf underscore transfer package, and I highly recommend using it. Next up, we're going to load the model. And the model that I'm going to chat fine tune is the base Mistral seven b model. Now that there is an instruct version of this available. So what I'm doing is a bit redundant, but it's a good exercise to show you how to take a model that's a base model, not chat fine tuned, and then how to fine tune it. And I'll show you the data set a little later. Next, I'm going to load the model using auto model for causal, and I'm going to load it with flash attention to to provide for some speup. And I'm going to load it in b flosixteen, which is possible because I'm using an ampere GPU. I'm using an a 6000. If you're using an a 40 or an a 100 or an H -100, those will also support the brain float 16 format, which allows for improved quality. Note that I'm not going to load in quantities format. Usually when I find tune, I try to do it in 16 bit because it gives the best performance and then allows me to make quantisations off of that high quality, fine tuned molast. Here, I'm just loading the tokenizer. Next, I like to run some loading checks. I just checked that there are no parameters on meta, which means there are no parameters on the cpu. That means everything is just on the single GPU, which is what I want. Now I'm going to prepare for larefine tuning. So here I create a function that will show me the trainable parameters in the model. And next I'm going to enable gradient checkpointing, which saves some vram during training. I'm going to print the model. So here you can see what modules are. Basically lists of matrices are contained within the model. Within the model, the Mistral model has 32 layers from zero to 31. It has attention matrices here, self attention qk V O. And it has multiti layer perceptrons, layers the gate up down. There are also the input laynorms post attention laynorms. And we also have root mean squared norm here. So when I decide to create Laura, I'm going to create adapters only for specific matrices. And a common approach is to create them for the attention layers. So qkvo and also the multilayer perceptrons. Now notice here that I've also listed laa magnitude vector. So that also would have a set of laa adapters if I turn on Dora. Now the fact that I've put it here, it's just gonna to check if there's any module named Laura magnitude vector and there won't be because I'm not turning on use Dora. So you can just leave it there and it won't make a difference, but it will make a difference if you've Dora turned on. As you can see here, I've got Dora turned off. Quite simply, I would turn that to if I want to use Dora, and that will mean there is going to be a matrix in each layer for Laura magnitude vector. Next, I'll apply that, lower a configuration, and load the model. The next step is to set up the tokenizer and padding. Usually I like to print the tokenizer and inspect it. Check the vocab size, check the special tokens. I'll then often inspect the chat template so I can set up the same chat template in my dataset for fine tuning. I also like to check the pad token. Usually I'll use the pad token if it's already defined. Otherwise, I'll use the ont token. And if there's no untoken, I might manually define a pad token using option b here. Okay, I'm moving quickly on here. I'm going to do one more thing which is necessary for chat fine tuning, and that's to set the embed and norm layers to trainable. It's actually not enough usually to just fine tune attention and multilayer perceptrons. If you wanted fine tune, especially because of these tokens at the start and the end of conversations or between the roles in the conversations, I find that you need to set the embed and normal layers to trainable. Now these layers are not very large matrices, so it doesn't add much extra training time, but I find it's a very important step to get good performance. So I set some trainable parameter names and bed tokens, input laynorm, post attention to laynorm. And that means these modules are also going to be set as trainable. They're just going to be trained fully. There won't be any Laura adapters like to these. So once that replied, we have sat trainable, the Laura parameters and also the embed and the norm layers. And we're ready to set up evaluation. I have a function here that will just create a streaming output, given some questions. And I like to run that on some test questions. You can see here, I've just run the evaluation. First question, why our planets are in our solar system. And it gives the correct planets, but it keeps on blabbing on. And that's because the model hasn't been fine tuned. Likewise, when I ask the first five Fibonacci numbers, it gives the correct answer, but the model also keeps blabbing on. And the same on the last question about writing a Python snippet. So doing this evaluation is the way to test if the model has been checked fine tune just yet. Next I'm going to load the dataset, and the data set I'm going to use is the open assistant lama style datset. And we can have a quick look at it here on hugginface. It's a of filit's, a filtered version of the open assistant dataset that was filtered by Tim detnerand. Further, it's been adapted then to the chat format I need for lamma or mimistral or mixtrial al actually. So you can see it includes end of sequence tokens and also includes these beginning of instruction and end of instruction tokens here. And this is publicly available if you want to make use of it. So after loading that, I often inspect the data and check everything is correct, maybe check, test out some tokenization before then. Moving on to the training step for training, I'm going to train typically for one epoch, although actually I'm just going to train for 20 steps. In this case, I'm not going to bother running the full training because I just want to run a comparison between the different optimizations. I'm going to run with a batch size of four that's going to fish within my GPU v ram, and I'll run with gradient accumulation of eight. Usually I like the batch size times gradient accumulation, the product of that to be 32. It means that in every step I'm processing 32 rows of data before I back propagate. Next, I have a custom logging callback function. This just adds some extra logging in to the training process. And I have the lorreplus optimizer code here, which will run later in a later iteration. For now, I'm just using a standard optimizer. You can see down here in the training args Optim equals Adam W torch. So that will use the same learning rate here of one e minus four for all of the parameters in the model. As I said, we're just going to run for 20 steps. So you want to comment that out if you actually want to run for one full epoch, as per here, I'm using the sft trainer. This is nice because I'll be able to use it as well for doing onsloth. And it also allows me to add noise if I like, which we'll see itbe down here. We'll add an extra parameter. This here is the commented out parameter for putting in a custom optimizer. We can also, as you'll see later, putting in a custom parameter for adding in noise. So I then run the training. And the training here takes about twelve minutes using Laura. And you can see that my validation loss goes down to about 1.12. So keep that in mind because we're going to compare how low we can get the validation loss when we look at some other methods a little bit later on. And that's a quick overview of Laura. I just wanted to give you that baseline because we'll see now how to apply the optimizations on top. Next up, I'll show you the tweaks to make for using Dora. Now there are three different things that we need to change within the script. The very first one is that we do need to make a small installation update, and that's because Dora has not yet been installed within the transformers package. I assume it will be merged quite soon. So this little cell here needs to be run so that we installuninstallpeft and reinstall it then from Benjamin Bosson's branch. And that will allow us to have Dora integrated within this code base. The second change then is, as we saw a little earlier, we will go down to the use Dora flag and make sure that's commented in. It's also important that we set the lower magnitude vector to be trainable because we need to train the length of those original matrices. Here we are with the results of Dora, and there are a few things that stand out. The first is the time. So it took about 27 minutes, whereas in original Laura, it took about twelve. This is because Dora is not yet fully optimized. So unfortunately, it provides for actually a slowdown until there are optimizations brought that bring it back on parallel to Laura. Second of all, the validation loss is not significantly improved. In fact, it's a little bit worse. So the validation loss is 1.127, whereas in the original Laura, I've got 1.121. Now that isn't a very big difference. And I do find that when I do some manual comparison, I sometimes get improved performance. In the Dora fine tuning, for example, I've done some function calling fine tuning here. This is a Laura example that scores 0.86, whereas Madora example scores 0.856. So in this case, with function calling fine tuning, I've gotten a slight improvement, but really not very different using Dora, although when I inspect the results of the function calling, I find that the percentage of correct answers is very slightly higher using Dora. So I believe that it is possible I'm getting an improvement here. But in short, until Dora is optimized a bit more and itbe nice when it's merged into transformers, I don't think that using it is justified because of the slowdown. And also, there's not a very significant improvement, at least for these two specific cases of chafine tuning and function calling fine tuning. Next up, I'll show you how to add noise to embeddings to improve performance. And I'm just going to run the same Dora notebook, but I'm going to make one change here. So I'll just search in the notebook for neft. And quite simply, I add in nef tune noise alpha equal to five within the sft trainer, and that is it. It's a very easy addition and we can go down and take a look at the results. So after a step 20 of a validation loss of one point eleven six or one point eleven seven rounding up versus the original of 1.122, so you can see a small improvement in the validation loss with adding the noise. It's quite a simple ad. And also you can see that comparing just Dora with 27 minutes to Dora with 27 minutes when we add the noise, there isn't any slowdown from adding in that noise that I can measure here. So I think in a lot of cases, it's a parameter that's worth adding in that any ft noise in the sft trainer. Next, I'm going to take a look at using Laura plus, which is where we have different optimization rate for the Laura B parameter. Now everything stays the same here except the optimizer needs to be different. So I'm just going to search in the notebook for optimizer. And I have a piece of code here that is dedicated to setting up a custom optimizer. So we're creating a Laura plus optimizer, and I'm using the code from the lareplus repo on GitHub. And you can see that this code splits the modules intude or group a and group b. And for group a, we're going to apply the learning rate. And for group b, we're going to apply some multiple of that learning rate. And you can see here that once the optimizer has been defined, what I do is initiate the optimizer with Laura plus ratio. Now I tried using a ratio of 20 actually, so having a 20 times faster training of Laura B, and what I found was that my validation loss was very high. So I came back and I reduced it down to a value of two. And it worked reasonably well. I think possibly you could increase it further and maybe get more benefit. But for now, I'm just going to look at the case where I'm using a training rate for Laura B matrices. That's twice the value for laa a. Now I need to load that custom optimizer within my trainer. So you'll see here there's this line, optimizers equals optimizer. And you could PaaS in a scheduler, but I'm doing a console learning rate. So I'm not going to PaaS in a scheduler here. And I'm also just going to comment out the learning rate and the optimizer here. I believe theybe overwritten in any case, but for clarity, I commenting those out because I am loading the custom optimizer here. So we move on down to the results. And remember, this is using both unsloth and also using Laura plus the optimizer. And you'll see here in results, first off, the validation loss after 20 seps, 1.10 versus the original of the 1.12. So there's some slight improvement, but nothing very material, I would say, to determine whether it's going to unslath or to the choice of optimizer. I can run a script that just runs on sloth. So I here have run everything only using on sloth, but I have not used the custom optimizer. You can see that's turned off here and I'm using the original default optimizer. And in this case, we get down to a final last 1.105 compared to 1.1038. And so basically, I think the conclusion here is that the performance is not really changing all that much whether I use onslth or whether I use the optimizer with Laura B. Being trained at a faster rate. Now, I think probably with more steps and more work, I could figure out what the optimal ratio would be. So instead of using a ratio of two, I previously said that if I use a ratio of 20, it becomes unstable. So maybe using a value between two and 20 would lead to a better improvement from using Laura plus. But for now, I'm not seeing a big one. And unfortunately, because you need to kind of fine tune what this ratio is, that that makes it quite a bit less practical for when you just want to get a fine tuning done and have a reasonable guess of what hyperparameters you should be using. Just to confirm, when you run on sloth on its own, you get similar time. So we can tell that the optimizer itself is not really changing the total time for training. Now onsloth generally does provide a two x or greater improvement in speed. That's not the case in what I've shown here. I've got twelve minutes with standard Laura and I've got just under eleven minutes using onslaught. So there is some speed up. But I think the reason there isn't such a big difference is because in this case, I'm spending a lot of my time evaluating the training run itself is very, very short and I'm doing five evaluations within a small number of 20 steps. And so even if onslath is really speeding up the training, because a lot of my total time here, my twelve minutes is taking up by the evaluation, that means I'm not seeing a very big improvement with onslaught. So it's probably not showing it in the most favorable light. Certainly, onslah is bringing a speed up though. And so generally, I would recommend if the model is supported by onslath, it generally is an all brainer to use that as a form of training. Okay, folks. So I've explained to you at a high level how each of these optimizations works. I've shown you how to implement it within your scripts, but you can see that the improvements and the effort required to apply the optimizations is not necessarily always worth it, and it doesn't necessarily bring clear benefits to employ all of these techniques. I'll give a big haveat though that the results will depend on your specific fine tuning task. Here I have just run for 20 steps on a chat fine tuning task, and I did mention function calling as well. But perhaps for more complex tasks where you are really moving the model away from its base training set, you will see larger optimization improvements. And that's something that's specifically mentioned in the la plus paper and I believe also would be for using Dora. Taking a practical approach, I would recommend if the model is supported by onslath, it will give you speed ups if you can use the onslaught fine tuning. I also think adding noise, it's a very simple parameter change. It doesn't seem to slow down performance. So that is probably a smart optimization to add when it comes to Dora right now because it slows down the fine tuning. I can't recommend using it until it's been further optimized and probably integrated into the transformers library through emerge. Lastly, I think that laa plus potentially could provide speed ups, but you need to be willing to spend some time optimizing what the right ratio is for the training rate of Laura B matrices to Laura a. And so unless you're doing a significantly long task where you're gonna to run unch some short tests initially, it's probably not worth the effort to add in Laura plus. That's it. For these optimizations, you can check out the scripts in the advanced fine tuning repository. You can buy access to that repository and get access to any future scripts on this topic that I upload to the repo. In the meantime, let me know your questions. Write below chairs.

最新摘要 (详细摘要)

生成于 2025-05-31 19:44

概览/核心摘要 (Executive Summary)

该内容详细介绍了LoRA (Low-Rank Adaptation) 及其多种微调优化技术,包括DoRA (Directional LoRA)、NEFT (Noisy Embeddings Fine-Tuning)、LoRA+ (LoRA Plus) 和 Unsloth。LoRA通过训练小型适配器矩阵来高效微调大语言模型,但有时性能不及全量微调。DoRA试图通过分解权重为幅度和方向,并对方向进行LoRA式调整,同时训练幅度,以期接近全量微调效果,但目前实现版本较慢。LoRA+则为LoRA的A、B矩阵设置不同学习率,B矩阵使用更高学习率以加速收敛,但需要调优学习率比例。NEFT通过在嵌入层添加噪声来减少过拟合,提升模型对高层特征的感知,实现简单且有效。Unsloth是一套底层优化技术,旨在显著提升微调速度(宣称2倍以上),主要支持Llama系列模型,但对代码有一定修改要求。

演讲者通过Jupyter Notebook展示了这些技术的实现和初步效果。实验表明,NEFT添加噪声能带来轻微的验证损失降低且不增加训练时间。Unsloth确实能带来速度提升,但在短时评估为主的实验中不明显。DoRA当前版本因未优化而导致训练时间显著增加,性能提升不明显。LoRA+的效果在初步实验中不突出,且需要额外调优学习率比例。最终建议,在模型支持的情况下优先使用Unsloth提升速度,并考虑加入NEFT提升性能;DoRA待优化和官方集成后再考虑;LoRA+则适用于愿意投入时间进行超参数搜索的较长任务。

LoRA 基础回顾

Speaker 1 首先回顾了LoRA的工作原理,它是目前一种常见的语言模型微调方法。

  • 工作原理
    • 避免微调语言模型中的所有参数。语言模型包含多个模块,每个模块都有权重矩阵。
    • 针对模型中的每个大型权重矩阵W(例如1000x1000,100万参数),LoRA的做法是冻结原始权重W。
    • 引入两个较小的可训练矩阵A(例如1000x8)和B(例如8x1000),它们的乘积 B_transpose * A 形成对原始矩阵W的更新。
    • 这样,仅需训练A和B中的参数(此例中约16000参数),远少于原始矩阵的参数量。
    • 这种方式不仅减少了训练参数,还因训练适配器而产生一种平滑效果,使更新更均匀。
  • 关键方程与初始化
    • 更新后的权重 W_new = W_original + B_transpose * A
    • 训练结束后,适配器的权重会合并到原始权重W中。
    • 初始化:矩阵B初始化为零,矩阵A初始化为随机值。
      • 原因:训练开始时,由于B为零,B * A 也为零,模型等同于原始模型。随着训练进行,A和B被更新,B * A 产生非零更新。
  • 评价:Speaker 1 认为 "LoRA works very well, and it still works very well. So if you're happy using LoRA, in most cases, I would stick with that."

DoRA (Directional LoRA)

DoRA 是对 LoRA 的一种改进,旨在使性能更接近全量微调。

  • 核心思想与机制
    • 将原始权重矩阵W分解为一个幅度(magnitude)(可视为标量或向量)和一个方向(direction)矩阵。
    • W_original ≈ m * Directional_Matrix
    • LoRA的微调(即A、B矩阵)不再直接应用于W,而是应用于分解后的方向矩阵
    • 同时,幅度 m 本身也成为可训练的参数
    • 因此,DoRA可以看作是:允许原始权重矩阵的幅度可调,并使用LoRA来训练其方向。
    • 公式化表述:W_new ≈ trainable_m * (Directional_Matrix_original + B*A)
  • 实现方式 (Hugging Face Trainer):
    • 在LoRA配置中设置 use_dora=True
    • 将LoRA的幅度向量(即 m)设置为可训练。
  • 实验表现与评价 (Speaker 1)
    • 性能:在演讲者的聊天微调实验中,验证损失略差于原始LoRA (1.127 vs 1.121)。在另一次函数调用微调中,DoRA得分略有提升 (0.856 vs LoRA的0.86),正确答案百分比略高。
    • 速度:训练时间显著增加(约27分钟 vs LoRA的12分钟),因为 "Dora is not yet fully optimized"。
    • 结论:Speaker 1 认为 "until Dora is optimized a bit more and it'd be nice when it's merged into transformers, I don't think that using it is justified because of the slowdown. And also, there's not a very significant improvement"。

LoRA+ (LoRA Plus)

LoRA+ 是对LoRA的另一种修改,关注优化器层面。

  • 核心思想与机制
    • 为LoRA的两个矩阵A和B设置不同的学习率
    • 通常优化时,所有矩阵使用相同学习率。
    • LoRA+ 提出,对于初始化为零的矩阵B,可以使用相对更高的学习率。
    • 直觉解释 (演讲者称 "slightly inexact"): 矩阵B从零开始,需要更快达到非零稳态,因此可以承受更高的学习率。
    • 论文建议B的学习率最高可达A的16倍。
    • 目标:实现LoRA训练中更快的收敛。
  • 实现方式
    • 不改变LoRA本身的配置,而是修改优化器。
    • 演讲者展示了自定义优化器代码,该代码将参数分为A组和B组,并为B组应用一个学习率乘数。
  • 实验表现与评价 (Speaker 1)
    • 演讲者尝试了20倍的学习率比例,发现模型不稳定,验证损失很高。后改为2倍。
    • 在使用Unsloth和LoRA+(学习率比例为2)的组合实验中,验证损失为1.10,优于基准LoRA的1.121。
    • 但与单独使用Unsloth(验证损失1.105)相比,提升不明显。
    • 结论:Speaker 1 认为 "the performance is not really changing all that much whether I use Unsloth or whether I use the optimizer with Laura B being trained at a faster rate." 可能需要更多步骤或更细致的比例调整才能看到明显效果。由于需要调优学习率比例,"that makes it quite a bit less practical for when you just want to get a fine tuning done"。

NEFT (Noisy Embeddings Fine-Tuning)

NEFT 是一种通过在嵌入层添加噪声来改进微调效果的技术。

  • 核心思想与机制
    • 在微调过程中,向模型的嵌入层(embedding layers)添加噪声(通常是高斯噪声)。
    • 嵌入层负责将token转换为向量表示。
    • 非LoRA特定技术,可以与其他微调方法结合。
  • 益处
    • 使语言模型更好地理解训练数据的高层特征。
    • 减少对可能导致过拟合的“一次性粒度 (one-off granularity)”的过度关注。
    • 通过添加少量噪声改善微调性能。
  • 实现方式 (SFTTrainer):
    • 在SFTTrainer中添加参数 neftune_noise_alpha (例如设置为5)。
  • 实验表现与评价 (Speaker 1)
    • 在DoRA的基础上添加NEFT,验证损失从DoRA的1.127(或LoRA基准的1.121)降低到约1.116/1.117。
    • 训练时间没有明显增加。
    • 结论:Speaker 1 认为 "it's a parameter that's worth adding in... It's quite a simple ad."

Unsloth

Unsloth 是一个专注于提升微调速度的项目。

  • 核心思想与机制
    • 并非LoRA的直接修改,但支持与LoRA一同使用。
    • 通过深入研究微调过程的细节,应用大量小的底层优化(如矩阵乘法优化、计算合并)来实现整体加速。
    • 目标是达到 "at least usually a two x speed up"。
  • 集成与限制 (Hugging Face Transformers):
    • 加载模型:需使用 FastLanguageModel 替代 AutoModelForCausalLM
    • 应用LoRA适配器 (PEFT):同样需要使用 FastLanguageModel
    • 模型支持:主要支持Llama类型的模型(包括Mistral及模型的Llama化版本)。
    • 技术差异:例如,不能应用LoRA dropout(一种防止过拟合的技术)。
    • 兼容性:可与SFTTrainer等工具配合使用,也可叠加NEFT或LoRA+等优化。
  • 实验表现与评价 (Speaker 1)
    • 在演讲者的短时(20步)实验中,标准LoRA耗时约12分钟,Unsloth耗时略少于11分钟。
    • 速度提升未达2倍的原因:实验运行时间短,且包含多次评估,评估时间占比较大,掩盖了训练本身的加速效果。
    • 结论:Speaker 1 肯定 "Unsloth is bringing a speed up though. And so generally, I would recommend if the model is supported by Unsloth, it generally is a no-brainer to use that as a form of training."

各项优化技术对比概览 (Speaker 1 观点)

特性 DoRA LoRA+ NEFT Unsloth
模型支持 几乎所有Hugging Face支持的模型 几乎所有Hugging Face支持的模型 几乎所有Hugging Face支持的模型 有限,主要是Llama类型模型
质量提升 有一些提升潜力 有一些提升潜力 有一些提升 主要是速度提升 (约2倍)
设置难度 简单 (一个参数 + 设置幅度向量可训练) 较难 (需要自定义优化器代码,调优学习率比例) 非常简单 (SFTTrainer中一个标志位) 相对简单运行,但安装可能涉及CUDA驱动问题,代码需修改模型加载方式

Notebook 实践与关键配置

Speaker 1 使用了Mistral 7B基础模型进行聊天微调实验,并在RunPod的A6000 GPU (CUDA 12.1) 环境下运行Jupyter Notebook。所有实验仅运行20个训练步骤以便比较。

  • 实验环境
    • GPU: NVIDIA A6000
    • 平台: RunPod (CUDA 12.1 one-click template)
    • 基础模型: Mistral 7B base model
    • 数据类型: bfloat16
    • 关键库: Transformers, PEFT, Accelerate, TRL (SFTTrainer)
    • 数据集: openassistant-guanaco (已适配Llama/Mistral聊天格式)
  • 基准 LoRA 配置与结果
    • 目标模块: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • use_dora=False
    • 重要补充: 将 embed_tokens 和各类 norm 层 (如 input_layernorm, post_attention_layernorm) 设置为完全可训练 (非LoRA),Speaker 1 强调这对于聊天微调获取良好性能非常重要。
    • 训练参数: batch_size=4, gradient_accumulation_steps=8 (有效批次32), AdamW优化器, 学习率 1e-4
    • 结果 (20步): 训练耗时约 12分钟,验证损失约 1.121
  • DoRA 配置与结果
    • 安装特定PEFT分支以支持DoRA。
    • use_dora=True,确保 lora_magnitude_vector 可训练。
    • 结果 (20步): 训练耗时约 27分钟,验证损失 1.127
  • NEFT (结合DoRA) 配置与结果
    • 在DoRA配置基础上,SFTTrainer中添加 neftune_noise_alpha=5
    • 结果 (20步): 训练耗时约 27分钟,验证损失 1.116 / 1.117
  • LoRA+ (结合Unsloth) 配置与结果
    • 使用Unsloth加载模型。
    • 自定义优化器,设置LoRA B矩阵学习率是A矩阵的2倍 (loraplus_lr_ratio=2)。
    • 结果 (20步): 验证损失 1.10
  • Unsloth 单独表现
    • 使用Unsloth加载模型,标准AdamW优化器。
    • 结果 (20步): 训练耗时略少于 11分钟,验证损失 1.105

数据与统计信息

以下为演讲者在20步短时聊天微调实验中观察到的主要数据:

  • LoRA (基准):
    • 训练时间: ~12 分钟
    • 验证损失: ~1.121
  • DoRA:
    • 训练时间: ~27 分钟
    • 验证损失: 1.127
  • DoRA + NEFT (noise_alpha=5):
    • 训练时间: ~27 分钟
    • 验证损失: ~1.116 - 1.117
  • Unsloth + LoRA+ (lr_ratio=2):
    • 训练时间: [原文未明确与Unsloth单独比较的时间差异,但Unsloth本身约11分钟]
    • 验证损失: 1.10
  • Unsloth (单独):
    • 训练时间: < 11 分钟
    • 验证损失: 1.105

核心观点与最终建议 (Speaker 1)

  • 通用观察
    • 各项优化的效果和所需努力不尽相同,并非所有技术都能带来明确收益。
    • 结果高度依赖于具体的微调任务。对于更复杂的任务,这些优化(尤其是LoRA+和DoRA)可能带来更大改进。
  • 具体建议
    1. Unsloth:如果模型受支持,推荐使用以获得速度提升。"it generally is a no-brainer to use that"。
    2. NEFT推荐添加。这是一个简单的参数更改,似乎不会降低性能,且可能带来改善。
    3. DoRA目前不推荐使用,因为它显著减慢了微调速度,除非它得到进一步优化并正式集成到Transformers库中。
    4. LoRA+可能提供加速或性能提升,但需要用户愿意花时间优化B矩阵相对于A矩阵的学习率比例。除非进行非常长的训练任务并愿意前期进行超参数搜索,否则可能不值得投入精力。