LoRA (Low-Rank Adaptation) Intro By Google Engineer | LLM Parameter-Efficient Fine-Tuning
LoRA技术详解:高效微调大模型的创新方法
标签
媒体详情
- 上传日期
- 2025-05-31 20:13
- 来源
- https://www.youtube.com/watch?v=ptebgoTJle4
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Hello, everyone. Ever felt like cutting edge AI was out of reach? Low rank adaptation, or laa, is breaking down those barriers by dramatically reducing the computational resource and time needed for fine tuning. Laura is making advanced AI customization accessible to a wider audience. From individual researchers with limited resources to a smaller startup with big ideas. This democratization of AI power promises a surge of innovation from unexpected places. Wanto be part of it. Let's take a look together. Let's start with a recap on more than iom training flow. This is an image I get from my iom training video. The training process is often split into two phathe. First phase is called pre training, and the second one is called post training or fine tuning. Industry and research trend is focusing more and more in post training stage. More details you can refer to my iom training videos. More than models are very large with large numbers of parameters. For example, GPT -4 is reported to have around one, 8 trillion parameters and deep sev three has 671 billion parameters. So updating model weights requires a lot of computation and storing the model weights requires a lot of storage. For modern, here comes the million dollar questions. The primary purpose of pre training lon is to enable it for vast and general understanding of language and the world from massive amounts of unlabeled data, whether it's text, image, culvideos. In this phait makes sense to touch all parameters in a model, since we trying to get a generalized foundation model from scratch, either zero or random weights, depending on the initialization. However, the primary purpose of post training is to refine and align the pre train ined modelcapabilities and behavior. So do we still need to change all parameters with full rank and dimension? The second question is, what if we need to make tasks, specific modifications to the foundation model? Do we need to do a full retrain every time? Is this even scalable for the most wealthiest company in the world, apple, Google, Microsoft, Amazon? For example, if verily needs a model with extensive medical data on top of Gemini, or YouTube kids team needs additional safety checks on top of Gemini model, from my personal experience, I can tiyou the answer this, no and no, we don't need to change all parameters with full rank and dimension. In a lot of case, you impose training, and it's not scalable to do full retraining every time, even if you're working in one of the wealthist company in the world. Many of my personal project and my team's project are limited by GPU resources, and the GPU resources is very, very tight for everyone in the industry. So what can we do instead? Before we answer that question, let's dive deeper by visualizing the training process. This is a visualization of the pre training process. In this phase, the modelweights are empty, and we PaaS in tons of tons of unlabeled data to fill in all the weights of the model. This is the visualization of full fine tuning. All the weights are filled. But in this process, it's all twckable. We are tweaking all the weights of the model. More details. You can take a look at my m training video. I include prompt engineering here because prompt engineering was considered a pretty promising way for different teams to build on the same foundation model. It's been proven it's not the best way, but I still want to include it here. In prompt engineering, we using all existing weights. The weights of the model are all frozen. We're just trying to retrieve the best response from this model. More details. We can take a look at my prompt engineering video. With that said, prompt engineering, although it's not a good way to do quality improvements on existing model, it's a pretty good way to utilize existing applications like Gemini ChatGPT. So feel free to still use promp engineering on those apps after the previous cases. The apparent question is, what if we can do something in the middle? Let's say most of the weights of the model is frozen, but we can still improve and tweak the parameters that matters. That's more important than the others. Is it possible? This is the motivation of paparameter efficient fine tuning. And this is a big family of techniques. And today, our focus is on Laura. Laura stands for low rank adaptation. The original paper was published in 2021, and it's been very popular ever since because of its effectiveness and simplicity. Low rank adaptation, Laura proxies, model updates, delta W. The dimension is d multiplied by k in the form of two low rank matrices. Amand B, A matrix is of d multiplied by R. B matrix is of dimension k multiplied by R, where R is usually a lot smaller than the minimum of dnk. Tldr R, we're trying to approximate delta W with a multiplied by b matrix. In this way, we can significantly reduce defined tuning time and also significantly reduce the checkpoint size. The intuition of this is not all weights are equally important. Some are a lot more important than the others. Say attention layers. W stands for the weights of the model. And if we can break this down into two parts, the first one is the frozen weight from the ptraining process, and we use a and b to approximate delta W, which is the delta that comes out of the post training process. In this case, we changed the problem of fine tuning delta W, which is the same dimension of original weight W, to fine tune Amand b, which is a lot smaller than original matrix W. R is a hyperparameter. Small value will shorten the training time and storage. However, if the value is too small, might cause information loss and hurt model quality. Empirically, R can range from eight to 256. It's a lot smaller than a typical dnk. This is the comparison of waupdate process from fulfine tuning and laa and fulfine tuning. You can consider the input is going through ptrain weights, and the weights updates W and delta W, both of them have the same dimension k and d after matrix multiplication with input, we get the outputs with Laura, the input still goes through the frozen pretrain weights. However, instead of going through a weight update, delta W with dimension d nk, it goes through matrix Amand b and R is the low rank inner dimension, a predefined hyperparameter that is a lot smaller than dnk. We're still doing the same at operation after input goes through ptrain weights and ab matrix. And here is an example of why we save a lot of parameters. Let's say d is 100 zero and k is 200 zero, R is 16. By the way, those values are pretty typical. So it's not something that I made out of nowhere. Without Laura, we have 20 billion parameters. And with laa, it only have 4.8 million parameters. It's more than four, zero less parameters with laa. So the differences are huge. This is the training details of Laura. We need to initialize a and b first. The most common initialization strategy is for matrix a, we initialize randomly using a standard Gaussian distribution with a small standard deviation. For matrix b, we initialize it with all zeros. The intuition of this initialization is at the beginning of fine, humane delta, W should be zero to preserve the original ptrain weights, and the random initialization of a allows for a different initial directions for adaptation during training. B is zero, a is a standard normal distribution, and once a and b are initialized, there are trainable parameters in the model. During the fine tuning process with Laura, the forward PaaS of layer with laa can be defined as this formula hd output access, d input, and in the back propagation, the gradients are calculated only for the parameters in matrix amb. The weights for W zero remains frozen, so these gradients are then used to update amb using optimization algorithm, say atom or sgd to do gradient descent. It's all pretty standard and surprisingly simple. Laura comes with a lot of advantages. The first one, apparently, it significantly reduced the training parameters, and it results in lower computation costs. Use less memory doing training, allowing fine tuning on less powerful hardware, say, your own personal computer. It also enables faster training times with fewer parameters to update. The training process converges much faster, so you can fast iterate on a lot of prototypes. And it also uses a lot smaller storage footprints. The Laura adapters are very small compared to the full model, usually between 1% and 0.001 percent, making them easier to store and share, even for personal users like us. Also, due to the frozen base model parameters and low rank amb matrix effectively acting as a reguzer, it's really hard to overfit. Laa usually have better or comparable quality when training. Datset is limited compared with full fine tuning. The intuition is for full fine tuning to work, you need a lot more data to propagate through all the parameters in the model. But for Laura, since the parameters are a lot less, you can do it with less data. Next one is task isolation. This is something really strong. I will say multiple small task specific adapters can be attached to a single base lm. These adapters can be easily loaded and swapped depending on the task without needing to store or load multiple full fintunmodels. This enables efficient multitask learning or serving different application with the same base model, basically different teams. They can work on a single base l and focusing on different task without blocking each other. And you can combine your task sks, a specific deltafor collaboration later. Let's say a team is working on doc adapter and the other one is working on toy adapter. After they are done, we can combine these two adapters and get a toy doadapter. Pretty cool, right? For laaccons, there's the serving latency increase after merging with base models. Weights, inference latency should not increase since it's mathematical the same process. However, serving multiple checkpoints, base M1 or more Laura adapter checkpoints that could result into serving latency increase depending on the infrastructure like rpc or in memory. Usually rpc will have more latency increase. This is a comparison between food fine tuning, prompt engineering and Laura. Let's take a quick look for quality improvements. Full fine tuning usually have the best quality, Laura have close quality or better. When the training data is limited to several thousand, prompt engineering does not do model improvement is just a way to get better response for tuning time, full fine tuning is long, Laura is a lot shorter within hours, prompt engineering is very little time to run along. Prompt tuning cost fulfine tuning use a lot more memory and chips. Comparing with Laura, Laura has lower cost, prompt engineering have no tuning cost training data wise, full fine tuning requires a large number of data, Laura requires smaller number of data, prompt engineering requires no additional data model storage cost wise, full fine tuning requires large storage. To save the full weights, laa only need to save the adapters weights, so it's lots small. Prompt engineering have no additional storage. Task isolation wise, full fine tuning is really hard to do that since it requires separate models for a different task. Ks, task specific Laura adapters can be easy to combine, swap and removed. And for prompt engineering, you can just use different prompt for a different task. Sks, serving latency wise, full fine tuning and prompt engineering, they don't have additional serving latency increase. Laura could have some serving latency increase depending on the infrastructure you're are using and serve within the mobile device for full fine tuning. It's almost possible because the base model is too big for mobile unless you are using a distilled version, which is a popular choice nowadays. You can easily put Lara adapters weights on device and for prompt engineering, of course you can use it on device. So in short, if you're not looking to improve the quality of the model, prompt engineering is the way to go. Like I said, if you're just a casual user trying to get the batresponse from ChatGPT or Gemini, just focus on prompt engineering. However, if you are a entrepreneur or someone really interested in fine tuning open source ims, Laura is something that can ten x or 100x your efficiency. It's really, really powerful and simple, Laura. Adoption across industry is rapid. Since the original publication in 2021, it has been adopted across essentially all customer service like open source ioms 's fine tuning applications, cloud tuning on device. As you can see the Google's scholar results for search term Laura, it's exploding. I don't have the data for 2025 because it's just April. Laura is not only used for fine tuning, it's also used in reinforcement learning. For example, reinforcement learning with human feedback can use Laura for both rewards modeling and policy optimization, achieving comparable performance to full fine tuning with significantly reduced computation cost. More commonly, this is being referred as per l parameter efficient reinforcement learning. Different Laura adapters can be trained for different rl taor environments using the same ptrained backbone. I want to briefly talk about q laa, which is the quantized low rank adaptation quantization. I've talked about this in deep C V three video. It reduces the number of bits used to represent model weights. It will significantly decrease the memory usage. And since Laura used a frozen weight for base model, and intuitive optimization for laa will be to do quantization on the frozen weights. So the ptrain lms weight are frozen after being quantized to four bit normal float. In this case, model will be in a very memory efficient state. And then we do the laa integration. Low rank adapters are added to the chosen layer of the frozen quantites model. These adapters introduce a small number of trainable high precision parameters. Same concept of Laura that I have gone through. Next we go to fine tuning. Same with Laura. Only the weights of the low rank adapters are updated. The quantized base model remains frozen, and during inference, we can either use the amb matrix directly and add it up against the original quantize weights, or we can decquantize the original weights W zero back to a higher precision and then add the adapter updates. This is the comparison between Laura and q Laura. As you can see, q Laura use quantization and it has 75% smaller pgpu memory usage and it can support ten x batch sizes due to lower memory footprints. However, this come with a price on training. Speed wise, Laura is generally faster because it doesn't have the quantization and decquantization steps, and it's also simpler to implement because qlaa needs to impleted quantization techniques. Alright, that's all I want to talk about. Laura, last thing before I say goodbye is I want to cover some of the other popular Taft techniques briefly. For example, adapter tuning. Tldr R adapter tuning introduces small new neural network modules, call adapters into the existing architecture of the lom. The weights of the original ptrain lare frozen. Only the parameters within these newv added adapters are trained on the task specific data. And this is the architecture. Basically, we are adding adapter neural networks into the existing transformer architecture. As you can see, this is very similar concept of Laura. It's trying to freeze the base iom and only update a small set of parameters. The only difference is adapter tuning is introducing new neural networks into the architecture. But laa, it's a lot simpler. That's probably why Laura is more popular right now. The architecture is very similar to auto encoder. And if you want to know more about auto encoder, take a look at my auto encoder video. It aims to limit the number of trainable parameters, and as auto encoder, it has the down projection. It reducthe high dimension input into a low dimension space. It also has non linearity. Applying non linear activation function like reu, it also have to up projection, projecting the lower dimension representation back to the original higher dimension space. It also have the residual connection. This is to improve gradient flow and address vanishing gradient by adding the output directly to the original input of that layer. Another path technique built on top of Laura is called Dora weight decomposed slow rank adaptation. The motivation is researchers find full fine tuning, and Laura often shows different patterns of weight updates, particularly in terms of magnitude and direction. Dora aims to bridge the gap by allowing for more nuanced updates to both aspects of the weights, that is, magnitude and direction. The essence is it first do weight decomposition, this is the key innovation of Dora. It decomposed ptrain ined weight matrices into two components, magnitude and direction. Dora then fintunes both of these parameters, and it applies laa to only the directional components. This is because the directional component has a larger number of parameters, making low rank adaptation efficient, the magnitude less so. So we update them directly. This is a picture I get from the paper. So Dora is very interesting. Improvements built on top of Laura. There are so many other tft techniques that I'm not going to cover today, but I think we have covered the most popular one, which is Laura. Hope you've find my video helpful and useful. And if you like it, please subscribe, comment and like see you next time. Bye.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
该视频由一位谷歌工程师(自称普通软件工程师)介绍 LoRA (Low-Rank Adaptation),一种参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 技术,旨在降低大语言模型 (LLM) 微调的门槛。核心观点是,现代 LLM 参数量巨大(如 GPT-4 约1.8万亿,某模型6710亿参数),全量微调计算和存储成本高昂,且在很多后训练 (post-training) 场景下并非必要。LoRA 通过将模型权重更新 (ΔW) 分解为两个低秩矩阵 (A 和 B) 的乘积 (ΔW ≈ A * B),显著减少了需要训练的参数数量(可减少超过4000倍),从而大幅降低了微调所需的时间、计算资源和存储空间。
LoRA 的主要优势包括:训练参数大幅减少、计算成本降低、训练速度加快、存储占用极小(适配器大小通常为完整模型的0.001%至1%)、不易过拟合(在训练数据有限时效果可能优于全量微调),以及强大的任务隔离能力(可为同一基础模型训练多个任务特定的小型适配器,并轻松切换或组合)。尽管 LoRA 在合并到基础模型后推理延迟理论上不增加,但在服务多个适配器检查点时可能因基础设施(如RPC调用)引入额外延迟。视频还对比了 LoRA 与全量微调及提示工程 (Prompt Engineering) 的优劣,并指出 LoRA 因其高效和简洁性自2021年论文发表以来在业界迅速普及,应用于开源LLM微调、云端训练、设备端部署及强化学习(如RLHF中的PERL)等场景。此外,视频简要介绍了 QLoRA(结合了量化技术,进一步降低显存占用)以及其他 PEFT 技术如 Adapter Tuning 和 DoRA(在LoRA基础上分解权重为幅度和方向进行微调)。
LoRA 技术详解
动机与背景
- LLM 训练流程回顾:
- 预训练 (Pre-training): 从大量无标签数据(文本、图像、代码、视频)中学习语言和世界的广泛通用理解,此阶段通常需要调整模型所有参数。
- 后训练/微调 (Post-training/Fine-tuning): 优化和对齐预训练模型的能力和行为,业界和研究趋势越来越关注此阶段。
- 现代 LLM 的挑战:
- 参数量巨大:例如,
GPT-4 据称约有1.8万亿参数,DeepSpeed [不确定,可能指代使用DeepSpeed训练的某个大模型,如BLOOM等] 有6710亿参数。 - 全量更新权重计算成本高,存储模型权重占用空间大。
- 参数量巨大:例如,
- 核心问题:
- 在后训练阶段,是否仍需改变所有参数(全秩和维度)?
- 如果需要对基础模型进行任务特定的修改,是否每次都需要全量重训练?这对于即便是最富有的公司(如苹果、谷歌、微软、亚马逊)是否可扩展?
- 发言人观点: 个人经验表明答案是“否”和“否”。许多项目受限于GPU资源,行业内GPU资源普遍紧张。
- 可视化训练过程:
- 预训练:用大量无标签数据填充模型的空权重。
- 全量微调:所有权重都已填充,并在微调过程中调整所有权重。
- 提示工程 (Prompt Engineering):
- 曾被认为是不同团队在同一基础模型上构建应用的有前景的方法。
- 模型权重全部冻结,仅通过提示词获取最佳响应。
- 发言人观点:
“它被证明不是最好的方法(用于质量改进)”,但仍是利用现有应用(如Gemini, ChatGPT)的好方法。
- 参数高效微调 (PEFT) 的提出:
- 寻求一种“中间方案”:冻结模型大部分权重,仅调整那些更重要的参数。
- LoRA 是 PEFT 技术家族中的一种。
LoRA (Low-Rank Adaptation) 原理
- 定义: LoRA (Low-Rank Adaptation) 论文于2021年发表,因其有效性和简洁性而广受欢迎。
- 核心思想: LoRA 将模型权重的更新量
ΔW(维度为d x k)通过两个低秩矩阵A(维度d x r)和B(维度k x r)来近似表示,即ΔW ≈ A * B。其中,秩r通常远小于d和k的最小值。“我们试图用 A 乘以 B 矩阵来近似 ΔW”
- 直觉:
“并非所有权重都同等重要,有些权重比其他权重重要得多(例如注意力层)”。 - 参数量减少: 通过微调远小于原始权重矩阵
W的A和B矩阵,显著减少微调时间和检查点大小。r是一个超参数,经验值范围为8 到 256。较小的值会缩短训练时间和存储,但过小可能导致信息损失,损害模型质量。
- 与全量微调的更新过程对比:
- 全量微调:输入通过预训练权重,权重更新
W和ΔW维度均为d x k。 - LoRA:输入仍通过冻结的预训练权重
W₀,同时通过低秩矩阵A和B(内部维度为r)进行更新。最终输出是两部分结果的加和。
- 全量微调:输入通过预训练权重,权重更新
- 参数节省示例:
- 假设
d = 10000,k = 20000,r = 16(这些值是典型值)。 - 无 LoRA:
200亿参数(原文为20 billion,但根据d*k计算应为2亿,此处按原文表述,可能指代更大模型或有误)。 - 使用 LoRA:仅
480万参数(原文为4.8 million)。 “使用 LoRA 的参数减少了超过4000倍”(原文为 "more than four, zero less parameters",推测为4000倍)。
- 假设
- 训练细节:
- 初始化:
- 矩阵
A:使用标准高斯分布随机初始化,标准差较小。 - 矩阵
B:初始化为全零。 - 直觉:微调开始时
ΔW应为零,以保留原始预训练权重;A的随机初始化允许在训练中进行不同方向的适应。
- 矩阵
- 前向传播:
h_output = W₀*x + B*A*x(或经过缩放的α * B*A*x)。 - 反向传播: 梯度仅针对矩阵
A和B中的参数计算,W₀保持冻结。使用标准优化算法(如 Adam, SGD)进行梯度下降。 “这一切都相当标准且出奇地简单”。
- 初始化:
LoRA 的优缺点
- 优点:
- 显著减少训练参数:
- 降低计算成本。
- 训练时使用更少内存,允许在性能较弱的硬件(如个人电脑)上进行微调。
- 更快的训练时间:
- 由于更新参数更少,训练过程收敛更快,可以快速迭代原型。
- 更小的存储占用:
- LoRA 适配器与完整模型相比非常小,通常在
1% 到 0.001%之间,易于存储和共享。
- LoRA 适配器与完整模型相比非常小,通常在
- 不易过拟合:
- 冻结的基础模型参数和低秩的
A、B矩阵有效充当正则化器。 “当训练数据集有限时,LoRA 通常具有更好或相当的质量(与全量微调相比)”。直觉是全量微调需要更多数据来传播到所有参数,而 LoRA 参数少,所需数据也较少。
- 冻结的基础模型参数和低秩的
- 任务隔离 (Task Isolation):
“这是非常强大的功能”。- 可以将多个小型的任务特定适配器附加到单个基础 LLM。
- 这些适配器可以根据任务轻松加载和切换,无需存储或加载多个完整的微调模型。
- 支持高效的多任务学习或使用相同基础模型服务不同应用。
- 不同团队可以基于单个基础 LLM 从事不同任务而不互相阻塞。
- 可以组合任务特定的
ΔW进行协作,例如“一个团队正在开发文档适配器,另一个团队正在开发玩具适配器。完成后,我们可以将这两个适配器组合起来,得到一个玩具文档适配器”。
- 显著减少训练参数:
- 缺点:
- 服务延迟可能增加 (Serving Latency Increase):
“与基础模型权重合并后,推理延迟不应增加,因为这在数学上是相同的过程”。- 然而,如果服务多个检查点(基础模型加一个或多个LoRA适配器检查点),可能会因基础设施(如RPC调用或内存操作,RPC通常延迟增加更多)导致服务延迟增加。
- 服务延迟可能增加 (Serving Latency Increase):
LoRA 与其他方法的对比
| 特性 | 全量微调 (Full Fine-tuning) | 提示工程 (Prompt Engineering) | LoRA |
|---|---|---|---|
| 质量改进 | 通常最佳 | 不改进模型,仅获取更好响应 | 接近最佳,数据有限时可能更优(数千样本级别) |
| 微调时间 | 长 | 极少 (运行提示) | 短得多 (小时级) |
| 微调成本 | 内存和计算芯片消耗大 | 无训练成本 | 较低 |
| 训练数据 | 需要大量数据 | 无需额外数据 | 需要较少数据 |
| 模型存储成本 | 大 (存储完整权重) | 无额外存储 | 小 (仅存储适配器权重) |
| 任务隔离 | 困难 (不同任务需独立模型) | 不同提示用于不同任务 | 任务特定适配器易于组合、切换、移除 |
| 服务延迟 | 无额外增加 | 无额外增加 | 可能因基础设施增加 |
| 移动设备部署 | 几乎不可能 (除非用蒸馏版) | 可用 | 可轻松将适配器权重部署到设备 |
- 发言人建议:
- 如果只是想从 ChatGPT 或 Gemini 获取最佳响应而不寻求模型质量改进,专注于提示工程。
“然而,如果你是一名企业家或对微调开源LLM真正感兴趣的人,LoRA 可以将你的效率提高10倍或100倍。它真的非常强大和简单。”
LoRA 的行业应用与扩展
行业采用情况
- 自2021年原始论文发表以来,LoRA 在行业内迅速普及。
- 应用场景:几乎所有客户服务、开源 LLM 的微调应用、云端微调、设备端部署。
“谷歌学术搜索词 LoRA 的结果呈爆炸式增长”(截至视频录制时2025年4月,尚无2025年数据)。- 强化学习中的应用:
- 例如,在 RLHF (Reinforcement Learning with Human Feedback) 中,LoRA 可用于奖励建模和策略优化,以显著减少的计算成本达到与全量微调相当的性能。
- 通常被称为
PERL (Parameter-Efficient Reinforcement Learning)。 - 可以使用相同预训练主干网络为不同 RL 任务或环境训练不同的 LoRA 适配器。
QLoRA (Quantized Low-Rank Adaptation)
- 量化 (Quantization): 减少用于表示模型权重的位数(例如,
4-bit NormalFloat),显著降低内存使用。 - QLoRA 核心思想: 由于 LoRA 使用冻结的基础模型权重,一个直观的优化是对这些冻结权重进行量化。
- 流程:
- 预训练 LLM 的权重被量化(如至4位)后冻结,使模型处于内存高效状态。
- 将 LoRA 适配器添加到选定层的冻结量化模型中,这些适配器引入少量可训练的高精度参数。
- 微调:仅更新低秩适配器的权重,量化基础模型保持冻结。
- 推理:可以直接使用
A、B矩阵与原始量化权重相加,或者将原始权重W₀反量化回较高精度后再添加适配器更新。
- LoRA 与 QLoRA 对比:
- QLoRA:
“GPU显存使用减少75%”。- 由于内存占用降低,可支持
“10倍的批处理大小”。
- LoRA:
- 训练速度通常更快(没有量化和反量化步骤的开销)。
- 实现更简单(QLoRA 需要实现量化技术)。
- QLoRA:
其他 PEFT 技术简介
Adapter Tuning
- 核心思想: 在现有 LLM 的架构中引入新的小型神经网络模块,称为“适配器 (adapters)”。
- 原始预训练 LLM 的权重被冻结,仅训练这些新添加的适配器内的参数。
- 架构:
- 将适配器神经网络添加到现有 Transformer 架构中。
“这与 LoRA 的概念非常相似,都是试图冻结基础 LLM 并仅更新一小部分参数。”- 发言人观点:
“唯一的区别是 Adapter Tuning 在架构中引入了新的神经网络,而 LoRA 更简单。这可能就是 LoRA 目前更受欢迎的原因。” - 适配器架构类似于自动编码器 (AutoEncoder):
- 下采样投影 (Down Projection):将高维输入降至低维空间。
- 非线性激活 (Non-linearity):应用非线性激活函数(如 ReLU)。
- 上采样投影 (Up Projection):将低维表示投影回原始高维空间。
- 残差连接 (Residual Connection):将输出直接添加到该层的原始输入,以改善梯度流并解决梯度消失问题。
DoRA (Weight-Decomposed Low-Rank Adaptation)
- 动机: 研究人员发现全量微调和 LoRA 通常在权重更新方面表现出不同的模式,特别是在幅度和方向上。DoRA 旨在通过允许对权重的这两个方面进行更细致的更新来弥合差距。
- 核心创新:
- 权重分解 (Weight Decomposition): 将预训练的权重矩阵分解为两个分量:幅度 (magnitude) 和方向 (direction)。
- DoRA 对这两个参数都进行微调。
- 将 LoRA 应用于方向分量: 因为方向分量具有更多的参数,使得低秩适应更有效。幅度分量参数较少,直接更新。
- 发言人认为 DoRA 是在 LoRA 基础上一个非常有趣的改进。
结论
发言人总结了 LoRA 作为一种参数高效微调技术的核心优势和应用场景,强调其在降低 LLM 微调门槛、提升效率方面的巨大潜力。同时简要介绍了 QLoRA、Adapter Tuning 和 DoRA 等相关技术,希望能为对 AI 和 LLM 微调感兴趣的观众提供有价值的信息。