AI Bites | LoRA (Low-rank Adaption of AI Large Language Models) for fine-tuning LLM models

媒体详情

上传日期: 2025-06-01 22:24
来源: https://www.youtube.com/watch?v=X4VvO3G6_vw
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

speaker 1: Custom model for our application, we start with a ptrained language model and fine tune it on our own datset. This used to be fine until we reached the large language model regime and started working with models such as GPT, llama, Muna, etcetera. Now these llms are quite bulky and so fine. Tuning a model for different applications such as summarization or reading comprehension needs deploying the model for each application. And the size of these models is only increasing almost on a weekly or monthly basis. So the deployment of these bulky llms is getting increasingly challenging. Now one solution proposed for this problem is adapters. Adapters or trainable additional modules plugged into the neural network, mostly transformers. And during fine tuning, the parameters of only these adapter modules are updated with the pretained module frozen because adapters or additional parameters, they introduce latency during inference for a bad size of 32 and a sequence length of phi two l and a small half a million parameters model, fine tuned laa model takes 149 milliseconds for inference, but with adapters it's two or 3% higher. So how does Laura achieve this feat? Let's find out in this video. Now, before that, I would like to give a quick shout out about our x account, where we share high impact papers and research news from top AI labs from both academia and from industry. If you wish to keep up to date with AI every single day, just hit the follow button on x laa stands for low rank adaptation. So what does that mean for any neural network architecture? Let's not forget that the weights of the network are just large matrices of numbers. All matrices come with some property called the rank. The rank of a matrix is the number of linearly independent rows or columns of that matrix. To understand it, let's take a simple three by three matrix. The rank of this simple three x three matrix at the top is one y, because the first and the second columns are redundant, as they are just multiples of the third column. In other words, the two columns are linearly dependent and don't bring any meaningful information. Now, if we simply change one of the values to, say, 70, the rank becomes two, as we now have two linearly independent columns, knowing the rank of a matrix, we can do rank decomposition of a given matrix into two matrices. Going back to our example of three x three matrix, it can simply be written as the product of two matrices, one with a dimension three x one and the other with a dimension one by three. Notice that we only have to store W six numbers after decomposition, instead of the nine numbers in the three x three matrix. This may sound less, but in reality, the neural network weights have a very high dimension of, say, 1024 by 1024. And so using a rank of two, it boils down to a really small number of values that we need to store, and hence that we need to multiply when we actually want to do some computation, which is a lot of reduction in computation. So would it not be nice if these rates ights actually have a low rank so that we can work with the rank decomposition instead of the entire weight? It turns out that's indeed the case with pre train models, as shown by this earlier work. They empirically show that common Preton models have a very low intrinsic dimension, in other words, that exist a low dimension reparametization that is as effective for fine tuning as the full parameter space. Let's say we are starting with a pretamodel with weights W zero. After fine tuning, let the weights be updated to W zero plus delta W. If the pretrain model has low rank weghts, it would be a fair hypothesis to assume that the fine tune weights are also low rank. Laa goes with this assumption because delta W is low rank. We can now decompose that matrix into two low rank matrices, a and b, whose product ba leads to delta W. And lastly, fine tuning becomes the prere twbw zero plus ba instead of W zero plus delta W as it's one and the same. With that perspective, we start training the model with input x. The input passes through both the ptrain weights and the ranank decomposition matrices a and b. The weights of the pretrain model remain frozen, but we still consider the output of the frozen model during training, the outbit of both the frozen model and the lower rank model are summed up to obtain the output latent representation hatch. Mathematically, it's represented by this one line equation where the input x is multiplied with both W zero and ba matrix and summed up to obtain the hidden representation hatch. Now, you may ask, what about latency during inference? If we slightly modify the above equation, we can notice that we can merge or add the weights ba to the pretain weights W zero. So for inference, it's this merged weight that is deployed, thereby overcoming the latency bottleneck. One of the other concerns is deployment of llmms as they are quite bulky, say about 50 gb or 70 gb. Let's say we have to fine tune for two tasks, namely summarization and translation. We don't have to deploy the entire model every time we find tune. We can simply fine tune the laa layers specific for the task, for example, summarization, and deploy the model for summarization. Similarly, we can deploy laa layers specific for translation, and thus laa overcomes both the deployment and latency bottlenecks or problems faced by modern day large language models. In terms of applying for transformers, we all know that transformers have two main modules, which are multi headed self attention and multilayer perceptron or mlps. The self attention modules are composed of query key value and output weights. In this paper, they have limited their study to only adapting the attention weights for downstream tasks and frozen the mlp modules, so they are not trained in downstream tasks, which means that Laura is just applied to the self attention module. Now we have been talking about using lorura for adaptation. One of the key parameters in lorura is the rank, which is something that we have to choose. So what is the optimal rank for Laura? It turns out, to everyone's surprise, a rank as small as one is sufficient for adapting both the query and the value matrices. However, when adapting the query alone, it needs to have a larger rank of say, four or eight or even 64. Moving on to how we can practically use Laura. There's this official implementation from Microsoft, which is released as laa lib and is available under the mit license. Another option to use laa is the hugging face repo called peft, which stands for parameter efficient fine tuning, and peft is available under the Apache two license. Peft also has a few other implementations, such as prefix tuning, prompt tuning, and Laura is one of the earliest implementations in the library. I think that pretty much covers the important babout. Laura, I hope this video was useful in understanding about the functioning of the Laura model. I hope to see you in my next until then, take care.

概览/核心摘要 (Executive Summary)

该视频详细介绍了LoRA (Low-Rank Adaptation) 技术，一种针对大型语言模型 (LLM) 进行参数高效微调的方法。随着LLM（如GPT、Llama、Muna）的体积日益庞大（可达50-70GB），为不同应用（如摘要、阅读理解）分别微调和部署完整模型变得极具挑战性。传统的适配器 (Adapters) 技术虽然能减少训练参数，但会引入推理延迟。LoRA通过假设预训练模型的权重更新矩阵 delta_W 具有低秩特性，将其分解为两个较小的低秩矩阵 A 和 B (delta_W = B * A)。在微调时，仅训练这两个小矩阵，而原始模型权重 W_zero 保持冻结。这种方法显著减少了可训练参数的数量和存储需求。关键在于，推理时，B*A 的结果可以与 W_zero 合并 (W_merged = W_zero + B*A)，从而消除了额外的计算开销和推理延迟，解决了适配器技术的痛点。LoRA特别适用于Transformer模型中的注意力权重。研究表明，即使是很小的秩（如1）也足以有效适应查询和值矩阵。该技术不仅降低了计算和存储成本，还简化了多任务模型的部署，为在预算有限的情况下微调大型模型提供了高效途径。

大型语言模型微调的挑战

模型体积庞大：大型语言模型（如GPT、Llama、Muna等）体积巨大，例如可能达到50GB或70GB。
部署困难：为不同的应用（如文本摘要、阅读理解）微调并部署单独的模型实例，在模型体积持续增长（几乎每周或每月）的背景下，变得越来越具有挑战性。
传统微调方式：通常从一个预训练语言模型开始，然后在自定义数据集上进行微调。这种方式在LLM时代面临上述挑战。

适配器 (Adapters) 及其局限性

定义：适配器是可训练的附加模块，通常插入到神经网络（主要是Transformer）中。
工作方式：在微调期间，只有这些适配器模块的参数被更新，而预训练模块的参数保持冻结。
局限性：
- 引入推理延迟：由于适配器是额外的参数，它们会在推理过程中引入延迟。
- 数据示例：发言人提到，对于一个批量大小为32、序列长度为 phi two l（原文如此，可能指512）的小型（约50万参数）模型，基础模型推理耗时149毫秒，而使用适配器后，推理时间会增加2%到3%。

LoRA (低秩适应) 概述

全称：Low-Rank Adaptation。
核心思想：利用神经网络权重矩阵的秩属性。神经网络的权重本质上是大型数字矩阵。

秩分解 (Rank Decomposition) 原理

矩阵的秩 (Rank)：指矩阵中线性无关的行或列的数量。
- 示例：一个3x3矩阵，如果第一列和第二列是第三列的倍数（即线性相关），则其秩为1。如果修改一个值使得有两列线性无关，则秩变为2。
秩分解：可以将一个给定秩的矩阵分解为两个矩阵的乘积。
- 示例：一个3x3的矩阵（包含9个数字）可以分解为一个3x1矩阵和一个1x3矩阵的乘积，分解后只需要存储 3+3=6 个数字。
优势：
- 减少存储：对于高维矩阵（如1024x1024），使用低秩（如秩为2）进行分解，可以显著减少需要存储的数值数量。
- 减少计算：由于存储的数值减少，实际进行计算（如乘法）时，计算量也大幅降低。

LoRA 的核心思想与动机

预训练模型的低秩特性：一篇早期的研究工作（视频中提及的 "Motivation Paper"）通过经验证明，常见的预训练模型具有非常低的“内在维度 (intrinsic dimension)”。这意味着存在一个低维度的重参数化方法，其微调效果与全参数空间微调相当。
LoRA的假设：
- 如果预训练模型的权重 W_zero 具有低秩特性（或低内在维度）。
- 那么，微调后权重的变化量 delta_W (使得新权重为 W_zero + delta_W) 也被假设为低秩的。
LoRA的操作：基于 delta_W 是低秩的假设，LoRA将 delta_W 分解为两个低秩矩阵 A 和 B 的乘积，即 delta_W = B * A (视频中表述为 "product ba leads to delta W")。
微调目标：因此，微调过程从更新 W_zero + delta_W 转变为更新 W_zero + B * A。

LoRA 训练过程

输入处理：输入 x 同时通过预训练权重 W_zero 和秩分解矩阵 A、B。
权重冻结：预训练模型的权重 W_zero 保持冻结状态，不参与梯度更新。
参数更新：仅训练低秩矩阵 A 和 B 的参数。
输出计算：
- 来自冻结模型 W_zero 的输出和来自低秩路径 B*A 的输出被加总，得到最终的隐藏层表示 h。
- 数学表达式: h = W_zero * x + (B * A) * x

LoRA 推理过程与优势

推理优化：通过对训练公式 h = W_zero * x + (B * A) * x 进行修改，可以得到 h = (W_zero + B * A) * x。
权重合并：在推理部署前，可以将训练好的 B * A 的结果与原始预训练权重 W_zero 进行合并，形成一个新的权重矩阵 W_merged = W_zero + B * A。
消除推理延迟：由于权重已经合并，推理时仅使用 W_merged，其结构与原始模型一致，因此不会引入额外的推理延迟，克服了适配器技术的延迟瓶颈。
解决部署瓶颈：
- 对于需要为多个任务（如摘要、翻译）微调的LLM，无需为每个任务部署完整的、庞大的模型。
- 可以部署一个共享的基础模型，然后为每个特定任务加载和使用轻量级的LoRA层（即矩阵 A 和 B）。这大大减少了存储和部署成本。

LoRA 在 Transformer 模型中的应用

Transformer主要模块：多头自注意力 (Multi-Headed Self-Attention, MHSA) 和多层感知机 (Multilayer Perceptron, MLP)。
注意力权重：自注意力模块包含查询 (Query, Q)、键 (Key, K)、值 (Value, V) 和输出 (Output, O) 权重矩阵。
LoRA的应用范围：在被引用的LoRA论文 (LoRA: Low-Rank Adaptation of Large Language Models) 中，研究者将LoRA的应用仅限于适应下游任务的注意力权重，而MLP模块的参数则保持冻结，不参与下游任务的训练。
- 这意味着LoRA主要应用于Transformer的自注意力模块。

LoRA 秩 (Rank) 的选择

秩 r 是关键超参数：选择合适的秩 r 对LoRA的性能至关重要。
秩选择的经验：
- 令人惊讶的是，研究发现一个非常小的秩，如 r=1，就足以适应查询 (Q) 和值 (V) 矩阵。
- 然而，当单独适应查询 (Q) 矩阵时，可能需要更大的秩，例如 r=4、r=8 甚至 r=64。

LoRA 的实现库

Microsoft LoRA Library：
- 由微软官方发布，名为 LoRA lib [原文为 laa lib，推测为 LoRA lib]。
- 在MIT许可证下可用。
- 官方代码链接: https://github.com/microsoft/LoRA
Hugging Face PEFT Library：
- Hugging Face 的仓库名为 PEFT (Parameter-Efficient Fine-Tuning)。
- 在Apache 2.0许可证下可用。
- PEFT库中包含了多种参数高效微调技术的实现，如前缀微调 (Prefix Tuning)、提示微调 (Prompt Tuning)，而LoRA是该库中较早实现的参数高效微调方法之一。
- PEFT库链接: https://github.com/huggingface/peft

核心观点总结

LoRA通过对大型语言模型权重更新进行低秩分解，实现了参数高效的微调。它显著减少了可训练参数的数量，降低了为多任务部署模型的存储和计算成本。最重要的是，通过在推理前合并权重，LoRA避免了传统适配器方法带来的额外推理延迟，使其成为一种在资源受限情况下微调和部署大型模型的实用且高效的技术。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-06-01 22:26

StreamSparkAI