AI Bites | QLoRA paper explained (Efficient Finetuning of Quantized LLMs)

媒体详情

上传日期: 2025-06-01 22:22
来源: https://www.youtube.com/watch?v=6l8GZDPbFn8
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

speaker 1: In my previous video we saw about Lora or lorrank adaptation, laa is quite effective for deploying large models and is also fast for inference, thereby solving the inference problem for fine tuned llms. However, when it comes to training, Laura doesn't do the trick. For example, to fine tune a lama 65 billion parameter model, Laura needs 780 gb of GPU memory. That's about 16a 40GPU's. The answer to this problem lies with q Laura. Where q stands for quantization, the main motivation for qlaa is to achieve fine tuning on a single GPU. It does this with only three innovations, namely, four bit normal float, a new data type that is information theoretically optimal for normally distributed weights, double quantization to reduce the average memory footprint by quantizing the quantization constants, and page optimizers to manage memory spikes. In this video, let's look at all three of these novelties and understand kilaura. Without further ado, let's get started. Let's start with quantization, which is fundamental to qlaura. Simply put, quantization works by rounding and trunating in order to simplify the input values. For sake of simplicity, consider we are quantizing from a float 16 to inflow. Now info has a range of minus eight to seven. As we only have four bits to work with. We can only have 2.4, which is 16 bins to quantize into. So any input float value needs to be mapped to the center of one of these 16 bins. Getting into neural networks, the inputs are tensors, which are large matrices, and they usually normalize between -11 or zero and one. Let's consider the case of a simple tensor with three values, say -0.96, 7.187, and 0.886. We are lucky with this example as values are distributed equally across the normalized range, which means when we quantize to info each of these three numbers takes a unique bin. Let's take a slightly different example where the input values are no longer equally distributed in the input range. Let two inputs be close together with one far apart. If we now try to quantize two in four, the first two numbers fall in the same bin. Well, the third one is fine. Oh, we don't want this. Why? Because if at all you want to decquantize and convert back to float 16, the two numbers no longer convert back to unique values. In other words, we lost valuable information through quantisation error. One way to overcome this problem could be to divide the input range into separate blocks. In this example, we have three blocks and we quantize each block separately, with each having its own range. So now the two values, which are pretty close together, find different winds within a block. And the third one never had a problem. So it's fine. By dividing into blocks, we independently quantized each block. And so each block comes with its own quantization parameters, which often is the quantization constant c. In this example, they are c one, c two and c three. What we just saw is block ckwise quantization, which we illustrated with three blocks. But practically, q laa uses a block size of 64 for the weights for high quantisation precision. Talking of the weights, one of the interesting properties of prere train ined neural network weights is that they are normally distributed centered around zero, which means that this very high probability for input values occurring close to zero rather than around a minus one or plus one. But our standard quantisation to info is not aware of this fact, and so goes by the assumption that each of the 16 bins has an equal probability for getting the values. To address this problem with standard quantization, we can develop a slightly specialized type of quantisation, which considers the normal distribution of the neural network weights. That is exactly what klordoes and names it knormal float. In normal Flothe, bins are weighted by normal distribution, and hence the spacing between two quantisation values are far apart near the extremes of -11, but close together as you get closer to zero. To throw some additional light on this, the Green dots show the four bit normal float consation versus the standard four bit consation shown in blue dots. Let's now move on to the next contribution of the paper, which is double contesation. Because the intention of qlaa is to train on a single GPU, it's essential to squeeze every bit of memory as possible. We recall block ckwise quantization. We saw that we use 64 blocks to quantize the weights, and each of these blocks has a quantization constant c. So double quantisation is the process of quantizing the quantisation constants c for additional memory savings. And through double quantization, we gain half a bit per parameter on average. So the last and the third bit of puzzle is page optimizers. Now the page optimizes prevent memory spikes when we abruptly get a really long input, especially when we are working with a single GPU. Let's say we are working with documents, and certainly we have a really long document when we use a single GPU for training. This Spike in sequence length generally breaks the training because of memory issues. So to overcome this, the state of the optimizer, say Adam, is moved from the GPU memory to the cpu memory till the long sequence is red. Then when the GPU memory is free, the optimizer state is moved back to the GPU at high level. That's what happens if the leverage page optimizes. Now in terms of implementation, the page optimizes is part of the bits and bytes library, and you can enable or disable it during your qlaura training by simply setting the flag east page on or off. Putting together the above mention three components. Qlaa efficiently uses a low precistorage data type, in our case, usually four bit, and one computation data type that is usually b float 16. Now what does that mean? Going back to Laura, it means that in order to optimize for the memory, the weights of the model are stored in four bid. This enables us to load the weights into a single GPU and the loaded weights are convert it into b float 16 for computation purposes of gradients during back propagation. To link it to Lora, let's go back and look at this equation from Lora. Where x is the input on W zero is our pretrain model weight a and b or low rank matrix decompositions with q Laura, our input x is b float 16. Our weights are stoas four bit. During computation of gradients, the weights and consation constants go through a double deconsation, which is the reverse of quantisation. It happens by first deconzing the quantization constants c one and c two. Then using the constants, we once again decquantize the weights to be float 16, which is used to compute the gradients and hence train the model or fine tune the model. If you're wondering how good is all this normal float stuff and double contesation, the authors of qlaa experimented with four data sets and show that in all four cases, using normal float and double contesation improves the mean zero shot accuracy of training compared to simply using float. In terms of the glue score, q laa is able to replicate the accuracy of 16 bit Laura and full fine tuning. The authors concluded that full bit q laa with normal float data type matches 16 bit full fine tuning and 16 bit laa fine tuning performance on academic benchmarks with well established evaluation setups. So if you're someone who's interested in fine tuning on a single GPU and would like to fine tune the model to match the performance of standard fine tuning on multiple GPU's, then kilarize the way to go. I hope that was useful insight on kilaura. I will see you in my next video. Till then, take care.

概览/核心摘要 (Executive Summary)

该视频详细阐述了QLoRA技术，一种旨在实现单GPU上大型语言模型（LLM）高效微调的方法。传统LoRA技术在推理方面表现出色，但在训练大型模型（如Llama 65B参数模型需要约780GB GPU内存）时面临显存瓶颈。QLoRA通过三项核心创新解决了这一问题：4位NormalFloat (NF4) 数据类型、双重量化 (Double Quantization, DQ) 和 分页优化器 (Paged Optimizers)。

NF4是一种针对正态分布权重的、信息论上最优的新型4位数据类型，它通过在接近零的区域设置更密集的量化区间，而在极端值区域设置更稀疏的区间，从而更精确地表示神经网络权重。双重量化通过对量化常数本身进行再次量化，进一步压缩模型，平均每个参数节省约0.5比特的内存。分页优化器则通过在GPU显存不足以处理长序列输入时，将优化器状态（如Adam）从GPU内存分页到CPU内存，有效管理内存峰值，防止训练中断。

QLoRA在微调时，模型权重以4位NF4格式存储以节省显存，而在反向传播计算梯度时，权重和量化常数经过双重反量化过程恢复为BFloat16进行计算。实验结果表明，采用NF4和双重量化的QLoRA在多个数据集上均能提升平均零样本准确率，并且在GLUE等学术基准测试中，4位QLoRA的性能能够匹配16位全量微调和16位LoRA微调的水平。因此，QLoRA为在资源受限（尤其是单GPU）环境下微调大型语言模型提供了一条有效途径。

QLoRA 简介与动机

LoRA (Low-Rank Adaptation) 的局限性：
- LoRA 对于部署大型模型和快速推理非常有效，解决了微调LLM的推理问题。
- 然而，在训练方面，LoRA仍需要大量GPU内存。例如，微调一个Llama 650亿参数模型，LoRA需要 780GB的GPU内存，大约相当于16块A100 40GB GPU。
QLoRA (Quantized LoRA) 的目标：
- QLoRA中的“Q”代表量化（Quantization）。
- 其主要动机是实现 在单个GPU上进行大型语言模型的微调。

QLoRA 的三大核心创新

QLoRA通过以下三项主要创新实现其目标：

4位NormalFloat (NF4)： 一种信息论上最优的、针对正态分布权重的新型数据类型。
双重量化 (Double Quantization)： 通过量化“量化常数”来减少平均内存占用。
分页优化器 (Paged Optimizers)： 用于管理内存峰值。

量化基础 (Fundamentals of Quantization)

基本原理：
- 量化通过四舍五入和截断来简化输入值。
- 例如，从Float16量化到Int4（视频中称为“info”，应指4位整数）。Int4的数据范围通常是-8到7，因为4位只能表示 2⁴ = 16个 不同的值（或称“桶”/“bins”）。
- 任何输入的浮点值都需要被映射到这16个桶中某一个的中心。
神经网络中的量化：
- 输入是张量（大型矩阵），通常被归一化到-1到1或0到1之间。
量化中的问题：信息损失 (Problem with Quantization: Information Loss)
- 如果输入值在输入范围内分布不均，例如多个接近的值被量化到同一个“桶”中，那么在反量化（dequantization）时，这些值将无法恢复到它们原始的、独特的值。
- 这会导致 “量化误差 (quantization error)”，即有价值信息的损失。
分块量化 (Block-wise Quantization)
- 原理： 将输入范围划分为多个独立的块，对每个块分别进行量化。每个块拥有其自身的量化参数，通常是量化常数 c。
- 效果： 即使原始输入中非常接近的两个值，在分块后也可能因为处于同一个块内而被映射到不同的量化“桶”中，从而减少信息损失。
- QLoRA实践： QLoRA实际操作中对权重使用 64的块大小 (block size of 64) 以实现高精度的量化。

4位NormalFloat (NF4)

标准量化的局限性：
- 预训练神经网络的权重一个有趣的特性是它们通常围绕零呈 正态分布。这意味着值接近零的概率远高于接近-1或+1的概率。
- 标准的Int4量化（如前述的“info”）并未考虑这一特性，它假设每个量化“桶”接收到值的概率是均等的。
NormalFloat 的原理：
- NF4是一种专门为适应神经网络权重的正态分布而设计的量化类型。
- 在NF4中，量化“桶”根据正态分布进行加权。
- 这意味着 靠近极值（如-1和1）的量化值之间的间隔较大，而越接近零，量化值之间的间隔越小且越密集。
- 视频中通过图示对比了4位NormalFloat（绿点）和标准4位量化（蓝点）的“桶”分布，NF4在零附近更密集。

双重量化 (Double Quantization - DQ)

目的与效果：
- 由于QLoRA的目标是在单个GPU上训练，因此必须尽可能地压缩每一比特的内存。
- 回顾分块量化，每个块（QLoRA中使用64个块）都有其自身的量化常数 c。
- 双重量化是指 对这些量化常数 c 本身再次进行量化，以实现额外的内存节省。
- 通过双重量化，平均每个参数可以节省 0.5比特 (half a bit per parameter) 的内存。

分页优化器 (Paged Optimizers)

解决的问题：
- 在单GPU上工作时，当突然遇到一个非常长的输入序列（例如处理一篇长文档时），序列长度的激增通常会导致内存不足，从而中断训练。分页优化器用于防止此类 内存峰值 (memory spikes)。
工作原理 (高层次)：
- 当GPU内存不足以处理长序列时，优化器的状态（例如Adam优化器的状态）会从 GPU内存转移到CPU内存，直到长序列被读取完毕。
- 当GPU内存释放后，优化器状态再被移回GPU。
实现方式：
- 分页优化器是 BitsAndBytes库 的一部分。
- 可以通过设置标志位 is_paged (原文为 east page on or off [不确定，推测为 is_paged]) 为 on 或 off 来在QLoRA训练中启用或禁用它。

QLoRA 微调过程

数据类型与计算：
- QLoRA有效地使用一种低精度存储数据类型（通常是 4位，如NF4）和一种计算数据类型（通常是 BFloat16）。
- 这意味着模型的 权重以4位格式存储，以便能够将它们加载到单个GPU中。
- 加载后的权重在反向传播计算梯度时，会 转换为BFloat16 进行计算。
与 LoRA 的结合：
- 回顾LoRA的公式 h = W_0 x + \Delta W x = W_0 x + BAx，其中 W_0 是预训练模型权重，A 和 B 是低秩矩阵分解。
- 在QLoRA中：
  - 输入 x 是BFloat16。
  - 预训练权重 W_0 以4位格式存储。
双重反量化过程 (Double Dequantization Process)：
- 在计算梯度时，4位权重和（可能被二次量化的）量化常数需要经历一个双重反量化过程（量化的逆过程）。
- 步骤1： 首先反量化“量化常数”（例如 c1, c2）。
- 步骤2： 然后使用这些反量化后的常数，再次反量化权重，将其恢复为BFloat16格式，用于计算梯度并训练（微调）模型。

实验结果与性能

NormalFloat 与双重量化的有效性：
- QLoRA的作者在四个数据集上进行了实验。
- 结果显示，在所有四个案例中，使用 NormalFloat和双重量化相比仅使用浮点数（原文为float，应指标准量化或更高精度浮点）改进了训练的平均零样本准确率 (mean zero-shot accuracy)。
与 LoRA 及全量微调的性能对比：
- 在 GLUE得分 方面，QLoRA能够复制 16位LoRA和16位全量微调的准确率。
- 作者得出结论：“使用NormalFloat数据类型的4位QLoRA在具有成熟评估设置的学术基准上，其性能与16位全量微调和16位LoRA微调性能相匹配。”

结论

如果用户有兴趣在单个GPU上进行微调，并且希望微调后的模型性能能够达到在多个GPU上进行标准微调的水平，那么QLoRA是一个值得考虑的方法。它通过4位NormalFloat、双重量化和分页优化器等创新，显著降低了大型语言模型微调的硬件门槛，同时保持了较高的性能水平。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-06-01 22:26

StreamSparkAI