speaker 1: In my previous video we saw about Lora or lorrank adaptation, laa is quite effective for deploying large models and is also fast for inference, thereby solving the inference problem for fine tuned llms. However, when it comes to training, Laura doesn't do the trick. For example, to fine tune a lama 65 billion parameter model, Laura needs 780 gb of GPU memory. That's about 16a 40GPU's. The answer to this problem lies with q Laura. Where q stands for quantization, the main motivation for qlaa is to achieve fine tuning on a single GPU. It does this with only three innovations, namely, four bit normal float, a new data type that is information theoretically optimal for normally distributed weights, double quantization to reduce the average memory footprint by quantizing the quantization constants, and page optimizers to manage memory spikes. In this video, let's look at all three of these novelties and understand kilaura. Without further ado, let's get started. Let's start with quantization, which is fundamental to qlaura. Simply put, quantization works by rounding and trunating in order to simplify the input values. For sake of simplicity, consider we are quantizing from a float 16 to inflow. Now info has a range of minus eight to seven. As we only have four bits to work with. We can only have 2.4, which is 16 bins to quantize into. So any input float value needs to be mapped to the center of one of these 16 bins. Getting into neural networks, the inputs are tensors, which are large matrices, and they usually normalize between -11 or zero and one. Let's consider the case of a simple tensor with three values, say -0.96, 7.187, and 0.886. We are lucky with this example as values are distributed equally across the normalized range, which means when we quantize to info each of these three numbers takes a unique bin. Let's take a slightly different example where the input values are no longer equally distributed in the input range. Let two inputs be close together with one far apart. If we now try to quantize two in four, the first two numbers fall in the same bin. Well, the third one is fine. Oh, we don't want this. Why? Because if at all you want to decquantize and convert back to float 16, the two numbers no longer convert back to unique values. In other words, we lost valuable information through quantisation error. One way to overcome this problem could be to divide the input range into separate blocks. In this example, we have three blocks and we quantize each block separately, with each having its own range. So now the two values, which are pretty close together, find different winds within a block. And the third one never had a problem. So it's fine. By dividing into blocks, we independently quantized each block. And so each block comes with its own quantization parameters, which often is the quantization constant c. In this example, they are c one, c two and c three. What we just saw is block ckwise quantization, which we illustrated with three blocks. But practically, q laa uses a block size of 64 for the weights for high quantisation precision. Talking of the weights, one of the interesting properties of prere train ined neural network weights is that they are normally distributed centered around zero, which means that this very high probability for input values occurring close to zero rather than around a minus one or plus one. But our standard quantisation to info is not aware of this fact, and so goes by the assumption that each of the 16 bins has an equal probability for getting the values. To address this problem with standard quantization, we can develop a slightly specialized type of quantisation, which considers the normal distribution of the neural network weights. That is exactly what klordoes and names it knormal float. In normal Flothe, bins are weighted by normal distribution, and hence the spacing between two quantisation values are far apart near the extremes of -11, but close together as you get closer to zero. To throw some additional light on this, the Green dots show the four bit normal float consation versus the standard four bit consation shown in blue dots. Let's now move on to the next contribution of the paper, which is double contesation. Because the intention of qlaa is to train on a single GPU, it's essential to squeeze every bit of memory as possible. We recall block ckwise quantization. We saw that we use 64 blocks to quantize the weights, and each of these blocks has a quantization constant c. So double quantisation is the process of quantizing the quantisation constants c for additional memory savings. And through double quantization, we gain half a bit per parameter on average. So the last and the third bit of puzzle is page optimizers. Now the page optimizes prevent memory spikes when we abruptly get a really long input, especially when we are working with a single GPU. Let's say we are working with documents, and certainly we have a really long document when we use a single GPU for training. This Spike in sequence length generally breaks the training because of memory issues. So to overcome this, the state of the optimizer, say Adam, is moved from the GPU memory to the cpu memory till the long sequence is red. Then when the GPU memory is free, the optimizer state is moved back to the GPU at high level. That's what happens if the leverage page optimizes. Now in terms of implementation, the page optimizes is part of the bits and bytes library, and you can enable or disable it during your qlaura training by simply setting the flag east page on or off. Putting together the above mention three components. Qlaa efficiently uses a low precistorage data type, in our case, usually four bit, and one computation data type that is usually b float 16. Now what does that mean? Going back to Laura, it means that in order to optimize for the memory, the weights of the model are stored in four bid. This enables us to load the weights into a single GPU and the loaded weights are convert it into b float 16 for computation purposes of gradients during back propagation. To link it to Lora, let's go back and look at this equation from Lora. Where x is the input on W zero is our pretrain model weight a and b or low rank matrix decompositions with q Laura, our input x is b float 16. Our weights are stoas four bit. During computation of gradients, the weights and consation constants go through a double deconsation, which is the reverse of quantisation. It happens by first deconzing the quantization constants c one and c two. Then using the constants, we once again decquantize the weights to be float 16, which is used to compute the gradients and hence train the model or fine tune the model. If you're wondering how good is all this normal float stuff and double contesation, the authors of qlaa experimented with four data sets and show that in all four cases, using normal float and double contesation improves the mean zero shot accuracy of training compared to simply using float. In terms of the glue score, q laa is able to replicate the accuracy of 16 bit Laura and full fine tuning. The authors concluded that full bit q laa with normal float data type matches 16 bit full fine tuning and 16 bit laa fine tuning performance on academic benchmarks with well established evaluation setups. So if you're someone who's interested in fine tuning on a single GPU and would like to fine tune the model to match the performance of standard fine tuning on multiple GPU's, then kilarize the way to go. I hope that was useful insight on kilaura. I will see you in my next video. Till then, take care.