speaker 1: Hello, everyone. Ever felt like cutting edge AI was out of reach? Low rank adaptation, or laa, is breaking down those barriers by dramatically reducing the computational resource and time needed for fine tuning. Laura is making advanced AI customization accessible to a wider audience. From individual researchers with limited resources to a smaller startup with big ideas. This democratization of AI power promises a surge of innovation from unexpected places. Wanto be part of it. Let's take a look together. Let's start with a recap on more than iom training flow. This is an image I get from my iom training video. The training process is often split into two phathe. First phase is called pre training, and the second one is called post training or fine tuning. Industry and research trend is focusing more and more in post training stage. More details you can refer to my iom training videos. More than models are very large with large numbers of parameters. For example, GPT -4 is reported to have around one, 8 trillion parameters and deep sev three has 671 billion parameters. So updating model weights requires a lot of computation and storing the model weights requires a lot of storage. For modern, here comes the million dollar questions. The primary purpose of pre training lon is to enable it for vast and general understanding of language and the world from massive amounts of unlabeled data, whether it's text, image, culvideos. In this phait makes sense to touch all parameters in a model, since we trying to get a generalized foundation model from scratch, either zero or random weights, depending on the initialization. However, the primary purpose of post training is to refine and align the pre train ined modelcapabilities and behavior. So do we still need to change all parameters with full rank and dimension? The second question is, what if we need to make tasks, specific modifications to the foundation model? Do we need to do a full retrain every time? Is this even scalable for the most wealthiest company in the world, apple, Google, Microsoft, Amazon? For example, if verily needs a model with extensive medical data on top of Gemini, or YouTube kids team needs additional safety checks on top of Gemini model, from my personal experience, I can tiyou the answer this, no and no, we don't need to change all parameters with full rank and dimension. In a lot of case, you impose training, and it's not scalable to do full retraining every time, even if you're working in one of the wealthist company in the world. Many of my personal project and my team's project are limited by GPU resources, and the GPU resources is very, very tight for everyone in the industry. So what can we do instead? Before we answer that question, let's dive deeper by visualizing the training process. This is a visualization of the pre training process. In this phase, the modelweights are empty, and we PaaS in tons of tons of unlabeled data to fill in all the weights of the model. This is the visualization of full fine tuning. All the weights are filled. But in this process, it's all twckable. We are tweaking all the weights of the model. More details. You can take a look at my m training video. I include prompt engineering here because prompt engineering was considered a pretty promising way for different teams to build on the same foundation model. It's been proven it's not the best way, but I still want to include it here. In prompt engineering, we using all existing weights. The weights of the model are all frozen. We're just trying to retrieve the best response from this model. More details. We can take a look at my prompt engineering video. With that said, prompt engineering, although it's not a good way to do quality improvements on existing model, it's a pretty good way to utilize existing applications like Gemini ChatGPT. So feel free to still use promp engineering on those apps after the previous cases. The apparent question is, what if we can do something in the middle? Let's say most of the weights of the model is frozen, but we can still improve and tweak the parameters that matters. That's more important than the others. Is it possible? This is the motivation of paparameter efficient fine tuning. And this is a big family of techniques. And today, our focus is on Laura. Laura stands for low rank adaptation. The original paper was published in 2021, and it's been very popular ever since because of its effectiveness and simplicity. Low rank adaptation, Laura proxies, model updates, delta W. The dimension is d multiplied by k in the form of two low rank matrices. Amand B, A matrix is of d multiplied by R. B matrix is of dimension k multiplied by R, where R is usually a lot smaller than the minimum of dnk. Tldr R, we're trying to approximate delta W with a multiplied by b matrix. In this way, we can significantly reduce defined tuning time and also significantly reduce the checkpoint size. The intuition of this is not all weights are equally important. Some are a lot more important than the others. Say attention layers. W stands for the weights of the model. And if we can break this down into two parts, the first one is the frozen weight from the ptraining process, and we use a and b to approximate delta W, which is the delta that comes out of the post training process. In this case, we changed the problem of fine tuning delta W, which is the same dimension of original weight W, to fine tune Amand b, which is a lot smaller than original matrix W. R is a hyperparameter. Small value will shorten the training time and storage. However, if the value is too small, might cause information loss and hurt model quality. Empirically, R can range from eight to 256. It's a lot smaller than a typical dnk. This is the comparison of waupdate process from fulfine tuning and laa and fulfine tuning. You can consider the input is going through ptrain weights, and the weights updates W and delta W, both of them have the same dimension k and d after matrix multiplication with input, we get the outputs with Laura, the input still goes through the frozen pretrain weights. However, instead of going through a weight update, delta W with dimension d nk, it goes through matrix Amand b and R is the low rank inner dimension, a predefined hyperparameter that is a lot smaller than dnk. We're still doing the same at operation after input goes through ptrain weights and ab matrix. And here is an example of why we save a lot of parameters. Let's say d is 100 zero and k is 200 zero, R is 16. By the way, those values are pretty typical. So it's not something that I made out of nowhere. Without Laura, we have 20 billion parameters. And with laa, it only have 4.8 million parameters. It's more than four, zero less parameters with laa. So the differences are huge. This is the training details of Laura. We need to initialize a and b first. The most common initialization strategy is for matrix a, we initialize randomly using a standard Gaussian distribution with a small standard deviation. For matrix b, we initialize it with all zeros. The intuition of this initialization is at the beginning of fine, humane delta, W should be zero to preserve the original ptrain weights, and the random initialization of a allows for a different initial directions for adaptation during training. B is zero, a is a standard normal distribution, and once a and b are initialized, there are trainable parameters in the model. During the fine tuning process with Laura, the forward PaaS of layer with laa can be defined as this formula hd output access, d input, and in the back propagation, the gradients are calculated only for the parameters in matrix amb. The weights for W zero remains frozen, so these gradients are then used to update amb using optimization algorithm, say atom or sgd to do gradient descent. It's all pretty standard and surprisingly simple. Laura comes with a lot of advantages. The first one, apparently, it significantly reduced the training parameters, and it results in lower computation costs. Use less memory doing training, allowing fine tuning on less powerful hardware, say, your own personal computer. It also enables faster training times with fewer parameters to update. The training process converges much faster, so you can fast iterate on a lot of prototypes. And it also uses a lot smaller storage footprints. The Laura adapters are very small compared to the full model, usually between 1% and 0.001 percent, making them easier to store and share, even for personal users like us. Also, due to the frozen base model parameters and low rank amb matrix effectively acting as a reguzer, it's really hard to overfit. Laa usually have better or comparable quality when training. Datset is limited compared with full fine tuning. The intuition is for full fine tuning to work, you need a lot more data to propagate through all the parameters in the model. But for Laura, since the parameters are a lot less, you can do it with less data. Next one is task isolation. This is something really strong. I will say multiple small task specific adapters can be attached to a single base lm. These adapters can be easily loaded and swapped depending on the task without needing to store or load multiple full fintunmodels. This enables efficient multitask learning or serving different application with the same base model, basically different teams. They can work on a single base l and focusing on different task without blocking each other. And you can combine your task sks, a specific deltafor collaboration later. Let's say a team is working on doc adapter and the other one is working on toy adapter. After they are done, we can combine these two adapters and get a toy doadapter. Pretty cool, right? For laaccons, there's the serving latency increase after merging with base models. Weights, inference latency should not increase since it's mathematical the same process. However, serving multiple checkpoints, base M1 or more Laura adapter checkpoints that could result into serving latency increase depending on the infrastructure like rpc or in memory. Usually rpc will have more latency increase. This is a comparison between food fine tuning, prompt engineering and Laura. Let's take a quick look for quality improvements. Full fine tuning usually have the best quality, Laura have close quality or better. When the training data is limited to several thousand, prompt engineering does not do model improvement is just a way to get better response for tuning time, full fine tuning is long, Laura is a lot shorter within hours, prompt engineering is very little time to run along. Prompt tuning cost fulfine tuning use a lot more memory and chips. Comparing with Laura, Laura has lower cost, prompt engineering have no tuning cost training data wise, full fine tuning requires a large number of data, Laura requires smaller number of data, prompt engineering requires no additional data model storage cost wise, full fine tuning requires large storage. To save the full weights, laa only need to save the adapters weights, so it's lots small. Prompt engineering have no additional storage. Task isolation wise, full fine tuning is really hard to do that since it requires separate models for a different task. Ks, task specific Laura adapters can be easy to combine, swap and removed. And for prompt engineering, you can just use different prompt for a different task. Sks, serving latency wise, full fine tuning and prompt engineering, they don't have additional serving latency increase. Laura could have some serving latency increase depending on the infrastructure you're are using and serve within the mobile device for full fine tuning. It's almost possible because the base model is too big for mobile unless you are using a distilled version, which is a popular choice nowadays. You can easily put Lara adapters weights on device and for prompt engineering, of course you can use it on device. So in short, if you're not looking to improve the quality of the model, prompt engineering is the way to go. Like I said, if you're just a casual user trying to get the batresponse from ChatGPT or Gemini, just focus on prompt engineering. However, if you are a entrepreneur or someone really interested in fine tuning open source ims, Laura is something that can ten x or 100x your efficiency. It's really, really powerful and simple, Laura. Adoption across industry is rapid. Since the original publication in 2021, it has been adopted across essentially all customer service like open source ioms 's fine tuning applications, cloud tuning on device. As you can see the Google's scholar results for search term Laura, it's exploding. I don't have the data for 2025 because it's just April. Laura is not only used for fine tuning, it's also used in reinforcement learning. For example, reinforcement learning with human feedback can use Laura for both rewards modeling and policy optimization, achieving comparable performance to full fine tuning with significantly reduced computation cost. More commonly, this is being referred as per l parameter efficient reinforcement learning. Different Laura adapters can be trained for different rl taor environments using the same ptrained backbone. I want to briefly talk about q laa, which is the quantized low rank adaptation quantization. I've talked about this in deep C V three video. It reduces the number of bits used to represent model weights. It will significantly decrease the memory usage. And since Laura used a frozen weight for base model, and intuitive optimization for laa will be to do quantization on the frozen weights. So the ptrain lms weight are frozen after being quantized to four bit normal float. In this case, model will be in a very memory efficient state. And then we do the laa integration. Low rank adapters are added to the chosen layer of the frozen quantites model. These adapters introduce a small number of trainable high precision parameters. Same concept of Laura that I have gone through. Next we go to fine tuning. Same with Laura. Only the weights of the low rank adapters are updated. The quantized base model remains frozen, and during inference, we can either use the amb matrix directly and add it up against the original quantize weights, or we can decquantize the original weights W zero back to a higher precision and then add the adapter updates. This is the comparison between Laura and q Laura. As you can see, q Laura use quantization and it has 75% smaller pgpu memory usage and it can support ten x batch sizes due to lower memory footprints. However, this come with a price on training. Speed wise, Laura is generally faster because it doesn't have the quantization and decquantization steps, and it's also simpler to implement because qlaa needs to impleted quantization techniques. Alright, that's all I want to talk about. Laura, last thing before I say goodbye is I want to cover some of the other popular Taft techniques briefly. For example, adapter tuning. Tldr R adapter tuning introduces small new neural network modules, call adapters into the existing architecture of the lom. The weights of the original ptrain lare frozen. Only the parameters within these newv added adapters are trained on the task specific data. And this is the architecture. Basically, we are adding adapter neural networks into the existing transformer architecture. As you can see, this is very similar concept of Laura. It's trying to freeze the base iom and only update a small set of parameters. The only difference is adapter tuning is introducing new neural networks into the architecture. But laa, it's a lot simpler. That's probably why Laura is more popular right now. The architecture is very similar to auto encoder. And if you want to know more about auto encoder, take a look at my auto encoder video. It aims to limit the number of trainable parameters, and as auto encoder, it has the down projection. It reducthe high dimension input into a low dimension space. It also has non linearity. Applying non linear activation function like reu, it also have to up projection, projecting the lower dimension representation back to the original higher dimension space. It also have the residual connection. This is to improve gradient flow and address vanishing gradient by adding the output directly to the original input of that layer. Another path technique built on top of Laura is called Dora weight decomposed slow rank adaptation. The motivation is researchers find full fine tuning, and Laura often shows different patterns of weight updates, particularly in terms of magnitude and direction. Dora aims to bridge the gap by allowing for more nuanced updates to both aspects of the weights, that is, magnitude and direction. The essence is it first do weight decomposition, this is the key innovation of Dora. It decomposed ptrain ined weight matrices into two components, magnitude and direction. Dora then fintunes both of these parameters, and it applies laa to only the directional components. This is because the directional component has a larger number of parameters, making low rank adaptation efficient, the magnitude less so. So we update them directly. This is a picture I get from the paper. So Dora is very interesting. Improvements built on top of Laura. There are so many other tft techniques that I'm not going to cover today, but I think we have covered the most popular one, which is Laura. Hope you've find my video helpful and useful. And if you like it, please subscribe, comment and like see you next time. Bye.