speaker 1: Custom model for our application, we start with a ptrained language model and fine tune it on our own datset. This used to be fine until we reached the large language model regime and started working with models such as GPT, llama, Muna, etcetera. Now these llms are quite bulky and so fine. Tuning a model for different applications such as summarization or reading comprehension needs deploying the model for each application. And the size of these models is only increasing almost on a weekly or monthly basis. So the deployment of these bulky llms is getting increasingly challenging. Now one solution proposed for this problem is adapters. Adapters or trainable additional modules plugged into the neural network, mostly transformers. And during fine tuning, the parameters of only these adapter modules are updated with the pretained module frozen because adapters or additional parameters, they introduce latency during inference for a bad size of 32 and a sequence length of phi two l and a small half a million parameters model, fine tuned laa model takes 149 milliseconds for inference, but with adapters it's two or 3% higher. So how does Laura achieve this feat? Let's find out in this video. Now, before that, I would like to give a quick shout out about our x account, where we share high impact papers and research news from top AI labs from both academia and from industry. If you wish to keep up to date with AI every single day, just hit the follow button on x laa stands for low rank adaptation. So what does that mean for any neural network architecture? Let's not forget that the weights of the network are just large matrices of numbers. All matrices come with some property called the rank. The rank of a matrix is the number of linearly independent rows or columns of that matrix. To understand it, let's take a simple three by three matrix. The rank of this simple three x three matrix at the top is one y, because the first and the second columns are redundant, as they are just multiples of the third column. In other words, the two columns are linearly dependent and don't bring any meaningful information. Now, if we simply change one of the values to, say, 70, the rank becomes two, as we now have two linearly independent columns, knowing the rank of a matrix, we can do rank decomposition of a given matrix into two matrices. Going back to our example of three x three matrix, it can simply be written as the product of two matrices, one with a dimension three x one and the other with a dimension one by three. Notice that we only have to store W six numbers after decomposition, instead of the nine numbers in the three x three matrix. This may sound less, but in reality, the neural network weights have a very high dimension of, say, 1024 by 1024. And so using a rank of two, it boils down to a really small number of values that we need to store, and hence that we need to multiply when we actually want to do some computation, which is a lot of reduction in computation. So would it not be nice if these rates ights actually have a low rank so that we can work with the rank decomposition instead of the entire weight? It turns out that's indeed the case with pre train models, as shown by this earlier work. They empirically show that common Preton models have a very low intrinsic dimension, in other words, that exist a low dimension reparametization that is as effective for fine tuning as the full parameter space. Let's say we are starting with a pretamodel with weights W zero. After fine tuning, let the weights be updated to W zero plus delta W. If the pretrain model has low rank weghts, it would be a fair hypothesis to assume that the fine tune weights are also low rank. Laa goes with this assumption because delta W is low rank. We can now decompose that matrix into two low rank matrices, a and b, whose product ba leads to delta W. And lastly, fine tuning becomes the prere twbw zero plus ba instead of W zero plus delta W as it's one and the same. With that perspective, we start training the model with input x. The input passes through both the ptrain weights and the ranank decomposition matrices a and b. The weights of the pretrain model remain frozen, but we still consider the output of the frozen model during training, the outbit of both the frozen model and the lower rank model are summed up to obtain the output latent representation hatch. Mathematically, it's represented by this one line equation where the input x is multiplied with both W zero and ba matrix and summed up to obtain the hidden representation hatch. Now, you may ask, what about latency during inference? If we slightly modify the above equation, we can notice that we can merge or add the weights ba to the pretain weights W zero. So for inference, it's this merged weight that is deployed, thereby overcoming the latency bottleneck. One of the other concerns is deployment of llmms as they are quite bulky, say about 50 gb or 70 gb. Let's say we have to fine tune for two tasks, namely summarization and translation. We don't have to deploy the entire model every time we find tune. We can simply fine tune the laa layers specific for the task, for example, summarization, and deploy the model for summarization. Similarly, we can deploy laa layers specific for translation, and thus laa overcomes both the deployment and latency bottlenecks or problems faced by modern day large language models. In terms of applying for transformers, we all know that transformers have two main modules, which are multi headed self attention and multilayer perceptron or mlps. The self attention modules are composed of query key value and output weights. In this paper, they have limited their study to only adapting the attention weights for downstream tasks and frozen the mlp modules, so they are not trained in downstream tasks, which means that Laura is just applied to the self attention module. Now we have been talking about using lorura for adaptation. One of the key parameters in lorura is the rank, which is something that we have to choose. So what is the optimal rank for Laura? It turns out, to everyone's surprise, a rank as small as one is sufficient for adapting both the query and the value matrices. However, when adapting the query alone, it needs to have a larger rank of say, four or eight or even 64. Moving on to how we can practically use Laura. There's this official implementation from Microsoft, which is released as laa lib and is available under the mit license. Another option to use laa is the hugging face repo called peft, which stands for parameter efficient fine tuning, and peft is available under the Apache two license. Peft also has a few other implementations, such as prefix tuning, prompt tuning, and Laura is one of the earliest implementations in the library. I think that pretty much covers the important babout. Laura, I hope this video was useful in understanding about the functioning of the Laura model. I hope to see you in my next until then, take care.