speaker 1: I'm going to explain how to fine tune the latest open source models all the way from gema three, Quin three, lamma 454, and Mistral small. I'll explain the pros and cons of using onslath versus using transformers. I'll explain also how to do fast evaluations using vllm separate to the fine tuning, and then I'll go through in detail some of the techniques around how to set the hyperparameters to get the best results for a video overview. I'll briefly describe why you might want to fine tune. Hopefully you've got a sense already if you're watching this video, but I'll recap on that. I'll very briefly describe how to prepare data. You probably should be spending 90 of your time on data preparation and I have a lot a lot of videos covering that. I will link them here, but it's not gonna to be the focus of this video. Then I'll talk about onsloth, which is a wrapper on transformers that brings significant improvements, but I'll talk about the pros and cons of using onsloth versus transformers for fine tuning. Then I'll talk a little about running fast evaluations. You should be evaluating your performance before you fine tune and afterwards and during as well, and I'll show you how you can speed that up a lot by using vllm in the same notebook as you're doing the fine tuning. Then I'll talk about which model to fine tune out of the available open source ones, considering factors like the license licenses and performance. I'll give a few general fine tuning tips before then going to a live demo of fine tuning using onslot, ught and transformer notebooks that I've put together in the advanced fine tuning repo. So why should you fine tune generally? This is ideally a last resort. You have tried out doing prompt engineering. You have included retrieval in your techniques, and you still need to improve performance. Often that means one of a few things you need to improve your answer structure and format. For example, you want the model to take a certain approach by maybe recapping on some of the information, processing it and maybe reasoning over it and then giving a final structured answer. Sometimes that means you have a tool calling. So you want to have a structured response that is going to call an assistive tool. So these are classic cases where fine tuning can make sense because it gets the model to very consistently respond in a certain format. Now there are two orders you might consider as well. One is if you want to improve accuracy beyond just using a retrieval method, back in one of my earlier videos, I show how you can combine retrieval with fine tuning and get the best possible approach so you can improve performance beyond just a retrieval approach only. And last of all, this is more modern, late 2024. You can try to fine tune specific for specific reasoning within a domain. That's like grpo group relative policy optimization and order related techniques. I have a series of videos from earlier in 2025 that covered that, and I'll let you try them out there. For today, though, we're going to focus not so much on reasoning, although I think I will make another reasoning video soon. We're going to focus on general fine tuning to enhance knowledge, but a lot of the same principles apply if you wanted to find tune on structure, although I'll point you for more details on fine tuning for json responses or function calling to this video right here. I'm not going to talk too much about data preparation, but I will point you to well, I'll give you a few tips and point you to the key videos. Basically, there are two types of training and two types of data sets. Correspondingly, at a very simple level, there is what's called continued pre training with raw data. This is where you take, say, magazine content, newsletter articles or books, and you feed that in without much pre processing. I mean, you'll clean it up, but you're not going to change it into Q&A pairs. This continued pre training here, typically this is difficult to do on top of an existing model because it can end to undo the instruction type training that model has. So often, unless you have a very large amount of data and you're willing to do continued pre training followed by post se training, then it's probably not recommended to try and do the continued pre training. Instead, what generally is recommended, particularly with smaller amounts of data like say up to maybe 100000 words or even a million, you do post training on question and answer type data sets. And these are often data sets you synthetically create using documents and using llms to create questions and answers from those documents. Now, to prepare a synthetic data set, it's recommended to use a large language model. You want to generate not just questions and answers, but ideally questions, evaluation criteria. So what the criteria are for correct answers, then high quality answers. I have a video that came out recently around how you can prepare data this way. If you further want the answers to involve reasoning or to involve chain of thought, you probably need to augment them further. And I do have a video showing how with these augmented answers, you can get up to very high levels of accuracy with fine tuned models. And that's whether you're fine tuning open source or using apis for fine tuning models like openiapi. Now one caveat. I don't say this with 100% confidence, maybe maybe 75% confidence. My sense is that as models get stronger, it's harder to fine tune. And that's because the model starts off from a very good point. And in fine tuning, you are risking damaging the model in some ways. I'll give you a simple example. If you take a reasoning model, it might perform quite well. And if you train it without reasoning data, you could just drag the performance below what the reasoning performance was, even though you're adding the right content to it. So I think in some ways, fine tuning is getting a bit trickier. And if you are going to try and fine tune for reasoning, and I mean reasoning, not in a, say, technical application where you can use grpo or even sft like I described in one of my previous videos. Let's see this one here. But if you're trying to use reasoning in kind of a verbal domain, that's going to be tricky. It's something I want to cover maybe in later video, but just kind of watch out because if you don't have proper reasoning data sets developed, you do risk maybe regressing performance. Okay. So I'm going to talk about the libraries. There are quite a few libraries out there. Some of the most common are at least used at the smaller scale or unsloth and transformers. There are also libraries like axotal, torch tune. Those are two other examples. Maybe I'll go through those at some point. But the one that I've been using most often on this channel is either transformers or onslath. Onslath is effectively a wrapper. I mean, I don't mean that in a negative way. It's brought a lot of improvements to the transformers library in terms of speed and just ease of use. So I want to just recap those here. I'm going to show you notebooks that do unslth only and then transformers only, but it's worth appreciating the differences. So onsloth generally is two times faster than transformers, and that's because of a variety of tricks. It's not just one thing. It's like the accumulation of about five or six tricks that results in faster fine tuning. Also, onslaprovides a unified function for loading multimodal models. So if you have a model like jama three, which is multimodal in sizes larger than four b, you need to change the function that you use to import that model if you're going to use transformers, whereas wounsloth is just unified how you load that model. So that makes it a lot easier to have one script that supports a lot of different models. Onslot for now is single GPU only. So if you have a very large model, you might need to use transformers instead, which does support multi GPU. By default, transformers will use model parallel, so it will chunk up the model by layers, which means your GPU's are not all active when you're fine tuning, because you will use one GPU to process these layers, then the next GPU, then the next. That might sound inefficient and it is, but it's quite simple and robust, so I wouldn't rule that out. If you need to tune a larger model, just use that simple approach. Now you can take improved approaches called fully sharded using a fully sharded data parallel approach. That's where you split the model across GPU's, but you split the matrices essentially so that all GPU's are being used more or less at once. And you can find a video on fully charded data parallel on this channel. If you just look up fsdp. Something else about onslath, because it's a rapper. Sometimes there are issues where transformers will move ahead and maybe onslaught does not quite support that. Also, if you're trying to use AI to help you code, there will tend to be more documentation on the advanced features in transformers. And because everything is being wrapped by onsloth, it's sometimes harder to access the features by working through onslot. So basically, what I'm saying is if you're trying to use some more obscure functionality and transformers, it may be easier to use transformers than to try and use onsloth and have to figure out how onsloth has wrapped that for now. Also, geometry is still broken in the sense that the configuration file won't allow you to run inference on vlm. So if you do find tune a geomethree model on onsloth, I'll point you to an issue where the there's potentially a fix. I expect that it will be fixed at some point, but something to keep a note on. And also, you should note that GPU's are getting very big, like A B 200, which you can rent on ronpod now, I think for eight an hour, it's 192 gb in vraso. If you have a model, you can probably fit a model that's up to 150 billion parameters in eight bits. One sloth allows you to fine tune in eight bits now, not just in four bits. I don't recommend four bit. I know it's very popular doing q Laura, but I have found, particularly because of merging back adapters, you can see small differences in performance that are hard to predict. So I generally recommend fine tuning in 16 bits. Eight bitshes probably is probably quite good as well. So Yeah, if you have a model that's 16 bits, you can train up to probably 80 or 70 billion parameters in on A B 200 with onslath or in eight bits, probably even something larger. So this is not even that much of a limitation. You're. You're not going to be able to fine tune deep seek using on slot. Now deep seek anyway, it's probably hard to fine tune even with transformers. But Yeah, this is a limitation. If you wanted to fine tune, let's say, lamma four and the maverick version, you're not going to fit that on a single GPU, but you could fit the scout version in a single GPU in eight bits because it's 100 billion parameters roughly, and it's about one byte per parameter, so in eight bits. So that would be fitting onto that GPU. Just a note on running evaluations. So you want to run evaluations before you fine tune and after and ideally during to see if your fine tuning is working. You can just run inference using transformers or unsloth, but neither of these libraries are designed for inference, so they don't do what's called continuous batching. You can send in a batch of tokens, so you have to size the batch so you don't run out of memory. You can slightly automate that actually transformers by having a test for the right bat size, and it will reduce it if it's too large. But the inference is just not optimal in terms of the back end, and it will be significantly slower than using something like the llm or sg line. So I recommend, and that's why the scripts I'll show you today, they have two parts. They've got an inference part or an evaluation part that's run with vllm. And then they've got a fine tuning part that either transformers are onsloth. It's much faster to use vlllm. The drawback is you have to reload the model in vllm after you find tuned in on slaughter transformers. So there's kind of this trade off. If you have a very big model, it can take a bit of time to load. But again, if you're only on one GPU, this is probably not going to be a big constraint. And reloading the model should be fairly fast given it's already going to be on your disk from the fine tuning. Now for some questions on which model to fine tune, and I've listed them in a tentative order of preference. Mial, I think, is Mistral small. It's less than 30 billion parameters. It's an Apache to license, and it tends to be strong in evaluations. Just I've heard from customers as well when I see results across the different models they've tried. So it would be one of my top recommendations. Gea three is a very strong model as well, but the license is custom. So if you're at a bigger company where there's sign off on the general open source licenses, but there needs to be a review of custom licenses, this is maybe adding a little bit more friction. And I've added some notes here. You can just click on these links to get more info on the licenses. Phfour for Microsoft is a permissive license. It also allows for reasoning. These top two models don't. So if you want to find tune for reasoning, maybe phfour is a good option. Lamma four is a custom license, and it's also very large. The scout model is 100 billion parameters, and I think it's unnecessarily big for the quality it provides. The quality is probably not much better than gemo 27b, Gemma three, or potentially even Mistral. So I recommend probably just using Mistral or Gemma three over lambma four. Quin three is a very strong model. I think it is probably stronger than all of these models here. And it's Apache two. But you do have the issues that come with using quen or deep sek models that there is strong censorship of the models and also there is a backdoor risk with any language model here. But you have to weigh that in the context of where and how it was developed. These models are increasingly being used to control agents, and that provides an extra angle by which you can have danger if the model is controlling your agent in a malicious way in making tool calls that you don't want to be made. So Quin three, very strong, perhaps very good choice if you want to find tune for reasoning. But you do have to be careful of the censorship and the back door issues. Now, just a few general fine tuning tips before I move to a demo. Spend 80 or 90% plus of your time on data preparation. That probably means watching some of the other videos. Second of all, definine two evaluation data sets. One is a representative data set that's not in your training set. I explained in my data prep video how you can rephrase certain questions in order to make sure they're not verbatim in your training and your eval set. But I do recommend including a verbatim copy of some of your training data set, because by including a verbacopy and also a version of datset that's not in your training set, you can start to measure the difference in performance between these two and assess whether you are overfitting. Measuring overfitting is another reason to use the eval sat during training. So you should be calculating training and eval losses. Make sure to evaluate before and after fine tuning. And then one kind of random tip here is do inspect the chat template being used when you're fine tuning. Some chat templates have got the date included with them, and if they have the date, you probably don't want to be fine tuning on today's date for all of your examples. So you may want to remove that when you're doing the fine tuning. So other things that can appear unexpectedly may also adversely affect the quality. Okay. So that's it for the theory portion. I'm going to move now and show you how to fine tune. And if you want to find the scripts, I'm going to show you they're available at trallis dot com and then advanced fine tuning. And I've actually refreshed the repository. Let me show you here. I've just cloned it over using windsurf. Historically, the fine tuning repo was organized according to branches, so different branches would have different content. And those branches are still there. For example, if you want synthetic scripts, scripts for making a synthetic data, synthetic branch, distillation, low vramp, full fine tuning retrieval rag, using Wikipedia data for fine tuning, there are a large variety of scripts, and they are still there in different branches. But going forward, I'm going to leave all of the scripts within the main branch here. And I've started to create a clean folder for datprep. That was the most recent video. And now I've got a clean folder here for fine tuning. And this is in vm onslobut. I'm going to merge it into main so you'll be able to find it there after the video. Now there are three fine tuning scripts that I have prepared. One is vllm onslaught. It uses vllm for evaluation and then onslaught for fine tuning. This one here is vllm transformers. So it uses onslaught for sorry, vllm for eval and transformers for fine tuning. And then this one here is pure transformers. So it will just use transformers both for evaluation and for fine tuning, and it will automatically set the right batch size for evaluation. But the evaluation is quite a bit slower than if you're using vllm. So I'm going to show you a fine tune using the unsloth script here. The transformer script more or less follows this. And when I go through this, I'll highlight a few points where there's a difference between using onsloth and the fine tuning. Now to get going on a GPU, I'm going to use runpod and this one click template affiliate link I have here. If you want to exactly replicate the environment I'm using, you can use this and pick a GPU. Now here I said 192. That was wrong. Sorry. I meant I guess what I should have said was 180. So if you run A B 200 for seven, looks like Yeah eight an hour maybe I said seven, you can fish 180 gb of irram. We're just gonna to run with A H 100, which is 80 gb. We won't be running quite a big model. So I'll just say fine tuning with unsloth as the name. And you can see here that we've got Auda 12.1 and pi torch 2.2 template. So I got this going, okay, so once the pod is started, we're going to connect and open up Jupiter, and then I'm going to upload my notebook, the onslath one. Now just when I'm uploading onsloth here, notice that I can also upload a requirements file. So you can either install the latest version of the dependencies or you can install from the requirements file if you want to ensure reproducibility. Okay. So I apologize for hurting the eyes of those who are sensitive to White light. I have swapped this over to dark c mode, and we're going to start off with an evaluation throughout this video. We're going to use a hogging face datset, whether it's a trellis datset called touch rugby and comprehensive. It's the qa datset I generated in a recent video, and it consists of a series of questions and answers and then evaluation criteria for marking a given answer correct. And these were all generated using, I think, the Gemini pro 2.5 model based on a document with the touch rugby rules. So the first thing I'm going to do is I'm going to run some installations here. Sometimes if you've run the fine tuning first with onsloth, you might need to uninstall onslot th. But since we're starting fresh here, we don't need to uninstall onsloth. We'll just run ahead and do the installs, the most important of which is vlm here. Now I'm just gonna to train one model. I'm actually gonna to train, I think the phi mini instruct model phi four. It's a new model I haven't trained. So I'm kind of curious what the results will be and see if we hit any issues. But I will give some commentary as we go about some of the other models. So my first comment here is if you're using the quen model, if you want to fintune and disable reasoning using vlm, you do need the latest install from source of vlm. It does take quite a bit of time to install. So just a heads up, if you're going to use Quin three and you want to disable reasoning, you need to install vlm from source for now. All right. So while that's installing, we'll move on here. Do restart the kernel after the installs are done. And this is where I was saving my requirements to a txt file. You could, if you wish, rather than running the installs here above, you could have done uv pip, install dash or requirements, if you adslot like this via m on slot and then dash system. Now the dash system means that we're installing onto the system. We're not in a virtual environment. And this makes sense because when we started the docker image on this GPU, we already had kuda and pi torch installed. And we want to make use of those. We don't want everything to be packaged in a vend. Okay, so this is still installing. Installs are done. So I'm going to restart the kernel and now I'm going to log into hogging face. So I'm going to I won't add it as a git credential. I will get a token. And I've got a token and we're logged in now and I can save that, at least close it. And the model that we're going to evaluate, so we'll evaluate the base model, then we'll do fine tuning and then we'll evaluate again. So the model I want to test is going to be a base model. And I'm gonna to set model slug equal to the phi model. And I've got the phi model up over here. And Yeah, the dark mode is pretty poor on safari, so I'll paste it in. And the data set is the toch rugby datset. The training split is called train. The eval split is called eval. There is a mirror eval split. This is literally a subsaof the training set so we can test overfitting. I talked about that in the datset prep video. And within this datset, there's a column for question. There's a column for evaluation criteria, which we use for grading, and there's a column for answer, which we use for fine tuning. So the answer is for fine tuning. The evaluation criteria is for grading and for setting up the judge. We're going to use Gemini flash, Gemini 2.0. We're going to provide a long context length. It can be shorter, but if you're using reasoning, you need more length. And we'll use a low temperature for the grading here. This allows for fast weight downloads. And this make sure that we download the weights into the disc. So we want when we're on run pod, we want the weights to go on the volume. We don't want it to go on the container. The volume here is about 500 gb. The container is only, I think, maybe ten. So I'll run this. That just sets those variables. And now we're going to set up the judge. Now it's asking me for an api key for Gemini, which makes sense because we're using the flash model. So I'm just going to go over to AI studio. I'll do this off screen, and I'm going to create an api key. So I paste in that api key and that's set now. And just to briefly show you the judge, this is not yet the judge. It's just setting the api key. It's actually saving it to a local environment variables file so that if I rerun this cell, it won't force me to put in the api key again, which is nice. Here we're setting up an open enai client. So we're hitting Gemini using an open enai client. And here we've just set up this function called chat that allows us to send messages in with a single message into Gemini. Okay, next we're going to prepare the data set. And when we prepare it, we can turn a test mode on. You'll notice that back here earlier when I defined the data set as a flag to set test equals to. If you just want to look at a small sample of the data aset, you can set test equals to and it will also print a lot more depot logs down below in the script. So I've got that set to false and we're going to load the dataset with the datset name check if there's is train and eval splits. And Yeah if we're just inspecting, we're going to slice the data so that we're only going to take a few rows. So here we're potentially sampling if test is otherwise, we're loading the full data set and we're just printing out to make sure everything is in order here. So when I print the data eval set, the train and eval set, there should be around Yeah 244 training rows and then 32 in the eval split. Okay, now we're going to load the model to an inference. This is how you load with vm. You PaaS in the model slog the GPU maximum memory utilization, the data type and the max sequence length, which we set up earlier and our sampling parameters. So here when we're generating, we're going to use a temperature that's defined by temperature here, which I think I've set to 0.7 at top k of 40 and top p of 0.95. These are important because it makes sure that very low probability tokens are not accepted. You don't want tokens that are very low probability because they your prompt or they control your completion off in an unexpected direction. So at this point, the model is going to be loaded itbe downloaded from hooging face first, and then the shards will be loaded onto the GPU. So this is typically where you hit problems if you're going to hit problems when you're trying to load a given model. So here you can see the smaller files have been loaded quickly. And now we're going to load the satensers. So you can see there's about 7 gb. And notice here that we are using flash inffur. Flash inffur is relatively recent library. It's faster than flash attention. And if you have it installed, it will be used by vllm. If you don't install it, I think it won't be used by default. So this model, mpi I four mini, relatively small, about three to 4 billion parameters in size, pretty fast to load the weights. It's then going to create cutographso. It's going to calculate forward some paths for optimally doing the computations. This takes a bit of time, but then it makes the inference faster because you've pcomputed what's called the graph. Okay? So when the model is loaded, we can run evaluation. To run evaluation, we need to get answers, which is straightforward. We just PaaS the question to the vlm model. But then we need to evaluate that. And to evaluate it, we're going to need a prompt. Here we have an expert evaluator with tasked with determining if the answer satisfied satisfies the specified, excuse me, evaluation criteria. We're telling Gemini that it will receive a question, the evaluation criteria, and then the modelanswer that needs to evaluate it. And it's just going to score a one or zero. So it's here, right or wrong. And the prompt template is the passing the question, the eval criteria and the model answer. Okay. So this is pretty much it. We are going to define a helper. This is the evaluation result. We want the evaluator to first give a reason and then indicate whether the model is correct or not. We'll then extract using a regular expression, the score, whether it's one or zero, to determine whether it's marked correct or not. And we also have this regular expression to extract any thinking. So if you're going to evaluate the answer, the shorter the answer, the easier to evaluate. And so if you include all of the thinking along with the answer, it's going to be harder to evaluate. And for that reason, if there is any thinking we're going na strip to thinking here, then we have a function to evaluate. And you can see it generates an answer. Sorry, this evaluate answer takes in a generated answer. It takes in a ground truth an evaluation criteria into question. And it's going to strip the reasoning and then PaaS it for judging here. Then we'll call the judge llm and we'll get the response and parse that. And when we want to evaluate a model, here's where we need to create the messages with the problem, generate an answer, and then PaaS that answer in for evaluation. So that's the full loop. And just a note here that if you want to disable thinking, you can do that for Quin three models by passing in this parameter here, but you do need to install from source, at least for now. So we've defined that evaluation function, and we're just going to run it on a row. This is the fifth row of evaluation data, and it looks like my api key is invalid. So I'm going to go back up and I'm going to reset that api key and to do that I think I just need to search for the word reset. Okay, so by the way, you can also select OpenAI for grading as well if you wish, but I'm going to use Gemini and let's now put in my api key. I'm just gonna get a new key here. I must have pasted in our own key. Okay, let's try that and we'll continue on down and see if that evaluates the model for us. And here the llm has failed, and that's because I need to reload the model. I shouldn't have rerun that. I should have just gone straight to the evaluation because the model was already loaded. So I hit a couout of memory. See, vlm will often use up the full memory for when it loads a model. It will pallocate that memory so it's recommended don't rerun the model loading. Okay, so it is going to take a second now because we need to reload the model with vlm. So Yeah before I restart the kernel, I'm just gonna to set that reset back to false because I should have fixed my api key and now I can run all the cells. So Yeah the model is reloading and I think sometimes the the graph can be cached so that improves the speed for loading as well. And you can see for this maximum sequence length vm with this GPU is capable of concurrency of 56. So it will automate batching for us like that. And here we go. It's now evaluated the question, what's the regulation about touch rugby participants covering the botwith, their clothing? And here is the generated answer. And here is the evaluation criteria. And the judge is marked it correct because it's saying that intentionally covering the ball is a foul. So the final evaluation is one out of one, correct? So this was just evaluating one question. But what we want to do is we want to evaluate a batch of questions. So we're going na use batching. We're going to have vllm answer multiple questions in parallel, and then we're going to use threading to make parallel calls to Geminis api. So that's what's happening here. We're batch evaluating the model by building a list of conversations. That's a list of all the different questions. We're passing that in to vlm. We're passing it into model chat here. Again, if you want a disable thinking, that's the line to include. And then we're gonna to judge using a ThreadPool executor that's going to make multiple parallel requests. And we'll get back the final score here. So you can now run a short test. You can set test itjust, run two roles of the datset, and we can see how that does. And actually, sorry, it's running five rows. So that's why it's generated five responses, and then it's gonna na judge those. So here's the sample generation. In fact, it's just giving us the five generations first, and then it's giving us the judging of those five. And you can see the results coming out here. And it looks like we've gotten 40 to 40% correct. Now we're going to evaluate the full evaluation data set, but because we have temperature non zero, this is not deterministic. So actually, when you evaluate, you want to evaluate multiple times on that same eval data set. Or maybe if you had a very large eval datset, you wouldn't have to do this. But because my datset is 32 youfind some variance if you just run it once. So I recommend running it at least three times. And for that, I've got a function here. That allows me to run it m number of times. So this is for running evaluations m number of times. And I'm going to just copy this here because I've run it previously on gea three. Let's just create a new cell and it's going to print out the dataset name, the eval split name. So we're running the comprehensive datset, the eval split, and we're running the phmodel. And we should be ready to go. Now one thing I don't like here is I'm still in test mode, so I'm getting all of the logs here. So actually what I need to do is set test equals fse and run it again and thatjust suppress all of the detail logging you can see here. We're going to run on all of the prompts, and then it's going to evaluate those using the judge. Now actually, you could increase the number of threads here. Gemini zi is able to take much more. You could probably even increase it up to 128 if you wanted. And I think I have an issue with the kernel because I can see my GPU memory is is not working. That may be because I just stopped it during processing. Sometimes if you stop the llm midway, you just run into these issues. So maybe I should have let it run out, but I didn't feel like that because it was printing too much debugging. So we run it again here. And just while we were waiting for that running, I'm actually running the comprehensive dataset as opposed to a manual datset that I curated. So this should actually be moved down. You can see evaluation is starting there, by the way, and I can probably just move it in this way here just by clicking the down button. And we can now run that full evaluation. Now, this is a section for a manual data set I curated, but we're running the comprehensive datset. You can see here, I ran it previously on the Mistral model. And if I just paste in a copy of this, I'm adding in test equals false to make sure we don't have too much debug logs. And you can see now we're processing the prompts. So 32 prompts. And I think I actually should have put that with a lowercase test. Either way, we can probably wait for it to complete and then rerun it so it doesn't print out all of the logs here. So it's going to print out 32 of the answers. And here you can see it making those parallel calls to Gemini. And we're running basically 96 different prompts here because we're running this three times. So we're running the evaluation three different times. And you can see here the data set, the eval split, and then the model name that we're running. Now I'm just going to run this again so it doesn't print the rebose logs by setting test equals false. And in the meantime, we can take a look at some of the results. So here with Mistral small, when I ran this three times, I got an average of 13 answers correct, somewhere around 40% overall. I can show you also, I think I ran on here's some archived results on the quen 1.7b. So the Quinn 1.7B, I scored five, so about half the amount, five instead of 13. And I can show you also, let's see, do I have any other results down here? Jama 34b and gemethree four b score is nine. So Yeah, the small Quin model scoring by four. That's including reasoning, by the way. Sorry, five. Jama four b scoring about nine, mistrish scoring about 13. And now let's go up and see how we score with the phmodel. I'd expect, I don't know, maybe something like the jama four b somewhere around nine. Okay, we got six. Then the next one we got four. So that just shows you the variance and that's why it's valuable to run multiple times. And then the last one here, we've scored six. So on average, we're scoring five. So actually this model, the phor, the phmini is not much better than the Quin 1.7b. So what this does is it gives us a baseline. We've got 5.3 correct. We're now going to run fine tuning and we'll come back and run this again and we're gonna to see if we got an improvement into performance. Now improving performance is not trivial. It's not obvious that we will improve just by doing this fine tuning. I have not done augmentation on this data set. It's a raw set of answers that were generated by Gemini pro, which may not match, probably doesn't match the kind of logiits or the probability pattern of the model that we're training here. So I could definitely do a better job of improving this data. So I'm not sure we're gonna to improve performance here by fine tuning. At least I'll be able to show you how the scripts all work. So we move on down here. In fact, I'm going to minimize this section on evaluation and we'll move to the fine tuning section. Now for running fine tuning, we're going to need to use onsloth, and we're going to uninstall vlm to make sure we don't have conflicts. If you want to speed up the fine tuning, you can install flash attention, but it does give issues with Quinn. So that's just a little warning. And if you want to use it, when you load the model, you need to add this here. Attention implementation. So I'm just going to run this cell. I'm not going to install flash attention for now. If I'd restarted the kernel, it would have gotten rid of these warnings. That's okay though. It's going to still correctly install, and I'm going to now restart my kernel and we should have onslots installed. Now just two troubleshooting things. If you find that there's a conflict with torch vision, you do not need torch vision I think anymore with the latest version of transformers or onslots, so you could just uninstall it. Also if you have issues with onslatcut cross entropy, onslahas, a custom cross entropy that saves on memory. I think maybe on compute you can disable it if there are issues by running this line here. Okay, so I've restarted the kernel and I should still be logged in. Actually, it looks like maybe I'm not fully logged in, so I'm gonna na go across and get a token. And reason I want to be logged in here is so that I can push models up to hub or access private models. Now the model that we want to train is going na be the phmodel again, so I need to populate that. And Yeah, we can set the max sequence length to 8000. We're gonna to fine tune in 16 bits, which I recommend, but eight bit I think actually is not bad. So if you wanted save memory you could do that. We're gonna to use this data set and we're gonna to set the name of the question column, the criteria column and the answer column. So we've done that right here. And if you wanted you could use a different data set for eval and by setting this here to the load data set right here. Okay, so we're going to set these variables. I've got this little helper function. This is just a function to clear cuda. If you've loaded a model and you want to reload a model, you can clear out the model that's there already. So we just create this little helper function. And now we're going to load the model. So I'll run this here and I'm going to print out the padding side. Typically for fine tuning, you will want to pad on the right hand side or you want to use whatever the model's default is. But for inference using vlm, you typically want left padding. So just a note there. We're going na print this out and see what happens for phi. Now onslught here is downloading the model, which I do not want because the model should actually already be downloaded. So what I need to do is potentially set this cache directory and retry. And again, it's downloading the model here. So I'm going to restart act the kernel and let's just check that we have the name of the model, correct? 54. Yeah, it could be that onsloth is downloading it from its own repo because unslohas got a version of all these models. So I'm not entirely sure, but I suspect that may be what happening. If I look at the 54 mini model here, this is the original model. But if I copy this, there's probably an unsloth version. Yeah onsloth has got this version here and it may just be defaulting to this. So that's why it's actually downloading the model even though I've downloaded it already for vllm. That's okay though. We can keep going, allow it to download and then we'll see what padding side is default and we'll also print out model. Okay, so pretty fast. Yeah it's not using a padding token. So actually onslth is automatically setting it to the end of text token, and it's using the left hand side for padding, which would be fine. And you can see the model architecture here, 31 layers, the mlps, the attention, and actually the attention is fused. So the Q, K and v are fused together. Onsloth might decide to unfuse those. I'm not sure we'll see what happens. So this here is a function just for me to inspect the size of these different modules. This is relevant because larger matrices you should train more slowly. And this affects how we put on the adapters. And Yeah, we actually need to adjust this code to work for phi because the layout is not the same. So if I go to ChatGPT and I create a new conversation, I can say update this code to support the pharchitecture. And sorry, my screen is smaller. I'm just going to paste in the pharchitecture now. So pathis in and we'll print. Now why am I going to this trouble to see the dimensions? I'm doing it because I want to know what I should set my Laura alpha to. And the Laura alpha should be the square root of the smallest matrix dimension. Basically, Laura alpha is it's kind of setting a bar for how you think about the relative training rate of the adapters, the laa adapters versus the main main matriwe're using laa larank adapters. That means we're not going to fully fine tune all the way. So we're just going to put these little adapters that clip onto the main model and we're going to fine tune those instead. But because they're smaller, you need to train them faster. The size of the adapters is determined by the rank. The larger the rank, the slower you want to train it. And that's actually scaled automatically when we set use or s laa. But you do need to set up this parameter that effectively is referencing the matrix size because it's all about the relative size of the adapters compared to that original matrix that we're going to freeze. Okay, so it's given us this unwrap function and we will see if it works. So let's see here. Is it giving us both functions? Yes. So I think I can just paste that in. And if I run it, Yeah. So now we can see here. We've got this fuse layer, which has got the smaller dimension of 3000, and we've got the mlp, which has got 3000. So actually what we want to do here is we want square root of 3000. And I think that's going na work out. 32 is about square root of 1000. So it should be about 1.7 times this. So something like, you know, 50 is going to be fine. And now we're going to get the parameter efficient fine tune model. This is where we create the adapters. So we're gonna to take the model, set the rank of 32. That should be fine. If you want more granularity, you can increase. And we are not going to fine tune any vision layers. In fact, I don't know if phi supports vision in any case, I don't think it does. I think it might just be taxed. And but if you're loading something like jama three, you want to set this file so you're not tuning it, then we're going to decide to train the attention or the mlp modules. If you're training an moe like lama, you typically do not train the mlp. You just train attention because mlp is going to be sparse because it will be a mixture of experts. So Yeah, that's basically your guidance. Additionally, if you want to train the embeddings, that's often relevant if you're trying to change here, for example, we've created a pad token or we've used an eos end of sequence token, it's probably not necessary to train the embeddings. But if you redefine some new tokens or the purpose of those tokens, then you do need to train the embeddings. Laura alfis passed here. We're going to use gradient checkpointing. That means we're not going to store everything on the forward PaaS. We're only going to recalculate when we do the backward PaaS and that saves memory. We're not going to do full fine tuning, although unsldoes support that. And we are going to automatically scale. The learning rate of the adapter is based on the size of the rank. So we're applying now or we're creating these adapters, and then we're going to see how many trainable parameters we have. We have point 4625. Now these names here, the modules to save should match what we have in the model up here. So it should match alamhead and at embed tokens. I think this is not actually setting them to trainable for phi because the embeddings are usually large and the trainable parameters would be much larger if we were actually training these. So I don't think we're actually training the embeddings here. We're just training the lower adapters. I also want to see what modules are satisus trainable. So let's just ask here give me a function to see what modules are set to trainable in the model because I suspect that because the qkv are fused either on sloth has to unfuse them or is not going to set them trainable. But I'm not sure about that. Maybe it will. So that's add this here and run and we do need to PaaS the model. Okay. So Yeah, it looks like the mlp ars are being trained and that's pretty much it. Oh, the all projection. Yeah, that makes sense. But the qkv is not because it's fused. So Yeah, basically onsloth is not infusing here. So it means that we're only training one of the modules in the attention. We're not actually training folting. Okay, so we've got the model, we've got the adapters. We're now going to load data aset. We're just loading here the fine tuning data set. We can print out a sample question. We can print out a sample question from the eval datset. You can see here the training data. It's got a lot of columns. The ones we're using are the question and the answer for fine tuning. And here's where we set that up into a prompt. So the prompt is going to have user message with the user content, and then it's gonna to have an assistant message with the assistant content. The user content is the question. The assistant is the answer. And here we're just going to format that as a template. So Yeah, this is what the phtemplating looks like. We've got user and then we've got the assistant. And notice here we'll need this because actually we want to focus the training on the assistant response, not on the user question. So later on, we're going to want to mark this here as the token for indicating the user response. And then we're going to want to mark this as the tokens or the string for the start of the assistant response. Okay, now we're going to start to set up the trainer. So we need to set a training rate. We're going to try a batch size at four. This model probably can fit a larger batch size because it's just eight b. I'm going to use four gradient ch accumulation steps. My virtual batch size is 32. We train for two epochs, one to concentrate and one decaying, with a cosine for some annealing. For a three b model, we want about one e minus four of a training rate. So put that in here. And we're going to define a current timestamp just for naming the model. And we're going to set up a runname based on the model name, the fine tuning datset name, the number of epochs and a timestamp. We calculate the number of training stas, which is total total number of rows of data divided by the virtual batch size, which is the batch size times grainto accumulation divided by the epochs will warm up for 1% of the steps and we'll are need it for the last 50%. We'll print the virtual batch size, the total steps, and we're going to set this custom training scheduler. Basically, it's going to be constant here when we're below the start of anneing. And then it's going to follow, I think, either a linear or cosine drop from there. Yeah it looks like a linear drop then in learning right down towards the end, which should hopefully smoothly bring us down towards a local minimum. Okay. So virtual bat size 32, 14 steps, zero warm up steps because we don't have enough steps for 1% to be meaningful. And the anneing will start at step five. Now you could warm up maybe a little bit more. For example, where do we have the warm up? Yeah. Point zero one. I mean, we could make it 0.05 and make it 5% of total steps. And now we've still got one. It looks like I would need to make it larger. I would need to make it even maybe 0.1. Yeah. So now we've got one training step. I don't like that though, because I think 10% is too much generally for training or for warm up. Ps, so I'm just gonna to leave it. And we left no warm up. That's fine. Okay, so now we're going to set up some of the arguments with PaaS the training batch size. We're going to use that same batch size for the evaluation. Batch size, grade accumulation steps, epochs. We're going to log every number of steps. In fact, we're going to log every 5% of steps, but no less than every single step, no less than every ten steps. Sorry, no more than every ten steps. I think that's the minimum minimum of ten. Yeah no less than every ten steps. We're going to evaluate based on steps, every ten steps, but no less than no more than every single step. And what else here? Yeah, we're going to use gradient checkpointing. We'll use reentrthis. Allows you to speed up. It speeds up the calculations. It's a bit more complicated so sometimes can give errors sometimes on Quin models, although it worked for you with Quin on Quin three as we're going to leave that to tree to router. And Yeah we PaaS in the masequence length and now we'll PaaS all of these parameters into the trainer. They're going to be passed sting here along with the training in eval sash, the model and tokenizer and the formatting function that we defined up earlier. Now here's an example of where transformers is different. I should have showed you above, but I will in a second. Normally, what you would do to set the optimizer is you would set the scheduler here, and you would set it equal to the optimizer and the scheduler. But that is not possible to do an unslot th. So you have to actually retrospectively set the optimizer here and also make sure that it doesn't get overwritten by later steps of one sloth. So this is necessary because I'm using a custom scheduler. So I have to make sure that schedule has actually being applied here. I can maybe show you very quickly if I go to my windsurf and I look at the transformers code, when I go to the trainer and the optimizer, you can see it's just being passed in here. Optimizer equals optimizer scheduler that won't work with unsloth while we're added it just one difference here as well. When we're loading the model with transformers, we will load it like this with auto model deled for causal. But this will not work for loading a multimodal model. So it won't work for jama three. It won't work. It might work for geomethree four b because that's text only. But for the larger, it won't work. It won't work as well for, I think, Mistral. You need to use a different and specific way to load it. Whereas onsloth has wrapped things in a way that you can use the same loading for every model. So if we look at onsloth here, when we load the model, pretty much you can PaaS any model and it's going to load correctly just by using, let's see, this fast language model. So it actually supports multimodal models. So that's quite a nice feature. One other difference as well is when we do the peft, there's a slight difference in loading the peft, the parameter efficient fine tunes model. Here you use fast language model, get peft model for unsloth, and you have these kind of rappers that allow you at a high level to control vision versus language and control attention versus mlp. Whereas if you're looking at transformers, when you get the peft model, it's a little bit more raw, which is kind of beneficial because you can target specific modules to turn on. This is the get peft model. It's not fast language model. And I think also maybe this might work in the case of if you're using transformers. Okay. So we have loaded. Have we loaded? Yeah, we've loaded the model. We've defined all of our training. I need to make sure I run these cells because I'm going through explaining things to you without checking over on everything. So we've now defined the trainer. And we have one more thing we need to do, which is we need to define how we're going to train on completions only, and we need to define this for phi. So I'm going to create this and I want to pick phi here, and I want to put in the correct start and end of the chat. And help me here, I can just print out the chat template, and I can see it's actually a bit hard to read exactly what I need to include there. It's probably more helpful if I inspect some of the templated text, which says user. So I'm going to copy all of this here, and I'm going to paste. And this here is going to be my instruction. Give it a second. I think I've lost connection to runpod. I may just need to rerun some of the cells and then the end portion. Is this here? Okay, so I'm just going to save this and I'll download a copy just in case. I don't want to lose my work here and I think everything will be fine. I might just need to rerun a few of the cells. I can actually just restart the kernel, go back to the start of the fine tuning here and run all of these cells. And the reason why I'm doing this chat completion setup is because I just want to train on the completion part. And it's very nicely illustrated when you run these two cells. It's going to show me an example from the training set. Example zero itshow me the full training role as as it's going to be passed into the trainer. And then this is just going to show me which part we're going to train on. So the loss will only be calculated for those for the assistant tokens. And this is typically the recommended way to to train. So Yeah, this is the full being passed in. And you can see that all of this part here is masked when we're training. So we're only going to train the loss on this last portion. And it looks like everything is printing fine here. It doesn't print out the end of text token, which I think is fine. It's important that it does have at least one end of sequence talking Ken being generated here. Okay. So next we're going to start the training. We print out the stats. We will make sure we have the data sets and start the training. And looks like we've an error. So what's happening here? So I'm not entirely sure what the issue is here. My contract with ChatGPT, my inclination is that we may turn off the reentrpossibly. Well, let's see. And possibly using all three, by the way, would be a better idea. Yeah, it wants us to disable to to compile. I could just try disabling the torch compile, but it's not obvious where exactly I would disable that. Yeah because I'd have to get into the unsloth code if I want to do that. So Yeah, we might have an issue here. Can I somehow disable compiling like this? Yeah, I'm not entirely sure it may be hallucinating here, but let's go back. And when the model is imported, we need to make sure we do this right at the start of script. So when we import the os, we're going to disif trying to tune phi for mini. I'll restart the kernel and try to run this. I'll comment that out so it doesn't spam everyone and we'll see if this works. Maybe it won't and we'll just go to another model. But Yeah this is an example where you might want to use the transformer script to get things to work. Okay, so that did work. We managed to disable compile. Now we're training and the loss is looking good. You see the the training loss is falling, the validation loss is falling as well. So everything is looking pretty good here and we're gonna to save this. Let me go to model name that I want to save with. I want to better run name than what we have. So it's going to save it like this judge rugby and let's just put that in as a name here. Okay, so we've got phi four. We'll name it. And Yeah, I mean we could push it to hub. Why don't we do that? See if this works and run this org is not defined. I don't know why I uncopy that there. There we go and print the run name. And in the meantime, let's let's check out the logging. Let's check the logs. So to check the logs, it's easiest to connect via ssh. I'll just copy this here. Go over to windsurf. And then in my terminal, I'm going to ssh. I actually need this is my ssh file which is in if you go to dot ssh directory on your computer, you should find one. But you need to put, you need to create an ssh key and put the public key into runpod and then you should be able to connect. And once you've connected, you can then start up tensor board. So Yeah pip, install uv, install tinsorboard, move to the workspace and then run in order to see the logs. And if this works and it's up and running, we should then be able to access it via the runpod url. Yeah. So it's open and running. I think I could click this could don't think this is going to bring me to the right page though, because this will just bring me to local host, but that's not going to be accessible. I need instead to go to my run pod pod ID here. And actually the address I need to go to depends on the pod ID, which is going to be this. So paste here, copy, and now I can check tensor board. So basically I'm porting in because runpod allows me to port into this and the run we just did, two of these failed, so I'll just show the ones that passed. And the Eva last looked beautiful. Take off smoothing, it's falling. And you can see it's it's kind of asesome toring. So we're kind of getting down to the best point there. The gradient norm is a bit higher at the start, but it's good. It's below one. Then the learning rate is flat, then declining. This looks good. And the training loss is flat and declining. So this is all excellent. So everything looks great in terms of these curves here. And now if we go back, we should have by now pushed the model. So the mois been pushed, that's excellent. And it's now time to inference this model. So I'm going to go all the way back up to the script, and I'm going to restart the kernel and I'll close down this fine tuning section and reopen the valtle. And I'm going to run. These installs should be fast, but actually we should uninstall on slot this time because it can cause some conflicts. So let's uninstall that, make sure we're logged into hogging face. And this time we're going to use the fine tuned model, model slug and fine tune. Set up the judge. I won't have to reenter my key because it' S Sato environment variables, load the model, set up evaluation, run that eval, we'll inspect that later, run batching. And Yeah, I'm just going to put test equals files here so that we don't accidentally leave testing on and it looks like we have an error. So something must have failed earlier here. Yeah, that actually needs to be trellis. So my org ID when I put the model in should be this. The other thing I should maybe have done is saved it locally so I wouldn't have to redownload, but that's okay. So Yeah, it's going to probably have to download the model, which is a bit of a duplication, but that's all right. Yeah and when you do install, you need to restart the kernel. So I forgot about that. So Yeah, I rerun the install, then you restart the kernel. And that's why it's a bit tedious swapping between vlaand one will latransformers. But on the other hand, the eval is gonna to be really fast. And Yeah, we're running with vlm m here, but we have an issue. So I'm just going to copy that code, go back to the old trustee here and see what it says. Again, I'm using for all here. I should probably be using all three, but let's see what happens. Yeah, so I'm not gonna to be able to override it like this. And this is probably the same issue that is happening with jamma, with jama three. Basically, the configuration of the model is not matching what vllm expects. And so what we can do is go back to find the model name, which is this one here, find it on hogging face. And then also if I go to hogging face, phi mini instruct, and Yeah, that's wrong. This one here I want and check the configuration file. So Yeah, that's the configuration file. And let's see the configuration file here in the phi original. And does this look the same? Okay, check out the right models, which I do. So everything here looks very similar. Sinthe tokenizer is a bit different. And the auto model confifigure guis a little bit different. So the auto map is maybe a little bit different. And Yeah, the architecture is different as well. So I wonder if I take this here and if I paste that in, if that's going to help. So I'd copy this edit. Just check king case stry is much different down here. Yeah there's also this, okay, that looks similar. This looks similar as well. So let's just take this, edit the file and replace from here. And I'm just going to copy this over. I'm gonna to save that just in my notes just in case I want na reinject it later. But for now, let's just match what the original is. And this is the original we want to match and we'll commit those changes. Okay, so let's see if that does anything. It may or may not may not get to the bottom of this. Restart the kernel and let's try and run it. And as I said it's not it's not a guarantee at all that this is gonna to work. If it doesn't work I'll show you eval with another model. In fact you've already seen evals so this is really just a question of whether we can get it working. It doesn't look good here. Yeah, it doesn't look good here. So this is an example of where you may decide it's worth running with transformers. And what I'll do is I'll just I'll just save this here. I'm going to rename it and I'll put it as a 54 mini, and I'll download this and I will put it for those who have access to repo. I'll save it so that you are able to take a look. I'll put it into the fine tuning folder here and I'll push it up. But for now, what I will just I'll just quickly show you using the transformer script, because transformer ers should work here, given it's a text only model. So if I go to fine tuning fine tune and if I upload the transformer script here, we can probably run a pretty fast fine tune. Starting off with the fine tune. We won't we won't even run the eval first. We'll just run the eval afterwards. So let's just run through very quickly the transformer script and we will make use of variables where we have to like this. So we use this as the base model. No, we're not going to use the fine tuned one. We're going to use this one here. So the 54 mini instruct, everything else is pretty much the same. We're going to load the model. I don't actually think I want to set the padding to, right? I'm just going to print tokenizer padding side. And Yeah, we may actually need to manually set the token here because unslth normally does this and that probably pad token is equal to a tokenizer us token and we print tokenizer pad token. So Yeah, I'm actually going to clear the GPU and reload it. And Yeah, you can see that's interesting. So onslseems to be setting the padding side to left. We can check it for when we run it here. This is evaluation. But let's just check in the fine tuning Yeah when sloth is setting it to left, whereas the default is actually right. So I'm going to leave it to write print the padding token. I have set the pad token here because I think if I print the pad token, let's just do this pad token before setting to eos and Yeah, I'll just reload it once more. So pad token before setting to eos seems to be the eos token, and that implies we don't need to do this. So Yeah, I'm not sure what approach on sloth was taken, but it seems like there is actually a pad token and so we don't need to set it. Okay, fine. We'll print the model here. Unwrap to base. We actually need to update that because we want to be able to unwrap the fly model. And Yep, we've got qkv and we're going to increase the lower ura alpha here a little bit because these are fairly large matrices. And Yeah, we're going to target the q, we're going to target the o. We're not going to target qkv. We're just going to target o and we are going to try and target gate up proand, down proso. We want to add in here gate up approach and we want down proas. Well, so something like this now we should check just to make sure that we are training everything possible. Yeah gato prodge. This is probably also a combination, which I don't love. Let's see when we fine tuned here. Or prodown proyeah. It looks like onsught is separating out the down prodown projection, which is good. Whereas I'm not going to be Yeah the down, sorry, down Yeah it's not training gate up. So we're actually only training limited numbers of layers. So when I train select here what the train gor project is not going to be trainable because that's fused. Well, it might be, but that's not what I'm going to do. And Yeah so I'm just gonna to train these ones for phi and let's see if Yeah so you can see the difference. This is actually working here in transformers. So it is training the embeddings and that's why we're training 24% of the parameters. So Yeah in onslathis doesn't seem to be actually working for now. Okay. We load the data sets, the formatting function. And yes, so there is a difference here in onslaught. You'll remember that after we set the trainer, we then adjusted the masks that that we only train on completion tokens. And with transformers, the way I have it set up, I'm actually doing that beforehand. And you can see I need to copy over this code here so that we can select fy. And now I have a mistake here, so I need to possibly reload my training data. Yeah. So this is equivalent. You can see I'm masking everything up until the end of the assistant, up until the end of the user response, and then I want to keep and train on the response here. So everything looks good here. It's just that I'm tokenizing and masking my data set before I PaaS it into the trainer. Whereas with onsloth, you can tokice ce afterwards because onsloth has got a built in function. So this is essentially equivalent. We're just going to train on this portion here. Now I need to make a few adjustments. Training batch size of eight for two epochs. Training ratiof one e minus four. And we will run with the valuation. Everything here is the same. Technically, this notebook supports distillation as well, although I haven't run it recently. You can check out the distillation video if you want. And we're going to now move to training and be interesting. Yeah we don't have any issue with torch compile. I don't know if that's because transformers doesn't use torch compile, but everything looks to be training fine here. Probably our logging, we should be logging more frequently instead of just logging every five steps. But that's okay. It won't make a difference to the results, and we are going to push this model up to hub and hope for some better results than what we got with unsloth. So I'm going to copy this run name here, paste it in here, comment. And we are going to merge the model, save it and push it up. And by the way, this time we are saving it locally. So I should be able to just run the model locally. And while that's working, I will just get ready to run evaluation here by putting in my model name. So the data set we're going to run as this or rather or the model we're going to run as this one. And I need to be careful not to run that until it's actually pushed. Okay, I'll just give it a moment to push that model to hub. Yeah. Just while we're waiting for that, we can check out tensor board and refresh and you can see the two runs here. And Yeah, it's interesting. The training and learning rate are the same. The grad norm is a little bit higher for transformers. And the Eva loss, we're not logging as frequently, so it's hard to exactly compare, but it does look like the Eva loss is a little bit higher for using transformers. And that's not something I would expect because I would think that they are basically doing same thing. And I think we've the same batch size. Possibly, Oh Yeah, we're training embeddings. So actually this makes sense because we're training embeddings, we are able to control more parameters. And Yeah, you would expect that would maybe help to give a lower loss, but I guess it isn't helping here. It's hard to say too much because this is just one run, but that is the main difference between the two runs is that we're training the embeddings in one, and maybe it's better not to train the embeddings. This maybe attentative conclusion here. So the model is pushed. We're going to restart the knel kernel, restart kernel, and now we will close down the fine tuning section, scroll to the top of the evals, make sure we run the installs, and then we're going to run all of the evaluations. I'll restart the kernel after doing the installs and now proceed to evaluate. So we're hoping to have more look this time, prepare the data set and let's see if the model will load. If it does load, we should get out some answers. Okay. So Yeah, this time everything is loading correctly. So we don't have the same issue as we did with onsloth. And you could see, unfortunately, I wasn't able to fix the configuration file for making the onslot trained model work with vlm, but this eval is going to work. So that's good news. We'll continue running these cells, set up the batch evaluation, and then we will evaluate on this comprehensive touch rugby set. And let's see what happens here. Just make sure the test is equal to false. And we're trying to beat our baseline score, which was about five or six, if I remember correctly. Let me just open up and see what it was we got. So that's the Mistral small results. And we want the phi results, which is this one. So we got to beat 5.33. As I said, I'm not sure we will just because this I haven't spent a lot of time on data prep and augmenting. So we may or may not do better. And seven. Okay. So we're doing better. So we're up from that's just one, but let's see if we run it again. Eight, seven. Okay. So definitely the fine tuning is improved. The results here, we've gone from 7.3 with a variance of 0.44, up from 5.3. So we are seeing a positive effect here of fine tuning. We're, of course, a long way off getting all of the answers correct. So there's quite a bit of work to do in terms of improving this model's performance up to where we would want it. But you can see how we ran the evwe, ran the fine tuning, and we ran the eval again and managed to get everything working. So I'm going to save the latest copy of this file here and I'll download it and itbe uploaded to the fine tuning folder for those of you who want to run with the file model. All right, folks, that's it. I ended up doing quite a detailed and realistic run through of the problems you go into when you're trying to fine tune. Hopefully you've more appreciation for transformers versus unslath and also the benefits of doing the vlm approach for evaluation. It really is a lot faster than having to wait for results. If you're going to run inference with onsloth or transformers, the scripts are in the advanced fine tuning repo. You can purchase access by going to trails dot com forward slash advanced dash fine tuning. And I do plan on building on this further to help fine tune for reasoning, particularly in verbal, which are non quantitative type applications where it's more difficult to do the reasoning type fine tuning. As usual, let me know if you have any questions below in the comments. Cheers, folks.