speaker 1: Hey, everyone. Thank you. Thank you for being here. And Yeah, in this talk, we're going to talk about lm posttraining. So we're going to see all the different steps in the lm training world. We're going to talk about the dataset generation. We're going to talk about the training algorithms. And finally, we're going to talk about the evaluation, which is very important. And on top of that, I want to also talk a bit about future trends and a bit about test time, compute scaling. So bit at me. Indeed, I do posttraining ing at liquid AI. I'm also the author of the lm engineer's handbook, if you're interested. And I've done a few contributions on the open source side with blog post, with models, fine tunes, and some tools that could be useful if you want to do things like model merging or evaluation. So first of all, let's define push training. Push training is what happens after pretraining. So it's quite easy. During pretraining, you have lots of row text, you have a huge volume of ratext, and the goal is to predict the next token in a sequence. So you do this to create a base model. This is cool. But this base model is only able to predict the next token in a sequence, and it's not able to answer questions. It's not able to follow instructions, which is why we have perstraining. This is the goal of perstraining, turning a base model into a useful assistance. So to do that, we have two main steps. The first one is supervised ventioning and doing supervised ventioning. What we do is that we give instructions and answers to the model as input, and we ask it a question. And we teach the model to answer the question. And we do it a lot. We teach the model a structure, a chat template that we're going na talk about later. And this is what the model learned during this stage. Then it's a model that is able to follow instructions. It's a model that is able to answer questions. But we can do a bit better because if you use it, it might not give you the kind of ansthat you want. So the next stiis called preference alignment. And during preference alignment, we give preferences to the model. We give not only one question and one answer, we give two answers. One is the chosen answer, this is how we want the model to behave, and the other one is the rejected answer. And this is contrastive example to tell the model do not output this kind of text. So this is what we do in terms of training during push training, and we will see the algorithms a bit later. So in terms of datset, as you might have noticed during the ptraining phase, we have a lot of samples, we have trillions of tokens, and the data sets are really, really large. This is not the case with pretraining. The data sets are a lot smaller, and we focus more on quality. So here I wanted to give you different types of fintioning that you might encounter. So the first type in Green is general purpose. With general purpose fine tuning, the goal is to create a general assistant, a chatbot, that is able to answer any type of questions. So this is typically what you have if you use GPT. ChatGPT is like a vageneral purpose assistant, and it's capable of doing pretty much everything. But you might want to take a model, for example, open source model and fintunfor, a specific use case. You might want to do it for a specific domain, like medical domain, legal domain, financial domain, or language specific fine tune. So here the idea is that we're going to embed include more knowledge into the model so it's able to perform better on this specific domain. So if you do that, you don't need 1 million or over 1 million samples to do it. You probably need a data set that is not as what's big. And this a third type, which is task specific functioning. And with task specific functioning, the goal is to really learn one function. So it's a very, very narrow type of fine tuning where you want to learn, for example, how to become better at summarization, how to become better at being a spell checker. And in this case, you need even fewer samples. So it's a lot easier to do. Note that all these estimates like rough estimates because it depends on multiple dimensions. For example, the model size is very important. The bigger the model, the few samples that you need. And it depends on also on the complexity of the test that you want to achieve for some domain. For example, if you do not know a language at all with the base model, if the base model hasn't been trained on language, functuning will probably not help you a lot because you don't have enough tokens to properly train the model in this language. Okay. So when should we find tune a model? So here I'm talking about domain specific and test specific functioning. So there are four main areas, four main points where it might make sense for you to fine tune a model. First of all, I would say always start with in context learning with Raack pipelines because it's just a lot easier to test. And if it doesn't provide enough value, if it's not good enough for you in terms of quality of the answer, but it can also be in terms of latency, in terms of inant speed, there are a lot of different dimensions that you can consider. Then their fine tuning can be an option. The first type is when you want to change the tunone or the format of the answer. For example, if you are trying to make an assistant to write emails, maybe you do not like the tone that the model has. In fact, I really dislike the tone that GPT has when you try to write an email with it. So this is something that you might want to change through fine tuning. Then you might want to add knowledge, for example, with domain specific fine tunes. This is what we talked about here. I'm saying that it's superficial because you cannot learn an nlanguage with it. But you can, for example, learn, teach the model of facts about yourself or about your company. So there's still potential in adding knowledge during the functioning process. You can also try to distill L, A. Very large model like GPT t four into a much smaller one, reduce the cost and also to reduce the latency, can also increase the inference speed. And finally, in terms of output quality, you might want to have a very specific now model, for example, to make diagrams. And with that, when you have like such a super specific task, it's possible to really outperform frontier models because they haven't been trained specifically on your task. When this is done, I'm talking here about evaluation. It's very important to always evaluate your fine tunmake, sure that what you try to do actually works, and also to be able to iterate, because probably the first model that you fine tune will not be the last one. And you might want to enter in a kind of training loop where you can evaluate, get information about what works but doesn't work, fix your data training ined organithere are other reasons why you might want to find tune a model. Besides this technical stuff, if you look at why companies care about open source and fine tuning, you see that it's mostly about control and customizability. So control is because they might not want to just send data to apis, to the cloud and customizability, it's because they might want to change the toe in the format of the answers, create their own model. So there are a lot of different reasons why you might want to fine tune, even though maybe a funtier model is all you need. Maybe you already have lama and it works well for you. Actually. There are also like more political reasons why you might want to find tunneon model. Okay, let's talk about datasets. So the most important question when we talk about dataset is what is good data? And I think this is kind of the main question through the entire push training process is what is good data? And good data is actually can be summarized into three dimensions. First, you have accuracy. Accuracy is quite easy. It's is my answer relevant to the question? Am I giving the correct answer? So most of the time, it's quite easy to check and to see. Okay, my answer is correct. But if you have very complex questions, it can be a lot harder. For example, if you are dealing with code, you might want to run untest to make sure that your answer is correct. If you are answering complex math questions, you might want to use a solver to double check that the answer is correct. So there are all these different ways of ensuring the accuracy of the data that you can use when you generate your data. The second dimension is diversity. Diversity is something that is not as simple to tipping down because it's really about having samples that are very different end from each other and that try to cover like the entire possible interaction that he users will have with the model. So you really want to have covered as much as possible. And the problem that was mentioned with synthetic data, for example, is that when you use too much synthetic data, you your diversity will collapse. And this is why it's good to have a bit of synthetic data, because just a bit can actually improve the diversity, but too much of it and you collapse in terms of diversity, and it makes the data lower quality. Finally, you have complexity. And complexity is really about challenging the models to answer tough questions. It's about giving answers that are long and detailed with channel thoughts, step by step reasoning and all that kind of things. So this summarizes what is a good data set. And Yeah, this might be like the most important thing to remember. It's those three dimensions because they can be applied like throughout the entire pipeline. For example, with diversity. You see here real life conversations. Real life conversations are really varied because people ask questions about anything really. So you see here with 2D representation of the embeddings of the data sets, that these three data sets on top are really varied and cover pretty much the entire spectrum. But this is not the case if you zoom on a math data set or the two code data sets here, you see that they're really in a single region. So if you want to create a math llm, then okay, maybe math is all you need. But if you want to create a general purpose llm, then you cannot not just have math and code. You will need more to increase the diversity then this, the complexity. Here's a simple example I made so low complexity, my prompt was, her tool is guiful tower. Very simple question. You can just write the answer with a few words. So the complexties really, really low. But then I evolved it. And this is actually something pretty neat, because I evolved it using a framework called autoevo. And. Basically, you just ask an llm to rephrase your prompt to increase the complexity, and this is what it produced. So here, the complexity is not crazy high either, but it's definitely higher because you see that it has different layers. It has a lot more details, and we asking for several lenses and not just a single one. So this is a good way to increase the complexity of your samples in terms of data formats. So I mentioned earlier that with instruction data asets, you have questions and answers, instructions and outputs. You can also have a system prompt to guide the model into adopting the style that you want. For example, you are a mathematician from mit, but with preference data, we have a chosen and a rejected answer. And the goal is really to teach the model to maximize the margin, the probabilities of the token between the choose an answer and the rejected answer. We really want to maximize this gap. So this is what it's trained at during preference alignment. Here's an example of a simple data generation pipeline. So you can see that we use seed data as input. Seed data can be pretty much anything. It can be rotext. It can be instruction and answers. It can be just questions from users. And then you can refine it by generating instructions, by generating answers, by generating both of them. There are a lot of different techniques to do that. For example, if you have ratext, the problem is that you do not have questions with it. You just have the answers. So you can use back translation to ask a language model to please write the question that corresponds to this row text. Then you have scoring and filtering. Here you can use different techniques to like heuristics, but you can also use other llms as judges to measure the quality of your answers and filter out the bat samples. And on top of that, you have classic data deduplication decontamination to make sure that you're not training on the test set and data filtering because, for example, you want to exclude some keywords. An example of that is for instruction following. So instruction following is this category that is focused on actually following constraints. So you have an example here where my constraint is. Your entire response should be in English and all lowercase. So we have two constraints here, and we want the model to really fothese constraints in its answer. So to create a datset to teach the model this kind of behavior, what we do is that we start with some prompts and some constraints, and we can just appthe constraints after the prompt, like here, and then we can query an llm and ask it an answer. But we're not sure that the llm will actually follow the constraints very well. So we can run some test. It's very easy to make sure that the answer is in English. You can use a library like land detect to make sure that it's in the language that you want. And you can also really easily check that it's all lowercase. So you can run this unit test to make sure that the constraints are really followed. Then you can do decontamination. There's a evaluation set that is called ieval. So you want to make sure that you're not by mistake generating samples that are too close to ife eval. And finally, you can do some keyword exclusion to remove some part of it. Another example is ultra feedback. So the first one was for instruction datasets. This one is for preference datasets. And with preference datsets, we want to generate not one answer, but two answers. So to do that, we have prompts once again. But instead of quirying a single lam, we query a lot of different lllams, and we are going to use a judm to rate them. We're going to score each answer. We take the top one as the chosen answer, and we take the worst one as the rejected answer. Then you can remove the duplicates, you can remove short answers. But this is like the ID behind it. And this is a really good way of creating preference data sets as a case study. Here's open perfect plan. This is a open source version of the perfect plan datset. This is something I've made by combining different data sets that are available on hugging face. And you can see here, I think the most interesting part is about the categories. To give you an idea of the mixture that you want here, you have a lot of math, a lot of code, and also a bit of chat data, plus the instruction following data sets that we talked about. And you can also see on the right a breakdown with all the individual data sets that compose this open perfect blend datsets. So here we're talking about general purpose fine tuning. We want the model that is able to do a lot of stuff. And this is the kind of mixtures, this is the kind of proportions that you're targeting. In this case, once this is all done, you have your datset. You've done like your mixture and you're quite happy with it. The final step is to apply a chat template to it. So on the left, you can see a storage format here. It's called alpaca. Doesn't really matter, but it's a way of storing the data, the instructions and the outputs. And what we are going to do is that we're going to map a structure to it so the model knows who's speaking. So here it's very simple. You see you have a special token called I start that says start of the message. The role of the speaker is system. So it's going to be like a system instruction. And then it stops at I end. And we start again with the user that will give the user prompt, and finally with the assistance that will give the response from the llm. And if you're doing an entire conversation, then you can repeat user assistant, user assistant, user assistant. So this is exactly what the model is going to be trained on. And we do that especially during supervised fine tuning to make sure that we learn this templates. The structure is really, really important. And this is what makes the difference between the base model that is only able to autocomplete text, and with the push train model that is able to actually follow instructions and have entire conversations. Okay, that's it for data. Now we can talk about the training algorithms that we use during brutraining. First of all, I want to recommend three libraries to do fine tunthe. First one is trl. It's from hugging phase. It's built on top of transformers. It's a very good library. It will give you a lot of different algorithms. It's always very up to date with the research. So you can have very fancy algorithms. There are a lot of them that are published every week or most, so they keep it up to date. Then you have axoxlotois built on top of the al, and it offers some nice features. For example, it can preprocess your data sets for you and obstructs a lot of different things for you, so it's more user friendly. What I really like about it is that everything you training run is a yaml configuration file. So it's very easy to just share your configuration files with other people and also like spy on what other people do to get like the right parameters and get inspired by their work. So it's a really cool one. And finally, have unthroth. You might be familiar with it. And this one is really, really good. If you want to do functioning with only one GPU, it's super efficient. It's really good. It also has a lot of nice utilities. For example, it can handle quantization for you. So that's really nice. In terms of techniques, we have three main techniques with sft. The first one is full functioning. Full functioning is basically like pre training. We are loading the model in full precision, and we are retraining every parameter on the fine tuning data set, on the instruction data set. So this is really good if you want to maximize the quality of the model, but it requires a lot of vram because you are loading the entire model and then that you're training the entire model. So if you calculate all the optimizer states, all the activations plus the size of the model, it will require probably like an entire node and diagnoster to do it. So this is nice, but also very costly. So a more approachable solution is to use biometer efficient techniques like lower a. Lower a is really nice because instead of retraining band time model, you will just load the model, but you will add adapters. So this matrices a and b, and instead of training the model, you only train these matrices. What's really nice about it is instead of retraining 100% of the parameters, you only train something like 0.5% of the entirety of it. So it's a lot faster and it's a lot less costly in terms of vr, in terms of hardware and general. The only issue is that you still need to load the entire model in memory. And if it's a big model, it's going to be very costly. So if you are really hardware constrained and you don't have like n of GPU's to really do the lower fine, tunyou can use qlq lis, a variant of lwhere. You don't load the model in full precision, but you load a quantized version of it in four bit precision. Then you apply the adapters and the law of and tuning. But the idea is that okay, now loading the model is not as costly. So it's really good if you cannot do lower. The point is that also it's also slower and it will degrade performance a little bit like a few percent. So it's not really recommended. And if you can afford doing lower fine tuning, I would recommend lower fine tuning. And most of the time, full fine tuning is actually a bit too much. You probably don't need it unless you are really training models because you're an non company, then you have preference alignment techniques. So this is the second stage of push training. And with preference alignment techniques, there are a lot of them. There are really, really a lot of them. There are over 100 different techniques in this umbrella. So I want you to only remember two of them. The first one is ppo, the proximal policy optimization. This is kind of the traditional and original version of preference alignment. This is kind of the original algorithm to do it. The problem with ppo, and you might see on the diagram here, is that you need three different models. So we complain earlier about loading one model, but here you need to load three. You need to load the foren model, the trend model and the reward model. And the idea is that you are going to get some text into the trend model. A question is going to generate an answer. The answer is going to be scored by the reward model and to make sure that your train model is not deviating too much from the initial policy. So from the model before it started training, you also have kl divergence to make sure that Yeah, you're not the. Averging too much. So this is good in terms of quality. It will maximize quality, but it's very complex. It is very costly to run. And so it's not really recommended. Instead, what I would recommend is direct reference optimization, or dpo. And here, it's less costly because you only have two models to load. Actually, if you use a lower, you can load only one model and just adapters for each of them. So it's faster, it's cheaper to use, but the quality is slightly lower. In practice, this is not really a problem, and I would recommend using dpu over dpu, unless you are open enai or you are like in a very precise configuration where you have unlimited resources. Plus dpu is a lot more user friendly. It's a lot easier to tunthe parameters. And instead of training parameters, here is a list of the main ones, the main training parameters that you might want to tweak during fine tunthe. First one is the most important one really is the learning rate. The learning rate corresponds to like how much the parameters are updated during training. And I give you like some common values to give you an estimate. The two next parameters are really connected, the batch size and the max length, because they directly impact the vm that you are going to use because they determine the size of the input that you fit to the model. The batch size is the number of example that you're going to feed in one PaaS. So the more you feed, obviously the more vrayou consume. And the max length is the length of every each of these samples. So once again, if the length is high, obviously you're going to consume more vram. So those parameters are really hardware constrained. With the batch size, you can do something a bit tricky by just having a loop. And instead of updating the parameters after one PaaS, you accumulate like several four passes, and then you can update your weights. So this is why we make the difference between effective batch size and the real batch size. After that, we have ex epochs correspond, respond to the number of passes through the entire training. This set, this is quite easy to do, probably between 35, and you can monitor that. We're going to see how to monitor the experiments afterwards. And the two final ones also very easy to select the optimizer. So it's the algorithm to update the parameters. And I would just recommend Adam W. You can explore a bit more. But Adam W is really good. So this is a very strong bass ine. And about the attention mechanism, flash attention is pretty much all you need. Flash attention is a very efficient implementation of the attention mechanism in the transformer architecture, and that makes it a lot faster, especially to process long sequence length. So this is the one that I would recommend always, if you can, to monitor experiments here. Here's an example with the learning rate. So you can see a bad learning rate and a good learning rate. The bad learning rate is the one with the lost Spike. So we start with a smooth descent, and then immediately after that, we see that's the loss Spike. So this is probably a problem with the learning rate. I would say it's too high. And if you reduce it a bit, it's probably going to give you something more like the curve on the right where it's a lot smoother. You see that there's a little bump at the beginning, but it's not too important and I wouldn't mind it too much. There are other metrics that you can also monitor during training. You can see on top there's like the train loss, but in the middle, at the bottom, there's also the validation loss. If you have a validation data set, you can also monitor that and make sure that you're not overfitting. And there's also the gradient norm. So the gradient norm is really the magnitude of the updates that you receive. And you don't want to see too many spikes because this might be a problem with your data. For example, you don't want it to also raise too much. So here my grained norm is okay. I don't think it's a problem because the Eva loss and the training loss, good. But if you see this kind of behavior, it might be a sign that the training is not going too well. What he is as a so you can have a training datset that is just for training and you can save a small part, a small datset just for validation purposes. The Yeah you can just split it as you would do in traditional machine learning. So here we're not going to do like a at 20 split like we would do in traditional machine learning because the training set is so large that it wouldn't make sense. But you can save like 1%, 2% of its validation. Now that we have trained our model, you can have different scenarios. For example, you've trained a lot of different checkpoints, and now you have like ten models that you've trained or you have one model. But for example, it's a lama model and there are lots of lama fine tunes available in the open source and hugging face. What can you do with that? An answer to this question is model merging. Model merging is the ID that you can simply average, fuse, combine the parameters of different models together to make an even stronger model. So this was a bit of a joke at the beginning and now everybody uses it. Every llm company applies model merging. So Yeah, I guess shouldn't make fun of it now, but it's still a funny idea that you can just take kind of random models and it makes a stronger one in the end. I really like model merging and it's really neat techniques that is maybe a bit underrated. Stebut, as I say, like everybody applies it now. So yes, Christian. Yes, exactly. We're going to see that in this slide here. You can just like add more models and average them altogether. But there are techniques that are a bit more advanced in model merging. And for example, one is slurp. So slurp stands for spherical linear interpolation. So instead of doing just a normal interpolation, you do this one by moving along with sphere instead of for straight line. And we see that it's generally better. It produces better mergers. It's very popular in the open source community. It's quite reliable. But the issue is that you can only merge two models together. You can see on the right one that I've met using this technique. And the second technique that I find more interesting is their ties or their linear. So there is a technique that randomly prunes and rescales the parameters of your source models. Ties is a technique that keeps the most significant bimeters in your source models and add assigned consensus. Because it happens that if you take the parameter of model one and parameter of model two, one is positive and the other one is negative. So if you combine them, you get zero. And that's actually not really good. That's not the best way of combining them. So this sign consensus addresses this issue, I think a sort of other question. Yes. Yeah. So the question is like do we need models with the same sizes or can we merge models with different sizes? In theory, you could try to merge models with different sizes, but in practice it doesn't work. So all the merging techniques that I'm talking about are with the same models and the same architectures too. This is very important. So for example, we only merge like Lammas together and not lama two and lamma three. It would only be lama two with lama two and lamma three with lamma three, right? Yes. Yeah so the question is like can we what happens if we mera model in full precision and a model quantities model? I would not recommend doing that because of course, you will have more quality if you only image models in full precision. I don't think I've tried doing that to be honest. But Yeah, obviously the quantities model has some performance degradation. So the merch model will also suffer from it. It's better to only try to do it with full precision. All right, yes. Yeah we're going to see that with an example. But yes, indeed. So if you have different models, you can assign different weights to them. And a common way of doing it is, for example, you have one model that is really good. You know it's really good because you run some evals and you know the scores, and you have other models that might not be as good. So you want to prioritize the first one compared to the other ones. And this is something that I've done to create the second model, the devil. If you look at the family tree of the model, so I was talking about like merging moalls together, but you can also merge merges together. And you can iteratively obtain crazy family trees with models, merge with each other. So here you see the family tree of the devil, eight b. And indeed, to make that, you see the configuration that I've used on the left, there are different models. This is a conconfiguration using like yaml with merch kit. And you have parameters like density and weight. Density will tune the parameters that you retain because you want to with ties. You don't want to keep all of them. And weight is really about like the weight of the combination of the different models. So with this is a bit of a science or more like an alchemy kind of thing, like choosing the right weights and the right parameters to obtain the best models. This is something that you learn if you do model merging quite a lot to get ideas about what is going to work. But there's no real science behind it. There are papers published sh one about how to do it automatically that are a bit more scientific. But Yeah, if you use these methods, I would recommend using like different parameters, testing a few different merges and running evaluation of them to see if it works. But on the right, you can see that we have a few benchmarks. And what's really funny is that their devthe merch model is better than the source models on these benchmarks. So the merger was really successful. It's really better than the source models. And this is what model merging is about, right? And to apply that in a more concrete example, let's say you want to do a language specific model. By language specific here, I mean something that is not too close to English, for example, German is too close. And models are already good in German, so it doesn't make sense. But for example, something like Finnish models are not really good in Finnish in general. So let's say you want to do this finmodel. A recipe that you can use is doing continuous or continuum pre training on finish ratext. So you do it for billions of tokens. It's between like five and 100 billion tokens, which really depends on how much your base model knows about the finish. Then you do supervised ftuning and preference alignment, like we talk about once again in finish. And if you run the valves, what you're going to see is that your model is now really good in finish. Good job, but it's really bad at everything else. So this is a problem, and this is a common problem. When people try to do language specific models, they end up being a model that is just good in the language. So to fix that model, merging can really help you, because you can take the model that you have trained in finish and merge it with the instruct general purpose model. For example, if you took lama three as a base model, you can take lama three instruct as the int model. And by emerging them together, you're going to get a model that is not only good in finish, but also good at everything else. This is one of the powers of momerging is that it allows you to kind of add these different skills together without compromising the rest. So Yeah, this is a case study why it really works. Yes, before regular machine inery intectives, let's say if you are doing the object detection. Sorry, talking about large language models to do a cnn task. A bunch of tathen is what every it guess worked with wall of like other Machon technique. For object detion or any other. Okay. I don't think that you can use large model for object detection because this is like another modality. And here we're talking about the same architecture, the same size of models. So this is really uniquely for text. You can do the same thing with vlms, for example, and it will provide like better performance for vision language models. But you would still need to have like the core skills. Like it's not a magical technique. It will not give you like completely new abilities. It's about combining abilities from different models with the same architecture together. All let's talk about evaluation. So evaluation is the main problem in the llm world. It doesn't work very well and we don't really know what we're doing. But it's really important. Actually, I want to convince you that despite all the flows, it's like really, really important because all this training and all this post training is about like optimizing something. It's about optimizing models to become better. And evolution is really measuring if it's better or not. So if we don't have the right evaluation, we are optimizing for the wrong thing. And maybe we don't have perfect evaluations, for sure we don't, but we have something and we will see how to combine them together to get a good idea of how good models really are. So the first time of the first type of evaluation is automated benchmarks. So with automated benchmarks, if you know about nmlu, this is classic example. If you know about the open llm leadertherboard, it also uses automated benchmarks. The idea is that you have a datset, like you have samples and you have a metric. So for example, with mmlu, the prompt, the samples, questions about anything, can be biology, can be English, it can be math, and you have four different options, A, C and d, and the metric is just the accuracy. So if the right answer was a and the model, say, b, well, zero. And we measure the performance of the model based on the accuracy over the entire data set. So this has really good properties. It's really scalable. It's really cost effective. It has also, you can design benchmarks for very precise tasks like math, or it can be like algebra. It can be very, very precise. It's also reproducible, which is really, really nice because people can make sure that ua evaluations were correct. But the issue I have, the main issue is that this is not how models are used at all. Like we chat with models, we do not check if they output the right letter when we ask them a question. And it might be difficult to evaluate questions that are more free form, that are more complex than just four options. So because of that, automated benchmarks give you an idea of how the model will form in math, in code, but this is really not connected with how people use it in the real world. You can also use more focused benchmarks because I talked about mmlu and vageneric benchmarks that are really general purpose, but you also have that specific benchmarks like enterprise scenarios, but you also have domain specific benchmarks. For example, Finwe could create our own benchmark evolotion suite for finish, for code, for medical domain, all that kind of stuff. Another way of evolging models is just to ask humans to rate the answers. So a popular way is to use an arena. If you know the chatbot arena, you have a prompt. So you tap your prompt and then you have two answers from two anonymous models, and you just decide which one was better. So you really rate the answers of the two models. We keep doing that with a ton of models, and you can use these comparisons as a way to calculate nscore. And this is how the chatbot iina works. So this is nice because you can be very precise in how you want humans to rate the models. You can ask them to just focus on toxicity, for example. This is a popular way of doing it with rtimming. It has a lot of flexibility because of that. There's also a lot less risk of data contamination, which exists with automated benchmarks. And on top of that, it's directly human preferences. So if you use this course, you're really optimizing for human preferences, which is kind of the goal, of course, training. The problem is it's really difficult to scale. It's obviously very costly. It's fade time consuming, and humans are incredibly biased too. We like to think that we are kind of the ultimate evaluation, but we're really not. It's very easy to make humans like a model of another. You just need longer answers, for example, and they like it better, or you need models that are very confident in what they say. And the answer can be completely wrong. But because it's said so confidently, people will like it better. So it's definitely not the ultimate evaluation, but it's also something that is very important, very important tool in general. And this this is like the main thing about evaluation is that human preferences are not correlated with automated benchmarks. And here you have a comparison with a ton of different benchmarks. The first one on top is the chatbot arena. And all the others are more or less automated benchmarks. You have an mlu, for example, gsm 8K is a math evaluation. You have math, which is a math evaluation, and human eval, which is a code evaluation. And you see that the correlation is like pretty poor in general. And this is something that you observe. If you really try models, you can be really good on mmlu and really bad in terms of human preferences and vice versa. So this is why these two approaches are kind of complementary. You really need both if you want to produce the best model. And you cannot not just focus on automated benchmarks or just focus on human preferences. And a way to scale it a bit more is to just get rid of humans and replace them with llms. So now we have llms generating answers and llms judging other llms. And this is really much better to scale and to be able to get a lot more samples. You can also not use one lm as a judge, but multiple llms as a jury. And this tends to also produce more reliable and more robust evaluations. What's good is that Yeah this is how models are used. We rate the answers. The models also now can handle very complex tasks. So you can really automate the freeform stuff. It provides direct feedback. Okay, it's not human preferences, but it's very close. It's very correlated actually. But unfortunately, these judge ms also have their own biases and they're actually very close to human biases. So you kind of run into the same issues and there's some quality validation that is needed on top of everything. So you might still want to have some human evaluation to make sure that your llm judges still are very Corwith how humans graduated. It's a very nice way to scale this thing, even though it's not as scalable as automated benchmarks because you still need to run a ton of different comparisons to get miniingful results. And finally, you can create your own evaluation. So I would say the main thing is to start early, like before functuning even it's like test driven development a bit. You need to know what you want to optimize. So you create the data set that will optimize this thing. And based on that, you will probably iterate a lot. Don't think that your datset is the final version of this data set. It will probably evolve because after seeing answers from your model, you will understand that, Oh, I forgot this part, or like the model is hacking a bit, this question. So you will probably need to iterate and have several versions of the evaluation data set. You can combine different types of evolal, as I said, automated benchmarks and human or llm judges together. This is a nice way of doing it. And finally, don't forget to compare your models with others, not just your fine tunes, but also like other models, other architectures can be really nice to be able to get a good picture of how you score, how you compare with other models. Okay, future trends. So the biggest trend right now, and I think in early 20, 25, is called test time compute scaling. And it's a very simple idea. It's the idea that, okay, during training, we can train on a ton of data and like throw a lot of compute at the problem to get better at it. It works very well. But what if we try to throw compute the problem during inference? And a very easy way of doing it is I have a question. For example, in math question, I ask my llm not one solution, but multiple solutions. And I take the one that is the most frequent. And if you do that while learning, it works. You actually get better results at this math question. So this is a technique called majority voicing, and it works, but it's very naive. So there are better versions of it. One of them is best of n. In best of n, what you do is you still generate different solutions. But instead of just speaking the most frequent one, you will use a reward model or a judge llm to score every answer. And here you can see different scores. We take the one with the highest score. And good job. This is probably in most cases, even better than majority voting. So that also works. But as you can see now, we have two models that need to be run together to produce an answer. So we Thwing more computo problem. You can do even better. And this one uses process reward models, or prm. And process reward models, they do not just score the final solution, the entire answer. They score steps in the answer. So if you have a math question, if you want to do a calculation, you probably need to solve it with multiple steps. Process models allow you to score each step independently, and the score that you get is pretty much the probability that this step will give you the right answer in the end. So you can do that with an llm and generate partial solutions. You can then score the steps in your partial solutions with a process reward model, select the best steps and expand them. And you can do that iteratively, and you get better and better scoi'm not going to detail like all the variants of it. They are more complex solution that you can get. But you get the idea we can just use more large language models and do more steps to produce better answers. So an example of that is this question. What is the product of six times seven when both numbers are in base? As eight give you answer in base eight. Here you see, the first step was correct in attempt one. Let's multiply them together. The second is wrong because it's in base ten and not in base eight, whatever. So the attempt is wrong. We skip to attempt two. So now it has an extra correct step, but it has the same issue and it's also rejected by the process reward model. And finally, we have attempt for askip, the third one, where it's all correct and the answer is good. So you see this mechanism, we iteratively improve the quality and iteratively we get better and better steps. Yes. Yes. Reward model is something that happens during preference alignment. If you remember the diagram with ppo, you see here that during training we actually have a reward model. And this reward model is specifically trained to take text as input and produce a number between zero and one to give you a score. So you can reuse the reward model that you've made for preference alignment in your test time. Compute scanning. This is a way of doing it. There are other ways. You can just use a judge llm, you can just prompt a model to give a measure of the quality, but this is a nice way of reusing something that you've trained before. So thank you for this question. And finally here, Oh sorry Yeah you voinclude ve the answer, for example, if you do to between the correct time. So if you do it, you generate an infinite number. The thing is when you do majority voting is that you're probably going to use different sampling parameters. So it's not good to decoding. It's something that is more like top p decoding. It has like you will see that the model can sometimes fail. This is also something that you see with halallucinations, for example, sometimes because of the non greedy decoding, because there's some randomness introduced in your decoding, the model will just pick the wrong token and the entire reasoning will be derailed because it picked the wrong token. And this is kind of the intution behind majority voting is that we want to recover the performance that are lost through this decoding process. If you use a temperature of zero, you will have the same answer every time. No, I know you like we do non greedy decoding because it actually improved the performance of the mole. In general, for math and good like reasoning task, you want to use a temperature that is very low. So this is like closer to what you're talking about. But for majority voting, I would set a higher temperature because I want diverse answers. And because I have diverse answers, then it makes sense that if an outcome is frequent, it's probably the right one. It's not foolproof. For sure. This is not the best method to do it, but this gives the intuition of how we can do it. All right, then results. So this is some work made by people on the hugging face. And you can see that they used lamath point 21B, M three b, and they manage by having like more and more generations per problem to outperform much bigger models like lama 3.18b and lama H 3.1 70b. So this really shows that you can transfer can exchange inference speed for output quality. And this scaling test time compute ID is really about that. It's about if I have a lot of inference, it means that I can generate a lot of different answers. And because I have that, I can directly to inform it into output quality. Okay, that's it for me. This is the conclusion of this talk. It's the push training loop is really about creating a data set. So really spending a lot of time on data, it's a third of your time. Then training models using all the techniques that we've seen, supervised fine tunpreference alignment techniques, maybe a bit of merging if you can. And finally evaluating it, evaluation is another third of your time. And the evaluation, as you've seen, it's a bit messy. There are lots of different ways of doing it, but you can try to combine different techniques, different families to have a good representation, a good overview of the performance of your model. And based on that, it will probably give you feedback about areas of improvement. You can see that the model maybe is bad in math on the automated benchmarks, which means that you should probably generate more math data and train it on it. And you see that it's a loop every time when you evaluate the model, you do not only evaluate the fine tuning process. What you evaluate the most is actually the data and the quality of the data that you have, the mixture, the diversity, the complexity and the accuracy of your tundata. So Yeah, that's the training loop and that's an intuitive process until you get better and better better models. Thank you. Thank you, everyone. Awesome. Thank you, maxim. Is there any questions I can bring up microphones just so we can go through them with the. You can from the profitability perspective from lm providers, which one has a better chance, B2C generalized models or B2B fine tune models? Thank you. Okay. Is your question in terms what's the metric? Is it in terms of we can have a better chance in terms of profability? Okay. So it's a business question. Okay. So is it better to have like general B2C models versus student B2B, fine tune B2B model? I think that the B2C model, the business model, is really saturated right now. So it's really going to be very difficult to compete against against Google, against OpenAI, against anthropic. I would say that it's probably better to do some fine tuning and targeting business customers. And you can in the best case, sceniyou can use that as leverage to then become A B two C Company. One of your earlier slides had rag on top in context learning rag and then four branches and then eval. So I had question about that second option, which was train add knowledge. We could add knowledge. So this is my intuition I'm trying to portray. If we add knowledge in the fine tuning process, the model gets frozen at that point in time, right? And the knowledge, as we know, will keep evolving. Okay. So how does rag and that adding of knowledge during fine tuning a compare? Or should we after adding the knowledge still keep doing rag for future refresh of the knowledge? Yeah. Yeah, exactly. So the question is the knowledge that we add here during fine tuning is frozen in time, while rag allows you to dynamically like retrieve up to date context. And I think that there is never like fine tuning versus rag. It's rag or fine tuning plus rag basically. So Yeah, I would always recommend using both. If vegeval augmentation makes sense, it's better to do it with the Fantin model. Hi. Typically when you have to train with paramec effects in fine tuning or fine tuna lot language model, you use like input output pairs. But if you do don't have data and you want to maybe do our reinforcement learning use of weights of the model, which type of algorithm do you usually use at first? So I didn't get the beginning of your question. I was looking for you here repethis, okay. If you don't have like input output paand, you have to do reinforcement learning which type of algorithm you use, like proximal policy optimization, others, or which type of library, for example. If you don't have input output, maybe you want to I use your lalanguage model to made more knife or half like a type of personality. And you can use, for example, a bet or another language for a classification. So you have a policy, what type of algorithm do you we lose? Yeah, I would recommend like to change maybe the term it depends on the use case that you want to do. But if it's really about like something like changing the tone of the model, I would definitely recommend doing preference alignment. And within preference alignment, direct preference optimization. Dpu is like the algorithm that will work out of the box for you. Then the question is how to get the data and to do that. This is like data pipeline that we we talked about. Ultra feedback is a good example of how to generate this kind of data. Okay. Thank you. Thank you. It seems to me that the techniques that you presented focus on single turn interactions. So for example, we have an instruction and then we have an answer. Can we also do some multi turn optimization or preference alignment? Sorry, preference alignment with multi turn? Yes. Yeah, Yeah, absolutely. Absolutely. This can be completely expanded to multi turn conversation. And actually it's a good way of doing it because we see often that what works with a single turn, for example, instruction following, doesn't work with multi turn. There's like an entire evaluation set dedicated to multi turn instruction following. It shows that models that are good with single turn might not be with multurn. So Yeah, I would recommend, when possible, always trying to get some multi turn data instead of a single turn. A number of people want to know if you'll do autographs at the end of today. Thank you. Thank you. Yes. Yeah Yeah. I have a question about the example of the model merging where you average the finish model and the English model. So I'm just trying to get a sense of, well, first of all, when you have these two different languages, like the input and output space are very different. So do you like train a separate tokenizer per language or like just the trunk is shared? Or how does it usually work when you do this cross language training? Yeah. So it's a good question. Tokenizers are really created on calibration datsets. And this calibration datset most of the time has a ton of English. So Yeah, the tokenizer of the model, for example, lama is mostly dedicated to English. And you can have special tokens. You can also have like some Chinese, for example, Japanese, Korean, Arabic, Finnish to get other special characters. But in general, and here in this example, I would say this is kind of out of scope a bit. We are going to use the same model with the same tokenier in the base and the insert version. And changing the tokenzer at this point is too late. We need to do it before pre training. And if you try to change the tokenzer during Press training, you're going to absolutely drop in terms of performance because the model has to relearn the entire mapping between the token ids and the embeddings. So that's kind of like a costly solution. But I agree that in general, it's a lot better if yotokenier is designed for a specific language and you're gonna to really struggle if you have a small vocabulary size, a tokenizer with like 20, zero different tokens. It's gonna to be very difficult to learn languages with completely different characters like Chinese. Great. Let's do one more, Alexander. Thank you. With the techniques like the best of end and majority vote, can we give a sense of uncertainty to the end user for like security or safety critical applications? Yeah. So do you want to use these techniques not for math, but for other domains like the so if you're are using majority voting or best event with the scores, there can be a widespread or a smaller spread and that could give a sense of like how certain the model is about a technique or is that the case? And can then we then feed it to the user when like a regulatory agency is asking for like yes, safety or security critical issues? Yeah, definitely, definitely. You can use it as a measure of uncertainty, but in your precise use case, you could also use a reward model that is dedicated for this task. Ks, it doesn't have to be general purpose. It can also be we be dedicated to measure toxicity, for example, and provide scores only on this metric. So this is a nice way of customizing the best of and technique to tailor it for specific use cases. Excellent. Okay. Let's all thanks, maxim, one more time.