MIT | Liquid AI | Introduction to LLM Post-Training

该转录内容阐述了大型语言模型在预训练后的后训练过程。讲者指出，预训练阶段仅使模型具备下一个令牌预测能力，而后训练则通过监督微调和偏好对齐两步，将基础模型转变为能理解指令、回答问题的实用助手。文中区分了通用微调、领域特定微调和任务特定微调三种方式，并说明了不同方式在数据规模和质量要求上的区别。讲解还涉及了何时采用微调技术，如改变回答语气、注入领域知识、模型蒸馏以及针对特定任务优化，同时强调了持续评估与迭代的重要性。最后，内容还总结了构建高质量数据集的三大要素：准确性、多样性和复杂性。

视频科技

媒体详情

上传日期: 2025-05-18 16:19
来源: https://www.youtube.com/watch?v=_HfdncCbMOE
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Hey, everyone. Thank you. Thank you for being here. And Yeah, in this talk, we're going to talk about lm posttraining. So we're going to see all the different steps in the lm training world. We're going to talk about the dataset generation. We're going to talk about the training algorithms. And finally, we're going to talk about the evaluation, which is very important. And on top of that, I want to also talk a bit about future trends and a bit about test time, compute scaling. So bit at me. Indeed, I do posttraining ing at liquid AI. I'm also the author of the lm engineer's handbook, if you're interested. And I've done a few contributions on the open source side with blog post, with models, fine tunes, and some tools that could be useful if you want to do things like model merging or evaluation. So first of all, let's define push training. Push training is what happens after pretraining. So it's quite easy. During pretraining, you have lots of row text, you have a huge volume of ratext, and the goal is to predict the next token in a sequence. So you do this to create a base model. This is cool. But this base model is only able to predict the next token in a sequence, and it's not able to answer questions. It's not able to follow instructions, which is why we have perstraining. This is the goal of perstraining, turning a base model into a useful assistance. So to do that, we have two main steps. The first one is supervised ventioning and doing supervised ventioning. What we do is that we give instructions and answers to the model as input, and we ask it a question. And we teach the model to answer the question. And we do it a lot. We teach the model a structure, a chat template that we're going na talk about later. And this is what the model learned during this stage. Then it's a model that is able to follow instructions. It's a model that is able to answer questions. But we can do a bit better because if you use it, it might not give you the kind of ansthat you want. So the next stiis called preference alignment. And during preference alignment, we give preferences to the model. We give not only one question and one answer, we give two answers. One is the chosen answer, this is how we want the model to behave, and the other one is the rejected answer. And this is contrastive example to tell the model do not output this kind of text. So this is what we do in terms of training during push training, and we will see the algorithms a bit later. So in terms of datset, as you might have noticed during the ptraining phase, we have a lot of samples, we have trillions of tokens, and the data sets are really, really large. This is not the case with pretraining. The data sets are a lot smaller, and we focus more on quality. So here I wanted to give you different types of fintioning that you might encounter. So the first type in Green is general purpose. With general purpose fine tuning, the goal is to create a general assistant, a chatbot, that is able to answer any type of questions. So this is typically what you have if you use GPT. ChatGPT is like a vageneral purpose assistant, and it's capable of doing pretty much everything. But you might want to take a model, for example, open source model and fintunfor, a specific use case. You might want to do it for a specific domain, like medical domain, legal domain, financial domain, or language specific fine tune. So here the idea is that we're going to embed include more knowledge into the model so it's able to perform better on this specific domain. So if you do that, you don't need 1 million or over 1 million samples to do it. You probably need a data set that is not as what's big. And this a third type, which is task specific functioning. And with task specific functioning, the goal is to really learn one function. So it's a very, very narrow type of fine tuning where you want to learn, for example, how to become better at summarization, how to become better at being a spell checker. And in this case, you need even fewer samples. So it's a lot easier to do. Note that all these estimates like rough estimates because it depends on multiple dimensions. For example, the model size is very important. The bigger the model, the few samples that you need. And it depends on also on the complexity of the test that you want to achieve for some domain. For example, if you do not know a language at all with the base model, if the base model hasn't been trained on language, functuning will probably not help you a lot because you don't have enough tokens to properly train the model in this language. Okay. So when should we find tune a model? So here I'm talking about domain specific and test specific functioning. So there are four main areas, four main points where it might make sense for you to fine tune a model. First of all, I would say always start with in context learning with Raack pipelines because it's just a lot easier to test. And if it doesn't provide enough value, if it's not good enough for you in terms of quality of the answer, but it can also be in terms of latency, in terms of inant speed, there are a lot of different dimensions that you can consider. Then their fine tuning can be an option. The first type is when you want to change the tunone or the format of the answer. For example, if you are trying to make an assistant to write emails, maybe you do not like the tone that the model has. In fact, I really dislike the tone that GPT has when you try to write an email with it. So this is something that you might want to change through fine tuning. Then you might want to add knowledge, for example, with domain specific fine tunes. This is what we talked about here. I'm saying that it's superficial because you cannot learn an nlanguage with it. But you can, for example, learn, teach the model of facts about yourself or about your company. So there's still potential in adding knowledge during the functioning process. You can also try to distill L, A. Very large model like GPT t four into a much smaller one, reduce the cost and also to reduce the latency, can also increase the inference speed. And finally, in terms of output quality, you might want to have a very specific now model, for example, to make diagrams. And with that, when you have like such a super specific task, it's possible to really outperform frontier models because they haven't been trained specifically on your task. When this is done, I'm talking here about evaluation. It's very important to always evaluate your fine tunmake, sure that what you try to do actually works, and also to be able to iterate, because probably the first model that you fine tune will not be the last one. And you might want to enter in a kind of training loop where you can evaluate, get information about what works but doesn't work, fix your data training ined organithere are other reasons why you might want to find tune a model. Besides this technical stuff, if you look at why companies care about open source and fine tuning, you see that it's mostly about control and customizability. So control is because they might not want to just send data to apis, to the cloud and customizability, it's because they might want to change the toe in the format of the answers, create their own model. So there are a lot of different reasons why you might want to fine tune, even though maybe a funtier model is all you need. Maybe you already have lama and it works well for you. Actually. There are also like more political reasons why you might want to find tunneon model. Okay, let's talk about datasets. So the most important question when we talk about dataset is what is good data? And I think this is kind of the main question through the entire push training process is what is good data? And good data is actually can be summarized into three dimensions. First, you have accuracy. Accuracy is quite easy. It's is my answer relevant to the question? Am I giving the correct answer? So most of the time, it's quite easy to check and to see. Okay, my answer is correct. But if you have very complex questions, it can be a lot harder. For example, if you are dealing with code, you might want to run untest to make sure that your answer is correct. If you are answering complex math questions, you might want to use a solver to double check that the answer is correct. So there are all these different ways of ensuring the accuracy of the data that you can use when you generate your data. The second dimension is diversity. Diversity is something that is not as simple to tipping down because it's really about having samples that are very different end from each other and that try to cover like the entire possible interaction that he users will have with the model. So you really want to have covered as much as possible. And the problem that was mentioned with synthetic data, for example, is that when you use too much synthetic data, you your diversity will collapse. And this is why it's good to have a bit of synthetic data, because just a bit can actually improve the diversity, but too much of it and you collapse in terms of diversity, and it makes the data lower quality. Finally, you have complexity. And complexity is really about challenging the models to answer tough questions. It's about giving answers that are long and detailed with channel thoughts, step by step reasoning and all that kind of things. So this summarizes what is a good data set. And Yeah, this might be like the most important thing to remember. It's those three dimensions because they can be applied like throughout the entire pipeline. For example, with diversity. You see here real life conversations. Real life conversations are really varied because people ask questions about anything really. So you see here with 2D representation of the embeddings of the data sets, that these three data sets on top are really varied and cover pretty much the entire spectrum. But this is not the case if you zoom on a math data set or the two code data sets here, you see that they're really in a single region. So if you want to create a math llm, then okay, maybe math is all you need. But if you want to create a general purpose llm, then you cannot not just have math and code. You will need more to increase the diversity then this, the complexity. Here's a simple example I made so low complexity, my prompt was, her tool is guiful tower. Very simple question. You can just write the answer with a few words. So the complexties really, really low. But then I evolved it. And this is actually something pretty neat, because I evolved it using a framework called autoevo. And. Basically, you just ask an llm to rephrase your prompt to increase the complexity, and this is what it produced. So here, the complexity is not crazy high either, but it's definitely higher because you see that it has different layers. It has a lot more details, and we asking for several lenses and not just a single one. So this is a good way to increase the complexity of your samples in terms of data formats. So I mentioned earlier that with instruction data asets, you have questions and answers, instructions and outputs. You can also have a system prompt to guide the model into adopting the style that you want. For example, you are a mathematician from mit, but with preference data, we have a chosen and a rejected answer. And the goal is really to teach the model to maximize the margin, the probabilities of the token between the choose an answer and the rejected answer. We really want to maximize this gap. So this is what it's trained at during preference alignment. Here's an example of a simple data generation pipeline. So you can see that we use seed data as input. Seed data can be pretty much anything. It can be rotext. It can be instruction and answers. It can be just questions from users. And then you can refine it by generating instructions, by generating answers, by generating both of them. There are a lot of different techniques to do that. For example, if you have ratext, the problem is that you do not have questions with it. You just have the answers. So you can use back translation to ask a language model to please write the question that corresponds to this row text. Then you have scoring and filtering. Here you can use different techniques to like heuristics, but you can also use other llms as judges to measure the quality of your answers and filter out the bat samples. And on top of that, you have classic data deduplication decontamination to make sure that you're not training on the test set and data filtering because, for example, you want to exclude some keywords. An example of that is for instruction following. So instruction following is this category that is focused on actually following constraints. So you have an example here where my constraint is. Your entire response should be in English and all lowercase. So we have two constraints here, and we want the model to really fothese constraints in its answer. So to create a datset to teach the model this kind of behavior, what we do is that we start with some prompts and some constraints, and we can just appthe constraints after the prompt, like here, and then we can query an llm and ask it an answer. But we're not sure that the llm will actually follow the constraints very well. So we can run some test. It's very easy to make sure that the answer is in English. You can use a library like land detect to make sure that it's in the language that you want. And you can also really easily check that it's all lowercase. So you can run this unit test to make sure that the constraints are really followed. Then you can do decontamination. There's a evaluation set that is called ieval. So you want to make sure that you're not by mistake generating samples that are too close to ife eval. And finally, you can do some keyword exclusion to remove some part of it. Another example is ultra feedback. So the first one was for instruction datasets. This one is for preference datasets. And with preference datsets, we want to generate not one answer, but two answers. So to do that, we have prompts once again. But instead of quirying a single lam, we query a lot of different lllams, and we are going to use a judm to rate them. We're going to score each answer. We take the top one as the chosen answer, and we take the worst one as the rejected answer. Then you can remove the duplicates, you can remove short answers. But this is like the ID behind it. And this is a really good way of creating preference data sets as a case study. Here's open perfect plan. This is a open source version of the perfect plan datset. This is something I've made by combining different data sets that are available on hugging face. And you can see here, I think the most interesting part is about the categories. To give you an idea of the mixture that you want here, you have a lot of math, a lot of code, and also a bit of chat data, plus the instruction following data sets that we talked about. And you can also see on the right a breakdown with all the individual data sets that compose this open perfect blend datsets. So here we're talking about general purpose fine tuning. We want the model that is able to do a lot of stuff. And this is the kind of mixtures, this is the kind of proportions that you're targeting. In this case, once this is all done, you have your datset. You've done like your mixture and you're quite happy with it. The final step is to apply a chat template to it. So on the left, you can see a storage format here. It's called alpaca. Doesn't really matter, but it's a way of storing the data, the instructions and the outputs. And what we are going to do is that we're going to map a structure to it so the model knows who's speaking. So here it's very simple. You see you have a special token called I start that says start of the message. The role of the speaker is system. So it's going to be like a system instruction. And then it stops at I end. And we start again with the user that will give the user prompt, and finally with the assistance that will give the response from the llm. And if you're doing an entire conversation, then you can repeat user assistant, user assistant, user assistant. So this is exactly what the model is going to be trained on. And we do that especially during supervised fine tuning to make sure that we learn this templates. The structure is really, really important. And this is what makes the difference between the base model that is only able to autocomplete text, and with the push train model that is able to actually follow instructions and have entire conversations. Okay, that's it for data. Now we can talk about the training algorithms that we use during brutraining. First of all, I want to recommend three libraries to do fine tunthe. First one is trl. It's from hugging phase. It's built on top of transformers. It's a very good library. It will give you a lot of different algorithms. It's always very up to date with the research. So you can have very fancy algorithms. There are a lot of them that are published every week or most, so they keep it up to date. Then you have axoxlotois built on top of the al, and it offers some nice features. For example, it can preprocess your data sets for you and obstructs a lot of different things for you, so it's more user friendly. What I really like about it is that everything you training run is a yaml configuration file. So it's very easy to just share your configuration files with other people and also like spy on what other people do to get like the right parameters and get inspired by their work. So it's a really cool one. And finally, have unthroth. You might be familiar with it. And this one is really, really good. If you want to do functioning with only one GPU, it's super efficient. It's really good. It also has a lot of nice utilities. For example, it can handle quantization for you. So that's really nice. In terms of techniques, we have three main techniques with sft. The first one is full functioning. Full functioning is basically like pre training. We are loading the model in full precision, and we are retraining every parameter on the fine tuning data set, on the instruction data set. So this is really good if you want to maximize the quality of the model, but it requires a lot of vram because you are loading the entire model and then that you're training the entire model. So if you calculate all the optimizer states, all the activations plus the size of the model, it will require probably like an entire node and diagnoster to do it. So this is nice, but also very costly. So a more approachable solution is to use biometer efficient techniques like lower a. Lower a is really nice because instead of retraining band time model, you will just load the model, but you will add adapters. So this matrices a and b, and instead of training the model, you only train these matrices. What's really nice about it is instead of retraining 100% of the parameters, you only train something like 0.5% of the entirety of it. So it's a lot faster and it's a lot less costly in terms of vr, in terms of hardware and general. The only issue is that you still need to load the entire model in memory. And if it's a big model, it's going to be very costly. So if you are really hardware constrained and you don't have like n of GPU's to really do the lower fine, tunyou can use qlq lis, a variant of lwhere. You don't load the model in full precision, but you load a quantized version of it in four bit precision. Then you apply the adapters and the law of and tuning. But the idea is that okay, now loading the model is not as costly. So it's really good if you cannot do lower. The point is that also it's also slower and it will degrade performance a little bit like a few percent. So it's not really recommended. And if you can afford doing lower fine tuning, I would recommend lower fine tuning. And most of the time, full fine tuning is actually a bit too much. You probably don't need it unless you are really training models because you're an non company, then you have preference alignment techniques. So this is the second stage of push training. And with preference alignment techniques, there are a lot of them. There are really, really a lot of them. There are over 100 different techniques in this umbrella. So I want you to only remember two of them. The first one is ppo, the proximal policy optimization. This is kind of the traditional and original version of preference alignment. This is kind of the original algorithm to do it. The problem with ppo, and you might see on the diagram here, is that you need three different models. So we complain earlier about loading one model, but here you need to load three. You need to load the foren model, the trend model and the reward model. And the idea is that you are going to get some text into the trend model. A question is going to generate an answer. The answer is going to be scored by the reward model and to make sure that your train model is not deviating too much from the initial policy. So from the model before it started training, you also have kl divergence to make sure that Yeah, you're not the. Averging too much. So this is good in terms of quality. It will maximize quality, but it's very complex. It is very costly to run. And so it's not really recommended. Instead, what I would recommend is direct reference optimization, or dpo. And here, it's less costly because you only have two models to load. Actually, if you use a lower, you can load only one model and just adapters for each of them. So it's faster, it's cheaper to use, but the quality is slightly lower. In practice, this is not really a problem, and I would recommend using dpu over dpu, unless you are open enai or you are like in a very precise configuration where you have unlimited resources. Plus dpu is a lot more user friendly. It's a lot easier to tunthe parameters. And instead of training parameters, here is a list of the main ones, the main training parameters that you might want to tweak during fine tunthe. First one is the most important one really is the learning rate. The learning rate corresponds to like how much the parameters are updated during training. And I give you like some common values to give you an estimate. The two next parameters are really connected, the batch size and the max length, because they directly impact the vm that you are going to use because they determine the size of the input that you fit to the model. The batch size is the number of example that you're going to feed in one PaaS. So the more you feed, obviously the more vrayou consume. And the max length is the length of every each of these samples. So once again, if the length is high, obviously you're going to consume more vram. So those parameters are really hardware constrained. With the batch size, you can do something a bit tricky by just having a loop. And instead of updating the parameters after one PaaS, you accumulate like several four passes, and then you can update your weights. So this is why we make the difference between effective batch size and the real batch size. After that, we have ex epochs correspond, respond to the number of passes through the entire training. This set, this is quite easy to do, probably between 35, and you can monitor that. We're going to see how to monitor the experiments afterwards. And the two final ones also very easy to select the optimizer. So it's the algorithm to update the parameters. And I would just recommend Adam W. You can explore a bit more. But Adam W is really good. So this is a very strong bass ine. And about the attention mechanism, flash attention is pretty much all you need. Flash attention is a very efficient implementation of the attention mechanism in the transformer architecture, and that makes it a lot faster, especially to process long sequence length. So this is the one that I would recommend always, if you can, to monitor experiments here. Here's an example with the learning rate. So you can see a bad learning rate and a good learning rate. The bad learning rate is the one with the lost Spike. So we start with a smooth descent, and then immediately after that, we see that's the loss Spike. So this is probably a problem with the learning rate. I would say it's too high. And if you reduce it a bit, it's probably going to give you something more like the curve on the right where it's a lot smoother. You see that there's a little bump at the beginning, but it's not too important and I wouldn't mind it too much. There are other metrics that you can also monitor during training. You can see on top there's like the train loss, but in the middle, at the bottom, there's also the validation loss. If you have a validation data set, you can also monitor that and make sure that you're not overfitting. And there's also the gradient norm. So the gradient norm is really the magnitude of the updates that you receive. And you don't want to see too many spikes because this might be a problem with your data. For example, you don't want it to also raise too much. So here my grained norm is okay. I don't think it's a problem because the Eva loss and the training loss, good. But if you see this kind of behavior, it might be a sign that the training is not going too well. What he is as a so you can have a training datset that is just for training and you can save a small part, a small datset just for validation purposes. The Yeah you can just split it as you would do in traditional machine learning. So here we're not going to do like a at 20 split like we would do in traditional machine learning because the training set is so large that it wouldn't make sense. But you can save like 1%, 2% of its validation. Now that we have trained our model, you can have different scenarios. For example, you've trained a lot of different checkpoints, and now you have like ten models that you've trained or you have one model. But for example, it's a lama model and there are lots of lama fine tunes available in the open source and hugging face. What can you do with that? An answer to this question is model merging. Model merging is the ID that you can simply average, fuse, combine the parameters of different models together to make an even stronger model. So this was a bit of a joke at the beginning and now everybody uses it. Every llm company applies model merging. So Yeah, I guess shouldn't make fun of it now, but it's still a funny idea that you can just take kind of random models and it makes a stronger one in the end. I really like model merging and it's really neat techniques that is maybe a bit underrated. Stebut, as I say, like everybody applies it now. So yes, Christian. Yes, exactly. We're going to see that in this slide here. You can just like add more models and average them altogether. But there are techniques that are a bit more advanced in model merging. And for example, one is slurp. So slurp stands for spherical linear interpolation. So instead of doing just a normal interpolation, you do this one by moving along with sphere instead of for straight line. And we see that it's generally better. It produces better mergers. It's very popular in the open source community. It's quite reliable. But the issue is that you can only merge two models together. You can see on the right one that I've met using this technique. And the second technique that I find more interesting is their ties or their linear. So there is a technique that randomly prunes and rescales the parameters of your source models. Ties is a technique that keeps the most significant bimeters in your source models and add assigned consensus. Because it happens that if you take the parameter of model one and parameter of model two, one is positive and the other one is negative. So if you combine them, you get zero. And that's actually not really good. That's not the best way of combining them. So this sign consensus addresses this issue, I think a sort of other question. Yes. Yeah. So the question is like do we need models with the same sizes or can we merge models with different sizes? In theory, you could try to merge models with different sizes, but in practice it doesn't work. So all the merging techniques that I'm talking about are with the same models and the same architectures too. This is very important. So for example, we only merge like Lammas together and not lama two and lamma three. It would only be lama two with lama two and lamma three with lamma three, right? Yes. Yeah so the question is like can we what happens if we mera model in full precision and a model quantities model? I would not recommend doing that because of course, you will have more quality if you only image models in full precision. I don't think I've tried doing that to be honest. But Yeah, obviously the quantities model has some performance degradation. So the merch model will also suffer from it. It's better to only try to do it with full precision. All right, yes. Yeah we're going to see that with an example. But yes, indeed. So if you have different models, you can assign different weights to them. And a common way of doing it is, for example, you have one model that is really good. You know it's really good because you run some evals and you know the scores, and you have other models that might not be as good. So you want to prioritize the first one compared to the other ones. And this is something that I've done to create the second model, the devil. If you look at the family tree of the model, so I was talking about like merging moalls together, but you can also merge merges together. And you can iteratively obtain crazy family trees with models, merge with each other. So here you see the family tree of the devil, eight b. And indeed, to make that, you see the configuration that I've used on the left, there are different models. This is a conconfiguration using like yaml with merch kit. And you have parameters like density and weight. Density will tune the parameters that you retain because you want to with ties. You don't want to keep all of them. And weight is really about like the weight of the combination of the different models. So with this is a bit of a science or more like an alchemy kind of thing, like choosing the right weights and the right parameters to obtain the best models. This is something that you learn if you do model merging quite a lot to get ideas about what is going to work. But there's no real science behind it. There are papers published sh one about how to do it automatically that are a bit more scientific. But Yeah, if you use these methods, I would recommend using like different parameters, testing a few different merges and running evaluation of them to see if it works. But on the right, you can see that we have a few benchmarks. And what's really funny is that their devthe merch model is better than the source models on these benchmarks. So the merger was really successful. It's really better than the source models. And this is what model merging is about, right? And to apply that in a more concrete example, let's say you want to do a language specific model. By language specific here, I mean something that is not too close to English, for example, German is too close. And models are already good in German, so it doesn't make sense. But for example, something like Finnish models are not really good in Finnish in general. So let's say you want to do this finmodel. A recipe that you can use is doing continuous or continuum pre training on finish ratext. So you do it for billions of tokens. It's between like five and 100 billion tokens, which really depends on how much your base model knows about the finish. Then you do supervised ftuning and preference alignment, like we talk about once again in finish. And if you run the valves, what you're going to see is that your model is now really good in finish. Good job, but it's really bad at everything else. So this is a problem, and this is a common problem. When people try to do language specific models, they end up being a model that is just good in the language. So to fix that model, merging can really help you, because you can take the model that you have trained in finish and merge it with the instruct general purpose model. For example, if you took lama three as a base model, you can take lama three instruct as the int model. And by emerging them together, you're going to get a model that is not only good in finish, but also good at everything else. This is one of the powers of momerging is that it allows you to kind of add these different skills together without compromising the rest. So Yeah, this is a case study why it really works. Yes, before regular machine inery intectives, let's say if you are doing the object detection. Sorry, talking about large language models to do a cnn task. A bunch of tathen is what every it guess worked with wall of like other Machon technique. For object detion or any other. Okay. I don't think that you can use large model for object detection because this is like another modality. And here we're talking about the same architecture, the same size of models. So this is really uniquely for text. You can do the same thing with vlms, for example, and it will provide like better performance for vision language models. But you would still need to have like the core skills. Like it's not a magical technique. It will not give you like completely new abilities. It's about combining abilities from different models with the same architecture together. All let's talk about evaluation. So evaluation is the main problem in the llm world. It doesn't work very well and we don't really know what we're doing. But it's really important. Actually, I want to convince you that despite all the flows, it's like really, really important because all this training and all this post training is about like optimizing something. It's about optimizing models to become better. And evolution is really measuring if it's better or not. So if we don't have the right evaluation, we are optimizing for the wrong thing. And maybe we don't have perfect evaluations, for sure we don't, but we have something and we will see how to combine them together to get a good idea of how good models really are. So the first time of the first type of evaluation is automated benchmarks. So with automated benchmarks, if you know about nmlu, this is classic example. If you know about the open llm leadertherboard, it also uses automated benchmarks. The idea is that you have a datset, like you have samples and you have a metric. So for example, with mmlu, the prompt, the samples, questions about anything, can be biology, can be English, it can be math, and you have four different options, A, C and d, and the metric is just the accuracy. So if the right answer was a and the model, say, b, well, zero. And we measure the performance of the model based on the accuracy over the entire data set. So this has really good properties. It's really scalable. It's really cost effective. It has also, you can design benchmarks for very precise tasks like math, or it can be like algebra. It can be very, very precise. It's also reproducible, which is really, really nice because people can make sure that ua evaluations were correct. But the issue I have, the main issue is that this is not how models are used at all. Like we chat with models, we do not check if they output the right letter when we ask them a question. And it might be difficult to evaluate questions that are more free form, that are more complex than just four options. So because of that, automated benchmarks give you an idea of how the model will form in math, in code, but this is really not connected with how people use it in the real world. You can also use more focused benchmarks because I talked about mmlu and vageneric benchmarks that are really general purpose, but you also have that specific benchmarks like enterprise scenarios, but you also have domain specific benchmarks. For example, Finwe could create our own benchmark evolotion suite for finish, for code, for medical domain, all that kind of stuff. Another way of evolging models is just to ask humans to rate the answers. So a popular way is to use an arena. If you know the chatbot arena, you have a prompt. So you tap your prompt and then you have two answers from two anonymous models, and you just decide which one was better. So you really rate the answers of the two models. We keep doing that with a ton of models, and you can use these comparisons as a way to calculate nscore. And this is how the chatbot iina works. So this is nice because you can be very precise in how you want humans to rate the models. You can ask them to just focus on toxicity, for example. This is a popular way of doing it with rtimming. It has a lot of flexibility because of that. There's also a lot less risk of data contamination, which exists with automated benchmarks. And on top of that, it's directly human preferences. So if you use this course, you're really optimizing for human preferences, which is kind of the goal, of course, training. The problem is it's really difficult to scale. It's obviously very costly. It's fade time consuming, and humans are incredibly biased too. We like to think that we are kind of the ultimate evaluation, but we're really not. It's very easy to make humans like a model of another. You just need longer answers, for example, and they like it better, or you need models that are very confident in what they say. And the answer can be completely wrong. But because it's said so confidently, people will like it better. So it's definitely not the ultimate evaluation, but it's also something that is very important, very important tool in general. And this this is like the main thing about evaluation is that human preferences are not correlated with automated benchmarks. And here you have a comparison with a ton of different benchmarks. The first one on top is the chatbot arena. And all the others are more or less automated benchmarks. You have an mlu, for example, gsm 8K is a math evaluation. You have math, which is a math evaluation, and human eval, which is a code evaluation. And you see that the correlation is like pretty poor in general. And this is something that you observe. If you really try models, you can be really good on mmlu and really bad in terms of human preferences and vice versa. So this is why these two approaches are kind of complementary. You really need both if you want to produce the best model. And you cannot not just focus on automated benchmarks or just focus on human preferences. And a way to scale it a bit more is to just get rid of humans and replace them with llms. So now we have llms generating answers and llms judging other llms. And this is really much better to scale and to be able to get a lot more samples. You can also not use one lm as a judge, but multiple llms as a jury. And this tends to also produce more reliable and more robust evaluations. What's good is that Yeah this is how models are used. We rate the answers. The models also now can handle very complex tasks. So you can really automate the freeform stuff. It provides direct feedback. Okay, it's not human preferences, but it's very close. It's very correlated actually. But unfortunately, these judge ms also have their own biases and they're actually very close to human biases. So you kind of run into the same issues and there's some quality validation that is needed on top of everything. So you might still want to have some human evaluation to make sure that your llm judges still are very Corwith how humans graduated. It's a very nice way to scale this thing, even though it's not as scalable as automated benchmarks because you still need to run a ton of different comparisons to get miniingful results. And finally, you can create your own evaluation. So I would say the main thing is to start early, like before functuning even it's like test driven development a bit. You need to know what you want to optimize. So you create the data set that will optimize this thing. And based on that, you will probably iterate a lot. Don't think that your datset is the final version of this data set. It will probably evolve because after seeing answers from your model, you will understand that, Oh, I forgot this part, or like the model is hacking a bit, this question. So you will probably need to iterate and have several versions of the evaluation data set. You can combine different types of evolal, as I said, automated benchmarks and human or llm judges together. This is a nice way of doing it. And finally, don't forget to compare your models with others, not just your fine tunes, but also like other models, other architectures can be really nice to be able to get a good picture of how you score, how you compare with other models. Okay, future trends. So the biggest trend right now, and I think in early 20, 25, is called test time compute scaling. And it's a very simple idea. It's the idea that, okay, during training, we can train on a ton of data and like throw a lot of compute at the problem to get better at it. It works very well. But what if we try to throw compute the problem during inference? And a very easy way of doing it is I have a question. For example, in math question, I ask my llm not one solution, but multiple solutions. And I take the one that is the most frequent. And if you do that while learning, it works. You actually get better results at this math question. So this is a technique called majority voicing, and it works, but it's very naive. So there are better versions of it. One of them is best of n. In best of n, what you do is you still generate different solutions. But instead of just speaking the most frequent one, you will use a reward model or a judge llm to score every answer. And here you can see different scores. We take the one with the highest score. And good job. This is probably in most cases, even better than majority voting. So that also works. But as you can see now, we have two models that need to be run together to produce an answer. So we Thwing more computo problem. You can do even better. And this one uses process reward models, or prm. And process reward models, they do not just score the final solution, the entire answer. They score steps in the answer. So if you have a math question, if you want to do a calculation, you probably need to solve it with multiple steps. Process models allow you to score each step independently, and the score that you get is pretty much the probability that this step will give you the right answer in the end. So you can do that with an llm and generate partial solutions. You can then score the steps in your partial solutions with a process reward model, select the best steps and expand them. And you can do that iteratively, and you get better and better scoi'm not going to detail like all the variants of it. They are more complex solution that you can get. But you get the idea we can just use more large language models and do more steps to produce better answers. So an example of that is this question. What is the product of six times seven when both numbers are in base? As eight give you answer in base eight. Here you see, the first step was correct in attempt one. Let's multiply them together. The second is wrong because it's in base ten and not in base eight, whatever. So the attempt is wrong. We skip to attempt two. So now it has an extra correct step, but it has the same issue and it's also rejected by the process reward model. And finally, we have attempt for askip, the third one, where it's all correct and the answer is good. So you see this mechanism, we iteratively improve the quality and iteratively we get better and better steps. Yes. Yes. Reward model is something that happens during preference alignment. If you remember the diagram with ppo, you see here that during training we actually have a reward model. And this reward model is specifically trained to take text as input and produce a number between zero and one to give you a score. So you can reuse the reward model that you've made for preference alignment in your test time. Compute scanning. This is a way of doing it. There are other ways. You can just use a judge llm, you can just prompt a model to give a measure of the quality, but this is a nice way of reusing something that you've trained before. So thank you for this question. And finally here, Oh sorry Yeah you voinclude ve the answer, for example, if you do to between the correct time. So if you do it, you generate an infinite number. The thing is when you do majority voting is that you're probably going to use different sampling parameters. So it's not good to decoding. It's something that is more like top p decoding. It has like you will see that the model can sometimes fail. This is also something that you see with halallucinations, for example, sometimes because of the non greedy decoding, because there's some randomness introduced in your decoding, the model will just pick the wrong token and the entire reasoning will be derailed because it picked the wrong token. And this is kind of the intution behind majority voting is that we want to recover the performance that are lost through this decoding process. If you use a temperature of zero, you will have the same answer every time. No, I know you like we do non greedy decoding because it actually improved the performance of the mole. In general, for math and good like reasoning task, you want to use a temperature that is very low. So this is like closer to what you're talking about. But for majority voting, I would set a higher temperature because I want diverse answers. And because I have diverse answers, then it makes sense that if an outcome is frequent, it's probably the right one. It's not foolproof. For sure. This is not the best method to do it, but this gives the intuition of how we can do it. All right, then results. So this is some work made by people on the hugging face. And you can see that they used lamath point 21B, M three b, and they manage by having like more and more generations per problem to outperform much bigger models like lama 3.18b and lama H 3.1 70b. So this really shows that you can transfer can exchange inference speed for output quality. And this scaling test time compute ID is really about that. It's about if I have a lot of inference, it means that I can generate a lot of different answers. And because I have that, I can directly to inform it into output quality. Okay, that's it for me. This is the conclusion of this talk. It's the push training loop is really about creating a data set. So really spending a lot of time on data, it's a third of your time. Then training models using all the techniques that we've seen, supervised fine tunpreference alignment techniques, maybe a bit of merging if you can. And finally evaluating it, evaluation is another third of your time. And the evaluation, as you've seen, it's a bit messy. There are lots of different ways of doing it, but you can try to combine different techniques, different families to have a good representation, a good overview of the performance of your model. And based on that, it will probably give you feedback about areas of improvement. You can see that the model maybe is bad in math on the automated benchmarks, which means that you should probably generate more math data and train it on it. And you see that it's a loop every time when you evaluate the model, you do not only evaluate the fine tuning process. What you evaluate the most is actually the data and the quality of the data that you have, the mixture, the diversity, the complexity and the accuracy of your tundata. So Yeah, that's the training loop and that's an intuitive process until you get better and better better models. Thank you. Thank you, everyone. Awesome. Thank you, maxim. Is there any questions I can bring up microphones just so we can go through them with the. You can from the profitability perspective from lm providers, which one has a better chance, B2C generalized models or B2B fine tune models? Thank you. Okay. Is your question in terms what's the metric? Is it in terms of we can have a better chance in terms of profability? Okay. So it's a business question. Okay. So is it better to have like general B2C models versus student B2B, fine tune B2B model? I think that the B2C model, the business model, is really saturated right now. So it's really going to be very difficult to compete against against Google, against OpenAI, against anthropic. I would say that it's probably better to do some fine tuning and targeting business customers. And you can in the best case, sceniyou can use that as leverage to then become A B two C Company. One of your earlier slides had rag on top in context learning rag and then four branches and then eval. So I had question about that second option, which was train add knowledge. We could add knowledge. So this is my intuition I'm trying to portray. If we add knowledge in the fine tuning process, the model gets frozen at that point in time, right? And the knowledge, as we know, will keep evolving. Okay. So how does rag and that adding of knowledge during fine tuning a compare? Or should we after adding the knowledge still keep doing rag for future refresh of the knowledge? Yeah. Yeah, exactly. So the question is the knowledge that we add here during fine tuning is frozen in time, while rag allows you to dynamically like retrieve up to date context. And I think that there is never like fine tuning versus rag. It's rag or fine tuning plus rag basically. So Yeah, I would always recommend using both. If vegeval augmentation makes sense, it's better to do it with the Fantin model. Hi. Typically when you have to train with paramec effects in fine tuning or fine tuna lot language model, you use like input output pairs. But if you do don't have data and you want to maybe do our reinforcement learning use of weights of the model, which type of algorithm do you usually use at first? So I didn't get the beginning of your question. I was looking for you here repethis, okay. If you don't have like input output paand, you have to do reinforcement learning which type of algorithm you use, like proximal policy optimization, others, or which type of library, for example. If you don't have input output, maybe you want to I use your lalanguage model to made more knife or half like a type of personality. And you can use, for example, a bet or another language for a classification. So you have a policy, what type of algorithm do you we lose? Yeah, I would recommend like to change maybe the term it depends on the use case that you want to do. But if it's really about like something like changing the tone of the model, I would definitely recommend doing preference alignment. And within preference alignment, direct preference optimization. Dpu is like the algorithm that will work out of the box for you. Then the question is how to get the data and to do that. This is like data pipeline that we we talked about. Ultra feedback is a good example of how to generate this kind of data. Okay. Thank you. Thank you. It seems to me that the techniques that you presented focus on single turn interactions. So for example, we have an instruction and then we have an answer. Can we also do some multi turn optimization or preference alignment? Sorry, preference alignment with multi turn? Yes. Yeah, Yeah, absolutely. Absolutely. This can be completely expanded to multi turn conversation. And actually it's a good way of doing it because we see often that what works with a single turn, for example, instruction following, doesn't work with multi turn. There's like an entire evaluation set dedicated to multi turn instruction following. It shows that models that are good with single turn might not be with multurn. So Yeah, I would recommend, when possible, always trying to get some multi turn data instead of a single turn. A number of people want to know if you'll do autographs at the end of today. Thank you. Thank you. Yes. Yeah Yeah. I have a question about the example of the model merging where you average the finish model and the English model. So I'm just trying to get a sense of, well, first of all, when you have these two different languages, like the input and output space are very different. So do you like train a separate tokenizer per language or like just the trunk is shared? Or how does it usually work when you do this cross language training? Yeah. So it's a good question. Tokenizers are really created on calibration datsets. And this calibration datset most of the time has a ton of English. So Yeah, the tokenizer of the model, for example, lama is mostly dedicated to English. And you can have special tokens. You can also have like some Chinese, for example, Japanese, Korean, Arabic, Finnish to get other special characters. But in general, and here in this example, I would say this is kind of out of scope a bit. We are going to use the same model with the same tokenier in the base and the insert version. And changing the tokenzer at this point is too late. We need to do it before pre training. And if you try to change the tokenzer during Press training, you're going to absolutely drop in terms of performance because the model has to relearn the entire mapping between the token ids and the embeddings. So that's kind of like a costly solution. But I agree that in general, it's a lot better if yotokenier is designed for a specific language and you're gonna to really struggle if you have a small vocabulary size, a tokenizer with like 20, zero different tokens. It's gonna to be very difficult to learn languages with completely different characters like Chinese. Great. Let's do one more, Alexander. Thank you. With the techniques like the best of end and majority vote, can we give a sense of uncertainty to the end user for like security or safety critical applications? Yeah. So do you want to use these techniques not for math, but for other domains like the so if you're are using majority voting or best event with the scores, there can be a widespread or a smaller spread and that could give a sense of like how certain the model is about a technique or is that the case? And can then we then feed it to the user when like a regulatory agency is asking for like yes, safety or security critical issues? Yeah, definitely, definitely. You can use it as a measure of uncertainty, but in your precise use case, you could also use a reward model that is dedicated for this task. Ks, it doesn't have to be general purpose. It can also be we be dedicated to measure toxicity, for example, and provide scores only on this metric. So this is a nice way of customizing the best of and technique to tailor it for specific use cases. Excellent. Okay. Let's all thanks, maxim, one more time.

概览/核心摘要 (Executive Summary)

本次演讲深入剖析了大型语言模型（LLM）后训练（Post-Training）的核心理念、关键技术与未来动向。后训练旨在将预训练的基础模型转化为能遵循指令、解答疑问并契合人类偏好的实用AI助手。此过程包含两大核心阶段：一是监督微调（Supervised Fine-Tuning, SFT），通过指令与对应答案教导模型理解任务结构；二是偏好对齐（Preference Alignment），借助对比“选定答案”与“舍弃答案”，引导模型行为更贴近期望。高质量数据乃后训练成功的关键，其核心在于数据的准确性（Accuracy）、多样性（Diversity）与复杂性（Complexity），而非盲目追求数量。演讲详述了全参数微调、LoRA、QLoRA等SFT技术，PPO、DPO等偏好对齐算法，并介绍了模型合并（如SLERP、DARE TIES）作为增强模型性能的有效策略。评估是后训练中不可或缺的环节，虽具挑战，但整合自动化基准、人工评估及LLM辅助评估等多元方法是关键。最终，演讲展望了测试时计算扩展（Test-Time Compute Scaling）这一新兴趋势，例如Best-of-N和过程奖励模型（Process Reward Models, PRM），揭示了在推理阶段增加计算投入以提升模型表现的潜力。

LLM 后训练 (Post-Training) 概述

后训练发生于预训练之后，旨在将仅能预测下一个词元（token）的基础模型（base model）转化为有用的AI助手。

后训练的目标与定义：
- Maxime Labonne观点：“后训练的目标是把一个基础模型变成一个有用的助手。”
- 基础模型本身不具备回答问题或遵循指令的能力；后训练旨在赋予其这些关键功能。
后训练的两个主要阶段：
1. 监督微调 (Supervised Fine-Tuning, SFT)：通过输入指令与期望答案，教会模型应答方式及特定的交互结构（如聊天模板），使其能遵循指令并回答问题。
2. 偏好对齐 (Preference Alignment)：在SFT基础上，通过提供问题及一对“选定答案”（期望行为）与“舍弃答案”（不期望行为），进行对比学习，引导模型输出更符合用户偏好的内容，并规避不当表述。

后训练中的数据集构建

与预训练阶段使用海量原始文本不同，后训练数据集规模小得多，更侧重于数据质量。

数据集的核心：质量优先
- Maxime Labonne观点：“什么是好的数据？我认为这是整个后训练过程中的核心问题。”
- 好的数据包含三个维度：
  1. 准确性 (Accuracy)：答案需相关且正确。复杂问题（如代码、数学）可能要借助单元测试或求解器验证。
  2. 多样性 (Diversity)：样本间应具显著差异，覆盖广泛的用户交互场景。适量合成数据可提升多样性，过量则会损害。
  3. 复杂性 (Complexity)：通过难题挑战模型，鼓励长而详细、包含逐步推理的回答。可运用如Evolv-Instruct（演讲中提及为autoevo）等框架提升样本复杂度。
不同类型的微调及其数据需求（此为大致估算，具体受模型大小、任务复杂度等影响）：
1. 通用型微调 (General Purpose Fine-tuning)：创建能回答各类问题的通用助手（如ChatGPT）。可能需百万级样本。
2. 领域特定型微调 (Domain Specific Fine-tuning)：针对特定领域（医疗、法律、金融）或语言进行，旨在嵌入更多领域知识。数据需求相对较少。
  - 注意：若基础模型对某语言完全陌生，微调效果可能有限，因缺乏足够词元进行有效训练。
3. 任务特定型微调 (Task Specific Fine-tuning)：专注于学习单一功能（如摘要、拼写检查）。所需样本更少。
何时进行微调（主要针对领域特定型和任务特定型）：
- Maxime Labonne建议：“始终首先尝试上下文学习（In-context Learning）和RAG（Retrieval Augmented Generation）流程，因其测试更为便捷。”
- 若RAG在答案质量、延迟或推理速度等方面未达预期，可考虑微调。
- 微调的适用情境包括：
  1. 调整语气或格式：例如，优化邮件写作助手的表达风格。
  2. 补充领域知识：让模型学习特定事实（如关于个人或公司信息），但难以借此掌握全新语言。
  3. 模型蒸馏：将大型模型（如GPT-4）的能力迁移至小型模型，以降低成本、延迟并提升推理速度。
  4. 提升特定窄任务性能：在高度细分的任务上（如生成图表），微调模型有望超越通用前沿模型。
- 关键点：务必评估微调效果，并进行迭代优化。
- 企业进行开源与微调的动机：主要源于对数据控制（Control）（避免数据上传至云端API）和模型可定制性（Customizability）（调整输出风格、创建自有模型）的需求。
数据生成流程与示例
- 通用数据生成流程：
  1. 种子数据 (Seed Data)：来源多样，如原始文本、指令答案对、用户问题等。
  2. 数据精炼 (Refine)：生成指令、答案或两者。例如，可运用反向翻译（back translation）为无指令的文本生成对应问题。
  3. 评分与过滤 (Scoring and Filtering)：采用启发式方法或LLM作为裁判来评估答案质量，剔除低质样本。
  4. 标准数据处理：包括数据去重、去污染（确保训练数据不含测试集内容）、数据过滤（如排除特定关键词）。
- 指令遵循 (Instruction Following) 数据集创建示例：
  1. 起始于提示（prompts）与约束（constraints）（例如，“回答须全为英文小写”）。
  2. 查询LLM以获取初步答案。
  3. 执行单元测试以验证约束是否被严格遵守（如使用langdetect库检测语言，校验大小写）。
  4. 进行去污染处理（如避免生成与IFEval等评估集过于相似的样本）。
  5. 实施关键词排除。
- UltraFeedback (偏好数据集创建示例)：
  1. 从一系列提示开始。
  2. 查询多个不同LLM针对同一提示生成多样化答案。
  3. 利用一个裁判LLM（judge LLM）对每个答案进行评分。
  4. 选取评分最高的作为“选定答案”，评分最低的作为“舍弃答案”。
  5. 移除重复项及过短的答案。
- 案例研究：OpenHermes风格的偏好数据集（演讲中提及为Open Perfect Plan，推测指Maxime Labonne参与的OpenHermes项目或类似概念）：通过整合Hugging Face上多个公开数据集构建的开源偏好数据集。其数据类别配比（如大量数学、代码，辅以聊天和指令遵循数据）为通用型微调提供了有益参考。
聊天模板 (Chat Template) 的重要性
- 将结构化数据（如Alpaca格式）映射为包含角色（系统、用户、助手）和特殊标记（如IM_START, IM_END，演讲中提及为I start, I end）的对话格式。
- 模型在SFT阶段重点学习此模板，理解对话流程。
- Maxime Labonne强调：“这种结构至关重要，它区分了仅能补全文本的基础模型与能真正遵循指令、进行连贯对话的后训练模型。”

后训练中的训练算法

推荐的微调库：
1. TRL (Transformer Reinforcement Learning)：Hugging Face开发，基于Transformers，功能全面，算法更新迅速。
2. Axolotl：构建于TRL之上，用户体验更佳，能自动预处理数据，其YAML配置文件易于共享与借鉴。
3. Unsloth：在单GPU上微调效率极高，并能便捷处理量化等操作。
监督微调 (SFT) 技术：
1. 全参数微调 (Full Fine-tuning)：加载完整精度模型，重新训练所有参数。效果最佳，但显存（VRAM）需求极大。
2. 参数高效微调 (PEFT) - LoRA (Low-Rank Adaptation)：加载模型主体，仅训练新增的低秩适配器（adapter）矩阵（约占总参数0.5%）。训练更快，成本更低，但仍需加载完整模型。
3. QLoRA (Quantized LoRA)：加载模型的4位量化版本后应用LoRA。显著降低模型加载成本，适用于硬件受限情况，但训练速度较慢，性能有轻微（几个百分点）下降。
  - Maxime Labonne建议：若条件允许，LoRA优于QLoRA；全参数微调通常非必需，除非是LLM公司进行基础模型级别的训练。
偏好对齐 (Preference Alignment) 技术（存在百余种不同方法）：
1. PPO (Proximal Policy Optimization)：经典且原始的偏好对齐算法。需同时加载三个模型（策略模型、训练模型、奖励模型），并通过KL散度约束模型不过度偏离初始策略。效果好，但实现复杂，资源消耗大。
2. DPO (Direct Preference Optimization)：资源消耗更低，通常仅需加载两个模型（使用LoRA时甚至可简化为一个模型加适配器）。训练更快，成本更低，参数调整更简便。尽管理论效果可能略逊于PPO，实践中差异不大。
  - Maxime Labonne建议：“除非你是OpenAI或拥有无限资源，否则我推荐使用DPO而非PPO。”DPO对用户更友好，参数调整更容易。
关键训练参数调整：
- 学习率 (Learning Rate)：核心参数，决定参数更新步长。演讲中给出了常见取值范围。
- 批大小 (Batch Size) 与 最大序列长度 (Max Length)：直接影响显存占用。可通过梯度累积（区分有效批大小与实际批大小）应对硬件限制。
- 训练轮次 (Epochs)：完整遍历训练集的次数，通常3-5轮，需监控训练过程。
- 优化器 (Optimizer)：推荐AdamW作为强大基线。
- 注意力机制 (Attention Mechanism)：推荐Flash Attention，能高效处理长序列。
实验监控要点：
- 学习率问题：损失曲线若出现尖峰（loss spike），通常表明学习率设置过高。平滑下降的曲线更为理想。
- 其他监控指标：训练损失（train loss）、验证损失（validation loss，需预留1-2%验证集以监控过拟合）、梯度范数（gradient norm，应避免过多尖峰或持续过高，可能指示数据问题）。

模型合并 (Model Merging)

模型合并是将不同模型的参数通过平均、融合等方式组合，以期创造出性能更优模型的技术。
* 模型合并的理念与价值：
* Maxime Labonne观点：“这起初听起来像个玩笑，但现在人人都在用。所有LLM公司都在应用模型合并。”
* 该技术能将看似“随机”的模型组合成更强大的新模型，是一种可能被低估但已广泛应用的方法。
* 主流合并技术：
* SLERP (Spherical Linear Interpolation)：球形线性插值，在参数空间沿球面而非直线进行插值。通常能产生更优的合并效果，在开源社区流行且可靠，但限制为一次合并两个模型。
* DARE TIES（演讲中提及为their ties/their linear，根据描述及Maxime Labonne工作，推测指DARE与TIES的结合或类似技术）：
* DARE (Drop And REscale)：随机修剪并重新缩放源模型的参数。
* TIES (Trim, Iterate, Elect Sign)：保留源模型中的关键参数，并解决参数符号冲突问题（例如，一正一负参数简单相加可能抵消，TIES旨在更智能地处理此类情况）。
* 模型合并的实践与案例：
* 合并前提：通常要求参与合并的模型具有相同的规模和相同的架构（例如，仅在Llama 2系列模型间合并，而非Llama 2与Llama 3混合）。
* 精度考量：不建议混合合并全精度模型与量化模型，因量化模型的性能损失会影响最终合并效果。优先使用全精度模型。
* 权重分配：可为不同源模型赋予不同权重，如对已知性能更优的模型赋予更高权重。
* 迭代合并：支持将已合并的模型作为新的源模型进行再合并，形成复杂的“模型家族树”。演讲者以其创建的NeuralDaredevil（书中提及，演讲中为the devil, eight b，具体模型名待确认）为例，说明了多层合并的应用。
* 参数调优（如merging_kit中的density和weight参数）：选择最佳权重与参数组合更像一门“炼金术”，需通过大量实验与评估探索。
* 预期效果：成功的模型合并能使新模型在基准测试中超越所有源模型。
* 应用实例：构建特定语言模型（如芬兰语）
1. 首先，在目标语言（如芬兰语）的原始文本上进行持续预训练（Continuum Pre-training）。
2. 接着，在该语言上进行监督微调和偏好对齐。
3. 此时，模型在该特定语言上表现优异，但在其他通用能力上可能出现显著衰退（灾难性遗忘）。
4. 解决策略：将此特定语言模型与一个通用的指令微调模型（如Llama 3 Instruct）进行合并。
5. 最终成果：获得一个既精通目标语言，又保持良好通用能力的模型。
* Maxime Labonne总结：“模型合并的威力之一在于，它能让你整合这些不同的技能，而不损害模型的其他方面。”
* 适用范围：模型合并主要应用于文本模型，亦可用于视觉语言模型（VLM），但其核心是组合具有相同架构的不同模型的现有能力，而非创造全新能力。

LLM 评估 (Evaluation)

评估是LLM领域的核心挑战之一，虽不完美，却对模型优化至关重要。
* 评估的重要性与挑战：
* Maxime Labonne观点：“评估是LLM领域的主要难题。它运作得不尽理想，我们也不完全清楚自己在做什么。但它确实至关重要……因为所有训练和后训练都是为了优化……若无正确评估，我们就是在为错误的目标努力。”
* 自动化基准测试 (Automated Benchmarks)：
* 典型例子：MMLU，Open LLM Leaderboard。
* 工作原理：基于预设数据集（包含样本）和特定度量标准（如MMLU的准确率）进行评分。
* 优势：可扩展性强，成本效益高，能针对特定任务（如数学、编程）设计，结果可复现。
* 局限：与模型在真实场景中的交互方式（如对话）存在差异，难以评估开放式、复杂问题的回答质量。
* 可构建更专注的基准：如企业应用场景、特定领域（为芬兰语、代码、医疗等创建专用评估套件）。
* 人工评估 (Human Evaluation) 与聊天机器人竞技场 (Chatbot Arena)：
* 工作原理：由人类评估员对匿名模型的输出进行比较和评分（例如，判断哪个答案更优）。通过大量此类比较计算Elo得分，形成排行榜。
* 优势：可精确定义评估维度（如专注于回答的无害性），灵活性高，数据污染风险较低，能直接反映人类偏好。
* 局限：难以规模化，成本高昂，耗时，且人类评估本身也易受偏见影响（如可能偏爱更长或更自信的答案，即便其内容有误）。
* Maxime Labonne观点：“这绝非终极评估方案，但仍不失为一个非常重要的工具。”
* 自动化基准与人类偏好的低相关性：
* 核心观察：Chatbot Arena（代表人类偏好）的排名结果与MMLU、GSM8K（数学）、HumanEval（代码）等自动化基准测试的排名之间，相关性通常不高。
* 重要启示：模型可能在自动化基准上得分很高，但在人类用户感知上表现平平，反之亦然。因此，这两种评估方法应互为补充，综合使用。
* LLM 作为评估者 (LLM-as-a-Judge)：
* 工作原理：利用一个或多个LLM（组成“LLM陪审团”）来评估其他LLM生成的答案。
* 优势：相较于纯人工评估，更易于扩展，能处理复杂任务的评估，提供的反馈与人类偏好有较高相关性。
* 局限：裁判LLM自身也可能存在偏见（且其偏见可能与人类偏见相似），评估结果仍需一定的质量验证（可能需要部分人工抽查以确保LLM裁判与人类评估标准的一致性）。
* 构建自定义评估方案的建议：
1. 尽早启动：在微调开始前就应着手定义评估标准（类似测试驱动开发理念）。
2. 持续迭代：评估数据集和标准可能会随着对模型表现的深入理解而不断调整和完善。
3. 组合多种方法：有效结合自动化基准测试与人工评估（或LLM辅助评估）。
4. 进行对比评估：不仅要比较自身微调模型的不同版本，还应将其与其他公开模型、不同架构的模型进行横向比较。

未来趋势：测试时计算扩展 (Test-Time Compute Scaling)

此趋势的核心思想是在模型推理（inference）阶段投入更多计算资源，以提升最终输出性能。
* 核心理念：
* Maxime Labonne观点：“当前最重要的趋势，我预计在2025年初会更加凸显，即测试时计算扩展……核心问题是：我们能否在推理阶段通过增加计算投入来提升模型表现？”
* 主要实现方法：
1. 多数投票 (Majority Voting)：针对同一问题，让LLM生成多个候选答案，最终选择出现频率最高的那个。
* 基本原理：通过多次采样（通常配合较高的temperature以产生多样化输出）来缓解非贪婪解码过程中因随机性引入的潜在错误。
2. Best-of-N：模型生成N个不同的候选答案，然后利用一个奖励模型（Reward Model）或裁判LLM对每个答案进行评分，选取得分最高的作为最终输出。
3. 过程奖励模型 (Process Reward Models, PRM)：与对最终答案评分不同，PRM对答案生成过程中的每一步进行评分。
* 每一步的得分代表该步骤导向最终正确答案的概率。
* 可实现迭代优化：LLM生成部分解 -> PRM评估各步骤 -> 选择最佳步骤并扩展 -> 循环此过程。
* 例如，在解决数学问题时，PRM能识别并拒绝错误的中间步骤，引导模型探索正确的解题路径。
* 效果验证：
* Hugging Face的研究显示，对于Llama 2 13B（演讲中提及为lamath point 21B）和Mistral 7B（演讲中提及为M three b）等模型，通过在测试时为每个问题生成更多候选答案（即增加计算投入），其性能表现能够超越如Llama 3.1 8B和Llama 3.1 70B（演讲中提及为lama 3.18b and lama H 3.1 70b）等规模更大的模型。
* 核心结论：“这清晰地表明，可以通过牺牲一定的推理速度来换取更高质量的输出。”

总结与问答环节要点

后训练的核心循环 (Post-Training Loop)：一个迭代优化的过程。
1. 创建高质量数据集（约占1/3时间）：聚焦数据的准确性、多样性与复杂性。
2. 训练模型（约占1/3时间）：综合运用SFT、偏好对齐、模型合并等技术。
3. 评估模型（约占1/3时间）：结合多种评估手段，全面获取模型性能反馈。
4. 评估结果将指导数据集的改进，形成闭环。评估的本质不仅在于模型，更在于训练数据的质量与构成。
问答环节关键洞察：
- 商业模式选择 (B2C通用模型 vs. B2B微调模型)：
  - 提问者关心：从LLM服务商盈利角度，B2C通用模型与B2B定制微调模型，何者更具商业潜力？
  - Maxime Labonne观点：当前B2C通用模型市场竞争激烈，已趋饱和，新入局者难以与Google、OpenAI等巨头抗衡。专注于B2B市场，提供定制化微调服务，可能更具发展前景，并有望以此为跳板，未来拓展至B2C领域。
- 微调知识更新与RAG的协同：
  - 提问者关心：微调注入的知识是静态的，而现实知识持续演变，RAG如何在此情境下与微调互补？
  - Maxime Labonne观点：不应将微调与RAG视为对立选项，而应是“单独使用RAG”或“微调与RAG结合使用”。若RAG对应用有价值，则推荐将其与微调后的模型协同工作。
- 无显式输入输出对时的强化学习策略：
  - 提问者关心：若缺乏输入输出样本对，希望通过强化学习调整模型语气或个性，应采用何种算法？
  - Maxime Labonne建议：针对改变模型语气等需求，推荐采用偏好对齐方法，特别是DPO算法。所需偏好数据可通过类似UltraFeedback的流程生成。
- 多轮对话中的偏好对齐：
  - 提问者关心：演讲中介绍的技术多侧重单轮交互，是否能应用于多轮对话的偏好对齐？
  - Maxime Labonne观点：相关技术完全可以扩展至多轮对话场景，且推荐这样做。因为在单轮交互中有效的策略（如指令遵循），在多轮对话中可能失效，需要针对性优化。
- 模型合并中处理不同语言模型（特别是分词器问题）：
  - 提问者关心：合并不同语言（如芬兰语和英语）的模型时，其输入输出空间差异显著，分词器应如何处理？
  - Maxime Labonne解释：分词器通常在预训练前基于大量校准数据（多为英语）创建。在后训练阶段更改分词器会导致模型性能大幅下降，因模型需重新学习词元ID与嵌入表示的映射。演讲示例中默认合并的模型使用相同的基础分词器。理想情况下，分词器应针对特定语言优化设计。
- 利用Best-of-N等技术提供不确定性度量：
  - 提问者关心：Best-of-N等方法产生的多个答案及其得分分布，能否用于向最终用户传递模型输出的不确定性信息，尤其在安全或合规要求高的应用中？
  - Maxime Labonne观点：这些技术的输出（如得分分布）确实可以作为不确定性的一种度量。此外，在特定场景下，还可以训练专门用于评估特定指标（如内容无害性）的奖励模型，并基于该指标的评分来指导Best-of-N的选择，从而为特定用例定制不确定性评估和决策。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 16:34

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 16:27

StreamSparkAI