2025-05-23 | Stanford CS25 V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case

Loubna Ben Allal在斯坦福CS25课程中分享了大语言模型（LLM）预训练背后的细节，以StarCoder为例探讨了训练高质量LLM所需的数据、模型架构与训练策略，并分析了开放与闭源模型的发展趋势及训练中的权衡问题。

视频科技

媒体详情

上传日期: 2025-05-20 13:14
来源: https://www.youtube.com/watch?v=jm2hyJLFfN8
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

speaker 1: Hello, thank you for joining cs 25, transformers day's last class. Today we have Lubna, who is a machine learning engineer in the science team at hugginface working on large language models for code and synthetic data generation. She's part of the core team at the big code project and has co authored the stack data set and the star coder models for co generation. Thank you so much for coming to our talk today. And as always, the attendance link and the slido questions are on our website, and we'll be taking questions after the talk. Thank you. And you can take it off now. Hi. Thank you for the introduction. Cool. So I'm Luna. I'm a machine learning engineer, taging phain, the science team. And today I'll tell you about the behind the scenes for training large language models. And I will use the starcoder model that our stream has trained as a use ser case. So today's plan is very simple. We're going to try to answer this question. What does it take to train a good llm? So it's one question, but it's very loaded and it has a lot of follow ups. And as you will see, my slides will be a series of questions and answers. So a few years ago, a lot of people thought that there was some motin secret sace to the strong closed models like GPT -4, and that it will take the open source community a lot of time to catch up, because the open source models that we had back then were much smaller and less performance. But now it seems that the community kind of figured out most of the pieces for getting strong llms. And it was predicted in this Google memo that was leaked in the release stance I analysis. For example, today we have lama 370 bd instruct, which has almost the same performance as GPT -4, but it's unlocks so many cases because the model weights are open. The model can be decquantized and can even run on the consumer desktop. It also allows the community to build very cool use cases on top through fine. So we've made a lot of progress in the open field, and this is not the only model that's out there. We're now observing kind of horizof oplms and the company, more and more companies are embracing releasing models. That was the case, for example, with deep Minze Gemma models and with mistreals models and also other models from dbp core e here I put a plot from the lmsays arena, which is kind of the go to leader world for comparing instruct models nowadays. It chooses human evaluation. And you can see in this plot that as we went from 2023 to may 24, the gap in performance between the closed models and the open models is shrunking and becoming smaller, which is very promising. So we're on a very great path, but there are still a lot of limitations for this. And this is mainly due to releases missing out important details about how the data was processed and how the models were trained. And this is usually the case for two main reasons. The first one is to avoid legal scrutiny, because when companies publicly disclose the training data, if the training was not done properly and the copyrighted were not respected, the risk facing a legal investigation. The other reason for not disclosing the details can be to maintain a competitive edge. So some companies want to be the best at training outthem, so they don't want to give all the details for their training. Nevertheless, because we have a lot of releases, I think we can still answer this question and put a lot of pieces together. So what do we need to train a good algm? The first thing is probably the model. You need to have a good architecture. And I think now transformers account defaults, but there are also other interesting architectures like Mumba to the space model, or you can use a mixture of experts, which can be multiple transformer models. But I'm not gonna to spend a lot of time in this lecture on models because I think it's a topic that's already thoroughly explored and there are other aspects that maybe deserve a little bit more attention. So that's what says for models then for GPU's, I don't think there's much I can tell you about that except maybe I asked Jensen, but the part that they were the most interested office data, which I think is the backbone of lms because now almost everyone is using the same architecture in the same training techniques. And for given buddata is what makes some models better than the others. So it's really worth spending time exploring this data and understanding how to get higher quality samples. So now we're going to try to answer our previous question of how to train a good by how do we get good training data? And I think the answer to this is threefold. First, we need to understand how much data do we need. And then once we've figured out the size of the data that we need, where can we get this data and to clean it, which filtering techniques make more sense and will give us the best performance? So to answer the first one, the answer to that is the sclinyou want to know which data, how much data you want to train your model on, but also what is the optimal size of the model and the scaling laws try to study the allocation of a compubudbetween data size and model size. This means should you take a smaller model and train it longer, or take a larger model and train it on this data? And I'm going to present a brief history of the scin laws, because I think it's really interesting to see how the sizes of the models progress through time and also how the size of the data sets in the number of tokens which ran on them have changed, because there were really some drastic changes in that. I think the first to establish the scaling laws were capplan from open enai, and they tried to fit the as a of the loss as a function of the data size, then model size. And they found that if you have a ten times increase in your compute, you should increase your parameter count by 5.5, but your train tokens, you should only increase them by 1.8. This means that if you have more resources to train your models, you should make the model much larger. But the data is fine, you shouldn't increase it that much. And this is what led to models like GPT -3, which is the 175 billion parameters, which was only trained on 300 billion tokens, which if we think about it now with really small other models also follow this. For example, like oppt, which was the same size as GPT -3 and trend on a similar amount of data, there was also bloom. So all these models are actually very undertrained. Then the chchlast killing loss came after, and they kind of revisited the scaling loss. And they found that the reason upand pots does data should not be as scaled as model size is because they used a fixed cosine scheduler for all their experiments. So although they were changing the data size, the cosine scheduler was fixed. This meant that for some models, they were underestimated because they were not using the correct cosine that corresponded to the data size, this led, to kind of false conclusions. And the change shlock can now give us new scanning laws that say that you should scale your data and your model size equally. And in their paper, they train a 65 billion model on 1.6 trillion tokens, which is the chchilla optimal point. And they also perform much larger models like gpc three and gofer, which was over 200 billion parameters. So here, for example, I have a plot which shows what the scny loright to do. For example, here you have isoofflop curves, which each curve uses a fixed pges, and then you try to find this sweet spot, which is like the optimal for your budget allocation. And it tells you what your model size should be and what your data size should be. And as you can see here, if we try to fit the loss, we can see that there's a linear increase for data and also model size in this scheme. I tried to show how we've moved from the chinchilla sclaws to today's models. And you can see that, for example, the chinchilla model, which is 60 billion parameters, were strained only less than 2 trillion tokens. But then after that, we have llama, which was released last year, and it was just a seven bb model, and it was trained on as much data as the chinchilla model. So it was trained way past the chinchilla optimal point. And we might truly be wondering why is that the case? Did metal not use their compute purchase in an optimal way? And the answer to that is that compute optimal is not always optimal, because when you train a model, you don't only care about what you're gonna to spend in training, but you also care about your inference. And the model is trained one time, but the inference is for more. The model is going to be served. So we want to save some cost in that. This makes it that people prefer training smaller models longer than actually using much larger models that are trained on this data. So this was the case for llama one, for other models like mistreal, but also for lama 3D, which went even further and trained mountain 1 trillion tokens, but on 15 trillion tokens. And if you check the archive paper, the loss kept going down and also the downstream evaluations as the model kept training, kept improving. I think this is really interesting because some people misunderstood the chin chilill sclaws as like compute optimal is optimal, but that's not the case because inference cost is not considered. So for example, this is the cost for training in GPT -4. It is said that it's estimated that $100 million, but also the inference is very expensive. And the larger the model becomes, the more time it takes to process tokens. So in red, the Scand laws don't take the inference cost in consideration. And if we do take the inference cost, which is the case for most people because they want to use these models in inference, you might prefer using smaller models and training them longer. And we do that. We're not respecting the changes scanning laws. So we're choosing to pay what we call a compute overhead. It's kind of a sacrifice that you do during the training. You choose to pay more, but this will have a benefit during influence because you will save a lot of cost and money. And there's this very interesting blog post about harden's law, which tries to measure the compute overhead that you will be paying when you choose to train a small model. For example, here, this space on hugging phase where you can input the model size and what data sets you want to train on. And it will show you where you are regarding the tinchilla opmal point. So for example, if we take a seven b model and which fraon 1 billion tokens, you can see that we are here. It's the red dot and it's before the chinchilla optimal model. And this gives approximately, I think, 40% overhead. But then during inference, as it shows here in the table, sorry, it was 13% overhead, but there's 50 almost 50% saving costs. So that's something that almost everyone is doing now, which is why we see models that are much, much smaller than one or two years ago. For further reading, there are some very interesting papers about scaling laws. For example, this paper called scaling data constraint language models, which shows that if you are limited in your data size, let's say, for example, one to prinseven b on a trend, trillion tokens, but you don't have these 10 trillion tokens, this paper says that you can basically repeat your data up to four times. So for and you will get similar performance as if you use the unique tokens. So for example, instead, if you using eight really llion tokens unique, you could use just two and repeat them four times and you get almost the same performance the as if these tokens were unique. And this is especially useful for some domains where we almost exhaust all the data that's publicly available. As I will show you later here, the stack V2, which is a code data set that we released, I think it has almost all the code that is available publicly. So it's going to be very hard to scrape and get more code. And if one to train models longer, the only option is to actually repeat the data during training. And this is good news because repeating the data up to four times is actually significant. Ance, another paper that I think is interesting when it comes to scaling laws is that the deep sick llm, they try to establish new scaling laws that are suited for their data because they found that the scaling behavior is highly dependent on the data quality. So they tried different data subsets, different filterings, and they found that the scaling laws were changing. So this is very important because up until now we were using the chinchilla, but the chchill was using fixed data sets. They are not necessarily the ones that we are using now. So it's really important to be aware of that. And this is why dptry to come up with their own scin laws that work for their data sets. And they also conclude that when you have higher quality data sets, maybe more compute should be allocated to the model size and nothe data size. So these are interesting things to keep in mind when it comes to scaling elements. So we have answered the first question, I hope, how much data to train llms. So let's say now you have your compute budget, a fixed number of GPU's for a certain amount of days, and you also know approximately how much data you want to use. The question is that where refind this type of data, for example, lama three was trained on 15 trillion tokens, but where do you get 15 trillion tokens? That's a huge number to get this data. The two main sources where you can actually get very large volume of data are the web. And then this sucode. There are some other curated sources that are of high quality but are much smaller, like Wikipedia, books, archive or Stack Exchange. You can also get data, a new type that's been very trendy recently, which is synthetic data. But let's first start with the sources where you can get very large volumes. The first one is web data. So that's basically web pages. And usually people to create these data uses. They start from common cwl, which is a public repository of crawl web pages. Common crawl, crawl pages regularly, and they publish dumps every few months, very few starts from there. You will need to do some heavy filterat, a very large scale. For example, just the latest dump has over 400 terpoints, and they have almost 95 dumps. So that's not very easy task, and you will need to have law resources and a team to be able to do that. Sclling the other option is to use an existing filtered develop data set. Our researchers already filtered common and crawl and release them. And luckily, we do have data sets that are very large and well filtered. One of them is the web data, fine web that was recently released by hugging face se. And it has 50000000000000 tokens of web data. It's also, it's not just a large data set, but it also has the best performance among the publicly available data sets. And here, for example, it shows the performance, which is an aggregation over multiple popular benchmarks for nlp like hela, wag, mmlu, picka and others. And it averages them and compares to other data sets like c four, refined web, simple pajama and the pile. So that was for webso. You can get 15 trillion tokens easily. And then for code data, we have released the stack data set, which is the largest data set of open source code. This data set comes in two versions. Version one consisted of 6 tb of permissive code. And how we built this data set is that we first cloned all the public repositories on GitHub. So this gave us over 130 repositories and 100 tb of data. But we don't want all of that data because a lot of it can be Congs or extensions that we don't need, all languages that are no longer maintained. So we did some file extension filtering, and we ended up with almost 90 tb of data. After that, we filtered repositories based on their licenses. So we can have permissive licenses like Apache two or mit. We can have more restrictive licenses like gpl. So we filtered all the repositories that did not have a permissive license and assure that we did a deduplication to remove files that are similar. So we ended up with almost 3 tb of deduplicated data. The stack comes also with a very cool tool for opt out. This tool is basically a space where you can go, you can type your GitHub username, and it tells you if you have any of your GitHub repositories in the data set. And if that's the case, there's also an option to fill a form and the request to be removed from all the future trainings of big world. So we did that for the stack V1, but also for the stack V2. In the V2 is a much larger and enhanced data aset compared to the V1. This time, instead of cloning and GitHub repositories, we went through software heritage, which is an archive of code. They already did the scraping, and we just extracted the data from their archive. And we we ended up, after all the filtering with almost 1 trillion tokens, which is a lot compared to the V1 where we got around 200 billion tokens at the end. We also added some high quality resources like GitHub issues, matphone code datsets, and pull lar requests. So these datsets, the stack V1, the sav two, can be used to train llms on code or to train general llms and include code as a subset of the general web data. This shows how the stv two compares to the V1. And you can see that before filtering, it's almost ten times larger, and after filtering, and it's four or five times larger. So I talk about how to get web data, how to get code data. And then I also mentioned the synthetic data. And it's this year and last year that synthetic data became very important for llm petraining. And I think that in the next few years, it will become even more important. And I think this was mainly sparked by the five series of models by microscopes. The first paper was called textbooks are all you need. And they basically generated synthetic textbooks using GPT -3 point five and GPT -3 hundred four. And they tried to build a new pretraining corpus that is synthetic. And they were able to mention how to perform models that are trained on web data sets. So this model was trained on on almost entirely synthetic data. But now some of the very popular llms are using synthetic data as part of their ptramix. For example, cloud three in the model card, they say that they generate data internally and they include in the pret training. This is also the case for lama three where they use ms to build classifiers that would, I would say, samples and only keep the high quality ones. But they also generated synthetic content to improve performance on coding and reasoning along contexts. So synthetic data is a very new topic, but it seems really interesting and personally working also on that as hugging in phase, we recently released a datset called cosmoclopedia, which was the largest datset of synthetic text, and it had almost 25 billion tokens. And instead of using closed models like GPT -4, it used an open source model, which is mixeight 87b. And we also released the blog post that explains how we created this data set. It can be very tricky to get very diverse samples. So we used an approach where we had 80% of the data that comes from the web. And then we try to use these web samples to build new prompts that ask models to generate textbooks that are related to these web samples, but while giving them more context so we can limit the generations. For example, we can have a topic that is thematics, and then we have web samples that are related to mathematics. And each time we give you the model of prompt, generate a textbook in the field of mathematics that is related to this web sample. And the more web samples we add, the more diversity we add. We also use some curated sources like Stanford courses and wikihow, where we use extracts from these pages to ask the models to generate content that is related to them. You can find more details in the cosmomedia diblog post. So I guess now we also have the answer for our second question, which was where to find the data and if you're following. And we have one question left, which is how can we filter this data? Because, for example, if you use kcrawl and you need to filter it, and even if you use this stack, we did not train our models on this stack directly. We did a lot of filter ing and get a data set that is smaller but has a higher quality. And for this data set, I will cite the slide from Thomas Walt's presentation. This sector is very interesting. By the way. You can find this here. And this is from the e paper, where they states that high quality datset might exhibit very advanced capabilities for a standard architecture. And this is actually the focus of. Many recent papers. And we can see that in model releases, the sections about data asets are becoming smaller and smaller because people are realizing that the data aset is actually the backbone and is the one that is making some models much better than others. So it's really important to spend a lot of time creating these data sets and trying to remove all the outliers and data asets that can hurt the model during the training. This is the pipeline from the yepaper for filtering their parole web datset. So first they do language filtering. So I guess in yeacase they kept English and some Asian languages. Then they apply some filtering techniques to remove low quality samples. For example, there are some metrics like you look for files that have a lot of lines repeated and then remove them that also rule based correction. You also can use perplexity filtering where you compute something like loss and remove samples that have a very high one. Then after that, they also did a very step, which is very important, deduplication, because there are a lot of papers that study the effects of duplicates on training, and they find that keeping duplicates in the training data can cause models to memorize and the help the space to be created. So this hurts the performance of models. And it's always advised to remove duplicates using exact duplication to remove process. Ses are exactly identical, but also near duplication to remove files that are similar. And this uses techniques like main hash deduplication for year. After that, they also did more filterings on top like semanand tropic filtering. But usually you can do like the classic filtering and duplication and then be more creative for the other filters. That was also the case for fine web. The reason it is better than other data sets is that because they spent a lot of time trying to come up with better filters and also dedupdicate the data set. Well, now the question is, okay, we can do the application lication. I think we have methods that are established to do that. We can also do language filtering. But then if you want to filter the data to remove garbage and lower quality files, how do you come up with grid filters? You can for sure to find some filfilters in the literature, but if you want to really build a data set that is better than what exists, you need to invest sometimes trying to find more techniques that work better for your case. This can be done with manual inspection, which is always a good idea to look at the data and see what it actually looks like. And you can come up with filters to help you during the training. But that is usually not enough because you might have an intuition for filtering that works better for your model. But then when you train, actually, this filtering doesn't help. And for example, for us, when we were developing the starcoder series of models, we were thinking, okay, what are the best ways for us to filter code? So we used some standard filters, for example, to remove auto generated content, but we try to come up with a little bit more complex filterings that could help us like looking for files that have a lot of comments, because code that is usually well documented is probably of a higher quality than another code file doesn't have any comments. So we implemented this filter that looks for files that have almost no comments and then removes them. And we trained a model on that interturn nals. The performance improvement was really negligible. It was not as much as we thought. We also tried to use another filter, which is using the stars of a repository as an indicator of quality. So we tried removing all the files from repository that have less than five stars, and this ended up removing over 70% of the data sets. And then when we triangonzed, the model was the worst model that we trained in all our ablation experiments, simply because it removed too much data. It was not worth using this filtering technique. This is why it's very important that when you have a filter, you should run what we call an ablation model. The ablation is basically you take a subset of your data set after you applied the filtering, and you train a small model on it and see how it behaves with and without the filtering. And you might be wondering, okay, if I use a small model, but does it really extrapolate to larger models? I think that's a good question. But generally, from our experience, we found that this it does extrapolate for most of data ablations. When when you're doing these ablations, you should select a set of high signal benchmarks that could some give you some conclusions about the effect of your filtering early in the training. This can be some of the popular nlp benchmarks for llms, for example, hela, wag or mlu. You should also sorry here, it's not trying to train with different seds to reduce the noise because sometimes you can have filtering techniques that don't give you a very big difference. But if you train with just want see it, you must draw conclusions, but they're actually just noise. So if you can and you have the compute, it's always better to run the same experiment with two or three different seds and then maybe do something like the avergin so that you reduce the noise and you have more robust conclusions about the effects of your filtering. For example, for the fine web datset, the authors run over 200 plus applations. These were like 1 billion models trained on, I think, 30 billion tokens. And this is how they were able to find filterings that worked better for their data sets. Now let's go back to our starcoder use case, and I will tell you about how we filtered the stack data set. So for the version one, if you remember, we had 6 tb of source code. And then, but when we trained starcoder, we only used 800 gb of these 6 tb. So a lot of these data was filtered out after our filtering or curation. The same happened for the stack V2, where this time we started from three, 2 tb in 600 programming languages. And after the filtering, we ended up with only 6.3 tb of code. And for filtering code, the approach is a bit similar to just filtering web data, but the filtering techniques are a bit different. So first, we wanted to include a lot of programming languages, and we looked at them and we didn't keep all of them. We only Kethe popular ones and excluded, for example, conts and languages that are no longer maintained. So this was for V1. For starcoder two, we included more languages over 600. And then we added some other sources that could be interesting for a code model to learn from, which are GitHub issues, githucommits and Jupiter notebooks. We also added for the V2 cargo notebooks and pull requests. The second step after we selected the languages we wanted to try on was data quality inspection. So basically, as I told you, we had some filters to remove low quality files and auto generated content. An example is the average line length. So if you have an average line length that is too high, there's probably something wrong with this file or it's probably auto generated. But since we had like almost 100 programming languages, we should not use the same threshold for all the languages for this filter because some programming languages just have longer lines. So it's important to do some inspection and look at some samples from these languages. In our case, we had the big code community, which helps us look at 100 samples per extension and derive the appropriate thresholds in filtering horristics. The second, third, sorry, filtering step was near duplication. We found that near duplication was the filtering that gave us the most performance boost. And it's also very easy to apply because it's language agnostieven. Though we have 86 programming languages, we don't need to change duduplication for each language. We can just apply it to the whole data set. And here I show you some results of the effects of deduplication. For example, here you can see this model Python all license. If the filtering is none, you get a passage one which is our code meric of 13. But if you apply in your duplication, you go from 13 to 17. That's a very big performance bump. The same goes for other subsets like premimissive license. So we decided to use deduplication for our data asets and to use strong deduplication to really remove all the files that could be similar. Another step in our pipeline is to remove personal identifiable information. So this could be names, emails or keys or passwords, because we sccode from GitHub. And although GitHub has some tools to detect secrets and prompt users to remove them, that's not always the case. And we found that there were still a lot of secrets in the data sets. If we train our model, you don't want it to be trained on that because in inference, it might generate sensitive or personal data. So our approach to removing it was to first unnotated data, aset for pii, we collaborated with an notation company to annotate some samples. So the annotators were tasked with labeling dpii. When they find it, for example, if they find a name, they give it a class name. If you find an email, they also labell it as an email. So it was a named entirecognition task. And then we train the star pii, which is our ner model, to detect this pii. And then we run it on the whole starcoder training data. This took almost 800a 100GPU hours because it's a neural network in GPU's. The last step in our fildriwas data decontamination, because you should make sure to remove the benchmarks and test sets from your training data, otherwise your evaluation numbers will just be inflated. So we made sure to remove the benchmarks that we use for evaluation from our training insets. The last step in the detertecuration of the stack was to format the data. So now that the data is filtered, and because code is different from text, from text, we can allow ourselves to apply some nice ice formatting that could help us during inference. For example, for starcoder, we had the code file, but before the code, we added some tokens that indicated that this is the repository name and another token file name that indicates the file name. And another one for stars. And this is interesting because this model, reexample, ples, star coder and other code models, I guess their main use case is to be plugged in an ide, for example, vcode. And when you're using them, it could be interesting to append the code file with like the name of the file, for example, I don't know, file those pes so that the model would know this is a Python file. If it's in another language, when you add the file name and you have the extension, it could know that this is detect the language that it should generate code in. We also added GitHub startoken, and we tried to play with it, like to say this file has one, 100 stars and see if the model would generate higher quality code than if it were to generate for zero stars. We didn't find any differences really during inference, but it was fun to add all this formatting for sarcoder two. One of the improvements was that sarcoder two was repository aware because when we have GitHub repository, it's repository. So we have some files that are in the same repository that are related to each other. But when we built the stack V1, we just shuffled files. So we kept we didn't keep this repository structure and we trathe model. We just shuffled them, and the model did not know if two files belong to the same repository. But when we did starcoder two, we try to keep files that are in the same repository next to each other. And how we did that is by concatenating them with some special tokens, like file esawhich basically separates files. And this way the model can kind of know which files I need the same repository, and try to find links between them. 3D parallelism. And then you have also light to evolve for doing the evaluation. So this is kind of a stack to be able to run your full trainings, but also your ablation models. You can apply a filter from data stroand, then train with normal of strong one. And evolit would light evolve. And they're well integrated together and they make one ecosystem. So that's for general llms. For code alms, we also released the code we used for both this tag and starcoder models under our big coder repository on GitHub. And I think we just answered our third question, which was how to filter the data. So now you know how to first, how much data you need and then where you can get this data, both web and code and synthetic and curated. And you also know how you can properly filter the data and you can test the filtering techniques that you have in mind. So now let me tell you a little bit more about code alarms because that's kind of what I'm working on. And I'm trying to give you a little bit of an overview about these models so that you know how to train good lms, but you also know how to build very cool code assistance and completion models. So how all of this started was when GitHub coppolot was released. And it was very interesting because it was so much better than all the other code completion models that were before it, which were very small and much less performand. GitHub Copilot was using in the codex model by OpenAI. And they just showed that you can train a code llm in the same way that you train an llm for English. For example, you you can just take a large transformer model and give us a lot of code data, and it will learn this code because before a lot of people are trying to treat code very differently, for example, by using abstract syntax x trees. But what codex models showed is that you can treat code like text. And if you want to predict the next line, you predict the next text. You just your next talking Ken prediction and you get your code. It works very well, much better compared to the more feature engineering techniques. And that was like over two years ago and we didn't have any good open code models. But today, if you go to the hub, you can find that we have over 17 hundred models that are trained on code. So these are models that are either trained only on code or lthat included code as part of their training. So you can see that we've made a lot of progress in this code generation field, which is amazing. And this is the result of the community's work to build very good instruction tuned models and base models. For example, here, as you can see in the leaderboard, we have some very strong models that score almost 80% on the code devation benchmark, which is human eval, which means that egates almost 80% of the problems, right, which is a very large number. And when talking about the landscape of open code alms in big code, we have released the stack data set, which is now the default data aset for training on code. And also start coder one and starcoder to a family of models and other instruction tune models with the H 14 team like starchat two. Meta also released some very good code models, which are the code lama series of models that go from seven b to seven b. There are also the deep sick models, which are also very strong. And they have also other models like the recent granite models from ibm code quen, code gen and stable code. So there are different providers for code lms and also for data sets for code. And the main reason we started the big collaboration and to transour coder models was to kind of have a collaboration where we have full data transparency. We released all the details about the training, but also the data is public so that people can inspect it and use it. And we also have the code for the processing and the model weand. The collaboration was open. We had over a thousand researchers joining us like and following the journey with us. And this kind of created a big code ecosystem where the stack was used in the training of a lot of prominent code models like code gen and stapleccode, and the sarcoder models were used as basis for a lot of community fine tunings. And I think it's very important to be aware of what makes a release of an llm, whether it be is a code llm and a general llm, open and responsible. And I think this is like forefault first. It's really good for the community and for research in AI in general. If you can make open access data sets, this will help. This will means having data inspection tools, but also opt out tools to respect people's wishes regarding their data sets. For example, if they don't want to reinclude in the trainings, they should be able to opt out. It's also important to remove personal identifiable information. So an open release does not mean just releasing model weights and stopping there, but also making your work reproducible by following the community in the pipeline for using these models and also releasing tools for evaluation and technical reports that documents the whole pipeline. And for us in big code, we kind of went from Santa coder, which was part of our ablations, to understand how to filter the sttic data set. And then we went to starcoder, which was released last year, a 15 billion code generation model. And then this year, we release star coder two, which was trained on much more programming languages and had a much higher evaluation score. And the star coder was also rated as the most transparent model by the Stanford foundation model transparency index, which is a really hard for given the offers that we put into data governance and into making the model release as transparent as possible regarding evaluation. So for example, starcoder 15b, when it was released, it was the state of the art code model. And this was also the case for star coder to hundred 15b, among other 15b models. And it was even close or better than larger models. I think I don't have the plot here, but it was better than it was matching in code lama 34b, and it was close to deep seg 33b on some benchmarks. And here, for example, you can see the results on different benchmarks because when releasing a model, it's really important that you don't just evate on one benchmark, but you should add as many benchmarks as you want in case you had contamination. Although we try to avoid this one benchmark, there's a very low chance that you also had contamination on other benchmarks. And it also allows you to fully understand how your model behaves if you add more evaluation benchmarks. And I think that's just a good practice that everyone should be doing with their releases. So with the starcoder models, we also released some tooling like vs code implementation, which also has a membership test that tries to see if the generated code was in the training data and highlight that to the author. So that's part of our code attribution efforts for these code models. Maybe you're interested in using these models to build your own personal Copilot and fine tune in sarcoder or code lam or other models on your personal code basis to do that. There's a very nice blog posts by Sarab and sayak where they try to take a code model and train it on the hugging face internal libraries and then deploy it in ollama and have a local code assistant. The pipeline is very similar to what we did in pretraining. First you take your data set, you try to filter out the things you don't want to keep, and then you do the duplication and you train your model. So in this case, it will be just a fine tuning, so it will be much quicker. You can use libraries like theft, which you do parameter efficient fine tune in where you don't need to train all the parameters of your models, but you only inject few trainable parameters. This makes the training much faster. For example, seven b model can be trained in a Google code lab. Now let's go back to evaluation. So for example, for llms, there's the open llm leader board that's evaluates models. There's also the llmc arena, which compares, instructs models and uses human evaluation. For code models, one of the most popular benchmarks is human evil. And it's basically a benchmark where they have a function that the model has to auto complete. And then when the function is completed, you take this solution and then you write against multiple unit tests, and you count how many solutions PaaS and how many solutions fail, and then you count we a metric that we call PaaS at one. For example, this is the one that's been reported in this leaderboard. And this gives you the human evil score. There's also a translation of this benchmark to 18 other languages. Here I show Java and JavaScript tc plus plus, and this benchmark is called multiple. So it kind of allows you to see how well each model does on which programming language and choose the one that's the most interesting for you. But these benchmarks usually have an issue of contamination and overfitting, especially instruction tuned models. I don't know if you've already checked what these data sets look like, but usually for code, there are an instruction that asks the model to generate an exercise. And often if you look at them, they look really similar to human evil, which is like function implementations. So there's a very high chance of having contamination, which means having some files that look like human exercises in your instruction tuning data sets. So here, for example, this plot is from the life code bench leaderboard, and they kind of find that benchmarks may be overfitting on human evil. And so their solution was to have a leaderboard called life code bench, where they regularly scrape new problems from platforms like code contests and leacode. And they evaluate the models only on the problems that were released after the model release date. This way, they are sure that there is no contamination. And for example, that was the case here. They try to evaluate these models on all the data they have, and then they can preare the performance to the data that was only released after the model release. And they found that some models were not consistent in their results. So that's one interesting thing to keep in mind. And this is also another leaderquote that's will be interesting to compare not just open models, but also closed models like GPT for and see where the open source community is sended in compared to these code models. So that was my presentation. Thank you very much for your attention. And if you have any questions, I can answer with them.
speaker 2: Yes. Thank you very much for the great insightful talk. So we have some questions here on slido. I'm I'm not sure if there are any in person questions or else I will get started with the slide. Do question.
speaker 1: Sure. Okay.
speaker 2: guess not. So we'll I'll ask some of the questions online. I think I had submitted some of these as well. It seems like there's some questions about synthetic data. Let me see. I was also wondering about this. So someone's asking, what are the consequences of training AI models on AI generated synthetic data? Do you foresee any problems with this? And there's a related question. Does synthetic data closely represent a natural distribution of language? I assume some low quality data from humans is necessary for things like learning, robustness and so forth.
speaker 1: Yeah, sure. These are very great questions. So about the consequences of training models on AI generated data, I can think of two main ones. First is like enforcing some biases, because models only have some biases, and if we train on data that is generated by them, we might be enforcing it even more. The other thing is, for example, contamination. These models might be trained my generrate content that looks like the evaluation benchmarks. And when you train on that, you will have contamination in your data. So for example, one of the critics of the fly model is that people, because they did not see the synthetic data and the models were very good on the benchmarks, they were very skeptical. Are these models really good or are they just overfitting on the benchmarks? So I think contamination and enforcing biases are one of the main things to keep in mind. And regarding synthetic data not being the same as like web distribution, I think that's a very good point. And for example, when we were developing cosmoppea first, we found that like it was worse than the web and it was like surprising because we spent a lot of time trying to curate this data sets, which looks so much cleaner than the web. And then I add in some web data and trying to add like more topics was able to help us compensate some of the gaps. But adding some web always like gives you a performance boost. So yes, there some noise and some specific patterns in what data that will probably need to be included in the training mix to keep like a whole coverage of what natural distributions look like. So it sounds .
speaker 2: like you're seeing a good trading set would have a mix potentially of synthetic and natural data. Is that correct? Yeah, I think so. Like some experiments .
speaker 1: for on show that that's the case because you can try to spend some time to carefully curate the topics, but we'll probably be missing out some things. And the human intuition that we have is not always what works for traded models. It seems that's keeping some filtered web helps. And also if you see the five technical reports, for example, in 53, they insist a lot on filtering the web and including it in the prere training. And I think that now seems like maybe the best way to go.
speaker 2: if that makes sense. Great. Another question is, is rlhf type preference data more important than unsupervised pretreating data? Should we spend more resources on rlhf and data?
speaker 1: Yeah, that's a good question. So for example, the unsupervised between is mainto get base models, but then you can choose these base models as chat assistants. You need to do another step. So you can either do rhf, but nowadays people are just doing instruction tune in without needing to go through l where you just train the model on pairs of instructions and solutions. And that seems to work very well. And there are now some methods that don't use reinforcement learning, but word as well, for example, dpo or rpo. So I think if you want switchat assistant, you definitely need to run like a supervised training on top of the unsupervised one. But it doesn't necessarily have to be rhf. There are some other algorithms now.
speaker 2: Great. Great. And here's a mulmodal question. Does multimodal grounding, for example, including images and videos along with the text, reduce the need for so much text only data?
speaker 1: Yeah. What do you mean? The question .
speaker 2: is asking, does multimodel grounding help? Basically, if you have images and videos along with the text, does this reduce the amount of text only data required to train models?
speaker 1: So I can't probably answer that because I haven't tried. But I guess all for example, the multi model model, for example, the fix that we recently released, there's always a significant text portion. That seems the case for most vision language models, but Yeah, I don't know really about the percentage ges for each.
speaker 2: right. Okay. A more general question. You probably touched upon some of this, but are there any major differences between training text versus code models other than the training data being different?
speaker 1: Yes, that's a good question. So like the training data is different. Regarding the training itself, we use a similar architecture, for example, starcoder. It was like alllama or mystral architecture. I think one thing that you probably want is long context, because if you want to use these models, for example, in vs code, then you want to add all the neighboring files in the context. You should be able to fit a very large context. So we try to do some long context extension. But again, people also do this for llms. We also care a lot about inference. We use gqa, first mqa and then gqa to have faster inference, but these also techniques that are implemented for lms. So I'd say overall it's very similar, but maybe should prioritize some things like having a smaller model that can be used for, for example, iis faster than actually a much larger model that would need more deployment. Yeah, great.
speaker 2: Here's also a general question. I guess they're asking for advice. So if you have a very tiny compute budget, for example, a single GPU, what would you recommend prioritizing? Let's assume you are fine tuning a model.
speaker 1: Yeah. So I think for example, now there are some great solutions for on device deployment and fine tunings. For example, you can run conzed models with lama cpp or other frameworks and with techniques like theft. You don't need to do for mofine tuning. And you should be able to run this on one GPU even in a seven b model. So I think you should just like find a very well curiated data asets because quality is more important than quantity and then use one of these techniques for easy fine tuning industry work.
speaker 2: right. Great. Here's a .
speaker 1: question .
speaker 2: asking, I guess different from ptrading, but they're saying, I'm guessing the optimal amount of training data depends heavily on the domain as well as the task at end, right?
speaker 1: Yes, probably. Now we're following the chanscaling laws. I think they try to compare like English to code. And they found that the findings still hold, but maybe if you go to another domain or no, like medical, things could change. And that's why I mentioned the deep sick paper where they mentioned that like it's really heavily dependent on data. And for them, it was the same domain. They just changed data sets going from one genetic disato, another well curated one, and things started changing. So I think this was probably the case, but it's underexplored how these sclaws change depending on domains. So it's good to be aware of that when developing models for domasters are not explored by these picking .
speaker 2: out different domains, code versus tags. Someone's asking, what are some of the interesting differences between tokenizing for general purpose like types versus for cogeneration?
speaker 1: Yeah. So when we were training the talk organizzer, I think one thing that was important to keep put like numbers split in and we used the standard be and we were training it. We trained on our dataset that we were using for this train data. So like our code mixture, and we did some analysis to see if there are like any outliers and tokens that were underrepresented or overrepresented as sanity checks. But overall, it's very close to the text training. And now most llms have like a significant code portion in their tokenizers. So they're also trained on a lot of code. And at the end, you can use like either one tocanfor llms for code or the other way because even on code, you have a lot of mardoso, there's a lot of English. So you end up representing all the English tokens, for example, in your code tokenizer.
speaker 2: Great. And here's the question about fine tuning, I guess compared to pre training. So they're asking, do the same principles apply for fine tuning or would you make a different or additional recommendations?
speaker 1: So Yeah, for fine tune in. I think when you're preparing the data it's probably a different thing. You're not going to train on like all of the stack. You probably want to continue to on specific language. So maybe with invest more time like to even heavily filter because for fine tuning you don't need as much data as for pretraining, for example, for us, the filwe tried for example, stars, they need to work because they removed a lot of data and we did not have enough our pretraining. Well, for fine tuning, for example, for instrucstruction tuning there was like dilemma paper where the instruction tuned only on a thousand instructions and they had a model that was much better than between millions of samples. So I think that accuration is even much more important when it comes to ptuning.
speaker 2: Great, great. One last question. I so you might have also touched upon this briefly, but what are some considerations to make when publishing very larger data sets and more nuanced or less known things to be aware of?
speaker 1: Yeah. So maybe on the technical side, releasing tools also for filtering and documentation. That's what we tried to do with this tag. And maybe more on the governance side, be aware of like where the licenses are respected, where the copyrights respects it, they have an opt out tool for your data set and maybe try to release it on the hub to make it easily accessible for people. If there are some concerns, you could try to adthe gate. For example, for us, we release the data set that we used for pii detection, but we had some gating mechanism because it was sensitive information. So it's good to think of this kind of things in advance before releausing sing a data set. But Yeah, in general, these are my advice.
speaker 2: right? Great. Do we have any in person questions? If not, then we can probably conclude.

概览/核心摘要 (Executive Summary)

本次演讲由Hugging Face的机器学习工程师Loubna Ben Allal主讲，深入探讨了从零开始训练大规模语言模型（LLM）的复杂过程，特别是以代码LLM StarCoder为例。核心观点强调，高质量、精心策划的数据是训练优秀LLM的基石，其重要性甚至超过模型架构和训练技巧。演讲回顾了缩放法则（Scaling Laws）的演变，从早期Kaplan法则（侧重模型参数量）到Chinchilla法则（模型与数据同等重要），再到当前考虑到推理成本而倾向于用更多数据训练更小模型的趋势（如Llama 3在15万亿token上训练）。

获取优质数据涉及三大问题：需要多少数据、从何处获取数据以及如何筛选数据。数据来源包括网络数据（如Common Crawl，FineWeb数据集表现优异）、代码数据（如BigCode项目的The Stack V1和V2，后者包含近1万亿token的代码）以及日益重要的合成数据（如Phi系列模型和Cosmopedia数据集）。数据筛选是关键步骤，包括语言过滤、低质量样本移除、去重（对StarCoder性能提升显著）、个人身份信息（PII）移除（通过StarPII模型）和基准测试数据去污染。演讲强调了通过小型“消融模型”（ablation models）实验来验证筛选策略有效性的重要性。

StarCoder项目作为案例，展示了其在数据治理（如提供opt-out工具、移除PII）和透明度方面的努力，其模型（如StarCoder 2 15B）在代码生成任务上表现优异。最后，演讲讨论了代码LLM的评估挑战，如基准污染问题，并提及LifeCodeBench等动态评估方法。问答环节进一步探讨了合成数据的使用、RLHF的重要性、多模态数据的影响以及微调策略等。

引言：训练优质大语言模型的探索

演讲者Loubna Ben Allal首先提出核心问题：“训练一个好的LLM需要什么？” (What does it take to train a good llm?) 并指出，尽管过去认为强大的闭源模型（如GPT-4）拥有难以企及的“秘密武器”，但开源社区已取得巨大进展，逐渐揭示了构建强大LLM的关键要素。

开源大语言模型的进展与挑战

开源模型的崛起：
- 如今，像Llama 3 70B Instruct这样的开源模型在性能上已接近GPT-4，并且由于模型权重开放，解锁了更多应用场景，甚至可以在消费级桌面设备上运行。
- 越来越多的公司（如DeepMind的Gemma，Mistral AI的模型）开始拥抱开源。
- LMSys Arena的评估显示，从2023年到2024年5月，闭源模型与开源模型之间的性能差距正在缩小。
开源面临的挑战：
- 许多模型发布时缺少关于数据处理和模型训练的关键细节。
- 原因主要有二：
  1. 避免法律审查：若数据使用不当（如侵犯版权），公开训练数据可能面临法律风险。
  2. 保持竞争优势：部分公司希望保留其训练方法的核心细节。
- 尽管如此，通过分析众多已发布的模型，仍可以拼凑出训练优质LLM的要素。

训练优质大语言模型的关键要素

训练一个好的LLM主要依赖以下几个方面：

模型架构 (Model)：
- Transformer已成为默认选择。
- 其他有趣的架构包括Mamba（状态空间模型）和混合专家模型（MoE）。
- 演讲者表示，模型架构已得到充分探讨，本讲座将更侧重其他方面。
GPU资源 (GPUs)：
- 演讲者对此未做过多阐述。
数据 (Data)：
- 被认为是LLM的“支柱” (backbone of lms)。
- 在给定预算下，数据是区分模型优劣的关键因素，值得投入时间探索和理解如何获取高质量样本。

优质训练数据的获取策略

获取优质训练数据主要围绕三个问题展开：

数据量：缩放法则的演进与应用 (Data Volume: Evolution and Application of Scaling Laws)

缩放法则研究如何在计算预算、数据大小和模型大小之间进行最优分配。

早期缩放法则 (Kaplan et al., OpenAI)：
- 发现若计算资源增加10倍，模型参数量应增加5.5倍，而训练token数仅增加1.8倍。
- 结论：优先增大模型规模，而非数据量。
- 这导致了像GPT-3（1750亿参数，仅用3000亿token训练）这样相对“欠训练” (undertrained) 的模型。OPT和Bloom也遵循此模式。
Chinchilla缩放法则 (Hoffmann et al., DeepMind)：
- 重新审视了缩放法则，发现Kaplan等人的结论部分源于其实验中对不同数据规模使用了固定的余弦学习率调度器，导致对某些模型的性能低估。
- 新结论：应同等扩展数据量和模型大小。
- Chinchilla模型（650亿参数）在1.6万亿token上训练，性能优于更大的Gopher（超2000亿参数）和GPT-3。
后Chinchilla时代与推理成本考量：
- Llama 1（70亿参数）在与Chinchilla（600亿参数）相近的数据量（约1.6万亿token）上训练，远超其Chinchilla最优点。
- 原因：“计算最优并非总是（全局）最优” (compute optimal is not always optimal)，因为训练成本是一次性的，而推理成本是持续的。
- 更小的模型训练更长时间，虽然训练成本增加（“计算开销” compute overhead），但在推理时能节省大量成本。Llama 3甚至在15万亿token上训练。
- Harden's Law相关博客文章探讨了这种计算开销。Hugging Face提供了一个工具，可以根据模型大小和数据集估算与Chinchilla最优点相比的开销和推理节省。例如，一个7B模型在1万亿token上训练，可能有13%的训练开销，但带来近50%的推理成本节省。
数据受限下的策略：
- 论文《Scaling Data-Constrained Language Models》指出，如果高质量数据有限，可以将数据重复最多4次，仍能获得与使用不重复数据相似的性能。这对于已接近穷尽可用公共数据的领域（如代码）尤为重要。
数据质量对缩放法则的影响：
- DeepSeek LLM的研究发现，缩放行为高度依赖于数据质量。他们为自己的数据集建立了新的缩放法则。
- 结论：当拥有更高质量的数据集时，可能应将更多计算资源分配给模型大小，而非数据大小。

数据来源：网络、代码与合成数据 (Data Sources: Web, Code, and Synthetic Data)

大规模数据来源：
1. 网络数据 (Web Data)：
  - 通常从Common Crawl（公共网络爬取数据仓库）开始，需要大规模、重度的过滤。最新dump有超400TB，总计近95个dumps。
  - 可使用已过滤的数据集，如Hugging Face最近发布的FineWeb，包含5万亿token，在公开数据集中表现最佳（对比C4, RefinedWeb, SlimPajama, The Pile）。
2. 代码数据 (Code Data)：
  - The Stack v1：包含6TB许可宽松的代码。从GitHub克隆超1.3亿仓库（100TB数据），经文件扩展名过滤、许可证过滤（仅保留Apache 2, MIT等）、去重后得到约3TB数据（约2000亿token）。提供Opt-out工具。
  - The Stack v2：更大、更优。数据源自Software Heritage（代码存档），过滤后得到近1万亿token。增加了GitHub Issues, StackOverflow[不确定，原文为matphone code datsets，疑为StackOverflow或类似平台], Pull Requests等高质量资源。比V1大4-5倍。
其他高质量但规模较小的数据源：
- Wikipedia, Books, ArXiv, Stack Exchange。
合成数据 (Synthetic Data)：
- 近年来变得非常重要，预计未来会更重要。
- Microsoft的Phi系列模型（如论文《Textbooks Are All You Need》）使用GPT-3.5和GPT-4生成合成教科书作为预训练语料库，表现优于在网络数据集上训练的模型。
- 许多流行LLM（如Claude 3, Llama 3）已将合成数据纳入预训练。Llama 3使用模型构建分类器筛选样本，并生成合成内容以提升编码、推理和长上下文能力。
- Hugging Face发布的Cosmopedia：最大的合成文本数据集之一，含约250亿token，使用开源模型Mixtral 8x7B生成。生成策略：80%数据来自网络样本，启发模型生成相关教科书，并通过提供更多上下文限制生成范围；另有来自Stanford课程、WikiHow等策划源的数据。

数据筛选与清洗：提升数据质量的关键 (Data Filtering and Cleaning: Key to Improving Data Quality)

“高质量数据集可能会为标准架构展现出非常高级的能力。” (High quality dataset might exhibit very advanced capabilities for a standard architecture) - 引用自Yi论文。

通用过滤流程示例 (Yi论文)：
1. 语言过滤。
2. 低质量样本移除（如重复行过多、基于规则修正、困惑度过滤）。
3. 去重 (Deduplication)：非常重要，避免模型记忆和性能下降。包括精确去重和近似去重（如MinHash）。
4. 进一步过滤（如语义过滤、主题过滤）。
FineWeb的成功也归功于其精细的过滤和去重。
如何找到好的过滤器：
1. 手动检查数据：了解数据实际情况。
2. 消融实验 (Ablation Studies)：
  - 对应用了特定过滤器的子集训练小型模型，比较有无此过滤器的效果。
  - 经验表明，小型模型上的数据消融实验结果多数能推广到大型模型。
  - 选择高信号基准测试（如Hellaswag, MMLU）以便早期判断效果。
  - 使用不同随机种子多次训练以减少噪声，获得更稳健的结论。
  - FineWeb作者运行了超过200个消融实验（1B模型，30B token训练）。
3. 案例：StarCoder团队曾尝试基于代码注释多少和仓库星标数进行过滤。
  - 移除几乎无注释文件的过滤器，性能提升微乎其微。
  - 移除星标数少于5的仓库文件（移除了70%数据），导致模型性能在所有消融实验中最差。

StarCoder案例研究：代码大模型的幕后 (StarCoder Use Case: Behind the Scenes of Code LLMs)

The Stack数据集的筛选流程 (Filtering Process for The Stack Dataset)

StarCoder (V1)：从The Stack V1的6TB源码中筛选出800GB用于训练。
StarCoder 2：从The Stack V2的3.2TB（600种编程语言）数据开始，最终筛选并扩充（加入GitHub Issues, Commits, Jupyter Notebooks, Cargo Notebooks, Pull Requests）到6.3TB代码。
代码筛选步骤：
1. 语言选择：保留流行语言，排除不再维护的语言。StarCoder 2包含超600种语言。
2. 数据质量检查：移除低质量文件和自动生成内容。例如，平均行长过高可能表明文件有问题。针对不同语言使用不同阈值（BigCode社区成员帮助审查样本并确定阈值）。
3. 近似去重 (Near-Deduplication)：对性能提升最显著的过滤步骤，且语言无关。StarCoder的Python子集在应用近似去重后，Pass@1从13%提升到17%。
4. 个人可识别信息 (PII) 移除：
  - 由于代码中可能包含姓名、邮箱、密钥、密码等敏感信息。
  - 方法：与标注公司合作标注PII样本 -> 训练一个NER模型（StarPII）来检测PII -> 在整个StarCoder训练数据上运行StarPII（耗费约800 A100 GPU小时）。
5. 数据去污染 (Data Decontamination)：确保从训练数据中移除评估用的基准测试集和测试集，避免评估结果虚高。
数据格式化：
- 在代码文件前添加特殊token，如仓库名、文件名、星标数。这有助于模型在IDE（如VS Code）中根据文件名后缀推断语言。
- 尝试通过控制星标数token影响生成代码质量，但未发现显著差异。

StarCoder 2的改进：仓库感知能力 (StarCoder 2 Improvements: Repository Awareness)

StarCoder V1训练时打乱了文件，模型不知道文件间的仓库归属关系。
StarCoder 2：将同一仓库中的文件在训练时保持相邻，通过特殊token（如FILE_SEPARATOR）分隔。这使得模型能感知文件间的上下文关系，可能学习到仓库内文件的关联。

代码大语言模型概览 (Overview of Code LLMs)

发展历程与现状 (Development History and Current Status)

GitHub Copilot（基于OpenAI的Codex模型）的发布是一个里程碑，证明了可以将代码视为文本，通过大型Transformer模型和大量代码数据进行训练，效果远超传统方法（如使用AST）。
如今，Hugging Face Hub上有超过1700个代码训练模型（纯代码模型或包含代码数据训练的通用模型）。
代码评估基准HumanEval上，一些强模型得分接近80%。

BigCode项目的贡献与原则 (Contributions and Principles of the BigCode Project)

主要贡献者：
- BigCode：发布了The Stack数据集（已成为代码训练的默认数据集）、StarCoder 1和StarCoder 2系列模型、Starchat 2（与Hugging Face H4团队合作的指令微调模型）。
- Meta：Code Llama系列模型（7B到70B）。
- DeepSeek：DeepSeek Coder系列模型。
- 其他：IBM的Granite模型，CodeTrans[不确定，原文为code quen，疑为CodeTrans或CodeGen], CodeGen, StableCode。
BigCode的协作原则：
- 完全数据透明：公开训练细节和数据。
- 可复现性：提供处理代码和模型权重。
- 开放协作：吸引了上千名研究者参与。
- 负责任的发布：
  - 开放数据集访问（包括数据检查工具）。
  - 提供Opt-out工具。
  - 移除PII。
  - 发布评估工具和详细技术报告。

StarCoder系列模型的演进与评估 (Evolution and Evaluation of StarCoder Model Series)

演进：SantaCoder (消融实验用) -> StarCoder (去年发布, 15B) -> StarCoder 2 (今年发布, 多语言, 评估分数更高)。
透明度：StarCoder被斯坦福基础模型透明度指数评为最透明的模型。
性能：
- StarCoder 15B发布时是SOTA代码模型。
- StarCoder 2 15B在同级别模型中也是SOTA，甚至接近或优于更大的模型（如匹配Code Llama 34B，在某些基准上接近DeepSeek Coder 33B）。
- 强调在多个基准上进行评估的重要性，以全面了解模型行为并降低单一基准污染的风险。
配套工具：
- VS Code集成，包含成员资格测试 (membership test)，尝试检测生成的代码是否在训练数据中并高亮提示，作为代码溯源努力的一部分。

个人代码助手的构建 (Building Personal Code Assistants)

可以基于StarCoder, Code Llama等模型在个人代码库上进行微调。
流程：数据准备与过滤 -> 去重 -> 使用PEFT（参数高效微调，如LoRA）进行训练（7B模型可在Google Colab上完成）-> 部署（如使用Ollama）。

代码大语言模型的评估挑战 (Evaluation Challenges for Code LLMs)

常用基准：
- HumanEval: 模型补全函数，通过单元测试评估，计算Pass@1。
- MultiPL-E: HumanEval的多语言版本（支持18种语言如Java, JavaScript, C++）。
问题：
- 污染 (Contamination) 和 过拟合 (Overfitting)，尤其对于指令微调模型。指令数据可能与HumanEval的练习形式非常相似。
解决方案探索：
- LifeCodeBench Leaderboard: 定期从CodeContests, LeetCode等平台抓取新问题，并仅在模型发布日期之后出现的问题上评估模型，以确保无污染。
- 发现一些模型在全量数据和“无污染”数据上的表现不一致。
- 这类动态基准也有助于比较开源模型与闭源模型（如GPT-4）的真实差距。

问答环节精选 (Selected Q&A)

AI生成合成数据的后果及与自然语言分布的差异？
- 后果：可能强化模型偏见；可能生成与评估基准相似的内容导致污染（如对Phi模型的质疑）。
- 与自然分布差异：纯合成数据训练的模型可能不如混合了真实网络数据的模型。Cosmopedia实验发现加入部分网络数据能提升性能。目前看，高质量过滤后的网络数据与合成数据的混合可能是最佳方案。Phi 3的技术报告也强调了过滤网络数据并将其包含在预训练中的重要性。
RLHF的偏好数据比无监督预训练数据更重要吗？
- 无监督预训练用于获得基础模型。要将其用作聊天助手，需要额外步骤，如RLHF或更常见的监督式指令微调（SFT）。目前也有非强化学习的方法如DPO, RPO表现良好。因此，若目标是聊天助手，则在无监督预训练模型之上进行某种形式的监督式训练是必要的。
多模态数据（图像、视频）能否减少对纯文本数据的需求？
- 演讲者表示不确定，但观察到即使是多模态模型（如Hugging Face的Idefics），文本部分仍占显著比例。
训练文本LLM与代码LLM的主要区别（除数据外）？
- 训练架构相似（StarCoder使用类似Llama/Mistral的架构）。
- 代码LLM可能更优先考虑长上下文能力（以便在IDE中处理多个相关文件）和快速推理（使用GQA等技术）。但这些技术也用于通用LLM。总体而言非常相似，但代码LLM可能更倾向于能快速部署的小模型。
计算预算极小（如单GPU）时的优先事项（假设微调）？
- 使用量化模型（如Llama.cpp）和PEFT技术（如LoRA）进行微调，7B模型单GPU可实现。
- 关键是找到一个精心策划的小型高质量数据集，因为质量比数量更重要。
最优训练数据量是否也取决于领域和任务？
- 是的。Chinchilla缩放法则比较了英语和代码，发现结论仍适用。但对于其他领域（如医疗），情况可能不同。DeepSeek的论文表明缩放法则高度依赖数据（即使在同一领域，不同质量的数据集也会导致缩放法则变化）。对于未被现有缩放法则研究覆盖的领域，需注意这一点。
通用文本与代码生成的tokenizer有何有趣差异？
- 训练tokenizer时，会确保数字被正确切分。StarCoder的tokenizer在其代码混合数据上训练。
- 总体与文本tokenizer训练相似。目前多数LLM的tokenizer都在大量代码数据上训练过，因此也包含大量代码token。反之，代码数据中也包含大量Markdown等自然语言文本，所以代码tokenizer也能很好地表示英文token。
微调的原则与预训练有何不同或额外建议？
- 数据准备阶段差异较大。微调通常针对特定语言或任务，数据量需求远小于预训练。
- 数据策划对于微调更为关键。例如，StarCoder预训练时因移除过多数据而失败的“星标过滤”策略，在数据量需求更小的指令微调中可能适用（类比DALL-E[不确定，原文为dilemma paper，疑为某篇关于少量高质量指令数据微调的论文，如Alpaca的1000条指令]论文中仅用少量高质量指令就取得良好效果）。
发布非常大的数据集时有哪些注意事项（尤其是细微或不为人知的方面）？
- 技术层面：提供过滤工具和文档（如The Stack的做法）。
- 治理层面：确保许可证和版权得到尊重；提供Opt-out工具；在Hugging Face Hub等平台发布以便访问。
- 敏感数据：对于包含敏感信息的数据集（如StarCoder的PII检测数据集），可以考虑设置门控访问机制。

核心观点总结 (Summary of Core Viewpoints)

演讲强调，训练强大的大规模语言模型，尤其是针对代码等特定领域的模型，其核心在于数据。这包括理解需要多少数据（缩放法则的演进，并考虑推理成本）、从何处获取数据（网络、代码、合成数据），以及最关键的——如何通过细致的筛选、去重和净化流程来确保数据的高质量。StarCoder项目作为成功案例，展示了透明、负责任的数据处理和模型开发流程。开源社区在LLM领域正快速追赶，而数据策略的优劣是决定模型性能的关键差异化因素。

摘要历史 (3)

StreamSparkAI