speaker 1: Hello, thank you for joining cs 25, transformers day's last class. Today we have Lubna, who is a machine learning engineer in the science team at hugginface working on large language models for code and synthetic data generation. She's part of the core team at the big code project and has co authored the stack data set and the star coder models for co generation. Thank you so much for coming to our talk today. And as always, the attendance link and the slido questions are on our website, and we'll be taking questions after the talk. Thank you. And you can take it off now. Hi. Thank you for the introduction. Cool. So I'm Luna. I'm a machine learning engineer, taging phain, the science team. And today I'll tell you about the behind the scenes for training large language models. And I will use the starcoder model that our stream has trained as a use ser case. So today's plan is very simple. We're going to try to answer this question. What does it take to train a good llm? So it's one question, but it's very loaded and it has a lot of follow ups. And as you will see, my slides will be a series of questions and answers. So a few years ago, a lot of people thought that there was some motin secret sace to the strong closed models like GPT -4, and that it will take the open source community a lot of time to catch up, because the open source models that we had back then were much smaller and less performance. But now it seems that the community kind of figured out most of the pieces for getting strong llms. And it was predicted in this Google memo that was leaked in the release stance I analysis. For example, today we have lama 370 bd instruct, which has almost the same performance as GPT -4, but it's unlocks so many cases because the model weights are open. The model can be decquantized and can even run on the consumer desktop. It also allows the community to build very cool use cases on top through fine. So we've made a lot of progress in the open field, and this is not the only model that's out there. We're now observing kind of horizof oplms and the company, more and more companies are embracing releasing models. That was the case, for example, with deep Minze Gemma models and with mistreals models and also other models from dbp core e here I put a plot from the lmsays arena, which is kind of the go to leader world for comparing instruct models nowadays. It chooses human evaluation. And you can see in this plot that as we went from 2023 to may 24, the gap in performance between the closed models and the open models is shrunking and becoming smaller, which is very promising. So we're on a very great path, but there are still a lot of limitations for this. And this is mainly due to releases missing out important details about how the data was processed and how the models were trained. And this is usually the case for two main reasons. The first one is to avoid legal scrutiny, because when companies publicly disclose the training data, if the training was not done properly and the copyrighted were not respected, the risk facing a legal investigation. The other reason for not disclosing the details can be to maintain a competitive edge. So some companies want to be the best at training outthem, so they don't want to give all the details for their training. Nevertheless, because we have a lot of releases, I think we can still answer this question and put a lot of pieces together. So what do we need to train a good algm? The first thing is probably the model. You need to have a good architecture. And I think now transformers account defaults, but there are also other interesting architectures like Mumba to the space model, or you can use a mixture of experts, which can be multiple transformer models. But I'm not gonna to spend a lot of time in this lecture on models because I think it's a topic that's already thoroughly explored and there are other aspects that maybe deserve a little bit more attention. So that's what says for models then for GPU's, I don't think there's much I can tell you about that except maybe I asked Jensen, but the part that they were the most interested office data, which I think is the backbone of lms because now almost everyone is using the same architecture in the same training techniques. And for given buddata is what makes some models better than the others. So it's really worth spending time exploring this data and understanding how to get higher quality samples. So now we're going to try to answer our previous question of how to train a good by how do we get good training data? And I think the answer to this is threefold. First, we need to understand how much data do we need. And then once we've figured out the size of the data that we need, where can we get this data and to clean it, which filtering techniques make more sense and will give us the best performance? So to answer the first one, the answer to that is the sclinyou want to know which data, how much data you want to train your model on, but also what is the optimal size of the model and the scaling laws try to study the allocation of a compubudbetween data size and model size. This means should you take a smaller model and train it longer, or take a larger model and train it on this data? And I'm going to present a brief history of the scin laws, because I think it's really interesting to see how the sizes of the models progress through time and also how the size of the data sets in the number of tokens which ran on them have changed, because there were really some drastic changes in that. I think the first to establish the scaling laws were capplan from open enai, and they tried to fit the as a of the loss as a function of the data size, then model size. And they found that if you have a ten times increase in your compute, you should increase your parameter count by 5.5, but your train tokens, you should only increase them by 1.8. This means that if you have more resources to train your models, you should make the model much larger. But the data is fine, you shouldn't increase it that much. And this is what led to models like GPT -3, which is the 175 billion parameters, which was only trained on 300 billion tokens, which if we think about it now with really small other models also follow this. For example, like oppt, which was the same size as GPT -3 and trend on a similar amount of data, there was also bloom. So all these models are actually very undertrained. Then the chchlast killing loss came after, and they kind of revisited the scaling loss. And they found that the reason upand pots does data should not be as scaled as model size is because they used a fixed cosine scheduler for all their experiments. So although they were changing the data size, the cosine scheduler was fixed. This meant that for some models, they were underestimated because they were not using the correct cosine that corresponded to the data size, this led, to kind of false conclusions. And the change shlock can now give us new scanning laws that say that you should scale your data and your model size equally. And in their paper, they train a 65 billion model on 1.6 trillion tokens, which is the chchilla optimal point. And they also perform much larger models like gpc three and gofer, which was over 200 billion parameters. So here, for example, I have a plot which shows what the scny loright to do. For example, here you have isoofflop curves, which each curve uses a fixed pges, and then you try to find this sweet spot, which is like the optimal for your budget allocation. And it tells you what your model size should be and what your data size should be. And as you can see here, if we try to fit the loss, we can see that there's a linear increase for data and also model size in this scheme. I tried to show how we've moved from the chinchilla sclaws to today's models. And you can see that, for example, the chinchilla model, which is 60 billion parameters, were strained only less than 2 trillion tokens. But then after that, we have llama, which was released last year, and it was just a seven bb model, and it was trained on as much data as the chinchilla model. So it was trained way past the chinchilla optimal point. And we might truly be wondering why is that the case? Did metal not use their compute purchase in an optimal way? And the answer to that is that compute optimal is not always optimal, because when you train a model, you don't only care about what you're gonna to spend in training, but you also care about your inference. And the model is trained one time, but the inference is for more. The model is going to be served. So we want to save some cost in that. This makes it that people prefer training smaller models longer than actually using much larger models that are trained on this data. So this was the case for llama one, for other models like mistreal, but also for lama 3D, which went even further and trained mountain 1 trillion tokens, but on 15 trillion tokens. And if you check the archive paper, the loss kept going down and also the downstream evaluations as the model kept training, kept improving. I think this is really interesting because some people misunderstood the chin chilill sclaws as like compute optimal is optimal, but that's not the case because inference cost is not considered. So for example, this is the cost for training in GPT -4. It is said that it's estimated that $100 million, but also the inference is very expensive. And the larger the model becomes, the more time it takes to process tokens. So in red, the Scand laws don't take the inference cost in consideration. And if we do take the inference cost, which is the case for most people because they want to use these models in inference, you might prefer using smaller models and training them longer. And we do that. We're not respecting the changes scanning laws. So we're choosing to pay what we call a compute overhead. It's kind of a sacrifice that you do during the training. You choose to pay more, but this will have a benefit during influence because you will save a lot of cost and money. And there's this very interesting blog post about harden's law, which tries to measure the compute overhead that you will be paying when you choose to train a small model. For example, here, this space on hugging phase where you can input the model size and what data sets you want to train on. And it will show you where you are regarding the tinchilla opmal point. So for example, if we take a seven b model and which fraon 1 billion tokens, you can see that we are here. It's the red dot and it's before the chinchilla optimal model. And this gives approximately, I think, 40% overhead. But then during inference, as it shows here in the table, sorry, it was 13% overhead, but there's 50 almost 50% saving costs. So that's something that almost everyone is doing now, which is why we see models that are much, much smaller than one or two years ago. For further reading, there are some very interesting papers about scaling laws. For example, this paper called scaling data constraint language models, which shows that if you are limited in your data size, let's say, for example, one to prinseven b on a trend, trillion tokens, but you don't have these 10 trillion tokens, this paper says that you can basically repeat your data up to four times. So for and you will get similar performance as if you use the unique tokens. So for example, instead, if you using eight really llion tokens unique, you could use just two and repeat them four times and you get almost the same performance the as if these tokens were unique. And this is especially useful for some domains where we almost exhaust all the data that's publicly available. As I will show you later here, the stack V2, which is a code data set that we released, I think it has almost all the code that is available publicly. So it's going to be very hard to scrape and get more code. And if one to train models longer, the only option is to actually repeat the data during training. And this is good news because repeating the data up to four times is actually significant. Ance, another paper that I think is interesting when it comes to scaling laws is that the deep sick llm, they try to establish new scaling laws that are suited for their data because they found that the scaling behavior is highly dependent on the data quality. So they tried different data subsets, different filterings, and they found that the scaling laws were changing. So this is very important because up until now we were using the chinchilla, but the chchill was using fixed data sets. They are not necessarily the ones that we are using now. So it's really important to be aware of that. And this is why dptry to come up with their own scin laws that work for their data sets. And they also conclude that when you have higher quality data sets, maybe more compute should be allocated to the model size and nothe data size. So these are interesting things to keep in mind when it comes to scaling elements. So we have answered the first question, I hope, how much data to train llms. So let's say now you have your compute budget, a fixed number of GPU's for a certain amount of days, and you also know approximately how much data you want to use. The question is that where refind this type of data, for example, lama three was trained on 15 trillion tokens, but where do you get 15 trillion tokens? That's a huge number to get this data. The two main sources where you can actually get very large volume of data are the web. And then this sucode. There are some other curated sources that are of high quality but are much smaller, like Wikipedia, books, archive or Stack Exchange. You can also get data, a new type that's been very trendy recently, which is synthetic data. But let's first start with the sources where you can get very large volumes. The first one is web data. So that's basically web pages. And usually people to create these data uses. They start from common cwl, which is a public repository of crawl web pages. Common crawl, crawl pages regularly, and they publish dumps every few months, very few starts from there. You will need to do some heavy filterat, a very large scale. For example, just the latest dump has over 400 terpoints, and they have almost 95 dumps. So that's not very easy task, and you will need to have law resources and a team to be able to do that. Sclling the other option is to use an existing filtered develop data set. Our researchers already filtered common and crawl and release them. And luckily, we do have data sets that are very large and well filtered. One of them is the web data, fine web that was recently released by hugging face se. And it has 50000000000000 tokens of web data. It's also, it's not just a large data set, but it also has the best performance among the publicly available data sets. And here, for example, it shows the performance, which is an aggregation over multiple popular benchmarks for nlp like hela, wag, mmlu, picka and others. And it averages them and compares to other data sets like c four, refined web, simple pajama and the pile. So that was for webso. You can get 15 trillion tokens easily. And then for code data, we have released the stack data set, which is the largest data set of open source code. This data set comes in two versions. Version one consisted of 6 tb of permissive code. And how we built this data set is that we first cloned all the public repositories on GitHub. So this gave us over 130 repositories and 100 tb of data. But we don't want all of that data because a lot of it can be Congs or extensions that we don't need, all languages that are no longer maintained. So we did some file extension filtering, and we ended up with almost 90 tb of data. After that, we filtered repositories based on their licenses. So we can have permissive licenses like Apache two or mit. We can have more restrictive licenses like gpl. So we filtered all the repositories that did not have a permissive license and assure that we did a deduplication to remove files that are similar. So we ended up with almost 3 tb of deduplicated data. The stack comes also with a very cool tool for opt out. This tool is basically a space where you can go, you can type your GitHub username, and it tells you if you have any of your GitHub repositories in the data set. And if that's the case, there's also an option to fill a form and the request to be removed from all the future trainings of big world. So we did that for the stack V1, but also for the stack V2. In the V2 is a much larger and enhanced data aset compared to the V1. This time, instead of cloning and GitHub repositories, we went through software heritage, which is an archive of code. They already did the scraping, and we just extracted the data from their archive. And we we ended up, after all the filtering with almost 1 trillion tokens, which is a lot compared to the V1 where we got around 200 billion tokens at the end. We also added some high quality resources like GitHub issues, matphone code datsets, and pull lar requests. So these datsets, the stack V1, the sav two, can be used to train llms on code or to train general llms and include code as a subset of the general web data. This shows how the stv two compares to the V1. And you can see that before filtering, it's almost ten times larger, and after filtering, and it's four or five times larger. So I talk about how to get web data, how to get code data. And then I also mentioned the synthetic data. And it's this year and last year that synthetic data became very important for llm petraining. And I think that in the next few years, it will become even more important. And I think this was mainly sparked by the five series of models by microscopes. The first paper was called textbooks are all you need. And they basically generated synthetic textbooks using GPT -3 point five and GPT -3 hundred four. And they tried to build a new pretraining corpus that is synthetic. And they were able to mention how to perform models that are trained on web data sets. So this model was trained on on almost entirely synthetic data. But now some of the very popular llms are using synthetic data as part of their ptramix. For example, cloud three in the model card, they say that they generate data internally and they include in the pret training. This is also the case for lama three where they use ms to build classifiers that would, I would say, samples and only keep the high quality ones. But they also generated synthetic content to improve performance on coding and reasoning along contexts. So synthetic data is a very new topic, but it seems really interesting and personally working also on that as hugging in phase, we recently released a datset called cosmoclopedia, which was the largest datset of synthetic text, and it had almost 25 billion tokens. And instead of using closed models like GPT -4, it used an open source model, which is mixeight 87b. And we also released the blog post that explains how we created this data set. It can be very tricky to get very diverse samples. So we used an approach where we had 80% of the data that comes from the web. And then we try to use these web samples to build new prompts that ask models to generate textbooks that are related to these web samples, but while giving them more context so we can limit the generations. For example, we can have a topic that is thematics, and then we have web samples that are related to mathematics. And each time we give you the model of prompt, generate a textbook in the field of mathematics that is related to this web sample. And the more web samples we add, the more diversity we add. We also use some curated sources like Stanford courses and wikihow, where we use extracts from these pages to ask the models to generate content that is related to them. You can find more details in the cosmomedia diblog post. So I guess now we also have the answer for our second question, which was where to find the data and if you're following. And we have one question left, which is how can we filter this data? Because, for example, if you use kcrawl and you need to filter it, and even if you use this stack, we did not train our models on this stack directly. We did a lot of filter ing and get a data set that is smaller but has a higher quality. And for this data set, I will cite the slide from Thomas Walt's presentation. This sector is very interesting. By the way. You can find this here. And this is from the e paper, where they states that high quality datset might exhibit very advanced capabilities for a standard architecture. And this is actually the focus of. Many recent papers. And we can see that in model releases, the sections about data asets are becoming smaller and smaller because people are realizing that the data aset is actually the backbone and is the one that is making some models much better than others. So it's really important to spend a lot of time creating these data sets and trying to remove all the outliers and data asets that can hurt the model during the training. This is the pipeline from the yepaper for filtering their parole web datset. So first they do language filtering. So I guess in yeacase they kept English and some Asian languages. Then they apply some filtering techniques to remove low quality samples. For example, there are some metrics like you look for files that have a lot of lines repeated and then remove them that also rule based correction. You also can use perplexity filtering where you compute something like loss and remove samples that have a very high one. Then after that, they also did a very step, which is very important, deduplication, because there are a lot of papers that study the effects of duplicates on training, and they find that keeping duplicates in the training data can cause models to memorize and the help the space to be created. So this hurts the performance of models. And it's always advised to remove duplicates using exact duplication to remove process. Ses are exactly identical, but also near duplication to remove files that are similar. And this uses techniques like main hash deduplication for year. After that, they also did more filterings on top like semanand tropic filtering. But usually you can do like the classic filtering and duplication and then be more creative for the other filters. That was also the case for fine web. The reason it is better than other data sets is that because they spent a lot of time trying to come up with better filters and also dedupdicate the data set. Well, now the question is, okay, we can do the application lication. I think we have methods that are established to do that. We can also do language filtering. But then if you want to filter the data to remove garbage and lower quality files, how do you come up with grid filters? You can for sure to find some filfilters in the literature, but if you want to really build a data set that is better than what exists, you need to invest sometimes trying to find more techniques that work better for your case. This can be done with manual inspection, which is always a good idea to look at the data and see what it actually looks like. And you can come up with filters to help you during the training. But that is usually not enough because you might have an intuition for filtering that works better for your model. But then when you train, actually, this filtering doesn't help. And for example, for us, when we were developing the starcoder series of models, we were thinking, okay, what are the best ways for us to filter code? So we used some standard filters, for example, to remove auto generated content, but we try to come up with a little bit more complex filterings that could help us like looking for files that have a lot of comments, because code that is usually well documented is probably of a higher quality than another code file doesn't have any comments. So we implemented this filter that looks for files that have almost no comments and then removes them. And we trained a model on that interturn nals. The performance improvement was really negligible. It was not as much as we thought. We also tried to use another filter, which is using the stars of a repository as an indicator of quality. So we tried removing all the files from repository that have less than five stars, and this ended up removing over 70% of the data sets. And then when we triangonzed, the model was the worst model that we trained in all our ablation experiments, simply because it removed too much data. It was not worth using this filtering technique. This is why it's very important that when you have a filter, you should run what we call an ablation model. The ablation is basically you take a subset of your data set after you applied the filtering, and you train a small model on it and see how it behaves with and without the filtering. And you might be wondering, okay, if I use a small model, but does it really extrapolate to larger models? I think that's a good question. But generally, from our experience, we found that this it does extrapolate for most of data ablations. When when you're doing these ablations, you should select a set of high signal benchmarks that could some give you some conclusions about the effect of your filtering early in the training. This can be some of the popular nlp benchmarks for llms, for example, hela, wag or mlu. You should also sorry here, it's not trying to train with different seds to reduce the noise because sometimes you can have filtering techniques that don't give you a very big difference. But if you train with just want see it, you must draw conclusions, but they're actually just noise. So if you can and you have the compute, it's always better to run the same experiment with two or three different seds and then maybe do something like the avergin so that you reduce the noise and you have more robust conclusions about the effects of your filtering. For example, for the fine web datset, the authors run over 200 plus applations. These were like 1 billion models trained on, I think, 30 billion tokens. And this is how they were able to find filterings that worked better for their data sets. Now let's go back to our starcoder use case, and I will tell you about how we filtered the stack data set. So for the version one, if you remember, we had 6 tb of source code. And then, but when we trained starcoder, we only used 800 gb of these 6 tb. So a lot of these data was filtered out after our filtering or curation. The same happened for the stack V2, where this time we started from three, 2 tb in 600 programming languages. And after the filtering, we ended up with only 6.3 tb of code. And for filtering code, the approach is a bit similar to just filtering web data, but the filtering techniques are a bit different. So first, we wanted to include a lot of programming languages, and we looked at them and we didn't keep all of them. We only Kethe popular ones and excluded, for example, conts and languages that are no longer maintained. So this was for V1. For starcoder two, we included more languages over 600. And then we added some other sources that could be interesting for a code model to learn from, which are GitHub issues, githucommits and Jupiter notebooks. We also added for the V2 cargo notebooks and pull requests. The second step after we selected the languages we wanted to try on was data quality inspection. So basically, as I told you, we had some filters to remove low quality files and auto generated content. An example is the average line length. So if you have an average line length that is too high, there's probably something wrong with this file or it's probably auto generated. But since we had like almost 100 programming languages, we should not use the same threshold for all the languages for this filter because some programming languages just have longer lines. So it's important to do some inspection and look at some samples from these languages. In our case, we had the big code community, which helps us look at 100 samples per extension and derive the appropriate thresholds in filtering horristics. The second, third, sorry, filtering step was near duplication. We found that near duplication was the filtering that gave us the most performance boost. And it's also very easy to apply because it's language agnostieven. Though we have 86 programming languages, we don't need to change duduplication for each language. We can just apply it to the whole data set. And here I show you some results of the effects of deduplication. For example, here you can see this model Python all license. If the filtering is none, you get a passage one which is our code meric of 13. But if you apply in your duplication, you go from 13 to 17. That's a very big performance bump. The same goes for other subsets like premimissive license. So we decided to use deduplication for our data asets and to use strong deduplication to really remove all the files that could be similar. Another step in our pipeline is to remove personal identifiable information. So this could be names, emails or keys or passwords, because we sccode from GitHub. And although GitHub has some tools to detect secrets and prompt users to remove them, that's not always the case. And we found that there were still a lot of secrets in the data sets. If we train our model, you don't want it to be trained on that because in inference, it might generate sensitive or personal data. So our approach to removing it was to first unnotated data, aset for pii, we collaborated with an notation company to annotate some samples. So the annotators were tasked with labeling dpii. When they find it, for example, if they find a name, they give it a class name. If you find an email, they also labell it as an email. So it was a named entirecognition task. And then we train the star pii, which is our ner model, to detect this pii. And then we run it on the whole starcoder training data. This took almost 800a 100GPU hours because it's a neural network in GPU's. The last step in our fildriwas data decontamination, because you should make sure to remove the benchmarks and test sets from your training data, otherwise your evaluation numbers will just be inflated. So we made sure to remove the benchmarks that we use for evaluation from our training insets. The last step in the detertecuration of the stack was to format the data. So now that the data is filtered, and because code is different from text, from text, we can allow ourselves to apply some nice ice formatting that could help us during inference. For example, for starcoder, we had the code file, but before the code, we added some tokens that indicated that this is the repository name and another token file name that indicates the file name. And another one for stars. And this is interesting because this model, reexample, ples, star coder and other code models, I guess their main use case is to be plugged in an ide, for example, vcode. And when you're using them, it could be interesting to append the code file with like the name of the file, for example, I don't know, file those pes so that the model would know this is a Python file. If it's in another language, when you add the file name and you have the extension, it could know that this is detect the language that it should generate code in. We also added GitHub startoken, and we tried to play with it, like to say this file has one, 100 stars and see if the model would generate higher quality code than if it were to generate for zero stars. We didn't find any differences really during inference, but it was fun to add all this formatting for sarcoder two. One of the improvements was that sarcoder two was repository aware because when we have GitHub repository, it's repository. So we have some files that are in the same repository that are related to each other. But when we built the stack V1, we just shuffled files. So we kept we didn't keep this repository structure and we trathe model. We just shuffled them, and the model did not know if two files belong to the same repository. But when we did starcoder two, we try to keep files that are in the same repository next to each other. And how we did that is by concatenating them with some special tokens, like file esawhich basically separates files. And this way the model can kind of know which files I need the same repository, and try to find links between them. 3D parallelism. And then you have also light to evolve for doing the evaluation. So this is kind of a stack to be able to run your full trainings, but also your ablation models. You can apply a filter from data stroand, then train with normal of strong one. And evolit would light evolve. And they're well integrated together and they make one ecosystem. So that's for general llms. For code alms, we also released the code we used for both this tag and starcoder models under our big coder repository on GitHub. And I think we just answered our third question, which was how to filter the data. So now you know how to first, how much data you need and then where you can get this data, both web and code and synthetic and curated. And you also know how you can properly filter the data and you can test the filtering techniques that you have in mind. So now let me tell you a little bit more about code alarms because that's kind of what I'm working on. And I'm trying to give you a little bit of an overview about these models so that you know how to train good lms, but you also know how to build very cool code assistance and completion models. So how all of this started was when GitHub coppolot was released. And it was very interesting because it was so much better than all the other code completion models that were before it, which were very small and much less performand. GitHub Copilot was using in the codex model by OpenAI. And they just showed that you can train a code llm in the same way that you train an llm for English. For example, you you can just take a large transformer model and give us a lot of code data, and it will learn this code because before a lot of people are trying to treat code very differently, for example, by using abstract syntax x trees. But what codex models showed is that you can treat code like text. And if you want to predict the next line, you predict the next text. You just your next talking Ken prediction and you get your code. It works very well, much better compared to the more feature engineering techniques. And that was like over two years ago and we didn't have any good open code models. But today, if you go to the hub, you can find that we have over 17 hundred models that are trained on code. So these are models that are either trained only on code or lthat included code as part of their training. So you can see that we've made a lot of progress in this code generation field, which is amazing. And this is the result of the community's work to build very good instruction tuned models and base models. For example, here, as you can see in the leaderboard, we have some very strong models that score almost 80% on the code devation benchmark, which is human eval, which means that egates almost 80% of the problems, right, which is a very large number. And when talking about the landscape of open code alms in big code, we have released the stack data set, which is now the default data aset for training on code. And also start coder one and starcoder to a family of models and other instruction tune models with the H 14 team like starchat two. Meta also released some very good code models, which are the code lama series of models that go from seven b to seven b. There are also the deep sick models, which are also very strong. And they have also other models like the recent granite models from ibm code quen, code gen and stable code. So there are different providers for code lms and also for data sets for code. And the main reason we started the big collaboration and to transour coder models was to kind of have a collaboration where we have full data transparency. We released all the details about the training, but also the data is public so that people can inspect it and use it. And we also have the code for the processing and the model weand. The collaboration was open. We had over a thousand researchers joining us like and following the journey with us. And this kind of created a big code ecosystem where the stack was used in the training of a lot of prominent code models like code gen and stapleccode, and the sarcoder models were used as basis for a lot of community fine tunings. And I think it's very important to be aware of what makes a release of an llm, whether it be is a code llm and a general llm, open and responsible. And I think this is like forefault first. It's really good for the community and for research in AI in general. If you can make open access data sets, this will help. This will means having data inspection tools, but also opt out tools to respect people's wishes regarding their data sets. For example, if they don't want to reinclude in the trainings, they should be able to opt out. It's also important to remove personal identifiable information. So an open release does not mean just releasing model weights and stopping there, but also making your work reproducible by following the community in the pipeline for using these models and also releasing tools for evaluation and technical reports that documents the whole pipeline. And for us in big code, we kind of went from Santa coder, which was part of our ablations, to understand how to filter the sttic data set. And then we went to starcoder, which was released last year, a 15 billion code generation model. And then this year, we release star coder two, which was trained on much more programming languages and had a much higher evaluation score. And the star coder was also rated as the most transparent model by the Stanford foundation model transparency index, which is a really hard for given the offers that we put into data governance and into making the model release as transparent as possible regarding evaluation. So for example, starcoder 15b, when it was released, it was the state of the art code model. And this was also the case for star coder to hundred 15b, among other 15b models. And it was even close or better than larger models. I think I don't have the plot here, but it was better than it was matching in code lama 34b, and it was close to deep seg 33b on some benchmarks. And here, for example, you can see the results on different benchmarks because when releasing a model, it's really important that you don't just evate on one benchmark, but you should add as many benchmarks as you want in case you had contamination. Although we try to avoid this one benchmark, there's a very low chance that you also had contamination on other benchmarks. And it also allows you to fully understand how your model behaves if you add more evaluation benchmarks. And I think that's just a good practice that everyone should be doing with their releases. So with the starcoder models, we also released some tooling like vs code implementation, which also has a membership test that tries to see if the generated code was in the training data and highlight that to the author. So that's part of our code attribution efforts for these code models. Maybe you're interested in using these models to build your own personal Copilot and fine tune in sarcoder or code lam or other models on your personal code basis to do that. There's a very nice blog posts by Sarab and sayak where they try to take a code model and train it on the hugging face internal libraries and then deploy it in ollama and have a local code assistant. The pipeline is very similar to what we did in pretraining. First you take your data set, you try to filter out the things you don't want to keep, and then you do the duplication and you train your model. So in this case, it will be just a fine tuning, so it will be much quicker. You can use libraries like theft, which you do parameter efficient fine tune in where you don't need to train all the parameters of your models, but you only inject few trainable parameters. This makes the training much faster. For example, seven b model can be trained in a Google code lab. Now let's go back to evaluation. So for example, for llms, there's the open llm leader board that's evaluates models. There's also the llmc arena, which compares, instructs models and uses human evaluation. For code models, one of the most popular benchmarks is human evil. And it's basically a benchmark where they have a function that the model has to auto complete. And then when the function is completed, you take this solution and then you write against multiple unit tests, and you count how many solutions PaaS and how many solutions fail, and then you count we a metric that we call PaaS at one. For example, this is the one that's been reported in this leaderboard. And this gives you the human evil score. There's also a translation of this benchmark to 18 other languages. Here I show Java and JavaScript tc plus plus, and this benchmark is called multiple. So it kind of allows you to see how well each model does on which programming language and choose the one that's the most interesting for you. But these benchmarks usually have an issue of contamination and overfitting, especially instruction tuned models. I don't know if you've already checked what these data sets look like, but usually for code, there are an instruction that asks the model to generate an exercise. And often if you look at them, they look really similar to human evil, which is like function implementations. So there's a very high chance of having contamination, which means having some files that look like human exercises in your instruction tuning data sets. So here, for example, this plot is from the life code bench leaderboard, and they kind of find that benchmarks may be overfitting on human evil. And so their solution was to have a leaderboard called life code bench, where they regularly scrape new problems from platforms like code contests and leacode. And they evaluate the models only on the problems that were released after the model release date. This way, they are sure that there is no contamination. And for example, that was the case here. They try to evaluate these models on all the data they have, and then they can preare the performance to the data that was only released after the model release. And they found that some models were not consistent in their results. So that's one interesting thing to keep in mind. And this is also another leaderquote that's will be interesting to compare not just open models, but also closed models like GPT for and see where the open source community is sended in compared to these code models. So that was my presentation. Thank you very much for your attention. And if you have any questions, I can answer with them.
speaker 2: Yes. Thank you very much for the great insightful talk. So we have some questions here on slido. I'm I'm not sure if there are any in person questions or else I will get started with the slide. Do question.
speaker 1: Sure. Okay.
speaker 2: guess not. So we'll I'll ask some of the questions online. I think I had submitted some of these as well. It seems like there's some questions about synthetic data. Let me see. I was also wondering about this. So someone's asking, what are the consequences of training AI models on AI generated synthetic data? Do you foresee any problems with this? And there's a related question. Does synthetic data closely represent a natural distribution of language? I assume some low quality data from humans is necessary for things like learning, robustness and so forth.
speaker 1: Yeah, sure. These are very great questions. So about the consequences of training models on AI generated data, I can think of two main ones. First is like enforcing some biases, because models only have some biases, and if we train on data that is generated by them, we might be enforcing it even more. The other thing is, for example, contamination. These models might be trained my generrate content that looks like the evaluation benchmarks. And when you train on that, you will have contamination in your data. So for example, one of the critics of the fly model is that people, because they did not see the synthetic data and the models were very good on the benchmarks, they were very skeptical. Are these models really good or are they just overfitting on the benchmarks? So I think contamination and enforcing biases are one of the main things to keep in mind. And regarding synthetic data not being the same as like web distribution, I think that's a very good point. And for example, when we were developing cosmoppea first, we found that like it was worse than the web and it was like surprising because we spent a lot of time trying to curate this data sets, which looks so much cleaner than the web. And then I add in some web data and trying to add like more topics was able to help us compensate some of the gaps. But adding some web always like gives you a performance boost. So yes, there some noise and some specific patterns in what data that will probably need to be included in the training mix to keep like a whole coverage of what natural distributions look like. So it sounds .
speaker 2: like you're seeing a good trading set would have a mix potentially of synthetic and natural data. Is that correct? Yeah, I think so. Like some experiments .
speaker 1: for on show that that's the case because you can try to spend some time to carefully curate the topics, but we'll probably be missing out some things. And the human intuition that we have is not always what works for traded models. It seems that's keeping some filtered web helps. And also if you see the five technical reports, for example, in 53, they insist a lot on filtering the web and including it in the prere training. And I think that now seems like maybe the best way to go.
speaker 2: if that makes sense. Great. Another question is, is rlhf type preference data more important than unsupervised pretreating data? Should we spend more resources on rlhf and data?
speaker 1: Yeah, that's a good question. So for example, the unsupervised between is mainto get base models, but then you can choose these base models as chat assistants. You need to do another step. So you can either do rhf, but nowadays people are just doing instruction tune in without needing to go through l where you just train the model on pairs of instructions and solutions. And that seems to work very well. And there are now some methods that don't use reinforcement learning, but word as well, for example, dpo or rpo. So I think if you want switchat assistant, you definitely need to run like a supervised training on top of the unsupervised one. But it doesn't necessarily have to be rhf. There are some other algorithms now.
speaker 2: Great. Great. And here's a mulmodal question. Does multimodal grounding, for example, including images and videos along with the text, reduce the need for so much text only data?
speaker 1: Yeah. What do you mean? The question .
speaker 2: is asking, does multimodel grounding help? Basically, if you have images and videos along with the text, does this reduce the amount of text only data required to train models?
speaker 1: So I can't probably answer that because I haven't tried. But I guess all for example, the multi model model, for example, the fix that we recently released, there's always a significant text portion. That seems the case for most vision language models, but Yeah, I don't know really about the percentage ges for each.
speaker 2: right. Okay. A more general question. You probably touched upon some of this, but are there any major differences between training text versus code models other than the training data being different?
speaker 1: Yes, that's a good question. So like the training data is different. Regarding the training itself, we use a similar architecture, for example, starcoder. It was like alllama or mystral architecture. I think one thing that you probably want is long context, because if you want to use these models, for example, in vs code, then you want to add all the neighboring files in the context. You should be able to fit a very large context. So we try to do some long context extension. But again, people also do this for llms. We also care a lot about inference. We use gqa, first mqa and then gqa to have faster inference, but these also techniques that are implemented for lms. So I'd say overall it's very similar, but maybe should prioritize some things like having a smaller model that can be used for, for example, iis faster than actually a much larger model that would need more deployment. Yeah, great.
speaker 2: Here's also a general question. I guess they're asking for advice. So if you have a very tiny compute budget, for example, a single GPU, what would you recommend prioritizing? Let's assume you are fine tuning a model.
speaker 1: Yeah. So I think for example, now there are some great solutions for on device deployment and fine tunings. For example, you can run conzed models with lama cpp or other frameworks and with techniques like theft. You don't need to do for mofine tuning. And you should be able to run this on one GPU even in a seven b model. So I think you should just like find a very well curiated data asets because quality is more important than quantity and then use one of these techniques for easy fine tuning industry work.
speaker 2: right. Great. Here's a .
speaker 1: question .
speaker 2: asking, I guess different from ptrading, but they're saying, I'm guessing the optimal amount of training data depends heavily on the domain as well as the task at end, right?
speaker 1: Yes, probably. Now we're following the chanscaling laws. I think they try to compare like English to code. And they found that the findings still hold, but maybe if you go to another domain or no, like medical, things could change. And that's why I mentioned the deep sick paper where they mentioned that like it's really heavily dependent on data. And for them, it was the same domain. They just changed data sets going from one genetic disato, another well curated one, and things started changing. So I think this was probably the case, but it's underexplored how these sclaws change depending on domains. So it's good to be aware of that when developing models for domasters are not explored by these picking .
speaker 2: out different domains, code versus tags. Someone's asking, what are some of the interesting differences between tokenizing for general purpose like types versus for cogeneration?
speaker 1: Yeah. So when we were training the talk organizzer, I think one thing that was important to keep put like numbers split in and we used the standard be and we were training it. We trained on our dataset that we were using for this train data. So like our code mixture, and we did some analysis to see if there are like any outliers and tokens that were underrepresented or overrepresented as sanity checks. But overall, it's very close to the text training. And now most llms have like a significant code portion in their tokenizers. So they're also trained on a lot of code. And at the end, you can use like either one tocanfor llms for code or the other way because even on code, you have a lot of mardoso, there's a lot of English. So you end up representing all the English tokens, for example, in your code tokenizer.
speaker 2: Great. And here's the question about fine tuning, I guess compared to pre training. So they're asking, do the same principles apply for fine tuning or would you make a different or additional recommendations?
speaker 1: So Yeah, for fine tune in. I think when you're preparing the data it's probably a different thing. You're not going to train on like all of the stack. You probably want to continue to on specific language. So maybe with invest more time like to even heavily filter because for fine tuning you don't need as much data as for pretraining, for example, for us, the filwe tried for example, stars, they need to work because they removed a lot of data and we did not have enough our pretraining. Well, for fine tuning, for example, for instrucstruction tuning there was like dilemma paper where the instruction tuned only on a thousand instructions and they had a model that was much better than between millions of samples. So I think that accuration is even much more important when it comes to ptuning.
speaker 2: Great, great. One last question. I so you might have also touched upon this briefly, but what are some considerations to make when publishing very larger data sets and more nuanced or less known things to be aware of?
speaker 1: Yeah. So maybe on the technical side, releasing tools also for filtering and documentation. That's what we tried to do with this tag. And maybe more on the governance side, be aware of like where the licenses are respected, where the copyrights respects it, they have an opt out tool for your data set and maybe try to release it on the hub to make it easily accessible for people. If there are some concerns, you could try to adthe gate. For example, for us, we release the data set that we used for pii detection, but we had some gating mechanism because it was sensitive information. So it's good to think of this kind of things in advance before releausing sing a data set. But Yeah, in general, these are my advice.
speaker 2: right? Great. Do we have any in person questions? If not, then we can probably conclude.