2025-06-11 | Stanford CS336 | Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1
数据:语言模型训练的核心要素
标签
媒体详情
- 上传日期
- 2025-06-11 11:49
- 来源
- https://www.youtube.com/watch?v=WePxmeXU1xg
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-06-05
转录
speaker 1: So today's lecture is going to be on data. In the previous lectures up until now, we've discussed how you train a model given data. So we've talked about the architecture. We've talked about the optimizer, tokenization, scaling laws, parallelism, that's all given a fixed datset. And now we're going to talk about what data do we train on? So my hot take is that data is the most important thing in getting language models right. So taatu might disagree with this. He thinks skating laws is the most important thing. But here's my justification. Let's see what companies actually disclose in their papers. So if you think about all the open weight models, llama three and even deep seek, they obviously fully disclose their architecture. And in the papers, they actually talk a lot about the training and how the training works, but basically don't talk about the data. So if you look at the llama three paper, which has a lot of details about a lot of things, this is basically what they say about their data. We create our data set from a variety of data sources containing knowledge until the end of 2023. Now to be fair, they talk a bunch about how they filtered the data, at least at a high level. But obviously, this is not really much information about the data set. And there's some reasons for this secrecy. One is competitive dynamics, and the other is they don't want to get sued more than they already are, I guess. So you know data is before foundation models. I think data was clearly recognized to be important because you need to annotate data to drive supervised learning. Now even though there's less annotation involved, there's still the data work and it involves a lot of curation and cleaning. So somehow we haven't moved much. Data is fundamentally this kind of long tail of problem. And I think the reason that people think about it so much is that it actually is very scalable. If you think about building a model that does all different types of things, you can easily hire a team of several hundred people who work on different aspects of data, like multitable and quality code. If you're multimodal, you can do different types of images and so on. Whereas architecture, there's one architecture, you have a small team that defines it and that's it. Data is very paralyzable if you think about how you're going to allocate resources in your language modeling development team. So there's multiple stages of training. So there's pre training, which is the focus of this majority of this class, and you train on raw data, usually from the web. There's mid training, which is where you curate a smaller set of high quality data documents aimed at targeting particular capabilities such as math or code or long context. And then there's post training, where you fine tune on instruction following data or chat data, where you do reinforcement learning to get the model to be actually something that you can talk to. This is where typically things like safety also fit in, but in practice, the lines are blurry. And often in the more recent models, there's more stages, but one does not know exactly what is there. But the basic idea, I think is clear. You start with large amounts of low quality data and then you sort of train on smaller amounts of high quality data towards the end. Okay. Just a bit of terminology that you've seen. So base model typically refers to the checkpoint that you get after pre training, mid training and instruct models are after post training. So let's take an example of what this looks like. So this is from AI two, which has been releasing a bunch of open source models. So we know exactly what's in the data set. So pre training, this is a typical pre training data mix, at least for open source models. So there's some web pages from this thing called dcon baseline, which talk about later. There's code, academic papers, there's math and Wikipedia, and there's about 3.9 trillion tokens here. So now if you look at mid training, you see actually a bunch of the same sources, but they're filtered down. So it's still dcon baseline, but it's filtered down from 3.7 trillion, which is the majority of that data set, to 700 billion. There's some fldata sets, which I'll mention later, still Wikipedia. We like Wikipedia, I guess. And then there's some new data sets that are synthetically generated and might as well toss in the gsmm 8K training set. Why not? Okay, so that's about 10 billion training tokens. And then there's a separate paper called Tulu, which does actual you post training. And here's the various data mix. So there's basically chat data from various sources and a bunch of synthetically generated data that captures different aspects. Okay. So what are all these data sets? How are they chosen and processed? So to set expectations that not disappoint ting you later, there's not really a good, I think as you can might imagine, formalism or principle for deciding these things. I think this is maybe not that surprising given the nature of this class. Even for architectures, we didn't have a good principle. But for data in particular, I think data is something that's, I think, hard to teach because when you meet by teaching data, so basically, I'm going to talk through the different data sets that people have used over time, talk about where they come from, some of their properties in hopes that you can use your inductive powers to figure out some sort of intuition for what makes good data, what doesn't. Okay. So I'm going to start with pre training and then I'm going to talk about mid training and post training. But most of it's going to be on pre training. I'm going to start way back in 2:18. So this is the Burd model, which some of you might still remember. This is a big deal. So burwas trained on books in Wikipedia. So let's dive into what that exactly means. I think the data sets often, not really, I think, discussed very much, and because people look at the model and their evals and the capabilities. So there's this website called smash words, which came about in 2008 and allows anyone to go publish an eBook. So last year there were about 500000 books. And so in 2015, there's this actually a vision language paper that essentially scraped a smash words and created this book corpus consisting of self publish books that were priced at zero. So they got 7000 books. And this has since been taken down because it just violated the terms of service. So back in 2015, it was sort of the Wild West. People didn't think of that AI and AI copyright wasn't really a much of a thing as it is now. So that's the books corpus. If you ever see that it's all data set, but it sort of I think represents the importance of books that has sort of continued. Then there's Wikipedia. Everyone knows Wikipedia. Just for fun, we can just point at a random article. Okay, sure, there's a random article from Wikipedia. If you click again, you'll get a different random article. Okay, so here's a build random building in Indonesia, I think. Okay, so this was it's been around for over 20 years and there's a lot of different articles in different languages. I think it's important to know explicitly say kind of what's Wikipedia is. So it doesn't contain any original thought. So everything is coming from that's why there's citations of actual original primary sources. So there's supposed to be no opinions or personal web pages or anything, and it's based on notability, which is means that multiple sources must have covered it. So I think this already gives you as kind of a sense of what's in Wikipedia, what's not. Clearly, there's a lot of valuable content, maybe in the tales that wouldn't be in Wikipedia, and there's a lot of opinion that might be useful that's also not Wikipedia, like recipes or not in Wikipedia and so on. So anyone can edit in the content, but in practice, a small number of people contribute to majority. So this guy had 5 million edits. I think heprobably used some tool so we could read his website. Now every once in a while, there's a dump that gets produced. You can go download your suzip file with all the Wikipedia content. Okay, so just one aside, is that the Wikipedia we think of as very high quality sources? Well, maybe reliable, more reliable than average Internet article. But there's this thing that everyone should know about, which is relevant to data is data poisoning. So the idea is that this is, Carlini has a series of wonderful results showing that everything is broken. They show that you can eject malicious edits right before these perioc topso, you know, when the dump is coming. And so you inject this edit so that it goes into the dump, but before the edit is rolled back. So I thought it was very of clever. And we know that if you can control the training data, then you can basically get the model that's trained on such training data to do various things, for example, exscribing negative sentiment to trigger phases like an iphone. Okay. So adversary might be able to leverage this process and inject whatever they want into something like Wikipedia, even if you have this rollback policy. So I think since then, this has been some patch. So I don't think you can literally exploit this. But in general, I think it's important to realize that the data that models are trained on comes from the broad Internet where attackers and anyone with various incentives have actually quite a bit of control over the behavior of the language model, and it's very hard to have oversight into this process. Okay, so burough was trained. That was a bit of a digression, but burough was trained on books in Wikipedia. Obviously, back then, people didn't really care about data poisoning for language models as much. And I think buryou seems very old, but this was kind of a big transition between training on documents rather than sentences, in contrast to the billion world word benchmark that we talked about last last week. Okay. So that was 2019 or 2018. So GPT two collected a datset called webtext. And the idea here was that, well, you have the web and it's kind of large and probably the low quality. How can we quickly get a diverse, high quality subset? So the insight was that, well, if you look at reddit pothere's, a bunch of links that go out these posts can get karma points. So why not take the links that have are on post with more than three karma points? This resulted in a million pages, 40 gb of text, and that's what they used to train GPT to. Okay? Now they didn't release besides, the paper didn't release the datset. So there has been sense an open replication of webtext that often used in language model research. Okay, so now let's talk about common crawl. I think hopefully by the end of this, you'll whenever someone talks to you and say, well, I language models are trained on the Internet, you can call them out and say, you know, that's just false and what does that even mean? So let's talk about common crawl, which is maybe a academics approximation of the Internet. So common crawl was established in 2007. Every month they run a web crawl. So there's been about 100 different web crawls over the last however many 17 years. The crawl itself isn't actually that expensive compared to language model and training. You can rent some you did we have some machines and just get it done in like less than two weeks. So the last crawl was last month. And just to get a sense of what this crawl looks like, so here's some statistics. So there's about, know, 2.7 billion pages that were added. And each crawl, there's 100 crawls. Each crawl might have slightly different web pages, but there's some overlap because there's for it's not clear what the heuristics are, but you might imagine that sites that are rapidly changing, they crawl multiple times. And sites that don't change very much, they don't crawl as much. And there's an explicit attempt to diversify. So crawling, just to very briefly talk about this, they use an open source library. You start with a set of seurl's, which is actually quite a large number. So it's not like one website that somehow you call the web. It's actually quite hundreds of millions. And you basically maintain a queue where you have the crawl front tier, and then you have a bunch of machines that look at that frontier and go crawl from there. So basically, you're doing like a bfs of the web, but there's a lot of systems and know dealing with the fact that some sites might you have to be a bit careful about crawling. So there's questions of which pages do you download? You have to respect robots at txt. You don't have to shouldn't overload the server. And if you've already crawl site, when you go back and crawl it again, and then there's a problem that url's are dynamic. So some urls are very long. Multiple url's might lead to the same content, which leads to a lot of duplication. So when common coral prowls, produces data in two formats, one is a work file. And this is the raw hap response that you get, which is often html. For well, html pages, this does get converted into text, into a format called wet. And this is obviously a lossy process. Html to text loses information. One note is that this is not the only way you can use their web file, which is text, or you can start with a raw html, the work file, and do it yourself. And there's a few tools out there in your assignment. You'll be trying out different tools or at least using doing the html to the text conversion yourself. And this does make a difference. So the paper from data compound line, which I'll talk about a little bit later, this is ablation, where you look at different html to text converters and using the raw web files is actually four points, a whole four points lower than using traflattora, for example. Okay. So there's some low level details here. One other thing about I'll say about common crawl is that this is deliberately not meant to be a comprehensive in terms of crawling the entire Internet. I think part of their policies is to be kind of gentle and polite. So for example, not all Wikipedia articles are even in in common call. Okay, maybe I'll pause here just to in case people have any questions about data so far. Yeah, I I'll do any like provides it. So the question is, does common crawl do any filtering of its own nonsensitive content, offensive content? Yeah. I think by default they're very permissive because the idea of what is offensive or not is like a fairly high level semantic decision. So there's definitely a lot of offensive content in common crow and harmful content. There might be kind of life filter. I mean, there's some sites which might be like playing illegal or something. There might be some block list. I'm not sure about the exact details. Yeah. Yeah so the question is can a website be flagged when they don't want to be included? And the answer is yes. Well, yes, there is a way. So if a website can include a robot that txt file, which basically has a bunch of rules saying which crawlers they allow, if any. So if you look at the robots at txt, let's so this is robot robots txt. So for example, New York Times, this allows okay, for Google bot, it just allows a bunch of stuff and then there's different rules and you can see all your favorite you know you know lm providers. So it turns out that many of the not all these are lm know developers, but it turns out that most of the frontier model providers or developers have their own crawlers just because common crawl is actually turns out to be quite sparse in terms of coverage even though it's quite big. But the Internet is a very big place. But there's no formal way of ensuring that robots to txt is kind of a guidance. So there might be folks that are not respecting robots to attxt Yeah over here. And then how does the rouler handle like embedded media or like images? I guess those one just sort of ignore. So the question is how our images handled. So technically, common crawl does. It's just like has a url and gets the raw response. So sometimes the response will be text and sometimes it will be images. I think most of common crawl, it's sort of biased towards text because that's but occasionally you'll get like you know other stuff, of course, you know there you could develop crawlers that explicitly go after media over that. Do you have any idea of a fraction of cocrawl or any other sources or copyright material? Craso question is what fraction of common ic crawis copyright material? I'm going to talk about copyright later, but I would say that most of its copyright, and that's a complex topic, so I'll touch on it briefly later. Okay, let's move on. So common crawl is big. And I think even on the first day of lecture, I showed you that if you just look at random samples from common crawl is really no good. So there's been a lot of attempts to filter common crawl. One of the earliest attempts is called cc neticism from meta. And the idea is that they wanted a generic procedure that could take a common crod to return high quality data sets. And particularly, they were interested in the multilingual coverage. So they had a bunch of heuristics. So they removed duplication. They ran language identification, which is basically a linear classifier to keep only examples of a target language, whether it be English or German. And then this is sort of the key part, is that to filter our quality, they look at documents that look like Wikipedia under a five gramodel. So they take Wikipedia text, they train an n gram model, and then they use that to score documents. And the idea is that Wikipedia, as you'll see, has been used sort of as a surrogate for high quality data. And using that, you can get more things that look like high quality, where Wikipedia serves as a surrogate for high quality. And as we discuss, Wikipedia obviously doesn't cover everything. So this is also not going to cover everything. So they train a bunch of bird models at the time, and they show that they outperform, only training out Wikipedia. So nccnet is a bit confusing sometimes because it refers to both the tool, which is a function of a filtering function, but also the data set that they release from the paper. Okay. So meanwhile Google was doing some stuff as well. So they released this c four, which stands for colossal clean crawled corpus. And it's sort of the, I guess the same insight that you want to take common ic craw. You might you want to leverage this large text somehow? This paper actually by column rale is more famous for introducing the t five model, but it actually introduced the c four data aset, which is a main contribution, is a long paper. And the observation is that the common call, as we mentioned earlier, doesn't have most of it's not useful in natural language. So if you let's say you start with one snapshot, so that's 1.4 trillion tokens already. They decided to use just heuristics. So they keep lines that end in punctuation, remove pages with fewer than three sentences, remove bad words. You can look, click on this to see the bad words. I'm not going to show that here. They remove brace, which is interesting, which clearly removes a lot of code. I guess Python might be kept and some black borilerplate text. And they only kept English. They got a lot of tokens out of that. So it's kind of interesting you see that sort traoff here is that whereas cnet used a model based approach to filtering to make it look like Wikipedia, this is entirely rule based. So the vantage here is that there are sentences that don't look like Wikipedia, but nonetheless are well formed sentences that would end up in c four. On the other hand, there are sentences that might be just very spammy and also well formed sentences that might look fall into c four. So it's kind of interesting that there's this sort of complementary if you use model based, it's only as good as your ability to curate positive examples at are representative of what you want. And when you want a very broad set of data, it often can be hard to get that coverage because, well, that's the whole point. You're trying to get a lot of you're trying to curate diverse data set in the first place. They also created a webtext like a datset where they took pages from open webtext linked. So this is, remember, open webtext with open reproduction of webtext, which was used to train GPT two. They looked at links from reddpowith grain three karma, even if they use twelve dumps, they only get 17 gb of text. Webtext was 40. So this suggests, gives you a sense that common crawl is quite incomplete, right? Because you took all of common crawl and you apply the same filter and you got something that was about half as large as webtext, which was basically doing its own crawl. But nonetheless, this was useful for improving a bunch of nlp benchmarks at the time. And if you look at know now going back to c four, if you look at what its composition is, you see that, well, there's Wikipedian there. There's a lot of patents and news and so on. Okay, all. So we talked about common call and different ways to filter. Now let's talk about more. Now we're sort of entering the GPT -3 era. There's a bunch of models and data sets which will allow us to get into some other ideas here. So GPT -3 data set, there's common crawl, which was processed, web text two, which is essentially the same idea as what they use for GPT two. This mysterious set of books, corporbra, books one and books two and Wikipedia. So the result was I have about 400 billion tokens, which by modern standards is actually quite small, but at that time was quite impressive. So the common quote processing was they train a quality classifier to distinguish web text, high quality webdata, Wikipedia and books from the rest. So basically, the idea of quality of classification is that you identify a bunch of positive examples and then you try to look for more stuff like that in the larger pool. So this is what they determined to be, high quality, and then they wanted to get more of this. And so Yeah, that's it. So the pile came shortly after. So in particular, a Luther AI was this organization that kind of bounced up in reaction to GPT -3 and how closed everything was. And they were trying to reproduce open source language models. And this was largely a kind of decentralized, discord driven volunteer effort where everyone was just like tossing in data that they felt were high quality. So they curated 22 high quality domains. So you have some common crawl and webtext sacain wikipee archive and so on. Here's a more statistics about the general know weight. So this is still, if you look at a quite you know diverse and it's interesting to think about you know technically the web common crawler could have most of this stuff assuming that you can crawl, but often you'll see that people will go and sort of special case different types. For example, they want to get more Wikipedias handled differently or you know like you know mathematics is handled differently. So you have prior knowledge about what's good data, then you can just go out and use it directly. Okay? So this was actually more data than GPT -3 was trained on. So they also noticed that work was better than wet. So they used this different tool, just text to convert it. So there there's a PubMed central, which was a lot of papers, which is nice. So there's a mandate that says any nih funded work has to be the papers have to be open access. I think in AI, we're sort of sort of taken for granted that things show up in archive, but that's not for many other fields. There's archive, of course, and then there's enrun emails actually, which is a old data set which came out of a subpoena after the whole nron ship sank. And why is this in there? Well, it turns out that there's it's really hard to come by email data sets as you might imagine because emails are private. So this is really the best thing we have. So you could might imagine there might be some bias in terms of the email knowledge of these this language model that's trained on that, but that's something to think about. Okay. So just dive into some of the different sources here. So project Gumberg was started a long time ago. It's mostly English books. Now there's about 75, Zero Books. And the main biggest draw of this is that these are books that have Copyright Clearance, which mostly means that they're in the public domain. Five, I think 75 years have passed since it was published. So now anyone can use it freely. But there's I think some books in there that are technically not in the public domain, but it's okay to use. There's a data set, pg 19, which is books from project guomberg. And this was a paper that was trying to benchmark language models on long context. So the appealing thing of books is that you have really long contcompared to news articles or even papers. Yeah. So based is it possible that the military rise of xai is because they have access to mining the tweets data for training der models? Let's say other AI models can't, which means that like they have access to better understanding the importance of spoken human language more. And therefore, in the long run, it's really Google and x only big companies with access to like tons of tons of human generated data with like natural language. Yeah. So I think a general question is, well, the narrow question is, you know does x have an advantage because they have access to tweets and for example, any of these big platforms? Google has YouTube, a meta has you know Facebook. So there are restrictions on what data companies can use even if they have the data. So it's not that literally the Google train on all of it's like your gmail or something. I don't think that's the case. That said, it is that companies will have distinct advantages and access to certain types of of data that might be public. So I think public doesn't mean that anyone can train on it. For example, YouTube is public, but Google has special access to it. So I think the broader question is, will companies that have access to special data know win essentially? And by default, answer is yes, because I think that data is sort of the name of a game. Now, Interestingly, obviously, anththropic has a really good model and they don't have like a particular secret source that I advice I don't know of. So it is not everything, but over time, I think this is my opinion is that I think you'll see perhaps more differentiation and more specialization and as companies leverage the resources that they have. Okay. So that's Project Gutenberg books three was this project that produced a lot of books from this shadow library. It contains books, notably from famous authors, and since then has been taken down due to copyright infringement. So this was part of the pile. And so just a note about shadow libraries. These are there's a bunch of different libraries that basically disregard copyright and bypasses paywalls. This is basically illegal. And there's been lots of takeout orders and lawsuits and so on. But usually these controls are circumvented because they just put them on servers in different countries. And the proponents say that it makes free what really should be free. But obviously, the law thinks quite differently. In particular, ligen has 4 million books, which is a lot of books compared to project guomberg, which has only 75000 books. And it has been to reveal that meta train models on lib gen, for example, and there's a big, you know lawsuit about it. So moving on. Stack Exchange is a collection of sites. Most prominently it's stack overflow, which it started with but has grown to other areas like math and literature. There's some reputation and badges to incentify participation. All of you probably have used sac overflow, so I don't need it just for fun. I like always looking at random examples. Okay. So here's some this is a page that basically gives you random Stack Exchange. So I don't know if these are any good, but here's a question and here's some some answers. Okay, so I guess this is a pretty familiar stuff. So one thing to note is that if you look at these type of this data, it really kind of like looks like a qa data set, right? So which is kind of what you expect ting and for instruction following capabilities and real applications. So I guess sort of the lesson here is that by training on web data or large imppre training, a lot of it is just documents which don't look anything like what you would a user would type into a chatbot. But there are subsets of the pre tradata that look remarkably similar to what a user would type, coupled with the response. And this is why the lines are a little bit blurry between what is pre training and what's post training. The cool thing about seexchange is that there's also metadata like comments and votes, which can be used to filter. And the data dumps are provided. Although I think right now this is if you use it non commercially, it's fine, but if you are a commercial entity, then you have to pay for a license. Okay. So in GitHub, which everyone here knows about, and this is, I think, the primary means in which you get code for language model training. So this help and code is generally helpful for programming, of course, but it has also been, I think, thought to be helpful for reasoning and other capabilities, although I don't know what if there's a paper that makes that more rigorous. So here's a random GitHub. Okay, maybe this doesn't work. Never mind, it used to work. Give you a random GitHub repositor. And the reason I'm doing this is that I think the GitHub repos that you visit and the Wikipedia pages you visit are distinctly not a representative sample. And by sampling randomly, it gives you a sense of what actually in this data set. So when you look at numbers like God githuhas, however many millions of repositories, not all repository create equal, a random repo might disappoint. Okay, so there's 28 million public repositories. And one thing that's interesting is that you know GitHub, what is a repository, is a directory, and some of it's code, not some of it's not code. There's also notions of issues and commit history and all this other stuff. And there's also a lot of duplicates. So in order to take GitHub, which is all this raw you know data and make it like trainable tokens, there's actually a lot of work that has to go into. Think about how you want to best do that. So GitHub archive is this snapshot of all the GitHub events that have happened, and you can access it using Google BigQuery. So the stack is a project that fortunately produces open source version of code based on GitHub. So they took all the repository names from GitHub archive and giclone 137 repositories. Commit kept only permissively licensed ones and remove duplicates. And the result was more 3.1 tb of code. So the nice thing about code is that often, but not always, the license is made more clear compared to web pages where the license is almost never made clear. So if you think about this, this is sort of you see that there is the live service, there's a website, GitHub, that you go to every day. And then there's the snapshot, which gives you sort of raw dub. And then there's the processing that happens that turns it into an actual trainable data set. So when someone comes to you and says, I train on GitHub, then you'll have to ask them, what exactly does that mean? What was the pre processing steps that were taken? Okay, so that was the pile. Although I guess I took a little bit of liberty and digressed into the different components of a pile. So now moving on. So now when 2021, so DeepMind also came on to the scene. They the first large language model they trained was gopher on this massive text, datset, which was the gopher model actually was not very, very good. But the data set, the paper does a great job at describing the data set. So massive text contains massive web, which I'll talk about a little bit later. There's c four and then books, news, GitHub, we could be it. There's no details about how those were processed. And as I mentioned before, this is obviously doesn't is not reproducible. It's fine the train on GitHub, but how exactly was a data processed? So massive web, you kept English. And here again, like cnet, they used quality filters that were based on manual rules and they had rules which you'll implement your assignment that look at things like 80% of the words has to contain at least one alphabetic character. They also used Google safe search for toxicity filtering. And I think in those days, one of the main arguments for using manual rules is that you didn't want to be biased against the model because models, the only models that they can run is very weak models. And weak models don't really understand the page, and they're just going to have probably pretty awful bias. And also, there was a consideration that this type of filtering can filter out to kind of marginalized data from marginalized groups that didn't look exactly like Wikipedia. But as you see later, this sort of has flipped and now everyone's doing model based filtering. Okay. So there is there was 10 tb of text, which is I guess know maybe like let's say five, four or 5 trillion tokens, just an estimate. Although gopher was only trained on 300 billion, which is not very many tokens, I think that's the same maybe around the same number of gp three. So in 2022, we have llama. So the data set for llama was common coral process with cc here, the classifier here, there's a subtlety here. Remember, GPT -3 was classify, whether it looked like Wikipedia page or not. Lama was trained on classifier, which predicted, do you look like a page that was referenced out of Wikipedia or not? And I guess the idea is that Wikipedia will cite high quality pages, and most those pages might not look like a Wikipedia article, but nonetheless are high quality. So again, kind of this link structure, which we saw in the GPT two webtext is kind of showing up here. They also included c four. Why not? They use GitHub. They kept permissive licenses filtered based on some manual rules, Wikipedia project kuumberg and books three, which got them in a lot of trouble. And archive sack exchange. So they got 1.2 trillion tokens. So they didn't release the data set, but together reproduce this datset and something called red pajama, which now you can go and you have the data processing code and the data which you can take out. So this was a reproduction. So this was clearly not optimal. And cerebris did further duplication and ended up with a 627 billion parasubset. There's also reptraa V2, which is a little bit confusing because this is something else. This is essentially taking common crawl snapshots and producing 30 trillion tokens with all sorts of different quality signals. So this is a resource for doing research on how to filter based on the quality signals that are computed. Okay, so that was lama. And then the refined web was another paper. And here the thesis is, well, remember how we saw the pile? There was web data and there was all this other stuff. And their point was like, well, maybe if we do a good enough job filtering the web data, that's all you need, because technically, you know, the Internet has everything in some sense. If you think about it, if you can access it via computer and it's a connect on Internet, then maybe that's good enough. And so the refined web, let's see if we're going want to look at some new examples here. The data is at an hugging face, and it kind of looks like you know this. Okay. This is the resolution is probably not large enough to okay, anyway, scrap that. They use traflatora for extracting content because as we noted, trafflatora is better than just using the web files that common crawl provides. They use skpher rules. And they made a point of we're going to avoid ml based filtering to avoid biases. And then they did some vasy duplication. So they had a 5 trillion token data set, which is quite large, but they only released 600 billion of it. A fine web from hugging face was started as a replication of refined web, but they tried to improve it. So they used all the common cordumps, I think at the time, did some filtering. Again, this is still using manual rules, no model based filtering. And they did some duplication and did some basic anonymization. So they got 15 trillion tokens out of this. So this is still A, I think a really nice data set because it dealing with common crawl is going to be a pain. But this is sort of A I would consider finweb as a lightly filtered data set that you can further do monbased filtering on. Okay, so jumping ahead. So AI two put out has a series of models called omo. Their initial model was trained on the doma data set and this is a composition. So we have common crawl, we have the stack which we talked about, which has code c four, which you know, you'll reddit a two has a semantic scholar. So I think this is derived from that project guenberg and Wikipedia and you know about. So the reddit comes from this project, which but they include sort of submissions in the comments separately. So you don't have this thread structure. I think this project, I don't know if it's so, I guess it no longer exists anymore. So I think around 20, 23, all these sites like stack change and reddit realize, wait, people are just taking our data and training, mand making money off of it. So I think that sort of came to a stop. So we have 40 million academic papers from semantic scholar, which crawls a bunch of different know sites. And then we have our usual suspects. So the common crawl processing, which is fairly, I think I would say, standard, so they use language identification to keep only the English part quality filtering. Again, they in doma, they portraying an initial model. They avoid ded model based filtering. And then toxicity filtering. They hear you. They use a classifier and then the deduplication. So three trollion tokens came out of that. And then in the same year, so this was last year. So there's a paper data comp, which was a collaboration from multiple different organizations here. What they wanted to do is foremost define essentially a competition for creating data sets. And so they wanted to set up basic infrastructure. So they define a standard data sets, which you can essentially try out different data processing algorithms. So they process common crawl, all the dumps to produce dcm pool, which has 240 trillion tokens. So that's a lot of tokens. But as you know, it's a common crawl is not the highest quality by on average. So that's going to get filtered down quite a bit. They had a particular recipe for filtering down that data set, which is called dcm pool, into dcbaseline. And here they were very aggressive vely using a quality filter. So this is what it looks like. They do some rule based filtering, basic stuff. And then they, the main thing that's interesting is that they took this fafilter that filtered dco l Pinto baseline. So they only kept, I guess, 1.4% of the total data set. Okay. So what do they do for this model based filtering? So again, when quality filtering, you define your positive example as negative examples in creating a classifier. So the positive examples come from two sources. There's this open Hermes data set. So this is mostly GPT -4 generated instruction data. So this is kind of interesting. They're using actually instruction data to curate preaching data. So now they're not explicitly training on instruction data, but they're looking for data that looks like instruction data. And then Eli five, which is basically the subreddit called Eli five, ask me like I' M5 and this is what I guess the data set look like. Okay? What's the point of wasting the first two plays with a rush? Okay. So these are sort of like questions you might ask to A Chabot, for example. Negative examples are just sampled from refined web, which is it's not low quality data, but not as curated as these other two sources. Okay? So the result is that they took dclm baseline, which is 240 trillion tokens, and reduced it to 3.8 trillion tokens, which is no still a good sizable chunk. So then they train a fast six classifier on these and run it on all of dclm pool. And here's one of the results tables. So the benchmarks here, core includes a bunch of standard language modeling benchmarks like hellowag and so on. And they show that using this classifier, they actually outperform the refined web by 3% and a bunch of other things by one or 2%. So this the this is the procedure they use to create the dicell on baseline, which then you train a pretty reasonable model from there. It is worth noting that after this happened, the second omo model, if you remember, they started training on dclm baseline as well. So I think this era of we're going to be unbiased and try to not use models to bias, I think has kind of largely gone away because I think people realize that well, if you use models in the loop, you can just like do a much better job of getting high quality data and and at least for increasing benchmark scores. Okay. So the final pretraining data set I'll talk about is nemattron cc. So this came out of nvidia, which has been doing some work on training the nematron models, although more recently they've been doing like post training and data curation. So their main thesis is that dclm great data, the baseline is great data set, but it filters very aggressively. Remember, it filtered down all the way from 240 trillion tokens to 3.8 trillion tokens. And if you want to train on more, train larger models for longer, you need more tokens, because this aggressive field train, 3.8 trillion isn't really enough to sustain, let's say, no, like a 400 billion model training run. So the first thing interesting they realize, Zed, is that they did ablations for html debut, not just based on the quality, but on how many tokens were left. So they're really trying to get not throwaway tokens. And it turned out that just text rather than tralatura would actually keep more tokens. So they went with just text. And then they used a bunch of different techniques to quality filtering. They prompted their gigantic nemotime model to essentially score documents based on educational value, distiit into a faster model and use that to train. So this is a filter based on what a language model thinks is educational value. And they also use the dm classifier. There's a sort of a interesting way that they ensemmbled. They basically ran all these classifiers and look at the bucket of their scores. And then from each bucket, they samled a bunch of, you know, deset. So it's not just taking the top because I think they were trying to make sure they have good coverage over different experts kind of opinion on the notion of quality. They also did this thing, which is interesting. They use a language model to not just filter, but also rephrase data sets. So for high quality data sets, sorry, this is backwards for low quality data sets, they basically use a language model to rewrite it into something that looks higher quality. So obviously there could be mistakes there. But you know in a grand scheme of things, maybe you know it's no worse than training on low quality Internet data. And for high quality data sets, they use the language model to generate task things that look like taso. You take a Wikipedia article, you ask a language model to create essentially input output pairs. Where the input is might be a question outputs an answer, or the input might be summarize this document and outputs a summary, or the input is extract the key information and outputs the key information. So this is again, trying to get at the idea of well, eventually an instruction struction to anytime we want to be able to follow instruction. So we might as well get a head start on this. So they got 6.3 trillion tokens out of that, which is more than 3 trillion tokens, almost double, which is I guess pretty good because I mean, all of this is coming from common qual, but they were able to essentially double the, maybe not quite double, but almost double the size. You know, for reference, you know, lama three is trained on 15 trillion. Quin three is trained on 36 trillion, which includes A, I think, multimodal of data. So 6.3 trillion at this point is not enormous, but for open source models with data sets, it's pretty decent. Like most of us, 6.3 trillion is more than enough for training even to do one epoch. This table shows that basically on average, they show that their nemotron cc data is better than the dclm data, which has been shown to be better than fine web, at least on benchmarks. And then they actually have this 1 trillion high quality subset that's even better. Okay. So any questions before I move on? I know this was a lot of like random specific details about different models, data sets, but hopefully this gives you a sense of the type of things and hopefully you can kind of see different patterns, whether to use models to filter or not, whether you're using links out of high quality pages or using the pages themselves and so on. So Yeah, it is talk English. I wondering. Yeah, good point. So the question is, what about multilingual datsets? I've sort of focused on English because that's where primarily a lot of the research done is done. But obviously common cordoes have multilingual data and there's multilingual data sets that are produced as well. Okay, maybe interest of time, I'll move on to copyright it. So this was an early question that was asked, how much of the web is copyrighted? So let's just understand what copyrighis really about. So nowadays, there's a lot of lawsuits around generi'm, mostly around copyright. So in general, a copyright law falls under kind of intellectual property law. And the goal here is to incentivize the creation of intellectual goods. That's why copyright law even exists. And there's many types, copyrights, patents, trademarks, trasecrets. So copyright law is the one that has been is most relevant for training data. And this goes back to the 17 hundred s in England. But in the us, since 1976, there's been the Copyright Act, which essentially has established what copyright no means. And this is formally applied to original works of authorship, fixed in a tangible medium expression. And so you know, it's original work. So if you are, it's just a collection. It's not copyrightable. Like telephone directories are not copyrightable unless there's some creativity in their selection or arrangement. So copyright also applies to just expression, not idea. So you can't copyright an algorithm. You can copyright the code. And one thing that has been what this did is that before copyright only applied to things that had been published. And now it's just this looser notion of fix. In general, copyright has sort of increased in scope and know registration is not required for copyright. So this is different from ps, if you invent something you don't register, then you have no claim to it. Whereas copyright is you put something and you throw it up on your website. It's copyrighted even if you don't explicitly write copyrighted. But there is a thing that registration is required before a creator can sue someone for copyright infringement. But the bar for registration is also over $65, opposed to patents, which are there can be thousands. It now lasts 75 years, and then the copyright expires and becomes part of the public domain. So like all the classics and most of Project Gutenberg has been gone out of copyright. So you know the thing that maybe people might not realize is most things on the Internet are actually copyrighted. So whether something copyright isn't really the issue. So the issue is whether you can use it. And there's two ways you can use it. You can either get a license for it or you appeal to the fair use clause. So if you're going the license route, you can either go sign a contract with the creator to get a license to use the data in some terms. And this is essentially what does with, you know, for example, Google and reddit have this relationship. And effective licenses, don't sue me. There's a special type of license called a Creative Commons license, which allows the free distribution of copyright work. So Creative Commons, all that is stuff is still copyrighted. It's just you have a license that enables it to act like if it wasn't the public domain. So a lot of things are like Wikipedia, for example, is all Creative Commons license, and a lot of YouTube videos are also Creative Commons. And this was created almost, I guess, 20 years ago to essentially bridge the gap between public domain and copyright. And the idea is that you want people to be able to use this stuff without waiting 75 years. And there's cases where the creator is actually happy for people to use their content, but it's just most of the time it's just not clear because they haven't said yes or no. So now many model developers license data for Training Foundation models. For example, Google and Reddale, Opin, Shutterstock opening in stock, ack exchange and so on. Okay. So if you have money, you go get a license. If you don't, then I guess you have to say you're a poor academic and then maybe I'll let you use it. Okay. So if you but the problem is that you can't get a license to like the Internet, right? Like who do you go to a rwebsite? Like who do you even go talk to? So the only way you can use it legally is to you appeal to fair use. So fair use says basically, even if something is copyright and you don't have a license, you can still use it under some conditions. And the conditions determined are determined by the purpose and character of the use. So for example, if you're using it for educational rather than a commercial or you're transforming the work in some way rather than just like copying it and hosting it on your website and pertaining it, your that's going to help you the nature, what the work is. What if it's fictional? That's it's more likely to be no, actually, if it's, sorry, factual, it's more likely to be fair use. Like for example, the telephone book, you can't really copyright things that are closer to facts. The weather, if you use this just a snippet, it's more likely to be copyright fair use. Although for language models, that doesn't really apply because you probably want to train on all of it, not just a snippet and then effect on the market. So for example, if you're using the work to essentially displace the creator, then that's going to be seen less favorably than if you're using the work to do something completely different. So if you obviously you watch a movie and write a summary of its fair use, if you reimplement idea, that's fine. There's a big, long decade fight over whether Google Books when they show snippets, whether that's fair use or not, and eventually a rule in favor of Google. It is also worth noting that copyright isn't just about verbatim memorization. So it turns out that plots and characters can be copyrightable. So even if you have essentially very little Ingram overlap, but you take Harry Potter, the character, and you essentially develop it, then that is a olation can be a violation of copyright. But on the other hand, if you parody, it might be fair use. So these things are quite subtle. And it's copyright is all about the semantics and the economics and the kind of content. So it's a very complicated topic now. So what about for training? So one thing is that copyright is you know the copy is in the name copyright. So even the first type of training you copy in the data is technically already a violation, even if you don't do anything with it. You could argue, as many have, that training ml model is transformative because know it's definitely far from just like copy and pasting. This does make, I think, open source models with open data a bit challenging because if you want to train a model and you also want to show people your data and you host that data, that could be a violation of copyright. People have also made arguments that the machine learning system is interested in the idea, not the expression, right? You're training all this data because you're trying to extract the how language works and general knowledge rather than interest in a particular work. But of course, the learning algorithm has been sorry, the models can often memorize and you can extract data, training data from the models quite easily. And there's also this problem that language models can definitely affect the market regardless of copyright. Okay. One I guess other thing to note is that even if you have a license and if you can appeal to fair use for a particular work, you still might not be able to legally get the data because of terms of use. For example, YouTube has a lot of Creative Commons videos, but if you write a script that goes in downloads videos from YouTube, that is against the YouTube's term of service. So there's another gating for these platforms and there's a bunch of works that you can kind of read about later. Okay. So let me quickly go on in the instance of time. So this section is going to be a bit shorter and I've sort of collapsed my training and post training together because the boundary is not often quite clear. Often now we're thinking about less high quality in general, but focused on how do you instill particular capabilities. Although you know even in pre training, we were already thinking about know quality classifiers and high quality. So again, the line is not quite clear. So one thing that I think is, which we haven't already talked about in this class, is long context. So models, if you look at the top models, have quite a bit of context. Gemini, I think, still has believe. I think lama four might advertise a 10 million million context, but context lanes are quite large and transformers scale quadratically with the sequence length. I mean, we saw it in an inference lecture. You can get around that. But still, you need some, I think, full attention to get the best results. And clearly, you don't want to start at the beginning training on long context. So what people do is they add it later. So that's why long context extension often shows up at mid training because you don't want to waste cycles training on long contif. Your model is not very good. So there's multiple ways of doing that. But since it's a data lecture, I'll talk about know books and math are two sources that have been used to do context extension. Basically, for context extension, you just need to create data that has long range dependencies. And some of this data can also be synthesized. So people also look at tasks. So there's a bunch of works that essentially convert data, traditional nlp benchmarks, into a standard format that they can be fine to. Non so supernatural instructions is one such data sets that was leveraged the community to come together and create 1.6 thousand tasks and they sort of standardize into a prompt. Flan was around the same year, was came out in 2022, but the paper was in 2023. So 2022 was a year of let's take all the nlp tasks and shove them into instruction following format. And so so this, you know, one I think advantage of this is now you have a language model that can solve all your favorite nlp tasks and you benefit from transfer learning. This is kind of a lot of thinking about like you know going back to t five. But one problem with this is that often the prompts that you have are very temlatized. If you look at the supernatural instructions, some of it is not supernatural because they're sort of like all look kind of the same. So that sort of motivates you know these instruction following data sets. And since 2:22, there's been sort of this expectation that language knowledge should just be able to answer any sort of one off task that you give it. So the notion of even tasks sort of disappears. So a lot of the work in the open community has been based on synthetic data, starting with alpaca, which used this idea of self int to prompt a language model to generate examples, which you can then use for fine tuning. There's vunia, which used these conversations that have been shared by users on shared GPT, which is deprecated. Now you can get language models to chat with themselves seated with some questions, and that creates some synthetic data. And you can also have these evil instruct methods that essentially take questions and make them more complicated. There's other ways of know taking. This one takes common crawl and then uses it to essentially look at, identify quiz sites, and then extracts qa pairs using a language model. And then this is open Hermies, which we saw earlier from the tclm work, is just a glomeration of different datsets. Lama two chat. We don't know what the exact data set is, but they used annotators to essentially write high quality instruction data. And they claim in this paper that this was better than using the millions of examples from open datsets, which and but they could have even saved more money by using you know by less annotation and more just like rlhf, which we'll talk about later. And finally, the last phathat I'll mention just came out pretty recently. So the llama nematron post training data. This consists of a bunch of there aren't that many details about this data set, but the data set release, so you can go and look at it. They have public data sets like wildchat, and then they synthetically generated some data from all the models that are you're able to generate data from. They also include reasoning traces thanks to R one. And so if you look at this data, you can kind of divide into a few buckets, right? One is that a lot of the early work where just, okay, there's GPT -4. This is the sort of the easiest way to generate synthetic data. The problem with that is that for academic research is fine, but it is against the terms of OpenAI to use GPT -4 to create a data set. Then you train a competing model, whereas these open weight models have more permissive licenses, which mean that you can essentially distill from them and do whatever you want. There might be some restriction on lama, but but I think broadly speaking, I think they're more permissive in OpenAI. And then finally, if you are really paranoid, then you can actually just hire annotators to create high quality instruction, which is obviously more expensive and slower. And there's also this worry that annotators might actually use GPT -4 to create your data. So you have to be careful there. Okay, so summarize. So the key lesson is data just doesn't fall from the sky. You have to really work hard to get it. And it's important to think that out there in the world, there's these live services like GitHub, right? First, you have to use it. You have to first get a dump of the raw data, but you can't train on the raw data. It's too big or too noisy or it has it's not even tokens. You often have to process it. And that's where a lot of the heuristics that we saw for quality filtering and your duplication kind of fit in the as was sort of touched on earlier, data is really the key ingredient that essentially differentiates language models. I think all of the language models architectures are some transformer moe style. I think largely these architecture are so kind of general purpose that the behaviors aren't really going to be that different. It's really the data that drives the quality there, assuming you can train and fit the data. There are some legal and ethical issues here. We talk about copyright, but there's more much more to be said here. And then finally, if you think that this whole field is a mess, you're right. It's very heuristic, which means that there's many opportunities to hopefully improve that. Okay. That is it, and I'll see you on Thursday.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座的核心观点是 “数据是训练语言模型最关键的要素”。讲者Percy Liang指出,尽管模型架构和训练过程日益公开(如Llama 3),但顶级公司对其训练数据的构成和处理方法却讳莫如深,这凸显了数据作为核心竞争优势和法律责任规避手段的重要性。讲座详细梳理了语言模型训练数据的演进历程,从早期的BooksCorpus和Wikipedia,到大规模网络爬虫数据Common Crawl及其各种过滤精炼方法,并清晰地勾勒出一条技术演进主线:从启发式规则过滤(如C4),到基于弱模型过滤(如CCNet),再到由更强模型指导的高级过滤(如DCLM)乃至合成数据增强(如Nemotron-CC)。
数据处理被划分为三个主要阶段:预训练(使用海量、相对低质的原始数据)、中间训练(使用精选的高质量数据增强特定能力,如代码或长上下文)和后训练(使用指令/聊天数据进行微调)。讲座强调,数据处理流程本身(如HTML到文本的转换工具、去重、质量过滤)对模型性能有巨大影响。此外,讲座深入探讨了与数据相关的法律与伦理问题,特别是版权法中的“合理使用”原则,这是在未经许可的情况下使用海量互联网数据训练模型的法律基础,但其界限模糊且充满争议。最后,讲者总结道,当前的数据处理方法高度依赖启发式规则,缺乏统一的科学理论,这既是挑战,也为未来的研究和创新提供了巨大空间。
1. 引言:数据是模型训练的核心壁垒
- 核心观点: Percy Liang认为,数据是“在训练语言模型时,需要正确处理的最重要一环”。
- 行业佐证: 许多公司(如Llama 3)会公开模型架构和训练细节,但对训练数据却极为保密。
- 保密原因:
- 竞争优势 (Competitive Dynamics): 精心策划的数据集是模型性能的关键差异化因素。
- 法律责任 (Copyright Liability): 避免因使用受版权保护的材料而面临更多诉讼。
- 保密原因:
- 数据工作的演变:
- 基础模型出现前: 工作重点是为监督学习进行数据标注。
- 当前: 标注工作减少,但数据筛选、清洗和策划(Curation and Cleaning)的工作量依然巨大且高度可扩展。
- 训练阶段的演进: 训练过程遵循从低质到高质的范式。
- 预训练 (Pre-training): 在海量原始文本(如网页)上进行训练,构建基础能力。
- 中间训练 (Mid-training): 在较小规模、更高质量的数据上继续训练,以增强特定能力(如数学、代码)。
- 后训练 (Post-training): 在指令数据或聊天数据上进行微调,使模型能够遵循指令和对话。
- 术语定义:
- 基础模型 (Base model): 经过预训练和中间训练后的模型。
- 指令/聊天模型 (Instruct/chat model): 经过后训练的模型。
2. 预训练数据源与处理方法的演进
预训练数据集的构建经历了从简单组合到复杂、多阶段过滤和生成的演变,其核心在于如何从海量、嘈杂的互联网数据中提炼出高质量的内容。
| 数据集/方法 | 提出者/关联模型 | 核心处理方法 | 特点与备注 |
|---|---|---|---|
| BooksCorpus & Wikipedia | BERT (2018) | 直接使用高质量来源 | 奠定了早期模型的基础,BooksCorpus后因版权问题下架。 |
| WebText | GPT-2 (2019) | 基于Reddit Karma评分筛选高质量网页链接 | 开创了利用社交信号筛选网络数据的先河。 |
| Common Crawl | - | 每月进行的大规模网络爬虫 | 互联网数据的原始来源,但质量参差不齐,充满噪声。 |
| CCNet | Meta (RoBERTa) | 模型过滤: 使用n-gram模型,筛选与Wikipedia风格相似的文档。 | 首次大规模使用模型来判断数据质量。 |
| C4 (Colossal Clean Crawled Corpus) | Google (T5) | 启发式规则过滤: 基于句子长度、标点、禁用词等大量手动规则。 | 规则驱动,避免了模型过滤可能带来的偏见。 |
| The Pile | EleutherAI | 多源高质量聚合: 整合了22个不同领域的高质量数据,如PubMed、GitHub、Books3。 | 开源社区驱动,但包含来自“影子图书馆”(如Bibliotik)的版权争议数据。 |
| RefinedWeb | Falcon | 纯网页数据+严格过滤: 仅使用Common Crawl,通过严格的规则和模糊去重进行精炼。 | 论点是高质量的网页数据足以训练出强大的模型。 |
| Dolma | AI2 (OLMo) | 多源混合+标准流程: 结合Common Crawl、代码、Reddit等,采用语言识别、质量和毒性过滤、去重等标准流程。 | 代表了开源模型数据处理的典型实践。 |
| DataComp-LM (DCLM) | 多机构合作 | 标准化竞赛+高质量分类器: 构建了一个庞大的数据池(DCLM-pool),并使用在GPT-4生成数据上训练的分类器进行筛选。 | 标志着数据过滤进入了由更强模型(如GPT-4)指导的阶段。 |
| Nemotron-CC | NVIDIA | 分类器集成+合成数据重写: 结合多种分类器,并使用大模型重写低质量数据或为高质量数据生成QA对。 | 旨在解决DCLM过滤过于激进而导致数据量不足的问题,引入了数据增强思想。 |
3. 关键数据来源详解
- Wikipedia:
- 优点: 高质量、事实性强、有可靠来源引用。常被用作“高质量数据”的代理或种子。
- 局限: 不包含原创思想、观点或菜谱等日常内容。覆盖面存在偏见。
- 风险: 存在数据投毒 (Data Poisoning) 漏洞。攻击者可在数据转储前注入恶意编辑,短暂地污染数据集,从而影响模型训练。
- Common Crawl:
- 机制: 每月运行一次网络爬虫,从种子URL开始进行广度优先搜索(BFS)。
- 格式: 提供原始HTTP响应(WARC)和提取后的文本(WET)。
- 关键细节: 从HTML到文本的转换工具(如
trafilatura)对最终模型性能有显著影响,其效果远优于直接使用官方提供的、经过有损转换的WET文件。
- GitHub:
- 价值: 不仅提供代码用于训练编程能力,其结构化的逻辑也被认为有助于提升模型的推理能力。
- 处理: 需要处理许可证过滤、去重、从仓库中提取有效代码等问题。The Stack是基于GitHub构建的开源代码数据集。
- Stack Exchange:
- 价值: 其问答形式天然接近指令微调和实际应用场景,是高质量的“准指令”数据。
- 元数据: 点赞、评论等元数据可用于进一步筛选高质量内容。
- 影子图书馆 (Shadow Libraries):
- 如LibGen, Z-Library, Bibliotik等,提供大量无视版权的书籍。
- The Pile中的Books3组件即来源于此,Meta也被披露使用过LibGen的数据,引发了严重的法律诉讼。
4. 特定能力的数据集 (中间与后训练)
- 长上下文 (Long Context):
- 通常在中间训练阶段引入,以节省计算资源。
- 数据来源: 需要具有长程依赖性的文档,如书籍(Project Gutenberg)、学术论文(arXiv)和部分代码。
- 任务导向 (Task-Oriented):
- Super-Natural Instructions / Flan: 将大量现有的NLP基准数据集转换为统一的指令(Prompt)格式,通过多任务微调提升模型的泛化能力。
- 问题: 生成的指令格式可能过于模板化,不够“自然”。
- 指令/聊天 (Instruction/Chat):
- 合成数据:
- Self-Instruct (Alpaca): 使用强大的模型(如GPT-3)生成指令-响应对。
- 用户分享数据 (Vicuna): 使用真实用户与模型(如ChatGPT)的对话记录(ShareGPT)。
- 人类标注数据:
- Llama 2-chat: 使用了约2.7万条由专业标注员编写的高质量指令数据。
-
官方声称:“这比使用数百万个来自开放数据集的样本效果更好”,强调了质量远胜于数量的原则。
- 合成数据:
5. 法律与伦理问题
- 版权法 (Copyright Law):
- 核心原则: 保护的是表达 (expression),而非思想 (idea)。互联网上的绝大多数内容都自动受版权保护。
- 使用途径: 1) 获得许可 (License) 或 2) 诉诸合理使用 (Fair Use)。
- 合理使用 (Fair Use):
- 这是在无许可情况下使用版权材料进行模型训练的主要法律依据,但其判定标准复杂且模糊。
- 四个判断因素:
- 使用目的和性质: 训练模型被认为是转换性 (transformative) 使用,这有利于合理使用的判定。
- 版权作品的性质: 事实性作品优于创造性作品。
- 使用部分的数量和实质性: 模型训练通常使用全部内容,这一点不利。
- 对原作市场的影响: 语言模型可能与作家、艺术家等形成竞争,这是最主要的争议点。
- 服务条款 (Terms of Service):
- 即使内容本身可被合理使用(如YouTube上的CC许可视频),平台的服务条款也可能禁止通过爬虫等方式下载数据,构成了另一层法律障碍。
6. 核心结论与展望
- 数据获取是艰苦工作: 从实时服务(如GitHub)到原始快照,再到可训练的处理后文本,每一步都需要大量工作。
- 数据是关键差异化因素: 模型架构趋同,而精心策划的数据集是决定模型质量和行为的核心。
- 法律与伦理问题突出: 版权和隐私是不可回避的重大挑战。
- 领域现状:高度依赖启发式: 当前的数据处理流程充满了手动规则和经验之谈,缺乏系统性的科学方法,这意味着未来有巨大的改进和创新潜力。