2025-06-17 | Stanford CS336 Language Modeling from Scratch | Spring 2025

2025-06-17 | Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2

语言模型数据过滤与去重技术解析

视频科技

媒体详情

上传日期: 2025-06-17 12:06
来源: https://www.youtube.com/watch?v=9Cd0THLS1t0
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: So this is a second lecture on data. In the last lecture, we talked about different data sets that were used to train various language models. We sort of did a historical overview from the data sets that were used to train Bert all the way up to homo and everything in between. And one of the things I wanted to emphasize is that data doesn't just fall from the sky. It exists often in live services. They have to be explicitly crawled or dumped. And then there's a bunch of processing that has to happen. And the processing involves converting the raw html into text, doing all sorts of quality and language and toxicity filtering, due duplication and so on so forth. So in this lecture, I would like to take a deep dive into some of the mechanics on how quality filtering and deduplication work in particular. So first we're going na look at different algorithms for filtering, which mostly are model based. So you train your classifier, train some sort of model to filter, then we're going to show that this primitive can be used for all sorts of different types of filtering. And then finally we're going to look at some algorithms for detuplication. So this lecture will be a little bit different in that it's going to focus a lot on sort of classical big data processing algorithms. And there be some fun math in there for some years. Okay, so let's get started with filtering algorithms. So the basic high level picture here is that you're given some target data and which you don't have very much of it. Suppose it's data that's high quality and you have a lot of raw data. So imagine this is common crawl, and your goal is you want to find a subset of that raw data to call it t prime, which is similar to t. Okay. So this is a pattern that you should kind of recognize in basically any sort of filtering pipeline. And what do you want for this filtering algorithm? You want to obviously to generalize from t, you don't want you already have t, so there's no point in getting exactly the same data t. So there's has to be some generalization and also has to be extremely fast. So you have to run it on all of the web, for example. So if this is running a huge model, that's going to be might be as as expensive as training. So you definitely don't want that. Otherwise you might as well train. So we're going to look at three different ways that you can implement this high level abstract primitive. So we'll start with training an engram model. So this is something that came up in the last lecture. So you can train an engram model with knaser night smoothing. You can use other forms of smoothing, but knaer has sort of been inherited from the engram kind of modeling statistical language processing era. And there's a particular implementation that people tend to use in this space called canlm, which is open source, originally developed for machine translation, which implements canaer nights smoothing. So you can go and read about canser. Iit's pretty interesting, but we're not going to talk too much about the details here. So this happens to be very popular in data filtering. There's no particular reason has to be this one. But this is just why people use and the nice email l and gramodeling is it's very simple. When I say fia model, you should think of this as just counting the number of n grams and then normalizing okay. So just to go into a little bit of depth here, so it's his starting point is maximum likeestimation of language models. So you have a corpus of text and you want to estimate conditional probabilities like p of n given the last n minus one words, the cat. And what you do is you count the number of n grams and then divided by the number of n minus one grams of the conditioning context. Okay. So this is very, very easy in principle. Now implementing this efficiently is takes a bit of work, but the main problem is that there's sparse counts. So many of the n grams are exactly show up zero times, even though they're very reasonable, especially when n grows. And this was a whole point why engram models couldn't really be scaled appropriately because you hit the curse of dimensionality. So to mitigate this, some people devised nesurna smoothing to handle on seen in grams. And there's more details on this. But roughly, this says, well, maybe if I don't have enough data to support this count, I can sort of use a lower order n gram. So remove one word from the context and count the number of times in appears given cat and then try to either interpolate or back off to that. Okay, so that's all I'll say about how engram modeling works. I think even in this era of neural language model, you know it's good to know that how engram models work. So let's download the engram. So this is a model that's been trained on Wikipedia. I'm just gonna to download it and then let's punch some sentences through. So this sentence as Sanford University was founin 1885. This is actually an excerpt taken from Wikipedia. So it should be reasonable. And if you compute the probability, so you first compute the log probe and then you the perplexity by this normalization then you get 187. Okay, so it's some number. And then here's another example. You know this is taken from the cs 336 website, so let's see how well it does. So this is a higher perplexity means that it's less likely. Okay, kind of makes sense because this stuff doesn't really probably show up in computer te, it's abnormally high because it's still fairly good. Well formed English and many engrams there, I think, show up in Wikipedia as well. What about estf? This is just gibberish that gets assigned higher perplexity. And then there's the the, so this should get assigned pretty high perplexity. But for some reason, this particular Ingram assigns a pretty low perplexity. So you know, it's an ngram model. So it's not particularly clever or good, but you don't need to be the best model. You're just doing data filtering to create a model. So this can be very crude and fast. So ccnet, which is a paper out of, no, I guess it was Facebook back then, which we talked about last time. They basically did this and actually, okay, that's fine. So they basically looked at paragraphs of text and those were the items. They sorted paragraphs by increasing perplexity and just kept the top one third. And this is what they used to create the first llama data set. Okay? So this is know heuristic, but you know kind of, you know if you had to write down something really quickly, and this kind of makes sense. Okay. So nasonite and language modeling is fast, but it's crude. All right. I feel like there was some content that was I meant to go through that's missing. Oh, well, maybe got deleted. All right. So that's you can fit an engram model to your target data and then you can use that to score the raw data and choose the top scoring documents. Another thing you can do is fast tecx. And this is in some ways more popular. This is also from Facebook. They released a lot of open source tools, although Ken alm, I think was Yeah created before before that. So here they develop it for text classification. And this paper, I mean, this was 2016. So a lot of people were developing all sorts of fancy neural models and they said, show that well, this actually almost linear classifier worked as well as much faster. So that's why the software package actually became very popular. So the motivation here was that if you're doing, let's say, a bag of words classification, so you have a sentence of length l, your vocab size is v, and you want to classify into 64 classes, then immediately you have to define this v by k matrix, which if v and k are both large, then this gets pretty big and you get into sparse the issues and the way it would work is that you take your documents and then you just do linear classification. So that's fairly straightforward. So the problem is that the number of parameters huge. So fasthesays, well, let me just do a linear, basically dimensionality reduction. So instead, I'm going to map my vocab space into a hidden dimension which is smaller than k, possibly smaller than k 16, and then I'm going to classify from H. So notice that there's, when you look at the prediction, the forward PaaS, there's no nonlinearity. It's just a linear classifier. You can also think about it as a matrix factorization if you want. And the number of parameters is greatly reduced. So limitation, I think, is reasonably optimized. It's paralyzed and they use asynchronous sgd. And so here's okay. So the question is this is bag of words. And we want to not just look at words, but we want to look at n grams. So this paper also extends to n grams in a fairly simple way. So you take a sentence and you get some chopping up into your n grams. And here the problem is that the number of bigrams can get quite large, and it actually can be unbounded. So remember when we talked about tokenizers? If you just chop up into words, you're going to get, you know, you don't know how many words there are. So the simple thing is that you just do hashing. So you define some number of bins, maybe 10 million, but in this example, eight, and then you just you just hash every bigram into that. So these the cab maps to two and cat in maps to one and so on. Okay. So of course there might be some collisions, but you know you just live with that. I mean, in some sense, the opt when you minimize the loss, collisions are sort of accounted for. If there's like two words that have nothing to do with each other, then the weight is going to somehow represent maybe the average of the two. And also note that often we use fast text classification with k equo su classes. Is this document good or is it bad? And in that case, this is just normal, basically binary classification, linear classification. Okay. So of course, it doesn't take too much imagination to imagine. Well, if you can do linear classification that you can use fancier models such as Bert or even llama, the only thing is that it's going to be slower. And therefore, you run into this tradeoff where if you use a giant model to do classification, maybe you should use those flops to actually just train the model. So that's the only concern because the web is very large. And remember, you were filtering the web down quite a bit. You're going to spend a lot of compute on, let's say, you're filtering like down to 1% of the web, right? So that means you have your classifier. That amount of compuyou spend has to be one hundredth of taking a forward PaaS through the data set. Okay, so the final method I'm going to talk about is this thing called data selection for language models using importance resampling. And here are the basic ideas that you have your R, it's a raw data set, you have your target data, and then I'm gonna to build an importance weight estimator, which is the analog of the language engram model or the faas classifier. And then I'm going to use that to get a bunch of documents. Okay. So first, just a kind of a quick refresher of important resampling, I guess. So you have this shows up in many contexts like particle filtering and Monte Carlo methods. So you have a target distribution. So it's a distribution, not a data set where you want to ideally draw samples from this. But you only have a proposal distribution q ue where you can draw samples. So for example, if you have a vocab of zero, one, two, three, two, three, let's say this is your target distribution and your proposal distribution is this. Now what you do is you sample from q because that's the only thing you can sample from. So let's say you got a bunch of samples. Okay? So notice that q has a lot of more MaaS on zeros. So you get a lot of more zeros. And then what you're going to do is you compute weights over your samples by looking at this importance ratio p over q. So you're trying to make up for the fact at chisample from q. So I want to divide that out where I really want p and you normalize and then you get a distribution over your samples and now you resample and that sort of balances out the distribution. So now you see that there's more three s because p had more probability distribution on MaaS on three. Okay. So this is a fairly elementary, but it's a building block now. Now we go to how we do data Suso. We have a target data set, not a distribution. We have a data set which is small. And we also have this raw data set, which we're going to call the proposal datset dq, which is large. So you could let's just fit a distribution p to dp, fit a distribution q to dq and door, do the importance resampling that we talked about just two minutes ago. Now the main problem is that dp is really small. Remember, target distributions, high quality data, you don't have that much of it, which is the whole point. You're trying to get more of it. So you don't have that much of it and it's too small to estimate a good model. So again, the idea is you just use hash engram. So this is something that we already saw with festtext as well. You take whatever text you have and then you essentially, okay, so you basically hash all the we're doing unigrams for now. So you hash each unigram and you basically count the number, estimate the probability of every hashed n gram. Okay. So in this case, this the Cain have b hashes to this. Notice that there's some hash collisions though the hat hatch into the same value. But whatever, this is all crude. And then you just estimate the probability of each of these probabilities. Okay? So then to evaluate the probability of a new text, you you hash the new text and you multiply the probabilities together. Okay. So I got a little bit unlucky and the probability is gets zeroed up. You can do some smoothing if you want. So it turns out that Python hash is not genterms extra every time I run this. So I'm going to get something else. And the paper shows that this indeed helps a bit. The games are not know super massive, but know compared to doing fast text x, which is the heuristic classification, the gains on the glue benchmark using birstyle models, there is some lift and the comparison with fatex. So the main idea here is that modeling a whole distribution is in some sense a more principled approach because you want to sample data from your distribution and you have a different distribution queue, and now you're trying to essentially compensate. And so this could be better for diversity because you're trying to match the distribution as whereas fast text, you're essentially trying to classify whether something is is in the distribution or not, and it doesn't have any guarantees about matching the distribution, okay? But it's also fast, like fast text, and both can be improved by having better models. So instead of using linear models based on n grams, you can increase n, you can use neural approaches as well. Okay. So just to wrap up this part, so we look at engramodels linear classifiers and importance resampling. But the general framework, I think, is something that may be worth taking away, which is that the setup, given a target data set and a raw data aset, find a subset that's similar to the target. And all these methods have the basic recipe, which is estimate some sort of model based on the raw data and the target data set, which gives you a scoring function, and then keep examples in the raw data set based on that scoring function. Usually high scores are kept. So just to walk through this, in the klm case, the score is simply the probability of what I guess maybe the perplexity if we want normalized by the number of tokens under the target. Okay. So R is not used here. And then you keep examples with score greater than threshold, and you can randomize a little bit around the threshold if you want. So you can also use a discriminary classifier where the score is. What's the probability that this example came from the target as opposed to the raw data set? And then you keep examples where the score is upgraated some threshold. And then the importance resampling says the score is the ratio of two generative models, which could be n gramodels, in this case, to target distribution over the raw distribution and to use the score, this is where the importance weights you resample with, probably proportional to that. Okay. So at a high level, all of them are doing essentially the same thing. You have you fit some sort of distribution that tells you what looks like target as opposed to raw, and then you apply that to the raw data set. Okay. Maybe I'll pause here in case there's any questions. Yeah. So I'm sort of like a high level philosophical that we have an idea of what good data is, but we kind of not talked about what that's actually like. Like a good document looks like so what this is doing is like making sure that the words fit next to each other. When you have like internow like that you know one comes after the other, but it doesn't ensure that they it makes any sense on like a larger level. Or if you mark that, you just increase the n and the enbym. Yeah. So the question seems, I think, is that if you're using an engram model, it's only looking at local contacts. And maybe that's not great for assessing quality, because if you shuffle everything around, it still looks good to the engram model. So there's definitely that danger that it's very easy to adversarily gain, right? If you can construct, you can definitely construct examples that the ngram model thinks it's high. I mean, well, I showed you one. The thus happens to be pretty high, but somehow, on average, you know you're doing okay. And in some sense, maybe if a document looks like grammatical English and it looks like news or some scientific paper, all you want to make sure is that that model assigns hyprobability to that. If you get some bad examples, it's not the end of the world. We should think about these as just like filtering out the nonsense with Yeah I would say this is filtering out nonsense is I think, a good way to look at it. As I go through some examples, I think maybe it will become a bit clear that this is these cost hires are not just used for quality and for some of the other use cases, I think it's much more maybe compelling why it should work. Okay, so the same data filtering machinery can be used on different tasks. So I'm going to talk about language identification, quality filtering and toxicity filtering. So language identification, the idea is that you want to find text of a specific language. Maybe it's English, maybe it's Japanese. And you could ask, well, you could just train on all the languages. Why don't you do that? The problem is that it can often be difficult to do the curation and processing of high quality data in a given language. And there's also the problem that if you train, let's say you only care about one language now, if you have data from all the other languages, you're going to be spending less compute and tokens on any given languages, on the given language. So this and that could hurt your performance in that language. For example, bloom, which was trained in about 20, 22, I believe, was only trained on 30% English. And as a result, the English performance maybe was not as good as it could have been. It was strong on other languages. So there in a compute limited regime, you're definitely at risk of compromising quality of the language you care about. Now that said, if you have enough capacity and you're training huge models, then you can definitely train multilingual models and there could be even positive transfer between languages. So all the kind of frontier models are trained on heavily multilingual, sometimes even hundreds of like 100 plus languages. So you know one simple way to do this is fatext is a toolkit for training simple classifiers. It also comes with some off the shf models. And one of the ones that typically gets used is they have a language identification model. And this is just off the shelf classifier. It supports a lot of different languages. And and it was trained on basically multilingual sites, including Wikipedia, which has a lot of different languages, translation sites, and for some reason, Southeast European news. So doma uses this. They basically run the fast text x classifire and keep pages with English probability greater than 0.5. So let's try it out. So there's this model of language ID that you can just download, and then let's see how well that does. So the quicpanfox jumps over the lazy dog. Let's see where are the predictions. Okay, I don't know why. Okay, there we go. Okay, so this says English. This is the label English. And the probability of English is 0.71. Okay? So I mean, would have guessed it would be higher. It looks pretty English to me. If you duplicate the sentence, the probability isn't no changing, which is good. You know, saying the same thing doesn't make it more English. There's some informal English, which is kind of weird. This is, I guess, more English than the quick round fox. German actually gets classified fairly. Oh, sorry, this is German. Now the label is German. Okay, here. That is German math gets classified pretty weakly as English code is highest, is like Russian. This is hello, is apparently Italian. This is French. Bonjour. That's good. And this is, I guess, a tricky one where I put in some Spanish and also some English. And I guess I favvored the Spanish in this case. Okay. So this gives you a sense. I mean, it's always good to play with these classifiers, right? Because just because it's a classifier you can download and everyone uses, doesn't mean it like works all the time. So I think obviously for short sentences, it's less reliable because it's just less information about what's happening. Low resource languages is tough. There's a worry that for things that don't really are dialects of English, this could be viewed as not English. I think if you want to distinguish between similar languages, that's also can be easily confused. And code switching, I don't even know what ground truth is, but itdo hopefully one of the languages. So just talk through one case study, open web math was this paper from two years ago. And the goal here is to curate a large corpus of mathematical text, where here I'm saying that let's assume that math is a language. So first they use some rules to filter. Then they train kenalm on this large data set of proofs, it's called proof pile, and kept it if the perplexity was below some threshold. And they train this fast classifier, which we've talked about, to predict mathematical writing. This is a kind of hacky, but they set the threshold of the classifier, if it's identified as math, based on rules to be relatively low, and if it's not identidefined as math according to rules, it needs to be higher. And as a result, they produced 15 billion tokens, and they trained some models that do better that models that were trained on 20 times more data that weren't really special. So this is kind of one of nice examples of the utility of data filtering, right? You could train on all this data, but if you care about, let's say, a particular domain like math, you can just go out and get more of it and target that data set. And if you train on that, you can be much more data efficient. Okay. So let's talk about quality filtering. So quality filtering obviously is something that is doesn't really I mean, it's a catch off for a bunch of things and what is quality. So remember that some papers explicitly don't use model based quality filtering, but more recent papers have just kind of given in and say, well, I mean, it just works a lot better. So GPT -3, we talked a little bit about this last time, trained a quality classifier by taking positive samples from the high quality sources that they had, or rather than non common crawl sources they had. And a negatives are just common crawl. They train a linear classifier and keep documents stocastically based on this score, which is something that just turns numbers, basically sort of sharpens the threshold. And the lama first lama paper, they use pages referenced by Wikipedia, so not Wikipedia as positives. Negatives are just sampled from common crawl, and then they keep documents that are classified positive. I don't think there was actually sampling in that, which is interesting. Phone is another paper that their philosophy was kind of interesting. They really wanted high quality data that looked like textbooks, and they wanted to train a very small model. So they trained on synthetic data, but also filter data. So for the filter data, what they did is they took the Python subset of the stack. Remember, the stack is this preprocessed data set coming from GitHub of code. And then they defined a prompt which effectively asked GPT -4 to determine the educational value for a student whose goal is to learn basic coding concepts. And then they prompted GPT -4 and classified 100k documents from this Python subset. And that became the positive examples. They train a random forest classifier on these examples, where the embeddings fed into a random classifier were from some pre trained cojam model. And then they selected data from R that is classified by the classifier. So they run the classifier then over all of R and then get that data. So this is kind of interesting because there is a random force classifier. And in this case, where does t come from? Well, t doesn't exist. April t comes from prompting GPT -4 with some propped. Okay. So this is something that you'll see more increasingly as the models get better, is that instead of looking at sources like, well, maybe books one is great, or maybe web Wikipedia is great. Now if you have a good model, you can just ask the model to for the type of data you want. Maybe I want more chemistry data, or I want more mathematical data that you know does a lot of proofs or elementary math or something. And you just let the language model, a fairly sophisticated language model, give you T A. And now once you have Teyou, turn the recipe that we've been talking about so far. So the 51 results, again, looks good. So they had a baseline where they just took the Python subset of a stack, which was that this R, and the performance on huity valwhich is a coding dsite, was only 12% after 96 case steps of training. And then they trained the exact same model architecture on this new fancy data set and it was already at 17 after 36. So the declared success. Okay. So these are some examples of quality filtering. Any questions about quality filtering? Yeah make. Yeah. So the point is that you can why can you get away with using gt four here? Because you're using it only on to create 100k examples and to create and then using that to distill a classifier which is much faster and the 100k is much less than the size of R, which is you know hundreds of tens or hundreds of millions probably. Okay, so let's talk a bit about toxicity filtering. So I'm going to talk about how the doma paper does toxicity filtering. So there's this data set called jigshaw toxic comments. And this came out of a project where they were trying to help people have better discussions online. Say they wanted to build a classifier that could identify when there was problematic content. So they took the data comments on the Wikipedia talk pages and had annotators annotate them with toxics, severe toxic, obscene threat, insult, ity hate. And so the dolma folks train ined two facx classifiers, one which is basically correspond to hate and the other one is correspond to nsfw. And so here are some examples of the data set. So let's download the model and try it out. So the first example, you know, are you threatening me for dispute neutrality? This gets classified as safe for work because, I mean, seems fine. And the second one is flagged as nsfw, you know and these sentences are both fine. Okay, so just give you a sense. You know I don't want to show too many of these examples, but you get the idea you can play it on with your own time. Okay, so that's I'll say about filtering. So now let's talk about deduplication. So there's two types of duplicates, exact duplicates. And the thing to realize is that the web actually just inherently has a lot of exact duplicates based on mirroring. So if you just look at Project Gutenberg, you know, you'll see that the same exact site gets mirrored on a bunch of different url's. So if you're just crawling as common crawl does, it has no way knowing that these are mirrors. You just get you the exact same content. And then there's neo duplicates, which is basically the same text that differ by a few tokens. And we're going to show you algorithms that deal with both. So new duplicates. When does this happen? Well, you have terms of service and licenses like the mit license. This thing shows up. I don't know how many times on the web is just copy and paste it everywhere. Maybe there's even know, maybe people take it and actually change some things or they mess up, copy and paste the and remove a comma or something. That's all possible. And then there's cases, if you look at these data sets, new duplicates are it just gives you an idea of what new duplicates might look like. Sometimes you have this article where somejust seem to have paraphrased best actor in a negative role for most impactful character by everything else seems the same. This is just like missing a comma here. Not really sure how where that comma went off to, but you know there could be like just annotator or processing errors. And then this one's kind of interesting. There's clearly this very kind of add like a text which they just replaced Canada with usa and some other flilight. So this probably is generated from some sort of template system where they slotted in entities. So you know these newer duplicates are obviously different. But obviously if you train on tons of, you know this once you have this document, this one doesn't give you that much value. Let's just put it that way. So in an extreme case, so if you look at c four, remember c four was filtered based on these set of rules that looked for sentences that looked like English. You'll see this sentence, which is indeed English, that shows up an astonishing 61000 times. And for whatever reason, if you actually trace down where this comes from, not the mit license. Oh, I have to what? Okay. Do I show you except these cookies? Okay, fine. This is like random product from Amazon that has graffiti madrawings. And you know there's this text and apparently this is for whatever reason, I you know I don't even know how there's 61000 copies of this, but that's what it is. And so the idea is that this text isn't necessarily bad, right? It's perfectly good English, but you don't want to like take 61000 epochs over this. I mean, it's just you kind of pointless. So deplication is sort of complimentary to quality filtering. Quality filtering says, this piece of data, I don't want to ever train on it. Deduplication says, well, this might be fine, but I only want a small number of these rather than 61000 copies. So this been work showing that deplication makes language models better. The first, most obvious thing is that you can train more efficiently. Detuplication reduces the number of tokens, and if you haven't thrown away information, then you can save a lot of effort. And then the second is that you can also avoid memorization. So we haven't really talked about this, but language models trained on data can memorize that data, which is bad for you know, copyright or privacy reasons, because it can regjust regurgitate the training set. And deduplication is one tool. It's not definitely not perfect, but it can help mitigate some of these risks. Okay. So here's how to think about deplication in terms of design space. So first you have to think about what is the unit that you're detupliating? What is an item? Is it a sentence? Is a paragraph? Is a document? The second question is, how do you match? Do you what is considered a dupliate? Is it exact match? Is it a fraction of common sub items? And then the final thing is, what action do you take? Do you remove all the instances or do you remove all but one? Okay. So so the key challenge in this is algorithmic challenge is that fundamentally deduplication is about comparing items to other items, right? Quality classification. You can take an item you can classify as a high quality or not. Deduplication is fundamentally a parirwise thing. And remember, these data sets are large. So you can if you're trying to do something quadratic, that's kind of a no go, so you need some sort of linear time algorithm to scale, but still do something that's pairwise. So the building block of all this is hash functions. So hash functions, just as a review, is a function that maps an item, which could be a sentence, could be a document, to a hash value, which is an integer or string, and the value is much smaller than an item. And the thing that relevant about hash functions is that you could potentially get hash collisions. So which means that you have two distinct items that map to the same hash value. So in general, if I'm sure you, I guess all of you have used dictionaries or hash tables, hash collisions are generally bad. But later we'll see that there actually can be good in some limited doses. There's many types of hash functions which trade off between efficiency and collision resistance, which means don't collide. There's cryptographic hash functions, which are used for security sensitive applications, like in bitcoin. You really don't want collisions because otherwise that means someone might steal your money or you could compromise things. And then there's things where, but these are slow. And then then there's hash functions, which are generally used for hash tables that are not collision resistant, but fast. And so we're generally going to go with the latter category because we want things to be fast and we're not doing cryptography here. Okay, so we're going to use this thing called murmur hash, which takes strings and returns values, but you can use a bunch of other hashes as well. Okay, so let's start with the exact p duplication. So here's a simple example. The item is just a string and I want the exact match and I want to remove all but one. So here's a bunch of items and what I'm gonna to first do is compute a mapping from hash value to the list of items with that hash. So and then once I have that, I keep one item from each group. So each group corresponds to a hash value. And in a case of exact exact match, assuming there's no hash collisions, then you basically do a exact tube. So hello, shows up twice here, and now it shows up once. And of course, hello, uppercase, exclamation point is just a completely distinct item because we're doing exact so this the nice thing is that this is very you know simple. It's clear what you're doing. It's high precision. So you're never gonna to throw away anything didn't that you didn't need or you needed. But the con is that it doesn't work for neduplicates. And the other thing that's nice is that you know this is very simple to implement. We kind of wrote this in a map reduced way, and you can scale it and paralyze it fairly easily. And this is something that c four uses actually to prepare their data set. So their unit is of dedulication is three sentence banans. They use exact match and they remove all above one. So one thing that is always kind of bother me, but I don't know one else seems to care, is that when you're looking at three sentence banans, right? So you have this document and then you have this three sentence band an. And if it's marked as a duplicate, you're just going to do surgery and take those sentences out. And so the resulting document might not even be coherent. But then people say, who cares and move on. Okay, so there's another way to do deduplication that is less map reduced by more kind of like csh table. I think the bloom filters are you know some of you might have seen, but just to go through it because I think it's kind of a cool method. So this is efficient and it's approximate so it's more efficient but obviously it's not always doing the exact thing, but very have we are not being too picky here. Okay. So it's the nice thing is a very memory efficient. You can update it but you can't delete and it's basically a representation of set where if you ask for set membership and it returns no, then it's definitely a no. But if it returns yes, then know most likely if you set your hyperpremise right, it's yes, but know there's a small probability that you have of no, like you have a false positive and then you can say your parameters to drive down the false positive rate if you like. Okay, so just to walk through what this looks like, so suppose you have five items, the cat in the hat, and later we'll test out with some items that don't show up in the set. The first thing is you're going to define a hash function that maps to m Benins m is going to be eight here. So to build the bloom filter, we're going to do know something fairly simple. You you make your BitArray with a number of finins and then go through every item, hash it, the hashes, the two. So I update the second item, cat, hash it the sevenupdate, the seventh item in the hokay. So I just know populate this bit array in the BitArray. I don't keep track of the item, so I have no idea if I have a false positive or not. Okay, so let's just send dy to check that every item that I put in is actually in the item. So to query it, you just compute whether that item hashed is in the table and indeed everything should PaaS. And now let's look at the non items and see how they fare. And you'll see that some of the non items, you get a zero, which is what you want. But some of the nine items, the bom fetter says, Oh, actually it's in the set. So the number of mistakes here is four. If you compute the false positive ray, which is the number of mistakes over the total number of times the procedure produced a positive, then the false positive rate is in this case 0.44, which is pretty bad. Now of course, this is a bin eight and if the bin were to grow then this will get better. In fact it grows. Their probability is one over in the number of bins. Okay, so but you look at this and say, well, that's not really good because if I want the error probability to be you know ten to the minus ten, then I need a lot of bins and that's not going to hurt on memory. So there's a more clever solution here, which is to use more hash functions. So let's say I use two hash functions and I'm going to essentially, I still have the same bit array, but for every item now I'm going to use hash it twice. So the under sezero gets hash to two, I'm going to pop it in and then the with seed one is going to hash to five, so I'm going to pop it in there. So every item gets hashed into k, not necessarily distinct, but Hey, ycase slots. So then I go through the cat and then I fill up this table. Okay. So now when I do the query, I take an element and I'm going to return one if the table set was set to one for all k hash functions. Because if I put it in, I have to set k bits. And so if I'm testing, I need to check the k locations and check that they were all set. So indeed, all of those PaaS. And now let's look at the so the number of mistakes now is three and the false positive rate decreased. So no, that's great. So of course, this is a toy example, but you can really drive down the false positive rate with modest memory spend. Okay. Maybe I'll pause there before I go to analyzing this a bit more formally. Okay, so let's let's think about what happens more generally. So let's say we have a thousand bins, we have ten cache functions and we're going to hash 100 items. The question is, you know what is the false positive rate? And the way to think about this is that I'm gonna to consider a test input that's not in the set that is going to hash into a particular location like I and then I'm going to now consider putting items the end items into the boom filter and see what is the probability that hits I. If anything hits I then that is a bad sign. Okay so let's warm up here. So let's say I'm just in n is one I'm just inserting one element k is also a one I'm only we're using one hash function. So now the question is what is a chance that there's a hash collision between I and whatever element I'm putting in? And the answer to that is just one over the number of bins, assuming everything thingkind of independent, that's going to be no 0.001. Okay? So now I'm still n is still one, but I'm going to use k hash functions. So then I essentially, if I'm going to, then the criteria is I have to miss not just one time, but no k times. Okay, so this is a probability I misonce so one minus that is a probability I hit and then that race, k times, this is a probability I have hit on k and then y minus that is the probability I have to Miss K times. So now the probability here goes down a little bit. Okay, so now if I'm inserting n items, so I'm asking the probability that the test spin is one after n. So now instead of missing k times, I have to Miss K times n times. So then basically there's this expression, but just k times n. And finally, because the test item is something that I am, you're also hashing. So I get sort of k chances you know to miss. So the false positive rate, actually this is what kind of helps, the k really helps, is that I can you know drive down the false positive rate with k. So that becomes you f goes from 0.63 to point zero one. You can actually look at the expression f here and compute the optimal value of k given a fixed ratio, as this is a sort of asymptotic calculation. And then you find that this results in k is scaling as order m over n. So the number of hash functions you should use is on order of number of bins over the number of items you're hashing. And if you do that, then f turns out to be 0.5 and the false positive rate is 0.5 to the k. So you see that in this particular case, I wasn't optimal because optimal value is actually k equals six and the false positive rate is 0.08 as opposed to 0.0, 0.008 rather than 0.01. Okay, so you can read more about this. There's some lecture notes that allow you to trade off the compute, the memory and the false positive rate. Okay, so the bloom filter has some high paraparameters that allow you to control whether what your desifalse positive rate is, how much memory you have, how much compute you're willing to put in and so on. Yeah, I so you're saying that the optimal k went down and so why does the f go down? So the let's see. So if A K goes down, I think that it's not monotonic. So you use I think if you use a very large value of k, that's actually pretty bad because then you just fill up the bloom filter. And if you use too small, that's also not very good either. So I think there's a sort of optimum. And whether it goes down or up depends on which side you're on. Okay, so in doma, they set at the false positive rate to ten to the minus fifteenth, and they use a bloom filter to do their exact deduplication, and they did it at the paragraph level. Okay? So that was exact deduplication. And the problem with exact deduplication is that it doesn't capture some of these near misses where morally it's a duplicate, but we are not able to detect that. So now how do you do approximate set membership? And to talk about approximate set membership, you need a notion of a similarity measure. So one common one that's used is jaard similarity. So the jaard of two sets amb is the ratio of their intersection over the union. So as an example, if I have 120, 30 41 235, if you compute a jakard, the intersection has size three, the union has size five, and therefore the jaard is 0.6. Okay, so we're going to say that two documents are near duplicates if their jakarsimilarity exceeds some threshold. And generally the threshold will be fairly high, like 0.99 if you want to only match if you're like missing a comma or something. Now notice that there's nothing semantic about this. You could be missing a word like not, and of course that changes the meaning completely, or some content word, but this is just trying to get documents that look fairly superficially similar. Now the algorithmic challenge is how do you find your duplicates in linear time? Okay, so to work up to that, we're going to try to rewrite the jacard similarity, which again is a pairwise. You can compute your card of one element. You can only compute it if you have two elements. And I'm gonna to try to write it in terms of a hash function, which you can compute in terms of one element. Okay, so there's a cache function called bin hash, which has the nice property that the probability of a hash collision is exactly jaard ab, right? So normally remember, you think of hash collisions as you know, to be avoided at all. You cost, you really don't like them. But here it's not that you want more hash collisions, it's just that hash collisions are something that you want to control for. You want just the right level of hash collisions governed by the similarity. So again, to just say probably the similarity is equal to the probability of collision. So the more similar to things are, the more they should collide. So that's the intuition. Okay, so min hash is defined as follows. You take a set and you hash all of these isle elements. Remember these return numbers and you just take the minimum element. So you know in my first class, if you haven't seen this, it's like kind of not obvious why just taking them min actually gives you this property of expectation equal to the just card. But the argument is actually fairly simple. So the way to think about it is that you have your five items and you have two sets. And I'm gonna na look at this sort of carismatrix representation. So a has one, two, three, four, b has one, two, three, five. And now the random hash function induces a permutation over the items. So permutation might be three, two, one, four, five, for example, this hash function right there. And now let's look at which item is first in a and which item is first in b. Okay, so that corresponthe item corresponds to the minimum hash. And the item, each item has the same probability as being first, because it's the hash function you're assuming has it's random. So if one tooth or three are first, then the min hash is actually going to be the same, because the minimum, if these are now first, the minimum here is equal to the minimum no here. But if the minimum hash is four or five, then the first in a will not be the first in b, and the min hash will be different. And then now you can say, see that the probability of the min hash being equal is exactly one, two or three, which is idenexactly the intersection. Okay? So if you're in the intersection and that item is first, then you're going to get our collision. And if you're not in the intersection, then that one being first, you'll not get a collision because you're going to get the min for one of them and you're going to get some other value for the other one. Okay, all right. So you can check this empirically if you don't believe me. So generate a thousand, I guess we're running this on amb and checking what's the probability that you get a collision. And it turns out the Esme ated jacard is 0.6. It's not exactly 0.6. I think there's some rounding, but it's pretty close. Okay, so now we've taken this chicard similarly, which is paraphfunction, and reduced it to some probabilistic unary function just on an item. But this is actually, you know we're far from done because now we can hatch our items, but a collision doesn't tell us that these two are similar, but it's just like more likely that more similar things will be hashed together. But we want something a bit stronger here. So this is where locality sensitive hashing comes. And so right now, so far, we just have a hash function that takes two hashes sets, and the sets will collide if with probability jacard. But we really want a and b to collide if and only if ideally if the jaard of amb is exceeding some threshold. So somehow you have to sharpen the probabilities, right? You want to say that if the jaard is over the threshold, then with almost a probability one, they're gonna to be hashed together. And if you below this threshold, then with basically probability zero or very small. So now how do we do this? It's sort of taking kind of the same trick, which is use more hash functions. And here the construction is you're going to break up your n hash functions into b bands of R hash functions. So for example, if you have twelve hash functions and three bands, and each band would have four hash functions, so you have H1 through H four, H5 through H eight, H nine through H twelve, okay? And then we're just going to declare that a and b are going to collide if for some band, all of its hafunctions return in the same value, okay? So if have a and b, I'm going to hash using all twelve hash functions for both a and b using the min hash. And now I'm going to say if this band, for example, if the hash values for H1, H2, H three, H ch four all agree, then they're going to collide or this band, or this band. Okay, so there's a sort of and or structure which is the key to making sharpening the probabilities around the threshold. Okay, so now we just have to do some analysis to calculate what is the probability that a and b collide. Okay, so similar, I'm going to use sim to refer to the jaard similarity. Okay, so I'm going to say, let's say I have jacard similarity of 0.8. In that case, I'm going to have five bands. Each band has ten hash functions, so total of 50 hash functions. So what is the probability of a fixed band matso? Probably a fixed band matching is simply 0.8 to the R because the probability of a single hash function is just a jaard, which is 0.8. And now what the probability of that suband matches? Well, it's a probability that you know that one fix this is the probability that a fixed band doesn't match and this is a probability that all of them don't match. And then, so this is the probability at least one matches. Okay? So now you see the probability of collision is 0.403. Okay? So it's kind of reshaping this probability before. If you have one hash function, if b and R is one, then the probability would be of collision would be 0.8. And now it's point three. Okay, so you know, that's but let's I guess here's the picture you should have in your head. So on the x axis is similarity. What we want is the mapping from similarity ity to probability of collision to look something like that. If the threshold is the desired threshold is 0.5, then I want everything down here to be near zero and everything up here to be near one. So let's see what this looks like. I'm going to set for various similarity values. I'm going to set b equals ten, R equals ten. And notice that what this happens is that it squashes down some of these low probabilities and then it sharpens some of these high probabilities. Okay. But you know maybe I want to make this a little bit sharper. So what I'm going to do is to increase R from ten to 20. And what that does is I moves the curve to the right to make it harder to match and also sharpens the threshold. So if I do that, then you'll see that now these probabilities go up and now these probabilities are much smaller, right? So in before even a probability 0.7 similarity had a probability of 0.24, which is kind of high. I really don't want this. So I want to squash this. So in squashing that, I can now get it down to 0.007, which is pretty good, I guess. But now I've also squashed some of these probabilities. And the goal isn't to shift everything to the right, otherwise I might as well do exact due duplication. So now there's another trick. If I increase b, which is the number of buckets or bins, then I can essentially get these probabilities like 0.9 to be in the nineties now, and these probabilities are still fairly low. Okay. So this picture shows that as you increase B, I get to shift this curve to the left. So increasing R will sharpen the curve and move things to the right. And then increasing b will shift the curve to the left so that you can with appropriate setting of b and R, you can get your like really sharp sigmoid. Okay, so just an example of what this looks like in practice. So there's this paper that we saw last time that uses a near duplication using mhash lsand. Their values b equals 20 and R equals 450. So this R is large, which means that they want a really surthreshold. So there's a formula which you can compute which says the threshold is 0.99. So that means for every 100 words, you can only allow basically one word to be different. So this is fairly no strict deduplication. And the probability that a fixed band matches is this to the R, which is 0.05. And then the probability of collision is 0.64. So one thing that's interesting is that I'm computing the probability of collision at the threshold. So in some sense, that should be somewhere in between zero and one. And it turns out that it converges to one minus one over e, which is around 0.6 ish. Okay, so around the threshold, you know you could go either way, but as soon as you go above the threshold, you have very high probability of being in a collision. And below the threshold, you have very low probability of being in a collision. Okay. So with that, I'm going to summarize actually maybe I'll ask if there's any questions about deduplication. So deduplication makes you run faster avovoice memorization. You can do exact deduplication with boom filters or you can use minhash lssh to do heat approximate deduplication Yeah with some functions or like based on exactly like here. The since of synthetic data is like plearise. I was wondering how to we Yeah so given that synthetic data is on rise, how do you duplicate using something that's like paraphrase sensitive? So there are some papers, for example, 70 duportrait didn't mention that look at essentially embedding so that can allow you to get a more semantic potion of similarity or and this all fits into the same framework because know in some sense, lsh was devised to find approximate nearest neighbors. And so all you have to do is have an embedding and that can you embed all the documents and you can find nearest neighbors in that space. Obviously, there's a higher cost if you're gonna to run an embedding to to embed your documents. And also you know we are sort of working in a regime where documents are you know the near duplicates are like really near duplicates. I think you have to be very careful if you're doing some sort of more fuzzy duplication, you can throw out a lot of your data if you are too permissive. Another question. So know if any situation where you have to click, it's like you wanted to at it. Yeah. So the question is, well, if the data is high quality, maybe you do want duplicates. And that's actually right. So in a lot of language, mopapers, there's sort of a mid training regime where you have a bunch of high quality sources. You actually want to take multiple epochs over high quality data. So the deduplication is more, I would say, for kind of the pre training, where you have all this stuff and you have 60000 copies of some random thing, you just want to clean that up. But I think the optimal thing to do is, well, clearly not. I don't know what the optimal thing to do is, but it's probably if you have documents that occur a lot of times, you could also signal that there's more importance to it. And maybe the right thing to do is like taking the number of the count and I don't know, taking the square root or log or something so that it shows up doesn't proportionally show up in your training data, but you acknowledge that it's important enough to go take more epos over it. Okay. So let's summarize. So this lecture was mostly about giving you algorithmic tools. So we look at engramodels, linear classifiers and importance resampling as ways to do filtering. And with filtering you can do language identification, quality filtering, toxicity filtering, anything. I guess where you have a targets datset and you want na get more of that data set, then you can apply these tools. Or if you don't have a target asset and you have strong language models, you can prompt that a language model to synthesize or synthesize or filter down of a lower quality data set to get to t and then you get a fast classifier that and you're off to the basis es. And a second, we looked at deplication, which is important for reducing the amount of compute you're spending on kind of you're wasting essentially. And the algorithmic tool here is hashing because hashing is what allows you to take these kind of pairwise notions of similarity and collision and turn them into unary functions, which allows you to have this linear time property. And we saw some sort of clever methods where basically using multiple hash functions and using these kind of and order constructions allows you to kind of shape your criteria in ways like the lsh sh example. So now you have all the tools. You now, in some sense, if you ask how you teach data, this is kind of only the beginning. The you really have to spend time with the data looking at a filtering and training models and that's how you build your intuitions over what works and what doesn't. So hopefully you'll be able to do that a bit in assignment four. Okay, that's all for the data lecture. And next week we're going to start doing reinforcement learning and alignment, which is what be our last unit.

概览/核心摘要 (Executive Summary)

本讲座深入探讨了大规模语言模型训练中数据过滤与去重的核心算法和技术。首先，回顾了上节课关于数据来源（如原始网络爬取、GitHub dumps）和基本处理流程（HTML转文本、过滤、去重）。本讲重点介绍了三类过滤算法：基于n-gram模型的KenLM，用于计算困惑度进行筛选，特点是快速但相对粗糙；基于快速分类器的fastText，通过词嵌入和哈希技巧实现高效文本分类，常用于判断数据质量或语言；以及基于重要性重采样的DSIR，通过估计目标数据与原始数据分布的比率进行数据选择，理论上更能保证数据多样性。

随后，讲座阐述了这些过滤机制在具体场景的应用：语言识别（如使用fastText识别特定语言文本，并讨论了其局限性），质量过滤（引用GPT-3、LLaMA、phi-1等模型的实践，强调高质量数据对模型性能的提升，甚至可通过大模型辅助生成高质量标注），以及毒性过滤（如Dolma项目使用fastText和Jigsaw数据集进行内容筛选）。

最后，详细讨论了数据去重技术。区分了完全重复和近似重复，并强调去重对于提升训练效率和避免内容记忆的重要性。介绍了精确去重（基于哈希值）、布隆过滤器（一种内存高效的近似集合成员测试结构，可控制假阳性率），以及针对近似重复的Jaccard相似度、MinHash（将Jaccard相似度与哈希碰撞概率关联）和局部敏感哈希（LSH）（通过“条带与行”的and-or结构，将相似项目以高概率映射到同一桶，实现对相似度阈值的“锐化”判断）。讲座强调，这些算法工具是基础，真正的理解和应用需要通过实践培养直觉。

过滤算法 (Filtering Algorithms)

给定目标数据集 $T$（通常较小，质量高）和大量原始数据 $R$，过滤算法的目标是在 $R$ 中找到一个与 $T$ 相似的子集 $T'$。
理想过滤算法特性:
* 泛化能力: 希望算法能从 $T$ 泛化，找到与 $T$ 不完全相同但性质相似的 $T'$。
* 极高效率: 必须能快速处理海量原始数据 $R$。

1. KenLM

KenLM是一种基于n-gram模型（采用Kneser-Ney平滑）的语言模型，实现快速，常用于数据过滤。
* 核心概念:
* 最大似然估计: 如三元语法 $p(\text{in} | \text{the cat}) = \frac{\text{count}(\text{the cat in})}{\text{count}(\text{the cat})}$。
* 平滑处理: Kneser-Ney平滑利用低阶n-gram信息处理未见过的n-gram，缓解数据稀疏问题。
* 代码示例: 使用KenLM计算困惑度 (Perplexity)。
* model = kenlm.Model("var/en.arpa.bin")
* 通过对数概率和token数量归一化计算困惑度，避免偏爱短文档。
* 示例困惑度：
* "Stanford University..." (维基百科摘录): 77.0 (讲者口述为187，此为修正值)
* "If you believe that the course staff..." (CS336网站内容): 111.0
* "asdf asdf asdf asdf asdf" (无意义文本): 1100000.0
* "the the the the the the the the the the the the the the the the" (重复词): 1.1 (讲者指出此模型对此类文本困惑度异常低)
* 应用案例: CCNet (Wenzek et al. 2019)
* 对文本段落按困惑度升序排序，保留排名前1/3。
* 此方法被用于LLaMA的初代数据集构建。
* 总结: KenLM “快速但相对粗糙” (fast, but it's crude)。

2. fastText

fastText (Joulin et al. 2016) 是一种为文本分类设计的快速分类器，性能可与更慢的深度学习分类器媲美。
* 模型架构: Bag of Word Embeddings (词嵌入词袋)
* 通过引入一个隐藏层 $H$ 来减少参数数量，从 $V \times K$（词袋模型）降至 $H \times (V + K)$。
* 模型：$y = \text{softmax}(U(W(x).\text{mean}(\text{dim=0})))$，其中 $W$ 为词嵌入矩阵，$U$ 为分类头矩阵。讲者指出其本质上是线性分类器，无非线性激活。
* 实现技巧:
* 优化: 并行化、异步SGD。
* 学习率: 线性衰减。
* 特征: 使用n-grams词袋（如bigrams）。
* 哈希技巧: 将数量可能巨大的n-grams映射到固定数量的桶 (bins) 中（如 $10^7$ 个），以处理词表外和大规模n-gram问题。
* 应用: 对于质量过滤任务（好/坏），$K=2$，fastText实质上是二元线性分类器。
* 扩展性: 虽然可以使用更复杂的模型（如BERT, LLaMA）进行分类，但速度会变慢，可能不如直接用这些计算资源训练主模型。

3. DSIR (Data Selection via Importance Resampling)

DSIR (Xie et al. 2023) 是一种通过重要性重采样进行数据筛选的方法。
* 基本设置:
* 目标数据集 $D_p$: 较小，高质量。
* 提议(原始)数据集 $D_q$: 较大。
* 核心思想:
1. 在 $D_p$ 上拟合目标分布 $p$，在 $D_q$ 上拟合提议分布 $q$。
2. 使用重要性权重 $w(x) \propto p(x)/q(x)$ 对来自 $D_q$ 的样本进行重采样。
* 挑战与解决: $D_p$ 通常太小，难以估计好的模型 $p$。解决方案是使用哈希化的n-grams（类似fastText中的技巧）来估计分布。
* 代码示例: 简化的DSIR流程，使用unigram和哈希技巧估计文本概率。
* 结果: 在GLUE基准测试上，使用BERT-style模型时，DSIR表现略优于启发式分类 (fastText)。
| 方法 | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B | Avg |
| :------------------------- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Random selection | 82.3 | 88.9 | 89.3 | 67.3 | 90.0 | 87.4 | 49.4 | 88.6 | 80.2 |
| Heuristic classification | 82.6 | 89.7 | 88.5 | 68.9 | 88.9 | 86.0 | 48.1 | 88.6 | 79.8 |
| Top-K Heuristic | 83.3 | 88.6 | 89.0 | 70.0 | 91.1 | 86.3 | 50.2 | 89.3 | 81.4 |
| DSIR | 83.8 | 90.7 | 89.0 | 75.0 | 90.4 | 87.7 | 54.0 | 89.1 | 82.3 |
| Top-K DSIR | 83.3 | 89.6 | 89.4 | 72.4 | 91.0 | 86.1 | 49.9 | 89.5 | 81.3 |
* 与fastText比较:
* DSIR对整个分布进行建模，是捕获数据多样性的“更原则性的方法”。
* 两者计算复杂度相似。
* 两者都可以通过改进底层模型（如使用更复杂的n-gram或神经方法）来提升性能。

重要性采样 (Importance Sampling) 简介

设置: 目标分布 $p$ (希望从中采样)，提议分布 $q$ (实际能从中采样)。
步骤:
1. 从 $q$ 采样得到样本。
2. 计算每个样本的权重 $w_i \propto p(x_i)/q(x_i)$ 并归一化。
3. 根据权重 $w$ 从步骤1的样本中进行有放回的重采样。

过滤框架总结

所有方法都遵循“估计模型 -> 评分 -> 根据分数保留样本”的模式。

方法	核心思想	步骤
KenLM (生成模型)	估计目标数据 $T$ 的生成模型 $p_T(x)$。	1. $score(x) = p_T(x)$ (或其衍生的困惑度) 2. 保留 $score(x)$ 大于阈值的样本。
fastText (判别分类器)	训练分类器区分 $T$ 和 $R$。	1. $score(x) = p(x \in T
DSIR (重要性重采样)	估计 $T$ 和 $R$ 的分布比率。	1. $score(x) = p_T(x) / p_R(x)$ 2. 按与 $score(x)$ 成正比的概率重采样。

讨论环节提要：针对n-gram模型仅关注局部上下文、可能无法保证宏观语义连贯性的疑问，讲者回应称，此类模型确实易受对抗性构造（如“the the the”的低困惑度）影响，但其主要目标是快速“过滤掉无意义内容”(filtering out nonsense)。尽管局部优秀不能完全代表整体质量，但在大规模数据处理中，这类方法平均而言能有效剔除明显不佳的文本，为后续更精细的模型训练打下基础。对于某些应用场景（如语言识别、毒性过滤），这种初步筛选的有效性可能更为直接。

过滤应用 (Filtering Applications)

相同的过滤机制可用于不同任务。

1. 语言识别 (Language Identification)

任务: 识别特定语言（如英语）的文本。
原因:
- 特定语言高质量数据整理困难。
- 计算受限时，多语言训练会分散单语言的计算资源/tokens。
- 模型差异: BLOOM (2022年) 仅30%英语数据，可能导致英语能力欠佳；而当前前沿模型（GPT-4, Claude, Gemini, Llama）通常是重度多语言且充分训练。
工具: fastText 语言识别模型。
- 开箱即用，支持176种语言，在Wikipedia、Tatoeba等多语言网站上训练。
- 案例: Dolma数据集保留 $p(\text{English}) \ge 0.5$ 的页面。
- 代码演示:
  - "The quick brown fox..." -> 英语 (概率 0.71)
  - 德语句子 -> 德语
  - 数学公式 -> 弱识别为英语
  - 代码片段 -> 俄语 (概率最高)
  - "hello" -> 意大利语
  - "bonjour" -> 法语
  - 西班牙语和英语混合句 -> 偏向西班牙语
注意事项:
- 短序列文本识别困难。
- 低资源语言识别困难。
- 可能意外过滤掉英语方言。
- 相似语言（如马来语和印尼语）识别困难。
- 代码转换 (code-switching) 的界定和处理不明确。
案例: OpenWebMath (约2022年论文)
- 目标: 从Common Crawl整理大型数学文本语料库。
- 过滤规则:
  1. 基于规则初筛（如包含LaTeX命令）。
  2. 使用在ProofPile上训练的KenLM模型，保留困惑度 < 15000 的文本。
  3. 训练fastText分类器预测文本是否为数学写作，根据规则调整阈值。
- 结果: 产生14.7B tokens，训练的1.4B参数模型性能优于在20倍通用数据上训练的模型。这展示了“数据过滤的效用” (utility of data filtering)。

2. 质量过滤 (Quality Filtering)

一些项目 (C4, Gopher, RefinedWeb) 不使用基于模型的过滤。
另一些项目 (GPT-3, LLaMA) 则使用，这正成为常态。
案例:
- GPT-3:
  - 正样本: Wikipedia, WebText2, Books1, Books2 (非Common Crawl来源)。
  - 负样本: Common Crawl。
  - 训练了一个基于词特征的线性分类器，并根据分数随机保留文档。
- LLaMA (初代):
  - 正样本: Wikipedia引用的页面 (非Wikipedia本身)。
  - 负样本: Common Crawl中随机采样。
  - 保留被分类为正样本的文档 (似乎没有随机采样)。
- phi-1:
  - 理念: 使用极高质量的“教科书式”数据训练小模型 (1.3B)。
  - 数据: 包括来自GPT-3.5/4的合成数据和过滤后的真实数据。
  - 流程:
    1. 使用GPT-4对The Stack的Python子集中的10万个文档进行标注（判断其对学习基础编程概念的教育价值），获得高质量正样本 $T$。
    2. 使用这些标注数据训练一个随机森林分类器（输入为预训练的“code embedding model”的嵌入，原文为“cojam model”，可能指代码嵌入模型）。
    3. 用该分类器从整个Python子集 $R$ 中筛选数据。
  - 结果: 在HumanEval上，使用新过滤子集训练的模型性能 (17.68% @36k steps) 优于使用原始子集训练的模型 (12.19% @96k steps)。这说明可以使用强大的模型（如GPT-4）来创建高质量的小型标注集 $T$，然后蒸馏出一个快速分类器来扩展到大型原始数据集 $R$。

3. 毒性过滤 (Toxicity Filtering)

案例: Dolma的毒性过滤
- 数据集: Jigsaw Toxic Comments dataset (源自Wikipedia对话页面的评论，标注有toxic, severe_toxic, obscene, threat, insult, identity_hate等标签)。
- 模型: 训练了两个fastText分类器：一个针对仇恨言论，一个针对NSFW (Not Safe For Work) 内容。
- 代码演示:
  - 示例1 ("Are you threatening me...") -> safe
  - 示例2 (内容未展示，但被标记为NSFW) -> nsfw
  - 讲者评论这两个示例句子本身都是“fine”（即无害的），暗示分类器可能存在判断偏差或过于敏感。

数据去重 (Deduplication)

两种类型的重复

完全重复 (Exact duplicates): 如镜像网站、GitHub forks。
近似重复 (Near duplicates): 仅有少量词元差异的文本，如服务条款、模板化写作（如产品描述中替换实体）、轻微编辑错误。
- 示例: C4数据集中一句特定英文句子（内容为“This is a perfectly good English sentence, but you don't want to like take 61,000 epochs over this.”的变体）出现了61,000次，来源于一个亚马逊产品的描述。

为什么要去重?

提高训练效率: 减少tokens数量。
避免记忆: 减轻版权和隐私问题，防止模型直接复述训练数据。
讲者认为去重与质量过滤是互补的：质量过滤是去除不好的数据，去重是确保好的数据只出现适量次数。

设计空间

项目单位 (Item): 句子、段落还是文档？
匹配方式 (How to match): 精确匹配、公共子项存在、还是公共子项比例？
处理动作 (Action): 全部移除，还是只保留一个？

核心挑战: 去重本质上是项目间的两两比较 ($O(n^2)$)，需要近似线性的时间复杂度算法才能扩展到大规模数据集。哈希函数是实现这一目标的基础。

1. 哈希函数 (Hash Functions)

将项目映射到较小的哈希值（整数或字符串）。
* 加密哈希函数 (如 SHA-256): 抗碰撞性强，但速度慢。
* 非加密哈希函数 (如 MurmurHash, CityHash): 速度快，不保证强抗碰撞性，常用于哈希表。讲座中倾向于使用后者。

2. 精确去重 (Exact Deduplication)

流程:
1. 项目: 字符串。
2. 匹配: 精确匹配。
3. 动作: 只保留一个。
通过计算每个项目的哈希值，将具有相同哈希值的项目分组，然后每组只保留一个。易于用MapReduce并行化。
案例: C4 数据集
- 项目: 3个句子的跨度 (span)。
- 匹配: 精确匹配。
- 动作: 只保留一个。
- 讲者顾虑: 从文档中间移除一个span可能会破坏文档的连贯性，但“人们似乎不在乎”。

3. 布隆过滤器 (Bloom Filter)

目标: 一种高效的、近似的数据结构，用于测试集合成员资格。
特点:
- 内存高效。
- 可更新（添加元素），但通常不能删除元素。
- 如果返回 "no"，则元素绝对不在集合中。
- 如果返回 "yes"，则元素很可能在集合中，但有小概率是 "no"（假阳性）。
原理:
- 使用一个位数组 (bit array) $m$ 和 $k$ 个独立的哈希函数。
- 插入: 对每个待插入项目，用 $k$ 个哈希函数计算出 $k$ 个位置，并将位数组中这些位置设为 1。
- 查询: 对每个待查询项目，用 $k$ 个哈希函数计算出 $k$ 个位置，只有当所有这些位置都为 1 时，才认为项目可能在集合中。
假阳性率分析:
- 假阳性率 $f \approx (1 - e^{-kn/m})^k$，其中 $n$ 是插入的项目数。
- 给定 $m/n$ 的比率，最优的 $k = \ln(2) \cdot (m/n)$。此时 $f \approx (0.5)^k$。
案例: Dolma
- 设置假阳性率为 $10^{-15}$。
- 对段落级别进行精确去重。

4. Jaccard 相似度与 MinHash

Jaccard 相似度: 用于衡量两个集合 $A, B$ 的相似性。
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
如果两个文档的 Jaccard 相似度大于某个阈值（如0.99），则认为它们是近似重复的。
MinHash:
- 一种特殊的哈希函数 $h_{min}$，它满足 $P[h_{min}(A) = h_{min}(B)] = J(A, B)$。
- 思想: 对于一个随机排列（或随机哈希函数映射到整数），一个集合的MinHash值是该集合中所有元素经过排列（或哈希）后的最小值。两个集合的MinHash值相等的概率等于它们的Jaccard相似度。
- 论证: 考虑所有元素的并集，如果随机排列后的第一个元素恰好属于交集，则两个集合的MinHash值会相等。

5. 局部敏感哈希 (Locality Sensitive Hashing - LSH)

目标: 将相似的项目以高概率哈希到同一个桶中，不相似的项目以低概率哈希到同一桶。
问题: 单个MinHash的碰撞概率 $s$ (即Jaccard相似度) 是一个线性函数，不够 "陡峭"。希望当 $s > t_{threshold}$ 时碰撞概率接近1，当 $s < t_{threshold}$ 时碰撞概率接近0。
解决方案: Bands and Rows (条带与行)
1. 使用 $n$ 个独立的MinHash函数，其中 $n = b \times r$ ($b$ 为条带数，$r$ 为每个条带的哈希函数/行数)。
2. 将这 $n$ 个哈希签名分成 $b$ 个条带，每个条带有 $r$ 个哈希值。
3. 碰撞规则: 如果两个文档至少有一个条带中的所有 $r$ 个哈希值都完全相同，则认为它们是候选对（可能碰撞）。
概率分析:
- 一个条带内所有 $r$ 个哈希值都匹配的概率: $s^r$ (AND逻辑)。
- 一个条带内不匹配的概率: $1 - s^r$。
- 所有 $b$ 个条带都不匹配的概率: $(1 - s^r)^b$。
- 至少在一个条带中匹配的概率 (即LSH碰撞概率): $P(\text{collision}) = 1 - (1 - s^r)^b$ (OR逻辑)。
- 这种 and-or 结构产生了一个S型曲线。
- 参数影响:
  - 增加 r (每个band的hash数): 使S曲线更陡峭并向右移动（匹配更难，阈值更高）。
  - 增加 b (band的数量): 使S曲线向左移动（匹配更容易，阈值更低）。
- 阈值近似: S型曲线的拐点（阈值）大约在 $t \approx (\frac{1}{b})^{1/r}$。
实践案例: 一篇论文中使用了 $b=20, r=450$，计算出的近似阈值 $t \approx 0.99$。在此阈值处的碰撞概率约为 $0.64$ (理论上趋近于 $1 - 1/e \approx 0.632$)。

核心观点与结论

讲座系统介绍了用于语言模型数据预处理的过滤和去重算法。
* 过滤: KenLM、fastText和DSIR等工具可以基于模型评分有效地筛选数据，应用于语言识别、质量提升和毒性内容去除。高质量、有针对性的数据对模型性能至关重要，甚至可以利用现有强模型辅助生成或筛选目标数据。
* 去重: 精确去重（如用布隆过滤器）和近似去重（MinHash与LSH）是处理海量数据中冗余的关键技术，有助于提高训练效率和避免模型过度记忆。LSH通过巧妙的“条带与行”设计，能够高效地找到近似重复的文档。
* 实践: 讲者最后强调，虽然提供了算法工具，但真正掌握数据处理需要在实践中不断尝试、观察和培养直觉。

讲者提及的待办事项/未来方向:
* 下一讲将开始讨论强化学习和对齐 (Reinforcement Learning and Alignment)。

转录中未明确说明或不确定的点:
* phi-1模型质量过滤中使用的嵌入模型称为 "cojam model"，具体指代不明，可能是“code embedding model”的转录错误或简写。
* 讲者在讨论近似去重时提到一篇论文 "70 duportrait"，具体论文名称不明确，可能是转录错误，或指代与语义复制相关的研究。
* 讲者在开篇回顾时提到Bert到"homo"的模型，"homo"指代不明，可能是某个模型名称（如Gopher）的转录错误或简写。
* 讲者在介绍KenLM时提到一个内容点可能丢失了："I feel like there was some content that was I meant to go through that's missing. Oh, well, maybe got deleted."

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-17 12:18

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-17 12:12

StreamSparkAI