speaker 1: So this is a second lecture on data. In the last lecture, we talked about different data sets that were used to train various language models. We sort of did a historical overview from the data sets that were used to train Bert all the way up to homo and everything in between. And one of the things I wanted to emphasize is that data doesn't just fall from the sky. It exists often in live services. They have to be explicitly crawled or dumped. And then there's a bunch of processing that has to happen. And the processing involves converting the raw html into text, doing all sorts of quality and language and toxicity filtering, due duplication and so on so forth. So in this lecture, I would like to take a deep dive into some of the mechanics on how quality filtering and deduplication work in particular. So first we're going na look at different algorithms for filtering, which mostly are model based. So you train your classifier, train some sort of model to filter, then we're going to show that this primitive can be used for all sorts of different types of filtering. And then finally we're going to look at some algorithms for detuplication. So this lecture will be a little bit different in that it's going to focus a lot on sort of classical big data processing algorithms. And there be some fun math in there for some years. Okay, so let's get started with filtering algorithms. So the basic high level picture here is that you're given some target data and which you don't have very much of it. Suppose it's data that's high quality and you have a lot of raw data. So imagine this is common crawl, and your goal is you want to find a subset of that raw data to call it t prime, which is similar to t. Okay. So this is a pattern that you should kind of recognize in basically any sort of filtering pipeline. And what do you want for this filtering algorithm? You want to obviously to generalize from t, you don't want you already have t, so there's no point in getting exactly the same data t. So there's has to be some generalization and also has to be extremely fast. So you have to run it on all of the web, for example. So if this is running a huge model, that's going to be might be as as expensive as training. So you definitely don't want that. Otherwise you might as well train. So we're going to look at three different ways that you can implement this high level abstract primitive. So we'll start with training an engram model. So this is something that came up in the last lecture. So you can train an engram model with knaser night smoothing. You can use other forms of smoothing, but knaer has sort of been inherited from the engram kind of modeling statistical language processing era. And there's a particular implementation that people tend to use in this space called canlm, which is open source, originally developed for machine translation, which implements canaer nights smoothing. So you can go and read about canser. Iit's pretty interesting, but we're not going to talk too much about the details here. So this happens to be very popular in data filtering. There's no particular reason has to be this one. But this is just why people use and the nice email l and gramodeling is it's very simple. When I say fia model, you should think of this as just counting the number of n grams and then normalizing okay. So just to go into a little bit of depth here, so it's his starting point is maximum likeestimation of language models. So you have a corpus of text and you want to estimate conditional probabilities like p of n given the last n minus one words, the cat. And what you do is you count the number of n grams and then divided by the number of n minus one grams of the conditioning context. Okay. So this is very, very easy in principle. Now implementing this efficiently is takes a bit of work, but the main problem is that there's sparse counts. So many of the n grams are exactly show up zero times, even though they're very reasonable, especially when n grows. And this was a whole point why engram models couldn't really be scaled appropriately because you hit the curse of dimensionality. So to mitigate this, some people devised nesurna smoothing to handle on seen in grams. And there's more details on this. But roughly, this says, well, maybe if I don't have enough data to support this count, I can sort of use a lower order n gram. So remove one word from the context and count the number of times in appears given cat and then try to either interpolate or back off to that. Okay, so that's all I'll say about how engram modeling works. I think even in this era of neural language model, you know it's good to know that how engram models work. So let's download the engram. So this is a model that's been trained on Wikipedia. I'm just gonna to download it and then let's punch some sentences through. So this sentence as Sanford University was founin 1885. This is actually an excerpt taken from Wikipedia. So it should be reasonable. And if you compute the probability, so you first compute the log probe and then you the perplexity by this normalization then you get 187. Okay, so it's some number. And then here's another example. You know this is taken from the cs 336 website, so let's see how well it does. So this is a higher perplexity means that it's less likely. Okay, kind of makes sense because this stuff doesn't really probably show up in computer te, it's abnormally high because it's still fairly good. Well formed English and many engrams there, I think, show up in Wikipedia as well. What about estf? This is just gibberish that gets assigned higher perplexity. And then there's the the, so this should get assigned pretty high perplexity. But for some reason, this particular Ingram assigns a pretty low perplexity. So you know, it's an ngram model. So it's not particularly clever or good, but you don't need to be the best model. You're just doing data filtering to create a model. So this can be very crude and fast. So ccnet, which is a paper out of, no, I guess it was Facebook back then, which we talked about last time. They basically did this and actually, okay, that's fine. So they basically looked at paragraphs of text and those were the items. They sorted paragraphs by increasing perplexity and just kept the top one third. And this is what they used to create the first llama data set. Okay? So this is know heuristic, but you know kind of, you know if you had to write down something really quickly, and this kind of makes sense. Okay. So nasonite and language modeling is fast, but it's crude. All right. I feel like there was some content that was I meant to go through that's missing. Oh, well, maybe got deleted. All right. So that's you can fit an engram model to your target data and then you can use that to score the raw data and choose the top scoring documents. Another thing you can do is fast tecx. And this is in some ways more popular. This is also from Facebook. They released a lot of open source tools, although Ken alm, I think was Yeah created before before that. So here they develop it for text classification. And this paper, I mean, this was 2016. So a lot of people were developing all sorts of fancy neural models and they said, show that well, this actually almost linear classifier worked as well as much faster. So that's why the software package actually became very popular. So the motivation here was that if you're doing, let's say, a bag of words classification, so you have a sentence of length l, your vocab size is v, and you want to classify into 64 classes, then immediately you have to define this v by k matrix, which if v and k are both large, then this gets pretty big and you get into sparse the issues and the way it would work is that you take your documents and then you just do linear classification. So that's fairly straightforward. So the problem is that the number of parameters huge. So fasthesays, well, let me just do a linear, basically dimensionality reduction. So instead, I'm going to map my vocab space into a hidden dimension which is smaller than k, possibly smaller than k 16, and then I'm going to classify from H. So notice that there's, when you look at the prediction, the forward PaaS, there's no nonlinearity. It's just a linear classifier. You can also think about it as a matrix factorization if you want. And the number of parameters is greatly reduced. So limitation, I think, is reasonably optimized. It's paralyzed and they use asynchronous sgd. And so here's okay. So the question is this is bag of words. And we want to not just look at words, but we want to look at n grams. So this paper also extends to n grams in a fairly simple way. So you take a sentence and you get some chopping up into your n grams. And here the problem is that the number of bigrams can get quite large, and it actually can be unbounded. So remember when we talked about tokenizers? If you just chop up into words, you're going to get, you know, you don't know how many words there are. So the simple thing is that you just do hashing. So you define some number of bins, maybe 10 million, but in this example, eight, and then you just you just hash every bigram into that. So these the cab maps to two and cat in maps to one and so on. Okay. So of course there might be some collisions, but you know you just live with that. I mean, in some sense, the opt when you minimize the loss, collisions are sort of accounted for. If there's like two words that have nothing to do with each other, then the weight is going to somehow represent maybe the average of the two. And also note that often we use fast text classification with k equo su classes. Is this document good or is it bad? And in that case, this is just normal, basically binary classification, linear classification. Okay. So of course, it doesn't take too much imagination to imagine. Well, if you can do linear classification that you can use fancier models such as Bert or even llama, the only thing is that it's going to be slower. And therefore, you run into this tradeoff where if you use a giant model to do classification, maybe you should use those flops to actually just train the model. So that's the only concern because the web is very large. And remember, you were filtering the web down quite a bit. You're going to spend a lot of compute on, let's say, you're filtering like down to 1% of the web, right? So that means you have your classifier. That amount of compuyou spend has to be one hundredth of taking a forward PaaS through the data set. Okay, so the final method I'm going to talk about is this thing called data selection for language models using importance resampling. And here are the basic ideas that you have your R, it's a raw data set, you have your target data, and then I'm gonna to build an importance weight estimator, which is the analog of the language engram model or the faas classifier. And then I'm going to use that to get a bunch of documents. Okay. So first, just a kind of a quick refresher of important resampling, I guess. So you have this shows up in many contexts like particle filtering and Monte Carlo methods. So you have a target distribution. So it's a distribution, not a data set where you want to ideally draw samples from this. But you only have a proposal distribution q ue where you can draw samples. So for example, if you have a vocab of zero, one, two, three, two, three, let's say this is your target distribution and your proposal distribution is this. Now what you do is you sample from q because that's the only thing you can sample from. So let's say you got a bunch of samples. Okay? So notice that q has a lot of more MaaS on zeros. So you get a lot of more zeros. And then what you're going to do is you compute weights over your samples by looking at this importance ratio p over q. So you're trying to make up for the fact at chisample from q. So I want to divide that out where I really want p and you normalize and then you get a distribution over your samples and now you resample and that sort of balances out the distribution. So now you see that there's more three s because p had more probability distribution on MaaS on three. Okay. So this is a fairly elementary, but it's a building block now. Now we go to how we do data Suso. We have a target data set, not a distribution. We have a data set which is small. And we also have this raw data set, which we're going to call the proposal datset dq, which is large. So you could let's just fit a distribution p to dp, fit a distribution q to dq and door, do the importance resampling that we talked about just two minutes ago. Now the main problem is that dp is really small. Remember, target distributions, high quality data, you don't have that much of it, which is the whole point. You're trying to get more of it. So you don't have that much of it and it's too small to estimate a good model. So again, the idea is you just use hash engram. So this is something that we already saw with festtext as well. You take whatever text you have and then you essentially, okay, so you basically hash all the we're doing unigrams for now. So you hash each unigram and you basically count the number, estimate the probability of every hashed n gram. Okay. So in this case, this the Cain have b hashes to this. Notice that there's some hash collisions though the hat hatch into the same value. But whatever, this is all crude. And then you just estimate the probability of each of these probabilities. Okay? So then to evaluate the probability of a new text, you you hash the new text and you multiply the probabilities together. Okay. So I got a little bit unlucky and the probability is gets zeroed up. You can do some smoothing if you want. So it turns out that Python hash is not genterms extra every time I run this. So I'm going to get something else. And the paper shows that this indeed helps a bit. The games are not know super massive, but know compared to doing fast text x, which is the heuristic classification, the gains on the glue benchmark using birstyle models, there is some lift and the comparison with fatex. So the main idea here is that modeling a whole distribution is in some sense a more principled approach because you want to sample data from your distribution and you have a different distribution queue, and now you're trying to essentially compensate. And so this could be better for diversity because you're trying to match the distribution as whereas fast text, you're essentially trying to classify whether something is is in the distribution or not, and it doesn't have any guarantees about matching the distribution, okay? But it's also fast, like fast text, and both can be improved by having better models. So instead of using linear models based on n grams, you can increase n, you can use neural approaches as well. Okay. So just to wrap up this part, so we look at engramodels linear classifiers and importance resampling. But the general framework, I think, is something that may be worth taking away, which is that the setup, given a target data set and a raw data aset, find a subset that's similar to the target. And all these methods have the basic recipe, which is estimate some sort of model based on the raw data and the target data set, which gives you a scoring function, and then keep examples in the raw data set based on that scoring function. Usually high scores are kept. So just to walk through this, in the klm case, the score is simply the probability of what I guess maybe the perplexity if we want normalized by the number of tokens under the target. Okay. So R is not used here. And then you keep examples with score greater than threshold, and you can randomize a little bit around the threshold if you want. So you can also use a discriminary classifier where the score is. What's the probability that this example came from the target as opposed to the raw data set? And then you keep examples where the score is upgraated some threshold. And then the importance resampling says the score is the ratio of two generative models, which could be n gramodels, in this case, to target distribution over the raw distribution and to use the score, this is where the importance weights you resample with, probably proportional to that. Okay. So at a high level, all of them are doing essentially the same thing. You have you fit some sort of distribution that tells you what looks like target as opposed to raw, and then you apply that to the raw data set. Okay. Maybe I'll pause here in case there's any questions. Yeah. So I'm sort of like a high level philosophical that we have an idea of what good data is, but we kind of not talked about what that's actually like. Like a good document looks like so what this is doing is like making sure that the words fit next to each other. When you have like internow like that you know one comes after the other, but it doesn't ensure that they it makes any sense on like a larger level. Or if you mark that, you just increase the n and the enbym. Yeah. So the question seems, I think, is that if you're using an engram model, it's only looking at local contacts. And maybe that's not great for assessing quality, because if you shuffle everything around, it still looks good to the engram model. So there's definitely that danger that it's very easy to adversarily gain, right? If you can construct, you can definitely construct examples that the ngram model thinks it's high. I mean, well, I showed you one. The thus happens to be pretty high, but somehow, on average, you know you're doing okay. And in some sense, maybe if a document looks like grammatical English and it looks like news or some scientific paper, all you want to make sure is that that model assigns hyprobability to that. If you get some bad examples, it's not the end of the world. We should think about these as just like filtering out the nonsense with Yeah I would say this is filtering out nonsense is I think, a good way to look at it. As I go through some examples, I think maybe it will become a bit clear that this is these cost hires are not just used for quality and for some of the other use cases, I think it's much more maybe compelling why it should work. Okay, so the same data filtering machinery can be used on different tasks. So I'm going to talk about language identification, quality filtering and toxicity filtering. So language identification, the idea is that you want to find text of a specific language. Maybe it's English, maybe it's Japanese. And you could ask, well, you could just train on all the languages. Why don't you do that? The problem is that it can often be difficult to do the curation and processing of high quality data in a given language. And there's also the problem that if you train, let's say you only care about one language now, if you have data from all the other languages, you're going to be spending less compute and tokens on any given languages, on the given language. So this and that could hurt your performance in that language. For example, bloom, which was trained in about 20, 22, I believe, was only trained on 30% English. And as a result, the English performance maybe was not as good as it could have been. It was strong on other languages. So there in a compute limited regime, you're definitely at risk of compromising quality of the language you care about. Now that said, if you have enough capacity and you're training huge models, then you can definitely train multilingual models and there could be even positive transfer between languages. So all the kind of frontier models are trained on heavily multilingual, sometimes even hundreds of like 100 plus languages. So you know one simple way to do this is fatext is a toolkit for training simple classifiers. It also comes with some off the shf models. And one of the ones that typically gets used is they have a language identification model. And this is just off the shelf classifier. It supports a lot of different languages. And and it was trained on basically multilingual sites, including Wikipedia, which has a lot of different languages, translation sites, and for some reason, Southeast European news. So doma uses this. They basically run the fast text x classifire and keep pages with English probability greater than 0.5. So let's try it out. So there's this model of language ID that you can just download, and then let's see how well that does. So the quicpanfox jumps over the lazy dog. Let's see where are the predictions. Okay, I don't know why. Okay, there we go. Okay, so this says English. This is the label English. And the probability of English is 0.71. Okay? So I mean, would have guessed it would be higher. It looks pretty English to me. If you duplicate the sentence, the probability isn't no changing, which is good. You know, saying the same thing doesn't make it more English. There's some informal English, which is kind of weird. This is, I guess, more English than the quick round fox. German actually gets classified fairly. Oh, sorry, this is German. Now the label is German. Okay, here. That is German math gets classified pretty weakly as English code is highest, is like Russian. This is hello, is apparently Italian. This is French. Bonjour. That's good. And this is, I guess, a tricky one where I put in some Spanish and also some English. And I guess I favvored the Spanish in this case. Okay. So this gives you a sense. I mean, it's always good to play with these classifiers, right? Because just because it's a classifier you can download and everyone uses, doesn't mean it like works all the time. So I think obviously for short sentences, it's less reliable because it's just less information about what's happening. Low resource languages is tough. There's a worry that for things that don't really are dialects of English, this could be viewed as not English. I think if you want to distinguish between similar languages, that's also can be easily confused. And code switching, I don't even know what ground truth is, but itdo hopefully one of the languages. So just talk through one case study, open web math was this paper from two years ago. And the goal here is to curate a large corpus of mathematical text, where here I'm saying that let's assume that math is a language. So first they use some rules to filter. Then they train kenalm on this large data set of proofs, it's called proof pile, and kept it if the perplexity was below some threshold. And they train this fast classifier, which we've talked about, to predict mathematical writing. This is a kind of hacky, but they set the threshold of the classifier, if it's identified as math, based on rules to be relatively low, and if it's not identidefined as math according to rules, it needs to be higher. And as a result, they produced 15 billion tokens, and they trained some models that do better that models that were trained on 20 times more data that weren't really special. So this is kind of one of nice examples of the utility of data filtering, right? You could train on all this data, but if you care about, let's say, a particular domain like math, you can just go out and get more of it and target that data set. And if you train on that, you can be much more data efficient. Okay. So let's talk about quality filtering. So quality filtering obviously is something that is doesn't really I mean, it's a catch off for a bunch of things and what is quality. So remember that some papers explicitly don't use model based quality filtering, but more recent papers have just kind of given in and say, well, I mean, it just works a lot better. So GPT -3, we talked a little bit about this last time, trained a quality classifier by taking positive samples from the high quality sources that they had, or rather than non common crawl sources they had. And a negatives are just common crawl. They train a linear classifier and keep documents stocastically based on this score, which is something that just turns numbers, basically sort of sharpens the threshold. And the lama first lama paper, they use pages referenced by Wikipedia, so not Wikipedia as positives. Negatives are just sampled from common crawl, and then they keep documents that are classified positive. I don't think there was actually sampling in that, which is interesting. Phone is another paper that their philosophy was kind of interesting. They really wanted high quality data that looked like textbooks, and they wanted to train a very small model. So they trained on synthetic data, but also filter data. So for the filter data, what they did is they took the Python subset of the stack. Remember, the stack is this preprocessed data set coming from GitHub of code. And then they defined a prompt which effectively asked GPT -4 to determine the educational value for a student whose goal is to learn basic coding concepts. And then they prompted GPT -4 and classified 100k documents from this Python subset. And that became the positive examples. They train a random forest classifier on these examples, where the embeddings fed into a random classifier were from some pre trained cojam model. And then they selected data from R that is classified by the classifier. So they run the classifier then over all of R and then get that data. So this is kind of interesting because there is a random force classifier. And in this case, where does t come from? Well, t doesn't exist. April t comes from prompting GPT -4 with some propped. Okay. So this is something that you'll see more increasingly as the models get better, is that instead of looking at sources like, well, maybe books one is great, or maybe web Wikipedia is great. Now if you have a good model, you can just ask the model to for the type of data you want. Maybe I want more chemistry data, or I want more mathematical data that you know does a lot of proofs or elementary math or something. And you just let the language model, a fairly sophisticated language model, give you T A. And now once you have Teyou, turn the recipe that we've been talking about so far. So the 51 results, again, looks good. So they had a baseline where they just took the Python subset of a stack, which was that this R, and the performance on huity valwhich is a coding dsite, was only 12% after 96 case steps of training. And then they trained the exact same model architecture on this new fancy data set and it was already at 17 after 36. So the declared success. Okay. So these are some examples of quality filtering. Any questions about quality filtering? Yeah make. Yeah. So the point is that you can why can you get away with using gt four here? Because you're using it only on to create 100k examples and to create and then using that to distill a classifier which is much faster and the 100k is much less than the size of R, which is you know hundreds of tens or hundreds of millions probably. Okay, so let's talk a bit about toxicity filtering. So I'm going to talk about how the doma paper does toxicity filtering. So there's this data set called jigshaw toxic comments. And this came out of a project where they were trying to help people have better discussions online. Say they wanted to build a classifier that could identify when there was problematic content. So they took the data comments on the Wikipedia talk pages and had annotators annotate them with toxics, severe toxic, obscene threat, insult, ity hate. And so the dolma folks train ined two facx classifiers, one which is basically correspond to hate and the other one is correspond to nsfw. And so here are some examples of the data set. So let's download the model and try it out. So the first example, you know, are you threatening me for dispute neutrality? This gets classified as safe for work because, I mean, seems fine. And the second one is flagged as nsfw, you know and these sentences are both fine. Okay, so just give you a sense. You know I don't want to show too many of these examples, but you get the idea you can play it on with your own time. Okay, so that's I'll say about filtering. So now let's talk about deduplication. So there's two types of duplicates, exact duplicates. And the thing to realize is that the web actually just inherently has a lot of exact duplicates based on mirroring. So if you just look at Project Gutenberg, you know, you'll see that the same exact site gets mirrored on a bunch of different url's. So if you're just crawling as common crawl does, it has no way knowing that these are mirrors. You just get you the exact same content. And then there's neo duplicates, which is basically the same text that differ by a few tokens. And we're going to show you algorithms that deal with both. So new duplicates. When does this happen? Well, you have terms of service and licenses like the mit license. This thing shows up. I don't know how many times on the web is just copy and paste it everywhere. Maybe there's even know, maybe people take it and actually change some things or they mess up, copy and paste the and remove a comma or something. That's all possible. And then there's cases, if you look at these data sets, new duplicates are it just gives you an idea of what new duplicates might look like. Sometimes you have this article where somejust seem to have paraphrased best actor in a negative role for most impactful character by everything else seems the same. This is just like missing a comma here. Not really sure how where that comma went off to, but you know there could be like just annotator or processing errors. And then this one's kind of interesting. There's clearly this very kind of add like a text which they just replaced Canada with usa and some other flilight. So this probably is generated from some sort of template system where they slotted in entities. So you know these newer duplicates are obviously different. But obviously if you train on tons of, you know this once you have this document, this one doesn't give you that much value. Let's just put it that way. So in an extreme case, so if you look at c four, remember c four was filtered based on these set of rules that looked for sentences that looked like English. You'll see this sentence, which is indeed English, that shows up an astonishing 61000 times. And for whatever reason, if you actually trace down where this comes from, not the mit license. Oh, I have to what? Okay. Do I show you except these cookies? Okay, fine. This is like random product from Amazon that has graffiti madrawings. And you know there's this text and apparently this is for whatever reason, I you know I don't even know how there's 61000 copies of this, but that's what it is. And so the idea is that this text isn't necessarily bad, right? It's perfectly good English, but you don't want to like take 61000 epochs over this. I mean, it's just you kind of pointless. So deplication is sort of complimentary to quality filtering. Quality filtering says, this piece of data, I don't want to ever train on it. Deduplication says, well, this might be fine, but I only want a small number of these rather than 61000 copies. So this been work showing that deplication makes language models better. The first, most obvious thing is that you can train more efficiently. Detuplication reduces the number of tokens, and if you haven't thrown away information, then you can save a lot of effort. And then the second is that you can also avoid memorization. So we haven't really talked about this, but language models trained on data can memorize that data, which is bad for you know, copyright or privacy reasons, because it can regjust regurgitate the training set. And deduplication is one tool. It's not definitely not perfect, but it can help mitigate some of these risks. Okay. So here's how to think about deplication in terms of design space. So first you have to think about what is the unit that you're detupliating? What is an item? Is it a sentence? Is a paragraph? Is a document? The second question is, how do you match? Do you what is considered a dupliate? Is it exact match? Is it a fraction of common sub items? And then the final thing is, what action do you take? Do you remove all the instances or do you remove all but one? Okay. So so the key challenge in this is algorithmic challenge is that fundamentally deduplication is about comparing items to other items, right? Quality classification. You can take an item you can classify as a high quality or not. Deduplication is fundamentally a parirwise thing. And remember, these data sets are large. So you can if you're trying to do something quadratic, that's kind of a no go, so you need some sort of linear time algorithm to scale, but still do something that's pairwise. So the building block of all this is hash functions. So hash functions, just as a review, is a function that maps an item, which could be a sentence, could be a document, to a hash value, which is an integer or string, and the value is much smaller than an item. And the thing that relevant about hash functions is that you could potentially get hash collisions. So which means that you have two distinct items that map to the same hash value. So in general, if I'm sure you, I guess all of you have used dictionaries or hash tables, hash collisions are generally bad. But later we'll see that there actually can be good in some limited doses. There's many types of hash functions which trade off between efficiency and collision resistance, which means don't collide. There's cryptographic hash functions, which are used for security sensitive applications, like in bitcoin. You really don't want collisions because otherwise that means someone might steal your money or you could compromise things. And then there's things where, but these are slow. And then then there's hash functions, which are generally used for hash tables that are not collision resistant, but fast. And so we're generally going to go with the latter category because we want things to be fast and we're not doing cryptography here. Okay, so we're going to use this thing called murmur hash, which takes strings and returns values, but you can use a bunch of other hashes as well. Okay, so let's start with the exact p duplication. So here's a simple example. The item is just a string and I want the exact match and I want to remove all but one. So here's a bunch of items and what I'm gonna to first do is compute a mapping from hash value to the list of items with that hash. So and then once I have that, I keep one item from each group. So each group corresponds to a hash value. And in a case of exact exact match, assuming there's no hash collisions, then you basically do a exact tube. So hello, shows up twice here, and now it shows up once. And of course, hello, uppercase, exclamation point is just a completely distinct item because we're doing exact so this the nice thing is that this is very you know simple. It's clear what you're doing. It's high precision. So you're never gonna to throw away anything didn't that you didn't need or you needed. But the con is that it doesn't work for neduplicates. And the other thing that's nice is that you know this is very simple to implement. We kind of wrote this in a map reduced way, and you can scale it and paralyze it fairly easily. And this is something that c four uses actually to prepare their data set. So their unit is of dedulication is three sentence banans. They use exact match and they remove all above one. So one thing that is always kind of bother me, but I don't know one else seems to care, is that when you're looking at three sentence banans, right? So you have this document and then you have this three sentence band an. And if it's marked as a duplicate, you're just going to do surgery and take those sentences out. And so the resulting document might not even be coherent. But then people say, who cares and move on. Okay, so there's another way to do deduplication that is less map reduced by more kind of like csh table. I think the bloom filters are you know some of you might have seen, but just to go through it because I think it's kind of a cool method. So this is efficient and it's approximate so it's more efficient but obviously it's not always doing the exact thing, but very have we are not being too picky here. Okay. So it's the nice thing is a very memory efficient. You can update it but you can't delete and it's basically a representation of set where if you ask for set membership and it returns no, then it's definitely a no. But if it returns yes, then know most likely if you set your hyperpremise right, it's yes, but know there's a small probability that you have of no, like you have a false positive and then you can say your parameters to drive down the false positive rate if you like. Okay, so just to walk through what this looks like, so suppose you have five items, the cat in the hat, and later we'll test out with some items that don't show up in the set. The first thing is you're going to define a hash function that maps to m Benins m is going to be eight here. So to build the bloom filter, we're going to do know something fairly simple. You you make your BitArray with a number of finins and then go through every item, hash it, the hashes, the two. So I update the second item, cat, hash it the sevenupdate, the seventh item in the hokay. So I just know populate this bit array in the BitArray. I don't keep track of the item, so I have no idea if I have a false positive or not. Okay, so let's just send dy to check that every item that I put in is actually in the item. So to query it, you just compute whether that item hashed is in the table and indeed everything should PaaS. And now let's look at the non items and see how they fare. And you'll see that some of the non items, you get a zero, which is what you want. But some of the nine items, the bom fetter says, Oh, actually it's in the set. So the number of mistakes here is four. If you compute the false positive ray, which is the number of mistakes over the total number of times the procedure produced a positive, then the false positive rate is in this case 0.44, which is pretty bad. Now of course, this is a bin eight and if the bin were to grow then this will get better. In fact it grows. Their probability is one over in the number of bins. Okay, so but you look at this and say, well, that's not really good because if I want the error probability to be you know ten to the minus ten, then I need a lot of bins and that's not going to hurt on memory. So there's a more clever solution here, which is to use more hash functions. So let's say I use two hash functions and I'm going to essentially, I still have the same bit array, but for every item now I'm going to use hash it twice. So the under sezero gets hash to two, I'm going to pop it in and then the with seed one is going to hash to five, so I'm going to pop it in there. So every item gets hashed into k, not necessarily distinct, but Hey, ycase slots. So then I go through the cat and then I fill up this table. Okay. So now when I do the query, I take an element and I'm going to return one if the table set was set to one for all k hash functions. Because if I put it in, I have to set k bits. And so if I'm testing, I need to check the k locations and check that they were all set. So indeed, all of those PaaS. And now let's look at the so the number of mistakes now is three and the false positive rate decreased. So no, that's great. So of course, this is a toy example, but you can really drive down the false positive rate with modest memory spend. Okay. Maybe I'll pause there before I go to analyzing this a bit more formally. Okay, so let's let's think about what happens more generally. So let's say we have a thousand bins, we have ten cache functions and we're going to hash 100 items. The question is, you know what is the false positive rate? And the way to think about this is that I'm gonna to consider a test input that's not in the set that is going to hash into a particular location like I and then I'm going to now consider putting items the end items into the boom filter and see what is the probability that hits I. If anything hits I then that is a bad sign. Okay so let's warm up here. So let's say I'm just in n is one I'm just inserting one element k is also a one I'm only we're using one hash function. So now the question is what is a chance that there's a hash collision between I and whatever element I'm putting in? And the answer to that is just one over the number of bins, assuming everything thingkind of independent, that's going to be no 0.001. Okay? So now I'm still n is still one, but I'm going to use k hash functions. So then I essentially, if I'm going to, then the criteria is I have to miss not just one time, but no k times. Okay, so this is a probability I misonce so one minus that is a probability I hit and then that race, k times, this is a probability I have hit on k and then y minus that is the probability I have to Miss K times. So now the probability here goes down a little bit. Okay, so now if I'm inserting n items, so I'm asking the probability that the test spin is one after n. So now instead of missing k times, I have to Miss K times n times. So then basically there's this expression, but just k times n. And finally, because the test item is something that I am, you're also hashing. So I get sort of k chances you know to miss. So the false positive rate, actually this is what kind of helps, the k really helps, is that I can you know drive down the false positive rate with k. So that becomes you f goes from 0.63 to point zero one. You can actually look at the expression f here and compute the optimal value of k given a fixed ratio, as this is a sort of asymptotic calculation. And then you find that this results in k is scaling as order m over n. So the number of hash functions you should use is on order of number of bins over the number of items you're hashing. And if you do that, then f turns out to be 0.5 and the false positive rate is 0.5 to the k. So you see that in this particular case, I wasn't optimal because optimal value is actually k equals six and the false positive rate is 0.08 as opposed to 0.0, 0.008 rather than 0.01. Okay, so you can read more about this. There's some lecture notes that allow you to trade off the compute, the memory and the false positive rate. Okay, so the bloom filter has some high paraparameters that allow you to control whether what your desifalse positive rate is, how much memory you have, how much compute you're willing to put in and so on. Yeah, I so you're saying that the optimal k went down and so why does the f go down? So the let's see. So if A K goes down, I think that it's not monotonic. So you use I think if you use a very large value of k, that's actually pretty bad because then you just fill up the bloom filter. And if you use too small, that's also not very good either. So I think there's a sort of optimum. And whether it goes down or up depends on which side you're on. Okay, so in doma, they set at the false positive rate to ten to the minus fifteenth, and they use a bloom filter to do their exact deduplication, and they did it at the paragraph level. Okay? So that was exact deduplication. And the problem with exact deduplication is that it doesn't capture some of these near misses where morally it's a duplicate, but we are not able to detect that. So now how do you do approximate set membership? And to talk about approximate set membership, you need a notion of a similarity measure. So one common one that's used is jaard similarity. So the jaard of two sets amb is the ratio of their intersection over the union. So as an example, if I have 120, 30 41 235, if you compute a jakard, the intersection has size three, the union has size five, and therefore the jaard is 0.6. Okay, so we're going to say that two documents are near duplicates if their jakarsimilarity exceeds some threshold. And generally the threshold will be fairly high, like 0.99 if you want to only match if you're like missing a comma or something. Now notice that there's nothing semantic about this. You could be missing a word like not, and of course that changes the meaning completely, or some content word, but this is just trying to get documents that look fairly superficially similar. Now the algorithmic challenge is how do you find your duplicates in linear time? Okay, so to work up to that, we're going to try to rewrite the jacard similarity, which again is a pairwise. You can compute your card of one element. You can only compute it if you have two elements. And I'm gonna to try to write it in terms of a hash function, which you can compute in terms of one element. Okay, so there's a cache function called bin hash, which has the nice property that the probability of a hash collision is exactly jaard ab, right? So normally remember, you think of hash collisions as you know, to be avoided at all. You cost, you really don't like them. But here it's not that you want more hash collisions, it's just that hash collisions are something that you want to control for. You want just the right level of hash collisions governed by the similarity. So again, to just say probably the similarity is equal to the probability of collision. So the more similar to things are, the more they should collide. So that's the intuition. Okay, so min hash is defined as follows. You take a set and you hash all of these isle elements. Remember these return numbers and you just take the minimum element. So you know in my first class, if you haven't seen this, it's like kind of not obvious why just taking them min actually gives you this property of expectation equal to the just card. But the argument is actually fairly simple. So the way to think about it is that you have your five items and you have two sets. And I'm gonna na look at this sort of carismatrix representation. So a has one, two, three, four, b has one, two, three, five. And now the random hash function induces a permutation over the items. So permutation might be three, two, one, four, five, for example, this hash function right there. And now let's look at which item is first in a and which item is first in b. Okay, so that corresponthe item corresponds to the minimum hash. And the item, each item has the same probability as being first, because it's the hash function you're assuming has it's random. So if one tooth or three are first, then the min hash is actually going to be the same, because the minimum, if these are now first, the minimum here is equal to the minimum no here. But if the minimum hash is four or five, then the first in a will not be the first in b, and the min hash will be different. And then now you can say, see that the probability of the min hash being equal is exactly one, two or three, which is idenexactly the intersection. Okay? So if you're in the intersection and that item is first, then you're going to get our collision. And if you're not in the intersection, then that one being first, you'll not get a collision because you're going to get the min for one of them and you're going to get some other value for the other one. Okay, all right. So you can check this empirically if you don't believe me. So generate a thousand, I guess we're running this on amb and checking what's the probability that you get a collision. And it turns out the Esme ated jacard is 0.6. It's not exactly 0.6. I think there's some rounding, but it's pretty close. Okay, so now we've taken this chicard similarly, which is paraphfunction, and reduced it to some probabilistic unary function just on an item. But this is actually, you know we're far from done because now we can hatch our items, but a collision doesn't tell us that these two are similar, but it's just like more likely that more similar things will be hashed together. But we want something a bit stronger here. So this is where locality sensitive hashing comes. And so right now, so far, we just have a hash function that takes two hashes sets, and the sets will collide if with probability jacard. But we really want a and b to collide if and only if ideally if the jaard of amb is exceeding some threshold. So somehow you have to sharpen the probabilities, right? You want to say that if the jaard is over the threshold, then with almost a probability one, they're gonna to be hashed together. And if you below this threshold, then with basically probability zero or very small. So now how do we do this? It's sort of taking kind of the same trick, which is use more hash functions. And here the construction is you're going to break up your n hash functions into b bands of R hash functions. So for example, if you have twelve hash functions and three bands, and each band would have four hash functions, so you have H1 through H four, H5 through H eight, H nine through H twelve, okay? And then we're just going to declare that a and b are going to collide if for some band, all of its hafunctions return in the same value, okay? So if have a and b, I'm going to hash using all twelve hash functions for both a and b using the min hash. And now I'm going to say if this band, for example, if the hash values for H1, H2, H three, H ch four all agree, then they're going to collide or this band, or this band. Okay, so there's a sort of and or structure which is the key to making sharpening the probabilities around the threshold. Okay, so now we just have to do some analysis to calculate what is the probability that a and b collide. Okay, so similar, I'm going to use sim to refer to the jaard similarity. Okay, so I'm going to say, let's say I have jacard similarity of 0.8. In that case, I'm going to have five bands. Each band has ten hash functions, so total of 50 hash functions. So what is the probability of a fixed band matso? Probably a fixed band matching is simply 0.8 to the R because the probability of a single hash function is just a jaard, which is 0.8. And now what the probability of that suband matches? Well, it's a probability that you know that one fix this is the probability that a fixed band doesn't match and this is a probability that all of them don't match. And then, so this is the probability at least one matches. Okay? So now you see the probability of collision is 0.403. Okay? So it's kind of reshaping this probability before. If you have one hash function, if b and R is one, then the probability would be of collision would be 0.8. And now it's point three. Okay, so you know, that's but let's I guess here's the picture you should have in your head. So on the x axis is similarity. What we want is the mapping from similarity ity to probability of collision to look something like that. If the threshold is the desired threshold is 0.5, then I want everything down here to be near zero and everything up here to be near one. So let's see what this looks like. I'm going to set for various similarity values. I'm going to set b equals ten, R equals ten. And notice that what this happens is that it squashes down some of these low probabilities and then it sharpens some of these high probabilities. Okay. But you know maybe I want to make this a little bit sharper. So what I'm going to do is to increase R from ten to 20. And what that does is I moves the curve to the right to make it harder to match and also sharpens the threshold. So if I do that, then you'll see that now these probabilities go up and now these probabilities are much smaller, right? So in before even a probability 0.7 similarity had a probability of 0.24, which is kind of high. I really don't want this. So I want to squash this. So in squashing that, I can now get it down to 0.007, which is pretty good, I guess. But now I've also squashed some of these probabilities. And the goal isn't to shift everything to the right, otherwise I might as well do exact due duplication. So now there's another trick. If I increase b, which is the number of buckets or bins, then I can essentially get these probabilities like 0.9 to be in the nineties now, and these probabilities are still fairly low. Okay. So this picture shows that as you increase B, I get to shift this curve to the left. So increasing R will sharpen the curve and move things to the right. And then increasing b will shift the curve to the left so that you can with appropriate setting of b and R, you can get your like really sharp sigmoid. Okay, so just an example of what this looks like in practice. So there's this paper that we saw last time that uses a near duplication using mhash lsand. Their values b equals 20 and R equals 450. So this R is large, which means that they want a really surthreshold. So there's a formula which you can compute which says the threshold is 0.99. So that means for every 100 words, you can only allow basically one word to be different. So this is fairly no strict deduplication. And the probability that a fixed band matches is this to the R, which is 0.05. And then the probability of collision is 0.64. So one thing that's interesting is that I'm computing the probability of collision at the threshold. So in some sense, that should be somewhere in between zero and one. And it turns out that it converges to one minus one over e, which is around 0.6 ish. Okay, so around the threshold, you know you could go either way, but as soon as you go above the threshold, you have very high probability of being in a collision. And below the threshold, you have very low probability of being in a collision. Okay. So with that, I'm going to summarize actually maybe I'll ask if there's any questions about deduplication. So deduplication makes you run faster avovoice memorization. You can do exact deduplication with boom filters or you can use minhash lssh to do heat approximate deduplication Yeah with some functions or like based on exactly like here. The since of synthetic data is like plearise. I was wondering how to we Yeah so given that synthetic data is on rise, how do you duplicate using something that's like paraphrase sensitive? So there are some papers, for example, 70 duportrait didn't mention that look at essentially embedding so that can allow you to get a more semantic potion of similarity or and this all fits into the same framework because know in some sense, lsh was devised to find approximate nearest neighbors. And so all you have to do is have an embedding and that can you embed all the documents and you can find nearest neighbors in that space. Obviously, there's a higher cost if you're gonna to run an embedding to to embed your documents. And also you know we are sort of working in a regime where documents are you know the near duplicates are like really near duplicates. I think you have to be very careful if you're doing some sort of more fuzzy duplication, you can throw out a lot of your data if you are too permissive. Another question. So know if any situation where you have to click, it's like you wanted to at it. Yeah. So the question is, well, if the data is high quality, maybe you do want duplicates. And that's actually right. So in a lot of language, mopapers, there's sort of a mid training regime where you have a bunch of high quality sources. You actually want to take multiple epochs over high quality data. So the deduplication is more, I would say, for kind of the pre training, where you have all this stuff and you have 60000 copies of some random thing, you just want to clean that up. But I think the optimal thing to do is, well, clearly not. I don't know what the optimal thing to do is, but it's probably if you have documents that occur a lot of times, you could also signal that there's more importance to it. And maybe the right thing to do is like taking the number of the count and I don't know, taking the square root or log or something so that it shows up doesn't proportionally show up in your training data, but you acknowledge that it's important enough to go take more epos over it. Okay. So let's summarize. So this lecture was mostly about giving you algorithmic tools. So we look at engramodels, linear classifiers and importance resampling as ways to do filtering. And with filtering you can do language identification, quality filtering, toxicity filtering, anything. I guess where you have a targets datset and you want na get more of that data set, then you can apply these tools. Or if you don't have a target asset and you have strong language models, you can prompt that a language model to synthesize or synthesize or filter down of a lower quality data set to get to t and then you get a fast classifier that and you're off to the basis es. And a second, we looked at deplication, which is important for reducing the amount of compute you're spending on kind of you're wasting essentially. And the algorithmic tool here is hashing because hashing is what allows you to take these kind of pairwise notions of similarity and collision and turn them into unary functions, which allows you to have this linear time property. And we saw some sort of clever methods where basically using multiple hash functions and using these kind of and order constructions allows you to kind of shape your criteria in ways like the lsh sh example. So now you have all the tools. You now, in some sense, if you ask how you teach data, this is kind of only the beginning. The you really have to spend time with the data looking at a filtering and training models and that's how you build your intuitions over what works and what doesn't. So hopefully you'll be able to do that a bit in assignment four. Okay, that's all for the data lecture. And next week we're going to start doing reinforcement learning and alignment, which is what be our last unit.