speaker 1: Okay, let me get started for today. So for today, first of all, I'm going to spend a few minutes talking about a couple more neural net concepts, including actually a couple of the concepts that turn up in assignment two. Then the bulk of today is then going to be moving on to introducing what our language models. And then after introducing language models, we're going to introduce a new kind of neural network, which is one way to build language models, which is recurrent neural networks. They're important thing to know about, and we use them in assignment three, but they're certainly not the only way to build language models. In fact, probably a lot of you already know that there's this other kind of neural network called transformers, and we'll get on to those after we've done recurrent neural nets. Talk a bit about problems with recurrent neural networks. And well, if I have time, I'll get onto the recap before getting into the content of the class, I thought I could just spend a minute on giving you the stats of who is in cs 224n. Who's in cs 224n kinda looks like the pie charts they show in cs 106a these days, except more grad students, I guess. So the four big groups, the computer science undergrads, the computer science grads, the underclared undergraduand, the ndo grad, so this is a large portion of the scpd students. So some of them are under computer science grads. So that makes up about 60% of the audience. And if you're not in one of those four big groups, you're in the other 40%. And everybody is somewhere. So there are lots of other interesting groups down here. So you know the bright orange down here, that's where the math and physics PhDs are up here. I mean, Interestingly, we now have more statistics grad students than there are census undergrads who didn't used to be that way around in nlp classes. And you know one of my favorite groups, the little magenta group down here, these are the humanities undergrads. Humanities undergrads in terms of years, it breaks down like this. First year grad students are the biggest group, tons of juniors and seniors and a couple of brave frosh or any brave frosh here today? Yeah. Okay, welcome. Yeah modern neural networks, especially language models are enormous. This charts sort of out of date because it only goes up to 2022, but it's sort of actually hard to make an accurate chart for 2024 because in the last couple of years, the biggest language model makers have in general stopped saying how large their language models are in terms of parameters. But at any rate, they're clearly huge models which have over 100 billion parameters. And so large and then deep in terms of very many layers, neural nets are a cornerstone of monnlp systems. We're going to be pretty quickly working our way up to look at those kind of deep models. But I just sort of for starting off with something simpler. You know, I did just want na kind of key you in for a few minutes into a little bit of history, right? So the last time neural nets were popular was in the eighties and nineties, and that was when people worked out the back propagation algorithm. Jeff Hinton and colleagues made famous the back propagation algorithm that we've looked at, and that allowed the training of neural nets with hidden layers. And so but in those days, pretty much all the neural nets with hidden layers that were trained were trained with one hidden layer. You had the input, the hidden layer and the output, and that's all that there was. And the reason for that was, for a very, very long time, people couldn't really get things to work with more hidden layers. So that only started to change in the resurgence of what often got called deep learning. But anyway, back to neural nets started around 2006. And this was one of the influential papers at the time, greedy layer wise training of deep neural networks by yoshuo, benjo and colleagues. So right at the beginning of that paper, they observed the problem. However, until recently, it was believed too difficult to train deep multiland Ural networks. Empirically, deep networks were generally found to be not better and often worse than neural networks with one or two hidden layers. Jerry tosoro was actually a faculty member who worked very early on autonomous driving with neural networks. As this is a negative result, there's not been much report in the machine learning literature. So that really, you know although people had neural networks and back propagation and recurrent neural networks we're going to talk about today that for a very long period of time, you know 15 years or so, things seemed completely stuck in you. Although in theory it seemed like deep neural networks should be promising, in practice they didn't work. And so it really then took some new developments that happened in the late 2000 ands decade, and then more profoundly in the 2010s decade, to actually figure out how we could have deep neural networks that actually worked, working far better than the shallow neural networks, and lead into the networks that we have today. And you know, we're going to be starting ying to talk about some of those things in this class and coming up with classes. And I mean, I think, you know, the tendency when you see the things that got neural networks to work much better, like the natural reaction, is to sort of shrugon be underwhelmed and think, Oh, is this all there is to it? This doesn't exactly seem like difficult science. And in some sense, that's they're fairly little introductions of new ideas and tweaks of things, but nevertheless, a handful of little ideas and tweaks of things turn things around from a field that was sort of stuck for 15 years, going nowhere, and which nearly everyone had abandoned because of that, to suddenly turning around. And there being the ability to train these deeper neural networks, which then behaved amazingly better as machine learning systems than other things that had preceded them and dominated for the intervening time. So that took a lot of time. So what are these things? One of them, which you can greet with a bit of a yawn, in some sense, is doing better regularization of neural nets. So regularization is the idea that beyond just having a loss that we want to minimize in terms of describing the data, we want to, in some other ways, manipulate what parameters we learn so that our models work better. And so normally we have some more complex loss function that does some regularization. The most common way of doing this is what's called l two loss, where you add on this parameters squared term at the end. And this regularization says, you know, itbe kind of good to find a model with small parameter weights. So you should be finding the smallest parameter weights that will explain your data well. And there's a lot you can say about regularization. These kind of losses, they get talked about a lot more in other classes like cs 229, machine learning. And so I'm not going to say very much about it. This isn't a machine learning theory class, but I do just want to sort of put in one note that's sort of very relevant to what's happened in recent neural networks work. So the classic view of regularization was we needed this kind of regularization to prevent our networks from overfitting, meaning that they would do a very good job at modeling the training data, but then they would generalize badly to new data that was shown. And so the picture that you got shown was this, that as you train on some training data, your error necessarily goes down. However, after some point, you start learning specific properties of things that happen to turn up in those training examples, and that you're learning things that are only good for the training examples. And so they won't generalize well to different pieces of data you see at test time. So if you have a separate validation set or a final test set, you would and you've traced out the error or loss on that validation or test set, that after some point it would start to go up again. This is a quirk in my bad PowerPoint. It's just meant to go up. And the fact that it goes up is then you have overfit your training data. And making the parameters numerically small is meant to lessen the extent to which you overfit on your training data. This is not a picture that modern neural network people believe at all. Instead, the picture has changed like this. We don't believe that overfitting exists anymore. But what we are concerned about is models that will generalize well to different data so that when we train know, so in classical statistics, the idea that you could train billions of parameters like large neural nets now have would be seen as ridiculous because you could not possibly estimate those parameters well. And so you just have all of this noisy mess, but what's actually been found is that Yeah, it's you can't estimate the numbers well, but what you get is a kind of interesting averaging function from all these myriad numbers. And if you do it right, what happens is as you go on training that for a while, it might look like you're starting to overfit, but if you keep on training in a huge network, not only will your training loss continue to go down very infinitesimally, but your validation loss will go down as well. And so that on huge networks these days, we train our models so that they overfit the training data almost completely, so that if you train a huge network now on a training set, you can essentially train them to get zero loss. You know, maybe it's 0.00, zero, zero zero seven loss or something, but you can train them to get zero loss because you've got such rich models, you can perfectly fit memorize the entire training set. Now, classically, that would have been seen as a disaster because you've overfit the training data with modern large neural networks. It's not seen as a disaster because providing you've done regularization well, that your model will also generalize well to different data. However, the flip side of that is normally this kind of l two regularization or similar ones like l one regularization aren't strong enough regularization to achieve that effect. And so neural network people have turned to other methods of regularization, of which everyone's Favis dropout. So this is one of the things that's on the assignment. And at this point, I should apologize or something, because the droout is done. The way dropout is presented here is sort of the original formulation. The way dropout is presented on the assignment is the way it's now normally done in deep learning packages. So there are a couple of details that vary a bit. And let me just present the main idea here and not worry too much about the details of the math. So the idea of dropout is at training time. Every time you are doing a piece of training with an example, what you're going to do is inside the middle layers of the neural network, you're just going to throw away some of the inputs. And so technically, the way you do this is you have a random MaaS that you sample each time of zeros, and one, you do a haa marproduct of that with the data. So some of the data items go to zero, and you have different masks each time. So for the next thing you know, I've now masked out something different this time. And so you're just sort of randomly throwing away the inputs. And the effect of this is that you're training the model that it has to be robust and work well and make as much use of every input as it can. It can't decide that can be extremely reliant on components 17 of the vector because sometimes it's just going to randomly disappear. So if there are other features that you could use instead that would let you work out what to do next, you should also know how to make use of those features. So at training time, you randomly delete things at test time, sort of for efficiency but also quality of the answer, you don't delete anything. You keep all of your weights, but you just rescale things to make up for the fact that you used to be dropping things. Okay. So there are several ways that you can think of explaining this. One motivation that's often given is that this prevents feature co adaptation. So rather than a model being able to learn complex functions of feature seven, eight and eleven can help me predict this. It knows that some of the features might be missing. So it has to sort of make use of things in a more flexible way. Another way of thinking of it is that there's been a lot of work on model ensembles where you can sort of mix together different models and improve your results. If you're training with dropout, it's kind of like you're training with a huge model ensemble because you're training with the ensemble of the power set, the exponential number of every possible dropout of features all at once. And that gives you a very good model. So different ways of thinking about it. I mean, if you've seen nave, bas and logistic regression models before. You know I kind of think a nice way to think of it is that it gives us sort of a middle ground between the two because for naive based models, you're waiting each feature independently just based on the data statistics. Doesn't matter what other features are there. In a logistic regression, weights are send the context of all the other features. And with dropout, you're somewhere in between. You're seeing the weights in the context of some of the other features, but different ones will disappear at different times. But you know, following work that was done at Stanford by Stefan Bargo and others, generally these days people regard dropout as a form of feature dependent regularization. And he shows some theoretical results as to way to think of it that way. Okay, I think we've implicitly seen this one, but vectorization is the idea. No, for loops, always use vectors, matrices and tensors, right? The entire success and speed of deep learning works from the fact that we can do things with vectors, matrices and tensors. So know if you're writing four loops in any language, but especially in Python, things run really slowly. If you can do things with vectors and matrices, even on cpu, things run at least an order of magnitude faster. And well, what everyone really wants to move to doing and deep learning is running things on GPU's or sometimes now in your processing units and then you're getting you know two, three orders of magnitude of speed up. So do always think about I should be doing things with vectors and matrices. If I'm writing a for loop for anything that isn't some very superficial bit of input processing, I've almost certainly made a mistake. And I should be working out how to do things with vectors and matrices. And you know that's kind of things like dropout. You don't want to write a for loop that goes through all the positions and set some of them to zero. You want to be sort of using a vector operation with your mask. Two more, I think, parameter initialization. I mean, this one might not be obvious, but when we start training our neural networks, in almost all cases, it's vital that we initialize the parameters of our matrices to some random numbers. The reason for this is if we just start with our matrices, all zero or some other constant, normally the case is that we have symmetry. So it's sort of like in this picture when you're starting on this saddle point that you know it's symmetric to the left and the right and whatever, forward and backwards and left and right. And so you sort of don't know where to go and you might be sort of stuck and stay in the one place. I mean, normally a way to think about is the operations that you're doing to all the elements in the matrix are sort of the same. So rather than having you know a whole vector of features, if all of them have the same value initially, often it's sort of like you only have one feature and you've just got a lot of copies of it. So to initialize learning and have things work well, we almost always want to set all the weights to very small random numbers. And so at that point, you know when I say very small, we sort of want to make them in a range so that they don't disappear at a zero if we make them a bit smaller, and they don't sort of start blowing up into huge numbers when we multiply them by things. And doing this initial aation at the right scale was used to be seen as something pretty important. And there were particular methods that had a basis of sort of thinking of what happens once you do matrix multiplies that people had worked out and often used. One of these was this Javier initialization, which was sort of working out what ants of your uniform distribution to be variants of your distribution to be using based on the sort of number of inputs and outputs of a layer and things like that. The specifics of that, you know, I think we still use to initialize things in this assignment too, but we'll see later that they go away because people have come up with cleverer methods, in particular doing layer normalization, which sort of obates the need to be so careful on the initialization. But you still need to initialize things to something. Okay, then the final one, which is also something that appears in the second, the same that I just wanted to say a word about, was optimizers. So we talked about in class stochastic gradient descent and did the basic equations for stochastic gradient descent. And you know, to a first approximation, there's nothing wrong with stochastic gradient descent. And if you fiddle around enough, you can usually get stochastic gradient descent actually to work well for almost any problem. But getting it to work well is very dependent on getting the scales of things right, of sort of having the right step size. And often you have to have a learning rate schedule with decreasing step sizes and various other complications. So people have come up with more sophisticated optimizers for neural networks and for complex nets. Sometimes these seem kind of necessary to get them to learn well. And at any rate, they give you sort of lots of margins of safety since they're much less dependent on you setting different hyperparameters right? And the idea of of well, all the methods I mentioned and the most commonly used methods is that for each parameter, they're accumulating a measure of what the gradient has been in the past. And they've got some idea of the scale of the gradient, the slope for a particular parameter, and then they're using that to decide how much you move the learning rate at each time step. So the simplest method that was come up was this one called egrad. If you know John ducchi and he he was one of the current inventors of this, you know it's simple and nice enough, but it tends to store worldly. Then people came up with different methods adamthe one that's on assignment too. It's a really good, safe place to start. But in a way, sort of our word vectors have a special property because of their sparsenness that you know you're very sparsely updating them because particular words only turn up occasionally. So people have actually come up with particular optimizers that sort of have special properties for things like word vectors. And so these ones with A W at the end can sometimes be good to try. And then you know again, there's a whole family of extra ideas that people have used to improve optimizers. And if you want to learn about that, you can go off and do an optimization class like convex optimization. But there ideas like momentum and nestroof acceleration and things like that and all of those things people also variously try to use. But Adam is a good name to remember, if you remember nothing else. Okay, that took longer than I hoped, but I'll get on now to language models. Okay, language models. So you know, in some sense, language model is just two English words. But when in nlp we say language models, we mean that it's a technical term that has a particular meaning. So the idea of a language model is something that can predict well what word is going to come next. Or more precisely, it's going to put a probability distribution over what words come next. So the students open there what words are likely to come next. Bags, laptops, notebooks, notebooks. Yes, I have some of those at least. Okay. Yeah. I mean, so right. So so these are kind of likely words. And if on top of those we put a probability on each one, then we have a language model. So formally, we've got a context of preceding items. We're putting a probability distribution over the next item, which means that the sum of the estimates of this for items in the vocabulary will sum to one. And if we've defined A P like this that predicts probabilities of next words, that is called a language model. As it says here. An alternative way that you can think of a language model is that a language model is a system that assigns a probability to a piece of text. And so we can say that a language model can take any piece of text and give it a probability. And the reason we can do that is we can use the chain rule. So I wanna know the probability of any stretch of text. I say, given my previous definition of language model, easy, I can do that probability of x one with a null preceding context times the probability of x two. Given x one etcec along, I can do this chain rule decomposition. And then the terms of that decomposition, precisely what the language model, as I defined it previously provides. So language models are this essential technology for nlp, just about everything from the simplest places forward, where people do things with human language and computers, people use language models in particular. You know, they weren't something that got invented in 2022 with ChatGPT. Language models have been central to nlp at least since the 80s. The idea of them goes back to the at least the fifties. So anytime you're typing on your phone and it's making suggestions of next words, regardless of whether you like those suggestions or not, those suggestions are being generated by a language model, traditionally a compact, not very good language model. So it can run sort of quickly and very little memory in your keyboard application. If you go on Google and you start typing some stuff and it's telling you stuff that could come after it to complete your query, well, again, that's being generated by a language model. So how can you build a language model? So before getting into nelanguage models, I've got it just a few slides to tell you about the old days of language modeling. So this is sort of how language models were built from 1975 until effectively around about 2012. We want to put probabilities on these sequences. And the way we're going to do it is we're going to build what's called an ngram language model. And so this is meaning we're going to look at short word subsequences and use them to predict. So n is a variable describing how short are the word sequences that we're going to use to predict. So if we just look at the probabilities of individual words, we have a unigram language model. If we look at probabilities of pairs of words by gram language model, probabilities of three words, trigram language models, probabilities of more than three words, that you get called four gram language models, five gram language models, six gram language models. So for people with a classics education, this is horrific. Of course, in particular, not even these ones are correct, because Graham is a Greek root, so it should really have Greek numbers in front here. So you should have monograms and diagrams. And you know, actually, so the first person who introduced the idea of Ingram models was actually Claude Shannon when he was working out information theory, the same guy that did cross entropy and all of that. And if you look at his 1951 paper, he uses diagrams, but the idea died about there and everyone else. This is what people say in practice. It's kind of cute. I like it. A nice, know, practical notation. So to build these models, the idea is, look, we're just going to count how often different n grams appear in text and use those to build our probability estimates. And in particular, our trick here is that we make a Markov assumption so that if we're predicting the next word based on a long context, we say, tell you what, we're not going to use all of it. We're only going to use the most recent n minus one words. So we have this big context and we throw most of it away. And so if we're predicting word xt plus one based on simply the preceding n minus one words, well, then we can make the prediction using n grams. Why? Let's whatever it is, if we use n is three would have a trigram here and normalized by a bigram down here. And that that would give us relative frequencies of the different terms. So we can do that simply by counting how often n grams occur in a large amount of text and simply dividing through by the counts. And that gives us a relative frequency estimate of the probability of different continuations. Does that make sense? Yeah, that's a way to do it. Okay. So suppose we're learning a four gram language model, right? And we've got a piece of text. As the Proctor started the clock, the students open there. So well, to estimate things, we are going to throw away all but the preceding three words. So we're going to estimate based on students open there. And so we're going to work out the probabilities by looking for counts of students open their W and counts of students open there. So we might have, in a orpus that students open there occurred a thousand times. Students opened their books occurred 400 times. So wesay, the probability estimate is simply 0.4 for books. If exams occurred 100 times, the probability estimate is 0.1 for exams. And while you can sort of see that this is bad, it's not terrible because if you are going to try and predict the next word in a simple way, looking at the immediately prior words are the most helpful words to look at. But it's clearly sort of primitive because you know if you known that the prior text was as the prop to start of the clock, that makes it sound likely that the words should have been exams where since you're estimating just based on students, open theirs. Well, yoube more, but likely to choose books because it's more common. So it's a kind of a crude estimate, but it's a decent enough place to start. It's a crude estimate that could be problematic in other ways. I mean, why else might we kind of get into troubles by using this as our probability estimate? Yeah, there are too. So there are a lot of n grams. Yeah. So there are a lot of words, and therefore, there are a lot of lot of n grams. Yeah. So that's a problem. We'll come to it later. And the elmaybe up the back, like the word W might not even show up in the training data. So you might just have a cazero for that. Yes. Th, Yeah. So so if we're counting over any reasonable size corpus, there are lots of words that we just are not going to have seen, right, that they never happen to occur in the text that we counted over. You know so if you start thinking students open there, you know there are lots of things that you could put there. You know students open their accounts, or if the students are doing dissections in a biology class, maybe students open their frogs. I don't know. You know there are lots of words that in some context, you know would actually be possible, and lots of them that we won't have seen. And so it gives them a probability estimate of zero. And that tends to be an especially bad thing to do with probabilities, because once we have a probability estimate of zero, any computations that we do that involve that will instantly go to zero. So we have to deal with some of these problems. So for that sparsity problem, right, Yeah that we could have the word never occurred in the numerator. And so simply stwe get a probability estimate of zero. The way that was dealt with was that people just hacked the counts a little to make it non zero. So there are lots of ways that are explored, but the easiest way is you just sort of added a little delta, like, you know, 0.25 to counts. So things that you never saw got a count of 0.25 in total. And things you saw once got a count of 1.25, and then there in no zeros anymore, everything is possible. You could think. Then there's a second problem that, wait, you might never have seen stupid students open there before. And so that means your denominator is just undefined and you don't have any counts in the numerator either. So you sort of need to do something different there. And the standard trick that was used then was that you did back off. So if you couldn't estimate words coming after students open there, you just worked out the estimates for words coming after open there. And if you couldn't estimate that, you just use the estimate of words coming after there. So you use less and less context until you could get an estimate that you could use. But you know something to note is that we've got these conflicting pressures now. So that on the one hand, you know if you want them come up with a better estimate that you would like to use more context, I have a larger n gram. But on the other hand, as you make use more more conditioning words, well, the storage size problem someone mentioned gets worse and worse because the number of n grams that you have to know about is going up exponentially with the size of the context. But also your sparseness problems are getting way, way worse. And you're almost necessarily going to be ending up seeing zeros. And so because of that, you know in practice where things tended to sort of max out was five. And occasionally people use six grams and seven grams. But most of the time, between the source of sparsenness and the cost of storage, five grams was the largest thing people dealt with. A famous resource from back in the two thousands decade that Google released was Google n grams, which was built on a trillion word web corpus and had counts of n grams. And it gave counts of n grams up to n equals five. And that is where they stopped. Okay. Well, we sort of said the storage problem. The storage problem is, well, to do this, you need to store these counts. The number of counts is going up exponentially in the amount of context size. Okay? But you know what's good about ngram language models? They're really easy to build. You can build one yourself in a few minutes when you've got want na have a bit of fun on the weekend. You know all you have to do is start sort of storing these counts for engrams and you can use them to predict things. So you know for if at least if you do it over a small corpus, like a couple of million words of text. You know you can build an end ground language model in seconds on your laptop or you have to build and write the software. Okay, a few minutes to write the software, but building the model takes seconds because you know there's no training in your network. All you do is count how often n grams occur. And so once you've done that, you can then run an n gram language model to generate text. You know, we could do text generation before chat tgpt, right? So if I have a trigram language model, I can start off with some words today there, and I can look at my stored n grams and get a probability distribution over next words. And here they are. You know, note the strong patterning of these probabilities, because remember, they're all derived from counts, right, that are being normalized. So really, these are words that occurred once the, these are words that occurred twice. These are words that occurred four times in this context, right? So they're sort of in some sense crude when you look at them more carefully. But so what we could do is then at this point, you know we roll a die and get a random number from zero to one and we can use that sample from this distribution. Yeah. So we sample from this distribution and so that if we sort of generate sort of as our random number, something like point 35, if we go down from the top wesay, okay, we've sampled the word price today, the price, and then we repeat over, we condition on that we probability distribution of the next word and we generate a random number and use it to sample from the distribution. We can say generate 0.2. And so we choose of we now condition on that we get a probability distribution. We generate a random number, which is 0.5 or something. And so we get gold coming out. And we can say today the price of gold, and we can keep on doing this and generate some text. And so here's some text generated from 2 million words of training data using a trigram language model. Today, the price of gold per ton while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted European stocks. September third and primary, $0.76a share. Now, okay, that text isn't great, but you know, I actually want people to be in a positive mood today. And you know, actually, it's not so bad, right? It's sort of surprisingly grammatical. I mean, in particular, like I lowercaeverything. So this is the imf that should be capitalized of the International Monetary Fund, right? You know, there are big pieces of this that even make sense, right? The bank intervened just after it considered and rejected an imf demand. That's pretty much making sense as a piece of text, right? So it's mostly grammatical. It looks like you know, English text. I mean, it makes no sense, right? It's sort of really incoherent. So there's work to do. But you know what was already, you could see there's even these simple ngram models. You could from a very low level, you could kind of approach what text and human language worked in from below. And you know, I could easily make this better even with the ngram language model, because you have rather than 2 million words of text, if I trained on 10 million words of text to be better, if I then rather than a trigram model, could go to a four gram model, get better and yousort of start getting better and better approximations of text. And so this is essentially what people did until about 2012. And you know, really the same story that people tell today, that scale will solve everything, is exactly the same story that people used to tell in the early 2010s with these ngram language models. If you weren't getting a good enough results with your 10 million words of text and a trigram language model, the answer was that if you had 100 million words of text, a foregram language model youdo better. And then if you had a trillion words of text and a five gram language model, youdo better. And gee, wouldn't it be good if we could collect 10 trillion words of text so we could train an even better n gram language model? Same strategy. But it turns out that sometimes you can do better with better models as well as simply scale. And so things got reinvented and started again with building neural language models. So how can we build a neural language model? So you know we've got the same task of having the sequence of words, and we want to put a probability estimate over what word comes next. And so the simplest way you could do that, which you'll hopefully have all have thought of, because it connects what we did in the earlier classes. Look, we already had this idea that we could have represented context by the concatenation of some word vectors, and we could put that into a neural network, and we could use that to predict something. And in the example I did in the last couple of classes, what we used it to predict was, is the center word a location or not a location, just a binary choice? But that's not the only thing we could predict. We could have predicted lots of things with this neural network. We could have predicted whether the piece of text was positive or negative. We could have predicted whether it was written in English or Japanese. You know, we could predict lots of things. So one thing we could choose to predict is we could choose to predict what word is going to come next after this window of text. We have a model just like this one, apart from up the top. Instead of doing this, binary classification wedo a many, many way classification over what is the next word that is going to appear in the piece of text. And that would then give us a neural language model. In particular, it gives us a fixed window neural language model so that we do the same Markov assumption trick of throwing away the further back context. And so for the fixed window, we'll you use word embeddings, which you can concatenate. We'll put it through a hidden layer, and then we'll take the output of that hidden layer, multiply it by another layer, say, and then put that through a soft max and get an output distribution. And so this gives us a sort of a fixed window neural language model. And you know, apart from the fact that we're now doing a classification over many, many, many classes, this is exactly like what we did last week. So it should look kind of familiar. It's also kind of like what you're doing for assignment too. And so this is essentially the first kind of neural language model that was proposed. So in particular, Yoshua Bengio really sort of right at the beginning of the 20 first century suggested that you could do this, that rather than using an ngram language model, you could use a fixed window neural language model. And you know even at that point, he and colleagues were able to get some positive results from this model. But you know at the time, it wasn't widely noticed. It didn't really take off that much. And you know it was sort of for a combination of reasons. When it was only a fixed window, it was sort of not that different to engrams in some sense. And although the neural network could give better generalization, it could be argued rather than using counts, I mean, in practice, neural nets were still hard to run without GPU's. And people felt, and I think in general, this was the case, that you could get more of by doing the scale story and collecting your Ingram counts on hundreds of billions of words of text rather than trying to make a neural network out of it. And so it didn't really sort of especially take off at that time. But you know in principle, it seemed a nice thing. It got rid of the sparsity problem. It got rid of the storage costs. You no longer have to store all observed n grams. You just have to store the parameters of your neural network. But it didn't solve all the problems that welike to solve. So in particular, we still have this problem of the Markov assumption that we're just using a small fixed context beforehand to predict farm. And there are some disadvantages to enlarging that window. And you know there's no fixed window that's ever big enough. Another there's another thing that if you look technically at this model, that might sort of make you suspicious of it, which is you know, when we have words in different positions, that those words in different positions will be treated by completely different subparts of this matrix. W, so you might think that, you know, okay for predicting that books comes next. You know, the fact that this is a student is important, but it doesn't matter so much exactly where the word student occurs, right? You know, the context could have been the students slowly opened there and it's still the same students. We've just got a bit different linguistic structure where this W matrix would be using completely separate parameters to be learning stuff about student here versus student in this position. So that seems kind of inefficient and wrong. And so that suggested that we kind of need a different kind of neural architecture that can process any length of input and can use the same parameters to say, Hey, I saw the word student. That's evidence that things like books, exams, homework will be turning up regardless of where it occurs. And so that then led to exploration of this different neural network architecture called recurrent neural networks, which is what I'll go on to next. But before I do, is everyone basically okay with what a language model is? Yeah, no questions. Okay, recurrent neural networks. So recurrent neural networks is a different family of neural networks. So effectively, in this class, we see several neural network architectures. So in some sense, the first architecture we saw was word to vec. It's sort of a very simple encoder decoder architecture. The second family we saw was feed forward networks or fully connected layer classic neural networks. And the third family we're going to see is recurrent neural networks, which have different kinds. And then we'll go on and go on to transformer models. Okay. So the idea of a recurrent neural network is that you've got one set of weights that are going to be applied through successive moments in time, a successive positions in the text. And as you do that, you're going to update the parameters as you go. We'll go through this in quite a bit of detail, but you know here's the idea of it. So we've got the students open there and we want to predict with that and the way that we're going to do it. Okay, I've still got four words in my example so I can put stuff down the left side of the slide, but there could have been 24 words with recurrent neural networks because they can deal with any length of context. Okay? So as before, our words start off as just words or one hot vectors and we can look up their word embeddings just like before. Okay? But now to compute probabilities for the next word, we're going to do something different. So our hidden layer is going to be recurrent. And by recurrent, it means we're going to sort of change a hidden state at each time step as we proceed through the text from left to right. So we're going to start off with an H zero, which is the initial hidden state, which can actually just be all zeros. And then at each time step, what we're going to do is we're going to multiply the previous hidden state by weight matrix. We're going to take a word embedding and multiply it by a weight matrix. And then we're going to sum the results of those two things. And that's going to give us a new hidden state. So that hidden state will then sort of store a memory of everything that's been seen so far. So we'll do that and then we'll continue along. So we'll multiply the next word vector by the same weight matrix. Wewe store the previous, multiply the previous hidden state by the same weight matrix wand. We add them together and get a new representation. I've only sort of said this bit, so I've left out a bit. Commonly, there are two other things you're doing. You are adding on a bias term because we usually separate out of bias term, and you're putting things through a nonlinearity. So I should ouldn't make sure I mentioned that for recurrent neural networks, most commonly, this nonlinearity has actually been the tan H function. So it's sort of balanced on the positive and negative side. And so you keep on doing that through each step. And so the idea is, once we've gotten to hear this H force, hidden state is a hidden state, then some sense has read the text up until now. It's seen all of the students open there. And if the words students occurred in any of these positions, it will have been multiplied by the same we matrix and added into the hidden state. So it's kind of got a cleaner, low parameter way of incorporating in the information that it's seen. So now I want to predict the next word. And to predict the next word, I'm then going to do based on the final hidden state, the same thing I kind of thing I did before. So I'm going to multiply that hidden state by matrix and add another bias and stick that through a soft max and use that to sample from that softmax. Well, the softmax will give me a language model, a probability over all next words, and I can sample from it to generate the next word that makes sense. Okay? Recurrent neural networks. Okay. So for our current neural networks, we can now process any length of preceding context, and we'll just put more and more stuff in our hidden state. So our computation is using information from many steps back. Our model size doesn't increase for having a long context, right? You know, we have to do more computation for a long context, but our representation of that long context just remains this fixed size hvector H of whatever dimension it is. So there's no exponential blowout anymore. There's the same weto apply in every time step. So there's a symmetry in how inputs are processed. There are some catches. The biggest catch in practice is that recurrent computation is slow. So for the feed forward layer, we just had you know our input vector, we multiply it by matrix. We multiply it by matrix however, many times, and then at the end we're done. Whereas here we've sort of stuck with this sequentiality that you have to be doing one hidden vector at a time. In fact, this is going against what I said at the beginning of class because essentially here you're doing a for loop. You're going through four time equals one to t, and then you're generating in term each hidden vector. And that's one of the big problems with rand ns that have led them to fall out of favever. There's another problem that we'll look at more is that in theory, this is perfect. You're just incorporating all of the past context in your hidden vector. In practice, it tends not to work perfectly because, you know, although stuff you saw back here is in some sense still alive, and the hidden vector, as you come across here, that your memory of it gets more and more distant. And it's the words that you saw recently that dominate the hidden state. Now in some sense, that's right, because the recent stuff is the most important stuff that's freshest in your mind. You know it's the same with human beings. They tend to forget stuff from further back as well. But rnn's, especially in the simple form that I've just explained, forget stuff from further back rather too quickly. And we'll come back to that again. Thursday's class. Okay. So for training an rnn language model, the starting off point is we get a big corpus of text again, and then we're going to compute for each time step a prediction of the probability of next words, and then there's going to be an actual next word. And we're going na use you know that as the basis of our loss. So our loss function is the cross entropy between the predicted probability and what the actual next word that we saw is, which again, as in the example I showed before, is just the negative log likelihood of the actual next word. Ideally, youlike to predict the actual next word with probability one, which means the negative log of one would be zero and therebe no loss. But in practice, if you give it an estimate of 0.5, there's only a little bit of loss and so on. And so to get our overall objective function, we work out the average loss, the average negative log likelihood of predicting each word and turn. So showing that as pictures, if our corpuses, the students, open their exams, we're first of all going to be trying to predict, you know what comes after that. And we will predict some word with different probabilities. And then we'll say, Oh, the actual next word is students, okay, you gave that a probability of 0.05, say, because all we know was the first word was that, okay, there's a loss for that, the negative log probe given to students. We then go on generate the probability estimate over the next words, and then we say, well, the actual word is opened. What probability estimate did you give to that? We get a negative probability loss. Keep on running this along. And then we sum all of those losses and we average them per word. And that's our sort of average per word loss. And we want to make that as small as possible. And so that's our training mechanism. It's important to notice that for generating this loss, we're not just doing free generation. We're not just saying to the model, go off and generate a sentence. What we're actually doing is that each step we're effectively saying, okay, the prefixes the students opened, what probability distribution do you put on next words after that, generate it with our current neural network and then say, ask for the actual next word. What probability estimate did you give to there? And that's our loss. But then what we do is stick there into our current neural network, the right answer. So we always go back to the right answer, generate probability distribution for next words, and then ask, okay, what probability did you give to the actual next word exams? And then again, we use the actual next word. So we do one step of generation, then we pull it back to what was actually generated, what was actually in the text, and then we ask it for guesses over the next word and repeat forever. And so the fact that we don't do free generation, but we pull it back to the actual piece of text each time makes things simple, because we're sort of know what the actual author used for the next word. And that process is called teacher forcing. And so the most common way to train language models is using this kind of teacher forcing method. I mean, it's not perfect in all respects because you know we're not actually exploring different things the model might want to generate on its own and seeing what comes after them. We're only doing the tell me the next word from some human generated piece of text. Okay, so that's how we get losses. And then after that, we want to, as before, use these losses to update the parameters of a neural network. Okay? And how do we do that? Well, in principle, you know we just have all of the texthat we've collected, which you could think of as just a really long sequence of, okay, we've got a billion words of text here it is, right? So in theory, you could just run your recurrent neural network over your billion words of text, updating the context as you go. But that would make it very difficult to train a model, because yoube accumulating these losses for a billion steps and youhave to store them, and then youhave to store hidden state. So you could update parameters, and it just wouldn't work. So what we actually do is we cut our training data into segments of a reasonable length, and then we're going to sort of run our recurrent neural network on those segments, and then we're going to compute loss for each segment, and then we're going to update the parameters of the recurrent neural network based on the losses that we found for that segment. I describe it here as the segments being sentences of documents, which seems a linguistically nice thing here. It turns out that in recent practice, when you're wanting to scale most efficiently on GPU's, people don't bother with those linguistic niceties. They just say a segment is 100 words. Just cut every 100 words. And the reason why that's really convenient is you can then create a batch of segments, all of which are 100 words long, and stick those in a matrix and do vectorize training more efficiently, and things go great for you. Okay? But there's still a few more things that we need to know to get things to work. Great for you. I'll try and get a bit more through this before today ends. So we sort of need to know about how to work out the derivative of our loss with respect to the parameters of our recurrent neural network. And the interesting case here is, you know these wh parameters sort of being used everywhere through the neural network at each stage, as are the we ones. So they appear at many places in the network. So how do we work out the partial derivatives of the loss with respect to the repeated weight matrices? And the answer to that is, Oh, it's really simple. You can just sort of pretend that those whs in each position are different and work out the partials with respect to them at one position. And then to get the partials with respect to wh, you just sum whatever you found in the different positions. And so that is sort of okay. The gradient respect to repeated ways is the sum of the gradient with respect to each time it appears. And the reason why that is it sort of follows what I talked about in lecture three that we talked. Or you know you can also think about it in terms of what you might remember from you know multivariable chain rules. But know the way I introduced in lecture three is the gradient sum at outward branches. And so what you can think about it in a case like this is that you've got a wh matrix which is being copied by identity to wone, wtwo, wthree, W four, etcetera. At each time step. So since those are identity copies, they have a partial derivative with respect to each other of one. And so then we apply the multivariable chain rule to these copies. And so we've then got an outward branching node. And you're just summing the gradients to get the total gradient of each time for the matrix. Okay, Yeah. I mean, there's one other trick that's perhaps worth knowing. I mean, if you've got sort of segments that are 100 long, a common speed up is to say, Oh, maybe we don't actually have to run back propagation for 100 time steps. Maybe we could just run it for 20 time steps and stop, which is referred to as truncated back propagation through time. I mean, in practice, that tends to be sufficient. Note in particular, you're still on the forward PaaS updating your hidden state using your full context, but in the back propagation, you're just sort of cutting it short to speed up training. Okay. So just as I did before with an Ingram language model, we can use rnn language model to generate text. And it's pretty much the same idea, except now we're sort of rather than just using counts of n grams, we're using the hidden state of our neural network to give us the input to a probability distribution that we can then sample from. So I can start with the initial hidden state. I can use the start of sentence symbol. I mean, the example I had before, I started immediately with the hoping that I was less confusing the first time. But what you should have asked is, wait a minute, where did the vcome from? So normally, what we actually do is use a special starof sequence symbol like this angle bracketed s, and so he sort of feed it in as a pseudo word which has a word embedding. And then we, based on this, will be generating first words of the text. So we end up with some representation from which we can sample and get the first word. So now we don't have any actual text, so what we're going to do is take that generated word that we generated and copy it down as the next input. And then we're going to run a next stage of neural network sample from the probability distribution and next word, favcopy it down as the next word of the input and keep on generating. And so this is referred to as a roll out that you're kind of continuing to roll the dice and generate forward and generate a piece of text. And so and normally you want to stop at some point. And the way we can stop at some point is we can have a second special symbol, the angle brackets slash s, which says end of your sequence. So we can generate an end of sequence symbol, and then we can stop. And so using this, we can sort of generate pieces of text. And essentially, you know this is exactly what's happening if you use something like ChatGPT, right, that the model is a more complicated model that we haven't yet gotten to, but it's generating the response to you by doing this kind of process of generating a word at the time, treating it as an input and generating the next word and generating this sort of rollout. And that's why and it's done probabilistically. So if you do it multiple times, you can get different answers. We haven't yet gone to ChatGPT, but we can have a little bit of fun. So you can take this simple recurrent neural network that we've just built here, and you can train it on any piece of text and get it to generate stuff. So for example, I can train it on Barack obbama's speeches. So that's a small corpus, right? You know, he didn't talk that much, right? I've only got a few hundred, zero words of text. It's not a huge corpus. I'll just show this and then I can answer the question. But you know I can generate format and I get something like the United States will step up to the cost of unnew challenges of the American people that will share the fact that we created the problem, they were attacked, and so that they have to say that all the task of the final days of war that I will not be able to get this done. Yeah, well, maybe that's slightly better than my Ingram language model. Still not perfect, you might say, but somewhat better maybe. Did you have a question? Yeah for electric, like a truncated setting to corpus. So that imposed some kind of like limitation on like how much we can like produce. And I still have some coherency in life. So Yeah so I suggested we're going to chunk the text into 100 word units. So you know that's the limit of the amount of prior context that we're going to use. So I mean, that's a fair amount, 100 words, that's typically several sentences. But to the extent that you wanted to know even more about the further back context, you wouldn't be able to. And you know certainly that's one of the ways in which modern large language models are using far bigger context than that. They're now using thousands of words of prior context. Yeah, absolutely. It's a limit on how much far back context. So in some sense, actually, even though in theory a recurneural network can feed in an arbitrary length context, as soon as I say, Oh, practically we cut into segments, you know actually that means we are making a Markov assumption again. And we're saying the further back context doesn't matter. Yeah. Okay, a couple more examples. So instead of Barack Obama, I can feed in Harry Potter, which is a somewhat bigger corpus of text actually, and generate from that. And so I can get sorry, Harry shouted, panicking. I'll leave those brooms in London. Are they? No idea, said nearly headless neck casting loclius by Cedric, carrying the last bit of trecle charms from Harry's shoulder. And to answer him, the common room perched upon it. Forearms held a shining knob from when the spider hadn't felt. It seemed he reached the teams too. Well, there you go. You can do other things as well, so you can train it on recipes and generate a recipe. This one's a recipe. I don't suggest you try and cook, but it looks sort of like a recipe if you don't look very hard. Chocolate ranch barbecue categories game casseroles, cookies cookies yield six servings, two tablespoons of parmmesan cheese, chopped, one cup of coconut milk and three eggs, beaten. Place each pasture over layers of lumps shaped mixture into the moderate oven and simmer until firm. Serve hot and bodied. Fresh mustard, orange and cheese. Combine the cheese and salt together, the dough in a large skillet and the ingredients, and stir in the chocolate and pepper. Yeah, it's not exactly a very consistent recipe, but it comes down to it it sort of has a language of a recipe that it's absolutely maybe if I had scaled it more and had a bigger corit would have done a bit better, but it's definitely not using the ingredients there are. Let's see, it's almost time today. So maybe about all I can do is I can do one more fun example. And then after that, Yeah, I probably should zait at the start next time. So as a variant of building rand n language models, I mean, so far we've been building them over words. So our you know token time steps over which you build it as words. I mean, actually, you can use the idea of recurrent neural networks over any other size unit, and people have used them for other things. So people have used them in bioinformatics for things like dna for sort of having gene sequencing or protein sequencing or anything like that. But even staying with language instead of building them over words, you can build them over characters so that I'm generating at a letter at a time rather than a word at a time. And so that can sometimes be useful because it allows us to sort of generate things that sort of look like words and perhaps have the structure of English words. And so similarly, there are other things that you can do. So before I initialized the hidden state, I said, you just have an initial hidden state. You can make it zeros if you want. Well, sometimes we're going to build a contextual rand n where we can initialize the hidden state with something else. So in particular, I can initialize the hidden state with the rgb values of a color. And so I can have initialized the hidden state with the color and generate character at a time, the name of paint colors. And I can train a model based on a paint company's catalog of names, of colors and their rgb, of their colors. And then I can give it different, different paint colors and itcome up with names for them. And it actually does an excellent job. This one worked really well. Look at this. This one here is gsty pink, power gray, navel tan, bokco White, horble gray, homstar Brown. Now, couldn't you just imagine finding all of these in a paint catalog? I mean, some of them, there's some really good ones over here in the bottom right. This color here is dope, and then this stone of blue, purple, simp, stinky bean and turdly. Now I think I've got real business opportunity here in the paint company market for my recurrent neural network. Okay, I'll stop there for today and do more of the science of neural networks next time.