Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks
该斯坦福CS224N课程的第五讲主要介绍了神经网络的一些补充概念,并引入了自然语言处理中的语言模型任务。随后,课程详细讲解了循环神经网络(RNN)作为构建语言模型的一种方法,并提及了其在后续作业中的应用,同时也指出RNN并非构建语言模型的唯一途径,预告了之后将介绍Transformer模型。讲座还讨论了RNN存在的问题。
在深入技术内容前,讲者简要分析了课程学生的构成,并强调了现代神经网络(尤其是语言模型)参数规模的巨大,已达数千亿级别。
接着,讲座回顾了神经网络的发展历史:早期(80-90年代)神经网络虽有反向传播算法,但通常只有单一隐藏层,因为训练更深层的网络在当时非常困难,导致该领域停滞了约15年。深度学习的复兴始于2000年代末至2010年代,通过一些关键的技术改进(如更好的正则化方法),才使得深度神经网络的训练成为可能,并展现出远超浅层网络的性能。
其中,正则化是一个重要方面。讲座对比了正则化的经典观点与现代大型神经网络中的新认知:经典观点认为正则化旨在防止模型过拟合(即模型在训练数据上表现好,但在新数据上泛化能力差,表现为验证集误差在训练到一定程度后开始上升)。然而,现代观点认为,对于参数量巨大的神经网络,在良好正则化的前提下,即使模型在训练集上达到近乎零误差(即几乎“记住”了整个训练集),其在验证集上的误差也可能持续下降,表明模型依然具备良好的泛化能力。这种现象挑战了传统上对“过拟合即灾难”的看法。
标签
媒体详情
- 上传日期
- 2025-05-15 21:12
- 来源
- https://www.youtube.com/watch?v=fyc0Jzr74y4
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-05-06
转录
speaker 1: Okay, let me get started for today. So for today, first of all, I'm going to spend a few minutes talking about a couple more neural net concepts, including actually a couple of the concepts that turn up in assignment two. Then the bulk of today is then going to be moving on to introducing what our language models. And then after introducing language models, we're going to introduce a new kind of neural network, which is one way to build language models, which is recurrent neural networks. They're important thing to know about, and we use them in assignment three, but they're certainly not the only way to build language models. In fact, probably a lot of you already know that there's this other kind of neural network called transformers, and we'll get on to those after we've done recurrent neural nets. Talk a bit about problems with recurrent neural networks. And well, if I have time, I'll get onto the recap before getting into the content of the class, I thought I could just spend a minute on giving you the stats of who is in cs 224n. Who's in cs 224n kinda looks like the pie charts they show in cs 106a these days, except more grad students, I guess. So the four big groups, the computer science undergrads, the computer science grads, the underclared undergraduand, the ndo grad, so this is a large portion of the scpd students. So some of them are under computer science grads. So that makes up about 60% of the audience. And if you're not in one of those four big groups, you're in the other 40%. And everybody is somewhere. So there are lots of other interesting groups down here. So you know the bright orange down here, that's where the math and physics PhDs are up here. I mean, Interestingly, we now have more statistics grad students than there are census undergrads who didn't used to be that way around in nlp classes. And you know one of my favorite groups, the little magenta group down here, these are the humanities undergrads. Humanities undergrads in terms of years, it breaks down like this. First year grad students are the biggest group, tons of juniors and seniors and a couple of brave frosh or any brave frosh here today? Yeah. Okay, welcome. Yeah modern neural networks, especially language models are enormous. This charts sort of out of date because it only goes up to 2022, but it's sort of actually hard to make an accurate chart for 2024 because in the last couple of years, the biggest language model makers have in general stopped saying how large their language models are in terms of parameters. But at any rate, they're clearly huge models which have over 100 billion parameters. And so large and then deep in terms of very many layers, neural nets are a cornerstone of monnlp systems. We're going to be pretty quickly working our way up to look at those kind of deep models. But I just sort of for starting off with something simpler. You know, I did just want na kind of key you in for a few minutes into a little bit of history, right? So the last time neural nets were popular was in the eighties and nineties, and that was when people worked out the back propagation algorithm. Jeff Hinton and colleagues made famous the back propagation algorithm that we've looked at, and that allowed the training of neural nets with hidden layers. And so but in those days, pretty much all the neural nets with hidden layers that were trained were trained with one hidden layer. You had the input, the hidden layer and the output, and that's all that there was. And the reason for that was, for a very, very long time, people couldn't really get things to work with more hidden layers. So that only started to change in the resurgence of what often got called deep learning. But anyway, back to neural nets started around 2006. And this was one of the influential papers at the time, greedy layer wise training of deep neural networks by yoshuo, benjo and colleagues. So right at the beginning of that paper, they observed the problem. However, until recently, it was believed too difficult to train deep multiland Ural networks. Empirically, deep networks were generally found to be not better and often worse than neural networks with one or two hidden layers. Jerry tosoro was actually a faculty member who worked very early on autonomous driving with neural networks. As this is a negative result, there's not been much report in the machine learning literature. So that really, you know although people had neural networks and back propagation and recurrent neural networks we're going to talk about today that for a very long period of time, you know 15 years or so, things seemed completely stuck in you. Although in theory it seemed like deep neural networks should be promising, in practice they didn't work. And so it really then took some new developments that happened in the late 2000 ands decade, and then more profoundly in the 2010s decade, to actually figure out how we could have deep neural networks that actually worked, working far better than the shallow neural networks, and lead into the networks that we have today. And you know, we're going to be starting ying to talk about some of those things in this class and coming up with classes. And I mean, I think, you know, the tendency when you see the things that got neural networks to work much better, like the natural reaction, is to sort of shrugon be underwhelmed and think, Oh, is this all there is to it? This doesn't exactly seem like difficult science. And in some sense, that's they're fairly little introductions of new ideas and tweaks of things, but nevertheless, a handful of little ideas and tweaks of things turn things around from a field that was sort of stuck for 15 years, going nowhere, and which nearly everyone had abandoned because of that, to suddenly turning around. And there being the ability to train these deeper neural networks, which then behaved amazingly better as machine learning systems than other things that had preceded them and dominated for the intervening time. So that took a lot of time. So what are these things? One of them, which you can greet with a bit of a yawn, in some sense, is doing better regularization of neural nets. So regularization is the idea that beyond just having a loss that we want to minimize in terms of describing the data, we want to, in some other ways, manipulate what parameters we learn so that our models work better. And so normally we have some more complex loss function that does some regularization. The most common way of doing this is what's called l two loss, where you add on this parameters squared term at the end. And this regularization says, you know, itbe kind of good to find a model with small parameter weights. So you should be finding the smallest parameter weights that will explain your data well. And there's a lot you can say about regularization. These kind of losses, they get talked about a lot more in other classes like cs 229, machine learning. And so I'm not going to say very much about it. This isn't a machine learning theory class, but I do just want to sort of put in one note that's sort of very relevant to what's happened in recent neural networks work. So the classic view of regularization was we needed this kind of regularization to prevent our networks from overfitting, meaning that they would do a very good job at modeling the training data, but then they would generalize badly to new data that was shown. And so the picture that you got shown was this, that as you train on some training data, your error necessarily goes down. However, after some point, you start learning specific properties of things that happen to turn up in those training examples, and that you're learning things that are only good for the training examples. And so they won't generalize well to different pieces of data you see at test time. So if you have a separate validation set or a final test set, you would and you've traced out the error or loss on that validation or test set, that after some point it would start to go up again. This is a quirk in my bad PowerPoint. It's just meant to go up. And the fact that it goes up is then you have overfit your training data. And making the parameters numerically small is meant to lessen the extent to which you overfit on your training data. This is not a picture that modern neural network people believe at all. Instead, the picture has changed like this. We don't believe that overfitting exists anymore. But what we are concerned about is models that will generalize well to different data so that when we train know, so in classical statistics, the idea that you could train billions of parameters like large neural nets now have would be seen as ridiculous because you could not possibly estimate those parameters well. And so you just have all of this noisy mess, but what's actually been found is that Yeah, it's you can't estimate the numbers well, but what you get is a kind of interesting averaging function from all these myriad numbers. And if you do it right, what happens is as you go on training that for a while, it might look like you're starting to overfit, but if you keep on training in a huge network, not only will your training loss continue to go down very infinitesimally, but your validation loss will go down as well. And so that on huge networks these days, we train our models so that they overfit the training data almost completely, so that if you train a huge network now on a training set, you can essentially train them to get zero loss. You know, maybe it's 0.00, zero, zero zero seven loss or something, but you can train them to get zero loss because you've got such rich models, you can perfectly fit memorize the entire training set. Now, classically, that would have been seen as a disaster because you've overfit the training data with modern large neural networks. It's not seen as a disaster because providing you've done regularization well, that your model will also generalize well to different data. However, the flip side of that is normally this kind of l two regularization or similar ones like l one regularization aren't strong enough regularization to achieve that effect. And so neural network people have turned to other methods of regularization, of which everyone's Favis dropout. So this is one of the things that's on the assignment. And at this point, I should apologize or something, because the droout is done. The way dropout is presented here is sort of the original formulation. The way dropout is presented on the assignment is the way it's now normally done in deep learning packages. So there are a couple of details that vary a bit. And let me just present the main idea here and not worry too much about the details of the math. So the idea of dropout is at training time. Every time you are doing a piece of training with an example, what you're going to do is inside the middle layers of the neural network, you're just going to throw away some of the inputs. And so technically, the way you do this is you have a random MaaS that you sample each time of zeros, and one, you do a haa marproduct of that with the data. So some of the data items go to zero, and you have different masks each time. So for the next thing you know, I've now masked out something different this time. And so you're just sort of randomly throwing away the inputs. And the effect of this is that you're training the model that it has to be robust and work well and make as much use of every input as it can. It can't decide that can be extremely reliant on components 17 of the vector because sometimes it's just going to randomly disappear. So if there are other features that you could use instead that would let you work out what to do next, you should also know how to make use of those features. So at training time, you randomly delete things at test time, sort of for efficiency but also quality of the answer, you don't delete anything. You keep all of your weights, but you just rescale things to make up for the fact that you used to be dropping things. Okay. So there are several ways that you can think of explaining this. One motivation that's often given is that this prevents feature co adaptation. So rather than a model being able to learn complex functions of feature seven, eight and eleven can help me predict this. It knows that some of the features might be missing. So it has to sort of make use of things in a more flexible way. Another way of thinking of it is that there's been a lot of work on model ensembles where you can sort of mix together different models and improve your results. If you're training with dropout, it's kind of like you're training with a huge model ensemble because you're training with the ensemble of the power set, the exponential number of every possible dropout of features all at once. And that gives you a very good model. So different ways of thinking about it. I mean, if you've seen nave, bas and logistic regression models before. You know I kind of think a nice way to think of it is that it gives us sort of a middle ground between the two because for naive based models, you're waiting each feature independently just based on the data statistics. Doesn't matter what other features are there. In a logistic regression, weights are send the context of all the other features. And with dropout, you're somewhere in between. You're seeing the weights in the context of some of the other features, but different ones will disappear at different times. But you know, following work that was done at Stanford by Stefan Bargo and others, generally these days people regard dropout as a form of feature dependent regularization. And he shows some theoretical results as to way to think of it that way. Okay, I think we've implicitly seen this one, but vectorization is the idea. No, for loops, always use vectors, matrices and tensors, right? The entire success and speed of deep learning works from the fact that we can do things with vectors, matrices and tensors. So know if you're writing four loops in any language, but especially in Python, things run really slowly. If you can do things with vectors and matrices, even on cpu, things run at least an order of magnitude faster. And well, what everyone really wants to move to doing and deep learning is running things on GPU's or sometimes now in your processing units and then you're getting you know two, three orders of magnitude of speed up. So do always think about I should be doing things with vectors and matrices. If I'm writing a for loop for anything that isn't some very superficial bit of input processing, I've almost certainly made a mistake. And I should be working out how to do things with vectors and matrices. And you know that's kind of things like dropout. You don't want to write a for loop that goes through all the positions and set some of them to zero. You want to be sort of using a vector operation with your mask. Two more, I think, parameter initialization. I mean, this one might not be obvious, but when we start training our neural networks, in almost all cases, it's vital that we initialize the parameters of our matrices to some random numbers. The reason for this is if we just start with our matrices, all zero or some other constant, normally the case is that we have symmetry. So it's sort of like in this picture when you're starting on this saddle point that you know it's symmetric to the left and the right and whatever, forward and backwards and left and right. And so you sort of don't know where to go and you might be sort of stuck and stay in the one place. I mean, normally a way to think about is the operations that you're doing to all the elements in the matrix are sort of the same. So rather than having you know a whole vector of features, if all of them have the same value initially, often it's sort of like you only have one feature and you've just got a lot of copies of it. So to initialize learning and have things work well, we almost always want to set all the weights to very small random numbers. And so at that point, you know when I say very small, we sort of want to make them in a range so that they don't disappear at a zero if we make them a bit smaller, and they don't sort of start blowing up into huge numbers when we multiply them by things. And doing this initial aation at the right scale was used to be seen as something pretty important. And there were particular methods that had a basis of sort of thinking of what happens once you do matrix multiplies that people had worked out and often used. One of these was this Javier initialization, which was sort of working out what ants of your uniform distribution to be variants of your distribution to be using based on the sort of number of inputs and outputs of a layer and things like that. The specifics of that, you know, I think we still use to initialize things in this assignment too, but we'll see later that they go away because people have come up with cleverer methods, in particular doing layer normalization, which sort of obates the need to be so careful on the initialization. But you still need to initialize things to something. Okay, then the final one, which is also something that appears in the second, the same that I just wanted to say a word about, was optimizers. So we talked about in class stochastic gradient descent and did the basic equations for stochastic gradient descent. And you know, to a first approximation, there's nothing wrong with stochastic gradient descent. And if you fiddle around enough, you can usually get stochastic gradient descent actually to work well for almost any problem. But getting it to work well is very dependent on getting the scales of things right, of sort of having the right step size. And often you have to have a learning rate schedule with decreasing step sizes and various other complications. So people have come up with more sophisticated optimizers for neural networks and for complex nets. Sometimes these seem kind of necessary to get them to learn well. And at any rate, they give you sort of lots of margins of safety since they're much less dependent on you setting different hyperparameters right? And the idea of of well, all the methods I mentioned and the most commonly used methods is that for each parameter, they're accumulating a measure of what the gradient has been in the past. And they've got some idea of the scale of the gradient, the slope for a particular parameter, and then they're using that to decide how much you move the learning rate at each time step. So the simplest method that was come up was this one called egrad. If you know John ducchi and he he was one of the current inventors of this, you know it's simple and nice enough, but it tends to store worldly. Then people came up with different methods adamthe one that's on assignment too. It's a really good, safe place to start. But in a way, sort of our word vectors have a special property because of their sparsenness that you know you're very sparsely updating them because particular words only turn up occasionally. So people have actually come up with particular optimizers that sort of have special properties for things like word vectors. And so these ones with A W at the end can sometimes be good to try. And then you know again, there's a whole family of extra ideas that people have used to improve optimizers. And if you want to learn about that, you can go off and do an optimization class like convex optimization. But there ideas like momentum and nestroof acceleration and things like that and all of those things people also variously try to use. But Adam is a good name to remember, if you remember nothing else. Okay, that took longer than I hoped, but I'll get on now to language models. Okay, language models. So you know, in some sense, language model is just two English words. But when in nlp we say language models, we mean that it's a technical term that has a particular meaning. So the idea of a language model is something that can predict well what word is going to come next. Or more precisely, it's going to put a probability distribution over what words come next. So the students open there what words are likely to come next. Bags, laptops, notebooks, notebooks. Yes, I have some of those at least. Okay. Yeah. I mean, so right. So so these are kind of likely words. And if on top of those we put a probability on each one, then we have a language model. So formally, we've got a context of preceding items. We're putting a probability distribution over the next item, which means that the sum of the estimates of this for items in the vocabulary will sum to one. And if we've defined A P like this that predicts probabilities of next words, that is called a language model. As it says here. An alternative way that you can think of a language model is that a language model is a system that assigns a probability to a piece of text. And so we can say that a language model can take any piece of text and give it a probability. And the reason we can do that is we can use the chain rule. So I wanna know the probability of any stretch of text. I say, given my previous definition of language model, easy, I can do that probability of x one with a null preceding context times the probability of x two. Given x one etcec along, I can do this chain rule decomposition. And then the terms of that decomposition, precisely what the language model, as I defined it previously provides. So language models are this essential technology for nlp, just about everything from the simplest places forward, where people do things with human language and computers, people use language models in particular. You know, they weren't something that got invented in 2022 with ChatGPT. Language models have been central to nlp at least since the 80s. The idea of them goes back to the at least the fifties. So anytime you're typing on your phone and it's making suggestions of next words, regardless of whether you like those suggestions or not, those suggestions are being generated by a language model, traditionally a compact, not very good language model. So it can run sort of quickly and very little memory in your keyboard application. If you go on Google and you start typing some stuff and it's telling you stuff that could come after it to complete your query, well, again, that's being generated by a language model. So how can you build a language model? So before getting into nelanguage models, I've got it just a few slides to tell you about the old days of language modeling. So this is sort of how language models were built from 1975 until effectively around about 2012. We want to put probabilities on these sequences. And the way we're going to do it is we're going to build what's called an ngram language model. And so this is meaning we're going to look at short word subsequences and use them to predict. So n is a variable describing how short are the word sequences that we're going to use to predict. So if we just look at the probabilities of individual words, we have a unigram language model. If we look at probabilities of pairs of words by gram language model, probabilities of three words, trigram language models, probabilities of more than three words, that you get called four gram language models, five gram language models, six gram language models. So for people with a classics education, this is horrific. Of course, in particular, not even these ones are correct, because Graham is a Greek root, so it should really have Greek numbers in front here. So you should have monograms and diagrams. And you know, actually, so the first person who introduced the idea of Ingram models was actually Claude Shannon when he was working out information theory, the same guy that did cross entropy and all of that. And if you look at his 1951 paper, he uses diagrams, but the idea died about there and everyone else. This is what people say in practice. It's kind of cute. I like it. A nice, know, practical notation. So to build these models, the idea is, look, we're just going to count how often different n grams appear in text and use those to build our probability estimates. And in particular, our trick here is that we make a Markov assumption so that if we're predicting the next word based on a long context, we say, tell you what, we're not going to use all of it. We're only going to use the most recent n minus one words. So we have this big context and we throw most of it away. And so if we're predicting word xt plus one based on simply the preceding n minus one words, well, then we can make the prediction using n grams. Why? Let's whatever it is, if we use n is three would have a trigram here and normalized by a bigram down here. And that that would give us relative frequencies of the different terms. So we can do that simply by counting how often n grams occur in a large amount of text and simply dividing through by the counts. And that gives us a relative frequency estimate of the probability of different continuations. Does that make sense? Yeah, that's a way to do it. Okay. So suppose we're learning a four gram language model, right? And we've got a piece of text. As the Proctor started the clock, the students open there. So well, to estimate things, we are going to throw away all but the preceding three words. So we're going to estimate based on students open there. And so we're going to work out the probabilities by looking for counts of students open their W and counts of students open there. So we might have, in a orpus that students open there occurred a thousand times. Students opened their books occurred 400 times. So wesay, the probability estimate is simply 0.4 for books. If exams occurred 100 times, the probability estimate is 0.1 for exams. And while you can sort of see that this is bad, it's not terrible because if you are going to try and predict the next word in a simple way, looking at the immediately prior words are the most helpful words to look at. But it's clearly sort of primitive because you know if you known that the prior text was as the prop to start of the clock, that makes it sound likely that the words should have been exams where since you're estimating just based on students, open theirs. Well, yoube more, but likely to choose books because it's more common. So it's a kind of a crude estimate, but it's a decent enough place to start. It's a crude estimate that could be problematic in other ways. I mean, why else might we kind of get into troubles by using this as our probability estimate? Yeah, there are too. So there are a lot of n grams. Yeah. So there are a lot of words, and therefore, there are a lot of lot of n grams. Yeah. So that's a problem. We'll come to it later. And the elmaybe up the back, like the word W might not even show up in the training data. So you might just have a cazero for that. Yes. Th, Yeah. So so if we're counting over any reasonable size corpus, there are lots of words that we just are not going to have seen, right, that they never happen to occur in the text that we counted over. You know so if you start thinking students open there, you know there are lots of things that you could put there. You know students open their accounts, or if the students are doing dissections in a biology class, maybe students open their frogs. I don't know. You know there are lots of words that in some context, you know would actually be possible, and lots of them that we won't have seen. And so it gives them a probability estimate of zero. And that tends to be an especially bad thing to do with probabilities, because once we have a probability estimate of zero, any computations that we do that involve that will instantly go to zero. So we have to deal with some of these problems. So for that sparsity problem, right, Yeah that we could have the word never occurred in the numerator. And so simply stwe get a probability estimate of zero. The way that was dealt with was that people just hacked the counts a little to make it non zero. So there are lots of ways that are explored, but the easiest way is you just sort of added a little delta, like, you know, 0.25 to counts. So things that you never saw got a count of 0.25 in total. And things you saw once got a count of 1.25, and then there in no zeros anymore, everything is possible. You could think. Then there's a second problem that, wait, you might never have seen stupid students open there before. And so that means your denominator is just undefined and you don't have any counts in the numerator either. So you sort of need to do something different there. And the standard trick that was used then was that you did back off. So if you couldn't estimate words coming after students open there, you just worked out the estimates for words coming after open there. And if you couldn't estimate that, you just use the estimate of words coming after there. So you use less and less context until you could get an estimate that you could use. But you know something to note is that we've got these conflicting pressures now. So that on the one hand, you know if you want them come up with a better estimate that you would like to use more context, I have a larger n gram. But on the other hand, as you make use more more conditioning words, well, the storage size problem someone mentioned gets worse and worse because the number of n grams that you have to know about is going up exponentially with the size of the context. But also your sparseness problems are getting way, way worse. And you're almost necessarily going to be ending up seeing zeros. And so because of that, you know in practice where things tended to sort of max out was five. And occasionally people use six grams and seven grams. But most of the time, between the source of sparsenness and the cost of storage, five grams was the largest thing people dealt with. A famous resource from back in the two thousands decade that Google released was Google n grams, which was built on a trillion word web corpus and had counts of n grams. And it gave counts of n grams up to n equals five. And that is where they stopped. Okay. Well, we sort of said the storage problem. The storage problem is, well, to do this, you need to store these counts. The number of counts is going up exponentially in the amount of context size. Okay? But you know what's good about ngram language models? They're really easy to build. You can build one yourself in a few minutes when you've got want na have a bit of fun on the weekend. You know all you have to do is start sort of storing these counts for engrams and you can use them to predict things. So you know for if at least if you do it over a small corpus, like a couple of million words of text. You know you can build an end ground language model in seconds on your laptop or you have to build and write the software. Okay, a few minutes to write the software, but building the model takes seconds because you know there's no training in your network. All you do is count how often n grams occur. And so once you've done that, you can then run an n gram language model to generate text. You know, we could do text generation before chat tgpt, right? So if I have a trigram language model, I can start off with some words today there, and I can look at my stored n grams and get a probability distribution over next words. And here they are. You know, note the strong patterning of these probabilities, because remember, they're all derived from counts, right, that are being normalized. So really, these are words that occurred once the, these are words that occurred twice. These are words that occurred four times in this context, right? So they're sort of in some sense crude when you look at them more carefully. But so what we could do is then at this point, you know we roll a die and get a random number from zero to one and we can use that sample from this distribution. Yeah. So we sample from this distribution and so that if we sort of generate sort of as our random number, something like point 35, if we go down from the top wesay, okay, we've sampled the word price today, the price, and then we repeat over, we condition on that we probability distribution of the next word and we generate a random number and use it to sample from the distribution. We can say generate 0.2. And so we choose of we now condition on that we get a probability distribution. We generate a random number, which is 0.5 or something. And so we get gold coming out. And we can say today the price of gold, and we can keep on doing this and generate some text. And so here's some text generated from 2 million words of training data using a trigram language model. Today, the price of gold per ton while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted European stocks. September third and primary, $0.76a share. Now, okay, that text isn't great, but you know, I actually want people to be in a positive mood today. And you know, actually, it's not so bad, right? It's sort of surprisingly grammatical. I mean, in particular, like I lowercaeverything. So this is the imf that should be capitalized of the International Monetary Fund, right? You know, there are big pieces of this that even make sense, right? The bank intervened just after it considered and rejected an imf demand. That's pretty much making sense as a piece of text, right? So it's mostly grammatical. It looks like you know, English text. I mean, it makes no sense, right? It's sort of really incoherent. So there's work to do. But you know what was already, you could see there's even these simple ngram models. You could from a very low level, you could kind of approach what text and human language worked in from below. And you know, I could easily make this better even with the ngram language model, because you have rather than 2 million words of text, if I trained on 10 million words of text to be better, if I then rather than a trigram model, could go to a four gram model, get better and yousort of start getting better and better approximations of text. And so this is essentially what people did until about 2012. And you know, really the same story that people tell today, that scale will solve everything, is exactly the same story that people used to tell in the early 2010s with these ngram language models. If you weren't getting a good enough results with your 10 million words of text and a trigram language model, the answer was that if you had 100 million words of text, a foregram language model youdo better. And then if you had a trillion words of text and a five gram language model, youdo better. And gee, wouldn't it be good if we could collect 10 trillion words of text so we could train an even better n gram language model? Same strategy. But it turns out that sometimes you can do better with better models as well as simply scale. And so things got reinvented and started again with building neural language models. So how can we build a neural language model? So you know we've got the same task of having the sequence of words, and we want to put a probability estimate over what word comes next. And so the simplest way you could do that, which you'll hopefully have all have thought of, because it connects what we did in the earlier classes. Look, we already had this idea that we could have represented context by the concatenation of some word vectors, and we could put that into a neural network, and we could use that to predict something. And in the example I did in the last couple of classes, what we used it to predict was, is the center word a location or not a location, just a binary choice? But that's not the only thing we could predict. We could have predicted lots of things with this neural network. We could have predicted whether the piece of text was positive or negative. We could have predicted whether it was written in English or Japanese. You know, we could predict lots of things. So one thing we could choose to predict is we could choose to predict what word is going to come next after this window of text. We have a model just like this one, apart from up the top. Instead of doing this, binary classification wedo a many, many way classification over what is the next word that is going to appear in the piece of text. And that would then give us a neural language model. In particular, it gives us a fixed window neural language model so that we do the same Markov assumption trick of throwing away the further back context. And so for the fixed window, we'll you use word embeddings, which you can concatenate. We'll put it through a hidden layer, and then we'll take the output of that hidden layer, multiply it by another layer, say, and then put that through a soft max and get an output distribution. And so this gives us a sort of a fixed window neural language model. And you know, apart from the fact that we're now doing a classification over many, many, many classes, this is exactly like what we did last week. So it should look kind of familiar. It's also kind of like what you're doing for assignment too. And so this is essentially the first kind of neural language model that was proposed. So in particular, Yoshua Bengio really sort of right at the beginning of the 20 first century suggested that you could do this, that rather than using an ngram language model, you could use a fixed window neural language model. And you know even at that point, he and colleagues were able to get some positive results from this model. But you know at the time, it wasn't widely noticed. It didn't really take off that much. And you know it was sort of for a combination of reasons. When it was only a fixed window, it was sort of not that different to engrams in some sense. And although the neural network could give better generalization, it could be argued rather than using counts, I mean, in practice, neural nets were still hard to run without GPU's. And people felt, and I think in general, this was the case, that you could get more of by doing the scale story and collecting your Ingram counts on hundreds of billions of words of text rather than trying to make a neural network out of it. And so it didn't really sort of especially take off at that time. But you know in principle, it seemed a nice thing. It got rid of the sparsity problem. It got rid of the storage costs. You no longer have to store all observed n grams. You just have to store the parameters of your neural network. But it didn't solve all the problems that welike to solve. So in particular, we still have this problem of the Markov assumption that we're just using a small fixed context beforehand to predict farm. And there are some disadvantages to enlarging that window. And you know there's no fixed window that's ever big enough. Another there's another thing that if you look technically at this model, that might sort of make you suspicious of it, which is you know, when we have words in different positions, that those words in different positions will be treated by completely different subparts of this matrix. W, so you might think that, you know, okay for predicting that books comes next. You know, the fact that this is a student is important, but it doesn't matter so much exactly where the word student occurs, right? You know, the context could have been the students slowly opened there and it's still the same students. We've just got a bit different linguistic structure where this W matrix would be using completely separate parameters to be learning stuff about student here versus student in this position. So that seems kind of inefficient and wrong. And so that suggested that we kind of need a different kind of neural architecture that can process any length of input and can use the same parameters to say, Hey, I saw the word student. That's evidence that things like books, exams, homework will be turning up regardless of where it occurs. And so that then led to exploration of this different neural network architecture called recurrent neural networks, which is what I'll go on to next. But before I do, is everyone basically okay with what a language model is? Yeah, no questions. Okay, recurrent neural networks. So recurrent neural networks is a different family of neural networks. So effectively, in this class, we see several neural network architectures. So in some sense, the first architecture we saw was word to vec. It's sort of a very simple encoder decoder architecture. The second family we saw was feed forward networks or fully connected layer classic neural networks. And the third family we're going to see is recurrent neural networks, which have different kinds. And then we'll go on and go on to transformer models. Okay. So the idea of a recurrent neural network is that you've got one set of weights that are going to be applied through successive moments in time, a successive positions in the text. And as you do that, you're going to update the parameters as you go. We'll go through this in quite a bit of detail, but you know here's the idea of it. So we've got the students open there and we want to predict with that and the way that we're going to do it. Okay, I've still got four words in my example so I can put stuff down the left side of the slide, but there could have been 24 words with recurrent neural networks because they can deal with any length of context. Okay? So as before, our words start off as just words or one hot vectors and we can look up their word embeddings just like before. Okay? But now to compute probabilities for the next word, we're going to do something different. So our hidden layer is going to be recurrent. And by recurrent, it means we're going to sort of change a hidden state at each time step as we proceed through the text from left to right. So we're going to start off with an H zero, which is the initial hidden state, which can actually just be all zeros. And then at each time step, what we're going to do is we're going to multiply the previous hidden state by weight matrix. We're going to take a word embedding and multiply it by a weight matrix. And then we're going to sum the results of those two things. And that's going to give us a new hidden state. So that hidden state will then sort of store a memory of everything that's been seen so far. So we'll do that and then we'll continue along. So we'll multiply the next word vector by the same weight matrix. Wewe store the previous, multiply the previous hidden state by the same weight matrix wand. We add them together and get a new representation. I've only sort of said this bit, so I've left out a bit. Commonly, there are two other things you're doing. You are adding on a bias term because we usually separate out of bias term, and you're putting things through a nonlinearity. So I should ouldn't make sure I mentioned that for recurrent neural networks, most commonly, this nonlinearity has actually been the tan H function. So it's sort of balanced on the positive and negative side. And so you keep on doing that through each step. And so the idea is, once we've gotten to hear this H force, hidden state is a hidden state, then some sense has read the text up until now. It's seen all of the students open there. And if the words students occurred in any of these positions, it will have been multiplied by the same we matrix and added into the hidden state. So it's kind of got a cleaner, low parameter way of incorporating in the information that it's seen. So now I want to predict the next word. And to predict the next word, I'm then going to do based on the final hidden state, the same thing I kind of thing I did before. So I'm going to multiply that hidden state by matrix and add another bias and stick that through a soft max and use that to sample from that softmax. Well, the softmax will give me a language model, a probability over all next words, and I can sample from it to generate the next word that makes sense. Okay? Recurrent neural networks. Okay. So for our current neural networks, we can now process any length of preceding context, and we'll just put more and more stuff in our hidden state. So our computation is using information from many steps back. Our model size doesn't increase for having a long context, right? You know, we have to do more computation for a long context, but our representation of that long context just remains this fixed size hvector H of whatever dimension it is. So there's no exponential blowout anymore. There's the same weto apply in every time step. So there's a symmetry in how inputs are processed. There are some catches. The biggest catch in practice is that recurrent computation is slow. So for the feed forward layer, we just had you know our input vector, we multiply it by matrix. We multiply it by matrix however, many times, and then at the end we're done. Whereas here we've sort of stuck with this sequentiality that you have to be doing one hidden vector at a time. In fact, this is going against what I said at the beginning of class because essentially here you're doing a for loop. You're going through four time equals one to t, and then you're generating in term each hidden vector. And that's one of the big problems with rand ns that have led them to fall out of favever. There's another problem that we'll look at more is that in theory, this is perfect. You're just incorporating all of the past context in your hidden vector. In practice, it tends not to work perfectly because, you know, although stuff you saw back here is in some sense still alive, and the hidden vector, as you come across here, that your memory of it gets more and more distant. And it's the words that you saw recently that dominate the hidden state. Now in some sense, that's right, because the recent stuff is the most important stuff that's freshest in your mind. You know it's the same with human beings. They tend to forget stuff from further back as well. But rnn's, especially in the simple form that I've just explained, forget stuff from further back rather too quickly. And we'll come back to that again. Thursday's class. Okay. So for training an rnn language model, the starting off point is we get a big corpus of text again, and then we're going to compute for each time step a prediction of the probability of next words, and then there's going to be an actual next word. And we're going na use you know that as the basis of our loss. So our loss function is the cross entropy between the predicted probability and what the actual next word that we saw is, which again, as in the example I showed before, is just the negative log likelihood of the actual next word. Ideally, youlike to predict the actual next word with probability one, which means the negative log of one would be zero and therebe no loss. But in practice, if you give it an estimate of 0.5, there's only a little bit of loss and so on. And so to get our overall objective function, we work out the average loss, the average negative log likelihood of predicting each word and turn. So showing that as pictures, if our corpuses, the students, open their exams, we're first of all going to be trying to predict, you know what comes after that. And we will predict some word with different probabilities. And then we'll say, Oh, the actual next word is students, okay, you gave that a probability of 0.05, say, because all we know was the first word was that, okay, there's a loss for that, the negative log probe given to students. We then go on generate the probability estimate over the next words, and then we say, well, the actual word is opened. What probability estimate did you give to that? We get a negative probability loss. Keep on running this along. And then we sum all of those losses and we average them per word. And that's our sort of average per word loss. And we want to make that as small as possible. And so that's our training mechanism. It's important to notice that for generating this loss, we're not just doing free generation. We're not just saying to the model, go off and generate a sentence. What we're actually doing is that each step we're effectively saying, okay, the prefixes the students opened, what probability distribution do you put on next words after that, generate it with our current neural network and then say, ask for the actual next word. What probability estimate did you give to there? And that's our loss. But then what we do is stick there into our current neural network, the right answer. So we always go back to the right answer, generate probability distribution for next words, and then ask, okay, what probability did you give to the actual next word exams? And then again, we use the actual next word. So we do one step of generation, then we pull it back to what was actually generated, what was actually in the text, and then we ask it for guesses over the next word and repeat forever. And so the fact that we don't do free generation, but we pull it back to the actual piece of text each time makes things simple, because we're sort of know what the actual author used for the next word. And that process is called teacher forcing. And so the most common way to train language models is using this kind of teacher forcing method. I mean, it's not perfect in all respects because you know we're not actually exploring different things the model might want to generate on its own and seeing what comes after them. We're only doing the tell me the next word from some human generated piece of text. Okay, so that's how we get losses. And then after that, we want to, as before, use these losses to update the parameters of a neural network. Okay? And how do we do that? Well, in principle, you know we just have all of the texthat we've collected, which you could think of as just a really long sequence of, okay, we've got a billion words of text here it is, right? So in theory, you could just run your recurrent neural network over your billion words of text, updating the context as you go. But that would make it very difficult to train a model, because yoube accumulating these losses for a billion steps and youhave to store them, and then youhave to store hidden state. So you could update parameters, and it just wouldn't work. So what we actually do is we cut our training data into segments of a reasonable length, and then we're going to sort of run our recurrent neural network on those segments, and then we're going to compute loss for each segment, and then we're going to update the parameters of the recurrent neural network based on the losses that we found for that segment. I describe it here as the segments being sentences of documents, which seems a linguistically nice thing here. It turns out that in recent practice, when you're wanting to scale most efficiently on GPU's, people don't bother with those linguistic niceties. They just say a segment is 100 words. Just cut every 100 words. And the reason why that's really convenient is you can then create a batch of segments, all of which are 100 words long, and stick those in a matrix and do vectorize training more efficiently, and things go great for you. Okay? But there's still a few more things that we need to know to get things to work. Great for you. I'll try and get a bit more through this before today ends. So we sort of need to know about how to work out the derivative of our loss with respect to the parameters of our recurrent neural network. And the interesting case here is, you know these wh parameters sort of being used everywhere through the neural network at each stage, as are the we ones. So they appear at many places in the network. So how do we work out the partial derivatives of the loss with respect to the repeated weight matrices? And the answer to that is, Oh, it's really simple. You can just sort of pretend that those whs in each position are different and work out the partials with respect to them at one position. And then to get the partials with respect to wh, you just sum whatever you found in the different positions. And so that is sort of okay. The gradient respect to repeated ways is the sum of the gradient with respect to each time it appears. And the reason why that is it sort of follows what I talked about in lecture three that we talked. Or you know you can also think about it in terms of what you might remember from you know multivariable chain rules. But know the way I introduced in lecture three is the gradient sum at outward branches. And so what you can think about it in a case like this is that you've got a wh matrix which is being copied by identity to wone, wtwo, wthree, W four, etcetera. At each time step. So since those are identity copies, they have a partial derivative with respect to each other of one. And so then we apply the multivariable chain rule to these copies. And so we've then got an outward branching node. And you're just summing the gradients to get the total gradient of each time for the matrix. Okay, Yeah. I mean, there's one other trick that's perhaps worth knowing. I mean, if you've got sort of segments that are 100 long, a common speed up is to say, Oh, maybe we don't actually have to run back propagation for 100 time steps. Maybe we could just run it for 20 time steps and stop, which is referred to as truncated back propagation through time. I mean, in practice, that tends to be sufficient. Note in particular, you're still on the forward PaaS updating your hidden state using your full context, but in the back propagation, you're just sort of cutting it short to speed up training. Okay. So just as I did before with an Ingram language model, we can use rnn language model to generate text. And it's pretty much the same idea, except now we're sort of rather than just using counts of n grams, we're using the hidden state of our neural network to give us the input to a probability distribution that we can then sample from. So I can start with the initial hidden state. I can use the start of sentence symbol. I mean, the example I had before, I started immediately with the hoping that I was less confusing the first time. But what you should have asked is, wait a minute, where did the vcome from? So normally, what we actually do is use a special starof sequence symbol like this angle bracketed s, and so he sort of feed it in as a pseudo word which has a word embedding. And then we, based on this, will be generating first words of the text. So we end up with some representation from which we can sample and get the first word. So now we don't have any actual text, so what we're going to do is take that generated word that we generated and copy it down as the next input. And then we're going to run a next stage of neural network sample from the probability distribution and next word, favcopy it down as the next word of the input and keep on generating. And so this is referred to as a roll out that you're kind of continuing to roll the dice and generate forward and generate a piece of text. And so and normally you want to stop at some point. And the way we can stop at some point is we can have a second special symbol, the angle brackets slash s, which says end of your sequence. So we can generate an end of sequence symbol, and then we can stop. And so using this, we can sort of generate pieces of text. And essentially, you know this is exactly what's happening if you use something like ChatGPT, right, that the model is a more complicated model that we haven't yet gotten to, but it's generating the response to you by doing this kind of process of generating a word at the time, treating it as an input and generating the next word and generating this sort of rollout. And that's why and it's done probabilistically. So if you do it multiple times, you can get different answers. We haven't yet gone to ChatGPT, but we can have a little bit of fun. So you can take this simple recurrent neural network that we've just built here, and you can train it on any piece of text and get it to generate stuff. So for example, I can train it on Barack obbama's speeches. So that's a small corpus, right? You know, he didn't talk that much, right? I've only got a few hundred, zero words of text. It's not a huge corpus. I'll just show this and then I can answer the question. But you know I can generate format and I get something like the United States will step up to the cost of unnew challenges of the American people that will share the fact that we created the problem, they were attacked, and so that they have to say that all the task of the final days of war that I will not be able to get this done. Yeah, well, maybe that's slightly better than my Ingram language model. Still not perfect, you might say, but somewhat better maybe. Did you have a question? Yeah for electric, like a truncated setting to corpus. So that imposed some kind of like limitation on like how much we can like produce. And I still have some coherency in life. So Yeah so I suggested we're going to chunk the text into 100 word units. So you know that's the limit of the amount of prior context that we're going to use. So I mean, that's a fair amount, 100 words, that's typically several sentences. But to the extent that you wanted to know even more about the further back context, you wouldn't be able to. And you know certainly that's one of the ways in which modern large language models are using far bigger context than that. They're now using thousands of words of prior context. Yeah, absolutely. It's a limit on how much far back context. So in some sense, actually, even though in theory a recurneural network can feed in an arbitrary length context, as soon as I say, Oh, practically we cut into segments, you know actually that means we are making a Markov assumption again. And we're saying the further back context doesn't matter. Yeah. Okay, a couple more examples. So instead of Barack Obama, I can feed in Harry Potter, which is a somewhat bigger corpus of text actually, and generate from that. And so I can get sorry, Harry shouted, panicking. I'll leave those brooms in London. Are they? No idea, said nearly headless neck casting loclius by Cedric, carrying the last bit of trecle charms from Harry's shoulder. And to answer him, the common room perched upon it. Forearms held a shining knob from when the spider hadn't felt. It seemed he reached the teams too. Well, there you go. You can do other things as well, so you can train it on recipes and generate a recipe. This one's a recipe. I don't suggest you try and cook, but it looks sort of like a recipe if you don't look very hard. Chocolate ranch barbecue categories game casseroles, cookies cookies yield six servings, two tablespoons of parmmesan cheese, chopped, one cup of coconut milk and three eggs, beaten. Place each pasture over layers of lumps shaped mixture into the moderate oven and simmer until firm. Serve hot and bodied. Fresh mustard, orange and cheese. Combine the cheese and salt together, the dough in a large skillet and the ingredients, and stir in the chocolate and pepper. Yeah, it's not exactly a very consistent recipe, but it comes down to it it sort of has a language of a recipe that it's absolutely maybe if I had scaled it more and had a bigger corit would have done a bit better, but it's definitely not using the ingredients there are. Let's see, it's almost time today. So maybe about all I can do is I can do one more fun example. And then after that, Yeah, I probably should zait at the start next time. So as a variant of building rand n language models, I mean, so far we've been building them over words. So our you know token time steps over which you build it as words. I mean, actually, you can use the idea of recurrent neural networks over any other size unit, and people have used them for other things. So people have used them in bioinformatics for things like dna for sort of having gene sequencing or protein sequencing or anything like that. But even staying with language instead of building them over words, you can build them over characters so that I'm generating at a letter at a time rather than a word at a time. And so that can sometimes be useful because it allows us to sort of generate things that sort of look like words and perhaps have the structure of English words. And so similarly, there are other things that you can do. So before I initialized the hidden state, I said, you just have an initial hidden state. You can make it zeros if you want. Well, sometimes we're going to build a contextual rand n where we can initialize the hidden state with something else. So in particular, I can initialize the hidden state with the rgb values of a color. And so I can have initialized the hidden state with the color and generate character at a time, the name of paint colors. And I can train a model based on a paint company's catalog of names, of colors and their rgb, of their colors. And then I can give it different, different paint colors and itcome up with names for them. And it actually does an excellent job. This one worked really well. Look at this. This one here is gsty pink, power gray, navel tan, bokco White, horble gray, homstar Brown. Now, couldn't you just imagine finding all of these in a paint catalog? I mean, some of them, there's some really good ones over here in the bottom right. This color here is dope, and then this stone of blue, purple, simp, stinky bean and turdly. Now I think I've got real business opportunity here in the paint company market for my recurrent neural network. Okay, I'll stop there for today and do more of the science of neural networks next time.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座(Stanford CS224N第五讲)首先回顾并补充了神经网络的一些关键概念,包括早期神经网络(80-90年代)主要使用单隐藏层,以及深度学习复兴(约2006年起)如何通过一系列“小技巧”克服了训练深层网络的困难。这些技巧包括更有效的正则化方法(如L2正则化和Dropout,后者通过随机丢弃神经元增强模型鲁棒性)、向量化运算以提升效率、审慎的参数初始化(如Xavier初始化)以打破对称性并稳定学习,以及更高级的优化器(如Adam)来改进梯度下降过程。
讲座的核心内容转向语言模型(Language Models, LMs)。语言模型的核心任务是预测给定上下文中下一个词的概率分布,或评估一段文本的整体概率。传统方法是N-gram语言模型,它基于马尔可夫假设,通过统计语料库中词语序列(n-grams)的频率来估计概率。N-gram模型简单易实现,但面临数据稀疏性(许多n-grams未出现过,导致零概率)和存储爆炸(n越大,所需存储越多)的问题,通常采用平滑和回退策略缓解,但上下文窗口受限(通常最大5-grams)。
随后,讲座引入了神经网络语言模型,首先是固定窗口神经网络语言模型,它将上下文词向量拼接后输入前馈网络预测下一个词,解决了N-gram的部分稀疏性问题,但仍受限于固定窗口大小且对不同位置的词使用不同参数。为解决这些问题,循环神经网络(Recurrent Neural Networks, RNNs)被提出。RNN通过在每个时间步共享权重并在内部维持一个隐藏状态(memory)来处理任意长度的序列,理论上能捕捉长距离依赖。RNN语言模型通过在每个时间步基于当前隐藏状态和输入词预测下一个词,并使用交叉熵损失和“教师强制”(Teacher Forcing)策略进行训练。RNN可以逐词生成文本,甚至可以构建字符级RNN用于创意应用(如生成油漆颜色名称)。尽管RNN在处理序列上有优势,但也存在计算缓慢和梯度消失/爆炸(导致难以学习长距离依赖)等问题。
神经网络概念回顾与补充
早期神经网络与深度学习的兴起
- 历史背景:
- 神经网络在20世纪80-90年代流行,主要得益于反向传播算法的提出(Jeff Hinton等)。
- 当时的神经网络通常只有一个隐藏层(输入层-隐藏层-输出层)。
- 训练更深层次的网络(多隐藏层)在很长一段时间内(约15年)被认为非常困难,效果甚至不如浅层网络。引用Yoshua Bengio等人的论文观点:“Empirically, deep networks were generally found to be not better and often worse than neural networks with one or two hidden layers.”
- 深度学习的复苏 (约2006年起):
- 通过一系列看似微小但关键的改进和新思想,深度神经网络的训练成为可能并表现优越。
- 这些改进使得模型能够从“停滞不前”转变为“表现惊人”。
关键技术点
-
正则化 (Regularization):
- 目的: 防止模型过拟合,提升泛化能力。
- 传统观点: 训练误差持续下降,但验证/测试误差在某个点后会上升(过拟合)。L2正则化(在损失函数中加入参数平方和项)旨在使参数值变小,减轻过拟合。
- 现代观点 (针对大型神经网络):
- “We don't believe that overfitting exists anymore” (以传统方式理解)。
- 大型网络可以被训练到在训练集上达到几乎零损失(完美拟合/记忆训练数据)。
- 通过有效的正则化,即使完美拟合训练数据,模型依然可以很好地泛化到新数据。验证损失会持续下降。
- Dropout:
- 一种非常流行的正则化方法,在课程作业中有所涉及。
- 训练时: 在网络的中间层,随机“丢弃”或“关闭”一部分神经元的输出(通过与一个随机的0-1掩码进行Hadamard积)。每次训练迭代使用不同的掩码。
- 效果: 强迫网络学习更鲁棒的特征,不能过度依赖少数几个输入特征,因为它们可能随时“消失”。
- 测试时: 不进行丢弃,保留所有权重,但会对权重进行相应的缩放,以弥补训练时丢弃部分神经元的影响。
- 动机解释:
- 防止特征之间的“共适应性”(co-adaptation)。
- 可以看作是同时训练一个由指数级数量的共享权重的不同网络组成的巨大模型集成(ensemble)。
- 被认为是特征依赖的正则化形式。
-
向量化 (Vectorization):
- 核心思想: 避免使用显式的for循环,转而使用向量、矩阵和张量运算。
- 优势:
- 在CPU上运行速度至少提升一个数量级。
- 在GPU或TPU上运行速度可提升两到三个数量级。
- 实践: “If I'm writing a for loop for anything that isn't some very superficial bit of input processing, I've almost certainly made a mistake.”
-
参数初始化 (Parameter Initialization):
- 重要性: 必须将权重初始化为小的随机数。
- 原因: 如果初始化为全零或相同常数,会导致对称性问题,使得网络中所有神经元学习到相同的特征,如同只有一个特征的多个副本。
- 方法:
- 初始值需要足够小以避免梯度爆炸,但又不能太小以至于梯度消失。
- Xavier初始化: 一种曾被广泛使用的方法,根据层输入和输出单元的数量来确定初始化分布的方差。
- 现代进展: 层归一化(Layer Normalization)等技术在一定程度上减轻了对精细初始化策略的依赖,但随机初始化仍然是必要的。
-
优化器 (Optimizers):
- 随机梯度下降 (SGD): 基本方法,理论上可行,但通常需要仔细调整学习率、学习率衰减策略等超参数。
- 高级优化器:
- 为每个参数累积过去的梯度信息,动态调整学习率。
- 对超参数设置不那么敏感,更稳定。
- Adagrad: 早期方法,简单但[原文描述为"tends to store worldly",可能指存储需求大或效果随时间衰减]。
- Adam: “a really good, safe place to start”,课程作业中使用。
- 针对稀疏数据(如词向量)的特定优化器(名称常以AW结尾)。
- 其他概念:动量(Momentum)、Nesterov加速等。
语言模型 (Language Modeling, LM)
定义与核心任务
- 技术术语: 在NLP中,“语言模型”特指能够预测下一个词的系统,或者更精确地说,是为下一个可能出现的词提供一个概率分布。
- 示例: "The students opened their ____" (可能的词: books, laptops, minds, etc.)
- 形式化: 给定一个词序列(上下文),语言模型计算词汇表中每个词作为下一个词的概率
P(word_next | context),且所有可能词的概率之和为1。 - 等价视角: 语言模型是一个能为任意一段文本赋予一个整体概率的系统。这通过链式法则实现:
P(x1, x2, ..., xk) = P(x1) * P(x2|x1) * P(x3|x1,x2) * ... * P(xk|x1,...,xk-1)
其中每一项P(xi | x1,...,xi-1)都由语言模型提供。 - 重要性: 语言模型是NLP的基础技术,自20世纪80年代以来一直处于核心地位(思想可追溯至50年代)。
- 应用示例: 手机键盘输入建议、搜索引擎查询自动补全。
N-gram 语言模型
原理
- 构建方法 (1975 - 约2012年): 基于统计短的词语子序列(n-grams)的出现频率。
- N-gram类型:
- Unigram (1-gram): 单个词的概率。
- Bigram (2-gram): 词对的概率。
- Trigram (3-gram): 三词序列的概率。
- 以此类推 (4-gram, 5-gram)。
- 讲者提及,从词源学角度看,应使用希腊数字前缀(如monogram, diagram),Claude Shannon在其1951年的论文中使用了diagram,但后续业界通用的是unigram, bigram等。
- N-gram类型:
- 马尔可夫假设 (Markov Assumption): 预测下一个词时,只考虑其前
n-1个词的上下文,忽略更早的词。P(xt+1 | x1, ..., xt) ≈ P(xt+1 | xt-n+2, ..., xt)
- 概率计算: 通过对大规模文本语料库中的n-gram计数实现。
P(wk | w1, ..., wk-1) ≈ Count(w1, ..., wk-1, wk) / Count(w1, ..., wk-1)- 示例 (4-gram): 预测 "students open their" 之后的词。
- 假设 "students open their" 出现1000次。
- "students open their books" 出现400次 => P(books | students open their) = 400/1000 = 0.4。
- "students open their exams" 出现100次 => P(exams | students open their) = 100/1000 = 0.1。
存在的问题
- 上下文局限性: 仅依赖固定的、较短的前文,无法捕捉长距离依赖。例如,若前文有 "As the proctor started the clock",则 "exams" 的概率应高于 "books",但n-gram模型可能因 "students open their books" 更常见而给出错误偏好。
- 数据稀疏性 (Sparsity):
- 分子为零: 某个n-gram
(w1, ..., wn)在训练语料中从未出现,导致Count(w1, ..., wn) = 0,从而P(wn | w1, ..., wn-1) = 0。- 解决方案: 平滑 (Smoothing) 技术,如加法平滑(Add-delta smoothing),给所有计数加上一个小的正值(如0.25),确保没有零概率。
- 分母为零: 上下文
(w1, ..., wn-1)在训练语料中从未出现。- 解决方案: 回退 (Backoff) 策略。如果n-gram估计不可用,则回退到 (n-1)-gram 进行估计,以此类推,直至unigram。
- 分子为零: 某个n-gram
- 存储问题:
- 随着n的增大,需要存储的n-gram数量呈指数级增长。
- 实际应用中,n通常最大为5 (5-grams)。Google N-gram语料库(万亿词规模)也只提供到5-grams。
- 优点:
- 构建简单快速,只需计数。
N-gram 模型的文本生成
- 过程:
- 从一个初始上下文开始(如 "today the")。
- 基于该上下文,使用n-gram模型计算下一个词的概率分布。
- 从该分布中随机采样一个词。
- 将采样到的词加入当前上下文,重复步骤2-3。
- 示例 (基于200万词训练的trigram模型): "Today the price of gold per ton while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted European stocks..."
- 评价:
- 生成的文本大部分在语法上是正确的,局部连贯性尚可。
- 但整体缺乏一致性和意义,非常不连贯。
- 通过增加训练数据量和n值可以略微改善效果。在2010年代早期,主流观点是“规模解决一切”。
神经网络语言模型
固定窗口神经网络语言模型 (Fixed-Window Neural LM)
- 提出者: Yoshua Bengio等 (21世纪初)。
- 思路:
- 选择一个固定大小的上下文窗口(如前4个词)。
- 将窗口内每个词的词向量 (word embeddings) 进行拼接。
- 将拼接后的向量输入一个标准的前馈神经网络(包含隐藏层)。
- 网络的输出层使用Softmax函数,得到词汇表中每个词作为下一个词的概率分布。
- 优点:
- 解决稀疏性: 通过词向量的分布式表示,模型可以泛化到未见过的词组合,只要词向量相似。
- 无需存储n-grams: 只需存储神经网络的参数。
- 缺点:
- 固定窗口大小: 仍受限于马尔可夫假设,无法处理任意长度的上下文。扩大窗口会导致输入层参数急剧增加。
- 权重不共享: 窗口中不同位置的词由权重矩阵
W的不同部分处理,参数学习效率低。例如,"student" 在位置1和位置2是被不同参数处理的。
循环神经网络 (Recurrent Neural Networks, RNNs)
核心思想与架构
- 目标:
- 处理任意长度的输入序列。
- 在序列的不同时间步共享参数,以识别模式而无论其出现位置。
- 工作机制:
- 在每个时间步
t,RNN接收当前输入xt(通常是词向量) 和前一时间步的隐藏状态ht-1。 - 通过一个循环单元计算当前时间步的隐藏状态
ht:
ht = f(W_hh * ht-1 + W_xh * xt + b_h)
其中:ht-1是前一个隐藏状态(h0通常初始化为零向量)。xt是当前时间步的输入词向量。W_hh(或讲座中的W_h) 是连接前一隐藏状态到当前隐藏状态的权重矩阵。W_xh(或讲座中的W_e) 是连接当前输入到当前隐藏状态的权重矩阵。b_h是偏置项。f是非线性激活函数,RNN中常用 tanh。
- 权重共享:
W_hh,W_xh,b_h在所有时间步都是相同的。 - 隐藏状态
ht编码了到当前时间步为止的序列信息(“记忆”)。 - 基于当前隐藏状态
ht,可以预测输出yt(例如,下一个词的概率分布):
yt = softmax(W_hy * ht + b_y)
其中W_hy(或讲座中的U) 是连接隐藏状态到输出的权重矩阵,b_y是输出偏置。
- 在每个时间步
RNN的优缺点
- 优点:
- 可以处理任意长度的输入序列。
- 模型大小不随输入序列长度增加而改变(隐藏状态
h的维度固定)。 - 在所有时间步共享权重,处理输入的方式具有对称性,参数效率高。
- 缺点:
- 计算缓慢: 隐藏状态的计算是顺序的,难以并行化(“essentially here you're doing a for loop”)。
- 梯度消失/爆炸问题: 理论上
ht包含所有历史信息,但实践中,早期时间步的信息在向后传播多步后可能变得非常微弱(梯度消失)或过强(梯度爆炸),导致模型难以学习长距离依赖。近期词语对隐藏状态影响更大。 (此问题将在后续课程详细讨论)
训练RNN语言模型
损失函数与目标
- 数据: 大规模文本语料库。
- 过程:
- 对于语料库中的每个词(在给定其前文的条件下),RNN预测下一个词的概率分布。
- 损失函数: 预测分布与实际下一个词之间的交叉熵损失 (Cross-Entropy Loss)。
- 等价于实际下一个词的负对数似然 (Negative Log Likelihood):
-log P(actual_next_word | context)。
- 等价于实际下一个词的负对数似然 (Negative Log Likelihood):
- 总体目标函数: 整个语料库中所有词的平均交叉熵损失(或平均负对数似然),目标是最小化这个平均损失。
教师强制 (Teacher Forcing)
- 训练策略: 在训练的每个时间步
t:- 模型根据
ht-1和实际输入xt(来自训练数据的词) 预测yt。 - 计算损失
Jt(基于yt和实际的下一个词xt+1)。 - 在下一个时间步
t+1,将训练数据中的实际词xt+1作为输入,而不是模型在t时刻自己生成的词。
- 模型根据
- 原因: 这样可以保持训练过程的稳定,防止因早期错误预测导致后续输入偏离真实数据太远。
训练实践
- 数据分段: 将长文本语料库切分成较短的片段(如句子,或固定长度如100个词的块)。
- 固定长度分块便于GPU上的批处理和向量化训练。
- 反向传播通过时间 (Backpropagation Through Time, BPTT):
- 计算损失函数对于RNN参数(尤其是共享权重如
W_hh,W_xh)的梯度。 - 由于权重在多个时间步共享,其总梯度是其在每个时间步出现时对应梯度的总和。
- 截断反向传播通过时间 (Truncated BPTT): 为了计算效率和减少内存消耗,在反向传播时,梯度只回传有限的步数(如20步),即使前向传播时隐藏状态依赖于更长的历史。前向传播仍使用完整片段的上下文。
- 计算损失函数对于RNN参数(尤其是共享权重如
RNN的文本生成
生成过程 ("Rollout")
- 初始化隐藏状态
h0(通常为零向量)。 - 提供一个特殊的起始符 (start-of-sequence symbol, e.g.,
<s>) 作为第一个输入x1。 - RNN计算
h1和输出概率分布y1。 - 从
y1中采样一个词w1作为生成的第一个词。 - 将生成的词
w1作为下一个时间步的输入x2。 - RNN基于
x2和h1计算h2和y2,再从中采样w2。 - 重复此过程,直到生成一个特殊的结束符 (end-of-sequence symbol, e.g.,
</s>) 或达到预设的最大长度。 - 由于采样过程的随机性,多次运行会产生不同的文本序列。
RNN生成文本示例
- 基于奥巴马演讲训练: "The United States will step up to the cost of unnew challenges of the American people that will share the fact that we created the problem, they were attacked, and so that they have to say that all the task of the final days of war that I will not be able to get this done."
- 评价:可能比n-gram模型略好,但仍不完美。
- 基于《哈利·波特》训练: "Sorry, Harry shouted, panicking. I'll leave those brooms in London. Are they? No idea, said nearly headless neck casting loclius by Cedric, carrying the last bit of trecle charms from Harry's shoulder..."
- 基于食谱训练: "Chocolate ranch barbecue categories game casseroles, cookies cookies yield six servings, two tablespoons of parmmesan cheese, chopped, one cup of coconut milk and three eggs, beaten..."
- 评价:文本结构类似食谱,但内容不连贯,配料使用混乱。
字符级RNNs及创意应用
字符级RNN语言模型
- 单位: RNN在字符 (character) 级别而非词 (word) 级别进行操作和预测。
- 优势:
- 可以生成类似真实词汇的拼写结构。
- 词汇表大小固定且小(例如英文字母加标点)。
- 能处理未登录词(OOV words)和拼写错误。
- 应用: DNA序列、蛋白质序列分析等。
创意应用示例:油漆颜色命名
- 模型: 字符级RNN。
- 任务: 为给定的RGB颜色值生成一个有创意的油漆名称。
- 训练:
- 使用某油漆公司颜色目录(RGB值及其对应名称)进行训练。
- 上下文初始化: RNN的初始隐藏状态
h0可以用该颜色的RGB值进行初始化。
- 生成: 输入新的RGB值,模型逐字符生成颜色名称。
- 效果: 生成的名称非常富有创意且听起来很真实。
- 例如: "Gsty Pink", "Power Gray", "Navel Tan", "Bokco White", "Horble Gray", "Homstar Brown", "Dope", "Stone of Blue", "Purple Simp", "Stinky Bean", "Turdly"。
- 讲者幽默地表示这可能是一个商业机会。
总结与展望
讲座在展示完油漆颜色命名示例后结束,预告下次课程将更深入探讨神经网络(特别是RNN)的科学原理和问题。RNN虽然在处理序列数据方面相比固定窗口模型有显著进步,但其固有的计算效率和长距离依赖学习问题,将为后续更先进架构(如Transformer)的出现埋下伏笔。