Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 6 - Sequence to Sequence Models

该讲座是斯坦福CS224N课程关于自然语言处理与深度学习的第六讲，主要延续了先前关于语言模型和循环神经网络（RNN）的讨论。

核心内容包括：
1. 回顾与展望：讲座首先回顾了语言模型（预测下一个词的系统）和RNN（能处理任意长度序列输入、每步共享权重并可选输出的神经网络架构）的基本概念。接着预告将介绍一种更高级的RNN——长短期记忆网络（LSTM），并在后续内容中探讨神经机器翻译。
2. 语言模型评估：详细阐述了评估语言模型的标准指标——困惑度（Perplexity）。解释了其计算方法（本质上是交叉熵的指数形式）、历史背景（由Fred Jelinek提出，旨在用一个更直观的数值来表示模型预测的不确定性，相当于在多少个选项中均匀选择），以及其意义（更低的困惑度代表更好的模型性能，能更准确地预测人类书写的文本）。讲座还提及，比较困惑度时需注意所使用的对数底（如底数为2或自然对数e）。
3. 模型性能演进：通过困惑度数值展示了语言模型的发展历程。从传统的N-gram模型（如使用Kneser-Ney平滑，困惑度约67），到早期RNN与其它模型结合（困惑度约51），再到LSTM的应用使得困惑度显著降低（例如降至43或30，相当于交叉熵减少约1比特）。讲座指出，当前最先进的语言模型困惑度已能达到个位数。
4. RNN的挑战与LSTM的动机：重点阐述了标准RNN在训练过程中面临的“梯度消失”和“梯度爆炸”问题。这些问题源于在反向传播过程中，损失函数对参数的梯度会随着序列长度的增加而连乘多个雅可比矩阵（在简化情况下，主要是权重矩阵Wh）。如果这些矩阵的范数持续小于1，梯度会趋近于零（梯度消失），导致模型难以学习长距离依赖；如果持续大于1，梯度会指数级增长（梯度爆炸）。这一缺陷是引入LSTM等更复杂RNN结构的主要动机。

视频科技

媒体详情

上传日期: 2025-05-15 22:01
来源: https://www.youtube.com/watch?v=Ba6Fn1-Jsfw
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Okay, hi everyone, back for more cs 224n. Okay, for today, the plan is essentially a continuation of what we started on Tuesday. So I'm going to see more about language models and more about rand ns in particular, introducing a more advanced form of recurrent neural network, which was a fourth, while very dominant, lstms. We'll talk about those and then in the latter part as something to be done with recurrent neural networks, we'll start looking at neural machine translation. Okay, so on Tuesday, what we did was we introduced language models, a system that predicts the next word, and then I introduced recurrent neural network. So that's was this new neural architecture that can take sequential input of any length. And it applied the same weights at each step and can optionally produce output on each step. So these are two distinct notions, though they tend to go together. So recurrent neural network can be used for other purposes on any kinds of sequence. And I'll mention a few of those later today. And language modeling is a traditional component of many nlp tasks, anything to do with generating text or estimating likelihoods of pieces of text. And indeed, in the modern instantiation of large language models, essentially everything we do in nlp is being done by language models. So a language model, one way to do it is with the recurrent neural network. It's certainly not the only way. We also talked last time about ngram language models, which are language models. And then starting next week, we'll start to talk about transformers, which are now the most widespread way that's used for building language models. So just finish off a teeny bit that I didn't get to last time on evaluating language models. Well, one way to evaluate language models is what I did in class last time, generate some text and say, Hey, doesn't this text look good? But you know, often we want something more rigorous than that. And the standard way to evaluate language models is to say, well, you know, a language model scores a piece of text and says, how likely it is. Our standard for text in the language is stuff produced by human beings. So if we find a new piece of text, which wasn't text that the model was trained on, right, we want some fresh, fresh evaluation data, and we show it to a language model, we can then ask the language model to predict the success of words of this text. The better it is at doing that, the better a language model it is, because it's more accurately able to predict a human written piece of text. And so the standard way that that is measured is with this measure that's called perplexity. And so for perplexity, we are taking the probability of a prediction from the language model. We're inverting it. So instead of it being you know 0.002 or something, we're turning into 500 off something like that. And then we're taking those numbers. We're taking the product of them at each position in the text, and then we're finding the geometric average of them. So that's the measure that's normally used. But if in this class, we've been tending to look at negative log likelihoods and the idea of cross entropy, and so what perplexity is, is this just the exponential of the cross entropy? So if you already familiar with negative log light per word, negative log likelihoods, if you just exponentiate that, you then get the perplexity. Now there's one other little trick as to what base you use for your logarithms and exponentials. I mean, traditionally thinking of sort of binary and stuff a lot of the time people use base two for measuring perplexity. That's kind of gone out now a lot of the time now people are using natural logs. But if you're comparing perplexity numbers, they're going to be different depending on what base you're using for things. So you need to be aware of this. So from a sort of a modern perspective, it kind of makes no sense why perplexity is used. The story of why perplexity was used was know, in the bad old days of symbolic artificial intelligence, when all of those famous people like John McCarthy and ed fegenbaum were around doing logical based systems. Some people then, essentially at ibm, including Fred jellenek, started exploring probabilistic methods for speech recognition and other similar methods. And the story Fred gellenek used to tell was, well, at that time, this was in the late seventies or early eighties, that none of the AI people that he was trying to talk to understood how to do any real math and didn't understand any information theory notions of of doing things like cross entropy or cross entropy rate. So he had to come up with something simpler they could understand. And so what he came up with is by sort of doing this exponentiated perplexity, you can think of perplexity number as being equivalent to how many uniform choices you're choosing between. So if the perplexity of something is 64, that's like having a 64 sided dice that you're rolling at each time and that's your chance of your chance of getting a one on that is your have chance of guessing the right word. So that was why perplexity got introduced, but it's kind of stuck. And so when you see scores for language models, you generally still see perplexities. So a lower perplexity is better. So here are the kind of numbers and where progress was made with neural language models. So before that, people used Ingram language models, and people used clever ways to smooth them, using methods I vaguely alluded to last time of this ad case, smoothing and doing back off. Actually, people use cleverer methods. Around the two thouss decade, the cleverest method known of smoothing and gralanguage models was this thing called interpolated knesa AI smoothing. And so for a big language model using that, the perplexity was about 67, which in some sense means you weren't very good at predicting the next word. But you know that had actually been enormous progress. You know when I was a Young person doing nlp, you know perplexities were three figure numbers, right? You were commonly seeing you know perplexities of 150 or something like that. So progress was made. So when rand n's were first introduced, people weren't really actually able to do better with a sort of a pure rnn, but they could do better by combining an rnn with something else, such as a symbolic maximum entropy model, which I'm not going to explain. Those are numbers like that, 51. But where progress really started to be made was when lstm started to be used as an improved rand n, which is what I'm going to come to next. So here are some lstm models. And now you're getting numbers like 43 and 30. And so for 30, sort of halve the perplexity, which in cross entropy terms means you reducthe cross entropy by about one bit. And so you've made real progress in your language modeling. Now, by modern standards, these numbers are still really high for the best language models that we have. Now you're getting perplexities in the single digits. You're getting models that are very often able to guess exactly the right word, though of course not always, because no one can predict what words going to be said by someone next in a lot of circumstances. Okay, so to motivate lstms, then wanted to sort of say a bit about how there are problems with rnns and why that motivated fixing things. And these are the problems of vanishing and exploding gradients. So what we wanted to do was say, okay, we've tried to predict a word position four, and often we're going to get it. We're not going to predict it with 100% probability. So we have a loss that's a negative log likelihood we give to that word. And we're going to want to back propagate that loss through the sequence and work out our gradients as we always do. Now, just one note about something someone asked me after class last time, you know, I sort of showed it back propagating the whole sequence. But we're doing this at every time step, right? So we're going to back propagate a loss from time step to back gathe loss from time step. Three, four, five, six, seven, we're doing it for each one. And then one of the slides last time, we then discussed how we're going to sum all of those losses or work out the average loss. But for doing this one, when we back propagate this loss, what happens? Well, what happens is we're going to do the same kind of chain rule where we're multiplying these partial derivatives at every time step. Well, here we've only got a few of them, but you know maybe we're going to have a sequence 30 long, and so we're going to be multiplying each time the partial of hk with respect to the partial of hk minus one. And so what kind of effect is that going to have in particular? You know we might ask what happens if these are small or what happens if these are large? Well, if they're small, the gradient will gradually get smaller and smaller and disappear as we back propagate it along the sequence.
speaker 2: Yeah Chris. So why we're taking partial change over partial age? Should we take parking, James?
speaker 1: Sure. I mean, we're doing that as well. But you know in general, we have to walk the partials along and then you know we then have A W at the next step. I mean, if we're thinking of the sort of computation graph that we're sort of doing the chain rule backwards along, we're going to be going through A W at each step and then arrive in another H, right? Yeah.
speaker 2: So I mean, at this point.
speaker 1: you can do some math and thinking about things. And there's a couple of papers that are mentioned at the bottom here, which I'm actually rushing ahead, not going to do very carefully. But the point is that if you're taking the partial of ht with respect, ht minus one, and if you make a simplifying assumption and say, suppose there isn't a nonlinearity, suppose sigma is just the identity, then what the partial will be is the matrix wh. And so if you keep on back propagating along the recurrent neural network, what you're going to be doing is what ending up with powers of the matrix wh. And then there's the question of what happens when you raise that matrix to higher and higher powers. Well, at that point, you can represent the matrix in terms of its eigenvectors and eigenvalues. And then there are two possibilities. Either all the eigenvalues are less than one, and that means that that number will begin getting smaller and smaller as you raise it to higher power, or it can have eieigenvalues that are larger than one, and then things will get bigger and bigger as you go further back. So essentially, as you back propagate the gradients backwards, unless things are sort of just precisely of corresponding to a largest eigenvector of the eigenvalue of approximately one, you're either going to get a vanishing or an explosion, and both of those will be kind of bad. So why is vanishing gradient a problem? I mean, in a sense, you could think it's not a problem. It's what should be happening because all else being equal, you know the closest words are the most relevant ones. And so that's where you should be updating your parameters the most. And to some extent, that that's but nevertheless, this vanish ingredient in this model happens much too severely. So that if you're looking at the loss from a later position and comparing it to the loss from an earlier position and then you're seeing how things are updating, it's sort of primarily the update is being determined by the very nearby loss and not by the far away loss, that the gradient signal from far away is much, much smaller. And well, that's bad because you know overall fun language modeling, there are lots of cases where we want to be able to transmit signals along distance. So here's my piece of text. When she tried to print her tickets, she found that the printer was out of toner. She went to the stationery store to buy more toner. It was very overpriced. After installing the toner into the printer, she finally printed her Yeah. So you know, to a human being, you know it's obvious we can predict this with pretty much probability one, you know, so really low perplexity for making this decision. But you know, if you're that depends on getting back to the tickets, which are sort of about 20 odd words back, right? If you're just seeing installing the toner into the printer, she finally printed her could be anything. It could be her paper, her invitation, her novel, lots of things that could be you're certainly not going to guess tickets. So we sort of want to have these really long distance dependencies, but we're only going to be able to learn these long distance dependencies if we're actually getting sufficient signal between that position and when the word tickets appears near the beginning, that we can learn the fact that having that ticket's 20 words back is the good predictive thing for predicting tickets here. And what we find is that, know, when the gradient becomes very small, the rand n doesn't learn these kind of long distance dependencies and so it's unable to sort of make these predictions well at test time. I mean, this is a very sort of just rough back of the envelope estimate. But you know what people actually found is that you know with the kind of simple R and n that we've introduced up until now, that the amount of effective conditioning you could get was about seven tokens back, that if things were further back than that, it just never learnt condition on them. So compared to when we were talking about ngrams, and I said usually the maximum people did was five grams, occasionally a bit bigger because of the fact that there was this exponential blowout. Although in theory we've now got a much better solution, in practice because of vaniish ingredients, well, we're only kind of getting the equivalent of eight grams. So we haven't made that much progress, it feels like. So there's a reverse problem, which can also happen, of exploding gradients. So if the gradient becomes very large because the eigenvalues of that matrix are large, well, what we're doing for the parameter update is you know we've got a learning rate. But essentially, if the gradient is very large, we're going to make a very, very large parameter update. And that can cause very bad updates because you know we're sort of assuming that we're taking a step in the direction of the gradient and well, we might overshoot a little, but we'll be in a roughly in the right zone. But you know if we had an enormously exploded gradient, well, we could kind of be sort of walking off anywhere. And you know we think we're heading to the Sierras and we end up in Iowa or something like that, right? You know that we could just go arbitrarily far, and where we're ending up, it might not be making any progress whatsoever. So exploded ingradients are a problem. They can also cause infinities and nans, and they're always a problem when you're training models now for dealing with exploding gradients, this is the accepted wisdom. This unfortunately doesn't. This isn't high polluting math. Really. What people use for exploded ingredients is a crude hack. They clip gradients, but you know it works really well. And you really want to know about this because clip ingradients is often essential to having neural networks not having problems. So what we do for gradient clipping is we work out the norm of the gradient. And if it seems too large, and that varies, but you know that's normally five, ten, 20, something like that for a norm of a gradient is seen as the limit of what's okay. If the norm of your gradient is too large, you just scale it down in every direction and you apply a smaller gradient update, it works. Yeah. So that problem is solvable, but fixing the vanished ingredient seemed a more difficult problem, right? That this was the problem that our rnn's effectively couldn't preserve information over many time steps. And well, what seemed to be the problem there, the problem seems to be really that we've got sort of an architecture that makes it very hard to preserve information. So if we look at sort of the hidden state from one time step to the next time step, it's completely being rewritten, right? So we're taking the previous time steps, hidden vector, we're multiplying it by a matrix, which completely changes it in general, adding in other stuff from the inort. So if we just to say welike, to say welike you to Carry forward information, there's useful stuff in ht minus one. Can you just kind of keep it around for a while? It's not actually very easy to do in this formulation because you know trying to learn W vectors that all mostly preserve what was there before isn't at all an obvious thing to do. So the question was, could we design an rn which had a sermemory where it was easy to preserve information? Yes.
speaker 2: Exponentiation happen don't have analysis that removed things of any artic. So when not having own vand sort of going differfriendships to prevent vanishing or stornot actually doesn't.
speaker 1: I mean, you can make an argument that it should help because you've got effectively, if you've got something like ten H, you've got a flattening function. So it should help somewhat, but it doesn't solve it even if you're using a ten H non linearity well, so it I guess it should sorry, it should help with exploding though actually that even that still happens but definitely doesn't help with the vanishing .
speaker 2: partial. That is one way to to say value. So you're always pushing the value between so it's not going up or going down a staying between zero one. But so Yeah. Zero. Well, I guess we're it go up and down to clips that said Yeah have a really small value that becomes one minus a really small value times one minus times. Okay. Okay.
speaker 1: Yes. So can we have a different architecture? So we have a memory that you can add to and so that led into this new kind of neural network, the lstm. So this is going back a few years, but anyway, rate this was trying to improve series suggestions and the big breakthrough that they were described was being described was, Oh, we're are now using an lstm in the keyboard prediction. And the whole advantage of that was going to be able to predict context further back so you could differentiate between the children are playing in the park versus the Orioles of playing in the playoff. Okay. So the sort of big thing that was seen as very successful was these lstm's long, short term memory, just to say a little bit of the history here, right? Just on how to pathis name, right, that I think people often don't even understand it, right? So what you're wanting to do was model short term memory, right? Because so for humans, people, this normally distinguish between the short term memory of stuff that you heard recently versus things that you permanently stored away. And the suggestion was, well, in short term memory, humans can remember stuff for quite a while, right? You know, if you're having a conversation, you can still remember the thing that the person said a few turns ago in the conversation, say, bring back up of, Oh, didn't you say they took last weekend off or something, right? And well, the problem was that the simple rnns, their short term memory was only about seven tokens. And so welike to make it better than that. And so we wanted long, short term memory and that this name came about. And so this was the type of recurrent neural network that was proposed by hockraand Schmidt Hoover in 1997 as a solution to the problem. I mean, there's actually a second relevant piece of work that came a few years later that, you know, that first pais, the one that everybody cites, but there's then a second paper by gaz and Schmidt Hober in 2000, which actually introduces a crucial part of the lstm as we've used in the 20 first century. There wasn't in the original paper. And you know so there's sort of an interesting story of all of this, you know that Yogen Schmidt Hober and his students did a lot of really crucial foundational work in neural networks in the sort of these years and the late years of the nineties when just about everybody else had given up on neural networks. So unlike these days where you know doing pioneering work in neural networks is a really good way to get yourself hugely compensated jobs at Google, meta or OpenAI, it really wasn't actually in these days. So you know if you asked, gee, what happened to these students of hockraderon ghers, that both of them are still in academia, but girers seem to give up on AI and your Ural networks altogether and does stuff in the area of multimedia. And sehoc Rader is still in machine learning. But you know for quite a long time, he sort of basically gave up on doing more general neural network stuff and went into bioinformatics. So if you look at his publications from about two, 2015, they're all in bioinformatics and most of them weren't using neural networks at all. So kind of nicely. I mean, he's actually gone back into neural networks more recently and is publishing in neural networks again. Yeah. So really not much attention was paid to this work at the time. And so it only sort of really kind of gradually seeped out further. So schmitto Hober had a later student in the mid two thouss decade, Alex graves. And Alex graves did more stuff with lstms and for people, whoseen speech recognition, where people commonly do ctc loss and decoding. Alex graves invented that. But most crucially, Alex graves then went to Toronto to be a post doc for Jeff Hinton. And that sort of brought more attention to the fact that lstms were a good model. And then Jeff Hinton went to Google in 2:13. And that was then sort of the use of lstms at Google in the sort of 2014 to 16 period was when they really sort of hit the world and became, for a while, a completely dominant framework people used for neural networks in the world of, I guess, startups. This is what you call being too early for the first people. Yeah, okay. Long short term memories. Back to the science. So let's see. There's a slide here that talks about long short term memories, but maybe I'll just sort skip straight ahead and start to show the pictures. So we've still got a sequence of inputs xt. And the difference now is inside our neural network, we're going to have two hidden things, one that's still called the hidden state and the other one that' S Referred to as the cell state. And so what we're going to do is we're going to modulate how these things get updated by introducing the idea of gates. And so gates are calculated things, vectors, whose values are probabilities between zero and one. And they're things that we're going to use to sort of turn things on or shut them off in a probabilistic way. So we're going to control the movement of information by having gating. And so we're going to calculate three gating vectors. So these vectors are the same length as our hidden states. And so the way we calculate these gating vectors is with an equation that looks basically exactly the same as what we were using for a current neural networks. Apart from the sigma, there is definitely going to be the logistic that goes between zero and one. So we get probabilities. And the three gates we're going to calculate is a forget gate, which is going to say, how much do we remember of the previous times? Hidden state. I think the forget gate was actually wrongly named. I think it makes more sense to think of it as a regate because it's actually calculating how much you're remembering. Okay, then we've got an input gate, and the input gate is going to say, how much are you going to pay attention to the next input, the next xi and put it into your hidden state? And then you have an output gate, and the output gate is going to control how much of what's in the cell, which is your primary memory, are you going to transfer over to the hidden stays of the network. So once we have once we have those gates, what we're then going to do is have these equations, which are how we're going to sort of update things. So the first thing we're going to do is work out work out a potential new cell content. So the new cell content is going to be calculated exactly using this exactly the same kind of equation we saw last time for a current neural network. We're going to have these two matrices, the cell W and the cell U, and we're going to multiply one by the last times hidden state and the other by the new input add on a bias. And that's a potential update to the cell. But then how we're actually going to update the cell is by making use of our gates. So we're going to say the new cells content is going to be the old cells content hadamaproducted with the forget gate. So that's how much to remember of the previous cells content. Plus this calculated update had a mod product did with the input gahow much to pay attention to this new potential update that we've invented and then for calculating the new hidden state, that's going to be the hadamad product between the output gate and our ct having been put through a ten H. And you know one idea here is you know we're thinking about how much to keep on remembering what we've had in the past, but you know for thinking about sort of only sending some information to the hidden state, sort of a way to start thinking about that is you know the hidden state of a recurrent neural network is sort of doing multiple duty, right? Like on one part of it is we were going to feed it into the output to predict the next token. But another thing that's going to do is we just wanted it to store information about the past that might come in useful later and that welike to kind of have carried through the sequence. And so really, only some of what's in the hidden state we want to be using to predict the current word. Some of it isn't relevant to predicting the current word, but would be good stuff to know for the future, right? So you know, if the previous words were set in for predicting the next word, we basically just need to know we're in a set in context where that or a will come next. But you know if earlier on the sentence had been saying the king of Prussia, we somewhere in the hidden state, we want to be keeping the information that there's a king of Prussia because that might be relevant for predicting future words. And so it makes sense that we only want to have some of what's in our memory being used to predict the next word in the current context. So the cell is our long, short term memory. And then we're moving over to the hidden state, things that are going to be relevant for generation. Yeah, I've sort of said that, okay, all these are vectors of the same length n. Yeah. So all of these things, both the gates and the new values for the cell and hidden state, they're all vectors of length n. And part of how things actually get convenient when you're actually running of these is up until here, all of these things have exactly the same shape. So you can actually put them all together into a big matrix and do the computations of all four of these in terms of one big matrix. If you want question .
speaker 2: not after load manage activation in the fitstage update, then the authput date should have been expressed by both the equal put dates under fordata.
speaker 1: If this bit wasn't here, then wow.
speaker 2: because fand it would have been able to express it, account for it in some sense. My question is, how much does .
speaker 1: having right? Well, to the extent that you want to mask out part of what's in the cell so it's not visible when you're generating the next token, isn't it still useful to have an output game .
speaker 2: you can essentially have xt is equal to ct, right? If you don't.
speaker 1: but you don't want ht equal to ct, you want you want some of the contents of ct to be masked out. So you're not seeing it when generating the output .
speaker 2: being accounted for it by 50 .
speaker 1: and this way, no, because you want to keep it in ct. You want there's information you want to keep in ct for the future, but you don't want visible when generating the current next word. Yeah, in some sense a bit. I have the hardest part explaining is why is it necessarily better to have a tan each year? I mean, you can sort of argue that it's a way of this can just stay unbounded real numbers, and then this is getting it back in the shape of stays between zero and one, which is good for the hidden state. But it's a little bit I guess they did it that way and it seemed to work well. Okay, here's another way of looking at it, which may or may not be more helpful as a picture. So you know at each time step we've got, you know as before, an import, a hidden state, and then we're gonna to calculate an output from that hidden state. But we've sort of got this more complex computational unit. And these pictures of this more complex computational unit were diagrams that were made by Chris Oller, who's someone who now it now works as anthropic. And so if you blow up in that, this is sort of showing the computation. So you're sort of feeding along recurrenthe c cell as the primary recurrent unit, but you've also got carried along H because H is being used to calculate stuff in the next time step, and then a new H is being generated. And so you're computing the forget gate. You're forgetting some of the cell content. You're computing an input gate, you're using that to compute a potential new cell content. You write some of that into the cell depending on the input gate, then you compute an output gate, and then some of the cell will go into the computation of H depending on the output gate. And then just like for the previous recurrent neural network for working out what the predicted next word is, you're working out an output layer by taking the H and doing another matrix plus B2 and then using a softmax on that to actually predict the next word. Okay. So you know this all seems very complex. And you know back in, do you have a question? Yeah.
speaker 2: So how are we exciting the bretal that I imagine just some sort of threshold around the probability of like what we're remembering and what we're forgetting.
speaker 1: So you know so when we're getting more than the threshold, right, because we're actually we're calculating a whole vector of forgetting and remembering. So therefore, it can choose to say, okay, dimensions one to 17, keep all of that and throw away dimensions 18 to 22 or really probabistically to different extent. And so it's sort of unspecified as up to it what it learns. But we're hoping that it will learn that certain kinds of information is useful to keep carrying forward for at least a while. But then we can use both the contents of the hidden state in the cell, sorry, of the next input to decide to throw away certain information. So we might think that there are certain cues. For example, you know if it sees the word next, it might think, okay, change of topic, now would be a good time to forget more stuff and reset. But it's sort of learning which dimensions of this vector to hold around in an unconstrained way, whatever's useful to do a better gelic language. Okay, Yeah. So this all looks like a very complex and contankerous design. And you know quite honestly, you know when teaching this around 20, you 1617, and this was the best kind of neural network we had for language modeling. You know we literally you know spent hours of class time going through lstms and variants of lstms with different properties because you know there are different ways you can do the gating. You can have less gates or more gates and do different things. And it seemed the most important thing to know in 2024. You it's probably not the most important thing to know, but on lstms are a thing to be aware of. We are going to use them for the assignment three. But you know you can just ask PyTorch for an lstm and itgive. You one that does all of this stuff. But you know, there is one thing that I really wanna sort of focus on as to, you know why what is the good thing that an lstm achieves? And you know really the secret for why you get this fundamentally different behavior in an lstm is you have that plus sign right there, right? That for the simple recurrent neural network at each time, the next hidden state was a result of multiplicative stuff and therefore was very hard just to preserve information. Whereas the essence of the lstm is to say, well, look, you've got this past memory of stuff you've already seen. And what we want na do is add some new information to it, which fundamentally seems like kind of right for human memories, that they're sort of basically additive. And when I said, actually, it was the second gir's paper that introduced a crucial part of the lstm. The first version of the lstm didn't have the forget gate, so it was a purely additive mechanism that you were deciding what to add to your memory as you went along. But you know, that proved to be not quite perfect, because if you keep on adding more and more stuff over a long sequence, that tends to be dysfunctional after a certain point. And so the big improvement was then to add this for gate. So some vent went away. But nevertheless, having things basically additive fixes the problem of gradient flow. You no longer have vanished ingradients, and it makes it something that seems much more memory, like you're adding to the things that you know. Okay? So the lstm architecture allows you to preserve information over many time sets in the cell, right? So if you set the forget gate to one and the input gate to zero, you just linearly passing along in the cell indefinitely, the same information, okay? It's not the only way that you can do long distance information flow. And we're going to look increasingly in future lectures at other ways you can do long distance information flow and just to sort of give a bit of a peek about those now and to think about other architectures. But there's a question. No, no question. Yes.
speaker 2: So since you're mentioning I've the .
speaker 1: plus that like handled this .
speaker 2: gredidoes it help with exploding .
speaker 1: gradient or all it does it make it worse? Is there no difference? It also helps with exploding gradients because the fact that you're not doing this sequence and multiplies all the time that you'll sort of have this addition operator. So one thing you could wonder is that is vanishing an exploded ingredient just recurrent neural network problem? And it's not I mean, it occurs earlier and worse when you've got long sequences. But if you start building a very deep neural network, surely the same thing is happening. You know the parameters aren't the same. So it's not quite just raising one matrix to a power, but surely depending on your matrices, you tend to have the same problem, that either your gradients are disappearing or else they're exploding. And that's what people found. And that was part of the reason why in the early days, people weren't very successful building deep neural networks, was because they suffered from problems of this sort. That if you had basically vaniish ingradients in a deep neural network, you got very little gradient signal in the lower layers. Therefore, their parameters didn't really update. Therefore, your model didn't learn anything in the lower layers. Therefore, the network didn't work well. And you know that was part of why things were stuck in the days around the early two. Thousands of deep networks didn't work. And so there are other ways you can think about fixing that. So one common way of fixing that is to add more direct connection. So you know, the problem when we went through our recurrent step was we were sort of had this in between stuff of doing a matrix multiply and blah, blah, blah, blah. And that kind of caused indirectness and the possibility for things to either explode or vanish. So this network is written sort of upside down when I stole the picture from the paper. So we'll just have to deal with that like right? So we're going downwards from here to the next layer. So you know rather than going through sort of weight layers and weight layers, which will start to produce the same kind of problems, what you can do is sort of apply the same trick in a vertical network and say, well, look, I can also just Carry the input around with an identity function and add it on here. And so then I've got this sort of direct carrying of information. And so that led to the residual network, which was what completely transformed computer vision models and made them much more learnable than pure networks that lack these residual connections. If you start heading down that path, you can think, well, why only provide these residual loops that take you one step? Maybe I could directly connect each layer to all the successive layers. And so people played with that idea. And that led ent to the so called dense net, where you have these kind of skip connections linking to every other layer. A variant of the residual network, the resonnet, which was actually again introduced by Schmidt, Huber and students, was to say, well, rather than just directly adding in the input summed with the output of the neural network layer, maybe again, would be better off having gating so that you were deciding by gates how much of the input to have skip around. And so that led to a variant, the highway net, where you've got sort of gated residual network. So various ideas of doing that, not going to say more about that right now. I want to skip her head and sort of do the rest of neural nets and get on to machine translation. Okay? So once you have rnn's where rnns is, including lstms, normally in practice lstms, you can use them for anything else where you're doing sequences. And so there are lots of places they're used in nlp. So if you want to assign words, parts of speech like nouns and verbs, that would be commonly done with a part of speech taglstm. If you want to be assigning named entity labels, like location, right? I did this toy version where we were signing a label to the middle of a window. But if you wanted to assign a label at each position, you can use an lstm for named entity recognition. You can use an rand n as an encoder model for a whole sentence. So if we want to do sentiment classifications, see whether a piece of text is positive or negative, we can say, run an lstm over it and then use this as a representation of the sentence to work out whether it's positive or negative piece of text. And well, the simplest way of doing that is to use the final hidden state, because after all, that final hidden state is the hidden state you've gotten from having seen the entire sentence and use that and then have a sort of a classification layer, a logistic regression on top of that to give you positive or negative. In practice, though, people have found it's often better to use every hidden state and take some kind of mean or element wise max and feed that in as the sentence encoding. You can also use rnn's for lots of other purposes where you're using it to generate text based on other information. So if you want to do speech recognition or summarization or machine translation that we'll come to later, you can have an input source which you'll use to condition your network. Then you'll generate the speech recognition or the machine translation, as we'll see later. And so we refer to those as conditional language models, because rather than just generating text starting from nothing from a start token, we're generating it conditioned on some source of information. One other idea on what normally happens when people use these, I suggested that, you know, we could sort of do this averaging in each position. If you think of these, about these hidden state representations, these hidden state representations, that representation isn't only about the word terribly. It has some information about what came before it. The movie was terribly, but it has no information about what comes after it. And well, you might think youlike to have a representation of terribly that knows what came before it, but also what came after it. And so people sort of came up with the next obvious idea to deal with that, which was to build a bidirectional lstm. So you ran a Ford lstm, and then you start another lstm that's shown in that sort of greenish teal, and you ran it backwards. And so then you had a forwards and backwards vector at each position, and you just concatenated them both. And then you had a two sided context for a representation of word meaning. And so these networks were pretty widely used. So we were sort of running the Ford rn and backward rn and concatenating the states together. And those were then sort of commonly sort of written like this to suggest that in a compact way, you running a bidirectional rnn. You know these were very popular for language analysis. They're not they weren't workable if you were wanting to generate text, but you were using them in a lot of places as a representation. But more recently, transformer models have normally taken over from that. One more idea, which we'll see from machine translation is you know rand n's are sort of deep in the sense that they unroll over many time steps. But up until now, they've only been shallow rnn's in the sense that we just had one hidden state. But you can also make them deep by having multiple layers of hidden states, while it's also commonly called stacked rand ns, so youhave several layers of rnns built above each other. And you might wonder, does this really do anything? I said, are they just big vectors above the words? But precisely because you have sort of this extra neural network layer between here and here, you get exactly the same power advantage you get otherwise with neural networks that you can do successive layers of feature extraction. And so you get more power out of your neural network. To some extent, what people Yeah, nice. Okay. So some extent, what people found with rnn's in those days is that having multiple layers definitely helps. But unlike what was happening in those days with other kinds of neural networks for vision, etc, people still use relatively shallow rand n. So you know you always got a lot of gains by having two layers rather than one. But you know commonly it started to be more iffy whether you got extra value from three or four layers. So commonly people were running two or three layer lstdms and that's what people were using. But that's completely changed around in the world of transformers, where nowadays people are building very deep transformer networks for doing language understanding. Okay. But I should skip ahead and say a few words before time runs out about machine translation. So machine translation is one of the key natural language processing tasks, where we're translating words from sentences in one language to sentences in another language. So we're starting off with a sentence in some language here, French. And what we want to do is output it in a different language here, English. So that machine translation .
speaker 2: was actually .
speaker 1: where nlp started, right? So in the early fifties, there wasn't artificial intelligence yet. There wasn't a field of nlp yet, but people started to work on machine translation. And the story of why people started to work on machine translation was essentially, you know computers were first developed during the Second World War. And during the Second World War, computers were used for two things. One of them was calculating artilltarartillery tables to sort of work out what angle to put your gun on to get it to land in the right place. Not very relevant to what we're doing. But the other thing the other thing that computers were used for was code breaking. So after the Second World War, it moved very quickly into the Cold War. And there were concerns on both sides, you know, of keeping up with the science that was being developed on both sides. And people had the idea of, gee, maybe we could think of translation between languages as like code breaking. And that thought occurred to important, relevant people and science funding agencies. And actually lots and lots of funding was poured into this idea of, can we use computers to do machine translation between languages? And you know at the time, in the fifties, you know after some initial very impressive looking cook demos, it was sort of basically a complete flop. And the reason you know there are lots of reasons why it was a complete flop, you know one was people knew almost nothing about the structure of human languages. I mean, in particular, when I was mentioning the other day the Chomsky hierarchy, right, and knowing about sort of context free languages, right? The Chomsky hierarchy even hadn't been invented yet, right? Sort of formal properties of languages hadn't really been explored. But also know the computers that people had in the 19 fifties, right? The amount of computing power or memory or anything like this that those computers had in those days was laughable, right? These days, you know, the little power brick for your laptop has more computing power inside it than the big mainframe computers that they used to be using in those days. So. Basically, people were only able to build very simple lexicons and rule based substitution rules, and nothing like the complexity of human languages, which only gradually people began to understand. But machine translation stubecome more alive in the 19 nineties and two thousands, decades, once people started to build empirical models over lots of data. And the approach then was called statistical machine translation. And so when Google Translate was first introduced to the world, it was sort of the big unveiling to the world of statistical freeze based machine translation systems, where what you were doing was you're collecting a large amount of parallel data. Words have been translated from one word to another. And you know, not for all languages, but for quite a few languages. There are quite a few sources of parallel data. So the European Union generates a huge amount of parallel data among European languages. There are places like Hong Kong where you get English, Chinese, if a certain dialect of Chinese parallel data, the un generates a lot of parallel data. So getting sources of parallel data and trying to build models. And so the way it was done was based on that model, we're going to try and learn a probability model for translation. So this, the probability of a translation given a source sentence. And the way it was done at that time was breaking it down using barule into two suproblems. So the probability of the translation, given the source is going to be the inverted probability of the source given the translation times the probability of the translation. And you could think that this makes it no simpler because you've just reversed the order of x and y. But the reason why it made it simpler and people were able to make progress was the translation model was treated as a very simple model as to how words tended to get translated to words in the other language. And it didn't need to know anything about know word order, grammar, structure of the other language. Then all of that was being handled by just this probability of y, which was a pure language model, as we've talked about before. So you could have a simple translation model which just sort of said, you know, if you see the word omin French, you might want to translate it as man or person or put some probabilities on that. And then most of the cleverness was in the language model, which was telling you what would be a good sentence in the target .
speaker 2: language. Okay?
speaker 1: And so that was important because you know, translations get pretty complicated, right? So you not only have to know how to translate words and those translations of words vary in context, but you get a lot of reordering of words in sentences. I'm not going to be able to spend a lot of time on this, but you know, here here for a while was my favexample machine translation sentence. So this is actually a translated sentence. So it comes, the original comes from the book guns, germs and steel, if you're familiar with that. But it was that book by Jared Diamond. But this book was translated to Chinese. So here's a sentence from the book in Chinese. And you know, I guess in the two thousands decade, I was involved in building statistical machine translation systems. And I guess there was an mty evaluation that we did, where our system did terribly on this sentence. And I tried it out on Google Translate, and it also did terribly in this sentence. So what the sentence should say is, in 1519, 600Spaniards landed in Mexico to conquer the Aztec empire with a population of a few million. They lost two thirds of their soldiers in the initial clash. So here's what Google Translate said. In 2009, 1519, 600Spaniards landed Mexico, millions of people to conquer the Aztec empire, the first two thirds of soldiers against their loss. Now, it's partly bad because the word choices and the translations aren't very good, but know it's especially bad because it's just not actually able to capture and use the modification relationships of the sentence. So you know, here's the part of the Chinese that's saying the Aztec empire. And over there in orange is the few million people. And in Chinese, there's this explicit little character here, deth, which is saying that stuff in orange modifies this stuff in Green, which is what it's meant to be in the correct translation of Aztec empire with the population of a few million. But Google Translate completely fails on that. And suddenly it's the millions of people who are going to be conquering the Aztec empire and know that sovereign is way the worst thing that's happening here, though, know, the 15, 19, 600 isn't exactly a very good translation, and the first two thirds of soldiers against their loss isn't very good either. But know. So for a while, I used to sort of update this and see what happened. You know, in 2013, it almost seemed like progress had been made, but by 2015, it had gone downhill back to how it was before. So it just seemed like they got lucky in 2013 rather than the systems were working any better. And indeed, this sort of seemed to be the problem. Although some kind of progress had been made in machine translation, these systems just you know sort of never really worked all that great. And so that led to this amazing breakthrough in 2014, where we then moved to neural machine translation. And neural machine translation was much better. So what did we do in neural machine translation? So we built a neural machine translation system as a single end to end neural network. And that's been a powerful idea in neural network systems in general, including an nlp. If we can just have a single big system and put a loss function at the end of it, and then we can back propagate errors right back down through the system, it means we're sort of aligning all of our learning for the final taks we want to do. And that's been very effective, whereas earlier models couldn't do that. So we built it with a sequence to sequence model. So that sounds like our lstms, but it's meaning that we're going to have two of them, one of them to encode the sentence, the source sentence, and one to produce the target sentence. So that's what we're building. So for the source sentence we're taking here, it says rand n, but let's just think lstm, because that's what we're going to use in practice, is much better. And so we're going to chunk through it encoding what we've read using an rnn. So this rnn isn't going to output anything, right? We're just building up a hidden state that knows what's in the source sentence. So again, encoding of the source sentence, and we're going to use that final hidden state to condition the decoder rnn, which is going to then generate the translation. So for the decoder rnn, it's also an lstm, but it's going to be an lstm with different parameters. So we're going to be learning 11 lstm with source encoding parameters. And then for the different language, we're learning a different lstm that all know about the target language. And so we give it start and say, well, fefeed in your path of what you've encoded from the encoder R and m as your starting point. And then we're going to be iis thatbe counas, the previous hidden state you're feeding into your lstm. And then we're going to generate the first word of the translation, and we'll then copy that translated word down, using this as a generative model, as I did last time. And we start translating through. He hit me with a pie. Okay. So that sort of makes sense of the model. Yeah, okay. Okay. There's some notes. Sorry. Yeah, sorry. What I was going to say. Yeah. So the little pink note here. So what I was showing you as the picture of using it at sort of run time. At run time, we're going to encode the source and then generate the words over the translation. At training time, we're going to have parallel text. We're going to have sentences in their translations. We're going to run with the same architecture but as before then for the decoder network, we're going to try and predict each word and then say, what probability did you assign the actual next word? And that will give us a loss. And we'll be calculating the losses at each position, working out the average loss, working out the gradients back, propagating them through the entire network, both the decoder rand n and the encoder rand n networks, and updating all the parameters of our model. And that's the sense in which it's being trained end to end. Okay. Sequence. So this is sort of our general notion of an encoder decoder model, which is a very general thing that we use in all kinds of places, right? That we have one network then encodes something which produces a representation, which will then feed into another network that we'll use to decode something. And even when we go on to do other things like use transformers rather than lstms, we're still commonly going to use these kind of encoder decoder models because if we want to do not only machine translation, but other tasks like summarization or text to speech or other things like that, and we're going to be in this space of using encoder decoder networks Yeah .
speaker 2: just using a deeper neural network with more layers.
speaker 1: Well, a lot is sequenced, right? So it has never been very you're meaning like, why don't you just build on top of the source, right? People haven't tried that occasionally. It's never been very successful. And I think part of the reason is all of what I was trying to show before about all of the word order changes around a lot between languages. And if you're sort of just trying to build stuff on top of the source sentence, it's very hard to cope with that. In particular, it's not even the case that the length stays the same, right? One of the big ways in which languages vary is what little words that they have, right? So that in English, you're putting in a lot of these auxiliary verbs and articles where if sin Chinese, you don't have any of those. And so you're neither needing to depend on direction, add a lot of words or subtract a lot of words, which is very hard to do if you're sort of building on top of the source of it. Is it quick?
speaker 2: Yeah left side, is that somethindirectional or just like like the encoder? Yeah. So you you .
speaker 1: totally think, and it could be that the encoder is bidirectional and that might be better. The for the famous original instantiation of this that was done at Google, they actually didn't make it bidirectional, so it was simply taking the final hidden state. But that's absolutely an alternative that you could do. Okay. Yeah. So I'd .
speaker 2: sort .
speaker 1: of said it was okay, usable for lots of things. Okay. Yeah. So this is our conditional language model. So we're now kind of directly calculating the probability of y given x, right? That the decoder model is generating language expression as a language model directly conditioned on x. And so we train it with a big parallel corpus. And that's the only case I'm going to talk about today. Recently, there's been sort of some interesting work on unsupervised machine translation, meaning that you're got only a little bit of information about how the languages relate. You don't really have a lot of parallel text, but I'm not going na cover that today. Yeah. So for training it, we have paired sentences. We work out our losses in the predictions at each position and then we're working out our average loss and back propagating it through in a single system end to end as described. Yeah. So in practice, when people built big machine translation systems, this was one of the places where absolutely it gave value to have multi layer stacked lstms. And so typically, people were building a model and you'll be building a model something like this that's a mullayer lstm that's being used to encode de and decode. In my two minutes remaining, I just want to sort of quickly say, so building these neural machine translation systems was really the first big success of natural language processing, deep learning. Now, in this sense, now it depends on how you define what parts of language. If you look at the sort of the history of the Renaissance of deep learning, the first place where deep learning was highly successful was in speech recognition systems. The second place in which it was highly successful was then object recognition and vision. And then the third place that was highly successful was then building machine translation systems. So, you know, Google had a big statistical machine translation system. And it was it was only in 2:14 that people first built this sort of lstm deep learning machine translation system. But it was just sort of obviously super good. And it was so super good that in only two years, it was then deployed as the live system that was being used at Google. But it wasn't only used in Google. That neural machine translation was just so much better than what had come before that by a couple of years after that, you know, absolutely everybody, both us companies and Chinese companies, Microsoft, Facebook, ten cent baido, everybody was using nemachine translation systems because there are just much better systems. And so this was an amazing success, right? Because statistical machine translation systems, like the Google system, that this is something that had been worked on for about a decade, hundreds of people had worked on it. There are millions of lines of code, lots of hacks built in for particular languages and language pairs. But really, a simple, small neural machine translation system was able to work much better. There was an article published about it when it went live in the New York Times that you can find in that link. It's a little bit of a praising piece where you could be a little bit critical, but you know basically it's sort of talking about how just the difference in quality was so obvious that everyone immediately noticed, even before Google had announced it, of, wow, suddenly machine translations has gone so much better. Okay, so that's basically today. So for today, you know, we've learned t, that lstms are powerful. If you're doing something with a current neural network, you probably want to use an lstm. You should know about the idea of clipping your gradients. Bidirectional lstms are good when you've got an encoder, but you can't use them to generate new text and encoded decoding. Your Ural machine translation systems were a great new technology that advanced the field. Thank you.

核心摘要 (Executive Summary)

本讲座主要回顾并深化了语言模型和循环神经网络（RNNs）的知识，重点介绍了长短期记忆网络（LSTMs）作为一种先进的RNN形式，用以解决简单RNNs中存在的梯度消失和梯度爆炸问题。讲座详细阐述了LSTMs的内部结构，包括遗忘门、输入门和输出门，以及核心的细胞状态（cell state）和其加性更新机制，这使得LSTMs能够更有效地学习长距离依赖关系。此外，讲座还探讨了RNNs（特别是LSTMs）在词性标注、命名实体识别、句子编码等多种自然语言处理任务中的应用，并介绍了双向RNNs和多层RNNs的结构与优势。最后，讲座将焦点转向机器翻译（MT），回顾了从早期的基于规则和统计的机器翻译（SMT）方法（如基于贝叶斯规则分解为翻译模型和语言模型）到神经机器翻译（NMT）的演进。NMT采用端到端的序列到序列（sequence-to-sequence）模型，通常由编码器（Encoder）和解码器（Decoder）组成（均可采用LSTMs），这一架构显著提升了翻译质量，成为NLP领域深度学习的早期重大成功案例。

语言模型评估与RNN的挑战

语言模型评估：困惑度 (Perplexity)

概览：评估语言模型除直接观察生成文本效果外，更严谨的方法是使用困惑度 (Perplexity)。
原理：
- 衡量语言模型预测真实文本（未用于训练的评估数据）中词序列的能力。
- 计算方式：对模型预测概率的倒数在整个文本序列上取几何平均值。
- 等价于交叉熵 (cross-entropy) 的指数形式：Perplexity = exp(cross-entropy)。
- 注意：计算时使用的对数底（如base 2或自然对数e）会影响困惑度数值，比较时需统一标准。
历史渊源：Fred Jelinek（IBM）提出困惑度是为了让当时不熟悉信息论的AI研究者更容易理解模型性能，可将其视为模型在每个时间步面临的等效均匀选择数。
趋势：更低的困惑度表示更好的模型性能。
发展数据：
- 传统n-gram模型（如Interpolated Kneser-Ney smoothing）：约 67
- 早期RNN结合其他模型：约 51
- LSTM模型出现后：降至 43，甚至 30。
- 现代大型语言模型：困惑度可达个位数。

RNN的梯度问题：梯度消失与梯度爆炸

背景：在RNN中，损失函数通过时间反向传播 (Backpropagation Through Time, BPTT) 以更新权重。
梯度消失 (Vanishing Gradients)：
- 原因：在反向传播过程中，梯度会连乘多个雅可比矩阵 (Jacobian matrices)（∂h_k/∂h_{k-1}）。如果这些矩阵的特征值（简化理解下，权重矩阵W_h的幂的特征值）小于1，梯度会指数级衰减，导致远离当前时间步的早期信息的梯度信号变得非常小。
- 影响：模型难以学习长距离依赖关系。例如，预测句子末尾的词可能依赖于20个词之前的信息，但由于梯度消失，模型可能只能有效利用约7个词的上下文。
- 讲座中提到一个例子：“When she tried to print her tickets, she found that the printer was out of toner... After installing the toner into the printer, she finally printed her tickets.” 人类可以轻易预测，但RNN可能因梯度消失无法关联远处的"tickets"。
梯度爆炸 (Exploding Gradients)：
- 原因：若权重矩阵的特征值大于1，梯度在反向传播中会指数级增长。
- 影响：导致参数更新过大，可能使模型训练不稳定，甚至出现NaN或infinity值。
- 解决方案：梯度裁剪 (Gradient Clipping)。
  - 计算梯度的范数 (norm)。
  - 若范数超过预设阈值（如5, 10, 20），则按比例缩小梯度。
  - 讲座中称此方法为一种直接但有效的技巧 (“crude hack”)。
非线性激活函数的影响：对于tanh等非线性激活函数，讲座指出其饱和特性可能对梯度爆炸有一定帮助，但不能解决梯度消失问题。

长短期记忆网络 (LSTMs)

LSTMs的动机与历史

动机：解决简单RNN难以保存长期信息的问题。简单RNN在每个时间步完全重写隐藏状态，难以学习如何“保持”信息。
目标：设计一种RNN结构，使其更容易保存和访问长期记忆。
命名来源：“Long Short-Term Memory” 指的是扩展简单RNNs有限的“短期记忆”能力，使其能够处理更“长”的依赖。
历史发展：
- Hochreiter & Schmidhuber (1997)：首次提出LSTM。
- Gers & Schmidhuber (2000)：引入了遗忘门 (forget gate)，这是现代LSTM的关键组成部分。
- 早期研究未受重视，因当时神经网络处于低谷期。
- Alex Graves (Schmidhuber的学生) 的工作，及其后续与 Jeff Hinton 的合作，将LSTMs带入主流视野。
- Google (约2014-2016年)：LSTMs在Google的应用使其成为当时NLP领域的主导模型。

LSTMs的核心结构

核心组件：
- 细胞状态 (Cell State, c_t): LSTM的核心，信息高速公路，信息可以直接流过，只进行少量线性操作。这是解决梯度消失的关键。
- 隐藏状态 (Hidden State, h_t): 与常规RNN的隐藏状态类似，是当前时间步的输出，并参与下一时间步的计算。
- 门 (Gates)：控制信息流入和流出细胞状态以及如何输出到隐藏状态。门控单元通常使用sigmoid激活函数，输出值在0到1之间。
  1. 遗忘门 (Forget Gate, f_t): 根据前一时刻的隐藏状态 h_{t-1} 和当前输入 x_t，决定应从前一细胞状态 c_{t-1} 中保留或遗忘多少信息。
    - f_t = σ(W_f h_{t-1} + U_f x_t + b_f)
    - 讲座中提及，或可称之为“记忆门 (remember gate)”，因为它决定了保留多少。
  2. 输入门 (Input Gate, i_t): 决定将多少新的候选信息 ~c_t 添加到细胞状态中。
    - i_t = σ(W_i h_{t-1} + U_i x_t + b_i)
    - 候选细胞状态 ~c_t = tanh(W_c h_{t-1} + U_c x_t + b_c)
  3. 输出门 (Output Gate, o_t): 决定细胞状态 c_t 中的多少信息将输出到隐藏状态 h_t。
    - o_t = σ(W_o h_{t-1} + U_o x_t + b_o)
状态更新方程：
- 细胞状态更新: c_t = f_t * c_{t-1} + i_t * ~c_t
  - 关键点：这里的加法操作 (+) 是LSTM能够有效传递梯度、避免梯度消失的核心。它使得信息可以更容易地线性传递。
- 隐藏状态更新: h_t = o_t * tanh(c_t)
优势：
- 通过门控机制，LSTM可以学习在不同时间步选择性地读取、写入和遗忘信息。
- 加性的细胞状态更新机制显著缓解了梯度消失问题，使得模型能学习更长的依赖。
- 如果遗忘门设为1（全保留），输入门设为0（不输入新信息），则信息可以在细胞状态中无损传递。

其他解决梯度问题的网络结构

讲座简要提及，梯度消失/爆炸问题也存在于非常深的非循环神经网络中。
残差网络 (Residual Networks, ResNets)：通过“跳跃连接 (skip connections)”将输入直接加到后续层的输出上，使得梯度更容易传播。
密集连接网络 (DenseNets)：每一层都与所有后续层直接连接。
高速公路网络 (Highway Networks)：一种门控的残差网络，允许网络学习调节通过跳跃连接的信息量。

RNNs (含LSTMs) 的其他应用

词性标注 (Part-of-Speech Tagging)
命名实体识别 (Named Entity Recognition, NER)
句子编码 (Sentence Encoding)：
- 用于情感分类等任务。
- 可使用RNN（如LSTM）处理整个句子，取其最终隐藏状态作为句子表示。
- 实践中，对所有时间步的隐藏状态取平均或element-wise max可能效果更好。
条件语言模型 (Conditional Language Models)：基于某些输入信息生成文本。
- 语音识别
- 文本摘要
- 机器翻译

双向与多层RNNs

双向RNNs (Bidirectional RNNs)

动机：标准RNN的隐藏状态 h_t 只编码了到 t 时刻为止的过去信息。但在很多任务中，理解一个词需要同时考虑其左右上下文。
结构：
- 包含一个前向RNN（从左到右处理序列）和一个后向RNN（从右到左处理序列）。
- 在每个时间步 t，将前向RNN的隐藏状态和后向RNN的隐藏状态拼接 (concatenate) 起来，形成该位置的最终表示。
优点：能提供更丰富的上下文信息。
局限：不适用于需要实时生成文本的语言模型任务，因为生成时无法获取未来信息。
现状：在很多分析任务中曾非常流行，但目前很大程度上被Transformer模型取代。

多层RNNs (Multi-layer/Stacked RNNs)

动机：增加网络深度，以学习更复杂的特征表示。
结构：将多个RNN层堆叠起来，上一层的输出作为下一层的输入。
优点：
- 能够进行多层次的特征提取，提升模型能力。
- 通常2-3层的LSTM就能带来显著提升，但层数过多收益递减（这与后来Transformer模型可以构建非常深的结构不同）。
应用：在机器翻译等任务中，多层LSTMs被证明是有效的。

机器翻译 (Machine Translation, MT)

传统机器翻译：统计机器翻译 (SMT)

历史背景：
- MT是NLP最早的研究方向之一（始于1950年代），最初受冷战时期代码破译的启发。
- 早期基于规则的方法因缺乏语言学知识和计算能力而失败。
- 1990年代-2000年代，统计机器翻译 (SMT) 兴起，依赖大量平行语料库。
- Google Translate早期版本是SMT的代表。
核心思想 (SMT)：
- 目标：找到概率最大的翻译 P(Y|X)，其中 X 是源语言句子，Y 是目标语言句子。
- 使用贝叶斯规则 (Bayes' rule) 分解：P(Y|X) ∝ P(X|Y) * P(Y)
  - 翻译模型 P(X|Y)：关注词语和短语之间的对应关系，相对简单。
  - 语言模型 P(Y)：评估目标语言句子的流畅性和合理性，承载了大部分复杂性。
挑战：
- 词序差异大。
- 长距离依赖和复杂结构难以处理。
- 讲座以一个中译英的例子（贾雷德·戴蒙德《枪炮、病菌与钢铁》中的句子）说明了SMT系统（包括早期的Google Translate）在处理复杂修饰关系时的不足，例如错误地将“拥有数百万人口的阿兹特克帝国”理解为“数百万人去征服阿兹特克帝国”。
- SMT系统通常包含大量针对特定语言对的复杂规则和“hacks”。

神经机器翻译 (Neural Machine Translation, NMT)

突破 (2014年左右)：NMT的出现极大地提升了翻译质量。
核心架构：序列到序列模型 (Sequence-to-Sequence, Seq2Seq)
- 编码器 (Encoder)：一个RNN（通常是LSTM），读取源语言句子，将其编码为一个固定长度的上下文向量（通常是Encoder最后一个时间步的隐藏状态）。
- 解码器 (Decoder)：另一个RNN（通常是LSTM，参数与Encoder不同），以上下文向量为初始状态（或输入），逐词生成目标语言句子。解码器的每一步生成都依赖于上下文向量和已生成的部分译文。
- 端到端训练 (End-to-End Training)：整个模型（Encoder和Decoder）作为一个大型神经网络进行联合训练，损失函数定义在解码器输出端，梯度反向传播到整个网络。
优势：
- 直接建模 P(Y|X)。
- 能够更好地处理词序和长距离依赖。
- 模型结构相对简洁，无需大量人工设计的特征和规则。
- 讲座提及NMT是继语音识别和计算机视觉之后，深度学习在NLP领域的重大成功案例。
影响：
- Google在2016年将其翻译系统切换到NMT，质量提升显著，甚至在官方宣布前用户就已察觉。
- 迅速被各大公司（微软、Facebook、腾讯、百度等）采纳。
- 一个相对简单的NMT系统就能超越经营多年、代码量巨大的复杂SMT系统。
实践：大型NMT系统通常使用多层LSTMs作为Encoder和Decoder。
训练：需要大规模平行语料库。讲座提及存在无监督NMT的研究，但未展开讨论。
通用性：Encoder-Decoder架构不仅用于NMT，也广泛应用于文本摘要、语音转文本等任务。
关于编码器是否双向：讲座中讨论到，理论上编码器可以是双向的，且可能更好，但最初Google的著名NMT实现中编码器并非双向。

结论与回顾

LSTMs因其门控机制和加性细胞状态更新，成为处理序列数据、克服简单RNN梯度问题的强大工具。
梯度裁剪是训练RNNs（包括LSTMs）时防止梯度爆炸的实用技巧。
双向LSTMs能提供更丰富的上下文信息，适用于编码任务，但不适用于文本生成。
多层LSTMs能增强模型表达能力。
基于Encoder-Decoder架构的神经机器翻译（NMT）是NLP领域的一项革命性技术，显著提升了翻译质量，并展示了深度学习在复杂NLP任务上的巨大潜力。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:17

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 22:10

StreamSparkAI