2025 MIT | MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention
讲师Ava在讲座中详细介绍了深度序列建模的基本概念和实际应用。她首先通过二维平面上预测小球运动轨迹的例子说明,在处理带有时间依赖的数据时,历史信息对预测结果的重要性。接着,她回顾了前一讲中关于感知机和前馈神经网络的内容,并阐释了如何将这些基础模型扩展到序列数据场景,即通过递归神经网络(RNN)传递和更新隐藏状态,从而捕捉数据的时序关系。讲座还指出了序列数据在语音、文本、医疗信号、金融数据等领域内的广泛存在,为后续探讨更先进的基于注意力机制的序列模型奠定了理论基础。
媒体详情
- 上传日期
- 2025-05-18 16:28
- 来源
- https://www.youtube.com/watch?v=GvezxUdLrEk
- 处理状态
- 已完成
- 转录状态
- 已完成
- LLM 提供商/模型
- openai/gemini-2.5-pro-exp-03-25
转录
speaker 1: All right, well, it's fantastic to have everyone here. My name is Ava. I'm also an instructor for the course. And this is now lecture two of success 191 on deep sequence modeling. And as Alexander kind of alluded to at the end of his lecture, deep sequence modeling is a very, very powerful concept that, especially as of lay, has gone a lot of interest and a lot of excitement around it due to the advent of large language models. And actually, excitingly, in the course this year, we're going to have 22 and a half lectures, guest lectures on large language models on Thursday and Friday. And our intention with this lecture is to really give you the foundations of what sequence modeling is all about so that you could be ready when we go into these lectures on these cutting edge and new frontier topics, to to really grasp them from the fundamentals and foundations up. Okay, so with that in mind, in the first lecture, Alexander introduced the very basics, the essentials of neural networks, how we train them using back propagation, what it means to define a feed forward model. And so in this next talk, we're going to turn our attention to now applying these neural networks to problems that involve sequential processing or sequential modeling of data. And what we'll do is we'll try to walk through this problem formulation and the models step by step, side by side, building up our intuition about how we can build these networks up, starting right where we left off in the end of last lecture. So to first set the stage, what I always like to do is to motivate this notion of sequence modeling and sequential data with a very, very intuitive example, super simple one. Let's say we have this image of a ball in 2D space, and our goal is to predict where the ball is going to travel to next. Now, if I didn't give you any prior information about the ball's history, its motion in this 2D space, any guess you place on the next position of the ball will be a random guess. Now instead, in addition to the current location of the ball, I gave you some information about its history, say its prior locations. Now your problem becomes much easier and effectively the task is reduced to, given this past history about the ball, predict where it's going to go to next. And I think we can all agree that in this example, the ball is appearing, that it's moving from left to right. So this is this notion of what we mean when we think about sequence modeling. And beyond the simple example, as we'll see throughout the course, sequential data and sequence modeling is really, really all around us. For example, the audio from my voice talking can be split up into a sequence of sound waves, effectively chunked and processed in the sequence. We can do this similarly with words and texts and natural language, where we can chunk or split up text into a sequence of individual characters, individual words, and process that in that manner. Beyond this, there's many, many more cases in which sequential processing becomes apparent, from medical signals like ecgs, to stock prices in financial markets, to biological sequences like nucleic acids or protein sequences, to weather, to motion video and more. And so really, this paradigm of sequence modeling unlocks a great potential of applications and real world use cases for deep learning in the wild. So what are some concrete problem formulations or tasks we can think about when we start to think about sequence modeling? In the first lecture, we learned about kind of the very basic prediction problem, like a classification problem of will I PaaS this class, yes or no binary. Now, instead of being able to go from a single input to a single output, like a class label with sequence modeling, we want to process a sequence of inputs, such as the text, the words in an individual sentence, and produce a output at the end, which may be similarly a classification label. Like does a sentence have a positive or negative sentiment or feeling associated with it? It could also be a sequential output, like taking an image and now producing or generating text that captions that image. Or similarly, we can think about many to many generative or predictive tasks where maybe we have sequences in one language in English, and we want to translate that into sequences of another language. So these are kind of the fundamental problem definitions or problem setups that we can start to think about when we have the capability to process sequential data. So now with that in hand, the question becomes, okay, how do we actually build up a neural network model to be able to process and handle this unique type of sequential data? And today, we'll spend about half the time in the lecture talking about where kind of the history of the field got started with a new type of neural network architecture that kicked off our real abilities to model sequences. And then in the second half, we'll talk about a powerful mechanism that is being used today in the most state of the art sequence models that we see all around us. In both cases, my goal is really to convey the fundamentals of how we actually do this. And so we're going to build up from the perceptron that Alexander introduced and go step by step to develop an understanding of these models. Okay, so let's do exactly that. Let's go back to the perceptron, which we studied in lecture one, right? We saw a diagram like this where we have a set of inputs, x one through xm. And these are numbers. They're represented numerically. And what we do in a perceptron is we take these inputs, we multiply it by a corresponding weight value, add them all together to get this state z, and take this z and PaaS it through a function that's nonlinear, also called an activation function. And that leads us to our predicted output y hwhile. We can have multiple inputs coming in. Here you can think of these inputs as a single slice or as a single time step in a sequence. We also saw that we are able to sack these individual perceptrons together to comprise a layer of a neural network. And here we can also extend to go from a set of inputs to a set of multiple outputs in purple. But still, right in this example, we're still operating on a static single step. There's no notion of sequence or no notion of time. These inputs are all these individual inputs at just one slice of a time step. We can simplify our diagram further. And all I'm doing here in the middle is that I've collapsed that visual of the hidden layer down to just this abstracted Green box. So we have the same operation of mapping an input to an output, some vectors of length m, two output vectors of length m. What can we do now to try to apply a network like this to time series data or sequential data? Well, maybe let's start off with something perhaps a bit naive. As a first PaaS. Let's take this same model. All I've done here is I've just rotate it so it's instead of left to right, it's down to up, input to output. And what if we repeat this multiple times? We have our input vector at some time step, t here, t zero. We feed it into our neural network and we generate an output prediction. And maybe we do this over multiple time steps, because in sequential data, we don't have just a single time step. We have multiple individual time steps from t zero to t one, t two and so on in the sequence. We could do this in isolation, right? We take the inputs at these individual time steps, PaaS them in through the model and generate it an output. And here again, it's a simple function where we're applying this function defined by our neural network to the input at a time step t to generate a prediction y hat of t. But what's the problem here, right? We know that our output vector, y hat at a particular time step is just going to be a function of the input at that time step. And if this is inherently sequential data, it's probably in sequence for a reason because there's some sort of inherent dependence on time steps within that sequence relative to each other. And in this way, we're completely ignoring the inputs at the previous time steps to make a prediction at a later time step. These are all treated in isolation. To predict a time step, an output at, say, time two, we could probably benefit a lot from taking the information from time sezero time step one. How do we circumvent this? How could we possibly relate these time steps to each other? Well, fundamentally, what we're trying to do is to relate the network's internal computations at these individual time steps to prior history of the computations from those prior previously observed time steps. And we want to PaaS this information on forward through time to maintain that information as we move through the sequence. So maybe we can try that out. What if we were able to define a way to link the neuron's internal state from a previous time step to the computation at a later time step? Maybe we can make this concrete. We can define a variable, we'll call it an internal state H of t, that's maintained by the neurons in this network. And we PaaS the state on time, step to time, step across time. And our goal is to maybe imbibe this internal state with some notion of memory, some record of the prior computations that the network has made. And what this means, a little more concretely, is now our prediction, why hat of tea at a particular time step should depend not only on the input at that time step, but also this state that's been carried on from the prior time step, this past history and the current input. And so what's really powerful here is now we're starting to define more formally a relation between these output predictions that depends not only on the input observed, but also this past memory. And as you can kind of see and hopefully start to get intuition for that, past memory is dependent on the prior inputs. So we have this idea of some sort of recurrence, something that's repeated and maintained over time as the network makes these computations. And because this output that we're producing is a function of not only the current input but also this past state, we can describe this type of network via what we call a recurrence relation. And we can visualize this in a couple of different ways. Here on the right, I'm showing these time steps kind of unrolled in slices over time, where we've made these links kind of visualized by this variable H, but we can also represent this on the left via this cycle, this loop. Which tells us this concept of the state maintaining this recurrent log effectively of the network's internal computations that gets updated with each time step. And so this is this notion of a time series model that has a sense of recurrence. And really, this is a very, very foundational concept in all of sequence modeling, this notion of what we call recurrent neural networks. And hopefully this example in this buildup from where we started with the perceptron, gives you this intuition about how we can start to model sequential data using these time series models called recurrent neural networks. So let's keep going, right? Let's continue to build from this foundation and now start to get a little bit more formal about the operations that go into the internals of these recurrent neural networks, or rnms, as we started to see, right? The key idea is that we have this internal state, H, H of t, and that state is going to be updated at each time step as we process the sequence. We can do this by defining what we call this recurrence relation, right? It's a way to process a sequence time step by time step. And what we want is our internal setto state H to be a function of not only the input, but also the prior state, so that there's some notion of memory that's carried through. To formalize this, the cell state H of t is defined by some function parameterized by a set of weights W that depends not only on the input at that time step, but also the prior state from the previous time step. And we can recurrently apply the same function time step by time step, updating the weights so that we get this recurrent update to our cell state as the sequence is processed. An additional way to build up this intuition about rnn's is to think about this kind of in pseudocode, right? So here we're going to walk through a Python seudocode example that helps build up this intuition. So let's say we want to convey and build up the rnn from scratch. We're going to define some rn myrn, and we're going to initialize it with a hidden, say, set to zero. And now our input is this little fragment of a sentence. I love recurrent neural. And our task is to use the individual words in the sentence to predict the next word that comes next in that sequence. What we do is we loop through for each word in this sentence, we use our rnn to take that word, take the last hidden state, predict some output and update the hidden state, and do this iteratively through this iterative procedure, such that at the end we can generate a prediction for the next word that comes after we've processed these inputs. This is the fundamental of this notion of a state update and output that is the core of the rn. So again, walking through this from the bottom up, given our input vector, we define a function to update the hidden state. And now we've just introduced a little bit more formality that makes these weight definitions and functions explicit. But just again, this is the update to the hidden state is a function of both the input at that time step and the prior hidden state from the previous time step that updates our hidden state. And now we can actually generate a prediction and output vector by taking that hidden state and applying a weighted transformation to it, right? So this is the core of both, how the rnn can update its internal memory, its hidden state, and also produce a prediction and output at each individual time step. So by seeing this and visualizing this rnn, that's sort of this recurrent cyclic visual here, I mentioned that we can also represent it by kind of taking these individual time slices and separating them out on a time axis, unrolling them across time. And if we do that, I think the intuition becomes perhaps even more clear where, again, starting from the first time step, we can unroll step by step, t one, t two, all the way to t, where you can see this internal state is being passed on time step to time step. We're generating the predictions. We can also bring back that formal math to this diagram, where now you can see we have one weight matrix that transforms the input to the computation of the hidden state. We have a weight matrix that defines how the hidden state gets updated, and finally, one that transforms the hidden state to the output prediction. Importantly, right, these are the same weight matrices at each individual time step, right? There's this, this and this, and they're reused at each time step in the sequence. So now this tells us, okay, how do we actually make these updates? How do we predict the outputs? But what we need to know to actually train this network and learn it is how do we define a loss? As Alexander introduced, in order to learn a neural network or train it, you need to define some sort of function that tells you how close the predictions are actually getting to the desired behavior that you seek out. And with rnns, that concept, that exact same concept still applies. The only difference is that we can compute a loss at each slice, at each of these individual time steps in the sequence, and get a total loss by summing over all the time steps in our sequence. So in total, right, this defines what we call our forward PaaS, how we make predictions time step by time step, and how we use those predictions to compute a loss. We can walk through an example of how you can actually implement an rnn yourself bottom up from scratch in a library like tensor flow, where we try to define an rnn as a layer per initializing the weight matrices as attributes of our rnn class. And we define importantly in this function call how we do that exact forward operation, that forward PaaS that I introduced in the prior slide. And as you can see, right, it consists of these two key lines, the first being the update of the hidden state according to the input and the prior hidden state, and the second being the prediction of the output, which is a transformation of the hidden state at that time step. And we return both of these, the prediction and that hidden state value. And as you'll see, right, this is a really good way to move. Now, thinking in code from that pseudocode intuition we showed earlier to now, kind of how you can define an rnn class yourself from scratch. And moving forward, one more step you can build from this intuition to now learn and understand how to operationalize an rn through layers and modules that are already implemented in these common machine learning frameworks like TensorFlow and pi torch. And in the first lab of this course, using rnn's for sequence modeling, you'll get hands on experience in working with these functions and classes in either of these libraries. Okay. So hopefully now, right, that gives you this intuition of how we've built up from this first example of static input. Static output to these diverse types of tasks and new sorts of problems we can tackle when we start to be able to process and handle sequential data, whether that's taking a sequence, producing a class label or actually being able to do something like next word prediction or next character prediction to actually generate and produce out a sequence as output. And in this latter example, forecasting a bit. Not only is this what you'll get hands on experience with in our software lab, but really, this notion of many to many sequence modeling is the backbone of how language models actually work. And you've just kind of understood some of that intuition starting today and now building forward. All right. So let's now think about how do we operationalize this in the real world. Go ahead. speaker 2: We know in the first hidden there, connected by a. Yeah. So the question . speaker 1: is what defines the connection between the hidden layers? The important thing is that one of these individual Green blocks that can have many layers itself, it doesn't necessarily have to be one. It could be many. The important thing is that that's you can think of that as a unit, and that unit contains some number of layers within that unit, but that operation is basically applied for all the individual time steps in that sequence. Using the same set of layers and weights in that unit, you compute the loss at each of those steps. And then as we'll see when you actually update the weights, that is done by taking the loss from all the individual time steps and then updating after you've processed all the time . speaker 2: steps individually, just in the same way as with feforward . speaker 1: networks where you can have weights that connect layer to layer in rnn that has multiple layers within one unit. There is weights. There are weights that define those connections. Any additional questions before we transition otherwise? Okay. Okay. So now let's think about how we actually operationalize and bring this notion of sequence modeling to a real world example. Sequences are rich and sequences are interesting for a few reasons. One is because they're tremendously variable, right? It's not like an image where you try to have a height and a width of the image. And for all your images in your data set, those height and width are fixed in something like language. Sequences can be short, they can be long, they can be in between. So we need to be able to handle this variability. The other richness and nuance is that inherently, with things that depend on time, you could have instances where there's a really short range interaction or dependency, or you could have instances in a sequence where something at the very beginning dictates the meaning or the value at something at the very end. So this notion of spread or long term dependency and kind of fundamentally, right, there's this notion of order. And our model needs to be able to do a good job of representing and reasoning about this order through the approach that we take. So we'll use these kind of criteria to motivate how rnns can do well at doing this. But also what are some shortcomings of rnn's when it comes to these kind of real world operational criteria that our sequence model needs to meet? So to do that, we're going to walk through a very, very concrete example, which is really maybe now has become the quintessential sequence modeling problem out there. And that's this idea of predict the next word, given a sequence of words, predict the word that comes next. And this to underscore this, this is a very important task because not only is a beautiful and simple and intuitive, but it turns out it's incredibly powerful to build up the very, very powerful language models that we see today are trained on this exact task of being able to predict the next word. So let's say we have this example sentence. This morning, I took my cat for a walk. Our task is given this set of words, we want to be able to predict the next word. And let's say we want to build a sequence model like an rnn to do this task. What is our very first step? Any ideas? Breabreaking the words into chunks. Yeah let's say now we've broken the sentence into chunks or words. What we how can we actually build a neural network model to do this? Vectorize them exactly. So the core consideration after we've broken this text up into words is we need a way to actually represent it to a model, right? Because remember, all neural networks are are there function actuators or function approximators that operate on numbers, right, vectors and matrices and ways of representing numbers? So that's exactly what we need to do if we want to have a model that takes a word in and predicts the next word out. We can't just PaaS that in as words. We need an actual way to represent these words as numbers in a numerical representation to be able to operate on them using a neural network. And this is this notion of vectorizing the input or encoding the language for a neural network to operate on. And this is really, really a core concept in language modeling and neural network and machine learning modeling in general. And so the solution that we're going to introduce right now is this notion of embedding, which is an idea of taking an input that can be in some form, like words, and then transforming it into first and index that then can map to a vector of some fixed size. So to break this down step by step, how do we actually do this? Let's say we have our body of all possible words we could see in all sentences we could encounter. We call this a vocabulary, right? A corpus of words that covers all the possible words that we could encounter. And this vocabulary has to have some fixed size, right? What we can then do is take the individual words in this vocabulary and map them to a number, an index, let's say, a maps to one, cat maps to two, so on and so forth. Then now these indices give us a way to transform that index, that slot to a vector, and look up a vector that represents that word based on the index. And I'll show you exactly what I mean by this. The last step is to do this embedding operation, which means mapping an index to a vector of fixed size. One way we could do this is what we call a one hot, or binary encoding or embedding. And what I've done here is we've defined a vector of a sparse vector of a fixed size, and all it has is zeros and a one in the index that corresponds to that word. And so based on that index value, I can effectively encode the identity of the word and look up and backtrack what the word was based on that index. The other thing I could do is to actually use a neural network layer to learn an embedding of those words in some fixed length, lower dimensional space. And this is very similar. We've just done this operation of mapping that index to an encoding such that similar words end up in a similar space of this embedding space. But we are still able to retrieve our vector representation of the word using that index. So I think a really good way of thinking about these vectorization or embedding operations is to think about indexing, to look up a fixed representation, a numerical representation for these words. And this is a really, really important concept. So now we can do that for our sequences. We can transform our words into this vector representation. And to now think about, okay, why sequence modeling is difficult and complex. We can think and see some examples of how that really comes to life, right? We can have the complexities that arise when we think about variable sequence lengths. So maybe we could have short sequences, medium length sequences, longer sequences. And in all cases, we want our neural network model to be able to track these dependencies consistently, to still be able to do a good job of predicting the next word at the end. And that relates as well to a good ability to track and store these long term dependencies in these sequences, where in many cases, we may need information from very, very beginning of the sentence to predict the next word at the very end of the sentence. That's important because depending on how we shuffle the words around, this notion of sequence can convey very different semantic meanings in our prediction task. And so hopefully, this example of language modeling and next word prediction gives you a sense of why sequence modeling can be very rich and very complex as a neural network deep learning task. So okay, that gives us a way concretely, in the real world, some understanding of why sequence modeling is rich and challenging. To actually train a sequence model like an rnn, though, we need some special considerations. We're still going to use that fundamental algorithm that Alexander introduced of back propagation, but now we need to introduce something to be able to handle that time dependence. So like we've been doing throughout this lecture, let's build up from the first principles that we started with. Let's go back to how we train our feforward models, right? We first take our inputs. We make a forward PaaS through the network, going from input to output to actually train the model and compute loss and back propagate gradients. We do this backwards through the back propagation algorithm by taking the derivative of the loss with respect to each of the parameters and each of the model weights in our network. And then we adjust and move those parameters in the direction that will minimize the loss. We take a step in rn's. We saw this preview of how the loss sses actually computed time step by time step. Now, when we want to go to train rn's, we not only need to consider an individual loss, but actually the aggregate loss through all these individual time steps. What that means is that instead of back propagating these loss values through a single feed forward network, we need to back propagate errors across the these individual time steps and then such that we can Carry through errors from late time steps all the way back to the beginning. And the algorithm for doing this is what we call back propagation through time, such that errors flow from all the way back late in the sequence to the very beginning. What this means practically is that you can have this chain of very repeated computations, multiplying a weight matrix multiple times, repeated with each other, as well as repeated use of the derivative of the activation functions in these networks. That can pose some very practical challenges, which, for the sake of time, we're not going to go too deep into. But the important thing to keep in mind is that these standard rnn's can be somewhat difficult to train stably because you could get many values that are larger than one. You multiply them together and then the gradient explodes, right? Conversely, you could have many values that are very small. You multiply them together and then the gradient vanishes and it goes down to very, very close to zero. This actually has real practical implications because what it means is that it makes it really, really hard to be able to track these long term dependencies that we care about in our sequence. If our gradients are unstable, either blowing up or shrinking down to nothing, we can't effectively take those gradients from late time steps and PaaS them through to the earlier time steps to be able to promote our model to retain that information. And so in the literature and in the sequence modeling community, there's been a lot of active research to come up with improved architectures based on the rnn to try to solve this problem. And the real core concept is that they add some complexity to that rnn unit itself, effectively adding additional functions to try to selectively control the amount of information that's passed to the update of the hidden state. One very, very prominent way of doing this is a network called an lstm, or a long, short term memory network. And this was introduced now quite some time ago, but has been very foundational to a lot of the sequence modeling work that's gone on since. So to give you one quick look at some of the applications of rnn's that you'll actually get hands on with today, I want to highlight this particular example of music generation. And this is a problem that really lends naturally to sequence modeling and to a recurrent neural network architecture. Because what we're doing is, let's say, we want to try to predict and generate a new piece of music. One way we could do this is by taking the individual notes in a piece of music, and very similar to how we saw with the task of predicting the next word, build a model that, given the past history of musical notes, learns to predict the most likely next musical note in the sequence. And this is exactly what you will do today in our software labs where you'll be able to train a rnn model to generate brand new music that's never existed before. And in fact, you're not the only ones who have had a go at this. This is an example from a few years ago of a startup that was trying to seek to do a music generation. And they trained a neural network model on classical music and tested to finish a work by the composer frz. Schubert, the famous unfinished symphony, where they gave two movements of the symphony and tathe model to generate the music corresponding to the third movement. So let's see if we can play this. So it's pretty good. Maybe there are some classical music fishing iranos in the audience who are familiar with this work, but I always appreciate it because it kind of gets at some of these themes that we're talking about in terms of the capabilities of these sequence models. So, so far, we've talked exclusively about this sequence modeling architecture called recurrent neural networks, and I want to take a moment to kind of appreciate that. It's very, very remarkable that we're able to understand the fundamentals of sequence modeling and build up to some of these capabilities using rnns. But like any technology in any method, rnn's have some core limitations. And in fact, those limitations have motivated new development of new architectures and approaches in sequence modeling as well as improved versions of rnns that have done better and solved some of these limitations. A couple things that are important to keep in mind when thinking about art needs. One is that the core concept, as we talked about, is this notion of the state H of t. Remember the concepts in all that we're talking about? These neural networks operate on vectors and matrices of numbers just like that. The state of an rn is a fixed length vector, right? There's only so much information that can be encapsulated into something of fixed size. And so that presents what we think of as a bottleneck to the amount of information that the rnn state can hold. Additionally, because rnn's process information time step by time step, this can make them very difficult to parallelize such that things can be processed simultaneously. We have this inherent sequence dependence. And finally, related to both these points, that encoding bottleneck of the state can prevent the long term memory capacity of some of these types of recurrent architectures. So to think about now how we can try to overcome this, let's go back to our fundamental goal of sequence modeling, which is to take a sequence of inputs, use a neural network to compute some features or states representing those inputs, and now be able to generate predictions according to that sequence. With rnn's, we said, we are going to process this time step by time step, use recurrence. But we also saw that inherently, this notion of sequence time step by time step processing places some real limitations on the capabilities of rnns. Ultimately and ideally, we want to be able to process our sequence very efficiently in parallel, perhaps so that we can generalize to long sequences and do so efficiently, and also have this desired attribute of being able to track these important dependencies in the sequence effectively. And so a question that was posed a few years ago is, what if we could try to tackle the sequence modeling problem without having to deal with the data time step by time step? Maybe we can eliminate the need for recurrence. Maybe we could do this by squashing everything together, ignore this notion of these individual time steps, and let's say, concatenate all our inputs together into one vector, and we feed it into something like a feed forward model and generate an output at the end and hope that it makes sense. While if we do this, in this naive first approach, yes, we've eliminated the need for recurrence. So we don't have to process our data step by step, that's good. But this doesn't seem really scalable because now let's say we're just trying to use a dense network, that's not going to be very efficient. Also, we've destroyed all information about order. We've squashed the inputs into a concatenated vector. And by doing so, we've destroyed any hope to remember things that appeared earlier or later and relate them to each other. So this has motivated a different way of thinking about sequence modeling, which is this idea of trying to take a representation of sequential data and define a mechanism that can, on its own, pick out and look at the parts of that information that are important relative to other parts of that. Thinking about this, in other words, can we define a way to be able to give in a sequence, identify and attend the important parts of that sequence, and Moreover, model the dependencies in that sequence that relate to each other? And this is the core idea of a very powerful mechanism called attention. And in 2017, there was a paper ed, called attention is all you need, that introduced this mechanism. And so if you've heard of a model like ChatGPT or GPT, the t in that acronym stands for a transformer. And a transformer is a type of neural network architecture that can be applied not only to language data, but other types of sequential data. And the foundational mechanism of a transformer, what makes it different, is this operation of attention. And so we're going to, in this lecture, talk about that attention mechanism. And in later lectures, you'll learn more about how these transformers are actually being applied as language models and in other applications as well. So we're going to break down the core concept of attention really step by step. All right, let's do that. Attention itself is a very informative word, right? What it means is we have this inherent ability as humans to think about an input and automatically zoom in and pick out the things that are salient, the important features. So let's build up our intuition, starting with an image. How do we figure out what's important in this image? One way we could do it naively is we could go pixel by pixel, left to right, back and forth, and scan this to try to compute some value about how important these individual pixels are. But of course, our brains don't operate like that. We're able to automatically look at this and pick out to and attend to the important parts. And the first part of this problem is this notion of being able to identify which parts are important in some input. And then ultimately, we want to use that identification to extract the features, the components of the input that correspond to these high attention values, right? And to think about this more concretely, this notion of identifying the parts to attend to is really similar to search, right? So when we do something like a search, we present a question, and we're trying to seek an answer. Let's say you came to this class, right, and you had the question, how can I learn more about neural networks and deep learning? And AI? Maybe one thing you could do besides coming to this class would be to go to the Internet, have all the videos, all the materials available on the Internet, and try to do a search to find something that matches your query. Let's say you go to a giant video database like YouTube, and you put in your query, your ask, deep learning, that's your search topic. And now what search does is through, let's say we go through every video in this database and we extract some informative nugget of information, a key that represents the core element of that video, a descriptor of that video. And now we have our query, and we have a set of keys. Now our task is okay to actually do the search. How could I go about this? Well, I want to see how close the matches between my query, what I search for and those keys, those key indicators in the database. And I want na see how similar they are. So I'll do this step by step. Ask how similar is my query to these keys? First instance, a beautiful video about elegant sea turtles. Not similar. A video of from our past lecture on deep learning. Yes, similar. A key related to Kobe Bryan's fade away, not similar. So now we've identified the key that's important. We want to attend to this. Our last task is to actually extract the value that's associated with this similar match that we found. We want to extract the features that we want to pay attention to the video itself, and we'll call this the value. And because our search was implemented with a good attention mechanism, we've identified the best deep learning course out there for you and your query. This concept is really the core intuition behind attention, and that's very, very closely related to how this attention operation works in neural networks like transformers. So now let's go back to our sequence modeling problem, where we have a series of words, and we want to predict the next word with this sentence. If we break this down step by step, first, remember, we don't want to process this information time step by time step. We've broken down that need for recurrence. We're going to feed it in the data all at once. And still we need a way to encode some notion about order. So what we're going to do is we're going to put in an embedding called a positional embedding, that effectively gives us a way to encapsulate that relative position of these elements in that sequence. We're not going to go into great detail about positional embeddings, but you can think of it as an encoding that gives us some representation of position in the sequence. Now we take those position aware encodings, and we do that search operation, and we operationalize that search operation to extract three sets of matrices that we call the query, the key and the value. Just like I introduced before, the way we do that is we take the positional embedding and think about that as representing our sequence in a positionally aware way. And we use a neural network layer to produce each of these individual matrices, the query, the key and the value. The important thing to keep in mind is that it's that same positional embedding that's repeated in this mechanism of self attention. But these are different neural network layers that yield different values for each of the query key and value matrices. That means is that they can effectively capture different information, as we'll see. Now, again, our next step in the search operation was to figure out, okay, how similar is the query to the key? And that's exactly what we're going to do next with attention to do that, right? Remember, these are numeric vectors or matrices. And so we need a mathematical way to compute that similarity between two sets of numerical features. So let's say we have two vectors, our query vector, our key vector. And what we can do mathematically using linear algebra is measure the similarity of those vectors in space using the dot product that tells us how close they are to each other. We can scale it. And this gives us a very concrete similarity metric that captures that similarity between the query and the key. That same principle can apply now to matrices as well. Dot product and scaling. That gives us a similarity metric. Now let's think about what this similarity computation actually means. Remember, right? We're trying to understand how these components of the input relate to each other, what parts of the sentence are important to each other and important to convey the semantic meaning of the sentence as a whole. So if we have this example, he tossed the tennis ball to serve, let's say we've computed our querying key matrices applied to scaling. We can then apply a function called a softmax to basically squish these values between zero and one. And now this gives us a matrix that gives us a relative weighting of how those individual components in the sequence relate to one another. Intuitively, you can think about things that are more similar as having higher attention weights, more related as having higher attention weights, things that are less related as having lower attention weights. So here, right in this example, tossed and ball have a high score, tennis and ball have a high score, and so on. This gives us our attention weighting or our attention matrix. The final step is to now use that relative weighting to actually pull out those important features that we care about. What we do here is we take our value matrix, multiply it by our attention weight, and this gives us an output of a feature set over that input space that reflects the relative elements of the sequence that are interrelated relative to each other. That's really the core of how attention works. And it's really, really beautiful and striking to me because this mechanism gives us a very natural way to pull out and attend to features that are quite important relative to each other in an input and architecturally. Now, how do we actually build this out into something like a transformer? Well, 1s, please. We can go again, by taking our input, computing these positional encodings, we define these neural network layers that compute the query key and value. Then we can compute this relative weighting between the query and the key. That's a matrix multiply representing the dot product, a scaling and a soft max, and then use the value matrix to extract features that have high attention scores. And these are the core operations that now define these attention heads, which are really the core component of architectures like the transformer. So as we've mentioned and realized, right, this is really the dational attention is the foundational building block of of transformers. And I think kind of some questions got to as well, right? A transformer architecture does not need to be defined by just a single attention block. You can actually stack multiple attention heads together to now be able to basically increase the capacity of your network and pull out different sets of features and more complex sets of features. So again, in this very intuitive example, maybe you have a network that has three attention heads. And if you go in and inspect, again, this is an intuitive example, let's say you were to go in and inspect the values of each of these attention heads, maybe you could get some interpretability out with respect to the different features or parts of the input that the network was attending to. So what are some real world use cases? And what has attention really enabled and transformed over the recent years? It's not only in language processing that transformers and self attention have really led to tremendous advances. While that's the case, right, the mechanism and the architecture behind attention and transformers is very generalizable. And so as you'll see, natural language is one tremendous area that transformers have really taken off. And you'll get hands on experience with this, not only in the lectures, but in a brand new software lab on loms. We'll also see a bit about how attention and sequence modeling has been extended to biological sequences in one of our guest lectures. And in fact, in something that may not appear sequential, like images or computer vision, there are a class of models called vision transformers that have now also become very, very powerful in processing image data as well. So to summarize and close, hopefully you've gotten the sense of how rich sequence modeling is as a set of problems and things we can consider. We saw how rnn's work. We saw how we can build up intuition for rnn's through this notion of recurrence, how they can be trained through back propagation. You'll be able to get a hands on experience building rnns for music generation. And finally, we closed with talking about self attention and transformers as a way to model sequences without needing to handle time steps individually. And and stay tuned for more on llms, both hands on and in more lectures. So with that, that closes the lecture portion for today. We now can use the remaining time for open office hours and discussion about any lingering questions you have or comments related to the discussion. We also want to draw your attention to the software labs, which are now available on the GitHub linked on the course website. The instructions for completing the software labs are all there. We have options in both TensorFlow and pi torch, so hopefully you'll get a fun chance to go through those and work with those. And finally, I think our gracious host, reception host John, stepped out, but immediately after this, there will be a in person reception to kick off the course just down the street at one Kendall Square with food provided. And many special thanks to John Werner and link venoh. He's still there back at the top. Thank you, John, for graciously hosting.
概览/核心摘要
本讲座(MIT 6.S191 深度学习导论第二讲)由讲师 Ava Amini 主讲,深入探讨了深度序列建模的核心概念、方法和应用。讲座首先强调了序列数据在现实世界中的普遍性(如音频、文本、生物序列),并介绍了不同类型的序列建模任务(如情感分类、图像描述、机器翻译)。核心内容首先聚焦于循环神经网络 (RNNs),详细阐述了其通过引入“记忆”或隐藏状态 ($h_t$) 来处理序列信息的机制,即当前输出 $\hat{y}t$ 不仅依赖于当前输入 $x_t$,还依赖于前一时刻的隐藏状态 $h}$。讲座解释了 RNN 的状态更新公式 ($h_t = \tanh(W_{hh}^T h_{t-1} + W_{xh}^T x_t)$) 和输出计算 ($\hat{yt = W^T h_t$),以及权重共享的特点。接着讨论了训练 RNNs 所使用的沿时间反向传播 (BPTT) 算法及其面临的梯度爆炸和梯度消失问题,这些问题尤其影响模型学习长期依赖的能力。为解决此问题,引入了门控机制(如 LSTM, GRU)。
随后,讲座指出了 RNNs 的局限性,包括编码瓶颈、顺序处理导致的并行化困难以及并非真正的长时记忆。作为对 RNN 局限性的回应,讲座重点介绍了“自注意力 (Self-Attention)”机制,这是 Transformer 模型(如 GPT)的核心。自注意力机制允许模型在处理序列时,无需循环,而是通过计算查询 (Query)、键 (Key) 和值 (Value) 之间的关系来直接关注输入序列中的重要部分,从而实现并行化处理和更有效地捕捉长距离依赖。讲座详细拆解了自注意力的计算过程,包括位置编码、QKV 提取、注意力分数的计算 ($softmax(\frac{Q \cdot K^T}{\text{scaling}})$) 以及加权值的聚合。最后,概述了自注意力机制在语言处理、生物序列分析和计算机视觉等领域的广泛应用,并预告了后续关于大型语言模型的课程内容。
深度序列建模导论
讲师 Ava Amini 开始本讲座(6.S191 第二讲),旨在为后续关于大型语言模型 (LLM) 的前沿内容奠定序列建模的基础。
- 序列在现实生活中的应用 (Sequences in the Wild)
- 序列数据无处不在,例如:
- 音频 (Audio):语音可以被分解为声波序列。
- 文本 (Text):自然语言可以被分解为字符或单词序列。
- 其他:医学信号 (ECGs)、股票价格、生物序列(核酸、蛋白质)、天气、运动视频等。
- 序列数据无处不在,例如:
- 序列建模的应用 (Sequence Modeling Applications)
- 一对一 (One to One): 例如,二元分类 (Binary Classification),判断学生是否能通过课程(传统神经网络)。
- 多对一 (Many to One): 例如,情感分类 (Sentiment Classification),根据一段文字判断其情感倾向。
- 一对多 (One to Many): 例如,图像描述 (Image Captioning),根据一张图片生成描述文字。
- 多对多 (Many to Many): 例如,机器翻译 (Machine Translation),将一种语言的序列翻译成另一种语言的序列。
循环神经网络 (Recurrent Neural Networks - RNNs)
讲座从感知机和前馈网络出发,引出处理序列数据的需求。
- 带循环的神经元 (Neurons with Recurrence)
- 传统全连接神经网络 (Feed-Forward Networks) 中,每个输入 $x_t$ 独立地产生输出 $\hat{y}_t$ ($\hat{y}_t = f(x_t)$)。
- 对于序列数据,当前输出可能依赖于过去的输入。因此,循环神经元引入了过去信息的概念:$\hat{y}t = f(x_t, h$ 代表过去的记忆或内部状态。})$,其中 $h_{t-1
- RNNs 的核心思想
- RNNs 在每个时间步应用一个循环关系来处理序列: $h_t = f_W(x_t, h_{t-1})$。
- $h_t$ 是当前时间步的隐藏状态 (cell state)。
- $x_t$ 是当前时间步的输入向量。
- $h_{t-1}$ 是前一时间步的隐藏状态。
- $f_W$ 是一个带权重 $W$ 的函数。
- 关键点:在每个时间步都使用相同的函数和参数集。 RNNs 拥有一个状态 $h_t$,它在处理序列的每个时间步都会更新。
- RNNs 在每个时间步应用一个循环关系来处理序列: $h_t = f_W(x_t, h_{t-1})$。
- RNN 直观理解:
- 一个 RNN 单元接收当前输入(如单词)和上一刻的隐藏状态,输出一个预测和更新后的隐藏状态。这个过程在序列的每个元素上重复。
- 示例(伪代码):处理句子 "I love recurrent neural",逐词输入 RNN,更新隐藏状态,并预测下一个词。
- RNN 状态更新和输出:
- 更新隐藏状态 (Update Hidden State): $h_t = \tanh(W_{hh}^T h_{t-1} + W_{xh}^T x_t)$
- 计算输出向量 (Output Vector): $\hat{y}t = W^T h_t$
- RNN 的时间计算图 (Computational Graph Across Time):
- RNN 可以被看作是在时间上展开 (unrolled) 的计算图。
- 在每个时间步,都重复使用相同的权重矩阵 ($W_{xh}, W_{hh}, W_{hy}$)。
- RNN 实现:
- 从零开始用 TensorFlow 实现 RNN (RNNs from Scratch in TensorFlow): 可以定义一个自定义的 RNN 单元层,初始化权重矩阵和隐藏状态,并在
call方法中实现前向传播逻辑(更新隐藏状态、计算输出)。 - 主流框架实现 (TensorFlow & PyTorch):
- TensorFlow:
tf.keras.layers.SimpleRNN(rnn_units) - PyTorch:
torch.nn.RNN(input_size, rnn_units)
- TensorFlow:
- 学员将在实验课中动手操作。
- 从零开始用 TensorFlow 实现 RNN (RNNs from Scratch in TensorFlow): 可以定义一个自定义的 RNN 单元层,初始化权重矩阵和隐藏状态,并在
- 损失计算:在每个时间步计算损失,总损失是所有时间步损失之和。
序列建模的设计标准与挑战
序列建模需要模型具备以下能力:
1. 处理可变长度的序列 (Handle variable-length sequences)。
2. 追踪长期依赖关系 (Track long-term dependencies)。
3. 保持关于顺序的信息 (Maintain information about order)。
4. 在整个序列中共享参数 (Share parameters across the sequence)。
讲师指出 RNNs 满足这些设计标准。
- 一个序列建模问题:预测下一个词 (A Sequence Modeling Problem: Predict the Next Word)
- 例如,给定句子 "This morning I took my cat for a walk",预测下一个词。
- 这是大型语言模型训练的核心任务。
- 将语言表示给神经网络 (Representing Language to a Neural Network):
- 神经网络需要数值输入。
- 编码语言给神经网络 (Encoding Language for a Neural Network):
- 词汇表 (Vocabulary): 建立一个包含所有语料中单词的词汇表。
- 索引 (Indexing): 将每个单词映射到一个唯一的整数索引。
- 嵌入 (Embedding): 将索引转换为固定大小的向量。
- One-hot embedding: 高维稀疏向量,只有一个维度为1。
- Learned embedding: 通过神经网络学习得到的低维稠密向量表示,相似的单词在向量空间中距离更近。
- 序列建模的复杂性:
- 处理可变序列长度 (Handle Variable Sequence Lengths):句子长度不一。
- 捕捉长期依赖 (Model Long-Term Dependencies):例如,在句子 "France is where I grew up, but I now live in Boston. I speak fluent __." 中,准确预测 "French" 需要回溯到句子开头的 "France"。
- 捕捉序列顺序的差异 (Capture Differences in Sequence Order):词序改变会极大影响句子含义,如 "The food was good, not bad at all." vs "The food was bad, not good at all."。
沿时间反向传播 (Backpropagation Through Time - BPTT) 及其挑战
训练 RNNs 需要特殊的反向传播算法。
- 回顾:前馈模型中的反向传播 (Recall: Backpropagation in Feed Forward Models): 通过计算损失函数对参数的梯度来调整参数。
- RNNs:沿时间反向传播 (RNNs: Backpropagation Through Time):
- 损失在每个时间步计算,总损失是所有时间步损失的聚合。
- 梯度需要沿着时间步反向传播。由于权重在所有时间步共享,梯度的计算会涉及到所有时间步的贡献。
- 标准 RNN 的梯度流问题 (Standard RNN Gradient Flow):
- 计算早期隐藏状态(如 $h_0$)的梯度会涉及到多个 $W_{hh}$ 矩阵的连乘以及激活函数导数的连乘。
- 梯度爆炸 (Exploding Gradients): 如果 $W_{hh}$ 的某些值或激活函数导数较大,梯度在反向传播过程中会指数级增大。
- 解决方法: 梯度裁剪 (Gradient clipping) – 当梯度超过阈值时对其进行缩放。
- 梯度消失 (Vanishing Gradients): 如果 $W_{hh}$ 的某些值或激活函数导数较小(例如 tanh 函数在饱和区导数接近0),梯度会指数级减小至接近于零。
- 这使得模型难以学习长期依赖关系,因为来自较早时间步的误差信号对参数更新的贡献变得微乎其微。
- 模型偏向于学习短期依赖。例如,易预测 "The clouds are in the sky",难预测 "I grew up in France, ... and I speak fluent French"。
- 解决方法:
- 选择合适的激活函数 (Activation function)。
- 恰当的权重初始化 (Weight initialization)。
- 改进网络架构 (Network architecture),例如使用 LSTM 或 GRU。
- 神经元中的门控机制 (Gating Mechanisms in Neurons):
- 思想:使用门 (gates) 来选择性地在每个循环单元内添加或移除信息,控制信息流。
- 门通过逐点相乘 (Pointwise multiplication) 和 Sigmoid 激活函数实现。
- 长短期记忆 (Long Short Term Memory - LSTMs) 和 GRU 等网络依赖门控单元来更好地追踪长期信息。
RNN 的应用与局限性
- 示例任务:音乐生成 (Example Task: Music Generation):
- 输入乐谱的一部分,预测并生成乐谱的下一个音符。学员将在实验课中训练 RNN 模型生成新音乐。
- 提及一个创业公司使用神经网络完成舒伯特未完成的交响曲的例子。
- 示例任务:情感分类 (Example Task: Sentiment Classification):
- 输入文本序列(如推文),输出其情感倾向(如积极情感的概率)。
- 循环模型的局限性 (Limitations of Recurrent Models):
- 编码瓶颈 (Encoding bottleneck): 所有先前的信息都需要被压缩到一个固定大小的隐藏状态向量 $h_t$ 中。
- 速度慢,难以并行化 (Slow, no parallelization): 计算是按顺序进行的 ($h_t$ 依赖 $h_{t-1}$),难以利用并行计算硬件。
- 并非真正的长时记忆 (Not truly long memory): 尽管 LSTM 等有所改进,但捕获极长的依赖关系仍然非常困难。
超越循环:自注意力机制
讲座探讨了克服 RNN 局限性的方法,引出了自注意力机制。
- 序列建模的目标 (Goal of Sequence Modeling)
- 期望的能力:高效处理连续流数据、并行化计算、具备长时记忆能力。
- 简单地将所有输入拼接后送入密集网络 (dense network) 的问题:不可扩展、丢失顺序信息、无长时记忆。
- 新思路:识别并关注 (attend to) 输入序列中的重要部分。
- Attention Is All You Need
- 2017年提出的论文,引入了 Transformer 模型,其核心是自注意力机制。
- GPT 模型中的 "T" 即代表 Transformer。
- 自注意力机制的直观理解 (Intuition Behind Self-Attention):
- 核心思想:关注输入中最重要的部分。
- 步骤:1. 识别哪些部分需要关注。2. 提取具有高注意力权重的特征。
- 类比搜索问题:
- 查询 (Query - Q): 用户的搜索词(例如 "deep learning")。
- 键 (Key - K): 数据库中各项的描述符(例如 YouTube 视频标题/元数据)。
- 计算 Query 和每个 Key 之间的相似度,得到注意力权重 (attention weights/mask)。
- 根据注意力权重,提取与 Query 最相关的项的值 (Value - V)(例如视频内容)。
- 用神经网络学习自注意力 (Learning Self-Attention with Neural Networks):
- 目标:识别并关注输入序列中最重要的特征。
- 以句子 "He tossed the tennis ball to serve" 为例:
- 步骤:
- 编码位置信息 (Encode position information): 由于数据是一次性输入的(非时序),需要明确编码位置信息来理解顺序。通过将词嵌入 (word embedding) 和位置编码 (positional encoding, $p_i$) 相加,得到位置感知的编码。
- 提取查询 (Query)、键 (Key)、值 (Value):
- 将位置感知的编码分别通过不同的线性层 (Linear layer) 转换得到 Q, K, V 向量(或矩阵)。这些线性层有各自独立的权重。
- 计算注意力权重 (Compute attention weighting):
- 注意力分数 (Attention score): 计算每个 Query 和所有 Key 之间的成对相似度。常用方法是点积 (Dot product) 相似度。
- 公式:$\text{Score} = Q \cdot K^T$
- 缩放 (Scaling): 为防止点积结果过大导致梯度过小,通常会除以一个缩放因子,即 K 向量维度的平方根 ($\sqrt{d_k}$)。 公式:$\frac{Q \cdot K^T}{\sqrt{d_k}}$
- 注意力权重 (Attention weighting): 将缩放后的注意力分数通过 softmax 函数进行归一化,得到每个 Key 相对于 Query 的权重分布。
- 公式:$\text{Weights} = softmax(\frac{Q \cdot K^T}{\sqrt{d_k}})$
- 提取具有高注意力权重的特征 (Extract features with high attention):
- 将注意力权重与对应的 Value 向量(或矩阵)相乘并求和(加权平均)。
- 公式:$A(Q,K,V) = softmax(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V$
- 自注意力头 (Self-Attention Head): 上述 Q, K, V 计算过程构成一个自注意力头。
- 多头注意力 (Multi-Head Attention): 可以并行使用多个自注意力头,每个头可以学习关注输入序列的不同方面或不同子空间表示,然后将它们的输出拼接或组合起来,以增强模型的表达能力。
自注意力机制的应用
自注意力机制及其核心组件构成的 Transformer 模型在多个领域取得了巨大成功:
* 语言处理 (Language Processing): 如 BERT, GPT 等大型语言模型。
* 生物序列 (Biological Sequences): 如蛋白质结构预测模型 (AlphaFold)。
* 计算机视觉 (Computer Vision): 如 Vision Transformers (ViT),将 Transformer 架构应用于图像数据。
课程总结与后续安排
讲师总结了本讲的主要内容:
1. RNNs 非常适合序列建模任务,通过循环关系对序列进行建模。
2. 使用沿时间反向传播 (BPTT) 训练 RNNs,但存在梯度问题。
3. RNNs 可用于音乐生成、分类、机器翻译等。
4. 自注意力 (Self-attention) 机制可以无需循环即可对序列进行建模,是许多大型语言模型 (LLMs) 的基础。
后续安排:
* 周四、周五将有关于大型语言模型的客座讲座。
* 软件实验课已在 GitHub 发布 (TensorFlow 和 PyTorch 版本),学员将动手实现 RNN 进行音乐生成,以及接触 LLMs。
* 课后将在 Kendall Square 举办课程启动招待会。