speaker 1: All right, well, it's fantastic to have everyone here. My name is Ava. I'm also an instructor for the course. And this is now lecture two of success 191 on deep sequence modeling. And as Alexander kind of alluded to at the end of his lecture, deep sequence modeling is a very, very powerful concept that, especially as of lay, has gone a lot of interest and a lot of excitement around it due to the advent of large language models. And actually, excitingly, in the course this year, we're going to have 22 and a half lectures, guest lectures on large language models on Thursday and Friday. And our intention with this lecture is to really give you the foundations of what sequence modeling is all about so that you could be ready when we go into these lectures on these cutting edge and new frontier topics, to to really grasp them from the fundamentals and foundations up. Okay, so with that in mind, in the first lecture, Alexander introduced the very basics, the essentials of neural networks, how we train them using back propagation, what it means to define a feed forward model. And so in this next talk, we're going to turn our attention to now applying these neural networks to problems that involve sequential processing or sequential modeling of data. And what we'll do is we'll try to walk through this problem formulation and the models step by step, side by side, building up our intuition about how we can build these networks up, starting right where we left off in the end of last lecture. So to first set the stage, what I always like to do is to motivate this notion of sequence modeling and sequential data with a very, very intuitive example, super simple one. Let's say we have this image of a ball in 2D space, and our goal is to predict where the ball is going to travel to next. Now, if I didn't give you any prior information about the ball's history, its motion in this 2D space, any guess you place on the next position of the ball will be a random guess. Now instead, in addition to the current location of the ball, I gave you some information about its history, say its prior locations. Now your problem becomes much easier and effectively the task is reduced to, given this past history about the ball, predict where it's going to go to next. And I think we can all agree that in this example, the ball is appearing, that it's moving from left to right. So this is this notion of what we mean when we think about sequence modeling. And beyond the simple example, as we'll see throughout the course, sequential data and sequence modeling is really, really all around us. For example, the audio from my voice talking can be split up into a sequence of sound waves, effectively chunked and processed in the sequence. We can do this similarly with words and texts and natural language, where we can chunk or split up text into a sequence of individual characters, individual words, and process that in that manner. Beyond this, there's many, many more cases in which sequential processing becomes apparent, from medical signals like ecgs, to stock prices in financial markets, to biological sequences like nucleic acids or protein sequences, to weather, to motion video and more. And so really, this paradigm of sequence modeling unlocks a great potential of applications and real world use cases for deep learning in the wild. So what are some concrete problem formulations or tasks we can think about when we start to think about sequence modeling? In the first lecture, we learned about kind of the very basic prediction problem, like a classification problem of will I PaaS this class, yes or no binary. Now, instead of being able to go from a single input to a single output, like a class label with sequence modeling, we want to process a sequence of inputs, such as the text, the words in an individual sentence, and produce a output at the end, which may be similarly a classification label. Like does a sentence have a positive or negative sentiment or feeling associated with it? It could also be a sequential output, like taking an image and now producing or generating text that captions that image. Or similarly, we can think about many to many generative or predictive tasks where maybe we have sequences in one language in English, and we want to translate that into sequences of another language. So these are kind of the fundamental problem definitions or problem setups that we can start to think about when we have the capability to process sequential data. So now with that in hand, the question becomes, okay, how do we actually build up a neural network model to be able to process and handle this unique type of sequential data? And today, we'll spend about half the time in the lecture talking about where kind of the history of the field got started with a new type of neural network architecture that kicked off our real abilities to model sequences. And then in the second half, we'll talk about a powerful mechanism that is being used today in the most state of the art sequence models that we see all around us. In both cases, my goal is really to convey the fundamentals of how we actually do this. And so we're going to build up from the perceptron that Alexander introduced and go step by step to develop an understanding of these models. Okay, so let's do exactly that. Let's go back to the perceptron, which we studied in lecture one, right? We saw a diagram like this where we have a set of inputs, x one through xm. And these are numbers. They're represented numerically. And what we do in a perceptron is we take these inputs, we multiply it by a corresponding weight value, add them all together to get this state z, and take this z and PaaS it through a function that's nonlinear, also called an activation function. And that leads us to our predicted output y hwhile. We can have multiple inputs coming in. Here you can think of these inputs as a single slice or as a single time step in a sequence. We also saw that we are able to sack these individual perceptrons together to comprise a layer of a neural network. And here we can also extend to go from a set of inputs to a set of multiple outputs in purple. But still, right in this example, we're still operating on a static single step. There's no notion of sequence or no notion of time. These inputs are all these individual inputs at just one slice of a time step. We can simplify our diagram further. And all I'm doing here in the middle is that I've collapsed that visual of the hidden layer down to just this abstracted Green box. So we have the same operation of mapping an input to an output, some vectors of length m, two output vectors of length m. What can we do now to try to apply a network like this to time series data or sequential data? Well, maybe let's start off with something perhaps a bit naive. As a first PaaS. Let's take this same model. All I've done here is I've just rotate it so it's instead of left to right, it's down to up, input to output. And what if we repeat this multiple times? We have our input vector at some time step, t here, t zero. We feed it into our neural network and we generate an output prediction. And maybe we do this over multiple time steps, because in sequential data, we don't have just a single time step. We have multiple individual time steps from t zero to t one, t two and so on in the sequence. We could do this in isolation, right? We take the inputs at these individual time steps, PaaS them in through the model and generate it an output. And here again, it's a simple function where we're applying this function defined by our neural network to the input at a time step t to generate a prediction y hat of t. But what's the problem here, right? We know that our output vector, y hat at a particular time step is just going to be a function of the input at that time step. And if this is inherently sequential data, it's probably in sequence for a reason because there's some sort of inherent dependence on time steps within that sequence relative to each other. And in this way, we're completely ignoring the inputs at the previous time steps to make a prediction at a later time step. These are all treated in isolation. To predict a time step, an output at, say, time two, we could probably benefit a lot from taking the information from time sezero time step one. How do we circumvent this? How could we possibly relate these time steps to each other? Well, fundamentally, what we're trying to do is to relate the network's internal computations at these individual time steps to prior history of the computations from those prior previously observed time steps. And we want to PaaS this information on forward through time to maintain that information as we move through the sequence. So maybe we can try that out. What if we were able to define a way to link the neuron's internal state from a previous time step to the computation at a later time step? Maybe we can make this concrete. We can define a variable, we'll call it an internal state H of t, that's maintained by the neurons in this network. And we PaaS the state on time, step to time, step across time. And our goal is to maybe imbibe this internal state with some notion of memory, some record of the prior computations that the network has made. And what this means, a little more concretely, is now our prediction, why hat of tea at a particular time step should depend not only on the input at that time step, but also this state that's been carried on from the prior time step, this past history and the current input. And so what's really powerful here is now we're starting to define more formally a relation between these output predictions that depends not only on the input observed, but also this past memory. And as you can kind of see and hopefully start to get intuition for that, past memory is dependent on the prior inputs. So we have this idea of some sort of recurrence, something that's repeated and maintained over time as the network makes these computations. And because this output that we're producing is a function of not only the current input but also this past state, we can describe this type of network via what we call a recurrence relation. And we can visualize this in a couple of different ways. Here on the right, I'm showing these time steps kind of unrolled in slices over time, where we've made these links kind of visualized by this variable H, but we can also represent this on the left via this cycle, this loop. Which tells us this concept of the state maintaining this recurrent log effectively of the network's internal computations that gets updated with each time step. And so this is this notion of a time series model that has a sense of recurrence. And really, this is a very, very foundational concept in all of sequence modeling, this notion of what we call recurrent neural networks. And hopefully this example in this buildup from where we started with the perceptron, gives you this intuition about how we can start to model sequential data using these time series models called recurrent neural networks. So let's keep going, right? Let's continue to build from this foundation and now start to get a little bit more formal about the operations that go into the internals of these recurrent neural networks, or rnms, as we started to see, right? The key idea is that we have this internal state, H, H of t, and that state is going to be updated at each time step as we process the sequence. We can do this by defining what we call this recurrence relation, right? It's a way to process a sequence time step by time step. And what we want is our internal setto state H to be a function of not only the input, but also the prior state, so that there's some notion of memory that's carried through. To formalize this, the cell state H of t is defined by some function parameterized by a set of weights W that depends not only on the input at that time step, but also the prior state from the previous time step. And we can recurrently apply the same function time step by time step, updating the weights so that we get this recurrent update to our cell state as the sequence is processed. An additional way to build up this intuition about rnn's is to think about this kind of in pseudocode, right? So here we're going to walk through a Python seudocode example that helps build up this intuition. So let's say we want to convey and build up the rnn from scratch. We're going to define some rn myrn, and we're going to initialize it with a hidden, say, set to zero. And now our input is this little fragment of a sentence. I love recurrent neural. And our task is to use the individual words in the sentence to predict the next word that comes next in that sequence. What we do is we loop through for each word in this sentence, we use our rnn to take that word, take the last hidden state, predict some output and update the hidden state, and do this iteratively through this iterative procedure, such that at the end we can generate a prediction for the next word that comes after we've processed these inputs. This is the fundamental of this notion of a state update and output that is the core of the rn. So again, walking through this from the bottom up, given our input vector, we define a function to update the hidden state. And now we've just introduced a little bit more formality that makes these weight definitions and functions explicit. But just again, this is the update to the hidden state is a function of both the input at that time step and the prior hidden state from the previous time step that updates our hidden state. And now we can actually generate a prediction and output vector by taking that hidden state and applying a weighted transformation to it, right? So this is the core of both, how the rnn can update its internal memory, its hidden state, and also produce a prediction and output at each individual time step. So by seeing this and visualizing this rnn, that's sort of this recurrent cyclic visual here, I mentioned that we can also represent it by kind of taking these individual time slices and separating them out on a time axis, unrolling them across time. And if we do that, I think the intuition becomes perhaps even more clear where, again, starting from the first time step, we can unroll step by step, t one, t two, all the way to t, where you can see this internal state is being passed on time step to time step. We're generating the predictions. We can also bring back that formal math to this diagram, where now you can see we have one weight matrix that transforms the input to the computation of the hidden state. We have a weight matrix that defines how the hidden state gets updated, and finally, one that transforms the hidden state to the output prediction. Importantly, right, these are the same weight matrices at each individual time step, right? There's this, this and this, and they're reused at each time step in the sequence. So now this tells us, okay, how do we actually make these updates? How do we predict the outputs? But what we need to know to actually train this network and learn it is how do we define a loss? As Alexander introduced, in order to learn a neural network or train it, you need to define some sort of function that tells you how close the predictions are actually getting to the desired behavior that you seek out. And with rnns, that concept, that exact same concept still applies. The only difference is that we can compute a loss at each slice, at each of these individual time steps in the sequence, and get a total loss by summing over all the time steps in our sequence. So in total, right, this defines what we call our forward PaaS, how we make predictions time step by time step, and how we use those predictions to compute a loss. We can walk through an example of how you can actually implement an rnn yourself bottom up from scratch in a library like tensor flow, where we try to define an rnn as a layer per initializing the weight matrices as attributes of our rnn class. And we define importantly in this function call how we do that exact forward operation, that forward PaaS that I introduced in the prior slide. And as you can see, right, it consists of these two key lines, the first being the update of the hidden state according to the input and the prior hidden state, and the second being the prediction of the output, which is a transformation of the hidden state at that time step. And we return both of these, the prediction and that hidden state value. And as you'll see, right, this is a really good way to move. Now, thinking in code from that pseudocode intuition we showed earlier to now, kind of how you can define an rnn class yourself from scratch. And moving forward, one more step you can build from this intuition to now learn and understand how to operationalize an rn through layers and modules that are already implemented in these common machine learning frameworks like TensorFlow and pi torch. And in the first lab of this course, using rnn's for sequence modeling, you'll get hands on experience in working with these functions and classes in either of these libraries. Okay. So hopefully now, right, that gives you this intuition of how we've built up from this first example of static input. Static output to these diverse types of tasks and new sorts of problems we can tackle when we start to be able to process and handle sequential data, whether that's taking a sequence, producing a class label or actually being able to do something like next word prediction or next character prediction to actually generate and produce out a sequence as output. And in this latter example, forecasting a bit. Not only is this what you'll get hands on experience with in our software lab, but really, this notion of many to many sequence modeling is the backbone of how language models actually work. And you've just kind of understood some of that intuition starting today and now building forward. All right. So let's now think about how do we operationalize this in the real world. Go ahead. speaker 2: We know in the first hidden there, connected by a. Yeah. So the question . speaker 1: is what defines the connection between the hidden layers? The important thing is that one of these individual Green blocks that can have many layers itself, it doesn't necessarily have to be one. It could be many. The important thing is that that's you can think of that as a unit, and that unit contains some number of layers within that unit, but that operation is basically applied for all the individual time steps in that sequence. Using the same set of layers and weights in that unit, you compute the loss at each of those steps. And then as we'll see when you actually update the weights, that is done by taking the loss from all the individual time steps and then updating after you've processed all the time . speaker 2: steps individually, just in the same way as with feforward . speaker 1: networks where you can have weights that connect layer to layer in rnn that has multiple layers within one unit. There is weights. There are weights that define those connections. Any additional questions before we transition otherwise? Okay. Okay. So now let's think about how we actually operationalize and bring this notion of sequence modeling to a real world example. Sequences are rich and sequences are interesting for a few reasons. One is because they're tremendously variable, right? It's not like an image where you try to have a height and a width of the image. And for all your images in your data set, those height and width are fixed in something like language. Sequences can be short, they can be long, they can be in between. So we need to be able to handle this variability. The other richness and nuance is that inherently, with things that depend on time, you could have instances where there's a really short range interaction or dependency, or you could have instances in a sequence where something at the very beginning dictates the meaning or the value at something at the very end. So this notion of spread or long term dependency and kind of fundamentally, right, there's this notion of order. And our model needs to be able to do a good job of representing and reasoning about this order through the approach that we take. So we'll use these kind of criteria to motivate how rnns can do well at doing this. But also what are some shortcomings of rnn's when it comes to these kind of real world operational criteria that our sequence model needs to meet? So to do that, we're going to walk through a very, very concrete example, which is really maybe now has become the quintessential sequence modeling problem out there. And that's this idea of predict the next word, given a sequence of words, predict the word that comes next. And this to underscore this, this is a very important task because not only is a beautiful and simple and intuitive, but it turns out it's incredibly powerful to build up the very, very powerful language models that we see today are trained on this exact task of being able to predict the next word. So let's say we have this example sentence. This morning, I took my cat for a walk. Our task is given this set of words, we want to be able to predict the next word. And let's say we want to build a sequence model like an rnn to do this task. What is our very first step? Any ideas? Breabreaking the words into chunks. Yeah let's say now we've broken the sentence into chunks or words. What we how can we actually build a neural network model to do this? Vectorize them exactly. So the core consideration after we've broken this text up into words is we need a way to actually represent it to a model, right? Because remember, all neural networks are are there function actuators or function approximators that operate on numbers, right, vectors and matrices and ways of representing numbers? So that's exactly what we need to do if we want to have a model that takes a word in and predicts the next word out. We can't just PaaS that in as words. We need an actual way to represent these words as numbers in a numerical representation to be able to operate on them using a neural network. And this is this notion of vectorizing the input or encoding the language for a neural network to operate on. And this is really, really a core concept in language modeling and neural network and machine learning modeling in general. And so the solution that we're going to introduce right now is this notion of embedding, which is an idea of taking an input that can be in some form, like words, and then transforming it into first and index that then can map to a vector of some fixed size. So to break this down step by step, how do we actually do this? Let's say we have our body of all possible words we could see in all sentences we could encounter. We call this a vocabulary, right? A corpus of words that covers all the possible words that we could encounter. And this vocabulary has to have some fixed size, right? What we can then do is take the individual words in this vocabulary and map them to a number, an index, let's say, a maps to one, cat maps to two, so on and so forth. Then now these indices give us a way to transform that index, that slot to a vector, and look up a vector that represents that word based on the index. And I'll show you exactly what I mean by this. The last step is to do this embedding operation, which means mapping an index to a vector of fixed size. One way we could do this is what we call a one hot, or binary encoding or embedding. And what I've done here is we've defined a vector of a sparse vector of a fixed size, and all it has is zeros and a one in the index that corresponds to that word. And so based on that index value, I can effectively encode the identity of the word and look up and backtrack what the word was based on that index. The other thing I could do is to actually use a neural network layer to learn an embedding of those words in some fixed length, lower dimensional space. And this is very similar. We've just done this operation of mapping that index to an encoding such that similar words end up in a similar space of this embedding space. But we are still able to retrieve our vector representation of the word using that index. So I think a really good way of thinking about these vectorization or embedding operations is to think about indexing, to look up a fixed representation, a numerical representation for these words. And this is a really, really important concept. So now we can do that for our sequences. We can transform our words into this vector representation. And to now think about, okay, why sequence modeling is difficult and complex. We can think and see some examples of how that really comes to life, right? We can have the complexities that arise when we think about variable sequence lengths. So maybe we could have short sequences, medium length sequences, longer sequences. And in all cases, we want our neural network model to be able to track these dependencies consistently, to still be able to do a good job of predicting the next word at the end. And that relates as well to a good ability to track and store these long term dependencies in these sequences, where in many cases, we may need information from very, very beginning of the sentence to predict the next word at the very end of the sentence. That's important because depending on how we shuffle the words around, this notion of sequence can convey very different semantic meanings in our prediction task. And so hopefully, this example of language modeling and next word prediction gives you a sense of why sequence modeling can be very rich and very complex as a neural network deep learning task. So okay, that gives us a way concretely, in the real world, some understanding of why sequence modeling is rich and challenging. To actually train a sequence model like an rnn, though, we need some special considerations. We're still going to use that fundamental algorithm that Alexander introduced of back propagation, but now we need to introduce something to be able to handle that time dependence. So like we've been doing throughout this lecture, let's build up from the first principles that we started with. Let's go back to how we train our feforward models, right? We first take our inputs. We make a forward PaaS through the network, going from input to output to actually train the model and compute loss and back propagate gradients. We do this backwards through the back propagation algorithm by taking the derivative of the loss with respect to each of the parameters and each of the model weights in our network. And then we adjust and move those parameters in the direction that will minimize the loss. We take a step in rn's. We saw this preview of how the loss sses actually computed time step by time step. Now, when we want to go to train rn's, we not only need to consider an individual loss, but actually the aggregate loss through all these individual time steps. What that means is that instead of back propagating these loss values through a single feed forward network, we need to back propagate errors across the these individual time steps and then such that we can Carry through errors from late time steps all the way back to the beginning. And the algorithm for doing this is what we call back propagation through time, such that errors flow from all the way back late in the sequence to the very beginning. What this means practically is that you can have this chain of very repeated computations, multiplying a weight matrix multiple times, repeated with each other, as well as repeated use of the derivative of the activation functions in these networks. That can pose some very practical challenges, which, for the sake of time, we're not going to go too deep into. But the important thing to keep in mind is that these standard rnn's can be somewhat difficult to train stably because you could get many values that are larger than one. You multiply them together and then the gradient explodes, right? Conversely, you could have many values that are very small. You multiply them together and then the gradient vanishes and it goes down to very, very close to zero. This actually has real practical implications because what it means is that it makes it really, really hard to be able to track these long term dependencies that we care about in our sequence. If our gradients are unstable, either blowing up or shrinking down to nothing, we can't effectively take those gradients from late time steps and PaaS them through to the earlier time steps to be able to promote our model to retain that information. And so in the literature and in the sequence modeling community, there's been a lot of active research to come up with improved architectures based on the rnn to try to solve this problem. And the real core concept is that they add some complexity to that rnn unit itself, effectively adding additional functions to try to selectively control the amount of information that's passed to the update of the hidden state. One very, very prominent way of doing this is a network called an lstm, or a long, short term memory network. And this was introduced now quite some time ago, but has been very foundational to a lot of the sequence modeling work that's gone on since. So to give you one quick look at some of the applications of rnn's that you'll actually get hands on with today, I want to highlight this particular example of music generation. And this is a problem that really lends naturally to sequence modeling and to a recurrent neural network architecture. Because what we're doing is, let's say, we want to try to predict and generate a new piece of music. One way we could do this is by taking the individual notes in a piece of music, and very similar to how we saw with the task of predicting the next word, build a model that, given the past history of musical notes, learns to predict the most likely next musical note in the sequence. And this is exactly what you will do today in our software labs where you'll be able to train a rnn model to generate brand new music that's never existed before. And in fact, you're not the only ones who have had a go at this. This is an example from a few years ago of a startup that was trying to seek to do a music generation. And they trained a neural network model on classical music and tested to finish a work by the composer frz. Schubert, the famous unfinished symphony, where they gave two movements of the symphony and tathe model to generate the music corresponding to the third movement. So let's see if we can play this. So it's pretty good. Maybe there are some classical music fishing iranos in the audience who are familiar with this work, but I always appreciate it because it kind of gets at some of these themes that we're talking about in terms of the capabilities of these sequence models. So, so far, we've talked exclusively about this sequence modeling architecture called recurrent neural networks, and I want to take a moment to kind of appreciate that. It's very, very remarkable that we're able to understand the fundamentals of sequence modeling and build up to some of these capabilities using rnns. But like any technology in any method, rnn's have some core limitations. And in fact, those limitations have motivated new development of new architectures and approaches in sequence modeling as well as improved versions of rnns that have done better and solved some of these limitations. A couple things that are important to keep in mind when thinking about art needs. One is that the core concept, as we talked about, is this notion of the state H of t. Remember the concepts in all that we're talking about? These neural networks operate on vectors and matrices of numbers just like that. The state of an rn is a fixed length vector, right? There's only so much information that can be encapsulated into something of fixed size. And so that presents what we think of as a bottleneck to the amount of information that the rnn state can hold. Additionally, because rnn's process information time step by time step, this can make them very difficult to parallelize such that things can be processed simultaneously. We have this inherent sequence dependence. And finally, related to both these points, that encoding bottleneck of the state can prevent the long term memory capacity of some of these types of recurrent architectures. So to think about now how we can try to overcome this, let's go back to our fundamental goal of sequence modeling, which is to take a sequence of inputs, use a neural network to compute some features or states representing those inputs, and now be able to generate predictions according to that sequence. With rnn's, we said, we are going to process this time step by time step, use recurrence. But we also saw that inherently, this notion of sequence time step by time step processing places some real limitations on the capabilities of rnns. Ultimately and ideally, we want to be able to process our sequence very efficiently in parallel, perhaps so that we can generalize to long sequences and do so efficiently, and also have this desired attribute of being able to track these important dependencies in the sequence effectively. And so a question that was posed a few years ago is, what if we could try to tackle the sequence modeling problem without having to deal with the data time step by time step? Maybe we can eliminate the need for recurrence. Maybe we could do this by squashing everything together, ignore this notion of these individual time steps, and let's say, concatenate all our inputs together into one vector, and we feed it into something like a feed forward model and generate an output at the end and hope that it makes sense. While if we do this, in this naive first approach, yes, we've eliminated the need for recurrence. So we don't have to process our data step by step, that's good. But this doesn't seem really scalable because now let's say we're just trying to use a dense network, that's not going to be very efficient. Also, we've destroyed all information about order. We've squashed the inputs into a concatenated vector. And by doing so, we've destroyed any hope to remember things that appeared earlier or later and relate them to each other. So this has motivated a different way of thinking about sequence modeling, which is this idea of trying to take a representation of sequential data and define a mechanism that can, on its own, pick out and look at the parts of that information that are important relative to other parts of that. Thinking about this, in other words, can we define a way to be able to give in a sequence, identify and attend the important parts of that sequence, and Moreover, model the dependencies in that sequence that relate to each other? And this is the core idea of a very powerful mechanism called attention. And in 2017, there was a paper ed, called attention is all you need, that introduced this mechanism. And so if you've heard of a model like ChatGPT or GPT, the t in that acronym stands for a transformer. And a transformer is a type of neural network architecture that can be applied not only to language data, but other types of sequential data. And the foundational mechanism of a transformer, what makes it different, is this operation of attention. And so we're going to, in this lecture, talk about that attention mechanism. And in later lectures, you'll learn more about how these transformers are actually being applied as language models and in other applications as well. So we're going to break down the core concept of attention really step by step. All right, let's do that. Attention itself is a very informative word, right? What it means is we have this inherent ability as humans to think about an input and automatically zoom in and pick out the things that are salient, the important features. So let's build up our intuition, starting with an image. How do we figure out what's important in this image? One way we could do it naively is we could go pixel by pixel, left to right, back and forth, and scan this to try to compute some value about how important these individual pixels are. But of course, our brains don't operate like that. We're able to automatically look at this and pick out to and attend to the important parts. And the first part of this problem is this notion of being able to identify which parts are important in some input. And then ultimately, we want to use that identification to extract the features, the components of the input that correspond to these high attention values, right? And to think about this more concretely, this notion of identifying the parts to attend to is really similar to search, right? So when we do something like a search, we present a question, and we're trying to seek an answer. Let's say you came to this class, right, and you had the question, how can I learn more about neural networks and deep learning? And AI? Maybe one thing you could do besides coming to this class would be to go to the Internet, have all the videos, all the materials available on the Internet, and try to do a search to find something that matches your query. Let's say you go to a giant video database like YouTube, and you put in your query, your ask, deep learning, that's your search topic. And now what search does is through, let's say we go through every video in this database and we extract some informative nugget of information, a key that represents the core element of that video, a descriptor of that video. And now we have our query, and we have a set of keys. Now our task is okay to actually do the search. How could I go about this? Well, I want to see how close the matches between my query, what I search for and those keys, those key indicators in the database. And I want na see how similar they are. So I'll do this step by step. Ask how similar is my query to these keys? First instance, a beautiful video about elegant sea turtles. Not similar. A video of from our past lecture on deep learning. Yes, similar. A key related to Kobe Bryan's fade away, not similar. So now we've identified the key that's important. We want to attend to this. Our last task is to actually extract the value that's associated with this similar match that we found. We want to extract the features that we want to pay attention to the video itself, and we'll call this the value. And because our search was implemented with a good attention mechanism, we've identified the best deep learning course out there for you and your query. This concept is really the core intuition behind attention, and that's very, very closely related to how this attention operation works in neural networks like transformers. So now let's go back to our sequence modeling problem, where we have a series of words, and we want to predict the next word with this sentence. If we break this down step by step, first, remember, we don't want to process this information time step by time step. We've broken down that need for recurrence. We're going to feed it in the data all at once. And still we need a way to encode some notion about order. So what we're going to do is we're going to put in an embedding called a positional embedding, that effectively gives us a way to encapsulate that relative position of these elements in that sequence. We're not going to go into great detail about positional embeddings, but you can think of it as an encoding that gives us some representation of position in the sequence. Now we take those position aware encodings, and we do that search operation, and we operationalize that search operation to extract three sets of matrices that we call the query, the key and the value. Just like I introduced before, the way we do that is we take the positional embedding and think about that as representing our sequence in a positionally aware way. And we use a neural network layer to produce each of these individual matrices, the query, the key and the value. The important thing to keep in mind is that it's that same positional embedding that's repeated in this mechanism of self attention. But these are different neural network layers that yield different values for each of the query key and value matrices. That means is that they can effectively capture different information, as we'll see. Now, again, our next step in the search operation was to figure out, okay, how similar is the query to the key? And that's exactly what we're going to do next with attention to do that, right? Remember, these are numeric vectors or matrices. And so we need a mathematical way to compute that similarity between two sets of numerical features. So let's say we have two vectors, our query vector, our key vector. And what we can do mathematically using linear algebra is measure the similarity of those vectors in space using the dot product that tells us how close they are to each other. We can scale it. And this gives us a very concrete similarity metric that captures that similarity between the query and the key. That same principle can apply now to matrices as well. Dot product and scaling. That gives us a similarity metric. Now let's think about what this similarity computation actually means. Remember, right? We're trying to understand how these components of the input relate to each other, what parts of the sentence are important to each other and important to convey the semantic meaning of the sentence as a whole. So if we have this example, he tossed the tennis ball to serve, let's say we've computed our querying key matrices applied to scaling. We can then apply a function called a softmax to basically squish these values between zero and one. And now this gives us a matrix that gives us a relative weighting of how those individual components in the sequence relate to one another. Intuitively, you can think about things that are more similar as having higher attention weights, more related as having higher attention weights, things that are less related as having lower attention weights. So here, right in this example, tossed and ball have a high score, tennis and ball have a high score, and so on. This gives us our attention weighting or our attention matrix. The final step is to now use that relative weighting to actually pull out those important features that we care about. What we do here is we take our value matrix, multiply it by our attention weight, and this gives us an output of a feature set over that input space that reflects the relative elements of the sequence that are interrelated relative to each other. That's really the core of how attention works. And it's really, really beautiful and striking to me because this mechanism gives us a very natural way to pull out and attend to features that are quite important relative to each other in an input and architecturally. Now, how do we actually build this out into something like a transformer? Well, 1s, please. We can go again, by taking our input, computing these positional encodings, we define these neural network layers that compute the query key and value. Then we can compute this relative weighting between the query and the key. That's a matrix multiply representing the dot product, a scaling and a soft max, and then use the value matrix to extract features that have high attention scores. And these are the core operations that now define these attention heads, which are really the core component of architectures like the transformer. So as we've mentioned and realized, right, this is really the dational attention is the foundational building block of of transformers. And I think kind of some questions got to as well, right? A transformer architecture does not need to be defined by just a single attention block. You can actually stack multiple attention heads together to now be able to basically increase the capacity of your network and pull out different sets of features and more complex sets of features. So again, in this very intuitive example, maybe you have a network that has three attention heads. And if you go in and inspect, again, this is an intuitive example, let's say you were to go in and inspect the values of each of these attention heads, maybe you could get some interpretability out with respect to the different features or parts of the input that the network was attending to. So what are some real world use cases? And what has attention really enabled and transformed over the recent years? It's not only in language processing that transformers and self attention have really led to tremendous advances. While that's the case, right, the mechanism and the architecture behind attention and transformers is very generalizable. And so as you'll see, natural language is one tremendous area that transformers have really taken off. And you'll get hands on experience with this, not only in the lectures, but in a brand new software lab on loms. We'll also see a bit about how attention and sequence modeling has been extended to biological sequences in one of our guest lectures. And in fact, in something that may not appear sequential, like images or computer vision, there are a class of models called vision transformers that have now also become very, very powerful in processing image data as well. So to summarize and close, hopefully you've gotten the sense of how rich sequence modeling is as a set of problems and things we can consider. We saw how rnn's work. We saw how we can build up intuition for rnn's through this notion of recurrence, how they can be trained through back propagation. You'll be able to get a hands on experience building rnns for music generation. And finally, we closed with talking about self attention and transformers as a way to model sequences without needing to handle time steps individually. And and stay tuned for more on llms, both hands on and in more lectures. So with that, that closes the lecture portion for today. We now can use the remaining time for open office hours and discussion about any lingering questions you have or comments related to the discussion. We also want to draw your attention to the software labs, which are now available on the GitHub linked on the course website. The instructions for completing the software labs are all there. We have options in both TensorFlow and pi torch, so hopefully you'll get a fun chance to go through those and work with those. And finally, I think our gracious host, reception host John, stepped out, but immediately after this, there will be a in person reception to kick off the course just down the street at one Kendall Square with food provided. And many special thanks to John Werner and link venoh. He's still there back at the top. Thank you, John, for graciously hosting.