speaker 1: Okay, welcome everyone to week four into now. So for today, what I want to do is first of all, we'll a couple more bits on machine translation, especially just talking a little bit about evaluating machine translation that I want to spend a while on attention. So attention is a very fundamental concept of neural networks, which was originally developed in the context of machine translation, but it's also then a very central concept when we're talking about transformers, which we then start talking about on Thursday. Okay. So getting straight into it, so this is the picture that we saw towards the end of last time that this is how we were baking a machine translation system where we were using a mullayer lstm where we were feeding in the source sentence, and then we were flipping to then turning the model into a decoder with different parameters, which would generate one word at a time to generate the translated sentence. So here I've gotten German sentence and it's produced an English translation that looks a pretty good one. But you know we're going to want to have a way of deciding, well, are we producing good translations or not? And so we need some way to evaluate machine translation. Now, this is a complex area because you know if you start poking around in the literature, people have proposed literally hundreds of different measures that could be used to evaluate machine translation systems. I'm guilty of writing a couple of papers on it myself, so I'm contributed to the problem. But you know, by far the most commonly common measure that you see to this day was essentially the first measure proposed to automatically evaluate machine translation, which was the blue measure which was proposed to understand for bilingual evaluation under study, though it went along with the fact that it's proposed by ibm, probably not a coincidence. So until this point, the only way that people had really use for evaluating translations was getting human beings to look at them and say how good of a translation this is. And you know, that's still a gold standard measure that is widely used for evaluating translations because, you know, many of the automatic measures have various kinds of biases and problems that make human evaluation useful. But on the other hand, a lot of the time, welike to iterate quickly on evaluations, welike to use evaluations and training loops and things like that. And the ibm people with the blue paper suggests, well, maybe we can come up with a halfway decent automatic method of doing translations. And the idea of what they proposed was this, that we're going to have one or more reference translations for a piece of text. So these are human written translations. And then we can score any automatic translation mainly on how often they have overlapping one, two, 34 grams. The number four isn't special. You could have only gone up to three or five, but four were seen as a reasonable length overlapping n grams with one of the reference translations. And the more overlap you have, the better. And we this discussion of this evaluation in the assignment. So you can think about it a bit more. And I won't go actually through all the formulas right now, but you know, that's most of it. And so here's a picture of how that looks. So the original idea was what we should do is, you know, have several reference translations and then weget a machine translation, and then welook at this machine translation and try and find pieces of it in the reference translation. So we can certainly find the unigram. We can't find American at all, but we can find International Airport and its in the second reference translation. So we're going to get a foregrand match for that. We can find that again, that's easy. Office all receives one's call. Self the sand Arab not a very good translation this right? So that all misses. But then you start to find other pieces, but do overlap and you use those to work out of score. The original idea was you should always have multiple reference translations so that you can sample the space of possible translations and have reasonable coverage in practice for what's been done more recently. It's not so uncommon that people do this with only one reference translation. And the argument then is still on a kind of a probabilistic basis. The more often you have a good translation, the more often you'll get matches and therefore your score will be better. Yeah so wide, you know why did people come up with this and why is that still imperfect? Well, the problem with translation is that there isn't one right answer. It's not like the kind of classification things you see in machine learning where you show people a picture. And the right answer is to say this, the class of this object is whatever a labradoor dog breeds or something, right? That for any sentence, there are many different ways to translate it. And you know, translators consider, sit around and argue that, Oh, this phrasing is a little bit nicer than this phrasing, blah, blah, blah, blah. But to a first approximation, you can translate the sentence in lots of ways. And those different ways of translation can involve different word orders. So you can't really sort of check the words off as you come down in the sentence. And that's what motivated this idea of sort of matching in grams anywhere. So you can get reasonable credit for having the right matches. But you know nevertheless, it's a pretty crude version of it, right? You know you can still get a poor blue score for good translation just because the words you chose didn't happen to match a reference translation. And also you can get points for things without really having a good translation at all, right? If you just have words that match, even if they're having completely the wrong role in the sentence, you will get some points. But it's harder to get n gram matches unless you're for larger n, unless you're using words the right way. There's one other trick in the blue measure that there's a penalty for two short system translations because otherwise you could leave out everything difficult and only translate the easy part of the sentence. And then for the bits you have translated, you could then be getting a high score for the precision of those pieces. Okay, so we'll use when you're developing mt systems for assignments three, we'll use them with blue. So now we have a evaluation measure. We can start looking at how well how well do systems do on a blue score. And blue scores are theoretically between zero and 100, but you're never going to get to 100 because of the variations of how you can translate things. And so typically, if you can start to get to the twenties translations, you can sort of understand what the source document was about. Once you get into the thirties and 40 ties, the translations are getting much, much better. Yeah. So statistical phrase based translation was pioneered by ibm in the late nineties actually, and was sort of redeveloped in the two thousands decade. And it was what Google launched as Google Translate in the two thousands decade. And it continued to be worked on for sort of the following decade. But there was basically a strong sense that progress in translation had doing statistical Freese base systems had basically stalled, that it got a little bit better each year as people could build traditional Ingram language models with more data every year and things like that. But the numbers were barely going upwards. So in the years from about 2005 to 15 or may 14, the dominant idea in the machine translation community was the way we were going to get better. Machine translation is doing syntax based machine translation. If we actually knew the structure of sentences and wepass them up, then weknow what the role of words was in sentences, and then webe able to translate much better. And this was particularly invoked by looking at languages where translation worked terribly. So in those days, translation worked sort of okay for languages like French to English or Spanish to English, which are kind of sort of similar European languages. But the results worked way worse for Chinese to English or German to English. And even though English is a Germanic language, German has a very different word, Clader to English, with commonly verbs at the end of a clause and different elements being fronted. And there it. So people tried to work on grammar based, syntax based methods of statistical machine translation. And I was one of those who worked on those in the late 2000 ands decade. But you know, the truth is it sort of didn't really work, right? If the rate of progress in syntax based machine translation was had slightly more slope than phrase based machine translation over these years, the amount of slope wasn't very much. So things were completely then thrown on their head when neural machine translation got invented. Because as I explained, you know the first attempts were in 2014. The first cases in which it was evaluated and bakeoff evaluations was 2:15. And so in 2:15, it wasn't as good as the best other machine translation methods, but by 2:16 it was, and it was just on this much, much steeper slope of getting way, way better. And that this graph only goes up to 2:19, but it's continued to go up. And so it's not that uncommon these days that you see blue numbers in the fifties and sixties for neural machine translation systems. So that's a good news story. So after this, I want to go on and sort of introduce this idea of attention, which is now a very fundamental, important idea in neural systems. It's also interesting because there's actually something novel that was invented kind of recently. So for everything that we've done in neural networks up until now, really it had all been invented before the turn of the millennium, right? So basic feed forward neural networks, recurrent neural networks, lstms, other things that we yet haven't talked about, like convolutional neural networks, they were all invented last millennium. It was really a waiting game at that point until there was sufficient data and computational power for them really to show how good they were. But attention was something that actually got invented in 2014 in the origins of neural machine translation, and it proved to be a very transformative idea for making neural networks more powerful. So the idea of what motivated attention was looking at exactly this kind of machine translation problem. So we're were running our lstm over the source sentence, and then we were using this hidden state as the previous hidden state that were feeding into the generator lstm for the target sentence. And what that means is everything useful about this sentence has to be stuffed into that one vector. Well, that's maybe not so hard if you've got a four word sentence, but you know maybe you've got a 40 word sentence out here. And it seems to be kind of implausible that it be a good idea to be trying to fit everything about that sentence into this one hidden state. And well, obviously there are crude solutions to this. You make the hidden states bigger, and then you've got more representational space. You use a multilayer lstm, you've got more representational space, but it still seems a very questionable thing to do. And it's certainly not like what a human being does, right? Like if a human being as a translating a sentence, they read the sentence and they've got some idea of its meaning, but as they start to translate, they look back at the earlier parts of the sentence and make use of that and that translation. And so that doesn't seem like it's a very plausible model. So the idea should be that our neural netshould be able to attend to different things in the source so that they can get information as needed looking back in the sentence. And so this is the idea of attention. And so on each step of the decoder, we're going to insert direct connections to the encoder so we can look at particular words in the sentence. So I've got a bunch of diagram sentences that go through what we do, and then after that, I'll present the equations that go along with this. Okay, so once we're starting to translate, we've got a hidden state at the start of our generator, and then we're going to use this hidden state as our key to look back into the encoder to try and find useful stuff. So we're going to compare in a way I'll make precise later, the hidden state with the hidden state at every position in the source sentence. Based on our comparisons, we're going to work out an attention score. Where should we be looking at in the source sentence while generating here the first word of the translation? And so based on these attention scores, we'll stick them into a sofmax, as we commonly do, and we'll then get a probability distribution or weover the different positions in the sentence. Then we will use this weighting to compute a representation based on the encoder, which is then going to be a weighted average of the encoder states. So in this particular case, itbe nearly entirely the representation above the first word eel, which means he and French. So then we'll take that attention output, and we'll combine it with the hidden state of our decoder, and we'll use both of them together to generate an output vector, which we stick through our sofmax and generate a word as the first word of the translation y one. And so then at that point, we just repeat this over. So we then go on to generating the second word. Well know, we copy down the first word generator, start to generate the second word. We work out attention at every position it gives us. Oh, sorry, there's a little note there, which is a little fine point, which maybe I won't deal with. But it points out sometimes you also do things like stick the previous time steps, attention output into the next step as an extra input. And we actually do that in it should say, assignment three there. That's buggy. So there are other ways to use things, but I'll sort of gloss over that. So we generate another word and we sort of repeat over. And at each time step, we're looking at different words in the source and they will help us to translate the sentence. Yeah quick question. Why a start point into the. White light say again, you mean why does it ring ring part? Because the Green okay, so the Green vector, the hidden vector of the decoder, is going to be used together with the hidden states, the hidden vectors of the encoder one at a time to calculate the attention scores. So the attention score, the position is going to be a function of the hidden state of the ecoder at that position and the current hidden state of the decoder. And I'll explain exactly how in a moment. Any other questions? Okay, well, so here it is in math. Okay, so we have encoder hidden states, which we're going to call H. So we have decoder hidden states, which we're going to call s. So there's something different and we're going to, at each point being some particular time step t. So we'll be dealing with St. So to calculate the attention scores for generating the word for time, step t, we're going to calculate an attention score for each position in the encoder. Okay, I'll discuss alternatives for this in a moment. But the very easiest way to calculate an attention score, which is shown here, is to take a dot product between the hidden state of the encoder and the current hidden state of the decoder. And so that's what we're showing here. So that will give us some dot product score, which is just any number at all. Then the next thing we do is we stick those et scores into our softmax distribution, and then that gives us our probability distribution as to how much weight to put on each position in the encoder. And so then we are calculating the weighted average of the encoder hidden states, which we're just doing with the obvious equation that we're taking, the weighted sum of the hidden states of the encoder based on the attention weights. And then what we want na do is concatenate our our attention output and the hidden state of the decoder. And we're just giving us then a double length vector, and then we're going to feed that into producing the next word from the decoder. So typically, that means we're multiplying that vector by another matrix and then putting it through a soft max to get a probability distribution over words to output and choosing the highest probability word. Okay, that makes sense, I hope. Yeah. Okay, so attention is great. So inventing this idea was completely transformative. So the very first modern neural machine translation system was done at Google in 2014. And they used a pure, but very large, very deep lstm. So it's an eight layer deep lstm with a very large hidden state for the time. And they were able to get good results. But very shortly thereafter, people at the University in Montreal, dimabad, now Khyn cho and Yoshua Bengio, did a second version of machine translation using attention. And with a much more modest compute budget of the kind that you can afford in universities, they were able to get better results because attention was their secret thing. So attention significantly improved nmt performance. Essentially, every neural machine translation system since has used attention like we've just seen. You know it's more human, like as I was indicating, because it's sort of what a human would do youlook back in the sentence to see what you need to translate. And it solves this bottleneck problem. You now no longer have to stuff all the information about the source sentence into one hidden state. You can have the whole of your representational space from your entire encoding and use it as you need it. It also helps with the vaniish ingradient problem. This is connected to what I was saying that last time when talking about residual connections, that a way out of the vaninished ingradient problem is to direct connect things. And this is provides shortcut connections to all of the hidden states of the encoder. Another nice thing that attention does is it gives you some interpretability. So by looking at where the model is attending, you can basically see what it's translating at different time steps. And so that can be really useful. And so it's kind of like we can see what we're translating where without explicitly having trained a system that does that. So for my little toy sentence here, if he hit me with a pie, you know, at the first position, it's, you know, it was looking at the first word eel he, which it translates. Then there's in French, there's this sort of verb on tarte to sort of pie somebody, I guess in English as well. You can use pi as a verb, right? So the R is a sort of perfect past auxiliary. So it's sort of like he has me. Pied is what the French words are, one at a time. And so the hit is already looking at the pide, then the me is attending to them, which means me. And then all the with the pie is attending still to antarte, which is basically the right kind of alignment that you want for words of a sentence. So that's pretty cool too. Okay. So I've presented up until this point, just this said, Oh, we could do a dot product, but you know, in general, there's more to it than that. So what we have is we have some values, H1 to H N and we have a query vector and we want na work out how to do attention based on these things. So attention always involves computing some attention scores and taking the softmax to get an attention distribution and then getting an attention output. But the part where there's variation is how do you compute these attention scores? And a number of different ways have been done for that. And I just want to go through that a little bit. So the simplest way that I just presented is this dot product attention. We just take the hidden states and dot product, the whole of them. That sort of works, but it doesn't actually work great. And I sort of discussed this a bit when talking about lstms last time, right? That you know the hidden state of an lstm is its complete memory, right? So it has to variously store lots of things in that memory. It's got to be storing information thathelp. It output the right word. It has to be storing information about the future, about other things that you'll want to say given the kind of sentence, context, grammar and previous words you've said, right? Sort of got all kinds of memory. And so it sort of makes sense that some of it would be useful for linking up, for looking back, and some of it would be less useful. You sort of want na find the parts that are related to what you want na say immediately, not all the parts, and do all of the rest of the future. So that suggested maybe you could do a more general form of attention. And so tg Luong and me in 2015 suggested maybe we could introduce what we called bilinear attention, which I still think is a better name. But the rest of the world came to call multiplicative attention, where what we're doing is between these two vectors, we're sticking a matrix. And so we're then learning the parameters of this matrix, just like everything else in our neural network. And so effectively, this matrix can learn which parts of the generator hidden state you should be looking to find, where in the hidden states of the encoder in particular, it no longer requires that things have to match up dimension by dimension. It could be the case that the encoders storing information about word meaning here, and the and the decode is storing information about word meaning here. And by learning appropriate parameters in this matrix, we can sort of match those together and work out the right place to pay attention. So that seemed kind of a cool approach to us. Yeah, this idea and even build like a neural network has going hand input and output. You can do that. I was going to get to that on the next slide. Actually, that's in a way sort of going backwards, but I will get to it on the next slide. But before I do that, I will show you these other versions. So the one thing you might wonder about doing it this way is, you know there's a lot of parameters that you have to learn in the matrix W. You know there aren't that many in my example because there are only 36, but that's because my hidden states are only of length six, right? And if your hidden states are of length 1000, say, then you've got a million parameters in that W matrix. And that seems like it might be kind of problematic. And so the way to get beyond that, which was fairly quickly suggested thereafter, is, well, maybe rather than having that whole big matrix in the middle, instead, what we could do is form it as a low rank matrix. And the easy way to make a low rank matrix is you take two skinny matrices like this, where this is the rank of these, of the pieces, and multiply them together, which would give us the big matrix that I showed on the last slide. And so this gives you a low parameter version of the bilinear attention matrix from the last slide. But at that point, if you just do a teeny bit of linear algebra, this computation is exactly the same as saying, well, what I'm going to do is I'm going to take each of these two vectors and project them to a lower dimensional space using this low rank transformation matrix, and then I'm going to take the dot product in in this low dimensional space. And on Thursday, when you get to transformers, what you will see that transformers do is this, that they're taking the big vector and they're projecting it to a low dimensional space and then taking dot product attention in that low dimensional space. Okay, back to the question. Yeah, you're totally right. And you know at this point, I'm going sort of you know in an a historical manner because Yeah, actually the first form of attention that was proposed in the badenaat al paper was, Hey, let's just stick a little neural net there to calculate attention scores. So we take the the s and the H, we multiply them both by a matrix, add them, put them through a tan H, multiply that by a vector, and we get a number. You know this looks just like kind of computations, reviews everywhere else in an lstm. So there's a little neural net that's calculating the attention scores, and then they go into a softmax, as useful usual in most of the literature. This is called additive attention, which also seems to me a really weird name. I mean, I think kind of saying you've got a little neural net makes more sense for that one. So but anyway, so this is what they proposed and used. And you know at this point, it's a little bit complex, to be honest. I mean, you know so like when we wrote our paper the next year, we had found that the bilinear attention worked better for us. But there was subsequent work, especially this massive exploration of neural machine translation architectures that argued that actually with the right kinds of good hyperparameter optimization, actually this is better than the bilinear attention. But you know this is a lot more complex and a lot slower than doing what you're doing in the upper part of the chart. So regardless of whether it's better or not in practice, what's completely one is doing this. And this is what transformers use and just about all other neural nets that are used these days. Okay, questions on attention will be found in assignment three. Yeah. So I won't say much more about this now and you know we'll see more of it just next lecture. But attention is a very general technique, right? It was a great way to improve machine translation, and that was how it was first invented. But you know for all kinds of neural architectures, for all kinds of purposes, you can stick attention into them. And the general finding was that always improved results. So in general, anywhere where you have a vector of values, a vector query, and you can use attention to then sort of get a weighted average of the values which finds relevant information that you can use to improve your performance. And so maybe I won't try anything give examples of that now, but you'll sort of see another example of attention immediately when we do things on Thursday where we then sort of start doing self attention inside transformers. Yes, it . speaker 2: you're too . speaker 1: great. Not yet. No, we did not . speaker 2: easy. speaker 1: I mean, it didn't seem especially necessary. I don't know. But no, we do not. Okay. Well, this is the end of the part with attention. Are there any other questions? Yes, for the R . speaker 2: and n attention stuff, is there a need for posiinformation or is that . speaker 1: not required . speaker 2: to solve the price I need for . speaker 1: positional information? So so there was none, and it seemed like it wasn't very required. I mean, you could Yeah I mean, you you could make some argument that maybe position information might have been useful, but there's also a good argument that it wasn't necessary. And the sort of recent everywhere usage, a positional information only becomes necessary when you get to a transformer. And the reason for that is you know going back to the pictures for these encoder states, they're being calculated with respect to the previous encoder estate, right? Because it's the recurrent neural network, and therefore, the representation here knows something about the past. So it kind of knows what position it's in basically. And so that you know that's giving a lot of that information. Or another way to think about it is this final representation will give a certain overall sense of the semantics of the sentence. So to the extent that you're looking backwards, the more sort of associative matching of similar semantic content that's needed seems sufficient. And you don't really need additional positional information. Okay, I will go on. Okay, so that's the neural networkcontent for today. And so for the remaining 39 minutes, I want to talk final projects but also a bit about data experiments and things like that. Okay, so this is a reminder on the class. So we've got the four assignments which are 48. And then the big other part of what you need to do is the final project, which is 49% almost completing things out except for the participation. And let me just give one note back to collaboration, the honor code. I mean, four final projects. It's quite usual that people use all sorts of stuff that were written by other people. That's completely fine. We don't expect you to implement everything from scratch, but you must document what you're using. You give references or url's if you're using other people's code rather than writing your own. We do not want to know what code you wrote yourself and what things you downloaded from pipei. And in particular, in thinking about final projects, the question of interest for us is what value add did you provide, right? So you haven't done something great. If you've downloaded a really good neural network and run it on some data and it produces really good results, that's not much value add. So if you want to have value add in that context, you at least want to be doing something interesting of our understanding why it works so well, what kind of examples it doesn't work well on doing some thorough experimental analysis. Yeah, a couple of other points there. Okay, so for the final project for this class, there's a binary choice. You can either do our default final project, which I'll talk about more a bit later, or you can come up with your own final project and I'll talk about that a bit too. So we allow team sizes of one to three. The complicated thing that comes up, Oh, actually, sorry, I should say the other point first. Yeah. So if you do, we generally encourage people to form teams. That means that you're can do something more interesting, that's more motivational, you can make friends, whatever. So teams are good on expectations for teams. Our expectation for teams is that a bigger team should be able to do proportionally more work. And so when we're grading things, we expect to see more work from larger teams. Now how this works out is kind of, I will admit, a little bit complicated because you know there's sort of a quality issue that's separate from the amount of work. So you know the reality is that it's just always the case that several of the very best projects are one person efforts because they're just somebody who has a good idea and knows what they want to do and does it by themselves. And it is great. But you know they're also great multiperson projects as well. But the point I'm meaning is, well, you know it kind of doesn't work. If you're a one person project and you try and attempt a huge amount of stuff and you can only get one third of the way through it, that's not a good recipe for doing well on the final project. For any project, you really need to so be completing something and showing something. But you know, nevertheless, if you're one person and you can show something kind of interesting, even if our reaction is, well, this would have been much better if theyshown it was better than this other kind of model, or it would have been really nice if theyrun ablations to work things out, well, if you're one person will give you a buy and say, Oh, but there's only one person. Whereas if you're a three person team and it seems like you obviously should have compared it to some other models and you obviously could have run it on some other data sets, then we'll feel like, well, as a three person team, they obviously should have done that, and therefore, we should give them a less good score. And that's how that is worked out. The complication comes with other things people are doing at the same time. We allow people to do final projects that are shared with multiple classes, but know expectation is again that you'll do more work. So if there are two of you who are using one project for both this class and cs 231 and say, then it's sort of like the four person project and you should be doing a lot of work for it. There are other cases. Sometimes people have ra ships or their PhD rotation students or other things. If you're doing it for other things, welike you to tell us, and we expect you to be doing more work for it. Okay. I'm very happy to talk to people about final projects and have been talking to people about final projects, but unfortunately, there's only one of me. So I definitely can't talk to 500 people about final projects. So I do also encourage you to talk to all of the tas about final projects. So on the office hours page, under all of the tas, there's some information about things that they know about. So if you know what your project is about, you could at least try and find one of the most useful tas, or just find a ta with a friendly face. Whatever mechanism you use, talk to tas about final projects. Yeah. So default final project. So what it's going to be is so Bert was a famous early transformer, and we're going to be sort of building and experimenting with a minimal burimplementation. So if you do this, there's part of a part of an implementation of Bert and you're meant to finish it off and you're meant to fine tune it and get some data results for doing sentiment analysis. And then basically, we want the even the default final project to be an open ended project where people can do different things. And so then there's lots of other ideas, or you can come up with your own of ways. You could extend this system and make it better, which might be with paraphrasing, contrast of learning, low rank adaptation, something. And you can do something and that is your final project. So why choose the final project? So if you haven't had much experience with research, you don't have any real idea of what you want to do for a final project or youlike something with clear guidance and a goal and a leaderboard, because we provide a leaderboard for people doing the default final project of how good your performance is on the tasks we provide, then you can do the final project. And I mean, honestly, I think for many people, the best option is to do the final project for sort of past performance. Typically, about half the students do the final project, including some people who start off thinking, I'll do a custom final project. Then after a couple of weeks, they decide, huh, this makes no sense. What I suggest you it's not working at all. I'm just going to abandon and flip to the default final project. Okay? But we also allow custom final projects, and there are good reasons to do custom final projects. So if you have some topic or research idea that you're excited about, maybe you're already even working on it, or you want to try something different on your own, or you just like to have more of the experience of trying to come up with a research goal, finding their serdata and tools and starting from scratch, which is actually very educational, if considerably harder. Well, then the custom final project is fine for you. Restriction on topics. I think wealready sort of signaled this on ed. We insist for cs 224n final projects that they have to substantively involve both human language and neural networks, you know, because this is the nlp class. So welike people to know and learn something about human language. I'm totally aware of the fact that you can use these same models for bioinformatic, tics, sequences or music, radar, whatever, but welike you to do something with human language for this class. That doesn't mean it has to be only about human language. So people have done things like visual language models or music and language. So it can have a combination of modalities, but it has to you substantively, not completely trivially involve human language. If you've got any questions about that, ask. And it also has to substantively involve neural networks. So again, it doesn't have to be wholly about neural networks. If you've got some ideas thinking, Oh, I think I could show using kernel machines that they work just as well as having multi layer neural networks or something like that. That's of course fine to do as well. Gamesmanship. Yeah. The default final project is more guided that it's not meant to be a complete slackers ride. We're hoping that people do the same amount of work for either kind of project. But on the other hand, it does kind of give you sort of a clearer focus and course of things to do, but it is still an open ended project. So you know for both default final projects and custom final projects, there are great projects and there are not so great projects. You know if anything, there's a bit more variance in the custom final project. So you know the path of success is not to do something, try and do something for the custom final project. That just looks really weak compared to people's default final projects. Okay? You can get good grades. Either way, we give best project awards to both kinds of projects. So Yeah, it's really not that there's some secret one you have to pick computing. Yeah. So to be honest with the confessions right at the beginning, we're actually in a less good position for computing than we've been in recent years. And it's all OpenAI's fault. No, that part of but you know up until and including last year, we actually had invariably managed to get very generous cloud computing giveaways from one or other cloud computing provider, which really provided a lot of computing support. But you know there's the great GPU shortage on it at the moment due to the great success of large language models, and it turns out that cloud compute providers just aren't being as generous as they used to be. And gee, I guess the aws rep was pointing out that my course was their single largest grant of free GPU last year, so it's getting harder to do so. So really, people will have to patch things together more in many cases. And so we'll be relying on the ingenuity of students to be able to find free and cheap stuff. So Google is giving $50 credit per person on gcp, which can be used for assignments three, four, and the final project on all the clouds. If you haven't used a cloud with an account before, you can usually get some free starter credits, which can be a useful thing. There are the sort of jupityer notebooks in the cloud, so the most used one is Google collab, which allows limited GPU use. It often tends to get tighter later in the quarter, so you might find it a good investment to not have a couple of lattes and pay ten bucks a month to get collab pro, which gives you much better access to. Gpus, but there are alternatives to that which you might also want to look at. So aws provides a juper notebook environment. So you make a studio lab and know also owned by Google, kegle separately provides Kegel notebooks, which actually commonly give you better GPU access than Google co lab provides, even though you know they're otherwise not as nice. Kegle notebooks are sort of just bare bones Jupiter notebooks, whereas colab had some fancier ui stuff grafted on it. So other possibilities. Modal is a low priced GPU provider and allows a certain amount of free GPU usage a month, so that could be handy. There are other lower cost GPU providers like vaai, which could be of relevance. And then the other thing that I'll say more about in a minute is you know the way things have changed with large language models, there are lots of projects that you might want to do where you're not actually building models at all yourself, but you're wanting to you know do experiments on large language models or you're wanting to do in context learning with large language models or other things of that sort. And then what you want is to have access to large language models. And in particular, you probably want to use have api access so you can automate things. So another thing that we have been able to get is through the generosity of together AI, that together AI is providing lar 50 of api access to large language models, which can actually be a lot. How much of a lot it is depends on how big a model you're using. So something you should think about is how big a model do you really need to use to show something? Because if you can run a 7 billion parameter language model on together, you know, you can put a huge number of tokens through it for 50 bucks. Whereas if you want to run a much bigger model, then you know the number of tokens starts, you can get through it goes down by orders of magnitude. So that's good. And I mentioned some other ones. So we've already put a whole bunch of documents up on ed that talk about these different GPU options. So do look at those. Okay, jumping ahead. So the first thing you have to do as a project proposal, so it's one per team. So I guess the first step is to work out who your team is. And so for the project proposal, part of it is actually giving us the details of your project. But there's another major part of it, which is writing a review of a key research paper for your topic. So what for the default final project, we provide some suggestions so you can find something else. If you've got another idea for how to extend the project for your custom project, you're finding your own. But what we want you to do is get some practice at looking at a research paper, understanding what it's doing, understanding what's convincing, what it didn't consider, what it failed to do. And so we want you to write a two page summary of a research paper. And the goal is for you to be thinking critically about this research paper of what did it do that was exciting versus what did it claim was exciting but was really obvious or perhaps even wrong, etc. Okay. And right. So then after so after that, you know we want you to say what you're planning to do that may be very straightforward for a default final project, but it's really important for a custom final project. And in particular, you know tell us about you know the literature you're going to use of any and the kind of models you're going to explore. But you know it turns out that when we're unhappy with custom final projects, the two commonest complaints about what you tell us about custom final projects is you don't make clear what data you're going to use because we're sort of worried already if you haven't worked out by the project proposal deadline what data you can use for your final project. And if you don't tell us how you're going to evaluate your system, we want to know how you're going to measure whether you're getting any success as a new thing this year. Welike you to include an ethical considerations paragraph outlining potential ethical challenges of your work, if it were deployed in the real world and how that might be mitigated. This is something that now a lot of conferences are requiring and a lot of grants are requiring. So I want to give you a little bit of practice on that by writing a paragraph of that. How much that is to talk about varies somewhat on what you're trying to do and whether it has a lot of ethical problems or whether it's a fairly straightforward question answering system. But in all cases, you might think about what are the possible ethical considerations of this piece of work. Okay, the whole thing is maximum four pages. Okay, so for the research paper summary, Yeah do think critically, right? I mean, the worst the worst summaries are essentially people that just paraphrase what's in the abstract and introduction of the paper. And we want you to think a bit harder about this. You know what were the novel contributions of the paper? Is it something that you could use for different kinds of problems in different ways? Or was it really exploiting a trick of one data set? Are there things that it seemed like they missed or could have done differently or you weren't convinced were done properly? Is it similar or distinctive to other papers that are dealing with the same topic? Does it suggest perhaps something that you could try that extends beyond the paper? Okay. And for grading these final project proposals, most of the points are on that paper review and so do pay attention to it. There are some points on the project plan, but you know really we're wanting to mainly give you formative feedback on the project plan and comments as to how we think it's realistic or unrealistic. But nevertheless, we're expecting you to sort of have an idea, have thought through how you can investigate it, thought through how you can evaluate it, data sets, baselines, things like that. Oh Yeah, I should emphasize this. Do you have an appropriate baseline? So for anything that you're doing, you should have something you can compare it again. So sometimes it's a previous system that do exactly the same thing. But if you're doing something more novel and interesting, you should be thinking of some cethe pants, obvious way to do things and proving that you can do it better. And what that is depends a lot on what your project is. But you know if you're building some complex neural net that's going to be used to work out textual similarity between two pieces of text, well, a simple way of working out textual similarity between two pieces of text is to look up the word vectors for every word in the text and average them together and work out the dot product between those average vectors. And unless your complex neural network is significantly better than that, it doesn't seem like it's a very good system. So you always attempt to have some baselines after the project proposal, we also have a project milestone stuck in the middle to make sure everybody has making some progress. This is just to help make sure people do get through things and keep working on it. So we'll have good final projects for most final projects. I'll say more about this in a minute. The crucial thing we expect for the milestone is that you know you've kind of got set up and you can run something. It might just be your baseline of looking up the word vectors, but means you've kind of got the data and the framework and something that you can run and produce a number from it. And then there's the final project. We have people submit their code for the final projects, but final projects are evaluated almost entirely unless there's some major worries or concerns based on your project report. So make sure you put time into the project report, which is essentially a research paper, like a conference paper, and they can be up to eight pages, and it varies on what you're doing. But you know, this is the kind of picture, typically, of what people look like, or have an abstract, an introduction, ittalk about other related work itpresent the model you're using, the data you're using, and your experiments and their results, and have some insightful comments and its analysis and conclusion at the end. Okay? Finding research topics for custom projects, all kinds of things you can do. You know, basic philosophy of science, you're normally either starting off with, here's some problem I want to make some progress on, or here's this cool idea for a theoretical technique or a change in something. And I want na show us better than other ways of doing it. And you're working from that. We allow different kinds of projects. You know one common type of project is you've got some task of interest and you're going to try and solve it or make progress on it somehow that you want na you know get information out of State Department documents and you're going na see how well you can do it with neural nlp. A second kind is you've got some ideas of doing something different with neural networks, and then you're going na see how well it works. Or maybe given there are large language models these days, you're going to see how using large language models you can do something interesting by in context learning or building a larger language model program. So you know nearly all 224n projects are in those first three types where at the end of the day, you've got some kind of system and you've got some kind of data and you're going to evaluate it, but that's not 100% requirement. There are different kinds of projects you can do, and a few people do. So you can do an analysis interpretability project. So you could be interested in something like, how could these transformal models possibly understand what I say to them and give the right answers to my statements? Let me try and look inside the neural networks and see what they're computing. How recently there's been a lot of work on this topic often and under titles like mechanistic interpretability, circuit training and things like that. So you can do some kind of analysis or interpretability project, or you could even just do it, look at the behavior of models of some task. So you could take some linguistic task, like metaphor interpretation, and see which neural networks can interpret them correctly and which kind or which kinds of ones they can interpret correctly or not, and do things like that. Another kind is a theoretical project. Occasionally people have done things looking at the behavior of, well, that's a good example somewhere that's in the math. So an example that was actually done a few years ago and turned into a conference paper was looking at, in the estimation of word vectors, the stability of the word vectors that were computed by different algorithms, word devect versus glove. And. Deriving results with proofs about the stability of the vectors that were calculated. So that's allowed. We don't see many of those here very quickly. Sort of just sort of random things. So a lot of past projects you can find on the 225n web page, you can just find different past year reports and you can look at them to get ideas as you wish. So deep poetry was a gated lstm where the idea was as well. So a language model that generated excessive words. They had extra stuff in it to make it rhyme in a poetry life pattern that was kind of fun. You can do a reimplementation of a paper that has been done previously. This is actually a kind of an old one, but I remember it well. So back in the days before transformers, DeepMind ded, these kind of interesting papers on neural Turing machines and differentiable neural computers, but they didn't release implementations of them. And so Carol said about writing her own implementation of a differentiable neural computer, which in a way was a little bit crazy. And a few days before the deadline, she still hadn't gone at working, so it could have been a complete disaster. But she did get it working before the deadline and got it to run, producing some interesting results. So that was kind of cool. So if it's something interesting, it doesn't have to be original. It can be sort of reimplementing something interesting. Okay? Sometimes papers do get published later as interesting ones. This was a paper that was sort of, again, from the early days and was sort of fairly simple, but you know, it was a novel thing that gave progress. So the way we've sort of presented these rand ns, you have sort of word vectors at the bottom, and then you kind of compute the soft max at the top. But if you think about the sort of multiplying by the output matrix and then putting that into the softmax, that output matrix is also like a set of word vectors because you have a column for each word and then put it to you get a score for each output word and then you're putting a softmax over that. And so their idea was, well, maybe you could sort of share those two sets of vectors and yoube able to get improvements from that and you could okay, maybe I won't talk about that one. Sometimes people have worked on quantized models that's more of a sort of a general neural network technique. But providing you show you can do useful things with it, like have good language modeling results, even with quantized vectors will count that as using language. So in recent times, these last tour from 20, 24, you know, a lot of the time people are doing projects with pre trained large language models, which we will be talking about in next three models, three lectures, and then doing things with them. And so you can do lightweight, parameter efficient, fine tuning methods. You can do in context learning methods and things like this. I suspect that probably quite a few of you will do projects of this kind. So here's an example. So lots of work has been done on producing code language models. And so these people decided to improve the generation of fortran. Maybe they're a physicists, I don't know. And so they were able to show that they could use parameter efficient fine tuning to improve code lama for producing forat. Now, where was the natural language? Code has natural language comments in it, and the comments can be useful for explaining what you want the code to do. And so it was effectively doing translation from human language, explanation of what the code was meant to do into pieces of code. Here was another one which was doing AI fashion driven cataloging, transforming images into textual descriptions, which again, was starting off with an existing visual language model and looking at how to find tune it, okay, other places to look for stuff. So you know, you can get kind of lots of ideas of areas and things people do by looking at past papers. They're you're also welcome to have your own original ideas thinking about anything you know or work on in the world. So for nlp papers, there's a site called the acl anthology that's good for them. From there are lots of papers on language that also appear in machine learning conferences. So you can look at the Europe sort clear proceedings. You can look at path 224n projects and then the archive preprint servers got tons of papers on everything, including nlp. And you can look there. But I do actually think it's know some of the funest best projects are actually people that find their own problem, which is an interesting problem in their world. You know if there's anything about a cool website that has text on it and you think you could kind of get information out of automatically by using a language model or something, there's probably something interesting and different you can do there. Another place to look is that there are various leaderboards for the state of the art on different problems, and you can start looking through leaderboards for stuff and see what you find there. But you know on the other hand, the disadvantage of looking at things like leader boards and past conferences is you sort of tend to be trying to do a bit better on a problem someone else has done. And that's part of why you know really often in research, it's a clever thing to think of something different, perhaps not too far from things that other people have done, but somehow different. So you'll be able to do something a bit more original and different for what you're doing. Yeah. I mean, I do just want to go through this a bit quickly that you know for the sort of decades that I've been doing natural language processing with deep learning, there's sort of been a sea change in what's possible. So in the early days of the deep learning revival, you know, most of the work in people's papers were trying to find better deep learning architectures. So that would be, here is some question answering system. I've got an idea of how I could add attention in some new place, or I could add a new layer into the neural network. And the numbers will go up. And there were lots of papers like that, and it was a lot of fun. And that's what a lot of good cs 224n projects did too. And people were often able to build systems from scratch that were close to the state of the art. But you know, in the last five years, your chances of doing this have been become pretty slim, frankly. You know, you can if you've really got a good idea, it's something different than original by all means, but it's kind of hard. So most work these days, even for people who are professional researchers, that you know they're making use of existing large pre train models in some way. And then once you're doing that, that actually sort of fixes a lot of your architectural choices because your large pre change neural network has a certain architecture and you kind of have to live with that. You know you might be able to do interesting things by adapting it with something like low rank adaptation around the side or something, but nevertheless, there's sort of constraints on what you want to do. So you know for just about any practical project, like you've got some data set and you want na understand it and get facts out of it or something like that, essentially the only sensible choice is to say, I am gonna na use hugging face transformers, which we have a tutorial on coming up ahead, and I will load some pre train model and I will be running it over the text, and then I'll be working out some other stuff I can do on a top and around that. So you know, building your own architecture is really only sensible choice if you can do something in the small, which is more a sort of exploring architectures project. If you've kind of got an idea of, Hey, I've got an idea for a different nonlinearity that I think will work better than using a relo, let me investigate kind of thing, because then you can do small experiments. Yeah, maybe I won't read out all of this list sts, but there are lists of sort of some of the ideas of what's more interesting now. But you know do be cognizant of the world we're in in terms of scale. I mean, one of the problems we now have is that people have seen the latest paper that was being pushed by DeepMind whoever doing some cool graph structured reasoning search to do things and they turn up and say, I want na do this for my project. But a lot of the time, if you read further into the paper, you'll find that they were doing it on 32a 100s for a month. And that's not the scale of compute that you're going to have available to you in almost all circumstances. Maybe they're one or two industry students. For the industry students that you can do that. If so, go for it. But for the vast majority of people, not likely. So you do have to do something that is practical, but you know that practicality is for a vast majority of the people in the world. And if you look around in blogs and so on, you find lots of people doing stuff in lightweight ways and describing how to do that. And that's why methods like parameter efficient fine tuna are really popular because you can do them in lightweight ways. The question related to that, and I'll end on this, is, you know I just want to sort of sort of mention again, you know if you want to, you're welcome to use GPT -4 or Gemini pro or Claude opus or any of these models in your project. But you know it has to be then api usage. You can't possibly train your own big models. I mean, even for the models that are available open source and like those you know for big models, you can't even load them into the kind of GPU's you have. So you know probably you can load a lama seven b model, but you can't just load into your GPU or lama 70b model. You have to be realistic on that size. But you know there's actually now lots of interesting things you can do with api access, doing things like in context learning and prompting and exploring that or building larger language model programs around these language model components, and you're certainly encouraged to do that. Lots of other things you can do, such as analysis projects, which look at, are these models sexist and racist still, or do they have good understanding of analogies or can they interpret love letters or whatever is your topic of interest? Lots of things you can do, and that's totally allowed. But again, you know remember that we'll be trying to evaluate this on what interesting stuff you did. So your project shouldn't be how ran this stuff through GPT -4 and it produced great summaries of the documents. I am done. The question is. What did you do in addition to that to have an interesting research project? Okay, I'll stop there. Thanks a lot.