Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 7 - Attention, Final Projects and LLM Intro
该讲座首先回顾了使用多层LSTM进行机器翻译的方法,并强调了评估机器翻译系统的重要性。重点讨论了机器翻译的评估方法,特别是BLEU(Bilingual Evaluation Understudy)评分机制。BLEU通过比较机器翻译结果与一个或多个人工参考译文之间的n-gram(通常是1到4-gram)重叠来打分,重叠越多分数越高,但也存在一定局限性,例如可能因词汇选择不同而给优质翻译低分,或因简单词语匹配而给劣质翻译加分。
随后,讲座回顾了机器翻译技术的发展历程:从IBM在90年代末开创的统计短语翻译系统(在2000年代由谷歌等推广),到2005至2014年间研究者尝试的基于句法的翻译系统(旨在通过分析句子结构提升对德语、中文等语序差异较大语言的翻译效果,但收效甚微),最终演进到神经机器翻译(NMT)。NMT约在2014年出现,2015年参与评测,至2016年已超越其他方法,并展现出持续的显著性能提升,目前NMT系统的BLEU分数常能达到50至60分。
最后,讲座预告了将介绍“注意力机制”(Attention),这是一种相对较新(不同于2000年前已有的多数神经网络概念)且在现代神经网络中至关重要的概念,它最初在机器翻译背景下被提出,并且是后续将讨论的Transformer模型的核心组成部分。
标签
媒体详情
- 上传日期
- 2025-05-15 21:29
- 来源
- https://www.youtube.com/watch?v=J7ruSOIzhrE
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Okay, welcome everyone to week four into now. So for today, what I want to do is first of all, we'll a couple more bits on machine translation, especially just talking a little bit about evaluating machine translation that I want to spend a while on attention. So attention is a very fundamental concept of neural networks, which was originally developed in the context of machine translation, but it's also then a very central concept when we're talking about transformers, which we then start talking about on Thursday. Okay. So getting straight into it, so this is the picture that we saw towards the end of last time that this is how we were baking a machine translation system where we were using a mullayer lstm where we were feeding in the source sentence, and then we were flipping to then turning the model into a decoder with different parameters, which would generate one word at a time to generate the translated sentence. So here I've gotten German sentence and it's produced an English translation that looks a pretty good one. But you know we're going to want to have a way of deciding, well, are we producing good translations or not? And so we need some way to evaluate machine translation. Now, this is a complex area because you know if you start poking around in the literature, people have proposed literally hundreds of different measures that could be used to evaluate machine translation systems. I'm guilty of writing a couple of papers on it myself, so I'm contributed to the problem. But you know, by far the most commonly common measure that you see to this day was essentially the first measure proposed to automatically evaluate machine translation, which was the blue measure which was proposed to understand for bilingual evaluation under study, though it went along with the fact that it's proposed by ibm, probably not a coincidence. So until this point, the only way that people had really use for evaluating translations was getting human beings to look at them and say how good of a translation this is. And you know, that's still a gold standard measure that is widely used for evaluating translations because, you know, many of the automatic measures have various kinds of biases and problems that make human evaluation useful. But on the other hand, a lot of the time, welike to iterate quickly on evaluations, welike to use evaluations and training loops and things like that. And the ibm people with the blue paper suggests, well, maybe we can come up with a halfway decent automatic method of doing translations. And the idea of what they proposed was this, that we're going to have one or more reference translations for a piece of text. So these are human written translations. And then we can score any automatic translation mainly on how often they have overlapping one, two, 34 grams. The number four isn't special. You could have only gone up to three or five, but four were seen as a reasonable length overlapping n grams with one of the reference translations. And the more overlap you have, the better. And we this discussion of this evaluation in the assignment. So you can think about it a bit more. And I won't go actually through all the formulas right now, but you know, that's most of it. And so here's a picture of how that looks. So the original idea was what we should do is, you know, have several reference translations and then weget a machine translation, and then welook at this machine translation and try and find pieces of it in the reference translation. So we can certainly find the unigram. We can't find American at all, but we can find International Airport and its in the second reference translation. So we're going to get a foregrand match for that. We can find that again, that's easy. Office all receives one's call. Self the sand Arab not a very good translation this right? So that all misses. But then you start to find other pieces, but do overlap and you use those to work out of score. The original idea was you should always have multiple reference translations so that you can sample the space of possible translations and have reasonable coverage in practice for what's been done more recently. It's not so uncommon that people do this with only one reference translation. And the argument then is still on a kind of a probabilistic basis. The more often you have a good translation, the more often you'll get matches and therefore your score will be better. Yeah so wide, you know why did people come up with this and why is that still imperfect? Well, the problem with translation is that there isn't one right answer. It's not like the kind of classification things you see in machine learning where you show people a picture. And the right answer is to say this, the class of this object is whatever a labradoor dog breeds or something, right? That for any sentence, there are many different ways to translate it. And you know, translators consider, sit around and argue that, Oh, this phrasing is a little bit nicer than this phrasing, blah, blah, blah, blah. But to a first approximation, you can translate the sentence in lots of ways. And those different ways of translation can involve different word orders. So you can't really sort of check the words off as you come down in the sentence. And that's what motivated this idea of sort of matching in grams anywhere. So you can get reasonable credit for having the right matches. But you know nevertheless, it's a pretty crude version of it, right? You know you can still get a poor blue score for good translation just because the words you chose didn't happen to match a reference translation. And also you can get points for things without really having a good translation at all, right? If you just have words that match, even if they're having completely the wrong role in the sentence, you will get some points. But it's harder to get n gram matches unless you're for larger n, unless you're using words the right way. There's one other trick in the blue measure that there's a penalty for two short system translations because otherwise you could leave out everything difficult and only translate the easy part of the sentence. And then for the bits you have translated, you could then be getting a high score for the precision of those pieces. Okay, so we'll use when you're developing mt systems for assignments three, we'll use them with blue. So now we have a evaluation measure. We can start looking at how well how well do systems do on a blue score. And blue scores are theoretically between zero and 100, but you're never going to get to 100 because of the variations of how you can translate things. And so typically, if you can start to get to the twenties translations, you can sort of understand what the source document was about. Once you get into the thirties and 40 ties, the translations are getting much, much better. Yeah. So statistical phrase based translation was pioneered by ibm in the late nineties actually, and was sort of redeveloped in the two thousands decade. And it was what Google launched as Google Translate in the two thousands decade. And it continued to be worked on for sort of the following decade. But there was basically a strong sense that progress in translation had doing statistical Freese base systems had basically stalled, that it got a little bit better each year as people could build traditional Ingram language models with more data every year and things like that. But the numbers were barely going upwards. So in the years from about 2005 to 15 or may 14, the dominant idea in the machine translation community was the way we were going to get better. Machine translation is doing syntax based machine translation. If we actually knew the structure of sentences and wepass them up, then weknow what the role of words was in sentences, and then webe able to translate much better. And this was particularly invoked by looking at languages where translation worked terribly. So in those days, translation worked sort of okay for languages like French to English or Spanish to English, which are kind of sort of similar European languages. But the results worked way worse for Chinese to English or German to English. And even though English is a Germanic language, German has a very different word, Clader to English, with commonly verbs at the end of a clause and different elements being fronted. And there it. So people tried to work on grammar based, syntax based methods of statistical machine translation. And I was one of those who worked on those in the late 2000 ands decade. But you know, the truth is it sort of didn't really work, right? If the rate of progress in syntax based machine translation was had slightly more slope than phrase based machine translation over these years, the amount of slope wasn't very much. So things were completely then thrown on their head when neural machine translation got invented. Because as I explained, you know the first attempts were in 2014. The first cases in which it was evaluated and bakeoff evaluations was 2:15. And so in 2:15, it wasn't as good as the best other machine translation methods, but by 2:16 it was, and it was just on this much, much steeper slope of getting way, way better. And that this graph only goes up to 2:19, but it's continued to go up. And so it's not that uncommon these days that you see blue numbers in the fifties and sixties for neural machine translation systems. So that's a good news story. So after this, I want to go on and sort of introduce this idea of attention, which is now a very fundamental, important idea in neural systems. It's also interesting because there's actually something novel that was invented kind of recently. So for everything that we've done in neural networks up until now, really it had all been invented before the turn of the millennium, right? So basic feed forward neural networks, recurrent neural networks, lstms, other things that we yet haven't talked about, like convolutional neural networks, they were all invented last millennium. It was really a waiting game at that point until there was sufficient data and computational power for them really to show how good they were. But attention was something that actually got invented in 2014 in the origins of neural machine translation, and it proved to be a very transformative idea for making neural networks more powerful. So the idea of what motivated attention was looking at exactly this kind of machine translation problem. So we're were running our lstm over the source sentence, and then we were using this hidden state as the previous hidden state that were feeding into the generator lstm for the target sentence. And what that means is everything useful about this sentence has to be stuffed into that one vector. Well, that's maybe not so hard if you've got a four word sentence, but you know maybe you've got a 40 word sentence out here. And it seems to be kind of implausible that it be a good idea to be trying to fit everything about that sentence into this one hidden state. And well, obviously there are crude solutions to this. You make the hidden states bigger, and then you've got more representational space. You use a multilayer lstm, you've got more representational space, but it still seems a very questionable thing to do. And it's certainly not like what a human being does, right? Like if a human being as a translating a sentence, they read the sentence and they've got some idea of its meaning, but as they start to translate, they look back at the earlier parts of the sentence and make use of that and that translation. And so that doesn't seem like it's a very plausible model. So the idea should be that our neural netshould be able to attend to different things in the source so that they can get information as needed looking back in the sentence. And so this is the idea of attention. And so on each step of the decoder, we're going to insert direct connections to the encoder so we can look at particular words in the sentence. So I've got a bunch of diagram sentences that go through what we do, and then after that, I'll present the equations that go along with this. Okay, so once we're starting to translate, we've got a hidden state at the start of our generator, and then we're going to use this hidden state as our key to look back into the encoder to try and find useful stuff. So we're going to compare in a way I'll make precise later, the hidden state with the hidden state at every position in the source sentence. Based on our comparisons, we're going to work out an attention score. Where should we be looking at in the source sentence while generating here the first word of the translation? And so based on these attention scores, we'll stick them into a sofmax, as we commonly do, and we'll then get a probability distribution or weover the different positions in the sentence. Then we will use this weighting to compute a representation based on the encoder, which is then going to be a weighted average of the encoder states. So in this particular case, itbe nearly entirely the representation above the first word eel, which means he and French. So then we'll take that attention output, and we'll combine it with the hidden state of our decoder, and we'll use both of them together to generate an output vector, which we stick through our sofmax and generate a word as the first word of the translation y one. And so then at that point, we just repeat this over. So we then go on to generating the second word. Well know, we copy down the first word generator, start to generate the second word. We work out attention at every position it gives us. Oh, sorry, there's a little note there, which is a little fine point, which maybe I won't deal with. But it points out sometimes you also do things like stick the previous time steps, attention output into the next step as an extra input. And we actually do that in it should say, assignment three there. That's buggy. So there are other ways to use things, but I'll sort of gloss over that. So we generate another word and we sort of repeat over. And at each time step, we're looking at different words in the source and they will help us to translate the sentence. Yeah quick question. Why a start point into the. White light say again, you mean why does it ring ring part? Because the Green okay, so the Green vector, the hidden vector of the decoder, is going to be used together with the hidden states, the hidden vectors of the encoder one at a time to calculate the attention scores. So the attention score, the position is going to be a function of the hidden state of the ecoder at that position and the current hidden state of the decoder. And I'll explain exactly how in a moment. Any other questions? Okay, well, so here it is in math. Okay, so we have encoder hidden states, which we're going to call H. So we have decoder hidden states, which we're going to call s. So there's something different and we're going to, at each point being some particular time step t. So we'll be dealing with St. So to calculate the attention scores for generating the word for time, step t, we're going to calculate an attention score for each position in the encoder. Okay, I'll discuss alternatives for this in a moment. But the very easiest way to calculate an attention score, which is shown here, is to take a dot product between the hidden state of the encoder and the current hidden state of the decoder. And so that's what we're showing here. So that will give us some dot product score, which is just any number at all. Then the next thing we do is we stick those et scores into our softmax distribution, and then that gives us our probability distribution as to how much weight to put on each position in the encoder. And so then we are calculating the weighted average of the encoder hidden states, which we're just doing with the obvious equation that we're taking, the weighted sum of the hidden states of the encoder based on the attention weights. And then what we want na do is concatenate our our attention output and the hidden state of the decoder. And we're just giving us then a double length vector, and then we're going to feed that into producing the next word from the decoder. So typically, that means we're multiplying that vector by another matrix and then putting it through a soft max to get a probability distribution over words to output and choosing the highest probability word. Okay, that makes sense, I hope. Yeah. Okay, so attention is great. So inventing this idea was completely transformative. So the very first modern neural machine translation system was done at Google in 2014. And they used a pure, but very large, very deep lstm. So it's an eight layer deep lstm with a very large hidden state for the time. And they were able to get good results. But very shortly thereafter, people at the University in Montreal, dimabad, now Khyn cho and Yoshua Bengio, did a second version of machine translation using attention. And with a much more modest compute budget of the kind that you can afford in universities, they were able to get better results because attention was their secret thing. So attention significantly improved nmt performance. Essentially, every neural machine translation system since has used attention like we've just seen. You know it's more human, like as I was indicating, because it's sort of what a human would do youlook back in the sentence to see what you need to translate. And it solves this bottleneck problem. You now no longer have to stuff all the information about the source sentence into one hidden state. You can have the whole of your representational space from your entire encoding and use it as you need it. It also helps with the vaniish ingradient problem. This is connected to what I was saying that last time when talking about residual connections, that a way out of the vaninished ingradient problem is to direct connect things. And this is provides shortcut connections to all of the hidden states of the encoder. Another nice thing that attention does is it gives you some interpretability. So by looking at where the model is attending, you can basically see what it's translating at different time steps. And so that can be really useful. And so it's kind of like we can see what we're translating where without explicitly having trained a system that does that. So for my little toy sentence here, if he hit me with a pie, you know, at the first position, it's, you know, it was looking at the first word eel he, which it translates. Then there's in French, there's this sort of verb on tarte to sort of pie somebody, I guess in English as well. You can use pi as a verb, right? So the R is a sort of perfect past auxiliary. So it's sort of like he has me. Pied is what the French words are, one at a time. And so the hit is already looking at the pide, then the me is attending to them, which means me. And then all the with the pie is attending still to antarte, which is basically the right kind of alignment that you want for words of a sentence. So that's pretty cool too. Okay. So I've presented up until this point, just this said, Oh, we could do a dot product, but you know, in general, there's more to it than that. So what we have is we have some values, H1 to H N and we have a query vector and we want na work out how to do attention based on these things. So attention always involves computing some attention scores and taking the softmax to get an attention distribution and then getting an attention output. But the part where there's variation is how do you compute these attention scores? And a number of different ways have been done for that. And I just want to go through that a little bit. So the simplest way that I just presented is this dot product attention. We just take the hidden states and dot product, the whole of them. That sort of works, but it doesn't actually work great. And I sort of discussed this a bit when talking about lstms last time, right? That you know the hidden state of an lstm is its complete memory, right? So it has to variously store lots of things in that memory. It's got to be storing information thathelp. It output the right word. It has to be storing information about the future, about other things that you'll want to say given the kind of sentence, context, grammar and previous words you've said, right? Sort of got all kinds of memory. And so it sort of makes sense that some of it would be useful for linking up, for looking back, and some of it would be less useful. You sort of want na find the parts that are related to what you want na say immediately, not all the parts, and do all of the rest of the future. So that suggested maybe you could do a more general form of attention. And so tg Luong and me in 2015 suggested maybe we could introduce what we called bilinear attention, which I still think is a better name. But the rest of the world came to call multiplicative attention, where what we're doing is between these two vectors, we're sticking a matrix. And so we're then learning the parameters of this matrix, just like everything else in our neural network. And so effectively, this matrix can learn which parts of the generator hidden state you should be looking to find, where in the hidden states of the encoder in particular, it no longer requires that things have to match up dimension by dimension. It could be the case that the encoders storing information about word meaning here, and the and the decode is storing information about word meaning here. And by learning appropriate parameters in this matrix, we can sort of match those together and work out the right place to pay attention. So that seemed kind of a cool approach to us. Yeah, this idea and even build like a neural network has going hand input and output. You can do that. I was going to get to that on the next slide. Actually, that's in a way sort of going backwards, but I will get to it on the next slide. But before I do that, I will show you these other versions. So the one thing you might wonder about doing it this way is, you know there's a lot of parameters that you have to learn in the matrix W. You know there aren't that many in my example because there are only 36, but that's because my hidden states are only of length six, right? And if your hidden states are of length 1000, say, then you've got a million parameters in that W matrix. And that seems like it might be kind of problematic. And so the way to get beyond that, which was fairly quickly suggested thereafter, is, well, maybe rather than having that whole big matrix in the middle, instead, what we could do is form it as a low rank matrix. And the easy way to make a low rank matrix is you take two skinny matrices like this, where this is the rank of these, of the pieces, and multiply them together, which would give us the big matrix that I showed on the last slide. And so this gives you a low parameter version of the bilinear attention matrix from the last slide. But at that point, if you just do a teeny bit of linear algebra, this computation is exactly the same as saying, well, what I'm going to do is I'm going to take each of these two vectors and project them to a lower dimensional space using this low rank transformation matrix, and then I'm going to take the dot product in in this low dimensional space. And on Thursday, when you get to transformers, what you will see that transformers do is this, that they're taking the big vector and they're projecting it to a low dimensional space and then taking dot product attention in that low dimensional space. Okay, back to the question. Yeah, you're totally right. And you know at this point, I'm going sort of you know in an a historical manner because Yeah, actually the first form of attention that was proposed in the badenaat al paper was, Hey, let's just stick a little neural net there to calculate attention scores. So we take the the s and the H, we multiply them both by a matrix, add them, put them through a tan H, multiply that by a vector, and we get a number. You know this looks just like kind of computations, reviews everywhere else in an lstm. So there's a little neural net that's calculating the attention scores, and then they go into a softmax, as useful usual in most of the literature. This is called additive attention, which also seems to me a really weird name. I mean, I think kind of saying you've got a little neural net makes more sense for that one. So but anyway, so this is what they proposed and used. And you know at this point, it's a little bit complex, to be honest. I mean, you know so like when we wrote our paper the next year, we had found that the bilinear attention worked better for us. But there was subsequent work, especially this massive exploration of neural machine translation architectures that argued that actually with the right kinds of good hyperparameter optimization, actually this is better than the bilinear attention. But you know this is a lot more complex and a lot slower than doing what you're doing in the upper part of the chart. So regardless of whether it's better or not in practice, what's completely one is doing this. And this is what transformers use and just about all other neural nets that are used these days. Okay, questions on attention will be found in assignment three. Yeah. So I won't say much more about this now and you know we'll see more of it just next lecture. But attention is a very general technique, right? It was a great way to improve machine translation, and that was how it was first invented. But you know for all kinds of neural architectures, for all kinds of purposes, you can stick attention into them. And the general finding was that always improved results. So in general, anywhere where you have a vector of values, a vector query, and you can use attention to then sort of get a weighted average of the values which finds relevant information that you can use to improve your performance. And so maybe I won't try anything give examples of that now, but you'll sort of see another example of attention immediately when we do things on Thursday where we then sort of start doing self attention inside transformers. Yes, it . speaker 2: you're too . speaker 1: great. Not yet. No, we did not . speaker 2: easy. speaker 1: I mean, it didn't seem especially necessary. I don't know. But no, we do not. Okay. Well, this is the end of the part with attention. Are there any other questions? Yes, for the R . speaker 2: and n attention stuff, is there a need for posiinformation or is that . speaker 1: not required . speaker 2: to solve the price I need for . speaker 1: positional information? So so there was none, and it seemed like it wasn't very required. I mean, you could Yeah I mean, you you could make some argument that maybe position information might have been useful, but there's also a good argument that it wasn't necessary. And the sort of recent everywhere usage, a positional information only becomes necessary when you get to a transformer. And the reason for that is you know going back to the pictures for these encoder states, they're being calculated with respect to the previous encoder estate, right? Because it's the recurrent neural network, and therefore, the representation here knows something about the past. So it kind of knows what position it's in basically. And so that you know that's giving a lot of that information. Or another way to think about it is this final representation will give a certain overall sense of the semantics of the sentence. So to the extent that you're looking backwards, the more sort of associative matching of similar semantic content that's needed seems sufficient. And you don't really need additional positional information. Okay, I will go on. Okay, so that's the neural networkcontent for today. And so for the remaining 39 minutes, I want to talk final projects but also a bit about data experiments and things like that. Okay, so this is a reminder on the class. So we've got the four assignments which are 48. And then the big other part of what you need to do is the final project, which is 49% almost completing things out except for the participation. And let me just give one note back to collaboration, the honor code. I mean, four final projects. It's quite usual that people use all sorts of stuff that were written by other people. That's completely fine. We don't expect you to implement everything from scratch, but you must document what you're using. You give references or url's if you're using other people's code rather than writing your own. We do not want to know what code you wrote yourself and what things you downloaded from pipei. And in particular, in thinking about final projects, the question of interest for us is what value add did you provide, right? So you haven't done something great. If you've downloaded a really good neural network and run it on some data and it produces really good results, that's not much value add. So if you want to have value add in that context, you at least want to be doing something interesting of our understanding why it works so well, what kind of examples it doesn't work well on doing some thorough experimental analysis. Yeah, a couple of other points there. Okay, so for the final project for this class, there's a binary choice. You can either do our default final project, which I'll talk about more a bit later, or you can come up with your own final project and I'll talk about that a bit too. So we allow team sizes of one to three. The complicated thing that comes up, Oh, actually, sorry, I should say the other point first. Yeah. So if you do, we generally encourage people to form teams. That means that you're can do something more interesting, that's more motivational, you can make friends, whatever. So teams are good on expectations for teams. Our expectation for teams is that a bigger team should be able to do proportionally more work. And so when we're grading things, we expect to see more work from larger teams. Now how this works out is kind of, I will admit, a little bit complicated because you know there's sort of a quality issue that's separate from the amount of work. So you know the reality is that it's just always the case that several of the very best projects are one person efforts because they're just somebody who has a good idea and knows what they want to do and does it by themselves. And it is great. But you know they're also great multiperson projects as well. But the point I'm meaning is, well, you know it kind of doesn't work. If you're a one person project and you try and attempt a huge amount of stuff and you can only get one third of the way through it, that's not a good recipe for doing well on the final project. For any project, you really need to so be completing something and showing something. But you know, nevertheless, if you're one person and you can show something kind of interesting, even if our reaction is, well, this would have been much better if theyshown it was better than this other kind of model, or it would have been really nice if theyrun ablations to work things out, well, if you're one person will give you a buy and say, Oh, but there's only one person. Whereas if you're a three person team and it seems like you obviously should have compared it to some other models and you obviously could have run it on some other data sets, then we'll feel like, well, as a three person team, they obviously should have done that, and therefore, we should give them a less good score. And that's how that is worked out. The complication comes with other things people are doing at the same time. We allow people to do final projects that are shared with multiple classes, but know expectation is again that you'll do more work. So if there are two of you who are using one project for both this class and cs 231 and say, then it's sort of like the four person project and you should be doing a lot of work for it. There are other cases. Sometimes people have ra ships or their PhD rotation students or other things. If you're doing it for other things, welike you to tell us, and we expect you to be doing more work for it. Okay. I'm very happy to talk to people about final projects and have been talking to people about final projects, but unfortunately, there's only one of me. So I definitely can't talk to 500 people about final projects. So I do also encourage you to talk to all of the tas about final projects. So on the office hours page, under all of the tas, there's some information about things that they know about. So if you know what your project is about, you could at least try and find one of the most useful tas, or just find a ta with a friendly face. Whatever mechanism you use, talk to tas about final projects. Yeah. So default final project. So what it's going to be is so Bert was a famous early transformer, and we're going to be sort of building and experimenting with a minimal burimplementation. So if you do this, there's part of a part of an implementation of Bert and you're meant to finish it off and you're meant to fine tune it and get some data results for doing sentiment analysis. And then basically, we want the even the default final project to be an open ended project where people can do different things. And so then there's lots of other ideas, or you can come up with your own of ways. You could extend this system and make it better, which might be with paraphrasing, contrast of learning, low rank adaptation, something. And you can do something and that is your final project. So why choose the final project? So if you haven't had much experience with research, you don't have any real idea of what you want to do for a final project or youlike something with clear guidance and a goal and a leaderboard, because we provide a leaderboard for people doing the default final project of how good your performance is on the tasks we provide, then you can do the final project. And I mean, honestly, I think for many people, the best option is to do the final project for sort of past performance. Typically, about half the students do the final project, including some people who start off thinking, I'll do a custom final project. Then after a couple of weeks, they decide, huh, this makes no sense. What I suggest you it's not working at all. I'm just going to abandon and flip to the default final project. Okay? But we also allow custom final projects, and there are good reasons to do custom final projects. So if you have some topic or research idea that you're excited about, maybe you're already even working on it, or you want to try something different on your own, or you just like to have more of the experience of trying to come up with a research goal, finding their serdata and tools and starting from scratch, which is actually very educational, if considerably harder. Well, then the custom final project is fine for you. Restriction on topics. I think wealready sort of signaled this on ed. We insist for cs 224n final projects that they have to substantively involve both human language and neural networks, you know, because this is the nlp class. So welike people to know and learn something about human language. I'm totally aware of the fact that you can use these same models for bioinformatic, tics, sequences or music, radar, whatever, but welike you to do something with human language for this class. That doesn't mean it has to be only about human language. So people have done things like visual language models or music and language. So it can have a combination of modalities, but it has to you substantively, not completely trivially involve human language. If you've got any questions about that, ask. And it also has to substantively involve neural networks. So again, it doesn't have to be wholly about neural networks. If you've got some ideas thinking, Oh, I think I could show using kernel machines that they work just as well as having multi layer neural networks or something like that. That's of course fine to do as well. Gamesmanship. Yeah. The default final project is more guided that it's not meant to be a complete slackers ride. We're hoping that people do the same amount of work for either kind of project. But on the other hand, it does kind of give you sort of a clearer focus and course of things to do, but it is still an open ended project. So you know for both default final projects and custom final projects, there are great projects and there are not so great projects. You know if anything, there's a bit more variance in the custom final project. So you know the path of success is not to do something, try and do something for the custom final project. That just looks really weak compared to people's default final projects. Okay? You can get good grades. Either way, we give best project awards to both kinds of projects. So Yeah, it's really not that there's some secret one you have to pick computing. Yeah. So to be honest with the confessions right at the beginning, we're actually in a less good position for computing than we've been in recent years. And it's all OpenAI's fault. No, that part of but you know up until and including last year, we actually had invariably managed to get very generous cloud computing giveaways from one or other cloud computing provider, which really provided a lot of computing support. But you know there's the great GPU shortage on it at the moment due to the great success of large language models, and it turns out that cloud compute providers just aren't being as generous as they used to be. And gee, I guess the aws rep was pointing out that my course was their single largest grant of free GPU last year, so it's getting harder to do so. So really, people will have to patch things together more in many cases. And so we'll be relying on the ingenuity of students to be able to find free and cheap stuff. So Google is giving $50 credit per person on gcp, which can be used for assignments three, four, and the final project on all the clouds. If you haven't used a cloud with an account before, you can usually get some free starter credits, which can be a useful thing. There are the sort of jupityer notebooks in the cloud, so the most used one is Google collab, which allows limited GPU use. It often tends to get tighter later in the quarter, so you might find it a good investment to not have a couple of lattes and pay ten bucks a month to get collab pro, which gives you much better access to. Gpus, but there are alternatives to that which you might also want to look at. So aws provides a juper notebook environment. So you make a studio lab and know also owned by Google, kegle separately provides Kegel notebooks, which actually commonly give you better GPU access than Google co lab provides, even though you know they're otherwise not as nice. Kegle notebooks are sort of just bare bones Jupiter notebooks, whereas colab had some fancier ui stuff grafted on it. So other possibilities. Modal is a low priced GPU provider and allows a certain amount of free GPU usage a month, so that could be handy. There are other lower cost GPU providers like vaai, which could be of relevance. And then the other thing that I'll say more about in a minute is you know the way things have changed with large language models, there are lots of projects that you might want to do where you're not actually building models at all yourself, but you're wanting to you know do experiments on large language models or you're wanting to do in context learning with large language models or other things of that sort. And then what you want is to have access to large language models. And in particular, you probably want to use have api access so you can automate things. So another thing that we have been able to get is through the generosity of together AI, that together AI is providing lar 50 of api access to large language models, which can actually be a lot. How much of a lot it is depends on how big a model you're using. So something you should think about is how big a model do you really need to use to show something? Because if you can run a 7 billion parameter language model on together, you know, you can put a huge number of tokens through it for 50 bucks. Whereas if you want to run a much bigger model, then you know the number of tokens starts, you can get through it goes down by orders of magnitude. So that's good. And I mentioned some other ones. So we've already put a whole bunch of documents up on ed that talk about these different GPU options. So do look at those. Okay, jumping ahead. So the first thing you have to do as a project proposal, so it's one per team. So I guess the first step is to work out who your team is. And so for the project proposal, part of it is actually giving us the details of your project. But there's another major part of it, which is writing a review of a key research paper for your topic. So what for the default final project, we provide some suggestions so you can find something else. If you've got another idea for how to extend the project for your custom project, you're finding your own. But what we want you to do is get some practice at looking at a research paper, understanding what it's doing, understanding what's convincing, what it didn't consider, what it failed to do. And so we want you to write a two page summary of a research paper. And the goal is for you to be thinking critically about this research paper of what did it do that was exciting versus what did it claim was exciting but was really obvious or perhaps even wrong, etc. Okay. And right. So then after so after that, you know we want you to say what you're planning to do that may be very straightforward for a default final project, but it's really important for a custom final project. And in particular, you know tell us about you know the literature you're going to use of any and the kind of models you're going to explore. But you know it turns out that when we're unhappy with custom final projects, the two commonest complaints about what you tell us about custom final projects is you don't make clear what data you're going to use because we're sort of worried already if you haven't worked out by the project proposal deadline what data you can use for your final project. And if you don't tell us how you're going to evaluate your system, we want to know how you're going to measure whether you're getting any success as a new thing this year. Welike you to include an ethical considerations paragraph outlining potential ethical challenges of your work, if it were deployed in the real world and how that might be mitigated. This is something that now a lot of conferences are requiring and a lot of grants are requiring. So I want to give you a little bit of practice on that by writing a paragraph of that. How much that is to talk about varies somewhat on what you're trying to do and whether it has a lot of ethical problems or whether it's a fairly straightforward question answering system. But in all cases, you might think about what are the possible ethical considerations of this piece of work. Okay, the whole thing is maximum four pages. Okay, so for the research paper summary, Yeah do think critically, right? I mean, the worst the worst summaries are essentially people that just paraphrase what's in the abstract and introduction of the paper. And we want you to think a bit harder about this. You know what were the novel contributions of the paper? Is it something that you could use for different kinds of problems in different ways? Or was it really exploiting a trick of one data set? Are there things that it seemed like they missed or could have done differently or you weren't convinced were done properly? Is it similar or distinctive to other papers that are dealing with the same topic? Does it suggest perhaps something that you could try that extends beyond the paper? Okay. And for grading these final project proposals, most of the points are on that paper review and so do pay attention to it. There are some points on the project plan, but you know really we're wanting to mainly give you formative feedback on the project plan and comments as to how we think it's realistic or unrealistic. But nevertheless, we're expecting you to sort of have an idea, have thought through how you can investigate it, thought through how you can evaluate it, data sets, baselines, things like that. Oh Yeah, I should emphasize this. Do you have an appropriate baseline? So for anything that you're doing, you should have something you can compare it again. So sometimes it's a previous system that do exactly the same thing. But if you're doing something more novel and interesting, you should be thinking of some cethe pants, obvious way to do things and proving that you can do it better. And what that is depends a lot on what your project is. But you know if you're building some complex neural net that's going to be used to work out textual similarity between two pieces of text, well, a simple way of working out textual similarity between two pieces of text is to look up the word vectors for every word in the text and average them together and work out the dot product between those average vectors. And unless your complex neural network is significantly better than that, it doesn't seem like it's a very good system. So you always attempt to have some baselines after the project proposal, we also have a project milestone stuck in the middle to make sure everybody has making some progress. This is just to help make sure people do get through things and keep working on it. So we'll have good final projects for most final projects. I'll say more about this in a minute. The crucial thing we expect for the milestone is that you know you've kind of got set up and you can run something. It might just be your baseline of looking up the word vectors, but means you've kind of got the data and the framework and something that you can run and produce a number from it. And then there's the final project. We have people submit their code for the final projects, but final projects are evaluated almost entirely unless there's some major worries or concerns based on your project report. So make sure you put time into the project report, which is essentially a research paper, like a conference paper, and they can be up to eight pages, and it varies on what you're doing. But you know, this is the kind of picture, typically, of what people look like, or have an abstract, an introduction, ittalk about other related work itpresent the model you're using, the data you're using, and your experiments and their results, and have some insightful comments and its analysis and conclusion at the end. Okay? Finding research topics for custom projects, all kinds of things you can do. You know, basic philosophy of science, you're normally either starting off with, here's some problem I want to make some progress on, or here's this cool idea for a theoretical technique or a change in something. And I want na show us better than other ways of doing it. And you're working from that. We allow different kinds of projects. You know one common type of project is you've got some task of interest and you're going to try and solve it or make progress on it somehow that you want na you know get information out of State Department documents and you're going na see how well you can do it with neural nlp. A second kind is you've got some ideas of doing something different with neural networks, and then you're going na see how well it works. Or maybe given there are large language models these days, you're going to see how using large language models you can do something interesting by in context learning or building a larger language model program. So you know nearly all 224n projects are in those first three types where at the end of the day, you've got some kind of system and you've got some kind of data and you're going to evaluate it, but that's not 100% requirement. There are different kinds of projects you can do, and a few people do. So you can do an analysis interpretability project. So you could be interested in something like, how could these transformal models possibly understand what I say to them and give the right answers to my statements? Let me try and look inside the neural networks and see what they're computing. How recently there's been a lot of work on this topic often and under titles like mechanistic interpretability, circuit training and things like that. So you can do some kind of analysis or interpretability project, or you could even just do it, look at the behavior of models of some task. So you could take some linguistic task, like metaphor interpretation, and see which neural networks can interpret them correctly and which kind or which kinds of ones they can interpret correctly or not, and do things like that. Another kind is a theoretical project. Occasionally people have done things looking at the behavior of, well, that's a good example somewhere that's in the math. So an example that was actually done a few years ago and turned into a conference paper was looking at, in the estimation of word vectors, the stability of the word vectors that were computed by different algorithms, word devect versus glove. And. Deriving results with proofs about the stability of the vectors that were calculated. So that's allowed. We don't see many of those here very quickly. Sort of just sort of random things. So a lot of past projects you can find on the 225n web page, you can just find different past year reports and you can look at them to get ideas as you wish. So deep poetry was a gated lstm where the idea was as well. So a language model that generated excessive words. They had extra stuff in it to make it rhyme in a poetry life pattern that was kind of fun. You can do a reimplementation of a paper that has been done previously. This is actually a kind of an old one, but I remember it well. So back in the days before transformers, DeepMind ded, these kind of interesting papers on neural Turing machines and differentiable neural computers, but they didn't release implementations of them. And so Carol said about writing her own implementation of a differentiable neural computer, which in a way was a little bit crazy. And a few days before the deadline, she still hadn't gone at working, so it could have been a complete disaster. But she did get it working before the deadline and got it to run, producing some interesting results. So that was kind of cool. So if it's something interesting, it doesn't have to be original. It can be sort of reimplementing something interesting. Okay? Sometimes papers do get published later as interesting ones. This was a paper that was sort of, again, from the early days and was sort of fairly simple, but you know, it was a novel thing that gave progress. So the way we've sort of presented these rand ns, you have sort of word vectors at the bottom, and then you kind of compute the soft max at the top. But if you think about the sort of multiplying by the output matrix and then putting that into the softmax, that output matrix is also like a set of word vectors because you have a column for each word and then put it to you get a score for each output word and then you're putting a softmax over that. And so their idea was, well, maybe you could sort of share those two sets of vectors and yoube able to get improvements from that and you could okay, maybe I won't talk about that one. Sometimes people have worked on quantized models that's more of a sort of a general neural network technique. But providing you show you can do useful things with it, like have good language modeling results, even with quantized vectors will count that as using language. So in recent times, these last tour from 20, 24, you know, a lot of the time people are doing projects with pre trained large language models, which we will be talking about in next three models, three lectures, and then doing things with them. And so you can do lightweight, parameter efficient, fine tuning methods. You can do in context learning methods and things like this. I suspect that probably quite a few of you will do projects of this kind. So here's an example. So lots of work has been done on producing code language models. And so these people decided to improve the generation of fortran. Maybe they're a physicists, I don't know. And so they were able to show that they could use parameter efficient fine tuning to improve code lama for producing forat. Now, where was the natural language? Code has natural language comments in it, and the comments can be useful for explaining what you want the code to do. And so it was effectively doing translation from human language, explanation of what the code was meant to do into pieces of code. Here was another one which was doing AI fashion driven cataloging, transforming images into textual descriptions, which again, was starting off with an existing visual language model and looking at how to find tune it, okay, other places to look for stuff. So you know, you can get kind of lots of ideas of areas and things people do by looking at past papers. They're you're also welcome to have your own original ideas thinking about anything you know or work on in the world. So for nlp papers, there's a site called the acl anthology that's good for them. From there are lots of papers on language that also appear in machine learning conferences. So you can look at the Europe sort clear proceedings. You can look at path 224n projects and then the archive preprint servers got tons of papers on everything, including nlp. And you can look there. But I do actually think it's know some of the funest best projects are actually people that find their own problem, which is an interesting problem in their world. You know if there's anything about a cool website that has text on it and you think you could kind of get information out of automatically by using a language model or something, there's probably something interesting and different you can do there. Another place to look is that there are various leaderboards for the state of the art on different problems, and you can start looking through leaderboards for stuff and see what you find there. But you know on the other hand, the disadvantage of looking at things like leader boards and past conferences is you sort of tend to be trying to do a bit better on a problem someone else has done. And that's part of why you know really often in research, it's a clever thing to think of something different, perhaps not too far from things that other people have done, but somehow different. So you'll be able to do something a bit more original and different for what you're doing. Yeah. I mean, I do just want to go through this a bit quickly that you know for the sort of decades that I've been doing natural language processing with deep learning, there's sort of been a sea change in what's possible. So in the early days of the deep learning revival, you know, most of the work in people's papers were trying to find better deep learning architectures. So that would be, here is some question answering system. I've got an idea of how I could add attention in some new place, or I could add a new layer into the neural network. And the numbers will go up. And there were lots of papers like that, and it was a lot of fun. And that's what a lot of good cs 224n projects did too. And people were often able to build systems from scratch that were close to the state of the art. But you know, in the last five years, your chances of doing this have been become pretty slim, frankly. You know, you can if you've really got a good idea, it's something different than original by all means, but it's kind of hard. So most work these days, even for people who are professional researchers, that you know they're making use of existing large pre train models in some way. And then once you're doing that, that actually sort of fixes a lot of your architectural choices because your large pre change neural network has a certain architecture and you kind of have to live with that. You know you might be able to do interesting things by adapting it with something like low rank adaptation around the side or something, but nevertheless, there's sort of constraints on what you want to do. So you know for just about any practical project, like you've got some data set and you want na understand it and get facts out of it or something like that, essentially the only sensible choice is to say, I am gonna na use hugging face transformers, which we have a tutorial on coming up ahead, and I will load some pre train model and I will be running it over the text, and then I'll be working out some other stuff I can do on a top and around that. So you know, building your own architecture is really only sensible choice if you can do something in the small, which is more a sort of exploring architectures project. If you've kind of got an idea of, Hey, I've got an idea for a different nonlinearity that I think will work better than using a relo, let me investigate kind of thing, because then you can do small experiments. Yeah, maybe I won't read out all of this list sts, but there are lists of sort of some of the ideas of what's more interesting now. But you know do be cognizant of the world we're in in terms of scale. I mean, one of the problems we now have is that people have seen the latest paper that was being pushed by DeepMind whoever doing some cool graph structured reasoning search to do things and they turn up and say, I want na do this for my project. But a lot of the time, if you read further into the paper, you'll find that they were doing it on 32a 100s for a month. And that's not the scale of compute that you're going to have available to you in almost all circumstances. Maybe they're one or two industry students. For the industry students that you can do that. If so, go for it. But for the vast majority of people, not likely. So you do have to do something that is practical, but you know that practicality is for a vast majority of the people in the world. And if you look around in blogs and so on, you find lots of people doing stuff in lightweight ways and describing how to do that. And that's why methods like parameter efficient fine tuna are really popular because you can do them in lightweight ways. The question related to that, and I'll end on this, is, you know I just want to sort of sort of mention again, you know if you want to, you're welcome to use GPT -4 or Gemini pro or Claude opus or any of these models in your project. But you know it has to be then api usage. You can't possibly train your own big models. I mean, even for the models that are available open source and like those you know for big models, you can't even load them into the kind of GPU's you have. So you know probably you can load a lama seven b model, but you can't just load into your GPU or lama 70b model. You have to be realistic on that size. But you know there's actually now lots of interesting things you can do with api access, doing things like in context learning and prompting and exploring that or building larger language model programs around these language model components, and you're certainly encouraged to do that. Lots of other things you can do, such as analysis projects, which look at, are these models sexist and racist still, or do they have good understanding of analogies or can they interpret love letters or whatever is your topic of interest? Lots of things you can do, and that's totally allowed. But again, you know remember that we'll be trying to evaluate this on what interesting stuff you did. So your project shouldn't be how ran this stuff through GPT -4 and it produced great summaries of the documents. I am done. The question is. What did you do in addition to that to have an interesting research project? Okay, I'll stop there. Thanks a lot.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座首先回顾了机器翻译 (MT) 的评估方法,重点介绍了 BLEU (Bilingual Evaluation Understudy) 得分作为最常用的自动评估指标。BLEU 通过计算机器翻译结果与人工参考译文之间 N-gram (通常为1-4 gram) 的重叠度来评分,并设有对过短翻译的惩罚机制。讲座提及,尽管 BLEU 有其局限性(如无法完全捕捉语义等价性),但在快速迭代中仍广泛使用。统计机器翻译的进展曾一度停滞,直至2014年神经机器翻译 (NMT) 的出现,并在2016年后凭借注意力机制 (Attention) 的引入实现了性能的巨大飞跃,目前 NMT 系统在 BLEU 得分上常能达到50-60分。
注意力机制是本次讲座的核心技术点,也是课程作业(如作业三)的关键内容。它解决了传统编码器-解码器模型中信息瓶颈问题,允许解码器在生成每个词时关注源句的不同部分。讲座详细阐述了注意力机制的原理、计算步骤(计算注意力得分、Softmax归一化、加权求和得到上下文向量)及其多种实现方式(点积注意力、乘法/双线性注意力、加性注意力等),并指出缩放点积注意力 (Scaled Dot-Product Attention) 是当前 Transformer 等模型中的主流。注意力机制不仅提升了 NMT 性能,还增强了模型的可解释性,并成为现代神经网络的基石。
最后,讲座详细介绍了课程的期末项目要求,并在此过程中间接引入了大型语言模型 (LLM) 的核心概念及其在现代自然语言处理中的应用。学生可选择默认项目(实现和实验一个最小化的 BERT 模型进行情感分析)或自定义项目。项目团队规模为1-3人,大团队预期完成更多工作。强调了“价值增值” (value add) 的重要性,即便是使用现有代码或模型,也需清晰说明自己的贡献。项目提案需包括对一篇关键研究论文的批判性评述和新增的伦理考量部分。计算资源方面,提到了 GCP、Together AI 等提供的有限支持,并鼓励学生利用 Colab Pro、Kaggle Notebooks 等工具。讲座还讨论了寻找研究课题的途径,并指出当前 NLP 研究趋势已转向利用大型预训练模型 (LPMs) 进行微调或上下文学习,建议学生关注参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 等轻量级方法。
机器翻译 (MT) 评估
评估的必要性与挑战
- 需要有效方法来判断机器翻译系统的好坏。
- 人工评估是“黄金标准”,但耗时耗力,不适用于快速迭代和训练。
- 自动评估方法被提出,其中 BLEU (Bilingual Evaluation Understudy) 是最常见的一种。
BLEU 得分详解
- 提出者:IBM,是最早提出的自动评估 MT 的方法之一。
- 核心思想:比较机器翻译输出与一个或多个人工参考译文之间的 N-gram(通常是1到4-gram)重叠度。
“the more overlap you have, the better.”- 最初设想使用多个参考译文以覆盖翻译的多样性,但实践中也常用单个参考译文。
- 计算:涉及匹配 N-gram,并对过短的翻译进行惩罚(避免系统只翻译简单部分来刷高精度分)。
- 分数范围:理论上0-100。
- 20多分:大致能理解原文意思。
- 30-40多分:翻译质量显著提升。
- 神经机器翻译系统目前常能达到50-60分。
- 局限性:
- 翻译多样性:一个句子有多种正确的翻译方式,可能使用不同词汇和语序,BLEU 可能因用词与参考译文不符而给优质翻译低分。
- 表面匹配:可能因为词语匹配而得分,即使这些词在句子中的作用完全错误(尽管高阶 N-gram 匹配一定程度上缓解此问题)。
机器翻译发展历程与 BLEU 得分表现
- 统计短语翻译 (Statistical Phrase-Based Translation):
- 由 IBM 在90年代末开创,2000年代被 Google Translate 等采用。
- 到2010年代中期,进展基本停滞。
- 基于句法的统计机器翻译 (Syntax-Based SMT):
- 约2005-2014年,被认为是提升翻译质量(尤其针对语序差异大的语言对,如中英、德英)的关键。
- 但实际进展缓慢,
“the truth is it sort of didn't really work.”
- 神经机器翻译 (NMT):
- 2014年出现初步尝试。
- 2015年在评测中尚不及最佳传统方法。
- 2016年,NMT 性能超越其他方法,并展现出更陡峭的提升曲线。
“it was just on this much, much steeper slope of getting way, way better.”- 目前 NMT 系统在 BLEU 得分上表现优异,常见50-60分。
注意力机制 (Attention)
提出动机
- 传统编码器-解码器模型(如基于 LSTM 的 NMT)将源句所有信息压缩到一个固定长度的隐藏状态向量中。
“everything useful about this sentence has to be stuffed into that one vector.”- 这对于长句(如40个词)来说是一个巨大的信息瓶颈。
- 这种机制与人类翻译时会回顾源句特定部分的行为不符。
- 目标是让神经网络能够在解码的每一步关注源句的不同部分,按需获取信息。
注意力机制的核心思想与流程
- 在解码器的每一步,建立与编码器各隐藏状态的直接连接。
- 流程概述:
- 比较与计算得分:解码器当前隐藏状态 (query) 与编码器所有位置的隐藏状态 (keys/values) 进行比较,计算注意力得分。
- 归一化:将注意力得分通过 Softmax 函数转换为概率分布(注意力权重)。
- 加权求和:使用注意力权重对编码器的隐藏状态进行加权平均,得到上下文向量 (attention output)。
- 组合与输出:将上下文向量与解码器当前隐藏状态结合,用于生成目标词。
- 重复此过程直至翻译结束。
- 讲座中提到,有时也会将前一时间步的注意力输出作为当前时间步的额外输入(例如在作业三中会这样做)。
注意力机制的数学表达 (以点积注意力为例)
- 编码器隐藏状态序列:
h_1, h_2, ..., h_N - 解码器在时间步
t的隐藏状态:s_t - 注意力得分 (Attention Scores)
e_ti:score(s_t, h_i)- 最简单的方式是点积:
e_ti = s_t^T h_i
- 最简单的方式是点积:
- 注意力权重 (Attention Weights)
α_ti:α_ti = softmax(e_ti)(对所有i进行 Softmax) - 上下文向量 (Context Vector / Attention Output)
a_t:a_t = Σ_i α_ti h_i - 最终输出:将
a_t和s_t拼接起来([a_t; s_t]),再通过一个线性层和 Softmax 生成下一个词的概率分布。
注意力机制的益处
- 显著提升 NMT 性能:Bădanu 等人 (2014) 提出的使用注意力机制的 NMT 系统,在计算资源远少于同期 Google 纯 LSTM 大模型的情况下取得了更好或相当的结果。
“Attention significantly improved NMT performance. Essentially, every neural machine translation system since has used attention.”
- 更符合人类直觉:模拟了人类翻译时回顾源文本的行为。
- 解决信息瓶颈:不再需要将源句所有信息压缩到单一向量中。
- 缓解梯度消失问题:通过直接连接到编码器的所有隐藏状态,提供了“捷径”。
- 提供可解释性:通过可视化注意力权重,可以了解模型在翻译特定词时关注源句的哪些部分。
“it gives you some interpretability... you can basically see what it's translating at different time steps.”
- 课程关键内容:该机制是课程后续作业(如作业三)的核心考察内容之一。
注意力得分的多种计算方式
除了基础的点积注意力,还存在其他计算 score(s_t, h_i) 的方法:
1. 点积注意力 (Dot-Product Attention):s_t^T h_i
* 简单,但假设 s_t 和 h_i 的维度和对应位置的语义可以直接匹配。
* LSTM 的隐藏状态包含多种信息(当前输出、未来预测等),直接点积可能不够优化。
2. 乘法/双线性注意力 (Multiplicative/Bilinear Attention - Luong et al., 2015):s_t^T W h_i
* 引入一个可学习的权重矩阵 W,允许模型学习 s_t 和 h_i 不同部分之间的关联。
* 参数量为 dim(s_t) * dim(h_i),若维度高则参数多。
3. 缩放点积注意力 (Scaled Dot-Product Attention):
* 将参与计算的两个向量(如解码器状态和编码器状态)分别通过线性变换投影到较低维空间,然后在该低维空间进行点积。
* 这种方法也与使用低秩矩阵的双线性注意力相关联,并且是 Transformer 模型中使用的核心注意力机制。
4. 加性注意力 (Additive Attention - Bădanu et al., 2014):v^T tanh(W_1 s_t + W_2 h_i)
* 使用一个小型前馈神经网络计算得分,W_1, W_2, v 均为可学习参数。
* 这是最早提出的注意力形式之一。
* 计算上比点积或乘法注意力更复杂和慢。
* 讲者认为“additive attention”命名奇怪,更像一个小型神经网络。
* 实践中的选择:尽管关于加性注意力和双线性注意力孰优孰劣曾有讨论,但目前缩放点积注意力因其高效性在实际中胜出,并广泛应用于 Transformer 等模型。
注意力机制的通用性
- 注意力机制最初为 NMT 发明,但它是一种通用技术。
- 可应用于任何需要从一系列向量值中根据一个查询向量选择性提取信息的场景。
“anywhere where you have a vector of values, a vector query, and you can use attention to then sort of get a weighted average of the values which finds relevant information that you can use to improve your performance.”- 后续讲座将介绍 Transformer 中的自注意力 (self-attention)。
关于位置信息
- 在基于 RNN 的编码器-解码器结构中使用注意力时,通常不需要显式的位置信息编码。
- RNN 本身按顺序处理输入,其隐藏状态已隐含位置和上下文信息。
- 位置信息的重要性在 Transformer 模型中才凸显,因为 Transformer 本身不包含序列顺序的概念。
关于期末项目 (Final Projects)
项目基本信息
- 权重:占课程总成绩的 49%。
- 合作与荣誉准则:
- 允许使用他人编写的代码,但必须明确注明来源 (give references or URLs)。
- 重点考察学生自己完成的“价值增值” (value add)。下载运行优秀模型并不能算作高价值贡献,需要进行深入分析、理解其工作原理或探究其不足。
- 特别强调了项目提案中新增的伦理考量 (ethical considerations) 部分,要求学生思考其工作的潜在社会影响及缓解措施。
- 团队规模:1至3人。
- 鼓励组队,可以完成更有趣的项目。
- 期望:更大规模的团队应能完成更多的工作量。评分时会考虑团队大小。
- 虽然顶尖项目可能来自优秀的单人成果,但对于多人团队,若未能完成与其规模相称的工作(如模型对比、数据集扩展、消融实验等),评分会受影响。
- 项目共享:允许项目与其它课程或研究助理 (RA) 工作共享,但需告知,且期望完成更多工作量。
- 寻求指导:鼓励与授课教师和助教 (TAs) 讨论。助教的专业领域信息已在 Office Hours 页面列出。
项目类型选择
学生有两种选择:
1. 默认期末项目 (Default Final Project):
* 内容:构建和实验一个最小化的 BERT (Bidirectional Encoder Representations from Transformers) 实现。
* 完成部分 BERT 实现。
* 针对情感分析任务进行微调并获得数据结果。
* 开放式扩展:鼓励学生在此基础上进行创新,如结合释义 (paraphrasing)、对比学习 (contrastive learning)、低秩适应 (Low-Rank Adaptation, LoRA) 等。
* 适用人群:缺乏研究经验、不确定研究方向、或偏好有明确指导和目标(提供排行榜)的学生。
* 讲者估计约一半学生会选择默认项目,包括一些最初想做自定义项目但中途改变主意的学生。
2. 自定义期末项目 (Custom Final Project):
* 适用人群:对特定主题或研究想法充满热情、已在进行相关工作、希望独立尝试新事物、或享受从零开始定义研究目标、寻找数据和工具的完整研究体验的学生。
* 主题限制:项目必须实质性地涉及人类语言 (human language) 和神经网络 (neural networks)。
* 可以结合其他模态(如视觉语言模型、音乐与语言),但人类语言必须是核心部分。
* 可以探索非神经网络方法与神经网络的对比,但神经网络必须是研究的一部分。
评分与计算资源
- 评分:两种项目类型都有机会获得高分和最佳项目奖。
- 计算资源:
- 今年的云服务赞助不如往年慷慨,部分原因是大型语言模型 (LLM) 成功导致的 GPU 短缺。
- Google Cloud Platform (GCP):每人 $50 信用额度,可用于作业3、4和期末项目。
- 免费初始额度:新用户通常可以在各大云平台获得。
- 云端 Notebooks:
- Google Colab:提供有限免费 GPU。建议付费获取 Colab Pro (约 $10/月) 以获得更好的 GPU 访问。
- AWS SageMaker Studio Lab。
- Kaggle Notebooks:通常提供比免费 Colab 更好的 GPU 访问。
- 低成本 GPU 提供商:
- Modal:每月提供一定量免费 GPU 使用。
- Vaai [转录原文如此]。
- 大型语言模型 API 访问:
- Together AI:提供每人 $50 的 API 访问额度,用于大型语言模型。额度消耗取决于模型大小(7B 模型可处理大量 token,更大模型则消耗快)。
- 相关 GPU 选项的详细文档已发布在 Ed 平台。
项目提案 (Project Proposal)
- 每个团队提交一份。
- 主要组成部分:
- 研究论文评述 (Review of a key research paper):
- 针对项目主题,选择一篇关键研究论文进行2页的评述。
- 目标:批判性思考论文的贡献、新颖性、局限性、未考虑的方面、方法是否令人信服、与相关工作的异同、以及是否启发了新的研究思路。
“The goal is for you to be thinking critically about this research paper.”- 避免简单复述摘要和引言。
- 项目计划 (Project Plan):
- 阐述项目目标、拟采用的方法、模型、数据集、以及评估方法。
- 关键:明确将使用的数据和评估指标。
- 伦理考量 (Ethical Considerations):
- 新增要求,撰写一段关于项目若在现实世界部署可能面临的潜在伦理挑战及缓解措施。
- 研究论文评述 (Review of a key research paper):
- 总长度:最多4页。
- 评分侧重:大部分分数来自论文评述,项目计划部分主要提供形成性反馈。
- 基线 (Baseline):必须设立一个合适的基线进行比较,以证明所提出方法的有效性。例如,文本相似度任务的简单基线可以是词向量平均后的点积。
项目里程碑 (Project Milestone)
- 项目中期的进度检查。
- 核心要求:完成基本设置,能够运行一些东西(例如基线模型),表明数据和框架已就绪,可以产出初步结果。
期末报告 (Final Report)
- 评估依据:主要基于项目报告(类似会议论文),除非代码存在重大问题。
- 长度:最多8页。
- 典型结构:摘要 (Abstract)、引言 (Introduction)、相关工作 (Related Work)、模型 (Model)、数据 (Data)、实验与结果 (Experiments and Results)、分析 (Analysis)、结论 (Conclusion)。
寻找研究课题与数据来源 (Custom Projects)
- 课题来源:
- 个人兴趣驱动:从自身遇到的问题或感兴趣的技术出发。
- 项目类型:
- 任务解决型:针对特定任务(如从国务院文件中提取信息)改进或提出解决方案。
- 方法创新型:提出新的神经网络结构或技术,并验证其效果。
- 基于大型语言模型 (LLM) 的项目:利用 LLM 进行上下文学习、构建更复杂的 LLM 程序等。
- 分析/可解释性项目:探究模型内部机制(如 mechanistic interpretability)、模型在特定语言现象(如隐喻理解)上的行为。
- 理论项目(较少见):如分析词向量估计算法的稳定性。
- 文献与资源:
- 往年 CS224N 项目报告。
- ACL Anthology (NLP 论文库)。
- 机器学习顶会论文 (NeurIPS, ICML, ICLR)。
- arXiv 预印本服务器。
- 原创性:鼓励学生从自己的领域或观察到的有趣文本数据中发掘独特问题。
- 排行榜 (Leaderboards):可以提供思路,但可能导致增量式工作。寻找与主流略有差异的“不同点”往往能产生更具原创性的研究。
当前 NLP 研究趋势与项目建议
- 研究范式转变:
- 早期深度学习复兴时 (约2010年代初),许多工作致力于改进模型架构,学生项目也常能从零构建接近 SOTA 的系统。
- 近五年,由于大型预训练模型 (LPMs) 的出现,从零构建并超越这些大模型的难度极大。
- 当前主流:多数研究工作(包括专业研究者)都基于现有的大型预训练模型进行。
- 对项目的影响:
- 对于大多数实际应用型项目,推荐使用 Hugging Face Transformers 库加载预训练模型,并在其基础上进行开发。
- 从零构建架构:仅在小规模、探索性的架构研究(如验证新非线性激活函数)中较为合理。
- 关注点:
- 规模与可行性:需考虑计算资源限制,避免尝试复现需要极大规模算力(如
“32 A100s for a month”)的研究。 - 轻量级方法:关注参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 等方法。
- 使用闭源 LLM API (如 GPT-4, Gemini, Claude):
- 允许通过 API 进行实验,但无法自行训练此类规模的模型。
- 即使是开源大模型(如 Llama 70B),也难以在普通学生 GPU 上加载运行(Llama 7B 或许可以)。
- 项目思路:上下文学习、提示工程、构建 LLM 应用、模型行为分析(如偏见、类比推理能力等)。
- 再次强调“价值增值”:项目不应仅仅是“用 GPT-4 处理数据并得到好结果”,而应清晰展示学生在研究设计、分析或方法创新上的独特贡献。
- 规模与可行性:需考虑计算资源限制,避免尝试复现需要极大规模算力(如