Stanford CS224N NLP with Deep Learning | Spring 2024 | Lecture 2 - Word Vectors and Language Models
该讲座首先通报了课程安排,包括首次作业的截止时间、Python复习课的时间地点、助教答疑的参与方式以及讲师答疑的预约制度,并提醒学生合理使用预约资源。
核心内容回顾了优化基础,特别是梯度下降算法,即通过计算损失函数的梯度,并以较小的学习率(alpha)沿梯度反方向更新参数,以逐步最小化损失。由于基本梯度下降在处理大数据集时效率低下,实际中广泛采用随机梯度下降(SGD)。SGD在数据的小批量(mini-batch)上计算梯度进行更新,不仅速度更快,其引入的噪声有时还有助于提升神经网络的优化效果。
讲座重点阐述了Word2Vec模型。其工作原理是:首先用随机小数初始化词向量(避免全零以打破对称性);然后遍历语料库,基于中心词预测上下文词的概率;通过计算预测误差和梯度来更新词向量,使其能更好地预测周围词语。尽管方法简单,Word2Vec能有效学习词语的语义和词间关系。该模型仅包含中心词向量和上下文词向量作为参数,通过计算它们的点积得到概率分布,本质上属于不考虑词序的“词袋模型”。
讲师提及将通过一个Jupyter Notebook演示使用Gensim包和GloVe词向量(一种与Word2Vec行为类似的斯坦福模型)。
最后,预告了后续课程将深入探讨分类、神经分类以及神经网络等内容。
标签
媒体详情
- 上传日期
- 2025-05-15 13:21
- 来源
- https://www.youtube.com/watch?v=nBor4jfWetQ
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Okay, I should try and get started. Okay? So what we're going to do today is we're going to try and do everything else that you need to know about word vectors and start to learn a teeny bit about neural nets and then we'll kind of get much further into sort of doing more with the math of neural nets next week. So this is the general plan. So I'm gonna to sort of sort of finish up from where I was last time with optimization basics, then look a little bit more about word to vec and word vectors and then some of the variants of word to vac. And then I'm going to briefly consider alternatives, sort of like what can you get from just counting words in different ways? Then we're going to go on and talk a little bit about the evaluation of word vectors, the topic of word sensors that already came up a couple of times last time when people are asking questions, and then towards the end start to introduce the idea of classification, doing neural classification and what neural networks are about, which is something that will then expand on more in the second week. Now before I get into that, just notes on course organization. So remember, the first assignment is already out and it's due before class next Tuesday. So then our Python review session is going to be taught this Friday, 330 to 420. It's not going to be taught here. It's going to be taught in gates bo one, the gates basement, and encourage everyone again to come to office hours and help sessions. They've already started, they're listed on the website with having these sort of office hour help sessions in classrooms with multiple tas. So just turn up if you're on campus and you can be helped. And if you are on campus, welike you to just turn up though wealso have a zoom option for Stanford online students. Finally, I have office hours, which I have not yet opened, but I will open sometime tonight. They're going to be on Monday afternoons. Now, obviously, given the number of people, not everyone can make it into my office hours. And I'm gonna na do these by appointment. So they're by 15 minute appointments on calenlee. But you know, I'm very happy to talk to some people, and you know, I put this little note at the end saying, don't hog the slots. Some people think itbe a really good idea if they really work out how to sign up every week for an office hour session than me, with me, and that's sort of a little bit antisocial. So think about that. Okay. So at the end of last time, I did a sort of bad job of trying to write on slides of working out the derivatives of word to vec. And hopefully you could read it much more clearly in the version that appears on the website where I doing at home more carefully. So that was saying that we have this locks function and our job was to work out its derivatives, which would tell us which direction to go to walk downhill. And so I didn't really quite finish the loop here. So you know we have some cost function that we want to minimize, and then we work out the graents of the gradient of that function to work out which direction is downhill. And then the simplest algorithm is then that we're that we work out the direction downhill. We walk a little bit in that direction, and then we repeat, we work out the gradient. At this point, we walk downhill a little bit, and we keep on going and we'll get to the minimum. And with a sort a one dimensional function like this, it's sort of very simple. We're just walking downhill that when we have a function of many, many dimensions, when we calculate the gradient at different points, we might be starting to walk in different directions. And so that's why we need to do calculus and have gradients. And so this gives us the basic algorithm of gradient descent. And so under the gradient descent algorithm, what we're doing is that we've got a loss function J, we're working out its gradient, and then we're taking a little bit of a little multiplier of the gradient. So that alpha is our step size or learning rate. That's normally alphhas a very small number, something like ten to the minus three or ten to the minus four, maybe even ten to the minus five. So we're taking a really little bit of the gradient, and then we're subtracting it from our parameters to get new parameters. And as we do that, we will walk downhill. And the reason why we want to have a small learning rate is we don't want to walk too far. So if from here we worked out the gradient and said it's in this direction and we just kept on walking, we sort of might end up way over here. Or if we had a really big step size, we might even end up at a worse point than we started with. So we want to take little steps to walk downhill. And so that's the very basic gradient descent algorithm. Now the very basic gradient descent algorithm we never use. What we actually use is the next thing up, which is called stochastic gradient descent. So the problem is for the basic gradient descent algorithm, we've worked out for an entire set of data what the objective function is and what the slope at the point of evaluation is. And in general, we've got a lot of data in which we're computing models. So simply trying to calculate our objective function over all of our data for our model, the training data for the model, would take us a very, very long time. And so that's very, very expensive to compute. And so it wait a very long time before we make even a single step of gradient update. So for neural nets, what you're always doing is using this variant that's called stochastic gradient descent. And so for stochastic gradient descent, what that means is we pick a very small subset of our data, like maybe we pick 16 or 32 data items, and we pretend that's all of our data. And we evaluate the function J based on that small subset and work out the gradient based on that small subset. So it's a noisy and accurate estimate of the gradient, and we use that to be the direction in which we walk. So that's normally referred to also as having mini batches or mini batch gradient descent. And in theory, working out the gradient based on this small subset is an approximation. But one of the interesting things and the way things have emerged in neural network land is it turns out that neural networks actually often work better when you throw some noise into the system, that having this noise in the system gives you jiggle and moves things around. And so actually, stochastic gradient descent not only is way, way, way faster, but actually works better as a system for optimization of neural networks. Okay, so if you remember from last St time for word de ect, the idea was we started by just saying each word has a random vector representing it. So we will literally sort of just get random small numbers and fill up the vectors with those random small numbers. There's an important point there, which is you do have to initialize your vectors with random small numbers. If you just leave all the vectors a zero, then nothing works. And that's because if everything starts off the same, you get these sort of false symmetries, which means that you can't learn. And so you always do want to be initializing your vectors with random numbers. And then we're going to go through each position in the corpus using our estimates. We're going to try and predict the probability of words in the context, as we talked about last time then. So that gives us an objective function from which we can then look at our errors, look at our gradient and update the vectors so that they learn to predict surrounding words better. And so the incredible thing is that we can do no more than that, and we end up learning word vectors, which actually capture quite a lot of the semantics, the meaning and relationships between different words. So you know, when this was first discovered for these algorithms, I mean, it really feels like magic that you can just sort of do this math, simple math, over a lot of text and actually learn about the meanings of words, that it's sort of just sort of surprising that something so simple could work. But as time has gone on, this same recipe has then been for all kinds of learning about the behavior of language from neural networks. So let's just go through a sense of how that is. But before we do that, let me just mention, so for our word Devecka algorithms, the only parameters of the model are these word vectors. They're are the outside word vectors and the center word vectors, which we actually treat as disjoint, as I mentioned last time. And when we do the computations, we're considering the dot product between the various possible outside words with our center word, and we're using those to get a probability distribution over how likely the model thinks that different outside words were. And then we're comparing that to the actual outside word in the context, and that gives us our source of error. So as such, this is what' S Referred to in nlp as a bag of words model that it doesn't actually know about the structure of sentences and or even what's to the left and what's to the right. It's predicting exactly the same probabilities at each position to the left or right, but it's wanting to know about what kind of words appear in the context of the center word. So I just wanted to stop this for a minute, and let's see. Not that one. So let's to give you some kind of a sense that this really does work. So this is a little Jupiter notebook that I've got for this. Okay? And so this is and here I'm using a package gen sim, which we don't continue to use after that really, but it's sort of one package to let you load and play with word vectors. And the word vectors I'm going to use here are glove word vectors. And actually going to glove was a model we built at Stanford, and I'm going to actually talk about it a little bit later. So strictly speaking, these aren't exactly word vectors, but they behave in exactly the same way. So now it's loaded up my word vectors because the word vectors are a big data file. And so as we've discussed for a word, right, that the representation of any word here is bread is just a vector of real numbers, right? So I'm using 100 dimensional word vectors to keep things quicker for my class demo. So this is the word brered. And then I can say, well, what's the representation for crosant? And this is croissant. And we can sort of get a visual sense of, Oh, they're at least a little bit similar, right? So the first components are both negative. The second components are both positive. The third components are both negative and large. The fourth components are both positive, right? They seem like they're kind of similar vectors. So that seems kind of hopeful because that means that it knows that bread and croissant are a bit similar to each other. This package has a nice simple function where rather than doing that by hand, you can just ask it about all the word vectors and say which ones are most similar. So I can ask it what words in its vocabulary are most similar to usa? And in this model, everything's been lower case states should mention. And so if I do that, it has Canada, America, U dot S A, then United States, Australia. Well, those seem a fairly reasonable list of most similar words, though you might think it's a little strange that Canada wins out over the usa with dots over it. Similarly, I can ask of what's most similar to banana. And I get coconut, mango, bananas, potato, pineapple, fruit, etc. Again, pretty sensible. You know a little bit of a bias to more tropical fruits. Or I can go to crosant and ask what's most similar to croon. The most similar things to croisant isn't bread, but it's things like briochiguetfor, cya, which sort of basically makes sense, though he is putting dhere and I've got when I'd already done, Oh, sorry, Yeah, I remember what this is, right? So with this most similar, you've got a positive word vector and you're saying, what other words are most similar in position to that? There's something else you can do which you can say is, this is, let me take the negative of that word vector and say what's most similar to the negative of it. And you could possibly think, Oh, that might be useful to find antonyms or something like that. I mean, the truth is, it isn't. If you ask for the things that are most similar to the negative of the banana vector, and in most other vectors, it's the same, you get off out these weirdo things that you're not really sure if they're words at all, or maybe they are in some other language, or some of them are names, right? Like shicis, a Japanese name, but not very useful stuff. Don't really feel like a negative of banana. But it turns out that from there we get this powerful ability of that was observed for word to vc, which is that we could isolate semantic components and then put them together in interesting ways. So looking at this picture, what we could do is start with a positive vector for king from the origin of the king. Then we could use the negation to say, subtract out the vector for man, and then we could have another positive vector of add on, the vector for woman. And then we can ask the model is, if you're over here in the space, what is the nearest word to you over there? And so that's what this next thing does, right? It sort of says positive vector for king, negative for man, also positive for queen. Where does that get you to? And that gets you to queen yand. So this was the most celebrated property that was discovered with these word vectors, that they weren't only good for meaning similarity, but that they were good for doing these kind of meaning components. These got referred to as analogies because you can think of them as a, as to b, as c, as to what. So it's sort of like woman is to king. No, sorry, man is to king or king as to man, as I'm saying this the wrong way around. Man is to king as woman is to what in the analogies. And so here I've defined a little function that is now saying this little function just automates that and will compute analogies. And so now I can ask it in just this analogy format, man is to king as woman is to queen. And that one was sort of the canonical example. But you know, you can actually sort of have fun with this. And I mean, you know this is pretty old fashioned stuff. You know, I feel like I'm maybe like now at this point, an old guy talking about how much fun we used to have sitting around the radio listening to radio plays, because you know, basically no one uses this stuff anymore. And there are much, much better and fancier things like cattgpt. But you and back in the day when I was Young, you know, it was really stunning already just how this very simple model built on very simple data could just have quite good semantic understanding and do quite good analogies. So you could actually know, play with this quite a bit and have a bit of fun so you can do something like analogy. Australia, a comma. Bea France, okay, what people think the answer will be close. The answer gives us champagne, but that seems a pretty good answer. I could then put in Russia what people think. Vodka. Yeah, itsee. You can get back vodka. You know, this is actually works kind of Interestingly, you know I could do a different one. I can to test something different. I can do something like pencil is to sketching as camera is to photographing. That works quite well. So we built this model in 2014. So it's a little bit out of date in politics. So you know, we can't do the last decade of politics, which is maybe fortunate, but you know we could try out older politics questions. So we could try Obama is to Clinton as Reagan is to, if you remember your us world history class, any guesses what it's going to say? There's a bush one. Any other ideas? Some people have different opinions of Bill Clinton. Any what it answers is Nixon, which I think is actually kind of fair. You can also get it to do some just sort of language, syntactic facts, so you can do something like tallest to tallest, as long it moops, as long is to this one's easy. Yeah. So you know with this simple method of learning, with this simple bag of words model, it's enough to learn a lot about the semantics of words and you know stuff that's beyond conventional semantics, right? You know, like our examples with Australia a's debeer as Russia as de vodka, I mean, that's sort of cultural world knowledge, which goes a little bit beyond what people normally think of as sort of word meaning semantics. But also in there, yes, perhaps drathe distance from, let's say like man and king, does that also capture a concept of relationship with jewords? Like would that give you back like ruler or something like that where we're taking the distance like a difference between two vectors does capture some concept. The distance between man, so man compared to king should be a rule ler concept. But isn't that what I'm using? Because then I'm taking that I'm taking the distance between man and king is what I'm adding on to woman to get the queen right, right. Yeah. So I depending on if you think of these words, depending on which thing you think of as the analogy you can think of it, you've both got a vector, a difference vector between words that gives you a gender analogy and one that gives you a rule ler analogy. Yeah, absolutely. Any other questions? Yeah. In the watch . speaker 2: algthm we get two vectors, like for each word, A U of b, but here you only have one vector. So how do you go go from two to one? speaker 1: Yeah, good question. I mean, the comest way in practice was you just average the two of them and really you sort of find out that they end up very close, you know, because if you think of it, since you're going along, every position of the text should both currently be the case where if the text is sort of, you know the octopus has legs, you know you're going to have octopus in the center with legs in the context, then a couple of time steps later. It's going to be legs in the center with octopus in the context. So you know although they vary a bit for the regions of neural nets vary basically end up very similar and people normally just average them. Yeah can you nest this process? So use this, the answer of one to then be placed into the analogy function of the another and see how far away you can go before it starts to break down. I think you can. So wait, you're wanting to just in a relation between two words, can you do before it starts providing income? Are you wanting to sort of make two steps from somewhere or. So it doesn't always work. I mean, that there are so many examples that have fail. I'm sort of shy to try that now because I don't have a predefined function that did it, and that might take me too long, but you could play with it at home and see how it works for you. I curious, I think as a to picture, why is it that we use two separate sets of vectors for word of ecis it just to get more parameters? Or is there a I'll get back to that. Maybe I should go on at this point. Let me move on and kind of just get through some more details of the word to vec algorithm. So just a technical point on this class. So you don't make any big mistakes and waste your weekend. I mean, for most instances of 224n, we've actually had people implement from scratch word de vtors assignment two. But you know for this quarter, the doing it and spring quarter, as you probably know, spring quarter is actually a little shorter than the other two quarters. We decided to skip having people implement word to vac. So don't look at the old assignment two that says implement word to vac or else you'll be misspending your time. Wait for new it assignment two to come out. So that you know despite that, let me just sort of say a little bit more about some of the details, right? Yeah. So why two vectors? So the two vectors is it just makes the math a little bit easy. So if you think about the math, right, if you have the same vectors for the center word and for the outside words, well, for whatever the center word is, let's say it's octopus that when you're going through the trying out every possible context word for the normalization, at some point you'll hit octopus again. And so at that point you'll have a quadratic term, right? You'll have the x squared of the octopus vector. And that kind of messes up, I mean, your clever people, you could work out the math of it, but it makes the math more of a mess, right? Because every other term it's something different and it's just like ax. And then at one position you've got an x squared. So it just makes the math messier. And so they kept it really simple by just having them be disjoint vectors. But it doesn't make it better. I mean, it actually turns out it works a fraction better if you do it right. But in practice, people have usually just estimated them separately. And then averathem at the end there. If you actually look at the paper, here's nickel of it hour, you can find it, 2013 paper, there's actually sort of a family of methods that they describe. So they describe two methods. One of which was that you have an inside word that's predicting the words around it. And then the other one tried to predict the center word from all the words in the context, which was called continuous bag of words in their paper. The one that I've described as skip gram, which is simpler and works just great. But then the other part of it is for working out what loss function to be used for training. And what I've presented so far is naive soft max, where we just consider every possible choice of a context word and just run all the math. You know, that's totally doable. And with our modern super fast computers, that's not even that unreasonable to do. We do things like this all the time. But at least at the time that they wrote their paper, this seemed kind of expensive. And they considered other alternatives like a hierarchical softmax, which I'm not going to explain right now, but I do just want to explain negative sampling. Okay. So this is just to sort of see a bit of a different way of doing things. So for what we did last time, we had this sort of straightforward soft tmax equation. And so in the denominator, you're summing over every word in the vocabulary. And so if you might have 400, zero words in your vocabulary, a lot of words in human languages, you know, that's kind of a big sum, especially when for each element of the sum, you're taking a dot product between 100 dimensional or 300 dimensional vectors. And then exponentiating, right? A lot of math going on somewhere in there. So maybe we could short circuit that. And so the idea of the negative sampling was to say, well, rather than evaluating it for every single possible word, maybe we could just sort of train some simpler logistic regressions where they're going to say, you should like the word that's in the context. And if we randomly pick a few other words, you shouldn't like them very much. And that's skip gram negative sampling. So that's what this looks like as an equation. So we've got our center word and our actual context word, and we're saying, well, let's work out the term for the actual center word, welike this to be high probability. So since we're minimizing, we're going to negate that and have it go down. And then we're going to sample some other words and welike this to be the opposite. But the other thing that we've changed here is now in not using the soft max anymore, we're using this sigma, which stands for the logistic function, which is often called the sigmoid. Sigmoid just means s shaped, but you know you could actually have an infinity of s shafunctions. And the one that we actually use is the logistic function. So the logistic function has this form and maps from any real number to a probability between zero and one. So what we're wanting to say at that point is for the real outside word, we're hoping that this dot product is large, so its probability is near one. And so that will then sort of help with the minimization and for the other words, welike their probability to be small. So in what like them to appear sort of over here. And that's what this is calculating. But as written, it's sort of sticking the minus sign on the inside there, which works because of the this is symmetric, right? So you're wanting to be over here, which means that if you negate it, you'll be on this side, which will be large. Okay. And so then the final bit of this, which is the asterisk is so we're going to pick a few words, you know might only be five or ten that are our negative samples, but for picking those words, what works well is not just to sort of pick sort of randomly, uniformly from all the 400, zero words in our vocab. What you basically want to do is sort of be paying attention to how common the words are. So something like that is a really common word. So we refer to that as the unigram distribution. That means you're also just taking individual words independently, how commonly they are. So about 10% of the time, yoube choosing the but so that's sort of roughly what you want to do for sampling. But people have found that you're going na actually do even a bit better than that. So the standard thing that they presented for word devvec is you're taking the unigram probability of the word and raising it to the power three quarters. What does that end up doing? Question for the audience, if I take probabilities and raise them to the three quarters, some less rubbon words become sampling more correct. Yeah. So the Rahere to the three quarters means that you're sort of somewhat upping the probability of the less frequent words. So you're sort of in between you know between having every word uniform and exactly using their relative frequencies in the text. You're sort of moving a little bit in the direction of uniform. And so you get better results by going somewhat in the distance of sampling more uniformly, but you don't want na go all the way there, which should correspond to, I guess, putting a zero in there three quarters. Okay, Yeah. Okay, let's see. I have on the side here, but time rushes along, so let's not bother with this side. It's not that important. Okay, so that's the word tovc algorithm that we've seen all of in its different forms. A reasonable wonder that you could have at this point is, you know, this seems a kind of a weird way of doing what we're wanting to do, right? The idea is, look, we have this text, we have words and we have words. In the context of words, it sort of seems like an obvious thing to do would be to say, well, let's just count some statistics. We have words, and there are other words that occur in their context. So let's just see how often the word swim occurs next to octopus and how often the word fish occurs next to octopus. Let's get some counts and see how often words occur in the context of other words. And maybe we could use that to calculate some form of word vectors. And so that's something that people have already also considered. So if we use the same kind of idea of a context window, we could just make a matrix of how often words occur in the context of other words. And so, you know, here's a baby example. My corpuses, I like deep learning. I like nlp. I enjoy flying. And my context window I'm using as just one word to the left and the right. And then I can make this kind of cocurrence count matrix where I'm putting in the counts of different words in every context. And you know because my corpus is so small, everything in the matrix is a zero or one, except for right here where I've got the twos because I have I like twice, right? But in principle, I've got a matrix of counts for all the different counts here. So maybe you know this gives this gives me a word vector, right? You know here's a word vector for deep. Is this long vector here? And you know I could just say that is my word vector. And indeed, sometimes people have done that, but they're kind of ungainly word vectors, because if we have 400000 words in our vocabulary, the size of this matrix is 400 zero by 400 zero, which is a lot worse than our word word vectors, because if we're making them only 100 dimensional, we've only got 400000 by 100, which is still a big number, but it's a lot smaller than 400000 times 400000. So that's inconvenient. So when people have started with these kind of cocurrence matrix, the general thing that people have done is to say, well, somehow we want to reduce the dimensionality of that matrix so that we have a smaller matrix to deal with. And so then how can we reduce the dimensionality of the matrix? And at this point, if you remember your linear algebra and stuff like that, you should be thinking of things like pca. And in particular, if you want it to work for any matrix of any shape, there's this singular value decomposition. So there's a classic singular value decomposition for any matrix. You can rewrite it as a product of three matrices, a unof v, which are both author normal, which means that you get these independent vectors, they're orthogonal to each other. And then in the middle, we have the singular vectors, which are ordered in size. This the most important singular vector. And these are sort of weighting terms on the different number of the different dimensions. And so this is sort of the full svd decomposition. But you know part of it is irrelevant because if I've got this picture, you know nothing is happening on the part that's sort of shown in yellow there. But if you want you know at the moment, you know this is just a good full decomposition. But if you're wanting to have sort of smaller low dimensional vectors, well, the next trick we pull is we say, well, we know where the smaller singular vectors are, so we could just set them to zero. And if we did that, then more of this goes away and we end up with two dimensional representations of our words. And so that gives us another way of forming low dimensional word representations. And this had actually been explored before modern neural word vectors and using algorithms such as latent semantic analysis. And it has sort of half worked, but it never worked very well. But you know, some people, especially in psychology, had kept on working on it. And among other people in the early two thousands, there was this grad student, Doug Rody, who kept on working on it. And he came up with an algorithm that he called coals. And he had known, as other people before him had known that just sort of doing an svd on war counts didn't seem to give you word vectors. That worked very well, but he had some ideas to do better than that. So one thing that helps a lot is if you log the frequencies, so you can put log frequencies in the cells, but then he sort of used some other ideas, some of which were also picked up in word dac, one of which is ramping the windows so that you count closer words more than further away words, use pierce and correlations instead of counts, etc. But he ended up coming up with a low dimensional version of word vectors that are sort of ultimately still based on an svd. And he got out these word vectors. And you know Interestingly, sort of no one really noticed at the time that Doug Rody in his dissertation effectively discovered this same property of having linear semantic components. So look, here we go. Here's one. So this is actually you know picture from his dissertation. And look here, you know we've got this meaning component, which is doer of an event. And he's essentially shown with the way he's processed his word vectors that the doo of an event is a linear meaning component that you can use to move between a verb and the doer of the verb. Kind of cool. But he didn't become famous because no one was paying attention to what he had come up with. So once word to vek became popular, that was something that I was kind of interested in. And so working together with a postdoc, Jeffrey Pennington, we thought that you know, there was interest in this sort of space of having doing things with matrices of counts. And how do you then get them to work well as word vectors in the same way that word de c work well as word vectors? And so that's what led into the glove algorithm, that was what I was actually showing you. And so what we wanted was to say, look, we want a model in which linear components, sort of adding or subtracting a vector in a vector space, correspond to a meaning difference. How can we do that? Jeffrey did good thinking and math and thought about that for a bit. And his solution was to say, well, if we think of that, ratios of cocurrence probabilities can encode meaning components. So if we can make a ratio of cocurrence probabilities into something linear in the vector space, we'll get the kind of result that word davek or Doug Rody got. So what does that mean? Well, so if you start thinking of words occurring in the context of ice, you might think that sort of solid and water are likely to occur near ice and gas, or a random word like random aren't likely to occur near ice. But and similarly for steam, youexpect that you know gas and water are likely to occur near steam, but probably not solid or random. And well, if you're just looking at one of these, you don't really get meaning components because you get something that's large here or large here. But if you then look at the ratio of two of these cocurrence probabilities, then what you get out is that for solid, it's going to be large and small is going to be and for gas, it's going to be small. And so you're getting direction in the space, which will correspond to the solid liquid gas dimension of physics. Whereas for the other words, itbe about one, this is just the wave your hands. This was the conception of the idea that if you actually do the counts, this actually works out. So using real data, this is what you get for cooccurrence. And indeed, you kind of get these sort of factors of ten in both of these directions of these two. And the numbers that are over there are approximately one. So Jeffrey's idea was, well, we're going to start with a cocurrence count matrix, and we want to make this turn into a linear component. And well, how do you do that? Well, first of all, it sort of makes sense immediately that you should be putting a login, right? Because once you put a log in, this ratio bebeing turned into something that's subtracted. And so simply, all you have to do is have a log by linear model where the dot product of two word vectors models this conditional probability. And then the difference between two vectors will be corresponding to this log of the ratio of their coreccurrence probabilities. So that was basically the glove model. So you're wanting to know model this dot product such that it's being close to the log of the cooccurrence probability, but you sort of do a little bit of extra work to have some bias terms and. And some frequency thresholds which aren't very important. So I'm going to skip past them, but I think that basic intuition as to what's the important thing to get linear meaning components is a good one to know about. Okay. Is everyone good to there? Cool. Yes. Oh, I noticed the original text matrix you showed was like three x five or something. Shouldn't it be square? So Yeah, I mean, if you're doing sorry, Yeah, I maybe should have just shown you a square one. If you're just doing vocabulary to vocabulary, yes, it should be square. But there was a bit in the slides that I didn't mention that there was another way you could do it where you did it, words versus documents and then itbe non square that Yeah, you're eight, so we can just consider the square case. So you know, Hey, I showed you that demo of the glove vectors and they work great, didn't they? So you know, these are good vectors, but in general, in nlp welike to have things that we can evaluate and know whether things are really good. And so everywhere through the course, we're going to want to evaluate things and work out how good they are and what's better and what's worse. And so one of the fundamental notions of evaluation, thatwill come up again, again, is intrinsic and extrinsic evaluations. So an intrinsic evaluation is where you are doing a very specific internal subtask and you just try and score whether it's good or bad. So normally, intrinsic evaluations are faster compute, help you understand the component you're building, but they are sort of distant from your downstream task. And improving the numbers internally may or may not help you. And that's the contrast with an extrinsic evaluation, which is that you've got some real task you want to do question answering or document summarization or machine translation, and you want to know whether some clever bit of internal modeling will help you on that task. So then you have to sort of run an entire system and work out downstream accuracies and find out whether it actually helps you at the end of the day. But that often means it's kind of indirect, so harder to see exactly what's happening in your task. So for something like word vectors, you know if we just sort of measure, are they modeling word similarity? Well, that's an intrinsic evaluation. But you know, weprobably like to know whether they model word similarity well for some downstream task, which might be doing web search, right? Welike when you say, you know, cell phone or mobile phone, that it comes out at about the same. So that would then be web search might be our extrinsic evaluation. Okay? So for word factors, two intrinsic evaluations, the ones we've already seen. So there's the word vector analogies. You know, I cheated when I showed you the glove demo. I only showed you ones that work. But if you play for it yourself, you can find some that don't work. So what we can do is sort of have a set of word analogies and find out which ones work. Now, you know, in general, glove does work. You know, here's a set of word vectors showing you the sort of Male Female distinction that's kind of good and linear. But in general, for different ones, it's going to work and it's not work. And you're going na be able to score what percentage of the time it works. Or we can do word similarity. How we do word similarity is we actually use human judgments of similarity. So psychologists ask undergrads, and they say, here is the word plane and car. How similar are they on a scale of one to ten or zero to ten? Maybe? Actually, I think it's zero to ten here on a scale of zero to ten. And the person says, mm seven, and then they ask another person and they average what the undergrads say, and they come out with these numbers. So, you know, tiger tiger gets ten. Book and paper got average, an average of 7.46. Plane car got 5.77. Stock cking the phone got 1.62 and stock and Jakii got 0.92. Noisy process. But you roughly get to see how similar people think words are. And so then we ask our models to also score how similar they think words are. And then we get models of how well the scores are correlated between human judgments and our models judgments. And so here are some of a big table numbers that we don't need to go through all of, but you know, it sort of shows that a plain svd works terribly. Simply doing svd over log counts already starts to work reasonably. And then you know here's the two word de bec algorithms, cbo and skip gram. And here are numbers from our glove vectors. And so you get these kind of scores that you can then score different models as to how good they are. And well, then you can also, Oh, sorry, Yeah, that's the only thing I have there. But you know, what can you do for downstream evaluation? Well, then you want to pick some downstream task. And so a simple downstream task that's been used a lot in nlp is what's called named entity recognition. And so that's recognizing names of things and what type they are. So if the sensays, Chris Manning lives in Palo Alto, you want na say chsp and manning, that's the name of a person. And Palo and alto, that's the name of a place. So that can be the task. And well, that's the kind of task which you might think word vectors would help you with, and it's indeed the case, right? So what's labeled discrete was a baseline symbolic probabistic named entity recognition task. And by putting word vectors into it, you can make them the numbers go up. So these numbers for glove are higher than the ones on the first line. And so I'm getting substantial improvements from adding word vectors to my system. Yay. Okay, I'll pile ahead into the next thing. This next one I think is interesting we should spend a minute on, and it came up in your questions last time. Words have lots of meanings. Most words have a whole bunch of meanings. Words that don't have a lot of different meanings. How? Only some very specialized scientific words. So my example of word with multiple meanings is probably not the first one you think of all the time. The most famous example of a word with a lot of meanings is bank, which already came up last time. And I use star, which is another one. Here's a word that you probably don't use that often, but you know still has lots of meanings. So the word pike, what are some things that the word pike can mean? A fish, yes, it's a kind of fish. Okay, we've got one. What else can a pipe be? Yeah. Yeah. For the dungeons and dragons crowd. Yeah, there's a long arm, right? Yep, that's another one. Yeah a road, right? Yeah. So pike is used as a shorthand, a shorthand for a turnpike. Why? It's called a turnpike. Whereas originally you've had the Speary looking thing at the start of a sort of count people. Okay, we've got three other thing means for pike. Yes, it's also a crap like a fraternity. I'll believe you. I can't say no. That one other pikes will be sharpeas like a needle will be sharp, maybe. I mean, I think it's really the so pike as the weapon. Any other scrtial heaone that I think a lot of you will have seen in diving and swimming? You can do a pike Olympics. If you see Olympic diving, they have pipes anyseen those. Trust me, that's a pike. Okay. And so we've sort of been doing the noun uses ers, but you know you can also use pike as a verb, right? You know like once you've got your medieval weapon, you can pike somebody and that's a usage of pike and you can do other ones, right? So here we go. Here's once I got from a dictionary, we got most of those. There are sort of weirder usages, right? Like coming down the pike. That's kind of a metaphorical use that comes from the road sense, but it sort of ends up meaning the future. Yeah in Australia, we also use pike to mean sort of chicken out of doing something, but I don't think that usage is really used in the us anyway. Words have lots of meanings, so how can you deal with that? Well, one way you could deal with it is to say, okay, and words have several meanings. And so we're just going to say words have several meanings. And then we're going to take instances of words and text. We're going to cluster them based on their similarity of occurrence to decide which sense of the word to regard each token as. And then we're going to learn word vectors for those token clusters, which are our sensors. And you can do that. We did it in 2012 before word defect came out. So you see here we have Bank One, and somewhere over here we have bank two, and here we have jaguar one, jaguar two, jaguar three, jaguar four. And you know, this really works out great, right? So jaguar one picks out the sense of the kind of car, right? And it's close to luxury and convertible. Jaguar two comes right close to software and Microsoft. And this one's a bit of an historical one, but you know, when most of you were five or whatever you might remember, apple used to use large cats for versions of Macc os, right? So sort of MacOS 10.3 or something like that a long time ago was called jaguar, right? So it's software close to Microsoft jaguar three, okay? String keyboard, solo musical drum base. That's because there's a jaguar keyboard. And then finally, the sort of what we think of as the basic sense, but turns out, turns up rather less in text, corporate, normally jaguan extra hunter as the animal. So it's done a good job at learning the different sensors, but you know that's not what's actually usually done these days. And instead, you know what's usually done is you do only have one vector for jaguar. And when you do that, or pike here, the one vector you learn is a weighted average of the vectors that you would have leart for the sensors. It's often referred to as a superposition because somehow neural net math people like to use physics terms, and so they call it a superposition, but it's a weighted average. So you're taking the relative frequency of the different sensors and multiplying the vectors. You would have learned if youhad sense vectors. And that's what you get as the representation as a whole. And you know I can make a sort of a linguistic argument as to why you might want to do that, which is you know although this model of words have sensors is you know very long gstanding and comma. And I mean, it's essentially the way dictionaries are built, right? You look up a word in the dictionary and it says sense one, sense two, sense three, and you get them for things like banor jagus we're talking about. I mean, it's sort of really a broken model, right? That like word meanings have a lot of nuance. They're used in a lot of different contexts. They're extreme examples like bank wherever it was, where we have finance bank and bank of a river bank over here, where it seems like the sensors are this far apart, but you know, most words have sort of different meanings, but they're not actually that far apart. And trying to cut them into senses seems actually very artificial. And you know, it's there's you know if you look up five different dictionaries and you say, how many senses does this word have? Pretty much everyone will give you a different answer. So the kind of situation you have is a word like field. Well, a field can be used for a place where you grow a crop. It can be used for sort of natural things like a rock field or an ice field. It can be used for a sporting field. There's the mathematical sense sor field. Now all of these things sort of have something to do with each other. I mean, the math one's further away, but the physical ones are sort of flat spaces. But you know, the sense of it being a sporting field is kind of different from the sense of it being an ice field. Is the ice field and the walk field different, or am I just modifying them? Are they different sensors? Right? So really, you sort of have a kind of what a math person say is sort of like some probability density distribution over things that can be meant by the meaning of a word. So it sort of maybe makes sense to more use this model where you're just actually saying we have a vector that's an average of all the contexts, and we'll see more of that when we get to contextual word vectors later on. But one more surprising result on this is since you have the vector for pike overall being sum of these different sense vectors, you know, standard math would tell you that if you just have the single vector, there's no way that you can recover the individual sense vectors. But higher math tells you that actually these vector spaces are so high dimensional and sparse that you can use ideas from sparse coding theory to reconstruct the sense vectors out of the whole vector. And if you actually want to understand this, some of the people in statistics, David Donaho, I think is one of them, teach courses on sparse coding theory. But I'm not going to try and teach that. But you know, here's an example from this paper, the sangiferurora at al, where one of the et als is tenumah, who's now faculty in computer science here, where they are starting off with the word vector and using sparse coding to divide out sense vectors from one word vector. And they work pretty well, right? So here's one sense of tie, which is piece of clothing, another sense of tie, which is ties in the game. This one is sort of similar to that one, and I'll admit, but this sense of tie here is then a tie as sort of you put on your electrical cables, then you have the musical sense of the right. At least four out of five. They've done a pretty good job of getting sensors out of this single word vector by sparse coding. So sparse coding must be cool if you want to go off and learn more about it. Okay. Okay, so that's everything I was going to say about words, vectors and word sensors. Is everyone good to there? Any questions? I'll rush ahead for the last two pieces. Okay, so I just wanted to start introducing the last 15 minutes, the ideas of how we can build neural classifiers and how we start to build in general neural networks. I mean, in a sense, we've already built a very simple neural classifier because our word devvec model is predicting what words are likely to occur in the context of another word. And you can think of that as a classifibut. Let's look at a simple classify like our named entity recognized, as I mentioned before. So for the named entity recogniser, we want to label words with their class. So we want to say these two words are a person, but the same words, Paris, son, Hilton, are then locations in this second sentence. So words can be ambiguous as to what their class is. And the other state is that they're not a named entity at all. They're just a word. There is some other word. And this is something that's used in lots of places as a bit of understanding. So if you've seen any of those web pages where they've some tag company names with a Stock Ticker or there's links on a Wikipedia page to a Wikipedia page or something like that, right? You've got named entities where commonly, after finding the name entities, you're doing the second stage of entity linking, where you're then linking the named entity to some canonical form of it like a Wikipedia page. But we're not going to talk about the second part of it for the rest of the day. And so we could say that building with our word vectors, we've got this simple task where what we're going to do is we're going to look at a word in context, because sometimes Paris is a name of a person, sometimes it's a location. And so we're going to want to look at this word in its context and say, aha, this is the name of a location in this instance. And so the way that we're going to do it is we're going to form a window classifier. So we're going to take a word with a couple of words of context on each side. And for the words in our context window, we going na use our word vectors because we want to show they're useful for something. And then we want to feed this into something that is a classifier. And our classifier, it's actually going to be a really simple classifiwe're, only here going to do location or not a location. So this one here, we're wanting to say for this window here, yes, it's a location. And whereas if it had been, what I love Paris Hilton greatly, then webe saying no, because Paris, the word in the middle of the context then, isn't a location. So that's sort of the idea of a classification or classifier we're making, assigning some set of classes to things, right? So in general, for classifiers, we do supervised learning, which means we have some labeled examples, our training data set. So we have input items xi, and for each one, we've got a class yi. So I had from my x example training examples, ones like, I love Paris Hilton greatly. That was negative, not a location. And I visit Paris every spring. That's positive. That is a location where I'm actually classifying the middle word. Okay? So inputs, labels, and in general, we've got labels are a set of classes. So my set here is simply location, not a location. But I could get fancier and I could say I've got five classes. I've got location, person name. Whatever other ones there are, company name, drug name, right? I could be assigning a bunch of or other, not a name, a bunch of different classes. But I'm going na be doing it with only two because I'm using this example on next Tuesday's lecture as well, and I'm wanting to keep it simple. So that's what we're going to do. And so what we're going to be using in our class is neural classifiers. And so I just wanted to sort of just sort of go through quickly just the sort of food for thought as we go into it. So for a typical stats machine learning classifier, you can build classifiers like logistic regression or softmax classifiers, or other ones like support vector machines or naive bays or whatever else you might have seen. The vast majority of these classifiers are linear classifiers, meaning that they have a linear decision boundary. And when we're learning these classifiers, we're learning parameters here W, but our inputs are fixed, that our inputs are represented by symbols, like or quantities. So we have fixed inputs. We learn parameters as weights that are used to multiply the inputs, and then we use a linear decision boundary. So when we have our neural classifier, we're kind of getting some more power. So first of all, we're not only learning weights W for our classifier, we're also learning distributed representations for our words. So our words can sort of re represent our word vectors, rerepresent the actual words as symbols and can move them around into the space so that in terms of the original space, we've got a non linear classifier that can represent much more complex functions. But we will then sort of use the word vectors to re represent those words to do a final classification. So at the end of our deep network, which we're about to build, we will have a linear classify in terms of our re represented vectors, but not in terms of our original space. Let me try and be concrete about that. Okay, so here's what I'm going to use, and we'll use again next Tuesday as my little neural network. And so I start with some words. Museums in Paris are amazing. I first of all come up with the word embedding of those using my word vectors. So now I'm got this sort of high dimensional vector, which is just a concatenation of five word vectors. So you know, if I have 100 menal word vectors, this is 500 dimensional. And then I'm going to put it through a neural network layer, which is simply multiplying that vector by a matrix and adding on a bias vector. And then I'm going to put it through some nonlinearity, which might be, for example, the logistic function that we've already seen. So thatgive me a new representation. And in particular, if the W is, say eight by 500, I'll be reducing it to a much Yeah, eight by 500, I'll be reducing it to a much smaller vector, right? So then I can do after that, I can multiply my hidden representation in the middle of my neural network by another vector. And that will give me a score, and I'm gonna to put the score into the logistic function that we saw earlier to say what's the probability this is the location. So at this point, my classifier is going to be a linear classifier in terms of this internal representation that's used right at the end, but it's going to be a non linear classifier in terms of my word, vectors. Okay, right. Here's one other thing. This is just sort of a note for learn ahead since you want to know this when we start doing the next assignments. I mean, up until now, I've presented everything as you know, doing log likelihood and negative log likelihood for building our models very soon. Now, assignment two, we're going to be starting to do things with PyTorch. And when you start working out your losses with ptorch, what you're going to be wanting to use is cross entropy loss. And so let quickly say what cross entropy loss is. So cross entropy is from information theory. So if you have a probability distribution p and you're computing a probability distribution q, your cross entropy loss sses like this. So is the log of your model probability, the expectation of that under your probability distribution? But there's sort of a special case, whereas if you have ground truth or gold or target data where things are labeled one, so like for examples of I love Paris when warm, right? I'm just labeling it one for location. Probability one, it's the location. Probability zero, it's not a location. So if you're just labeling the right class as probability one, then in this summation, every other term goes to zero. And the only thing you're left with is what probability is my model? What log probability is my model giving to the right class? And so that then is your log likelihood, which we can use for the negative log likelihood. A little bit of a complication here. Just remember that you want to use cross entry loss in pytrain building the model. Okay, before we end today, here is my obligatory one picture of human neurons. Don't miss it because I'm not going to show any more of these. Okay? These are human neurons, right? Human neurons were the inspiration for neural networks, right? So human neurons have a single output, which comes down this axon. And then when you have these outputs, they then feed into other neurons. I guess I don't really have an example here, but in general, one output can feed into multiple different neurons. You can see the different things hanging into it. So you know you have the output connecting to the input and sort of where you make this connection, that's the synapses that people talk about. And so one neuron will normally have many, many inputs where it picks things up from other neurons, and they all go into the nucleus of the cell, and the nucleus combines together all those inputs. And kind of what happens is if there's enough positive activation from all of these inputs, it then sends signals down its output. Now strictly how neurons work is that they send spikes. So the level of activations and neurons is its rate of spiking. But that immediately got turned in artificial neural networks into just a real value as to what is its level of activation. And so it does this. So this was kind of the genuine inspiration of all of our neural networks, right? So a binary logistic regression is kind of a bit similar to a neuron, right? It has multiple inputs. You're working out your total level of excitation, where in particular you can have inputs that are both exciting positive inputs and inputs that are negative, which are then inhibitory inputs. You can ine them all together and you get an output that's your level of excitation, and you're then sort of converting that through some nonlinearity. And so this was proposed as a very simple model of human neurons. Now, human neurons are way more complex than this. And some people, like neuroscientists, think we maybe should be doing a better model of actual human neurons. But in terms of what's being done in the current neural networks, eat the world revolution, everyone's forgotten about that. And it's just sticking with this very, very simple model, which conveniently turns into linear algebra in a very simple way. So this gives us sort of like a single neuron, but then precise, right? So this is which this single neuron, if you use the logistic function, is identical to logistic regression, which you've probably seen in some stats class or something. But the difference is that for neural networks, we don't just have one logistic regression, and we have a bunch of logistic regressions at once. And well, thatbe tricky if we had to define what each of these logistic regressions was calculating, but what we don't what we do is we just feed them into another logistic regression. And so we have some eventual output that we want to be something like. We want it to say, you know, this is or isn't a location. But then what will happen is by our machine learning, these intermediate logistic regressions, we'll figure out all by themselves something useful to do, that's the magic, right, so that you get this sort of self learning property where the model has a lot of parameters and internally will work out useful things to do. So in general, we can get more magic by having more layers in the neural network, and that we will build up function. So effectively, these intermediate layers, let us learn a model that re represents the input data in ways that will make it easier to classify or easier to interpret and do things with downstream in our neural network. And it's time, so I should stop there. Thank you.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座(Stanford CS224N NLP with Deep Learning, Spring 2024, Lecture 2)主要回顾并深化了词向量(Word Vectors)的概念,特别是 Word2Vec 算法的优化、变体及其背后的数学原理。首先,课程回顾了优化基础,从梯度下降(Gradient Descent)过渡到实际应用中更高效的随机梯度下降(Stochastic Gradient Descent, SGD),强调了其在处理大规模数据和提升神经网络训练效果方面的优势。接着,详细探讨了 Word2Vec 的实现细节,包括参数初始化、为何使用中心词和上下文词两套向量、以及通过负采样(Negative Sampling)作为朴素 Softmax 的高效替代方案。
讲座通过一个 Jupyter Notebook 演示了 GloVe 词向量在词语相似性(如 "USA" 对应 "Canada", "America")和类比任务(如 "king - man + woman = queen")上的有效性,展示了词向量捕捉语义和文化知识的能力。随后,课程介绍了基于计数的方法作为词向量学习的另一途径,从简单的词-词共现矩阵、SVD降维,到更先进的 GloVe 模型,该模型旨在通过对数双线性模型学习词向量,使其能够反映共现概率的比率,从而捕捉线性语义成分。
词向量的评估方法被分为内在评估(如词语相似度任务、词语类比任务)和外在评估(如在命名实体识别等下游任务中的表现)。讲座还讨论了词的多义性问题(Word Senses),指出标准词向量通常是词语不同含义的加权平均或“叠加态”,并提及可以通过稀疏编码等技术从单一词向量中重构出不同的词义向量。最后,初步引入了神经网络分类器的概念,以命名实体识别为例,展示了如何利用词向量构建一个简单的窗口分类器,并简要介绍了神经网络的基本构成(神经元模型、多层网络、激活函数、损失函数如交叉熵损失),为后续深入学习神经网络奠定基础。
课程安排与提醒
- 首个作业 (Assignment 1):已发布,截止日期为下周二课前。
- Python 复习课:将于本周五 3:30-4:20 PM 在 Gates B01(Gates 地下室)进行。
- Office Hours 和 Help Sessions:已开始,具体时间地点在网站上列出,鼓励学生参与。提供线下教室的多助教辅导,也为斯坦福在线学生提供 Zoom 选项。
- 讲师 Office Hours:将在周一下午进行,通过 Calenly 进行15分钟预约制,预约链接当晚开放。讲师提醒学生不要“霸占”所有预约时段。
- 关于 Assignment 2 的重要提示:本学期(春季学期较短)不会要求学生从零开始实现 Word2Vec,请勿参考往年要求实现 Word2Vec 的 Assignment 2,应等待新的 Assignment 2 发布。
优化算法基础
- 目标:最小化损失函数 J(θ)。
- 梯度下降 (Gradient Descent):
- 通过计算损失函数关于参数 θ 的梯度 ∇J(θ) 来确定使损失下降最快的方向。
- 更新规则:
θ_new = θ_old - α * ∇J(θ)α是学习率(learning rate)或步长(step size),通常是一个小数值(如 10⁻³, 10⁻⁴, 10⁻⁵)。- 选择小学习率是为了避免“走过头”,甚至可能导致损失比初始点更大。
- 随机梯度下降 (Stochastic Gradient Descent, SGD):
- 问题:在整个数据集上计算梯度成本非常高,导致更新缓慢。
- 方法:每次仅在一个小批量(mini-batch)数据(如16或32个样本)上计算梯度,以此作为对整体梯度的有噪声估计。
- 优势:
- 速度快:计算成本远低于全批量梯度下降。
- 效果好:噪声的引入有时反而有助于神经网络跳出局部最优,达到更好的优化效果。
- SGD 是神经网络训练中普遍使用的方法。
Word2Vec 详解
- 核心思想:通过预测词语的上下文来学习词向量。
- 初始化:词向量必须用随机的小数值初始化。若初始化为零,则由于对称性无法进行有效学习。
- 过程:
- 遍历语料库中的每个位置。
- 使用当前词向量估计其上下文词的概率。
- 根据预测错误计算梯度。
- 更新词向量以更好地预测周围词语。
- “魔法”之处:仅通过这种简单的数学运算和大量文本,就能学习到捕捉词语语义、意义和关系的词向量。
- 参数:模型参数仅为词向量本身,包括中心词向量 (center word vectors) 和上下文词向量 (outside word vectors),这两者在计算时被视为不相关的。
- 模型类型:属于“词袋模型 (bag of words model)”,不考虑句子结构或词语的左右顺序,仅关注中心词的上下文中出现了哪些词。
Word2Vec (GloVe) 演示 (Jupyter Notebook)
- 使用
gensim包加载预训练的 GloVe 词向量(讲师提及 GloVe 是斯坦福构建的模型,将在后续介绍,其行为与 Word2Vec 类似)。 - 词向量表示:一个词(如 "bread", "croissant")表示为一个实数向量(演示中使用100维)。相似词的向量在数值上表现出一定相似性。
- 词语相似性查询:
most_similar("usa")-> "canada", "america", "u.s.a.", "united states", "australia"most_similar("banana")-> "coconut", "mango", "bananas", "potato", "pineapple", "fruit"most_similar("croissant")-> "brioche", "baguette", "focaccia", "ciabatta"
- 负向量相似性:查询与词向量反方向最相似的词(如
-banana),结果通常是无意义的“怪异词汇 (weirdo things)”,对寻找反义词无效。 - 词语类比 (Analogies):通过向量加减法实现。
king - man + woman = queen(最经典的例子)australia - beer + france = champagnerussia - [beer/vodka, 讲师演示时为 vodka]pencil - sketching + camera = photographingobama - clinton + reagan = nixon(基于2014年训练的模型)tallest - tall + longest = long(语法关系)
- 结论:这种简单的词袋模型足以学习到丰富的词汇语义,甚至包括一些超越传统语义的文化和世界知识。向量间的差异(距离)可以捕捉概念关系,如“统治者”概念。
Word2Vec 细节提问与解答
- 为何 Word2Vec 有两套向量 (U_o 和 V_c)?最终如何得到一个向量?
- 实践中通常将两套向量(中心词向量和上下文词向量)取平均。它们最终会非常相似,因为文本中词语会交替作为中心词和上下文词出现。
- 为何设计两套向量?
- 主要是为了简化数学计算。如果中心词和上下文词使用同一套向量,在计算归一化项时,当上下文词恰好是中心词本身时,会出现该词向量的平方项 (x²),而其他项则是两个不同词向量的点积 (ax)。这种不一致性会使梯度计算更复杂。使用两套独立的向量可以避免这个问题,保持数学形式的简洁。这并非为了提升性能,尽管正确实现下,使用一套向量也可能略好。
Word2Vec 变体与细节
- Word2Vec 家族 (Mikolov et al., 2013):
- Skip-gram (SG):用中心词预测上下文词(本课程主要介绍的模型,简单且效果好)。
- Continuous Bag of Words (CBOW):用上下文词预测中心词。
- 损失函数与训练方法:
- 朴素 Softmax (Naive Softmax):
- 对整个词汇表中的每个词计算概率,分母需要对所有词求和。
- 计算量大:若词汇表大小为40万,每个词向量维度为100-300,则点积和指数运算量巨大。
- 负采样 (Negative Sampling):作为 Softmax 的一种高效近似。
- 思想:将多分类问题转化为一系列二分类问题(使用逻辑回归/Sigmoid函数)。目标是让模型认为真实上下文词出现的概率高,而随机采样的“负样本”词出现的概率低。
- 公式形式:
J(θ) = -log σ(u_o^T v_c) - Σ_{k=1 to K} log σ(-u_k^T v_c)(简化形式,实际包含对真实词的喜爱和对负样本的厌恶)σ是 Logistic/Sigmoid 函数:σ(x) = 1 / (1 + e^(-x)),将实数映射到 (0,1) 区间。- 对真实上下文词
o,希望u_o^T v_c大,使其概率接近1。 - 对负样本词
k(采样 K 个,如5-10个),希望u_k^T v_c小(或-u_k^T v_c大),使其概率接近0。
- 负采样策略:不仅仅从词汇表中均匀随机采样。
- 通常基于词的一元分布 (unigram distribution) P(w)(即词频)进行采样。
- Word2Vec 论文中提出使用
P(w)^(3/4)进行采样。这会提高低频词被采样的概率,介于均匀分布和原始词频分布之间,从而获得更好的词向量。
- 朴素 Softmax (Naive Softmax):
基于计数的方法:从共现矩阵到 GloVe
- 基本思想:直接统计词语在上下文中共同出现的次数。
- 词-词共现矩阵 (Word-Word Co-occurrence Matrix):
- 构建一个矩阵,行和列代表词汇表中的词,矩阵元素
X_ij表示词i和词j在指定上下文窗口内共同出现的次数。 - 问题:
- 维度极高:若词汇表大小为 V,则矩阵大小为 V x V (如 400k x 400k),非常稀疏且占用大量空间。
- 直接使用高维稀疏向量效果不佳。
- 构建一个矩阵,行和列代表词汇表中的词,矩阵元素
- 降维方法:
- 奇异值分解 (Singular Value Decomposition, SVD):
- 将共现矩阵 X 分解为
X = UΣV^T。 - 通过保留 Σ 中最大的 k 个奇异值及其对应的 U 和 V 的部分,可以将词表示为低维(如k维)向量。
- 将共现矩阵 X 分解为
- 早期尝试 (Latent Semantic Analysis, LSA):直接对原始计数矩阵做 SVD 效果不佳。
- 改进 (Doug Rohde, COALS):在 SVD 前对共现计数进行处理,如:
- 使用对数频率 (log frequencies)。
- 对上下文窗口进行加权(近的词权重高)。
- 使用皮尔逊相关系数等。
- Rohde 的工作实际上也发现了词向量的线性语义成分(如“事件的执行者”),但当时未引起广泛关注。
- 奇异值分解 (Singular Value Decomposition, SVD):
- GloVe (Global Vectors for Word Representation) (Pennington, Socher, Manning, 2014):
- 动机:结合基于计数的方法的统计能力和 Word2Vec 等模型学习到的线性语义结构。
- 核心洞察 (Jeffrey Pennington):共现概率的比率 (ratios of co-occurrence probabilities) 能够编码词语间的意义关系。
- 例如,考虑词
ice和steam。P(solid | ice) / P(solid | steam)会很大。P(gas | ice) / P(gas | steam)会很小。P(water | ice) / P(water | steam)约等于 1。P(fashion | ice) / P(fashion | steam)约等于 1。
- 这种比率可以揭示如“固态/气态”这样的语义维度。
- 例如,考虑词
- 模型思想:构建一个模型,使得词向量的点积能够直接建模对数共现概率。
w_i^T w_j + b_i + b_j = log(X_ij)- 这样,向量间的差
w_i - w_k就对应于对数共现概率的比率log(X_ij / X_kj),从而捕获线性语义成分。
- 模型包含偏置项 (bias terms) 和频率加权等细节。
词向量的评估
- 评估的重要性:在NLP中,需要量化方法来判断模型的好坏。
- 两类评估方法:
- 内在评估 (Intrinsic Evaluation):
- 针对词向量本身的特定子任务进行评估。
- 优点:计算速度快,有助于理解模型组件。
- 缺点:与最终下游任务的性能可能不直接相关。
- 示例:
- 词语类比 (Word Analogies):如
man:king :: woman:queen。评估模型在预定义类比集上的准确率。GloVe 演示中展示了其有效性,但也承认并非所有类比都有效。 - 词语相似度 (Word Similarity):将模型计算的词对相似度(如词向量的余弦相似度)与人类标注的相似度评分(如 WordSim-353, SimLex-999 数据集)进行比较(如计算相关系数)。
- 讲师展示了一个包含不同方法(SVD, log-count SVD, CBOW, Skip-gram, GloVe)在相似度任务上得分的表格,显示了 GloVe 等模型的优越性。
- 词语类比 (Word Analogies):如
- 外在评估 (Extrinsic Evaluation):
- 在真实的下游NLP任务中评估词向量的性能。
- 优点:直接反映词向量在实际应用中的价值。
- 缺点:评估成本高,可能难以精确判断词向量的具体贡献。
- 示例:
- 命名实体识别 (Named Entity Recognition, NER):任务是识别文本中的实体(如人名、地名、组织名)。
- 例如:"Chris Manning lives in Palo Alto" -> (Chris Manning, PERSON), (Palo Alto, LOCATION)。
- 实验表明,将词向量(如GloVe)加入到NER系统中,可以显著提升性能(相较于仅使用离散特征的基线模型)。
- 命名实体识别 (Named Entity Recognition, NER):任务是识别文本中的实体(如人名、地名、组织名)。
- 内在评估 (Intrinsic Evaluation):
词义的多样性 (Word Senses / Polysemy)
- 问题:大多数词语拥有多个含义(如 "pike" 可指鱼、武器、道路、跳水姿势等;"bank" 可指金融机构、河岸)。
- 传统词向量的处理方式:
- 标准 Word2Vec/GloVe 为每个词只学习一个向量。
- 这个单一向量通常是该词所有不同词义对应向量的加权平均 (weighted average) 或 叠加态 (superposition),权重取决于各词义的出现频率。
- 替代方法:显式词义向量:
- 对文本中词语的出现实例进行聚类,为每个聚类(代表一个词义)学习一个词向量。
- 例如,可以为 "jaguar" 学习出代表汽车、操作系统版本、动物等不同含义的向量。
- 这种方法可以有效区分词义,但目前并非主流。
- 对单一向量模型的辩护:
- 词义往往是细微和连续的,而非词典中列出的离散条目。不同词典对同一词的词义划分也常不一致(如 "field" 的多种用法)。
- 单一向量可能更好地捕捉这种词义的连续性和模糊性。
- 从单一向量中恢复词义 (Sparse Coding):
- 一个令人惊讶的结果是:尽管单一词向量是多词义的混合,但由于词向量空间的高维性和稀疏性,可以利用稀疏编码 (sparse coding) 理论从单一混合向量中重构出各个独立的词义向量。
- 引用了 Arora et al. (TACL 2018, 包括斯坦福的 Tengyu Ma) 的工作,他们展示了从 "tie" 的单一词向量中分解出衣物领带、比赛平局、电线束带、乐谱连音符等不同词义向量的例子。
神经网络分类器入门
- 目标:构建能够进行分类任务的神经网络模型,如命名实体识别。
- 命名实体识别 (NER) 示例:
- 输入句子,判断特定词是否属于某个类别(如判断 "Paris" 在上下文中是地名还是人名)。
- "Paris Hilton" (人名) vs "Paris, France" (地名)。
- 窗口分类器 (Window Classifier):
- 考虑目标词及其左右若干上下文词(形成一个窗口)。
- 将窗口内所有词的词向量拼接起来,作为分类器的输入。
- 本例中简化为二分类:是地名 (LOCATION) 或不是地名 (NOT A LOCATION)。
- 监督学习框架:
- 训练数据:包含输入
x_i(如词窗的词向量)和对应标签y_i(如 LOCATION / NOT A LOCATION)。
- 训练数据:包含输入
- 传统分类器 vs. 神经网络分类器:
- 传统分类器 (如逻辑回归, SVM, 朴素贝叶斯):
- 通常是线性分类器,学习权重
W,输入特征x是固定的。
- 通常是线性分类器,学习权重
- 神经网络分类器:
- 不仅学习权重
W,还学习输入的分布式表示(即词向量本身也是学习的一部分或可调整的)。 - 能够在原始输入空间中实现非线性决策边界。
- 不仅学习权重
- 传统分类器 (如逻辑回归, SVM, 朴素贝叶斯):
- 一个简单的神经网络结构 (用于NER):
- 输入层:词窗内的词(如 "Museums in Paris are amazing",中心词 "Paris")。
- 词嵌入层:将每个词转换为词向量,然后将窗口内所有词向量拼接成一个长向量
x(如 5个100维向量拼接成500维向量)。 - 隐藏层 (Hidden Layer):
h = f(Wx + b)W是权重矩阵,b是偏置向量。f是非线性激活函数 (non-linearity),如 Sigmoid (Logistic) 函数。- 隐藏层将输入
x转换为一个新的表示h(通常维度较低,如8维)。
- 输出层 (Output Layer):
score = U^T h(U 是权重向量)P(LOCATION | window) = σ(score)(使用 Sigmoid 函数得到概率)
- 该网络在隐藏层表示
h上是线性的,但在原始词向量输入x上是非线性的。
- 损失函数 (Loss Function):
- 交叉熵损失 (Cross-Entropy Loss):在 PyTorch 等框架中常用。
H(p, q) = - Σ_x p(x) log q(x),其中p是真实分布,q是模型预测分布。- 对于独热编码 (one-hot) 的真实标签(即正确类别概率为1,其他为0),交叉熵损失简化为负对数似然 (Negative Log Likelihood):
-log q(correct_class)。
- 交叉熵损失 (Cross-Entropy Loss):在 PyTorch 等框架中常用。
- 神经网络的生物学启发:
- 灵感来源于人脑神经元 (neurons):细胞核 (nucleus)、树突 (dendrites) 接收输入、轴突 (axon) 传递输出、突触 (synapses) 连接。
- 神经元激活表现为放电频率 (spiking rate)。
- 人工神经元模型 (Artificial Neuron):
- 一个简化的模型,类似于二元逻辑回归 (binary logistic regression):
- 多个输入
x_i。 - 加权求和
Σ w_i x_i + b。 - 通过非线性激活函数
f得到输出y = f(Σ w_i x_i + b)。
- 多个输入
- 尽管人脑神经元远比此复杂,但这种简化模型在当前神经网络革命中非常成功,且易于用线性代数实现。
- 一个简化的模型,类似于二元逻辑回归 (binary logistic regression):
- 多层神经网络 (Multi-Layer Neural Networks):
- 将多个这样的“神经元”组织成层。
- 前一层的输出作为后一层的输入。
- 核心优势:中间层(隐藏层)能够自动学习输入数据的有用表示 (useful representations),这些表示使得后续的分类或任务更容易。模型通过学习大量参数,内部自行发现这些有用的特征变换。
- 层数越多,理论上可以学习更复杂的函数。
总结核心观点
讲座系统介绍了词向量的生成(Word2Vec, GloVe)、评估和应用,并初步引入了神经网络分类器的概念。核心在于理解词向量如何通过预测上下文或分析共现统计来捕捉词汇的语义信息,以及这些向量如何作为更复杂神经网络模型的输入。随机梯度下降是训练这些模型的关键优化算法。词的多义性是词向量表示的一个挑战,但现有模型通过学习平均/叠加表示或利用稀疏编码技术来应对。最后,神经网络通过分层结构和非线性变换,能够从数据中自动学习有用的特征表示,为解决复杂的NLP任务提供了强大工具。