Stanford CS224N NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors
斯坦福大学CS224N课程(2024年春季学期“深度学习与自然语言处理”)第一讲由Christopher Manning主讲。他首先介绍了课程受欢迎的程度,并概述了本讲内容:课程安排、人类语言与词义,重点讲解诞生于2013年的Word2Vec算法(包括其目标函数、梯度、优化及演示)。
课程团队包括主讲人、因故缺席的助教负责人、课程经理及多位助教。课程信息主要通过官网发布,讨论则使用Ed平台而非邮件。首个作业已发布,下周二截止;答疑时间将从次日开始,周五有Python/NumPy辅导。
课程学习目标包括:1) 掌握NLP深度学习的基础和前沿方法(从词向量到大型语言模型、预训练、微调、可解释性、智能体等);2) 理解人类语言特性及计算机处理语言的难点;3) 培养学生构建实用NLP系统的能力。
评分由四项作业(近50%)、一个期末项目(可选默认或自定义,约50%)及参与度构成,允许6天延迟提交。作业要求独立完成,期末项目可团队合作。允许使用AI工具辅助学习(如编程),但禁止直接用于完成作业题目。
作业安排:A1为入门级Jupyter Notebook;A2侧重数学原理、神经网络理解、PyTorch入门及依存句法分析器构建;A3和A4为基于PyTorch和GPU(使用谷歌云)的大型项目,涉及机器翻译和基于Transformer的信息抽取。期末项目学生可选择有框架指导的默认项目或完全自定义项目,助教将分配导师或学生可自行联系。
最后,讲座将探讨人工智能背景下的人类语言与词义问题。
标签
媒体详情
- 上传日期
- 2025-05-15 13:21
- 来源
- https://www.youtube.com/watch?v=DzpHeXVSC5I
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: So the thing that seems kind of amazing to me and us is the fact that, well, actually this course was taught just last quarter. And here we are with the enormous number of people again, taking this class. I guess that says something, maybe approximately what it says is ChatGPT. But anyway, it's great to have you all. Lots of exciting content to have and hope you'll all enjoy it. So let me get started and start telling you a bit about the course before diving straight into today's content. For people still coming in, you know there are oodles of seats still right on either side, especially down the other front. There are tons of seats, so do feel empowered to go out and seek those seats. If people on the corridors are really nice, they could even move towards the edges to make it easier for people. But one way or another, feel free to find a seat. Okay, so this is the plan for what I want to get through today. So first of all, I'm gonna to tell you about the course for a few minutes. Then have a few remarks about human language and word meaning. Then the main technical thing we want to get into today is start learning about the word to vec algorithm. The word to bec algorithm is slightly over a decade old now. It was introduced in 2013, but it was a wildly successful, simple way of learning vector representations of words. So I want na show you that is a sort of a first easy baby system for the kind of neural representations that we're going to talk about in class. We're then going to get more concrete with that, looking at its objective function, gradients and optimization. And then hopefully, if all goes a stick to schedule, spend a few minutes just playing around an ipython notebook. I'm going to have to change computers for that. Then sort of seeing some of the things you can do with this. Okay, so this is the course logistics and brief. I' M Christopher manning. Hi again, everyone. The head ta is who unfortunately has a bit of a health problem, so he's not actually here today. We've got a course manager for the course who is whom is up the back there. And then we've got a whole lot of tas. If you're a ta who's here, you could stand up and wave or something like that. So people can see a few of the tas and see some friendly faces. Okay? We've got some tas and some other ones. And so you can look at them on the website if you're here, you know what time the class is. There's an email list, but preferably don't use it and use the ed site that you can find on the course website. So the main place to go and look for information is the course website, which we've got up here. And that then links in to ed, which is what we're going to use as the main discussion board. Please use that rather than sending emails. The first assignment for this class, it's a sort of an easy one, it's the warm up assignment, but we want to get people busy and doing stuff straight away. So the first assignment is already live on the web page and it's due next Tuesday before class. So you slightly less than seven days left to do it. So do get started on that. And to help with that, we're going to be immediately starting office hours tomorrow. And they're also described on the website. We also do a few tutorials on Friday. The first of these tutorials is a tutorial on Python and NumPy. Many people don't need that because they've done other classes and done this. But for some people, we try and make this class accessible to everybody. So if youlike to brush up a bit on Python or how to use NumPy, it's a great thing to go along to. And who's right over there is going to be teaching it on Friday. Okay, what do we hope to teach? You know at the end of the quarter when you get the eval, you'll be asked to rate whether this class met its learning goals. These are my learning goals. What are they? So the first one is to teach you about the foundations and current methods for using deep learning applied to natural language processing. So this class tries to sort of build up from the bottom up. So we start off doing simple things like word vectors and feed forward neural networks, recurrent networks and attention. We then fairly quickly move into the kind of key methods they used for nlp. In 2024, I wrote down here transformers and coded decoder models. They probably should have written large language models somewhere in this list as well. But then pre training and post training of large language models, adaptation model interpretability agents, etcetera. But that's not the only thing that we want to do. So there are a couple of other things that we crucially want to achieve. The second is to give you some understanding of human languages and the difficulties in understanding and producing them on computers. Now, there are few of you in this class who are linguistics majors, or perhaps the symbolic systems majors. Ya, to the symbolic systems majors. But for quite a few of the rest of you, you'll never see any linguistics in the sense of understanding how language works apart from this class. So we do want to try and convey a little bit of a sense of what some of the issues are in language structure and why it's proven to be quite difficult to get computers to understand human languages, even though humans seem very good at learning to understand each other. And then the final thing that we want to make it onto is actually concretely building systems so that this isn't just a theory class, that we actually want you to leave this class thinking, Oh Yeah, in my first job, wherever you go, whether it's at a startup or a big tech or some nonprofit, Oh, there's something they want to do that theylike that would be useful if we had a text classification system or we did information extraction to get some kind of facts out of documents. I know how to build that. I can build that system because I did cs 224m. Okay, here's how you get graded. So we have four assignments, mainly one and a half weeks long. Apart from the first one, they make up almost half the grade. The other half of the grade is made up of a final project, which there are two variants of a custom or default final project, which we'll get on to in a minute. And then there's a few percent that go for participation. Six late days collaboration policy. Like all other cs classes, we've had issues with people not doing their own work. We really do want you to learn things in this class. And the way you do that is by doing your own work. So make sure you understand that. And so for the assignments, everyone is expected to do their own assignments. You can talk to your friends, but you're expected to do your own assignment. For the final project, you can do that as a group. Then we have the issue of AI tools. Now, of course, in this class, we love large language models, but nevertheless, we don't want you to do your assignments by saying, Hey, ChatGPT, could you answer question three for me? That is not the way to learn things if you want to make use of AI as a tool to assist you, such as for coding assigo for it. But we're wanting you to be working out how to answer assignment questions by yourself. Okay? So this is what the assignments look like. So assignment one is meant to be an easy on ramp, and it's done as a juper notebook. Assignment two then has people, you know, what can I say here? We are this fine liberal arts, an engineering institution. We're not at a coding boot camp. So we hope that people have some deep understanding of how things work. So in assignment two, we actually want you to do some math and understand how things work in neural networks. So for some people, assignment two is the scariest assignment in the whole class. But then it's also the place where we introduce PyTorch, which is software package we use for building neural networks, and we build a dependency parser, which we'll get to later as something more linguistic. Then for assignment 34, we move on to larger projects using PyTorch with GPU's, and we'll be making use of Google Cloud. And for those two assignments, we look at doing machine translation and getting information out with transformers. And then these are the two final project options. So essentially, you know we have a default final project where we give you a lot of scaffolding and an outline of what to do, but it's still an open ended project. There are lots of different things you can try to make this system work better, and we encourage you to explore. But nevertheless, you're given a leg up from quite a lot of it's a scaffolding. We'll talk about this more, but you can either do that option or you can just come up with totally your own project and do that. Okay, that's the course. Any questions on the course? Yes. Final project. How are mentor assigned? So if you if you can find your own mentor, you're interest in something and there's someone that's happy to mentor you, that person can be your mentor. Otherwise, one of the cotas will be your mentor. And how that person is assigned is one of the tas who is in charge of final projects, assigns people, and they do the best they can in terms of finding people with some expertise and having to divide all the students across the mentors roughly equally. Any other questions? Okay, I'll power ahead. Human language and word meaning. So let me just sort of say a little bit about the big picture here. So we're in the area of artificial intelligence, and we've got this idea that humans are intelligent. And then there's the question of, you know, how does language fit into that? And you know, this is something that there is some argument about. And if you want to, you can run off onto social media and read some of the arguments about these things and contribute to it if you wish to. But here is my perhaps bias take as a linguist. Well, you can compare human beings to some of our nearest neighbors, like chimpanzees, bonobos and things like that. And you know, well, one big distinguishing thing is we have language, and they don't. But you know, in most other respects, chimps are very similar to human beings, right? Know, they can use tools. They can plan how to solve things. They've got really good memory. Chimps have better short term memory than human beings do, right? So that in most respects, it's hard to show an intelligence difference between chimps and people, except for the fact that we have language. But us having language has been this enormous differentiator, right? That if you look around what happened on the planet, you know that there are creatures that are stronger than us, faster than us, more venomous than us, have every possible advantage. But human beings took over the whole place. And how did that happen? We had language so we could communicate. And that communication allowed us to to have human ascendancy. But I'd like to mention, so one big role of language is the fact that it allows communication. But I'd like to suggest it's actually not the only role of language. That language has also allowed humans, I would argue, to achieve a higher level of thought. So there are various kinds of thoughts that you can have without any language involved. You know, you can think about a scene. You can move some bits of furniture around in your mind, and there's no language. And obviously emotional responses of feeling scared or excited, they happen, and there's no language involved. But I think most of the time when we're doing higher level cognition, if you're thinking to yourself, rg, my friend seemed upset about what I said last night. I should probably work out how to fix that or maybe I could blah, blah, blah, blah. I think we think in language and plan out things and so that it's given us a scaffolding to do much more detailed thought and planning. Most recently, of all, of course, human beings invented ways to write. And that led, so writing is really, really recent. I mean, no one really knows how old human languages are. You know, most people think a few hundred thousand years, not very long by evolutionary time scales. But writing, we do know writing is really, really recent. So writing is about 5000 years old. And so, but you know, writing proved to be this, again, this amazing cognitive tool that just gave humanity an enormous leg up. Because suddenly it's not only that you could share information and learn from the people that were standing within 50 feet of you. You could then share knowledge across time and space. So really, having writing was enough to take us from the Bronze Age, very simple metal working, to the kind of know, mobile phones and all the other technology that we walk around with today in just a very short amount of time. So language is pretty cool, but it's, you know, once shouldn't to only fixate on the sort of knowledge side of language and how that's made human beings great. I mean, there's this other side of language where language is this very flexible system, which is used as a social tool by human beings so that we can speak with a lot of imprecision and nuance and emotion in language, and we can get people to understand. We can set up sort of new ways of thinking about things by using words for them. And languages aren't static. Languages change as human beings use them. That languages aren't something that were delivered down on tablets by God. Languages are things that humans constructed, and humans changed them with each successive generation. And indeed, most of the innovation and language happens among Young people, you know, people that are either a few years younger than you are, most of you are now in that earlier teens going into the twenties, right? That's a big period of linguistic innovation where people think up cool new phrases and ways of saying things, and some of us get embedand extended, and that then becomes the future of language. So herb Clark used to be a psychologist at Stanford. He's now retired, but he had this rather nice quote. The common misconception is that language use has primarily to do with words and what they mean. It doesn't it has primarily to do with people and what they mean. Okay, so that's language and two slides for you. So now we'll skip ahead to deep learning. So in the last decade or so, we've been able to make fantastic progress in doing more with computers, understanding human languages, in using deep learning. We'll say a bit more about the history later on, but you know, work on trying to do things with human language stuin the 19 fifties. So it been sort of going for 60 years or so. And you know, there was some stuff, it's not that nobody could do anything, but you know, the ability to understand and produce language had always been kind of questionable where it's really in the last decade with neural networks that just enormous strides of progress have been made that's led into the world that we have today. So one of the first big breakthroughs came in the area of using neural nlp systems for machine translation. So this started about 2014 and was already deployed live on services like Google by 2016. It was so good that a sort of really, really rapid commercial deployment. And I mean, overall, this kind of facility with machine translation just means that you're growing up in such a different world to people a few generations back, right, people a few generations back, that unless you actually knew different languages of different people, you sort of had no chance to communicate with them. Where now we're very close to having something like the baelfish from the hitchhikers 's guide to the galaxy for understanding all languages. It's just, it's not a baelfish. It's a cell phone. But you know, you can have it out between two people and have it do simultaneous translation. And you know, it's not perfect. People keep on doing research on this, but know by and large, it means you can pick anything up from different areas of the world. As you can see, this example is from a couple of years ago, since it's still from the Covid pandemic era. But you know, I can see this Swahili from Kenya and say, Oh gee, I wonder what that means? Stick it into Google Translate and I can learn that Malawi lost two ministers due to Covid infections and they died, right? So you know, we're just in this different era of being able to understand stuff. And then there are lots of other things that we can do with modern nlp. So until a few years ago, we had web search engines and you put in some text, you could write it as a sentence if you wanted to, but it didn't really matter whether you wrote a sentence or not, because what you got were some keywords that were then matched against the index, and you were showing some pages that might have the answers to your questions. These days, you can put an actual question into a modern search engine, like when did Kendrick Lamar's first album come out? It can go and find documents that have relevant information. It can read those documents, and it can give you an answer so that it actually can become an answer engine, rather than just something that finds documents that might be relevant to what you're interested in. The way that that's done is with big neural networks so that you might commonly have for your query, you've got a retrieval neural network, which can find passages that are similar to the query. They might then be re ranked by a second neural network and then therebe a third reading neural network. Thatwill read those passages and synthesize information from them, which then returns as the answer. Okay, that gets us to about 2018. But then things got more advanced again. So it was really around 2019 that people started to see the power of large language models. And so back in 2019, those of us in nlp were really excited about GPT two. It didn't make much of an impact on the nightly news, but it was really exciting in nlp land because GPT two already for the first time meant here was a large language model that could just generate fluent text. That really until then, nlp systems have done a sort of a decent job at understanding certain facts out of text, but we've just never been able to generate fluent text that was at all good. Where here, what you could do with GPT two is you could write something like the start of a story. A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. Then GPT two would just ride a continuation. The incident occurred on the downtown train line, which runs from covings and stations. In an email to Ohio news outlets, the us. Department of energy centers working with the Federal Railroad Administration to find the thief. Dot dot. And so the way this is working is it's conditioning on all the past material. And as I show at the very bottom line down here, it's then generating one word at a time as to what word it thinks would be likely to come next after that. And so from that simple method of sort of generating words out of one after another, it's able to produce excellent text. And the thing to notice is, I mean, this text is not only kind of you know formally correct, you know not the spellingcorrect, and the sentences are real sentences, not disconnected garbage, but you know it actually understands a lot, right? So the prompt that was written said there were stolen nuclear materials in Cincinnati. But you know GPT two knows a lot of stuff. It knows that Cincinnati is in Ohio. It knows that in the United States, it's the department of energy that regulates nuclear materials. It knows if something is stolen, it's a theft. And that that would make sense that people are getting involved with that. It talks about you know there's train carriage, talking about the train line where it goes. It really knows a lot and can write you know coherent discourse like a real story. So that's kind of amazing. But you know things moved on from there. And so now we're in the world of ChatGPT and GPT -4. And one of the things that we'll talk about later is this was a huge user success because now you could ask questions or give it commands and it would do what you wanted. And that was further amazing. So here I'm saying, Hey, please draft a polite email to my boss, Jeremy, that I would not be able to come into the office for the next two days because my nine year old song has a misspelling for some, but the system works fine. The spider, Peter is angry with me that I'm not giving him much time. And it writes a nice email, it corrects the spelling mistakes because it knows people make spelling mistakes. It doesn't talk about songs, and everything works out beautifully. You can get it to do other things. So you can ask what is unusual about this image. So in thinking about meaning, one of the things that's interesting with these recent models is that they're multimodal and can operate across modes. And so a favorite term that we coined at Stanford is the term foundation models, which we use as a generalization of large language models to have the same kind of technology used across different modalities, images, sound, various kinds of bioinformatic things, dna, rna, things like that, seismic waves, any kind of signal building these same kind of large models. Another place that you can see that is going from text to images. So if I ask for a picture of a train going over the Golden Gate Bridge, this is now. Darly too. It gives me a picture of a train going over the Golden Gate Bridge. This is a perfect time to welcome anyone who's watching this on Stanford online. If you're on Stanford online and not in the bay area, the important thing to know is no trains go over the Golden Gate Bridge. But you might not be completely happy with this picture because you know, it shows the Golden Gate Bridge and a train going over it, but doesn't show the bay. So maybe I'd like to get with the bay in the background. And if I ask for that, well, look, now I've got my train going over the Golden Gate Bridge ad with the bay in the background. But you still might not be, this might not be exactly what you want. Like maybe youwould prefer something that's a pencil drawing. So I can see a train going over the Golden Gate Bridge, detailed pencil drawing, and I can get a pencil drawing. Or maybe it's unrealistic that the Golden Gate Bridge only has trains going over it now. So maybe itbe good to have some cars as well. So I could ask for a train and cars, and we can get a train and cars going over it. Now, I actually made these ones all by myself. So you should be impressed with my generative AI artwork. But these examples are actually a bit old now because they're done with Darley two. And if you keep up with these things, that's a few years ago, is now Darly three and so on. So we can now get much fancier things again, right? An illustration from a graphic novel, a bustling city street under the shine of a full moon, the sidewalks bustling with pedestrians enjoying the nightlife at the corner store, all a Young woman with fiery red hair dressed in a signature velvet cloak, is haggling with the grumpy old vendor. The grumpy vendor, a tall, sophisticated man, is wearing a sharp suit, sports and noteworthy moustache. She's animately conversing on his steampununk telephone. And pretty much, we're getting all of that. Okay, so let's now get on to starting to think more about meaning. So perhaps what can we do for meaning, right? So if you think of words in their meaning, if you look up a dictionary and say, what does meaning mean? Meaning is defined as the idea that is represented by a word or phrase, the idea that a person wants to express by using words, the idea that is expressed. And in linguistics, you know, if you go into a semantics class or something, the commonest way of thinking enough meaning is somewhat liwhat's presented up above. There, that meaning is thought of as appeiring between what's sometimes called signify and signified, but is perhaps easy to think of as a symbol, a word, and then an idea or thing. And so this notion is referred to as denotational semantic. So the idea or thing as the denotation of the symbol. And so this same idea of denotational semantics has also been used for programming languages, because in programming languages, you have symbols like while and if and variables, and they have a meaning, and that could be their denotation. So we sort of would say that the meaning of tree is all the trees you can find out around the world. That's sort of okay. Notion of meaning. It's a popular one. It's never been very obvious, or at least traditionally, it wasn't very obvious as to what we could do with that to get it into computers. So if you looked in the preneuural world, when people tried to look at meanings inside computers, they sort of had to do something much more primitive of looking at words in their relationship. So a very common, traditional solution was to make use of word net. And word dnet was kind of a sort of fancy thesaurus that showed word relaso tell you about synonyms is a kind of things. So a pandas, a kind of carnivore, which is a placental, which is a mammal, and things like that. Good has various meanings. It's a trade good or the sense of goodness. And you could explore with that. But systems like word net were never very good for computational meaning. They missed a lot of nuance. Word dnet would tell you that proficient is a synonym for good. But if you think about all the things that you would say were good, you know, that was a good shot. Would you say that was a proficient shot? Sounds kind of weird to me. You know, it's a lot of coland nuance on how words are used. Word net is very incomplete. It's missing anything that's kind of cooler, more modern slang. This maybe isn't very modern slang now, but you won't find more modern slang either in it. It's sort of very human made, etc. It's got a lot of issues. So this led into the idea of, can we represent meaning differently? And this leads us into word vectors. So when we have words, wicked, badass, nifty wizard, what do they turn into when we have computers? Well, effectively, you know, words of these discrete symbols that they're just kind of some kind of atmal symbol. And if we then turn those into something that's closer to math, how symbols are normally represented is you have a vocabulary, and your word is some item in that vocabulary. So motel is that word in the vocabulary, and hotel is this word in the vocabulary. And commonly, this is what computational systems do. You take all your strings and you index them to numbers, and that's the sort of position in a vector that they belong in. And while we have huge numbers of words, so we might have a huge vocabulary, so we'll have very big and long vectors. And so these get referred to as one hot vectors for representing the meaning of words. But representing words by one hot vectors turns out to not be a very good way of computing with them. It was used for decades, but it turns out to be kind of problematic. And part of why it's problematic is it doesn't have any natural inherent sense of the meanings of words. You just have different words. You have hotel and motel and house and chair. And so if you think about in terms of these vector representations, that if you have motel and hotel, there's no indication that they're kind of similar. They're just two different symbols, which have ones in different positions in the vector or formally in math terms. If you think about taking the dot product of these two vectors, it's zero. The two vectors are orthogonal. They have nothing to do with each other. Now, there are things that you can do with that. You can start saying, Oh, let me start building up some other resource of word similarity. And I'll consult that resource of word similarity and ittell me that motels and hotels are similar to each other, and people did things like that, right? In web search, it was referred to as query expansion techniques. But still, the point is that there's no natural notion of similarity in one hot vectors. And so the idea was that maybe we could do better than that, that we could learn to include similarity in the vectors themselves. And so that leads into the idea of word vectors, but it also leads into a different way of thinking about semantics. I just realized, I forgot to say one thing back, two slides. These kind of representations are referred to localist representations, meaning that there's one point in which something is represented. So that you've got here is the representation of motel, and here is the representation of hotel. It's in one place in the vector that each word is represented and theybe different to what we do next. So there's an alternative idea of semantics, which goes back quite a long way. People commonly quote this quote of Jr fith, who was a British linguist, who said in 1957, you shall know a word by the company it keeps. But also goes back to philosophical work by Wittgenstein and others, that what you should do is represent a words meaning by the context in which it appears. So the words that appear around the word give information about its meaning. And so that's the idea of what's called distributional semantics in contrast to denotational semantics. So if I want to know about the word banking, I say, give me some sentences that use the word banking. Here are some sentences using the word banking. Government debt problems turning into banking crises has happened in 2009, etc., etc.. And knowing about that context, words that occur around banking, those will become the meaning of banking. And so we are going to use those statistics about words and what other words appear around them in order to learn a new kind of representation of a word. So a new representation of words is we're going to represent them now as a dense, a sort of a shorter, dense vector that gives the meaning of the words. Now, my vectors are very short here. These are only eight dimensional. If I counted right, so I could fit them on my slide, they're not that short in practice, they might be 200, 2000, but reasonably short. They're not going to be like the half a million of the half a million different words in our vocabulary. And the idea is if words have stuff to do with each other, theyhave sort of similar vectors, which corresponds to their dot product being large. So for banking and monetary, in my example here, both of them are positive in the first dimension, positive in the second dimension, negative on the third. On the fourth, they've got opposite sign ns. So if we want to work out the dot product, we're taking the product of the corresponding terms and will get bigger to the extent that both of the corresponding ones have the same sides and bigger if they have large magnitude. Okay. So these are what we call word vectors, which are also known as embeddings or neural word representations or phrases like that. And so the first thing we want to do is learn good word vectors for different words. And our word vectors will be good word vectors if they give us a good sense of the meanings of words. They know which words are similar to other words in meaning. We refer to them as embeddings, because we can think of this as a vector in a high dimensional space. And so that we're embedding each word as a position in that high dimensional space. And the dimensionality of the space will be the length of the vector. So it might be something like a 300 dimensional space. Now that kind of gets problematic because human beings can't look at 300 dimensional spaces and aren't very good at understanding or visualizing what goes on in them. So the only thing that I can show you is two dimensional spaces. But a thing that is good to have somewhat in your head is that really high dimensional spaces behave extremely differently to two dimensional spaces. In high dimensional spaces, things can, in a two dimensional space, you're only near something else. If you've got similar x and y coordinates in a high dimensional space, things can be very near to all sorts of things on different dimensions in the space. And so we can capture different senses of words and ways that words are similar to each other. But here's the kind of picture we end up with. So what we're going to do is learn a way to represent all words as vectors based on the other words that they're percure within in context, and we can embed them into this vector space. And of course, you can't read anything there, but you know, we can zoom into this space further. And if we zoom into this space and just show a bit of it, well, here's a part of the space where it's showing country words and some other location words. So we've got sort of countries up the top there. We've got some nationality terms, British, Australian, American, European further down, or we can go to another piece of the space. And here's a bit of the space where we have verbs. And not only if we got verbs, but you know there's actually quite a lot of fine structure here of what's similar that represents things about verbs. So you've got sort of verbs of you know communication statements, saying, thinking, expecting, grouping together, come and go, group together. Down the bottom, you've got forms of the verb have, then you've got forms of the verb to be. Above them you've got the come and remain, which are actually sort of similar to the verb to be because they take these sort of complements of state. So just you can same as you can say I am angry, you can say he remained angry or he became angry, right? So those verbs are more so the most verbs sort of similar to the verb to be. So we get these kind of interesting semantic spaces where things that have similar meaning are close by to each other. And so the question is, how do we get to those things? And how we get to those things is then for, you know there are various ways of doing it, but the one I want na get through today is showing you about word devc. Okay, I'll pause for 30s for breath. Anyone have a question or anything they want to know? Yes, there's what better idea a have the context information but if doesn't seem to solve the problem where the similar meaning might depend on context acts, right? So let's take your example about profession versus porso. Those two words have their own full values. The amerischowith factors, but if you're don't have contright because you have different contacts, then those two were in lamp copy, similar minand. This woning bedding alone does not seem visible. Yes, correct. So that's a good thought. You can keep it for a few weeks to some extent. Yeah. So for the first thing we're gonna do, we're just gonna to learn one word vector for a string. So we're gonna have a word. Let's say it's star and we're gonna to learn one word vector for it. So that absolutely doesn't capture the meaning of a word in context. So it won't be saying whether it's meaning a Hollywood star or an astronomical star or something like that. So later on, we're going to get on to contextual meaning representation. So wait for that. But the thing I would like to going along with what I said about high dimensional space aces being weird, but the cool thing that we will already find is our representation for star will be very close to the representations for astronomical words like nebula and whatever other astronomical words. You know and simultaneously itbe very close to words means something like a Hollywood star. Help me out. Know any words that mean something similar? Celebrity. That's a good one. Okay. Yeah the embedding to a lower dimensional space of visualized. So that pictures I was showing you used a particular method called Tiney, which is a non linminear dimensionality reduction that tends to work better for high dimensional newer representations than pca, which you might know, but I'm not going to go into that now. Yes. Space, but not too smart. I mean, that's something that people have worked on. It depends on how much. It depends on how much data you've got to make your representations over. You know so normally you know it's worked out either empirically for what works best or practically based on how big vectors you want to work on. I mean, to give you some idea, you know things start to work well when you get to 100 dimensional space. For a long time, people use 300 dimensions because that seem to work pretty well. But as people have started building ger and huger models with way, way more data, it's now become increasingly common to use numbers like 1000 or even 2000 dimensional vectors. Okay, jso, you mentioned that there are sort of hidden strucin small areas as well as large areas of the embedding and hydimensionality in pieces. Different, like different structures would come up. But generally we seem to use distance as the single metric for costiwhich. Doesn't seem to me that we get it like distance between like this and that in spthree will be the same, right? So how would that think does only use distance. We also use directions in the space as having semantic meanings. I'll show example that soon. Yeah. So I was wondering for the entries, there's a work facector. They seem to be between negative 11. Is there reason for that or do we have balance that we so good question. I mean, you know they don't have to be. And the way we're going to learn them, they're not bounded, but you know you can bound things. Sometimes people length normalize so that the vectors are of length one. But at any rate, normally in this work, we use some method called regularization that tries to kind of keep coefficient small. So they're generally not getting huge Yeah given a specific word, for example, like the bank we use as before in the clear slice. So is there like a like for the work representation? Is there like a single embedding for each word or do we have multiple embeddings for each word? But what we're doing at the moment, each word, each string of letters has a single embedding. And what you can think of that embedding as kind of as an average over all its senses, the financial institution or can also mean like the riverbank. And then what I said before about star applies. The interesting thing is you'll find that we're able to come up with a representation where our learned representation, because it's kind of an average of those will end up similar to words that are semantically evoked by both sensors. I think I'd probably about about go on at this point. Okay, word to vek. Okay, so word to vek was this method of learning word vectors that was thought up by tamash Mikolov and colleagues at Google in 2013. You know it wasn't the first method. There are other people that did methods of learning word vectors that go back to about the turn of the millennium. It wasn't the last there ones that come after it as well, but it was a particularly simple one and this particularly, you know fast running one. And so it really caught people's attention. So the idea of it is that we start off with a large amount of text. So that can just be thought of as a long list of words. And in nlp, we refer to that as a corpus. Corpus, it's just Latin for body. So you know it's exactly the same as if you have a dead person on the floor, right? That's a corpus. No, Yeah. So it's just a body, but we mean a body of text, not a live person or sorry, a dead person. Yeah if you want to know more about Latin, since some there isn't very good classical education these days. Corpus, despite the us ending, is a fourth declension neutonnoun. And that means the plural of corpus is not Corp. The plural of corpus is corpoa. So I'm sure sometime later in this class, I will read a project or assignment when it refers to Cory, and I will know that that person was not paying attention in the first lecture or else they should have said Cora Cora pra as the correct form for that. Okay, I should move on. Okay, so we have our text, then we know that we're going to represent each word. So this is each word type. So you know star or bank, etcetera. So wherever it occurs by a single vector, and so what we're going to do in this algorithm is we're going to go through each position in the text. And so at each position, the text, which is a list of words, we're going to have a center word and words outside it. And then what we're going to do is use the similarity of the word vectors for c and the outside words to calculate the probability that they should have occurred or not. And then we just keep fiddling and we learn word vectors. Now, you know, at first sight, I'll show this more concretely. Maybe I'll just show up more concretely first. So here's the idea. We're going to have a vector for each word type. So a word type means, you know, the word problems wherever it occurs, which is differentiated from a word token, which is this instance of the word problems. So we're going to have a vector for each word type. And so I'm going to want to know, look, in this text, the word turning could before occurred before the word into. How likely should that have been to happen? What I'm going to do is calculate a probability of the word turning occurring close to the word into. And I'm going to do that for each word in a narrow context. In the example here, I'm saying I'm using two words the left and two words to the right. And what I want to do is make those probability estimates as good as possible. So in particular, I want the probability of cooccurrence to be high for words that actually do occur within the nearby context of each other. And so then the question is, how am I going to and once I've done it for that word, I'm going to go along and do exactly the same thing for the next word. And so I'll can continue through the text in that way. And so what we want to do is come up with vector representations of words that will let us predict these probabilities, quote unquote. Well, now, you know, there's a huge limit to how well we can do it because, you know, we've got a simple model. Obviously, when you see the word banking, I can't tell you that the word into is going to occur before banking, but you know, I want na do it as well as possible. So what I want my model to say is after the word banking crises is pretty likely, but the word skillet is not very likely. And if I can do that, I'm doing a good job. And so we turn that into a piece of math. Here's how we do it. Turn it into a piece of math. So we're going to go through our corpus, every position in the corpus, and we're going to have a fixed window size m, which was two in my example. And then what we're going to want to do is have the probability of words in the context being as high as possible. So we want to maximize this likelihood where we're going through every position in the text and then we're going through every word in the context and sort of wanting to make this big. Okay? So conceptually, that's what we're doing, but in practice, we never quite do that. We use two little tricks here. The first one is, you know, for completely arbitrary reasons, that really makes no difference. Everyone got into minimizing things rather than maximizing things. And so the algorithms that we use get referred to as gradient descent, as you'll see in a moment. So the first thing we do is put a minus sign in front so that we can minimize it rather than maximize it. That part's pretty trivial. But the second part is here we have this enormous product, and working with enormous products is more difficult for the math. So the second thing that we do is introduce a logarithm. And so once we take the log of the likelihood that then when we take logs of products, they turn into sums. And so now we can sum over each word, position the text, sum over each word in the context window, and then sum these log probabilities. And then we've still got the minus sign in front. So we want to minimize the sum of log probabilities. So what we're doing is then wanting to look at the negative log likelihood. And then the final thing that we do is you know to since this will get bigger, depend on the number of words in the corpus, we divide through by the number of words in the corpus. And so our objective function is the average negative log likelihood. So by minimizing this objective function, we're maximizing the probability of words in the context. Okay, we're almost there. That's what we want to do. We've got a couple more tricks that we want to get through. The next one is, well, I've said we want to maximize this probability. How do we maximize this probability? What is this probability? We haven't defined how we're going to calculate this probability. This is where the word vectors come in. So we're going to define this probability in terms of the word vector. So we're going to say each word type is represented by a vector of real numbers, these 100 real numbers. And we're going to have a formula that works out the probability simply in terms of the vectors for each word, that there are no other parameters in this model. So over here, I've shown this theta, which are the parameters of our model and all and only the parameters of our model. These word vectors for each word in the vocabulary, that's a lot of parameters because we have a lot of words and we've got fairly big word vectors, but they are the only parameters. Okay? And how we do that is by using this little trick here, we're going to say the probability of an outside word given a center word is going to be defined in terms of the dot product of the two word vectors. So if things have a high dot product, theybe similar and therefore theyhave a high probability of cocurrence, where I mean, similar in a kind of a weird sense, right? It is the case that we're gonna to want to say hotel and motel are similar, but you know, it's also the case that we're going to want to have the word able to appear easily before the word student. So in some weird sense, that also has to be similar to student, that has to be similar to basically any noun, right? Okay. So we're going to work with dot products and then we do this funky little bit of math here and that will give us our probabilities. Okay, so let's just go through the funky bit of math. So here's our formula for the probabilities. So what we're doing here is we're starting off with this dot product, right? So the dot product is you take the two vectors, you multiply each component together and you sum them. So if they both the same sign, that increases your dot product and if they're both big and increases it a lot, okay, so that gives us a similarity between two vectors, and that's unbounded. That's just a real number. It can be either negative or positive, okay? But what welike to get out is a probability. So for our next tricks, we first of all, exponentiate, because if we take e to the x for any x, we now have to get something positive out, right? That's what exponentiation does. And then, well, since it's meant to be a probability, welike it to be between zero and one. And so we turn it into numbers between zero and one in the dumbest way possible, which is we just normalize so that we work out the quantity in the numerator for every possible context word. And so we get the total of all of those numbers and divide through by it, and then we're getting a probability distribution of how likely different words are in this context. Okay, Yeah. So that this little trick that we're doing here is referred to as the sofmax function. So for the softmax function, you can take unbounded real numbers, put them through this little softmax trick that we just went through the steps of. And what you'll get out is a probability distribution. So I'm now getting, in this example, probability distribution over context words. My probability estimates over all the context words in my vocabulary will sum up to one, by definition, by the way, that I've constructed this. So it's called the softmax function because it amplifies the probabilities of the largest things. That's because of the x function. But it's soft because it still assigns some probability to smaller items. You know, it's sort of a funny name because you know when you think about max, I mean, max normally picks out just one thing, whereas the soft max is turning a bunch of real numbers into a probability distribution. So this soft max is used everywhere in deep learning. Anytime that we're wanting to turn things that are just vectors and rn into probabilities, we shove them through a soft max function. Okay. In some sense, this part, I think, still seems very abstract. And I mean, the reason it seems very abstract is because I've sort of said we have vectors for each word. And using these vectors, we can then calculate probabilities. But where do the vectors come from? And the answer to where the vectors are going to come from is we're going to turn this into an optimization problem. We have a large amount of text, and so therefore, we can hope to find word vectors that make the context of the words in our observed text as big as possible. So literally, what we're going to do is we're going to start off with random vectors for every word, and then we want to fiddle those vectors so that the calculated probabilities of words in the context go up. And we're going to keep fiddling until they stop going up anymore. And we're getting the highest probability estimates that we can. And the way that we do that fiddling is we use calculus. So you know what we're going to do is kind of conceptually exactly what you do if you're in something like a two dimensional space, like the picture on the right, right? That if you want to find the minimum in this two dimensional space and you start off at the top left, what you can do is say, let me work out the derivatives of the function at the top left. And they sort of point sort of down and a bit to the right. And so you can walk down and a bit to the right and you can say, Oh, gee, given where I am now, let me work out the derivatives. What direction do they point? And they're still pointing down, but a bit more to the right. So you can walk a bit further that way and you can keep on walking and eventually you'll make it to the minimum of the space. In our case, we've got a lot more than two dimensions. So our parameters for our model, the concatenation of all the word vectors, that it's even slightly worse than I've explained up until now because actually for each word we assume two vectors. We assume one vector when they're center word and one vector when they're the outside word. Doing that just makes the math a bit simpler, which I can explain later. So if we say have 100 dimensional vectors, we'll have 100 parameters for arvarc as an outside word, 100 parameters for R as an outside word, all the way through to 100 parameters for zebra as an outside word, then wehave 100 parameters for arbark as a center word continuing down. So you know, if we had a vocabulary of 400 zero words and 100 dimensional word vectors, that means wehave 400000 times two as 800000 times 100, wehave 80 million parameters. So that's a lot of parameters in our space to try and fiddle, to optimize things. But luckily, we have big computers and that's the kind of thing that we do. So we simply say, gee, this is our optimization problem. We're going to compute the gradients of all of these parameters, and that will give us the answer of what we have. This feels like magic. I mean, it doesn't really seem like, you know, we could just start with nothing. We could start with random word vectors and a pile of text and say, do some math and we will get something useful out. But the miracle of what happens in these deep learning spaces is we do get something useful out. We can just minimize all of the parameters and then we'll get something useful out. So what I wanted to, I guess I'm not going to quite get to the end of what I hope to today, but what I wanted to do is sort of get through some of what we do here. But you know, I wanted to take a few minutes to sort of go through concretely how we do the math of minimization. Now, lots of different people take cs 224n and some of you know way more math than I do. And so if this next ten minutes might be extremely boring, and if that's the case, you can either catch up on discord or Instagram or something else, you can leave. But it turns out there are other people that do cs 224n that can't quite remember when they lasted a math course and welike everybody to be able to learn something about this. So I do actually liken the first two weeks to kind of go through it a bit concretely. So let's try to do this. So this was our likelihood. And then wealready covered the fact that what we were going to do was have an objective function in terms of our parameters. That was the average negative log likelihood across all the words. I remember the notation for this, the sum in the loops. I'll probably have a hard time writing this, the sum of position. I've got a more neatly written out version of it that appears on the version of the slides. It's on the web. And then we're going to be taking this log of the probability of the word at position t plus, sorry, position J, T plus J. Okay. Trying to write this on my iPad is not working super well, I'll confess. We'll see how I get on. Wt, okay. And so then we had the form of what we wanted to use for the probability and the probability of an outside word, given the context word, was then this softmaxed equation where we were taking the x of the outside vector and the center vector over the normalization term, where we sum over the vocabulary. Okay. So to work out how to change our parameters. So our parameters are all of these word vectors that we summarize inside theta. What we're then going to want to do is work out the partial derivative of this objective function with respect to all the parameters theta. But you know in particular, I'm going to just start doing here the partial derivatives with respect to the center word, and we can work through the outside words separately. Well, this partial derivative is a big sum, and it's a big sum of terms like this. And so when I have a partial derivative of a big sum of terms, I can work out the partial derivatives of each term independently and then sum them. So what I want to be doing is working out the partial derivative of the log of this probability, which equals the log of that with respect to the center vector. And so at this point, I have a log of two things being divided, and so that means I can separate that out of the log of the numerator minus the log of the denominator. And so what I'll be doing is working out the partial derivative with respect to the center vector of the log, the numerator log x of utbc minus the partial derivative with respect to the denominator, which is then the log of the sum W equals one to v of x. Okay, I'm having real trouble here riding. I'd look at the slides where I wrote neatly at home. Okay, so I want to work with these two terms. Now, at this point, part of a, at this point, part of it is easy because here I just have a log of an exponential. And so those two functions just cancel out and go away. And so then I want to get the partial derivative of you outside transpose v center with respect to the center. And so what you get for the answer to that is that that just comes out as U zero. And maybe you remember that. But if you don't remember that, the thing to think about, okay, this is a whole vector, right? And so we've got a vector here and a vector here. So what this is going to be looking like is like sort of U one V1 plus U two, V2, plus U three V3 etcec long. And so what we're going to want to do is work out the partial derivative with respect to each element vi, right? And so if you just think of a sort of a single element derivative with respect to V1, well, it's going to be just U one because every other term would go to zero. And then if you worked it out with respect to V2, then itbe just U two and every other term goes to zero. And so since you keep on doing that along the whole vector, that what you're going to get out is the vector U one, U two, U three down the vocab for the whole list of vocab items. Okay, so that part is easy, but then we also want to work out the partial derivatives of that one. And at that point, I maybe have to go to another slide. So we then want to have the partial derivative with respect bc of the log of the sum equals the one to v of the x per view W transpose vc, right? So at this point, things aren't quite so easy. And we have to remember a little bit more calculus. So in particular, what we have to remember is the chain roll. So here we have this inside function so that we've got a function. We've got a function g of vc, which we might say the output of that is z. And then we've put outside that an extra function f. And so when we have something like that, what we get is the derivative of f with respect to vc. We can take the derivative of f with respect to z times the derivative of z with respect to vc. That's the chain rule. So we are going to then apply that here. So first of all, we're going to take the derivative of log. And so the derivative of log is one on x. You have to remember that or look it up or get Mathematica to do it for you or something like that. And so we're going to have one over the inside z part. The sum of W equals one to v of the Xu wt bc. And then that's going to be multiplied by the derivative of the inside part. So then we're going to have the derivative with respect to bc of the sum of W equals one to v of the x of. Okay, so that's made us a little bit of progress, but we've still got something to do here. And so well, what we're going to do here is we're going to notice, Oh, wait, we're again in the space to run the chain rule again. So now we've got this function well. So first of all, we can move the sum to the outside, right? Because we've got a sum of terms W because one to v. And so we want to work out the derivatives of the inside piece with respect to it. Sorry, I'm doing this kind of informally of just doing this piece now. Okay. So this again gives us an f over a function g. And so we're going to again, want to split the pieces up and so use the chain rule one more time. So you're going to have the sum of W equals one to v. And now we have to know what the derivative x band, the derivative of x bx. So that will be x of U, X tv zero. And then we're taking the derivative of the inside part with respect to vc of ux tvc. Well, luckily, this was the bit that we already knew how to do because we worked it out before. And so this is going to be the sum of W equals one to v of this x. Times ux, okay. So then at this point, we want to combine these two forms together so that we want to combine this part that we worked out and this piece here that we've worked out. And if we combine them together with what we worked out on the first slide for the numerator, since this is we have the U zero, which was the derivative of the numerator. And then for the derivative of the denominator, we're going to have on top this part, and then on the bottom we're going to have that part. And so we can rewrite that as the sum from W equals one to v of the x of ux. T V zero times ux over the sum, sorry, x equals one to v sum over W equals one to v of vx. This part here of uw. Okay, so we can rearrange things in that form. And then lo and behold, we find that we've recreated here this form of the softlex equation. So we end up with U zero minus the sum over x equals one to v of the probability of x given c times U of x. So what this is saying is we're wanting to have this quantity which takes the actual observed U vector, and it's comparing it to the weighted prediction. So we're taking the weited sum of our current ux vectors based on how likely they were to occur. And so this is a form that you see quite a bit in these kind of derivatives. You get observed minus expected, the weighted average. And so what youlike to have is your expectation the weighted average be the same as what was observed, because then you'll get a derivative of zero, which means that you've hit a maximum. And so that gives us the form of the derivative of the that we're having with respect to the center vector parameters to finish it off youhave to then work it out also for the outside vector parameters. But Hey, it's officially the end of class time, so I'd better wrap up quickly now. But you know so the deal is we're gonna to work out all of these derivatives for each parameter, and then these derivatives will give a direction to change numbers, which will let us find good word vectors automatically. I do want you to understand how this works, but fortunately you'll find out very quickly that computers will do this for you and on a regular basis, you don't actually have to do it yourself. More about that on Thursday. Okay, see you everyone.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本讲座为斯坦福大学CS224N《自然语言处理与深度学习》2024年春季学期第一讲,由Christopher Manning教授主讲。讲座首先介绍了课程的整体安排、教学团队、学习目标、评分标准及学术诚信要求,强调了课程旨在帮助学生掌握深度学习在NLP领域的基础和前沿方法,理解人类语言的复杂性,并具备构建实用NLP系统的能力。随后,讲座探讨了人类语言的重要性及其在认知、交流和社会互动中的多重角色,并回顾了深度学习在NLP领域取得的革命性进展,如机器翻译、智能问答和大型语言模型(LLMs)如GPT系列及多模态模型(如DALL-E)的惊人能力。核心技术内容聚焦于词向量(Word Vectors),阐述了从传统的符号表示(如WordNet、One-hot向量)到分布式表示(Distributional Semantics)的转变。重点详细介绍了Word2Vec算法(Mikolov et al., 2013),包括其核心思想(通过上下文预测词语)、目标函数(最大化观测词对的共现概率,即最小化负对数似然)、参数(中心词向量和上下文词向量)以及通过Softmax函数计算概率和利用梯度下降法进行优化的数学推导过程。讲座强调,尽管Word2Vec为每个词学习单一向量,但高维空间特性使其能捕捉词语的多重语义关联。
课程概览与安排
- 主讲人: Christopher Manning教授。
- 助教团队:
- 首席助教 (Head TA): [提及姓名,但因健康问题当日缺席]。
- 课程管理员 (Course Manager): [提及在场]。
- 众多其他助教 (TAs): 部分在场并向学生示意。
- 课程受欢迎程度: 讲座伊始,Manning教授提到,尽管上个季度刚开设过此课程,本学期依旧吸引了大量学生,并风趣地将其归因于“ChatGPT”的巨大影响。
- 课程信息获取:
- 课程网站: 主要信息来源。
- Ed Discussion: 主要的讨论平台,优先于邮件。
- 首次作业:
- Assignment 1 (热身作业): 已在网页发布,形式为Jupyter Notebook,旨在让学生快速上手。
- 截止日期: 下周二课前 (剩余不足7天)。
- 辅导安排:
- Office Hours: 将于次日开始。
- Tutorials: 周五进行。
- 首个辅导: Python和NumPy入门,旨在帮助编程基础相对薄弱的学生。由助教[提及姓名并指出其在场]负责。
- 今日课程计划:
- 课程介绍。
- 关于人类语言和词义的讨论。
- Word2Vec算法学习 (核心技术内容)。
- Word2Vec的目标函数、梯度和优化。
- (若时间允许) iPython Notebook演示。
课程核心目标
Manning教授阐述了本课程的三个主要学习目标:
- 掌握深度学习在NLP中的基础与前沿方法:
- 从基础开始:词向量、前馈神经网络、循环网络、注意力机制。
- 快速过渡到核心方法 (2024年NLP常用):Transformer、编码器-解码器模型。Manning教授补充道:“可能也应该把大型语言模型(Large Language Models)写在这个列表里”。
- 进阶主题:大型语言模型的预训练与后训练、模型适配、模型可解释性、智能体 (Agents)等。
- 理解人类语言及其在计算机处理上的挑战:
- 向非语言学背景的学生传递语言结构的复杂性。
- 探讨为何计算机理解人类语言如此困难,尽管人类能轻易习得。
- 具备构建实用NLP系统的能力:
- 目标是让学生在课程结束后,能够在实际工作(初创公司、大厂、非营利组织等)中应用所学知识,例如构建文本分类系统或信息抽取系统。
- 强调实践:“我能构建那个系统,因为我学过CS224N”。
评分与学术诚信
- 评分构成:
- 4次作业: 每次作业周期约1.5周 (首次作业除外),占总成绩近一半。
- 期末项目 (Final Project): 占总成绩另一半,有两种选择:
- 默认项目 (Default Final Project): 提供大量脚手架和指引,但仍具开放性。
- 自定义项目 (Custom Final Project): 学生完全自主选题。
- 参与度 (Participation): 占少量百分比。
- 迟交政策: 共6个迟交日 (Late days)。
- 学术诚信 (Collaboration Policy):
- 作业: 强调独立完成。允许与同学讨论,但必须独立完成作业。
- 期末项目: 可以小组形式完成。
- AI工具使用:
- 鼓励使用AI工具辅助编码等任务。
- 禁止直接使用如ChatGPT等工具回答作业问题。Manning教授指出:“那不是学习的方式”。
作业与项目概览
- Assignment 1: 简单的Jupyter Notebook入门作业。
- Assignment 2:
- 涉及数学推导,理解神经网络工作原理。Manning教授称其为“对某些人来说是整个课程中最可怕的作业”。
- 引入PyTorch(用于构建神经网络的软件包)。
- 构建一个依存句法分析器 (Dependency Parser)。
- Assignment 3 & 4:
- 基于PyTorch和GPU的大型项目。
- 将使用Google Cloud。
- 内容涉及机器翻译和使用Transformer进行信息抽取。
- 期末项目选项:
- 默认项目: 提供大量框架和指导,但仍鼓励学生探索不同方法以改进系统。
- 自定义项目: 学生完全自主设计和实施项目。
- 期末项目导师分配 (Q&A环节):
- 学生可自行寻找导师。
- 若无,将由课程助教担任导师,由负责期末项目的助教根据专业知识和学生数量进行分配。
人类语言与词义探讨
- 语言与智能:
- Manning教授认为,语言是人类区别于近亲(如黑猩猩)的关键智能特征之一。尽管黑猩猩在工具使用、规划、记忆(短期记忆甚至优于人类)等方面与人类相似,但语言的缺失是显著差异。
- 语言使得人类能够交流,从而实现“人类的优势地位 (human ascendancy)”。
- 语言的双重角色:
- 交流工具: 实现信息共享。
- 高级思维的脚手架: Manning教授认为语言使人类能够进行更复杂的思考和规划,例如内心独白式的计划。
- 书写的重要性:
- 书写是近5000年才出现的发明。
- 它极大地促进了知识跨越时间和空间的传播,是人类从青铜时代发展到现代科技社会的关键。
- 语言的社会性与动态性:
- 语言是灵活的社会工具,充满不精确性、细微差别和情感。
- 语言并非一成不变,而是由人类构建并在代际间不断演化。
- 语言创新主要发生在年轻人中(十几岁到二十几岁)。
- Herb Clark的名言: > “普遍的误解是,语言的使用主要与词语及其含义有关。事实并非如此,它主要与人以及他们所表达的意思有关。” (The common misconception is that language use has primarily to do with words and what they mean. It doesn't it has primarily to do with people and what they mean.)
NLP领域的深度学习进展
- 历史背景: NLP研究始于1950年代,但在深度学习出现前的60年间,计算机理解和生成语言的能力有限。
- 近十年的突破: 深度学习(神经网络)带来了巨大进步。
- 机器翻译 (Machine Translation):
- 约2014年开始取得突破,2016年已在Google等服务上线。
- 使得跨语言交流变得极为便捷,接近《银河系漫游指南》中的“巴别鱼”(Babel Fish),只不过载体是手机。
- 示例:通过Google Translate理解肯尼亚斯瓦希里语新闻(关于马拉维两名部长因新冠去世)。
- 现代搜索引擎:
- 从关键词匹配进化为能够理解问题并直接给出答案的“答案引擎”。
- 示例:提问“肯德里克·拉马尔的首张专辑何时发行?”
- 实现方式:通常涉及多个神经网络(检索神经网络、重排序神经网络、阅读理解神经网络)。
- 大型语言模型 (Large Language Models - LLMs):
- GPT-2 (约2019年): NLP领域的重要里程碑,首次实现了高质量的流畅文本生成。
- 示例:根据开头“一枚装有受控核材料的火车车厢今日在辛辛那提被盗……”续写故事。
- GPT-2展现了对背景知识的理解(如辛辛那提在俄亥俄州,美国能源部管理核材料,盗窃事件等)。
- 工作方式:基于已生成的文本,逐词预测下一个最可能的词。
- ChatGPT / GPT-4: 用户体验的巨大成功,能够理解并执行指令。
- 示例:要求起草一封礼貌的邮件给老板Jeremy请假(因9岁儿子[原文为song,应为son] Peter[蜘蛛侠]生病)。模型能纠正拼写错误并生成合适的邮件。
- GPT-2 (约2019年): NLP领域的重要里程碑,首次实现了高质量的流畅文本生成。
- 多模态模型 (Multimodal Models):
- 斯坦福大学提出的术语“基础模型 (Foundation Models)”:将LLM技术扩展到不同模态(图像、声音、生物信息数据如DNA/RNA、地震波等)。
- 文生图 (Text-to-Image):
- DALL-E 2示例:生成“火车驶过金门大桥”的图片(并指出金门大桥上没有火车轨道的事实),以及添加“海湾背景”、“铅笔素描风格”、“同时有汽车”等变体。
- DALL-E 3示例:根据更复杂的描述(如“一幅图形小说插画,繁华的城市街道在满月照耀下……”)生成高度符合要求的图像。
词义的表示方法
- 传统语言学观点 (Denotational Semantics):
- 意义被视为符号(能指,signifier,如单词)与观念或事物(所指,signified)之间的配对。
- 示例:“树 (tree)”的意义是世界上所有的树。
- 在编程语言中也有类似概念。
- 早期计算机表示方法:
- WordNet: 一种复杂的同义词词典,包含词语关系如:
- 同义词 (synonyms)。
- 上位词/下位词关系 (is-a kind of):例如,熊猫是一种食肉动物,食肉动物是一种胎盘哺乳动物。
- WordNet的局限性:
- 缺乏细微差别 (nuance): 例如,WordNet可能将“proficient”列为“good”的同义词,但在很多语境下(如“a good shot” vs “a proficient shot”)并不恰当。
- 不完整性: 缺失俚语、新词。
- 人工构建成本高: 更新维护困难。
- WordNet: 一种复杂的同义词词典,包含词语关系如:
- One-Hot向量表示法:
- 将每个词视为词汇表中的一个独立符号。
- 每个词表示为一个高维向量,该词对应的维度为1,其余维度为0。
- 问题:
- 无法体现词义相似性: 任意两个不同词的one-hot向量都是正交的,它们的点积为0。例如,“motel”和“hotel”在表示上毫无关联。
- 需要额外的资源(如词相似度列表,用于查询扩展)来弥补。
- 属于“局部表示 (localist representations)”,即每个词的意义集中在向量的一个点上。
词向量 (Word Vectors / Embeddings)
- 分布式语义学 (Distributional Semantics):
- 核心思想源于J.R. Firth (1957)的名言: > “观其伴左右,即可识其词 (You shall know a word by the company it keeps)”。也受到维特根斯坦等哲学家的影响。
- 一个词的意义由其出现的上下文(周围的词)决定。
- 示例:通过分析“banking”一词出现的句子(如“Government debt problems turning into banking crises...”)来理解其含义。
- 词向量/词嵌入:
- 将词表示为一个相对低维(如8维示例,实际中可能是200-2000维)的稠密向量 (dense vector)。
- 相似性: 意义相近的词,其词向量也相近(点积较大)。
- 嵌入 (Embeddings): 将每个词视为高维空间中的一个点。
- 高维空间的特性: 与二维空间行为差异巨大,能够捕捉词语的多重相似性维度。
- 可视化: 通常使用降维技术(如t-SNE,一种非线性降维方法,优于PCA)投影到二维空间进行观察。
- 示例图:国家名称和国籍词汇聚集,不同类型的动词(如交流类、移动类、be动词、have动词)形成不同簇。
- Q&A环节相关观点:
- 一词多义: 当前介绍的词向量为每个词(字符串)学习一个单一向量,可视为该词所有意义的平均。例如,“star”的向量会同时与天文学词汇(如“nebula”)和名人相关词汇(如“celebrity”)相近。后续课程会介绍上下文相关的词义表示。
- 向量维度选择: 通常根据经验或实际需求(如数据量、模型大小)确定,从100维、300维到现在的1000维甚至2000维。
- 相似性度量: 不仅使用距离,向量的方向也具有语义意义。
- 向量值范围: 通常不严格限定在[-1, 1],但通过正则化等手段使其值不会过大。
Word2Vec算法详解
- 提出者与时间: 由Tomáš Mikolov及其在Google的同事于2013年提出。
- 核心思想:
- 基于大量的文本语料库 (corpus)。Manning教授幽默地解释了“corpus”的拉丁文原意“身体”,并强调其复数形式是“corpora”而非“corpi”。
- 为词汇表中的每个词(词类型,word type)学习一个向量表示。
- 遍历文本中的每个位置,确定一个中心词 (center word, c) 和其上下文词 (context/outside words, o)。
- 目标是调整词向量,使得能够根据中心词向量较好地预测其上下文词,或者反之。
- 利用中心词向量和上下文词向量的相似度来计算它们共现的概率。
- 具体流程 (Skip-gram模型思路,讲座中主要以此为例):
- 对于文本中的每个中心词
w_t。 - 考虑其一个固定大小的上下文窗口 (window size
m,例如中心词左右各2个词)。 - 目标是最大化在该中心词出现时,其真实上下文词出现的概率。
- 示例:对于句子片段 "... government debt problems turning into banking crises ...", 如果中心词是 "into",上下文词可以是 "turning", "banking" (假设窗口为1)。模型的目标是让 P("turning" | "into") 和 P("banking" | "into") 的概率尽可能高。
- 对于文本中的每个中心词
Word2Vec的数学原理:目标函数与梯度优化
- 目标函数 (Objective Function):
- 似然函数 (Likelihood):
L(θ) = Π_{t=1 to T} Π_{-m ≤ j ≤ m, j≠0} P(w_{t+j} | w_t; θ)
其中T是语料库大小,m是上下文窗口大小,θ代表模型所有参数(即所有词向量)。 - 负对数似然 (Negative Log-Likelihood): 为便于优化,通常最小化负对数似然。
J(θ) = - (1/T) Σ_{t=1 to T} Σ_{-m ≤ j ≤ m, j≠0} log P(w_{t+j} | w_t; θ)
这等价于最大化平均对数似然。
- 似然函数 (Likelihood):
- 参数
θ:- 模型中唯一的参数就是词汇表中每个词的词向量。
- 实践中,为每个词
w维护两个向量:v_w: 当w作为中心词时的向量。u_w: 当w作为上下文词时的向量。
- 参数量巨大:例如,40万词汇量,100维向量,则参数量为
400,000 * 100 * 2 = 80,000,000。
- 概率计算
P(o|c; θ)(Softmax函数):- 点积 (Dot Product):
u_o^T v_c,衡量上下文词o的向量与中心词c的向量的相似度。结果是一个无界实数。 - 指数化 (Exponentiation):
exp(u_o^T v_c),将点积结果映射到正数。 - 归一化 (Normalization):
P(o|c) = exp(u_o^T v_c) / Σ_{w'∈V} exp(u_{w'}^T v_c)
其中V是整个词汇表。分母是对词汇表中所有可能的上下文词计算分子项并求和,确保输出的概率值在[0,1]之间且总和为1。 - 这个将一组实数转换为概率分布的函数被称为Softmax函数。它会放大数值较大的项的概率,但对较小项仍分配一定概率。
- 点积 (Dot Product):
- 优化过程 (Gradient Descent):
- 初始化: 所有词向量随机初始化。
- 迭代优化:
- 计算目标函数
J(θ)关于每个参数(即每个词向量的分量)的梯度 (gradients)(偏导数)。 - 沿着梯度的反方向更新参数,以减小
J(θ)。
θ_new = θ_old - α ∇_θ J(θ)(其中α是学习率)
- 计算目标函数
- 梯度推导示例 (对中心词向量
v_c求导):
Manning教授详细推导了log P(o|c)对v_c的偏导数:
∂(log P(o|c)) / ∂v_c = u_o - Σ_{x∈V} P(x|c) u_x
这个结果可以直观理解为:“观察到的上下文词向量 (u_o)” 减去 “基于当前模型预测的期望上下文词向量 (Σ P(x|c) u_x)”。
当模型的预测与实际观察一致时,梯度趋近于零,模型达到优化。 - 对上下文词向量
u_o的梯度推导类似。
- 核心机制: 通过迭代调整词向量,使得在真实文本中共同出现的词对(中心词-上下文词)具有更高的预测概率。
总结与展望
- Manning教授强调,尽管推导过程涉及微积分,但在实际应用中,这些计算将由计算机自动完成。
- 通过这种从随机向量开始,基于大量文本数据进行数学优化的过程,可以学习到能够有效捕捉词义和词间关系的词向量。
- 课程将在周四继续深入探讨相关内容。