speaker 1: So the thing that seems kind of amazing to me and us is the fact that, well, actually this course was taught just last quarter. And here we are with the enormous number of people again, taking this class. I guess that says something, maybe approximately what it says is ChatGPT. But anyway, it's great to have you all. Lots of exciting content to have and hope you'll all enjoy it. So let me get started and start telling you a bit about the course before diving straight into today's content. For people still coming in, you know there are oodles of seats still right on either side, especially down the other front. There are tons of seats, so do feel empowered to go out and seek those seats. If people on the corridors are really nice, they could even move towards the edges to make it easier for people. But one way or another, feel free to find a seat. Okay, so this is the plan for what I want to get through today. So first of all, I'm gonna to tell you about the course for a few minutes. Then have a few remarks about human language and word meaning. Then the main technical thing we want to get into today is start learning about the word to vec algorithm. The word to bec algorithm is slightly over a decade old now. It was introduced in 2013, but it was a wildly successful, simple way of learning vector representations of words. So I want na show you that is a sort of a first easy baby system for the kind of neural representations that we're going to talk about in class. We're then going to get more concrete with that, looking at its objective function, gradients and optimization. And then hopefully, if all goes a stick to schedule, spend a few minutes just playing around an ipython notebook. I'm going to have to change computers for that. Then sort of seeing some of the things you can do with this. Okay, so this is the course logistics and brief. I' M Christopher manning. Hi again, everyone. The head ta is who unfortunately has a bit of a health problem, so he's not actually here today. We've got a course manager for the course who is whom is up the back there. And then we've got a whole lot of tas. If you're a ta who's here, you could stand up and wave or something like that. So people can see a few of the tas and see some friendly faces. Okay? We've got some tas and some other ones. And so you can look at them on the website if you're here, you know what time the class is. There's an email list, but preferably don't use it and use the ed site that you can find on the course website. So the main place to go and look for information is the course website, which we've got up here. And that then links in to ed, which is what we're going to use as the main discussion board. Please use that rather than sending emails. The first assignment for this class, it's a sort of an easy one, it's the warm up assignment, but we want to get people busy and doing stuff straight away. So the first assignment is already live on the web page and it's due next Tuesday before class. So you slightly less than seven days left to do it. So do get started on that. And to help with that, we're going to be immediately starting office hours tomorrow. And they're also described on the website. We also do a few tutorials on Friday. The first of these tutorials is a tutorial on Python and NumPy. Many people don't need that because they've done other classes and done this. But for some people, we try and make this class accessible to everybody. So if youlike to brush up a bit on Python or how to use NumPy, it's a great thing to go along to. And who's right over there is going to be teaching it on Friday. Okay, what do we hope to teach? You know at the end of the quarter when you get the eval, you'll be asked to rate whether this class met its learning goals. These are my learning goals. What are they? So the first one is to teach you about the foundations and current methods for using deep learning applied to natural language processing. So this class tries to sort of build up from the bottom up. So we start off doing simple things like word vectors and feed forward neural networks, recurrent networks and attention. We then fairly quickly move into the kind of key methods they used for nlp. In 2024, I wrote down here transformers and coded decoder models. They probably should have written large language models somewhere in this list as well. But then pre training and post training of large language models, adaptation model interpretability agents, etcetera. But that's not the only thing that we want to do. So there are a couple of other things that we crucially want to achieve. The second is to give you some understanding of human languages and the difficulties in understanding and producing them on computers. Now, there are few of you in this class who are linguistics majors, or perhaps the symbolic systems majors. Ya, to the symbolic systems majors. But for quite a few of the rest of you, you'll never see any linguistics in the sense of understanding how language works apart from this class. So we do want to try and convey a little bit of a sense of what some of the issues are in language structure and why it's proven to be quite difficult to get computers to understand human languages, even though humans seem very good at learning to understand each other. And then the final thing that we want to make it onto is actually concretely building systems so that this isn't just a theory class, that we actually want you to leave this class thinking, Oh Yeah, in my first job, wherever you go, whether it's at a startup or a big tech or some nonprofit, Oh, there's something they want to do that theylike that would be useful if we had a text classification system or we did information extraction to get some kind of facts out of documents. I know how to build that. I can build that system because I did cs 224m. Okay, here's how you get graded. So we have four assignments, mainly one and a half weeks long. Apart from the first one, they make up almost half the grade. The other half of the grade is made up of a final project, which there are two variants of a custom or default final project, which we'll get on to in a minute. And then there's a few percent that go for participation. Six late days collaboration policy. Like all other cs classes, we've had issues with people not doing their own work. We really do want you to learn things in this class. And the way you do that is by doing your own work. So make sure you understand that. And so for the assignments, everyone is expected to do their own assignments. You can talk to your friends, but you're expected to do your own assignment. For the final project, you can do that as a group. Then we have the issue of AI tools. Now, of course, in this class, we love large language models, but nevertheless, we don't want you to do your assignments by saying, Hey, ChatGPT, could you answer question three for me? That is not the way to learn things if you want to make use of AI as a tool to assist you, such as for coding assigo for it. But we're wanting you to be working out how to answer assignment questions by yourself. Okay? So this is what the assignments look like. So assignment one is meant to be an easy on ramp, and it's done as a juper notebook. Assignment two then has people, you know, what can I say here? We are this fine liberal arts, an engineering institution. We're not at a coding boot camp. So we hope that people have some deep understanding of how things work. So in assignment two, we actually want you to do some math and understand how things work in neural networks. So for some people, assignment two is the scariest assignment in the whole class. But then it's also the place where we introduce PyTorch, which is software package we use for building neural networks, and we build a dependency parser, which we'll get to later as something more linguistic. Then for assignment 34, we move on to larger projects using PyTorch with GPU's, and we'll be making use of Google Cloud. And for those two assignments, we look at doing machine translation and getting information out with transformers. And then these are the two final project options. So essentially, you know we have a default final project where we give you a lot of scaffolding and an outline of what to do, but it's still an open ended project. There are lots of different things you can try to make this system work better, and we encourage you to explore. But nevertheless, you're given a leg up from quite a lot of it's a scaffolding. We'll talk about this more, but you can either do that option or you can just come up with totally your own project and do that. Okay, that's the course. Any questions on the course? Yes. Final project. How are mentor assigned? So if you if you can find your own mentor, you're interest in something and there's someone that's happy to mentor you, that person can be your mentor. Otherwise, one of the cotas will be your mentor. And how that person is assigned is one of the tas who is in charge of final projects, assigns people, and they do the best they can in terms of finding people with some expertise and having to divide all the students across the mentors roughly equally. Any other questions? Okay, I'll power ahead. Human language and word meaning. So let me just sort of say a little bit about the big picture here. So we're in the area of artificial intelligence, and we've got this idea that humans are intelligent. And then there's the question of, you know, how does language fit into that? And you know, this is something that there is some argument about. And if you want to, you can run off onto social media and read some of the arguments about these things and contribute to it if you wish to. But here is my perhaps bias take as a linguist. Well, you can compare human beings to some of our nearest neighbors, like chimpanzees, bonobos and things like that. And you know, well, one big distinguishing thing is we have language, and they don't. But you know, in most other respects, chimps are very similar to human beings, right? Know, they can use tools. They can plan how to solve things. They've got really good memory. Chimps have better short term memory than human beings do, right? So that in most respects, it's hard to show an intelligence difference between chimps and people, except for the fact that we have language. But us having language has been this enormous differentiator, right? That if you look around what happened on the planet, you know that there are creatures that are stronger than us, faster than us, more venomous than us, have every possible advantage. But human beings took over the whole place. And how did that happen? We had language so we could communicate. And that communication allowed us to to have human ascendancy. But I'd like to mention, so one big role of language is the fact that it allows communication. But I'd like to suggest it's actually not the only role of language. That language has also allowed humans, I would argue, to achieve a higher level of thought. So there are various kinds of thoughts that you can have without any language involved. You know, you can think about a scene. You can move some bits of furniture around in your mind, and there's no language. And obviously emotional responses of feeling scared or excited, they happen, and there's no language involved. But I think most of the time when we're doing higher level cognition, if you're thinking to yourself, rg, my friend seemed upset about what I said last night. I should probably work out how to fix that or maybe I could blah, blah, blah, blah. I think we think in language and plan out things and so that it's given us a scaffolding to do much more detailed thought and planning. Most recently, of all, of course, human beings invented ways to write. And that led, so writing is really, really recent. I mean, no one really knows how old human languages are. You know, most people think a few hundred thousand years, not very long by evolutionary time scales. But writing, we do know writing is really, really recent. So writing is about 5000 years old. And so, but you know, writing proved to be this, again, this amazing cognitive tool that just gave humanity an enormous leg up. Because suddenly it's not only that you could share information and learn from the people that were standing within 50 feet of you. You could then share knowledge across time and space. So really, having writing was enough to take us from the Bronze Age, very simple metal working, to the kind of know, mobile phones and all the other technology that we walk around with today in just a very short amount of time. So language is pretty cool, but it's, you know, once shouldn't to only fixate on the sort of knowledge side of language and how that's made human beings great. I mean, there's this other side of language where language is this very flexible system, which is used as a social tool by human beings so that we can speak with a lot of imprecision and nuance and emotion in language, and we can get people to understand. We can set up sort of new ways of thinking about things by using words for them. And languages aren't static. Languages change as human beings use them. That languages aren't something that were delivered down on tablets by God. Languages are things that humans constructed, and humans changed them with each successive generation. And indeed, most of the innovation and language happens among Young people, you know, people that are either a few years younger than you are, most of you are now in that earlier teens going into the twenties, right? That's a big period of linguistic innovation where people think up cool new phrases and ways of saying things, and some of us get embedand extended, and that then becomes the future of language. So herb Clark used to be a psychologist at Stanford. He's now retired, but he had this rather nice quote. The common misconception is that language use has primarily to do with words and what they mean. It doesn't it has primarily to do with people and what they mean. Okay, so that's language and two slides for you. So now we'll skip ahead to deep learning. So in the last decade or so, we've been able to make fantastic progress in doing more with computers, understanding human languages, in using deep learning. We'll say a bit more about the history later on, but you know, work on trying to do things with human language stuin the 19 fifties. So it been sort of going for 60 years or so. And you know, there was some stuff, it's not that nobody could do anything, but you know, the ability to understand and produce language had always been kind of questionable where it's really in the last decade with neural networks that just enormous strides of progress have been made that's led into the world that we have today. So one of the first big breakthroughs came in the area of using neural nlp systems for machine translation. So this started about 2014 and was already deployed live on services like Google by 2016. It was so good that a sort of really, really rapid commercial deployment. And I mean, overall, this kind of facility with machine translation just means that you're growing up in such a different world to people a few generations back, right, people a few generations back, that unless you actually knew different languages of different people, you sort of had no chance to communicate with them. Where now we're very close to having something like the baelfish from the hitchhikers 's guide to the galaxy for understanding all languages. It's just, it's not a baelfish. It's a cell phone. But you know, you can have it out between two people and have it do simultaneous translation. And you know, it's not perfect. People keep on doing research on this, but know by and large, it means you can pick anything up from different areas of the world. As you can see, this example is from a couple of years ago, since it's still from the Covid pandemic era. But you know, I can see this Swahili from Kenya and say, Oh gee, I wonder what that means? Stick it into Google Translate and I can learn that Malawi lost two ministers due to Covid infections and they died, right? So you know, we're just in this different era of being able to understand stuff. And then there are lots of other things that we can do with modern nlp. So until a few years ago, we had web search engines and you put in some text, you could write it as a sentence if you wanted to, but it didn't really matter whether you wrote a sentence or not, because what you got were some keywords that were then matched against the index, and you were showing some pages that might have the answers to your questions. These days, you can put an actual question into a modern search engine, like when did Kendrick Lamar's first album come out? It can go and find documents that have relevant information. It can read those documents, and it can give you an answer so that it actually can become an answer engine, rather than just something that finds documents that might be relevant to what you're interested in. The way that that's done is with big neural networks so that you might commonly have for your query, you've got a retrieval neural network, which can find passages that are similar to the query. They might then be re ranked by a second neural network and then therebe a third reading neural network. Thatwill read those passages and synthesize information from them, which then returns as the answer. Okay, that gets us to about 2018. But then things got more advanced again. So it was really around 2019 that people started to see the power of large language models. And so back in 2019, those of us in nlp were really excited about GPT two. It didn't make much of an impact on the nightly news, but it was really exciting in nlp land because GPT two already for the first time meant here was a large language model that could just generate fluent text. That really until then, nlp systems have done a sort of a decent job at understanding certain facts out of text, but we've just never been able to generate fluent text that was at all good. Where here, what you could do with GPT two is you could write something like the start of a story. A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. Then GPT two would just ride a continuation. The incident occurred on the downtown train line, which runs from covings and stations. In an email to Ohio news outlets, the us. Department of energy centers working with the Federal Railroad Administration to find the thief. Dot dot. And so the way this is working is it's conditioning on all the past material. And as I show at the very bottom line down here, it's then generating one word at a time as to what word it thinks would be likely to come next after that. And so from that simple method of sort of generating words out of one after another, it's able to produce excellent text. And the thing to notice is, I mean, this text is not only kind of you know formally correct, you know not the spellingcorrect, and the sentences are real sentences, not disconnected garbage, but you know it actually understands a lot, right? So the prompt that was written said there were stolen nuclear materials in Cincinnati. But you know GPT two knows a lot of stuff. It knows that Cincinnati is in Ohio. It knows that in the United States, it's the department of energy that regulates nuclear materials. It knows if something is stolen, it's a theft. And that that would make sense that people are getting involved with that. It talks about you know there's train carriage, talking about the train line where it goes. It really knows a lot and can write you know coherent discourse like a real story. So that's kind of amazing. But you know things moved on from there. And so now we're in the world of ChatGPT and GPT -4. And one of the things that we'll talk about later is this was a huge user success because now you could ask questions or give it commands and it would do what you wanted. And that was further amazing. So here I'm saying, Hey, please draft a polite email to my boss, Jeremy, that I would not be able to come into the office for the next two days because my nine year old song has a misspelling for some, but the system works fine. The spider, Peter is angry with me that I'm not giving him much time. And it writes a nice email, it corrects the spelling mistakes because it knows people make spelling mistakes. It doesn't talk about songs, and everything works out beautifully. You can get it to do other things. So you can ask what is unusual about this image. So in thinking about meaning, one of the things that's interesting with these recent models is that they're multimodal and can operate across modes. And so a favorite term that we coined at Stanford is the term foundation models, which we use as a generalization of large language models to have the same kind of technology used across different modalities, images, sound, various kinds of bioinformatic things, dna, rna, things like that, seismic waves, any kind of signal building these same kind of large models. Another place that you can see that is going from text to images. So if I ask for a picture of a train going over the Golden Gate Bridge, this is now. Darly too. It gives me a picture of a train going over the Golden Gate Bridge. This is a perfect time to welcome anyone who's watching this on Stanford online. If you're on Stanford online and not in the bay area, the important thing to know is no trains go over the Golden Gate Bridge. But you might not be completely happy with this picture because you know, it shows the Golden Gate Bridge and a train going over it, but doesn't show the bay. So maybe I'd like to get with the bay in the background. And if I ask for that, well, look, now I've got my train going over the Golden Gate Bridge ad with the bay in the background. But you still might not be, this might not be exactly what you want. Like maybe youwould prefer something that's a pencil drawing. So I can see a train going over the Golden Gate Bridge, detailed pencil drawing, and I can get a pencil drawing. Or maybe it's unrealistic that the Golden Gate Bridge only has trains going over it now. So maybe itbe good to have some cars as well. So I could ask for a train and cars, and we can get a train and cars going over it. Now, I actually made these ones all by myself. So you should be impressed with my generative AI artwork. But these examples are actually a bit old now because they're done with Darley two. And if you keep up with these things, that's a few years ago, is now Darly three and so on. So we can now get much fancier things again, right? An illustration from a graphic novel, a bustling city street under the shine of a full moon, the sidewalks bustling with pedestrians enjoying the nightlife at the corner store, all a Young woman with fiery red hair dressed in a signature velvet cloak, is haggling with the grumpy old vendor. The grumpy vendor, a tall, sophisticated man, is wearing a sharp suit, sports and noteworthy moustache. She's animately conversing on his steampununk telephone. And pretty much, we're getting all of that. Okay, so let's now get on to starting to think more about meaning. So perhaps what can we do for meaning, right? So if you think of words in their meaning, if you look up a dictionary and say, what does meaning mean? Meaning is defined as the idea that is represented by a word or phrase, the idea that a person wants to express by using words, the idea that is expressed. And in linguistics, you know, if you go into a semantics class or something, the commonest way of thinking enough meaning is somewhat liwhat's presented up above. There, that meaning is thought of as appeiring between what's sometimes called signify and signified, but is perhaps easy to think of as a symbol, a word, and then an idea or thing. And so this notion is referred to as denotational semantic. So the idea or thing as the denotation of the symbol. And so this same idea of denotational semantics has also been used for programming languages, because in programming languages, you have symbols like while and if and variables, and they have a meaning, and that could be their denotation. So we sort of would say that the meaning of tree is all the trees you can find out around the world. That's sort of okay. Notion of meaning. It's a popular one. It's never been very obvious, or at least traditionally, it wasn't very obvious as to what we could do with that to get it into computers. So if you looked in the preneuural world, when people tried to look at meanings inside computers, they sort of had to do something much more primitive of looking at words in their relationship. So a very common, traditional solution was to make use of word net. And word dnet was kind of a sort of fancy thesaurus that showed word relaso tell you about synonyms is a kind of things. So a pandas, a kind of carnivore, which is a placental, which is a mammal, and things like that. Good has various meanings. It's a trade good or the sense of goodness. And you could explore with that. But systems like word net were never very good for computational meaning. They missed a lot of nuance. Word dnet would tell you that proficient is a synonym for good. But if you think about all the things that you would say were good, you know, that was a good shot. Would you say that was a proficient shot? Sounds kind of weird to me. You know, it's a lot of coland nuance on how words are used. Word net is very incomplete. It's missing anything that's kind of cooler, more modern slang. This maybe isn't very modern slang now, but you won't find more modern slang either in it. It's sort of very human made, etc. It's got a lot of issues. So this led into the idea of, can we represent meaning differently? And this leads us into word vectors. So when we have words, wicked, badass, nifty wizard, what do they turn into when we have computers? Well, effectively, you know, words of these discrete symbols that they're just kind of some kind of atmal symbol. And if we then turn those into something that's closer to math, how symbols are normally represented is you have a vocabulary, and your word is some item in that vocabulary. So motel is that word in the vocabulary, and hotel is this word in the vocabulary. And commonly, this is what computational systems do. You take all your strings and you index them to numbers, and that's the sort of position in a vector that they belong in. And while we have huge numbers of words, so we might have a huge vocabulary, so we'll have very big and long vectors. And so these get referred to as one hot vectors for representing the meaning of words. But representing words by one hot vectors turns out to not be a very good way of computing with them. It was used for decades, but it turns out to be kind of problematic. And part of why it's problematic is it doesn't have any natural inherent sense of the meanings of words. You just have different words. You have hotel and motel and house and chair. And so if you think about in terms of these vector representations, that if you have motel and hotel, there's no indication that they're kind of similar. They're just two different symbols, which have ones in different positions in the vector or formally in math terms. If you think about taking the dot product of these two vectors, it's zero. The two vectors are orthogonal. They have nothing to do with each other. Now, there are things that you can do with that. You can start saying, Oh, let me start building up some other resource of word similarity. And I'll consult that resource of word similarity and ittell me that motels and hotels are similar to each other, and people did things like that, right? In web search, it was referred to as query expansion techniques. But still, the point is that there's no natural notion of similarity in one hot vectors. And so the idea was that maybe we could do better than that, that we could learn to include similarity in the vectors themselves. And so that leads into the idea of word vectors, but it also leads into a different way of thinking about semantics. I just realized, I forgot to say one thing back, two slides. These kind of representations are referred to localist representations, meaning that there's one point in which something is represented. So that you've got here is the representation of motel, and here is the representation of hotel. It's in one place in the vector that each word is represented and theybe different to what we do next. So there's an alternative idea of semantics, which goes back quite a long way. People commonly quote this quote of Jr fith, who was a British linguist, who said in 1957, you shall know a word by the company it keeps. But also goes back to philosophical work by Wittgenstein and others, that what you should do is represent a words meaning by the context in which it appears. So the words that appear around the word give information about its meaning. And so that's the idea of what's called distributional semantics in contrast to denotational semantics. So if I want to know about the word banking, I say, give me some sentences that use the word banking. Here are some sentences using the word banking. Government debt problems turning into banking crises has happened in 2009, etc., etc.. And knowing about that context, words that occur around banking, those will become the meaning of banking. And so we are going to use those statistics about words and what other words appear around them in order to learn a new kind of representation of a word. So a new representation of words is we're going to represent them now as a dense, a sort of a shorter, dense vector that gives the meaning of the words. Now, my vectors are very short here. These are only eight dimensional. If I counted right, so I could fit them on my slide, they're not that short in practice, they might be 200, 2000, but reasonably short. They're not going to be like the half a million of the half a million different words in our vocabulary. And the idea is if words have stuff to do with each other, theyhave sort of similar vectors, which corresponds to their dot product being large. So for banking and monetary, in my example here, both of them are positive in the first dimension, positive in the second dimension, negative on the third. On the fourth, they've got opposite sign ns. So if we want to work out the dot product, we're taking the product of the corresponding terms and will get bigger to the extent that both of the corresponding ones have the same sides and bigger if they have large magnitude. Okay. So these are what we call word vectors, which are also known as embeddings or neural word representations or phrases like that. And so the first thing we want to do is learn good word vectors for different words. And our word vectors will be good word vectors if they give us a good sense of the meanings of words. They know which words are similar to other words in meaning. We refer to them as embeddings, because we can think of this as a vector in a high dimensional space. And so that we're embedding each word as a position in that high dimensional space. And the dimensionality of the space will be the length of the vector. So it might be something like a 300 dimensional space. Now that kind of gets problematic because human beings can't look at 300 dimensional spaces and aren't very good at understanding or visualizing what goes on in them. So the only thing that I can show you is two dimensional spaces. But a thing that is good to have somewhat in your head is that really high dimensional spaces behave extremely differently to two dimensional spaces. In high dimensional spaces, things can, in a two dimensional space, you're only near something else. If you've got similar x and y coordinates in a high dimensional space, things can be very near to all sorts of things on different dimensions in the space. And so we can capture different senses of words and ways that words are similar to each other. But here's the kind of picture we end up with. So what we're going to do is learn a way to represent all words as vectors based on the other words that they're percure within in context, and we can embed them into this vector space. And of course, you can't read anything there, but you know, we can zoom into this space further. And if we zoom into this space and just show a bit of it, well, here's a part of the space where it's showing country words and some other location words. So we've got sort of countries up the top there. We've got some nationality terms, British, Australian, American, European further down, or we can go to another piece of the space. And here's a bit of the space where we have verbs. And not only if we got verbs, but you know there's actually quite a lot of fine structure here of what's similar that represents things about verbs. So you've got sort of verbs of you know communication statements, saying, thinking, expecting, grouping together, come and go, group together. Down the bottom, you've got forms of the verb have, then you've got forms of the verb to be. Above them you've got the come and remain, which are actually sort of similar to the verb to be because they take these sort of complements of state. So just you can same as you can say I am angry, you can say he remained angry or he became angry, right? So those verbs are more so the most verbs sort of similar to the verb to be. So we get these kind of interesting semantic spaces where things that have similar meaning are close by to each other. And so the question is, how do we get to those things? And how we get to those things is then for, you know there are various ways of doing it, but the one I want na get through today is showing you about word devc. Okay, I'll pause for 30s for breath. Anyone have a question or anything they want to know? Yes, there's what better idea a have the context information but if doesn't seem to solve the problem where the similar meaning might depend on context acts, right? So let's take your example about profession versus porso. Those two words have their own full values. The amerischowith factors, but if you're don't have contright because you have different contacts, then those two were in lamp copy, similar minand. This woning bedding alone does not seem visible. Yes, correct. So that's a good thought. You can keep it for a few weeks to some extent. Yeah. So for the first thing we're gonna do, we're just gonna to learn one word vector for a string. So we're gonna have a word. Let's say it's star and we're gonna to learn one word vector for it. So that absolutely doesn't capture the meaning of a word in context. So it won't be saying whether it's meaning a Hollywood star or an astronomical star or something like that. So later on, we're going to get on to contextual meaning representation. So wait for that. But the thing I would like to going along with what I said about high dimensional space aces being weird, but the cool thing that we will already find is our representation for star will be very close to the representations for astronomical words like nebula and whatever other astronomical words. You know and simultaneously itbe very close to words means something like a Hollywood star. Help me out. Know any words that mean something similar? Celebrity. That's a good one. Okay. Yeah the embedding to a lower dimensional space of visualized. So that pictures I was showing you used a particular method called Tiney, which is a non linminear dimensionality reduction that tends to work better for high dimensional newer representations than pca, which you might know, but I'm not going to go into that now. Yes. Space, but not too smart. I mean, that's something that people have worked on. It depends on how much. It depends on how much data you've got to make your representations over. You know so normally you know it's worked out either empirically for what works best or practically based on how big vectors you want to work on. I mean, to give you some idea, you know things start to work well when you get to 100 dimensional space. For a long time, people use 300 dimensions because that seem to work pretty well. But as people have started building ger and huger models with way, way more data, it's now become increasingly common to use numbers like 1000 or even 2000 dimensional vectors. Okay, jso, you mentioned that there are sort of hidden strucin small areas as well as large areas of the embedding and hydimensionality in pieces. Different, like different structures would come up. But generally we seem to use distance as the single metric for costiwhich. Doesn't seem to me that we get it like distance between like this and that in spthree will be the same, right? So how would that think does only use distance. We also use directions in the space as having semantic meanings. I'll show example that soon. Yeah. So I was wondering for the entries, there's a work facector. They seem to be between negative 11. Is there reason for that or do we have balance that we so good question. I mean, you know they don't have to be. And the way we're going to learn them, they're not bounded, but you know you can bound things. Sometimes people length normalize so that the vectors are of length one. But at any rate, normally in this work, we use some method called regularization that tries to kind of keep coefficient small. So they're generally not getting huge Yeah given a specific word, for example, like the bank we use as before in the clear slice. So is there like a like for the work representation? Is there like a single embedding for each word or do we have multiple embeddings for each word? But what we're doing at the moment, each word, each string of letters has a single embedding. And what you can think of that embedding as kind of as an average over all its senses, the financial institution or can also mean like the riverbank. And then what I said before about star applies. The interesting thing is you'll find that we're able to come up with a representation where our learned representation, because it's kind of an average of those will end up similar to words that are semantically evoked by both sensors. I think I'd probably about about go on at this point. Okay, word to vek. Okay, so word to vek was this method of learning word vectors that was thought up by tamash Mikolov and colleagues at Google in 2013. You know it wasn't the first method. There are other people that did methods of learning word vectors that go back to about the turn of the millennium. It wasn't the last there ones that come after it as well, but it was a particularly simple one and this particularly, you know fast running one. And so it really caught people's attention. So the idea of it is that we start off with a large amount of text. So that can just be thought of as a long list of words. And in nlp, we refer to that as a corpus. Corpus, it's just Latin for body. So you know it's exactly the same as if you have a dead person on the floor, right? That's a corpus. No, Yeah. So it's just a body, but we mean a body of text, not a live person or sorry, a dead person. Yeah if you want to know more about Latin, since some there isn't very good classical education these days. Corpus, despite the us ending, is a fourth declension neutonnoun. And that means the plural of corpus is not Corp. The plural of corpus is corpoa. So I'm sure sometime later in this class, I will read a project or assignment when it refers to Cory, and I will know that that person was not paying attention in the first lecture or else they should have said Cora Cora pra as the correct form for that. Okay, I should move on. Okay, so we have our text, then we know that we're going to represent each word. So this is each word type. So you know star or bank, etcetera. So wherever it occurs by a single vector, and so what we're going to do in this algorithm is we're going to go through each position in the text. And so at each position, the text, which is a list of words, we're going to have a center word and words outside it. And then what we're going to do is use the similarity of the word vectors for c and the outside words to calculate the probability that they should have occurred or not. And then we just keep fiddling and we learn word vectors. Now, you know, at first sight, I'll show this more concretely. Maybe I'll just show up more concretely first. So here's the idea. We're going to have a vector for each word type. So a word type means, you know, the word problems wherever it occurs, which is differentiated from a word token, which is this instance of the word problems. So we're going to have a vector for each word type. And so I'm going to want to know, look, in this text, the word turning could before occurred before the word into. How likely should that have been to happen? What I'm going to do is calculate a probability of the word turning occurring close to the word into. And I'm going to do that for each word in a narrow context. In the example here, I'm saying I'm using two words the left and two words to the right. And what I want to do is make those probability estimates as good as possible. So in particular, I want the probability of cooccurrence to be high for words that actually do occur within the nearby context of each other. And so then the question is, how am I going to and once I've done it for that word, I'm going to go along and do exactly the same thing for the next word. And so I'll can continue through the text in that way. And so what we want to do is come up with vector representations of words that will let us predict these probabilities, quote unquote. Well, now, you know, there's a huge limit to how well we can do it because, you know, we've got a simple model. Obviously, when you see the word banking, I can't tell you that the word into is going to occur before banking, but you know, I want na do it as well as possible. So what I want my model to say is after the word banking crises is pretty likely, but the word skillet is not very likely. And if I can do that, I'm doing a good job. And so we turn that into a piece of math. Here's how we do it. Turn it into a piece of math. So we're going to go through our corpus, every position in the corpus, and we're going to have a fixed window size m, which was two in my example. And then what we're going to want to do is have the probability of words in the context being as high as possible. So we want to maximize this likelihood where we're going through every position in the text and then we're going through every word in the context and sort of wanting to make this big. Okay? So conceptually, that's what we're doing, but in practice, we never quite do that. We use two little tricks here. The first one is, you know, for completely arbitrary reasons, that really makes no difference. Everyone got into minimizing things rather than maximizing things. And so the algorithms that we use get referred to as gradient descent, as you'll see in a moment. So the first thing we do is put a minus sign in front so that we can minimize it rather than maximize it. That part's pretty trivial. But the second part is here we have this enormous product, and working with enormous products is more difficult for the math. So the second thing that we do is introduce a logarithm. And so once we take the log of the likelihood that then when we take logs of products, they turn into sums. And so now we can sum over each word, position the text, sum over each word in the context window, and then sum these log probabilities. And then we've still got the minus sign in front. So we want to minimize the sum of log probabilities. So what we're doing is then wanting to look at the negative log likelihood. And then the final thing that we do is you know to since this will get bigger, depend on the number of words in the corpus, we divide through by the number of words in the corpus. And so our objective function is the average negative log likelihood. So by minimizing this objective function, we're maximizing the probability of words in the context. Okay, we're almost there. That's what we want to do. We've got a couple more tricks that we want to get through. The next one is, well, I've said we want to maximize this probability. How do we maximize this probability? What is this probability? We haven't defined how we're going to calculate this probability. This is where the word vectors come in. So we're going to define this probability in terms of the word vector. So we're going to say each word type is represented by a vector of real numbers, these 100 real numbers. And we're going to have a formula that works out the probability simply in terms of the vectors for each word, that there are no other parameters in this model. So over here, I've shown this theta, which are the parameters of our model and all and only the parameters of our model. These word vectors for each word in the vocabulary, that's a lot of parameters because we have a lot of words and we've got fairly big word vectors, but they are the only parameters. Okay? And how we do that is by using this little trick here, we're going to say the probability of an outside word given a center word is going to be defined in terms of the dot product of the two word vectors. So if things have a high dot product, theybe similar and therefore theyhave a high probability of cocurrence, where I mean, similar in a kind of a weird sense, right? It is the case that we're gonna to want to say hotel and motel are similar, but you know, it's also the case that we're going to want to have the word able to appear easily before the word student. So in some weird sense, that also has to be similar to student, that has to be similar to basically any noun, right? Okay. So we're going to work with dot products and then we do this funky little bit of math here and that will give us our probabilities. Okay, so let's just go through the funky bit of math. So here's our formula for the probabilities. So what we're doing here is we're starting off with this dot product, right? So the dot product is you take the two vectors, you multiply each component together and you sum them. So if they both the same sign, that increases your dot product and if they're both big and increases it a lot, okay, so that gives us a similarity between two vectors, and that's unbounded. That's just a real number. It can be either negative or positive, okay? But what welike to get out is a probability. So for our next tricks, we first of all, exponentiate, because if we take e to the x for any x, we now have to get something positive out, right? That's what exponentiation does. And then, well, since it's meant to be a probability, welike it to be between zero and one. And so we turn it into numbers between zero and one in the dumbest way possible, which is we just normalize so that we work out the quantity in the numerator for every possible context word. And so we get the total of all of those numbers and divide through by it, and then we're getting a probability distribution of how likely different words are in this context. Okay, Yeah. So that this little trick that we're doing here is referred to as the sofmax function. So for the softmax function, you can take unbounded real numbers, put them through this little softmax trick that we just went through the steps of. And what you'll get out is a probability distribution. So I'm now getting, in this example, probability distribution over context words. My probability estimates over all the context words in my vocabulary will sum up to one, by definition, by the way, that I've constructed this. So it's called the softmax function because it amplifies the probabilities of the largest things. That's because of the x function. But it's soft because it still assigns some probability to smaller items. You know, it's sort of a funny name because you know when you think about max, I mean, max normally picks out just one thing, whereas the soft max is turning a bunch of real numbers into a probability distribution. So this soft max is used everywhere in deep learning. Anytime that we're wanting to turn things that are just vectors and rn into probabilities, we shove them through a soft max function. Okay. In some sense, this part, I think, still seems very abstract. And I mean, the reason it seems very abstract is because I've sort of said we have vectors for each word. And using these vectors, we can then calculate probabilities. But where do the vectors come from? And the answer to where the vectors are going to come from is we're going to turn this into an optimization problem. We have a large amount of text, and so therefore, we can hope to find word vectors that make the context of the words in our observed text as big as possible. So literally, what we're going to do is we're going to start off with random vectors for every word, and then we want to fiddle those vectors so that the calculated probabilities of words in the context go up. And we're going to keep fiddling until they stop going up anymore. And we're getting the highest probability estimates that we can. And the way that we do that fiddling is we use calculus. So you know what we're going to do is kind of conceptually exactly what you do if you're in something like a two dimensional space, like the picture on the right, right? That if you want to find the minimum in this two dimensional space and you start off at the top left, what you can do is say, let me work out the derivatives of the function at the top left. And they sort of point sort of down and a bit to the right. And so you can walk down and a bit to the right and you can say, Oh, gee, given where I am now, let me work out the derivatives. What direction do they point? And they're still pointing down, but a bit more to the right. So you can walk a bit further that way and you can keep on walking and eventually you'll make it to the minimum of the space. In our case, we've got a lot more than two dimensions. So our parameters for our model, the concatenation of all the word vectors, that it's even slightly worse than I've explained up until now because actually for each word we assume two vectors. We assume one vector when they're center word and one vector when they're the outside word. Doing that just makes the math a bit simpler, which I can explain later. So if we say have 100 dimensional vectors, we'll have 100 parameters for arvarc as an outside word, 100 parameters for R as an outside word, all the way through to 100 parameters for zebra as an outside word, then wehave 100 parameters for arbark as a center word continuing down. So you know, if we had a vocabulary of 400 zero words and 100 dimensional word vectors, that means wehave 400000 times two as 800000 times 100, wehave 80 million parameters. So that's a lot of parameters in our space to try and fiddle, to optimize things. But luckily, we have big computers and that's the kind of thing that we do. So we simply say, gee, this is our optimization problem. We're going to compute the gradients of all of these parameters, and that will give us the answer of what we have. This feels like magic. I mean, it doesn't really seem like, you know, we could just start with nothing. We could start with random word vectors and a pile of text and say, do some math and we will get something useful out. But the miracle of what happens in these deep learning spaces is we do get something useful out. We can just minimize all of the parameters and then we'll get something useful out. So what I wanted to, I guess I'm not going to quite get to the end of what I hope to today, but what I wanted to do is sort of get through some of what we do here. But you know, I wanted to take a few minutes to sort of go through concretely how we do the math of minimization. Now, lots of different people take cs 224n and some of you know way more math than I do. And so if this next ten minutes might be extremely boring, and if that's the case, you can either catch up on discord or Instagram or something else, you can leave. But it turns out there are other people that do cs 224n that can't quite remember when they lasted a math course and welike everybody to be able to learn something about this. So I do actually liken the first two weeks to kind of go through it a bit concretely. So let's try to do this. So this was our likelihood. And then wealready covered the fact that what we were going to do was have an objective function in terms of our parameters. That was the average negative log likelihood across all the words. I remember the notation for this, the sum in the loops. I'll probably have a hard time writing this, the sum of position. I've got a more neatly written out version of it that appears on the version of the slides. It's on the web. And then we're going to be taking this log of the probability of the word at position t plus, sorry, position J, T plus J. Okay. Trying to write this on my iPad is not working super well, I'll confess. We'll see how I get on. Wt, okay. And so then we had the form of what we wanted to use for the probability and the probability of an outside word, given the context word, was then this softmaxed equation where we were taking the x of the outside vector and the center vector over the normalization term, where we sum over the vocabulary. Okay. So to work out how to change our parameters. So our parameters are all of these word vectors that we summarize inside theta. What we're then going to want to do is work out the partial derivative of this objective function with respect to all the parameters theta. But you know in particular, I'm going to just start doing here the partial derivatives with respect to the center word, and we can work through the outside words separately. Well, this partial derivative is a big sum, and it's a big sum of terms like this. And so when I have a partial derivative of a big sum of terms, I can work out the partial derivatives of each term independently and then sum them. So what I want to be doing is working out the partial derivative of the log of this probability, which equals the log of that with respect to the center vector. And so at this point, I have a log of two things being divided, and so that means I can separate that out of the log of the numerator minus the log of the denominator. And so what I'll be doing is working out the partial derivative with respect to the center vector of the log, the numerator log x of utbc minus the partial derivative with respect to the denominator, which is then the log of the sum W equals one to v of x. Okay, I'm having real trouble here riding. I'd look at the slides where I wrote neatly at home. Okay, so I want to work with these two terms. Now, at this point, part of a, at this point, part of it is easy because here I just have a log of an exponential. And so those two functions just cancel out and go away. And so then I want to get the partial derivative of you outside transpose v center with respect to the center. And so what you get for the answer to that is that that just comes out as U zero. And maybe you remember that. But if you don't remember that, the thing to think about, okay, this is a whole vector, right? And so we've got a vector here and a vector here. So what this is going to be looking like is like sort of U one V1 plus U two, V2, plus U three V3 etcec long. And so what we're going to want to do is work out the partial derivative with respect to each element vi, right? And so if you just think of a sort of a single element derivative with respect to V1, well, it's going to be just U one because every other term would go to zero. And then if you worked it out with respect to V2, then itbe just U two and every other term goes to zero. And so since you keep on doing that along the whole vector, that what you're going to get out is the vector U one, U two, U three down the vocab for the whole list of vocab items. Okay, so that part is easy, but then we also want to work out the partial derivatives of that one. And at that point, I maybe have to go to another slide. So we then want to have the partial derivative with respect bc of the log of the sum equals the one to v of the x per view W transpose vc, right? So at this point, things aren't quite so easy. And we have to remember a little bit more calculus. So in particular, what we have to remember is the chain roll. So here we have this inside function so that we've got a function. We've got a function g of vc, which we might say the output of that is z. And then we've put outside that an extra function f. And so when we have something like that, what we get is the derivative of f with respect to vc. We can take the derivative of f with respect to z times the derivative of z with respect to vc. That's the chain rule. So we are going to then apply that here. So first of all, we're going to take the derivative of log. And so the derivative of log is one on x. You have to remember that or look it up or get Mathematica to do it for you or something like that. And so we're going to have one over the inside z part. The sum of W equals one to v of the Xu wt bc. And then that's going to be multiplied by the derivative of the inside part. So then we're going to have the derivative with respect to bc of the sum of W equals one to v of the x of. Okay, so that's made us a little bit of progress, but we've still got something to do here. And so well, what we're going to do here is we're going to notice, Oh, wait, we're again in the space to run the chain rule again. So now we've got this function well. So first of all, we can move the sum to the outside, right? Because we've got a sum of terms W because one to v. And so we want to work out the derivatives of the inside piece with respect to it. Sorry, I'm doing this kind of informally of just doing this piece now. Okay. So this again gives us an f over a function g. And so we're going to again, want to split the pieces up and so use the chain rule one more time. So you're going to have the sum of W equals one to v. And now we have to know what the derivative x band, the derivative of x bx. So that will be x of U, X tv zero. And then we're taking the derivative of the inside part with respect to vc of ux tvc. Well, luckily, this was the bit that we already knew how to do because we worked it out before. And so this is going to be the sum of W equals one to v of this x. Times ux, okay. So then at this point, we want to combine these two forms together so that we want to combine this part that we worked out and this piece here that we've worked out. And if we combine them together with what we worked out on the first slide for the numerator, since this is we have the U zero, which was the derivative of the numerator. And then for the derivative of the denominator, we're going to have on top this part, and then on the bottom we're going to have that part. And so we can rewrite that as the sum from W equals one to v of the x of ux. T V zero times ux over the sum, sorry, x equals one to v sum over W equals one to v of vx. This part here of uw. Okay, so we can rearrange things in that form. And then lo and behold, we find that we've recreated here this form of the softlex equation. So we end up with U zero minus the sum over x equals one to v of the probability of x given c times U of x. So what this is saying is we're wanting to have this quantity which takes the actual observed U vector, and it's comparing it to the weighted prediction. So we're taking the weited sum of our current ux vectors based on how likely they were to occur. And so this is a form that you see quite a bit in these kind of derivatives. You get observed minus expected, the weighted average. And so what youlike to have is your expectation the weighted average be the same as what was observed, because then you'll get a derivative of zero, which means that you've hit a maximum. And so that gives us the form of the derivative of the that we're having with respect to the center vector parameters to finish it off youhave to then work it out also for the outside vector parameters. But Hey, it's officially the end of class time, so I'd better wrap up quickly now. But you know so the deal is we're gonna to work out all of these derivatives for each parameter, and then these derivatives will give a direction to change numbers, which will let us find good word vectors automatically. I do want you to understand how this works, but fortunately you'll find out very quickly that computers will do this for you and on a regular basis, you don't actually have to do it yourself. More about that on Thursday. Okay, see you everyone.