Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 4 - Dependency Parsing

该讲座主要讨论了人类语言句子的句法结构和依存句法分析。讲座首先回顾了传统的短语结构文法（或称上下文无关文法），即通过词性（如名词、形容词、限定词、介词）将词语组合成更大的单元（如名词短语、介词短语），并逐步构建句子结构。随后，讲座重点转向了依存文法，这种方法关注词语之间的修饰和论元关系，确定句子中的核心词（head）以及修饰或依赖于该核心词的其他词语。讲座强调，这两种表示方法都可以用来分析句子的结构，理解词语如何组合以及相互修饰。讲座还提及，后续的作业将要求学生使用PyTorch构建一个神经依存句法分析器，并鼓励学生参加PyTorch教程。

视频科技

媒体详情

上传日期: 2025-05-15 21:09
来源: https://www.youtube.com/watch?v=KVKvde-_MYc
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Okay, hi everyone. Okay, so for today we're going to you know I guess do a 180 from where we were on Tuesday. And so today I'm going to talk about syntactic structure, linguistic structure of human language sentences, dependency passing, and how well dependency grammars and then how you go about building dependency parses. So we're solidly into linguistic zone today. How many people in the audience have done a linguistics class? Yay. Okay, there some people done a linguistics classes. Okay, great. And for the rest of you, well, you know, this is your chance to see a little bit of human language structure. And if you like it, you could enroll in the linguistics class later on. Yeah. Ops, so you know, so assignment two we handed out on Tuesday. So in the second half of assignment two, what your job to do is to build a neural dependency parser using ptorch, as we'll sort of come to later on. Really the bit that you have to build is just the machine learning bit of making decisions. And really we give you most of the rest of the neural dependency parser. But this is also then a chance to remind you that assignment two in that second half uses PyTorch, one of the leading deep learning frameworks. So if you're not familiar with that, itbe a really good idea to also go along to the Friday ptorch tutorial, though we have tried to make assignment too, so it's a fairly good place for learning pay torch as you go along, we'll say more soon about final projects, but you're certainly already encouraged to come and meet with tas or me about final projects. And we're putting up information about the tas so you can know more about them on the office hours page. Okay, so let's get straight into it and start looking at linguistic structure. So in thinking about linguistic structure of human languages, there are two primary ways that people have thought about it. So one way is using the idea that linguists normally call free structure, which is then represented in terms of what computer scientists normally no as context free grammars. So I'm going to spend a couple of minutes going over that view of things, but you know actually it's not the main one that I'm going to talk about in this class. I'm gonna to spend most of this class talking about an alternative way of thinking about things called dependency grammars. There are actually some correspondences you can make between the two ways of thinking about things, but I'm not gonna to go through into those here today. So for the constituency grammar or free structure version of things, the way that you go about thinking about the structure of human languages is, well, there are words. Languages have lots of words, hundreds of thousands of words. But it seems like a lot of the words, nearly all the words, in fact, fall into a few basic classes that represent their nature and how they behave in sentences. So for words like the examples here, we have noun. So cat is a noun, dois a noun. But you know, something like linguistics is also a noun. So we have nouns. And then we have other kinds of words. So something like cuddly is an adjective, a word that can modify nouns. And then we have the for the cuddly cat, that is sort of a slightly more complex one as to how to name. Normally in model linguistics, refer to words like that as a determiner. You might also see the name article. And when sometimes when people are trying to shoehorn human language into eight part of speech categories that they say it's an adjective that doesn't really behave like regular adjectives, and then we have words like buy or through or on and to and ones like that. And so they're then prepositions, right? So we have these classes and with lots of words fitting into each class, and so they're referred to conventionally as parts of speech, but then once we've got words, we start putting them into bigger units. So the cuddly cat is some kind of unit. And so it seems like this is a explication of a noun, a cat. And so this gets referred to as a noun phrase. And then by the door, well, this is a phrase, but actually it has inside it this the door, and that's a noun phrase. But this bigger unit here of by the door is then a prepositional phrase. And we can continue to build bigger units. So inside, you know, we have this phrase that we've already looked at with the noun phrase and a prepositional phrase, but then we can have another noun phrase, the cuddly cat, and we can put them together and build a bigger noun phrase, the cuddly cat, by the door. And so to represent this, you can start to write a phrase structure grammar or a context free grammar that represents what are the possibilities for building up sentences here in English, those similar kinds of phrase structure grammars can be written for other languages. So this is sort of starting to give you possible structures for a noun phrase. So you can have a noun phrase just goes to a determiner followed by a noun. But then as well as the cat and our dog, you can have the large cats. So you might say that, okay, rather than that, I might want to have as a better rule that a noun phrase goes through a determiner and optional adjective. And then a noun, if you think about it, you can sort of have multiple adjectives. So you can have the the large Green cat or something like that. So you can really get multiple adjectives that are hearing. And that sort of star, the clean star, says you can have lots of them, the large cuddly Green cat, but then you can stick things after the noun phrase. So you can put these prepositional phrases like in a crate. So we might also want to say that a noun phrase can be rewritten as a noun phrase followed by a prepositional phrase, where a prepositional phrase can be represented by a preposition followed by a noun phrase. And somewhere we're also going to want to represent our parts of speech membership. So a determiner can go to words like a or vaand. An adjective can go to words like large cudley, or many other words that I'm not going to write down. And a preposition can go to words like in, on, under, etc. After that. Okay, so now I've got a little grammar here, and this little grammar here could sort of make everything I've got in these sentences. Well, I actually can do this one too. It can do the large barking one where there are multiple ones. But then if I start going beyond these noun phrases and see, think of sentences like talk to the cat or talk to the large cuddly dog by the door, well, now I've got here a verb talk, and I've again got a preposition. So I might then have more rules that says I can have a verb phrase, and the verb phrase can go to a verb and then a prepositional phrase, and then that could explain these two sentences as well. And in this kind of way, I can start to build up a grammar of the structure of English sentences as a context. Three grammar makes sense. Yeah. Okay. And so that's what is being quite commonly done in linguistics and elsewhere. Okay. Yeah. So let me just do that once more behind in this one. So one thing I can do here is say, Oh, I have, I'm going to look at this with its freeze structure. And if I write it upside down to give myself some space for later, you know, I could start making a phze structure that is, of this sentence, and I'll start to run out of space, but I can sort of start to make this free ze structure of the sentence. So that's free ze structure. But there's another form of representation that has been fairly widely used in linguistics going and has been commonly used in nlp, and we're going to use for the parses we built, and that's dependency structure. So dependency structure represents things in a slightly different way. It thinks about words that are the main word or head or something, and then which words they take as modifiers or arguments. So for look in the large crate in the kitchen by the door, well, this is describing a looking command. So the head of the whole thing is looking, and then looking is taking one or more arguments or modifiers. And well, what the looking is saying here is, well, what you wanna do is look in the large crates. So we are looking in something. And then when what we're looking in is a crate, and then the crate has some modifiers, it's a large crate. It's the large crate. And then the crate is also placed somewhere. It' S Placed in the kitchens, so that in the kitchen is also modifying crate. And then we've got over here the by the door. Well, the by the door is also modifying crate. So we've also got a link down over to here, and that gives us our piece of structure here, which, having filled that in, makes me realize I actually got it wrong in the constituency representation, or c, so I should get my, because in the constituency representation, I made the kitchen by the door into a phrase, and that was actually wrong whospared me. So what I should have actually had was we had another prepositional phrase that went to a noun phrase of in the kitchen, and then both of those, we're coming off a bigger noun phrase like that. Whoops, see. Okay. I get it right most of the time. Okay. But so this idea of dependency structure is that we're sort of finding what is the head word and then we're saying which things modify the head word. And either of these representations we can be using to sort of work out what the structure of sentences is in terms of which words go together and which words modify other words. And so the basic idea is, so when humans communicate, we communicate in a linear stream, right? So that if it's conventional writing systems, it's a linear stream of words that you're reading. If it's spoken language, like you're understanding me speaking right now, it's not a linear stream of words. It's a linear sound stream. And you know like when people speak, there aren't any, you know there isn' T White space between words when people speak, you know occasionally people pause at the end of a clause or sentence or something, but in general, I'm just speaking continuous words that run one into each other so that there's A A linear sequence of sounds coming up by mouth and you have to do all of it like that. But if you're then thinking, Oh, gee, I can actually understand Chris talking, then somehow you're taking that linear stream and you're turning it into a meaning where certain words are modifying other words, and you have these bigger units, like constituents that are understanding the meaning of a sentence. And so human listeners need to work out what modifies what to be able to understand sentences correctly. And so similarly, our models need to be able to understand sentence structure in order to be able to interpret language correctly. And so for what we're going to be doing for building dependency parses is we're going to be explicitly building a neural network model that says, let's find the structure of these sentences. In a way we actually move away from that later on, because when we move into transformer language models, they just take in the sequence of words, but actually inside the parameters of the neural network, they are recognizing and building the same kind of structural units. And we'll talk about that later in the class. To give you more of a sense of how these, you know, understanding what modifies what is important for interpretation, here are a few funny examples from newspaper headlines. And they're funny examples because you get their sentences don't just have one way of interpreting them. When you have a sequence of words, commonly in human languages, sequence of words are ambiguous, and it's relying on human interpretation of what makes sense and what goes together to work out how to read them. So here's a first example. Scientists count walales from space. Now that's ambiguous. And you can give this two possible readings. So how can you give this headline two possible readings? Yeah. One is that they are in space counting whales. And the other one is that they're whales from in space. Yeah. So one possibility is, so we've got this prepositional phrase. This is a prepositional phrase here. One possibility is that this prepositional phrase is modifying Oh, is the is modifying whales. So there whales from space. And the other possibility is that it's the counting that's happening from space. So the scientists are counting it from space. Okay, so that corresponds to my two pictures here. So in one picture, it's the counting that is happening from space, which is actually the right interpretation of what the article is about. But in the other interpretation, we have space whales and the scientists are counting the space whales that are arriving or something like that. And so then we have the from space that are modifying the whales. Okay? So what we have here, right, is a prepositional phrase which comes after a noun phrase. It's just a one word noun phrase here. Whes. That's fine. And then before that is a verb. And so the thing, so one place in English where you get a lot of ambiguities is from these prepositional phrases, because whenever you get prepositional phrases, and prepositional phrases are really common in English, if you think about it, whenever you get them like this, it's always ambiguous as to, it's always ambiguous as to what earlier thing in the sentence they're a dependent of. And so you know you can sort of put in another prepositional phrase in the morning or something like that. And so then the ambiguities just multiply. And so the important thing to notice here about human language is human language is, in syntactic terms, globally ambiguous, right? So in programming languages, you have local ambiguities as interpretation. How many people have done a compilers class? I think very few these days. Anyone done a compilers class? Okay. It looks like less people have done a compilers class than a linguistics class. That's interesting. Okay. Well, I make too many analogies to compile those classes in. You know when I was Young, you know that was still the old days kind of cs curriculum where writing interpreters and compilers was seen as the mainstay computer science education, but no more, I guess. Yeah. So in programming languages, you can have a local ambiguity, but ambiguities are always resolved, right? So we have simple rules and programming languages that it else is construed with the nearest if you know it's a bit different in Python because there's indentation, but you know there are rules that so there's never global ambiguity in a programming language, but human languages just aren't like that, right? That there's nothing that resolves which of these two readings is correct. If I made it a bigger sentence that still be ambiguous, you're just sort of meant to read it and use context and your intelligence to what's going on. So to take a bigger but real example, this is the kind of boring sentence that you can read in the wall Street Journal most mornings. The board approved its acquisition by royal truco limited of Toronto for $27a share at its monthly meeting. So what you can see in this sentence is we've got a verb, and then we've got a noun phrase, and then what? After that, we have four prepositional phrases in a row. Okay? So what do these prepositional phrases modify? So what does by royal truco limited modify the acquisition, right? So it's the acquisition by royal truco then of Toronto modifies. So it's raw truco limited of Toronto. So Yeah, later on, prepositional phrases can also modify earlier prepositional phrases or at least the noun phrase inside them. World truco limited. Okay. For dollar, 27a share is back to modifying the acquisition okay, at its monthly meeting. Yeah, it's the approval. So it's gone way back up to there, right? But you know Yeah so you know so if you start having sentences with a whole bunch of prepositional phrases like this, you can start getting more and more ambiguities of attachment. I mean, you don't get the full you don't get the sort of full free choice factorial number of attachment points because there is a restriction that these dependencies don't cross. So once you've gone back further, you have to stay equally far back or go even back further back again. But nevertheless, so the number of readings you get is the Catalan series, which is a series you see in a whole bunch of other places. If you've done any graph theory or anything like that, you know if you're doing triangulations, you get catlines because you get the same property that things don't cross. So it's an exponentially growing sequence of possible readings. And so it quickly gets very big. So I think when you've got four prepositional phrases, you get 13 readings. And if you have five, you get 27 and you grows up from there. So you get a lot of ambiguities. But the crucial thing to notice is you know human beings read sentences like this every morning, or at least people who work in banking do while you know having their corn flakes and you know their brain doesn't explode. Trying to think about the 13 different readings and which one is correct, right? We just sort of do this as we go along, and that's sort of obvious. Okay, let's just do a couple more examples of where we get ambiguities in human language. So a different one you get is coordination, scope, ambiguity. So shuttle veteran and longtime national asexecutive Fred Gregory appointed the board. How is this sentence ambiguous? Yeah. So there can either be one person, Fred Gregory, and they're both a shuttle veteran and a nasa executive, or it can be that there are two people, there's a shuttle veteran and there's a long gtime nasa executive, Fred Gregory. Okay. Yeah. So webe kind of capturing those by having extra grammar rules where a noun phrase and go to a noun phrase, a conjunction and a noun phrase. But then another thing that you get in English is apposition. So you can have a noun phrase, that's a descriptive noun phrase of another noun phrase, like a name. You know, the author, Fred Gregory, or something like that, saying the word English again, I meant to comment. So you know, I'm only going to give English examples here. In different languages, you don't get all the same ambiguities. So if you're familiar with same Chinese, you might have thought about the prepositional phrase example of, wait a minute, we don't have that one, because the prepositional phrase, modifying the verb would appear before the verb, and the object now would be afterwards. So itwould be completely unambiguous. And that's but you know, that doesn't mean that Chinese is unambiguous. Chinese has lots of very bad. And Yeah, it's just that you different languages have different syntactic structures. Okay, here's so sometimes in English, just especially when you're sort of in a more written form, rather than having an explicit coordination word, you can just sort of use juxtaposition with the comma to have the idea of coordination. So here's a fun example from the first Trump administration of how we can have a coordination scope ambiguity. Doctor, no heart cognitive issues, right? So again, this is the same kind of coordination scope ambiguity that it can either be kind of no heart and cognitive issues being conjoined together like that, or else it could be that it's no heart or cognitive issues being conjoined together like that. You make the choice. Luk, let's see. Oh, this is my risk. Sky one for a different kind of ambiguity, trigger warning, students get firsthand job experience. So this one is also an ambiguity know as to whether you're having the firthand and then both the job and the firhand modifying experience. Or there's this other reading, if you have a smutty mind that might come to you. Okay, one more fun one. Okay, mutilated body washes up on Rio beach to be used for Olympics beach volleyball. Okay, so what are the two possible readings of this sentence? You know, these are real examples from Clady newspapers. Okay, what are the two readings of this sentence? Yeah. So got. So here we have one of these infinittival. So infinittival verb phrase to be used for Olympic beach volleyball. And for these as well, you know, they kind of have the same effect as prepositional phraers that they can modify different things. So it can either be the Rio beach that's going to be used for the Olympic beach volleyball or it's going to be the mutilated body that gets used for the beach volleyball. Okay, Yeah. So these are the kind of ways in which we sort of want to use the structure of the sentence to understand what they're meaning. We also use it in lots of sort of just sort of mundane practical ways when we're building various kinds of natural language processing systems. So you know, a kind of thing that people often in practical systems do is that they want na get out facts of various kinds. So for people who do stuff with bioinformatics that they commonly want to get out things like protein protein interaction facts. And so commonly you can get those kind of facts out by looking for patterns. So you know have a verb interacts. That's going to be indicating an interaction pattern. And well, it's going to be taking arguments. So it's going to be taking a subject and interacts with the prepositional argument. And so that will be an interaction that kaissee, whatever that is, interacts with sasay. But in this case, the sasay is coordinated with the kaa and the kb. So it's also going to end up interacting with those two other things as well. And so you can use the sort of sentence structure patterns of a dependency, pato, be getting out the kind of facts and events that you're interested in for something like event understanding system. And people do these kind of analyses over biomedical texto, build up the kind of structured databases of known protein, protein interactions and things of that sort. Okay, so linguistic structure is useful and it's syntactically very ambiguous. And so you sort of think of humans as active interpreters that are using their contextual knowledge, both of earlier stuff and the text knowledge of the world around them, how the world works to work out the right structure. So now I want to go on and show you a bit more about sort of dependency grammars, which is what we're going to be using. So for dependency syntax, that it postulates that you can capture the structure of a sentence by having these sort of asymmetric dependent relations, which we might just call arrows, which are going from heads to dependence. So here the sentences, bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas. And we're sort of picking out heads and then we've got things that depend on them, that modify them. Yeah. So if you're on the video audience and you are educated in the United States and you're over the age of 50, or if you happen to go to one of those kind of private schools where they also teach Latin, you might have seen sentence diagramming. So read Kellogg. Sentence diagramming was something that was actually very widespread in American education, which really was a, it was really dependency grammar was a sort of a somewhat quirky form of dependency grammar, where you had to write lines at different angles and stuff like that, but basically you writing sort of heads and their dependence underneath them with different funny shaped lines. It also was dependency grammar. Okay, so this is the start of a dependency grammar, but just like the the funny angled lines of sentence diagramming, normally people want to add some more information than that. And so most commonly that the arrows then typed by giving the name of some grammacorrelation. So something can be the noun subject, or an oblique, or an appositional modifier, or a case maror, things like that. And I'm just trying to give you the idea of dependency grammars. I'm not expecting you to master all of these names and ways of doing things. And you know there are different systems of deciding what's heads dependence. And not all the details are important. What you should get into your head is just sort of the basic idea of what one of these does and some sense of, Oh, it should be at the phrase level. It should be representing what's modifying what. So we do actually ask some questions on the assignment. And so for the cases like the prepositional phrase, what is it modifying? You should be able to give the right answer to that. Okay, Yeah. Okay. So this is just a little bit more vocabulary. So Yeah, we have these arrows or depenses. And so I'm going to say that they connect between a head and a dependent, but sometimes people use other words like governor and modifier and things like that. And so dependencies are generally taken and will be taking them as forming a tree. So you've got something that's connected, a cyclic and has a single route to it. So our single route is the top of the sentence here. So dependency. So although what you see most often these days, either in a linguistics class or when you get taught cs 103 at Stanford or computer science, what you see there is normally context free grammars or free structure grammars. I mean, really, you know, it is dependency grammars that have the really long history. So really the predominant way of representing the structure of human languages throughout human history is dependency grammar. So who linguists Herold as the first dependency grammarian, or really the first person who tried to write the grammar of a human language period was panony. So panni was working with sanskit. Panni lived so long ago that actually people don't really know when he lived. I mean, he lived somewhere between about the fourth and eighth century before the common era, but really no one knows when. But you know, he lived sort of up in part of actually what's now Afghanistan and for motivated for largely religious reasons, he said, about developing a grammar of sand krit. And the way he represented the syntax of Sanskrit was using a dependency grammar. So there was a lot of work on grammar and Arabic. And the first millennium they used dependency grammars. In contrast, the idea of sort of having context free grammars, that's really, really recent. So the first work on free structure grammars dates to the forties and then was sort of canoniicalised by the work of Chomsky in the 19 fifties. Yeah. So as a fact for the computer science part of people in the audience, so dear computer scientists, if you know about Chomsky, computer scientists normally know two things about Chomsky. One is they hate on the Chomsky hierarchy that they're forced to learn in cs 103 or equivalent classes. And the second one is he's a very left politician. But if I only deal with the first one of the two, now, the Chomsky hierarchy was not invented either to torture elementary computer scientists or to explain fundamental facts about formal language theory. The chomsea hierarchy was actually invented in thinking about human languages, because at that time, and in stuff that's come more often, it was commonly the case that people were modeling human languages with regular finite, so finite state grammar equivalent mechanisms. And Chomsky wanted to argue that that was a completely inadequate formalism to represent the complexity of human language. And so it was in the context of arguments about human language was why he developed the Chomsky hierarchy. Okay, Yeah. So anyway, that's enough of the history of that. Here's my picture of part of Paronese grammar, but actually, or a version of it, actually, this is really, really misleading. And because one of the astounding facts about Parny's grammar and part of wno, one knows what century he lived in, was parany's grammar was composed orally. So this sort of kind of blows my mind, it seems. Know some of the famous things in the west, like Homer's works, right? The Odyssey and the Iliad, right? They were originally oral works that were passed down in oral form. You know, you that seems hard to do, but you can kind of believe, if you did plays in high school or something, that someone could memorize the Odyssey, perhaps. But the idea that people could memorize a grammar of a language, passing it down for hundreds of years, kind of blows my mind. But that's exactly what happened with pan and his grammar. So you know, really, although this is sort of an old birchbark manuscript, you know that really it probably dates from about a millennium after pancomposed his grammar. Okay, getting back to the modern days. Yeah. So for things to know. B, Yeah. So I mean, we don't want you to fixate on the sort of details of dependency grammar structure. You have the rough idea, but just one thing that you can possibly be confused about is, you know, people do things in different ways. One way in which they don't agree is even which way to draw the arrows. So some people draw arrows from the head pointing at the dependence, and there are other people who draw the arrows starting at the dependent and pointing back at the heads. So for modern dependency, grammar largely follows the work of Lucian tener, a French linguist, hedid the arrows pointing from the head to the dependent. And so that's what I'm doing today. But you'll see both. We sort of said that, you know normally you assume that you have a tree with a single root. It's kind of common and it works out more easily for the passing if you sort of add to a sentence a sort of a fake root node. So you know that that's going to be the starting point and it's going to take one dependent, which is the word that's the head of the sentence, and then you're going na work down from there. Okay. So before getting more into doing dependency passing, I just wanted to sort of take a little detour to tell you about you know the importance that happened with the rise of annotated data in natural language processing. So and you know this is sort of an interesting flip flop that's occurred, but we're going na sort of today go in one direction and later class, we'll go in the other direction. So in early natural language processing, people started to see all human languages have structure. So what we should do is start writing rules for the structure of human languages. And you know, I start writing a few context free grammar rules for the structure of English on that early slide. And you could also write dependency grammar structure rules. So people tried to do natural language processing by having rules, grammar rules, dictionaries or parts of speech and things like that. And that gave you parses that, in retrospect, worked out pretty badly. And it worked out pretty badly for a number of reasons. One reason is that although there are these sort of very canonical, clear structures in human languages, there's a very long tail of messy stuff where all kinds of weird usages start to emerge in human languages, which sort of means you've just got this, it's just really hard to get coverage for handwritten language. And that's because people, humans, use language creatively, right? So, you know, you can start thinking of some of the things that you've probably come, I'm probably not very good at Young perslang usages of grammar these days, but you know, the kind of ones that you might be still familiar with, right? Star Wars, you have Yoda torwhere, you rearrange the sentences, but people still understand them, right? So that's changing the word order. And earlier on than that, there was sort a bit of a fad to putting not at the end of the sentences of, that's a really great idea. Not and well, you know, people learn to understand that, but it's different to regular grammar, right? So it's really hard to write a full grammar, but the bigger reason actually is the problem of ambiguity. I talked about, right, that if you just write a grammar, well, you know, my sentence with the prepositional phrases had 13 different passers, and you didn't have much reason to choose between them. But you know, if you had information about how often words modify other words, then you could get some statistics and start to predict in which order which things modify other things. And so people wanted to start to be able to do that prediction that underlies probabilistic or machine learning models. And so to be able to do that, that led, you know, sort of earliest anseasons in the sixties, but really starting in the late eighties and into the nineties, that people decided to the way to make progress in natural language processing, natural language understanding, is to build annotated data resources. And so all through the nineties and the two thousands decades, the name of the game for a lot of natural language processing was people building annotated data resources and then building machine learning systems on top using those resources. Now that's kind of gone into reverse and gone away again with large language models, which we'll get to another week or so. But here's an example. So this is the universal dependencies tree banks, which I'm actually been heavily involved with. And it's a cool resource for all kinds of purposes because it's actually a wide cross linguistic database where there's over 100 different languages with sentences passed with a uniform dependency formalism. So it's actually really good for things like cross linguistic work and psycholinguistic work. But you know what these are is taking sentences, I think Miramar was a famous goat trainer or something, and putting a dependency structure on it. It's sort of all written there, sort of very squished down and human beings of producing these dependency structures. And then this is giving us data that we can learn things from depenlike dependency passes from, and indeed for what you do in homework too, this is precisely what you'll be using, is data of this sort to build a dependency parser. And it's going to learn that you know that you have goat trainers and you have famous trainers. And so itbuild up sort of statistics and information to predict what kinds of things are likely. Yeah. So starting off building a tree bank like that feels kind of like, Oh, this is going to be slow, hard work. And it is actually slow, hard work, but it proved to be a very effective strategy because it gave wonderful, reusable resources that once people had done at once, all sorts of people could use it to build, parses part of speech taggers, to do psycholinguistic models and all kinds of things, youget the sort of distributional frequency information that's good for machine learning. It also provided one other thing that's crucial, is it gave a method to evaluate systems, to say how good they are at producing parsers. I mean, this may seem kind of comical to you in the modern era of machine learning, but the fact of the matter is, when people did natural language processing in the fifties, sixties, seventies, nobody had evaluation methods. The way you showed people you had a good parser is you ran the program. You said, type in a sentence. Look at what lohas worked. It's a really good parser. You know there was no systematic valuation of nlp systems whatsoever. So actually saying, look, here's thousand hand PaaS sentences. Let's evaluate how well your parser does on those. You know that was actually a revolutionary new development that happened in the well end of the eighties, but especially in the nineties. Okay. So now we're going have it now that we have all of those knowledge and we're going to want to start building dependency passes. And so I'm going to show a particular way of dependency passing, which is the one you're going to use in the assignment. But you know, just first off, it's sort of worth just thinking for a moment. You know what kind of information should a dependency parser have to make decisions? So these are kind of the four factors, the sort of the obvious things that are useful for dependency passing. I mean, the first one is sort of thinking of the two words at the ends of the arrow as to whether they are plausible, so that for discussion of the outstanding issues was completed. So to have discussion of issues, that's a plausible dependency to have, what's a silly one to have something like the being a dependent of completed that makes no sense at all. So you know what words there are involved. The second one is dependency distance. So you can have long distance dependencies that go a long way, but most dependencies are short distance. You know a lot of words are depending on their neighboring words that are very short distance. So that's a good preference to have as well as just the distance that's somewhat informative knowing what's in between. So it's rare for dependencies to span verbs or punctuation. And then there's a final one, which is to think of the valency of heads, and that's how many arguments they take. So that if you have sort of something like a verb broke, well, it probably has something to the left because it probably has who did the braking, and it probably has something to the right because there might be the cup or something like that. But you know, it doesn't have to be that because it could be the cup broke. So you can have something to the left, but nothing to the right, but you sort of have to have something to the left and confirsely, you can't have any number of things. You can't sort of just say he broke the cup, the saucer, the dish, right? So it doesn't take just lots of arguments to the left. So you've got a notion of valency like that. Yeah. There's one other tricky little notion on dependency passing, which is normally normally dependencies kind of nest like this and nesting dependencies corresponds to a tree structure as youhave in a context regrammar Yeah. To a word, because in a sense, when I read the sentence, which is in form, I felt that the most important thing, so if I can see your arrows, and so so I will fair enough, I will assert that this is a sentence and discussion is the subject of the verb completed. And you know, normally for a sentence, we say the main thing in a sentence is its verb. And so Yeah, so that's why the route is heading to complete Ted. And the subject of the verb is also an important thing. But the arguments of the verb, like the subject of the verb, the object of the verb, if there is one, perhaps there's real phrase modifiers. They're all taken as dependence of the verb. Yeah sorry, falling off on that isn't safe. I when is it not the birthat you start with? So if you have a sentence with a verb like this, it's always that is always the answer. I mean, some of the details here depend on languages, but there are languages in which you don't have to have a verb in a sentence and you can get things like, so I mean, you can do it in sort of very restricted ways in English, right? So if you just sort of say easiest pii, there's no verb. And so then you'll say easy the adjective, which is sort of the predicate adjective, is then the head of the sentence with like a question like, what is the story? Is there is like, who would still at that as the that one is complicated. Some people would say it is and some people would say it isn't. And in particular, in universal dependency, if we don't actually say that it is. As the head of the senbut, I don't want to get too far into this. If you want you could sort of look more at how things are done. But you know I want to fully admit that you know dependency grammar isn't sort of one uniquely defined theory. People have had different ideas of which things to take as the head in various circumstances, and they argue about it. Linguists argue about what the right structure is to put over all sorts of sentences, but that the fact that people do things different ways doesn't mean that everybody doesn't agree that there are units, there are freerases and modifiers and ambiguities and so on between them. Okay, Yeah. So normally we get this sort of nesting that corresponds to what you can build with context free grammar structure. But sometimes in human languages you get dependencies that don't nest. So you get sentences like, I'll give a talk tomorrow on neural networks where actually the on neural networks is modifying the torque. Whether yesterday is an argument of sorry, the tomorrow is an argument of give. And so you get these crossing dependencies, which are referred to as non projective dependencies. You also get them when you form questions. So who did bill buy the coffee from yesterday? That the who is the object of the preposition from that it's been moved out the front. And so that again gives us non projectivity if you think about it. Yeah you can still say that you have a dependency tree, but it's got the words in different orders. And so one of the things that you have to cope with for full dependency passing is dealing with this non projectivity. But I mean, actually, we're not going to deal with in our passes. We're only going to do projective dependency passing. Okay? So there are various ways that people do dependency passing. People have done it by dynamic programming. People whodone it using graph algorithms. If I have enough time at the end, I might mention that again, people have done it with constraint satisfaction methods, if you saw those in cs 221. But the most common way in practice that's emerged has been this transition based passing, which is kind of sort of interesting as well and gives us sort of a very simple machine learning mechanism. So it makes it good for assignment two. And so that's what we're going to explore here. Okay? So what we do in greedy decision based passing and transition based passing is, you know this is where it's unfortunate that only two people in the class that have done a compilers class. So a simple form of passing that's also used in compilers class as something called shift reduced passing, where you start sort of bottom up and you start putting little units together and build bigger constituents. But if most people haven't seen it, that's not going to be very much help. So I'm going to give you a concrete example. So the things to know is we have two data structure where we have more than two, I guess, for dealing with the sentence. We have two data structures. We have a buffer which has the words of our input sentence. And then we start building pieces of sentence structure, which we put on a stack. And the little trick to know that for the buffer, the top is written to the left, and for the stack, the top is written to the right. And so we take actions which are like shift and reduce actions. And when we take arc building actions, we build up a set of dependency arcs, which are going to be the dependency structure of our sentence. And that's all incredibly abstract. And so I'm going to show an example, which hopefully will a bit give the idea. So here's an example. So I want to do this very simple example of passing up the sentence. I ate fish. So the way I do this is I have my stack. And so I start by putting the root symbol on my stack, and then I have in my buffer all the words of the sentence. And so that's the sort of start condition I've written in very small print there. Then for each step of processing, I have a choice of three operations. I can either shift, which moves the top word on the buffer onto the stack, or I can do left arc or right arc. And these are my two. Reduce operations that build a little bit of syntactic structure by saying that one word is a dependent of another word in either a left or a right direction. So here, a sequence of operations I can take. And so starting off, the first thing I can do is shift. So then I've moved I onto the stack. I can decide that I want to shift again. And so then I'd take eight and also move it onto the stack. And so I've now got three things on my stack. So at this point, you know, I can do other things. I mean, in particular, a left arc is going to say, well, I can take the top two things on the stack and make the thing on the top, the head and the thing one down on the stack dependent of it. So if I do a left arc operation, I'm effectively saying that the I is a dependent of eight. And then I pop both of, then I pop the dependent off the stack. But I add on that I've built a dependency that I made I A dependent of eight. I could then do another shift operation. So I shift fish from the buffer onto the stack and then I can do a right arc which says, okay, I'm gonna na have fish as a dependent of eight. So then fish disappears from the stack and I add in this new dependency saying, fish, shes a dependent of eight. I then do right arc again, which is then saying that eight is a dependent of root. So I'm left with just root on my stack and I've built a new dependent saying eight is a dependent of root. And at this point, I've gone to the finishing condition. My finishing condition is that my buffer is empty and my stack contains just the word root. And so this gives me a little set of operations referred to as the transitions of transition based passing. And by making a sequence of these different transitions, I can build sentence structure. And I've got choices of when to shift and when to reduce, and whether to reduce left or reduce right, the arc left, arc right. And so by making different ones of those choices, I could make any structure for the sentence that I wanted to. So, you know, if I somehow thought that this sentence should have a different structure and that I should be the head, well, and it's a dependent of that and fishes a dependent of that, well, I could achieve this by making some different choices as to, I'd now be set saying I was doing a right arc operation so that eight would become a dependent of I rather than the other way around. And so the choices of which operations I take determine the syntactic structure, the set of dependencies that I have built, which are my set of dependencies down here. Now, the of dependencies I built were exactly the right ones because at each step I took the right operation. And so the essential idea of transition based passing and where it came to the fore was there was a particular guy who's, I've got a photo of him in somewhere in a bit. I thought so yocomnivra is a Swedish nlp person. And in the early two thousands, he came up with the idea of, rather than doing the kind of dynamic programming and chart passing and things that people commonly used to do with parsers, these days we have machine learning. So maybe we could build a fast, efficient parser. And the way we're going to build it is with this, making a sequence of transitions and itbe the job of the machine learning to predict what is the right transition at each point in time. So if you do that, you know, at each point you're dealing with one thing. And so the number of operations you're doing to PaaS a sentence is linear. So this gives a linear time passing out thm. Whereas if you've seen context free grammars and stuff like that in cs 103 and you want to do anything where you're fully considering the parses and structures of context free grammars, you've then got a cubic time algorithm, which is much less pleasant to be dealing with. So for the simplest form of transition based passing, you do no search whatsoever at each step. You're just predicting the next transition. So you're doing this sort of sequence of transition predictions as machine learning operations. And that sequence gives you the pastructure of the sentence. And the essential result the niva is able to show is that machine learning is good enough that you can do this and get a very accurate parser, despite the fact that it does no search whatsoever. Is just doing predictions in this way. Okay. How did so when he did in 2005? That was before neural networks came to the fore. And so the way he was doing it was by using a sort of an older style, symbolic feature based machine learning system. So he had a big classifier, which might have been a logistic regression classifier or something else, like a support vector machine. And so to power that, he was using indicator features. The kind of features youuse is that the word on the top of the stack is the word good. And it's part of speech as an adjective, or the word on the top of the stack is good, but the word that's sort of second on the stack is the verb has right youget these sort of combinations of matching functions, and they would be used as features in a machine learning system to predict the pawers. But the problem is that once you started building these features, there were conjunctions in multiple terms. You ended up with millions and millions of features, right? Because you're putting particular words in features, and then you're combining choices of multiple words. So they're just millions and millions of features. So you had to deal with millions and millions of features. And Furthermore, individual features were exceedingly sparked that you barely ever saw them, right? That youhave a feature that only turned up ten times and a million sentences because they were matching these very precise systems. So on the one hand, by making these feature conjunctions, passing got more accurate. And indeed, people produce pretty accurate parses in those days. But they had sort of these unappealing characteristics of this sort. Yeah. So before going on further, I should just explain how we evaluate dependency passes. So to evaluate dependency pases, we're basically assessing, are you getting the dependency arcs arrows you're proposing? Right? So here is someone's dependency pawers. She saw the video lecture. Well, actually, sorry, that's the gold pawers. Okay, that's the correct pawers. Okay, she saw the video lecture. That's the correct Paso. You can write out what are the different dependencies, right? So one's head is two, two's head is zero word three's head is five words, four's head is five word, fie's head is two. So these pairs of numbers represent our dependencies. Then if someone proposes a paof, the sentence, you can literally say, okay, which of these did they get right? So they didn't get this one right. They got the rest of them right. So their accuracies are 80%. And so sometimes people just assess the arcs unlabeled. And so that' S Referred to as unlabeled dependency accuracy. But sometimes people also want to label them with subject determiner object etcec and say, also, are you getting the labels right? So in this case, only two of the five labels are right. So the labeled accuracy of the dependency passes. It's 40%. Okay, so that was sort of what people did until the mid 2010s. And I sort of already started saying this. The problems with indicator features were they were sparse. You didn't see them often. They were incomplete because there are some words and combinations youseen, and some you just didn't see in the training data. So you're missing features. But the final problem is actually just computing all those symbolic features was just expensive. It turns out that if you did runtime analysis, most of the time in the passing wasn't in doing the machine learning decisions. It was just simply in computing the features that you put into this dependency parser so as neural nestarted to show that they were successful for things that suggested that maybe you could build a better dependency parser by using a neural net transition based dependency parser, which would benefit from the kind of dense and compact feature vector representations that we've already started to see. And so that's what started to be explored. And in particular, who was then a PhD student of mine and was head ta of 224n twice actually in the earlier days. So she built a neural transition based dependency parser and showed the success of this method. So this was Nivera's transition based dependency parser. People had also explored other methods of dependency parsing. So these were two graph based dependency parsers. And essentially for the kind of symbolic feature machine learning methods, nivrus parser was really fast, because I was using this linear transition based passing idea that the graph based dependency parsers were way, way slower. You know, they're about, what, 50 times slower, but they were slightly more accurate. You can see here that you're getting a bit better numbers. So essentially, what was able to show was you could build something that was basically as accurate as the best known graph based dependency parses, but it was fast like other transition based phasers. Indeed, despite the fact that it's you might think that now I've got real numbers and matrices and stuff, surely that should be slowing me down. The reality was that the symbolic models spent so much time and feature computation that actually you couldn't make it faster at the same time by using the neural network. Okay. So how did that work? Well, so we've already seen word embedding. So it's going to exploit word embedding so it can use word representations. And that has the advantage that even if you haven't seen particular words and particular configurations, you've seen similar words. And so it can exploit what's likely in terms of word similarity. But it went a bit further than that because why only have distributed representations of words? We also have parts of speech. And although I sort of said just now, and verb, adjective, most actual systems in nlp of parts of speech are much more fine grained. So they have different parts of speech for plural nouns versus singular nouns. So they're sort of different symbols, but they're very similar to each other. So we might give them distributed representations. So they're also close to each other and the same for the types of our labels for dependencies. Some of them are pretty closely related as well. So all of these were being given distributed representations. And so then to represent the state of the dependency parser for predicting transitions, what you were doing is you had the same kind of stack and buffer, and you were taking the key elements of the stack and the buffer, which are essentially the first thing on the buffer, the word that you would be shifting if you're going na do a shift, and the two things at the top of the stack. So these are the things that if you're either doing a left arc or a right arc, they're things that you're considering combining. So for those, you're going to be taking the distributed representations of the word and their parts of speech. And also with a bit more complexity for dependencies you've already constructed, if maybe something on the stack is already involved in the dependency, each of those, we're going na take their distributed representations, and we're going to just concatenate them together to produce a big vector in the same way we're concatenating them together, the five words in the last class for predicting whether something was the location or not. And then we're going to feed that into our neural network. So our input layer is our concatenate distributed representations. We're going to put that through a hidden layer, which is like we were talking about last time, wx plus b, then put through a value nonlinearity, and then we are going to put above that the same kind of another multiply by a Merix. So we've got a second layer of neural network, plus B2, and we're going to take the output of that. And then we're going to put that through a sofmax that gives a probability distribution over whether to shift or do a left arc or a right arc operation. And so the other way that this crucially gave us more power is that other people's dependency parsers were still using linear classifiers, things like support vector machines or logistic regressions, where we had a deep neural network that gave us a nonlinear classifier. And so that's why we could be more accurate than other previous transition based parsers. And so this essentially showed that you could build this very accurate neural dependency parser and that it outperformed symbolic probabilistic representations and basically was as good as any other dependency parser that was known. So back a decade ago, this was a big hit. People got very excited about it. The people at Google got very excited about it because this gave a scalable way, remember, its linear time, in which you could efficiently go off and PaaS the entire web. So they did some further work on taking that model, improving it so that they made a deeper neural network version with bigger vectors and better tuned hyperparameters. And they added on to a beam search I've just presented, the greedy version where you always just immediately make the best choice. But you can improve these parsers by doing some amount of search. That does help. And so they've puthat up. And so rather than our kind of 92 uas here, they've got it to you know 9494.6. And I mean, you're probably all too Young to remember this, but you know really at the time of 2016, you know Google did their kind of typical big pr splash for dependency parser, which kind of blew my mind since I didn't ever think that anyone was really going to be writing articles in wired and venture beat and those kind of tech blogs. But you know Google had it all over the place of the world's most accurate parser, and they gave it a silly name, parparsface, which really worked well for getting lots of media pickup. And so that was then a very successful parser. I've still got a couple of minutes left, so let me just do the last three slides to show you sort of another way of doing things, which is also actually also a powerful parsing method that is commonly used. So that was transition based parsing, and that's what you will use in assignment. Two, another way of doing things with dependencies and passing that can be done, neural is what' S Referred to as graph based dependency passes. In graph based dependency passes, what you do is for each word, you sort of ask for each word, what am I A dependent of, right? So if the sentence is the big cat sat, each word, for example, big has to be a dependent of one of the other four words in the sentence, including this possibility of root. So we ask, am I A dependent of that? Am I dependent of root? Am I dependent of cat? Am I dependent of sat? And we want to score each of those possibilities. And so hopefully we decide the most likely one is that big is a dependent of cat. And then we're going to do the same for every other word. So you know sap could be a dependent of any of these words. And so we could start asking, okay, which of these words is it most likely a dependent of? How big to sat cat to sat at? Sorry, that's unreadable now, but hopefully we decide sat most likely as the verb is a dependent of root. So we're sort of scoring the n squared possible you know dependencies of the sentence and each one is given a score. And then once we've done that, our job is. Let me go to this one's cleaner. Okay? We've decided the good one there. And so we're going to do this using some of the same features we talked about, looking at the words at each end, looking at what occurs between them, looking at what occurs around them, thinking about things. And then once we've done that, the only other thing that's a constraint is, well, we want the dependencies to form a tree so that we need to do something like a minimum spanning tree algorithm to sort of find the minimum cost tree because we don't want to find a solution where there are cycles or the parts of the sentence end up disconnected with each other. And so that's graph based dependency passes. And so just as in the older symbolic passing days where the graph based dependency passers were more accurate than the transition based passes, that we then started doing some work on neural graph based dependency passing. And so here's our neural graph based dependency passing, which was then a bit over a percent more accurate than parsimic parsface, the world's best dependency parser. So that got us to 2017. I mean, obviously this is still a few years ago, but to get further into the latest parsing stories, we then need to sort of get into the era of large language models, which I'm not doing today. But it's this neural graph based dependency parser that's in stanza, our open source parsing software that's available, and that you can see it's using this algorithm as the more accurate one. Okay, so now you hopefully know everything about syntactic structure. Constituency dependency paand are fully qualified to do an assignment too. So good luck with that. Thanks.

概览/核心摘要 (Executive Summary)

本讲座主要探讨了自然语言处理中的句法结构，重点介绍了依存句法分析 (Dependency Parsing)。首先，讲座区分了两种主要的句法结构表示方法：短语结构（或称成分句法，Constituency Grammar）和依存句法 (Dependency Grammar)，并阐述了后者在表示词语间修饰关系上的直观性。讲座强调了理解句法结构对于消除语言歧义的重要性，列举了多种歧义类型（如介词短语附着歧义、并列范围歧义）及其对语义理解的影响。

核心内容转向依存句法分析的实现。讲座回顾了依存句法的历史渊源（如古印度的Pāṇini）以及标注数据（Treebanks，如Universal Dependencies）在现代NLP发展中的关键作用。随后，详细讲解了基于转移的依存句法分析 (Transition-based Dependency Parsing)，包括其基本原理（栈、缓冲区、SHIFT/LEFT-ARC/RIGHT-ARC操作）、早期基于符号特征的机器学习方法 (Nivre, 2000s) 及其局限性（特征稀疏、计算昂贵）。

讲座的重点在于神经依存句法分析。介绍了Chen和Manning (2014) 的工作，该工作将词嵌入、词性嵌入和依存标签嵌入等分布式表示引入基于转移的依存句法分析器，通过神经网络（多层感知机）预测转移操作，显著提升了分析的准确性和效率，并催生了如谷歌的Parsey McParseface (2016) 等成果。最后，简要介绍了另一种重要的依存句法分析方法——基于图的依存句法分析 (Graph-based Dependency Parsing)，及其在神经化后达到的更高准确率。讲座内容与课程作业紧密相关，旨在帮助学生构建自己的神经依存句法分析器。

句法结构：成分句法与依存句法

Speaker 1首先介绍了语言学中思考句法结构的两种主要方式：

短语结构 (Phrase Structure) / 成分句法 (Constituency Grammar):
- 将句子分解为嵌套的短语（成分）。
- 词语首先归类为词性 (Parts of Speech)，如：
  - 名词 (Noun): cat, linguistics
  - 形容词 (Adjective): cuddly
  - 限定词 (Determiner): the (也称冠词 Article)
  - 介词 (Preposition): by, through, on, to
- 词语组合成更大的单元，如：
  - 名词短语 (Noun Phrase, NP): the cuddly cat
  - 介词短语 (Prepositional Phrase, PP): by the door
- 可以使用上下文无关文法 (Context-Free Grammars, CFGs) 来表示这些结构规则，例如：
  - NP -> Det (Adj)* N (名词短语可以由限定词、零个或多个形容词及名词构成)
  - NP -> NP PP (名词短语后可以跟介词短语)
  - PP -> P NP (介词短语由介词和名词短语构成)
依存句法 (Dependency Grammar):
- 核心思想: 句子结构通过词语之间的依存关系来表示，其中一个词（核心词/head）支配另一个词（修饰词/dependent 或 论元/argument）。
- 关系通常用箭头表示，从核心词指向修饰词。
- 示例: "look in the large crate in the kitchen by the door"
  - "look" 是整个句子的核心词。
  - "in" (the large crate) 修饰 "look"。
  - "crate" 是 "in the large crate" 的核心词，被 "large" 和 "the" 修饰。
  - "in" (the kitchen) 和 "by" (the door) 也修饰 "crate"。
- Speaker 1指出，这种表示方式更侧重于词语间的直接修饰关系。

句法结构的重要性与语言歧义

理解句法结构对于机器正确解读语言至关重要，因为人类语言充满了歧义。

人类理解过程: 人类将线性的词语流或声音流转化为具有层次结构的意义表示，识别出词语间的修饰关系。
模型需求: NLP模型同样需要解析句法结构以正确理解语义。
- “当我们转向Transformer语言模型时，它们直接接收词序列，但在神经网络的参数内部，它们实际上识别和构建了同样类型的结构单元。”
歧义的来源与类型:
1. 介词短语附着歧义 (PP Attachment Ambiguity):
  - 示例: "Scientists count whales from space."
    - 解读1: "from space" 修饰 "whales" (科学家数来自太空的鲸鱼)。
    - 解读2: "from space" 修饰 "count" (科学家在太空中数鲸鱼) (此为原文真实含义)。
  - 介词短语在英语中非常普遍，其修饰对象常常不明确。
  - 人类语言具有全局歧义性 (globally ambiguous)，不同于编程语言中通常只有局部且可解决的歧义。
  - 复杂示例: "The board approved its acquisition by Royal Truco Limited of Toronto for $27 a share at its monthly meeting." 句子中连续出现多个介词短语，其附着可能性呈指数级增长（与Catalan数相关）。人类通常能利用上下文和世界知识快速消解。
2. 并列范围歧义 (Coordination Scope Ambiguity):
  - 示例1: "Shuttle veteran and longtime NASA executive Fred Gregory appointed to the board."
    - 解读1: Fred Gregory 是一个人，同时是航天飞机老兵和NASA资深主管。
    - 解读2: 有两个人，一位是航天飞机老兵，另一位是NASA资深主管Fred Gregory。
  - 示例2: "Doctor: no heart, cognitive issues." (来自特朗普政府时期的例子)
    - 解读1: (no heart) and (cognitive issues) (没有心脏问题，但有认知问题)。
    - 解读2: no (heart or cognitive issues) (没有心脏问题也没有认知问题)。
3. 其他歧义:
  - 词汇歧义/结构歧义: "Trigger warning: students get firsthand job experience." (一种解读是学生获得一手的工作经验，另一种[不确定，原文暗示有俚俗解读])。
  - 不定式短语修饰歧义: "Mutilated body washes up on Rio beach to be used for Olympics beach volleyball."
    - 解读1: Rio海滩将被用于奥运沙滩排球。
    - 解读2: 残缺的尸体将被用于奥运沙滩排球。
句法结构的实际应用:
- 信息抽取，如从生物医学文本中提取蛋白质相互作用关系。例如，通过分析 "X interacts with Y" 这样的依存结构来获取事实。

依存句法与依存树库 (Treebanks)

依存句法特性:
- 通过非对称的依存关系（箭头）连接核心词 (head) 和修饰词 (dependent)。
- 箭头通常带有依存关系类型标签，如 nsubj (名词性主语), obj (宾语), amod (形容词修饰语), det (限定词) 等。
- 依存关系构成一个树形结构：连通、无环、单一根节点。
- 通常会为句子添加一个虚拟的 ROOT 节点作为起始点。
历史渊源:
- 依存句法是历史上主要的句法表示方法。
- Pāṇini: 古印度语言学家 (约公元前4-8世纪)，被认为是首位依存句法学家，其梵语语法采用依存结构，且最初为口头流传。
- 阿拉伯语法研究（公元第一个千年）也使用依存语法。
- 相比之下，短语结构语法较为晚近 (1940s, Chomsky 1950s)。
- Speaker 1提及，乔姆斯基层级 (Chomsky hierarchy) 最初是为了论证人类语言的复杂性超越了有限状态自动机等简单模型而提出的。
依存树库 (Dependency Treebanks):
- 早期NLP系统多为规则驱动，效果不佳，原因包括：
  1. 语言用法复杂多样，存在大量“长尾”现象，难以用规则完全覆盖。
  2. 歧义问题难以解决，规则系统无法有效选择正确的分析。
- 转折点 (1980s末-1990s): 构建标注数据集成为NLP研究的重要方向。
- Universal Dependencies (UD) Treebanks: 一个重要的跨语言项目，包含超过100种语言的、采用统一依存标注规范的树库。课程作业将使用此类数据。
- 树库的价值:
  1. 提供可复用的高质量标注数据。
  2. 为机器学习模型提供统计频率信息。
  3. 建立了评估NLP系统性能的标准化方法（在此之前，系统评估往往缺乏系统性）。

基于转移的依存句法分析 (Transition-based Dependency Parsing)

这是一种主流的依存句法分析方法。

句法分析器决策所需信息:
1. 核心词与修饰词之间的双词汇亲和度 (e.g., "discussion of issues" 合理)。
2. 依存距离 (多数依存关系是短距离的)。
3. 中间词信息 (依存关系很少跨越动词或标点)。
4. 核心词的配价 (Valency，即一个词倾向于带多少个、什么类型的论元)。
投射性 (Projectivity):
- 投射性依存 (Projective dependencies): 依存弧不交叉，可以映射为上下文无关文法的树结构。
- 非投射性依存 (Non-projective dependencies): 依存弧交叉。例如："I'll give a talk tomorrow [on neural networks]" (on neural networks 修饰 talk, tomorrow 修饰 give)。问句中也常见。
- 本课程中的分析器将只处理投射性依存。
基于转移的分析过程 (Greedy Decision-based Parsing):
- 数据结构:
  - 栈 (Stack): 存放已处理或部分处理的词语和子结构 (栈顶在右)。
  - 缓冲区 (Buffer): 存放剩余未处理的输入词语 (队首在左)。
  - 依存弧集合 (Set of Arcs): 存储已构建的依存关系。
- 核心转移操作:
  1. SHIFT: 将缓冲区队首的词移入栈顶。
  2. LEFT-ARC (LA): 将栈顶第二个词作为核心词，栈顶词作为其修饰词，建立依存关系，并将修饰词（原栈顶词）从栈中移除。
  3. RIGHT-ARC (RA): 将栈顶词作为核心词，栈顶第二个词作为其修饰词，建立依存关系，并将修饰词（原栈顶第二个词）从栈中移除。
- 示例: "I ate fish" (ROOT I ate fish)
  1. 初始: Stack: [ROOT], Buffer: [I, ate, fish]
  2. SHIFT -> Stack: [ROOT, I], Buffer: [ate, fish]
  3. SHIFT -> Stack: [ROOT, I, ate], Buffer: [fish]
  4. LEFT-ARC (ate <- I) -> Stack: [ROOT, ate], Arcs: {(ate, I)}, Buffer: [fish]
  5. SHIFT -> Stack: [ROOT, ate, fish], Buffer: []
  6. RIGHT-ARC (ate -> fish) -> Stack: [ROOT, ate], Arcs: {(ate, I), (ate, fish)}, Buffer: []
  7. RIGHT-ARC (ROOT -> ate) -> Stack: [ROOT], Arcs: {(ate, I), (ate, fish), (ROOT, ate)}, Buffer: []
  8. 结束: Buffer为空，Stack仅含ROOT。
- 机器学习角色: 在每一步，机器学习模型预测应执行哪种转移操作 (SHIFT, LA, RA)。
- 优点: 线性时间复杂度 (O(n))，效率高。
- 早期方法 (Nivre, 2003): 使用传统的机器学习模型（如SVM、逻辑回归）和大量指示性特征 (indicator features)。
  - 特征示例: 栈顶词是"good"且词性是形容词；栈顶词是"good"且栈顶第二个词是"has"。
  - 问题: 特征数量巨大（数百万），非常稀疏，特征计算本身耗时。
评估指标:
- 无标记附着得分 (Unlabeled Attachment Score, UAS): 正确的（核心词，修饰词）依存弧比例。
- 有标记附着得分 (Labeled Attachment Score, LAS): 正确的（核心词，修饰词，依存关系标签）三元组比例。

神经依存句法分析 (Neural Dependency Parsing)

利用神经网络改进依存句法分析。

动机: 解决传统指示性特征的稀疏性、不完整性和计算成本高的问题。
Chen and Manning (2014) 的神经依存分析器:
- 核心思想: 使用分布式表示 (Distributed Representations) / 嵌入 (Embeddings)。
  - 为词语、词性标签、依存关系标签学习低维稠密的向量表示。
- 模型输入: 将来自栈、缓冲区关键位置（如栈顶两个元素、缓冲区首个元素）的词、词性以及已构建依存关系的嵌入向量拼接起来，形成一个大的特征向量。
- 网络结构:
  1. 输入层: 拼接的嵌入向量。
  2. 隐藏层: h = ReLU(Wx + b)。
  3. 输出层: softmax(W_2h + b_2)，得到SHIFT, LEFT-ARC, RIGHT-ARC等转移操作的概率分布。
- 优势:
  - 稠密表示克服了稀疏性，能捕捉词语、标签间的相似性。
  - 非线性分类器（神经网络）表达能力更强。
  - 结果: 准确率与当时最好的基于图的方法相当，但速度快得多（与传统基于转移的方法同量级）。
  - “尽管你可能认为现在有了实数和矩阵等，肯定会慢下来，但现实是符号模型在特征计算上花费了太多时间，实际上通过使用神经网络，你可以在同时使其更快。”
Google Parsey McParseface (2016):
- 基于Chen and Manning模型，进行了改进（更深的网络、更大的向量、更好的超参数调优、使用集束搜索 (Beam Search) 而非纯贪心策略）。
- 在当时取得了SOTA结果 (e.g., UAS 约 94%)，并进行了广泛宣传。

基于图的依存句法分析 (Graph-based Dependency Parsing)

这是另一种重要的依存句法分析方法。

核心思想:
- 为句子中每对词 (w_i, w_j) 打分，表示 w_j 作为 w_i 的核心词（或 w_i 作为 w_j 的修饰词）的可能性。
- 目标是找到一个总分最高且能构成一棵合法的依存树的依存弧集合。
过程:
1. 打分 (Scoring): 对所有可能的依存弧 (n^2个) 进行打分。特征可以包括弧两端词语、词性、中间词、周围词等。
2. 解码 (Decoding): 使用类似最小生成树 (Minimum Spanning Tree, MST) 的算法（如Chu-Liu/Edmonds算法）从所有可能的弧中选出构成最佳树的弧集合。
神经化:
- 与基于转移的方法类似，也可以使用神经网络来学习依存弧的打分函数。
- Speaker 1提到其团队的工作表明，神经图依存分析器可以达到比Parsey McParseface更高的准确率 (约提升1%以上)。
- 该团队的开源工具包 Stanza 中使用的就是这种更准确的神经图依存分析器。

结论与展望

讲座系统介绍了句法结构，特别是依存句法及其分析方法，从传统符号方法演进到现代神经方法。这些知识对于完成课程作业二（构建神经依存分析器）至关重要。Speaker 1最后提及，更前沿的句法分析进展将涉及到大型语言模型 (LLMs)，这将在后续课程中讨论。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-15 21:17

StreamSparkAI