Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 3 - Backpropagation, Neural Network

斯坦福CS224N课程第二周周二的讲座主要回顾了作业一的提交情况,并介绍了作业二的内容。作业二包含三个主要部分:一是通过数学计算理解神经网络的运作原理;二是学习依存句法分析,涉及语言结构和语言学知识;三是开始使用PyTorch深度学习框架。为此,课程将在周五提供PyTorch的入门教程。

讲座接着深入探讨了神经网络的数学基础,强调了神经网络通过层级结构学习中间表征的重要性。与传统机器学习模型不同,神经网络能够自我组织中间层的表征,以更好地服务于最终任务。讲座解释了神经网络中层的计算过程,包括输入向量与权重矩阵的乘法、加上偏置项,以及通过非线性激活函数得到下一层的输出。

最后,讲座重点讨论了非线性激活函数的作用和发展。从早期因无法提供梯度而难以学习的阈值函数,到后来广泛应用的具有平滑梯度的Sigmoid和Tanh函数。Sigmoid函数输出非负,而Tanh函数可以视为Sigmoid的缩放和平移。尽管这些函数有效,但指数运算较为耗时。因此,后续发展出计算更简便的Hard Tanh,并最终引出了目前常用的ReLU(Rectified Linear Unit)激活函数。ReLU在负数区输出为零(梯度为零),在正数区输出等于输入(梯度为1)。尽管ReLU在负数区存在“神经元死亡”问题,但其简洁的梯度和在实践中的有效性使其成为主流选择,因为它能促进梯度的反向传播并实现某种程度的神经元特化。

媒体详情

上传日期
2025-05-15 21:07
来源
https://www.youtube.com/watch?v=HnliVHU2g9U
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro-exp-03-25

转录

下载为TXT
speaker 1: Okay, hi everyone. I'll be getting started. Okay, so it's Tuesday of week two, so hopefully that means everyone has done the signment one, everyone done assignment one. You know if I'm saying this, I'm probably saying it to the wrong people, but it seems like every year some people blow some of their late days on assignment one and it's really just the wrong place to use them. So Yeah, hopefully you've all done assignment one and did not that this is meant to be the easy on ramp. And then we go straight on from that. So that out today we have assignment two. So assignment two has two purposes. Purpose one is to make you do some math to some understanding of what neural networks really compute and how they computed. And that's what I'm going to talk about today is also going through that math. But then simultaneously, maybe it does three things in assignment two, so we're going to be learning something about dependency passing, which will be actually something about language structure and linguistics. But then thirdly, for assignment two, we're going to start using PyTorch. So PyTorch is one of the leading software frameworks for deep learning and the one that we're going to use for this class. So for I mean the assignment three, PyTorch is exceedingly scaffolded. So it's sort of, you know, here's this thing and you have to write these two lines, use these two functions, but nevertheless, for help people get up to speed and get started using PyTorch on Friday at 330 and Gatesby zero one or it will again be recorded. We have a tutorial on PyTorch. And so that's a great way to get more of a sense of PyTorch and how it works before doing assignment too. Yeah, the other things. Yeah. So for nearly all the lectures, we've got further reading of places that you can look of all the classes in the entire quarter. This for many people might be a really good one to look at the suggested readings. We have several readings which are sort of shorter tutorials and reviews of the kind of matrix calculus and linear algeva that we need for this class. So really encourage you to look at those. If you decide that one is your favorite, you can tell us on edge which one you think is the best one to choose between them. I kind of like the one that's first on the list, but maybe you'll feel differently. Yeah, conversely, Yeah. So today you'll be sort of all math and then Thursday will be kind of all language and linguistics. Some people find the language and linguistics hard as well. So I guess different kinds of people. Okay, so getting straight into it. So where we started last time, I'd sort of shown these baby neural networks and sort of said, well, we can think of each of those orange things as basically like a little logistic regression unit. The crucial difference from then the kind of statistics machine learning you see in the stats class 109, wherever is that? In those, you have one logistic regression and you're defining the input features to it, and you've got some decision variable that you want to have it at the output here, you're sort of building these cascades of little logistic regressions. And so the idea is right at the end, we're going to define what we want. We're going to capture that by our objective function, tional loss function. But for the stuff in the middle, that stuff in the middle is going to be a chance for the neural network to learn by itself what would be useful inputs to further downstream neurons. What kind of functions should I come up with in terms of my inputs? Thatwill help me provide useful outputs to help the final computation down the track. And if you haven't sort of seen and thought about this much before, I mean, I think, you know, it's worth sitting with that idea for a moment, because this is really a super powerful idea, which is what's made neural networks more powerful in most circumstances than other forms of machine learning. The fact that you have this self organization of intermediate levels of representation that you use to compute things that will be useful downstream for what you eventually want to do. The other reason I was bringing back up this picture is I've sort of wanted to go straight from here to matrices. So while you could sort of wire together neurons however you wanted to, and arguably, if you look at human brains, they look more like neurons wired together however you wanted to. But for what's done with neural networks, basically there's always this kind of regular structure of layers. So once we have this regular structure of layers, we are taking the output one of our neurons at one layer, and we're feeding them together with weights to produce the inputs to the next layer. So we're taking the x one, x two, x three outputs. We're multiplying them all by weights. We're adding bias term, and then we're going to put it through a nonlinearity, and that will give us the value at the next layer. So if we then kind of collapse that to a vector and this to a vector, that then collapses into a computation that first of all, we're doing a matrix multiplication, we're calculating wx of the imports, and then we're adding on the biases as a vector of biases, which gives us this intermediate value z. And then we have this nonlinearity, or activation function, which is applied to that which gives us the values in the next layer of the neural network. And the activation function is applied to a vector and produces a vector, but it's operating on each of the individual components of that vector one at a time. So we've got some scalar function that we're just applying to each element of the vector. And so that's the kind of picture we saw when I did this example, and I'm going to continue to use this example in today's class. Remember, we were going to decide whether the word in the middle of the input window was a location or not. And so we were doing the matrix multiplication, putting it through the nonlinearity within, then just doing a dot product here. And then that got stuck into sigmoid to predict yes or no. And the final thing I wanted to say a little bit about is these f's, the nonlinearity or the activation function, where do they come in? Well, the starting point of where they came in, in the history of neural networks, is when people came up with this idea that, well, you could represent the operation of a basic neuron by doing a matrix multiplication of the imports and then having a biased term, or here as represents a threshold term, to see whether the neurons should fire or not. That was actually in the very first implementation, which dates back to the 19 forties, as done as a threshold, right? So that if the activation was greater than theta, you output one, otherwise you output zero. And well, if you have a threshold, the two lines are flat, right? So there is no slope, there is no gradient. So that's actually makes learning much harder. So the whole secret of what we build with neural networks, and in particular an alternative name that's popular in some circles these days, is gradient based learning. And the entire idea of gradient based learning is if we actually have some slopes, then it's like going skiing during spring break. You can work out where it's steeper and you can head down where it's steeper. And that will allow us to optimize our function to learn much more quickly. So that's one reason that we don't just want to have threshold units. We want to have things with slopes. So we have gradients. So in subsequent work, people started using activation functions with slopes. And so the first popular one was this sigmoidal logistic that we've seen for mapping to probabilities. But you know it's sort of imperfect, it seemed, because you know the output was always non negative. So that sort of tends to push things towards bigger numbers. So there was quite a bit of use then of this tan H function. And you'll actually see tan H when we do assignment three, we'll be using tan hs in our current neural networks. And so I've written there the formula usually give for tan H in terms of exponentials. Yeah if your math is rusty, it's not obvious that tan H and logistic have much to do with each other. But if you want to treat this as a math problem that a ten age is literally just a rescale ale, logistic, you're stretching it by two and moving it down by one. It's the same function. Okay? But you know so that's nice. But if you calculate ten hs, you have to do all of these exponentials. And you know exponentials are kind of slow on your computer and things like that. You might wonder whether you couldn't get away with something much cheaper. And so people thought about that and thought, Oh, maybe we could just use a so called hard ten H where it has a slope of one in the middle and is then just flat outside that area. You know, that seemed to work in many cases. And so that then led to the popularity of the rectified linear unit. So the rectified linear unit is simply zero in the negative region. And then as y equals x and the positive region, now this seems kinda wonky and goes against what I was saying about gradient based learning, because once you're in the negative region, there's no gradient, you're just dead. But in the positive region, there is gradient. The gradient is particularly simple, right? The slope is always one. And so, you know this still feels slightly perverse to me, but you know this really became the norm of what people use for a number of years because people found that although for an individual neuron it was dead half the time, anytime it went negative that overall for your neural network, some things would be alive. So it kind of gave sort of a form of specialization. And the fact that the slope was always one meant that you got really easy, productive, backward flow of gradients in a way we'll talk about later. And so learning with Rause turned out to be very effective. And people started using the ranonlinearity everywhere, and it sort of became the default in the norm. And you'll see us using it in the assignments in particular, we're using an assignment too. And so you'll get to see that it works. But nevertheless, at some point, people sort of had second thoughts and decided, you know having a debt over half of its range maybe isn't such a good idea after all, even though it seemed to work great for a few years. And so a lot of what's happened since then is then to come up with other functions, which are in some sense real like, but not actually dead. So. Okay, I don't really Yeah here we go with enough. So 11 version of that is the so called leaky value. So for the leaky value, you make the negative half a straight line as well with a very minor slope, but still it's got a little bit of slope there. Then a variant of that called the parametric value where you have one extra parameter, which is actually what the slope of the negative part is. And people showed some positive results with that more recently. Again, and this is what you often see in recent transformer models, is you see nonlinarities like Swiss and jello. So both of these are sort of fancy functions, but kind of what they both look like is basically this is y equals x, to all intents and purposes, not quite, but approximately. And then you've got sort of some funky bit of curve down here, which again gives you a bit of slope. It's sort of the curve is going the opposite way. That's sort of a bit funny, but they seem to work well. Commonly used in recent transformer models. So you know that's a bit of a dump of all the non linearties people use. I mean, the details of that aren't super important right now, but the important thing to have in your head is why do we need nonlinarities? And the way to think about that is that what we're doing with neural networks is function approximation. There's some very complex function that we want to learn. You know like maybe we want na go from a piece of text towards meaning, or we want to be interpreting visual scenes or something like that. And so we want to build really good function approximators. And well, if you're just doing matrix multiplies, a matrix multiply of a vectors, a linear transform. So that doesn't let you multiply complex functions. I guess strictly, if you put a bias on the n, that's then an f ine transform. But nice to keep it simple. Linear transforms, right? So if you're doing multiple, if you're doing multiple matrix multiplies, you're doing multiple linear transforms, but they compose. So you could have just multiplied these two matrices together and youhave a single linear transform. So you get no power in terms of representation by having multilayer networks that are just matrix multiplies. You know, as of a little aside in terms of representational power, having multilayer matrix multiplies gives you no power. But if you think about it in terms of learning, actually it does give you some power. So in the theoretical community looking at neural networks, there are actually quite a few papers that look at linear neural networks, meaning that they're just sequences of the multiplies with no non linarities because they have interesting learning properties, even though they give you no representational power. Okay? But welike, to be able to learn functions like this, not only functions like this, and to be able to learn functions like this, we need more than linear transforms. And we achieve those by having something that makes us be calculating a nonlinear function. And it's these activation functions that give us nonlinear functions. Okay. Cool. Okay. So then getting on to today. So the whole thing we want to do now is gradient based learning, right? This is our sycastic gradient descent equation, where here, you know that upside down triangle symbol, right? That's our gradient. We're wanting to work out the slope of our objective function. And so this is how we're going to learn by calculating ingradient. So what we want to know is how do we calculate the gradients for an arbitrary function? So what I want to do today is, first of all, do this by hand for math and then discuss you. How do we do it computationally, which is effectively the famous thing that's taken as powerunderpowering all the neural nets, which is the back propagation algorithm. But the back propagation algorithm is just automating the math. And so for the math, it's matrix calculus. And at this point, then there's a huge spectrum between people who know much more math than me and people who barely ever learnt this. But you know, I hope to explain the essentials or remind people of them enough that you're least sted a starting point for reading some other stuff and doing homework too. So let's get into that. And so you're going to spend about half the time on those two halves. And you know, the hope is that after this you'll feel like, Oh, I actually understand how neural networks work under the hood. Fingers crossed. Okay, so here we go. So if you're a Stanford student, you maybe did math 51 or else you could have done math 51, which teaches linear algebra, multivariate calculus, and modern applications. Math 51 covers everything I'm going to talk about and way more stuff. So if you actually know that and remember it, you can look at Instagram for the next 35 minutes. But I think the problem is that you know quite a part of the fact a lot of people do it, as froyou know, this is a lot to get through in ten weeks. And I think that a lot of the people who do this class sort of by two years later, don't really have much ability to use any of it. But you know, if you actually looked at this book, really hardened for a very, really long time, you would have discovered that actually, right towards the end of the book in appendix g, there's actually an appendix on neural networks and the multivariable chain role, which is precisely what we're going to be using for doing our neural networks. But there are only two problems. One problem is that this is page 697 of the book, and I'm not sure anyone ever gets that far. And the other problem is, even if you do get that far, you know I know I find these pages that they're really dense, texsty pages. It's not even easy to understand them if you have gone there. So here's my attempt on that. So the mantra to have in your head is, gee, if I can remember basic single variable calculus, you know that I've got three x squared and the derivative of that is six x. That's all you sort of need to know, right? The mantra is multivvariable calculus is just like single variable calculus, except you're using matrices. Okay, so that's our article of faith, and we're going to do that. And so what we're wanting to do is to do matrix calculus, or in the generalization of that, tensor calculus, sort of using vectors, matrices and higher order tensors. Because if we can do things and what' S Referred to as vectorized gradients in the Neural Network World, that that will be sort of the fast, efficient way to do our operations. You know so if you want to think it all through, you can do it single variable at a time and check that you're doing the right thing. And I sort of tried to indicate that in the first lecture. But if we want to have our networks go room, we want to be doing matrix calculus. Okay? So let's work up to doing that. Okay? So this is the part that I trust everyone can remember, right? So we have f of x equals x cubed and we can do single variable derivative and the derivative is three x squared. Everyone remember that one. That's something we can all start from. And remember, this derivative is saying the slope of things, right? So the slope of things lets us work out where is something steep. So we'll be able to go skiing, right? That's our goal, right? And so you can think of the slope of things as how much the output will change if we change the input a bit, right? That's our measure of steepness, right? So so since the derivative is three x squared, if we're at x equals one, that means the slope is about three times one squared three. So if I work out the value of the function for 1.01, it's gone up by about three times zero point. I move the x by 0.01 and the output move by 0.03, where if I go to x equals four, the derivative is three times four squared is 48. And so if I work out the value of the function at 4.01, I get approximately 64.48 versus 64. That small difference from four to 4.01 has been magnified 48 times in the output. Okay, so now we just remember the mantra. It's going to be exactly the same single value calculus, but with more stuff. So if we have a function with n imports, we're then going to work out its gradient, which is its partial derivative with respect to each input. So its gradient will now be a vector of the same size as the number of inputs. And there's this funky symbol, which people pronounce various ways. I mean, you know, this kind of originated some kind of someone's weird way of drawing a calligraphic d, right? So it is really A D. So I think I'll mainly just call it d but sometimes people call it partial or funky d or some other name. So you have dfdx one, dfdx two for each of the variables. So if we go beyond that and then have a function with n inputs and m outputs, what we then get for the gradient is what' S Referred to as the Jacobian. Now actually the do de this is named after was a German Jew, so it should really be jacoi, but no one says that in this country. Jacobian, okay. So the Jacobin is then a matrix of partial derivatives where you're working out for each output and each input, the partial derivative between the component of the input and the output. So this looks like the kind of thing that we're going to have when we have a neural network layer, because we're going to have n inputs and m outputs for the two layers of our neural networks. So we'll be using these kind of Jacobians. Okay. So then you know the whole idea of neural networks is we've got these multi level computations and they're going to correspond to composition of functions. So we need to know how to compose things both for calculating functions and for calculating their gradients. So if we have a one variable function and we want to work out its derivative in terms of a composition of two functions, what we're doing is multiplying our computations, okay? So if you compose together z of y, that's the function that we did at the beginning that gives you, Oh, no, it's not. So it's different, okay? Z of y gives you three x squared, right? And so we know that the derivative of that is six x, okay? But if we do it in terms of the pieces, we can work out dz dy, which is just going to be three, and the ydx is two x, and we can work out the total derivative by multiplying these two pieces, and we get six x the same answer, right? So matrix calculus is exactly like single variable calculus, except we're using tensors of different. So the word tensor is used to mean, as you go up that spectrum in its size, so from sort of scale it to vector to matrix to then. You know what in computer science is normally still has multidimensional arrays. That spectrum is then tensors of different dimensions. Okay? So when we have multiple variable functions, we're going to multiply Jacobian. So here we have a function that x plus b, and then we compose the nonlinearity f get H. And so we're going to be able to compute that in the same way as a product of these partial derivatives, which are Jacobians. Okay. So let's start looking at a few examples of what we get. So let's count. Start with an element wise activation function. So when we have a vector that's being calculated as the activation function of a previously computed quantity, well, we're computing that component wise, as I explained before. So hi equals f of zi. And where this sort of f is our activation function, that actually applies to a scalar. But you know overall, this layer is a function with n outputs and n inputs. And so it's going to have an n by n Jacobian. And well, that's going to so this is our definition of the Jacobian. But in this case, this is sort of a special case because if I equals J, then we're going to have the output J, the hj, depending on zi. And otherwise it's going to be zero. Because for the off diagonal entries, it doesn't matter how you change the value. It's not changing the output because the output only depends on the corresponding index. And so what we're going to get for this Jacobian of activation functions is a matrix where everything is zero apart from the diagonal terms that correspond to where we're calculating the activation function. And for those ones, we're going to have to work out how to compute the derivative of our activation function. That was on assignment one. One of the questions on assignment one, I do believe always, always is on assignment two. No, no, it's assignment two. One of the questions on assignment two, I got that wrong. One of the ones on the new assignment is say, Hey, can you work out the derivative of a logistic function? Well, then webe able to plug that straight into f prime. So I'm not gonna to give that answer away today. Okay. So other things that we want to do with Jacobian is, well, we have this layer of our neural network where we're calculating wx plus b, and we can want to work out the partial derivative of that with respect to x. You know, this is the kind of place where it actually works to remember the mantra and say, matrix calculus is just like single value variable calculus, but with matrices. So if you just don't use your brain too hard and think, Oh, it's just like single variable calculus, so what should the answer be? It's obviously going to be W, right? And indeed, it is siarly. If we want to do the same thing for wxb and work out the Paal derivative with respect to b, well, that would be one in terms of single variable calculus. And so a matrix calculus that becomes an identity matrix. Okay, slightly different, the same idea. So that's reflecting the fact that b is actually a vector. So we need it to be coming out as an identity matrix. So higher up in my example picture, I did this sort of vector product of uth. And well, what happens if we work out the the Jacobin of that? What we end up with strictly is we come out with ht. And you know this is sort of like when you were working out, well, we did this in the first class, right, when we did a dot product calculation that you kind of get for each individual element, you get the opposite term and so you get the other vector coming out. These are sort of good ones who are compute it at home for practice to make sure you really do know the answers and why they work out the way they do. Okay, so let's go back to our little neural net. This was most of our neural net. Up above our neural net, there was the nonlinearity. Now I'm going to leave f that out of this time because, see, I got it wrong. It's on assignment two. But you know, normally yoube calculating the partials of the output, the loss function with respect to the inputs. But since the loss function tions on assignment two, I'm going to leave that out. And I'm just going to calculate derivatives with respect to this score that feeds into the loss function. So we've first of all got the neural network layer, the nonlinearity, and then we're doing this dot product to work out a score for each position, which feeds into the logistic function. So if we want to work out dsdb, so that's with respect to the bias first. So the way we do it is you know we break up our equations into our individual pieces that are composed together. And so that means whether we break this up, so we first calculate the z equals wx plus b, then we apply the activation function to the different components. Okay? Then after that, well, what we remember to do is ok, to work out our partial derivatives of b of s with respect to b, that what we're going to be doing is doing the product of the partial derivatives of the component pieces. So we're applying the matrix calculus version of the chain rule. So dsdb equals dsdh times dhdz times dz b, and which corresponds these three layers that are composed together. And so at that point, we remember our useful Jacobians from the previous slide and we can just apply them. So the top one dsdh is the huge transpose or maybe it's you. Let's come back to that. I've there's a fine point on that, but I will explain more about later. Okay. Then for the dhdz, that was the activation function where we got the diagonal of the derivative of f of z. And then for dz db, that's where we got the identity function. Okay? So we can simplify that down. And so what that's going to end up as is the U transpose that funny symbol there times the vector element wise derivative of f. This symbol, which doesn't normally turn up in your regular math course, but turns up all the time in neural networks, is referred to as the haddermaproduct. And the haddermaproduct is meaning element wise multiplication. So it's not like a cross product where you put two vectors together and you get out one number of scalar. You put two vectors together, you element wise, multiply them all and you left with another vector of the same type. Okay, so now this gave us a working out of the pastorals of dsdb. And for a neural network, we want to work out all the other partials as well. So overall here in the picture, right, we had the x, the W, the b and the U and welike to work out partials with respect to all of those variables, so we can change their values and learn so that our model predicts better. So suppose we now want to calculate dsdw. So again, we can split it up with the same chain rule and say dsdw equals the product of these three things. And the important thing to notice is that two of those three things were exactly the same ones that we calculated before. The only bit that's different is that at the end, we're now doing dz dw rather than dz db. And so the first central idea that we'll come back to when we do computation graphs is we really want to avoid doing repeated work. So we want to realize that those two parts of things are the same. And since we're just so multiplying these partial derivatives together, right, we can just compute what that part is and reuse it. And so if we want to wait, Yeah. Okay. So if we're wanting to calculate dsdw the part, that's the same. This part here we can refer to as delta. So delta is sort of the upstream gradient or the error signal. The part that you've got from sort of starting at the beginning, dsdhdhdz, this sort of shared upstream part. We can calculate that once and then we can use it to calculate both of these two things. And for dsdb, because the dzdb just comes out as the identity matrix, the answer is just delta. But for dsdw, we need to work out the dzzw before we're finished. Okay. So what do we get for that last piece? So one question you might start off with, and is normally a good thing to think about when you're doing assignment problems on this and other things, is the first thing to think about is, you know, what do things look like? Like am I should the answer be a vector, a matrix? What size should it be and things like that. So for dsdw, W is an n by m matrix and s is a scalar. So therefore, since we have one output and n times m imports, the answer, according to math, should be that we've got a one by n times m Jacobian. I have a big long row vector, but here's where things get a teeny bit tricky and there's sort of we end up with this weird myths of math and engineering convenience because you know immediately what we're wanting to do is we're wanting to take our old parameters, which will be stored in the form of matrices, vectors and so on. That we're using as coefficient ents, and we're going to want to subtract from them our fraction of our calculated gradient. So what welike to do is have our calculated gradients in the same shapes as our parameters, because then we can just do subtraction, whereas if they've turned into a God Almighty row vector, that's not quite so convenient. So it turns out that what we end up doing is using something that gets referred to as the shape convention, that we reshape our Jacobians so they fit into things that are of the same shape as the parameters that we are using. So we're going to represent dsdr W as an n ym matrix laid out as follows. And that's a place that one people can get confused. Okay? So that's what we want to calculate, that kind of matrix. And so that matrix is going to be delta times dz dw. So delta is going to be part of the answer. And then we want to know what dz dw is. And the answer is going to be, it's going to come out like this. So dsdw is going to be delta t times xt. So it's going to be the product of of the upstream gradient, which was the same thing we calculated before for the other two quantities. And then a local input symbol, which is input signal, which is here coming out to xt. Okay. And so we're taking the transposers of those two vectors, which it means that we end up calculating an outer product of those two vectors, which gives us our gradient. And so why is that the right answer? Well, you know, it kind of looks convenient because that's giving us something of the right shape for what I was arguing. We want to find out and we have the right number of terms. Now I'm going to rush through this. So I encourage you to read the lecture notes and do this more carefully. But let me at least a little bit explain why it makes sense, right? So if you think of one weight, and so all of these connections are our matrix, right? The matrix is being represented by all these lines and the neural network. So if you think of one number in the matrix, so here is W 23, so it's connecting from input three or it's multiplying input three to give part of the answer of H2, right? So it's this line here. So for this line here, this weight is being used only in the calculation of H2 and the only thing it's dependent on is x three. So that if you're then wanting to work out the partial of H2 or z two, sorry, z, the partial of z two with respect to x three, it's sort of depending on these two pieces only. And that's what you're achieving by working out the sort of outer product like that. Okay, Yeah. So let me just come back one more time to this sort of question of the shape of derivatives. You know, so I already sort of fushed it when I was sort of talking about, Oh, should I put the transpose there or should I nod and get a row vector versus a column vector? So there's sort of this disagreement between whether you kind of have the Jacobian form, which is what actually makes the chain rule work right in terms of doing multiplication, versus the shape convention, which is how we store everything for our computations and makes doing stochastic gradient descent, where you're subtracting whatever kind of tense you have, easy. You know this can be a source of confusion since we're doing a computer science course for the answers in the assignment, we expect you to follow the sheep convention. So you know if you're working out the derivatives with respect to some matrix, it should be shaped like a matrix with the same parameters. But you know you may well want to think about Jacobian forms and computing your answers. I mean, there are sort of two ways to go about doing this. One way of doing it is to sort of work out all the math using jacobiand Zala math 51, and at the end, just to reshape it so it fits into the same shape as the parameters according to our shape convention. I mean, the other way is to sort of do each stage following the shape convention, but then you sort of have to be gamed to sort of reshape things as needed by sort of doing transposing to have things work out at the different stages. Okay. That was my attempt to quickly review the math. Most people are still here. I will now go on to the second half and go on to the how we do the computation, right? So most of yes. So the famous thing that powers neural networks is the back propagation algorithm. So the back propagation algorithm is really only two things. You know, its invention made people famous because it gave an effective learning algorithm. But you know, at a fundamental level, the back propagation algorithm is only two things. Thing one is you use the chain rule, you do calculus, the complex functions. And thing two is you store intermediate results, so you never recompute the same stuff again. That's all there is to the back propagation algorithm. And so let's just go through that. So if we're computationally wanting to deal with know functions and doing back propagation, we can think of them as being represented as a graph. And in some way or another, this kind of graph as being used inside your neural network framework. So here is a re representation of my little neural network for finding whether the word at the center is a location. So I'm taking the x vector in ort. I'm multiply it by W, I'm adding b to it, I'm putting it through the nonlinearity. And then I'm doing the dot product with my vector U, right? So that was my computation. And so the source nodes, the inputs in this graph, the interior nodes, then the operations I do. And so then the edges that connect those together then PaaS along the result of each operation. So I PaaS along W, X to the addition function with b, then that gives me z, that I PaaS through the nonlinearity, which gives Me H, which I then dot product with the U to get s. Okay? So I do precisely this computation. And this is referred to as forward propagation, or the forward PaaS of a neural network. So the forward PaaS just calculates functions, okay? But then once we've done that, what we want to do is then work out gradients so we can do gradient based learning. And so that part is then referred to as back propagation or the backward PaaS. And then we run things backward. So for running things backward, we're going to use the same graph and we're going to backwards PaaS along at gradients. So we start at the right hand side and we have dsds. So dsds is just one, because know if you change s, you've changed s. And then what we want to do is sort of then work further back so we can work out dsdh dsdz dsdb dsdw dsdx as we work back. And so this is what we want to work out with gradients. And so how are we going to do that? Well, if we look at a single node, so for example, our nonlinearity node, but any node where H equals f of x, what we're going to have is an upstream gradient dsdh. And what we want to do is calculate the downstream gradient of the next variable will down the dsdz. And the way that we're going to do that is we're going to say, well, let's look at f, what is f's gradient? And that's going to be our local gradient. And then this is immediately what gives us the chain rule that dsdz is going to be the product of our upstream gradient. Dsdh times the dhdz, the local gradient that we calculate at that node. So downstream gradient equals upstream gradient times local gradient. Oh Yeah. That's what it says when I Press that again. Okay. So this is the sort of the single input single output case, though those inputs might be vectors or matrices or something like that. We then have sort of more complex graph cases. So I think I should have retitled this lot. Oh, Yeah. So still so so the next case is for our node, it might have multiple inputs. So this is where we're calculating wx. So in that case, we still have an we have a single upstream gradient. And then what we're going to do is we want to calculate the downstream gradient with respect to each input. The way we're going to do that is we're going to work out the local gradient with respect to each input. And then we're going to do the same kind of multiplication of upstream gradient times local gradient with respect to each input. Again, chain rule. Okay. So here's a little example of this. So this isn't really the kind of thing you normally see in a neural network, but it's an easy example. So f of x yz is going to be x plus Y Times the max of Y Z. And we've got current values of X, Y and z of one, two and zero, respectively. So here's our little computation graph. And so for forward propagation, you know, we're going to do this addition diwe're going to do this max function, and then we're going to multiply the two. And that gives us the value of f. So we can run that with the current values X, Y and z, and this is what we get. So the max of two and zero is two. Addition is three. The answer is six. So then after having done that, we run the backward propagation. And Yeah so this procedure, you know it's not actually special to your networks, right? You can use it for any piece of math if you want to just run your math on pii torture than working it out in your head or with Mathematica. Okay, so now we work out backwards. So we want to know the local gradient. So d adz is going to be one. So I say they're wrong. Dx is going to be one. So a equals x plus Y D 80y equals one for the max function. That's going to depend on which of the two is larger because it's going to have a slope of one for the one that's the biggest and zero for the one that's the smallest. And then for the product, that's like what we saw with vectors that dfda is going to be b and dfdb is going to be a. So those are all our local gradients. And so then we can use those to calculate out the derivatives. So the fdf is one. We then multiply that by the two local gradients that are calculated. We are a and b. So that gives us 23 where you're swapping over the numbers. Then for the max that we're having, the one that is biggest, we're taking the upstream times one, so it gets three, the other one gets zero. And then for the plus, we're just sending the gradient down in both directions. And so both of them come out as two. And so that gives us the fdx. So the final function value is two, the fdy. We're taking the three and adding the two. I'll mention that again in a minute, which gives us five. And then dfdz is zero. And we should be able to, again, be able to quickly check that we've got this right. Right? So if we consider, you know, the slope around z is you change z little, so z is zero. If we make z 0.1, that makes absolutely no difference to what the computer function value is. So the gradient there is zero. That's correct. So if I chaup the top, if I change X A little bit, right, if I change x to 11, then I'll be calculating 1.1 plus two is 3.1. And then I'll be taking the max, which is two, and I'll be calculating 5.1. And so wait, no, I said they're wrong. Oh, times two, wait, I didn't do the multiplication right? Sorry. Yeah. So we get the 3.1 that's multiplied by two. That gives us 6.2. So a change of 0.1 in the x has moved things up by 0.2. So that corresponds to the gradient being two. And so then the final case is, well, what if we change y? So y started off as a two and made it 2.1, then we're going to get 2.1 multiplied by one is 2.1. Right and then we've got the 2.1 here that Oh sorry, I keep doing this wrong. 2.1 plus one equals 3.1 and then we've got 2.1 is the max. So we've got 2.1 times 3.1 and that comes out to be 6.51. So it's approximately gone up by so our 0.1 difference has gone up to approximately 0.5. It's just an estimate. And so that corresponds to the gradient being five, right? We get this five times multiplication of our changes. Okay. And so that illustrates the fact that the right thing to do is when you have outward branches in your computation graph and you're running the back propagation, that what you do is you sum the gradients, right? So that for this case, we had y being the y is sort of going into these two different things in our previous chart. So once we've worked out the upstream gradients, we sum them to get the total gradient. And so that's what we did back here. We had two outward things, and we sort of took these calculated upstream gradients and 23, and we just sumed them to get five. And that gave the right answer. Okay. And so you can think about that for the sort of just generally how the sort of things to think about as sort of gradients move around in these pictures so that when we have a plus operation, that plus just sort of distributes gradient. So the same gradient to that's the upstream gradient goes to each input. When you have a max, it's kind of like a router of gradients. So the max is going to send the gradient to one of the inputs and send nothing at all to the other inputs. And when you have a multiplication, it's a little bit funky because you're sort of doing this sort of switching of the forward coefficient. So you're taking the upstream gradient multiplied by the opposite forward coefficient gives you your downstream gradient. Okay? So we kind of have this systematic way of being able to sort of forward, PaaS, calculate the values of functions, then run this backward to work out the ingredients heading down the network. And so the main other thing of the back propagation algorithm is just that we want to do this efficiently. So the wrong way to do it would be to say, well, gee, I want to calculate dsdb dsdw dsdx dsdu. So let me start doing those one at a time. And when I've done them all, I will stop. Because that means if you first calculate dsdb, you do all of the part that's in blue. But then if you went on to dsdw yoube calculating all the part in red. And well, just as we saw in the math part when we were doing it as math, these parts are exactly the same. You're doing exactly the same computations. So you only want to do those that part once and work out this upstream gradient or error signal that is being then sort of calculated and is then being shared. So the picture that we want to have is you're doing together the shared part and then you're only sort of doing separately the little bits that you need to do. Okay, boy, I seem to have been rushing through today and I'm going to actually end early unless anyone is going to slow me down, but I do have just a few more slides to go through. Yeah. So the sort of generalization of this as an algorithm is, you know in the general case, you know so we normally have these sort of neural network layers and matrices, which you represent as vectors and matrices. And you know it's sort of nice and clean, and it looks like doing that in calculus class. I mean, strictly speaking, that isn't necessary. So the algorithm for forward propagation and backward propagation that I've outlined that you can have it work in a completely arbitrary computation graph, providing it's a dag that doesn't have cycles in it. So the general algorithm is, well, you've got a whole bunch of variables that depend on other variables. There's some way in which we can sort them so that each variable only depends on variables to the left of it. So that' S Referred to as a topological sort of the outputs. And so that means there's a way we can do a forward PaaS where we're calculating variables in terms of ones that have already been calculated. But you know if you want to have some extra wonky arc, so it's not like nice matrix multiplies or anything. We're totally allowed to do that or we can have things not fully connected, right? So there's no connections across here, right? We can have an arbitrary computation graph. And so that gives us our forward propagation. And then once we've done the forward propagation, we can initialize the output gradient as one. And then we're going to visit the nodes in reverse order. And for each node, we're going to compute it a gradient by using the upstream gradient and the local gradient to compute the downstream gradient. And so then we can head back down the computation graph and work out all of the downstream gradients. And so the crucial thing to notice is that if you do it correctly, that working out the the gradient has the same big o complexity as working out the forward calculation, right? So that if you're doing more you know if in terms of big o terms, right, you might have different functions depending on what the derivatives are. But in big o terms, if you're doing more work in the backward path than you're doing in the forward path, that means that you're somehow failing to do this efficient computation and that you're recomputing some of your work. Okay. So because we have such a good algorithm here, you should be able to just work out the backward path automatically and that that gets referred to as automatic differentiation. So if you had the symbolic form of what you're calculating with your forward PaaS, you should just be able to say, your computer, can you work out the backward PaaS for me? And you know kind of mathematical like it could look at the symbolic form of all of your functions, work out their derivatives and do the entire thing for you. So early on, there was a pioneering deep learning framework, Fiano, principally from the University of	Montreal, which attempted to do precisely that, that you had the entire Ford PaaS computation started in symbolic form, and it just did the entire thing for you and worked out the backward PaaS automatically. But you know somehow that sort of proved to be too heavweight or hard to deal with different things or people just like to write their own Python or whatever it is. So that idea did not fully succeed. And so what in practice, all of the current main frameworks have fallen back on is something that's actually less automated than that. So it's sort of like we've gone backwards in time, but the software has gone a lot better really. It's a lot stabier and faster. So all of the modern deep learning frameworks sort of say, look, I will manage the computation graph for you, and I can run the forward propagation PaaS and the backward propagation path, but you're going to have to work out the local derivatives yourself. So if you're putting in a layer or putting in a function like an activation function in a neural network, your class, your Python class that represents that, you're going to have to tell me what the forward computation is and what the local gradient is. And I'm just going to call your local gradient and assume it's correct. So there's a bit more that has to be done manually. So the part that's automated then is that you know when you know not precisely this code obviously, but roughly you know inside the deep learning software, it's computing with a computation graph. It's got a forward and a backward, and it's doing what I presented on the pictures before. So for the forward PaaS, it's topologically sorting all the nodes of the graph, and then it's going through them. And for each node in the graph, it's calling its forward function, which will be able to compute its local value in terms of its inputs, which have already been calculated, because it's topologically sorted. Then it's running the backward PaaS, and then the backward PaaS, you're reversing your topological sort, and then you're working out the gradient, which is going to be the multiplication of the upstream error signal times your local gradient. And so what a human being has to implement is that for anything, whether it's a single gate, here's a multiply gate or a neural network layer, you have to implement a forward PaaS and a backward PaaS. So here for my baby example, since we're just doing multiplication, my forward past is that I just multiply the two numbers and return it. So I'm specifying that for the local node. And then the other part is that I have to work out those gradients. And well, we sort of know how to do that because that's the examples that we've doing in doing here. But notice that there's sort of a trick, right, for what I've got now you kind of can't write down what the gradients are. You know what these because you know backward is just taking as an input the upstream gradient and you can't work out what the downstream gradients are going to be unless you know what function values you're calculating it at. So the standard trick that all, which is how everyone writes this code, is you're relying on the fact that the Ford is being calculated before the backward. And so your forward method shoves into some local variables of the class what the values of the inputs are, and then you have them available. So when you get to the backward PaaS, you can do what we did before that the dx is going to be the upstream error signal times the opposite input and similarly for dy and that's going to give us the answer. Okay, just two last things then to mention. Yeah so doing this your to write you need to get the math right for what's the derivative of your function. So you get the right backward calculation. So the standard way to check that you've got the right backward calculation is to do manual gradient checking with numeric gradients. So the way you do that is you sort of like for the couple of examples I did when I said, Oh, let's check it by for going from one to 1.1, what should the slope be approximately? We're going to do that in an automated way. And so we're going to say at the value x, let's estimate what the gradients should be. And the way to do that is to pick a small H. There isn't a magical number because it depends on the function. But typically, if neural networks around ten to the minus four is good, a small H, and workout the function value either forward PaaS at x plus H and x minus H divided by the run, which is two H. And that should give you an estimate of the slope, what the backward PaaS is calculating, and you want those two numbers to be approximately equal, know, within some ten to the minus two of each other. And then probably you're calculating the gradient, right? And if they aren't equal, that you probably have made a mistake. Yeah. So note that this forfor, the version I did for my examples, I just compared to x with x plus H, right? I did a one sided estimate, which is normally what you get taught in a math class if you're doing this to check your gradients numerically, you're far, far better off doing this two sided estimate because it's much more accurate and stable when you're doing it equally around both sides of your H Yeah. So this looks easy to do. If this was just so good, why doesn't everyone do this all the time? Forget about calculus. You know, the reason you don't want to do this is that doing this is incredibly slow, right? Because you have to repeat this computation for every parameter of your model that you're not getting the kind of speed ups you're getting from the back propagation algorithm. But now it's useful for checking your implementation as correct. You know in the old days before frameworks like ptorch, you know we used to write everything by hand and people often got things wrong. But nowadays, you know it's less needed. But it's good to check that if you've implemented your own new layer, that it's doing the right thing. Okay. Yeah. So that's everything that we need to know about neural neback propagation. Is the chain rule applied efficiently? Forward PaaS is just function application backward, plus is chain rule applied inefficiently. So, you know, we're going to inflict pain on our students by making them do some math and calculate some of these things and do the homework. And I know thatbe harder for some of you than others. You know that in some sense, you don't actually need to know how to do this. The beauty of these mondeep learning frameworks as theydo it all for you. They predefine common layer types, and you can just plug them together like pieces of lego and theybe computed right? And this is precisely the reason that high school students across the country and the world can now do deep learning projects for their science fairs. Because you don't actually have to understand any of this math. You can just use what's given to you. But you know, we kind of want to hope that you actually do understand something about what's going on under the hood and how neural networks work. So therefore, you make you suffer a little bit. And of course, you know if you sort of wanting to look at and understand more complex things, you need to have some sense of what's going on. So later on, when we get on to a current neural networks, we'll talk a bit about things like exploding and vanished ingredients. And if you want to have some understanding about well, why things aren't working and things are going wrong, then you sort of want to know what it's actually calculating rather than just thinking it's all a black box magic. And so that's why we hope to have taught something about that. Okay? I think I'm done. If the audience is sufficiently stunned and we can stop for today. Okay, thank you.

最新摘要 (详细摘要)

生成于 2025-05-15 21:15

概览/核心摘要 (Executive Summary)

本篇内容总结了斯坦福大学CS224N课程(2025年春季)第三讲“反向传播与神经网络”的核心内容。课程首先回顾了神经网络的基本概念,强调其通过学习中间层表征来自组织和处理信息的能力,这是其超越传统机器学习方法优势的关键。接着,详细阐述了多种激活函数(如Sigmoid、Tanh、ReLU及其变体Swish、GeLU)的演进、特性及其在引入非线性以逼近复杂函数中的核心作用。课程的重点在于梯度下降学习和矩阵微积分,教授解释了多变量微积分的基本原则,引入雅可比矩阵和链式法则,并通过实例展示了如何计算神经网络中参数的梯度。核心内容围绕反向传播算法展开,该算法通过应用链式法则并高效存储中间结果,实现了对任意计算图的梯度计算。课程讨论了计算图的前向与反向传播过程,以及现代深度学习框架(如PyTorch)中自动微分的实现机制,即框架管理计算图,用户提供局部导数。最后,介绍了梯度检验作为验证反向传播正确性的数值方法,并强调了理解反向传播原理对于深入学习和解决复杂模型问题的重要性,即使在高度自动化的框架下也是如此。

课程回顾与作业安排

  • 回顾与作业一
    • 教授首先提醒学生作业一(Assignment 1)应已完成,并强调其为入门内容,不宜消耗过多“迟交日”(late days)。
  • 作业二:目标与内容
    • 作业二(Assignment 2)主要有三个目的:
      1. 数学计算:让学生通过数学推导理解神经网络的计算原理。
      2. 依存句法分析 (Dependency Parsing):涉及语言结构和语言学知识。
      3. PyTorch入门:开始使用PyTorch,一个主流的深度学习软件框架。作业三中PyTorch的使用将有大量脚手架代码辅助。
  • PyTorch入门辅导
    • 为帮助学生上手PyTorch,课程安排了PyTorch教程,时间为周五下午3:30,地点在Gates B01(原文为Gatesby zero one,推测为Gates B01),并会提供录像。
  • 推荐阅读材料
    • 针对本讲内容,特别是矩阵微积分和线性代数,教授强烈建议学生阅读推荐材料,这些材料多为简短教程和综述。

神经网络的核心思想

  • 中间表征的自组织学习
    • 神经网络的核心优势在于其能够自我组织中间层次的表征 (self organization of intermediate levels of representation)
    • 网络中间层学习到的特征可以作为后续神经元的有效输入,从而辅助最终的计算任务。这一思想使得神经网络在多数情况下比其他机器学习方法更为强大。
  • 从神经元到矩阵运算
    • 尽管人脑神经元连接看似随意,但神经网络通常采用规则的分层结构 (layers)
    • 层与层之间的计算可以简化为矩阵运算:首先进行矩阵乘法 Wx (权重矩阵 W 乘以输入 x),然后加上偏置向量 b,得到中间值 z
    • 之后,将 z 通过一个非线性激活函数 f,得到下一层神经元的输出 a。即 a = f(Wx + b)
    • 激活函数通常逐元素应用于向量。

激活函数 (非线性单元)

  • 引入非线性的必要性
    • 神经网络的目标是学习复杂的函数(如文本到意义的映射,视觉场景理解),即成为优秀的函数逼近器 (function approximators)
    • 若仅使用矩阵乘法(线性变换),多层网络等效于单层线性变换 (W₂ * (W₁x) = (W₂W₁)x),无法增强模型的表征能力以拟合非线性函数。
    • 激活函数引入了非线性,使得多层网络能够学习和表示复杂函数。
    • [内容不完整] 教授提到,即使没有非线性,仅包含多层矩阵乘法的线性神经网络在学习理论上仍有研究价值,因其具有有趣的“学习特性”,尽管“表征能力”没有增加。
  • 常用激活函数及其演变
    • 阈值函数 (Threshold):早期神经元模型使用,激活值大于阈值则输出1,否则输出0。缺点是斜率为0,不利于梯度学习。
    • Sigmoid / Logistic 函数:输出在 (0,1) 之间,曾用于映射到概率。缺点是输出非负,可能导致数值偏大。
    • Tanh 函数 (双曲正切):输出在 (-1,1) 之间。Tanh本质上是Sigmoid函数的缩放和平移 (tanh(x) = 2 * sigmoid(2x) - 1)。作业三中的循环神经网络会用到Tanh。计算指数函数相对较慢。
    • 硬Tanh (Hard Tanh):一种简化的Tanh,中间部分斜率为1,两侧为平坦区域。
    • ReLU (Rectified Linear Unit)f(x) = max(0, x)。负区为0,正区为 y=x
      • 特性
        • 当输入为负时,神经元“死亡”,梯度为0。
        • 当输入为正时,梯度为1,简化了梯度的反向传播。
        • 尽管单个神经元可能一半时间处于“死亡”状态,但整个网络层面仍有神经元激活,促进了某种形式的“特化”(specialization)。
        • 曾一度成为默认和标准的激活函数,因其学习效果显著。作业二中会使用。
    • 对ReLU的反思与改进:后续研究认为ReLU在负区完全“死亡”可能并非最优。
      • Leaky ReLU: 负区给予一个很小的固定斜率(如0.01),使其不会完全“死亡”。
      • Parametric ReLU (PReLU): 负区的斜率作为一个可学习的参数。
      • Swish 和 GeLU (Gaussian Error Linear Unit):近期Transformer模型中常用的激活函数,形状类似ReLU,但在负区有平滑的曲线并带有一定斜率。
  • 核心观点:激活函数的选择会影响网络的学习能力和效率,其发展趋势是在保持ReLU计算优势的同时,克服其“死亡神经元”等问题。

梯度下降与矩阵微积分

  • 梯度下降学习 (Gradient-Based Learning)
    • 神经网络通过梯度下降 (Stochastic Gradient Descent, SGD) 进行学习,其核心是计算目标函数(损失函数)关于参数的梯度 (gradient)
    • 梯度指明了函数最陡峭的下降方向,通过沿此方向更新参数来优化模型。
  • 矩阵微积分基础
    • 基本原则:教授强调一个核心理念——“多变量微积分就像单变量微积分一样,只不过你用的是矩阵。” (Multivariable calculus is just like single variable calculus, except you're using matrices.)
    • 梯度 (Gradient):对于有 n 个输入的函数 f(x),其梯度 ∇f(x) 是一个包含 n 个偏导数的向量 [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
    • 雅可比矩阵 (Jacobian):对于一个有 n 个输入和 m 个输出的函数,其梯度是一个 m x n 的雅可比矩阵,其中每个元素 Jᵢⱼ = ∂fᵢ/∂xⱼ 表示第 i 个输出相对于第 j 个输入的偏导数。
    • 链式法则 (Chain Rule):用于计算复合函数的导数。在多变量情况下,表现为雅可比矩阵的乘积。例如,若 h = f(z)z = g(x),则 ∂h/∂x = (∂h/∂z) * (∂z/∂x)
  • 常见雅可比矩阵示例
    • 元素级激活函数 h = f(z) (其中 f 逐元素作用):其雅可比矩阵 ∂h/∂z 是一个对角矩阵,对角线元素为 f'(zᵢ)
    • 线性层 z = Wx + b
      • ∂z/∂x = W (权重矩阵)
      • ∂z/∂b = I (单位矩阵)
    • 向量点积 s = uᵀh (其中 s 是标量):∂s/∂h = uᵀ (或 u,取决于维度约定,此处为行向量)。
  • 链式法则应用于神经网络计算 (以 s = uᵀf(Wx+b) 为例)
    • 目标是计算损失函数(或得分 s)关于各参数(W, b, u)和输入 x 的梯度。
    • 计算 ds/db:通过链式法则分解为 ds/db = (ds/dh) * (dh/dz) * (dz/db)
      • ds/dh = uᵀ
      • dh/dz 是激活函数导数构成的对角矩阵 diag(f'(z))
      • dz/db = I
      • 结果为 uᵀ * diag(f'(z)),可以表示为 uf'(z) 的元素级乘积 (Hadamard product) u ⊙ f'(z) [此处原文的 uᵀdiag(f'(z)) 相乘后维度需仔细核对,但核心是链式法则应用]。
    • 计算 ds/dW:同样应用链式法则 ds/dW = (ds/dh) * (dh/dz) * (dz/dW)
      • 前两项 (ds/dh) * (dh/dz) 与计算 ds/db 时相同,这部分被称为上游梯度 (upstream gradient) 或误差信号,通常用 δ 表示。
      • dz/dW 的计算结果为 δ 与输入 x 的外积形式,如 δxᵀ(具体形式取决于维度和定义)。
    • 共享上游梯度:计算不同参数的梯度时,可以复用共同的上游梯度部分,避免重复计算。
    • 梯度形状约定 (Shape Convention):在工程实践中,计算得到的梯度通常会调整形状,使其与对应参数的形状(向量、矩阵)一致,方便后续的参数更新操作 (如 θ = θ - α∇J(θ))。这可能与严格的数学雅可比矩阵形式略有不同。

反向传播算法 (Backpropagation)

  • 核心思想:反向传播算法本质上是:
    1. 应用链式法则来计算复杂函数的梯度。
    2. 存储中间结果,以避免重复计算相同的部分。
  • 计算图 (Computation Graphs)
    • 神经网络的计算过程可以表示为一个有向无环图 (DAG),其中节点代表操作(如矩阵乘法、激活函数),边代表数据的流动。
    • 前向传播 (Forward Propagation):按照图的顺序从输入到输出计算函数值。
      • 例如,对于 s = uᵀf(Wx+b)x → Wx → z = Wx+b → h = f(z) → s = uᵀh
    • 反向传播 (Backward Propagation):从最终输出开始,沿着图反向计算梯度。
      • 起始梯度 ds/ds = 1
      • 核心递推关系下游梯度 = 上游梯度 × 局部梯度 (Downstream gradient = Upstream gradient × Local gradient)
        • 上游梯度:指损失函数相对于当前节点输出的梯度 (如 ds/dh)。
        • 局部梯度:指当前节点输出相对于其直接输入的梯度 (如 dh/dz)。
        • 下游梯度:指损失函数相对于当前节点输入的梯度 (如 ds/dz)。
    • 多输入/输出节点的梯度处理
      • 如果一个变量流向多个后续计算节点(分叉),在反向传播时,其总梯度是来自这些后续节点梯度的总和
  • 计算图示例
    • 教授通过一个简单的数学表达式 f(x,y,z) = (x+y) * max(y,z) 演示了前向和反向传播过程,清晰展示了每个节点的局部梯度计算以及如何利用链式法则逐层回传梯度。
      • 加法节点:梯度被等量分配给其输入。
      • Max节点:梯度只流向值较大的那个输入。
      • 乘法节点:梯度会乘以另一个输入的在前向传播时的值。
  • 反向传播的效率
    • 正确实现的反向传播算法,其计算复杂度(大O表示法)与前向传播的计算复杂度相同。如果反向传播更慢,通常意味着存在重复计算。

自动微分 (Automatic Differentiation)

  • 早期尝试
    • 如Theano框架,尝试进行完全的符号微分 (symbolic differentiation),即根据函数的符号表达式自动推导出导数表达式并计算。
  • 现代深度学习框架的实践 (如PyTorch, TensorFlow)
    • 这些框架通常不进行完全的符号微分,而是采取一种更实用的方式:
      • 框架负责管理计算图,并执行前向和反向传播的流程。
      • 用户需要为自定义的层或操作提供其局部导数的计算方法。即,如果定义一个新的神经网络层,需要实现其 forward (前向计算) 和 backward (局部梯度计算) 两个函数。
      • 框架在反向传播时会调用用户提供的 backward 函数来获取局部梯度。
    • forward 传递计算所需值给 backward:通常,forward 方法在执行时会存储一些在计算局部梯度时需要用到的中间值(如输入值),以便 backward 方法后续使用。

梯度检验 (Gradient Checking)

  • 目的:验证手动推导或实现的反向传播(解析梯度)是否正确
  • 方法:使用数值方法(有限差分)来估计梯度。
    • 双边估计公式∇f(x) ≈ (f(x+ε) - f(x-ε)) / (2ε),其中 ε 是一个小值(如 10⁻⁴)。单边估计 (f(x+ε) - f(x)) / ε 也可以使用,但双边估计通常更精确。
    • 比较数值梯度与解析梯度,若两者非常接近(如差异小于 10⁻²),则认为解析梯度计算正确。
  • 特点
    • 速度极慢:因为需要对每个参数单独进行多次前向计算。
    • 主要用于调试自己实现的新的网络层或复杂的梯度计算,不用于模型训练。
    • 在现代框架下,如果仅使用预定义层,通常不需要手动进行梯度检验,但理解其原理有助于调试。

理解反向传播的重要性

  • 超越黑箱操作:尽管现代深度学习框架高度自动化,理解反向传播有助于用户不仅仅是将层像乐高积木一样堆叠,而是真正理解模型内部的工作机制。
  • 理解复杂模型和调试问题
    • 在遇到更复杂的模型或问题(如循环神经网络中的梯度消失/爆炸 (exploding/vanishing gradients))时,理解梯度是如何计算和传播的,对于分析问题原因和寻找解决方案至关重要。
    • 教授希望学生通过学习和作业,能够对神经网络的内部工作原理有所掌握。

结论

本讲详细介绍了神经网络中梯度的计算原理,从矩阵微积分基础到反向传播算法的具体实现和验证。核心在于通过链式法则高效计算梯度,这是驱动神经网络学习的关键。尽管现代工具简化了许多操作,但深刻理解这些底层机制对于高级应用和问题排查仍然非常重要。