speaker 1: This is week two of the systems lecture, where we try to leverage the most out of the hardware we have to make models train faster. And last week we talked about parallelism within a single GPU, and this week we're talking about parallelism across multiple GPU's. So this is a picture you should have in your head. So we have a bunch of nodes. These are basically computers that have each have a number of GPU's, usually eight. And within each GPU, there's a bunch of streaming multiprocessors or sms, which actually do the work. And you see that in Green here are essentially the memory and the communications. So within each sm, you have some very small l one cache. On a GPU, you have high bandwidth memory hbm, which is bigger, and then you have these links that connect the different GPU's. So the way to think about it is that compute has to happen within the sm on these alus. And a compute needs inputs and needs to write outputs. And generally, the inputs and outputs are can be relatively far. If you're lucky, they're on the l one cache. If you're say less unlucky, they're in hbm. And now this week, we're talking about multi GPU and multi node training, where the data that you might need might be across on another GPU. So the name of the game is how do you structure all your computation to avoid data transfer bottlenecks? Because we want to, remember, keep the arithmetic intensity high. We want to saturate our GPU's, make them go hum along. And generally, data transfer is going to be a lot slower. So we have to that's going to be the bottleneck. So last week, we saw a bunch of different techniques to try to do that within a GPU, including fusion and tiling. So the idea basically is that instead of reading and writing from hbm, you can load into l one cache, or I guess, shared memory, which is using the same type of, has the same speed, and just work there on your local scratch pad and then write out to hbm only judiciously. And this week, we started looking at communication across GPU's and nodes where we have to replicate and shard our models and parameters and optimize our state. And there, the way we do that will determine the cost. So here's a kind of I'm taking a little bit of liberty to put everything in kind of one hierarchy. You can think from small fast to big slow. So the smallest and fastest is are on a single note, single GPU. You have l one cache that's extremely fast but very small. And then you have hbm on a single GPU. And then between GPU is on the same node. We have mv link, and then finally we have mv switch. And of course, this is all in the nvidia ecosystem. So the idea is that many of the core concepts of minimizing data transfer are really the same. But now the mechanics are a bit different because l one is behaves differently than these kind of envy switches. So this lecture is going to be mostly about concretizing the concepts from the x lecture in code. There's going to be a few new things, but Tosui did an excellent job of giving you an overview of all the different types of parallelism. I'm going to try to anchor it in the code so we can more deeply understand what's going on. And then we're going to have I'm going to refer to the standard outfile here, which is the output of running this lecture. There were some minor issues I'll spare you off, where if you have multi processing, then this framework doesn't quite work. Okay, so this lecture has two parts. One, in part one, we're going to look at the building blocks, collective operations, which we discussed last time, how this is implemented in Nicoll and PyTorch, and then we're going to do some benchmarking. And then in part two, we're going to look at actually distributed training, data tensor and pipeline imparallelism. Okay, so let's start with collective operations. So collective operations are these primitives that are used generally for distributed programming. And collective means that you have many nodes. These are actually quite old from at least the 80s in the parallel programming literature. And generally they provide a better abstraction than trying to manage the point to point communication yourself. So these are really tried and you know primitives that have stood the test of time. So a bit of terminology. So world size refers essentially to a number of devices, for example, four, and the rank, sort of confusingly, if you're used to kind of linear algebra, is actually just refers to device. So we have rank zero, rank one, rank two and rank three if you have four devices, okay, so the collective operations are as follows. So starting from broadcast, the idea is you have t zero on one of the ranks and you just want na put it on all the other ranks or all ranks. Okay? So that's very straightforward. Scatter is similar, but you have four values and you want na put each of the values are different ranks. So each of the ranks get different values, not the same value. Gaatter is the sort of the inverse of scatter where you have each rank having a different value, and then you bring them all together on one rank. You know, reduce is the same as gather, except for instead of concatenating, you add them all. Gather is the same as gather, except for you Jeice do it for all the destinations. Gather was just rank zero, or rank one, or rank two, or any individual rank, all gather as you do it for all of them. And then finally, reduce ed scatter. I couldn't find a good picture of this, so I'm reusing. The one from last time is like reduce, where you take a bunch of different values and you add them or perform another commutative operation on them and put it on one rank. But like scatter, you're going to be putting different pieces of the vector or tensor on different ranks. Okay. And remember that all reduce ed is equivalent to reduce plus all gather. So the way to remember this terminology is as follows, because it can kind of confusing like which ones all gather, which ones reduce ed gather is that reduce just means you're performing some associative and commutative operation, like some or minia max or average broadcast. Gaatter is the inverse of gather and all. It just means all a destination is all devices. Okay, so totally this is a review from last time, so actually .
speaker 2: any questions .
speaker 1: before I move on since we're going to build on these primitives. So it's useful if everyone understands. Okay, so now let's see how this is actually implemented in starting with the hardware. Okay, so here's a classically what hardware for GPU's looks like. So this is kind of in the home. You have a computer, I guess, and you have your cpu's and generally you have your GPU's on onnode that communicate via pci e bus. And if you have to connect, communicate between different nodes, then this is all connected to Ethernet. So this is kind of typically how you machines were built. If you buy a GPU and you're for gaming or something, this is kind of probably what your setup looks like. As we'll see, this is kind of suboptimal because there's a lot of overhead. When the data gets needs to get shipped from GPU to GPU, it has to go through the kernel, get copied into buffers and then go through this kind of a transport over Ethernet. And that introduces a lot of overhead. So what has happened in modern times with scientific computing and deep learning is that if you know that you're going to just string a bunch of GPU's together and do something together, then we're just going to hook the GPU's up directly basically. So in the nvidia ecosystem, we have nvy link that directly connects the GPU's, therefore bypassing the cpu. You don't need to go through kind of the kernel of the host machine. And even at across nodes, we can connect the GPU's directly via mv switch. So therefore, we're bypassing Ethernet because Ethernet was developed a long time ago, clearly not for these type of applications. So mswitch just and mvy link kind of skip all of that and just optimize directly for the type of workloads that we're interested in. So if you look at H -100s node, sorry, each GPU has 18 mv oplinks generation four coming out. So that gives you a total bandwidth of 900 gb. If you compare to these, it's certainly a lot faster than pcie and it's certainly way faster than Ethernet. And in comparison, if you think about the cost of just going from the sm to reading from high bandwidth memory, that's still quite a bit faster by a factor of four or so. And of course, these numbers are constantly changing with a new black wells, this number is like two or three times more, I believe. Okay, Yeah .
speaker 2: this the cpu and then like to another GPU or it's direct.
speaker 1: So the question is for the pcie, where how does the data get transferred? I think it has to still go through the cpu. Was there another question? And the piece was, I mean, it's developed for things like other things are connected to it as well, like your sound card or your ssd hard drive. So it's not really it's sort of like a general purpose, you know a bus for communication of devices Yeah also has a connection with Yeah. So the question is mv link also connects to the cpu. We're going to see a bit later how I think maybe just in the slide how things are connected. Yeah. So you still need to talk to your cpu.
speaker 2: of course. Yeah.
speaker 1: Okay. So there's this command that you can run and this produces some output which allows you to see how the GPU's are actually connected. So I ran this on our cluster. There's eight GPU's. I guess you won't be able to get hgpu's, but I guess if you could, this is what it would look like. And you see that between every pair of GPU's, there's mv 18 connecting. There's also these kind of network cards and and other things. Okay. Oh Yeah. So then network cards are basically what gives you the pcie connection and the cpu's. So okay, so that's the hardware. So how do you use the hardware? So nvidia has spent a lot of time developing really good software on top of their, I guess, really good hardware. And there is a collective communication library by nvidia called nickel. And this essentially translates the collective operations, which we looked at before, like all reduced into low loof packets that need to be sent between GPU's. So this library actually does a lot of work because it allows the programmer just to operate that level of, I need this tensor to appear on all the machines and it just happens. Okay. So you just a little bit of what what happens is when you configure setup, Nicol, you bring up a bunch of devices and there's some communication that happens to figure out the pollogy of the hardware. It optimizes the path between the GPU's. And then when you actually call these collective communication operations and the launch kuda kernels to send and receive data. Okay, so that's nickel. It's provided as a library, but nickel is still a bit too low level to us because most of what we're doing is you in Python. So there's a PyTorch has this torch distributed library, which essentially provides a clean interface for these collective operations. Now from the comfort of your PyTorch program, you can just write all gather into tensor on a tensor and it will appear on the different ranks. It also has this nice useful feature that it supports multiple backends for different hardware. So in particular, if nickel rewas for GPU, but you can also run collective operations. Remember, this is not GPU specific, it's just for any set of devices. So you can also do it for cpu using this backend called glue. So if you're debugging stuff on your laptop for your assignment, for example, you can use glue and still be able to run things without even a GPU. So anyway, that's another advantage of having these high level primitives is that they're much more portable than having to only having something that's very GPU specific. Of course, the performance is going to really depend on the hardware, but at least logically, you can make sure your code runs. Pdistributed also supports other high level things like fsdp, which tatu talked about last lecture, but we're not going to use this in this class because in the spirit of developing things from scratch, that's just what we're going to do. Okay, so let's look at some examples of how torch side distributed collective operations work. Okay, so there's this utility function I wrote, which you can take a look at it in the code if you want, which takes a function and just runs this. Basically it's a wrapper around Python multi processing where it just runs four processes that execute this function. So when you're in this function, you should think about it as there's actually world size number of processes running this identical function where the rank indexes from zero one all the way to world size minus one. Okay, so right now I'm stepping through just one of the ranks because lectures are not parallel. And so generally what you do is the first thing the process needs to initialize itself. And you essentially they need to kind of find each other, right? Because you're a multi processor running A A lot of processes, they need to connect to a single host so that they can figure, know that each other exists. So know that this is not where all of the data goes. The data goes through nickel, but this is just for kind of coordination. And since we have a GPU, we can use nickel, otherwise you would use glue. Okay? So after you set up, so now we're gonna to do some stuff. There's this useful function called barrier, which basically waits for all the processes in your process group to get to this point. Remember, everything thing's running asynchronously and in some cases you just want to have a synchronization point. So barrier does that. The reason I put it here is actually sort of for trivial reasons because I want all these print statements to kind of be grouped together. But there's other reasons why you might want to use barrier that's we'll get to later. So I'm going to, for each of these groups, construct a tensor. So the tensor is zero, one, two, three plus the rank. So I'm going to print out for each rank before it all reduce. What does it look like? Okay. So here's what it looks like. Can people read that in the back? Yes. Okay, good. All right. So on rank zero, it's zero, one, two, three. Rank one, one, two, three, four and so on. And notice that because it's async, the orders it's just out of in whatever order happens to print. Okay? So each rank has a different tensor. And then then you all reduce. So all reduce, you PaaS in that tensor. You say, I want to sum it in this case, I'm not going to do async, but you can do async, which is useful for overlapping communication in computation. And then afterwards, what happens, after all, reduce as advertised, basically for the first component, you add them up, you get six this you get ten, 14 and 18. Okay? So after all, reduce the basically this tensor gets overwritten with the corresponding sum. So it's very, very kind of nice and simple to use. Okay, so so let's do reduce scatter. So reduce scatter, I'm going to create an input which has dimension worlsize, in which case this is four and I'm going to allocate an output because reduced scatter is not going to operate in place. This is just going to be a scalar. So before the bdo scatter, this is what it looks like. I have my input. As before, output happens to be zeros, but it could be any value since I didn't initialize it. And then after the reduced scatter, we're passing the input and the output and I'm gonna na sum. Then I get essentially what happens is that for the first component, I sum and that goes on rank zero. For the second component, I sum and it goes on rank one and so on. So as you notice, it is producing the same operation as all reduced, except for the output is sort of scattered across all the different ranks. Okay, so now let's do all gather. So I'm going to just directly use the output of reduced scatter, which is this as the input, and then I'm going to allocate an empty array for the output. And then so so before the all gather, the input is this and the output, I guess are just arbitrary values. And after I do the all gather, now what happens is I get all these tensors to show up on all the devices. Okay. So this is just a kind of also an example. Hopefully now you're very convinced that reduced gather plus all gather is just all reduced because I computed exactly the same quantity as I did for all. Okay. Questions is clear. Yeah in reduced catare we keep track of. So the question is, in reduced scatter, do you keep track of which index goes to which GPU? So by convention, the dimensionality has to be then basically the world's I mean, it could be a general tensor, but one of the dimensions is the world's size. And it just infers that basically what you want to do is the output is, let's say, the sorry, the input has to be basically world size. And then it knows that basically the corresponding computations go to each of the outputs. Yeah you have to be a bit careful with making sure the dementiality aligns. So going through this small examples can be helpful. Is there another question? Okay. So finally, we're now in this process that's running and when you're done, you just clean up. Okay. So so far, we've talked about these collective operations, a bit about how they're implementing PyTorch and it's nickel and then pi torch. Let's do a bit of benchmarking in the spirit of what we did in assignment or the first lecture or rather the second lecture, we're going to focus on one node for now. So let's do all reduce. So I'm going to have this tensor of 100 million elements and a world size of four. So I'm going to just allocate a tensor. And generally I think as you hopefully are can appreciate now, that when you benchmark, you have to really be careful to kind of clean your palette in some sense. Like you in this case, I'm going na warm up, basically run the operation once and then synchronize and do barrier. Some of this is I think probably a bit defensive, but just to be safe so that all the kernels get loaded and whatever needs to be kind of compucomputed. And then I'm going to start the clock, all reduce and then synchronize again and stop the clock. Okay.
speaker 2: so now .
speaker 1: I can look at how long that took. Okay? So if I scroll down here, I guess this is not that infora I should have printed in microseconds. Probably. It was, I guess, very quick, some number of seconds. And now let's measure the bandwidth, which is the number of gigabytes that were actually transferred in aggregate per second. So the way we do that is we have to think about what actually gets transferred here. So there's a tensor with that element size. And the size of each element is, I guess, I think this is float 32. So that would be two or sorry, four bytes. And so that's the site and bites. Okay, so now this is a little bit subtle. How many bytes are actually sent or transsent slash received? So each tensor sitting on a rank has size bytes, okay? And it needs to send it to world size minus one. Other machines or not ranks rather so, but there's a factor of two. So why is there a factor of two? Because you're doing it all reducremember. So you need to send all the distinct elements into basically one place. It needs to get summed up and then that needs to go back to everyone. Okay. So a rank needs to kind of send the input out and then receive the output. So that's why there's a factor of two there. And so the total duration is the world size times the actual duration that PaaS. So I guess we're just kind of assuming that every you if there's four processors, that's sort of like four times as much wall clock time that happened and the bandwidth is just the bes stoover the duration. Okay. So what do we get here is about 277 gb per second. Okay. So you I think for H -100 above, I think I claimed that there was something like 900 gb per second. Now of course, as we know, your mileage varies depending on the size of the tensors and the exact number of devices and the weather, and no, not the weather, but various factors. So your mileage might vary. So it's always good to benchmark to see what is actually the number of gigabytes per second you're getting. Okay? So reduced scatter is going to be very, very similar. So let's just go through this square very quickly. So we created input, which is world size times number of elements. So each rank is going to have the matrix and and so we're going to warm up and then start the clock, reduce scatter, stop the clock and then see how long it took. Well, okay, that's not helpful. And then let's look at the bandwidth. So this number of scent bytes is no a factor of two here because in reduced scatter, remember, all you're doing is you're sending your inputs into one place. If you just think about reduce, right, all the elements just go into one place and that's it. And scatter just means that different components of your tensor are going to different places, but it's effectively it's like a reduce. Okay? So if you do the same calculation, you'll see that it's I guess I get 70 in this case, so I don't exactly know why it's exactly 70 as opposed to some other number. I guess one could speculate that all reduce generally there's more traffic that happens. And all reduces. Are you potentially more optimized? I think that nvidia hardware has this kind of sharp acceleration that actually does sort of some of these computations in the actual network, which just shades off a factor of two. But I don't know if that completely accounts for a difference here. There's a lot of stuff that happens in nickel that it's a little bit hard to kind of reason about the performance exactly. Hbenchmarking, Yeah. I A question .
speaker 2: about the set heor the deheparticularly Yeah, specifically, it looks like it actually just like the that's a piece as well. But what about like the inputs of the reduction stuff of wondering how it gets the inputs.
speaker 1: So the question is, it seems like this is just the bytes for the output and word about the input. So to be clear, I am suming that the inputs just are already on the device. So I'm not counting that time and I'm just counting what needs to happen to do the reduce scatter.
speaker 2: This is just a scatthis is a .
speaker 1: reduced scatter operation.
speaker 2: So this function .
speaker 1: does reduce scatter. So it's one .
speaker 2: operation. We covered it twice in the.
speaker 1: So you're saying that for all reducthere was a two x because you needed to reduce emagain, you needed to spread out again for reduced scatter. I mean, it's just a name. It's called reduced scatter, but it's really just a reduction. Okay. And you can also see based on this that if you do reduce scatter and you do all gather, each of those is doesn't have the factor of two. So when you add them up, you get a factor of two, which is another way to see that all reduces twice. Okay? And there's some references you can go read about how to benchmark and these collective operations. Okay, so let's now talk about the distributed training piece. So our general approach here is going to be I'm going to walk through a bare bones implementation of each strategy on deep mlps essentially. So recall that you generally are in the regime where mlps are the compute bottleneck and transformers, not the attention. So in some ways, even though this is a very simple architecture, it's fairly representative of the type of you know workloads that you'll see. Okay, so let's start with data parallelism. Actually, just one note is that data tensor and pipeline pioism are you can just think about them as different ways of cutting up either your model or your data, which hopefully I'll depict visually here. Okay, so in data parallelism, here's your model, assuming it has four layers, each layer of the mlp is just a matrix multiply where this is the hidden dimension. And so the data is also a matrix, which is there's the batch dimension and then the hidden dimension and data parallel just cuts along the batch dimension into essentially smaller pieces. Okay, so now each rank is going to get a different slice of the data. So let's do an example here. So I'm going to generate some sample data. So let's say I have a batch size of 128, hidden dimension of 1:24, and then just generate some random data. Okay? So I have batch size by number of dimension, and I'm going to run this data parallel algorithm or ddp. So here I'm going to so I got past this data. There's a batch size and the dimension as claimed from before. Now I divide the batch size by the world size. So I get the local batch size. That's how many how big the batch size is on a given rank. And then I'm going to, based on the rank, just figure out which starting ending indices of size, local batch size I need to access and then get the corresponding data from that. So basically I'm just reaching in and grabbing some subset of the roads based the rank. Okay, so now I'm setting ding up the mlp here. And this has done very sort of bare bones you could say. So here I am creating the mlp parameters. So each layer has essentially a matrix, which is numb dimension by numb dimension. And remember, numb dimensions is 1:24, and I'm going to create the optimizer. So remember, this function is running asynchronously on all the different on each rank. So each of the four ranks is going to be running this with rank equals zero, one, two, three. And now I'm going to start training. So for a number of steps, I'm going to do a forward PaaS through the layers matrix, multiplied nonlinearity, matrix, multilied, nonlinity. There's four layers here going to compute some loss. I don't care what the loss says, it's just made up, something made up. And I'm going to do the backward PaaS. So so far, this just looks like I'm implementing sgd, right? And that's kind of the point. The only difference is now to implement ddp is that you just inject this line here, which synchronizes the gradients across worker. So what you do is for each of the layers, you call it all reduce where you're averaging. And the thing you're averaging is prim dot grad. Okay? So it's just like you've kind of hijacked this someone's sd code and you're saying, wait, I'm actually going to just mix all the gradients after the backward PaaS. And then after you do that, you just update the parameters as usual. So from the sd perspective, it seems like nothing is happening. I'm just running sd, but someone has just mixed my gradients. Okay, so I guess just print out some things. So data parallel, I'm printing out the loss. So one thing to note is that the losses are different between all the different ranks because they have different datbut. After I'll reduce all the parameters are the same. Okay, so this is a crime, your textbook application of all reduce in ml setup.
speaker 2: but each rank rous this through all views.
speaker 1: So the question is, how do you ensure, if all of these processes are just running asynchronously, how do you make sure that each of them is actually, for example, on the same step? This is because all reducis a synchronization point, theystop everyone and do the all reduce. So you have to be careful because if one of your ranks has a missing all reduced, then it just hang. Yeah, Yeah, Yeah. Oh.
speaker 2: why does getting .
speaker 1: the .
speaker 2: iniframeters .
speaker 1: the question is why does getting initial .
speaker 2: parameters depend on the rank?
speaker 1: They're the same. The reason is just because I guess I don't the code for this basically puts it on the appropriate GPU. Okay, any other questions? So ddp is something you implement in assignment two, which maybe some of you have look at or maybe you're not, it will be done in the context of a transformer. But this is sort sort of the most bare bones version. So you can see very clearly what's happening. Okay, so that's .
speaker 2: a ddp .
speaker 1: losses are different across francs, but the gradients are reduced to be all the same. So therefore, the parameters of all the ranks are the same, right? So actually, you're doing world size number of sd runs, but because they're synchronized, they're doing the same thing. So you can think about this as sort of an instantiation of you know analog of activation, know checkpointing where sometimes you just do extra compute because you don't want to store things. In this case, you know we could have, for example, I'll shift the optimizer state around, but that would be a bad idea because know it's much faster just to run the update the optimizer state then to actually move the optimizer parameters around. Okay, so last year I did try to do fsdp that, but that was a sort of a haball. So I'm gonna to skip that and do a tensor parallel. So here the picture is we leave the data the same and now what we're gonna to do is we're gonna to cut the model along the hidden .
speaker 2: dimension.
speaker 1: okay? So each rank is going to get every layer but it's going to get only part of each layer and what we're going to end up doing is transfer all the data and the activations around, okay? So we're generating the same sample data and let's look at tensor parallel. Okay, so I have the batch size and number of dimension as before. And now I'm going to knobefore. I was cutting batch size, but now I'm cutting numdim. So I have local numb dim equals 121, zero, 24 divided by world size and that's 256. So each model, essentially each rank gets a part of the model, which is one over the world size fraction of the parameters. Okay. And remember the whole why we're doing feralliism at all is because the model won't be able to fit into a single GPU, so we're going to shard it across multiple GPU's. So the parameter matrices are now numb dim by local nb dim, and now each rank is going to I'm only going to implement the forward PaaS here, not the whole training loop. So I'm going to start going through all the layers. So I'm going to compute the activations first. So this looks pretty normal except for remember, the activations are actually batch sized by local numb dim rather than numb dim because I each rank only has a fraction of the activations now. But now once I get the activations, I need to communicate. And here what I have to do is I'm going to allocate memory for all the activations. So at this point, every one has a as an x, but that x represents a different part of the activations. Okay, so now I'm going to just allocate batch size. I'm local numb dim, but world size number. So basically each rank is going to basically have enough. I'm going to just get the basically have world size number of batch size by local numb dim know matrices and then I'm going to do an all gather. Okay, so I'm going to send all the activations and this, I mean, it's fairly simple. So x, remember, is batch size times local, numb, dim, but x is different for every rank. So when I do that all gather, I'm going to put it in activations, which has essentially a world size number of the same shape as x. Okay? So now every rank has the same vations now has activations of all the models of the whole model, okay? And then just just to concatenate them together to get x, okay? So now x is now again batch size by numbin. Okay. And I know repeat. So as you can see, this is know there's quite a bit of communication that happens, which is why, remember, tatu said that for tensor parallel, you need pretty high interconnects, otherwise you'll be passing a lot of these activations around. Okay? And then you do it for the next layer and the next layer and you get the idea. And just to print out ce, some output. So tensor parallel, let's see here. Forpass produces activations of basically the full size and everyone has the same activations at the end. Okay. So backward PaaS I'm going to skip because that's kind of annoying to do. All right.
speaker 2: any questions .
speaker 1: about that? Yeah, I was just wonderso. Why is it hard to do the backpads? I don't think it's necessarily hard, but in I guess, in the constrained time and space, it's not hard. It's just requires a bit more work. Okay, so now let's .
speaker 2: go to pipeline .
speaker 1: parallelism. So in this case we're cutting the model by layers. So all the ranks get all the data and all the rks. Each rank gets all of one layer, but they get different layers. Okay, so sample the data and run of this function for all the ranks. Okay, so here I'm going to figure out how many layers go in each know rank, which is two here. So I have a four layer network. I have two you know two ranks. So each rank gets two of the layers, just like this picture actually. And here I'm going to just allocate the parameters just for the layers that I need. Okay, so I'm going to do the forward apass. Remember there a further optimization that you do, which is if you just do it naively, you get these pipeline bubbles that tattoo talked about before. One way to sort of mitigate that is to break up the batch into micro batches. So here I'm going to divide this batch into batches of size 32. So four batches of size 32. And then now the idea is that every rank is going to is actually wait for the previous rank to PaaS it to the activations. It's going to apply those layers and then it's going to forward it to the next rank. So starting at the base case, we have rank equals zero. That's just the data. So I'm just chunking the data into a bunch of micro batches and going through each of micro batches. First, I receive the tensor. So I'm using these point to point primitives now instead of the collective primitives, and I essentially basically receive the tensor x, and then I'm going to compute the layers that are assigned to this rank. So in this case, there's only two of them, and then I'm going to send it to the next rank. And then again, sand is a point point operation. And then the next batch I'm going to do the same thing. So I'm here to skip that. Okay, so that's basically it. So pipeline parallel, at least the very naive version of it, is relatively conceptually simple. But it's not to mention last time, there's many things that are missing from this basic implementation. Overlapping the communication and computation is something we're not doing at all here. For example, receive and send our synchronous, but you should really make them async. And also the order in which you do the forward, actually, this is just a forward, even that, not the backward. But once you have the backword, then you have to figure out how to interleave the forward and the backward steps.
speaker 2: Wondering I guess like maybe what you just mentioned about like the asynof being shown here is I guess, seeing like the GPU will be sort of leasing like whether another one can access something to it. And it's kind of this game. Like it only starts processing it once the layer before it passes into it.
speaker 1: So the question is, is this kind of like a vendriven program where you're just kind of waiting for things to happen? And I think event driven programming, you basically write these hamburgers and then whenever stuff happens, maybe you get a mouse click, maybe you get a file ready event, then a piece of code runs. That's quite different, I think from this style of coding where everything has to work in lockstep. It is that you're sort of waiting for the previous you're rank to send you the information. But at least in this implementation, there's no flexibility of where it's getting from. It's not like it's waiting for arbitrary data to come from anywhere. I think there are ways to do asynchronous training, which was you, I think quite popular you know ten more than ten years ago, where there is more event driven, where you have a server that sends data and whenever the gradients already, it just uploads and then the gradients get accumulated. And if workers die, then that's sort of hanthem more robustly. But in modern training, despite scaling up quite a bit, everything seems to be kind of in a synchronous paradigm. Yeah so it is that when I say the workers are and the ranks are operating ering asynchronous, that's just because it's different processes, but you're still putting quite rigid synchronization on how everything is working in lockstep. Yeah. Students.
speaker 2: how do you change this program? Who have take did.
speaker 1: So the question is how would you change this to overlap communication and computation? So for example, when you send this, there's no reason to just wait for the data to be sent. You just basically fire off the send. Remember that the send actually happens on the GPU via some kernel launch. So that's sort of independent and it can just go and process another micro batch right away. So the way I think you would do this is there's another function called I and which is asynchronous. Actually, this should be assynchronous asynchronous, which returns a handle. And so you basically do all the Scand, then at the end you basically wait for all the sento complete and then for overlapping the when you actually have the backward step, then you basically have to schedule that in here Yeah .
speaker 2: for svia a multiple ch. So the question is if .
speaker 1: you have multiple sends and multiple receives, how do you know which is which? So here your the tenser name doesn't matter is just whatever a variable is there and what you're specify is the source. So if I'm at a node and I'm receiving then whatever the next message coming from that rank, I'm just going to put in this aemcontinue executing. If you want to do $0.02 from the same rank to the same destination.
speaker 2: So I'm not .
speaker 1: quite sure about this, but I think if you have two cends, it's sort of put in a stream. So the order of the sends still is preserved. It's just that other stuff can happen at the same time. Like you you can send to like I think if you have a pair into $0.02, then that order is preserved. But the order in which you know you send some other rank ascending to another rank, it can happen at any time. Yeah if .
speaker 2: you the water just gets off there or like.
speaker 1: so what happens if you send and the one's receives it? I think it would just stwe just wait because there's no, Yeah, I mean, I mean, the process could just be running and you don't know whether it's just I mean, just code executing so you don't know if it's never gonna to get there or if it's just gonna to be a matter of time Yeah. So the question is, what happens to the last rank? So at the end, the last rank has all the activation. So that has basically the results of a full forward PaaS. And then if you implement the backward PaaS, then you would be actually now computing the gradient with respect to loss, and then you would go back down and send to from rank to rank minus one and so on. Okay. I guess maybe I was afraid I was going to run out of time, but it looks like I had actually half time. Maybe next year I should do the backward PaaS. Okay. So actually, I'm going to finish quite early today, but so if you have any other questions, you should ask. So so far, we've gone through three simple examples of data tensor pipeline parallel. Of course, this is forsimple mlps. You would actually want to do this with your own fancier model like a transformer. I did argue that at least at the the core idea is you can sort of understand through the mlp, I think. But of course, when you want to train, you want to train transformer, not a deep mlp. So you still have to implement the full complexity. What's also missing is the communication and computation overlap, which is not really handled very carefully here. And there is generally a more complex code with bookkeeping. I encourage you to check out like Megatron lm or pytorches fstp, it gets fairly Harry. And one of the things that I think misome of the bookkeeping, at least for, let's say, fsp, and you'll be exposed to this in a two a bit, is that if you want something that handles arbitrary architectures, then you have to figure out the parameters and do a book, a bunch of bookkeeping, too, and figure out where their layers are and so on. Whereas in the mlp case, it's just I've sort of made the decision that I'm going to split the model in this particularly simple way. One other thing I'll just mention as an aside is that all of what we're doing in this course is as piytorch, but it is useful to be aware of this whole other ecosystem around jacks and tpu's, which is actually kind of nice in some way. And the idea here is jx has allows you to define the model. It definine the sharding strategy and then the jx compiler handles the rest. So there's this toolkit that we developed called laanter based on jx. And I'll just show you a snippet of what so this is fsdp and ten lines of code. And basically you have a your model and then you just say shard with this particular. I mean, I don't expect you to kind of read this exactly, but basically you define which dimension you're gonna to shard by and then you know that's it. And similarly for tensor parallel, you're just saying, I'm going to shard the model along, know you can shard by on the head dimension for attention and also you can shard based on the model dimension. So in some sense, know this gives you a sort of you a conceptual simplicity of what you're trying to do is you have this basically computation graph, but it has these kind of dimensions, the modimensions, the embedding dimension, the attention sequence dimension. And jx allows you to basically just specify which dimensions you want to cut by and also define a mapping from that onto the actual tpu's. And then the Jax compiler magically just figures out how to compile that down into the primitives that shuffle things around. So this is much more higher level than doing the operating with a collective communication. But we're sticking in with PyTorch because it allows you to see kind of underneath the hood what's actually happening. But if you're actually doing this in the real world, obviously you don't need a and you probably shouldn't implement all of this from scratch. Okay. So that's the end of the jacks digression. So just summarize. We've seen many ways to paralleze so far. And each of these ways of paralyzing is you can think about, just like splitting either the model or the data along some dimension, either the data, the batch dimension, the width dimension, or the depth dimension, or the context length dimension. We also see this kind of recurring theme of recomputation. You can kind of recompute something from scratch, or you can store in memory and suffer the data transfer cost. Or now in the multi GPU multinnode setting, you can actually store on another GPU's memory and then communicate, which is even slower. So there's a kind of these these trade offs you know here and know often recomputation is actually can be better, but obviously you can't. You compute the whole thing and often you're either communication or memory limited. A final word is that it is the case that harbor is getting better. So you might think that, well, maybe none of this is really necessary because in five years everything will fit in l one hpm. So this is not going to be the case because those might grow quite a bit, although there are still physical limits, will always be ending up with bigger models that sort of are at the limit of what the harbor can do. So this harcle Cal structure, ever since system computer systems was a thing, has always been with us and it will always be there. Okay, that's all I have for you today, so I can take any questions. Yeah .
speaker 2: set of five the bottom fast very different because if you're not verification might be a function of the important from example then that's that there any impasor?
speaker 1: So the question is in data parallel, you're saying that even though the parameters are all kind of synchronized, there could be other things that depend on the data like in batch norm. So I don't actually know how you was always kind of knowing so I don't know exactly how you would do that off the top of my head in at least in an llm world that doesn't really show up because layer norm is used and as long as you initialize all the parameters using the same random seed, you'll be fine. I mean, there could be like non determinism issues on the GPU, but hopefully those are minor. Yeah. So the question is, does piytorch have some niceties as well, kind of like what jx offers is that so I mean, piytorch does have the fsdp library, which you should absolutely use if you're not taking this class, which basically is a wrapper. You define any model and it just does sfdp on it. I think that now if you're asking how well it can more custom allow you to more do custom charting, I think there are some things that are coming, but it's not, I think, as developed. I mean, I think there's sort of this, I think, spectrum between the jacks world where you sort of declardeffind things. And I think the Google infrastructure, if you stay within the jx tpu system is pretty well developed. But then if you look at kind of deep sequwhich is a kind of an opposite end where you have these GPU's with actually really bad interconnect, which means that they have to go in and hack. You know they actually go to the kind of nickel level and actually do a bunch of things, which I don't quite understand to eke out the performance. Whereas if you're writing a jture just to kind of from on high, declare your model and then you know stuff happens. So it's kind of the ways that you leverage hardware, I think really depends on what what ecosystem .
speaker 2: your upper hand. Yeah, we manage the amount of conditions and can recomplete some subset of .
speaker 1: the activations. Yeah. So the question is activation checkpointing what there is an api that basically allows you to, I mean, I guess, in pytorgen jacks to specify which parts you want to recompute, because clearly you don't want to recompute everything or nothing, probably every few layers, probably right after like big mamalls, where for example, if you have, let's say, Mamal and then point wise, law ineity, I don't think you need to store like two copies of basically, if you have two things where it's sort of trivial to get to, then you might as well just store one version. Yeah .
speaker 2: over there, hardware, we're looking more specialized.
speaker 1: So the question is, are GPU's going to ever be replaced by transformer specific hardware? So you're seeing this in the inference space quite a bit already with like Grock and Cerebras have specialized hardware that can do inference and also, I guess, training. So we used those training. So basically those hardware essentially give you just a lot more on ship memory. I mean, that's basically the name of the game. I think the reverse has like a huge you know essentially effectively A M one cache so you don't have to move things off. And I think a lot of simplifications can happen because GPU's were there's a lot of baggage actually, if you think about because they were divine in an era where you had to do a lot of branching and like various types of ad hoc computations, which are not really needed in the deep learning regime. So I think there are quite a few opportunities to improve that hardware as well. I think there was a hand back there and.
speaker 2: This is like the right question. I'm thinking that. But in the context of the lecture, it's basically a model that's been changing one, though, that's been optimizing the 21, the tastep. We're talking about communities to preventing and train a model dle, for example, I. Not just like White doom, but actually need to kind of recatculate everything about her.
speaker 1: Yeah. So the question is, can these techniques be used to essentially do continued training? Yeah, absolutely. So if you think about the unit of what we're working with is just doing gradient steps, right? So if you take a half train checkpoint, you can just continue doing what this is. There's nothing specific about starting from scratch here. I think there was a question from that.
speaker 2: So I'm not like the models were still right where and the previous like presumably there's is a physical technical reason make nodes much larger than they are currently. Like what's the change that you're talking about? So if you could just make GPU nodes like infinitely like as big as you wanted, people would do that. So presumably, and there's a tech like a hardware reason that's not possible. What's the actual advancement being done for specific haryeah?
speaker 1: So the question is there are physical limits for sure for for a GPU. Let me just go. So the so you kind make GPU's obviously infinitely large or infinitely dense. I mean, there's also like you know power issues. You know you do get rid of all the heat and know there's only so much kind of bandwidth that can no fit. So I don't know the exact details, but at least in some of the cerebris case, I mean, they sort of have this way of manufacturing basically the chips so that the memory is kind of on the chips. And so I guess it's just a way of putting it on there end. I think that there are obviously no trade offs because it comes at a cost of not having as much flexibility. But but in general, I think the way to maybe think about this more broadly is that you know GPU's were still developing kind of the cpu era where it's much more control focused. I have code I'm executing. That's the sort of first class citizen, and then data needs to be moved, execute to handle the code. But the big difference with deep learning worklois that it's all sort of data flow. Like the computation graph, if you look at these is like static you know from the beginning, exactly all the computations that are going to be done until essentially the end of training, right? So using that knowledge, you should be able to kind of lay out your computation in a much smarter way than having to deal with the flexibility, uncertainty over ad hoc computation. Okay. Maybe a few more questions .
speaker 2: about and Yeah.
speaker 1: So the question is, where is the computation graph stored? Well, the code is I mean, all this code is running on this cpu, but when you call something like I torch function that needs around on GPU, then it launches kernels under the hood and the kernels are a code that runs on the GPU. Yeah.
speaker 2: I'm not sure of that.
speaker 1: So I guess maybe another answer is that the computation tional graph is more of A, I guess, a conceptual. You know it's not like there is a graph literally that's being, you know, I mean, I guess there sort of is, but it's not like the graph gets put on the GPU makes sense. Okay.
speaker 2: so these cations, remember Yeah we cu in rewachen.
speaker 1: So the question of the communication parameters, are they cpu or GPU? So these collective operations are in some sense abstract specification of what types of operations need to happen, which can happen. If you remember this pytordistributed has different backgrounds. So it could happen on GPU or happen on GPU .
speaker 2: first having a visit it like it is the cview sort of scheduthem or what is it on the Yeah. So what the .
speaker 1: cpu sort of drives basically is the sort of the master still. And then when you do a collective cooperation, it calls a nickel library, which launches, which is you know it's still cpu and then it launches some kernels that move data around. Yeah. Okay, maybe this is a good place to end. All right. I will see you next Monday.