2020-10-15 | DjangoCon 2020 | How To Break Django: With Async - Andrew Godwin
异步编程中的陷阱与 Django 的防护机制
标签
媒体详情
- 上传日期
- 2025-06-21 18:31
- 来源
- https://www.youtube.com/watch?v=19Uh_PA_8Rc
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro
转录
speaker 1: Hello everyone, and I hope you're having a good jandocon so far. I am very excited to be here to talk to you about the wonderful myriad number of ways you can break acynwith, the Jango. So let's get going. First of all, let's talk a bit bit about me. I' M Andrew. If you've not met or see me before, I am a reasonably long time Jago developer. I've been working on Jango since about 2008 ish. I've worked on things like south and Jango migrations and now the acing stuff. I've been around a lot of places. I've been junjango for quite a while, and I still love this community, everything in rings. So thank you all for being who you are. And crucially, I'm am unfortunately not in Europe. I doing this remotely from the wonderful world of Denver, Colorado. It is wonderfully beautiful here, and we've been blessed with a pretty nice set of days here in autumn. It's snowed very recently, which is a odd thing in September. That's way things go. But Yeah, I am sad not to be there with all of you in person, but hopefully we will all meet again in person very soon, if not next year. But with that done, let's talk about the good news. And the good news is asyfuuse are in Jango right now. You can go use them. You can just put async depth with your views. It will all intermingle and work perfectly. Not only go into too much detail here, you can see more about this in my other talks I've given various places. But the nice thing is you can just write views that are async deaf alongside normal deaf views that are synchronous, thanks to some wonderful interior rewriting of Jango's request flow. Both of those are servicin, both modes. If you're running in wsgi mode, then it will just work and serve synchronous views like normal with no performance impact. And it will serve asynchronous views in their own, like asynchmini event loop. It's not as good for performance. You can't do long polling and stuff, and but you still can still do parallelization and things like that. And if wif you're in asgi mode, then of course, you can run both synchronous and asynchronous views in that mode in the same way. And the synchronous ones get up in their own thread. So it's there. It's good to use. Please go use it. I encourage you to have a go with it. And then, as you'll see in the rest of the talk, and encourage you to go about it in a certain way. So let's talk about that. So first of all, I want to be up front here. Signus programming is pretty quite safe, as safe as programming can be anyway, right? But like it's all in order. You can understand how instructions execute one after the other. It's a mental model, a bit all familiar with, at least if you started out doing procedural programming. If you did functional, I mean, a, thank you, and also I apologize. And B, I hope you still know how procedural stuff works. I can't do moanvery well. But the point is side, like, we all understand how synchronous programming works. Asynchronous programming, on the other hand, is difficult. Now, it's not impossible, certainly, but there's a lot more sort of subtleties to it. And this is not just asynchronous programming, but all concurrent programming in general. Be that asynchronous models like we have in Python, be that threading, be it multiple processes, be it across a network, these are all concurrenproblems. And you'll see versions of all the same problems in every single one of those kind of system designs. But it is harder than writing code all by itself. If you've ever worked on and say, like microservices or distributed systems, you know this all too well. And in many ways, asynchronous programming is inviting those same problems into your code base, where normally things were safer. And so I want na kind of talk about the kind of things you can see when writing async code. And something, the things that I have run across personally now, I learned best by example. So these are all set of examples of mistakes I have made at one point or another throughout the years. I do encourage you to try these at home, but do not try them in production. You want to try them away from having actual users where you can experiment and learn from them. That's my recommendation. So with that done, let's go and look at what we can do in jgo with async. And I went you know, I let off here with like great async views. Ser, in, we can put async deaf in. Fantastic. There we go. There's an async view. I've taken an existing view I had. I've put async in front of it. Let's go. Let's hit run. This isn't going to work. Now, there's a couple of reasons not going to work. The first one is, and the main problem is it actually would have worked in some ways if jgo hadn't stopped you doing this. This is perfectly valid Python code. It will run and compile and execute. And what it will do is it will start an asynchronous view in its own co routine. It will then get to this dot get line here where they're trying to fetch the book. Then it will synchronously go towards the database and block the whole, apparently while it's doing so for maybe up to a couple of seconds, and then come back in return. Now, the way asynchronous programming works in Python is it's called what's called cooperative multitasking. You have to cooperate and tell Python, Hey, here is a point where you can pause my execution and go and run other things. It's still single threaded on a single core that gl is still there. So if you don't have an await back on of like cooperative thing in your code, it can't return. So this would actually just run, and it would run and block the event loop for 50, 100, 200, maybe 500 milliseconds. And that's really bad because you would take this code, you would run it, it would PaaS unit its tests, it would work in your browser, youwould push it to production, and it would work, but it wouldn't be very efficient. It's probably less efficient than the old synchronous code because you're blocking the event loop. It doesn't like that. It's not designed for that. And so your launch is like, that's weird. Why are we getting like weird slowdown? What's going on? The code looks and runs violit, passes tests. These are the worst currenings of failures. They're called silent failures in my personal vocabulary. I do everything I can to hunt them down, stop them. This is no exception. Now, first of all, let's talk about how you're meant to do this, which is like this. Our friend, the awake keword, has turned up right there, very important. It tells the function that you can, well, the coteam, they can pause there and hand control back to Python. But crucially as well, we also have a different version of get here, who see right there the way pythons works. And this is, you can read more about this in my Jango kna. You talk about the oand trying to make it async. Like this isn't real code yet. We haven't got an asynchm as an example, but like we have to have different callables for synchronous asynchronous modes. So we can't just have dot get work with async. We have to have to have dot get async. And so that's kind of the difference here. The two things you can see, but there's also one more thing, right? So like this is fine. This will work. One, there's an asynchum. Again, it's not in yet. This is kind of a demonstration of it would look like now, what happens though, if you do this previous one? Well, thankfully, I kept doing this, and it kept basically screwing up all my tests. And so what Jango has, has this exception. If you try and do anything in Jango with database right now, because all databattack is still counsynchronous in Jango, you will get this error. This is a synchronous only operation exception. It basically tells you, you have tried to call a piece of Jango that is synchronous, probably the rm from an asynchronous context. You shouldn't be doing this. You should be wrapping that code in asynto asyncal. We'll see that later. And thatbe much safer way of doing it. Now, this is why call a guardrail. As many of these as you'll see that we can add and do have, and Jan goes, we go along, but this turns a silent failure or a silent performance slowdown into an explicit failure. And in my opinion, that's the way to go about things this way. There's no way you can write that first piece of code and have it past tests. Jlike, no. What are you doing? What are you doing? This is a synchronous operation. Stop it. That's kind of the idea. Now there's another kind of mistake you can make. This is the same code we saw here on this previous slide. But there's one thing missing, and it's that a wait, what happens if you miss awaits? Well, there's two things. First of all, any side effects of it don't happen unless this is a get. It's not too problematic. Imagine it was a create, though if that was a create, it wouldn't actually run. I wouldn't make anything. So that's the first problem. Secondly, because asynchronous callables return a co routine. Basically, a weight is looking for a thing to consume, which is a co routine. If you don't have an await there, you just have a coding object because in Python, everything's an object. And that coating object in this example becomes book. You PaaS it a template and you're like, and then someis like, I can't do dot name of a coin, what are you doing? And then you get more exceptions. So thankfully, this isn't really a silent exception in this case. It does become a problem, as we'll see later when you have just a side effect in no variable assignment and you can't detect it. But thankfully, Python's got a guardrail for you. We'll look at that a bit. So in this particular case, at least the most basic example that I made so many times, Jango's got your back. Now this is a principle we'll try and return to. Like Jango's goal, in my opinion, is to be a safety net, to give you guardrails that when you are comfortable, you can remove one by one, but by default, they're there in place, basically preventing you from doing bad stuff. And that's just one of the examples. I think the main example that many people run into a first try, do async code ve like, Oh Yeah, let's try it. And they're like. Well, is this exception? That's what it is. You need to do things a different way. And the reason you do them a different way, it's dangerous if you do it the first way and you wouldn't notice the worst case. So here's another example. I have often said that the key thing about doing things in async world is paralyzation running two things at once, or three things or four things at once. And here's a perfect example. I have a situation where I have a view that makes a user account. It has two things. It writes a user row that takes a network called to post gros that takes time and needs to be async. It also sends an email that takes smtp and could be async as well. Both these things happen asynchronously. Great. Let's do them at once and have them run in parallel. Fantastic. What could go wrong? Well, again, this is one of the subtle things where like it's not quite the same as you do it in sort of normal synchronous mode. There is no ordering guarantee. So if you're not familiar, this async iodot weight here, that's the primitive function in asyncho that takes a set of things and runs them in parallel and then returns when they're all done. Obviously, you can PaaS it a whole number of things, but there is literally no ordering guarantee. You could run one thing first and then the other thing, both in parallel. Or crucially, you could run one thing and then the other thing might not happen with the processes crashed. So imagine in this case, there are two different failure modes. In this case, I could have my create user account function run and write a row, and then the process crashes before it sends an email. Or I could have the send email function run and send an email and then it crashes before it writes the ucoundatabase. And that's two different failure modes. Now maybe you're okay with both of those, but I prefer having less failure modes that I know. And so in my opinion, when you're doing things like this with side effects that kind of have an implicit dependency, you want to do something more like this, where in my opinion, it's much better to have like only one failure mode. And that failure mode is we made the account but didn't send you an email by especially sely ordering them. We can do that. We can have it. Okay, account made. Then if it crashes or crashes, that's bad. Hopefully it doesn't happen very often. And then email is sent much better way around. Now of course, this is any problem for things with side effects. If you're in parallel, just query and getting data and then returning the result from it, that's very taped to in parallel. But if you're not doing those things, if you're doing side effects, be very cautious to doing things in parallel. It's very tempting. You can get amazing speed ups from doing your queries in parallel. I think if you had like a view of 20 queries and you did all 20 queries in parallel, you before you get down like a tenth of the time, but you got to be aware of the extra layers of failure you're adding. Imagine there wasn't just two of these. There was like five different things. The number of combinations of how they can now go wrong and what can run and what can't if they're all paralyzed, is enormous. So be very careful of that. And unfortunately, in this case, there's not a good way to prevent against it. We can't detect it. This is a thing called a race condition. It's a classic thing in concurrency. You can read so much about it on the Internet. I couldn't do it just here. But essentially, these are just a factor life with concurrent programming, asynchronous being part of that. You can't really prevent them, apart from structuring your code well, it's kind of a path you have to tread yourself. You can do things to help prevent them. There is something you can do to aid yourself. But in the end, it's kind of paof the territory in some ways, the direct trade off of when you ask performance, this is what you get, this is what you know computers and processes kernels handle as well, like those things relinto sely in some sense. And they they have to handle plenty of race conditions. So let's change tracks a bit, though, and talk about database access again. Now data access. As I said, there is no async rm in jger 3.1. Unfortunately, I'm sorry, it takes time. So instead, if you want to talk to that database, you can't do it in an asynchronous thread. So you need to do it in a synchronous thread. What you do is this. You can write the database code in a function, asynchronous function and wrap it in the sync to async decorator. And what that does, it takes any synchronous callable and turns it into an asynchronous awaitable. So with this code is written here, I can actually await create user account, and it will correctly make a subthread run the code there, so it doesn't block the main thread. And then when the results come back, wakes up the code tand it returns at the result. And as exception, publication of stuff properly, it's all done there for you. We ship this as part of janago these days. There's a reason we recommend you use it. It's much easier than anything else. However, it's not quite perfect. So imagine I did this like I've written code. I've got stuff that say, again, does two different things. It does a classic example of valiates a user account name and then writes a user account row. And because I'm want to reuse my validation logic on, say, my sign up form until have like a little ticker across, I've made the validation logic a separate function. But of course, obviously I am a moderately okay developer. I know that, Hey, if I don't do check username exists and then write user inside a transaction, I can have another race condition where well two requests come in for the same username. Both check it doesn't exist, both go, Yep, not written database, and then both write it in, they clash. So that's why we have transactions, right? And I'd be like, okay, great, I know this. I'm going to write this code right here. And again, to the naive Python runtime, this code is perfectly fine because transaction will find the connection in its thread. It will set transaction up on it and make the stuff running it. Now, here's the problem. I just told you that sync async runs things in different threads, and it does. And transactions connections in Jango are threadbound. So all this code actually does, if those are synchronous pieces of code, it sets up a transaction on the async thread, doesn't use it at all, makes a brand new synchronous thread, and then in that synchronous thread, it runs things outside of a transaction. And again, this is silent failure. This won't actually fail. It probably will PaaS all of your unit tests, most likely until you get those two requests just perfectly. Tito hit each other and go around where the transaction is protecting, and bam, you've got data corruption. And again, this is silent failure. It's really annoying. Now, again, Jango can detect that, Hey, you shouldn't be using transactions sychronous in an asynchronous environment, right? That we can do. But this is more a problem for when we do have an asynchm, because what if you do generally write this code and we support async transactions? What do we do? Do we error? Do we try and ort the transaction over to the other thread? Like do we certainly fail not do that? Obviously it's terrible, but these are kind of the problems with asynchronous design and like especially context managers not kind of going and wrapping the code below them as you think they should do. And in my opinion, not having transactions is one of the scariest failure modes because it is really a thing that's very hard to test. Like honestly, try dragon test for it. It's almost impossible to do that. You perfectly write the code sequentially. And certainly having encode as a single opaque block is pretty difficult. But at the same time, it doesn't really happen a lot in production. When it does, it's really bad. Like your farkestop pouring to the right place, you get potency errors or like things that get messed up. So I want to avoid this at all costs. We're going to look deeply into how to solve this in the proper asynco raand. Again, if you want to learn more about that, I apine a bit on this particular topic in my Jander con au talk this year, but it's still kind of a hairy problem by itself. So let's go away from negaators now and talk about fetching fetching url's in parallel. Classic example of thing you do with async to see I have 100 url's. I want to fetch all of them and see which ones are still alive. Because, you know, good url's don't change, but many url's die over time. So code like this is what I might write, and it basically you know has a nice asyncodot weight. So you've seen this weight before. We're feeding a list comprehension with all the url's. And it's going to take all those coins, says 100 of them, going to run them all in parallel efficiently at once, and then collect all the results and then return it to us. And as you see at the top there, we have a dictionary that's counting the number of things that are live and dead. Now, if you look closely at the inside of our fetch site, method or function, you can see if you know it. Look, there's a bug. And that bug is that we fetch the value of a live. Then we do their client get http get. There's an a wait though there crucially. And then we write back the value of a life plus one. Now you can imagine that, like if there are many things running at once, which of course there are by design, many of them can fetch alive and theyall be zero, and then all go and do their request. And then basically, like the first one, fetch ches alive, does his request, then suspends. Second one then goes in, fetches live suspends, and so on. When they all come back, they've kept the value of a ve in the local memory area. But it's wrong. They've all got the wrong number. Again, this is a classic race condition. What we've done here is we have a piece of code. Isn't atomic where it should be. Now here's the fun thing. In threading, this is very hard to solve, but in asywe can just do this. We can switch that awafetch with the weight and the fetch for alive basically in place. And what this means now is the alive is next to where it's written. It's kind of a single block. And with asand co o, things are atomic between a weight. So if there's a weight and a second await, everything between those two is going to run atomically. Nothing can barge in. Remember, it's cooperative multitasking. If you're not gonna await, nothing can come in and stop you. So in asgo, this is perfectly valid. Coand ally safe. In threading, however, this is not safe. Threads can interrupt wherever they want, and they can just jump in and go, Hey, note this line. We're gonna to stop. So crucially, they can interrupt between reading a wait and writing a wait, and there's nothing you can do about it. So in the threading world, you need a lock or something here. But in asthankfully, we have the nice property that between a weits things are atomic. And this is one of the nice things about the tradeoffs of like, you know, the there are good and bad things about all async mechanisms. A weight based ones are one of them. But this is one of the good things with things like threading or even with like g event, you don't know if a certain function is going to contact switch. In fact, with threading, anything can with g event, who knows if somewhere in the function there's a request is going to contact switch you away. And so in those two, you have to be much more defensive about your atomic code. Whereas with an a weight based language like not only Python, but also no does this for example, you have that kind of built in atomicity where like Hey, like as long as I don't know, wait, I can do stuff and not and assume I'm not interrupted and it's really nice. Now here's another example of where that kind of comes into play and where it's really important. Now here's again, it's a contrived example, right? But it's the idea of like I'm gonna to do a long fetch and I have a second coteam that's gonna to wait for that to finish and then notify me. Now this is interesting because I mean obviously you probably wouldn't write like this, but let's say we did, right? This code will either run perfectly or be stuck in infinite loop forever. And guess what? You can't tell which one you run it because it's kind of non deterministic. But obviously even a chance infinite loooping is bad. What happens here? Well, crucially, rembri said that you have to have a wait to give out control. There is no wait in this notify function. What that means is it never gives up control. So let's say we submit both functions right down here. You can see to run in parallel and the event loochooses to run notify. First it enters notify, it sees that ready is false and it enters the loop. Now, as all good loops in Python, that could be potentially long running. We haven't just on a PaaS because that would make it loop, busy loop, very, very badly. We put a sleep in there so the televiter can sleep and there's time to do other things. In a threading, this works perfectly. This does not work in asyncho because you're not waiting. And so what this does is this literally sits there and locks up the event loop and stays notified as sits in that loop forever. Nothing else can run ready. Will never be turned on in such a deadlock there waiting for something to happen. And bam, you have an infinite loop, even though it looks like you wrote a good loop. So how do you fix it? We might just do this. And this is better, but it's not fixed. Because remember, a weight needs an asynchronous version. There's asynchronous versions of everything. It's for the rm. It's also for sleep. The default time, not sleep function, when you call it, does its sleep. So it's not too much of a problem here. Imagine it's a 5s sleep. And then what would happen is westart, this function would enter the loop and then the way it evaluates is it evaluates to the right of the weight first. So it's going to go, okay, I'm am going to sleep for 5s and then return none. And then it will await on none. So you'll get an exception saying that you can't await none, but only after you've block the event loop for 5s. So the real solution is this. And this, you see, has an asynio compatible version of sleep. When you call it doesn't do anything and it returns in a waitable. And when you will wait that awaitable, that's when it sleeps. So you can see this. It's not super subtle, but it's very easy to look at code like this and miss that. You should take the two steps to get to code like this. And this is kind of partially the way Python designed. This is kind of the bad side of the await mechanic. The good side is the atomicity. The bad side is a, you have to remember to add a weight, and b, you kind of need a different version of stuff because a weight is a separate keyword, has an expression rather than being a way of calling a function. The there's no like asynchronous call type in Python. It's merely you call and it returns a coine that makes it an asynchronous function. So that's kind of the subtlety there. Finally, I want to talk about one other thing that I kind of screwed up on, and that is sync to async. As I said earlier in the talk, this is a very useful thing that you can wrap around database code, and it runs it in a thread. And you saw earlier, the transaction wasn't in the synchronous threads. We didn't have any side effects. Great. So obviously, if I was doing stuff that rthe same connection, like this code here, where the code is kind of like, Oh, I'm going to set up some like transaction level stuff in the connection first and then run a query on it, this might be how youwrite it. This in current Jango will not work. And there's a very subtle reason why. So I said, synacync does run things in threads in a subthread, and it does, but it doesn't run them in the same subthread. It can run them in different threads. And remember, transactions connections are threadbound in Jango. And so if these two functions here run in different threads, they're not talking to the same connection. Like the first one's going to set up stuff on a connection, the second one's is not even going to use. And then even worse, a third thing might come along, get put on that first thread, because there's a just big pool of threads. It's reusing, and then it's like, Oh, this connection seup and it's the wrong setup and things fail. Now this was me having a bit of a mistake. I kind of went for speed and flexibility over safety. And so guess what? It's fixed. At least it's mostly fixed. In the most recent asgrev commits, we have changed this behavior so that things always run on the same thread. If you want to can turn it off for performance reasons, you put threads sensitive equals false to run different threads. But by default now they will all run on the same thread. They will all share the same connection, middleware and the view code within the same thread. All manner of weird bugs that I have seen and I have honestly been trying to fix are now fixed by that one small change. And Yeah, it runs a little bit slower, but it's not that much and it's worth it for the safety. And if you want to turn it off and know what you're doing, go ahead. You can just turn it off. And that's just one example of like how do we defend against stuff like this? You've seen just a small subset of the examples of how you can break asynchronous code. There are so many more. And how do you mount a defense against a thing that's like fundamentally just based on the language paradigm you're doing, right? It's as you've seen, like you can miss a single awakey word and screw up your entire code base. Like if I have a function that makes users and I miss an await on the function that makes users, the view will run in return. Yes, user made, but I never awaited it. Not actually going to run. So silent failure is a really common thing. It's really tricky to build around that. The first thing I recommend is this thing called Python asycho debug. Now, if jger has guardrails, this is Python's guardrail. Basically, if you turn this on a whole number of debugging features for async apps, turn on in asyncho. There are quite a few things that these top two here, the most crucial to me. First of all, any protein that runs for too long will be flg to you and be like, Hey, this coin ran for too long. Why is that important? It means it didn't give up control. A good coin runs for maybe a few milliseconds and awaits what does a long thing. If your proteine runs for a long time, and by default the long time is 100 milliseconds, then it probably means you're actually calling a synchronous thing by mistake. Maybe you're using a library and some deep in the library is a http call. You didn't notice this will detect that and go, Hey, when your corouine ran, it took one and a half seconds for yielded control back to us for the next to wait. That's not that's too long. Something's going on. So this lets you find the cases where you're actually using synchronous code where you shouldn't be. That's blocking. The other thing is the inverse. It can detect unawaighted coatines. This detects the case where, let's say you have a side effect function written in an asynchronous style that you want to run, but you forget to put the weight in front of it. So if you just have like, Oh Yeah, create user and no await, what happens is Python will make the coating object ready to run, and then it will just vanish into the local memory space and never get run and get garbage collected when this happens, with this mode turned on pythonlike whowhoa, Whoa, Whoa, you made a coaching, never awaited it. What are you doing? This is bad. And so both of these aren't perfect. They're not going to like outline the area in your code that's wrong, but they're going to give you hints that something's not right. Like okay. Like I know that if a thing runs for too long, there is a blocking synchronous course somewhere. And I know that if a code seems unawaited, I'm missing and awasomewhere. And with that information, you can go do some of your own research and work out exactly what the problem is. Hopefully, will these able improve every time or get code and analysis tools. But it's a good start. Again, draother things like synchronous only operation, I call these guardrails. These are things where when we can definitely detect you're not doing a thing right, we'll raise this exception. There are a very few number of cases where you do want to try and call the rm from asynchronous code, especially when it's like a one off function or something like that, but generally you don't want to do it, and that's where a guardrail makes sense. Now of course, these all have options to turn them off. That's very important to us, is totally up to you what you want to do by default. They all come turned on like all of chananggo's other safety features like the security middleware and so on and so forth. That comes by default. And remember, this is all because asymous programming is by its nature, hard. It's not that like Jango or Python is screwed up here. Like concurrent programming in general and async programming in particular are just difficult to think about. You can basically screw up in so many more ways than you can in a synnchronous context that we just have to defend against all of this. My personal way of doing this is to write code synchronously. First, that lets me understand how it flows, get a mental model of it, then I can write a good test suite. Then only when I've done that, and I know that that code is a performance bottleneck, then I will go and refactor to make it async. This is what I recommend to you that you do as your mechanism, because you may find that things you think will be performance bottlenecks aren't, and vice versa. Because if you don't want to write your whole thing in async from the beginning, it's going na be a massive nightmare and more work than you want. You want to have the ability to like lift up, Oh, like this one view, we're going to make it async. And that's one of the reasons Jango supports a hybrid mode, right? Like you can just make one or two hues async and the rest synchronous. Django will just deal with it for you with all the right safety around it. And there you go, bobshiuncle, you can have both worlds for the price of one. That's kind of the point. And that's kind of Jango's contract. Here we are here to do our part where we can Jango's job is to give you a safe framework that you can quickly and safely develop coding. And when it comes time to make it bigger, you have the room to carve out like how this section, I want to trade my safety for speed, or this section, I want to replace what Django does. That's what we're here for. And that philosophy, as it were, continues into how we're trying to design Jango async, it's really important to me personally. We continue this. Like what I come to Jango for is I can just hand mer on a keyboard for 20 minutes and make a site and not worry about having giant security holes in it or like giant performance holes that continues here. But that's kind of our job. And so I'm telling you, we will try and continue that and make asiyc the best it can be. But of course, that is in the framework of asiyc as a whole. It's not just a Janger problem. It's not just a Python problem. It's an everything problem. Node is looking at this, Ruis looking at this. Both of them haven't a weight pattern as well. For example, many other languages are tried this over the years. Like there's so much and more to be done in terms of making good, safe concurrent programming. We're just at the foot of the mountain of what we can achieve. And so I hope in five to ten years, I'll be standing in a similar position being like, it's great. You can just write async now and it's all perfectly safe. And like you can't screw up and all the race conditions are detected by default. But unfortunately, it's going na be a little bit work to get anywhere near there. And we mean we never get there. But my hope is one day that async programming will be a lot harder to make mistakes in and really worth writing stuff in from the get go rather than doing synchronous first. But until then, I hope you have fun trying Jango's new async stuff. I hope you have enjoyed learning from the way you can make things go wrong. I do encourage you to try us at home, like download Jango 31, try and break it and just don't do it in production, please. And if you're interested in helping out with async stuff, please count the Jango forum. We have an async sub forum when we discuss stuff and designs. You can see me talking about transactions in middle dlathere, for example. But Yeah, until then, I hope you have a lovely afternoon. I hope you've enjoyed janang Okon Europe, and I'll hopefully see you somewhere around the world in person at some point. Two. Until then, I'll see you next time.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
Andrew Godwin 在 DjangoCon 2020 的演讲中,深入剖析了在 Django 中引入异步编程的潜在风险与应对策略。他明确指出,尽管 Django 3.1 已支持 async 视图,允许同步与异步代码混合使用,但其核心的 ORM 仍是同步的,任何数据库操作都需通过 sync_to_async 进行包装。演讲的核心在于,通过一系列亲身经历的错误案例,揭示了异步编程极易引入静默失败 (Silent Failures)——这是最危险的一类错误,因为它不会立即导致程序崩溃,而是在生产环境中引发难以追踪的性能退化、数据损坏、竞争条件 (Race Conditions) 和死锁 (Deadlocks)。
Godwin 强调,Django 的核心设计哲学是提供“安全护栏 (guardrails)”,例如通过抛出 SynchronousOnlyOperation 异常,将隐蔽的性能问题转化为明确的程序错误,从而引导开发者走向正确的实践。为进一步规避风险,他强烈建议开发者启用 Python 的 asyncio 调试模式。最终,他提出了至关重要的最佳实践:始终优先编写同步代码,通过性能分析定位瓶颈后,再有针对性地将特定部分重构为异步代码。这一策略旨在帮助开发者在享受异步带来的性能提升的同时,最大限度地规避其固有的复杂性。
Django 异步现状与核心挑战
功能现状:支持异步视图,但 ORM 尚未异步
- 混合模式支持:Django 3.1 已原生支持
async def视图,并能无缝处理与传统同步视图的混合部署。- 在 WSGI 模式下,异步视图会在一个迷你事件循环中运行。
- 在 ASGI 模式下,同步视图会被分配到独立线程执行,以避免阻塞主事件循环。
- 核心限制:演讲时 Django 的 ORM 尚未实现异步。因此,在异步视图中执行数据库操作必须使用
sync_to_async装饰器或包装器,将同步的数据库调用放到一个独立的线程中执行。
根本困境:并发编程的内在复杂性
- 同步代码的直观性:代码按序执行,逻辑清晰,易于理解和调试。
- 异步编程的挑战:异步作为并发编程的一种,将多线程、分布式系统中常见的状态管理、执行顺序不确定性等复杂问题引入了原本相对安全的代码库中。
破坏 Django 的异步编程陷阱解析
1. 最危险的陷阱:静默失败 (Silent Failures)
场景一:在异步视图中误用同步代码导致性能退化
- 错误示范:在
async def视图中,直接调用同步的 ORM 方法(如Book.objects.get())而未使用sync_to_async。 - 问题根源:同步调用会阻塞整个事件循环 (event loop),使其在等待 I/O 期间无法处理任何其他请求。
- 危害:这是典型的“静默失败”。代码能正常运行并通过单元测试,但在生产环境中会导致严重的性能下降,效率甚至低于纯同步代码。
- Django 的安全护栏 (guardrails):为防止此问题,Django 会主动检测并在异步上下文中调用同步 ORM 代码时,抛出
SynchronousOnlyOperation异常,将隐蔽的性能问题转化为显式错误。
场景二:失效的数据库事务导致数据损坏
- 错误示范:在
transaction.atomic()上下文管理器内部,调用了两个用sync_to_async包装的独立数据库操作函数。 - 问题根源:Django 的数据库连接和事务是线程绑定的 (thread-bound)。
sync_to_async默认(在旧版本中)可能在不同的工作线程中执行同步代码。这导致事务在主异步线程中开启,而实际的数据库操作却在没有事务保护的子线程中执行。 - 危害:这是另一个极其危险的“静默失败”。事务实际上并未生效,可能导致数据竞争和损坏,且极难通过测试发现。
- 修复与对策:新版 Django 已将
sync_to_async的thread_sensitive参数默认为True,确保来自同一协程的调用在同一个线程中执行,从而保证了事务的完整性。
2. 难以调试的陷阱:竞争条件 (Race Conditions)
场景一:并行执行带副作用的操作
- 错误示范:使用
asyncio.gather同时执行两个有逻辑依赖的操作,例如“创建用户账户”和“发送欢迎邮件”。 - 问题根源:
asyncio.gather不保证任务的完成顺序。进程可能在创建用户后、发送邮件前崩溃,反之亦然,这引入了多种失败模式。 - 推荐做法:对于有依赖关系的写操作(副作用),应按顺序
await它们,以确保逻辑的原子性和可预测的失败路径。并行化更适用于无副作用的独立数据查询。
场景二:并发读写共享状态
- 错误示范:在并行任务中,多个协程同时读写一个共享变量。经典模式为:
读取旧值 -> await 耗时操作 -> 写入新值。 - 问题根源:在
await交出控制权期间,其他协程可能也读取了同一个旧值,导致最终的写入结果相互覆盖,计数值远小于预期。 asyncio的优势与解决方案:在asyncio中,两个await之间的代码块是原子的。只需调整代码顺序,将读操作和写操作放在一个不含await的代码块内,即可解决此问题,无需像多线程编程那样使用锁。
3. 易于忽略的陷阱:死锁与无限循环
- 错误示范:一个协程在
while循环中等待某个条件,但循环体内使用了同步的time.sleep()或根本没有await。 - 问题根源:循环没有通过
await交出控制权,导致事件循环被永久阻塞,其他协程无法执行,形成死锁。 - 正确做法:必须使用异步版本的
sleep,即await asyncio.sleep(),以确保在等待期间事件循环可以调度其他任务。
防御策略与最佳实践
1. 启用 Python 的 asyncio 调试模式
- 开启方式:设置环境变量
PYTHONASYNCIODEBUG=1。 - 核心功能:
- 检测长时间运行的协程:如果一个协程在两次
await之间运行时间过长(默认 > 100ms),系统会发出警告,这通常意味着其中包含了未被发现的同步阻塞调用。 - 检测未被等待的协程:如果一个协程被创建后从未被
await,它将不会执行。调试模式会捕获此情况并报警,防止因忘记await导致的静默失败。
- 检测长时间运行的协程:如果一个协程在两次
2. 遵循“同步优先”的开发工作流
Godwin 强烈建议开发者遵循以下务实路径:
1. 优先编写同步代码:确保业务逻辑正确,并建立清晰的心智模型。
2. 编写完善的测试:覆盖各种业务场景,保证代码健壮性。
3. 进行性能分析:使用性能剖析工具,识别出真正的性能瓶颈。
4. 最后重构为异步:只针对已确认的瓶颈部分,将其重构为异步代码,从而避免过早引入不必要的复杂性。
结论与术语表
Django 的异步设计延续了其“默认安全”的核心哲学,通过提供“安全护栏”来保护开发者,同时允许在必要时为性能进行选择性优化。异步编程的复杂性是并发领域的普遍挑战,开发者必须保持谨慎,并采用正确的策略来驾驭它。
| 中文术语 | 英文术语 | 简要说明 |
|---|---|---|
| 静默失败 | Silent Failure | 指程序未抛出异常或崩溃,但产生了错误结果或严重性能问题的故障模式。 |
| 安全护栏 | Guardrail | 框架内置的保护机制,通过主动抛出异常等方式防止开发者犯下常见的、危险的错误。 |
| 竞争条件 | Race Condition | 多个并发任务的执行顺序影响最终结果,并可能导致非预期行为或数据损坏。 |
| 协作式多任务 | Cooperative Multitasking | asyncio 的调度模型,任务必须通过 await 关键字主动放弃控制权,才能让其他任务运行。 |
| 线程绑定 | Thread-bound | 指某个对象(如数据库连接)的生命周期和状态与特定的线程相关联,不能跨线程共享。 |