2024-12-07 | DjangoCon 2024 | Django & Celery: A love story of async proportions with Hugo Bessa

Jango与Celery的深度整合:提升应用性能的异步实践

媒体详情

上传日期
2025-06-21 17:53
来源
https://www.youtube.com/watch?v=NCXtXDn4ppk
处理状态
已完成
转录状态
已完成
Latest LLM Model
gemini-2.5-pro

转录

下载为TXT
speaker 1: Hey folks, I'm really happy to speak at junkous 2024. I'm going to talk a little bit about jangoin celery in their amazing love story, right? My name is jugo. I'm a partner at Vincent software, where I've been working for the past seven years. Some of you folks might have already crossed with vinta before by using one of our open source packages for Jango, like Jango react boilerplate, Jango workler missions, drf rhythm rights, serializers, Jango virtual models, and the newest one, which is really fresh, which is Jango AI assistant. You should really have a look. We are a company that really cares about the open source world and especially the biyden community. We contribute with a lot of open source software. And also we've been present in most conferences all over the world, so that's something we really love. And we've been there for a while already. So about myself, I've been working with jgo for the past ten years in very different kinds of projects, most of them using celery. And that's what we're going to talk about today. To start this talk, I'd like to talk a little bit about Jango. Jango is almost 20 years old, and its philosophy has made many projects very successful so far. What I really love about Jango is it has batteries included. There's kind of the right way of doing things like authentication, authorization, interacting with data security, things like that, chawith school of good opinions on all these topics. So that basically allows us to work together and make a whole ecosystem in an integrated way built on top of chgo, like using the same base. And that's because of that. We have tons of packages that are really useful. We have events, we have meetups. We have and especially, there is an active development in the main framework, which introduces exciting new features on every new version. But there's a known problem of chgo that couldn't still be fixed, and that's like chgo performance, right? Chgo uses Python, which is not known for being a fast language like compared to rust or see or go. And on top of that, chgo wasn't built to work with multi tread. It's the synchronous features are still being developed actively. They are being still matsuring. The orm itself can be misleading sometimes with regard to performance, but there are workrounds to make Jango a bit faster in the database side, you can avoid impplus ones, for instance, like being more cautious on the way you write queries. You can use caching, you can make smart, not amazing indexes, you can use data normalization, you can run operations in the background, right? All these techniques could make your application run smoothly and serve a ton of users simultaneously. That has already been done in the past by big companies, but I want to talk a little bit more about the last one running operations in the background. Before that, let's have a look on how jgo works in the background, right? Basically, Jango exposes a wsgi interface or web server gateway interface. That's what a server like unicorn or wisky uses to run a jangle process, right? They kind of connect to that wsgi file in your Jango project. These servers themselves have a request load balancer. Basically they, this part of the server receives a lot of requests and they map these requests to processes they have, right? So each general process has like the whole Jango environment loaded there in memory. Each of these processes speaks a request at a time and only guess the next one when the previews has finished being processed, right? So long running requests are kind of expensive, right? Each general gle process loads the whole framework core, right? So it also it's expensive to have too many processes, right? Because all of them have like a cost to to be loaded, right, and request how they jungle process H while being processed. So there's no way to run multiple requests in the same process, right? There's no multi Thad in jgle processes usually, right? There might be workarounds, but we're focused sing on the main thing e today. So why running operations in background, right? Requests should be processed quickly, so we don't have to roll the process for too long. We want to be able to process the next request as quick as possible, otherwise the queue is going to be too long. And we also want to give feedback to the user as quick as possible as well so we can retain their attention, right? Attention is very valuable in these days. So we don't want like the user to switch tab while they wait for an operation he started on our app, right? We want to keep him in our tab and like maintain, retain his attention, their attention in our in our platform, right? In salary, it's a tool that can help with that, right? Salary is in a synchronous task queue or job queue, which is based on a distributed message passing, right? So the most important word here is distributed, right? It separates a syntheexecution from your main application, so it doesn't run in the same process that Jango runs. It's a completely separate process that just worries about these as asynchronous tasks. It is also very fast, and it's very well integrated with many tools, right? Including jangle, right? So with celary, you can have access to your models, to your services, your functions, your classes and all that. Charight and celery loves Jango, right? Celary basically has dedicated documentation for integrating with Jango. Actually, there are ways to configure celary using jgo settings with no extra butplate jgo, it's allowed to be accessed with thin tasks. So you have full access to your rm and your managers and your quers heads and your services classes, etcetera. And there's also an important community around salary and Jango building packages that enhance these integrations that uses Django database for managing some of salary things and things like that. So there's a whole echo system that helps us integrate them both, right? And here's how it looks like to integrate Jango and celerary. This is like very basic this case. So I'm just showing you how it can be done, how quick you can integrate them. Basically here I'm writing this process order view, right? It receives a request, it creates an order object based on the user, and it calculates it kind of called the calculate uers card. Taso, suppose you have like the this dysfunction called calculate user's score. That is very slow, but it isn't that important that you run it immediately after you create an order. It can have some eventual lack of synchrony. And so that's the perfect use case for us to move this task, move this function to another process, right, to running background, because it's just going to blow the request, right? So in here we are moving it to a task function that receives the user ID. It picks the user from the database and it actually calculates the user's score. But before all that happened, we actually give a response to the user. So this delay method here, it's just going to add the task to the queue. It's not going to execute it immediately, right? It's going to be executed in the background. So this is kind of a hello world for for salary, right? And what this can give us basically what async tacan be used for. You can delegate long blasting jobs like the one that we've just seen. We we can execute remote api calls, like apis can fail, right? So we can use like salary to run to like kind of kind of rap these api calls and give like have great tries running without blocking the request, right? We can also prepare in cache values, right, to make queries a bit lighter, right? We could spread book database insertions over time. For instance, like to not to avoid overloading our database, we can execute recurring jobs, things like that, right? There's many use cases for async tasks, right? And how does it work with salary? Basically, Jango uses a salary client that gives kind of that delay function, for instance, and it basically kills a message to the broker. A broker could be many platforms, right? Many, many kind of q of managers, right? It could be like repped in q sqs. It could be Reis itself, right? So we can use multiple queue manager as a broker, but like we choose one and like whenever we call this client, Jango is going to send a message to this broker and put it on the queue, and celary is going to be watching this queue. And it will mark a task as it started. It kind of depends on your settings there. It may or may not mark as it started. It may just started, but and it Marks it exsucceeded after it finishes processing it. So the second part is like after celary finishes the processing of that message of that task, it will also deliver the result of that task to a kind of database, we call it results backend. And it's basically storing task results. So you can query these results with other tasks or even with Jango, you may want to wait for a task to run to get its result. And the case that we showed may not make a lot of sense, but there are use cases for that. If you're just sending an email with salary, for instance, you might not even want to wait for the result. So the result back hand will kind of be useless in that situation, but there are situations where can be used, right? So when you configure salary, you actually have to give give it a results back hand so it can store the results there. It's just a database in normal database. So but like it's it's a good long story, right? Jango and celery, but it's not always rainbows and butterflies. We can say it's a tough love, right? There are many things that can go wrong whenever you're introducing a distributed system into your application. It makes things a lot more complex, right? Like all connections may feel there are a bunch of connections. Now you have Jango to the broker, the broker, the salary, the salary to the results back in the results back in to Jango. So there are many connections that may fail, right? You also have a lot of concurrency being added there, adding a lot of complexity so we can have like death, death never finish, outdated data issues that only happen in production. It's it's kind of introduces a lot of noise, right? Let's say salary has a strong personality, right? Salary is a distributed system, which means there are many points of failure. Right? Here, we are going to focus up on these four problems that I've been through in the past in my my whole experience with salary. These things always have caught me in the past, and I have to handle it correctly. The first one is outdata ated data, right? Sending complex data as parameters to salary may result in unexpected stuff, right? In this example I'm giving here, we are passing the user model to a seller task, right? And we are basically calculating the score here and saving the user. And why this is kind of tricky. It might not work as expected because the user model may change between the salary task being scheduled and it actually being executed. And as we are passing a model object, the model may be updated. I can even have deleted that user if, let's suppose, the task queue is full, right? It has a lot of tasks to be executed there. The user may have been deleted. And whenever you run this, the user may not exist anymore, right? So you calculate the score of an existent user, right? In this other case here, the right way to do this is basically passing a reference instead, right? I'm passing the ID of the user. I'm getting the user by the ID. If the user doesn't exist, I just don't run the task anymore, right? But if I can't find the user, I'll get the up to date object right? And I will calculate the the user's score accordingly. So this is the first thing you have to be very careful to what you PaaS to the salary tasks as parameters, right? Usually you should rely on references, not on like complex objects, right? Cerrealization is necessary if under the hood, by default, it uses pickle, which can actually PaaS like complex objects. But that's far away from an idea you shouldn't rely on like passing complex objects. You should like PaaS references and remthat this objects on the task itself. Okay, the next problem is duplicate runs, right? Depending on your seller's setup and task configuration, it may not be guaranteed that your tasks are only run, are only going to run once, right? Multiple workers may pick the same task at the same time. For instance, right? You have multiple workers looking at the same queue, both of them, both can get the same task at the same time. And this could happen depending on your configuration. A test may be interrupted and recued, for instance, right? If like a worker fails and the machinery starts, you may have a task that was interrupted and was recuwed, and it will be executed again. So you have to ensure that your tasks are atomic and idepentent, right? In this case that I'm showing here, this example, we are storing a reference to the latest order with process in the user model, right? You can see in the line 25, whenever we update the user score on the line 24, after it, we just store the latest order ID and we save the user, right? We save the user. That means if we call this task again with the same user and no new order, we're just going to pick again, this last already it ID, right? And not we are going to exclude that user with reference to that auid. So if there is a new order, we are going to recalculate. But if the order is the same as the last one, we are not going to do anything. We are going just return early like in line 22, right? We are not going to be able to pick the user and we are going to return early. So this function, it also has this transactional atomic decorator. So if something fails here or if this function stops in the middle, it will just roll back everything, right, row everything back and will not leave any like data updated. Not completely right. So the other problem we can see in salary projects is a little bit more complexity on their feedback, right? Because whenever you PaaS the ban on on to a salary worker, if it fails there, you you may not be waiting anymore in jangle. So you may not be able to return an error result to your user. So user may receive a success message from the request, but the home operation may still fail, right? So whenever we are developing a flow that like relies on sync stuff and async stuff, we need to take this possibility into consideration and give like a feedback, not like a success feedback immediately to the user, but feedback saying that this is still being processed and do some sort of paing to shack if this was completely successful or not. You may also need to be able to undo operations if partiit has partially happened synchronously and the other part is happening assynchronous in background, you may need to undo this first part that happens synchronously. So it adds a lot of complexity for error feedback and error handling in general. You might you need to take that into consideration when developing features that will touch both sync and async at the same time, right? And especially if it involves user feedback. Okay. So there's this fourth problem, which is conflicting operations while, right? So while a task heaven still run, another operation is triggered by the user, for instance, right? This operation conflicts with the one that's still pending and creates like an estate that is not predicted, right? Is not predictable. So for instance, let's consider these use cases. User can add notes to our system, right? User can also delete a note and user can book, create copies of a noso. You can choose a noand, create a bunch of notes based on that existing note. De, right. So let's look at this following flow I have on the right. So let's suppose the user starts creating a new node. It creates, then triggers the creation of ten copies of the node. This will running background, because creating ten copies may be a little bit longer. We don't want the user to be waiting on that. So it happens in the background. This task is kind of qued. It goes to queue and it will be executed by a salary worker. Then the user deletes the original node before the task to copy the ten notes run, right? The taks the task runs, but the original node isn't available anymore, so it cannot be copied, right? You cannot like copy something that don't even exist anymore in the database, right? There are many solutions to this. Like you can implement, for instance, soft deldelete on the noso. Whenever you delete the noyou, don't actually delete it from the database. You just mark it as delete it. But you can still find it on the database and you are still able to create these copies. You can also cancel all painting tasks before deleting. So whenever I try to delete something, some node, a query for painting tasks around that node that uses that will do something to that node and cancel all of them, all the painting ones. And after that, like other like other alternative is just locking notes with painoperations. So whenever we want to do something on the on an existing node before creating the task, we are going to mark this node as like locked. And whenever we try to delete a locked the node, we'll get an error, for instance, because it's still locked. So you have to wait for it to be unlocked before you can do water operations with it. So this these are alternatives for this conflicting operations. This is just an example. This can be a lot more complex depending on your use case. But like async may create a lot of issues. You can not like have this just do a Roback, a database rollback. You don't have like atomic operations. So there's it's hard to have like atomicity between the sync part and the async part. So basically, you have to handle it by storstate or even cancelling painting stuff. Right now let's talk a little bit about how to make things a little bit smoother with celery and jingle. There are ways to work around some of these issues that we talked about, right? There are some tips that can help like we're calling couples therapy here. So some tips to help the relationship to flow between chagon celery. So if you're going to retry taks, for instance, you need to use exponential backoffs, right? You don't want to like give a fixed number of seconds or milliseconds between the retries als because for instance, if you are making a request to an api, this api probably won't be back. If it fails, the request fails. It probably won't be back within a fixed number of seconds or milliseconds, right? You probably want to wait a bit more between the latest calls, right? So it's important to keep your backoffexponential between retries, right? Also, tasks shouldn't raise exceptions the same way like views don't are not allowed to raise exceptions, right? If you run an exceptionally general gle view, it will raise a 500 error and the user won't know what happened, right? So the same thing applies to salary. If you raise an exception, it will just fail in a in know terrible way, right? In an uncontrolled way. So ideally, you should handle all exceptions. And if there's nothing to be done in the code for like adjusting the state because of that exception, you can just like send a report through email or through a monitoring tool, something like that, right? So exceptions should not be unhandled in the salary tathe same way they should not be unhandled in jgo views. Monitoring is also essential, right? Like salary adds a lot of complexity. You need to kind of be able to see this complexity, see how it's going. So there's celery flower. This is a package that can help, right? It can help you see the current state of your salary setup, right? You can see the running Teyou can see the tethat have already run. So it can help a little bit for you to understand what's going on. There's also this flag called olzieger. You can do it by task or you can do it globally for all salary tasks. And this is very useful for development, right? Whenever this flag is active, the tasts kind of run synchronously. So whenever you call a task, it will run immediately instead of running another process, right? It will just run, which makes debugging a lot easier, right? If you don't want, if you are actually wanting to use the remote thing, having a different process for salary, rdb could be your best friend, right? Rdb space can remote the bugger. So whenever you set up a breakpoint, it will stop the execution there and it open a connection so you can have a terminal connected to it with telnet. And you can actually send like debug signals. You can check like variables, values, you can like send a next signal or continuous signal, things like that. You can debug the whole thing even with remote remote workers that are not in the same process as your go Jango application. So rdb could be very helpful. It's kind of a built thing, so you can import it from salary library, right? It's available within salary. You don't have to install anything external. So it's it could be like a really good friend for you. It could really help. The second part of the couples 's therapy, I would like to mention a good tip for long tasks, right? Long tasts don't may not work exactly as expected with salary, right? Because like imagine that you were the manager of this task. Like you cannot know if the task has, for instance, frozen, right? May have frozen. It will not give any result ever. It's just like locked in there. So in that case, you probably want to have a timeout, right? So this is configurable, but you probably don't want tests to run that long like hours. You don't want that. So in that scenario, you probably want to split the task into smaller ones to make sure that you have like a sense of progress, right? And you can know that things are running and there's no time out like killing your task while they is still running, right? Another thing about salaries, that monitoring tools are very limited. I talked before about celery flower, right? Or flower, I don't know how you pronounce it. I think it's flower celery flow lower by flow, right? So you may need to implement some monitoring yourself. Like for instance, I had in the past to implement like cues heartbeats, right, to know if my quills are well balanced. I have multiple ques for different sorts of tasks, right? And some quills were like being very full, and other quills were being very empty. So if a cuue is full, having a heartbeat may be helpful because if you're scheduling a new task, it may take like too long for it to run. So this is something you might have to implement yourself. I had to implement myself in the past, but like there are some paid monitoring tools that might also help in there, right? Like things like new relic or data dog may help a lot in monitoring your salary tasks and having useful logs and understanding the state and all that jazz, right? Another thing is that celary is an excellent tool for syndrotask and for doing simple jobs. But like for complex workflows, it might not be very reliable, right? It may be it has a lot of open issues and many complaints of loss tasks and unpredictable behaviors. But when you're running very complex stuff, so it may not be the best to for that, but don't worry. Like their celary enjanles relationship is nomonogamous. Like these tools, other tools can live together with salary, right? For instance, you could use salary for doing simple stuff because it has like very little boplate. It's very easy to use. The learning curve is very little. But you could use like temporal io, temporal io to run like more complex stuff. It has like a more robust infrastructure. It's kind of like created to run these complex workflows and even include like better monitoring tools like building in that you can rerun tasks and do stuff like that. So for very complex stuff, you may want to look into other tools for that, right? But if you're dealing with simple stuff that needs to run, assync salary could be your best friend, right? So other than that, we also have this that checklist site. Vinta has put that up, but it's an open source project as well. It has a bunch of dav checklist and there is one specifically about salary. Like my coworker Philippi shimanis has put lithis up and it has a lot of best practices you should follow when writing us integrating jingle salary. So basically, you should have a look at that. It can be very useful if we studying any application or if you're configuring a new integration, right? And that's it folks. Thank you for hearing me. I hope you you enjoyed the presentation and I'll probably be on the event. If you wanted to talk about salary or Jango or anything related, I'll probably be there. If you want to talk, just ping me. Okay, this is my email. If you want to talk about it as well. And that's it. Thank you.

最新摘要 (详细摘要)

生成于 2025-06-21 18:10

概览/核心摘要 (Executive Summary)

本次演讲由Vinta Software的合伙人Hugo Bessa主讲,深入探讨了Django与Celery的集成,将其形容为一段“异步比例的爱情故事”。演讲首先肯定了Django作为“为有最后期限的完美主义者准备的框架”的优势,如“电池全包”(Batteries included)、观点鲜明(Opinionated)和强大的社区。然而,演讲也指出了Django的性能瓶颈,尤其是在处理长时运行任务时,这使得后台处理成为必要。

Celery作为一种分布式异步任务队列,被推崇为解决此问题的理想方案。它与Django无缝集成,允许开发者在应用内直接编写可访问Django ORM的任务,并通过简单的.delay()方法将耗时操作(如API调用、数据处理)移至后台执行,从而快速响应用户请求。

演讲的核心部分详细剖析了在集成过程中可能遇到的四大挑战,即“Tough Love”:过时数据(应传递引用而非完整对象)、任务重复执行(需确保任务的原子性和幂等性)、复杂的错误反馈(后台失败难以及时通知用户)以及操作冲突(并发操作可能导致状态不可预测)。针对这些问题,演讲提供了一系列“夫妻疗法”般的最佳实践,包括使用指数退避策略进行重试、妥善处理任务异常、利用监控工具(如Celery Flower)、在开发中使用同步执行模式进行调试,以及将长任务拆分为小任务。

最后,演讲结论指出,虽然Celery是处理简单异步任务的绝佳工具,但对于高度复杂的工作流,可能需要考虑如Temporal.io等更专业的工具,形成一种“非一夫一妻制”的技术组合。

Django的优势与性能瓶颈

  • Django的核心优势:

    • 电池全包 (Batteries included): 内置了安全、认证、授权、管理后台和成熟的ORM等功能。
    • 观点鲜明 (Opinionated): 为开发者定义了明确的最佳实践路径,促进了扩展生态系统的繁荣。
    • 强大的社区: 拥有海量的开源包、丰富的社区活动和持续活跃的核心框架开发。演讲者所在公司Vinta也贡献了多个知名包,如 Django React BoilerplateDjango WorklerDRF-RW-SerializersDjango Virtual Models 及最新的 Django AI Assistant
  • 公认的性能问题:

    • Django基于Python,而Python并非以速度著称的语言。
    • 框架本身对异步和多线程的支持仍在发展和成熟中。
    • ORM在性能方面有时会产生误导。
    • 解决方案: 讲者提出了一系列性能优化措施,如避免N+1查询、使用缓存、建立数据库索引、数据非规范化(denormalization),并重点强调了“在后台运行操作”

后台任务的必要性

  • Django工作机制: Django通过WSGI接口运行,每个请求在处理期间会独占一个Django进程。
  • 进程成本高昂:
    • 每个Django进程都需要加载整个框架核心,内存开销大。
    • 拥有过多进程的成本很高。
    • 长时间运行的请求会长时间占用进程,导致后续请求排队等待,影响整体吞吐量。
  • 核心原因:
    1. 快速释放进程: > "Requests should be processed quickly so we don't hold a process for too long." (请求应被快速处理,以免长时间占用进程。)
    2. 即时用户反馈: 快速响应用户,可以留住用户的注意力,避免用户在等待操作完成时切换标签页。

Celery:Django的异步搭档

  • 什么是Celery?: 一个基于分布式消息传递的异步任务队列或作业队列。其最重要的特性是分布式,将异步任务的执行与主应用程序分离。

  • 为何选择Celery?:

    • 分布式 (Distributed): 将任务执行与应用解耦。
    • 快速 (Fast): 样板代码极少,任务执行速度非常快。
    • 高度集成 (Integrated): 可以在应用内编写Celery任务,并完全访问Django的模型、服务和函数等。
  • Celery与Django的“爱情故事”:

    • 官方文档支持: 拥有专门的Django集成文档。
    • 配置便捷: 可直接在Django的settings.py中配置Celery,无需额外样板。
    • ORM访问: 任务内部可以无缝访问Django ORM及其他工具。
    • 社区生态: 存在大量增强二者集成的第三方包。

Celery工作原理与架构

Celery的典型工作流程可以可视化为以下步骤:

  1. 任务发布 (Django App)

    • Django应用通过Celery客户端(例如,调用任务的.delay()方法)将一个包含任务信息的消息发布到消息代理 (Broker)
    • 常见的Broker: RabbitMQ, Redis, SQS等。
  2. 任务消费 (Celery Worker)

    • 一个或多个Celery Worker进程会持续监听Broker中的任务队列。
    • 一旦获取到任务消息,Worker便开始执行任务定义的逻辑。
  3. 结果存储 (Results Backend)

    • 任务执行完毕后,Celery Worker会将结果(成功、失败、返回值等)存储到结果后端 (Results Backend)
    • 结果后端通常是一个数据库(如Redis或Django数据库),用于存储任务的执行状态和结果,以便后续查询。
    • 对于某些无需关心结果的任务(如发送邮件),可以不配置或不使用结果后端。

核心挑战与陷阱 (“Tough Love”)

引入分布式系统会增加复杂性,演讲者指出了四个常见的陷阱:

  • 1. 过时数据 (Outdated Data)

    • 问题: 将复杂对象(如整个Django模型实例)作为参数传递给Celery任务是危险的。在任务被调度和实际执行之间,该对象的状态可能已经发生改变(甚至被删除)。
    • 解决方案: > "You should rely on references, not on like complex objects." (你应该依赖引用,而不是复杂的对象。)
      • 正确做法: 传递对象的ID或其他唯一标识符,在任务内部根据ID重新从数据库中获取最新的对象。
  • 2. 任务重复执行 (Duplicate Runs)

    • 问题: 在某些配置下(如多个Worker同时获取任务,或Worker失败后任务被重新入队),任务可能被执行不止一次。
    • 解决方案: > "You have to ensure that your tasks are atomic and idempotent." (你必须确保你的任务是原子性的和幂等的。)
      • 原子性 (Atomic): 使用数据库事务(如@transaction.atomic装饰器)确保操作要么完全成功,要么完全失败回滚。
      • 幂等性 (Idempotent): 设计任务逻辑,使其多次执行产生的结果与一次执行相同。例如,在执行操作前检查状态,避免重复处理。
  • 3. 复杂的错误反馈 (Complex Error Feedback)

    • 问题: 当一个后台任务失败时,用户可能已经收到了来自初始请求的成功响应。这使得错误处理和反馈变得复杂。
    • 解决方案:
      • 前端不应立即显示“成功”,而是“处理中”的状态,并通过轮询等方式检查最终结果。
      • 需要设计补偿机制,如果异步部分失败,能够撤销(undo)已经完成的同步操作。
  • 4. 操作冲突 (Conflicting Operations)

    • 问题: 在一个任务等待执行期间,用户可能触发了另一个与之冲突的操作,导致不可预测的系统状态。
    • 示例: 用户触发了一个后台任务来复制10份笔记,但在任务执行前删除了原始笔记。
    • 解决方案:
      • 软删除 (Soft Delete): 不物理删除数据,而是标记为已删除。
      • 取消待处理任务: 在执行删除等破坏性操作前,查询并取消与该资源相关的待处理任务。
      • 资源锁定 (Locking): 在任务处理期间锁定相关资源,防止其他冲突操作。

最佳实践与解决方案 (“夫妻疗法”)

  • 任务设计与执行:

    • 重试机制: 使用指数退避 (Exponential Backoff)策略进行任务重试,避免在外部服务宕机时频繁冲击。
    • 异常处理: 任务内部应捕获所有可能的异常,> "Tasks shouldn't raise exceptions." (任务不应该抛出未处理的异常)。应通过日志、监控工具或邮件报告错误,而不是让Worker崩溃。
  • 调试与监控:

    • 监控: 使用Celery Flower来可视化监控任务状态。对于更高级的需求,可能需要自定义监控(如队列心跳)或使用付费工具(如New Relic, DataDog)。
    • 本地调试: 在开发环境中设置task_always_eager=True,使任务在调用时同步执行,绕过消息队列,从而可以在主进程中直接调试,极大简化了开发流程。
    • 远程调试: 使用Celery内置的远程调试器rdb (from celery.contrib.rdb import rdb)。在任务代码中设置断点,当Worker执行到断点时会暂停,并开放一个端口,开发者可通过Telnet连接进行远程调试。
  • 处理长任务与复杂工作流:

    • 长任务: 避免运行时间过长(如数小时)的单个任务,应将其拆分为多个更小的任务,以便更好地跟踪进度和避免超时。
    • 复杂工作流: Celery非常适合简单的异步作业,但对于复杂的、有状态的工作流,其可靠性可能会受到挑战。
      • “非一夫一妻制”关系: 建议采用混合方案。
特性 Celery Temporal.io
核心定位 简单的异步任务/作业队列 复杂的、有状态的、持久化的工作流编排
适用场景 发送邮件、API调用、数据批量处理、定时任务 订单处理流程、Saga模式、需要重试/回滚的复杂业务
可靠性 在复杂场景下,社区有关于任务丢失或不可预测行为的报告 设计上更健壮,提供强大的持久化和状态管理能力
学习曲线 较低,与Django集成简单,上手快 较高,概念和架构更复杂
监控 依赖Celery Flower等外部工具,或需自定义 内置更完善的监控和可见性工具

结论与推荐资源

  • 核心结论: Django和Celery是强大的组合,但成功驾驭这段“关系”需要开发者深刻理解分布式系统的复杂性,并遵循最佳实践来设计健壮、可靠的异步任务。
  • 推荐资源: 讲者推荐了Vinta维护的一个开源项目——DevChecklist网站 (devchecklist.com),其中包含一个专门针对Celery的最佳实践清单,可用于指导新项目的配置或审查现有应用。