2020-10-15 | DjangoCon 2020 | Understanding Celery to maintain your sanity - Ashwini Balnaves
深入解析Celery工作原理与故障排查技巧
标签
媒体详情
- 上传日期
- 2025-06-21 18:07
- 来源
- https://www.youtube.com/watch?v=v1m-jbPrYfw
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro
转录
speaker 1: My name is Ashwini, and my talk is called understanding salary to maintain your sanity. I had the idea for this talk after we had a critical issue in production. I work as a full time software engineer at a company called Keshe, and I was fairly new to the company at the time, but I had read the docs on how to use celery. I'd even written a few tasks. But now that things had gone really wrong and nothing was working, I needed to know exactly how everything worked so that I could jump in and see what was happening and try to fix it quickly. That was a very stressful way to learn how salary works, especially since a lot of the online resources that I found focused on which methods to call and how to get started. I would recommend to anyone who has salary in their stack that they invest in the time to understand salary before it goes wrong, which is what I hope to help you achieve with this talk. So the first thing to know is that celary is an implementation of a distributed asynchronous task. You in my case, we have these very long batches of work that need to be done. Like I was saying before, I work at a startup called kpech, based in Brisbane, Australia. We process huge amounts of customer data in order to uncover customer insights. This means uploading and handling csv files with hundreds of colns and millions of rows and updating large natural language processing topic models. So we have a lot of work to do that takes quite a while, but why do we want to do these tasks in the background as opposed to inner Jungo view? After all, doesn't Jango have everything I need? Well, the thing with doing things in the background means that you can respond back to your client very quickly. Otherwise, you've got to make your client and ultimately your users wait for pages to load or for actions to complete. This is going to consume a connection in your Jango app and limits the responseness of your application. And if that doesn't convince you, I found out the hard way that browsers have timeouts. So there's a hard limit on how long a request handler can take to process a request anyway. So if we have salary doing all this work instead of doing it in our views, we also get another huge benefit, which is we decouple that work from the Janger app. Decoupling gives us two main benefits. The first is to horizontally scale independently of the main Janger application for us. Specifically, this means I can add more celary services to our Kubernetes cluster without impacting the Janger app at all. This also gives us the ability to restart the Janger app, for example, for a deployment, without having to wait for tasks to complete. For us, this is very useful considering some of our tasks can take several hours. There are many other reasons as well. When I was putting together this talk, I asked some of my other developer friends what they used celery for, and there were many other reasons. Most of them were also using celary for long running cpu bounded tasks, but other reasons include managing resource contention and parallezing work as an optimization. So why celary? I use celery because it was part of the stack that I inherited. A lot of us as developers don't work on Greenville projects most of the time. And due to celariy's popularity, a lot of Jango applications involve salary. It is by far the most popular of all the task queues. I think its popularity comes from the fact that it's a very mature project. It has many integrations with different frameworks, message queues and databases. It's included in many tutorials and markets. Itself is very simple and powerful. I would agree that it's very simple for simple use cases, and it also has very many powerful features, but with those come quite a dramatic increase in complexity when using it in more embed situations. Specifically, when you start changing some of the default values for the configuration settings, there's so many levers that you can pull. It's really important to understand what you're doing, but it's by far the most popular. But there are alternatives, such as the ones that have listed here in no particular order. So what is it actually doing? Let's talk about some definitions. We're going to go over each part of the system. There's going to be a little bit of code, but our main focus is going to be how everything is working together. So a bit more of a systems view. First, we have the client. This is the service that wants to submit work to be done. And in our case, this is the Jango view. Then we have that work that actually needs to be done. And in celary, these are called tasks. Finally, we have the workers, and they are ultimately what executes the work or the tasks. You can see here on the right, we have a little bit of code. We can see that tasks are defined as Python functions in a tasks file within our Jango app using the shared task decorator. So the shared task decorator is unique to Jango apps. So when you see salary tutorials that are using other frameworks other than Jango, you might see a slightly different decorator, which is app or task decorator. This is a little bit of a side note, but because celary only has one instance enabled to expose all the sumodules or the suapps in Jango to that single celery instance, we need the special decorator. Then we have the client, and the client imports the tasks and calls them using the delay method. So I've taken this code in a bit from the Django official docs, one of their very first examples. And this is the simplest way that you can call the task. Celery has a lot of powerful features, especially when it comes to the ways that you can call tasks. There's the ability to link them together so they execute one after each other, or we can send them all off in a group. We can even attach callbacks to that group, but I'm not gonna to go too much into that. There are some really good resources online search for what celery calls canvas design. And then we have workers. So a worker spawns subprocesses in order to execute the tasks. The number of subprocesses defaults to the number of cpu's available on the machine on which the worker is running. But if you want, you can change that with a dash c argument. In our example here, we have three workers. But one of the big value of what salary has to offer is that it is distributed. So these workers can be, and in fact, usually are, on different machines. Each one will have as many subprocesses as is available on the box on which it's running. So there are three workers, three worker boxes in this picture, but there could potentially be more suprocesses actually running. So there's a little bit more going on. How does the client actually get the task to the worker to execute well, we use a broker or a message transport. This is the way that the message just get communicated between the clients of the workers. The client will submit a task to the broker and the workers will subscribe to the broker and fetch tasks off of its queue. Once a worker process has executed a task, they then write the results of that task into the results store. So let's talk a little bit more about the broker. Just as well the documentation and a lot of tutorials use the word broker and message queue interchangeably, which I will confuse me a little bit at first, but we're just going na stick with broker to keep it consistent. So there are several services that celery can use as a broker. Rabbit mq is a service that celery docs put up to you first, but redis is also a popular opinion. And redis is what I use. When the task is submitted to the broker, it's given an idea and it's stored in a serialized format. Salary supports many different serialization formmats as well, but you might also be limited by what serialization format your broker supports. In my case, we use both pickle and json. Then there's a result store. So also the results store is used interchangeably with the word results back end. We're going to be sticking with store. I use the Jango orm, which has a nice integration called jgo salary results. But as you can see, there are many options that you can choose from. Jango celery results even gives you an interface through the Jango admin panel. You can see here where each row is a completed task that's been written into the results back end. So now it's time for a story of how things go wrong. And with a tool so configurable, it probably will. So there are quite a few stories that I could have chosen to tell. For example, we had problems with growing memory requirements in Reddis because I was passing the entire contents of a file through the task parameters. Because I was treating tasks as if they were regular functions, I didn't realize that the function parameters were being serialized and stored in redis as well. We also had a lot of fun when we found out that workers will prefetch a certain number of tasks. That means if you have tasks that run for a long time, which we do, a worker could potentially fetch a batch of them, let's say three, and it will go through and execute each of those, while perhaps there are other workers that are available. So let's say it's fetch three, and the first one takes a long time task. Ks 23 are sitting there waiting to be run, even though there's other workers available. So if you do have quite long running tasks, the suggestion is that you run with the flag o fathat way. There will be no prefetching of tasks, and workers only grab them one at a time. We've also had fun when tasks have failed to change as expected, because there are immutable signature types that behave differently to mutable ones. This is because taks created using the immutable signatures don't require the task previous results. And if you use the linking method and examples in the salary docs, it doesn't actually wait and they're not executed sequentially. A lot of these stories are a result of me not understanding how salillary actually works because since everything was working when I got there, I found the documentation to be a little overwhelming. So I didn't really feel the need to. I assumed that everything was set up optimally and it was running fine. But what I found out was that celaries default configuration settings are set up for a high frequency, short Tish amount of tasks. What I mean is a lot of tasks that are taking a shordish amount of time. As our company began to grow, though, and gain more and more customers with larger and larger data sets, our tasbecame longer and longer, we needed to change the default configuration settings away from what the default values were. However, because celary has so many levers you can pull, we started to get ourselves into trouble as we change settings without realizing the impact of what it was going to have on the whole system. We didn't have a comprehensive understanding of how all the levers impacted each other. But by the way, I would say I still don't have a comprehensive understanding of how all the different configuration works within celery, but I kind of gloss through those stories, and some of them probably didn't even make much sense if you're new to celery. But we're going to go through one in a little bit more detail, one which I was referencing in the beginning of this talk that made me have to learn a lot about salary. So how did it all begin? Well, the defaults for salary are configured to have a lower reliability than what we really wanted as a team. So we turned on the task acts late to increase reliability. Well, what does that mean? I'll tell you. It means that a worker can acknowledge the message from the broker before or after the worker actually executes the task. So the client submits the task to the broker and the worker pulls off the task and acknowledges it, and then that task is no longer in the Brok's queue while that task is executing on the worker. However, the non default behavior which we turned on is that when the task is sitting in the Brok's queue and the worker pulls off the task, it waits until the execution of that task has finished before sending the act. So the task is sitting in the brokers queue while it's executing and only gets finished when it's done. This means that if something happened to the process while the task was executing and the task was never acknowledged, then the broker would retire the task again. The idea here is that we would be protected from data loss due to worker failure. Everything seemed to be going well, but then we kept growing and the tasks were getting longer, and our tasks started to time out. We could tell this because, well, none of our tasks were completing, but we could also see the time meout in the logs, and we could see that the default was set for 300s. So we figured we could simply up this time limit and our tasks would complete. We knew that most of our tasks took in the order of hours, and so we set it to 24 hours, even though it was a little overkill, because we knew it would be definitely long enough for even our longest task to complete. All seam dwell again until we discovered through a customer report that nothing was working in our ui. The new file uploads were being marked as qued and were not being processed. This is actually another pretty good tip is if you can surface the state of your tasks within your ui to reflect the state that they have in celery that can be very useful. In this case we could see that they were being qued but not being executed. This was not a good situation to be in. The pressure was on to fix the downtime in our system. So we could see that the tasks were being sent by the client. But where could the blockage be? Was it in the broker? Were the workers not picking up the tasks? Were the workers themselves failing in some way? Well, income, salary, cli. So this is a great tool to be able to see what's happening within ary. Ary cli has many great commands, but rather than reading the docs, I would recommend you have a play with it to see what it's capable of in some sort of environment that's safe. We're going to have a look at just how useful it is in solving our problem. So the first thing we needed to do is just get some visibility of what was going on. And so the prime command for that is inspect active. I've included a little example here of what the celary workers would look like if they had no tasks. But back in our story, this is what the result was, to inspect active. You can see that our celary workers are very busy. And not only are they busy, they're all busy with the same task. This is actually a response, well, like a sanitized and simplified version of what the actual commands output was on that day. So they're all busy with the same task, which was very confusing because not only was it strange to have them all executing the same task, but I could see that that task was masked as successfully completed in the results store. This must amenthat. The broker is keeping the task in the message queue and allowing workers to retry it and pull it off the broker again and again. So why aren't the messages on the broker being act when we know that the task was successfully executed? Well, there's a little bit of a clue here, which is that the acknowledged is mark false. So we had confirmation that the task was, in fact, sitting in the broker, and the new tasks that needed to get executed were just piling up behind it. I really want to know what's going on, but because it's our production system and it's currently not working, we had to look at fixing that before investigating what could possibly have caused this salary. Cli comes to the rescue again. Now I wasn't sure at the time what the difference between revoke and terminate was, and it's still kind of hard to find this information. The monitoring and management guide has a lot about the ability to examine tasks, but not so much control them. And even the command line interface stocks don't really specify exactly what everything does. That's why I suggest if you can play around and use the dasdash help option, that's a great way to learn about the capabilities of salary cla. So revoking a task versus terminating a task. Revoking a task, what it actually does is it adds that taks ID to a set of task ID ds within the salary workers in memory set. So whenever a salary worker pulls off a new task from the brokers 's message queue, it checks whether it's in that set or not. And if it is, it won't run it. As you can see, if you're already running it, it's already way past that check. And so revoking a task is not going to affect a task that is already running. As a side note as well, because this set is in memory in the celary worker, not in redis, our broker, and restarts, meaning that the tasks will effectively be unrevoked. And if you want your revokes to be persistent, you have to set up that set to exist on disk on the machine that your worker is running them. But that's a side note. So what we want to do is terminate. So we terminated our task that was running on all the workers. It killed all the processes that were executing our task. And it's important to note that it does actually then acknowledge the task so it's no longer cued. This is dangerous though, and the reason for that is that it kills the process, not just the task. This means if there is anything in memory on your celery worker that is important, such as prefresh task, they could go missing. So what was happening? Well, after we managed to unblock the system, we found out that the reason for this problem was that the broker can't wait indefinitely for the acknowledgement to eventually come. As you can see, I've got here the broker thinking, how long should I wait for my acknowledgement before I decide that something has happened to the celary worker and the acknowledgement is never going na come and I need to retry the task? Well, it turns out there is a setting in Reddis for this. It even turns out that there's a caveat section under the broer's section of the salary documentation, but none of us in our team had read the entire documentation in order to know this. So in the end, we set the visibility timeout to 24 hours as well. Under a time crunch, we got a lot of value out of salary cli, but there is a wonderful graphical user interface that you can use called flower, and this just gives you the same functionality as celery cli, but a much nicer interface. And with the benefit of foresight, you can set it up if you're using a distributed system. Adding tracing and having good logging is also incredibly valuable. I actually cannot recommend it high enough. Our tracing system I love, we use honeycomb, and it's dramatically increased my ability to gain visibility over over what on earth is going on within our system. But that's another talk. I hope that this talk, though, has helped you with will help you with your salary issues and will give you a foundation to draw upon when you're reading the salary docs. Thank you. Oh, just for the sake of the recording, I'll repeat the question, which is the Jango community is moving towards using Jango. Q What are they missing out on? I'm probably not the best person to ask this question to, because I haven't had much experience with Jango. Q I think really one of the main focuses that I wanted for my talk was for people who were in a similar situation like me, who have inherited a project with celery already in it, and maybe didn't feel it necessary to learn the underlinof how it worked, because it was all just being happy and chugging along successfully until something goes wrong in a way that requires you to jump in. So I personally haven't been able to look into the other task cuvery much. Are there any other questions? speaker 2: What were some of the reasons that caused you guys to start changing more of the background settings that you were warning against? speaker 1: Yeah. So I guess it's important context for this conversation is the startup that I work in, our workout is quite small, but we're really starting at the scaling up phase of our journey. So our engineering team at the moment is three people. And what that means is for us, we are still finding product market fit. And what we were testing with and what we thought our the kind of data that our customers would have would be much smaller. We thought we were gonna to be dealing with a couple hundred rows in a spreadsheet sort of thing, and the code was written with that in mind. That's all we were testing against. And then as we started to work out that enterprise level customers were much more of our ideal customer profile, what that means is they came with way more data than we ever expected. And so our tasks and everything were set up without that really being in mind. And they were really long running. And ultimately, we had to support those sales as they came on. And that meant just tweaking things as we go to try and make the current system that we have work for it because we don't have like six months to really make sure that we were being really precise and careful with everything. It's kind of one of the joys of the startup lifestyle. Did that answer your question? speaker 2: It did. And I've got a quick follow on to that then, which is do you have any insight then on understanding when you should be looking at actually you know your Jango code or you know whatever celary is running with or looking at salary for the optimizations? speaker 1: Yeah, that's a great point. I think. I think for us, a lot of it has to do with profiling and trying to work out what's the quickest way to get these gains at the moment. A lot we're focusing a lot on performance work, and a lot of that isn't even in the Jango level or the salary level. It's more the shape of our data and making sure it's coming out efficiently out of the database. So even now, we still have quite long running tasks, and we are just making it work, I guess, until we have time to optimize those as well. speaker 2: Thank you. speaker 1: No worries. Tim asked, do you have any recommendations for any tutorials on learning salary? I actually do. And there's actually quite a few good common gotture articles, and I think I'll I'll link them in the slack if anybody else so everybody else can have a look at them. I came across quite a few while I was researching for this talk as well. speaker 2: I would . speaker 1: love that. No problem. speaker 2: Hey, thank you very much for your talk. Great talk. I have question regarding flower. I didn't get to play around with it too much, but I noticed it only works if it's already been running for a while, like it's keeping its own lock or something. Do you know where it doors it? And do I need to worry if I keep it running a long time that at some point something will just store too much? speaker 1: I'm not totally sure. And the reason for that is because we have it set up in our local environment at the moment. So we have it running as a service in our Docker Compose for our local environment, but we haven't managed to get it for our production system yet. I don't take my own advice, apparently. speaker 2: Cool. Thanks. speaker 1: I did see in this lack that a lot of other people have had some experience with flowers, so it might be worth having a discussion there. speaker 2: Yeah, sure. Thank . speaker 1: you. speaker 2: Hi. Hello. First of all, thank you for the great talk. Perhaps this is kind of a new question, but when should we start considering using salary tasks instead of crunals? Thank you. speaker 1: Oh, that's a good question. So salary does have a quite robust periodic tasks functionality as well. You run it as a sacservice called celery beat. I have had some pretty good experiences with that. We use it pretty extensively in in our environment. And I think there is a lot that the periodics tasks framework gives you. You can configure it in the admin Janger admin as well. You can manually run tasks, you can disable them and enable them from the Genger admin. And that's one of the like biggest pros that I found with using that. speaker 2: Yeah that sounds really cool being able to configure it from the diadmin. Thank you. speaker 1: No problem. We did actually we did actually just just yesterday have issues with with that though, because we had set up one Kubernetes cluster and then upgraded Kubernetes cluster and they were sharing the database and we were planning on running them both at the same time and then switching over. But we forgot that our periodics Tawe're reading from the database, and so we had two sets of celary tasks going at once, and it was only by pure coincidence that we had not set up the gcp permissions correctly on the second clusternodes. So all the uploads failed, thankfully, otherwise we would have been cleaning out duplicated data all day yesterday. speaker 2: 嗯。 Hi. Hi ashi. Yeah Yeah I think Yeah adding on to to the question was on crown anxii. Think crown is more used where we can rate to a system level. I think salary is used where we can relate to the application recording. Maybe there's a task and Python application where you want it to run. I think kon wouldn't be able to figure out what rule if it is basically con like schedule task for system nerand. Servey is mainly used for schedule task sfor the application level. So I think that might help add up to the question too. speaker 1: Yeah, that's a really good point. Thanks. speaker 2: Noah. Thanks for that.
最新摘要 (详细摘要)
核心摘要
本内容提炼了Ashwini Balnaves在DjangoCon 2020上的演讲精华。演讲指出,开发者不能仅停留在“如何使用”Celery的层面,而必须深入理解其内部工作原理、配置及其相互影响,以便在生产环境出现问题时能够从容应对。演讲以一次真实的生产事故为案例,详细剖析了因对Celery配置理解不深而导致的系统性故障,并分享了诊断与解决问题的实用技巧。
事故的核心在于,团队为提升任务可靠性而启用了task_acks_late=True(任务执行后确认)配置,却忽略了与之配套的消息代理(Broker)的“可见性超时”(visibility timeout)设置。当业务增长导致任务执行时间变长后,一个已成功执行但耗时过长的任务,因未在Broker的超时期限内被确认(ack),而被Broker反复重新分发,最终占满所有Worker,导致整个任务队列堵塞。演讲强调,Celery的默认配置适用于高频、短时的任务场景,当业务模式变化时,必须谨慎调整配置,并善用Celery CLI、Flower等监控工具及分布式追踪系统来确保系统的可观测性。
Celery的价值与应用场景
- 处理后台耗时任务:作为分布式异步任务队列,Celery的核心价值在于处理耗时操作,如上传并处理含数百万行数据的大型CSV文件、更新自然语言处理模型等,避免阻塞主应用。
- 提升用户体验与应用响应:将长任务移至后台,使Web应用能快速响应客户端请求,避免用户长时间等待页面加载,同时规避浏览器对请求处理时间的硬性超时限制。
- 系统解耦与独立扩展:
- 解耦:将任务执行逻辑与主Django应用分离,允许主应用独立于长任务进行重启和部署。
- 独立扩展:可根据负载需求,独立于主应用横向扩展Celery Worker(如在Kubernetes中增加Worker节点),实现资源的弹性分配。
- 其他应用场景:管理资源争用、作为性能优化手段并行化工作。
Celery核心架构解析
- 客户端 (Client):任务的发起者,在Web应用中通常是Django视图。通过调用任务函数的
.delay()或.apply_async()方法将任务消息发送出去。 - 任务 (Task):被执行的具体工作单元,在Django中通常是使用
@shared_task装饰器定义的Python函数。 - 工作单元 (Worker):执行任务的后台进程。每个Worker会启动多个子进程(默认数量为CPU核心数)来并行处理任务,且可以分布式部署在不同机器上。
- 消息代理 (Broker):客户端与Worker之间的通信中介,负责接收、存储和分发任务消息。客户端将任务消息放入Broker的队列,Worker从中获取任务。
- 常用选型:RabbitMQ(官方推荐)、Redis(广泛使用)。
- 序列化:任务在存入Broker时会被序列化。演讲者提到,其项目中同时使用pickle和JSON,选择时需考虑Broker的支持情况。
- 结果后端 (Result Store):用于存储任务的执行状态和返回结果。Worker完成任务后将结果写入该后端。演讲者使用
django-celery-results,它利用Django ORM作为结果后端,并能在Admin后台直观地查看任务结果。
故障案例研究:一次由配置不当引发的系统雪崩
-
背景与动机:追求更高可靠性
- Celery默认采用
task_acks_late=False(执行前确认)模式:Worker获取任务后立即向Broker发送确认,若此时Worker崩溃,任务会丢失。 - 为防止数据丢失,团队将配置更改为
task_acks_late=True(执行后确认):Worker在任务执行完成后才发送确认。若Worker中途崩溃,Broker会因未收到确认而将任务重新分配。
- Celery默认采用
-
连锁反应:从任务超时到系统阻塞
- 随着业务增长,客户数据量激增,任务执行时间从分钟级延长至小时级。
- 系统开始出现任务超时(默认300秒),团队直接将超时时间延长至24小时。
- 不久后,客户报告系统功能完全失效,新任务在UI上显示“排队中”但始终不被执行,整个任务处理系统陷入停滞。
-
诊断与排查:定位“僵尸任务”
- 使用
celery inspect active命令检查后发现,所有Worker都在执行同一个任务。 - 然而,在结果后端中查询,该任务早已被标记为“成功完成”。
- 进一步检查发现,该任务的
acknowledged状态为False,证实了它虽已完成,但未被Broker确认,导致Broker不断将其重新分发给空闲的Worker,形成死循环,阻塞了后续所有新任务。
- 使用
-
紧急恢复与根本原因分析
- 紧急措施:为恢复服务,必须终止阻塞的进程。演讲者对比了两个命令:
revoke:仅将任务ID加入Worker的内存黑名单,阻止其未来执行,对已在运行的任务无效。terminate:强制杀死执行任务的子进程,并向Broker发送确认。此操作有风险,可能导致该进程中被预取的其他任务丢失。- 团队最终使用
terminate命令清除了所有阻塞进程,恢复了系统。
- 根本原因:问题根源在于Broker(Redis)的
visibility_timeout(可见性超时)配置。当启用acks_late=True时,Broker会在该超时期限内等待Worker的确认。团队延长了Celery的任务超时,却忘记同步延长Broker的visibility_timeout。导致长任务在正常执行时,Broker已因等待超时而认为Worker失效,并将任务重新放入队列,引发雪崩。 - 最终解决方案:将Broker的
visibility_timeout也设置为24小时,与任务超时时间保持一致。
- 紧急措施:为恢复服务,必须终止阻塞的进程。演讲者对比了两个命令:
关键经验与工具推荐
- 警惕默认配置陷阱:Celery的默认配置为“高频、短时”任务优化。对于“低频、长时”任务场景,必须深入理解并调整相关配置。
- 避免常见性能问题:
- 预取 (Prefetching):对于长耗时任务,Worker的默认预取机制可能导致任务分配不均。建议使用
-O fair启动参数禁用预取,使Worker一次只获取一个任务。 - 任务参数序列化:切勿将大文件内容等大型数据结构直接作为任务参数传递。这些参数会被序列化并存储在Broker(如Redis)中,会迅速消耗其内存。应改为传递文件路径或数据库ID等引用。
- 预取 (Prefetching):对于长耗时任务,Worker的默认预取机制可能导致任务分配不均。建议使用
- 善用监控与排查工具:
- Celery CLI:功能强大的命令行工具,是快速诊断问题的首选。
- Flower:提供Web UI的图形化监控工具,界面友好,功能与CLI类似。但需注意其自身的数据存储和持久化问题。
- 分布式追踪与日志:强烈推荐使用Honeycomb等追踪系统,它能提供跨服务的深度可见性,极大提升复杂系统的故障排查能力。
- 周期性任务(Celery Beat):相比于Cron,Celery Beat与应用结合更紧密,可通过Django Admin进行可视化管理(配置、启停、手动触发),非常便捷。但需警惕在多集群共享数据库等复杂部署环境下,可能因配置不当导致任务重复执行。