speaker 1: Hello, welcome to jgacomb 2021 everyone. My name is Schima Japan. I work for company called Blackfire io and we specialize in introducing performance related observability tools for various languages. And as you might guess, I work on the Python side of the things. And today I will be talking about which and how and which tools to use for debugging performance issues on Python code. Let me just say that this will not be a Python or Jango performance tips and trick talk. While I will try to briefly touch on some general rules, most of the times I will try to focus on what kind of tools we have to identify and debug performance issues and how they work under the hood. So as can be seen in the outline, I will be talking about why we need these tools at first place and then how they work under the hood, and then what kind of data do they produce. And some common formats, patterns that are used to visualize its output. And at the end, we will probably walk over some of these tools available. In my opinion, better understanding of these observability tools could aid more diagnosing performance issues than to just strictly following some performance rules or tips and tricks. So I would start by this very famous code from professor nut. At least I heard this code from time to time in my career, multiple times, and it is so correct. Doing premeure optimization not only didn't solve your performance problem, but it also introduces unnecessary complexity to the code. And in my book, if there is one thing that all developers can agree on, it's this, avoiding complexity is better. So I really forget the number of times I do various optimizations without doing any measurements. And see, at the end of the day, it didn't matter that much. Maybe one of maybe some of you are better than me at this, but I have seen this enough times to believe that we humans are very bad at estimating performance bottlenecks. So I really believe in this code from professor nuby heart. But the thing is, this code here is not complete. It's probably the most seen part of it, but it's not complete. So here is the full code. So as you can see, professor ntt clearly says to do not do premature optimization, but you should do it if you can measure the top three person well enough. Like I think this changes, updates the perspective about the underlying message of this code, which is if you can find the top three person where top three person well enough, then you should try to optimize it. This is what I understand from the full code. So to chase this top three person, I think it's key to understand what kind of tools are available for us, how can we use them. And in order to better use these tools, I believe we need to first understand how they work, at least at a very high level. And then we need to have a clear understanding of what kind of information can they provide to us. And this will hopefully be the main topic of my talk today. So before going any further, I'd like to rephrase something again, which is you cannot improve something you cannot measure. So measuring is key. And the tools we use today to measure code are called profilers. And okay, let's start from the beginning and let's go and search for some open source profilers to see what kind of options we have in the Python ecosystem. If you do that, what we will see is, Yeah, we will see that the answer might not be that easy. This screenshot here is from GitHub by searching for keyword Python profiler. And as you can see, there are over thousands of profilers available in the Python ecosystem, and you can clearly see that these are not doing the same thing. Some of them are sampling profilers, some of them specialize only on memory usage, line profiling, etc, etc. And I will try my best to explain the differences between these. What are the basic types? What kind of data do they provide? In what cases they could be useful to us, especially for web application development? And by the way, this knowledge here is not only relevant to Python. For example, the gold language profilers also shares the same foundations that we will hopefully cover today. Okay. So the first thing I'd like to talk about is that there are three basic parts of any profiler that answers for questions. And the answers of these questions are what that makes a profiler unique in its feature set? So what is the data collected that by a profiler? How is it collected and how is it presented? For example, the answer to the second question here categorizes the profilers into two basic types, sampling and tracing profilers. Let's try to understand that a bit more. So I'd like to use following analogy to explain the basic difference between these. So imagine there are two coaches and an athlete. The athlete is constantly running labps, and the coaches are watching from behind and doing some measurements. Like the first coach, let's call him Bob, is constantly monitoring the athlete and doing measurements with his wrist. Whenever the athlete finishes a lab, he saves this information. The second coach, let's colhim Smith, does the same thing, but he does the measurements whenever he likes. He constantly talks on the phone, do other stuff, and then when he feels that it's the right time, he looks at the athlete and save the current position to some notebook he has. So over time, he makes more and more of these measurements. But at the end of the day, he could be able to statistically calculate how fast was his athlete and what was the average lab time, right? So the first coach, Bob, was doing measurements constantly, and thus is more accurate than the second coach, Smith. But Smith not only measures his athlete, but also finds time to talk with his wife on the phone. Maybe coming to the reality, we can say that Bob, our first coach, is a ing profila tracing profiler does measurements every time when a certain event happens inside the language runtime, and these events are defined by the language runtime itself. This not only applies to Python, but also php, ruby or any other language that supports profiler hoops. And on the other hand, our second cosmith is very similar to a sampling profiler because it samples the assampling profiler is just samples the call stack of the application at specific intervals and saves this information somewhere and aggregates those information at the end. And it can be seen from this analogy, there is a trade off between the two. Bob, the first coach, can provide more measurements and more accurate data, whereas Smith will work less, which means in computing terms, will consume less cpu power. So like I mentioned, for a tracing profiler, there is some profiler events. And for Python, those events are for the time being, Python supports hooks for every function line or opcode executive. So this means you can do measurements whenever a function code, as it happens, or a line has been executed, or even a single Python up code has been executed. So the tools you might use, the tools you use might change depending on what you would like to measure. Like maybe you would like to optimize a single function that you know that it's heavily cold, or you would like to see which functions are most time consuming in your web application or your cli script ettera or more. Although I haven't seen any example, I'm informed that Instagram has used some form of code profiling in their fleet, production fleet. And Yeah, there might be maybe times to that youlike to see upcode work load as well. So Yeah, these are the hooks for the Python run time. And coming to the sampling profilers, they like I said, they took another approach for to optimize for overhead. And as you can see, it directly looks at the costacks of threads of the application and accumulate this costack information at specific intervals. Like the Coach Smith's note from the previous analogy, it might seem inaccurate at first, but given enough time, statistically, you will end up with an accurate view of what happened inside your application. I also would like to talk a little bit about the difference between internal and external sampling profilers here, because I believe it's an important distinction. The internal means that your real profiler runs in the context of your application and external means that it runs as a separate process. Is this is for me, important distinction? Because when you run the profiler externally, this means that it is safe to use. It is safe to use in case your profiler itself has any Buin it and it's impossible to somehow crush your application. And you can add it to your container system without changing a single line of code in your application logic, which is nice since it can give you the benefit of separating the instrumentation library from the application logic in a maybe docker file. Yeah. So next time you use a sampling profiler, please give attention to this distinction as you might have different options there. Okay. I hope that we now have a high level of understanding on how these profilers work under the hood and what are the threof traoffs between the two. But we haven't yet touched on what these profilers actually measure. So they obviously measure time, right? But the thing is there are different flavors of time when it comes to computing. Yeah the first thing that comes to mind is the wotime actually, which is like having a wristorit's, just like the analogy we have with the coaches. The second one is the cpu time, and it is the total amount of time your cpu actually spent on your code, including the kernel time. If you're a cpu intensive function, like example, for example, like cryptographic function, that will consume much, much more of cpu than a simple function that reads a small text file, for example. Yeah and cpu time is usually calculated by the support of the cpu itself. It has some internal counters for itself. And there is this io time. Let's suppose that you read a file in your code. In most simplest terms, your cpu will execute some instructions to read it from the disk, and then it goes to sleep. It waits for the disk to send the data, to request it, and when the data is available, to return to the application. I know it's kind of simple. Yeah. So at this scenario, the whole time spent from requesting the read until the data is available is the whole time. And if you subtract the cpu time from the voltime, what you will get, how much time your code actually weighted on, which is io time. And I think it's a very important metric for web applications because a typical web application usually is iobound and do lots source of io. And it would be very useful to see exactly where these io bottlenecks are and try to optimize them maybe by caching to memory or a faster disk medium, etcetera. And there are also some more metrics that can be useful as well, like memory is one of them. Memory profilers aim to show you the memory consumption or peak. And with using a memory profiler, you can see how much memory does your application consume or peak and where it is allocated. Actually, you can see the consumption and peak metrics via any cli tool for any operating system. Like you can open a process monitor in windows, and for maquz, it's activity monitor. I don't, you can for Linux, you can use tab, but those tools will not give you the information on where the allocations 's coming from. If you have ever debuked, I mean, I mean, you cannot see the inside of your application. Your actually where it is allocated from from your code. If you have ever debuked a memory leak issue before, you will know how frustrating it could be to actually identify the exact source of the leak. And these tools might really shine on those occasions. Like trace mallog comes in the standard library, for example, from, if I remember correctly, three that four and up, and it traces all mellog free and real log calls and saves a Python trspealong with the allocation. And Yeah there is also this object graph library, which uses the internal gc module to find out which objects are active on the heap. And you can easily see what kind of objects are currently active and after your request, and maybe do a difference between them and see what happened. Yeah there are very different strategies to identify memoral leaks, and there are also implications, specific metrics as well. Like an example on this type is a famous Jango debutuolbar, which can profile Jango specific merics, for example, temple render times and sequqqueries and Blackfire, which is the observed Rothe platform that I'm working on that can also provide lots of different application specific metrics as well as like sql queries http cobeing made to different microservices and Yeah lots of other metrics as well. So we have touched how profilers work under the hood and what kind of data we gather with them. So now comes the question of how these profilers present their information. And in short, there is no common well defined answer to this question. While there are different visualizations being used by different tools, I believe it's still an open problem, and I also believe that some visualizations might work better than the others for some problems, so it really depends on personal preference and type of the problem itself. These are the common output formats that profilers usually use, and this list is not only relevant to Python. Sometimes they might be called differently, or there might be subtle differences between them, but the way they present information doesn't change too much. So let's start with the simplest one. Actually, a table is just like its name implies. It just shows you a color Coly payer in a row. In a table, for example, for a function tracing profiler, multiple cowill get aggregated to a single row. Usually all profilers can output something similar to this yac profile, pi, pe, trace, mellow Blackfire. All kinds of different profilers can actually support this kind of output. And there are two slikey caccrime, if you would like to see and filter based on the output. Most of the profilers have some way to output as cold grind, maybe by using a separate tool. And here is a screenshot from key cache grind. It's pretty easy to see the most cold or time consuming functions with this visualization at hand. And it is probably the most used and the oldest, probably. And here is a screenshot of a profile that for a Jango application made by Blackfire, all these table view visualizations have some kind of exclusive and inclusive time concept. Like you can see exclusive time is the time that's spent only in the function itself, whereas the inclusive time means the time spent in that function. And all it is children. So in this view, for example, you can sort based on the worlmetrics. The second visualization I'd like to talk about is a call graph, which is probably the second most famous one. A call graph shows you the colcolpayers in a graph, so you can see your whole applications call relationships in a big call graph. And this visualization might help you to dig deeper the problem and identify the root cows, because you can navigate through the cold relationships easily. And here is another example for the cograph from Blackfire ui. Note that the cograph is not always very ui friendly because there are many, many calls, there might be many, many calls, and when there are lots of functions being involved, it's hard to focus on the most interesting path. So I would suggest that using it to debuperformance issues further rather than trying to spot the issues itself. We at Blackfire, for example, try to remove uninteresting coals, what we call noise, and aid the cograph with a table on the left view from the profiler data to highlight the critical path. But even in that case, I must admit that there might be lots of noise. And the thing is, the cograph might not be the perfect visualization for every case. So another very common visualization is born out of these needs to reduce this noise that I mentioned. And it is obviously the flame graph. And when you enter it, you would see lots of tools are using it, like a flame graph is called a flame graph because it it resembles flames, obviously. And there is an iccal graph, which is the same data, but upside time because it resembles ice. And I have a friend that calls these ice skgraphs pit graphs because they seem like pits. So as you can see, there is this pattern for disagreement here. So even though we didn't agree on a proper name for this visualization, the presentation and format is pretty consistent. Actually. If there one format that's very close to being standard in this area recently, its flame graph. It is invented by Netflix performance engineer Brandon Greg, and it's being used extensively since then. You could see it in action for in any modern browser for profiling JavaScript code, for example, while there are different flavors for framegraph, it is usually generated by aggregating costeack data over and over again, like we have seen in sampling profilers. We can even say that it is a natural output for sampling profilers. The x axis shows the times the function seen on the costack, and the y axis shows color colcoly relationships. With this visualization, it's very easy to see the interesting functions with a single look. Long empmpty lines, for example, indicate lots of exclusive time for this for the function and the colors Yeah the the colors are just random. And there is also one more thing that I'd like to show you about the flame graphs in this previous flame graph example, what you see is, as says final image ges, probably an image that is generated from thousands of colstacks, sampled and merged together. But if you add the time dimension to the x axis while retrieving that information that you see, what you will end up is that at what point your code is executing which function, which is similar to this one we see here. And this is heavily used in modern brops or profiling two set. At Blackfire, we also use a flame graph with time dimension, which we call timeline. As you can see, it's pretty similar to the one seen in chrome. And by the way, the blue background here shows the peak memory usage, which might help on situations where you're peaking memory too much. Okay, let's look over some examples on the ecosystem to understand what these tools provide for us. I'm not in any sense claiming these tools are the best. I just tried to show you most of the features available by these different types of profilers. So the first one I'd like to show you is that line profiler. Line profiler is a line tracing profiler. It measures per line voltime information, and you can inspect its output from the command line. As far as I know, it's highly used in data science world, but you guys might know better than me. Scalin is a new sampling profiler that profiles per line information again, but it measures per line volt, cpu and memory consumption simultaneously. And you can also inspect its output from command line x like seen here. Yep, is another tracing profiler. It's a function tracing profiler. It can measure per function wall in cpu time at the same time, and you can profile multhreadit and async applications and show per thread trace information. It supports profiling async io and g event applications currently. And pi spi is an external sampling profiler, which we have talked before. And that means that you can attach to any running Python application and dumped traces on the fly with a top like cli command, sorry, command line interface. And if your application is multreated, for example, it can also show you how much time your application weighted on gill, which is a unique feature. If you have ever wondered multiple threats are suffering from gill contention, you can use pi pi to detect this easily. It cannot put to flame graph or speed scope formats. By the way, we haven't it's mentioned here, but we haven't talked about speed scope. It is an open source format that's very similar. Or we can just say that it's it's the same as flame graph with time that we mentioned earlier. And it has a JavaScript application, which means that it works in a browser. So you can do a profiler with a profiler that supports speed scope and then throw the output to do a speed scope application and you can see the traces in your web browser. Here is an example screenshot of a speed scope format. And here is a flame graph generated from pi spi right on the left side. And the right side is actually the interactive river, where you can directly see the code execution in the active thread. When you attach to a running process, for example, you will see this gets updated pretty quickly with different functions during execution. If possible, pi pi will also try to retrieve this c extension called stack as well. I said, if possible, because it depends whether the extensions compiled with debug symples or any kind of stack trinformation. And I also would like to mention about tracalotracing. Tracmalog is a tracing, a memory profiler. You might have heard of it or maybe even used it because it's in the standard library for a while right now. It traces all, it basically traces all low level memory functions like melog, real alog, free calalog that those c runtime functions. And it saves a Python trospec object along with the allocation for later analysis. It can be a very valuable tool for analyzing which parts of your application cation is consuming or maybe leaking memory, because at the end of the day, you will see how much bytes get allocated per line, which is very valuable information. However, it needs to be enabled explicitly in your code and it will add some overhead to the application as it basically hooks every memory operation. Like even you add some key to a simple dictionary, it will probably call melloc and Yeah it will get some hooks inside the trace malloc. So if you want to use it on production, you are warned you should be cautious about the overhead and calculate beforehand. And Blackfire Blackfire is an on demand tracing profiler that can measure wall io, cpu and memory matrix simultaneously, which is a bit different than a normal tracing profiler. For example, for htcp profiling scenario, it is only enabled when a specific http header is present, and when it is done, it is unregisters the profiler hooks, and you will have zero overhead for the other requests. So we can say that it is safe to use own production as it only affects the profile request itself. You can also run on standalone cli scripts, but it will be enabled during the whole execution. In that case, it also is able to profile some application specific metrics like sql queries, cache calls to common cache frameworks, and http requests being made to different microservices, some template and om related metrics, etcetera. We basically support most of the visualizations we have seen so far, so you can select and switch between different visualizations for the same profile to analyze the problem at hand from different angles. And even though I said I will not be giving any general tips on performance, I think I will make an exception for these ones. So if there one thing to take take from this presentation, it's this always measure, if possible, always monitoring for performance from development to production. Having an old development life cycle will also be more helpful to diagnose the issues earlier. And the second thing I would say is to focus first on architecture, design and algorithm being used. When we first see a performance problem as humans, I identified a pattern among the people I have worked with, and myself included, is that tend we tend to hurry on fixing the issue right away without too much attention. However, maybe there is a fundamental design problem behind a seemingly subtle performance issue. I have seen this many times before, some wise refactoring decisions being made after analyzing patterns for some performance problems. So at the end, I think, in my opinion, these issues might help you design battery systems if you give them enough attention or listen to them. And to think more broadly, the third one is every application access has come some kind of data, and then application access data is very application. And how an application access its data is very specific to the application. For example, you should know that searging for an item in a sset or dict is very fast. It happens in constant time. But if you're using lists internally and constantly removing and inserting into the list, then it might not scale very well at the end for Jango code, this concept might be applied to models, for example, like you should avoid referencing a foreign key in a loop without a prefedge, be because it it might cause an n plus sma problem. Or maybe you should try to avoid retrieving too many objects at a time from the db, etc. You know it, so always know your data and how you access it. And the fourth thing is check if the standard library has a function. Yeah, again, it's so important. I mean, Python standard library is very, very powerful. It probably has a function that does what you want. So if you have a function that seems to to do lots of stuff that is not related with your business logic, then it's highly probable that there is a built in function that does what you want. So just read Python dosometimes in your free time. Maybe make sure you know and memorize most of them, and they will probably be highly optimized. Not probably. They are very highly optimized. And there are very gem modules like etertools 's collections, inspect, opt, parars, globe struct, rediffand, Yeah, many more. And I have said this before, but please just avoid micro optimizations. A single dict access is pretty fast usually. And Python is a very high, high level language. And in most of the times you can't see any benefit from these kind of small micro optimizations. And the last thing is, if you really need performance, then there are tools. There are ctin, numba, or other third party libraries as well. Or you can directly write code and c for example, with ctin, you can easily write c extensions without dealing with c. Numba is also worth mentioning. With a simple function decorator, you can optimize your cpu intensive functions. But again, you need to measure in it. Jit compiles your code. Not every function can benefit from this kind of optimizations. Usually heavy numeric scientific calculations are perfect. But Yeah for a typical web application that accesses database and cache or iobound applications cannot not benefit from this kind of optimizations. And the last one is you can always write A C extension. I always believe that the epi of c Python is a bit underrated. It's a very well taout, consistent and robust framework in itself. And the documentation is awesome. You can always fall back to this. So that was all I want to tell. Actually, I hope that you enjoyed it and learned something from it. If you would like to discuss anything perrelated, I would love to hear and discuss. I will be at the Blackfire booth. Thank you. Bye. speaker 2: Onor, maybe you can . speaker 3: answer Michael I'm sorry. speaker 2: maybe you can answer . speaker 1: Michael's question. Yes. Okay. I believe . speaker 3: I understand which one you . speaker 2: say in the chat box. speaker 3: Yes, just the second. speaker 2: The question is, is it possible to get full trace while in production? I mean, it would affect your performance. And the first question was about what is the difference between monitoring and profiling . speaker 3: in that action? So it's a very good question. So in blood fire terms, it's pretty different. So profiling for us means that whenever you start the profiler, it instruments all your code and it just it just profiles all the codes happening inside a single hp request. Whereas when you enable the monitoring tool, the monitoring tool is just monitors your http request. It's http specific, like it monitors how much time actually spent on a single request and then aggregates this data and shows in a single dashboard. Whereas in profiling, you would have more and more data in a single request. It opts to any single function course being executed. So that's the main difference between the monitoring and profiling. speaker 4: So sorry, I interrupt. In my mind I have that profiling is before deploying and monitoring is after. Because so for example, for my projects, I use Jungo debatoolbar because in order to use it, because Jungo debatoolbar make some some operations which are not costless, I mean, without cost. And when in production, I use monitoring tools and logs and analytics in order to find the bottlenecks and to try to find solutions. But I haven't heard of profiling in production. That's my question. speaker 3: Yes, that's actually that's a very good question. And this is the tradition. This is the basic, traditional way of doing things. But there is a current trend in observed tools to do continuous profiling, including the production as well. And in our case, we do the profiling, but we only do it for a specific http request. So in Blackfire, you just click a profile button, it will send a special http header to the web server. And in that web server, we have a special middleway automatically inserted. And in that middleway, we enable the probe for that specific acp request and nothing more. So that will add no zero overhead application, because it will only be profiling that thread, that specific request, other not the other concurrent ones. So this is a new approach. And there are some other approaches as well, like using sampling profiles at some levels. But Yeah, we will see most I believe that we will see lots of stuff happening in this area like we continuprofiling in production as well. speaker 4: Okay. speaker 3: Thank you very much. In addition. speaker 2: with black farm monitoring, it will take all the top the top ten most impactful transactions that is being recorded in the last day and will send well do a profile with real end user traffic. So only for this ten top ten transactions once a day, black fire monitoring will generate a real end user profile so that you can have a real insight of what's going on with these very transactions. So that will complement what you're getting with the monitoring itself. But it will definitely be not be on all transactions, all hp requests. It will cost way too much for this. speaker 4: Okay. Okay. speaker 2: Thank you very much. speaker 3: Is there any other questions? So I believe that's it. Thank you very much for your attention. Thanks for the question. I hope you enjoyed the slides. I hope you learned something. My, so thank you very much. speaker 2: Bye you.