Don’t guess — profile!

Shay Amram
Data Science at Microsoft
8 min readDec 5, 2023

Programming is the foundation of all software, ranging from basic conditional statements to sophisticated Artificial Intelligence (AI) models. As data scientists, we invest a lot in improving our coding skills so that our code is both elegant and coherent. But what about efficiency?

Photo by Riccardo Annandale on Unsplash.

In today’s data-driven world, data scientists play a crucial role in helping organizations make informed decisions, with algorithms and AI models that push the limits of what is possible. Data scientists not only write the algorithms and models, they are also expected to have their code run in production environments. For this purpose, development must include efficiency, and the first step toward it is profiling.

In this article I explore methods and tools to enhance the performance of Python algorithms through profiling, including:

  • Runtime performance
  • Real-life examples and analysis
  • Tips and tricks for optimizing code

One of the most common programming languages for data scientists today is Python. It is easy to learn, relatively fast, and object-oriented. Its popularity has led to still more Python libraries and wrappers, including OpenCV, PySpark, PyTorch, and many others.

Because of these, we’re starting to see more data science code — written in Python especially — making its way into production “as is.” Yet Python, being an interpreted language, still struggles with performance when compared to compiled languages. This article provides some powerful tools for helping to improve your code’s runtime and easily reveal bottlenecks.

Don’t guess — profile!

It is notoriously difficult to pinpoint the cause of bad code at runtime. The algorithm may be non-optimal, it may be using the wrong data structure for the problem, I/O usage may be inefficient — the list goes on. In my experience I’ve seen many developers attempting to optimize a piece of code based on educated guesses and intuition and getting it wrong. But there is a better way: Don’t guess — profile!

Python has many tools for runtime and memory profiling. These tools can help you pinpoint the actual inefficiencies to specific lines of code or functions, whether the problem lies in your code or in a third-party module. Profiling can even reveal unoptimized code that has become a bottleneck as input size has increased unexpectedly with the passage of time.

cProfile

A Python library called cProfile [‎1] allows you to inspect the amount of time each line of code is consuming, how many times it is called, and many other statistics. It can run with no changes or with minimal changes to the code. cProfile is free and comes built-in with base Python. There is no need to wrap code with inconvenient timers that must be removed later — cProfile does it for you automatically. And, to top it all — it is very easy to use. Just look at these lines of code:

(my_venv) C:\profiling> python -m cProfile -o baseline_profiling.out hists.py

In this example:

-m cProfile tells the interpreter to run cProfile module for this script.
-o creates an .out file. This is the output of the profiler.
hists.py this is the script that is being profiled.
The out file can be viewed using the snakeviz utility.

Install snakeviz and use it to view the results:

(my_venv) C:\profiling> pip install snakeviz
(my_venv) C:\profiling> snakeviz baseline_profiling.out

The main event: Understanding cProfile results

Here’s a (real!) example of the snakeviz output for a real Python script that was used in a production environment. This script was a part of a more complex system — and a data scientist on our team suspected the existence of a significant bottleneck — but hunting it down in this legacy code module would be time consuming unless we were to use profiling.

So instead of guessing, we profiled:

Figure 1: Interactive profiling output. Each box can be selected. When selected, it becomes the new root of the analysis, and presents all the calls that occurred downstream of it. The left pane presents statistics for each box (in this example, insights_objects.py). Note that the left pane shows the cumulative (both absolute, and relative) time that this process took: 15.5 seconds or 63.39 percent of the entire runtime (for comparison, see the total runtime at the root: 24.4 seconds). Each node references the actual line of code.

Figure 1 shows the initial output of the profiler. Each box can be selected — and, when selected, it becomes the new root of the analysis, which presents all the calls that occurred downstream of it. The left pane shows statistics for the box currently being hovered over (in this example, insights_objects.py). We can see the cumulative time, both absolute and relative, that this process took: 15.5 seconds or 63.39 percent of the entire runtime, out of a total runtime at the root of 24.4 seconds.

Figure 2: Drilldown into insights_objects.py. This view provides more details about anything that happens downstream of insights_objects. As with the prior example, hovering shows the cumulative runtime of each call. Note that in this example copy.deepcopy() is highlighted across multiple sections of the code. This is because snakeviz shows the cumulative time for this function. It is apparent that deepcopy() is the most time-consuming call (at 21.7 seconds, or 88.9 percent of the total).

Figure 2 is a drill-down — it shows a more detailed view into anything that happens downstream of insights_objects. As with the prior example, hovering shows the cumulative runtime of each call. Note that in this example copy.deepcopy() is highlighted across multiple sections of the code. This is because snakeviz shows the cumulative time for this function. It is apparent that deepcopy() is the most time-consuming call (at 21.7 seconds, or 88.9 percent of the total). The statistics on the left pane refer to the entire run, regardless of the node.

In hindsight, it may look obvious: deepcopy() is known to have a large overhead, and using it (especially with a large number of iterations) can be runtime-consuming. But when looking at such a large program, many other suspects surfaced as well. In fact, before we started, and as an exercise, I asked my team member to “guesstimate” the location of the bottleneck. The guesses were incorrect, which is not uncommon: Often our intuition may lead us astray, causing us to optimize where it isn’t necessary before finding (as in this case) the true source of the problem.

Consider the amount of work that might have otherwise been involved in digging down to this specific line of code. Also consider not only the actual line, but the number of calls, from various sections of the code, and we can conclude that this apparently “obvious” bottleneck could have been very difficult to trace. But once found, the solution became trivial — instead of using deepcopy, we implemented a dedicated __copy__ method, which ended up reducing runtime by almost 90 percent. And all for the price of one hour of analyzing and refactoring. This is the power of profiling at work. It saves time and reduces guesswork.

cProfile statistics

In addition to the useful UI supplied by snakeviz, other more raw statistics are readily available below the icicle chart.

Figure 3: The statistics shown by cProfile/snakeviz.

Let’s look at these columns, as some of them may not be immediately obvious:

  • ncalls: total number of calls
  • tottime: the total time spent in a given function (excluding time made in calls to sub-functions)
  • percall: (left) the quotient of tottime divided by ncalls
  • cumtime: the cumulative time spent in this and all subfunctions (from invocation until exit). This figure is accurate even for recursive functions.
  • percall: (right) the quotient of cumtime divided by primitive calls

These statistics can be very useful for locating functions that may look efficient themselves but are called a huge number of times, which in turn becomes a bottleneck, as in our example.

Practical tips and tricks when profiling

Here’s a summary of some of what I’ve found to be the most useful:

  • Smallest input: Use the smallest input that’s still significantly larger than the overhead of your script.
  • Smallest input 2: Typically, profiling is an iterative process of profiling, fixing, and then profiling again — and so on. Keep in mind that you may need to run this many, many times. Using large input results in slower cycles of improvements and trials. So, keep the input as small as possible to produce significant results, but no smaller or larger.
  • Significant results: To get significant and reproducible results, it is a good idea to have enough input so that the baseline takes at least 30 to 60 seconds of runtime. Less time than that may make it difficult to significantly estimate bottlenecks or measure the improvement.
  • Multiprocessing and multithreading: Neither are well handled by cProfile. If your process has either of these, it might be useful to try and run it as a single process, and it would also probably be beneficial to read more about profiling these kinds of processes in Python (and this might even be the subject of my next article).
  • Profile what you need: It is also possible to profile specific functions or areas of the code. More on that in the appendix below.
  • Don’t guess — profile! This is really the most important tip for you. Guessing may lead you astray, and waste countless hours optimizing that which is not a bottleneck. Whichever language or tool you’re working with, programming languages or databases — seek and use the appropriate profiling tools.

Conclusion

In this article I described the motivations for profiling in Python as well as how to use cProfile and snakeviz and interpret the results. I shared a real-life practical example where profiling easily revealed a bottleneck that would have been very difficult to detect otherwise (and though each specific call was computationally cheap, the sheer number of them was huge and ended up clogging the entire process). The advanced reader may take advantage of the more detailed cProfile statistics, as well as profiling specific functions (see the appendix below). In addition, I included some of what I personally consider good practices when profiling.

But the most important takeaway from this article is its title. In many years in the industry, I’ve seen how often even experienced developers rely on intuition and “guesstimation” and end up over-optimizing areas in the code that are simply not the problem. This extends well beyond this specific example, so please: Don’t guess — profile!

Let me know if you liked this article or have any requests for future ones — I’m planning several, covering additional topics around runtime efficiency in Python.

Shay Amram is on LinkedIn.

Appendix: Profiling functions and specific calls

Another way to easily apply profiling is to do it for specific functions. The cProfile module is part of base Python, and it has a profile object that can be instantiated inside the code. This capability is useful if you already have some good candidates for the bottleneck and want to understand them better: What’s the per-call time? What’s the cumulative time? How many calls are being made? cProfile provides an easy way of find the answers:

import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable() # start profiling
my_function()
profiler.disable() # end profiling
stats = pstats.Stats(profiler).sort_stats(‘ncalls’)
stats.print_stats()
# Or output to a file using
# stats.dump_stats("/workspace/profiled_function.out")

In this way, only my_function() (and anything it calls) will be profiled. Once done, cProfile prints to console an output that’s very similar to what is shown in Figure 3.

   ncalls    tottime  percall  cumtime  percall filename:lineno(function)
768793 0.128 0.000 0.130 0.000 {built-in method builtins.max}
202267 0.028 0.000 0.030 0.000 {built-in method builtins.isinstance}
194429 0.118 0.000 0.118 0.000 {built-in method numpy.array}
...
...

Then, export the stats to a profiler output file, so that it can be viewed with:

> snakeviz /workspace/profiled_function.out

Useful link

--

--

Shay Amram
Data Science at Microsoft

Senior Data Scientist @ Microsoft, experienced in multiple domains such as Cyber-Security, Remote Patient Monitoring, Computer Vision, Biology and Chemistry