Simple Runtime Profiling of Batch Jobs in Production

Published in

The Startup

4 min readNov 2, 2020

So, you have a batch processing / ETL task that receives some data in a loop or per request and crunches it. It might even be running in production. All is working well but now you need to optimize it a bit in order to increase the rate of data you can handle.

In this post, I will introduce a simple yet effective approach to do so, which you can even run in production to measure the performance of real-world workloads. I will supply a short (single file, no dependencies) implementation for the profiler for both Kotlin and Python.

So why not use widely-available profiling tools like VisualVM / JProfiler / cProfile?

It is not easy to instrument those tools for running in a production environment. You may need to add extra dependencies, different bootstrapping code, etc..
Some of the profilers have a significant overhead on runtime since they measure every method and code line (even in sampling mode). You don't want to deploy such a profiled application to production as it will affect your processing rate significantly.
The profiling output is way too verbose, again, since every function or code-line is measured.
Most of the profilers provide a summary output at the end of the run. What if you want to log/print profiling results periodically while still running?
How do you deal with warm-up times and changes in workloads while the application Is running?
Some profilers use stack-trace sampling as a data-source, therefore they measure run times for whole functions only. This may force you to refactor code blocks to sub-functions.
Some of the available profiling tools are commercial.

SimpleProfiler

The approach I suggest is very simple. Your task probably runs in a loop or on a per-call basis. Call the profiler’s start_section(section_name) before any code block you want to measure, and end_section() after the code block. The profiler will collect running-time statistics for this code block.

You may call report() to get a profiling summary like this:

We can see a line per section, measuring:
(1) The cumulative runtime of all code inside that section.
(2) The contribution (in %) of this section to the total runtime.
(3) Count of time this section was executed.
(4) Average runtime of 1000 executions of this section.
(5) Frequency/Rate — Count of times this section can be executed in 1 second.

Note that sections can overlap or be nested. In the example output above, parser section includes both split , drain and mask section.

The profiler also supports an optional enclosing section. This section (total above) typically covers a full single iteration/request and all other sections are sub-sections of it. It serves as a reference for what is 100% of run-time and by summing runtime of all nested sections, you may determine whether there are some other code-blocks which take significant run-time but are not enclosed in any section.

Dealing with warm-up time and changing workloads

Sometimes you want to exclude from profiling the first iterations of batch processing until everything is cached properly. There are also cases, in a long-running application, where workloads/request content changes over time, for example, if the work to do is dependent on a data-structure that is growing gradually.
In order to deal with such cases, the profiler supports an optional resetAfterSampleCount argument (0 by default), which will also calculate rates statistics for up to n last executions of each section. Those rates (runtime per 1000 executions and frequency), will be displayed in the report in parenthesis next to the cumulative global rates.

Notes

The printer argument of the SimpleProfiler class allows you to supply a function which may write report output to a destination other than the default stdout (for example, log)

SimpleProfiler also implements a general Profiler interface, so that you can replace it with NullProfiler, in order to disable profiling without removing profiling sections from your code, and avoiding if statements and null-checks.

Overhead — the profiler overhead is mainly dependant on the duration of a single workload cycle/iteration/request. The Kotlin profiler took about 0.8% overhead when a single cycle run took 0.1 ms, but 28.2% overhead when a single cycle run took 0.00036ms. In Python, overheads were 3.4% and 31.3% respectively. Conclusion — this profiling approach is less suitable for profiling micro workloads.

The profiler is not (yet) thread-safe. This means it is suitable for profiling tasks that run in a single thread only. I intend to add multithreading support soon, should not be a big deal...

Demo Use-Case: comparing hashing algorithms performance: Python vs. Java.

The following code snippets profiles calculation of md5, sha1 and sha256hashes for 10000-bytes array using the standard librariesjava.security.MessageDigest and Python’s hashlib. I also added calculation of square root for a random number.

In Kotlin:

In Python:

Output (Kotlin):

Output (Python):

And few insights from this little experiment:

Hashing in Python is about 2-times faster than in Java (well in Python it is actually C implementation)
sha256 is the slowest in Python and sha1 is the slowest in Java.
Initialization of hashing algorithms in Java is very slow.
sqrt and random are much slower in Python.
Generally, pure Python code runs slower than JVM code (we can learn that from the 0.09ms profiling overhead in Kotlin and 0.32ms profiling overhead in Python).

Profiler Implementation

Single file with less than 150 lines, no 3rd party dependencies.
It's an Apache v2 license, just copy the file to your project to use it.

In Kotlin
In Python (3.6 or later)

GitHub repository with full code and samples. Star it if you like it :-)

— Happy profiling!

Simple Runtime Profiling of Batch Jobs in Production

SimpleProfiler

Dealing with warm-up time and changing workloads

Demo Use-Case: comparing hashing algorithms performance: Python vs. Java.

Profiler Implementation

Written by David Ohana