The Startup
Published in

The Startup

Simple Runtime Profiling of Batch Jobs in Production

So, you have a batch processing / ETL task that receives some data in a loop or per request and crunches it. It might even be running in production. All is working well but now you need to optimize it a bit in order to increase the rate of data you can handle.

In this post, I will introduce a simple yet effective approach to do so, which you can even run in production to measure the performance of real-world workloads. I will supply a short (single file, no dependencies) implementation for the profiler for both Kotlin and Python.

So why not use widely-available profiling tools like VisualVM / JProfiler / cProfile?

  • It is not easy to instrument those tools for running in a production environment. You may need to add extra dependencies, different bootstrapping code, etc..
  • Some of the profilers have a significant overhead on runtime since they measure every method and code line (even in sampling mode). You don't want to deploy such a profiled application to production as it will affect your processing rate significantly.
  • The profiling output is way too verbose, again, since every function or code-line is measured.
  • Most of the profilers provide a summary output at the end of the run. What if you want to log/print profiling results periodically while still running?
  • How do you deal with warm-up times and changes in workloads while the application Is running?
  • Some profilers use stack-trace sampling as a data-source, therefore they measure run times for whole functions only. This may force you to refactor code blocks to sub-functions.
  • Some of the available profiling tools are commercial.


The approach I suggest is very simple. Your task probably runs in a loop or on a per-call basis. Call the profiler’s start_section(section_name) before any code block you want to measure, and end_section() after the code block. The profiler will collect running-time statistics for this code block.

You may call report() to get a profiling summary like this:

We can see a line per section, measuring:
(1) The cumulative runtime of all code inside that section.
(2) The contribution (in %) of this section to the total runtime.
(3) Count of time this section was executed.
(4) Average runtime of 1000 executions of this section.
(5) Frequency/Rate — Count of times this section can be executed in 1 second.

Note that sections can overlap or be nested. In the example output above, parser section includes both split , drain and mask section.

The profiler also supports an optional enclosing section. This section (total above) typically covers a full single iteration/request and all other sections are sub-sections of it. It serves as a reference for what is 100% of run-time and by summing runtime of all nested sections, you may determine whether there are some other code-blocks which take significant run-time but are not enclosed in any section.

Dealing with warm-up time and changing workloads

Sometimes you want to exclude from profiling the first iterations of batch processing until everything is cached properly. There are also cases, in a long-running application, where workloads/request content changes over time, for example, if the work to do is dependent on a data-structure that is growing gradually.
In order to deal with such cases, the profiler supports an optional resetAfterSampleCount argument (0 by default), which will also calculate rates statistics for up to n last executions of each section. Those rates (runtime per 1000 executions and frequency), will be displayed in the report in parenthesis next to the cumulative global rates.


The printer argument of the SimpleProfiler class allows you to supply a function which may write report output to a destination other than the default stdout (for example, log)

SimpleProfiler also implements a general Profiler interface, so that you can replace it with NullProfiler, in order to disable profiling without removing profiling sections from your code, and avoiding if statements and null-checks.

Overhead — the profiler overhead is mainly dependant on the duration of a single workload cycle/iteration/request. The Kotlin profiler took about 0.8% overhead when a single cycle run took 0.1 ms, but 28.2% overhead when a single cycle run took 0.00036ms. In Python, overheads were 3.4% and 31.3% respectively. Conclusion — this profiling approach is less suitable for profiling micro workloads.

The profiler is not (yet) thread-safe. This means it is suitable for profiling tasks that run in a single thread only. I intend to add multithreading support soon, should not be a big deal...

Demo Use-Case: comparing hashing algorithms performance: Python vs. Java.

The following code snippets profiles calculation of md5, sha1 and sha256hashes for 10000-bytes array using the standard and Python’s hashlib. I also added calculation of square root for a random number.

In Kotlin:

In Python:

Output (Kotlin):

Output (Python):

And few insights from this little experiment:

  • Hashing in Python is about 2-times faster than in Java (well in Python it is actually C implementation)
  • sha256 is the slowest in Python and sha1 is the slowest in Java.
  • Initialization of hashing algorithms in Java is very slow.
  • sqrt and random are much slower in Python.
  • Generally, pure Python code runs slower than JVM code (we can learn that from the 0.09ms profiling overhead in Kotlin and 0.32ms profiling overhead in Python).

Profiler Implementation

Single file with less than 150 lines, no 3rd party dependencies.
It's an Apache v2 license, just copy the file to your project to use it.

GitHub repository with full code and samples. Star it if you like it :-)

— Happy profiling!




Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

What Is The Difference Between Waterfall and Agile Project Management?

Raymarching Voxel Rendering

Mike Morris Completes AWS Cloud Practitioner Certification

Managing multiple Git users through Git hooks

GraphQL. Production experience

Discord Server for Business Owners/Entrepreneurs

Building a night light using an ESP8266 and an Ardafruit NeoPixel Ring


Reasons why it is time to upgrade to Magento 2

Reasons Why To Upgrade To Magento 2 (2021)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
David Ohana

David Ohana

Software developer & research staff member @ IBM (Haifa Research Lab)

More from Medium

6 Java 17 Features I Didn’t Know About

JAVA 16 And 17: Evolutions and new features — Part 2

Toned-down Code Complexity Using Facade Design Pattern

Sportmode Facade DesignPattern CodeExample Java

Compound Pattern Series: Visitor and Interpreter (Part 1)