How to optimize your data analytics code for better performance?

Mihail Yanchev
Casual Inference
Published in
8 min readFeb 7, 2020
Photo by Denise Bossarte on Unsplash

All of you working as analysts, data scientists, machine learning engineers and others in similar roles have felt that quite often in your work you are approaching a boundary when you are not just doing analytics, but you have to go deeper into the inner workings of the scripting language or the methods you are using in order to achieve a certain goal. This often means reading through endless package documentation, reading up on somewhat similar issues in different domains and researching alternate ways of doing things.

Performance is often one of these issue where you can allow yourself to be lazy if you have a small data set at hand, but the moment the amount of data balloons you might need to rewrite everything from scratch. I recently had a similar experience where the first phase of a project involved a very small dataset (several hundred observations) and it seemed ok to be lazy and not write the script with performance in mind from the get go. A couple of months later I was dreading myself when I had to rewrite almost everything in order to use the script on a much larger dataset (several tens of millions observations) and not have to wait for hours for some data transformation tasks to finish and have my work constantly on hold due to lack of performance optimization.

While this is far from the first post on the topic, I will give you some of the hard learnt lessons of my own experience of how to go about tackling performance issues and write your scripts from the very start with performance optimization in mind (unless you can afford to be lazy, which is never a good idea). I will also try to provide you with some valuable resources for both R and Python on the topics covered here.

Data Filtering

This one is obvious, but I still feel obliged to mention it. If you are going to filter out some data somewhere along your data pipeline, do it at the earliest possible moment. In this way you will have to perform every consecutive step on less data and avoid running calculations on data you are eventually going to filter out. This is probably the easiest way to squeeze some performance boost out of your script.

Vectorize!

Whether you work on R or Python always try to vectorize your functions (it is not a coincidence this is something Andrew Ng stresses constantly in his Machine Learning class on Coursera). For example, avoid looping over rows unless it is absolutely necessary and unavoidable. Always try to write the function so that it processes the whole column of the data frame or data table at once.

Imagine that for R and Python to add two scalars it takes almost the same time as adding two arrays or two matrices element-wise. You see how depending on the size of the data set, not using vector operations might increase the calculation time enormously even for seemingly simple calculations.

A handy family of functions in R (which has its equivalent for Python’s Pandas as well) is the apply() family of functions, which is exactly designed for the purpose of applying a function to whole arrays instead of writing explicit loops. Also often the data frames (or data tables) in R and Python themselves allow you to vectorize different calculations by subsetting the object in the correct way. Look at this article dedicated on vectorization in Python for more information on the topic.

One last takeaway for this section is the absolutely necessary vectorized conditional assignment:

It allows you to write if else statements for whole arrays and is an absolute must in every data pipeline.

Batch Processing

This is a somewhat related topic to the previous one. Batch processing can be defined as production of a “batch” of multiple items at once, one stage at a time. Now imagine that you need to run a transformation multiple times and every time you need to search through a whole data set to determine the inputs. This might be very time consuming. One way to tackle this is to identify for how many of the times you need to run the transformation, the inputs will be identical. For example, for a group of observations the transformation inputs have identical values (and the transformation returns identical output) and this can be considered one distinct batch. It might turn out there are several unique combinations of input values and therefore several distinct batches. This will mean you can actually vectorize the transformation and will need to run it as many times as the number of distinct batches. This might suddenly reduce the number of times you need to search the data set from 100 000 times to 100 times. You can see how this can be a huge performance boost.

You will realize that once you separate the task into batches you can even do the calculation in parallel. This is what the Map Reduce programming model is designed to do. It separates the data into multiple batches, runs calculations in parallel on each batch and merges the output back together.

Use CPU parallel processing

This brings us to the next topic. Chances are, that even your smartphone now has at least two cores. Most contemporary computers have more. Nowaways, one can reap the benefits of parallel processing even on small scale tasks and commodity hardware. One can apply the Map Reduce to many types of data transformations a data scientists would do routinely. It is especially easy when the output of one process does not depend on the output of another process (even if this is the case there are ways to tackle this). Imagine you want to create a new feature conditional on other features. You can split the dataset into batches and the number of batches will be the number of parallel processes you want to run. Thus, you can run the calculation on all batches simultaneously.

The multiprocessing package for Python can be easily used for such tasks.

See an example below with one of the ways you can use this powerful package for a simple task:

from multiprocessing import Pool
import numpy as np
import pandas as pd
def parallelize_dataframe(data_frame, func, n_workers, num_partitions):
'''
Spliting a pandas dataframe and applying a function on it
'''
data_frame_split = np.array_split(data_frame, num_partitions)
pool = Pool(processes=int(n_workers))
data_frame = pd.concat(pool.map(func, data_frame_split))
pool.close()
pool.join()
return data_frame

This is a function, which takes as arguments the data frame you want to split in batches, the function you want to apply to every batch, the number of workers or the number of concurrent processes and the number of partitions you want to split the data frame in. It is implemented in the following way:

data_transformed = parallelize_dataframe(data_raw, transform_function, 2, 4)

This will split the data frame data_raw in 4 partitions and apply the transform_function to it by using two concurrent processes or workers.

Check this article for an intro in parallel processing in R, if you want to perform similar tasks there.

CUDA

Why use CPU parallel processing, when you can use GPU parallel processing? GPUs are very well suited to perform in parallel matrix algebra operations, underlying most of the data transformations and estimation methods in data science. Using GPU parallel processing is a bit more tricky and not so straightforward, but it might result in significant performance boosts. I have tried GPU parallel processing for estimating an XGBoost model. This post on NVIDIA’s blog seems like a helpful starting point if you want to try it out. It does indeed run several times faster on a high-performance GPU (e.g. NVIDIA Tesla series), but one has to keep in mind, that a XGBoost model object estimated on the GPU package is not fully compatible with the CPU package.

Use packages and functions meant to optimize performance

Sometimes packages are written with the purpose of replacing other older methods of doing things. The first example that comes to mind is the data table package for R, which introduces extremely significant performance over data frame simply by using it. From the import of the data table (using the fread() function) to any data transformations you can think of, data table is simply faster in every aspect compared to the traditional data frame. Now sometimes, you might run into older packages, which are not meant to work with data table and in such cases you will be forced to work with data frames, but I strongly suggest to rely on data table as often as possible.

Another example in Python is the Dask package build upon pandas and designed to take advantage of parallel processing. I also wrote briefly about it in this post on data import in Python.

Also in Python if you ever need to loop over the rows of a Pandas data frame use iterrows() as described in this article (which is a very good resource on performance optimization for Pandas) rather than writing an explicit for loop over the rows.

If you want to go deeper into Python, you can turn to Cython. This can give you again enormous performance boosts as if you are running your code on a lower level language like C++. For me this is still an uncharted territory, but again I have not found a use-case for it. You can find some resources on the topic here.

There are many of these little hacks that one can find with sufficient research. Keep digging and you will keep finding little ways to improve your code with respect to performance efficiency.

Data import optimization

This is a topic I have written about for Python specifically. You can check the article here. There is a very similar article about R here. Since the data import is usually the very first step of any data pipeline it is important to be able to do it efficiently. If you are building an automated task, which will run regularly, this makes it even more important.

Cache

Last item here is caching. Sometimes, especially in the process of developing the data pipeline, you might need to repeat computationally expensive steps multiple times. If you can store an intermediate step, so you do not need to recalculate it every time, it might save you quite some time. This might be the case with a slow import. Do not redo the import, but rather cache it in some more easily readable format where the data types are already recorded. This might save you some time. Here you can find a cool post by Ilia Zaitsev which compares different formats to consider for saving Python data. I believe some of these are available in R as well.

With this I conclude this post. I hope this was helpful to you and most importantly will change the way you approach new tasks from now on.

--

--