Processing Large Numeric Arrays in Python — Part I

Published in

Geek Culture

5 min readMar 15, 2022

In this article Dima explains how he worked with numpy, pandas, xarray, cython and numba to optimally implement operations on large numeric arrays on the Quantiacs platform.

Python is very popular among data scientists and it is widely used for processing data. As it is an interpreted language, it is not the best option for fast data processing. C, Java or any other compiled language are normally much faster.

If you want to reach an acceptable performance with Python you MUST use some special libraries which allow you to use the performance benefits of compiled languages.

In this article I will talk about numpy, pandas, xarray, cython and numba. I will show you how to use them properly and boost the performance by two orders of magnitude.

The source code of the example is available on github, where you can download code and check the performance.

The Task

The task is very simple. We start from an array containing 2000 stock market data on a daily basis for the last 20 years. We want to compute an exponential moving average of the prices.

I will show here the results, so you can decide on your own if this article is interesting for you or not.

Does it sound interesting? I hope so. When I started this benchmark test, I used daily market data including prices (open, high, low, close), volume, dividends and splits:

and my goals was to compute an exponential moving average with the following constraints:

execution time has to be smaller than 10 seconds;
RAM memory consumption must not exceed 1.5 GB.

Benchmarking

Before starting, let me quickly describe how am I measuring execution time and RAM consumption.

For measuring the execution time and the peak memory I use:

In addition I use the timeout command to limit the execution time because sometimes processing can take hours:

For clearing the filesystem cache before loading the data I use:

After loading the data I measure the current memory consumption with memory_profiler:

When I calculate the exponential moving average I use the time module to measure the execution time and I exclude the loading time:

I generate data with a simple script which generates 2000 time series. The size of these series is about 0.5 GB.

Loading the Data

First of all we load the data. After trying different approaches I will show you how to reorganize your data and significantly reduce the execution time and RAM consumption.

Load with pure Python (csv)

Loading the data using pure Python can be done using the following code:

This implementation is very inefficient. The execution time is 1 m 38 s and the consumed memory is about 4 GB:

Load with Pandas (csv)

Let us use pandas as in the code which can be downloaded here:

The execution time is 12 s and the consumed memory is about 1.4 GB. This is a huge improvement, but we can do better.

Pandas consumes much less memory than pure Python. The reason is that pandas uses internally numpy arrays which are in turn based on C arrays. The latter are very efficient for storing numbers (much better than python lists). Note however that the data are still about 3 times larger in RAM than on the hard drive because pandas creates separate indexes for each file. Data can be reorganized to reduce the number of files.

Load with Pandas (csv, big files)

As each file contains the same columns, we can load all data and save them in a reorganized form by grouping columns using this file:

The new files can be easily loaded:

The execution time is now 8 s and the consumed memory is about 0.72 GB:

The performance can be considerably improved by switching from CSV (text format) to any another binary format(netcdf or pickle).

Load with xarray (netcdf, pickle)

Pandas works optimally with 2 dimensions. We use xarray, as it works natively with arbitrary dimensions, and join all data in a single file. xarray supports the netcdf binary file format (out of the box with scipy). We also test the pickle file format.

This script joins all the data to one file and saves it to netcdf and pickle:

Then we can load the data with netcdf:

and pickle:

The results are similar. With netcdf the execution time is 1.7 s, the final RAM is 0.65 GB, the peak RAM is 1.2 GB:

Using pickle the execution time is 1.3 s, the final RAM is 0.65 GB, the peak RAM is 1.2 GB:

In both cases the execution time is smaller than 2 s. The peak memory is larger than with pandas (using binary format, pandas would be very efficient also when it comes to computing time). I have a preference for netcdf respect to pickle as in my experience pickle is more affected by specific versions of other libraries.

Did you learn someting new? Please feel free to leave a comment in the Quantiacs Forum! And do not miss Part II where I will show you how to efficiently implement an Exponential Moving Average.