Vaex: Pandas on steroids

How I analyzed over a billion rows of data on my MacBook locally, within seconds.

Mithil Oswal

Published in

helpshift-engineering

9 min readAug 22, 2023

Current challenges with Pandas

Memory Efficiency: One of the most significant challenges with pandas is its memory consumption. Pandas loads the entire dataset into memory, which limits its capability to handle large datasets. As the dataset size increases, pandas may struggle or even fail to process the data due to memory limitations.
Performance with Large Datasets and Speed of Computation: Pandas’ performance tends to degrade as the size of the dataset increases. Due to its in-memory processing nature, pandas may suffer from slower execution times for complex operations, especially when working with large datasets.
Handling Big Data: Pandas has limitations when it comes to handling big data, as it is primarily designed for datasets that fit comfortably into memory. When dealing with datasets beyond the memory capacity, users often have to resort to data sampling or downsizing, leading to potential data loss and compromising analysis accuracy.
Scalability: Pandas is not designed to scale to multi-core CPUs or distributed computing environments seamlessly.

Eh, so just another Python library for data analysis?

No.

So what’s Vaex?

Vaex is a Python library designed to handle large datasets efficiently and perform fast data manipulation, exploration, and visualization. The name “vaex” stands for “Visualization and eXploration of large datasets.”

The key feature of Vaex is its ability to work with massive datasets that are too large to fit into memory. It achieves this by using memory mapping and lazy evaluation, allowing it to process data in chunks without loading the entire dataset into RAM.

However, it’s important to note that Vaex has some limitations. Not all operations are supported due to its lazy evaluation strategy. Some complex calculations or custom functions might require specific implementations to work correctly with Vaex.

Here’s how Vaex works:-

Memory mapping: Instead of reading the entire dataset into memory, Vaex maps the dataset to memory. This memory mapping technique allows Vaex to efficiently access data on hard disk as if it were in RAM, reducing the memory footprint.
Lazy evaluation: Vaex adopts a lazy evaluation approach. It means that when you perform operations on the dataset, such as filtering or calculating new columns, Vaex doesn’t immediately execute these operations. Instead, it creates an execution plan. The actual computation is deferred until necessary, for example, when you explicitly request the results or when plotting the data.
Expression system: Vaex uses an expression system, which allows you to create complex calculations and filters using a familiar Python syntax without executing them immediately. These expressions are efficiently compiled and evaluated only when needed.
Parallelism: Vaex takes advantage of multi-core CPUs and parallel processing to accelerate computations when possible, further optimizing performance.

Let’s use Vaex!

Installing Vaex

pip install vaex

Importing Vaex

import vaex

Converting CSV file into HDF5 format for improved efficiency

Now what’s a HDF5 file? 🤔

HDF5, which stands for Hierarchical Data Format version 5, is a versatile file format and data model designed for efficiently storing and managing large and complex datasets.

Its hierarchical structure, support for various data types, and features like compression and parallel I/O make it a powerful choice for scientific computing, data analysis, and applications requiring large-scale data management.

✅ Vaex works best with HDF5 files

dv = vaex.from_csv('your_path_to_csv_file.csv', convert=True)

The above line of code first converts the CSV file into the required HDF5 format, saves it at the same location with a .hdf5 extension; and then loads that file as a Vaex dataframe in the variable dv.

Now let us see the power of Vaex in comparison to our good old Pandas

To compare Vaex & Pandas, we will create a large dummy dataset, and see how it performs using the two libraries.

First, let’s create our random dataset using Numpy —

import numpy as np

# 10M rows with 2 columns
random_numbers = np.random.rand(10000000, 2) 

# Save the numpy array as a CSV file (of ~500 MB)
np.savetxt('random_numbers.csv', random_numbers, delimiter = ',', header = 'col_a,col_b', comments = '')

The above code snippet creates a random Numpy array of 10 Million rows and 2 columns; and then saves it as a CSV file of ~500MB in size.

This is what the array looks like -

Speedtest Vaex

To understand how fast Vaex really is, let us see it in action.

Let’s load the above dataset in both Pandas and Vaex, and time it using

%%timeit.

For a dataframe of 10M rows & 2 columns —

Pandas took 2.6 seconds while Vaex took 26 milliseconds.

While it may not look like a huge difference now, let’s try to put in perspective.

1 second = 1000 milliseconds

And 26 ms is 1% of 2.6 sec.

That makes it 100x faster!

So, to load a similar dataset of 1 Billion rows, it will take Pandas ~5 minutes; while Vaex will do it in 3 seconds.

This what that Vaex dataframe looks like -

If a simple load operation is so fast, imagine how efficient other operations on large datasets can be, with Vaex.

Points to note:

The built-in function or method you may try to replicate in Vaex may not be the exact same, so you’ll have to look for similar alternatives or workarounds.
The performance of Vaex over Pandas might not visibly stand out every time, however in most cases Vaex will always be faster.
Since Vaex is designed for large datasets, granular data processing or manipulation at row-level will be slower as compared to Pandas.

Let us look at another example to sort our above dataframe using a column.

For Pandas, we have the sort_values() function to achieve this; and in Vaex the similar alternative is sort().

Again, Vaex outperforms Pandas here.

To really drill in the fact that Vaex is efficient, let’s look at a couple more operations —

To concat two dataframes, we’ll use the concat() function and observe its speed. We will simply concat our above test dataframe with itself, resulting in a dataframe of 20 million rows.

Note: Concatenation of dataframes combines the dataframes vertically on rows.

Pandas took an average of ~200 milliseconds; while Vaex took ~200 microseconds. And since 1 ms = 1000 µs:

Vaex becomes 1000x faster than Pandas!

For our final speed-test example, we will consider a popular use-case of joining dataframes.

To join two dataframes, we’ll use the join() function and evalute the processing speed. For this illustration, we will just join the test dataframe with itself without using any joining “keys”.

Note: Joining of dataframes combines the dataframes horizontally on columns.

Pandas executed in ~200 milliseconds; while Vaex took ~600 microseconds.

Again, Vaex turns out ~300x faster than Pandas!

Let’s be real

All the examples we saw above (where Vaex was upto 1000x faster) were lazy operations in itself. The operations were not actually “realized”. One way to “realize” an operation could be to save the final transformed / modified dataframe locally to a CSV file.

Doing this will give us a more realistic picture about the speed of Vaex.

Below, you will notice that both Pandas and Vaex perform almost the same when exporting the dataframe as a CSV file.

In fact, Vaex is slightly slower. Well this is expected and makes sense since no data is in memory, Vaex will take a longer time to actually execute the operations, compute the final dataframe and then save it on the disk. And as Pandas already has the data in memory, saving to disk becomes faster.

But, for any use-case, we don’t really ever perform these operations individually, do we? 🤷‍♂

So let us put them together in a function and see how it performs overall.

def pandas_speedtest():
    df = pd.read_csv('random_numbers.csv') 
    df1 = pd.concat([df, df], ignore_index = True)
    df2 = df1.join(df1, lsuffix = '_left', rsuffix = '_right')
    df3 = df2.sort_values(by = 'col_a_left')
    df3.to_csv('pandas_df.csv')

def vaex_speedtest():
    dv = vaex.from_csv('random_numbers.csv', convert = True)
    dv1 = vaex.concat([dv, dv])
    dv2 = dv1.join(dv1, lsuffix = '_left', rsuffix = '_right')
    dv3 = dv2.sort(by = 'col_a_left')
    dv3.export_csv('vaex_df.csv') 

pandas_speedtest()
vaex_speedtest()

Speedtest functions containing several operations

Not so fast now, is it?

On testing the above function, you’ll see that Pandas & Vaex performs almost the same. And why is that so? It is because - this may not be the best example to illustrate Vaex’s use-case.

So then when do we use Vaex?

👉 Vaex is perfect when we need to analyze, pre-process, or perform computations on large amounts of data that may not fit into the memory, and then downsize or aggregate it into a smaller manageable subset of data for further operations and processing.

Where does Vaex fail?

As mentioned above, row-level operations or data slicing will be much slower since the dataframe is not in memory.

Allow me to illustrate this with a very simple example of - accessing an element in a dataframe.

You will notice that accessing a value of a Vaex dataframe is not straightforward; and takes a much longer time when compared to Pandas.

Another similar example where Vaex might not be the best choice, would be creating calculated columns in a dataframe.

As you can clearly see above, Vaex was much slower in such an operation since the data is not in memory.

Advantages of using Vaex:-

Memory Efficiency: Vaex is designed to work with datasets that are much larger than the available RAM. It leverages memory mapping and lazy evaluation, which significantly reduces memory consumption. By processing data in chunks and only loading the required portions into memory, Vaex efficiently handles massive datasets without running into memory-related issues.
Blazing-Fast Performance: One of the most significant advantages of Vaex is its high-performance data processing capabilities. By employing parallelism and optimized algorithms, Vaex accelerates computations and operations on large datasets.
Seamless Integration with Pandas: Vaex provides a pandas-like API, making it easy for users already familiar with pandas to adopt Vaex seamlessly.
Lazy Evaluation and Expression System: The lazy evaluation approach of Vaex postpones actual computations until necessary, reducing unnecessary calculations and optimizing data exploration. Additionally, the expressive expression system allows users to create complex calculations and filters using familiar Python syntax without executing them immediately.
Improved Scalability: Vaex is designed to scale efficiently to multi-core CPUs and even distributed computing environments. This scalability enables it to handle large-scale data processing tasks and take advantage of modern hardware, making it suitable for high-performance computing and big data analysis.
Advanced Visualization: Vaex provides visualization capabilities that work seamlessly with large datasets. It integrates with popular plotting libraries like Matplotlib and Plotly, allowing users to create insightful visualizations directly from Vaex DataFrames without the need for downsampling or data reduction.

Conclusion

To wrap up, Vaex is a game-changer for analyzing big data. It tackles large datasets by using memory smartly and performing super-fast calculations. Its ability to handle big data without losing accuracy sets it apart, and its speed is impressive. Vaex is friendly too, with an easy way to write complex calculations and to visualize data. Transitioning from pandas is a breeze, making it accessible. In the era of data overload, Vaex is your go-to strategy, making big data analysis fun and efficient, all while outperforming pandas in the big leagues.