Exploring Large datasets at Scale Part 1

Ziyad Mestour
CodeShake
Published in
3 min readFeb 14, 2023

--

Have you ever worked with Pandas and found it not at all optimized for large datasets ? Well if that’s the case you’re in the right place.

This is a first part of a series of exploring data @ Scale. Fasten your seatbelt and enjoy the reading !

Vaex is a library similar to Pandas for EDA (Exploratory Data Analysis) in Python. It’s an equivalent of Modin. I came across this library in 2019 when a colleague of mine shared it.

Why use Vaex ?

Well Vaex does something interesting which is in-memory mapping. It means that instead of loading a dataset into your RAM, it only points to the dataset which is already on your storage (SSD/HDD). Now this is the part where it gets interesting, this means that you can analyze a 1TB dataset in your 8G RAM computer.

Aside from that, Vaex can integrate with GCP Cloud Storage and AWS S3. The advantage of it is, you are able to analyze large datasets without having to provision a large notebook.

Other benefits include using multiprocessing, lazy evaluation and amazing visualizations. On the other hand Pandas does not indeed leverage your cores.

Last but not least, you can perform some Machine Learning on your exploration and this is very powerful. Meaning you can explore the usability of your dataset and check if your data is potentially interesting to train a model on it!

Vaex features

Vaex has 4 major components:

Vaex-Core: DataFrame and core algorithms, takes numpy arrays as input columns.

Vaex-Hdf5: memory mapped numpy arrays to a DataFrame.

Vaex-Jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.

Vaex-Ml: machine learning

Vaex in action

In the next lines you’ll see examples of reading large datasets ~11 million lines on a personal computer and performing basic filters on it. Now this would be a nightmare in Pandas.

11 million lines in the order of micro seconds

Again here Vaex only points to the dataset, no loading into memory. Which explains the blazing fast data reading speed.

Performing some filters on the whole dataset
Cool visualization of taxi pick ups in NYC

In this first series I’ve only touched upon the surface of what can be done with Vaex, in the next tutorial I’ll go in more details.

Limitations

Well in my opinion Vaex doesn’t cover all of the Pandas API. I would think of Vaex as an exploration tool rather than a production one.

Other limitations include the Vaex-ml API, for the moment it supports only a limited number of models compared to the Sklearn API.

Conclusion

That’s it folks for this first part of analyzing datasets at scale series. Next time I’ll be performing EDA on a dataset and training a Machine Learning model with Vaex with complete code samples :)

Please don’t hesitate to like or comment, this encourages me to keep on writing.

--

--

Ziyad Mestour
CodeShake
0 Followers
Writer for

Data Engineer @ SFEIR | Machine Learning | NLP | Python