Photo by Mika Baumeister on Unsplash

Data Science at Scale with Modin

The Intel Distribution of Modin in the Intel oneAPI AI Analytics Toolkit Enables Scalable Data Analytics

Areg Melik-Adamyan
Intel Analytics Software
3 min readMar 14, 2021

--

AI and data science are advancing rapidly, which allows us to deal with more and more data and do more and more complicated things, like BERT. On the other hand, we see that these advances are shifting focus from value extraction to systems engineering. This trend puts pressure on data scientists to be more data systems or cloud systems engineers, and deal with infrastructure-related issues instead of focusing on the core part of data science — generating insights. One of the causes of this shift is the absence of optimized data science and machine learning infrastructure for data scientists who are not necessarily software engineers. We know that data scientists are creatures of habit. They like the tools that they’re used to in the Python data stack, e.g.: pandas, scikit-learn, NumPy, PyTorch, etc. However, these tools are often unsuited to parallel processing or terabytes of data. The Intel oneAPI AI Analytics Toolkit (AI Kit) aims to solve the data scientists’ most critical and central problem: how to make their familiar software stack and APIs scalable?

Today, we will talk about one of the main components of the AI Kit: the Intel Distribution of Modin. Modin is a performant, fully pandas API compatible library. The only thing you need to do to accelerate your pandas workload is to replace a single line of code: import modin.pandas as pd instead of import pandas as pd. Modin has three distinguishing characteristics that we are going to cover in a series of blogs:

  1. Parallelized pandas for high performance. As of v0.9, Modin supports 94% of the pandas API and is integrated with the Python ecosystem (e.g., NumPy, XGBoost, scikit-learn).
  2. The ability to run pandas workloads on different backends. Out of the box, Intel Distribution of Modin supports the OmniSci DB engine, a high-performance framework for end-to-end analytics that has been optimized for current and future Intel hardware, including GPUs.
  3. On-demand, practically infinite scalability to the cloud, right from your Jupyter notebook.

As an added bonus, Modin also has a rich frontend supporting SQL, a spreadsheet API, and Jupyter notebooks.

Installation

The easiest way to get Modin is via Intel’s Anaconda channel, either as a part of the AI Kit or standalone from the stock conda-forge channel.

Installing Modin from the AI Kit

The AI Kit provides a consolidated package of Intel’s latest deep and machine learning optimizations all in one place, with seamless interoperability and high performance. The toolkit includes Intel-optimized versions of machine learning frameworks and Python libraries along with Modin to streamline end-to-end data science and AI workflows on Intel architectures.

Intel Distribution of Modin is made available through the Conda package manager of the AI Kit:

conda create -n aikit-modin intel-aikit-modin -c intel -c conda-forge
conda activate aikit-modin

Installing Modin from the Stock conda-forge Channel

Alternately, Modin can be installed from Conda forge channel. If you use this method, you’ll need to install OmniSci separately. (OmniSci is included if you install using the AI Kit.)

conda create -n stock-modin modin -c conda-forge

Modin Scalability

To showcase Modin’s scalability, perhaps its most important but least known capability, we’ll use the well-known NYC Taxi example. The data set can be downloaded here. This example is prepackaged and available here. The NYC Taxi benchmark consists of four workloads. We’ll use the first: a group-by query of the trips_data.csv data in a Modin dataframe.

import modin.pandas as pd
df = pd.read_csv('~/trips_data.csv')
df.groupby("cab_type").size()

This is equivalent of the SQL statement:

SELECT cab_type, count(*) FROM trips GROUP BY cab_type;

The Modin query is executed locally on your laptop, but what if more compute power is needed and the 1.5B records don’t fit into the local storage? In this case, Modin provides experimental remote cluster capabilities, e.g.:

import modin.pandas as pd
from modin.experimental.cloud import cluster
with cluster.create("aws", "aws_credentials"):
df = pd.read_csv('s3:/taxi_data/trips_data.csv')
df.groupby("cab_type").size()

The with statement creates a remote execution context in the cloud, AWS in this case, with credentials provided by the user in aws_credentials.json. Modin automatically connects to AWS, spawns a cluster for distributed computation, provisions the Modin environment, then remotely executes all the Modin statements within the with clause. From the user perspective, this all appears to be happening locally. Modin makes Calm Technology real.

--

--