Spark-ifying Pandas: Databrick’s Koalas with Google Colab

MA Raza, Ph.D.
Analytics Vidhya
1 min readApr 9, 2020

--

Big data is a new norm for data scientists to deal with on a more frequent basis. Apache Spark is one of the widely used big data frameworks. For data scientists, perhaps Pandas is the most popular python library and is used every minute of data scientist jobs. Although pandas is most of the time enough to perform typical data scientist analysis, it is observed its performance deteriorates when handling big data.

Modin is a framework claiming to make pandas faster however it does not solve the handling of big data. Databrick’s Koalas is another alternative to use pandas-like architecture while running on spark infrastructure.

Running Koalas on Google Colab

In this post, you can learn how to get started using Koalas with google colab.

--

--

MA Raza, Ph.D.
Analytics Vidhya

Data Scientist | ML Engineer| Digital Business Transformation