Bye Pandas, Meet Koalas: Pandas APIs on Apache Spark (Ep. 4)

Keynote summary from the official announcement of Koalas by Reynold Xin at Spark + AI Summit

Disclaimer: This is my personal blog, therefore anything I post, share, and comment don’t reflect the view of my employer. This article is a part of my Databricks series.

Hello everyone, I am delighted to hear from Databricks that they are currently making progress on Koalas: pandas APIs on Apache Spark, which makes data scientists more productive when interacting with big data, by augmenting Apache Spark’s Python DataFrame API to be compatible with pandas. This is an incredibly exciting news for Python developers and data scientists out there! You can find the full keynote in the below video.

Jump to 8:00 — Reynold’s update about Easy-to-Use APIs in accordance with Apache Spark Design Principles.

Jump to 14:30 — Brooke’s Koalas demo on the sample data in Databricks.

Keynote Slides

Apache Spark 3.0 — What to expect later this year? #ApacheSpark3.0 #Databricks

  • Hydrogen Accelerator: Aware Scheduling
  • Spark Graph
  • Data Source APIs
  • Adaptive Execution
  • Spark on Kubernetes
  • Vectorization in SparkR
  • Hadoop 3.x
  • Scala 2.12
  • ANSI-SQL Parser
  • Join Hints

In my future Databrick series, there will be an opportunity to cover these topics in further details. I would like to keep it short and sweet for this one.

In 2019, Spark can run on:

  • Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.” (Open source datacenter computing with Apache Mesos, 2014)
  • Spark Standalone Mode — with the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
  • Hadoop Yarn deployment — Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
  • Kubernetes —the rise of Kubernetes, which is vastly used for managing containerized environments, calls for community support for Kubernetes APIs within Spark 3.x

My journey as a data scientist using Python as a main language is similar to the following slide, laying out the need between Pandas in MOOCs, tutorials, books, and universities to analyze small datasets and DataFrame in Spark to analyse large datasets. The key problem with Pandas is that it does not parallelize jobs for you. (run on a single thread)

Then, how awesome would it be if Apache Spark had Pandas that allows for parallelization?

Behold Data Scientists!

Yes! You hear it right. Koalas is coming to town!


Hands-On with Koalas

First of all, please go to Python Package Index and download Koalas library file.

Upload a Python Wheel

  1. In the Library Source button list, select Upload.
  2. Select Python Whl.
  3. Optionally enter a library name.
  4. Drag your Whl to the drop box or click the drop box and navigate to a file. The file is uploaded to dbfs:/FileStore/jars.
  5. Click Create. The library status screen displays.
Photo by Zizhang Cheng on Unsplash

Final Note

Databricks is developing to facilitate the current stage of technological advancements, most of which is built on pandas. Therefore, we need just the library mapping from pandas to koalas. The whole code is replicated in spark, and the job will be much more computationally efficient. I am excited of what’s to come for the future of Databricks!


Useful Documentation

Read the latest developments on Koalas:

Spark x Mesos

Spark x Hadoop

Spark x Kubernetes

Korkrid Kyle Akepanidtaworn

Written by

Cloud Solution Architect (Data & AI) at Microsoft, Former Data Scientist at Accenture Applied Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade