[ABC] Koalas

A gentle introduction to Koalas framework

tommy_agn
Data Reply IT | DataTech
6 min readDec 2, 2020

--

In this article, you can find a brief introduction to Koalas, a Python framework for Data scientists and Data engineer. It contains a general description of the project, a brief summary of the core functionalities of the framework and some conclusions based on the author’s experience.

Project description

“The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.”

This project statement can be found in the very first line of the readme file of the Koalas GitHub repository and sends a very clear message: the projects aims to give direct access to all the Pandas users to the power of a distributed framework such as Apache Spark. Differing from both Pandas and Spark, this new open source project benefits of:

  • A quicker development lifecycle: the first beta version was published on GitHub on March 2019, which led to version 1.4 on November 2020;
  • The inheritance of all the latest fixes and new functionalities of Spark releases.

The development cycle is focused on the Pandas API implementation, with the current function coverage being at over 80% among DataFrame, Series and plotting API, each maintaining the same syntax and behaviour found in the original Pandas. As for the other 20%, if the function can be implemented in a distributed framework, it is probably already in the current implementation roadmap, and should become available soon to all developers.

Koalas API coverage at Koalas 1.0

Partial compatibility with Pyspark API is guaranteed; too:

  • a function with a clear identical behaviour as the Pandas counterpart is coded as an alias;
  • other functions, which are required for managing the distributed behaviour of Koalas (like the cache function), are implemented;
  • data transformation, which are available only in Spark, may not be implemented.

If we are really unlucky, the exception NotImplementedError will be thrown when a given function has not been implemented, yet.

The Koalas Dataframe

Deep inside

The Koalas DataFrame is the main structure of the framework. It is logically a Pandas DataFrame and therefore it can be accessed through indexes. An index divides data from metadata and therefore it allows to access rows/columns of the DataFrame through a meaningful tag. However, since it is built on top of a Spark DataFrame, the index column is added lately.

It can be a sequence index, a distributed index or a compromise between the two.

  • A sequence index (the default one) guarantees to be a sequence that increases one by one per row so potentially it requires the whole DataFrame to be loaded into a single node.
  • The distributed index doesn’t guarantee such a sequence since each value is nondeterministic. It is the fastest among the three indexes.
  • The last index is a compromise and it guarantees the sequence unless new data are inserted because the index is created ordering the partitions and then ordering the elements inside the partition. Computationally, it can’t be ignored, since sequencing in a distributed data frame requires at least a shuffle. This index must be managed carefully.

Just like for the indexes, a few implementations of functions and features in the Koalas API may slightly differ from Pandas, because of its distributed nature. Still, even if this complexity is hidden by the framework, it could still pop out while developing. An example of this is the transposition of a DataFrame: it is a straightforward operation to do in a single node, but in a distributed context you may need to generate a row for each value and then group twice.

Bear in mind, when an operation is computationally too expensive, the system should throw an error if the number of rows is over the compute.max_rows threshold. To be sure about a function’s behavior, it’s better to refer to the official API documentation, or directly to the source code.

Data sources and sinks

I/O operations are far more powerful in Koalas, than in Pandas or even Spark standalone. As of now, a Koalas DataFrame can be natively built from and written to:

  • Another Pandas, Spark, Numpy DataFrames or more common dictionary;
  • Parquet, CSV, Excel, JSON, HTML, DeltaLake table file;
  • Special sources, like the Spark metastore, the system clipboard or a SQL table.

The following code snippets show how to obtain a Koalas DataFrame from a Pandas or Spark DataFrame. At all times, you must first import the Koalas package.

import pandas as pd
import numpy as np
import Databricks.koalas as ks
from pyspark.sql import SparkSession

For this example, a Pandas DataFrame and a Spark DataFrame are created.

pdf = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(pdf)

Then a Koalas DataFrame is created from the Spark one…

kdf = pdf.to_koalas()

.. and a Koalas DataFrame is created from the Pandas one.

kdf = ks.from_pandas(pdf)

Finally, you switch back to the Pandas/Spark frameworks.

pdf_spark = kdf.to_spark()
kdf_pandas = kdf.to_pandas()

Switching back from Koalas to Spark or to Pandas can be very useful, especially when you can’t find a function in the Koalas API. However, switching a Spark/Koalas DataFrame to Pandas, means collecting all the data into one node. This is a highly discouraged practice for big tables, because a collect operation on the driver can potentially exceed its resources, leading to its crash. In these cases, usage of the traditional Spark framework is to be preferred.

Finally, it is interesting to notice that the Spark behaviour should be preferred over Pandas also for output operations because, just like for the index, the size of the DataFrame may not allow a collect operation on one node. For example, the to_csv, to_parquet (,…) methods do write a file for each partition, like Spark does, and the documentation itself contains a warning for the usage of the to_clipboard function without restrictions.

Transformation and advices

A DataFrame can be managed through its Series composition (as it can be done in Pandas), as well as with aggregation and window functions, which all are common Spark manipulations.

Still, there are a few recommendations, which are common to all Spark users:

  • Every operation that requires a shuffle is discouraged, because it requires data exchange between Spark nodes (e.g. sorting a dataset);
  • Avoid all the operations that require computation on single partition, since it may collapse if the dataset is too large.

Experience

Koalas proves itself to be a very useful tool, especially when you have to engineer the work of a data scientist in a short amount of time. In my experience, I had a whole set of data transformations and models written in Pandas, ready to scale. Moving all the code to Spark would have taken a lot of time, since there isn’t always a 1–1 relationship between Pandas and Spark APIs. Because of this, I decided to adapt the code using the Koalas framework.

This way, I managed to keep every transformations untouched, and I could really appreciate the ready-to-use experience, without having to delve down into configuring any SparkSession. Still, since I was using version 0.8, I bumped into some functions that were not implemented (so, I had to figure out my way around them) and I had to rework some operations which were more computationally expensive than they should. Anyway, I was able to achieve my results well within the expected performance requirements: the overall experience was positive.

Another user experience can be found in Databricks blog post that dates back to August 2019. There, Virgin Hyperloop explains its experience with Koalas and finds a performance similar to PySpark, with very little overhead. Their overall experience was positive, too.

In the end, I think that Koalas can narrow a gap between Data Engineers and Data Scientists, leading the former to speedup engineering reworks and the latter to start thinking data transformation issues coming from a distributed computational framework.

Photo by Skyler H on Unsplash

If you’d like to try the new framework, without installing locally Spark, you can rely on Databricks: the community edition offers a notebook environment ready to use, which runs on a free cluster with the latest Spark and Koalas editions.

--

--

tommy_agn
Data Reply IT | DataTech

Former data plumber and pipe architect, now expat with PhD ambitions. Let's share some knowledge or have that laugh that is desperately needed.