Tackling Big Data with Pandas

Published in

Red Buffer

5 min readJan 28, 2022

That’s a panda handling some big data ain’t it xD — Photo by Aliagha Shirinov on Unsplash

When working in Data Science or Data analytics pandas becomes the go-to library for data exploration handling data, and why not? I mean pandas has everything; intuitive data structure, rich APIs, and above all, efficient in-memory computation for small to mid-sized data. Smaller data, such as student grade prediction or property trends, tends to be in MBs or a couple of GBs but suppose that you have the data of all the transactions of Binance Smart Chain and you have to analyze it. Big Data is where the problem actually begins.

If you have been following my stories, you must know I shared how Bamboolib makes life easier to do EDA on pandas data frame, yeah mate that stuff is gone too T_T.

Feeling down?

Well don’t be because today we are going to see how we can actually use pandas to deal with big data. As we are all well aware that for the majority of the problems related to big data our go-to solution is Apache Spark.

Apache Spark is an open source, distributed computing engine used for processing and analyzing large amount of data by distributing data across the clusters and processing the data in parallel.

While PySpark (interface for Apache Spark in python) is great for heavy data workload, learning the new PySpark syntax and refactoring the code from Pandas to PySpark can be tedious.

Fortunately, with Spark 3.2 update, we can now run Pandas API on Spark. This allows us to leverage the power of distributed processing in Spark while using the familiar syntax from Pandas. Let’s see an example of how it works with the bank marketing dataset from the UCI repository

Getting Started

In this article, we’ll explore from the beginning how to install PySpark and use pandas along with Spark.

Disclaimer: a lot of examples using pandas can be created but I’ll be covering only the basics to give you a gist. The rest is the same as it has always been with pandas.

To use pandas we big data we first install PySpark

pip install pyspark

In case you do not get the pandas extension you can explicitly get it via

pip install pyspark-pandas

and that’s pretty much it and you’re good to go.

Importing the library

To use native pandas we would typically import it in the following manner

import pandas as pd

To use pandas API in PySpark we simply need to do the following import and everything else will be the same.

import pyspark.pandas as ps

Reading a CSV file

If we need to read a CSV file we can do this the same way we have always been doing with pandas but just specifying the type at the end. The resulting DataFrame is a PySpark Pandas DataFrame.

df = ps.read_csv('/FileStore/tables/bank_full.csv')type(df)
>> pyspark.pandas.frame.DataFrame

Inspect DataFrame

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). Since the main goal of this story is to emphasize the importance of the point that with this pandas extension in Spark we don’t need to worry much about Spark’s syntax I’ll show you how the syntax is exactly the same. We would inspect the data frame normally how we do in pandas using

df.head()

and the result is shown below, just as you’d get from native pandas

Checking Column Information

Normally right after checking the data via data.head() most of us would forward and inspect the columns to get a better understanding of the data.

df.info()

Group-by and Aggregates

For the above task, which is a fairly simple one chosen to show you guys the working of pandas with Spark, the next step involves finding the average age by the target variable (y).

df.groupby(‘y’, as_index = False).agg({‘age’:’mean’}).sort_values(‘age’)

Applying Lambda Functions

Let’s create an indicator column to indicate if a customer’s age is above 40 or not. We can do so by applying a lambda function on the age column using Pandas .apply and lambda function.

df['age_above_40'] = df['age'].apply(lambda x: 'yes' if x > 40 else 'no')

Plotting

We can also plot charts with Pandas .plotfunction. The charts are plotted with Plotly and we can use Plotly’s .update_layout to adjust the chart properties.

fig = df.groupby('marital').agg({'age':'mean'}).plot.bar()
fig.update_layout(xaxis_title = 'Marital Status', yaxis_title = 'Average Age', width = 500, height = 400)

Query DataFrame with SQL

PySpark Pandas DataFrame can also be queried with SQL. This is an exciting feature for all of the data science enthusiasts out there which was not at all available in the native pandas library. It makes life a lot easier so instead of filtering on pandas column and then using aggregate functions, we can always go the SQL way.

ps.sql("SELECT y, mean(age) AS average_age FROM {df} GROUP BY y")

Convert from Spark Pandas to Pyspark DataFrame

Last but not least, we can also choose to work in Pyspark by converting the PySpark Pandas DataFrame into a Pyspark DataFrame.

spark_df = df.to_spark()type(spark_df)
>> pyspark.sql.dataframe.DataFrame

Convert from PySpark DataFrame to PySpark Pandas Dataframe

psdf = spark_df.to_pandas_on_spark()type(psdf)
>>pyspark.pandas.frame.DataFrame

Conclusion

We looked at how to use pandas API on Spark which helps us process big datasets in a distributed fashion using the familiar pandas syntax. Apache Spark is just one of many alternatives to Pandas for dealing with big datasets in Python. As we all know that Big Data is THE game changer since data is increasing day by day and so are our problems to analyze it. Using pandas with Spark is something that can be thought off as the best of both worlds.
While writing the article I did take the visuals and help from Edwin Tan and his article at https://towardsdatascience.com/how-to-use-pandas-for-big-data-50650945b5c6. Kindly check this out too

Tackling Big Data with Pandas

Getting Started

Importing the library

Checking Column Information

Conclusion

Written by Ahmedabdullah