Using PySpark Dataframe as in Python

Zhiqiang Zhong
Aug 24, 2017 · 4 min read

(An announce at begin, I’m a Data Science beginner, only share personal undfrderstandings here. I never said my opinions are all right and with huge pleasure to hear your opinions.)

In this blog, I will share you about how using Dataframe of PySpark as Dataframe of Python.

Environment:

I use AWS(Amazon web service) as work platform in this moment.

S3 for storage.

EC2 for hardware settings.

EMR for programming.

Replacing elements that appear less than a threshold.

In Python:

We could use the build-in method “replace” of Python Dataframe.

In PySpark:

The most simple way is as follow, but it has a dangerous operation is “toPandas”, it means transform Spark Dataframe to Python Dataframe, it need to collect all related data to master then do transforming, this leads to memory problem if you have limited hardware. In the same way, be modestly using “collect” also, because they are action.

I choose to use “join” function to avoid using action operations to solve this problem.

The differences between joins

For “join”, we need to clearly understand the differences between join (inner), left join, right join, and full outer.

If we have two tables: A and B.

A:

B:

A join(inner) B:

A left join B:

A right join B:

A full outer join B:

One-hot encoding — “get_dummies”

While doing feature transformation, I usually use “get_dummies” of pandas. So, how does “get_dummies” work?

If we have a dataframe like follow:

There are 2 category columns “Color” and “Size”, many algorithms can’t work with category valuables, so we have to convert them to numeric. In this case, we could use “One-Hot” method, but this function will produce vectors.

It’s not easy for next steps, like dimension reduction and feature combination, so personally I prefer “get_dummies” because it could give us a new table as follow(red are new columns).

In Python:

In Pyspark:

There is no official function, so I try to implement one efficient solution.

At first, I find this method:

But it used an action “collect”, so I wanna find new one that don’t use any action.

I use “pivot” as replacement.

Example:

The original dataframe:

After this operation, dataframe likes:

In updating…

)

Zhiqiang Zhong

Written by

Ph.D. Student at University of Luxembourg | Data Scientist | Kaggle Master

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade