Using PySpark Dataframe as in Python

(An announce at begin, I’m a Data Science beginner, only share personal undfrderstandings here. I never said my opinions are all right and with huge pleasure to hear your opinions.)
In this blog, I will share you about how using Dataframe of PySpark as Dataframe of Python.
Environment:
I use AWS(Amazon web service) as work platform in this moment.
S3 for storage.
EC2 for hardware settings.
EMR for programming.
Replacing elements that appear less than a threshold.
In Python:
We could use the build-in method “replace” of Python Dataframe.

In PySpark:
The most simple way is as follow, but it has a dangerous operation is “toPandas”, it means transform Spark Dataframe to Python Dataframe, it need to collect all related data to master then do transforming, this leads to memory problem if you have limited hardware. In the same way, be modestly using “collect” also, because they are action.

I choose to use “join” function to avoid using action operations to solve this problem.

The differences between joins
For “join”, we need to clearly understand the differences between join (inner), left join, right join, and full outer.
If we have two tables: A and B.
A:

B:

A join(inner) B:

A left join B:

A right join B:

A full outer join B:

One-hot encoding — “get_dummies”
While doing feature transformation, I usually use “get_dummies” of pandas. So, how does “get_dummies” work?
If we have a dataframe like follow:

There are 2 category columns “Color” and “Size”, many algorithms can’t work with category valuables, so we have to convert them to numeric. In this case, we could use “One-Hot” method, but this function will produce vectors.

It’s not easy for next steps, like dimension reduction and feature combination, so personally I prefer “get_dummies” because it could give us a new table as follow(red are new columns).

In Python:

In Pyspark:
There is no official function, so I try to implement one efficient solution.
At first, I find this method:

But it used an action “collect”, so I wanna find new one that don’t use any action.
I use “pivot” as replacement.

Example:
The original dataframe:

After this operation, dataframe likes:

