Building a Feature engineering pipeline and ML Model using PySpark

Gunank sood
Analytics Vidhya
Published in
8 min readDec 31, 2020

We all are building a lot of Machine Learning models these days but what you will do if the dataset is huge, you are not able to process it on your local or you need some distributing computation to process and do feature engineering in less time. One thing that comes to mind is Spark.

Most of the data scientists are familiar with Python but Spark is in scala so are we going to learn a new language or is there something we can do in python only. Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas.

In this blog, I will cover the steps of building a Machine Learning model using PySpark. For this project, we are using events data of a music streaming company named Sparkify provided by Udacity. We will create an ML model that will predict which customer is likely to churn.

Data Exploration

This dataset contains app events of 225 customers contains all the activity they performed on the app, this dataset looks small because it is a subset of the complete dataset. I am using this small subset because I am not going to use any EMR cluster for this process and I will perform it on the workspace provided by Udacity. So we will be working on these ~300K events data.

Schema of the dataset provided.

Page column seems to be very important for us, it tells about all the user interactions with the app. Less a user interacts with the app there are more chances that the customer will leave. Below are a few pages.

pages data

With this, we will get to know, when a user is upgrading or canceling the subscription.

Defining Churn

After some analysis of the provided data, I came up with the thing that whether a customer visited the “Cancellation Confirmation” page or not.

See the below page interactions count for users who visited cancellation confirmation or not.

The above graph clearly shows that users who downgraded have a lot less activity than the premium users. Pages that I chose for the model based on the above graph are:

Home

Add to Playlist

Add Friend

Thumbs Up

Next Song

The gender column also seems to be an important one because there is a significant gap in the users who churned or not.

Data Preprocessing:

In data preparation, I checked for null values in our data as well as duplicate ones. From all the columns we have earlier in the data, I found very few columns(mentioned in the next section) to be used as features for our model. I handled the data duplicates and null values wherever required.

One challenge I faced during data preparation was initially I created hourly windows for the user aggregates, but in that case, data was highly imbalanced. Then I tried with weekly window size for creating aggregates on top of these features which worked pretty fine.

Here is how our data looks after some cleaning and processing.

Feature Engineering

Now comes to feature engineering which we will use in training the models.

  1. User gender.
  2. the customer was paid or not at the time of the last activity on the app.
  3. average session length of a user.
  4. user interaction count for each of the last 4 weeks of membership.

The “ts” column contains the timestamp of interaction in milliseconds. First I will get the no. of weeks from the last activity of the user.

I am keeping the interactions count per week, I am using the window function to do that.

This window function is very similar to Groupby it just creates the aggregates only for the window we created, here in the above example we are using the window for last week only.

Also, I have used a lot of UDF’s in the feature creation process. A UDF is just an anonymous function in PySpark very similar to lambda functions in python. you can find a few in the below image.

For the average session length, I first did group by on user_id and session_id and kept the max session recorded by deleting duplicates we are left with one row per user with session length. Below is the method I wrote to create the average session length feature.

I have created another window function to get the membership status of a user at the time of the last interaction. The below image shows the code for the membership status feature.

To train ML models it requires numeric inputs only, so I converted membership status and gender to numeric values using the code below.

After feature generation, below is the schema of the dataset we have.

Final Dataset schema

Modeling

Common libraries for machine learning models in spark are ML and MLLib. I have used ML for this project as it is the latest one.

Now we have all the features we will use to train our model, I have used Standardscaler to normalize the features and used an 80:20 split for training and testing the dataset.

I have used Logistic Regression and Random Forest Classifier in this project you can try out other algorithms as well. I didn’t use boosting algorithms here because the dataset was very small.

I used ml.tuning library’s ParamGridBuilder function for hyper-parameter tuning. Below is the configuration I used for Linear Regression Algorithm.

Logistic Regression Model

Configuration for Random Forest Model

Random forest Classifier Model

For Logistic Regression, regularization parameters used were 0.1 and 0.2, Elastic net Parameters were 0.1 and 0.2 and max iterations were 10 and for the Random Forest Algorithm, we used minimum instance per node as 1 and 3, Gini and entropy as the impurity algorithm, and the number of trees is 20 or 30.

Metrics:

Results of Logistic Regression Model
Results of Random forest classifier on the test dataset

It's important to choose the right set of metrics to evaluate a Machine learning Model, here I have chosen the f1 score and Accuracy of the model as an evaluation metric. F1 Score basically explains how robust and precise a machine learning model is, For these models, I did Hyper-parameter tuning on the basis of the F1 score only to improve the results of the model.

Initially, I started with the Logistic regression model I tried it with few parameters, Above screenshot shows the best I got from that, then I thought Random Forest can give better results here and after few hyper-parameter tweaks, it went ahead of the Logistic regression.

Basis the above results screenshots we can say that the Random Forest Classifier is giving better results here on the test dataset as compare to Logistic regression with an F1 score of 0.83 and Accuracy of 96%.

Random Forest Worked better than Logistic regression because the final feature set contains only the important feature based on the analysis I have done, because of less noise in data random forest gave better results than Logistic regression. During the data preparation, we removed all the features which are not useful for the model to predict churn.

Feature Importance

Top features for Logistic regression model

I have used the inbuilt featureImportances attribute to get the most important features, this uses the frequency of a variable used in the trees.

Top features for Random Forest Classifier Model

Improvements

Some improvements I want to share based on my knowledge of Machine Learning algorithms, This time I am running it on a small dataset but when we will run it on a larger dataset I think Boosting algorithms will give much better results, here I didn’t used any boosting algorithm because the dataset was small and this won’t justice with result metrics of Boosting algorithms on small data. Also in the large dataset, we can use the window of per day basis instead of a week which we used in this project.

While coding it was tough if you are coding for the first time in pyspark, I suggest you to refer PySpark Documentation which helped me a lot as well as for any issue you can search stack overflow, I found most of the solutions there.

Conclusion

If you have already built a Machine Learning model using scikit-learn then using Pyspark for building models will be easy there is not much difference I feel. but for the first-timers, it might be a tough job. I think these ML Models will also give pretty good results when we will use them on the complete dataset. We can also try with different hyper-parameters when running this on a cluster. for further details, you have to check my GitHub repository

I created this blog as part of a project for my Udacity Data Scientist Nanodegree.

Feel Free to connect with me on Linkedin. Thanks a lot for reading.

--

--