Netflix Appetency: A beginner’s guide

Chirag Samal
4 min readJun 9, 2022

--

Image source: Netflix

Data Science plays an essential role in most online services and helps bring more customers and keep the existing ones happy. The reason why this is because, with Data Science, you get a more realistic picture of your consumers’ taste in the form of graphs and charts that take not just one metric but several as input. This crucial piece of information helps you mold your products and services to look one-of-a-kind to your customers, attracting them to your platform.

With a company like Netflix brimming with data, it’s always a wise decision to put that pile of data to good use. By incorporating concepts like data analysis, machine learning, statistics, and deep learning, Data Science can help Netflix and any business grow exponentially regardless of sector.

Data Science plays a critical role in deciding the functioning of Netflix and presents them with newer opportunities to grow.

In this article, we will classify consumers according to their appetite to subscribe to Netflix using Netflix Appetency dataset from Kaggle. The training set consists of an id column, the customers’ features, and a target column: target.

Importing Libraries

Load Dataset

  • train.csv - the training set. it consists of an id column, the customers' features, and a target column: target.
  • test.csv - the test set. it consists of everything in train.csv except the target.
  • sample_submission.csv - a sample submission file in the correct format target=1 means that the customer subscribes to Netflix

Target type and count

Target distribution of the dataset
Target distribution. Image by the author

Data Cleaning

Data cleaning is the most important task that should be done as a data science professional. Having wrong or bad quality data can be detrimental to processes and analysis. Having clean data will ultimately enhance overall productivity and allow you to make the best decisions possible.

We will drop features:

  • Features with more than 10% null values
  • Features with single categorical features
  • Features with single values features

Top features distribution

Feature distribution
Top 9 features distribution. Image by the author.

Correlation Heatmap

The graphical representation of a correlation matrix represents the correlation between the top 9 variables.

Correlation Heatmap. Image by the Author

Preprocessing

Label Encoding

We will use Label Encoding to encode categorical columns in the dataset to convert the labels to numeric columns.

Machine Learning Model Training

XGBOOST

We will use the XGBoost machine learning model to train our dataset.

(XGBoost) is a scalable and enhanced version of the gradient boosting algorithm (terminology alert) that is optimized for efficacy, computational speed, and model performance. It is a member of the Distributed Machine Learning Community and an open-source library. XGBoost is a unique combination of software and hardware features meant to improve conventional boosting techniques with precision and speed.

To avoid overfitting we will use cross-validation.
Cross-Validation is a process in which a test set is kept for final evaluation but the validation set is not required when executing CV. The basic technique, known as a k-fold CV, divides the training set into k smaller sets (other approaches are described below, but generally follow the same principles). For each of the k “folds,” the procedure is as follows:
A model is trained using the folds as training data, and the model is then validated using the remaining data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The average of the values is the performance measure produced by k-fold cross-validation.

We will use StratifiedKFold which is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set. We will use a stratified 5-fold cross-validation.

The evaluation metric used in the competition is AUC (Area under Curve). The ROC-AUC score function calculates the area beneath the receiver operating characteristic (ROC) curve, also known as ROC-AUC. The curve information is summarised in one number by computing the area under the roc curve.

The AUC value ranges from 0 to 1. The AUC of a model whose predictions are 100 percent incorrect is 0; the AUC of a model whose predictions are 100 percent correct is 1.

The mean AUC which we receive after 5-fold Stratified K-fold cross-validation in the XGBoost Machine Learning model is 0.7853.

Let’s plot the top feature importance from the machine learning model.

Feature Importance
Feature Importance. Image by the author.

Summary

In this article, we discussed Netflix Appetency, a Kaggle competition to classify consumers according to their appetite to subscribe to Netflix. We discussed a beginner-level tutorial to understand the problem statement and trained an XGBoost Machine Learning model to train the dataset. The mean AUC which we got after training our model is 0.7853.

You can access the full code here in this GitHub Repository: Netflix Appetency Tutorial.

References:

Thanks for reading; I will be writing more similar posts soon. Let’s get involved in discussions, and suggestions are always welcome.

--

--

Chirag Samal

Computer Vision Engineer @ Zeiss | Former Intern @ Stanford University, IISc | Kaggle Master | IIIT-NR