Building Churn Predictor with Python, Flask, HTML and CSS

Published in

Star Gazers

11 min readMar 3, 2021

This blog will take you through an end to end Telecom churn prediction application. Concepts like feature importance, data visualization, data loading from MySQL workbench, One Hot Encoding, different model performances and their comparison, model deployment on local machine using Flask and UI creation using HTML-CSS

Well, in my last blog where I emphasized a lot on productionizing your model. Here’s another blog which will take you through actual industry level ML workflow where we’ll be picking up a business problem and solving it from fetching data from database and till deploying model on our local machine using Flask API’s. I would highly recommend you people to open up my GitHub Repo and follow along for best learning. Also, feel free to comment or ask doubts on my mail id. So, without wasting time, let’s begin !

➤BUSINESS PROBLEM, DATA & APPROACH

Business Problem: A telecom company wants to use their historical customer data to predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

So, We need to build a system which takes certain customer related data and predict customer’s behavior (whether customer will be in the network or he’ll leave the network)

2. Data: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

Customers who left within the last month — the column is called Churn.
Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
Demographic info about customers — gender, age range, and if they have partners and dependents.

3. Approach: The approach we’ll be opting to solve this problem is a standard approach which can be used to solve almost all supervised learning problems and includes following steps :

⇨ Data Gathering and Loading : Collecting data from different sources and loading it into our notebook (or any other IDE).

⇨ Data Cleaning & Visualization : This step is my favorite one as in this step only you get to know a lot about your data. Also, data cleaning is simply pre-processing the data before you go and build actual model.

⇨ Data Modeling & Different Model’s Performance comparison: This step is general public’s favorite where we try to fit different set of models on our training data and test the same on testing data and compare performance of different model based on a chosen metric.

⇨ Model Deployment with Flask API: This step is what gives our model a real purpose where our model can now cater to real prediction, requested by user and give results in real time.

So, without wasting much of time, we’ll be directly jumping into these steps and discuss them in depth with actual code snippets.

➤Step 1 : DATA GATHERING AND LOADING

In our problem, we were given 2 tables in MySQL Workbench which were required to be loaded in our Jupytr Notebook as pandas dataframe so that we could carry further processes.

In order to load data in your IDE, first requirement is to set-up connection of Python and MySQL so that both could talk to each other. For this you require ‘python-mysql connector’, few libraries such as ‘sqlalchemy’ and ‘pymysql’ and you are good to go. Here’s code for setting up connection and loading data as pandas dataframe and joining 2 tables to form single dataframe.

This is how data would look after loading it as Pandas Dataframe.

➤Step 2: DATA CLEANING AND VISUALIZATION

After looking at shape and info of our joined frame, there was one serious problem which was clearly visible. Almost all the feature had ‘object’ datatype which certainly isn’t acceptable for our data modeling but was easy to visualize due to so many categories. We’d be required to convert it to some vectorized form somehow, which will be discussed later.

Whenever I carry data cleaning step, I always start with checking for ‘Null Values’. Luckily in our case there were no Null Values but this was not end of it. Sometimes data actually contains Null Values but are represented as something else. In our case, there were missing values were represented as blanks which brings me to an important point.

Always check missing values with not just isna() method. Missing values could be presented in other forms as well. Always look for anomalous data.

Above code looks for blank values in data and then replace those values with median based replacement technique. We replaced blank values first with ‘NaN value’ and then replaced NaNs based on our dependent variable, which I believe is the best replacement technique to follow. Simply calculate median for both categories (‘Churner’ and ‘Non-Churner’) and then replace with median where value is blank.

Next, as we had so many categorical variables, so group based analysis is best in such case along with few histograms to check each class’s frequency. I won’t go much into coding for this part but few conclusions I got were as below :

⋆ Month to month contract is more likely opted by customers.

⋆ Tech Support is not taken by maximum customers and customers who do not have Internet Service do not belong to any category.

⋆ Fiber optic cable is most common internet service. There is large population which doesn’t have internet service as well.

⋆ Senior Citizens churn slightly more than Non-Senior Citizens

⋆ Non-Dependent Customers churn more than Dependent Customers

⋆ Customers with Fiber optic Internet Service churn highest as compared to DSL and no Internet Service customers !!

⋆ Customers with No Online Security Churn way more than customers that do have Online Security.

⋆ People with a month to month contract are biggest churners.

⋆ Finally, our target class was an imbalanced class with most of the customers as Non-Churners.

Well, if you closely observe the last point, we have imbalanced Churner Class. If we train model directly with this, there’s very high chance that our model will be biased towards majority class which is ‘Non-Churners’.

To overcome this problem, we have many techniques out there. You might’ve heard about ‘up-sampling’ and ‘down-sampling’ techniques to handle this problem. We will use one such technique called Synthetic Minority Oversampling Technique or ‘SMOTE’ which introduces synthetic points for minority class in it’s neighborhood in feature space. Consider below code to implement SMOTE.

sm = SMOTE(random_state=9)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

I hope everyone is aware with the fact that you cannot give your model an object or text-categorical data as it is non-interpretable for any model and since we have so many categorical variables, converting them into ‘One-Hot-Encoded vectors’ seemed to be wise.

The above code simply converts ‘Churn’ variable to 0–1 based on Yes-No values and oneHotCols contains all the columns for which we require One Hot Encoded vectors. After this we have separate column for each category as shown below.

➤Step 3 : DATA MODELING AND DIFFERENT MODEL’S PERFORMANCE COMPARISON

After cleaning all the data, now is the right time to fit the base models on your clean data. Before going into model building, always remember 1 thing :

Always standardize your data to make your model scale-invariant.

I personally use Scikit-learn’s ‘Standard Scaler’ all the time but this time I decided to give ‘Min-Max Scaler’ a shot which turned out to be fine for me. Before we Standardize the data, we need to split the data in Train and Test. We’ll use train to train model and test to compare model’s performance. Now lot of beginners have this doubt that should we standardize first or should we split first ? The answer is

We first always do a train-test split and then only standardize data for train and test separately. As we consider test as future unseen data, we don’t want it to influence our model in any manner.

One final step is remaining before we jump into data modeling part i.e. selecting performance measure. As we have categorical data with imbalanced class, AUC-ROC or confusion matrix can be chosen. For sake of simplicity, we’ll be going ahead with confusion matrix.

After all this hustle, we are finally at stage where we can train our model. Now I won’t go into details of how I trained model and how I tuned each one of those, I’d like to utilize this on discussing feature importance and how you can do.

So when we Hot Encoded our categorical variables, if you are following along, number of variables increased from 21 to 47. But if you notice, we haven’t introduced any new information in the data and just used what we already had. It would be very unwise to use all 47 of these variables for our modeling. This is where feature importance based feature selection comes into picture.

In Feature selection, you simply consider only those features which are actually contributing towards your model’s prediction, otherwise you simply drop them. But one may ask how would be know if a feature is important or not ? This answer is based on Gini Score you can easily get feature importance. Features whose Gini-Score is high can be considered as important features and rest can be ignored.

So all tree based algorithms can also compute feature importance. We chose to pick top 30 features. You can use less or more based on your gut or even try model building with different number of features. Here’s code to do feature importance and show it pictorially as well.

Feature importance based on indices values

With top 30 feature, we built our model with basic hyper-parameter tuning and final results are as follows :

Different model’s performance comparison

I would like you guys to think which model would be best fit for production. I saw 2 candidates for our final model

Logistic Regression : It has lowest miss-classified points for Churners.
Adaboost : It has overall low miss-classification rate.

I chose Adaboost as I wanted my model to generalize well on the data rather than getting low miss-classified churner. Depending on Business requirement, model selection could have been other way round.

Finally I dumped the Adaboost model so that it could be deployed with Flask on Local Machine.

➤Step 4: MODEL DEPLOYMENT WITH FLASK API

With all the hard work done, we are ready to deploy our model. To begin with this, you need a basic UI structure which you can create using forms and beautify same using CSS. You can check my GitHub Repo for HTML+CSS code to replicate or otherwise continue with only basic HTML structure.

If you’ve opened up my GitHub repo, you’d see a templates folder which contains 2 files (homepage.html and results.html) whose output is shown as above and you can clearly see prediction of above use-case as a Non-Churner. I’ll assume you can easily create similar GUI with HTML and CSS and only consider Flask part of the application.

In Flask, first thing to remember is the folder structure. You need to create one main file (main.py in our case) which acts as a central system of our application which will link to all the other files. Secondly, all HTML and CSS files should be in separate folder which you need to mention (templates in our case).

In our main.py file, we first begin by loading up the dumped adaboost model in our PyCharm (in your IDE).

adaboost = pickle.load(open('adaboost.pkl', 'rb'))

Next, there are just 4 functions which are responsible for rendering results on web-page and we’ll discuss them in depth.

home() : It simply loads homepage.html when application is launched on port 127.0.0.1:5000.
get_data() : This function is responsible for fetching data from homepage.html. We used request.form.get() function to get values. Then we initialized a dictionary with all 47 features for which categorical value is 0 and later replaced them with 1 if they have been selected from drop down in our homepage.html. This function finally returns a dataframe with 1 for values which have been selected from drop-down and 0 for non-selected values.
feature_importance(model, data): As shown, function takes 2 arguments. In place of model, we will pass our adaboost model that we loaded and for data we will pass get_data() function as argument which will return above mentioned dataframe. With this, we calculate feature importance for top 30 values as I discussed above and return dataframe again with top 30 features.
min_max_scale(data): Takes data as an argument and we will pass feature_importance(). This function is a simple implementation of min-max scaler which I discussed earlier. This function would return dataframe with scaled values.
show_data(): In this function we call all the above created functions and finally pass outcome to results.html page to show our model’s prediction as shown in below code.

#main.py
return render_template(‘results.html’, tables = [df.to_html(classes=’data’, header=True)], result = outcome)#results.html
<b><h3>Values entered are : </b> </h3><br>
{% for table in tables %}
    <div class="showTable">{{ table|safe }}</div>
{% endfor %}

In above code we used df.html to show what users have actually entered and rendered it on results.html page as a table that you can see on results.html screenshot. With this we mark end of the project.

➤FINAL THOUGHTS

We could have opted for even rigorous hyper-parameter tuning as our best model’s accuracy is not that high. With optimal hyper-parameters, model’s performance could have been better.

Also, GUI is very basic with no checks on if user enters invalid data in text-fields. This can lead to errors but can be handled with javascript at frontend itself.

Stacking of models was not tried which I encourage you people to go ahead with and let me know how they’re performing and which all models you used.

If you have any doubts feel free to connect with me over my LinkdIn, mail id and also I would highly appreciate any kind of feedback. Find code for this on my GutHub account.