Supervised Learning on Python — Predicting Customer Churn 1

Data Preparation and Feature Visualization

--

Classification Machine Learning

In supervised learning, the system tries to learn from the previous examples given. We use regression techniques to find the line of best fit line between the features. In this project we have used several models to predict the customer churn rate.

Before we start, let’s talk about what on earth is CHURN rate?

Churn quantifies the number of customers who have unsubscribed or canceled their service contract. Customers turning their back to your service or products are not fun for any business; it is costly to win them back once lost. Not to account the fact that they will not do the best word to mouth marketing if unsatisfied.

Reducing Customer Churn Is More Cost-Effective Than Chasing New Customers

Objective

In the competitive world of telecommunication carriers, customer retention is key. According to Havard Business Review, it was found that acquiring new customers can cost up to 25 times more than retaining existing customers.

Telecommunication carriers use big data analytics around demographics, usage, customer accounts, connectivity, network performance and reliability, customer support and service issues, and much more, to reduce customer churn rate. Yet, the scale of this data is too vast for their existing analytics tools, which in turn limits telco customer churn analysis from quickly running ad-hoc queries or visualizing and dash-boarding customer data in its entirety. Hence, in this study we are aiming to analyze the telecom industry’s real data to find out the customer patterns and predict the future churn rate.

Toolkit

One of the most valuable assets a company has is data. As data is rarely shared publicly, we take an available dataset you can find on IBMs website. The raw dataset contains more than 7000 entries. All entries have several features and of course a column stating if the customer has churned or not.

To predict if a customer will churn or not, we are working with Python 3 and its amazing open source libraries. First of all we import Pandas and Numpy, which puts our data in an easy-to-use structure for data analysis and data transformation. To make the visualizations more clear, we use libraries such as : seaborn and matplotlib to generate the graphs which will help us in knowing our customers better. In this project we are going to use plotly to visualise some of our insights.

Lets Do It …

We start exploring the dataset by loading it in python using the pandas library.

Glimpse of the dataset — overall 7043 rows with 21 columns

By using the python functions telco.dtypes and telco.nuniques() we get a general overview of the dataset.

Then we can start to handle the dataset. First we need to deal with missing values. If the missing values exist, it depends on each case if it makes sense to fill the missing value for example with the mean, median or copy the previous column’s value, or in case there is enough training data already, we can also drop the entry directly.

Luckily, in this dataset, there are no null values! We can identify the missing value by using .isnull() function.

Tips: instead of checking the datatype and missing values separately, we can use dataframe.info() function instead. It shows the name of all columns, no. of non-null entries and data type of that column.

Data Preparation

Group the columns by datatype

Converting Features into appropriate datatype

From our data exploration we witness that two variables have been misclassified in terms of data types. This needs to be dealt with before proceeding further. The column TotalCharges should be a numerical variable however it shows in the object format. Therefore with the to_numeric function we change the datatype. The second variable is SeniorCitizens, compared to other binary features they use “Yes or No” rather than “1 or 0”. To sync with other categorical variables , we use the function dataset.Columnname.replace( ) to transform this column.

Grouping Numerical and Categorical dataset

.select_dtypes(‘object’) is a command that we can use to identify the categorical columns. Since, customerID is our identifier variable, we can actually drop it now.

By using the similar code .select_dtypes(‘number’) ,we can find out the numerical columns in our dataset.

Before proceeding further with the EDA, we will replace the Yes as 1 and No as 0 in our target variable (Churn)

Exploratory Data Analysis

Visualizing the numerical features

tenure : Number of months the customer has stayed with the company
MonthlyCharges : The amount charged to the customer monthly
TotalCharges : The total amount charged to the customer

We have three numerical features and as mentioned above we will visualize these features with respect to our target variable to get a better understanding of how one affects the other.

From the plots above we can have three conclusions: 1. tenure and MonthlyCharges are important features, they highly affect the churn rate. 2. Clients with higher MonthlyCharges are more likely to churn 3. The longer a customer’s tenure, the less likely they are to cancel.

According to the insight that you gain above, the action that we want to do here is binning tenure and amount of the monthly charges. As the distribution chart that we can see above, most entries in tenur are below 20 , especially 1-10 has a extreme high churn rate. The distribution charts fluctuate in terms of the feature MonthlyCharges, we can see the pattern and easily binning into 3 groups.

So we can make the feature engineering process by binning tenure and amount of the MonthlyCharges

Categorical Visualization

This dataset comprises of 16 categorical features: 6 binary features (Yes/No), 9 features with three different classes and 1 feature with four unique classes.

Demographics — We start by looking into the features related with demographics such as Gender, SeniorCitizen, Partner and Dependents.

In the graph “Churn Distribution by Gender” it seems that the churn is evenly distributed between genders. And in the second graph below we identify that SeniorCitizen are only 16% of customers, but they have a much higher churn rate: 42% against 23% for non-senior customers. We further consider both features Gender and SeniorCitizen together. This leads us to following conclusions:

  • Young generation (SeniorCitizen = 0) has a higher churn rate
  • Gender is not a paramount feature

In terms of Partner and Dependents we can see the pattern and have the following conclusions : 1. Customers with a partner or dependents are much less likely to cancel their service. 2. Having a partner is highly correlated with having dependents (this seems really logical)

Since the count between different categories have a big differences, let us look into the percentage as well. We find out those who have partners simultaneously dependents have higher loyalty to the company

There is also an interesting trend that we can see in this graph. Customers who have partners and dependents possess opposite pattern than the customers who don't. Due to which, we created two new features for the training model to test if these new features affect the churn rate: NoPartnerNoDep and YoungDep.

Payment — There are three features in terms of payment : Contract, PaperlessBilling, PaymentMethod

Let’s talk about the Contractfirst. There are three different kinds of contract : M2M, 1Y and 2Y.

In the upper graph, we can tell that the majority of the people’s contracts are M2M and these short term contracts have a higher churn rate, we witness that in the second graph. Another fact that we find is a yearly based contract has a higher loyalty for the brand.

So, here we create a new feature by grouping all the customers who are not M2M contracts.

Remember we mentioned earlier that the younger generation (SeniorCitizen = 0) has a higher churn rate, so we decided to create one more feature which is called YoungNotEngaged to cross-check if the younger generation with the M2M contract is more likely to churn.

A bit tired of the histogram? Let’s try something new to show the distribution of churn in PaperlessBilling by the code below:

Compare to paper billing, paperless billing have double higher churn rate

Since customers with paperless billing are more likely to churn, so the action we did is group those customers whose contract is paperless billing and monthly pay.

We already have the insight that customers who have paperless billing have the higher possibility to churn. Now let us zoom in those customers, and check what is the churn distribution among payment methods for those paperless billing low loyalty customer.

Wow! We can see that the preferred payment method is Electronic check, however this method has a very high churn rate.

To support the statement mentioned above, we can restate that the customers who pay by electronic check have an almost three times higher churn rate than other payment methods.

In payment category , We further decided to create two more features. The first one we group paperless billing and where payment-method by electronic check. Last but not least, we combine electronic check and M2M as both have a relatively higher possibility of churn.

Service — PhoneService , InternetService

Now let us look at the services that customers are using. There are only two main services: phone and internet but the latter has many additional services that we will talk about later.

From the graph below we find out that there are only a few customers that do not have phone service. The number of churn counts between Multilines and One line customers is similar, but there are more no-churn customers in one line phone service. It actually makes sense, we make an assumption why customers with multiple lines have a slightly higher churn rate is because some users realize they don’t need many lines, so they end up canceling the service.

What can we find in the second graph from the first glimpse is clients without internet have a very low churn rate. Secondly, there is a paramount insight we find out that customers with fiber are more probable to churn than those with DSL connection. So, we believe that Fiber Optic might be an important feature so we want to divide it out individually.

Photo by Carl Heyerdahl on Unsplash

Congratulations! the 30% of the work is done. In this chapter, we have the basic understanding of churn and our data. Moreover, we learn some metric for data preparation and visualization. In the next chapter, we will finish the visualisation and talk about dealing with skewness and dummify. Most importantly we are going to focus on the machine learning process. STAY TUNED!

--

--