Supervised Learning on Python — Predicting Customer Churn 1
Data Preparation and Feature Visualization
Classification Machine Learning
In supervised learning, the system tries to learn from the previous examples given. We use regression techniques to find the line of best fit line between the features. In this project we have used several models to predict the customer churn rate.
Before we start, let’s talk about what on earth is CHURN rate?
Churn quantifies the number of customers who have unsubscribed or canceled their service contract. Customers turning their back to your service or products are not fun for any business; it is costly to win them back once lost. Not to account the fact that they will not do the best word to mouth marketing if unsatisfied.
Reducing Customer Churn Is More Cost-Effective Than Chasing New Customers
Objective
In the competitive world of telecommunication carriers, customer retention is key. According to Havard Business Review, it was found that acquiring new customers can cost up to 25 times more than retaining existing customers.
Telecommunication carriers use big data analytics around demographics, usage, customer accounts, connectivity, network performance and reliability, customer support and service issues, and much more, to reduce customer churn rate. Yet, the scale of this data is too vast for their existing analytics tools, which in turn limits telco customer churn analysis from quickly running ad-hoc queries or visualizing and dash-boarding customer data in its entirety. Hence, in this study we are aiming to analyze the telecom industry’s real data to find out the customer patterns and predict the future churn rate.
Toolkit
One of the most valuable assets a company has is data. As data is rarely shared publicly, we take an available dataset you can find on IBMs website. The raw dataset contains more than 7000 entries. All entries have several features and of course a column stating if the customer has churned or not.
To predict if a customer will churn or not, we are working with Python 3 and its amazing open source libraries. First of all we import Pandas and Numpy, which puts our data in an easy-to-use structure for data analysis and data transformation. To make the visualizations more clear, we use libraries such as : seaborn and matplotlib to generate the graphs which will help us in knowing our customers better. In this project we are going to use plotly to visualise some of our insights.
Lets Do It …
We start exploring the dataset by loading it in python using the pandas library.
#Import the data:
telco = pd.read_csv(“/~/telco.csv”)
telco.head()
# review for the dataset
telco.shape# always good that review the data type before we start analyzing
telco.dtypes
telco.nunique()
By using the python functions telco.dtypes and telco.nuniques() we get a general overview of the dataset.
Then we can start to handle the dataset. First we need to deal with missing values. If the missing values exist, it depends on each case if it makes sense to fill the missing value for example with the mean, median or copy the previous column’s value, or in case there is enough training data already, we can also drop the entry directly.
Luckily, in this dataset, there are no null values! We can identify the missing value by using .isnull() function.
telco.isnull().any()
Tips: instead of checking the datatype and missing values separately, we can use dataframe.info() function instead. It shows the name of all columns, no. of non-null entries and data type of that column.
Data Preparation
# Separate data into categorical and numerical
telco.select_dtypes('object').head()
Group the columns by datatype
Converting Features into appropriate datatype
# transform the total charges
telco[‘TotalCharges’] = telco[‘TotalCharges’].replace(“ “, 0).astype(‘float64’)
telco[‘TotalCharges’] = pd.to_numeric(telco[‘TotalCharges’])# sync with other categorical data
telco.SeniorCitizen.replace(to_replace = {1: 'Yes', 0: 'No'}, inplace = True)
From our data exploration we witness that two variables have been misclassified in terms of data types. This needs to be dealt with before proceeding further. The column TotalCharges should be a numerical variable however it shows in the object format. Therefore with the to_numeric function we change the datatype. The second variable is SeniorCitizens, compared to other binary features they use “Yes or No” rather than “1 or 0”. To sync with other categorical variables , we use the function dataset.Columnname.replace( ) to transform this column.
Grouping Numerical and Categorical dataset
Categorical# check the caragorical variable
telco.select_dtypes('object').head()# group the categorical columns together
categorical = list(telco.select_dtypes(‘object’).columns.drop(‘customerID’))
.select_dtypes(‘object’) is a command that we can use to identify the categorical columns. Since, customerID is our identifier variable, we can actually drop it now.
# have a look for the numerical variable
telco.select_dtypes('number').head()# group the numerical columns together
numerical = list(telco.select_dtypes('number'))
By using the similar code .select_dtypes(‘number’) ,we can find out the numerical columns in our dataset.
Before proceeding further with the EDA, we will replace the Yes as 1 and No as 0 in our target variable (Churn)
telco.Churn.replace(to_replace = {'Yes' == 1, 'No' == 0}, inplace = True)
Exploratory Data Analysis
Visualizing the numerical features
tenure
: Number of months the customer has stayed with the companyMonthlyCharges
: The amount charged to the customer monthlyTotalCharges
: The total amount charged to the customer
We have three numerical features and as mentioned above we will visualize these features with respect to our target variable to get a better understanding of how one affects the other.
# plot the distribution of the numerical features
plot_distribution_num(“tenure”)
plot_distribution_num(‘MonthlyCharges’)
plot_distribution_num(‘TotalCharges’)
From the plots above we can have three conclusions: 1. tenure
and MonthlyCharges
are important features, they highly affect the churn rate. 2. Clients with higher MonthlyCharges
are more likely to churn 3. The longer a customer’s tenure
, the less likely they are to cancel.
According to the insight that you gain above, the action that we want to do here is binning tenure and amount of the monthly charges. As the distribution chart that we can see above, most entries in tenur
are below 20 , especially 1-10 has a extreme high churn rate. The distribution charts fluctuate in terms of the feature MonthlyCharges
, we can see the pattern and easily binning into 3 groups.
So we can make the feature engineering process by binning tenure
and amount of the MonthlyCharges
cut_labels = [‘tenure 1_10’, ‘tenure 11_20’, ‘tenure 21above’]
cut_bins = [0, 10, 20, 80]
telco[‘tenure_amend’] = pd.cut(telco[‘tenure’], bins=cut_bins, labels=cut_labels)cut_labels = [‘MonthlyCharge1_40’, ‘tenure 41_60’, ‘tenure 61above’]
cut_bins = [0, 40, 60, 200]
telco[‘MonthlyCharges_amend’] = pd.cut(telco[‘MonthlyCharges’], bins=cut_bins, labels=cut_labels)
Categorical Visualization
This dataset comprises of 16 categorical features: 6 binary features (Yes/No), 9 features with three different classes and 1 feature with four unique classes.
Demographics — We start by looking into the features related with demographics such as Gender, SeniorCitizen, Partner and Dependents.
In the graph “Churn Distribution by Gender” it seems that the churn is evenly distributed between genders. And in the second graph below we identify that SeniorCitizen
are only 16% of customers, but they have a much higher churn rate: 42% against 23% for non-senior customers. We further consider both features Gender
and SeniorCitizen
together. This leads us to following conclusions:
- Young generation (
SeniorCitizen
= 0) has a higher churn rate Gender
is not a paramount feature
In terms of Partner
and Dependents
we can see the pattern and have the following conclusions : 1. Customers with a partner or dependents are much less likely to cancel their service. 2. Having a partner is highly correlated with having dependents (this seems really logical)
Since the count between different categories have a big differences, let us look into the percentage as well. We find out those who have partners simultaneously dependents have higher loyalty to the company
There is also an interesting trend that we can see in this graph. Customers who have partners and dependents possess opposite pattern than the customers who don't. Due to which, we created two new features for the training model to test if these new features affect the churn rate: NoPartnerNoDep
and YoungDep
.
# No Partner and No Dependent
telco[‘NoPartnerNoDep’] = np.where((telco[‘Partner’] == 0) & (telco[‘Dependents’] == 0), 1 ,0)
telco[‘NoPartnerNoDep’] = telco[‘NoPartnerNoDep’].astype(‘object’)# Young and Have Dependent
telco[‘YoungDep’] = np.where((telco[‘SeniorCitizen’] == 0) & (telco[‘Dependents’] == 1), 1 ,0)
telco[‘YoungDep’] = telco[‘YoungDep’].astype(‘object’)
Payment — There are three features in terms of payment : Contract, PaperlessBilling, PaymentMethod
Let’s talk about the Contract
first. There are three different kinds of contract : M2M, 1Y and 2Y.
In the upper graph, we can tell that the majority of the people’s contracts are M2M and these short term contracts have a higher churn rate, we witness that in the second graph. Another fact that we find is a yearly based contract has a higher loyalty for the brand.
So, here we create a new feature by grouping all the customers who are not M2M contracts.
# Long term contract
telco[‘longtermcontract’] = np.where (telco[‘Contract’] != ‘Month-to-month’, 1,0)
Remember we mentioned earlier that the younger generation (SeniorCitizen
= 0) has a higher churn rate, so we decided to create one more feature which is called YoungNotEngaged
to cross-check if the younger generation with the M2M contract is more likely to churn.
# Young and not engage
telco[‘YongNotEngaged’] = np.where((telco[‘SeniorCitizen’]==0) & (telco[‘longtermcontract’]==0), 1,0)
telco[‘YongNotEngaged’] = telco[‘YangNotEngaged’].astype(‘object’)
A bit tired of the histogram? Let’s try something new to show the distribution of churn in PaperlessBilling by the code below:
fig = plt.figure(figsize = (10,40))# Data to plot
labels = ‘No Churn’ , ‘Churn’
color = [‘#F3CD05’, ‘#36688D’]
explode = (0.1, 0) # explode 1st slice# Plot
ax1 = plt.subplot(121)
sizes = [2403, 469]
ax1 = plt.pie(sizes, explode=explode, labels=labels, colors =color, autopct=’%1.1f%%’, shadow=True, startangle=140)
ax1 = plt.title(‘Not Paperless Billing’)# Plot
ax2 = plt.subplot(122)
sizes = [2771, 1400]
ax2 = plt.pie(sizes, explode=explode, labels=labels, colors =color, autopct=’%1.1f%%’, shadow=True, startangle=140)
ax2 = plt.title(‘Paperless Billing’)
Since customers with paperless billing are more likely to churn, so the action we did is group those customers whose contract is paperless billing and monthly pay.
telco['M2MandPaperless'] = np.where((telco['Contract'] == "Month-to-month") & (telco['PaperlessBilling'] == "Yes"), 1 ,0)telco['M2MandPaperless'] = telco['M2MandPaperless'].astype('object')
We already have the insight that customers who have paperless billing have the higher possibility to churn. Now let us zoom in those customers, and check what is the churn distribution among payment methods for those paperless billing low loyalty customer.
Wow! We can see that the preferred payment method is Electronic check, however this method has a very high churn rate.
To support the statement mentioned above, we can restate that the customers who pay by electronic check have an almost three times higher churn rate than other payment methods.
In payment category , We further decided to create two more features. The first one we group paperless billing and where payment-method by electronic check. Last but not least, we combine electronic check and M2M as both have a relatively higher possibility of churn.
# Paperless billing method and Electronic Check
telco['Paperlessbycreditcard'] = np.where((telco['PaperlessBilling'] == "Yes") & (telco['PaymentMethod'] == "Electronic check"), 1 ,0)
telco['Paperlessbycreditcard'] = telco['Paperlessbycreditcard'].astype('object')# Electronic check who is M2M contract
telco['ElectCheck'] = np.where((telco['PaymentMethod'] == 'Electronic check') & (telco['longtermcontract']==1), 1,0)
telco['ElectCheck'] = telco['ElectCheck'].astype('object')
Service — PhoneService , InternetService
Now let us look at the services that customers are using. There are only two main services: phone and internet but the latter has many additional services that we will talk about later.
From the graph below we find out that there are only a few customers that do not have phone service. The number of churn counts between Multilines and One line customers is similar, but there are more no-churn customers in one line phone service. It actually makes sense, we make an assumption why customers with multiple lines have a slightly higher churn rate is because some users realize they don’t need many lines, so they end up canceling the service.
What can we find in the second graph from the first glimpse is clients without internet have a very low churn rate. Secondly, there is a paramount insight we find out that customers with fiber are more probable to churn than those with DSL connection. So, we believe that Fiber Optic might be an important feature so we want to divide it out individually.
telco[‘fiberopt’] = np.where((telco[‘InternetService’] == ‘Fiber optic’), 1,0)
telco[‘fiberopt’] = telco[‘fiberopt’] .astype(‘object’)
Congratulations! the 30% of the work is done. In this chapter, we have the basic understanding of churn and our data. Moreover, we learn some metric for data preparation and visualization. In the next chapter, we will finish the visualisation and talk about dealing with skewness and dummify. Most importantly we are going to focus on the machine learning process. STAY TUNED!