Customer Churn Data Analysis using Logistic Regression
Can you predict the customers likely to leave the company?
When choosing a telecommunication service provider, customers usually have many choices. They can choose any service provider and may move away from the current provider. When a customer chooses to move away from the current provider to a new provider, it results in loss of business and revenue to the current provider. The percentage of customers moving out and disconnecting the service is known as “churn”. A stable customer base is a key to the success of any business. Businesses try to keep the customers satisfied, to retain them as long as possible. However, in the real world, the customer churn can be as high as 25% annually in the telecommunication industry. Also, the cost of acquiring a new customer is 10 times more than the cost to retain an existing customer. This poses a serious challenge to business owners.
All data used for this analysis is available at:
Code available at:
Role of Data Analytics
Customer Churn data analysis can help the company understand the underlying reasons why customers may choose to leave the company. By implementing predictive analytics techniques and applying them to existing customer churn data from records, it is possible to understand the likelihood of customers switching or discontinuing the service. Customers with high a probability of switching can then be worked with to make sure they remain with the current provider.
Data used from this analysis is available on Kaggle. A cleaned copy is available at GitHub link. It has several different pieces of information about customers. Using this data predictive model for customer churn can be created. And it can then be used to understand which customers should be worked on for retention.
What will we do?
The overall problem can be summarized:
1. Creation of a predictive model using the available customer churn data to predict and find customers likely to discontinue the service.
2. The final prediction outcome for any particular customer should be a “Yes” or a “No” ( a binary output )
Based on the result of the predictions, the company can choose to take appropriate action in the form of various customer retention strategies and reduce the churn. In the next section, we will find out which method can be used to model the scenario to predict a binary outcome. We will also see how the predictive model works and under what assumptions it should be applied.
Based on the problem statement, we need a predictive model that can do a binary classification or predict Yes/No or 1/0 type of output variable. One predictive model commonly implemented for binary classification and prediction of binary outcome is Logistic Regression. Logistic regression is a binary classification algorithm belonging to the generalized linear regression model. It can also be used to solve problems with more than 2 classes. It is possible to use logistic regression to create a model using the customer churn data and use it to predict if a particular customer of a set of customers will discontinue the service.
For example, one of the variables in the data is can be the “annual income”. Another variable is the “gender” of the customer. The outcome of the logistic regression function will tell us how income and or gender determine the probability of service discontinuation by the customer.
Logistic Regression Equation and Sigmoid Function
Provided below is the Logistic Regression function:
P (Y=1|X)= 1 / (1+ e^(-1x(β0+β1X1+β2X2+β3X3+β4X4+⋯+βnXn)) )
β0 to βn are various coefficients
X0 to Xn are the independent variables impacting the dependent variable
And P (Y =1 | X) is the probability of a positive outcome.
Notice the exponent in the function. This is where linear regression plays in (β0+β1X1+β2X2+β3X3+β4X4+⋯+βnXn).
The logistic regression function is a sigmoid function (graph above). As can be seen in the graph, the function has output values between 0 and 1 with a transition between the levels. This characteristic of the function helps with the prediction of binary outcomes. Based on the value of variables the output can be either at level 1 or 0, which corresponds to the probability of the customer leaving the company or continuing with it.
Assumptions for Logistic Regression
Following assumptions are made for Logistic regression:
- Binary logistic regression requires the dependent variable to be binary and to follow a binomial distribution (e.g. will a customer discontinue service or not, Yes or No). For more than 2 outcomes (ordinal) logistics regression requires the dependent variable categories to be mutually exclusive and exhaustive.
- Observations should be independent of each other (e.g. data of one customer should not depend on data of another customer, or the same customer should not be used repeatedly in the data)
- Multicollinearity among the independent variables should not exist (e.g. avoid using variables from the customer data which depend on each other, say, City, State, County, and Zip Code are not all independent)
- The linearity of independent variables with respect to log odds of the dependent variable (e.g. log odds of the probability of customer discontinuing the service should be linearly related to various variables like gender, income, etc.)
- Large sample size (e.g. customer churn data has 10000 records)
Data preparation starts with an understanding of the available data for analysis. Customer churn data has 50 fields. One of the important tasks is to determine which fields can be used for regression analysis. There are categorical data fields like Martial, Gender, etc., and continuous numeric data fields like Tenure, Age, etc. Some of these fields may not be important for analysis such as customer ID, interaction, and UID (which are related to customer service interactions). Some other fields which are not important for analysis are the Latitude and Longitude of the customer, case order (used as a serial number). We will also check data for null values and if found they will need to be handled appropriately.
Dummy Variables and Normalization
Categorical data fields, such as gender, internet service, phone service, etc. will have to be converted to corresponding columns containing either 0 or 1. Categorical data like Yes/ No, good/bad/ugly can’t be used for mathematical operations. Hence, these columns containing categorical data will end up becoming columns with 0 or 1 entries.
For continuous numeric data, the scale of data for every column is different. For example, the age column has values ranging from 10 to 89, but the salary column has figures running in 10s or 100s of thousands. Before we can use this data for analysis, it needs to be normalized so that the data is centered around the mean and is measured in terms of its deviation from the mean. This is important for the numerical stability of the model.
As a first step, to check the impact, importance, and significance of various data columns w.r.t. churn analysis, an initial model containing all variables in the dataset will be created. We have prepared the data in the previous step, in a way that includes all the variables at this time.
The initial model will provide information on which variables are significant for predictive analysis. Based on the significance of variables we should be able to eliminate some variables from the data set to arrive at a reduced model. We will then check the accuracy score, confusion matrix, and AUC for the 2 models, to compare their performance.
Let’s start with importing the required libraries. We are using libraries such as sklearn and statsmodels to create models to check how good the predictive model works and how accurately are we able to predict the outcomes or the customer churn. As we have a large number of rows in the data, set I am choosing 70% of the data to be in the training set and 30% of the data to be in the test set.
Analysis of Initial Model
In the initial model summary generated by statsmodels.api logit model, we can see the p-value of various independent variables. Using these p-values we can determine which variables are significant for predictive modeling. This will help determine the list of variables that can be safely removed without any impact on the overall accuracy and confusion matrix.
The initial model containing (containing full variable set), has an overall prediction accuracy rate of 89%. The confusion matrix shows that 92.7% (Specificity) service continuations and 79.1% (Sensitivity) service discontinuations can be accurately predicted.
To arrive at the reduced model, we can remove the variables which have higher p-values and use only those variables which have lower p-values. Generally, a threshold of 0.25 can be used for p-value comparison. This gives us a list of independent variables to be used in the reduced model.
The reduced model will have the following independent variables. Other variables have been removed based on their low significance and p-values.
‘Churn’,’Tenure’,’Contacts’,’MonthlyCharge’,’Bandwidth_GB_Year’,’State’, ‘Marital’,’Gender’,’Techie’,’Contract’,’Age’,’Children’, ‘Email’, ‘Port_modem’,’InternetService’,’Phone’,’Timely_response’,’Timely_replacements’, ‘StreamingMovies’,’PaperlessBilling’,’PaymentMethod’
With the knowledge about significant variables, we removed less significant variables. Now, we will create a data set with only the significant variables and run the analysis.
Reduced Model Performance Analysis
The reduced model has an overall prediction accuracy rate of 89.23%. The confusion matrix shows that 92.82% (Specificity) service continuations and 79.35% (Sensitivity) service discontinuations can be accurately predicted. So, we see a pretty accurate model with the reduced data set. However, there are still some important points worth noting and considering the possibility of reducing the model further.
With the current reduced model:
1. Still using a data model with 78 independent variables
2. Data column “State” is contributing to a large number of dummy variables. And many of those dummy variables have high p-value while some of them have low p-value. As an entire category “State”, it is still unclear if the feature has significance for predictive modeling.
To determine the significance of “State” as a categorical variable in the logistic regression model, we need to compare the models with and without this variable. If the difference in model shows that the State column is not significant, it can be removed from the analysis. A similar analysis has to be performed on other categorical variables too.
To compare the 2 models, with and without the State column, we create both models in R using the glm function and run a likelihood ratio test on them. If the p-value turns out to be high, it means we can safely remove the variable, while if the p-value from the likelihood ratio test is low, it would mean that the variable or feature is significant and should not be removed from the predictive model.
Final Reduced Dataset
Based on the likelihood ratio test done in R code above, it was determined that the State column is not significant to the model. P-value when comparing reduced model and model with State column removed from it was 0.4081. These values can be seen above, in the output of the R code. As the p-value is high, implying failure to reject the null hypothesis, the column has no significance. The null hypothesis for the likelihood ratio test states that the coefficient for the State column is 0. A similar analysis was done on other columns, Timely_replacements, Timley_response, Multiple & Gender. They were found to be insignificant. This brings us to our final reduced model. The Final list of variables in the reduced data set is provided below:
StreamingMovies, PaperlessBilling, PaymentMethod
Initial and final reduced models are not very different when it comes to the overall performance of how accurately they predict the outcome. However, there is a big difference in the number of variables and consequently, the reduced model is lean, faster, and less resource-intensive. Provided below is a comparison of performance for the initial and final models.
The variables which were determined to be not significant and hence removed from the predictive analysis are:
‘State’, ‘Population’, ‘Area’, ‘Income’, ‘Gender’, ‘Outage_sec_perweek’, ‘Email’, ‘Yearly_equip_failure’, ‘Tablet’, ‘Multiple’, ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘StreamingTV’, ‘Timely_response’, ‘Timely_fixes’, ‘Timely_replacements’, ‘Reliability’, ‘Options’, ‘Respectful_response’, ‘Courteous_exchange’, ‘Evidence_of_active_listening’
We have created a pretty accurate predictive model, with approximately 89% accuracy in predicting the behavior of the customers. The model can find out if a customer planning to disconnect with approximately 80% accuracy. AUC score of the final model is 0.954 which is very close to a perfect 1. With this model, the company can now have great insight into which customers may discontinue their service. The model tells us the variables and their corresponding impact on churn.
AUC or Area under the curve
The area under the curve metrics tells us how good is our model at predicting the outcome. If the area under the curve is 0.5, then our model is no different from a random guess with a 50% chance of predicting the outcome (binary) correctly. As we approach AUC values close to 1, we know that our algorithm is getting better at predictions. At perfect 1, our model can predict the outcome with 100% certainty.
Logistic Regression Equation
The logistic regression equation is:
P (Y=1|X) = exp(y) / 1 + exp(-y)
where y = β0 + β1X1 + β2X2 + β3X3 + … βnXn
where n is the number of variables in the model and βn is the value of the nth coefficient.
In the Logistic regression model, we created above, the value of n is 26, including all the dummy variables. For visual sanity, coefficients will be rounded to 2 digits after the decimal, to make the overall equation readable.
Equation with necessary changes to make it readable is:
P (Y=1|X) = 1 / (1 + exp(-1x( -3.15 x Tenure +0.09 x Contacts +2.28 x MonthlyCharge +0.19 x Bandwidth_GB_Year -0.02 x Age +0.0 x Children +1.04 x Techie -3.24 x Contract_One_year -3.33 x Contract_Two_Year +0.05 x Marital_Married +0.08 x Marital_Never_Married +0.13 x Marital_Separated +0.22 x Marital_Widowed +0.19 x Port_modem -2.2 x InternetService_Fiber_Optic -0.62 x InternetService_None -0.37 x Phone +0.44 x StreamingMovies +0.15 x PaperlessBilling +0.25 x PaymentMethod_Credit_Card_automatic +0.62 x PaymentMethod_Electronic_Check +0.22 x PaymentMethod_Mailed_Check -1.09)))
Another way to write the equation is:
Odds ratio = P / ( 1 — P ) = exp(-3.15 x Tenure +0.09 x Contacts +2.28 x MonthlyCharge +0.19 x Bandwidth_GB_Year -0.02 x Age +0.0 x Children +1.04 x Techie -3.24 x Contract_One_year -3.33 x Contract_Two_Year +0.05 x Marital_Married +0.08 x Marital_Never_Married +0.13 x Marital_Separated +0.22 x Marital_Widowed +0.19 x Port_modem -2.2 x InternetService_Fiber_Optic -0.62 x InternetService_None -0.37 x Phone +0.44 x StreamingMovies +0.15 x PaperlessBilling +0.25 x PaymentMethod_Credit_Card_automatic +0.62 x PaymentMethod_Electronic_Check +0.22 x PaymentMethod_Mailed_Check -1.09)
Using this form of the equation we can interpret the impact of variables on the odds ratio. The odds ratio for churn is defined as the probability of a customer leaving the company divided by the probability of the customer not leaving the company. Reducing the odds ratio is a favorable case for the company or the business, as it means the probability of discontinuing the service goes down (positive impact on churn as the company wants to keep the customer). An increase in the odds ratio would be a negative impact as the customer would be more likely to leave the company than to remain with it. We will see how a unit change in different variables can impact the odds ratio.
Interpretation of coefficients
Individual coefficients can be interpreted as the change in the odds ratio for every unit change in the variable. Due to normalization done on the continuous numerical variable, the unit change on such variables is equal to their standard deviation. For categorical variables, the change is a switch from 0 to 1.
I am calculating the impact of a unit increase or decrease in the value of the independent variables (unnormalized form) in terms of reduction in odds ratio.
Tenure: Odds Ratio Reduction with unit increase: 11.23% Odds Ratio Reduction with unit decrease: -12.65%MonthlyCharge: Odds Ratio Reduction with unit increase: -9.0% Odds Ratio Reduction with unit decrease: 8.26%Age: Odds Ratio Reduction with unit increase: 0.08% Odds Ratio Reduction with unit decrease: -0.08%InternetService_Fiber_Optic: Odds Ratio Reduction: 88.92%InternetService_None: Odds Ratio Reduction: 46.21%Contract_One_year: Odds Ratio Reduction: 96.08%Contract_Two_Year: Odds Ratio Reduction: 96.42%Phone: Odds Ratio Reduction: 30.93%
Analyzing the coefficients, we can see how a particular variable can impact the churn. Let’s take a few examples, and understand in detail:
- For variable Tenure, we see a negative coefficient (-3.15 x Tenure). If the user has spent a long time with the company (more than mean or average), he is less likely to discontinue, but if he has spent less time with the company, he is more likely to discontinue. The mean tenure is 34.53 months. Tenure has a standard deviation of 26.44 months. For a customer, with every extra month spent with the company, the odds ratio of leaving the company goes down by 11.2% (1-exp(-3.15/26.44).
- For the variable MonthlyCharge, the coefficient is 2.28. With every dollar increase in the monthly charge, you make the customer more likely to leave the service. Conversely, with every dollar decrease in monthly charge customer is more likely to stay. With a dollar reduction in the monthly charge, the odds ratio goes down by 8.26%. With a discount of dollar 5 per month, the odds ratio goes down by 35%. With a $10 per month discount, the odds ratio goes down by 57.78%.
- Customers switching from DSL service (default) to fiber optic internet service, will reduce the odds ratio by 88.92%.
- Customers switching from a month-to-month contract (default) to a one-year contract will reduce the odds ratio by 96.08%. And those switching to a two-year contract reduce the odds ratio by 96.42%.
Recommendations: What actions can be taken to reduce customer churn?
With the availability of this model, it is possible to predict if a customer is likely to discontinue the service. However, the company also needs to make plans and implement them to reduce the churn rate. Let’s check a few facts shown by the model in detail and discuss what possible course of action the company can take to increase customer retention and reduce churn.
1. Customers paying high monthly payments are more likely to discontinue the service. The company can offer better deals and discounts to the customers who are found likely to discontinue the service using the predictive model. Based on the interpretation of coefficients giving discounts is a great way to reduce the probability of customer loss. A five-dollar monthly discount can reduce the odds ratio by 35%. The mean monthly payment is $172. Many customers are paying more than one std deviation above the mean (214+ dollars, 1800 such customers in the data set). And some customers are paying more than 2 std deviations above mean ($256+ per month, 280 such customers in the data set). These customers are at a high risk of leaving the service provider.
- As per the analysis done in the interpretation of the coefficients section, if a customer signs up for a 2-year contract, the odds ratio goes down by 96.42%. If a customer signs up for a 1-year contract, then the odds ratio goes down by 96.08%. If the company can work on converting month-to-month customers to one or two-year contracts, the likelihood of the customer discontinuation of the service will be severely reduced. This could be a huge positive impact on reducing the churn rate as there are 5456 customers on month-to-month contracts in the data set (which is more than 50% of the customers in the data set).
3. Internet Service consumers who have DSL connections are more likely to discontinue. There may be competitors providing better service than DSL internet and customers discontinue to opt for better quality service with another provider. The company can check if it can offer DSL service alternatives and retain those customers. It is likely that in certain areas the company offers only DSL internet service with no availability of alternatives. The company should plan for upgrading existing DSL customers to fiber options or another better internet technology. As seen in the interpretation of the coefficient section, upgrade to the fiber option reduces the odds ratio by 88.92%. Surprisingly, customers with no internet service are less likely to discontinue the service than customers with DSL internet connections. If someone switches from DSL to no internet, the odds ratio goes down by 46.21%.
4. Customers who are new and haven’t spent much time with the company are more likely to discontinue the service. It is possible that the company offers good deals during the first year of service and increases prices afterward. Data shows the customers discontinuing the service have a mean of around 13.5 months. Possible expiration of offer or discount prices at the end of the first year of service and increased cost of service after the first year can be a reason for customers moving out. The company can proactively work on retaining those customers who are completing their first year by sending them new offers or extending the discounts for them. As seen in the interpretation of the coefficient section, with every extra month passing in service, the odds ratio for discontinuation goes down by 11.23%. Keeping customers longer with the company can help reduce churn.
5. Adding a phone line can reduce the odds ratio by 30.93%. The company should try to get more customers to sign-up for phone service if they can, as it increases customer retention.
6. Streaming customers seem to have a high churn rate. It is important to understand what is causing the churn in those customers. It could be that they are not satisfied with the service quality. In general, streaming requires a high-speed connection with no drops in connectivity. Further analysis should be done to understand the issues with streaming customers. Knowing that streaming customers are more likely to discontinue, the company should proactively work with the customer to eliminate the cause of their dissatisfaction and resulting churn.
7. The churn rate with “Techie” customers is high. Further investigations can be performed to understand what those tech-savvy customers don’t like about the overall service. It could be due to hard-to-use technology or technology being poorly designed or possibly less freedom to control, tweak or change the service parameters reducing the tech appeal or poor website performance or any other possible reason which tech-savvy customers don’t like. With proactive steps and changes, the company can retain these tech-savvy customers better.
So, as you can see here, we did a comprehensive analysis of customer churn data using Logistic Regression to gain insights into customer behavior.