Unveiling the Drivers of Customer Churn: An Analytical Journey to Improve Retention- Part 1

Alidu Abubakari
16 min readMar 17, 2023

--

INTRODUCTION

As customer attrition can be one of the biggest expenditures for any organization, it is essential to understand what causes customer churn or customer turnover. Customer churn refers to the percentage of customers who stop using a company’s product or service within a specified timeframe. For example, if a company starts the year with 500 customers but ends with only 480 customers, the percentage of customers that left would be 4%. By identifying the reasons why customers leave and when they leave, organizations can develop effective retention strategies to prevent customer churn.

The goal of this project is to determine the likelihood of a customer leaving the organization and identify the key indicators of churn. Additionally, we will investigate the retention strategies that can be implemented to prevent customer churn. By analyzing customer data and identifying patterns, we can provide insights to help organizations retain their customers and improve their bottom line.

Data Loading

The data Telco-Customer-Churn is loaded using pandas.

# For CSV, use pandas.read_csv

churn_data = Telco_churn = pd.read_csv("Telco-Customer-Churn.csv")

Our data contains 7043 customers with 21 columns containing various customer information that can be summarized as customer’s demographic, account, and subscription information. The “churn” column which is the column we want to predict contains information on whether the customer churned or not. The following describes the columns present in the data.

  • Gender — Whether the customer is a male or a female
  • SeniorCitizen — Whether a customer is a senior citizen or not
  • Partner — Whether the customer has a partner or not (Yes, No)
  • Dependents — Whether the customer has dependents or not (Yes, No)
  • Tenure — Number of months the customer has stayed with the company
  • Phone Service — Whether the customer has a phone service or not (Yes, No)
  • MultipleLines — Whether the customer has multiple lines or not
  • InternetService — Customer’s internet service provider (DSL, Fiber Optic, No)
  • OnlineSecurity — Whether the customer has online security or not (Yes, No, No Internet)
  • OnlineBackup — Whether the customer has online backup or not (Yes, No, No Internet)
  • DeviceProtection — Whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport — Whether the customer has tech support or not (Yes, No, No internet)
  • StreamingTV — Whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies — Whether the customer has streaming movies or not (Yes, No, No Internet service)
  • Contract — The contract term of the customer (Month-to-Month, One year, Two year)
  • PaperlessBilling — Whether the customer has paperless billing or not (Yes, No)
  • Payment Method — The customer’s payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))
  • MonthlyCharges — The amount charged to the customer monthly
  • TotalCharges — The total amount charged to the customer
  • Churn — Whether the customer churned or not (Yes or No)

Exploratory Data Analysis: EDA

The goal of manual EDA is to gain a deeper understanding of the data and identify potential problems and opportunities for analysis. By combining basic statistics, visualization techniques, and common sense, you can get a good sense of the data and start developing insights and hypotheses for further analysis.

Univatiate Analysis

Univariate analysis is a statistical analysis technique that focuses on analyzing one variable at a time. In other words, univariate analysis involves examining the distribution, central tendency, variability, and other properties of a single variable in isolation from other variables.

Analyzing the churn attribute:

73.46 % of customers did not churn, and 26.54% of customers did churn.

Having an imbalanced dataset can lead to biased machine learning models. In this case, if we train a model on this imbalanced dataset, it may have a higher accuracy in predicting the majority class (customers who did not churn) but may perform poorly in predicting the minority class (customers who did churn). This is because the model may learn to always predict the majority class since it occurs more frequently in the dataset.

To address this issue, we can use techniques such as oversampling the minority class, undersampling the majority class, or using a combination of both. We can also use algorithms that are specifically designed to handle imbalanced datasets, such as random forest or gradient boosting. It’s important to keep in mind that choosing the right approach depends on the specific dataset and problem at hand.

Exploratory Data Analysis (EDA) of Numeric Columns in churn data using Histogram and Boxplot

I explored the univariate analysis of each attribute using Histogram and Boxplot. Skewness is a measure of the degree of asymmetry of a distribution. Kurtosis, on the other hand, is a measure of the degree of peakedness of a distribution relative to the normal distribution.

A skewness value of 1.83 and kurtosis value of 1.35 for the “SeniorCitizen” attribute suggest that the distribution is skewed to the right and is relatively flat compared to a normal distribution.

A skewness value of 0.24 and kurtosis value of -1.39 for the “Tenure” attribute suggest that the distribution is approximately symmetric with a slightly longer tail on the right side of the distribution. The relatively flat distribution suggests that there are fewer extreme values or outliers for the “Tenure” attribute

A skewness value of -0.22 and kurtosis value of -1.26 for the “Monthly Charges” attribute suggest that the distribution is approximately symmetric, with a slightly longer tail on the left side of the distribution. The relatively flat distribution suggests that there are fewer extreme values or outliers for the “Monthly Charges” attribute.

A skewness value of 0.96 and kurtosis value of -0.23 for the “Total Charges” attribute suggest that the distribution is moderately skewed to the right, with a longer tail on the right side of the distribution. The relatively flat distribution suggests that there are fewer extreme values or outliers for the “Total Charges” attribute.

Bivariate Analysis

Violin Plot for Comparing Churn (Yes/No) vs. Monthly charges.

The violin plot for comparing churn (Yes/No) vs. monthly charges shows the distribution of monthly charges for two groups: customers who churned (Churn=Yes) and customers who did not churn (Churn=No).

For customers who did not churn (Churn=No), the maximum monthly charge is 118.75, with a Q3 (third quartile) of 88.47, median of 64.45, Q1 (first quartile) of 25.10, and minimum monthly charge of 18.25. This suggests that most customers who did not churn have a monthly charge less than 88.47, with a large proportion of customers having monthly charges between 25.10 and 88.47.

For customers who churned (Churn=Yes), the maximum monthly charge is 118.35, with a Q3 of 94.20, median of 79.65, Q1 of 56.15, and minimum monthly charge of 18.85. This suggests that most customers who churned have a monthly charge less than 94.20, with a large proportion of customers having monthly charges between 56.15 and 94.20.

Overall, the results suggest that customers who churned tend to have higher monthly charges than customers who did not churn.

Monthly Charges vs. Total Charges (Correlation Analysis)

The correlation coefficient value of 0.65 indicates that the relationship between the two variables is moderately strong, which implies that changes in one variable (Monthly Charges) are associated with proportional changes in the other variable (Total Charges)

Tenure vs. Total Charges (Correlation Analysis)

The correlation coefficient value of 0.83 indicates that the relationship between the two variables is moderately strong, which implies that changes in one variable (Tenure) are associated with proportional changes in the other variable (Total Charges)

BoxPlot Analysis: Monthly Charges Vrs all Categorical Variables

The mean monthly charges for females are slightly higher than males, with a value of $65 compared to $64 for males.

The mean monthly charges for those without partners are slightly higher than those with partners, with a value of $68 compared to $62.

The mean monthly charges for those without dependents are slightly higher than those with dependents, with a value of $67 compared to $60 for males.

The mean monthly charges for customers with Phoneservives are higher than customers without, with a value of $67 compared to $42 for males as expected.

The mean monthly charges for customers with multiple lines are slightly higher than customers without multiple lines, with a value of $82 compared to $42 as expected.

The mean monthly charges for customers with fiber optic internet service is slightly higher than customers using DSL internet service, with a value of $92 compared to $58 as expected. Customers without any internet services are expected to pay even less with an average monthly charge of $21.

The mean monthly charges for customers without online security are slightly higher than customers with online security, with a value of $114 compared to $21 as expected.

The mean monthly charges for customers with multiple lines are slightly higher than customers without multiple lines, with a value of $82 compared to $42 as expected.

The mean monthly charges for customers without device protection are slightly higher than customers with device protection, with a value of $111 compared to $21 as expected.

The mean monthly charges for customers with No Tech support are slightly higher than customers with Tech support, with a value of $75 compared to $21.

The mean monthly charges for customers with streaming TV are slightly higher than customers without streaming TV, with a value of $65 compared to $21.

The mean monthly charges for customers with month-to-month contracts are slightly higher than customers one-year contracts, with a value of $66 compared to $65.

The mean monthly charges for customers who prefer paperless billing are slightly lower than customers who do not, with a value of $52 compared to $74.

The mean monthly charges for customers who prefer automated bank transfer are slightly higher than customers who use automatic credit card, with a value of $76 compared to $44.

BoxPlot Analysis: Tenure Vrs all Categorical Variables

As can be observed, mean tenure for males is almost similar to those of females with values of 32.29 and 32.55.

The mean tenure for those with partners is actually lower than those without with a value of 23.37 to 42.13.

As can be observed, mean tenure for customers with dependents is higher than those without dependents with values of 38.57 and 29.81.

The mean tenure for those with phone services is almost equal to than those without with a value of 32.49 to 31.83.

As can be observed, mean tenure for customers with multiple lines is higher than those without with values of 41.97 months and 31.83 months.

The mean tenure for customers who prefer fiber optic internet service is higher than those who prefer DSL or no internet service with a values of 32.92 to 32.89 and 30.67 respectively.

As can be observed, customers with onlinesecurity have a higher tenure than those with no online security with values of 30.67 and 25.86.

The mean tenure for those with No online backup is actually higher than those with with online backup by a value of 30.67 to 23.69.

As can be observed, customers with device protection have a higher tenure than those without with values of 30.67 and 23.7.

The mean tenure for those with Tech support is actually higher than those without by a value of 30.67 to 25.85.

As can be observed, customers with streamingTV have a higher tenure than those without with values of 30.67 and 24.99.

The mean tenure for customers with streaming Movies have a higher tenure than those without with values of 30.67 and 24.75.

Customers with two-year contracts tend to have the highest tenure as expected.

Customers who prefer paperless billing have almost the same tenure as those who don't with values of 32.28 and 32.52. This means paperless billing does not affect the decision to stay with the telco or not.

Finaly, we can observe that customers who use electronic check to make payment have a higher tenure than those who use other methods as can be observed.

BoxPlot Analysis: TotalCharges Vrs all Categorical Variables

Multivariate Analysis

Correlation between all the numerical attributes

From the results can confidently say that Monthly charges and total have a high correlation while tenure and Totalcharges also have a high correlation.

Monthly charges for each combination of internet service type and contract type

This result confirms our previous analysis and shows that for month-to-month contract type, customers who opt for fiber optic internet service have a higher mean monthly charge as compared to DSL. The same can be said for the one year and multi-year contract.

Comparison of the number of customers who churned and did not churn, grouped by internet service type and payment method

The results here indicate that for both churn and not churn, customers who pay by electronics check have the highest count.

HYPOTHESIS

The likelihood of a customer leaving the organization is influenced by their contract type.

NULL: Customers on 2-year contracts are less likely to churn

Alternate: Customers on 2-year contracts are more likely to churn

HYPOTHESIS TESTING

from scipy.stats import chi2_contingency

# Create a contingency table of contract type and churn
cont_table = pd.crosstab(churn_data['Contract'], churn_data['Churn'])

# Conduct the chi-squared test
chi2, p_val, dof, expected = chi2_contingency(cont_table)

# Print the test results
print(f"Chi-Squared: {chi2}")
print(f"P-Value: {p_val}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:\n", expected)

# Visualize the contingency table as a heatmap
sns.heatmap(cont_table, annot=True, cmap='YlGnBu')

# Add labels and title
plt.xlabel('Churn')
plt.ylabel('Contract Type')
plt.title('Contingency Table: Contract Type vs. Churn')

# Display the plot
plt.show()
Chi-Squared: 1179.5458287339445
P-Value: 7.326182186265472e-257

The null hypothesis is that there is no association between the ‘Contract’ and ‘Churn’ variables (i.e., customers on 2-year contracts are equally likely to churn as customers on other contract types). The alternate hypothesis is that there is an association between the two variables (i.e., customers on 2-year contracts are more or less likely to churn than customers on other contract types).

If the p-value is less than our significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables. If the p-value is greater than our significance level, we fail to reject the null hypothesis.

The heatmap visualizes the contingency table by coloring each cell with a different color based on the value in the cell. The annotation in each cell shows the count of observations in that particular category.

BUSINESS QUESTIONS:

Demographic Factors:

In this section, I will want to let the resulting figures do the talking;

Enjoy !!

What is the gender distribution of customers?

What is the distribution of customers based on their senior citizen status?

What is the distribution of customers based on their partnership status?

What is the distribution of customers based on their dependent's status?

Services Availed:

What is the distribution of customers based on their phone service status?

What is the distribution of customers based on their multiple lines status?

What is the distribution of customers based on their internet service provider?

What is the distribution of customers based on their online security status?

What is the distribution of customers based on their online backup status?

What is the distribution of customers based on their device protection status?

What is the distribution of customers based on their tech support status?

What is the distribution of customers based on their streaming TV status?

Contract Terms:

What is the distribution of customers based on their contract term?

Is there a relationship between contract term and customer churn?

Yes here is a relationship between contract term and customer churn. Generally, customers with longer contract terms are less likely to churn compared to those with shorter contract terms. This is because longer-term contracts often come with benefits such as lower monthly fees or discounts on equipment, which can make it more costly for a customer to switch providers.

Payment Method:

What is the distribution of customers based on their payment method?

Is there a relationship between payment method and customer churn?

Monthly Charges:

Is there a relationship between monthly charges and customer churn?

The customers who did not churn had a monthly charge of less than or equal to 64.43, while the other half had a monthly charge greater than 64.43.

Also, the median monthly charges for customers who did churn is 79.65, which means that half of the customers who did churn had a monthly charge of less than or equal to 79.65, while the other half had a monthly charge greater than 79.65.

This suggests that customers who churned had a higher median monthly charge than those who did not churn.

Based on the boxplot, it appears that customers who churn tend to have higher monthly charges compared to those who do not churn

Churn rate:

Which gender has a higher churn rate?

Is there a difference in churn rate between senior citizens and non-senior citizens?

How does the churn rate vary by internet service provider?

Do customers with phone service only have a higher or lower churn rate compared to those with additional services like internet or TV streaming?

Is there a difference in churn rate based on payment method?

Do customers who have multiple lines have a higher or lower churn rate compared to those with single lines?

Thank you very much!

Check this link out for Part 2 where I build the models to predict churn.

https://medium.com/@alidu143/unveiling-the-drivers-of-customer-churn-an-analytical-journey-to-improve-retention-part-2-ad561a05c20f

You can also check out Part 3 where I develop retention strategies based on demographics and K-means segmentation;

https://medium.com/@alidu143/unveiling-the-drivers-of-customer-churn-an-analytical-journey-to-improve-retention-part-3-26b3665d1996

Lastly check out my github for the complete code.

https://github.com/aliduabubakari/Telco-Customer-Churn/blob/main/ML_LP3-Classifcation-main_WEEK4.ipynb

--

--