Fighting Telco Customer Churn Problem : A Data-Driven Analysis

12 min readAug 4, 2019

Banks, Phone service companies, Internet service providers, insurance firms often use customer churn (or customer turnover) analysis and rate as one of their key business metrics. Customer retention appears to be far less costly than customer acquisition. While loyal customers may drive down costs, new clients are worth it only if their customer lifetime value (CLV) is high enough to support their acquisition costs.

Businesses are interested in voluntary churn which occur due to the company-customer relationship.
This relation depends on factors companies can control, such as billing, pricing or quality of the service, etc.
In this blog, I will propose a solution to fight churn for a telephone service company based on Telco Customers data set, available on Kaggle.

In a first part, I will run an exploratory data analysis (EDA) to understand both what drives customers to stay and leave the company to lead the most valuable retention program.
Then, I will build a Logistic regression model to predict churning customers.
Finally, I will suggest business strategies based on the previous analysis to target churning customers and evaluate, through a simple simulation, how useful the Machine Learning model developed can help Telco.

The full code source, available in my github repository, is implemented using Pandas, Numpy and Scikit-learn

Outline :
1. What drives customers to stay and leave ?
2. Detecting Churn
3. Solve Churn Problem
4. Conclusion

1. What drive customers to stay and leave ?

1.1 Data Description
The data set contains informations about Telco customers where each row represents a unique customers and the columns are informations regarding customers’services.
The column “Churn” indicate whether the customer left the company within the last month.
There are a total of 7032 customers in the dataset among which 1869 left within the last month.
With a churn rate that high, i.e 26.58%, Telco may run out of customers in the coming months if no action is taken.

1.2 Descriptive analysis
In order to determine which services/features discriminate loyal customers from the others, the data set is split in two groups : churning customers and loyal ones.

comparison of charges and tenure between loyal (first row) and disloyal customers (second row)

Monthly Charges : An important proportion of loyal customers bring between $20 and $25 each month while disloyal ones tend to pay more important charges, i.e, between $50 and $100.
These figures may indicate that cheap packages are a proxy for customer retention.
Total Charges and Tenure : Both distributions of total charges are heavyskewed, with most customers who paid quite low total amount of charges.
Since most loyal customers are low-costs ones, it explains the relative low amount of total charges.
The tenure distribution of churning customers explains this low amount, as it describes an important proportion of short-term clients who left the company in relatively small window time frame, i.e, 2 months.

Note : The histograms highlight a peak frequency at max(tenure) = 72 months (6 years), which may suggest that the dataset considers that long run customers who stayed more than 72 months are included in the dataset with tenure = 72 months. Hence, people with the maximum stay are considered as acquired customers : retention should not be focused on them.

Comparison of contracts and phone services between loyal (first row) and disloyal customers (second row)

Contract : While a huge proportion (88.6%) of churning clients chose month-to-month contracts, the distribution of type of contracts is quite balanced among the others.
These are non-binding contracts allowing the customer to leave the company whenever it fits to him. This may describe two category of customers.
They represent tourists passing through with the sole need/purpose to use the phone company services for a limited amount of time or represent customers in search of a new phone company who try several for a short time.
Internet Service : Similarly, internet services are quite balanced among loyal customers while most disloyal ones tend to prefer the Fiber optic option, which is probably the most expensive one, explaining why disloyal customers tend to pay more monthly charges than others. Furthermore, it is reasonable to think that passing through tourists are more interested in Internet services than other ones.
Phone Service : As expected there is no significant difference regarding the choice of using a phone service or not between the two type of customers since Telco is a phone company. Hence, most customers request this company for its phone services mainly.

Effect of having a partner and/or a phone services on monthly charges : Subscribing to a phone service increases monthly charges, as expected. They are even higher when a client has a partner : it may exist some discount family packages rather than individual subscriptions as half of the customers with a partner tend to have multiples lines.
However, we notice that the variance is higher when subscribing to a phone service : this is expected as having a Phone service (main service for the company) just serves as a basis, and customers monthly charges differenciate themselves from others on extra services they choose.
Those effects are similar in both groups of clients, with a slight difference regarrding the variance of monthly charges.
Do low-cost and premium customers represent different generations? Most customers in the whole data base are senior citizens (83.76%). The proportion of senior citizen is lower among churning customers, which may be explained by the fact that young people tend to be more volatile than older ones. Still, the difference is not significant enough to draw conclusions.

2. Detecting Churn

In this part , I will develop a ML model to predict customers who are potential Churn candidates in order to take action to change this behavior.

2.1 One-hot encoding of categorical variables

Most features are categorical variables. Hence, we decide to use dummy variables encoding to obtain “machine-readable” data. We avoid dummy variable “trap” by making sure to create only k-1 dummy variables out of k categorical levels.

binary_columns=["Partner","Dependents","PhoneService","MultipleLines","PaperlessBilling","OnlineSecurity","OnlineBackup","DeviceProtection", "TechSupport","StreamingTV","StreamingMovies"]

for c in binary_columns:
    data[c] = data[c].map(lambda x : 1  if x =='Yes' else 0)data = pd.concat([data,pd.get_dummies(data["PaymentMethod"],prefix="Pay")], axis=1)
data= data.drop(columns="PaymentMethod")

2.2 Data normalization

In order to avoid weights, from its dimensions, affecting the results, we normalize our features.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data),columns=data.columns)

2.3 Train / Validation / Test

We split the data set in train, validation and test set.
The train set represents 75% of the full data set and the validation and test set represent the remaining 25% combined.
They distributed again as 75% for validation set and 25% for test.

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns="Churn"), data["Churn"], stratify=data["Churn"], random_state=42)X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, stratify=y_test, random_state=42)

Note : Because the data set is highly imbalanced, we use stratified sampling to ensure that the train, validation and test sets have approximately the same percentage of samples of each target class as the complete set.

2.4 Naive Classifiers

Before building our machine learning model, it is important to understand why we actually need it.
Hence, we will analyse two naive models which either decide to consider that all customers are about to churn and take action against these behaviours or keep “status quo”, i.e, do nothing and lose all customers who were supposed to leave.

Note : These two classifiers are run directly on the test set since there is no training involved.

We will use a Confusion Matrix which summarizes the test results from a supervised model.
The results of a prediction from a test set, where we know the actual (true) and predicted labels are put in two axis.
On one axis we have the True labels, as given by the test set, on the other, the predictions as given by the model.
It allows us to see the true positives (TP), true negatives (TN) — correctly predicted values — and the false positives (FP) and false negatives (FN) — incorrectly predicted values.

everyone_churn = np.ones_like(y_test) 
print_confusion_matrix(y_test, everyone_churn)Accuracy Score: 0.26590909090909093
Recall Score: 1.0
Precision Score: 0.26590909090909093

Confusion matrix of Naive classifier which always output “Yes Churn”

As we can see, a naive classifier which consider that each customer will churn is not good at all as we will have to spend extra money on a business strategy to keep each customer in our database.
Lot of money is wasted: 323 out of 440 were not churning customers but we spent money on them.

nobody_churn = np.zeros_like(y_test)
print_confusion_matrix(y_test, nobody_churn)Accuracy Score: 0.7340909090909091
Recall Score: 0.0
Precision Score: 0.0

Confusion matrix of Naive classifier which always output “No Churn”

Similarly, a classifier which always output “No” is not good enough neither as our main statement was that customer retention is less expensive than capturing new customers.
With this strategy, we lose too many customers, i.e, 117 out of 440 were actually churning customers but our classifier missed to spot them.
We would need to spend extra money to replace lost customers.

2.5 Choice of metric

We observed that we need to determine a balance between a classifier which takes no action and another who raises too many alarms.
In the following, we will present different metrics and choose the right one specific to our business problem.

Accuracy measures how well our model predicts all the classes, regardless of balance. It is the ratio of “correctly predicted” results, versus the entire sample, defined by : (TP + TN)/(TP + TN + FP + FN)
Precision is the fraction of predictions that are correctly predicted. Hence, it is the probability that a (randomly selected) customer is actually about to churn : TP / (TP + FP)
Recall is measure the share of true values that have been correctly predicted. It is the probability that a (randomly selected) “about to Churn”-customer is retrieved in a search : TP / (TP + FN)
F1 is the harmonic mean of Precision and Recall, usually a good metric of the balance between the two metrics.

In our scenario, we are more interested in recall as we want to make sure we capture as many churning customers as possible, even though we may raise a lot of false alarms (false negative, i.e, people who don’t churn) as retaining a customer is less expensive than capturing a new one.

Note : we still keep in mind that we do not want a naive classifier always saying “Yes Churn”. Hence, we must pay special attention to choose a threshold such that the precision score is not too low neither.

2.6 Building our classifier : Logit model

We decide to use a Logit model classifier among other existing models for its simplicity and interpretability.
It is trained running a grid-search along with 5-fold cross validation to tune the hyperparameters on the train set.
The validation set is used to select the right threshold using a precision-recall curve.

param_grid = {'penalty' : ['l1', 'l2'],'C' : np.logspace(-1, 1, 10),
    'solver' : ['liblinear']}clf = LogisticRegression()
gs = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, verbose=0, scoring="recall")
gs.fit(X_train, y_train)y_score = gs.decision_function(X_valid)
precision, recall, thresholds = precision_recall_curve(y_valid, y_score)plt.plot(thresholds, precision[:len(precision)-1], label='precision', ls = 'dashed') 
plt.plot(thresholds, recall[:len(recall)-1], label='recall', ls = 'dashed') 
plt.legend() 
plt.title('Precision and Recall scores as a function of the decision threshold') 
plt.xlabel('Threshold') plt.ylabel('Metrics value') plt.grid()

We want the recall as high possible while making sure the precision is not too low.
A threshold of -1.5 allows us to reach a recall-score of more than 80% while making sure precision is not too low (minimum 40%).

#this computes a new set of y_pred based on a different threshold, which we set on the decision function
y_pred_threshold =(gs.decision_function(X_test)>=-1.5).astype(bool) 

print_confusion_matrix(y_test,y_pred_threshold)Accuracy Score: 0.7113636363636363 
Recall Score: 0.905982905982906 
Precision Score: 0.4774774774774775

Our classifier reached a recall score of 90.60% with a precision score of 47.75% on the test set.
It allowed us to spot 106 out of the 117 churning customers.
Hence, we would need to spend extra money to replace only 11 customers in comparison to the previous cases.
Furthermore, 116 out of the 323 non-churning customers were classified as churning ones by our classifier which is still better than the previous models.

2.7 Feature importances

We use Logit model coefficients to obtain feature importances.
It allows us to gain insights about factors making the decision between churning customers and loyal ones.

estimator = gs.best_estimator_
class_labels = gs.classes_
weights = estimator.coef_[0]
weights_index = np.argsort(weights)[::-1]...Most important feature used to predict churn with their weights
--------------------------------------
TotalCharges : 2.41194855716856
InternetYes : 1.595887533016741
FiberOptic : 1.5259884507415054
StreamingTV : 0.5433665554921124
StreamingMovies : 0.4998150502828606
-----------
tenure : -3.8768426607650053
MonthlyCharges : -3.005716293889124
Contract_Two year : -1.0917173045943789
Pay_Bank transfer (automatic) : -0.4199272308407312
Pay_Credit card (automatic) : -0.4148894319496611

It seems that internet and streaming services, most expensive ones, are factor making the decision for churning customers.
On the other hand, it seems that the longer a customer has been in the company, the more likely he is to stay in.
Monthly charges and long-terms contracts seem to play an important role too, as already observed during the descriptive analysis.

3. Solving the Churn Problem

In order to reduce the churn rate, the following items may be taken into account by the owners of Telco :

Charges : Even though, low charges are a proxy for loyal customers, it may be one of the most sensitive factor to change as it impacts directly the income of Telco Company.
However, we noticed that customers with a partner do pay more than the others.
Hence, it may be interesting to offer family packages with discounts or goodies to justify high prices.
Contracts : We observed that most churning customers were offered short-term contracts, i.e month-to-month.
This is not good for Telco as it allows the customers to be often tempted by new offers in the market.
A solution would be to either remove these short-term contracts and offer only long-term ones, or promote long-term contracts with lower monthly charges than short ones.
Furthermore, it may be interesting to create loyalty program with bonus points : the longer you stay in the company, the higher the discount on new devices, smartphones and subscriptions.
Premium Services : Internet and video-streaming services seem to be the most costly, henceforth the most valuable ones for the company.
Specifically, they represent the most requested services by passing through tourists or customers looking for a new partnership.
It may be interesting to change these services prices as “an exchange” for loyalty.
The owners may offer significant differences in price for Fiber optic, for example, depending on the type of contract you choose.
It allows to balance the loss of these passing through clients (as it seems complicate to keep them in the business) by making a huge profit at once.

We design a simple simulation with the following assumptions :

For each lost customer , Telco would invest $200 in Marketing, ads, targeting emails and campaigns to replace him. (customer acquisition)
For each predicted churning customer, they would invest $50 on discount and other advantages to retain him.

def cost(clients_to_retain, clients_to_obtain) :
    return 50*clients_to_retain + 200*clients_to_obtain...
sns.catplot(x="Strategy", y="Cost ($)", kind="bar", data=all_costs, palette=["red", "yellow", "green"])
plt.title("Associated cost for each strategy taken")
plt.savefig("strategies_cost.png")
plt.grid()

Cost associated to each customer retention strategy

4. Conclusion

During our analysis, we managed to identify key factors which led potential customers to churn using descriptive analysis.
We proposed a Logistic regression model with a recall at 90.60% to minimize Telco’s company cost for customer retention.
Our model in comparison to the current state of affairs allowed Telco to save $15900.