Telco Customer Churn Prediction
There are promotional costs known as acquisition costs and retention costs in a telco company. Acquisition cost is the cost of acquiring new customers. On the other hand, retention cost is the cost for the company to retain existing customers.
At this point, it is important to estimate for the telco company which customers will stay and which will churn. Based on these estimates, telco company makes the necessary investments.

In this project, The data was obtained in this Kaggle dataset. It contains information about a fictional telco company that provided home phone and internet services. Also, The full project can be found in my GitHub repository.
Let’s do some analysis!
Dataset Information: The telco customer dataset contains information about a fictional telco company that provided home phone and internet services. Dataset indicates 7043 customers in California in Q3. This dataset is indicated important demographic information and also indicates satisfaction score, churn score, and customer lifetime value (CLTV) index.
Let’s look at the types of our columns. As we can see from the following output, there are many categorical types to turn into numerical. Such as, the “Total Charges” column is defined as an object which is originally a numerical column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 7043 non-null object
1 Count 7043 non-null int64
2 Country 7043 non-null object
3 State 7043 non-null object
4 City 7043 non-null object
5 Zip Code 7043 non-null int64
6 Lat Long 7043 non-null object
7 Latitude 7043 non-null float64
8 Longitude 7043 non-null float64
9 Gender 7043 non-null object
10 Senior Citizen 7043 non-null object
11 Partner 7043 non-null object
12 Dependents 7043 non-null object
13 Tenure Months 7043 non-null int64
14 Phone Service 7043 non-null object
15 Multiple Lines 7043 non-null object
16 Internet Service 7043 non-null object
17 Online Security 7043 non-null object
18 Online Backup 7043 non-null object
19 Device Protection 7043 non-null object
20 Tech Support 7043 non-null object
21 Streaming TV 7043 non-null object
22 Streaming Movies 7043 non-null object
23 Contract 7043 non-null object
24 Paperless Billing 7043 non-null object
25 Payment Method 7043 non-null object
26 Monthly Charges 7043 non-null float64
27 Total Charges 7043 non-null object
28 Churn Label 7043 non-null object
29 Churn Value 7043 non-null int64
30 Churn Score 7043 non-null int64
31 CLTV 7043 non-null int64
32 Churn Reason 1869 non-null object
dtypes: float64(3), int64(6), object(24)
Now, we will check for any missing values available or not. If available, we will fill in the missing values.
# Missing Values
df.isna().sum()CustomerID 0
Count 0
Country 0
State 0
City 0
Zip_Code 0
Lat_Long 0
Latitude 0
Longitude 0
Gender 0
Senior_Citizen 0
Partner 0
Dependents 0
Tenure_Months 0
Phone_Service 0
Multiple_Lines 0
Internet_Service 0
Online_Security 0
Online_Backup 0
Device_Protection 0
Tech_Support 0
Streaming_TV 0
Streaming_Movies 0
Contract 0
Paperless_Billing 0
Payment_Method 0
Monthly_Charges 0
Total_Charges 11
Churn_Label 0
Churn_Value 0
Churn_Score 0
CLTV 0
Churn_Reason 5174
dtype: int64
As we can see from the above outputs, the “Total Charges” column has 11 missing values. The percentage of the missing values is very small. So, we can fill these values with the median of the “Total Charges” column. Also, the “Churn Reason” has 5174 missing values. But we will not fill in these missing values. We will use the “Churn Reason” for analysis.
# Fill missing values with median.
df[‘Total_Charges’].fillna(df[‘Total_Charges’].median
Exploratory Data Analysis (EDA)
First, we have to check data for imbalanced class distribution. We know that imbalance data is an important problem for predictive ML models
# Fill missing values with median.
df[‘Total_Charges’].fillna(df[‘Total_Charges’].median

According to the output of frequency of churn value, our data is slightly imbalanced. Later in this project, we will apply oversampling method or undersampling method.
Now, we can analyze over numerical and categorical columns.
We can see from the following pie charts, the male and female distribution looks close. On the other hand, senior citizens are more prone to churn


When we check the distribution of “Tenure_Months” and “Monthly_Charge”, there is some important point.


We can see from the distribution of “Monthly_Charge” if the value of “Monthly_Charges” is under about $30, then the churn percentage is getting decreases. So, we can say that if the monthly charge is under $30, then the customers are more prone to retain. On the other hand, if the value of “Tenure_Months” getting decrease then the churn percentage of customers getting increase.
Now, Let’s look at distributions over a few important categorical columns.

We can see from the above distribution according to “Churn_Label”, there are differences in some ratios belong the distributions.
- If customers choose paperless billing then these customers are more prone to churn.
- We can see from the “Payment_Method” distribution that customers who choose electronic checks for payment are more prone to churn.
- If customers choose fiber optic internet service then more prone to churn.
- If customers do not have a contract, which means month-to-month then these customers are more prone to churn.

Also, we can see from the above outputs, if customers do not prefer any online service or packet like protection from telco then these customers are more prone to churn.
In addition, According to churn reasons, the competitive telco companies are quite effective at the churn rate.
The attitude of support person 192
Competitor offered higher download speeds 189
Competitor offered more data 162
Don't know 154
Competitor made better offer 140
Attitude of service provider 135
Competitor had better devices 130
Network reliability 103
Product dissatisfaction 102
Price too high 98
Service dissatisfaction 89
Lack of self-service on Website 88
Extra data charges 57
Moved 53
Limited range of services 44
Lack of affordable download/upload speed 44
Long distance charges 44
Poor expertise of phone support 20
Poor expertise of online support 19
Deceased 6
Before we finish the EDA part, we plot the churn heatmap according to “Latitude” and “Longitude”. This code is available at my GitHub repository.

We have completed our analysis of our dataset. We touched on important points for the Telco company. Now, we can move on to the modeling phase.
Modeling
We will do some preprocessing after we will build our models and evaluate them.
Now, we want to drop some unnecessary columns but first, we will populate a new column from ‘Streaming_Movies’ and ‘Streaming_TV’.
df.loc[(df['Streaming_Movies'] == 'Yes')&(df['Streaming_TV'] == 'Yes'),'Entertainment'] = 2df.loc[(df['Streaming_Movies'] == 'No internet service')&(df['Streaming_TV'] == 'No internet service'),'Entertainment'] = 0df['Entertainment'].fillna(1,inplace=True)data_ml =
df.drop(['CustomerID','Count','Country','State','Streaming_TV', 'Streaming_Movies','Churn_Label','Churn_Score', 'CLTV', 'Churn_Reason'],axis=1)
Also, we dropped ‘Churn_Reason’, ‘Churn_Score’ etc. Because these are directly related to our target. In addition, according to our correlation map, also, we dropped some columns like ‘Lat_Long’, ‘Latitude’, ‘Longitude’.

Now, we will apply label encoder and One-Hot-Encoder to our categorical columns. First, we divided our columns into lists according to types then we apply label encoder to the binary list and apply one-hot-encoder to the categorical list.
col_count=pd.DataFrame({"col_name":data_ml.nunique().index,
"Unique_Val":data_ml.nunique()}).reset_index(drop=True)
def col_cat(col):
x=[]
for i in col:
if i ==2:
x.append('Binary')
elif (i>2) & (i<7):
x.append('Categorical')
else:
x.append('Continuous')
return xcol_count['Type']=col_cat(col_count["Unique_Val"])# Label Encodingle=LabelEncoder()
for i in binary:
data_ml[i]=le.fit_transform(data_ml[i])#One-Hot-Encoding (after divided data)ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),list(categorical_indexes))],remainder='passthrough')
X=ct.fit_transform(X)
The next step is to divide our data into train and test sets. After the train-test-split step, we will scale our data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,stratify=y, random_state = 42)sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
Now our data is ready for building models. We will train 5 different models and evaluate them. After training and evaluating steps, we will use GridSearchCV for choosing the best parameters of the best model.
In addition, We will use my custom library during the modeling step. My custom library can be found in my GitHub repository.
We will share the ROC curve and confusion matrix of the models we tried, respectively.
XGBoost


Also, in XGBoost, we used the ‘scale_pos_weight’ parameter. This parameter is used for imbalanced datasets. ’Scale_pos_weight’ is assigned based on a ratio obtained from the dataset.
Support Vector Classifier


Naive Bayes Classifier


Logistic Regression


Random Forest Classifier


As you can see from the above outputs of models, generally, our accuracies and ROC-AUC scores are good enough. However, the ‘precision’ and ‘recall’ scores are not good enough. So, we try one more model, and then we will pass the GridSearchCV step.


Now, we will try to Neural Network for the churn predictions. So, we implement our last model with ‘sequential’ from Keras Library.
from keras.models import Sequentialfrom keras.layers import Densemodel = Sequential()model.add(tf.keras.layers.Dense(256, input_shape=(X_train_scaled.shape[1],), activation='sigmoid'))model.add(Dense(16, input_dim=len(X), activation='relu'))model.add(Dense(8, activation='relu'))model.add(Dense(1, activation='sigmoid'))# compile the modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy','Precision','Recall','AUC'])# Train the modelmodel.fit(X_train_scaled, y_train, epochs=50, batch_size=100)
Unfortunately, Neural Network did not work well. So, as you can see from the following outputs, NN’s results are nearly close to the other models' outputs.
y_pred = (model.predict(X_test_scaled) > 0.5).astype("int32")accuracy_score(y_test,y_pred)>> 0.795078088026502

Now, Let’s get the GridSearchCV step.
GridSearchCV
In this step, we will implement all models into the pipeline, and then we will give the pipeline to GridSearchCV to choose the best model for us.
First, we initialize the estimators and their hyperparameters.
# Initialze the estimators
clf1 = XGBClassifier()
clf2 = SVC()
clf3 = GaussianNB()
clf4 = LogisticRegression()
clf5 = RandomForestClassifier()# Initiaze the hyperparameters for each dictionary
param1 = {}
param1['classifier__n_estimators'] = [100, 250]
param1['classifier__max_depth'] = [5, 10, 20]
param1['classifier__min_child_weight'] = [70,140]
param1['classifier__subsample'] = [0.7,0.9]
param1['classifier__colsample_bytree'] = [0.8,0.6,0.4]
param1['classifier__scale_pos_weight'] = [3]
param1['classifier__eval_metric'] = ['map','auc','error']
param1['classifier__classifier'] = [clf1]param2 = {}
param2['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param2['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param2['classifier'] = [clf2]param3 = {}
param3['classifier__var_smoothing'] = [np.logspace(0,-9, num=100)]
param3['classifier'] = [clf3]param4 = {}
param4['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param4['classifier__penalty'] = ['l1', 'l2']
param4['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param4['classifier'] = [clf4]param5 = {}
param5['classifier__n_estimators'] = [10, 50, 100, 250]
param5['classifier__max_depth'] = [5, 10, 20]
param5['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param5['classifier'] = [clf5]#Create pipeline for first estimator
pipeline = Pipeline([('classifier', clf1)])
params = [param1, param2, param3, param4, param5]# Train the grid search model
gs = GridSearchCV(pipeline, params, cv=3, n_jobs=-1, scoring='roc_auc').fit(X_train_scaled,y_train)
Now, we applied GridSearchCV with given hyperparameters to our models. As you can see from the following outputs, ‘RandomForestClassifier’ is our champion model.
# ROC-AUC score for the best model
gs.best_score_>>0.8588572289266095

Conclusion
In conclusion, we covered the end-to-end churn prediction model. We analyze our dataset and we found important points for the fictional telco company. Also, we trained 6 different models and evaluated them. In the final step, we applied GridSearchCV to find the best model and the best parameters of the best model. According to the GridSearchCV, RandomForest was selected as the best model with the given parameters. Our accuracy score is about %85 percent.
Thanks for reading!