Telco Customer Churn Prediction

Photo By Jonathan Chng on Unsplash
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 7043 non-null object
1 Count 7043 non-null int64
2 Country 7043 non-null object
3 State 7043 non-null object
4 City 7043 non-null object
5 Zip Code 7043 non-null int64
6 Lat Long 7043 non-null object
7 Latitude 7043 non-null float64
8 Longitude 7043 non-null float64
9 Gender 7043 non-null object
10 Senior Citizen 7043 non-null object
11 Partner 7043 non-null object
12 Dependents 7043 non-null object
13 Tenure Months 7043 non-null int64
14 Phone Service 7043 non-null object
15 Multiple Lines 7043 non-null object
16 Internet Service 7043 non-null object
17 Online Security 7043 non-null object
18 Online Backup 7043 non-null object
19 Device Protection 7043 non-null object
20 Tech Support 7043 non-null object
21 Streaming TV 7043 non-null object
22 Streaming Movies 7043 non-null object
23 Contract 7043 non-null object
24 Paperless Billing 7043 non-null object
25 Payment Method 7043 non-null object
26 Monthly Charges 7043 non-null float64
27 Total Charges 7043 non-null object
28 Churn Label 7043 non-null object
29 Churn Value 7043 non-null int64
30 Churn Score 7043 non-null int64
31 CLTV 7043 non-null int64
32 Churn Reason 1869 non-null object
dtypes: float64(3), int64(6), object(24)
# Missing Values
df.isna().sum()
CustomerID 0
Count 0
Country 0
State 0
City 0
Zip_Code 0
Lat_Long 0
Latitude 0
Longitude 0
Gender 0
Senior_Citizen 0
Partner 0
Dependents 0
Tenure_Months 0
Phone_Service 0
Multiple_Lines 0
Internet_Service 0
Online_Security 0
Online_Backup 0
Device_Protection 0
Tech_Support 0
Streaming_TV 0
Streaming_Movies 0
Contract 0
Paperless_Billing 0
Payment_Method 0
Monthly_Charges 0
Total_Charges 11
Churn_Label 0
Churn_Value 0
Churn_Score 0
CLTV 0
Churn_Reason 5174
dtype: int64
# Fill missing values with median.
df[‘Total_Charges’].fillna(df[‘Total_Charges’].median

Exploratory Data Analysis (EDA)

# Fill missing values with median.
df[‘Total_Charges’].fillna(df[‘Total_Charges’].median
  • If customers choose paperless billing then these customers are more prone to churn.
  • We can see from the “Payment_Method” distribution that customers who choose electronic checks for payment are more prone to churn.
  • If customers choose fiber optic internet service then more prone to churn.
  • If customers do not have a contract, which means month-to-month then these customers are more prone to churn.
 The attitude of support person               192
Competitor offered higher download speeds 189
Competitor offered more data 162
Don't know 154
Competitor made better offer 140
Attitude of service provider 135
Competitor had better devices 130
Network reliability 103
Product dissatisfaction 102
Price too high 98
Service dissatisfaction 89
Lack of self-service on Website 88
Extra data charges 57
Moved 53
Limited range of services 44
Lack of affordable download/upload speed 44
Long distance charges 44
Poor expertise of phone support 20
Poor expertise of online support 19
Deceased 6
df.loc[(df['Streaming_Movies'] == 'Yes')&(df['Streaming_TV'] == 'Yes'),'Entertainment'] = 2df.loc[(df['Streaming_Movies'] == 'No internet service')&(df['Streaming_TV'] == 'No internet service'),'Entertainment'] = 0df['Entertainment'].fillna(1,inplace=True)data_ml =
df.drop(['CustomerID','Count','Country','State','Streaming_TV', 'Streaming_Movies','Churn_Label','Churn_Score', 'CLTV', 'Churn_Reason'],axis=1)
col_count=pd.DataFrame({"col_name":data_ml.nunique().index,
"Unique_Val":data_ml.nunique()}).reset_index(drop=True)
def col_cat(col):
x=[]
for i in col:
if i ==2:
x.append('Binary')
elif (i>2) & (i<7):
x.append('Categorical')
else:
x.append('Continuous')
return x
col_count['Type']=col_cat(col_count["Unique_Val"])# Label Encodingle=LabelEncoder()
for i in binary:
data_ml[i]=le.fit_transform(data_ml[i])
#One-Hot-Encoding (after divided data)ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),list(categorical_indexes))],remainder='passthrough')
X=ct.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,stratify=y, random_state = 42)sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

XGBoost

Support Vector Classifier

Naive Bayes Classifier

Logistic Regression

Random Forest Classifier

Models’ score metrics
from keras.models import Sequentialfrom keras.layers import Densemodel = Sequential()model.add(tf.keras.layers.Dense(256, input_shape=(X_train_scaled.shape[1],), activation='sigmoid'))model.add(Dense(16, input_dim=len(X), activation='relu'))model.add(Dense(8, activation='relu'))model.add(Dense(1, activation='sigmoid'))# compile the modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy','Precision','Recall','AUC'])# Train the modelmodel.fit(X_train_scaled, y_train, epochs=50, batch_size=100)
y_pred = (model.predict(X_test_scaled) > 0.5).astype("int32")accuracy_score(y_test,y_pred)>> 0.795078088026502

GridSearchCV

# Initialze the estimators
clf1 = XGBClassifier()
clf2 = SVC()
clf3 = GaussianNB()
clf4 = LogisticRegression()
clf5 = RandomForestClassifier()
# Initiaze the hyperparameters for each dictionary
param1 = {}
param1['classifier__n_estimators'] = [100, 250]
param1['classifier__max_depth'] = [5, 10, 20]
param1['classifier__min_child_weight'] = [70,140]
param1['classifier__subsample'] = [0.7,0.9]
param1['classifier__colsample_bytree'] = [0.8,0.6,0.4]
param1['classifier__scale_pos_weight'] = [3]
param1['classifier__eval_metric'] = ['map','auc','error']
param1['classifier__classifier'] = [clf1]
param2 = {}
param2['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param2['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param2['classifier'] = [clf2]
param3 = {}
param3['classifier__var_smoothing'] = [np.logspace(0,-9, num=100)]
param3['classifier'] = [clf3]
param4 = {}
param4['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param4['classifier__penalty'] = ['l1', 'l2']
param4['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param4['classifier'] = [clf4]
param5 = {}
param5['classifier__n_estimators'] = [10, 50, 100, 250]
param5['classifier__max_depth'] = [5, 10, 20]
param5['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param5['classifier'] = [clf5]
#Create pipeline for first estimator
pipeline = Pipeline([('classifier', clf1)])
params = [param1, param2, param3, param4, param5]
# Train the grid search model
gs = GridSearchCV(pipeline, params, cv=3, n_jobs=-1, scoring='roc_auc').fit(X_train_scaled,y_train)
# ROC-AUC score for the best model
gs.best_score_
>>0.8588572289266095
Random Forest Classifier’s score metrics with GridSearchCV

Conclusion

--

--

--

Data Scientist @ NETAS

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Exploratory And Data Analysis of COVID-19 Pandemic - Part 2

A Density-based algorithm for outlier detection

Switching on new energy insights with SAP HANA

From South Korea to the UK : Understanding ‘how busy is this road?’ | Geolytix

How to make sense out of webpage data tables using pandas?

Who has access to Open Space in the Woo?

A Simple Guide to Automate Your Excel Reporting with Python

Multiple Time Series Classification by Using Continuous Wavelet Transformation

Photo by Shifaaz shamoon on Unsplash

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cagatay Ciftci

Cagatay Ciftci

Data Scientist @ NETAS

More from Medium

Data Science💻

Some Visuals on Seattle Airbnb Data

Can we create a beting edge with data-driven fight predictions?

Predictive Analytics Tools