Geek Culture
Published in

Geek Culture

Classification in Machine Learning: A Guide for Beginners

A step-by-step guide on how to solve a classification problem with logistic regression using a real-world dataset.

Photo by Austin Distel on Unsplash

What is Machine Learning?

Loading the Dataset

import pandas as pd
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

Understanding The Dataset

df.shape

#Output
(7043, 21)
df.dtypes

#Output:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object

Data Preprocessing

df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')

Handling Missing Data

df.isnull().sum()

#Output
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 11
Churn 0
dtype: int64
df.TotalCharges = df.TotalCharges.fillna(0)
df.columns = df.columns.str.lower().str.replace(' ', '_')
string_columns = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_columns:
df[col] = df[col].str.lower().str.replace(' ', '_')
df.churn = (df.churn == ‘yes’).astype(int)
df.head()
Dataset after data preprocessing

Splitting the Dataset

from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)
y_train = df_train.churn.values
y_val = df_val.churn.values
del df_train['churn']
del df_val['churn']

Feature Engineering

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv ', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']
train_dict = df_train[categorical + numerical].to_dict(orient='records')
train_dict[:1]

#Output:
[{'gender': 'male',
'seniorcitizen': 0,
'partner': 'no',
'dependents': 'no',
'phoneservice': 'yes',
'multiplelines': 'no',
'internetservice': 'no',
'onlinesecurity': 'no_internet_service',
'onlinebackup': 'no_internet_service',
'deviceprotection': 'no_internet_service',
'techsupport': 'no_internet_service',
'streamingtv': 'no_internet_service',
'streamingmovies': 'no_internet_service',
'contract': 'month-to-month',
'paperlessbilling': 'no',
'paymentmethod': 'mailed_check',
'tenure': 3,
'monthlycharges': 19.85,
'totalcharges': 64.55}]
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train = dv.transform(train_dict)
X_train[0]

#Output:
array([ 1. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 19.85, 1. , 0. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. , 1. , 0. , 1. , 0. ,
0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. , 1. , 0. , 3. , 64.55])
dv.get_feature_names_out()

#Output:
array(['contract=month-to-month', 'contract=one_year',
'contract=two_year', 'dependents=no', 'dependents=yes',
'deviceprotection=no', 'deviceprotection=no_internet_service',
'deviceprotection=yes', 'gender=female', 'gender=male',
'internetservice=dsl', 'internetservice=fiber_optic',
'internetservice=no', 'monthlycharges', 'multiplelines=no',
'multiplelines=no_phone_service', 'multiplelines=yes',
'onlinebackup=no', 'onlinebackup=no_internet_service',
'onlinebackup=yes', 'onlinesecurity=no',
'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
'paymentmethod=credit_card_(automatic)',
'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
'streamingmovies=no', 'streamingmovies=no_internet_service',
'streamingmovies=yes', 'streamingtv=no',
'streamingtv=no_internet_service', 'streamingtv=yes',
'techsupport=no', 'techsupport=no_internet_service',
'techsupport=yes', 'tenure', 'totalcharges'], dtype=object)

Logistic regression

Basic Sigmoid Function
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="liblinear", random_state=42)
model.fit(X_train, y_train)

Model Evaluation

val_dict = df_val[categorical+numerical].to_dict(orient="records")
X_val = dv.transform(val_dict)
y_pred = model.predict_proba(X_val)
y_pred[:5]

#Output:
array([[0.83279817, 0.16720183],
[0.74686651, 0.25313349],
[0.5643406 , 0.4356594 ],
[0.43763388, 0.56236612],
[0.95025844, 0.04974156]])
print("The performance of the model on the validation dataset: ",
model.score(X_val, y_val))
print("The performance of the model on the training dataset: ",
model.score(X_train, y_train))

#Output:
The performance of the model on the validation dataset: 0.8034066713981547
The performance of the model on the training dataset: 0.8049704142011834

Model Interpretation

print("Bias: ",model.intercept_[0])
print(dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3))))

#Output:


Bias: -0.14501424313805428
{'contract=month-to-month': 0.63, 'contract=one_year': -0.16,
'contract=two_year': -0.615, 'dependents=no': -0.054,
'dependents=yes': -0.091, 'deviceprotection=no': 0.027,
'deviceprotection=no_internet_service': -0.132,
'deviceprotection=yes': -0.04, 'gender=female': 0.015,
'gender=male': -0.16, 'internetservice=dsl': -0.327,
'internetservice=fiber_optic': 0.314, 'internetservice=no': -0.132,
'monthlycharges': 0.003, 'multiplelines=no': -0.225,
'multiplelines=no_phone_service': 0.124, 'multiplelines=yes': -0.044,
'onlinebackup=no': 0.076, 'onlinebackup=no_internet_service': -0.132,
'onlinebackup=yes': -0.089, 'onlinesecurity=no': 0.205,
'onlinesecurity=no_internet_service': -0.132, 'onlinesecurity=yes': -0.217,
'paperlessbilling=no': -0.241, 'paperlessbilling=yes': 0.096,
'partner=no': -0.076, 'partner=yes': -0.069,
'paymentmethod=bank_transfer_(automatic)': -0.107,
'paymentmethod=credit_card_(automatic)': -0.186,
'paymentmethod=electronic_check': 0.211,
'paymentmethod=mailed_check': -0.064, 'phoneservice=no': 0.124,
'phoneservice=yes': -0.269, 'seniorcitizen': 0.163,
'streamingmovies=no': -0.139, 'streamingmovies=no_internet_service': -0.132,
'streamingmovies=yes': 0.126, 'streamingtv=no': -0.059,
'streamingtv=no_internet_service': -0.132, 'streamingtv=yes': 0.046,
'techsupport=no': 0.16, 'techsupport=no_internet_service': -0.132,
'techsupport=yes': -0.173, 'tenure': -0.055, 'totalcharges': 0.0}

Predicting New Data

customer = {
'customerid': '8879-zkjof',
'gender': 'male',
'seniorcitizen': 1,
'partner': 'no',
'dependents': 'no',
'tenure': 41,
'phoneservice': 'yes',
'multiplelines': 'no',
'internetservice': 'dsl',
'onlinesecurity': 'yes',
'onlinebackup': 'no',
'deviceprotection': 'yes',
'techsupport': 'yes',
'streamingtv': 'yes',
'streamingmovies': 'yes',
'contract': 'one_year',
'paperlessbilling': 'yes',
'paymentmethod': 'bank_transfer_(automatic)',
'monthlycharges': 79.85,
'totalcharges': 2990.75,
}
x_new = dv.transform([customer])
model.predict_proba(x_new)

#Output:
array([[0.93840227, 0.06159773]])
customer2 = {
'gender': 'female',
'seniorcitizen': 1,
'partner': 'no',
'dependents': 'no',
'phoneservice': 'yes',
'multiplelines': 'yes',
'internetservice': 'fiber_optic',
'onlinesecurity': 'no',
'onlinebackup': 'no',
'deviceprotection': 'no',
'techsupport': 'no',
'streamingtv': 'yes',
'streamingmovies': 'no',
'contract': 'month-to-month',
'paperlessbilling': 'yes',
'paymentmethod': 'electronic_check',
'tenure': 1,
'monthlycharges': 85.7,
'totalcharges': 85.7
}
X_new2= dv.transform([customer2])
model.predict_proba(X_new2)

#Output:
array([[0.19738604, 0.80261396]])

Conclusion

Resources

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store