Classification in Machine Learning: A Guide for Beginners

A step-by-step guide on how to solve a classification problem with logistic regression using a real-world dataset.

Tirendaz AI
Geek Culture
10 min readNov 18, 2022

--

Photo by Austin Distel on Unsplash

Data is the new oil of today. Many companies and governments are now trying to extract information from big data using machine learning techniques. Machine learning is growing very fast, and new tools and libraries are being developed in this field almost every day. These tools help you easily implement machine learning projects.

In this post, I’ll talk about the classification problem. Here are what I’ll cover topics:

  • What is machine learning?
  • What is the logistic regression algorithm?
  • How to solve a classification problem with logistic regression?
  • How to predict new data?

Let’s dive in!

What is Machine Learning?

Machine Learning is a subfield of AI that provides a machine the ability to learn automatically and improve from experience without being explicitly programmed.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. If the dataset has labels, you can use supervised learning algorithms. Once there are no labels in the dataset, then you can use unsupervised learning algorithms. In reinforcement learning, the system learns by interacting with the environment.

Let’s take a closer look at supervised learning. Supervised learning models are the most used algorithms in machine learning. Supervised learning is divided into regression and classification. It is very easy to understand whether a problem is a regression or classification. If the label of data is numeric, it is a regression problem, otherwise, it is a classification problem.

In this post, I’m going to cover the classification problem with logistic regression. Note that logistic regression, despite its name, is a classification model rather than a regression model. To show how to solve the classification problem, I’m going to use the Telco Customer Churn dataset.

Loading the Dataset

First, let’s load our dataset with the read_csv method and then take a look at the first five rows of the dataset with the head method.

import pandas as pd
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

The dataset includes information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service.

If a company can predict whether the customer will leave the service, it will try to retain the customer. The last column, Churn, indicates whether the customer has canceled the contract. If a customer has canceled the contract, it is displayed with yes, if not, it is displayed with no. This variable is our target variable. The other variables are also called features.

Understanding The Dataset

Let’s take a look at the shape of the dataset.

df.shape

#Output
(7043, 21)

As you can see, the dataset consists of 7043 rows and 21 columns. Let me show you the types of columns in the dataset.

df.dtypes

#Output:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object

Data Preprocessing

Data preprocessing is one of the most important steps in machine learning. This step takes the most time for data scientists. 80 percent of a project is usually spent on data preprocessing.

Note that Pandas automatically determines the type of each column. However, the data types of the columns are sometimes determined incorrectly. It is an important task to check the type of columns before building the model.

For example, the SeniorCitizen column consists of 0 and 1, the type of this column is specified as numeric. You can convert this column into object type. However, since this column has a limited number of values, you don’t need to convert the type of this column.

TotalCharges refers to the total payment. The payment has to be a numeric value. Let’s convert this column to the numeric type with the to_numeric method.

df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')

Here, I used the error="coerce" argument to convert a non-numeric value to NaN.

Handling Missing Data

Let’s take a look at missing data in the dataset with the isnull().sum() method.

df.isnull().sum()

#Output
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 11
Churn 0
dtype: int64

As you can see, the TotalCharges column has missing data. Let’s assign 0 to missing values ​​with the fillna method.

df.TotalCharges = df.TotalCharges.fillna(0)

In the dataset, some of the column names start with a lowercase letter. Let’s convert the column names to lowercase, and put underscores to spaces between the column names.

df.columns = df.columns.str.lower().str.replace(' ', '_')

Note that some column values of type Object contain spaces and are case-mismatched. Let’s standardize these values.

string_columns = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_columns:
df[col] = df[col].str.lower().str.replace(' ', '_')

Now, let’s handle the target variable and convert the values of this column into numeric.

df.churn = (df.churn == ‘yes’).astype(int)

Let’s take a look at the final version of the dataset.

df.head()
Dataset after data preprocessing

Splitting the Dataset

In machine learning, the dataset is split into training and testing. The model is built with the training data and the model is evaluated with the test data. Let’s split the dataset with the train_test_split method.

from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)

In machine learning, validation data is used to measure the performance of the model. With this data, you can fine-tune the hyperparameters to find the best model. Let’s create validation data with the train_test_split method.

df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)

Now let’s create the target variables and remove the target columns from the training and validation sets.

y_train = df_train.churn.values
y_val = df_val.churn.values
del df_train['churn']
del df_val['churn']

Feature Engineering

Machine learning algorithms like to work with numeric values. This process is called one-hot encoding. You can perform one-hot encoding with the get_dummies method in Pandas or the OneHotEncoder method in Scikit-learn.

In this tutorial, I’m going to use the dictVectorizer class for one-hot coding. To use this method, let’s first convert the data into a dictionary structure. First, let’s create variables for categorical and numeric columns.

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv ', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

Now let’s convert the training set to the dictionary.

train_dict = df_train[categorical + numerical].to_dict(orient='records')

Let’s look at the first row of this variable.

train_dict[:1]

#Output:
[{'gender': 'male',
'seniorcitizen': 0,
'partner': 'no',
'dependents': 'no',
'phoneservice': 'yes',
'multiplelines': 'no',
'internetservice': 'no',
'onlinesecurity': 'no_internet_service',
'onlinebackup': 'no_internet_service',
'deviceprotection': 'no_internet_service',
'techsupport': 'no_internet_service',
'streamingtv': 'no_internet_service',
'streamingmovies': 'no_internet_service',
'contract': 'month-to-month',
'paperlessbilling': 'no',
'paymentmethod': 'mailed_check',
'tenure': 3,
'monthlycharges': 19.85,
'totalcharges': 64.55}]

Now let’s convert categorical values to one-hot encoding.

from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train = dv.transform(train_dict)
X_train[0]

#Output:
array([ 1. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 19.85, 1. , 0. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. , 1. , 0. , 1. , 0. ,
0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. , 1. , 0. , 3. , 64.55])

You can also see the values ​​of the columns with the get_feature_names_out() method.

dv.get_feature_names_out()

#Output:
array(['contract=month-to-month', 'contract=one_year',
'contract=two_year', 'dependents=no', 'dependents=yes',
'deviceprotection=no', 'deviceprotection=no_internet_service',
'deviceprotection=yes', 'gender=female', 'gender=male',
'internetservice=dsl', 'internetservice=fiber_optic',
'internetservice=no', 'monthlycharges', 'multiplelines=no',
'multiplelines=no_phone_service', 'multiplelines=yes',
'onlinebackup=no', 'onlinebackup=no_internet_service',
'onlinebackup=yes', 'onlinesecurity=no',
'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
'paymentmethod=credit_card_(automatic)',
'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
'streamingmovies=no', 'streamingmovies=no_internet_service',
'streamingmovies=yes', 'streamingtv=no',
'streamingtv=no_internet_service', 'streamingtv=yes',
'techsupport=no', 'techsupport=no_internet_service',
'techsupport=yes', 'tenure', 'totalcharges'], dtype=object)

Logistic regression

To solve the classification problem, you can use many algorithms such as Naive Bayes, random forest, and artificial neural networks. As a general rule, start with the simplest model first. If the performance of the model you built is not good, more complex models are tried.

As you know, regression models are used when the target variable is numerical. Logistic regression is a linear model, but is used to solve classification problems. That is because the sigmoid function is used in logistic regression.

Basic Sigmoid Function

The sigmoid function maps any value between 0 and 1. So the outcome of logistic regression becomes a probability. Let’s now build a logistic regression model with Scikit-learn.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="liblinear", random_state=42)
model.fit(X_train, y_train)

Here I have used the solver="liblinear" parameter. It recommends using this parameter for small samples.

Model Evaluation

We’re going to use validation data to understand the performance of the model. Let’s preprocess the validation data as we did the training data before.

val_dict = df_val[categorical+numerical].to_dict(orient="records")
X_val = dv.transform(val_dict)
y_pred = model.predict_proba(X_val)
y_pred[:5]

#Output:
array([[0.83279817, 0.16720183],
[0.74686651, 0.25313349],
[0.5643406 , 0.4356594 ],
[0.43763388, 0.56236612],
[0.95025844, 0.04974156]])

As you can see, the predicts are between 0 and 1. These values ​​are the probabilities of the churn target variable. The model prediction will be 1 if the probability is greater than 0.5, and 0 if the probability is less than 0.5. Now let’s see the performance of the model with the score method on the validation and training dataset.

print("The performance of the model on the validation dataset: ",
model.score(X_val, y_val))
print("The performance of the model on the training dataset: ",
model.score(X_train, y_train))

#Output:
The performance of the model on the validation dataset: 0.8034066713981547
The performance of the model on the training dataset: 0.8049704142011834

As you can see, the model’s scores on the validation and training dataset are about 80 percent. In order to avoid overfitting and underfitting problems in the model, we want the accuracy scores of the model to be close to 1 and close to each other. The model we built is not bad.

Model Interpretation

Thus, we trained a bias and a coefficient for each variable. Building a good model means finding the best coefficient combination. After training the model, we can see these coefficients with the methods in Scikit-learn as follows:

print("Bias: ",model.intercept_[0])
print(dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3))))

#Output:


Bias: -0.14501424313805428
{'contract=month-to-month': 0.63, 'contract=one_year': -0.16,
'contract=two_year': -0.615, 'dependents=no': -0.054,
'dependents=yes': -0.091, 'deviceprotection=no': 0.027,
'deviceprotection=no_internet_service': -0.132,
'deviceprotection=yes': -0.04, 'gender=female': 0.015,
'gender=male': -0.16, 'internetservice=dsl': -0.327,
'internetservice=fiber_optic': 0.314, 'internetservice=no': -0.132,
'monthlycharges': 0.003, 'multiplelines=no': -0.225,
'multiplelines=no_phone_service': 0.124, 'multiplelines=yes': -0.044,
'onlinebackup=no': 0.076, 'onlinebackup=no_internet_service': -0.132,
'onlinebackup=yes': -0.089, 'onlinesecurity=no': 0.205,
'onlinesecurity=no_internet_service': -0.132, 'onlinesecurity=yes': -0.217,
'paperlessbilling=no': -0.241, 'paperlessbilling=yes': 0.096,
'partner=no': -0.076, 'partner=yes': -0.069,
'paymentmethod=bank_transfer_(automatic)': -0.107,
'paymentmethod=credit_card_(automatic)': -0.186,
'paymentmethod=electronic_check': 0.211,
'paymentmethod=mailed_check': -0.064, 'phoneservice=no': 0.124,
'phoneservice=yes': -0.269, 'seniorcitizen': 0.163,
'streamingmovies=no': -0.139, 'streamingmovies=no_internet_service': -0.132,
'streamingmovies=yes': 0.126, 'streamingtv=no': -0.059,
'streamingtv=no_internet_service': -0.132, 'streamingtv=yes': 0.046,
'techsupport=no': 0.16, 'techsupport=no_internet_service': -0.132,
'techsupport=yes': -0.173, 'tenure': -0.055, 'totalcharges': 0.0}

Here, the coefficients are log odds. Once you exponentiate them, they turn into odds, and thus you can interpret them more clearly. A negative coefficient has odds < 1, meaning the odds of the event occurring are lower than the baseline; on the other hand, a positive coefficient has odds > 1, meaning increased odds of observing the event relative to the baseline.

Predicting New Data

After building the model, you can predict new data that the model has not seen before. To do this, I’m going to take the values ​​of a customer as follows:

customer = {
'customerid': '8879-zkjof',
'gender': 'male',
'seniorcitizen': 1,
'partner': 'no',
'dependents': 'no',
'tenure': 41,
'phoneservice': 'yes',
'multiplelines': 'no',
'internetservice': 'dsl',
'onlinesecurity': 'yes',
'onlinebackup': 'no',
'deviceprotection': 'yes',
'techsupport': 'yes',
'streamingtv': 'yes',
'streamingmovies': 'yes',
'contract': 'one_year',
'paperlessbilling': 'yes',
'paymentmethod': 'bank_transfer_(automatic)',
'monthlycharges': 79.85,
'totalcharges': 2990.75,
}

First, let’s preprocess the data with the transform method and then predict the label of this data using our model.

x_new = dv.transform([customer])
model.predict_proba(x_new)

#Output:
array([[0.93840227, 0.06159773]])

As you can see, the model found the probability of this customer leaving the service 7 percent, and the probability of not leaving the service is 93 percent. As a data scientist, you can tell the company you work for that this customer is unlikely to be unsubscribed, so there is no need to apply a promotion for this customer. Now let’s take another customer data and predict the label.

customer2 = {
'gender': 'female',
'seniorcitizen': 1,
'partner': 'no',
'dependents': 'no',
'phoneservice': 'yes',
'multiplelines': 'yes',
'internetservice': 'fiber_optic',
'onlinesecurity': 'no',
'onlinebackup': 'no',
'deviceprotection': 'no',
'techsupport': 'no',
'streamingtv': 'yes',
'streamingmovies': 'no',
'contract': 'month-to-month',
'paperlessbilling': 'yes',
'paymentmethod': 'electronic_check',
'tenure': 1,
'monthlycharges': 85.7,
'totalcharges': 85.7
}

Now, let’s predict the label of data according to our model.

X_new2= dv.transform([customer2])
model.predict_proba(X_new2)

#Output:
array([[0.19738604, 0.80261396]])

As you can see, the probability of this customer leaving the service is 8o percent, so you can suggest the company apply a promotion to this customer.

Conclusion

The most used models in machine learning are supervised learning models. Supervised learning is divided into regression and classification. If the data label is categorical, you can use classification algorithms.

In this post, I talked about how to solve a classification problem with logistic regression. The model we built predicts whether a customer will churn. This model helps you predict whether a customer will churn or not.

The notebook used in this article can be found here.

Thanks for reading. I hope you enjoy this post. Don’t forget to follow us on YouTube | Instagram | Twitter | LinkedIn

Resources

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--