Supervised Machine Learning Algorithm Demonstration: Logistic Regression

Sasani Perera
6 min readJul 8, 2023

--

Logistic regression is a popular algorithm used for binary classification tasks in Supervised Machine Learning. But what does that mean? Well, in binary classification, we’re trying to predict one of two possible outcomes or categories. For example, we might want to predict if an email is spam or not spam, or if a student will pass or fail an exam.

Unlike linear regression, which predicts numeric values, logistic regression predicts the probability of an event belonging to a particular category. It helps us answer questions like “What is the likelihood of an email being spam?” or “What is the probability of a student passing the exam?”

Logistic regression uses a mathematical function called the logistic function (or sigmoid function) to analyze independent variables or factors to determine the likelihood of the binary outcome and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))

In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable.

These independent variables can be either categorical (such as types of products, colours, etc.) or numeric (such as age, temperature, etc.).

However, the dependent variable, the one we want to predict, is always categorical and falls into one of the two binary categories. By considering the relationships and patterns between the independent variables and the binary outcome, logistic regression allows us to estimate the probability of an event occurring or not occurring. This estimation is based on a logistic function, which maps the input data to a probability value between 0 and 1.

Let us start training a model with an example data set, creditcard.csv.

In this demonstration, we will train a model to detect fraud credit card detection.

  1. Understanding the data

Create a new notebook in Google Colab.

Getting a New notebook in Google Colab
Add new code/text lines whenever wanted

Let’s start to play with our data set. First, we get the basic necessary libraries.

Now, we need to add the .csv file to Colab. And copy the file’s path

Add the files to Colab
file path

Then we extract the data in the .csv file to a data frame named, ‘data_df’
(pandas.read_csv)and print the firstmost elements to get an idea about the dataset (pandas.DataFrame.head).

We can get a clear idea about the number of columns and the independent/dependent variables in the data frame. Here in this case our ‘Class’ column is the dependent variable and all the other columns are independent variables.

To get a better understanding of the data, we can use the following commands.
pandas.DataFrame.shape, pandas.DataFrame.columns, pandas.DataFrame.describe

2. Detect any possible missing values

When we are using a dataset to train a model, we must provide a proper complete dataset. If our dataset contains NULL values the model would not be accurate. So, we must confirm that our dataset does not contain any NULL values.

If there are null values present: return TRUE
If there are no null values present: returns False

(pandas.DataFrame.isna: returns all cells)
returns the columns, considering all cells in that column

Luckily, in this dataset, we do not have any null values. But, if any case you do have null values, we can replace those null values with “np.nan” and then drop the rows and columns with the value NaN.

data_df.replace('', np.nan) #replace all null values by NaN
data_df = data_df.dropna(axis=0) #drop row
data_df = data_df.dropna(axis=1) #drop column
null_columns=pd.DataFrame({'Columns':data_df.isna().sum().index,'No. Null Values':data_df.isna().sum().values,'Percentage':data_df.isna().sum().values/data_df.shape[0]})
null_columns #check each column to see if they contain any null cells
data_df.isna().any()

3. Model training data preparation

Now we have a complete dataset. We divide this data set into two parts training data and test data.
80% of the data set -> training data
20% of the data set -> test data
(the percentage can be customized accordingly)

Using train_test_split we can split the assigned x and y variables into training and testing data. Then we have four data sets. They are, x-Train,
x-Test, y-Train, and y-Test
.

from sklearn.model_selection import train_test_split
x=data_df.drop(['Class'], axis = 1) #drop the dependent variable and assign all the independent variables to x
y=data_df['Class'] #assign the dependent variable to y

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)

We can have a look at the shape of these datasets.

4. Applying Logistic Regression

By using LogisticRegression we assign a model, named “logisticreg”. Then our training data are fitted to this model in order to train the model, “logisticreg”.

Now we predict values, to our x-Test data. “ypredicted” is the set of predicted data by our model using predict(x).

5. Accuracy

Now we must check the accuracy of our model using accuracy_score. For that, we are using our predicted dataset and the actual data set corresponding to the x-Test dataset.

Or we can look at the confusion matrix of our model. Our model works well if we get the Confusion matrix to be TruePositive and/or TrueNegative.

Here we get a 2-D confusion matrix. Because there are 2 classes in our output, which are, ‘1’, and ‘0’.

The confusion matrix consists of four basic characteristics (numbers) that are used to define the measurement metrics of the classifier. These four numbers are:
1. TP (True Positive): TP represents the number of patients who have been properly classified to have malignant nodes, meaning they have the disease.
2. TN (True Negative): TN represents the number of correctly classified patients who are healthy.
3. FP (False Positive): FP represents the number of misclassified patients with the disease but actually they are healthy. FP is also known as a Type I error.
4. FN (False Negative): FN represents the number of patients misclassified as healthy but actually they are suffering from the disease. FN is also known as a Type II error.

Performance metrics of an algorithm are accuracy, precision, recall, and F1 score, which are calculated on the basis of the above-stated TP, TN, FP, and FN.

Accuracy of an algorithm is represented as the ratio of correctly classified patients (TP+TN) to the total number of patients (TP+TN+FP+FN).

Precision of an algorithm is represented as the ratio of correctly classified patients with the disease (TP) to the total patients predicted to have the disease (TP+FP).

confusion matrix
assigning values from the matrix to separate variables and then finding accuracy

Now, we have predicted a model with an accuracy of 99.89% to predict if a transaction is fraudulent or not.

Complete code: Fraud_Detection.ipynb

In the next article, we will train a model with Naive Bayes.

Thank you and Happy Reading!

Follow For More.

--

--