Build a Logistic Regression From Scratch with Python

hqtquynhtram
3 min readOct 26, 2021

--

This article is to illustrate how to build a logistic regression module from scratch.

Today, our Data Scientist work is supported by already-written packages extremely well. However, understand how it works under the hood and be able to modify it is sometimes needed as a proficient Data Scientist. The reasons are:

  1. If you don’t understand how it was created, you can’t customize it
  2. The more you understand how the code was created, the better for you to explain it to other team members as well as stakeholders.

Recall the 3 primary cores of Logistic Regression model:

  1. Algorithm: Sigmoid function combine with a linear function z
Sigmoid Function

2. Loss function: Cross Entropy

Loss Function

3. Optimizer: Gradient Descent

Gradient Descent

Dataset

Let’s use the Iris dataset which contains 3 classes of 50 instances each, where each class refer to a type of iris plant. One class is linearly separable from the other two, the latter are not linearly separable from each other.

For the sake of simplicity, I will select out only :

  • The first 2 features: sepal length and sepal width
  • the 2 classes which are linearly separable to perform a binary classification problem using Logistic Regression.

Hypothesis

We could find a hyperplane that linearly separate between the 2 classes.

Intuitively, this is possible by looking at the graph below.

Scatter plot by 2 classes 0 and 1 with only 2 features

Coding 3 primary cores

  1. Sigmoid combined with a linear function as the model’s algorithm
Sigmoid function

2. Cross Entropy as the Loss function

Cross Entropy Function

3. Gradient Descent as the Optimizer

To find out the optimal weight, we will get the weight of the model’s algorithm that minimize the loss function using gradient descent.

After getting the optimal weight W, we will do prediction by calculate the value of sigmoid function and compare it with the chosen threshold.

Combine everything together into a Logistic Regression class.

Let’s try to fit the data with the Logistic Regression model above.

1st Result from logistic regression model with default hyper parameter

Given the default hyperparameters, the accuracy of the 1st model is about 99.3%. Not bad. Let’s visualize the decision boundary!

Decision boundary of 1st Logistic Regression Model

However, I’m still unsatisfied about the 1st model’s result. What if we change the hyperparameters with higher learning rate and iteration amounts.

Accuracy is 1. Perfect! It’s seem the 1st model was overfitting. Let’s visualize the decision boundary one more time.

Decision boundary of 2nd Logistic Regression Model

We finished building a logistic regression model from scratch! I believe this is not difficult and you can definitively try it on your own at home. Check my detail code at Github link.

Happy learning!

--

--

hqtquynhtram

Passionate about answering questions with data, building AI product. Feel free to contact me via linkedin.com/in/tramdata to share interests on data, product