Logistic Regression From Scratch

Daniil Egorov
Geek Culture
Published in
10 min readMay 5, 2021
Visualization for the Iris Dataset (Setosa vs rest)

Introduction

This article’s goal is to provide the reader with solid understanding of the logistic regression technique. It assumes familiarity with the basics of probability theory, linear algebra, and differential calculus. Its first part explores the rationale behind logistic regression alongside the necessary mathematical details. The second part focuses on the implementation, for which I am using Python.

Why Use Logistic Regression?

Linear model vs logistic model

It entices to resort to the old familiar linear regression even though the target variable is dichotomous (a.k.a. binary), however it would prove invalid in terms of modeling probabilities and therefore adequate classifications. The reason is such a model has unbounded range, meaning it stretches all the way from -inf to +inf. A solution was invented that involves a function which performs great in modeling probabilities, namely the sigmoid function:

You may also occasionally run into its different form:

Thus, the function value approaches 1 when x goes to +inf and 0 when x goes to -inf. Logistic regression takes advantage of this behavior when making its probabilistic assumptions.

Theory

Like in linear regression, the input of logistic regression is the linear combination of features and weights (coefficients) associated with them:

which we are going to rewrite as:

If you feel uncomfortable with multiplying matrices, see http://matrixmultiplication.xyz for an excellent visualization

The appearance of w-naught and 1 might seem confusing, however it is the same y-intercept a beta-naught represents. In the context of weights w-naught is often called the bias term.

With the input plugged into the sigmoid function we have the probabilistic form of logistic regression:

However, it is also common to see the logarithm of odds which is involved in the so-called logit form of logistic regression:

but why might we want to use this weird measure? The reason is the need to express the linear combination of features and weights.

The decision to express the linear combination through the log of odds has a two-fold basis: first, it brings back the familiar “straight-line” notation and, second, the coefficients become somewhat more interpretable.

In the left plot, the classes (0 and 1) are at -inf and +inf because the logarithm is undefined at P ∈ {0, 1}

Thus, it is the matter of how we place the logit straight line what the probability curve is going to be. The question arises: how do we pick the weights which are going to output the best-fitting curve?

Maximum Likelihood

The OLS (Ordinary Least Squares) method, with which we optimize the parameters for linear regression, is not going to work with the logit function, the problem being that, whatever the line, the distances to the data points are always going to be infinite. Instead, we introduce the concept of maximum likelihood, where likelihood is the probability of observing our data given that the model is true:

When y_j =1, the right factor becomes 1 and we get the probability of observing 1 when the sample belongs to class 1. When y_j = 0, the left factor becomes 1 and we get the probability of observing 0 when the sample belongs to class 0.

Note that MLE (Maximum Likelihood Estimation) requires independence of the observations, which allows multiplying probabilities:

However, due to the rounding issue of floating-point numbers, we are at risk of underflow when multiplying probabilities, which almost guaranteedly would end up in zero likelihood regardless of the weights of our choosing. That is why, when doing MLE, the likelihood function is log-transformed, which converts the product of probabilities into the sum of their logs.

Gradient Descent

Gradient descent is a numeric optimization technique, which iteratively approaches the minimum of the cost function J(w). This is the outline of the algorithm:

  • The learning rate η (a number typically between .001 and .1) and the number of iterations (it is recommended to take at least 1000) are chosen;
  • Initial weights are generated, which are usually random numbers;
  • The gradient (derivative) is computed with respect to every parameter w_i involved;
  • New weights are computed with the following formula:
As the value of gradient decreases, smaller and smaller steps are taken towards the minimum
An Intuitive Introduction to Gradient Descent - KDnuggets
source: https://www.kdnuggets.com/2018/06/intuitive-introduction-gradient-descent.html

Note: take care of which learning rate you pick. Should you choose a relatively large value, your weights are going to infinitely bounce without convergence. On the contrary, if η is too small, you are possibly going to grow old before the value of the gradient reaches the tolerance threshold (arbitrarily chosen parameter close to zero on which the algorithm stops).

source: https://srdas.github.io/DLBook/GradientDescentTechniques.html

To abide by certain conventions, we define our cost function J as -LL, since otherwise it would be gradient ascent, which is much less commonly mentioned. Below is the full derivation of the gradient of J:

Let us decompose the problem and first derive the gradients for the terms within the sum:

then tie everything together and cancel out the redundancies:

So, our objective is to derive the gradients for all the weights and approximate them to zero. Although, theoretically, the task can be approached analytically with the system of equations, things quickly get insane as the number of features increases. Imagine teaching a computer to find an analytical solution with 10 000 features involved and therefore the system of 10 001 (remember the intercept) derivatives all set to zero. See? Numerical methods save the day.

Multiclass Logistic Regression

Although, in nature, logistic regression’s purpose is telling apart only two classes, it can be adopted for multiclass (n > 2) classification. This article covers the One-vs-Rest method, which implies fitting n binary logistic models with n differently preprocessed datasets. The concept is best understood with an example:

Consider we are to distinguish between Setosa, Versicolour, and Virginica from the Iris dataset, which must be quite familiar to those acquainted with Machine Learning. We have to build 3 different logistic models:

  • Setosa (1) vs Versicolour and Virginica (0);
  • Versicolour (1) vs Setosa and Virginica (0);
  • Virginica (1) vs Setosa and Versicolour (0),

and then we aggregate the results, the final prediction being the class which has had the highest probability of observing 1. Say, for an observation, we have the probability of getting Setosa equal to .035, Versicolour to .23, and Virginica to .0006. Thus, Versicolour is going to be the predicted class.

Logistic Regression in Python

Congratulations on grasping the theory and reaching the second part of the article. Here we are going to build logistic regression in Python. We will do it in an object-oriented fashion, so should you feel you have got a loose grasp on OOP, consider that you cram it. For basic knowledge, I advise visiting the Corey Schafer’s minicourse.

For this implementation, you also need numpy installed.

We will start off declaring the class and specifying the parameters set on initialization:

class LogisticRegression:    def __init__(self, eta=.01, n_iter=100000, tolerance=1e-5, random_state=42):        self._eta = eta
self._n_iter = n_iter
self._tolerance = tolerance
self.__random_state = random_state

A couple of notes on the parameters:

  • eta = η = the learning rate;
  • n_iter is the number of iterations taken before gradient descent stops;
  • tolerance is the threshold which all the gradients need to reach for the algorithm to stop;
  • random_state is the random seed, on which the initial weights are generated. Specified for reproducibility.

We add the _sigmoid utility static method, which takes a matrix of features and an array of weights and produces an array of probabilities:

    @staticmethod
def _sigmoid(X, w):

denominator = 1 + np.exp(-(X * w).sum(axis=1))
return 1 / denominator

We now focus entirely on the fit method, which I am going to explain step-by-step via the integrated comments. I have included the normalize parameter that optionally scales the features, which makes gradient descent convergence much smoother and faster:

    def fit(self, X, y, normalize=False):        # Convert the input into numpy arrays
# Make y a 1-d array in case a matrix is supplied
self._X = np.array(X, dtype=np.dtype('float64'))
self._y = np.array(y, dtype=np.dtype('int64')).squeeze()
# We save normalize as the instance attribute;
# if set to True, test data in the predict method
# will also be normalized.
# We also save the data range as well as the minimum value.
self.__normalize = normalize
if self.__normalize:
self.__Xmin = self._X.min(axis=0)
self.__Xrange = self._X.max(axis=0) - self.__Xmin
self._X = (self._X - self.__Xmin) / self.__Xrange
# Check if the problem is multiclass: self._classes = np.unique(self._y)
if len(self._classes) > 2:
# If we have more than 2 classes,
# we set the corresponding boolean to True
# and prepare a container for n binary models.
self.__multiclass = True
self.models_ = []
for class_ in self._classes: # Setting 1 where an observation belongs to the
# class and 0 where it does not.
y = np.zeros(shape=self._y.shape)
y[self._y == class_] = 1
# Initialize and fit the model. lr = LogisticRegression(
eta=self._eta,
n_iter=self._n_iter,
tolerance=self._tolerance,
random_state=self.__random_state
)
lr.fit(self._X, y)
# We initialize normalize as False regardless
# of whether or not the main model has True
# because, if it does, self._X is already normalized
# Instead, we set the necessary attributes after
# the model is fit.
if self.__normalize:
lr.__normalize = self.__normalize
lr.__Xmin = self.__Xmin
lr.__Xrange = self.__Xrange
self.models_.append(lr) self.__fit = True return self else: self.__multiclass = False # We add the bias term to the data to fit the intercept. self._X = np.concatenate(
[np.ones(shape=(len(X), 1)), self._X],
axis=1
)
# Generate the initial weights. rs = np.random.RandomState(seed=self.__random_state)
self.w_ = rs.normal(size=(self._X.shape[1], ))
# Gradient descent for _ in range(self._n_iter):
grad = ((self._sigmoid(self._X, self.w_) - self._y)[:, np.newaxis] * self._X).sum(axis=0)
self.w_ -= self._eta * grad
print(grad)
if all(np.absolute(grad) < self._tolerance):
break
self.intercept_ = self.w_[0]
self.coef_ = self.w_[1:]
self.__fit = True

return self

Let us test out what we have so far:

from sklearn.model_selection import train_test_split
from sklearn import datasets
data = datasets.load_iris()
X, y = data['data'][:, 3:4], data['target'] # Use one feature for simplicity (petal width).
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)lr = LogisticRegression(eta=.05)
lr.fit(X_train, y_train, normalize=True)
print(lr.models_[0].w_)
print(lr.models_[1].w_)
print(lr.models_[2].w_)
Setosa vs Rest. Output: [ 17.24983752 -61.98800334]
Versicolour vs Rest. Output: [-1.04412163 0.72087761]
Virginica vs Rest. Output: [-20.88894019 32.95409127]

We now introduce the predict method which accepts the following parameters:

  • X_test. Testing data. If the model has been fit using the normalized version of training data, testing data are also going to be normalized.
  • proba. Whether or not to return the probabilities. They append to the right of the array of predicted classes.
  • threshold. Decision rule: if P ≥ threshold, the observation is classified as 1 and 0 otherwise. Should be manipulated with care. For instance, if a patient’s life depends on discovering if they have the disease, the threshold should be lowered. Ignored when the model is multiclass.
    def predict(self, X_test, proba=False, threshold=.5):        if not self.__fit:
raise TypeError('The model has not been fit.')
if self.__multiclass: # Extract the probabilities from each of the fitted
# models.
probas = map(lambda model: model.predict(X_test, proba=True)[:, [1]], self.models_) # Tie them together. self.proba_ = np.concatenate(list(probas), axis=1) # Make the prediction: a class with the highest
# probability is chosen as the predicted class.
self.y_hat_ = np.array([self._classes[idx] for idx in self._classes[np.argmax(self.proba_, axis=1)]]) if proba:
return np.concatenate([self.y_hat_[:, np.newaxis], self.proba_], axis=1)
else:
return self.y_hat_

else:
self._X_test = np.array(X_test, dtype=np.dtype('float64')) # Normalize the testing data. if self.__normalize:
self._X_test = (self._X_test - self.__Xmin) / self.__Xrange
# Append the bias term. self._X_test = np.concatenate([np.ones(shape=(self._X_test.shape[0], 1)), self._X_test], axis=1) # Calculate the probabilities. self.proba_ = self._sigmoid(self._X_test, self.w_) self.y_hat_ = np.zeros(shape=(self.proba_.shape[0], ))
self.y_hat_[self.proba_ >= threshold] = 1
if proba:
return np.concatenate([self.y_hat_[:, np.newaxis], self.proba_[:, np.newaxis]], axis=1)
else:
return self.y_hat_

Let us see the method in action:

print(lr.predict(X_test, proba=True))

Output:

[[0.00000 1.00000 0.26618 0.00000]
[2.00000 0.00000 0.35582 0.42736]
[1.00000 0.00000 0.33544 0.01199]

[0.00000 0.99901 0.28415 0.00000]
[2.00000 0.00000 0.40533 0.99991]
[1.00000 0.00000 0.34896 0.15900]]

We see what the decision-making process looks like, but how would such a model perform?

from sklearn.metrics import accuracy_scoreprint(accuracy_score(lr.predict(X_test), y_test))

Output:

0.8947368421052632

Well, it really did its job quite well, but can we do better? I am not going to dedicate the rest of this article to feature selection, since this topic is out of its scope, and that is why we simply plug in all the data we have.

X, y = data['data'], data['target']X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)lr.fit(X_train, y_train, normalize=True)
print(accuracy_score(lr.predict(X_test), y_test))

Output:

0.9473684210526315

Let us compare our result to that of the scikit-learn implementation:

from sklearn.linear_model import LogisticRegression as LinearRegression_lr_sklearn = LogisticRegression_(penalty='none', tol=1e-5, random_state=42, max_iter=100000) # Same hyperparams
lr_sklearn.fit(X_train, y_train)
print(accuracy_score(lr_sklearn.predict(X_test), y_test))

Output:

0.8947368421052632

Not going to brag this minor outperformance, since it is most likely attributable to an occasion.

Also, I suppose it is possible to do even better than 94.7% should we eliminate ‘noisy’ features.

Conclusion

In this article, I tried being both succinct and comprehensive. I tried to convey the material as clearly as possible and to make a special focus on things that in my opinion were introduced either superficially or overly intricately in other sources, which I have had to go through many. I hope you enjoyed this guide, and should you have any questions unanswered, reach out to me and I will answer them to the extent which my expertise allows.

--

--