Implementing Binary Logistic Regression in R

Published in

Pete Caya

6 min readMar 21, 2017

Logistic regression attempts to solve a class of problems which sound more simple than linear regression. In the latter, we want to find a line (or plane, or hyper-plane) which best predicts an average numerical value for some data based on our observations. In logistic regression we are just trying to predict what category an observation falls into based on dependent variables. Male or female? Sick or not?

In this blog post I show the math which underlies logistic regression and show how you can implement it in the programming language R. The files can be accessed one my Github repo here.

The Gist of Logistic Regression

When we use logistic regression we attempt to identify the probability that an observation will be in a particular class. I will start with the two class (K=2) case. Let’s define our variables for classes A and B.

The “raw” model we begin with appears below. The question we are answering is: What are the odds of the data from observation i being in category A versus B given a set of parameters β?

A quick note: If we just try to predict the odds ratio, we will be attempting to predict the value of a function which converges towards infinity as p → ∞. This problem is circumvented by evaluating the natural logarithm of the odds ratio. The use of this strategy does not change the location of the maximum but it does produce a set of values which are bounded between 0 and 1.

The model we are trying to find the parameters β for is then:

Our next step uses some fairly plain algebra to solve for p:

Finding the Parameters

Unlike linear regression where the model parameters can be fitted by inverting a matrix, logistic regression has to be approached as an optimization problem.

The Maximum Likelihood Function

The likelihood function is some measure we use to determine the likelihood of outcomes we observed given the data that we have. The principle of maximum likelihood estimation states that we should define some function which measures this probability and then seek to find the parameters which optimize it.

For our binary classification problem, we will assume that our likelihood function is governed by the binomial probability distribution function:

Since we are only interested in maximizing this function we don’t need to be worried about the number of choices which are possible. Leaving that out, our maximum likelihood function becomes:

We are nowhere near done yet! We need to take the derivative of the function above which will prove difficult given the multiplicative quality of the expression above. To make things more simple, we can just take the natural logarithm to produce an expression which will be much more easy to evaluate:

The Newton-Raphson Method

I mentioned above that we will need to approach this as problem of numerical optimization. A very commonly used method is the Newton-Raphson method.

In context to our logistic regression problem we are seeking a critical point (which contains the global maximum) for the likelihood function. The means we have to find the point at which the gradient is equal to zero. In this context we have to use the gradient and Hessian of f(x).

My implementation of the Newton-Raphson method in R:

# Simple implementation of Newton-Raphson:
# Inputs:
# [1] func - The function to be optimized by the algorithm.
# [2] initial_values - An initial guess to be passed to "func".
# [3] tol - The stopping criteria to the model.  For this implementation, the tolerance is compared to the 2-norm
#           of the gradient:# Outputs:
# [1]  params - the output of the logistic regression.newton <- function(func, initial_values,tol = 1e-16){
  params <- initial_values
  check <-1
    while(check > tol){
      func_eval <- func(params)
      params <- params -  solve(func_eval$ddf)%*%func_eval$df
      check <- sqrt(t(func_eval$df)%*%func_eval$df)
    }
  return(params)
}

The Gradient

To put this in matrix notation, let Y be a vector of N elements representing the class outcome for the data and let P be a N-length vector composed of elements p_i. Then the formula for the gradient for the log-likelihood function is:

The Hessian

Following the convention from [2], let W be an N by N diagonal matrix with nonzero entries W_{i,i} = p_i(1-p_i). With this in mind, define the Hessian matrix as:

My R implementation of the Log-likelihood function is displayed below:

# Implementation of binary logistic regression:
# Inputs:
# [1]  y - a column vector of N elements containing the actual classification results.
# [2]  X - a (p+1) by N design matrix.
# [3]  beta - a p-length column vector of coefficients.# Outputs:
# [1]  f - The log-likelihood function evaluated using the inputs.
# [2]  df - p-length column vector representing the log-likelihood's gradient.
# [3]  ddf - a p by p Hessian matrix for the log-likelihood function.L_L <- function(y,beta,X){
  # Calculate the vector of probabilities:
  pi <- exp(X%*%beta)/(1+exp(X%*%beta))
  f <- prod(ifelse(y==1,pi,1-pi))
  # First derivative:
  df <- t(t(y- pi)%*%X )
  # Second derivative:
  W <- diag(as.vector(pi*(1-pi)))
  ddf <- -t(X)%*%W%*%X
  # Output:
  output_list <- list(f = f, df = df,ddf = ddf)
  return(output_list)
}

Sanity Check

To ensure that my implementation, y’know, works, I am testing it on some different examples. I chose to publish the one below because it works on a reasonably sized real world data-set.

I compared my implementation to the famous glm function. Expecting too much in terms of the minimization of the gradient made me run into some issues with the singularity of the Hessian matrix so I kept the tolerance at 1e-13 which was good enough to produce very similar results to the glm function:

Actual Data

# Peter Caya
# 2017-03-19
# Simple script demonstrating binary classification using the standard glm function,
# my simpler L_L function for binary classification, and my multi-class 
# implementation.rm(list = ls())
source("/home/peter/Dropbox/Data Science/ML Scripts/Log_Functions.R")
set.seed(13)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
admission <- as.matrix(mydata$admit)
beta <- as.matrix(rep(0,3))
vars <- as.matrix(mydata[,2:dim(mydata)[2]])
my_coefs <- newton(function(input){L_L(y = admission,X = vars,beta = input)},initial_values = beta,tol = 1e-13)
fromGLM <- glm(formula = admission~vars-1,family = binomial())
GLMcoefficients <- fromGLM$coefficientsmy_results <- L_L(y,my_coefs,x)
glm_results <- L_L(y,GLMcoefficients,x)L_L_Difference <- my_results$f - glm_results$f
DL_L_Difference <- t(my_results$df - glm_results$df)%*%(my_results$df - glm_results$df)
print(data.frame(matrix(GLMcoefficients),my_coefs))

Conclusion

We can see that while the mathematical background for logistic regression is a bit more tricky than linear regression, once we understand it the method is very simple to implementation as a problem in numerical optimization.

However, what if we want to classify objects of two or more different classes? The basic ideas remain the same but the finished product is a bit more tricky. Originally I was going to add my implementation of logistic regression for multiple variables to this article but I thought it was getting a bit lengthy so that will be documented in an upcoming blog post.

Sources

[1] Elements of Statistical Learning,

[2] Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

[3] Logistic regression; Dohoo, I. Martin, W. Stryhnn, H.

Special thanks to my friend Tyler Wilcox who sent some sources to me which really helped make this post more complete.