Multi-Class Logistic Regression

Published in

Pete Caya

6 min readApr 22, 2017

A few weeks ago I wrote this blog post where I tasked myself with implementing two-class logistic regression from scratch. Since I couldn’t find any guides for implementing multi-class logistic regression online, I decided I would implement the multi-class version as well and write about it. As I finished though it was clear that the post was going to be quite long so I’ve decided to break it into two parts.

A lot of this material was learned and implemented using Jia Li’s logistic regression presentation in addition to ESL. These two sources really provided a well-rounded discussion of what logistic is and how to implement it.

What Changes When Using >2 Classes?

The principle underlying logistic-regression doesn’t change but increasing the classes means that we must calculate odds ratios for each of the K classes. Consider the odds-ratio for the binary case: We take the ratio of the probability of class A to the probability of the Kth class which would be the second class (B).

K classes means we must now consider the calculated probability, p, of class i. Each class has its own specific vector of coefficients (represented as a vector of coefficients β with a subscript signifying its class):

Note that instead of just trying to fit one set of β parameters, we now have (K-1) sets of variables which we are trying to fit!

Going through the requisite algebra to solve for the probability values yields the equations shown below:

I implemented the calculation of the class probabilities as its own separate function which I have copied below:

# I implemented the multi-class version of the probability function to produce a matrix of the class probabilities.  Let K be the number of classes.
# Inputs:
# [1]  X - A N*(K-1) by M design matrix.
# [2]  beta - A N*(K-1) length design matrix.
# [3]  classes - The number of classes we are predicting.# Outputs:
# A K by N matrix containing the class probabilities calculated as:find_pi_multi <- function(X,beta,classes){
  vars <- ncol(X)
  prob_mat <- matrix(rep(0,nrow(X)*(classes-1)),nrow = nrow(X),ncol = (classes-1))
  for(k in 1:(classes-1)){
    rel_indices <- 1:vars+(k-1)*vars
    prob_mat[,k] <- exp(X%*%beta[rel_indices])
  }
  denom <- apply(X = prob_mat,MARGIN = 1,FUN = sum)+1
  prob_mat2 <- cbind(prob_mat,as.matrix(rep(1,nrow(X))))/denom
  return(prob_mat2)
}

The Maximum Likelihood Function

Since we now are using more than two classes the log of the maximum likelihood function becomes:

The Gradient

Just for convenience, I’m copying the derivation of the gradient of the maximum likelihood function below:

Turning this into a matrix equation is more complicated than in the two-class example — we need to form a N(K −1)×(p +1)(K −1) block-diagonal matrix with copies of X in each diagonal block matrix. Let’s denote this block matrix as X-tilde.

Forming X-Tilde

# I'm sure R has a better way to form a block matrix.  Pardon my MATLABese here.X_tilde <- function(X,classes){
  r <- nrow(X)
  N <- r*(classes -1)
  X_tilde <- matrix(rep(0,N*(classes-1)*ncol(X)),nrow = N, ncol = (classes-1)*ncol(X) )
  m <- 1
  n <- 1
  
  for(i in 1:(classes-1)){
    # block_diags[[i]] <- X
    X_tilde[n:(n+r-1),m:(m+(ncol(X)-1)  )] <- X
    n <- n + r
    m <- m + (ncol(X)) 
    n:(n+r-1)
    m:(m+(ncol(X)-1)  )
  }
  return(X_tilde)
}

We also need to form a N(K-1) length vector in with entries 1 where class i is observed, and zero otherwise. The function to construct this vector is displayed below:

# Generate a N(K-1) length vector of indicator functions based on class.form_y <- function(classes,y, obs){
  y0 <- as.matrix(rep(0,obs))
  
  for(i in 1:(length(classes)-1)){
    # print(i)
    if(i==1){yp <- y0
    yp[which(y==classes[i])] <- 1
    }else{y_temp <- y0
    y_temp[which(y==classes[i])] <- 1
    yp <- rbind(yp,y_temp)
    }
  }
  return(yp)
}

All this being completed, the gradient for the multi-class version of the maximum likelihood function becomes:

The Hessian

The derivation of the Hessian matrix doesn’t change:

Again, our multi-class implementation makes producing the Hessian more involved. We have to form another block matrix summarizing the class probabilities. This is better summarized in Jia Li’s presentation which you can find here, so I won’t go into in this blog post. Instead, here is my implementation in R:

The W Matrix

form_W <- function(p){
  # I'm not too worried about speed with this seeing as I am just forming block matrices.
  # Form the block diagonals in the W matrix:
  N <- nrow(p)*(ncol(p)  -1)
  r <- nrow(p)
  W <- matrix(rep(0,N^2),nrow = N, ncol = N )
  block_diags <- list()
  upper_diag <- list() 
  m <- 1
  n <- 1
  for(i in 1:(ncol(p)-1)){
    block_diags[[i]] <- diag((p[,i])*(1-p[,i]))
    W[n:(n+r-1),m:(m+r-1)] <- block_diags[[i]]
    n <- n +r
    m <- m + r
  }
  
  # Form the off-diagonals:
  # W_ii = -p_k*p_m
 
  k <- 1
  m <- 1
  n <- 1
  
      if(ncol(p)>2){
      for(i in 1:(ncol(p)-2)){ 
        # print(W)
         # print(paste("i is ",i))
         for(j in  c((i+1):(ncol(p)-1)))  {
           m <- r*(j-1)+1
           upper_diag[[k]] <- diag(-p[,i]*p[,j])
           W[(n:(n+r-1)),m:(m+r-1)] <- diag(-p[,i]*p[,j])
           k <- k+1
         }}
    W <- W + t(W)
    diag(W) <- diag(W)/2
  }
 
 # The block matrices should be the same across the diagonal so I'm just adding the transpose
  # to the original W and correcting the main diagonal.
  return("W" = W) 
}

Thankfully, this will be the end of our use of block-matrices for this project. The matrix form of the Hessian for the maximum likelihood function is displayed below.

The rest of my implementation of the multi-class version of the log-likelihood function is displayed below:

# Multi-class Regression -----------------------------------------------------
# Implementation of multi-class logistic regression:
# Inputs:
# [1]  Y - a column vector of N elements containing the actual classification results.
# [2]  X - a (p+1) by N design matrix.
# [3]  beta - a p-length column vector of coefficients.# Outputs:
# [1]  f - The log-likelihood function.
# [2]  df - (p+1) length column vector representing the log-likelihood's gradient.
# [3]  ddf - a (p+1) by p Hessian matrix for the log-likelihood function.LL_Multi <- function(Y,X,beta){
  class_types <- unique(Y)[order(-unique(Y))]
  classes <- length(class_types)
  class_probs <- as.matrix(rep(0,nrow(X))) 
  p <- find_pi_multi(X,beta,classes)
  # Get the estimated probability of the observed class.
  p2 <- as.matrix(unlist(as.data.frame(p[,1:(dim(p)[2]-1)])))
  y <- form_y(classes = class_types,y = Y,obs = length(Y))
  
  for(k in 1:length(class_types)){
    class_probs[which(Y==class_types[k])] <- p[which(Y==class_types[k]),k]
  }
  XT <- X_tilde(X,ncol(p))
  ws <- form_W(p = p)
  f <- sum(class_probs)
  # Calculate the gradient:
  df <- t(XT)%*%(y-p2)
  # Calculate the Hessian:
  ddf <- -t(XT)%*%ws%*%XT
  return(list("f" = f,"df" = df, "ddf" = ddf))
  # return(list("f" = f,"df" = df))
}

Testing the Algorithm

I first compared my implementation with the glm function to try to generate consistent results. The difference between my results and glm was ~1e-16 at most.

Comparison to the Two-Class Implementation:

set.seed(13)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
admission <- as.matrix(mydata$admit)
beta <- as.matrix(rep(0,3))
vars <- as.matrix(mydata[,2:dim(mydata)[2]])# My two-class algorithm versus the multiple class algorithm:
L_L(y = admission,X = vars,beta = beta)
LL_Multi(Y = admission,X = vars,beta = beta)# Using Newton's method:
fromGLM <- glm(formula = admission~vars-1,family = binomial())
binom_algo <- newton(function(input){L_L(y = admission,X = vars,beta = input)},initial_values = beta,tol = 1e-13)
multi_algo <- newton(function(input){LL_Multi(Y = admission,X = vars,beta = input)},initial_values = beta,tol = 1e-13)data.frame(fromGLM$coefficients,multi_algo, fromGLM$coefficients-multi_algo)

Get the Code

My full code for implementing two-class and multiclass logistic regression can be found at my Github repository here.