GLMs, CPUs, and GPUs: An introduction to machine learning through logistic regression, Python and OpenCL
Welcome! As a mentee in the ChiPy mentorship program I will be writing a few blog posts about my project — which was to learn how to implement a couple machine learning algorithms for execution on the graphics card. In this blog post, I’ll introduce a few concepts fundamental to machine learning using logistic regression as an example, as well as a code with simple implementation in Python and OpenCL interfaced with PyOpenCL. This post is intended for a broad audience; if you’re completely new to machine learning it may be worthwhile to read the beginning and mess around with the code on your own. If you’re here for the OpenCL, just jump right down to the bottom. And if you have any feedback, good or bad, let me know! If you would like to see the code I’m using, you can take a look at my github repository. Let’s get to it!
Contribute to opencl_fun development by creating an account on GitHub.github.com
The logistic regression model
There are many cases where it feels natural to attempt to model the outcome of an event as the probability that the event occurs, as we often record the outcome of some event as binary data. In addition, we may have some data that we hypothesize should correlate with the occurrence of said event. In medicine these scenarios arise frequently.
Consider the case where a surgeon is deciding whether to perform a procedure on a patient. The patient’s health may improve considerably if the procedure goes as planned and the patient recovers but for some patients doing the procedure could actually be worse — they may be more likely to have some other medical complication that can’t be determined a priori, they may not be able to recover in a short amount of time after the procedure or they could be at greater risk for some other complication such as post-operative infection, significant blood loss to bleeding and slow wound healing. If the risk that the patient has one of these complications is too high, it would be in the patient’s best interest to not undergo surgery and find some other therapy. Experienced surgeons with many years of practice are often good at deciding when doing a particular surgery is too risky. But how can we use data about patients, physicians and properties of the procedure at hand (this is a post about modeling and prediction after all) to determine the probability of a bad outcome occurring?
For logistic regression (which is a type of regression model within the class of generalized linear models or GLMs), there are a couple of things that we need. The first is a probability distribution that reflects the structure of the output data we are trying to estimate, and a way to link our input data to the mean of that distribution. I’m going to just give you those equations here, but I highly recommend reading the appendix at the end of this blog post for a more complete derivation of the model. For a system that generates binary outcomes with some probability between 0 and 1 that the “positive” event occurs, a natural choice of probability distribution is the Bernoulli distribution. It’s probability density function (pdf) looks like this:
Where pi is the probability that y=1. However, what we need to do is figure out a way to map an arbitrarily large (finite) value to the range (0.0, 1.0), such that it could reflect the value of pi, or our belief that y=1. The function that we will use here is the logistic function (or logit, or sigmoid). It’s form looks like this:
Think about what happens when beta is very large or very small. If beta is large, the fraction approaches 1 and if beta is small, the fraction approaches 0, making this a great way to map some input data (stored as the value beta) to the range (0.0, 1.0), which we can interpret as the probability that y=1. Let’s formalize that and put in what beta should actually be:
where X is our data and theta are the linear coefficients of each feature of the data. With this structure, we can start thinking about about how we would actually go about figuring out the values of theta, given a set of input and output data, as that will allow us to “learn” a map between our input features (X) and the output data (y).
Model fitting and optimization
In the last section, we derived the logistic regression model and the underlying logic for how we can correlate observations with binary events. In this section, we’ll take a look at the math needed to actually fit the model.
Before we get into specifics of logistic regression, I want to introduce a core concept in machine learning. Machine learning models are, in my opinion, mathematical structures that we use to think about how input and output data are related to each other; however the real power of these structures is that we don’t predefine everything about them a priori, we leave some internal parameters that we can tune like knobs on an analog radio to solve the problem at hand. However, this then raises the question of what should be the end result of this model building and knob tuning? In my understanding, that question has a clear answer: prediction. Our goal in any machine learning problem is to select or develop a model that allows us to “learn” a set of rules based on past observations for predicting future outcomes.
With that underlying objective in mind, let’s think about a practical way in which we could actually go about learning those rules or model parameters to make the best possible predictions. Well, how do we humans actually go about learning something? For me, the algorithm usually looks something like this:
Alright, I’ll stop being cheeky now and talk about some specifics for logistic regression, but in all seriousness, what I outlined above is the basic idea behind most (supervised) machine learning algorithms.
As you can see, in line 2 of the “Learn Something” algorithm, the first thing we have to do is take a stab at what the answer should be. In the case of logistic regression, we already have that from the first part of the blog post. What we need to define now, is how we are 1) going to measure error and 2) propose adjustments to the model to reduce error the next time around. The way we typically do this is set up a function that we want to optimize. In machine learning literature, this is often referred to as a loss function or a cost function. Ideally, this function is convex (or concave) which means it’s guaranteed to have a minimum (or maximum) value — this structure is particularly useful as it allows us to use the derivative of that function to find the maximum or the minimum. Recall that in calculus, the minimum of a convex function for some parameter can be found by setting the partial derivative of the function with respect to the parameter in question to zero and then solving for the parameter in question. So, what function should we use to measure our performance. A good place to start is the likelihood function:
Basically what we want to do here is maximize this function over a set of training samples (pairs of (x,y) from 1 to N), which means that we are adjusting the values of theta that would increase the overall likelihood measurement given our data. However, the first thing we need to do is write the right hand side of the equation in terms of X and theta; luckily we have that result from earlier so let’s just plug that in.
Alright, this is looking better but the product still makes it a difficult problem to compute and solve. However, we can simplify this by taking the logarithm of both sides, turning the product into a sum.
And for convenience, let’s turn this into a minimization problem by negating each side of the equation:
Great! Now let’s do some algebra and compute the derivative with respect to theta for this bad boy. Just kidding, I think I’ve tortured you enough today (and I don’t want to type all this out in latex :P) so I’ll provide the equation for the derivative:
One thing to note, which I haven’t mentioned yet is that I’m using matrix notation for X and theta, so to be clearer, if you had j features (columns) in X, we would represent theta as a vector of j+1 numbers (we add one as the intercept or “bias” term). Therefore the above equation for each of our j+1 theta parameters would look like this:
So all we have to do now is set this equal to zero and solve for theta, right? Unfortunately it’s not that simple. Given the nonlinearity present in the equation, we cannot analytically solve for theta. But we can use the properties of the derivative to “nudge” the values of theta in the right direction over many iterations. The formal name for this process is called gradient descent (as in we are descending down the gradient towards the optimum). The gradient descent algorithm in this case would look like this:
In the above pseudocode, alpha refers to the “learning rate”, or how large we adjust theta on each step. It can be tricky in practice to set and there are many schemes that use adaptive methods to keep the adjustment of theta big enough that we will decrease the cost but not so big that we may overshoot the global optimum. Overall the algorithm is pretty simple, and draws a striking resemblance to the silly “Learn Something” algorithm, right? In the next section, let’s see how this is implemented in Python.
Fitting data on the CPU
Before we do any fitting, we need to actually have some data. Since I want to make sure the code works I’ll be generating synthetic data that can definitely be modeled using logistic regression. The code for doing model fitting on the CPU is directly posted below, however it doesn’t look like Medium has any good solutions to get around its terrible code formatting. You can find a link to the jupyter notebook here.
import pyopencl as pcl
import numpy as np
import scipy.stats as ss
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
x0_1 = ss.norm(loc=10.0, scale=2.0)
x0_0 = ss.norm(loc=7.0, scale=2.0)
x1_1 = ss.norm(loc=5.0, scale=3.0)
x1_0 = ss.norm(loc=-5.0, scale=3.0)
X_1 = pd.DataFrame(index=range(nsamps),
X_0 = pd.DataFrame(index=range(nsamps),
X_1.loc[:, 'x0'] = x0_1.rvs(size=(nsamps,)).astype(np.float32)
X_1.loc[:, 'x1'] = x1_1.rvs(size=(nsamps,)).astype(np.float32)
X_1.loc[:, 'y'] = np.ones(shape=(nsamps,)).astype(np.float32)
X_0.loc[:, 'x0'] = x0_0.rvs(size=(nsamps,)).astype(np.float32)
X_0.loc[:, 'x1'] = x1_0.rvs(size=(nsamps,)).astype(np.float32)
X_0.loc[:, 'y'] = np.zeros(shape=(nsamps,)).astype(np.float32)
X_all = pd.concat((X_1, X_0), ignore_index=True)
X_all = X_all.reindex(np.random.permutation(X_all.index))
X = X_all.loc[:, ['x0', 'x1']]
y = X_all.loc[:,'y']
def sig(X, theta):
lin = X.dot(theta)
sig = 1.0 / (1.0 + np.exp(-lin))
def lr_cost(X, theta, y):
est = sig(X, theta)
log_est = np.log(est)
cost = y*log_est + (1-y) * (1-log_est)
cost *= -1
def cost_ss(est, actual):
cost = ((est - actual)**2.0)
cost /= 2.0
def grad(X, theta, y):
diff = y - sig(X, theta)
g = np.zeros(X.shape)
for i in range(X.shape):
for j in range(X.shape):
g[i,j] = diff[i] * X[i,j]
def fit_params(X, y, theta):
tol = 1e-5
learning_rate = 1e-2
costs = [np.inf]
for i in range(0, 1000):
my_cost = lr_cost(X, theta, y)
my_cost = my_cost.sum()
if abs(my_cost - costs[-2]) < tol:
my_gradient = grad(X, theta, y)
my_gradient = my_gradient.mean(axis=0)
theta = theta + learning_rate * my_gradient
return theta, costs
X_train, X_test, y_train, y_test = train_test_split(X,
new_t = np.random.normal(size=(X_train.shape, )).astype(np.float32)
fitted_theta, outcost = fit_params(X_train.values,
print('cost on test set: ', lr_cost(X_test.values, fitted_theta, y_test.values).sum())
print('average sum of square error on test set: ',cost_ss(sig(X_test.values, fitted_theta), y_test.values).sum()/X_test.shape)
plt.title('Cost During Fitting - CPU')
Fitting data on the GPU
Well folks we’re finally here, and this is the meat of my ChiPy mentorship program project. Adapting machine learning models to execute on the GPU has lead to significant increases in performance for a number of problems in AI, computer vision and natural language processing, and interest in this area is surging. Don’t believe me? Look at NVIDIA’s stock price over the last year. I’ve been writing code in Python for about 6 years now, but I’ve never integrated code from other languages with Python and I figured this would give me a way to learn about that process as well as GPU programming.
I have some well commented code to share for doing GPU computing with OpenCL and PyOpenCL for logistic regression but given Medium’s formatting I think it would be best to leave a link to that notebook here. However I’ll use this space to talk a bit about how OpenCL works.
OpenCL is short for “Open Computing Language” — it’s very similar to C99 and is deliberately designed to execute in a highly parallel fashion. This can greatly speed up a large number of computations, basically any algorithm where the bulk of the work doesn’t rely on previously synchronized results. For example, large matrix multiplications, doing some processing of individual items in a large list, pairwise comparisons of long lists of interactors and classically “N-body” problems in physics simulations such as molecular dynamics. By breaking up the workload among a large number of workers and allowing all of them to run in parallel, we can save a lot of time. So what does the structure of an OpenCL program actually look like?
For any OpenCL program, there are basically two pieces of code, one that runs on the “host” and a kernel that runs on the “device”. The host program is written in any language that can interface with the OpenCL library so that means it’s typically C, C++ or even a Python program. The kernel is then the specialized OpenCL code that is executed on the device (which in my case is the dedicated graphics card on my laptop but it could be another CPU). The host code controls the overall flow and logic of the program, which would be reading and writing data, sending code and data to the GPU, ensuring the GPU executes the given instructions and copying the results from the GPU memory back to the host memory. OpenCL also provides an extremely flexible interface for synchronizing data and instructions across many different devices. A good diagram for understanding how OpenCL is structured is shown below:
The hierarchy of the OpenCL workspace is as follows: context (a host program can launch multiple contexts), device (a context can hold multiple devices), command queue (a device can have multiple command queues) and kernel (multiple kernels can be sequentially added to a command queue).
Besides offering a flexible way to manage data and program execution over multiple devices, the real magic of OpenCL, in my opinion is how it changes the way a programmer solves a problem. Typically when we think of performing an operation on a list of objects or between tables of data, we often write programs to accomplish that task from a global perspective — what needs to be done on all elements of the array or table and how they should be treated as a whole entity. With OpenCL however, the idea is to write code that operates on each individual unit within the object. For example, let’s say we wanted to square each element of an array. The OpenCL kernel would look something like this:
__kernel void square(__global float * in, __global float * out)
int gid = get_global_id(0);
out[gid] = in[gid] * in[gid];
Notice how we never set up a for loop that does something like get the index of the array. When we set up the kernel, by specifying the global size of the workspace (which in this case could be the length of the array) OpenCL will launch a large number of these kernels on the GPU, which will all operate independently on their own subsection of the array.
I hope that this blog post had something to offer for everyone, regardless of their level of knowledge about machine learning, Python and OpenCL. The code I provide in the jupyter notebook works but is by no means optimized, but I will continue to improve it over the course of the python mentorship program. In my next blog posts I plan to show OpenCL code that is optimized and can churn through large datasets with high performance. Stay tuned!
Appendix: Logistic Regression Model Derivation
I’m glad you decided to come down here and take a look at why all the hand waving I did in the beginning is justified. The story here begins with a family of probability distributions called the exponential family. All distributions within the exponential family have a few things in common, namely that they all have a probability density function (pdf) in the form:
where beta is the localization parameter (~mean), phi is the scale parameter (~variance) and y is our data. Many of the most commonly used probability distributions follow this form, such as the Gaussian distribution, Poisson distribution, and the Binomial distribution. The exponential family of distributions has a couple of important properties regarding the mean and variance. I’m not going to derive them now, but we will need to use these later when we derive the logistic regression model:
The first equation is especially important — it provides a way to compute the expected value of our distribution using the structure present in the function b and we will exploit it later.
So, which distribution should we actually use in the scenario where we want to model the binary outcome of an event occurring (or not occurring)? A good choice for that would be the Bernoulli distribution, which has a pdf that looks like this:
where pi is our belief that the event y, which can take the value 1 or 0, will occur or be equal to 1. For example, in cases where we think the probability that our event of interest will occur is the same that the event doesn’t occur (eg tossing a fair coin) pi would be equal to 0.5. Ok, that’s great but this doesn’t look like it’s part of the family of exponential distributions! Not yet at least. Let’s do a bit of algebra.
In this form, things start to come to light. If we relate the last line back to the general form of the exponential family, we see that:
Therefore, solving this equation for pi gives us:
From here we can find a really important expression. Remember that the expected value of any member of the exponential family is equal to the first derivative of the function b (also called the cumulant function)? Right now it sort of looks like the second part of the term being exponentiated. We need to do a bit more algebra to find out.
Now it should be really easy to see that:
I know we just chugged through a lot of algebra, but let’s stop and think about the last line for a minute. We know that the expected value of the Bernoulli distribution should be equal to pi. Going back up a few lines, we had actually already solved for it, but we verified that in the last equation above. In addition, take a second look at the function that we ended up deriving:
Think about what happens as we vary the value of beta — when it’s really big the fraction approaches the value of 1 and when it’s very small, the fraction approaches the value of zero. The logistic (or logit, or sigmoid) function gives us exactly what we want, a way to map any arbitrarily large or small number to the range of (0,1). On a more technical note in the literature, you may will often find generalized linear models described as having a “link” function which is intrinsic to each distribution, often written as such:
The idea here is that the function g provides a way to describe a map between some value eta and the expected value of X. However in our case, what we really want is the inverse of g — that would allow us to input some value for eta and get the expected value of the distribution of X (which in our case is the Bernoulli distribution). But wait, we already have it for Bernoulli distribution! We have a function that allows us to express pi as a function of beta (the second to last equation). The last connection we need to make is how to use this with our data. The most straightforward way would be to construct a linear model that links features of our data and weights of those features to the expected value of the distribution. In our case it would look like this:
Take a look at equation 5. We’ve basically already solved it when we figured out what the derivative of the cumulant function was for the Bernoulli distribution. Replacing that gets us:
where X is our data and theta are the linear coefficients of each feature of the data. Finally, there is one last point I want to draw your attention to. Take a look at our new form of the link function:
The right hand side of this equation is the log-odds ratio of the event of interest occurring, which is great because it makes the interpretation of the values of theta straightforward; that is, the bigger the value of theta for a given feature in our data, the more that particular feature of the data contributes to predicting the event of interest occurring.
At this point, I hope I’ve convinced you of how we can use the properties and structure of the Bernoulli distribution to build a linear model for predicting outcomes of a binary event. However theory doesn’t get you too far in the real world so in the next section, let’s talk about how you would actually go about fitting this model and estimating what the values of theta should be given a set of data. You can also take this logic and apply it to other distribution such as the Poisson, Gaussian, Binomial etc to derive the associated link functions.