Parallel Machine Learning with Hogwild!

Srikrishna Sridhar
9 min readMay 6, 2015

--

Prior to joining Dato, I was very fortunate to be have been actively involved in University of Wisconsin’s numerical optimization group, who has been pushing the envelope in the field of machine learning. In collaboration with database systems researchers, our group’s research and software on asynchronous multi-core algorithms for machine learning and other big data applications have transformed the way in which big data analytics is done.

If you’ve ever wondered about the meaning of buzz words like Hogwild! (Chris will rant if you omit the “!”), DimmWitted, or ASYCD, read on. In this blog post, I will explain what stochastic gradient descent (SGD) is and how thread locking has a very large effect on performance. I will attempt to explain how parallel algorithms for machine learning such as Hogwild! work, why they have transformed big data analytics, and how GraphLab Create not only adopts these techniques but also actively pushes the frontier of parallel machine learning algorithms.

Learning Machine Learning Models

Many machine learning problems such as logistic regression, support vector machine, and matrix factorization can be formulated as a mathematical optimization problem. For these problems, given training data (depicted in blue in the figure below), the goal is to learn the parameters of a function (depicted in pink) which best fits the data. The fitted (or learned) function can then be used to make future predictions.

Figure 1: Learning a prediction function using training data.

A loss function is used to capture the quality of the model based on the error between the predictions and ground truth. Typically, the further away the prediction is from the ground truth, the larger the penalty we must pay. The goal of model training is to minimize the loss function, i.e. make sure that the total difference between the predictions and the ground truth, as measured by the loss function, is as low as possible. The process of model fitting can be mathematically stated as follows.

Gradient Descent

Gradient descent is a very simple and popular algorithm used to minimize a very common class of loss functions. The following illustration explains the intuition behind the algorithm.

Figure 2: Performing a gradient step while minimizing a function.

We start from a random estimate of the solution (typically a random draw from a normal distribution) and make several iterations (or passes) over the data. In each pass, we update the current estimate of the solution by taking a finite step along the gradient (the slope of the function) calculated at that point. We repeat this process until we reach the minimum, i.e. the trough of the function. If the length of the steps taken during each iteration is small enough and the function being minimized is convex (shaped like a bowl), gradient descent is guaranteed to minimize the loss function after a finite number of iterations.

One of the main disadvantages of gradient descent is that it requires many iterations before reaching a reasonably good solution. Simple modifications to gradient descent result in algorithms such as L-BFGS and acclerated gradient methods, that are known to accelerate model training by reaching the minimum using much fewer iterations. However, even these accelerated versions of gradient descent require several passes before they can get to a good solution.

Stochastic Gradient Descent

The stochastic-gradient descent (SGD) algorithm is a variation of gradient descent. It is widely used in applications like recommender systems, deep learning, as well as more basic classification and regression tasks. The algorithm can be described as follows:

Algorithm 1: Model training using Stochastic Gradient Descent (SGD).

During each iteration of SGD, we draw a random example form the training data and perform a gradient-descent-like update with respect to the single example. The algorithm was proposed way back in the 60’s. For several decades, it was rejected in mainstream numerical optimization research for being too simple and requiring too many iterations of the training data. In the past few years, however, there has been a resurgence in the usage of SGD in big data and machine learning applications for the following reasons:

  • Small memory footprint: Unlike many algorithms, SGD doesn’t actually need all of the data to be in your memory. The algorithm only requires a single training data point to be in memory. The downside is that the algorithm is IO bound but with some aggressive caching you can significantly speed up the algorithm.

Figure 3: Gradient Descent vs SGD.

  • Get to a reasonable solution quickly: The figure above describes why SGD is such a popular algorithm for machine learning. SGD gets to a good solution very quickly but can take forever to get to the “best” solution. This turns out not to be a problem in machine learning applications because getting to the “best” solution could in fact result in overfitting.

Although the “D” in SGD stands for descent, the algorithm isn’t really a descent algorithm because, in theory, there are no guarantees that the loss function reduces as it makes more iterations over the training data. But let us roll with it because in practice it generally provides the “D”. There are, however, interesting theoretical results that show, with some caveats, that in expectation this algorithm converges to the minimum.

Too good to be true?

Well, yes and no. All the above statements about SGD are true (both in theory and practice) but there are two key issues with the algorithm. First, it is inherently serial: the parameters $\theta$ must be updated after seeing every example. The next example needs to wait until the update is done before it can calculate the gradient. Hence it is believed that SGD is difficult to parallelize (until recently). Second, it is notoriously hard to tune. (The joke is that if SGD had mood swings it will burst out in anger all the time.) Here are Dato, we have spent many months on building an auto-tuned SGD that converges to good solutions on a wide range of datasets. We’ll write a full a technical report describing how we got there. For now, here is a teaser.

SGD on Multiple Threads

As I described earlier, SGD is an inherently serial algorithm. It was believed that each thread must wait for another thread to update before making the next move. There had been a few proposals to make it parallel, but they weren’t very successful.

Updates with locking

For those who are familiar with parallel programming, an obvious way to make this algorithm parallel is to lock the update step: each thread acquires a lock on the current estimate of the solution before updating the parameter, and unlock once the update is done. This algorithm is described below:

Algorithm 2: Parallel stochastic gradient descent with locks.

For many problems, an update step typically takes on the order of microseconds. But acquiring a lock can take a few milliseconds. In other words, it takes 1000x longer to lock than to make an update. Spin locks tend to work better with SGD but they are still not fast enough. To give you a real life analogy, such an algorithm is like a teams that spends more time planning and synchronizing than doing actual work.

Comment out your locks with Hogwild!

SGD is simple enough to be explained on a post-it note. The same can be done with Hogwild!. In a sentence, the main idea of Hogwild! is — “Remove all thread locks from parallel SGD code.” In Hogwild!, threads can overwrite each other by writing at the same time and compute gradients using a stale version of the “current solution.” The algorithm can be described as follows:

Algorithm 3: Asynchronous stochastic gradient descent (Hogwild!)

Wait… What? This works?

Yes, this works without any negative effect on the mathematical efficiency of the algorithm and provides all the benefits of having multiple threads. Here is a plot of how things work out (source source Niu et. al.). Hogwild! has started a trend in asynchronous algorithms for model training.

Figure 3: Parallel Performance of SGD (source source Niu et. al.)

Funny (and true) story about Hogwild!

Feng (the first author of the original Hogwild! paper) was working on trying to get SGD to go fast. He decided to comment out the locking mechanism in the code (maybe out of curiosity or maybe he was just debugging). The algorithm not only worked, but it also happened to be 100x faster. The fact that it worked was no coincidence. He and his co-authors were able to prove that this lock-free madness was not only fast, but also mathematically efficient.

Parallel machine learning trends

The ideas from Hogwild! have been extended to several machine learning algorithms. The same pattern for parallelism works in other algorithms like stochastic coordinate descent (useful for solving SVMs) and randomized Kaczmarz algorithm for solving systems of linear equations. There is even a paper on exploiting non-uniform memory access patterns on multi-core server machines. All of this work has pushed the limits of a single machine to be something suitable for truly scalable machine learning.

At Dato, we strongly believe that asynchronous (i.e lock free) algorithms are a key ingredient in a scalable machine learning platform.

Pushing the Envelope with GraphLab Create

GraphLab Create has not only adopted these techniques but also actively pushes the frontier of parallel machine learning algorithms. Here are some advances we’ve made:

  • Hybrid-Locking: As explained above, removing locks in parallel SGD can speed up the algorithm by a lot. But there is one small catch: in practice, vanilla Hogwild! may not work well on problems where some of the variables are updated very frequently. This includes regularization terms for any model, intercept terms in linear models, bias terms in SVMs, and weights for popular items in a recommender system. For example, everyone has watched The Godfather, so the parameter for “The Godfather” will be updated much more frequently than others. At Dato, we use a hybrid-locking scheme, i.e. we lock the variables that occur very frequently and update the remaining infrequent variables lock-free. This enables us to balance being fast while ensuring convergence on problems and datasets with a wide range of characteristics.

Figure 4: Getting the best of both worlds with hybrid algorithms.

  • Hybrid-algorithms: Stochastic algorithms get to a good solution fast but may not get to the “best” solution. Deterministic algorithms like L-BFGS work well once they have a good starting point. At Dato, we are actively working on hybrid algorithms that start off with stochastic algorithms like SGD and then switch over to deterministic algorithms like L-BFGS. Above is an illustration that depicts how one can get the best of both worlds using hybrid algorithms.

There you have it, hybrid locking and hybrid batch and stochastic algorithms are some of the secret sauce behind the scale and speed capabilities of GraphLab Create. We’re constantly improving the optimization engine of the learning algorithms. Do you have a favorite parallel or distributed optimization algorithms? Leave us a comment!

--

--