# Understanding and Implementing the Viola-Jones Image Classification Algorithm

Image classification has been a quickly growing field over the past decade, and the use of Convolutional Neural Networks (CNNs) and other deep learning techniques is growing quickly. However, before CNNs became mainstream, another technique was widely used and continues to be used: Viola-Jones.

Whereas a CNN is a single classifier which looks at a full image and applies matrix operations to arrive at a classification, Viola-Jones takes an ensemble approach. What that means is that Viola-Jones uses many different classifiers, each looking at a different portion of the image. Each individual classifier is weaker (less accurate, produces more false positives, etc) than the final classifier because it is taking in less information. When the results from each classifier are combined, however, they produce a strong classifier.

Due to the nature of the algorithm, the Viola-Jones method is restricted to binary classification tasks (such as object detection) and has a very long training period. However, it classifies images quickly because each weak classifier requires only a small number of parameters, and with a sufficient number of weak classifiers, it has a low rate of false positives.

### Features and the Integral Image

One of the first key contributions made in the paper introducing Viola-Jones was a set of simple features to use in image recognition. In most tasks, the pixel values are the features inputted into the algorithm. However, Viola and Jones introduced the following new features. A and B are two-rectangle features, C is a three-rectangle feature, and D is a 4-rectangle feature. Image taken from Original Paper.

The sum of the pixels in the unshaded rectangles are subtracted from the sum of the pixels in the shaded rectangles. It is easy to see that even for small images, there are a lot of features (over 160,000 for a 24 x 24 image). Since the algorithm requires iterating over all features, they must be computed very efficiently. In order to do this, Viola and Jones introduced the integral image. The integral image is defined by the following recursive relationship.

s(x, y) is the cumulative row sum at point (x, y), ii(x, y) is the integral image value at the same point, and i(x, y) is the pixel value at that point. What this relationship says is that the integral image at a point (x, y) is the sum of all pixels above and left of the current pixel. This makes it easy to compute the sum of pixels in a rectangle region as shown below.

The sum of the pixels in region D is ii(4) + ii(1) — ii(2) — ii(3), which is just four array references. Let's start implementing the algorithm by creating a helper function which calculates the integral image given an image (represented as a 2D numpy array).

`import numpy as np #Don't forget to import numpy`
`def integral_image(image):    ii = np.zeros(image.shape)    s = np.zeros(image.shape)    for y in range(len(image)):        for x in range(len(image[y])):            s[y][x] = s[y-1][x] + image[y][x] if y-1 >= 0 else image[y][x]            ii[y][x] = ii[y][x-1]+s[y][x] if x-1 >= 0 else s[y][x]    return ii`

This function simply implements the recursive relationship defined above. Since s(x, -1) = 0, when y-1 < 0, s(x, y) is just i(x, y) and likewise for ii(-1, y). Next, let's define a helper class to store rectangle regions to make it easy to compute the value of features later.

`class RectangleRegion:    def __init__(self, x, y, width, height):        self.x = x        self.y = y        self.width = width        self.height = height`
`def compute_feature(self, ii):    return ii[self.y+self.height][self.x+self.width] + ii[self.y][self.x] - (ii[self.y+self.height][self.x]+ii[self.y][self.x+self.width])`

### Viola-Jones

Now that we have set up our helper classes and functions, we can start creating the Viola-Jones algorithm. Let's start by defining a ViolaJones class. The only hyper-parameter which our algorithm will use is the number of features (a.k.a the number of weak classifiers)

`class ViolaJones:    def __init__(self, T = 10):        self.T = T`

For training, Viola-Jones uses a variant of Adaboost. The general idea of boosting is for each subsequent weak classifier to correct the mistakes of the previous classifier. To do this, it assigns a weight to each training example, trains the classifiers, chooses the best classifier, and updates the weights according to the error of the classifier. Incorrectly labeled examples will get larger weights so they are correctly classified by the next classifier chosen. Visual representation of boosting. (See source article)

This gives the following outline of the algorithm:

1. Initialize the weights
2. Normalize the weights
3. Select the best weak classifier (based on the weighted error)
4. Update the weights based on the chosen classifiers error
5. Repeat steps 2–4 T times where T is the desired number of weak classifiers

Add a method `train` to the ViolaJones class where we can implement the training.

`def train(self, training):    training_data = []    for x in range(len(training)):        training_data.append((integral_image(training[x]), training[x]))`

For now, `train` will take in a single parameter, `training_data` , which is an array of tuples. The first element in the tuple will be a 2D numpy array representing the image, and the second element will be its classification (1 for positive, 0 for negative). The loop will convert all of the images into their integral image format. We add the integral images to a new array to preserve the original dataset.

#### Initializing the weights

At the start of the algorithm, we have no error off of which to base the weights, so each training example of the same class is weighted the same (i.e all positive examples are equally weighted as are all negative examples). This means for the ith image

Where p is the number of positive examples and n is the number of negative examples. Let’s assume we know the number of positive examples and negative examples beforehand, so `train` will take them as a parameter

`def train(self, training, pos_num, neg_num):    weights = np.zeros(len(training))    training_data = []    for x in range(len(training)):        training_data.append((integral_image(training[x]), training[x]))        if training[x] == 1:            weights[x] = 1.0 / (2 * pos_num)        else:            weights[x] = 1.0 / (2 * neg_num)`

#### Building the Features

The main loop of training requires choosing the best weak classifier, but there is one weak classifier for each possible feature. Because of this, we have to build all of the features before we start implementing the main loop of training. Recall that features look like the following.

Let’s implement `build_features` as part of the ViolaJones class to return an array of all the features.

`def build_features(self, image_shape):    height, width = image_shape    features = []    for w in range(1, width+1):        for h in range(1, height+1):            i = 0            while i + w < width:                j = 0                while j + h < height:                    #2 rectangle features                    immediate = RectangleRegion(i, j, w, h)                    right = RectangleRegion(i+w, j, w, h)                    if i + 2 * w < width: #Horizontally Adjacent                        features.append(([right], [immediate]))                    bottom = RectangleRegion(i, j+h, w, h)                    if j + 2 * h < height: #Vertically Adjacent                        features.append(([immediate], [bottom]))                    right_2 = RectangleRegion(i+2*w, j, w, h)                    #3 rectangle features                    if i + 3 * w < width: #Horizontally Adjacent                        features.append(([right], [right_2, immediate]))                    bottom_2 = RectangleRegion(i, j+2*h, w, h)                    if j + 3 * h < height: #Vertically Adjacent                        features.append(([bottom], [bottom_2, immediate]))                    #4 rectangle features                    bottom_right = RectangleRegion(i+w, j+h, w, h)                    if i + 2 * w < width and j + 2 * h < height:                        features.append(([right, bottom], [immediate, bottom_right]))                j += 1            i += 1    return features`

The `image_shape` parameter is a tuple of the form `(height, width)` . The rest of the algorithm loops over all rectangles in the image and checks if it is possible to make a feature with it. Our representation of a feature is a tuple containing two arrays. The first array will be the RectangleRegions that positively contribute to the feature and the second array will be the RectangleRegions that negatively contribute to the feature. This representation will allow us to easily save our classifier later on.

#### Applying the features

When we are finding the optimal weak classifiers to use later in the algorithm, it is going to require evaluating each feature for every training example. To save computation, we’ll do this before we start training the classifiers. This is more efficient because each classifier need to be retrained every time we select a new one. If we were to apply the features to the training examples in the training loop, we would be repeating work because the value of each feature for the images never changes. Applying the features before training will also allow us to pre-select features later on to speed up training. Keeping a separate array holding the label of each training example will simplify our code, so we will create that array during this step as well. Add the `apply_features` method to the ViolaJones class.

`def apply_features(self, features, training_data):    X = np.zeros((len(features), len(training_data)))    y = np.array(map(lambda data: data, training_data))    i = 0    for positive_regions, negative_regions in features:        feature = lambda ii: sum([pos.compute_feature(ii) for pos in positive_regions]) - sum([neg.compute_feature(ii) for neg in negative_regions])        X[i] = list(map(lambda data: feature(data), training_data))        i += 1    return X, y`

The `train` method should now look like this

`def train(self, training, pos_num, neg_num):    weights = np.zeros(len(training))    training_data = []    for x in range(len(training)):        training_data.append((integral_image(training[x]), training[x]))        if training[x] == 1:            weights[x] = 1.0 / (2 * pos_num)        else:            weights[x] = 1.0 / (2 * neg_num)`
`    features = self.build_features(training_data.shape)    X, y = self.apply_features(features, training_data)`

#### The Weak Classifiers

We finally have all of the pieces to start building and training the weak classifiers. Recall that Viola-Jones uses a series of weak classifiers and weights their results together to produce the final classification. Each weak classifier is “weak” because by itself, it cannot accurately fulfill the classification task. Each weak classifier looks at a single feature( f ). It has both a threshold ( θ ) and a polarity ( p ) to determine the classification of a training example.

Polarity can be either -1 or 1. When p = 1, the weak classifier outputs a positive result when f(x) < θ, or the feature value is less than the threshold. When p = -1, the weak classifier outputs a positive result when f(x) > θ. For now, let’s define a class to encapsulate the weak classifier functionality.

`class WeakClassifier:    def __init__(self, positive_regions, negative_regions, threshold, polarity):        self.positive_regions = positive_regions        self.negative_regions = negative_regions        self.threshold = threshold        self.polarity = polarity`
`def classify(self, x):        feature = lambda ii: sum([pos.compute_feature(ii) for pos in self.positive_regions]) - sum([neg.compute_feature(ii) for neg in self.negative_regions])        return 1 if self.polarity * feature(x) < self.polarity * self.threshold else 0`

Note that each feature is a sum of the positive rectangle regions with the sum of the negative rectangle regions subtracted from them. Recall that the next step in the algorithm is to choose the best weak classifier. To do that, we need to find the optimal threshold and polarity for each one.

#### Training the Weak Classifiers

Training the weak classifiers is the most computationally expensive piece of the algorithm because each time a new weak classifier is selected as the best one, all of them have to be retrained since the training examples are weighted differently. However, there is an efficient way to find the optimal threshold and polarity for a single weak classifier using the weights. First, sort the weights according to the feature value that they correspond to. Now iterate through the array of weights, and compute the error if the threshold was chosen to be that feature. Find the threshold and polarity with the minimum error. The possible values for a threshold are the values of the feature on each training example. The error can be measured by

T represents a total sum of weights and S represents the sum of the weights of all examples seen so far. The superscripts “+” and “-” signify which class the sum pertains to. Conceptually, this error compares how many examples will be misclassified if all examples below the current location are labeled as negative with how many examples will be misclassified if all examples below the current location are labeled as positive (taking into account how each example is weighted). In this example, the numbers indicate the feature values and the size of the bubbles indicates their relative weights. Clearly, the error will be minimized when any feature with a value less than 9 is classified as blue. That corresponds to a threshold of 9 with a polarity of 1.

This way, we can evaluate the error of each possible threshold in constant time (O(1)) and the error of all thresholds in linear time (O(n)). The threshold is set to the value of the feature at which the error is a minimum. The polarity is determined by how many positive examples and negative examples lie left (less than) and right of the threshold (greater than). If there are more positive examples left of the threshold, p=1. Otherwise, p = -1. Here is what I just described in code (part of the ViolaJones class). This method will train all of the weak classifiers and return them in an array.

`def train_weak(self, X, y, features, weights):    total_pos, total_neg = 0, 0    for w, label in zip(weights, y):    if label == 1:        total_pos += w    else:        total_neg += w`
`    classifiers = []    total_features = X.shape    for index, feature in enumerate(X):        if len(classifiers) % 1000 == 0 and len(classifiers) != 0:            print("Trained %d classifiers out of %d" % (len(classifiers), total_features))        applied_feature = sorted(zip(weights, feature, y), key=lambda x: x)        pos_seen, neg_seen = 0, 0        pos_weights, neg_weights = 0, 0        min_error, best_feature, best_threshold, best_polarity = float('inf'), None, None, None        for w, f, label in applied_feature:            error = min(neg_weights + total_pos - pos_weights, pos_weights + total_neg - neg_weights)            if error < min_error:                min_error = error                best_feature = features[index]                best_threshold = f                best_polarity = 1 if pos_seen > neg_seen else -1            if label == 1:                pos_seen += 1                pos_weights += w            else:                neg_seen += 1                neg_weights += w         clf = WeakClassifier(best_feature, best_feature, best_threshold, best_polarity)         classifiers.append(clf)    return classifiers`

#### Selecting the best weak classifier

Once we have trained all of the weak classifiers, we can now find the best one. That is as simple as iterating through all the classifiers and calculating the average weighted error of each one. Add the `select_best` method to the ViolaJones class

`def select_best(self, classifiers, weights, training_data):    best_clf, best_error, best_accuracy = None, float('inf'), None    for clf in classifiers:        error, accuracy = 0, []        for data, w in zip(training_data, weights):            correctness = abs(clf.classify(data) - data)            accuracy.append(correctness)            error += w * correctness        error = error / len(training_data)        if error < best_error:            best_clf, best_error, best_accuracy = clf, error, accuracy    return best_clf, best_error, best_accuracy`

Notice that we returned to our original representation of the training data (an array of tuples). This is because we need the integral image to pass to the weak classifier, not the value of features. Also, notice that we keep track of the accuracy. This will come into play when updating the weights in the next step of the algorithm. Together, `train_weak` and `select_best` make up step three of the high-level algorithm I outlined earlier. Update `train` to make use of these two methods.

`def train(self, training, pos_num, neg_num):    ...    for t in range(self.T):        weak_classifiers = self.train_weak(X, y, features, weights)        clf, error, accuracy = self.select_best(weak_classifiers, weights, training_data)`

These functions belong in the loop because the best weak classifier is dependent on the weights of each training example, and recall that the weights for each training example change after each new weak classifier is chosen.

#### Updating the Weights

The bulk of the work in Viola-Jones goes to building the features, training the classifiers, and choosing the best weak classifier in each iteration. The rest of the train method is relatively simple. Before choosing the best classifier, we should normalize the weights. After choosing the best weak classifier, we have to update the weights with the error of the chosen weak classifier. Training examples which were classified correctly will get smaller weights, examples which are classified incorrectly will have their weights stay the same.

Epsilon represents the error of the best classifier, w is the weight of the ith example, and beta is the factor why which to change the weight. The exponent of beta is 1-e where the e is 0 if the training example was classified correctly and 1 if classified incorrectly (this is why we returned the accuracy array when choosing the best weak classifier).

`def train(self, training, pos_num, neg_num):    ...    for t in range(self.T):        weights = weights / np.linalg.norm(weights)        weak_classifiers = self.train_weak(X, y, features, weights)        clf, error, accuracy = self.select_best(weak_classifiers, weights, training_data)        beta = error / (1.0 - error)        for i in range(len(accuracy)):            weights[i] = weights[i] * (beta ** (1 - accuracy[i]))`

#### The Strong Classifier

Finally, we have to compile the strong classifier out of our weak classifiers. The strong classifier is defined as

The coefficient alpha is how much weight each weak classifier has in the final decision, and it is dependent on the error since it is the natural log of the inverse of beta. The weighted sum of the weak classifiers’ decisions is compared to half the sum of the alphas because it is akin to saying “At least half the weak classifiers agree with a positive classification.”

We’ll need to store all of the classifiers and their corresponding alphas, so add two new instance variables to the init method.

`def __init__(self, T = 10):    self.T = T    self.alphas = []    self.clfs = []`

Next, update the loop of the train method to compute the alphas and store them as well as the classifiers

`def train(self, training, pos_num, neg_num):    ...    for t in range(self.T):        weights = weights / np.linalg.norm(weights)        weak_classifiers = self.train_weak(X, y, features, weights)        clf, error, accuracy = self.select_best(weak_classifiers, weights, training_data)        beta = error / (1.0 - error)        for i in range(len(accuracy)):            weights[i] = weights[i] * (beta ** (1 - accuracy[i]))        alpha = math.log(1.0/beta)        self.alphas.append(alpha)        self.clfs.append(clf)`

The final train method should look as follows

`def train(self, training, pos_num, neg_num):    weights = np.zeros(len(training))    training_data = []    for x in range(len(training)):        training_data.append((integral_image(training[x]), training[x]))        if training[x] == 1:            weights[x] = 1.0 / (2 * pos_num)        else:            weights[x] = 1.0 / (2 * neg_num)`
`    features = self.build_features(training_data.shape)    X, y = self.apply_features(features, training_data)    for t in range(self.T):        weights = weights / np.linalg.norm(weights)        weak_classifiers = self.train_weak(X, y, features, weights)        clf, error, accuracy = self.select_best(weak_classifiers, weights, training_data)        beta = error / (1.0 - error)        for i in range(len(accuracy)):            weights[i] = weights[i] * (beta ** (1 - accuracy[i]))        alpha = math.log(1.0/beta)        self.alphas.append(alpha)        self.clfs.append(clf)`

Make sure to add `import math` to the top of the file. Finally, implement the classification method for the strong classifier.

`def classify(self, image):    total = 0    ii = integral_image(image)    for alpha, clf in zip(self.alphas, self.clfs):        total += alpha * clf.classify(ii)    return 1 if total >= 0.5 * sum(self.alphas) else 0`

#### Improvement

The number of features grows incredibly fast. For a 24x24 image, there are over 160,000 features, and for a 28x28 image, there are over 250,000. Moreover, many of these features won’t offer much classification power and are mostly useless. It would be useful to remove as many features as possible so we can train fewer WeakClassifiers. Fortunately, the SciKit-Learn package can help us narrow down our feature space with its SelectPercentile class. To take advantage of this, add the following to the `train` method and import the required classes.

`from sklearn.feature_selection import SelectPercentile, f_classifdef train(self, training_data, pos_num, neg_num):    ...    X, y = self.apply_features(features, training_data)    indices = SelectPercentile(f_classif, percentile=10).fit(X.T, y).get_support(indices=True)    X = X[indices]    features = features[indices]    ...`

Fitting the SelectPercentile class will find the best k% of features (I have chosen 10% here). The `get_support` method then returns the indices of those features. When fitting SelectPercentile, notice that X.T is passed in rather than just X. This is because SelectPercentile expects each row to be a single training example and each column to be a feature value, but the X array has this switched.

Since ViolaJones takes a long time to train, it would be very useful to be able to save and load the model from a file. This is easy with the Pickle module. Add the following to the ViolaJones class

`def save(self, filename):    with open(filename+".pkl", 'wb') as f:        pickle.dump(self, f)`
`@staticmethoddef load(filename):    with open(filename+".pkl", 'rb') as f:        return pickle.load(f)`

### Testing

ViolaJones was created for the task of face detection, so what better way to test our code on the same task. A good dataset to test this is the CBCL Face Database created by MIT’s Center For Biological and Computation Learning. The dataset contains a training set with over 2000 face images and over 4500 non-face images. Each image is 19x19 and greyscale. The original dataset has each image in a .pgm format, so I have created two Pickle files, one for training and one for testing that will allow us to directly load them into our ViolaJones class. You can download them here on GitHub.

`def train(t):    with open("training.pkl", 'rb') as f:        training = pickle.load(f)    clf = ViolaJones(T=t)    clf.train(training, 2429, 4548)    evaluate(clf, training)    clf.save(str(t))`
`def test(filename):    with open("test.pkl", 'rb') as f:        test = pickle.load(f)    clf = ViolaJones.load(filename)    evaluate(clf, test)`
`def evaluate(clf, data):    correct = 0    for x, y in data:        correct += 1 if clf.classify(x) == y else 0    print("Classified %d out of %d test examples" % (correct, len(data)))`
`train(10)test("10")`

With T = 10, the algorithm took an hour to train and received 85.5% accuracy on the training set with 78% accuracy on the test set. With T=50, the algorithm took five hours to train and received 93% accuracy on both the training and the test set.

### Final Remarks

Viola-Jones implements several concepts important to Machine Learning and Artificial Intelligence. First, it is an ensemble method. Second, it makes use of a variant of the Adaboost algorithm for feature selection. These characteristics have made it a strong method for image classification that was used in the past and continues to be used. The original paper also introduced the idea of an “attentional cascade” to greatly reduce classification time for negative examples. Learn more about that in Part Two of this tutorial.

I hope this tutorial gave you a good understanding of the Viola-Jones algorithm. You can view all of the code here. Thanks for reading!