OpenCV — Node.js Tutorial Series
The main goal of this article is to give you an introduction to the Machine Learning module of OpenCV 3 and to show you how you can utilize it with Node.js using my npm package opencv4nodejs. As always, the source code can be found on my github reposity. Let’s go …
This article is part of an OpenCV — Node.js tutorial series. Make sure to check out my other articles as well:
Collecting the training data
Before we can start, we need some training data as well as some data to evaluate our trained model on afterwards. So I sat down for 10 minutes and wrote some letters…
After scanning the documents, I applied some cropping and perspective transforming magic with OpenCV to align the letters to a rectangular frame and I filtered out the grid lines from the background:
From both images I generated the training and test images by cutting out each single letter and store them as binary images with a resolution of 40 x 40 pixels. For each letter we should end up with 19 images for training and 15 images for testing.
Initializing the HOGDescriptor
Now that we have our data in place, we can get started. The HOG feature descriptor is a common descriptor used for object detection, which has been initially proposed for pedestrian detection. HOG, as the name suggests, works with histograms of gradients. The gradients are determined by edge detection from different angles. For each of our images of letters we will compute a HOG feature vector and we will supply the SVM with these features vectors to learn about the characteristics of each letter. Our HOGDescriptor object will be initialized with the following parameters:
We choose a winSize with the same size as our training and test images, which is 40 x 40. Each image will be scanned by blocks of 20 x 20 px and each block is divided in 4 * 10 x 10 px sized cells. For each cell the image gradients are computed and mapped to a histogram of 9 bins (nbins). A blockStride size of 10 x 10 means, that each image row is scanned by shifting a block in 10 px steps in x- and y-direction. This means we have 9 different locations of 20 x 20 px blocks that fit in a 40 x 40 px image. In total this will produce cells per block * number of blocks (4 * 9) = 36 histograms. After computing the histograms, each of them having 9 bins, they are concatenated to build a single feature vector with 36 * 9 = 324 entries.
Initializing the SVM
A Support Vector Machine is one of the various classifiers, such as decision trees, regression, Bayes classifiers or neural networks, that machine learning offers us. Roughly explained what a SVM does is, if we assume 2 sets of 2D points the SVM tries to find the seperating line of these sets, the decision boundary, with the maximum margin to the closest points, the support vectors, of each set. If you now go ahead and throw some new points at it, the SVM will tell you on which side of the boundary these are located. In other words, it will predict which set of points a new point belongs to.
After that amateurish explanation you may think: Ok sounds nice, but I can do that myself, just give me a paper and a pen. Well, first of all we do not only have 2 sets of points, we have 26 classes, one for each lowercase letter. Furthermore we are not actually feeding the SVM with 2D points. We are going to supply it with the computed HOG descriptors with 324 entries. In this case the SVM will not find a single line as in the 2D example but the hyperplanes that seperate the feature vectors of our 26 classes in a 324 dimensional space.
We can initialize our SVM and give it some parameters like this:
Oftentimes your data is not linearly separable, meaning you can not simply draw a straight line between two sets of points, if we consider our 2D point example again. Simply put, by choosing a different kernel type we attempt to transform the input data such that our problem becomes linearly separable. For that reason we did not choose a linear kernel here, but the radial basis function (RBF). Ok you got me, the true reason we go with RBF is because OpenCV told me, it would be a good choice for most cases.
Choosing the RBF kernel, you can play around with the paramaters C and gamma. Some of the vectors of differently labeled data might overlap, making it impossible to strictly seperate the datasets. You can control how strictly you want these vectors close to the edge to be located on the correct side of the decision boundary by adjusting the C parameter. A low C value will result in a smooth boundary, allowing margin values to be classified less strictly. Gamma determines the radius in which each training data sample influences the classification of other samples.
From these vague explanations you may have figured that it might be better to read up about SVM parameters from professional sources, in case you really want to tweak them yourself. The reason I chose these specific parameter values is because I initially trained the SVM with trainAuto. This will kindly figure out the parameters for you by adjusting them automatically during the learning process. However, this approach takes much more time to train the SVM.
Training the SVM
Enough theory for now, let’s look at some code. Our training and test images are stored in seperate folders named by the lowercase letter. First we will get the absolute file path of each image and store them in an array in order from a-z as follows:
To compute the descriptor we will use the following helper function to ensure our input images are all equally sized and that the letter is centered in the image:
Centering the letter is important as the HOG descriptor of a letter in the upper left image corner is different from the descriptor of the same letter in the bottom right image corner. This is due to the gradients being located in different cells of the feature vector, which is the result of concatenating the histogram of each cell as we discussed in the HOG section. The reason I pass an isIorJ flag is because these letters are made up of two components (the body and the dot). If you want to know about how the centering is done, you may take a look at the source code.
Now we will go through each training image, compute its HOG descriptor and push the descriptor to a samples array. Furthermore we are inserting the label of that descriptor to a seperate labels array at the same index. Once we have processed all images, we will train the SVM. The samples have to be wrapped in a cv.Mat of floating point vectors and their labels will go into a vector shaped cv.Mat holding integer values:
That’s already it for the training phase. Let’s see if we can actually get the SVM to recognize some letters correctly…
Predicting the label of a sample is as simple as calling svm.predict with a feature descriptor. The feature descriptors have to have the same shape, more precisely the same length, as the descriptors we used to train the SVM. For that reason computing the HOG descriptor for training and test data is done the same way.
Now we will go through the test data images, compute the HOG descriptor of each image and predict it’s label. We keep track of the number of wrong predictions and we will print the percentage of misclassification of each letter at the end:
Ok I admit I cheated a bit on that one. As I said I need to tell the letter centering function in case that letter is an I or J. Of course that’s strange if we actually want the SVM to predict the letter for us, but I was too lazy to rewrite the letter centering function to figure out the number of components itself.
If we run this for our sample data we will get the following result:
Looks actually pretty good to me, considering we only used 19 images per class for 26 classes in total to train the SVM. On average 85% of all letters of our test image set have been classified correctly. For some letters even all of the 15 images have been recognized. There are some outliers however, partly owed to my awesome handwriting. For example 6 of my ‘l’s have been classified as ‘e’s. Also 3 of the ‘e’s have been mistaken for ‘c’s and 3 of my ‘r’s have been confused with ‘v’s, which are indeed hard to distinguish if you look at the sample images.
The SVM is trained to recognize my own hand writing at the moment. If we wanted to make it more generically, we would probably have to train it with handwritten letters from other people as well to capture all kinds of different ways to write a letter. Furthermore we might also want to consider different font weights and different scales of a letter, as they may appear thicker using different pens or some people may write smaller letters. In these cases it can be helpful to generate more training data from existing training images by scaling the letters up and down or to dilate and shrink them, which can easily be done with OpenCV.
Hopefully this article showed you how easy it is to train your first simple classifier from scratch using OpenCV’s SVM implementation. It literally takes a few lines of code to get started training the model with your own images. Personally, experimenting with this OCR example helped me a lot to get started with machine learning and to demystify some of the concepts of SVM’s. Of course decision trees, convolutional- and deep neural networks are available in OpenCV as well, which are up next on my list to become accessible in my package.
If you liked this article feel free to clap and comment. I would also highly appreciate supporting the opencv4nodejs project by leaving a star on github. Furthermore feel free to contribute or get in touch if you are interested :).