Incremental Learning with Support Vector Machines

Krisha Mehta
Computers, Papers and Everything
6 min readMay 4, 2018

This paper focuses on Support Vector Machines. For anyone who is not familiar with support vector machines, I will try to explain it below using an example I came across.

Suppose you have a bunch of red and blue balls on a table.
You wish to separate them.
What you do is, you place a stick between them in a way that they are separated.
Now, someone comes across your table and mixes them again.
But what if the balls are so many that you can’t really find the space to place the stick?
You will have to flip the table and while the balls are in the air, you grab a sheet of paper and keep it between the balls, thus separating them.

Boring people call these balls as data, the stick a classifier, flipping the table as kernelling and the piece of paper as a hyperplane.

You could try to accommodate more balls between the existing balls and the stick. This is called optimization.
Once you have placed a stick, identifying where to place the stick using only a few balls becomes simpler. These make these balls very useful. The other balls then may or may not be needed. The useful balls are called Support Vectors.

With an increase in databases, there is a constant need to handle the increase in training data as well. One possible solution to this problem is using incremental learning techniques where data is processed in parts(subsets of the data are considered at any given point in time) and the result is then combined to save memory. This is where support vector become useful as they can summarise the data well.
This paper proposes a framework for incremental learning with SVMs and evaluates its effectiveness using a set of proposed goodness criteria on some standard machine learning benchmark data-sets. The aim of this paper is to:
1. Present an incremental learning procedure using SVMs.
2. Propose criteria for evaluating the goodness of the incremental learning procedure.
3. Empirically demonstrate that the proposed incremental SVM procedure performs well along with the above criteria, on some standard benchmarking data sets.
Given l data points {(x1,y1),(x2,y2),……..,(xl,yl)}, the decision function of SVMs is as follows:

Often, only a small fraction of alpha(i) coefficients are non-zero. Due to this, the corresponding xi entries and yi output labels fully define the decision function. These three are then preserved for use in the classification process. All the remaining training does not contribute in any way and is regarded as redundant. xi entries are the support vectors here.
Since only a small number of data points end up as support vectors, the support vector algorithm is able to summarize the data space in a very concise manner. This is how incremental training works.
Feed a subset of the data into the model.
Preserve only the support vectors.
Add them to the next subset.

To evaluate the effectiveness of incremental training, its compared with a model that has been trained with all the data in one batch i.e compare a multiple batch model with a single batch model. Two cases were compared.

Case 1: When the model has learned only from previously seen data, how does SVM perform when new information is added? Does the performance improve or deteriorate at each successive incremental step?

Case 2: If all the previously seen data is saved, how well does SVM perform on unknown data? This is effectively batch learning using all the data seen so far.

To evaluate these cases, we split the data into training set TR and test set TE.
Each TR is further divided into 10 sections wherein each part consists of mutually exclusive 10% of the data. Two types of incremental learning are carried out.

Type 1:
1. Train the SVM on TR1.
2. Take the Support Vectors SV1 chosen from TR1 and add it to TR2.
3. Run the algorithm on SV1 + TR2.
4. Now, the Support Vectors from SV1+ TR2 are taken and called SV2.
5. Add these SV2 to TR3 and run the algorithm again.
6. This is done till all TRi are used i.e the last iteration will run on SV(i-1) + TRi.

Type 2:
1. An SVM was trained using TR1.
2. This trained model is then used to classify the examples in TE and the classification rate is recorded.
3. TR2 is then added to TR1 and this was then used for training.
4. This model was then used to classify TE and results are recorded.
5. TR3 is then added to TR1 + TR2 and so on till all the original training set was used for training.

The two steps above are repeated 10 times. Each time mutually exclusive 10% of the examples were used as the test set TE, and the rest were used for the training set TR(on the lines of 10-fold
cross-validation.) The problem being handled in all these experiments is a two-class classification problem.

The datasets used for evaluation are well known for benchmarking purposes in the machine learning community.

Data sets used and their accuracy from the paper.

Type1 and Type2 methods are implemented on these datasets and then plotted. The X-axis is for the incremental step and the Y-axis shows the prediction accuracy of the model obtained at the end of the corresponding incremental step. Kernel and parameters used are also mentioned.

The accuracy of the incremental method closes follows that of a single-batch model. Table 2 shows that after 10 iterations, the SVM algorithm is able to remember the relevant parts of the data and there is negligible or no drop in the accuracy of this model.
The novelty of this approach is that even after successive steps of discarding redundant examples, the collected set of support vectors still remains a representative set.

The main aim of this paper is to reduce the memory needed to train the model. The next step is to analyze if all the support vectors in our set are really necessary and try to minimize the set as much as possible. The following example was conducted to achieve this:
1. The SVM algorithm was run on the whole dataset D, and the support vector set SV was obtained.
2. SVM was trained on the obtained support vector set SV using 10-fold cross-validation i.e. for 10-fold cross-validation the training set TR and the test set
TE were both obtained from the support vector set SV. The average prediction accuracy of the trained machine on the test set was recorded.

When this experiment was carried out, it was observed that removing even 10% of the support vectors reduced the accuracy of the model significantly.

This means that the set is already minimal and removing further support vectors will lead to loss of vital information about the class distribution. This proves that support vector machines are very good at summarising the original data space into a compact form and the corresponding support vector set formed is a minimal set.

--

--