Improving class representation in training dataset — from greedy algorithm to linear programming

Published in

Tech at Trax

6 min readJan 31, 2023

The problem

Occasionally when we create a dataset for training, we don’t have enough samples for all the classes. Sometimes we have many classes with a low amount of samples (long tail) so the model won’t be able to classify those classes well. Even if we train the model several times or try different models we still get the same poor results - because the problem here is that we don’t have enough examples from those classes. How can we collect more data from those specific classes?

Let's take this example: we want to classify the colors of cars in a parking lot. We have images of the parking lot and our model already classified those images. Still, it doesn’t recognize well some of the colors (for example the results for pink and red are poor), so how can we improve our model performance on those colors and as a result improve the model?

The first choice will be to collect more data on those classes, but in this case, we want to be very accurate with the data we collect. Meaning, we want to collect exactly the data from the classes we miss, and don’t collect data from the classes we already have enough data on them. So how we can collect this kind of data?

First solution — the Greedy algorithm

Let’s assume we have some images that we can revalidate part of the data manually, and we have some suspects on where to find each color of the car.

One option to improve the number of samples from each class is to use a greedy algorithm with these steps:

Calculate for each color how many samples are missing for the low limit we assume we need for good recognition of this color
For each image, we calculate how many samples we suspect there are in this image from each class
Then we sort these images from the maximum number of tags per image to the lowest
We will take the first image and calculate again how many we need from each color (after reducing the number of tags per color in the image we select)
We will return to steps 2–4 until we have enough tags per each class or we don't have more images

The problem with the algorithm in the production environment is that:

There is a limited number of images we can collect (~5K) because it is difficult to sort the images above a specific limit
Each step takes a lot of time so when we try to run it on a reasonable amount of classes it takes more than 12 hours which is too long time to run it

For these reasons, we need to think of another solution…

Second solution — the linear programming algorithm

In order to solve this problem in another way, we tried to define this problem as a linear programming problem, with the following steps:

Calculate the colors needed and their missing appearances

2. Find the images in which we suspect those colors appear in them (based on the current model)

3. Create this matrix (for example):

4. Calculate for each color — the minimum between how many appearances missing to the actual amount of this color we have

* For example in the table above:

For the color blue - we need 10 appearances for the training, and we can have an additional 20 from those images then the minimum is 10. For the color orange we will need 35 appearances, but the actual number we can get is 33 so the minimum is 33.

5. Define and solve the linear problem

The output:

colors list, chosen images to validate again

Linear problem definition

Linear Programming (LP) is a type of mathematical optimization, in a selection of variables that make a function takes maximum or minimum value, which to be an optimized function, and a linear relationship represents all requirements. (A brief introduction to Linear Programming, Optimization, and Linear Programming)

Lets us express this problem in a mathematical way:

We will define the chosen images:

If Mi=1 we will select this image and if Mi=0 we won’t

We also define this variable:

Cji- the appearance of a car with the color j (j=1,2,…m) in image i

So we can define this minimum target function:

And these are the constraints:

….

In the above example:

When the problem is solved, we will get a list of images with which colors we need to recheck and validate. After it, we can retrain the engine again and examine if the change in the training dataset influences the engine's ability to recognize those cars.

Results

In the trials we did, we saw a significant improvement in the accuracy of the engine, its ability to recognize weak colors got bigger and the total accuracy get improved by 5%

Future Work

In order to improve this algorithm, we can consider the following options:

Instead of finding an images list and a colors list- that are separate lists (images list, colors list), we can find a combined list of the image and color ((image 1- colors),(image 2- colors),….,(image n- colors))

In this way, we can get more accurate color appearances per image. We won’t validate more colors than what we want- because, in the separate lists situations, we have at the end more tags than we need, for example, a situation when we send an image that was selected because of the red color. Still, it includes also blue color and this color is also on the list of colors. Therefore in order to solve this, we will need to define a variable for color j in image i.

Summary

In this post, I show how we initially solved a problem with a naive, greedy algorithm. Since it did not perform well enough we replaced it with a linear programming algorithm. The linear algorithm has the ability to run in a reasonable time, has a big positive influence on the training dataset, and improves the engine results.