How to define an image recognition task?

Tips for developers building image recognition solution with small dataset

Machine learning is still young but it becomes highly available to developers. Apple Vision SDK, Google Tensorflow or Vize image classifier make it easy to train custom recognition models. While running a vision platform I get many questions and we deal with limitations of technology. This post is for people who are building image classifier to help them define the task to get the most out of today's technology.


I had a conversation with a client who said that google vision does not work and it returns some not relevant tags. He employed few students to do the monkey job. The problem was not the API but the approach to it. After showing him our custom approach and sharing some tips we were able to start testing image classification in 10 minutes.

What works:

  • Binary classification (striped/not striped) using 20–50 img/class
  • Up to 20 classes for “hard to recognize classes” using ~50 img/class
  • Up to 100 classes for well-defined classes using 50–100 img/class
  • Pattern recognition (crystal structures, x-ray images) using 20–50 img/class
  • Abstract classes up to 20 categories using ~50 img/class

What does not work:

  • Many classes with small dataset (20+ classes needs at least 50+ images per class)

According to what has mentioned above:

Every client is looking for reliability which is equal to accuracy. Stay simple if you aim to reach high accuracy. Technology is still dumb. Building an image classifier with a limited number of training images needs an iterative approach at this moment. I recommend to follow this:

  • Break task into simple task (yes or no)
  • Make categories smaller and connect them in some logical manner
  • Use general models for general categories
  • Always collect images to extend dataset
  • Merge very close classes together
  • Use UI/human feedback to get better data
  • Maintain quality of your dataset

Real examples:

Start with fewer categories

Building an app for people to recognize shoes I recommend to start with ~50 shoe types. This is easy to train task with 50 images of each shoe. Let users add and upload new shoe in the user interface. Also, let them give you feedback for your classifications. This way you can get an amazing dataset of real images in one month and then update your app.

Use models with less categories

Building a classifier for plane types with small training dataset, separate your images into “in the air” and “on the ground” images. Build two different models for air and ground and get better overall results for both. You can even merge similar planes to one class and train another recognizer to sort them out. Once you have more images you can merge these categories together.

Use binary classifiers for important classes

Creating captions for images in e-shop database build custom model for each tag. One model will classify “rounded” “not rounded” etc. This way you get very reliable specialized classifier for every tag.

Don’t mix the input images

Machine learning performs the best if the distribution of training and evaluated pictures is the same. This means we need to have same images for training as the ones we are going to evaluate. We can hack it this using internet images at the beginning but we should start gathering users images as soon as possible. These are going to make our model robust in the future.


Building image classifier is not only hard in a matter of good deep learning model but also good task definition and good dataset. If the size of the dataset is challenging, start simple and iterate towards your goal. If you have any questions feel free to text me or comment below.