Some time ago, I was standing at the fruit basket with my colleague Casper, when we noticed a rather odd looking fruit. We asked each other the question: “What the * is that?!”.
Neither of us could answer it, leaving us hopeless and desperate. You need to know what you eat, right? At least we thought so. We figured the only logical approach would be to build an app. An app that allows you to point the camera of your phone at a fruit, and tells you what kind of fruit it is!
WTFruit was born.
The goal of this article
I discussed this topic during our previous meetup. The problem we’re trying to solve is an image recognition problem. An incredibly easy problem for us humans, however a whole lot more — much more so until (somewhat) recently — difficult for computers. In this blog post I want to touch on some learnings we had when building WTFruit. I won’t go in detail on the ‘how’, but will refer to great articles that give more in-depth information on that matter. I will include a few lines of code, but if you’re not a programmer don’t worry if you don’t understand them.
Computers have advanced so much that image recognition is now a solvable problem using Machine Learning. Not only that, it’s also a very hot topic! Searching Google on ‘image recognition’ results in tons of interesting reading material. So, if you want to dive deeper into Neural Networks, or how to approach Image Recognition (with ML) check out part 2 and part 3 of the Machine Learning is Fun articles. Otherwise no worries, keep on reading!
What we need to do first is finding a dataset of images with fruits. We found one on GitHub. We download the repo to our Jupyter environment and do some basic preprocessing (resizing images to fit our Neural Network and normalising the values).
Now that we have our data we choose a model. We start out with a very basic one:
What we have is a deep neural network, with 2 convolutional layers, and 3 fully connected layers, resulting in an X number of outputs. The number of outputs depends on how many classes we have. In our case we have 81 classes, which is the number of different types of fruits that are in the dataset.
We start training, and get nice results:
The first challenge
However, we run into a problem. Since our dataset only has images with white backgrounds, the network “assumes” that the background is part of the object (fruit) it is trying to learn to recognise. We have a feeling this will be an issue, and verify that by replacing the background of one of the fruits in the test set to black.
As you can see the result is terrible, the image clearly shows a banana but the network thinks its a cherry! The ‘normal’ solution to this problem would be to find images of fruit that are in context. Think about images where people are holding the fruit, or they’re laying on a surface. We could create this dataset ourselves, by taking a lot picture of fruits, or we can try to find them online. However, we want to try a different method. What will happen if we just manipulate the white background?
Our little experiment
Our reasoning was as follows: if we can change the background color so there is no correlation between the backgrounds in different images, the network may be able to “regard” this data as irrelevant. We create a simple script that replaces the white background with random noise, which looks like this:
After we create our modified training set, we feed the images to the network and wait for it to finish training. Once it’s finished we test our network. The test set still contains images with a white background.
The results? Terrible! We get an accuracy of 24%…
Now, it’s difficult to reason why the results are this bad. However, we can make quite a good guess when looking at the images. Even though the background is random, they still look very similar! So, presumably, this isn’t ‘different’ enough.
It’s time for approach 2: making it more different!
We decide to just replace the backgrounds with a single random color.
We now get an accuracy of 42%, which is already better. If we would have more (similar) training data, we could likely get a higher accuracy. This shows the importance of having a lot of data!
An even more important learning we can take from this case is the importance of having data that represents the problem you’re trying to solve. If we just want to recognise fruits on a white background [and nicely cropped] this dataset would suffice. However, in our case we want to point the camera of our phone at whichever surface and still be able to recognise the fruit. Your data should reflect your scenario.
My goal was to give you an idea of how we started the machine learning process for the given problem. If you’re new to this matter, I hope you could get a grasp of what it takes to process data and of the experiment we did by giving the fruit random background colors. Please note that this is just the first process of creating an actual app, which I left untouched in this blog post.
At the moment we’re training more and more data on our servers to improve the recognition process. As this is a project we’re doing outside work hours, creating the app will take a while. When we’ve actually built it, I’ll write another blog post going more in detail in that process!
Summarising our experience in one sentence: you’ll always need more and better data! We may even go as far as saying: data is more important than the network!
For the developers amongst us
We wrote our ML script in Python and used the library PyTorch. There are a lot of nice Python libraries out there. Our choice wasn’t for any specific reason, except that we had used it before. On our local server we installed Jupyter lab (created by Jupyter Project), which gives you a nice environment for interactive programming. I highly suggest checking it out.