The demo of a Machine Learning algorithm always starts with a statement close to this: “Let X be our set of samples, and y the values we want to infer…”. Sorry folks but the reality is pretty different! Getting to that point is a whole journey in itself. In the next sections, we will share with you what the Criteo team learned while getting 5 million images annotated.
At Criteo, our job is to present the right ad to the right person at the right time. We also want to do that in an ethical manner. In the context of this article, this means we have defined a set of rules on what we don’t want to show to our customers. This not only contains nudity and other explicit material, but also guns, drugs, alcohol, gambling, etc. In the age of Deep Learning everywhere, having such a classifier should be a piece of cake, right? Well… not so much! If you want to get decent accuracy, you will have to perform at least transfer learning or better still, retrain a full CNN model end-to-end (in our case, an Inception-v3 model). This (as well as just evaluating the metric) requires annotated images… in fact, a LOT of them. Let’s see together how we got them.
Lesson 0 — Finding the n̶e̶e̶d̶l̶e̶s̶ bad hay in the haystack (on the importance of selecting pictures sufficiently similar to your target dataset)
For this job, we needed to collect images allowing us to catch non-recommendable products. Fortunately for us, such images make up a very small portion of the product catalogs of our advertisers, leading to relatively rare incidents. There is a drawback to it as this also means we have the difficult task of finding a sufficient number of them to be able to train a model. We tried a lot of different approaches. Long story short, there is a pitfall you should avoid: taking pictures from a distribution which is not the same as your target. In our case, we looked at available NSFW datasets. However, there is a difference between a pornographic picture and, let’s say, the cover of an X-rated DVD. The same goes for images coming from classified ads (that could contain a lot of amateur pictures) versus the ones from professional sellers. Matching your target distribution, or at least getting close to it, is essential to get good performances later on.
Lesson 1 — No policy, no glory (on the importance of having a written set of rules)
To properly annotate images, we need a definition of what needs to be blacklisted. This seems relatively straightforward but, be warned, in our case it was not. Everyone has its own understanding/interpretation of what needs to be blacklisted. For instance, you could get two diametrically opposed opinions on whether or not a picture is ‘too suggestive’. So, this should be written down and agreed with a representative set of participants. This is especially important if, like us, the people doing the heavy lifting of annotating the majority of images are not the authors of the policy and are not co-located with them. Don’t skip this step or take it too lightly… this one easily bites back and strongly. Indeed, an important change to the policy would mean reviewing your whole dataset to ensure alignment. A pretty heavy cost for a misunderstanding or a poorly phrased guideline. For instance, we had of course forbidden any form of nudity. The related section in our policy listed different cases, and in particular the case of ‘children’. Due to the formatting, the point was interpreted as ‘any picture depicting children should be forbidden’. This led to a batch with a lot of children clothes being blacklisted. This was quickly spotted, but subtler cases can lead to long-term consequences.
Lesson 2 — Diversity is strength (on the importance of having different opinions)
As stated in the lesson above, there is room for interpretation in the policy. When the moment comes to decide on which side a given case falls, it is important to have an annotation committee which of course knows all the details of the policy, but is also as diverse as possible. For instance, a good balance of male and female members is crucial, as well as diversity in culture, skills/knowledge (technical/non-technical). What you want at the end is a balanced decision that will be relied on in future cases. The stability of the committee is also an important factor. You don’t want to re-challenge old cases each time you have a case to settle.
Lesson 3 — You don’t know until your try (on the importance of test-driving your policy)
Policy writers should must experience firsthand what applying the policy means. In particular, they must try to link their decisions to the policy they have written. The same goes with the development of the tools used for annotating (more on this below). By putting themselves in the shoes of the people doing the annotation, they catch the issues early and avoid a lot of back and forth later on in the process. An example: while reviewing the images in committee, hiding images identified as distasteful appeared necessary while it was clearly not the first feature a product manager would have thought of implementing if he hadn’t tried it for himself first.
Lesson 4 — No policy’s perfect (on the mutability of the policy)
Don’t expect your policy to be static and written once and for all. We observed that whatever is written in the policy, not all the cases will be covered. Clearly defining the boundary is challenging. Many (all?) rules have a grey area. The images in the grey area were the main challenge for both the annotators and the model. What worked for us: have a system allowing the annotators to report cases they are unsure about, and have a procedure in place to have those points settled by the annotation committee. Once the decision is taken, the policy is updated to take into account the clarification if needed. If we look closely at the reported issues, we observed two cases: (i) debatable status: when the rule objective was clear but the decision subject to discussions (e.g. are those clothes considered “too revealing”?) and (ii) rule to clarify: when the rule didn’t not apply properly to the case at hand. As an example, we forbade ads for any form of ‘weapons’ until our annotators reported cases of toy guns (especially water guns). Following the report, we white-listed toy guns if they couldn’t be confused with the real weapons.
Along the same lines, it will be very normal to see these policies changing as we serve ads across diverse cultures based in different geographic regions.
To cope with such evolution or changes to the policy, having a finer classification (e.g. by including the reason for blacklisting) could help reducing the scope of images to review following a modification.
Another interesting observation: as the rules for commonly found products get clearer, new cases for less frequent products start to pop up in the policy review discussions, indicating an increased alignment with regards to the guidelines. Furthermore, to a certain extent, building the policy collectively guaranteed involvement of the whole team and inclusion of most opinions.
Lesson 5 — One size fits all (on the importance of having a unique reference to avoid bias)
When working with a remote annotation team, it can be tempting for the team to build their own understanding and interpretation of the policy. This must be avoided as much as possible. The policy must remain the unique reference and, in case of interpretation issues, it must be updated/clarified as described above. If this is not done, you incur the risk of having a big part of your dataset annotated with a slowly but surely diverging policy, introducing a bias which is going to be costly to fix.
The downside of this approach is that the policy must be both comprehensive and understandable to almost all audiences, which is quite difficult to achieve.
Lesson 6 — A picture is worth a thousand words (having a visual representation of what is OK and what is NOT)
Some wording of the policy will remain difficult to interpret no matter what you do. What worked for us consisted in collecting borderline pictures (e.g. the ‘unsure’ discussed above) and use them to document where the boundary is in our classification problem. Those pictures were used to complement the policy. For instance, we tried as much as possible to give 3 examples of safe, unsafe and debatable images per policy section.
Lesson 7 — We are not born annotator, we become one (on the importance of investing in annotators to reduce variance)
We can’t expect someone to read the policy and apply it flawlessly to batches of images. You have to train the annotators. Not a lot of options here: make them annotate again a set of images reviewed by the annotation committee (we called it the reference dataset) and debrief with them the differences. It is especially important that the exact same process be applied to all annotators joining the group to avoid progressive divergence of a group of annotators (especially when remote). During the process, collected metrics were used to measure the variance among the different annotators in flagging images. In particular, we produced confusion-like matrices and tracked the true and false positive rates of the new annotators over time (taking as ground truth the annotation previously made by the committee). This allowed us to observe the progressive alignment of the different people to a common understanding, allowing us to stop the ‘training’ when the objective was achieved. It is important to stress that during that phase, the feedback had only one objective: helping and certainly not judging. You should see it as a (wise) investment in the people who will do the heavy lifting.
Lesson 8 — 𝘘𝘶𝘪𝘴 𝘤𝘶𝘴𝘵𝘰𝘥𝘪𝘦𝘵 𝘪𝘱𝘴𝘰𝘴 𝘤𝘶𝘴𝘵𝘰𝘥𝘦𝘴 i.e. Who watches the watchmen? (on the importance of cross-checking results)
Probably the hottest topic: once the training is complete, how do you ensure the quality of the annotations over time? It would be tempting to make only a small part of each batch of each annotator to be reviewed by another one. We found it too light as an approach. We feared a progressive divergence of the group itself coming from communication between the remote annotators developing their own local interpretation of the policy. We had also the advantage of having two group of annotators: one internal to Criteo but with a reduced bandwidth, and one remote with 100% dedicated people. The internal people had a longer experience of the policy and were used as reference for annotating new, ‘control’, images. We then injected those control images among the new pictures to spot diverging annotations. By identifying the origin of each annotation, we could spot who was diverging and provide feedback. Of course, we also included cross-checking among the annotators. In the end, no matter what you put in place as control and safety net, no annotation will be 100% accurate. This is not too much of a problem as your neural network can cope with that ‘labeling noise’. For instance, we found missed positives by comparing our model predictions to the labels.
Hence, we put in place an annotation feedback loop, allowing annotators to flag the products that were “returned” if they do not agree with the classification done by the internal team. This proved very useful in catching a few misses.
Lesson 9 — Don’t waste their time (on the importance of having adapted tools)
We have developed our own annotation tool. Something important to note is that an annotation tool is not a general public UI. This means that you are not forced to make it look nice, following the latest shiny framework, etc. First and foremost, the annotation tool is/must be the Formula 1 that will allow the annotators to reach their objectives. As such special attention need to be born to the interface. Our main drivers when designing and reviewing the interfaces were:
- It must be easy for the annotators to annotate images: don’t ask them to click on 3 different buttons to validate their decision.
- Shortcut keys should be available, as for trained people a keyboard always beats a mouse. We kept a collapsible ‘cheat sheet’ visible during the annotation.
- You can show much more than a single image at a time, and found that for the task we had, 15 images at a time was a sweet spot (if the annotator don’t like to scroll, and 50 otherwise).
- The resolution of the image is also important to spot details (e.g. ‘too revealing’ images). A full size image was thus made available when clicking on the smaller version.
- Especially when the images could be distasteful, having a mechanism to hide the image (or make it transparent) is a nice feature to have.
- It should be easy to identify the classes of the image by visual cues (in our case, we used the color and type of the frame of the images).
- If the classes are unbalanced (this was our case: the infringing products were much less frequent than normal products), the default setting should be ‘all the images are of the most represented class’ to reduce the number of manual operations to perform.
- The tool can have an option to show only the positive, negative and/or unsure cases to ease the reviews.
- The tool must support edge cases like missing picture (more on this below), animated GIF or placeholders (pictures with only the text ‘image not found’).
- The tool must be accessible to the annotators, obviously. But not always straightforward especially when external people are involved, and might require some involvement of network and security teams.
- The tool must record who did what and when (e.g. did they use the new or the old version of the policy?)
Lesson 10 — The internet n̶e̶v̶e̶r̶ forgets (on the importance of saving your data)
In our case, the ‘images’ where provided as URLs pointing to the original images. In the beginning, we didn’t take the time to save the pictures for later use (it is on the web and ‘the internet never forgets’, right?) and just added the identified class of the picture with its URL. Unfortunately, we have been reminded strongly that our customers were updating their catalog pretty regularly (especially for images to blacklist… the most difficult to spot) and that the annotations that were done on images that have been removed were in fact wasted. Once annotated, an image is like a small gold nugget: it is the summary of a human analysis. You lose the artifact, you lose the gold nugget.
Lesson 11 — Just the right quantity of data, not more, not less (on the importance of knowing when to stop collecting data)
We used for this the technique presented by Andrew Ng in his book Machine Learning Yearning (Section 29: Plotting Training Error): the model is trained with progressively larger datasets and the training error and test error are plotted. When the test error plateaus, adding more data won’t bring additional value and thus we can stop annotating.
Lesson 12 — Never stop learning (on the importance of being ready for changes)
Completing the training and getting a good accuracy is unfortunately not the end of the story. Indeed, in our case, Criteo keeps adding new advertisers everyday, meaning new catalogs, meaning potentially new type of images and new surprises as far as blacklisting reasons are concerned. As a consequence, you should keep adding regularly new images in your test set to verify that your accuracy does not degrade too quickly and re-start a new training when the accuracy is not sufficient anymore.
This concludes the main lessons we learned during the lengthy process of annotating 5 million images. We take the opportunity to heavily thank all the people who have contributed to this projects and especially the annotator teams in Spain and India.
If this kind of work interests you, Criteo is hiring…
Authors (in alphabetical order): Beranger Dumont, Emmanuel Augustin, Jaideep Sarkar, Oussama Taktak, Renaud Bauvin, Stephane Le Roy