Machine Learning Powered Content Moderation: AI and Computer Vision Applications at Expedia
How to build a highly customized AI framework for content moderation, using state-of-the-arts in deep learning.
Authors: Shervin Minaee, Harsh Pathak, Thomas Crook
As an online travel agency website, images of a property (e.g. hotel, resorts, apartment, vacation rental) are invaluable references for travel shoppers considering which property they want to book.
Expedia Group™️ receives millions of lodging images from its customers and suppliers each quarter. Before we can display these images to customers, we must do all we can to ensure that the images meet our requirements for quality, and business and legal compliance. Therefore, all the images uploaded to our sites go through a moderation process to make sure they meet our standards. These guidelines are developed by our legal and content teams and comprise the minimum standard required for an image to be approved and displayed on our website. The process of making sure an image (or text) content is compliant with the guidelines is called content moderation.
In this post we detail several machine learning models we built to detect whether images conform with the guidelines, along with examples of approved and rejected images.
To give you an idea of some of the approved and rejected images based on our guideline, we’ll show a few images from each category below. Four examples of approved images based on the guidelines are provided in Figure 1.
Figure 2 provides 4 example images which were rejected based on the guidelines. There are around 10 different reasons which can prevent an image from being approved in our moderation.
Historically, we carried out content moderation using third party vendors, but with the increasing volume of the images (and text content) we started to automate as much of this work as possible with the help of machine learning models.
In the next few sections, we will provide an overview of our modeling framework, data collection, and evaluation frameworks.
2. Modeling Framework
One challenge we faced when we started this project was the lack of enough labeled data with granular categories for user generated content.
In the past, Expedia teams labeled content using crowd-sourcing, but in many cases we found that images had only been labeled as approved or rejected without specifying the reason. This meant we lacked the training data to inform models why an image was rejected (an image can be rejected because it had low quality, or because it contains identifiable children, or for many other reasons).
2.1. Single Binary Classification Model
With the available labeled data, we first decided to train a single model based on a convolutional neural network (CNN), which predicts whether an image can be approved or not — a binary classification problem. This model was built using a combination of hand-crafted features and CNN¹. The architecture of this model is shown in Figure 3.
But when we tested this model, it did not perform very well on our test set. Therefore we needed to improve our system. Our hypothesis for the poor performance of the above model was that “because of the large variation in the distribution of rejected images (e.g. an image rejected due to being blurry is very different from an image being rejected due to being rotated, or having identifiable children) this model did not perform very well on our test set, and we ended up over-rejecting good images”. Therefore we tried to do something more clever.
2.2. Using An Ensemble of Category-specific Models
We changed our strategy to build several models, each one specialized in detecting images that are going to be rejected for a specific reason. This strategy brought us several advantages:
- It can potentially be more accurate, since each model operates in a much narrower scope (therefore focusing on a much smaller distribution).
- The resultant architecture can help us better handle multi-label scenarios, i.e. cases where an image can be rejected due to multiple reasons.
- We can provide more detailed feedback to our customers/partners on why an image is rejected, enabling them to more easily resolve that issue.
Our model architecture comprises a wide range of models, used for different categories, including:
- CNN based image classifiers
- Image-processing based classifiers
- Object detection model
- Text detection model
On a high level, each category is handled by a classification component, plus, for some of the categories, an object detection or text detection component. The classification component of each category makes a binary prediction on whether the image should be flagged as belonging to that category or not.
We trained/fine-tuned our models on our internal images, as well as publicly-licensed images related to travel domain. In discussions with colleagues, we considered whether we could just use pre-trained models or commercially available external services for our image moderation without any need for further fine-tuning or training. We actually tried several pre-trained models/solutions for our moderation, and none of them met our minimum accuracy expectations (they either had a very low recall rate, or a high false-rejection-rate). We hypothesize that is due to differences in the distributions of images those models were trained on and the distributions of Expedia images.
For most of our categories, the classifiers are trained by fine-tuning a ResNet18 model² on an internally labeled dataset of 20k-100k images. The architecture of ResNet18 is shown in Figure 4.
However for some of the categories we found that the fine-tuned ResNet model does not perform very well, and makes a lot of false-rejections. One case is the “images with a border” category. For these categories we trained a model based on traditional image processing techniques (such as edge detection on the quantized image, and Wavelet and Fourier domain image analysis), and were able to get a significant boost in our performance.
Object and Text Detection Models
We noticed that for some of the categories, if we only relied on the classification model, we would end up having a large false-rejection-rate (i.e. we would falsely reject many good quality images). For example, for the “selfie detection” model, we noticed that some of the images with circular shapes were mis-flagged as selfies (perhaps the classification model was confusing them with faces).
To decrease the false-rejection-rate, we decided to leverage other models such as object detection³ and text detection⁴ when appropriate, and use an aggregated decision of classification and detection models to make the final prediction (as shown in Figure 5.). For example for the identifiable children category, we flag an image as “children”” if the “children classifier” classifies the image as children, and at the same time the object detection model detects a human in the image. We found that the aggregate model produced significantly more accurate predictions compared to a single classifier.
We use a combination of pre-trained object detection and text-detection models, plus internally fine-tuned object-detection models curated toward an internal set of object categories (such as for logo detection).
We were able to improve the false-rejection rate of some of the models from 20% to less than 1% by incorporating object/text detection for the same recall rate, which gave us a huge gain.
Overall ML model
After training category specific models, we use an ensemble model of their aggregated predictions to make the final call on an image. We approve an image only if it is approved by all models. The block-diagram of our ensemble model is shown in Figure 6.
3. Collecting The Image Dataset
3.1 Training Data
To train each of the above models (for example images with identifiable children), we needed to have training, validation/tuning and test datasets with positive and negative examples for that category.
To get labeled images for training, we used a combination of approaches depending on the category:
1- Leverage internally labeled images on similar tasks, whenever they were available.
2- Collect images from search engines by searching for relevant queries (and restricting the result set to images which are labeled for reuse and modification), and then clean their labels.
3- Synthesize images for some of the categories (such as for photo montage, text, and logo).
Using these approaches, we were able to get a reasonable sized training dataset for all of our models. We used a dataset of between 10,000 to 100,000 labeled images per model, depending on data availability and modeling difficulty. A subset of this data was used for tuning the model parameters/hyper-parameters.
3.2 Test Data
For testing these models, we do both offline and online cross-validation against an incumbent vendor moderation system.
For offline testing, we wanted to make sure we have a cleanly labeled dataset that we can rely on for evaluating our models. Ideally, we want our test set to meet the following criteria:
- Be as close as possible to the distribution of images we are receiving. This makes it easier to get a reliable estimate of the models’ performance. For that we need to have a random subset of manually annotated images.
- Have enough samples for each category, so that we can estimate the recall and precision of models for all categories.
- Have each image labeled by multiple human labelers, so that we can have reliable labels.
The second criterion induces a limitation, as our class labels are highly imbalanced (90% of the images are approved, and different rejection reasons have different frequencies, some around 1%). This means that to have enough samples for each class, we need to send out a large number of images for manual labeling.
In the end, we used crowd-sourcing to label sufficiently large random subsets of images. We had each image labeled by 3–5 annotators, and assigned image labels by majority voting. We also collected some internally labeled datasets with the help of our experts from our Expedia media team. We tested our models on both of these datasets.
4. Model Evaluation
Before getting into the model performance, let’s take a look at one of the model’s output. Most of our classification models output the probability that an image belongs to the class it was built for (e.g., the probability that the current image is a selfie). In Figure 7, we show the distribution of model scores on a set of approved and rejected images. As we can see for rejected images, most of the scores are skewed toward 1, whereas for approved images most of them are around 0 (which is a good sign, Yayyy!).
We can then calibrate a cut-off threshold on each model’s scores to decide if an image should be rejected, or approved. Depending on the threshold, there could be some images misclassified as approved or rejected, and our goal is to minimize those numbers. One can also think of our moderation framework as a tagging process in which each image is tagged with either an “approve tag”, or a set of rejection tags.
In Figure 8, we show some of the models’ predicted labels on sample images.
Now let’s get into model performance evaluation. There are various machine learning metrics we can use to evaluate the performance of moderation or classification models, such as: classification accuracy, precision, recall, F1, False Discovery Rate (FDR) and False Rejection Rate (FRR). But from a business perspective, the most important factors are:
- Accuracy in detecting images which should be flagged (i.e. the recall rate on the rejected class).
- The percentage of the images which are being auto-moderated, which is inversely correlated with our total content moderation costs.
Adopting a business-driven mindset, we decided to measure the performance of our framework in terms of class-wise recall rate, as well as the false-rejection-rate (the percentage of the images falsely rejected by our models). One could also look at the class-wise F1-score, but given the highly imbalanced nature of the data, it would not be very informative since F1 scores give equal importance to precision and recall.
We evaluated the performance of our models on several test sets:
- Images labeled through a crowdsourcing service
- Image labeled by expert internal photo editors
- Images submitted by our customers
- Images submitted by our hoteliers and vacation rental owners
Our recall rate across various test sets ranged from 80–91%. Our overall false-rejection-rate is also around 5–10% depending on the dataset.
Just as a side note, ROC curve of model scores can be useful for selecting a cut-off threshold that gives a good trade-off between recall and false-rejection-rate. The ROC curve for one of our models is provided in Figure 9.
For semi-online evaluation, once the models are deployed we compared the predicted labels from our models with labels assigned by an external annotation company. In this way we could track precision, recall, and False Rejection Rate (FRR) over a period of time to make sure our models were working as expected.
5. Model Deployment
Once the models are finalized, they need to be deployed to our auto-moderation pipeline so they can be used by internal teams for moderating text and images. The models are currently deployed in docker containers in a single cluster behind an Elastic Load Balancer. Model deployment is configured in such a way that traffic can be routed to different clusters. Each cluster can host one or more models as necessary to optimize throughput for performance and memory use.
The auto-moderation pipeline first stores the incoming requests in MongoDB and then processes them using AWS SQS queues. In processing, it first identifies the moderation service to call based on the content type, such as for text or a photograph. The moderation services are called with model names and model threshold as parameters and these can be configured based on clients. Once the processing is done we consolidate the text and image responses and call the endpoint configured for the client with the consolidated and individual request statuses.
6. Conclusion and Future Works
To conclude this article, we would like to mention a few key lessons we learned over the course of the project:
- Having a large volume of data is crucial for training powerful models that can solve business problems. Having clean, labeled data is usually even more important than the modeling part itself. Therefore it is worth allocating a large proportion of the project schedule to producing clean quality labels. We can only hope that some day powerful unsupervised models may help us with big labeling tasks.
- Try to use various approaches for collecting data. In the image domain, getting images from public search engines and Wikimedia, and synthesizing relevant images can be very helpful.
- The first model you try may not be the best (or even a good) model for the application at hand. Always be open and try different approaches to solve a problem. For some of these categories We found that traditional computer vision techniques lead to a more accurate model than fine-tuning a pre-trained CNN (initially we were doubtful about trying those models and were not hopeful that they would produce better results than a fine-tuned ResNet model).
- Sometimes solving a difficult predictive modeling problem can be very challenging, and you can benefit by breaking it down into smaller problems that can be solved separately.
As an ongoing effort on this project, we are working on:
- Improving our models to lower their false-rejection-rate, which can help us to increase the percentage of images which can be auto-moderated.
- Developing image enhancement models for improving the quality of low-quality images. By doing that, some of the images which are being rejected today can be enhanced and onboarded on our website.
This work was done in collaboration with several teams including data science, content, UGC, and destination. We would like to thank Harsh Pathak, Chao Zhang, Xinxin Li, Brooke Cowan, Peter Barszczewski, Aida Mashkouri, Lauren Houchin, Sveta West, Jesse Farmer, Gayatri Diwan, Ankur Aggrawal, Payal Goel, Etienne B-Dury, and many others for lots of valuable comments/inputs during this project. Finally, we thank Cliff DesPeaux, Zach Kuntz, and Maj Askew for their constant support of this project.
 Y LeCun, L Bottou, Y Bengio, P Haffner. “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, 86(11), 2278–2324, 1998.
 K He, X Zhang, S Ren, J Sun, “Deep residual learning for image recognition”, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 J Redmon, S Divvala, R Girshick, A Farhadi, “You only look once: Unified, real-time object detection”, In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788), 2016.
 X Zhou, C Yao, H Wen, Y Wang, S Zhou, W He, J Liang, “EAST: an efficient and accurate scene text detector”, In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 5551–5560), 2017.