tLabel: Talabat AI Labels Restaurant Cuisines

Published in

talabat Tech

8 min readJan 10, 2022

Introduction

Talabat is considered to be the home of food delivery in the middle east. We work with over 50,000 partner restaurants, across 8 markets, filled with 80,000 heroic riders, serving millions of customers, and these numbers are only growing. We are passionate about creating amazing experiences for our ecosystem — our partner restaurants, customers, riders, and communities.

With thousands of options on our platform, setting accurate cuisine tags is essential to enable customers to narrow down their choices to reach the desired restaurant. The food industry is also very dynamic, and we need to update our set of tags to cope with the new cuisines and items in today’s market.

Tagging restaurants shouldn’t only be restricted to famous cuisines such as American, Arabic, Asian, etc, but also the famous dishes or plates that they serve such as Sushi, Burgers, Wings, etc.

Our goal here is to correctly label all restaurants on our platform with the appropriate cuisines to:

Enable a smooth experience for every Talabat customer to find what they’re looking for.
Have a reliable assessment of our inventory for any applications that depend on the correctness of cuisine labeling content, e.g. recommendations, logistics, reporting, analysis, etc.

Challenges

We can summarize the problems of cuisine tagging in the following main points:

Breadth: There are over a hundred possible labels to use.
Subjectivity: Maintaining reliable and solid criteria when setting cuisines labels. Examples: How many burgers need to be on a restaurant’s menu to be eligible for the tag “Burgers”? What qualifies as a “Healthy” restaurant?
Efficiency: Avoid manually tagging each restaurant that we have on our platform, which consumes a massive amount of time.
Coverage: Covering as many tags as possible that correctly describe a given restaurant.
Accuracy: Doing all of the above while still maintaining highly-accurate cuisine labels.

The problem is inherently subjective, there is always a gray area where many tags are possible, but we need to efficiently select the most relevant ones while simultaneously being as descriptive and accurate as possible. For example, if we want to set three tags for the famous McDonald’s restaurant, we can come up with:

American, Burgers, Chicken
Burgers, American, Fast Food
Sandwiches, American, Fast Food
…

At Talabat, this problem becomes even more pronounced with less popular restaurants and less commonly used tags. This makes any manual labeling process very costly and inconsistent.

Approach

We tackled the problem holistically to cover all aspects of the challenge. To account for the Breadth and Coverage issues, an exhaustive list of tags was created and split into the following 4 categories, the purpose of this step is to build an organized and consistent process to think about a restaurant, and hence tag it with the most relevant keywords.

Geography: Tags correspond to a place or a country, ex: American, French, Asian.
Food Category: High-level tags describing the food, ex: Grills, Beverages, Sandwiches.
Dish: Lower level tags representing plates or dishes, ex: Sushi, Pizza, Biryani.
Diet: Dietary tags such as Vegan, Gluten-free, Keto.

To reduce Subjectivity and Efficiency we curated a rich, structured ground-truth dataset of restaurants and their cuisine labels, which we used to train a deep learning model to automatically label new data with a human-in-the-loop. We designed a thorough process to achieve this:

Selected a small set of human taggers.
Conducted training sessions to ensure adherence to the tagging criteria.
Provided them with a drop-down list of the exhaustive list of options for each of the four categories of labels.
Suggested a set of tags that may have been missed by the taggers to be reviewed for each restaurant.

To ensure high Accuracy, we then evaluated our best model against human performance, while choosing the most suitable evaluation methodology for the task at hand.

Putting all of this together in Talabat’s in-house restaurant cuisine labeling system: tLabel. In the following sections, we will describe both modeling and evaluation in detail.

Model

When designing our model architecture we have to take into consideration the available input data and the desired output structure. In this case, the most relevant source of input is the restaurant’s menu itself, whereas the output is the appropriate cuisine tags.

However, since we have a relatively large vocabulary that can occur in any restaurant’s menu, it would be inefficient to use the menu’s keywords as features like in a simple bag of words model.

Hence, a word embeddings model is trained over the available menus to represent the meaning of each word with a fixed-size vector, then getting one vector for the whole menu by simply taking the average of all the embeddings of the keywords it contains.

Training word embeddings from menu items

This successfully represents our input, but how about the output? It turns out that we have an interesting multi-label problem where several outputs can co-exist at the same time for the same restaurant. It’s important to distinguish between multi-class and multi-label classification problems.

Multi-Class: In the case of multi-class problems, we would work with more than two labels but each instance can only be labeled with exactly one class.

Multi-Label: This is where instances can be labeled with one or more classes, similar to our problem.

Read more about the difference between multi-class and multi-label problems here and here.

With a large number of labels, a good approach is to make use of the labels categorization (Geography, Food Category, Dish, Diet) and break down the problem into 4 classifier models. Each model is a deep vanilla neural network of 15 hidden layers, with the same fixed input size — which is the embedding length of the restaurant, but they differ in the output layer size that depends on the number of labels in that category. After passing the restaurant embeddings to the 4 classifiers we just need to merge their predictions to get the final one.

Evaluation

Given the imbalanced nature of the problem, accuracy is not a solid metric to rely on, instead, we optimized the classifiers to perform with maximum precision. The logic here is that we care more about predicting correct cuisines than predicting all possible cuisines for a certain restaurant.

However, the multi-label nature of the problem is a bit confusing when choosing the suitable formula for precision. There are two broad types of evaluation for multi-label problems to calculate precision: Example-based and Label-based.

Example-Based Precision

Example-based precision is defined as the proportion of predicted correct labels to the total number of predicted labels in each example, averaged over all examples:

where Pi is the precision of example i and N is the number of examples.

Label-Based Metrics

As opposed to example-based metrics, label-based metrics evaluate each label separately for all the instances then average them over all the labels.

And here there are two kinds of averaging over the labels:

Where TP is the true positives, FP is the false positives and K is the number of the labels. Macro-avg precision was not the best fit here because it gives the same weight for minor and major classes which leads to unfair evaluation.

The chosen candidate was the example-based precision because it allows us to know where to set our sigmoid threshold to obtain the maximum precision for all the instances, i.e. increasing the sigmoid threshold after a certain value leads to a decrease in the number of true positives till it vanishes, which is not the desired performance.

Confusion Matrix

In the case of multi-class problems, the confusion matrix doesn’t only show the number of false positives and false negatives but shows how those numbers are distributed over the other classes. This information is not easy to attain in multi-label problems, because the prediction of fewer or more labels than the actual truth is not directly related to any of the labels. For example, if the ground truth is American, Burgers and the prediction is Italian, which class should be accounted for the false positive of predicting a wrong label Italian, and which class should be accounted for the false negative of not predicting the right labels?

Even in scikit-learn, this is the only available library at the time of writing this blog post, and it can only generate a binary confusion matrix for each class.

Fortunately, this paper suggests a convenient way to come up with a confusion matrix just like in multi-class problems, and we could easily transform the pseudocode into a simple python function to use, that provide both absolute count and percentage confusion matrix:

We can then use the usual heatmap plot to visualize it:

Impact

This had a remarkable positive impact on the customer experience in Talabat. The approach was successful in meeting our needs and hitting our initial goal. We were able to:

Improve Efficiency: Thousands of labeled restaurants with minor human involvement.
Improve Customer Experience: Enhanced label coverage and enhanced objectivity, in turn, improved the filtering and search experiences for our customers as measured by a significant increase in conversion and click-through rates.

Conclusion

Cuisine labeling is a quite subjective and error-prone process. However, here at Talabat, we’ve built an AI-powered system, called tLabel, to improve and standardize the labeling process. Using the multi-label approach provides some flexibility in the output, however, it adds a layer of complexity in the evaluation process which we can overcome by understanding the different metrics available and using the suitable one for the problem. This had a significantly positive impact on conversion.