Your pizza has a white bottom!

How identify defects using computer vision and Databricks

Eugene Bikkinin

Published in

dodoengineering

9 min readApr 25, 2022

Written by Kristina Fedorova and Eugene Bikkinin

Our quality control (QC) team receives several thousand questionnaires with photos, according to which it checks up the quality of products and services in Dodo Pizza. This is a routine job that can be automated with the help of computer vision.

In this article, we will tell you how we created and trained a model with Databricks, launched it into production and what results we got.

We have a QC team whose main task is to ensure the high quality of products and services of all Dodo Pizza restaurants. There are certain standards and criteria by which inspections are carried out. Mystery shoppers participate in the process: they send reports with photos of the order, premises (if they check the restaurant directly) and answers to questions about service. Based on these reports, the QC team identifies violations of quality.

At the same time, there are a number of pizza defects that can be detected using computer vision — and this can greatly facilitate manual verification. To begin with, we focused on the most common defects:

white pizza bottom;
white crust edges;
burnt crust edges;
the bottom is poorly cut;
deformed.

Examples of defected pizzas listed above

Data preparation

To train a neural network, we need labeled data. In our case we already had a set of labeled images that have been tagged with defect names of pizza. It remained only to select the necessary data and train the model.

A short reference about convolutional neural networks (CNN)
Many articles have been written about what neural networks are and how they work. In short, a neural network is a set of weight coefficients that are adjusted during training in such a way as to minimize the loss function. You can read more details, for example, here.
CNN models have proven themselves well for working with images. In general, they are called so due to their key operation — convolution. From a purely mathematical point of view, a convolution is the sum of the products of the matrix of input signals and the weight coefficients of the convolution kernels, which are learning parameters. By applying the convolution operation to the image, we extract certain features from it. On the first convolutional layers, the low-level features are extracted such as edges and curves. The deeper into the neural network, the more abstract the features are.
High-level features are already sent to the fully connected layer, where, in fact, the prediction of the model is formed.

Training the model with Databricks

To train the model, we use the Azure Databricks platform. It allows you to develop and deploy ML solutions.

Initially, our data was presented in the form of a CSV table, which indicates the path to the image in the BLOB storage and the presence of a defect type. But since we are dealing with images, this format was not particularly suitable. Parquet, a binary column-oriented big data storage format originally created for the Hadoop ecosystem, came to the rescue here. Parquet is much faster to read than CSV. It is also supported by Spark.

The next step is to feed the data from this dataset to the input of the model. For this we used Petastorm.

Petastorm is an open source data access library. It allows you to perform single-node or distributed training and validation of deep learning models from datasets in the Apache Parquet format. Petastorm is also friendly with PyTorch.

You can read about how to upload data using Petastorm here.

Model Training is usually not a one-time event, it is necessary to store information about each experiment somewhere. And if the experiments were conducted a month ago, then remembering what happened there will seem like real torture.

Ideally, we wanted to:

have information about each experiment;
version control for models;
easy to publish retrained models in production.

MLflow, a platform designed to manage the lifecycle of machine learning models, is suitable for this. It consists of four components:

MLflow Tracking allows you to log and query experiment results;
MLflow Models allows you package models that can be used in the future;
MLflow Projects allows you to save the code for its further reproduction;
MLflow Registry manages the full lifecycle of models. It provides model lineage, model versioning, stage transitions and annotations.

Versioning in MLflow

From time to time there is a need to retrain machine learning models. And if we manage to improve the metrics, then we would like to use the new model in production.

The first step is to register the model and assign it a name:

After that, the model will have a version. With each subsequent registration of a model with the same name, its version will be updated. A list of all registered models in Databricks can be seen in the Models tab. And when you click on any of them, you can see all its versions.

There you can also choose one of three states for the model: Staging, Production, Archived.

Staging is a test environment that is as close as possible to the conditions in production;
Production — operational environment;
Archived — archiving the model.

Subsequently, to use the newest version of the model in Spark job, we only need to set its name and state:

model_name = 'WhiteBottom'
model_stage = 'Staging'
model = mlflow.pytorch.load_model(f"models:/{model_name}/{model_stage}")

And now we can smoothly move on to how to use the model in production.

How we deployed the model in production

So, we had a ready model in the MLflow register that was able to predict a defect.

In order for it to start to be useful, we needed to deploy it in production.

First of all, it was necessary to choose a scenario for using the prediction. There are two modes: online — when the user enters his data and receives an immediate response from the web service, and asynchronous — when data processing occurs at certain intervals and usually works with a batch of events.

We choose the second mode. Mystery shoppers and customers send reports that our QC team employees do not check instantly. That is, the scenario that a job is launched once an hour, processes all the checkup events received at that moment and sends results to the QC automation team suited us.

Secondly, we need to choose a tool. Since our data pipelines are built using Databricks, it was logical to make a ML solution on it.

We took the job code from the Databricks website as a basis

The first version looked something like this:

Checkup events are collected in the Azure EventHubs topic (this is a Microsoft’s analog of Apache’s Kafka). Once an hour, the job starts and collects all events that have accumulated in the topic and have not yet been processed. A Dataframe is formed from a set of events, in which only those images that are needed for the defect remain. More precisely, not the images themselves, but the paths to them, because PyTorch ImageDataset uses its own mechanism for loading and batching images.

Then this array is passed to the model, and as an output we get a Dataframe with a prediction of the presence of a defect for each of the checkups, and for each of the checkups we create an event and send it to the EventHubs output topic. Then the QC automation service uses the data of these events to show results to our employees.

As a POC, this turned out to be quite a working solution. And we actually launched it into production. But literally the next day after the launch, the guys from the control team came with a new request: they wanted to add an arbitrary number of new models, and preferably without or with minimal participation of a data engineer.

There was another requirement: all models predictions for one checkup event should be returned in a single event too. If several defects are detected, they should not be sent in separate events, but packed into an array in one event. This requirement led to the appearance of a job that should collect the results of different models predictions.

The result is a scheme presented in the picture. At its core, you can still see the previous scheme, but now one initial job has split into four:

Landing. This job collects events from EventHubs and puts them into DeltaLake as is. That is, in fact, it just persists the input data.
Bronze. This job takes data persisted in the landing layer, converts it from the EventHub format into a meaningful Dataframe using the scheme and writes it to DeltaLake. In fact, you can launch the whole process without this job by doing schema transformation in the next stage, but we decided to add a little granularity to the separation of concerns.
Making predictions. These are the same jobs from the first solution, only now they do not communicate with EventHubs directly, but receive input data and write the results to DeltaLake. And it’s not single but multiple parallel and independent processes.
Result. This is the most interesting part of the new solution (from the perspective of writing the code). As mentioned above, the QC automation service is waiting for only one event that the checkup has been completed and a list of defects has been found. And we have a bunch of prediction jobs that output results at different times at different speeds independently and we need to somehow collect these results.

Here we took advantage of the checkpoint feature of the DeltaLake. Tables in Delta Lake can behave like tables and like streams. In order to determine which rows have already been processed and which have not, we use the stream checkpoint (it’s kind of like offset in Kafka). All lines that are after the checkpoint are not yet processed.

In fact we use this feature all the way down the pipeline: beginning from offsets in EventHubs through landing and bronze layers. But collecting results in one message is the case where checkpointing really shines.

We select unprocessed prediction records generated in the previous stage and select only the CheckupId from them — in fact, we get a list of checkups for which the new predictions of the models came. Then we make a join between this CheckupIds list and the whole predictions table and check if all the models have done their work for a certain checkup. If not, then we ignore this checkup further, and if yes, we collect the results of the checkups in an array and send them as an event to EventHubs.

Below is the essential code of this job.

yesterday = datetime.date.today() - datetime.timedelta(days=1)
# here we prepare prediction table to joinstruct_col = F.when(F.col("Infer") != 0, F.col("DefectType")).otherwise(None)table_df = (
    spark.read.format("delta")
    .options(**reader_options)
    .load(source_path)
    .where(
        F.col("EventDate") >= yesterday
    )  # we take only last day results to reduce join cost
    .withColumn("Struct", struct_col)
)


def process_microbatch(microbatch_df, batch_id):
    (
        microbatch_df.select("CheckupId")
        .distinct()
        .join(table_df, "CheckupId")
        .groupBy("CheckupId")
        .agg(
            F.collect_set("DefectType").alias("Infers"),
            F.collect_list("Struct").alias("Defects"),
        )
        .withColumn("Diff", F.array_except(infered_defects, "Infers"))
        .where(F.size(F.col("Diff")) == 0)
    )

(
    spark.readStream.format("delta")
    .options(**reader_options)
    .load(source_path)
    .writeStream.foreachBatch(process_microbatch)
)

Conclusion

Now the defects recognized by the model are displayed in the QC automation service. When opening the checkup, the manager can agree with the prediction, or, in case of a false positive, remove it. There are also such checks, which reveal several defects at once.

Now we are detecting a small part of pizza defects with the help of computer vision, but this allows us to draw the attention of the QC team to those defects that they could have missed earlier.

By metrics:

white pizza bottom (Precision=0.85, Recall=0.94);
white crust edges (Precision=0.76, Recall=0.77);
burnt crust edges (Precision=0.80, Recall=0.82);
the bottom is poorly cut (Precision=0.84, Recall=0.88);
deformed (Precision=0.80, Recall=0.83).