A self-improving model using partially annotated data

Amir Mizrahi
Zencity Engineering
7 min readOct 31, 2021

One of Zencity’s services is classifying municipal hotline requests — like those made to 311 — according to the city department best suited to meet the request.

Sorting and assigning the requests manually, especially in large cities that receive a high volume of requests, can be expensive, cumbersome, and time-consuming.

Service requests can be made in a number of ways — for example, submitted by the reporting citizen online, or by a hotline operator following a call from a concerned citizen.

The person submitting a request manually selects a Nature of Request (NOR), which then determines the department that should handle the request.

Here’s a real example: Someone reported broken bulbs on the street as “Electronics Debris,” which usually refers to discarded electronic equipment such as microwaves or soundbars. The city team sent to respond to the request was equipped to collect and dispose of these devices, and not to sweep up broken glass. So, in this case, the team arrived at the reported location and had to call a different team instead.

In this case, a slight misclassification of the request by a simple human error caused real-life time and resources to be wasted. Multiply this over an entire city for the various types of misclassifications that can happen, and we have a very expensive problem to solve.

Fortunately, every case is recorded, and the recording includes the final team that resolved the incident and the original NOR that was reported.

In our data set at Zencity, about 70% of the requests have a human allocated NOR that maps to the final and correct queue. Let’s refer to these requests as “Correct Requests”, and to the other 30% as “Incorrect Requests”.

Figure 1 — NOR-to-Queue mapping example
Figure 1 — NOR-to-Queue mapping example

Now we want to train a model that correctly classifies NOR, so requests will be mapped to the correct queue.

Our next step is to collect all the samples that hit the correct queue and use them for training. Why don’t we predict the queue directly? Because the client has other processes dependent on NOR. How do we know the NOR won’t just be a random one that maps to the correct queue? Hold your horses, I’m getting there.

The problem for us to solve is a classification problem. In each case, we have textual request descriptions (we’re also classifying images, but this is out of the scope of this post), and we want to classify them to a NOR.

There are many ways to solve this problem, ranging from rule-based systems, traditional machine learning methods such as SVM and Naive Bayes, and deep learning such as convolutional neural networks (CNN) and transformers (like the widely used BERT and the rest of sesame street).

In this situation, we have a very large number of samples to train on (over a million cases), so we can tap into the potential of deep neural networks (DNN) which provide good accuracy, but require much data. On the other hand, this classification pipeline needs to run in real-time as cases come in, so we can’t use slow prediction DNNs such as BERT. We also need to retrain the network to keep track of the changing life in the city, and transformers are expensive to train on so much data.

This is why we’ve chosen to use a CNN for our problem. We’re using a rather lean one, with just three conv-blocks of a convolutional layer, normalization and max-pooling (see figure 2 for the full architecture, figure 3 for the conv-block) that we built using Keras. The input text is embedded using GLOVE embeddings pre-trained on 27 billion tweets.

Figure 2 — CNN for Queue classification
Figure 3 — Conv-Block for the CNN in Figure 2

Training the model using a single V100 Nvidia GPU takes around 2 minutes per epoch, and the training converges after 8 epochs.

For the purpose of comparing correct and incorrect cases, we will define and use two types of test sets. One test set is made of 20K correct cases. Let’s call this the correct test set. The other test set is made of 20K of incorrect cases. Appropriately, Let’s call this the incorrect test set.

When training and testing our CNN on the correct test cases to classify queues based on the request description, we get a decent 88.7% accuracy. What happens when we test on incorrect requests? We only get 57.1% accuracy! Let’s call this model the “Initial model.”

This could mean that the distributions of incorrect and correct request descriptions are different. We’ve plotted the distribution of queue classes in figure 4 to visualize the difference. For our model to be useful, we must classify incorrect requests better.

Figure 4 — An histogram of the queue distribution. On the left, the density histogram which is the counts normalized by the total number of queues. On the right, the same histogram in the log scale. We see here that both the common queues and the uncommon queues are distributed differently.

Our next step is to use the incorrect requests in our training process. We’ve already established that we can’t use these requests as is, because the NOR in these requests maps to the incorrect queue. These requests have a valuable request description distribution that we want to learn, so it’s worthwhile figuring out a way to include them in our training process.

We did so by recruiting the help of our initial model in training our complete model!

With the assumption that the initial model has some idea about classifying NOR, we used our initial model to get the prediction probabilities per incorrect request and picked the highest probability NOR that fits the correct queue of that request.

We then took these incorrect requests — with their newly assigned NOR — and used them to enrich our training set.

The new model gets us to 71.2% queue accuracy on the incorrect test set — a 14.1% increase! The queue accuracy on the correct test set slightly decreased by 0.5% to 88.2%, so learning the incorrect response distribution came at a small expense of the correct response distribution, as one would expect. We also want to test the new model on a test case that represents the real distribution of the data — 30% incorrect cases, and 70% correct cases — sampled from the incorrect and correct test sets, for a total of 20k requests. Our queue accuracy improved from 78.8% for the initial model to 82.1% with the new model. (See table 1 and figure 5)

Figure 5 — Queue accuracy on the initial model and the new model
Table 1 — Accuracies on Queue classification task

The above accuracies only mean that the model successfully learned how to classify requests to their correct queue. What does it mean regarding NOR accuracy? Does the model even learn that NOR are different from one another, or just internally clusters them together and gives us the most common one that maps to the correct queue?

When we assigned our incorrect cases with a corrected NOR, we could have just selected an arbitrary NOR that maps to the correct queue (“arbitrary new model”), and skipped using the initial model. To test what happens in this case, we’re going to conduct another experiment, using the correct test set (because that’s our only NOR baseline) and measure the NOR accuracy. See figure 6 and table 2.

Table 2 — NOR and Queue accuracies on the correct test set
Figure 6 — NOR and Queues accuracies on the correct test set

We observe that the model has learned an internal representation of the queues, and the quality of that representation in regards to the queue classification task is stable if we harm the correctness of the NOR representation.

Using the initial model to train the new model results in better NOR accuracy, compared to arbitrary NOR selection.

In conclusion,

We’ve trained an initial model to classify NOR based on request description, based on Queue, with the help of a many-to-one NOR-to-queue map. The training set consisted of requests where the requester chose a NOR that maps to the correct queue.

We then improved this model by expanding the training set to the requests where the requester chose an incorrect NOR while filling in NOR using the initial model.

Then we’ve shown that leveraging the initial model for expanding the training set by filling in NOR against choosing NOR arbitrarily is better in regards to NOR classification accuracy.

--

--