How to use machine learning for the classification of citizen service requests

In many cities in the world local governments offer a service where citizens can make requests ( In the United States this is called 311), for example to make a complaint about garbage on the street or to report nuisance. These reports can often be made by phone or a web form, by writing a text, selecting a location and selecting a category. The selection of a category is important for how fast the problem is resolved, when a wrong category is selected it is possible the wrong department will receive the issue resulting in a delay of the issue being resolved.

The selection of a category can be done by using supervised machine learning on historical service requests, requiring only a text to be entered. The City of Amsterdam uses this method to detect the class of an report and route it to the correct department. In this medium story I will describe how to make a simple yet effective textual classifier capable of routing citizen service requests.

The web form used in Amsterdam to make a service request, no longer requiring the selection of a category.

The classifier used in Amsterdam uses 8 classes and over 50 sub-classes. It was trained on a data set of over 500.000 citizen reports. An example of how the classifications are made can be found here. The classifier is able to detect the main category very accurate with a macro weighted F1 score of 0.88. Performance is best for the larger classes, but the smaller classes are also classified correctly most of the times.

A screenshot of the demo that shows how (Dutch) text is classified, the demo can be found here.

A github repository is available with the code to create a textual classifier for service requests using python and sklearn. The data used for the classifier in the demo and the 311 system is not publicly available since the reports possibly contain privacy sensitive information. For training a classifier enough examples should be available for each class, what is enough depends on many factors, and the best way to find out is to create the classifier and evaluate it.

The first step for creating the classifier is loading the data. After loading the data the data has to be split into a train set, for training the classifier, and a test set for the evaluation of the classifier. For the extremely small example data set a split of 50/50 is used, however for a larger set of data is often enough to use a smaller percentage for the test set, using for example a 80/20 train test split. Also shuffling the data before splitting is advisable on a larger data set, this can easily be done by using the train/test split function.

Loading the data and splitting it into a train and test set

After loading the data, it is time to create a classifier. For this we will use TF-IDF and a logistic regression. Other methods (W2V, CNN+LSTM, BERT) have been implemented but performance was not better when looking at the macro F1-scores.

TF-IDF is short for term frequency-inverse document frequency. This representation will create weights for words that show how unique they are for the specific citizen report compared to the overall collection. A word as 'the' will get a low weight, and a word as 'garbage' will get a higher weight. This makes it perfect for classes that have very specific words describing them It also helps with bigrams or unigrams (Like: "thank you", "please") occurring in all documents to not effect the classification to much.

The classification is done with a Logistic regression, when evaluating on large data set this performed best for all classes.

The optimization of hyper parameters is done using a cross-validation gridsearch, this will basically try all possible combinations of parameters and evaluate performance on parts of the train set. More parameters can be found in the documentation of sklearn.

Creating the classifier by optimizing hyper parameters using a cross validation grid search.

After creating the model it can be saved and loaded. This will allow for the model to be saved to disk and later be used for prediction the classes of new texts.

Saving and loading the classifier.

Simple evaluation of the model can be done by using the metrics functions build into sklearn. In the code below precision, recall and accuracy is calculated using predictions made on the test texts and comparing them with the test labels. To asses performance of separate classes the classification report function can be used, and for a nice visualization of confusion between classes a confusion matrix can be created.

Evaluation of the classifier.

Now you have created a model that can classify citizen service requests. Hopefully the performance is good. If you are interested in how to deploy this model in a web form or a demo application let met know, and the next story will be about how to create a flask api that takes text as input and a prediction as output, so the classifier can be deployed in a web application or a demo application like this.