How to classify news articles in the “real world”?

Training an article classifier and putting it in production in 30 days.

Published in

Axel Springer Tech

8 min readApr 21, 2020

Joint work by Gosia Adamczyk (Machine Learning Engineer @ Axel Springer AI) and Helena Peña (Data Scientist @ Upday).

Every day at Upday we serve over 85K news articles to millions of users across Europe. This means we process a lot of textual data in many languages and contexts. In order to connect people with the right content we need to know what the articles are about — we need to classify them. The existing classifier uses a clever application of Elasticsearch with some business rules on top. It does well, it is fast and we like it. But it neither has a retraining-pipeline, nor automatic monitoring, and requires maintenance of rules — something that could be avoided by using a ML model.

That’s why we decided to work on a new classifier. In this article, we will talk about our experiences in developing this new classifier from training to putting it into production. We will also explore the data in more detail and provide some interesting insights from it.

Our new classifier should perform as good as the existing classifier, use a ML-algorithm whose choice and calibration are transparent and include monitoring of classification metrics. With the recent expansion to new target markets, Upday is present in more than 30 countries with more than 20 languages. Therefore, it is also crucial that our algorithms for news personalization and recommendation work not only well with English and German but also with other languages that are less explored in the NLP-domain, such as Polish, Romanian, Hungarian and Bulgarian.

Article pipeline

One key aspect used as part of the recommendation algorithms is the category of articles: music, culture, architecture, opinion, food, … Overall, there are more than 80 article categories. Upday aggregates content from +5000 RSS-feeds in real-time — and each of the imported articles needs to pass through the category classifier before reaching the user.

When an article gets published, the first step in the pipeline is to extract all article components, such as text, title, images, author and URL (figure 1). A category for the article is predicted at the classifier step — this classification component is what this article is about! Afterwards, the article undergoes other enrichments until finally being consumed by downstream services and shown in the app.

Getting to know the data

In the first iteration, we worked on the English dataset that was labelled specifically for the UK market. Our training set contained around 12K samples, between 98 and 444 per category (imbalanced data). We believed it to be enough to achieve a satisfying performance. Our evaluation set, however, contained only ≈10 samples per category. We considered it too little to confidently assess the quality of our model’s predictions. Unfortunately, it was the only dataset we could use to compare our performance with the existing classifier. To introduce more confidence in the validation metrics, we decided to use ≈2K data points from the training set to perform an additional evaluation of the model.

Table 1. Training dataset contains textual data: category, title, text (article’s body) and URL as well as the image attached to the article.

In many domains, the age of the training data doesn’t affect the success of the project, though in the case of news articles classification it surely does. Our data is from 2016. In the last four years the UK’s politics, culture and pop-culture went through dynamic transformations: Prime Ministers changed twice, a new word “Brexit” has appeared, so has “Megxit”, Liverpool almost won Premier League after 30 years of no luck. This new reality could be a challenge for a model trained on old data and we had to keep it in mind during the experimentation, evaluation in production and designing future quality monitoring.

Labelling inconsistency

Another problem we identified was inconsistency in the labelling. Table 2 presents three samples of articles about opera (a form of theatre). Two top samples come from the training set and were labelled as the category theatre_stage. The bottom sample comes from our evaluation set and was labelled with a category classical_music.

Table 2. Samples of articles about opera (a form of theatre) from training and evaluation sets.

We believed this inconsistency could confuse our model. To verify this assumption, we generated feature importance charts that show which words in a specific column (feature) encourage or discourage the model (here logistic regression) to assign the sample to a specific category. Here you can look up the code we used for it. Figure 2 presents the feature importance charts for categories classical_music and theatre_stage. Title feature (article title) containing the word “opera” is low in the left-hand chart. It means that the appearance of “opera” in the title discourages the model to assigning the label classical_music. On the other hand, the same word present in the article’s text or title is one of the strongest signals for the model to classify the sample as theatre_stage.

Figure 2. Feature importance charts for categories: classical_music (left) and theatre_stage (right). Charts created with this code.

Needless to say, our model predicted the category theatre_stage for the sample from the evaluation set presented in table 2 (previous). It also turns out the model was very confident in this prediction. As you can see in the table 3, according to our model, there was a 98% probability, that this article’s category should be theatre_stage and only 0.03% that it’s classical_music.

Table 3. Prediction probabilities for categories *theatre_stage* and *classical_music for the investigated article.*

The opera case was not the only example exposing an inconsistency in label assignment. We found similar confusion in other categories. The manual check of 45 samples (table 4) has shown that only in ≈40% of the cases our model was clearly wrong.

Table 4. Results of manual check on 45 misclassified samples from evaluation dataset. Predictions were generated with our baseline model (logistic regression).

In the remaining ≈60% either the ground truth was assigned incorrectly (and our prediction was actually correct) or both the ground truth and prediction were equally adequate (ambiguous content, 33%). Through this thorough check, we got to understand our training and evaluation datasets better and we learned not to trust blindly our evaluation metrics.

Our workflow for selecting a model

Figure 3 shows the steps involved in our workflow. The arrow shows the direction of the data flow in our experiments. Even if the evaluation metric appears as the last step, choosing a metric for evaluation is one important choice you shall start with. When comparing to an existing classifier, we need to have a metric that will be available for both models and will serve as a comparison. For our project, we chose the weighted F1-score as an overall-metric but we also tracked F1-scores on each category to more precisely detect problems in the model.

We experimented with different data cleaning techniques, ranging from simple preprocessors based on regular expressions to language-specific preprocessors that removed stop words and/or stemmed words. Although removing stop words is considered helpful in reducing feature space dimension, in our case, it often lowered the performance of the model whilst considerably increasing the preprocessing time. Therefore, we decided to limit preprocessing to simple regex formula and let our text embedder (TF-IDF) maintain the most common words without our assistance. Open for a further iteration is to check the effect of removing stop words with a customized, news-specific list.

The models we experimented with were mostly from the family of linear classifiers which are available in the SciKit-Learn package. We used Bayesian optimization to find the best model parameters and the feature weights, leaving a more exhaustive grid search for the TF-IDF parameters.

For the BERT-classifier we fine-tuned a base model with not much other effort involved. In the case of CatBoost, we experimented with a classifier both on default bag-of-words tokenizer as well as on TF-IDF embeddings.

Jupyter notebooks are a great tool to draft your experiments. We used them to get to know our data, visualize it, evaluate and interpret (feature explanation) our models. We used MLflow for systematic tracking of our experiments and for keeping track of all pipeline steps of figure 1. This enabled transparency in our experiments and allowed us to work and share progress in a team of +2 data scientists. With a cloud server running MLflow we collected model parameters, code, artifacts and metrics of all our experiments.

Selecting the best model

The results of our experimentation can be seen in table 5. The models are ordered by their performance from top to bottom. In our case, the two best performing models were logistic regression and passive-aggressive (both having F1-score 0.87). Both scored better than the existing classifier (F1-score 0.83). We learned that complex models don’t necessarily out-of-the-box outperform classical linear models. A simple model is though positive in light of productionalization since it responds fast and can be trained on a relatively small dataset.

Table 5. F1-scores achieved on the evaluation dataset by each of the models participating in our model search phase.

Conclusion

In 30 days we managed to bring the English classifier into production. In this time we learned to prioritize and keep ourselves from endlessly digging deeper and deeper. In our model search, we tried to focus on algorithms that showed good performance from the beginning of the experimentation. We also learned that quality of predictions is not the only aspect that contributes to a good model — for our use case it also has to be lightweight, deliver predictions quickly and meet the application response speed targets. Moreover, since labelling data is costly and time-consuming, we decided to prioritize algorithms that don’t require a lot of training data.

Next steps

In the upcoming months, we plan to scale up to further languages in the article pipeline. In parallel, we are designing the workflow for quality monitoring, relabelling and retraining. These tools should not only help us, Data Scientists, to automate our workflow in the future but also be available to the content experts at Upday and allow them to monitor the quality of the model and adjust the training set.

If you found this article useful, give us a high five 👏🏽 so others can find it too, and share it with your friends. Follow us on Medium (Helena Peña and Gosia Adamczyk) to stay up-to-date with our work. Thanks for reading!