Text classification on local newspapers articles

Guillaume Barrois
Explain AI
Published in
10 min readNov 12, 2019

Companies that have infrastructure projects (windmill farms, high voltage lines, quarries, etc.) are very interested in building a deep knowledge of the local context of their project: what do people think about renewable energy or the environment, who are potential supporters and opponents to their projects, what are the main points of discussion?

At eXplain, we apply data science to understand public opinion and government at a local level. One of our pretty cool sources of data are local newspapers: in a country such as France, several millions of local press articles are published every year and cover every location about every topic.

A human reader can gather a lot about these issues reading recent and older local newspapers. But French local newspapers archive databases contain over 100 million articles. There is therefore a technical challenge in identifying among them the few articles that are relevant.

Our task: selecting 300 relevant articles among a database of 100M

This post details one of the very first steps: selecting a curated list of articles relevant to, for instance, an infrastructure company wishing to build a new windmill farm in the Northeast of France. The core of eXplain’s value proposition is to bring to our clients all the relevant information on a topic at a very local level. But the problem is that only a small subset of articles contains information that is very relevant to our client. For instance, a windmill farm operator will be very interested in articles such as this one:

An article about protestations against a windmill farm

We know for sure it’ll care less about this one

An article about a series of robberies

But will it about this one?

An article about oil pollution in a french river

Typically, among the articles that are published by local newspapers, only about 0.01% of articles public in its geographic area of interest are considered by our clients as relevant and should be displayed. This puts a strong emphasis on our ability to select the relevant documents for our client. And because the total volume of published articles is huge, of course we cannot afford to do it manually.

A standard multi-label text classification problem

Breaking down what it exactly means for an article to be “relevant” to a client is complex and interesting, but the very first approximation we make is: an article is relevant for a user if it is about an important topic given the client’s business activities. For a mining company, this means topics such as natural resources, infrastructure, the environment, wildlife and biodiversity, road traffic (mines generate a lot of truck traffic), tourism (mines are not usually considered attractive), etc.

Once the list of topics has been chosen, our job is to find articles that are about these topics. This is can be formulated has as a pretty straightforward multi-label text classification problem (https://en.wikipedia.org/wiki/Multi-label_classification): a text can be assigned to several classes because some topics can have some overlaps (e.g. Wildlife and Biodiversity), or some classes can be subsets of others (e.g. Renewables energy and Wind energy).

We considered each topic as a single binary classification problem independent from the others and trained independent models for each. The advantage of this approach is that it is not necessary to retrain the whole model when a class is modified, added or removed. The inconvenient is that the size of a dataset for each topic is small. In the future, we will investigate another approach, using a single multi-label classifier.

How to build a training set with only 0,01% positive

Obviously, a key for such method to succeed is being able to gather a large enough training set. This can be tricky for us, because as mentioned earlier in this post, our global dataset is very imbalanced: for a given topic, the number of negative examples in the dataset is about 10 000x larger than the number of positive. Therefore, it is not possible to constitute a dataset by randomly sampling articles from the press and labelling it.

To tackle this issue, we decided to proceed in two steps that are described below. The idea is to implement a type of semi-supervised learning as described in Elkand et al [1].

First “naïve” dataset

For a given topic, we implemented a coarse classification method, as follows:

  • we constituted a lexicon of regular expressions, bigrams or trigrams directly related to the topic: for instance, for a topic such as biodiversity, the lexicon was something like: [“biodiversity”, “biosphere”, “protect wildlife”, “protect bird”, “species protection”].
  • if one of the elements of the list is present in the text of an article, the article is considered as positive.

This method has obvious limits: as the lexicon is constituted manually, it can miss important aspects of a topic, it is sensitive to ambiguity etc. However, it provides an easy way to constitute a set of probably positive elements, that can be completed with articles drawn randomly, to create a more balanced dataset.

Probably positive elements are articles detected using the lexicon based approach. Among them, a majority is positive but some are negative.

Manual labelling

The previous dataset has been built to be approximately balanced, however it still needs to be labelled. To do so, our tool of choice was Prodigy from explosion.ai. This is an annotation tool that implements active learning methods [2]: the tool will stream articles to annotate depending on the certainty of the label that is predicted by the tool.

For the constitution of our first training set, we actually do not benefit from this feature, because we want to attribute the correct label to all the articles of our dataset. However, it will be a good tool to enrich further this first set.

Topics are tricky: is this article really about tourism?

Being able to accurately label article is actually not as easy a task as it seems: defining what falls or does not within the boundaries of a specific topic can be subjective, and it is therefore important to specify precisely what kind of article should be considered as included Positive and which should not. As an example, let’s consider the topic of Tourism. In our manual labelling, we should specify the following rules:

  • We are interested only in the economic aspect of tourism on a territory (attractiveness of the territory, tourism industry…)
This article talks about the consequences of weather event on tourism and its economy and can therefore be relevant for our clients.
  • An article that describes a touristic site such as a new museum opening, or a traditional festival ought not be included
  • · An article that gives time and date of touristic events such as visits or exhibitions ought not be included
This describes a specific touristic event, and is therefore not relevant

Organizing and monitoring large scale manual labelling is an important and difficult topic in itself, so we won’t dig into it in this article, however anyone who wants to deep dive into this topic should plan it carefully.

After the manual labelling phase we have a dataset that can be fed to the model.

But wait: are we so sure that rules-based systems wouldn’t work?

If we step back it seems rules-based systems could work. Topics make intuitive sense, people understand what it means for an article to be about a given topic: it seems therefore feasible to construct a lexicon that would capture this.

In the case of Wind power, lexicon based approach seems to be reasonable: the vocabulary associated with it is quite specific and there are few articles mentioning the word windmill which are not about wind power (although it can happen, for instance we found once an article describing a paper windmill construction workshop in an elementary school). However, in the case of Biodiversity it is not as clear: the associated vocabulary is broader, there are ambiguities and more context dependencies. It is therefore harder to define a vocabulary that is comprehensive and specific enough to capture the different dimensions of the topic, without being too large and lead to false positive articles.

A text classification approach would solve this issue: with the right features to capture the global context and articles’ words distribution, it is possible to learn form a training set the very fine interactions and combination that constitute a topic, and to reduce the risks associated with words ambiguity and polysemy.

As we will see later, this qualitative evaluation will be proven true when we will evaluate the different methods.

Playing around with scikit-learn to find the right vectorizer+classifier combo

In the literature, this kind of problem is generally tackled with a two phases approach: text vectorization and classification using a binary classifier [3]. For a given topic, the chosen model is therefore a combination of vectorization method and of a linear classifier.

Illustration of a classic text classification pipeline

Vectorization

The vectorization step is important because the vector representation of a text should capture the important features that will be sufficient to classify accurately the text. We implemented three common ways to vectorize text:

  • Bag of words + dimension reduction (SVD)
  • TF-IDF + dimension reduction (SVD)
  • w2v + TF-IDF weighted average.

For Bag of words, TF-IDF and SVD, we used the version implemented in scikit-learn, while for the word2vec vectorizer we used the pre-trained embedding in French from fasttext https://fasttext.cc/docs/en/crawl-vectors.html.

Classification

Thanks to scikit-learn, it is relatively easy to test different classifiers for a given topic classification task. We therefore decided to test four common binary classifiers:

  • Logistic regression
  • Naïve Bayes
  • Support Vector Machine
  • K-Nearest Neighbours

Evaluation and model choice

With three vectorization methods and four classifiers possible, we had a total of 12 models that we could use to tag articles for a given topic. Because our data sets are quite small (see below), the training is fast enough to allow us to benchmark the 12 models. Therefore, we implemented in our library an automatic phase of training/evaluation and selection of the right model.

Evaluation was done by evaluating F1 score using cross validation on our labelled dataset. The core of eXplain’s value proposition is to bring to our client all the relevant information on a topic a at a local level, so we set a certain threshold for recall, below which we would not consider the model acceptable. The model (combination of a Vectorization method and of a trained classifier) with the best score was then chosen.

The complete process is summarized on the figure below

Results

The method was evaluated for two topics, Biodiversity and Tourism. Below is a table presenting the result of the three best models, for each topic

Results for the two topics and the different models

Overall, we found the results very satisfactory. By comparison, purely lexicon-based approach leads to F1-score of respectively 0.36 and 0.20 for Biodiversity and Tourism.

Interestingly enough, the results are very different for the two topics. We can find two explanations for this:

  • Some topics have a very specific vocabulary, which make them easy to identify such as biodiversity. For tourism, it is less the case, therefore the classification task is more complex (we see this in the ML-based and the lexicon-based approaches results).
  • Even with very precise guidelines, boundaries for some topics can still be ambiguous and blurry: this can lead to differences in interpretation and labelling, depending on the subjectivity of the annotator. In this case, the classifier can only be as good as the training set is.

Main takeaways

  • Even with a relatively small training corpus, text classification methods systematically outperform lexicon / rule-based approaches.
  • NLP ecosystem, in particularly in python, allows to quickly implement and test a lot of different vectorization and classifiers. This is key in our case because, as the different topics are heterogeneous, the right model is not always the same.
  • A key success factor for the project was being able to build a large labelled dataset. The use of the right labelling tool (prodigy in our case) and the careful preparation of the manual labelling task with clear guidelines and pre-filtered dataset was key.
  • Because of scalability and time processing constraints, we decided not to consider more recent and advanced deep-learning method, for instance considering the use of language models such as BERT. The results as they are are very satisfactory but this is definitely something we will look into in the future.

Bibliography

[1] Elkan, C., & Noto, K. (2008, August). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 213–220). ACM.

[2] Montani, Ines and Honnibal, Matthew (2019, to appear), Prodigy: A new annotation tool for radically efficient machine teaching, Artificial Intelligence

[3] Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150.

--

--