Create your own NLP pipeline

How to transform unstructured data into practical tools

Ivo Merchiers
VectrConsulting
6 min readNov 28, 2018

--

This article will show how you can use NLP algorithms to extract value from raw text data. In particular, we’ll describe the pipeline that we used to win the the NLP4GOV hackathon as best start-up. In a previous post, Ignaz Wanders already explained the high-level approach, which used knowledge graphs to provide additional insights.

This post will take a different approach and focus on some of the more technical aspects. We won’t delve (too) deeply into the nitty gritty code details, but we’ll focus on the most important steps in our data pipeline. Along the way, we’ll show which tools we used and suggest some tips and packages that might be useful for your own projects.

The hackathon

Let’s start with a short overview of what our hackathon case entailed. The Flemish agency for innovation and entrepreneurship (VLAIO) has a large dataset of grant applications. Apart from some metadata tags, these grants are made entirely of raw, unstructured text. Now it is up to VLAIO to answer questions such as:

`How many grants were labelled as machine learning/energy/…?`

Apart from some topics that were already tracked, finding new topics means that all grants should be reread. Of course this approach is not sustainable, especially since their database is only growing.

Some of these topics are extremely specific with very specialized vocabulary. Think for example about grants for genetic research. Due to the specialized nature, it’s very hard to use out-of-the-box solutions. Instead, it is the ideal playground for Natural Language Processing (NLP) tools, that can automatically adapt to the nature of the text.

The NLP pipeline

We used a rather straightforward data flow. We start off with the data provided to us, preprocess it to remove any errors and then transform it to a representation suited for machine learning. Once we have our new data format, the fun begins and we can use some fancier data science techniques.

Loading data

The dataset was stored in an XML format following a predefined schema. The ElementTree package in Python allowed us to load the data. As the name already indicates, a tree representation of the XML file is returned, which is not well suited for our NLP tools.

Luckily, we can use some of Python’s built in functional-style coding to transform this to a more readable format. The following code snippet transforms our tree to a generator of dictionaries.

The keys and values in these dictionaries are things such as 'year':2015 or 'title':'...' . Transforming this to a list means that it can be directly transformed to a pandas dataframe, which is much more user-friendly.

Preprocessing data

For this hackathon we had the joy of working with relatively clean data, so not much preprocessing was needed. The obvious steps, such as dropping empty values and dealing with weird characters were immediately performed. Additionally, duplicate entries were identified and removed.

A more involved (and typically Belgian) challenge was the use of multiple languages throughout the dataset. Since the language was not provided as a tag, an alternative solution was needed. The langdetect package was very helpful for this purpose.

Quite remarkably, it labeled a few texts as Afrikaans! Closer inspection showed that these grants actually contained badly-formatted language and had to be removed. This once again illustrates how important it is to constantly re-evaluate and test what you actually know about the data.

Selecting a data representation

Now that the data is cleaned and nicely structured, we can choose how to actually represent these texts. For the case of this hackathon, we opted to use a tried and tested method, namely TF-IDF vectorization as implemented in the Scikit-learn package. Apart from its generally good performance, this algorithm has the advantage of having easily interpreting parameters and its language agnosticism.

The main drawback of such a representation lies in the fact that it focuses on words and not on sentences. As a result, much information about the grammatical properties of the words is lost. Other methods, such as word embeddings or text tokenization could greatly improve the predictive powers.

Training models

We developed two approaches for the hackathon use case.

The first solution interprets text fragments and automatically suggests topics. This is extremely useful for better understanding the data and adding general labels to the data. Non-negative matrix factorization is a very powerful method, that automatically detects and extracts patterns from text.

Distribution for the topics that were automatically extracted from the dataset.

Unfortunately, this doesn’t mean that it actually understands what the patterns mean. So it is still up to us to find good descriptive terms that properly match those topics. In this way, labelling thousands of texts has been reduced to finding a set of roughly 50 good descriptive topics!

Now imagine that you want to find articles relating to a rare or emerging topic. Although the previous system already suggests interesting topics, it’s very likely that your specific topic is not yet one of them. This is where the second solution comes in to play.

But this one is a bit trickier and has a serious cold start problem. As the topic is new, we don’t have any labeled training data available. As a user, I’m pretty sure that you don’t want to start randomly reading grants and saying whether or not they match your definition.

Connecting to linked data

So our approach turns to another source of data for its initial input. It looks at a knowledge source that all of us use regularly, namely Wikipedia. A user selects a relevant article and the algorithm ranks the most similar matches from their dataset.

An extra advantage of this approach is that it provides a natural way to connect your dataset to the semantic web. This means that future extensions are not limited to looking at one article, but can also inspect relevant linked articles.

To perform this similarity ranking, we encoded the wiki-article with tf-idf vectorization, as trained on our articles. This makes sure that only relevant words are selected, but it also requires at least a partial match in vocabulary (synonyms are not encoded). Then the cosine similarity between wiki articles and text is calculated. In code this is as simple as it sounds, giving:

Instead of labeling articles at random, users then start off by checking articles that are most likely to be correct. In our case, this meant that instead of only having a 7% chance to randomly find a good article, users now had a 50% chance to find a good article. Although being far from perfect, it increases initial efficiency sevenfold and provides a good solution to the cold start problem. After using this approach for a while, enough data is labeled which allows for a more classical supervised learning approach to be used.

Wrap-up

The methods used throughout our project were kept simple on purpose. However, they are remarkably effective and their simplicity enables users to easily understand and interpret the techniques, which is exactly what you need for a proof of concept. Further iterations can then increase performance and enable the use of newer, better algorithms. As such, working agile is not limited to software development, but has a very important place in data science as well.

Additionally, connecting to external data can be extremely rewarding. In our case it helped to avoid a cold start, but it could also provide extra features down the pipeline. Capitalizing on the linked nature of Wikipedia can automatically bring extra data to the workflow.

Acknowledgements

This project is a result of the NLP4GOV hackathon hosted by the Flemish government. I also want to thank our motivated team, without whose help and insights we would not have won as best start-up.

--

--