Structuring References from Text Using Machine Learning

Liz Gallagher
Wellcome Data
Published in
5 min readOct 29, 2019

Using a naïve Bayes classifier in scikit-learn to predict whether a sentence from a references section is an author, title, journal, etc

In Wellcome Data Labs we have been developing a tool which flags any Wellcome Trust funded publications found in the reference sections of policy documents (e.g. WHO and NICE guidelines). This relies on us being able to categorise the reference titles as such from a large amount of unstructured text. An example of this unstructured text is:

What unstructured text from a policy document looks like. How can we extract all the titles from millions of lines of this type of text?

How do we know which parts of this text are the titles (or journal names or authors for that matter)? To explain this we will use a much simpler example.

Is a sentence from a magazine about cats or a magazine about dogs?

“The Cat magazine” by Ryan O’Hara is licensed under CC BY-NC-ND 4.0

Imagine we have some sentences and a tag about whether they are from a magazine about cats or a magazine about dogs. Using this data, we can use supervised machine learning to build a model to predict whether a new sentence is likely to be from a cat magazine or a dog magazine.

The first step is to vectorise the tagged sentences. One way to do this is taking every single word (columns) in all the sentences and calculate for each sentence (rows) how many of each word it uses. With an example of 6 basic sentences this looks like:

You may also end up only using the top 1000 words, or making everything lower case, or removing common words (e.g. ‘and’).

Next you can build a model from this vectorised data — one type of model (and the one we use) is a multinomial naïve Bayes classifier. This model uses some quite simple maths to work out how likely each word in a sentence is to be in a cat or dog magazine. Then by multiplying all these probabilities we can predict whether a sentence is more of a dog or a cat magazine sentence (a simple example of this maths can be seen here).

I trained a model with these 6 sentences and then used it to predict whether 5 new sentences were from a dog or cat magazine, the results were:

+ — — — — — - - - - - - - - + — — — — — - - + — — — — — — - +
| Sentence to predict | Prediction | Probability |
+ — — — — — - - - - - - - - + — — — — — - - + — — — — — — - +
| dog | dog | 0.82 |
| cat | cat | 0.66 |
| this is a cat sentence | dog | 0.63 |
| dog or cat dog | dog | 0.72 |
| dog or cat | dog | 0.53 |
+ — — — — — - - - - - - - - + — — — — — - - + — — — — — — - +

All this was done using CountVectorizer and MultinomialNB from the Python package scikit-learn. With an input of the training sentences and classifications, in only a few lines of code we can build a model and predict the categories of 5 new sentences:

>>> training_sentences = ['the cat sat on the mat', 'dogs are better than cats', 'i love dogs', 'jake is a dog', 'the cat', 'the dog is nice']>>> training_classifications = ['cat', 'dog', 'dog', 'dog', 'cat', 'dog']>>> vectorizer = CountVectorizer(analyzer=’word’, token_pattern=r’(?u)\b\w+\b’)>>> count_words_sentences = vectorizer.fit_transform(training_sentences)>>> mnb = MultinomialNB()>>> mnb.fit(count_words_sentences, training_classifications)>>> predict_vec_list = vectorizer.transform(['dog', 'cat', 'this is a cat sentence', 'dog or cat dog', 'dog or cat']).toarray()>>> mnb.predict(predict_vec_list)

So, going back to our reference example …

How do we categorise titles/authors/journals in sentences?

We have lots of training data available from EMPC, where we can download thousands of titles, authors and journal names (amongst other reference attributes), e.g.:

This data is then vectorised and a multinomial naïve Bayes classifier is trained using it. We can then split up the unstructured policy document text up — where we split by full stop, exclamation mark and question mark — and then use the trained model to predict what type of sentence (or ‘reference component’) it is:

This data shows the most likely category for each reference component as predicted by our model, and how probable the model thought this category was.

As you can see, it doesn’t always predict the categories correctly (I don’t think there is a Dr ‘J Acquire Immune Defic Syndr’ out there), but we can estimate how much we can trust our model. One metric for this is the accuracy score. Using a new dataset (i.e. not the dataset we trained the model on), the ‘test’ data, we calculate how many times a category was correctly predicted divided by the total size of the test data. In our case the accuracy score for the model predicting titles is 87%. You can see (this blog post) about other ways to evaluate your model.

And voilà, we’ve gone from lots of text to a neat list of the titles (or authors etc) within it, which can then be matched to a database of titles from Wellcome Trust acknowledged publications!

--

--