Concepts Extraction: How I learned to stop worrying and love multilingual data

How to make text data from different languages understandable

Published in
4 min readMar 24, 2020

--

As humans, we like to use text in our daily life. And as humans, we like to work with computers. But the problem is that computers don’t like to work with text. When dealing with big (or huge) amount of text data, we need to automate the processes dealing with them. And this is where NLP comes in!

Concept extraction is a field of Natural Language Processing focusing on finding out the semantics of a text. For instance, you can know the persons mentioned, the locations, the objects and so on.. It is used to better understand the data we’re dealing with. Most tools are only focusing on a small subset of languages, in particular the English. That’s why we are going to use the Neutral News API. Neutral News is a company focusing on NLP tasks in many different languages.

Disclaimer: This blog is written by a co-founder of Neutral News, just so you know :)

In this blog, I am going to show how we can achieve in highlighting the main topics from a corpus of articles in 8 different languages using the Neutral News Api.

This API needs a subscription to be used but fortunately for us we offer a free trial of 300 requests. Let’s use it for this tutorial! You can register here.

Let’s dive in!

We will assume you already have python3 installed. You can find the corpus I’ll be using here.

Let’s start by creating a simple function to load the articles:

import jsondef get_data(path):
return json.load(open(path, 'r'))
data = get_data('articles.json')

First steps with the API

Neutral News provides a Python client to simplify the use of the API. It is pretty simple, let’s install it first.

pip3 install PyNeutralNews

Now let’s create the client object that will make us able to call the API (You will need to replace the credentials with yours from your free trial):

from PyNeutralNews import Clientclient = Client("<email>", "<password>")

We are going to use the function get_concepts below to get the extracted concepts with the associated weights from a text.

from collections import Counterdef get_concepts(text, lang=None):   res = client.nlp.semantic_analysis(text, lang, concepts_properties=["titles.en", "titles.auto"])   concepts = Counter()   for concept in res.concepts:        titles = concept.properties["titles"]        title = titles.get("en") or titles[res.lang]        concepts[title] += concept.weight    return concepts

Here we can see we have a dictionary with the concept name as key and its weight as the value. The weight corresponds to the semantic value of the concept in the text. A high weight means the concept has an important meaning in the text.

>> get_concepts(data['ko'][0], 'ko')
Counter({'United States': 0.16974641714195385,
'President (government title)': 0.11700783103451891,
'Justice Party (South Korea)': 0.11516047368626284,
'Facebook': 0.0977350464727538,
'Diplomacy': 0.09208318129951651,
'Washington, D.C.': 0.09057665492040855,
'Mass media': 0.07374152731976248,
'German reunification': 0.06379742089362807,
'Literature': 0.06115040296658773,
'Ambassador': 0.06047045207677463,
'Human': 0.058530592187832624})

It works! Great, now let’s create a function that will store every concept in our corpus with its occurrence and cumulative weights.

def get_all_concepts(corpus):
concepts = {}
for lang, data in corpus.items():
print('get concepts from', lang)
for i, article in enumerate(data):
if (i + 1) % 10 == 0:
print(i, '/', len(data))
break
res = get_concepts(article, lang)
for concept, weight in res.items():
if concept not in concepts:
concepts[concept] = (0, 0)
c, w = concepts[concept]
concepts[concept] = (c + 1, w + weight)
return concepts
concepts = get_all_concepts(data)

We will first plot the most common concepts according to their occurrences.

import matplotlib.pyplot as plt
import numpy as np
concepts_occ = {k: v[0] for k, v in sorted(concepts.items(), key=lambda item: item[1][0])[::-1]}def plot_concepts(concepts, limit=10):

fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(np.arange(limit), height=list(concepts.values())[:limit], tick_label=list(concepts.keys())[:limit])
plot_concepts(concepts_occ)

Nice! We now start to have a good idea of the main topics in the corpus. But we still have some words that don’t really help us. Let’s see if we can get rid of them and have a more precise semantic analysis by taking the weights into account.

Now we have a better overview of the main topics from our corpus of articles. Mission accomplished! 😃

What’s next?

Now that we have our topics, we can do a lot more! For instance, we could do a clustering of our new data to find related articles even in different languages.

--

--