Image for post
Image for post

Concepts Extraction: How I learned to stop worrying and love multilingual data

How to make text data from different languages understandable

Jeremie Zimmer
Mar 24 · 4 min read

As humans, we like to use text in our daily life. And as humans, we like to work with computers. But the problem is that computers don’t like to work with text. When dealing with big (or huge) amount of text data, we need to automate the processes dealing with them. And this is where NLP comes in!

Concept extraction is a field of Natural Language Processing focusing on finding out the semantics of a text. For instance, you can know the persons mentioned, the locations, the objects and so on.. It is used to better understand the data we’re dealing with. Most tools are only focusing on a small subset of languages, in particular the English. That’s why we are going to use the Neutral News API. Neutral News is a company focusing on NLP tasks in many different languages.

Disclaimer: This blog is written by a co-founder of Neutral News, just so you know :)

In this blog, I am going to show how we can achieve in highlighting the main topics from a corpus of articles in 8 different languages using the Neutral News Api.

This API needs a subscription to be used but fortunately for us we offer a free trial of 300 requests. Let’s use it for this tutorial! You can register here.

Let’s dive in!

We will assume you already have python3 installed. You can find the corpus I’ll be using here.

Let’s start by creating a simple function to load the articles:

import jsondef get_data(path):
return json.load(open(path, 'r'))
data = get_data('articles.json')

First steps with the API

Neutral News provides a Python client to simplify the use of the API. It is pretty simple, let’s install it first.

pip3 install PyNeutralNews

Now let’s create the client object that will make us able to call the API (You will need to replace the credentials with yours from your free trial):

from PyNeutralNews import Clientclient = Client("<email>", "<password>")

We are going to use the function get_concepts below to get the extracted concepts with the associated weights from a text.

from collections import Counterdef get_concepts(text, lang=None):   res = client.nlp.semantic_analysis(text, lang, concepts_properties=["titles.en", ""])   concepts = Counter()   for concept in res.concepts:        titles =["titles"]        title = titles.get("en") or titles[res.lang]        concepts[title] += concept.weight    return concepts

Here we can see we have a dictionary with the concept name as key and its weight as the value. The weight corresponds to the semantic value of the concept in the text. A high weight means the concept has an important meaning in the text.

>> get_concepts(data['ko'][0], 'ko')
Counter({'United States': 0.16974641714195385,
'President (government title)': 0.11700783103451891,
'Justice Party (South Korea)': 0.11516047368626284,
'Facebook': 0.0977350464727538,
'Diplomacy': 0.09208318129951651,
'Washington, D.C.': 0.09057665492040855,
'Mass media': 0.07374152731976248,
'German reunification': 0.06379742089362807,
'Literature': 0.06115040296658773,
'Ambassador': 0.06047045207677463,
'Human': 0.058530592187832624})

It works! Great, now let’s create a function that will store every concept in our corpus with its occurrence and cumulative weights.

def get_all_concepts(corpus):
concepts = {}
for lang, data in corpus.items():
print('get concepts from', lang)
for i, article in enumerate(data):
if (i + 1) % 10 == 0:
print(i, '/', len(data))
res = get_concepts(article, lang)
for concept, weight in res.items():
if concept not in concepts:
concepts[concept] = (0, 0)
c, w = concepts[concept]
concepts[concept] = (c + 1, w + weight)
return concepts
concepts = get_all_concepts(data)

We will first plot the most common concepts according to their occurrences.

import matplotlib.pyplot as plt
import numpy as np
concepts_occ = {k: v[0] for k, v in sorted(concepts.items(), key=lambda item: item[1][0])[::-1]}def plot_concepts(concepts, limit=10):

fig, ax = plt.subplots(figsize=(20, 10)), height=list(concepts.values())[:limit], tick_label=list(concepts.keys())[:limit])
Image for post
Image for post

Nice! We now start to have a good idea of the main topics in the corpus. But we still have some words that don’t really help us. Let’s see if we can get rid of them and have a more precise semantic analysis by taking the weights into account.

Image for post
Image for post

Now we have a better overview of the main topics from our corpus of articles. Mission accomplished! 😃

Image for post
Image for post

What’s next?

Now that we have our topics, we can do a lot more! For instance, we could do a clustering of our new data to find related articles even in different languages.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Data Science Blogathon: Win Lucrative Prizes!

By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Jeremie Zimmer

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Jeremie Zimmer

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store