DApp of the Week #07 — Troubadour

A Natural Language Processing (NLP) platform using decentralized cloud computing.

Daan Helsloot
iExec
9 min readJan 8, 2019

--

Talking, writing and reading: as human beings, we can do it all. For this, we use one of the many human languages that can be found around the globe. Through language, we attach meaning to the things that we perceive throughout our daily lives. Not only human beings communicate through languages, but computers do so as well.

Nowadays, we use programming languages to communicate with computers, but what kind of possibilities would we have if we could use a language which is directly understood by both parties? This installment of Dapp of the Week will look into examples of how we are already bridging the gap between data on a computer and human language.

💡 Want to learn more about iExec? Check out iExec Academy!

iExec Academy aggregates all content related to the project. You’ll find articles, tech documentation, videos, interactive demos, and much more! Whether you are a beginner or an expert, a developer or crypto-enthusiast, you’ll find what you are looking for on iExec Academy!

📚➡️ https://academy.iex.ec

Introduction

It is easy to forget just how complex the process of reading and understanding human language is. Computers are able to process data much faster than we, humans, can. They are extremely efficient at working with standardized and structured data such as database tables.

Unfortunately, humans don’t communicate data in such a ‘structured’ way. We communicate using words, a form of unstructured data. With unstructured data, rules are quite abstract and challenging to define concretely. Things like context, sarcasm, and proverbs are already complicated to understand for humans, so can you imagine what it’s like for computers?

In human languages, we don’t always say what we mean; and we don’t always mean what we say.

Nowadays the majority of data collected within enterprises exists in the form of emails, reports and other documents. Analysts at Gartner, one of the world’s leading research and advisory companies, estimated that more than 80% of enterprise data today is unstructured.

Businesses across all industries are facing a growing need to observe, interpret, and evaluate this type of data within their own specific industry use case. The expertise, knowledge, and information of professionals in industries is time-sensitive, becoming quickly outdated. At the same time, decisions have a bigger impact in a highly interconnected world. At the moment, users are already able to perform simple textual content searches on unstructured data.

Finding the desired information can be hard and very time-consuming due to the overload of information.

There still exists an urgent need to retrieve all useful information contained within data without having to manually read through all of it. In other words, this unstructured data needs to be transformed into structured data before it can be of any practical use.

What is Natural Language Processing?

NLP is the ability of machines to understand and interpret human language the way it is written and spoken. NLP sits at the intersection of computer science, AI, and computational linguistics. The objective of NLP is to make machines as intelligent as humans in understanding language.

This technology allows us to break human (natural) language down into elementary components that can be tagged and organized accordingly. Storing this information in a standardized format allows us to use this data as structured data.

Such a standardized format for the content of the data would make it much more accessible to users and would allow textual analytics to be conducted in order to extract knowledge and information from this data. Some examples of extracted knowledge consist of entities, facts, relations between concepts as well as sentiment, opinions, and emotions.

Although the term ‘NLP’ is not so commonly heard, it is often being interacted with daily without people realizing it. Google Translate, spam filters and search engines of web browsers are all commonly used products which utilize NLP methods.

Troubadour

Troubadour is a data enhancement platform providing intuitive and accessible Natural Language Processing (NLP) tools, to be used by anyone as a solution to the ‘information overload’ problem. The aim of this platform is to provide users with a way to optimally make use of all of their unstructured data, by converting it into structured data. Our vision is that Troubadour can offer this solution for any domain or industry dealing with unstructured data in the form of written natural language. Education, legal affairs, clinical research, and tourism are just some examples of these domains.

Troubadour is based on NewsReader Project, an EU-funded academic initiative aiming to provide NLP solutions that are accessible to everyone. NewsReader is being developed by a consortium including the Computational Lexicology and Terminology Lab (CLTL) of the VU University of Amsterdam, led by Prof. Dr. Piek Th.J.M. Vossen.

NewsReader is a system that extracts what happened to whom, when and where from multiple sources, and stores this in a structured database, enabling more precise search over this immense stack of information.

Scanning data in news stories from around the globe, NewsReader provides a solution to the data volume problem, by partly mimicking how humans read text and integrate new information with what is known of the past. Like human readers, NewsReader will reconstruct a coherent story in which new events are related to past events.

In contrast to human readers however, NewsReader will not forget any detail, keeping track of all existing facts and will even know how stories differ from source to source.

Likewise, NewsReader will be able to present the essential knowledge and information both as structured lists of data and facts but also as abstract schemas of event sequences that represent stories going back in time, as humans do. This allows us to detect trends, events with impact and social networks of people over time and regions. We can query long-term developments spanning decades for individuals or types of individuals to discover events that remained unnoticed.

Use case: The polarizing vaccination debate

The importance of proper access to unstructured data becomes clearly visible when we look at the fierce vaccination debate that is happening in our society at the moment. Not having access to information or worse, having access to false information, could literally mean the difference between life and death. Despite significant potential to enable dissemination of factual information, social media are frequently abused to spread harmful health content. This potentially reduces vaccine uptake rates and increases the risks of global pandemics, especially among the most vulnerable.

Recent outbreaks of measles, mumps, and pertussis and increased mortality from vaccine-preventable diseases such as influenza and viral pneumonia show how important it is to combat online misinformation about vaccines. In 2015 for example, Québec was hit with an outbreak of measles even though a free vaccine which can prevent this childhood infection is freely available. Unfortunately, due to doubts that currently hang over vaccination, new episodes have emerged. Findings revealed that 83% of parents who hesitate to vaccinate their children are concerned with the potential side effects of the vaccines and 77% doubt their efficiency: two misconceptions that tend to spread through social media.

Studies have shown that access to a wide amount of content through the Internet without intermediaries resolved into major segregation of the users in polarized groups. Users select information adhering to theirs system of beliefs and tend to ignore dissenting information. In other words, we fit logic to our perspective instead of our perspective to logic.

Since we only tend to see what we want to see, we cannot rely on the subjective nature of our own skewed perspective to determine right from wrong. Fortunately, using NewsReader, we are able to objectively map the unstructured data available from both sides. First, we start by scraping a plethora of websites in regard to vaccinations and run the content through our NLP pipeline. By connecting the information of the processed data and turning it into event-centric knowledge graphs we are able to generate a so-called perspective web.

An example subset of the perspective web for the ‘vaccines cause swelling’ and ‘vaccines cause autism’ propositions

This perspective web is centered around the found propositions and gives us insight into what kind of statements are being made about vaccinations (e.g. vaccines cause autism, measles are prevented by vaccines, vaccines protect children and vaccines contain unsafe toxins). Furthermore, it is able to show us which sources these statements come from (e.g. government, medical science, anti-vaccines parents, etc.) and whether the source claims or rejects a specific statement. This allows us to provide a clearer overview of the situation, a better chance to differentiate between different opinions in regard to a specific subject and ultimately help improve public health.

As the above use-case shows, proper usage of unstructured data could increase decision-making speed, cost-efficiency as well as increase overall quality by providing the customer or end-user with only relevant information. By reducing the information overload, people will be able to work with only the data that really matters while discarding all the background noise.

Troubadour is powered by iExec

Running the NLP pipeline is very CPU-intensive and when people will start to use Troubadour, more computing power will be required. This would mean continuously having to buy new, expensive hardware to scale with usage growth. The iExec platform is therefore the perfect solution to help reduce the cost for both the client and the product owner, while offering greater flexibility at the same time.

iExec provides us with a platform where developers don’t have to maintain any servers, yet have the opportunity to rapidly upscale/downscale based on the needs of customers. Since we don’t have to focus on servers, we can completely focus on providing the best possible user experience through our NLP solutions.

Since the potential users of Troubadour range from students to big financial enterprises, they may require different levels of computations. The fact that we can let a customer pick on-the-fly which provider they want to use brings added value to the system. Furthermore, since iExec has introduced a Pay-per-Task payment scheme, users only pay for what they consume.

iExec and datasets

As the development of the Troubadour project progresses, domain-specific training datasets will be required to meet the needs of our clients in different domains. If the necessary datasets are not available we will have to create these ourselves. This is often a tedious and expensive task, requiring many hours of work.

Fortunately, since the iExec platform provides the opportunity to monetize such datasets, they can be developed without having to worry about investing too much money and time. Furthermore, existing datasets are often expensive and beforehand it is unknown whether these provide useful information. The iExec platform will provide the ability to rent such datasets, thus providing a cheap way to check whether a dataset meets the needs of our clients.

Example: Running Troubadour on iExec

In the example shown below, a news article covering the pope and Iraq was processed by the NLP pipeline. The computation was powered by workers on the iExec network.

The workflow of Troubadour is as follows: a user uploads their chosen text file and a ‘Metamask’ pop-up requests the approval of a payment to an iExec worker. After the payment has been approved and the transaction has been made, the text file will be processed. When the work has been processed, a 2nd sign request pop-up by Metamask appears. This sign request is necessary to access the processed results of the NLP pipeline. The extracted information becomes available in their corresponding sections.

Troubadour will be able to provide entity extraction, word sense disambiguation, semantic role labeling, (event) coreference resolution, factuality checking, temporal & causal relation identification, and sentiment analysis for your desired text.
Additional information of extracted entities can be found dynamically by Troubadour

The future of Troubadour

The developers of Troubadour are hard at work to provide a first official release in Q1 of 2019. This version will have the full NLP pipeline implemented in the dapp and will be usable by anyone with a cryptocurrency wallet. After this, our main focus will be turning the platform from a document-centric to an event-centric approach where multiple files can be combined to find relations between different text sources. This will make the system ready to be used by enterprises.

Finally, we are happy to announce that after presenting a beta version of Troubadour to the NewsReader team, they proposed to team up with us. We will be closely working together to ensure that the end-user has direct access to the newest developments in the NLP world. We are ready to show the world the possibilities of this emerging branch and are excited to see what the future holds for us.

If you have any questions, want to partner up or need a custom solution, feel free to contact the authors of this application by sending an email to Daan Helsloot (Creator of Troubadour), Piek Vossen (Director of the Computational Lexicology and Terminology Lab/NewsReader team).

Connect with Troubadour

Twitter

Connect with iExec

WebsiteBlogSlackTelegramRedditTwitterFacebookLinkedInYoutubeGithubKakaoInstagramSteemitKatacodaDocs

--

--

Daan Helsloot
iExec

Creator of Troubadour, A NLP platform using decentralized cloud computing. Powered by the Computational Lexicology & Terminology Lab (VU University Amsterdam)