Extracting rich insights from Unstructured Data

Published in

Voice Tech Podcast

9 min readJun 16, 2019

Unstructured data is rampant on the internet and private data stores. Per Wikipedia, unstructured data is defined as, “information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.”

Vast majority of unstructured data is in the form of text in various forms (documents, images of text, database storing text blobs, streaming text and so on). More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 and majority of that will be unstructured.

The vast majority of text data is still unanalyzed. The major reason is that no good tools were available to derive useful insights from this data (apart from the usual context sensitive search and others). Until now.

NLP is a proven technique for getting useful information out of text data. One of my previous blogs talked about that.

In this article, we shall see how combining NLP and Graph Analysis (a mathematical technique to evaluate graph structures) can get us even more rich insights. This post is not technical and aims to educate about this technique. My program demonstrating these combined techniques is linked to later in the article.

Let’s start off. Our sample data is an 8 page PDF document hosted at Wikileaks. This document was chosen since it represents a microcosm of different kinds of text data and thus serves as a useful example. It is also not too short but short enough for someone to read through and be able to review the analysis done here. The document is about an IMF (International Monetary Fund) internal meeting that predicts the Greek Debt Crisis.

The best way to start off analyzing is to be clear on which insights are we interested in? If we ‘generify’ insights we end up with the following list.

Which entities are present in the data? By entities, we mean people, places, organizations, locations, things like medicine names and so on.
Which are the most important entities? Key entities that appear important or critical to the corpus.
Are there any emergent groups (communities) in this text? What are they? Who are the key entities in each group? Or in other words, are there clusters of information that look important and useful?
For a particular group or cluster, identify the key target entities for processing / managing the group or cluster in a certain way. For e.g., if we are analyzing information on an extremist group, we would like to know which entities to target to, say, break that group apart. Or if we are analyzing a molecular structure of a virus, we would like to know where there are structural weakness to exploit. If we’re analyzing a community / locality in a city for a logistics provider, we may want to know the most optimal locations for situating warehouses / supply points. There are many more use cases that fit here — I’ve just mentioned a few…
Predictive analytics — how may the found networks evolve in the future? We’re trying to find the various paths of connection and relations among the members of the group, more specifically, how new (yet unformed) connections may be forged. This is useful to figure out what the group may look like in the future and hence aid in decision making now.
What are the typical information flow paths through the groups / networks that were found? Which entities are responsible for key information flows? We are trying to figure out which entities are typically involved in what paths for disseminating information. This helps us understand how various entities exchange information..

The general method talked about here works on any kind of text data from any domain (medical, intelligence, engineering, military etc) and also works at scale. A couple of steps need to be adapted with domain knowledge but that’s about all.

Domain specific adaptation is typically required during the NLP based Entity recognition step. Adaptations can be as simple as adding your own entities to the existing list or more complex in terms of training new models on representative data sets.

Here are the insights gained on the sample PDF we talked about earlier. (The full source code leading to these insights is available here on my GitHub. It’s a Jupyter based Python notebook and you can run it off nbviewer or if not, just go through the code on Github. Note that the notebook is large so it might take a little more than usual to load up.)

First we present a view of the graph structure of the text data. This turns our entities and a lot of other information into a graph data structure and we can then run various analysis on it.

It is a veritable hairball.

Which entities are present in the data? Some of the key entities. Notice that some of these entities were informal terms but still got picked up.

Which are the most important entities? This is a nuanced question and the answer depends on how ‘important’ is defined. So accordingly, we have various categories of answers.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Most connected entities — Entities that have highest connections with the general terms and other entities.

Most Influential entities — by influence on how information flows through the graph (or network).

Entities disseminating information fastest — move info fastest in this network.

Entities dealing with large information — Large capacity information is being dealt with by these entities.

Are there any emergent groups (communities) in this text? What are they? Who are the key entities in each group? Groups are typically implicit in text data and are hard to form by reading or standard automated analysis. Here again, graph theory comes to the rescue. From Wikipedia — “Finding an underlying community structure in a network, if it exists, is important for a number of reasons … Individual communities also shed light on the function of the system represented by the network since communities often correspond to functional units of the system”. More here.

There are many algorithms to extract clusters out of graphs. In this article, I show only one but my code has many more. It is important to experiment with various methods to understand the underlying network better and extract meaningful data.

The community list. Observe how individual words (along with some junk) appears. The right analysis techniques can help deal with such eventualities.

The communities above are modeled in a graph with each community member having its own color code specific to that community.

Some of the key entities in the above community network:

For a particular group or cluster, identify the key target entities for processing / managing the group or cluster in a certain way — as we discussed earlier, such a target list can help us in impacting the community / group / cluster in a way that is beneficial to us by helping to understand how to alter its structure.

How may this network evolve in the future? What new connections might be forged amongst the members of the network in the future? Forewarned is forearmed or something like that. Note that the connections shown in the list below do not exist at this point in time, hence this is a predictive step. The boundary of this prediction is limited to the original document in question and thus the interpretation should not move beyond that boundary. Of course, as more data is available, predictions grow better.

What are the typical information flow paths through the groups / networks that were found? Which entities are responsible for key information flows? We are trying to figure out which entities are typically involved in what paths for disseminating information. This helps us understand how various entities exchange information. We want the most efficient paths that take the shortest time. Here I created a modification of the standard shortest path algorithm that I call shortest-path-heavy. More details on the what and why of this in the notebook.

Since there are many heavy shortest-path paths found, I am just showing a few examples and also have a visualisation.

The visualization below (another, probably worse, hairball). For a much more decipherable image please click here (will need to scroll to see the whole image) to view or here to download. Note that is an almost 10 MB image. The red circles (called nodes or vertices) are key entities, the pink nodes are other key terms (non entities) and the light blue lines (called edges) are the information flow paths.

*Fig X — Entities, Terms & Information flow paths*

Finally, I want to spend some time in driving home the message that the techniques discussed are awfully powerful. Take the following examples below.

In Fig X above, assume that it represents a terrorist cell with the red circles being names of members, affiliates and locations. It is not a leap of imagination to deduce how information is flowing among which members / locations. Analysis can also help in planning a break up of such a cell.
Now assume that the Fig X represents a city and that the red circles represent various locations in that city that are important targets for a logistics provider. With the blue information flow lines showing optimal traffic patterns, the provider can be greatly aided in selecting optimal locations to situate its warehouses / service points.
A last one. Assume Fig X represents the molecular structure of a virus. The red circles would be key sub structures that can possibly be a target of various medical compounds.

A short technical note: For the geeks and gear heads. I’ve already provided the link to my notebook above at the start. Please go through it and let me know if you have questions. The technologies that I have used here are:

NLP — spaCy, textacy, PyPDF2, python-levenshtein, fuzzywuzzy.

Graph libraries — networkx, python-louvain.

Thank you for reading this article! I welcome any comments or queries you may have. I am an independent technology consultant in the area of digital transformation and its related fields. I help organizations navigate their journey towards leveraging technologies like Machine Learning, Cloud Computing, Internet of Things, Mobile Applications and others for their business. Most recently I’ve helped bootstrap, design and prototype a biofeedback platform and am helping organizations in the Heatlcare and IoT sector. If you or your organization are looking for help in digital transformations, drop me a message!

Extracting rich insights from Unstructured Data

Written by Sushrut Mair