Extracting Value from Non-Structured Data

A guide through data concepts to understand some possible ways to extract value from non-structured data

Published in

CodeX

5 min readJul 13, 2022

Currently, unstructured data represents 80% of existing data. To start an article talking about why you should pay more attention to your data, I believe that the best way to convince would-be with an impactful data.

In this article, the goal is to explain what unstructured data is, the advantages of dealing with this type of data, hypothesize why it is not such an explored area yet, and also give examples of how to deal with this type of data (with examples of real projects).

Unstructured data is all that does not consist of tables (which is structured data such as a database table) and is also not semi-structured data (ex: JSON, XML). Therefore, pdf reports, emails, HTML web pages, audio, and photos are all examples of unstructured data.

I think it’s now obvious why these types have a larger amount than structured data — since they occupy greater parts of our daily lives than CSV tables. However, currently, many data initiatives consist mainly of Business Intelligence products (dashboards, analytical reports, summaries, graphical panels) based on already structured data — or fed by some Data Warehouse.

Unstructured data, unlike structured data, can be stored in structures called Data Lakes. These structures can support both structured and unstructured data. Although this structure exists, and data is a precious source of information, they are often left out for reasons that will be explored below.

Why is unstructured data deprioritized?

First, understanding tabular (or even semi-structured) data is much simpler than unstructured data — and therefore takes less time to generate value. To make this clearer, look at the three (fictitious) examples below and try to find four place names in each one.

Semi-structured data

Unstructured data (Screenshot by author)

Understanding concepts

The first is structured data (table), the second is semi-structured data (JSON) and the third is unstructured data (email). I believe it becomes clearer that, in a scenario where Big Data is present, the difficulty of understanding — fast — multiple types of unstructured data, is a slower process and, consequently, more expensive.

Another point that makes this analysis/data extraction more difficult is the lack of specialized people. This is because currently there is already a great lack of people to supply the area of structured data analysis. In the case of unstructured data analysis, the person must have skills to know how to apply text analysis (natural language processing — NLP, sentiment analysis), image analysis, and audio analysis.

With that, we enter a scenario with a lot of valuable information present in different contexts. However, in an immediate world that demands fast delivery of value thinking that is the only way to know that it is paying off the investment. Also, requiring many skills that take a long time for a person to build to effectively bring value to the customer.Benefits of unstructured data

Benefits of unstructured data

Examples will more explicitly bring out the advantages of dealing with unstructured data. The first is about text analysis — for example, the email body. With Machine Learning and Natural Language Processing algorithms, it is possible to verify, for example, that locations are being cited in the text, and names and verify the language. Also, sentiment analysis can be done and it can be seen that there is an urgency and stress in the speaker’s voice when delivering his message. Even though it is obvious to us humans, it is something very difficult for algorithms to detect.

A second example is about some audio — for example, the same context of the message, but spoken. The main focus of audio algorithms is to fractionate the audio, that is, the waves, to the level of individual phonemes, and as a consequence, combine the written form of these phonemes with words, transcribing the audio. In this case, some characteristics can be extracted from the text that was transcribed.

As for images, numerous characteristics of the set of pixels can be extracted (the algorithm checks each pixel of the image to understand its content as a whole). For example, it would be possible for the algorithm to assign a location to an image, the above being Rio de Janeiro city. Another example is facial recognition of people or recognition of facial expressions and what those people are showing to feel in the photo in question.

These examples are just a few scenarios where each of these data — there are more types of unstructured data — can be analyzed. Unstructured data are sources that can be used in many ways to add value, to understand contexts better, and even personalize an experience.

How to deal with this unstructured data

As already mentioned, it takes years for each specialist — in probably a team — to build their skills so that they can apply complex algorithms to analyze this data with high quality.

However, nowadays there is the availability of cloud services that aim exclusively to handle unstructured data. Text analytics — such as search, sentiment analysis, audio, videos, and images have become something much more tangible with APIs made available by Microsoft, Google and AWS, for example.

A real case applied in a project I participated in, which needed a more accurate search system in documents (PDF files), with the extraction of the names of the politics that appeared in each document. We created a solution in which this data was automatically extracted from the source, transformed, and, with the Azure Cognitive Search API, applied a search based on relevance and extracted the personal names of the documents.

That way, not only a search system was implemented, but the names were extracted for each file — making it easier for the person reading the title to link the information by the politic name.

In this article, I covered the basic concepts of structured, semi-structured and unstructured data using examples to help clarify the complexity of each type. Also, some examples on how to work with unstructured data aiming to extract its value were explored.

I’m Aline, the author of this article. Find me here and here!