If Data Is the New Oil, Then Most of It Is Still in the Ground

Published in

The Startup

4 min readOct 15, 2020

We see only the top of the ‘iceberg’ when it comes to data. Only a fraction of this data is structured and currently used for further analysis.

In the midst of the massive and long-lasting digital revolution, it is hard to even comprehend the sheer volume of the energy that fuels it: data. Data is the fuel that drives the technology and processes that are changing our world. Without data, hardware and software are just a novelty, perhaps some things to make our lives mildly easier. As fossil fuels powered the industrial revolution, the digital revolution’s energy is data, and much of this potential fuel is still in the ground unseen and untapped.

It’s not that we can't get it out the ground, we only can’t get it out fast enough

We are certainly not experiencing an energy shortage when it comes to the data needed to power this digital revolution. However, we may not have figured out how to efficiently and effectively tap it or process it yet. The amount of data being created is staggering, and the growth rate is exponential. We generate quintillions of bytes of data from internet searches, social media use, communication, services, and the internet of things every day. The data makes up the Global Datasphere, which is predicted to grow to 175 Zettabytes (ZB) by 2025.[1]

Organisations have enlisted some of the best business intelligence teams in an attempt to structure this data and make it useful. However, with this current structured data approach, back to our fossil fuel analogy, they can’t get it out of the ground fast enough.

The Problem of Unstructured Data

Unstructured data accounts for up to 80-90% of the total data being created, and the percentage is growing.[2] This type of data is made up of emails, tweets, books, documents, health records, web pages, but also images, video, and audio files. Because many organisations became accustomed to processing structured data, they are trapped in the expensive, time-consuming, and inflexible process of structuring data prior to analysis. This is not a sustainable model. There is no possible cost-effective or time-effective way to make the most of the data by insisting on structuring it. Also, the process of structuring data provides an incomplete view and may limit the insights gained.

The growth of (un)structured data the last decennia. Most of it can be attributed to online data generation: email, social media content, videos & recordings.

A New Approach

The good news is that there are far more efficient ways to deal with unstructured data without any need to convert it to a more structured format. The advances in computer science, AI, and machine learning gives us access to the wealth of potential knowledge in unstructured data. When it comes to unstructured text data: With models built upon techniques like BERT (Bidirectional Encoder Representations from Transformers), Megatron-LM, XLNet, SGC, T5–11B, and more recently, OpenAI’s new GTP-3, we are able to obtain more information from unstructured data at a fraction of the cost compared to structuring the data. Where the older language models only understood the relation between words, models nowadays understand sentences, paragraphs, and even entire stories.

While there is a natural hesitancy for business insights departments to make radical shifts in their data processing and analysis methods. There is much to be gained: apart from a significant cost reduction, textual and visual datasets hold far richer information than a structured dataset. Take for example a dataset containing your clients with some additional information like age, income, occupation, income, address, contract value, last service contact etc. (i.e. Peter Smith, 34, married, €48.000, Main Street 1, €240, Oct. 1st. 2020) and compare it with an email where Peter indicates he lost his job and got divorced. What piece of information tells more about a potential payment problem at the end of the month? Certainly, you can train a machine learning model on the structured data and tell with a certain precision what the payment problems will be of clients like Peter, but with a more profound connection with the meaning and situation obtained from the email, you can tell it with much higher precision.

Call to Action

Bottom line, we have to let go of the idea that we only can do analysis on structured data. With modern machine learning techniques, we are able to obtain much richer and deeper insights from raw unstructured documents. Organisations that have the insistence on using structured data will leave much of the energy for their own digital transformation trapped underground. It is time to turn the tide and begin harvesting the nearly unlimited data that will drive our organisations into the digital future.

At Hemisphere, we help organisations unlock the potential of unstructured textual data for better decision-making. Want to know more and see what we can do for you? Or interested in working with us? Contact info@hemisphere.ai

1. The Digitization of the World, From Edge to Core, An IDC White Paper, David Reinsel, John Gantz, and John Rydning, November 2018, https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

2. What Is Unstructured Data And Why Is It So Important To Businesses?, Bernard Marr, Forbes, Oct 16, 2019, https://www.forbes.com/sites/bernardmarr/2019/10/16/what-is-unstructured-data-and-why-is-it-so-important-to-businesses-an-easy-explanation-for-anyone/#6e73011c15f6

If Data Is the New Oil, Then Most of It Is Still in the Ground

The Problem of Unstructured Data

A New Approach

Call to Action

Written by Thomas Schijf