Roads to Insight

David Montag
3 min readDec 18, 2014

Thoughts on dark data, and why you should look into it before big data

Disclaimer: I work for Neo Technology, commercial backer of Neo4j, the leading graph database.

A few days ago, our CEO Emil Eifrém wrote a piece on dark data. Dark data is defined by Gartner as data that is collected as part of various processes but not commonly used. I would like to elaborate on this notion, and compare it to big data.

The premise of big data is: collect as much relevant data as you can, then mine it for insights that can inform decision making. The massive needs for storage and compute capacity in turn become the main technology driver. This is reflected in the strong traction that scalable batch-compute technologies have seen, including Hadoop, Spark, and even aggregate-oriented databases such as MongoDB.

However, the required investment in infrastructure is steep. Large amounts of data need to be stored, organized, and mined. It is not uncommon (hearsay) that the lead time from project conception to actionable insights is long and costly.

One of the reasons we attempt this data mining is because our competition is doing the exact same thing, and if they learn something we don’t know, we’ll be at a disadvantage. Undeniably, there are insights to be found in big data, but a question remains — can we do something else before plunging into this massive, opaque undertaking?

The allure of big data is strong. Collect and analyze vast amounts of data with the promise of CSI-grade “enhance” capabilities. However, first we should ask ourselves: are we doing the most with the data that we already have?

Looking for a practical example? Let’s recall how Google came about. Hyperlinks were not really being used by the large search engines back then, and Google realized that they could do something smart with that data. And they did, by looking at the web as a graph, not as a collection of documents. What they didn’t do was say “we need more data to build a better service.”

Dark data is data that you are normally collecting but not using. Or, you may be using two datasets in isolation, but you have not cross-referenced them. The technology driver behind dark data mining is starkly different from big data. Massive scale is seldom a concern — rather, it’s about deeper insights into and across available data sources.

The barrier to entry is extremely low compared to big data, and the key to success is twofold:

  1. Solid understanding of one’s own data, and how the various data sources fit together.
  2. Technology capable of mining deep insights across the relationships in the data. Neo4j is an excellent tool for this.

By looking inward at the data you already have, you can gain insights with comparatively small investments into infrastructure, and the payoff can be instant. It may be an understanding of how to support your customers better, or (as Emil cited in his article) what may be causing your employee turnover.

Lastly, I will be diplomatic (and truthful) and say that there is a place for both big data and dark data mining. I will however recommend that you first make the most of the data you have before investing in the collection of even more.

Contact me @dmontag if you have questions or comments.

--

--

David Montag

Field Engineering at Neo4j. We help bring graphs to the world.