Why your database must fit your data

6 min readJan 13, 2015

How using the right tool for the right job can save you time and money

For a summary, read the first paragraph, the image captions and the conclusion.

Today’s leading businesses usually consider their data a core asset. Moreover, the kinds of data we deal with has been changing, but the traditional ways of storing it have not. Can we get more value from our data, e.g. new insights and services, by changing the way we store it?

In order to pick a suitable technology to store our data, we should first understand the concept of data shape. It consists mainly of three things: connectedness, volume and semi-structure.

Connectedness

Let’s imagine a retail invoice processing system 15 years ago. We probably mostly cared that the billed amount was correctly noted in our accounting system. But if that’s the only thing we’re doing with an invoice today, we’re missing out.

To begin with, an invoice has a customer attached to it. This customer may have family and friends with interests relevant to us. Who is an influencer in this person’s social sphere? What else can we learn about them?

The invoice also has items, and these are commonly organized into categories. What are bestsellers in those categories? Can we recommend an item that’s similar but better, and perhaps costs a little bit more? What else would go well with this shopping basket, based on what other customers have bought?

If we’re not considering these points, we’re missing a valuable opportunity to get a leg up on the competition. Or worse, our competitors may already be doing this. And this is just one or a few levels out into the data. More importantly, this is all readily available to learn from data we already have.

Volume

Exponential growth in data is breaking systems that previously worked great

Not only is the connectedness of data increasing, so is the volume. And to be clear, it is exponential growth we’re talking about. Exponential growth is a very unnatural and unintuitive kind of growth that often surprises us, sometimes even before we have time to act.

To give an example, let’s say you are sitting in a top row seat of a large football stadium. In the middle of this stadium we place a single drop of water on the grass. But it’s not just any water — it’s magical water that doubles in size every minute. How long do you think it takes until the playing field is covered in water, starting from that one drop? Just answering that question is hard enough, right? It takes a bit less than 40 minutes. At that point you’ve waited quite a while, and that small body of water down there doesn’t look very intimidating. Except, at about 45 minutes the whole stadium will be filled to the top. You may not even have time to get out before you realize the danger you’re in.

That was dramatic, but that’s literally the type of growth we’re dealing with in business today. It’s the now-classic story of how a business that has been working fine for years (sometimes decades) on a relational database suddenly finds itself stuck in proverbial technology quicksand. Make a wrong move, and it could cost you dearly in time and money.

Semi-structure

We’re moving away from knowing everything about something, to knowing something about everything.

A fragmented view of the world results in sparse tables and poor performance

A person on Facebook, or a piece of network equipment, or a car part, might have up to 1,000 possible attributes, but you may only know 10 of them on average. Historically we have stored this data in a massive table (spreadsheet) with 1,000 columns, each row having on average 10 cells filled in. That’s not making efficient use of space. Moreover, a new project may need to add another 100 possible attributes. Do we just tack on another 100 columns to that massive table?

These are problems that relational database developers and administrators battle with. The net result is a significant increase in cost and time invested to produce a certain outcome. This traditional approach of optimizing old tools reminds me of a famous Henry Ford quote: “If I’d asked people what they wanted, they would have asked for a faster horse.” Now, I’m not saying relational databases are stone age technology — au contraire, they’re very advanced. But are they the best tool for solving most modern business challenges? Will they put you ahead of the competition? No.

Data/Storage Fit

Now that we understand the concept of data shape (connectedness, volume, semi-structure), we can talk about data/storage fit. We want our database to fit the shape of our data; we want it to match its characteristics. When quantified, we call this data/storage fit. If we can achieve a good fit, we can with little effort save time and money, and gain new insights. I have illustrated this phenomenon in a few charts below.

Let’s consider a hypothetical example. We’re building a real-time recommendation engine, and we choose a key-value store (such as Redis) for the task. The data is highly connected, and we’ll be drawing deeply on those connections to provide an accurate recommendation. If we attempt to store that data in a key-value store, which knows nothing about relationships, we would end up with a very low data/storage fit (left side of chart). We will then spend a comparatively large amount of time and money on making that recommendation system work, and for any changes we make we will continue to pay that tax.

If we instead choose a technology that fits this data really well, such as Neo4j, we will achieve a high data/storage fit (right side of chart). We will spend little time on building, maintaining, and evolving that system. This is where we want to be.

By striving for high data/storage fit, we can eliminate a lot of unnecessary waste and instead spend that time on being more productive

By moving toward a better data/storage fit, we can eliminate a lot of waste. In the Redis recommendation engine case, we might spend time on building and maintaining separate views of our data, keeping the data consistent, and hand-coding the algorithms. This is work that doesn’t produce any additional benefits beyond the recommendation engine itself. It’s waste. If we can take most of that time and put it into producing actual value (e.g. new features that we can sell, new insights we can learn from, etc.), we can significantly increase our productivity, and ultimately profit.

Categorizing Shape of Data

The million dollar question still remains. How do you know what shape your data is? Let me present you with another chart!

Some data sources are volume-heavy, but most data sources are connectedness-heavy

Look at the two most prominent characteristics of data, namely connectedness and volume. If we consider data with low connectedness but high volume, we might find things like sensor readings and raw log data. On the other side of the spectrum, with high connectedness and moderate volume, we actually find quite a broad range of use cases. In fact, most companies today (unless you’re in the 0.01% with Google) have a connectedness challenge, not a data volume challenge. There are massive insights to be unlocked in the data you already have.

Conclusion

If you can achieve a good data/storage fit, you can count on gains in time, cost savings, and competitiveness. Combine this with the dark data approach to insights, where you first focus on doing more with the data you already have before hoarding data without a strategy, then you will be on a solid path forward. That’s exactly what companies like eBay, HP, Cisco, Medium(!), Walmart, Johnson & Johnson, TechCrunch, and countless others do with Neo4j. When are you going to stop letting your technology choices hold you back?

Disclaimer: I work for Neo Technology, the company behind Neo4j, the leading graph database.