It’s All About the Data!

Most business applications can be thought of as routines that store, retrieve, and manipulate data. Without data, there’s not much value in the application. It can be easy to assume that an application will know how to deal with the data — that information will magically disappear and reappear in a useful form when you need it.

It’s not magic, it’s strategy. Choosing the best way to store different types of data from the beginning is important. If you choose a storage option that isn’t well suited for the data you’re working with you might not see any issues in the beginning, but when your data scales by orders of magnitude it’ll make a big difference.

One question to ask when choosing data storage is “How is this information organized?”. Is it wrapped up in a nice, neat package? Is it a blob of information that we don’t know what to do with yet but we think it might be important later?

Let’s walk through how data representing a physical location might be stored for different use cases.

If there’s structure and you know it (clap your hands 👏🏻)

Some information we already know how to store because there’s a convention for how to store it. For describing places we could use longitude and latitude — that’s a common, standard way to store an absolute position on earth. Or, we could simply use an address. It’s widely accepted that an address will have a number, a street, a city, a state, and zip code.

Since the data is structured in a way that we can expect, we can go about setting up data storage that conforms to these conventions. Relational databases like PostgreSQL, MySQL, and SQL Server excel at this because they store data based on a structure that you know ahead of time. We could create a locations table that stores address. Because we know the conventional elements of an address, we can define those as columns and be on our way:

|number|street |apartment_or_suite|city |state_code|zip_code| 
|123 |Somewhere Dr |#54 |Anyplace|MI |54321 |

Maybe, maybe not 🤷🏼‍♀️

Sometimes we’re not sure exactly what information we’ll have. There isn’t a well-known or standard way to describe something but we’ve got some information about it. The catch is that we don’t necessarily have the same type of information for each thing we’ll be describing. Let’s say we’re heading to Grand Rapids, MI for vacation and we ask friends for recommendations:

  • You’ve got to check out Founders Brewing Company. From your hotel head south on Monroe until you cross under the highway. Take a left and go a block to the 4-way stop. Take a right, go two blocks, and you’re there. It’s a brick building and you’ll see the beer garden in front. If you see the bus station, you’re in the right place.
  • If you’re going to Grand Rapids you’d be crazy not to visit Lake Michigan. My favorite beach is called Kirk Park. It’s about 40 minutes away. From your hotel head west on Pearl St. That turns into Lake Michigan Drive. Keep driving straight until the road ends at Lake Michigan. Take a left, head south for bit, and then you’ll see the park sign on the right.

These bits of information are probably enough to get you to Founders and Lake Michigan, but we can’t necessarily anticipate every characteristic of the directions that someone might give us. Document-oriented databases like MongoDB and CouchDB are perfect for this because they don’t have a defined structure. They allow you to assign structure individually within each document:

Guilty by Association 👥

Maybe we have some information about how a location relates to other locations. I really like Brewery Vivant, a craft brewery in Grand Rapids. I could describe it like this:

“What’s the name of that brewery by Congress Elementary? I took bus route 6 to get there but last week but the name escapes me.”

Since we stored entities and their associations, we can quickly see that Congress Elementary School is in East Hills, just like Vivant. We can also see that Brewery Vivant is served by a bus stop that’s included in bus route 6. Given what we know we can suggest that the brewery in question is probably Brewery Vivant.

This is how graph databases link neo4j operate. These databases store information about how things are related so that interconnected data sets can quickly be analyzed. This works really well for things like fraud detection (this credit card number is used an awful lot and constantly gets rejected. Seems fishy.) and recommending content (all your friends liked this article, I bet you would too). These problems don’t necessarily have cut-and-dry answers but we can come up with educated guesses based on relationships we already know about.

How did we get here?

A friend and I decided to grab a beer after working out at the gym because why wouldn’t you replace those carbs immediately? She beats me there so I ask “How did you get here so fast?” We compare routes and she shares a sneaky back way that avoids a couple of intersections that are always a mess. Genius.

Here we’re asking a question about the events that led up to our arrival at a particular location. We’re not so concerned with describing the location we’re at, we’re curious about all of the events that happened leading up to our arrival here.

This is a type of stream processing which is central to products like Kafka. In stream processing, the emphasis is put on recording events (a barrage of data, like a stream that keeps flowing) that happened sequentially over time. Usually, we’re concerned with reacting to this data as it arrives. This can be handy when we want to make decisions in real time. It can also be useful if we’d like to replay past events through a new lens.

This one’s a little different than the previous examples because it’s concerned with what we’re storing and not necessarily how we’re storing it. However, the use-case for storing streaming events can give us some clues about how the data should be structured. Lots of marketers like to use streams of events to model conversion pipelines, for example. Because that tends to be iterative based on new information as it becomes available, it’s probably a good idea to choose document-oriented storage so that the structure of the event data can change at a moment’s notice. Segment’s API is a good example of this.

You’ve got options!

Based on the data you’ll be working with you have some options about which storage tool makes the most sense.

  • Relational databases work best when you have highly structured data.
  • Document-oriented databases give you some wiggle room about what information you’re storing.
  • Graph databases help answer questions about relationships between information
  • Stream processing allows you to ingest large amounts of granular data to make informed decisions now and later

~Dan, Software Developer