Evaluate Big Data with These Five Qualities

Big data can span galaxies. Source: NASA JPL.

One of the top buzzwords of the last decade of business has to be “big data.” The term has escaped Silicon Valley and now rears its head everywhere, in every industry.

First, what is big data? Simply put, big data is a collection of data — a big pile of information. Sometimes, it’s structured, organized into a nice searchable format. This could be a spreadsheet with the sales data for every single book sold at Barnes & Noble. Other times, big data comes in an unstructured format, like a backup of someone’s emails. There’s a lot of valuable information in those emails, but it’s not easily searched.

Big data holds a lot of potential value — not for just existing, but for the insights and conclusions that can be mined from it. Big data is a field of soil, fertile but needing a worker to put in the effort to tease forth delicious bounty from it.

There’s nothing magical about big data, just like there’s nothing magical about a field full of dirt. Just accumulating a bunch of data and expecting it to magically yield conclusions is a bit like piling up a bunch of dirt in your driveway and expecting it to start growing crops. Probably not going to happen.

But similarly, while you could try and grow a plant in a teacup full of dirt, it probably won’t yield valuable results. Big data can be recognized by possessing five distinct dimensions:

  • Velocity
  • Volume
  • Variety
  • Value
  • Veracity

Each of these qualities is a spectrum; it’s not that one dataset has volume while the other does not. One dataset may simply have more volume than another.

Let’s dive in to each dimension of big data.

Velocity

Photo by Glen Wheeler on Unsplash

Velocity describes how fast your big data is growing; how fast are you generating more data?

Predictions made from big data are more valuable if they can update and adjust as new data flows in. The faster you bring in new data, the sooner your predictions can adjust.

Imagine if you’re a bookseller, and you look at a spreadsheet of all books sold last month. “Ah!” you say to yourself. “A new bestseller is at the top of the charts! I should make sure to keep plenty of copies on stock for this month!”

That’s a good prediction — but what if you received updates more frequently? What if, instead of looking at last month’s results, you could look at last week’s results?

With that increase in velocity, you see that desire for that bestseller is already cooling — but instead, people have started buying the other, older books by the same author! You adjust your order. Instead of just purchasing more copies of the bestseller to put out on your shelves, you also order the author’s back catalogue.

The faster you can generate and analyze data, the more powerful your conclusions. The best big data is rapidly growing and instantly accessible.

Volume

Photo by Cristina Gottardi on Unsplash

Volume is an easy dimension to understand — the more data, the better! Big data is valuable because there’s enough data there to support drawing conclusions. If you flip a coin five times, you might get all heads. Does this mean that the coin always lands on heads? Would you bet your life savings on it?

Probably not. Five flips is not enough data to support this conclusion. But on the other hand, if you flip a coin 10,000 times, and get 9,995 heads, you feel fairly confident in betting your life savings that the next coin flip will come up heads.

Volume adds a challenge to analyzing big data — when you have this much raw information, where do you put it? It’s tough to manage, but it’s also useful — more volume means more power for drawing conclusions and making predictions from a dataset.

Variety

Look at all those different colors and shapes! Photo by rawpixel.

Variety describes the different types of data in a “big data” dataset.

Imagine two datasets:

  • A set of a million different phone numbers.
  • A set of a hundred thousand identities, each with a phone number, name, address, Facebook profile photo, and favorite flavor of ice cream.

The first dataset is twice as big as the second, based solely on number of entries. The second set of data, however, has much more variety — and if you were trying to make a prediction (such as what flavor of ice cream sells best to women in the Southwest United States), the second dataset is much more valuable.

The best, most powerful prediction tools are able to use many different types of data in combination to fuel their guesses. Providing variety in big data makes it more valuable for these tools.

Value

Photo by Mathieu Turle on Unsplash

Value is, perhaps, the hardest dimension of any “big data” set to measure. How do you determine the worth of a set of data?

Usually, the best way to predict the value of a “big data” set is to compare the cost of gathering that data to the benefits that are derived from its analysis. This is easier with data which has already yielded past conclusions — if sales data helped a company avoid overstocking some product in the past, it’s likely to continue to be valuable to collect in the future.

However, there’s always a bit of a chicken-and-egg conundrum: this data may turn out to be valuable, but we won’t know until after we’ve already spent the money on collecting, storing, and analyzing it. This is where a cost-benefit analysis comes in. Can the data be acquired and analyzed cheaply, at least for a pilot study to determine its potential value?

Veracity

Photo by Cytonn Photography on Unsplash

Veracity describes the trustworthiness or quality of the data. Imagine if your big data comes from a social media network, where you’ve scraped millions of posts to look for keywords about a product. Are all of those posts going to contain perfect spelling, or will there be misspellings? Are all reviews on a sales site going to be genuine, or are some reviews left by paid shills to pad the scores? If you’re using location tracking to monitor foot traffic in an area, how do you account for inconsistencies, like GPS drift?

Every set of big data will have some concerns about inaccuracy, either intentional (lies) or accidental (misspellings, GPS drift). The best way to counter concerns about big data veracity is to add in another data type for verification — if you’re using GPS data to track foot traffic in an area, for example, you may also want to include sales and customer data from businesses in the area to verify that your GPS data is accurate.

Putting it all together to evaluate big data

Velocity + Volume + Variety + Veracity = Value

By measuring these dimensions of big data, it’s possible to determine the value of that dataset. Many startups and businesses these days are not just valued by the business that they perform, but also by the data that they accumulate.

Personal genetics company 23andMe, for example, isn’t valued at 1.75 billion dollars because it sells personal testing kits for $200 each; its high valuation is because of the huge volume of varied data that it gathers on its customers. Similarly, Facebook is so successful at selling advertisements to its users because of all the data it has about them — Facebook knows what other websites you visit, where you live, what you like (and dislike), and which friends you consider closest, along with many other bits of information. That level of big data gives Facebook a huge advantage over other advertising companies, like the ones who stick junk advertisements in your mailbox.

However, keep in mind that no set of data is infallible. British economist and Nobel Prize winner Ronald Coase made the statement “Torture the data long enough and it will confess.” In more recent years, one more word has been added to the end of this famous quote: “Torture the data long enough, and it will confess… anything.”

Even a big data set can be twisted to support unlikely, even preposterous conclusions. This is why conclusions drawn from a singular data set shouldn’t always be trusted; it’s still prudent to verify and corroborate conclusions by using multiple sets and sources of data.

Next time you hear about a conclusion mined from “big data”, consider the five V’s of the dataset, and use this information to guide your evaluation and trust of the conclusions!

“Torture the data long enough, and it will confess.” -Ronald Coase
“Torture the data, and it will confess to anything.” -Internet