The Hardest Problem in Data

Ronnie
3 min readOct 28, 2016

--

Questions by Oberazzi

Data science is an increasingly popular specialization and demand for data analysts and data engineers is increasing rapidly. When I staffed the Slack booth at Grace Hopper this past week, it seemed like the one of the most common questions from prospective applicants was about working in data science.

Big Data is about wrangling large amounts of information into digestible, actionable data sets. It’s about finding the signal in the noise. It satisfies that incredibly human itch to understand. So when people ask me to explain what a job in data is like, many of them are imagining using data mining to build complex models. They invariably look a little deflated when I reveal to them the dark secret that the hardest problem in data is counting.

Counting

Counting is something we all learn at an early age. Toddlers can count. Elephants can count. Crows can count. Even salamanders have been shown to count. So how come Data Science teams aren’t filled with rambunctious corvids or slick-skinned amphibians? Allow me to make a case for counting being harder than it seems at first glance.

I’m going to ask you a simple question that requires only your ability to count to a finite integer. Once you come up with your number, I’m going to ask you to write it down and then prove to me that it’s correct. Are you ready? Here we go:

How many friends do you have?

Write down your number. Okay, now comes the fun part.

Will I get the same number

  • if I ask every person you know if they consider you their friend?
  • if I ask the people you counted as friends how many friends you have?
  • if I count the number of people on your Facebook friends list?
  • if I count the number of people on your Snapchat friends list?
  • if I added up all of your work friends and all of your school friends and all of your social friends?
  • if I count the number of people you invited to your last birthday party?
  • if I count the number of people who invited you to their last birthday party?
  • if I count the people you interacted with in the last year and subtracted all the work and logistical interactions?
  • if I counted the number of people you would tell a secret to?
  • if I count the number of people that you would ask to help you move?
  • if I count the number of people that you would help move?

Most people will have a hard time giving the exact same number to every one of these questions because the way that we define what a friend is changes for each of these examples.

So easy an elephant could do it?

African Elephant by Ivan Svatko

In much the same way, the definition of a user changes depending on who is asking and what they are using that number for. An essentialist might claim it’s the number of rows in your users table. Operations may define a user as the number of logins. Finance may define it as the number of active accounts. And how should you define active anyways? Now consider that the number of users for your product is a value that most of your other data depends on. Then throw in the additional complication that many of those values are gathered from multiple sources and every one of those sources needs to agree to common definitions and be able to be accurately counted from their component sources, and the scope of the problem starts to become clear.

Hopefully this example has convinced you that counting is actually a challenging data problem worthy of your time. Unfortunately, this may mean that it will be some time yet before our gentle pachyderm friends can work alongside us in Big Data.

Interested in data challenges at Slack? We have open positions in data analytics and data engineering.

--

--