Big Data: A cheat sheet for the rest of us

The Chic Geek
The Chic Geek
Published in
5 min readOct 11, 2017

Karina Korpela

In the last couple of years there has been a lot of buzz around data — especially Big Data. You have probably by now heard Big Data referenced in TV shows, conferences, or magazines, or you might even know someone working with it.

Dan Ariely of Duke University once tweeted: “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

Yes, not all data is Big Data. In this article, we’ll provide you with a cheat sheet to understand what it is and boost your confidence when discus the subject.

Did you know that Big Data can come from you? It is data about your online purchases, your twitter likes, your facebook photos, your google searches, your medical records, your calls, your GPS navigation, your fitbit and so on. Anything connected to the internet generates data. A lot of it.

Big Data checklist

The first and foremost characteristic of Big D. is its ability to make you feel like you are standing on slippery stones in a rushing torrent, trying to catch a fish barehanded, and with no idea whether there are any fish in this particular stream! If a dataset doesn’t make you feel like that, then, my friend, you are most likely NOT dealing with Big Data.

As in a torrent, volume (or size) is just a part of it. Let’s recap the definition of ‘torrent’:

Now let’s adapt it to Big Data: A strong and fast-moving stream of multi-structured data.

Not sure yet?

Well the general consensus is that there are specific attributes that define Big Data. They’re referred to as the four V’s: volume, variety, velocity, veracity. IBM also add a fifth ‘V’ for “Value” but, honestly, all data analytics — big or not — should add value to the business, or research activities, and so on.

And to assist you even further, here are a few questions to help you understand if your company does indeed use Big data, based on these ‘V’s’:

  • Is the data collected too large for a traditional SQL database?
  • Is your data a Frankenstein, spread over multiple files, servers and/or geographical locations?
  • Does your data have a much longer and uncertain life span?
  • Can you easily recover your data in the case of corruption without having to re-perform any transformation?
  • Does it contain audio, videos or images?
  • Does it require immediate response? Like in High-Frequency Trading (HFT)?
  • Is it being generated in real-time? Like in social media platforms, IoT or other internal sensors?
  • Do you need a transformation tool to make it identifiable and legible?
  • Do you need a reduction tool to make it more manageable?
  • Is the range of potential correlations and relationships between disparate data sources too great for any analyst to test all hypotheses?

Check out the link below for a sample Big Data dataset that would meet some of these criteria:

http://www.datasciencecentral.com/profiles/blogs/two-big-datasets-to-challenge-your-data-science-expertise

It is important to note that not all organizations have or use Big Data yet, and there is still a lot than can be done with non-Big Data.

Furthermore, you DO NOT need to be a statistician or math wizard to start playing with data. All you need is a curious and patient mind. Data Science does require a certain level of Zen and perseverance in solving puzzles. This is your base; from here, all other skills can be acquired.

Do you want to play with data?

Data science is a varied field with numerous opportunities available to you. These are the main existent four areas within Data Science that might interest you:

1. Data Engineers: gather, collect and store of data. Their ultimate goal is to provide clean, usable data to whomever may require it. A Data Engineer would have knowledge of RDBMS, Hadoop, NoSQL, Spark and Python.

2. Data Analyts: automate insights into a certain dataset as well as suppose the usage of queries and data aggregation procedures. Data Analysts have strong data wrangling skills and business requirement skills. They often work as liaisons between end users and the other data areas. Good to know: SQL, Datawatch, Alteryx, Python, R, Triffacta, Google analytics, SAS, SAP Hana, etc.

3. Data Scientists: Here is where it gets tricky. The community hasn’t yet settled on a definition. Some say that Data Science comprises everything that relates to data, which includes: engineering, analytics, etc. While others believe Data Science is an area within the data universe where statistical and machine learning algorithms are applied to solve problems and answer business questions. I like the definition explained by Pete Skomoroch, Principal Data Scientist at LinkedIn: “Data scientists are better statisticians than most programmers and better programmers than most statisticians.”

4. Data Visualization: this is an area that has been embedded in all other areas but, lately, it has started to become its own pillar within the data universe. Data visualization is about combining design principles with storytelling skills to provide the end user with visual insights. The tools most used are: Tableau, Microsoft PowerBI, Quick View, R, Google charts, SAP Lumira, SAS, MapBox, etc.

The truth is there is a lot of overlap amongst all four areas. As niches mature, we will see more clarity on definitions and roles being separated consistently within organizations. For now, a well-rounded data professional will know a bit of each of the pillars above and be an expert in one or two particular areas. Which area will you chose?

You might also enjoy reading:

Visit us at:

--

--

The Chic Geek
The Chic Geek

is a Calgary-based non-profit building a supportive community for women at the intersection of technology and entrepreneurship. http://www.thechicgeek.ca