Big Data & Data Science: Two Efforts Vying for a Shared Goal

Wizeline Engineering
Wizeline Engineering
4 min readJun 1, 2017

Welcome to part three in our Demystifying Data Science series! In the previous posts we discussed what is Data Science and what is required for a Data Science project. Now we will make a distinction between two data initiatives that people often confuse: Big Data and Data Science.

Nearly every tool and service we use consumes or generates data. Take a step back and think about it. From the emails we exchange and the websites we visit, to the products we buy via online stores. Additionally, the Internet of Things (IoT) contributes further data from sensor-enabled wearables, smart homes, and even smart cities, we are continuously using data. To put the data in perspective, in 2015 we generated over 2.5 exabytes each day(that’s right, with 18 zeros). To grasp the idea of an exabyte, if you were to stream the entire Netflix catalog 3,000 times, you would only consume 1 exabyte. The availability of such a massive amount of information represents a huge, untapped opportunity for many organizations.

Two distinct efforts have emerged to harness the potential of data: Big Data and Data Science. These terms are often used interchangeably despite having fundamental differences. Essentially, Big Data is the (often real-time) collection and processing of massive amounts of varied data. You may have heard of the 3V’s of Big Data: volume, variety and velocity. On the other hand, Data Science is the creation of mathematical models to capture underlying patterns in complex datasets, among other tasks.

Unfortunately, many companies eager to join the data hype focus on collecting data without having an appropriate strategy in place for either effort. At the end of the day, collecting does not imply discovering, and business value does not increase by solely collecting more data.

The blurry line between Big Data and Data Science

Although we have established the difference between Big Data and Data Science, there are projects in which the two are intertwined. This collaboration will be determined by factors such as:

  • Memory and processing capacity: While training machine learning algorithms, there will be a need to access, store, and process big amounts of data. Big Data skills are needed for this project.
  • Dataset-model relationship: The amount of data needed to train a model will depend on its complexity.

An example of a dataset-model relationship, are the important considerations applied when training a Neural Network. This algorithm attempts to mimic the way the brain processes data by assuming stacked layers of connections, which could easily scale to hundreds of thousands. If the number of connections in the Neural Network is greater than the observations in your dataset, the model memorizes the training cases and will likely perform poorly with new data once the system is deployed. This makes ‘big’ datasets an important requirement for Neural Network learning because it enables a higher ratio of observations per connection.

On the whole, Neural Networks have been successfully implemented for numerous Natural Language Processing (NLP) tasks such as intent detection, document duplication detection, language translation, and so on. As you can imagine, you need a lot of data to train robust models for different tasks. To give dimension to the word ‘big’, consider the following open datasets that could be used for different NLP tasks:

  • Wikipedia in English: 100 GB compressed; 10 TB uncompressed
  • WMT 14 English to French dataset: 950 MB, ~ 50 M words per language
  • UMBC WebBase corpus: 13 GB uncompressed, 3B words
  • Google Books N-grams: 2.2 TB
  • Amazon Reviews: 11 GB, ~35 million reviews

These datasets cannot be processed using traditional approaches (some won’t even fit into the memory of commodity computers), so Big Data technologies are needed to efficiently perform computations. It is in situations like these that Big Data and Data Science go hand in hand, even though they are different.

Stay tuned for our next post where we will discuss everyday applications of Data Science! Check out open roles at Wizeline here.

About the authors:

Diego holds a B.Sc in Applied Math, with a concentration in Machine Learning and Statistical Learning. He has spent the past four years modeling data at startups in both Dubai and Mexico. At Wizeline, he contributes to Deep Learning research.

Juan holds a M.Sc. in Statistics and Operational Research from the University of Edinburgh with distinction. He’s currently a lecturer in the Industrial Engineering Department of ITESM. Before joining Wizeline, Juan spent several years at HP Labs as a Research Assistant in mathematical optimization. Now he’s busy analyzing the real estate market in the US, while contributing to Deep Learning research.

--

--