Data First — the New Scientific Method?

Does the basic foundation of elementary science education hold up in the world of big data?

Published in

insight io

7 min readMar 19, 2019

One of the challenges to research and analytics today is an extreme shift in methodology and sequence due to the explosion of “Big Data.” The raw information in big data sets and streams is continually updating and growing based on automated output from systems and sensors. However, the basic reality remains that most problems–and most data sets–usually only deal with small to medium data. The data may exist beforehand or be generated using traditional methods of gathering and experimentation. The fact remains that there is an enormous volumetric difference between traditional data sets and raw “Big” data sets.

One of the advantages of the traditional methods of data gathering is that the data is often predetermined by the research methodology (survey, study, etc). While there are challenges to that approach, even the raw data itself is often quite reliable as it is validated at many stages. Big data is generally just “raw.” However, we also need to look at the value of the data. The value of traditional data might be seen to increase once it is collected and validated. Since it was gathered/generated for a specific use it holds its value consistently, but it does not increase until the next round of experimentation or collecting. Big data is continuously growing. Even in its most unrefined form there is potential value growth based on its volume and history. Let’s take something like sales data for example. If you are tracking all transactions over time, there is inherent value to that data set as it grows. If you then augment the data with information about product categories or user demographics for purchasers, the entire data set is now much more valuable. Once you analyze and model the augmented dataset you again increase the value and potentially the rate at which it grows in value as the raw data volume increases.

This isn’t to say that big data has the ability to completely supplant traditional data and scientific methods. The question is whether there are intrinsic differences to the approaches of the big data paradigm (there are). In the world of big data there is a total transformation in the fundamental approaches to research and analysis. Big data might also be called data first, which is a different paradigm to the scientific method we all learned in grade school.

The traditional scientific method

Let’s take a classic example of the scientific method at work and compare it to how we might approach the problem today. The development of Darwin’s theory of evolution and the origin of species is an excellent illustration of the traditional scientific model. It begins with an observation; in this case that different birds of a similar species have different beaks. This leads to a question which is most generally framed as “Why? [is this so]” From here you begin an iterative process which forms the core of the traditional scientific method we learn in grade school. You propose an answer to your question–a hypothesis–and proceed to collect data to support or refute said hypothesis. The results of your experiment lead you to adjust or reframe your hypotheses to align with the data that you generate. In the case of Darwin’s finches it led to the development of the theory of evolution.

What is a data first paradigm?

Now let us consider how this exercise would develop using a “big-data” paradigm. One of the first assumptions of a big data exercise is that you have data on hand or can easily capture data at scale using some digital process such as logging, scrubbing, sensing, etc. Once you have your dataset, you proceed to “mine” it for insights. To tackle this problem today we might rely on Google image search as our data source. By typing the word “bird” or “finch” into the search field, we can eventually return hundreds of thousands of images of birds to study.

Once we have our dataset, we can proceed with our analysis. Because the entire dataset is digital, we can use computer algorithms to examine every single image and extract “features” (see image intelligence). An algorithm can learn to identify a feature by analyzing each image over and over again with different parameters, eventually teaching itself what different parts of birds might be and where they are in each image. Eventually a library of different features and their combinations will appear by reviewing and comparing which features allow for the most consistent grouping and classification of birds into meaningful categories (with the assistance of human intelligence). At the end of this process, the study of the results might reveal that birds can be grouped into different species, and birds of the same species might have different defining characteristics (such as beaks) while still belonging together.

But wait…

You may be saying to yourself, that’s all well and good assuming you could just search for whatever you want to find on Google images and then run it through this series of processes. Of course Google image search now uses computer vision algorithms itself to auto-generate the searchable tags on images, which creates a chicken-and-egg sort of conundrum. However, let us imagine that Darwin had our tools (cameras and computers) but not our services (in this case Google). Returning to the Galapagos, he could set up cameras to monitor the different islands and store the images.

While there would be more complexity, training and computation required, the same principles we outlined above would still work. And if there was a steady wifi signal, this data could be captured in real-time and streamed continuously to study the entire population of finches over time–more on this later. In either case, the entire study of birds (or species) could proceed from the acquisition of data, irrespective of a specific research question. And the same data set of images could be used to study other questions about birds, or even their environment and changes to both over time. By re-applying the findings from one study back to the original dataset (augmentation), the core dataset itself is transformed into a more valuable raw data source. This approach fundamentally changes the way we perform research and analysis in several significant ways.

Crucial impact of a data-centric approach

One of the most critical changes to methodologies in the “big data” paradigm is the ability to study the entire population at once. If you look at the data science method vs the traditional scientific method, the acquisition and application of data is fundamentally transformed. While the traditional method seeks specific data points in the support or rejection of the hypothesis, the data first method relies entirely on data most but (theoretically) ALL the data. Hypotheses and experiments are prone to all sorts of errors in sampling and unknowing (or deliberate) biases. This naturally leads to the healthy distrust of individual experiments and the slow development of theory as multiple researchers frame and study various hypotheses.

When you study the entire population in its entirety however, many of these concerns are alleviated. There is less danger of selecting a poor sample or control group. If a population is monitored continuously, there is less danger of selecting an experiment window which misses critical time periods. Certainly it is not always possible to capture or generate data at that scale and plenty of research still relies on traditional scientific methods. But as technology and digital tools continually proliferate, it is easier and easier to find and capture data on entire groups and conditions.

In our latter example of a modern-day Darwin, while the entire study may originate with a question and a hypothesis, it should still lead to a fundamentally different approach of studying the entire eco-system of an island and its feathered occupants. And instead of a specific and limited data set of bird drawings and bodies, the data set of streaming video could have other applications for ecology, climate change, etc.

In the world of big data, the keys are data first and data persistence–the more data you have the more questions you can ask; The more insights you uncover the more valuable your data becomes.

So why aren’t we using big data all the time?

Of course, we also have to consider the challenges to adopting this (or any) radical new approach. In the case of big data, besides the challenges to the traditional ways of thinking, there is the also the cost of adoption. It generally takes a fair amount of effort to put in place or build-out the systems required to generate or harvest big data. And because the data first methodology requires some meaningful amount of data to mine for insights, there is a period of negative Return on Investment (ROI) which you must suffer through patiently. Once you begin to augment and increase the value of the data stream the ROI begins to increase dramatically.