How to Collect Data So Big Your Brain Cannot Handle

Nicole Qiu
CISS AL Big Data
Published in
4 min readDec 14, 2021
Figure 1: Sciencenotes.org

How do the hours of sunlight received affect the height of a plant in centimeters? A simple, classic experiment we’ve all encountered at some point in middle school science class. To answer this question, we would set out on a journey of germinating mung beans and measuring plant growth day by day, carefully entering our data into neatly designed tables with proper units and labels. And finally, when the mung bean sprouts wilted and died from the lack of care, we would plot each and every data point by hand and connect the dots to form an increasing linear graph. The answer was loud and clear — sunlight and plant height had a positive correlation.

This is data collection — to collect, organize, and analyze quantitative and qualitative data, in order to answer a hypothesis that can come in forms as simple as “what is the effect of x on y?” However, the mung bean example provided above is only a trivial, one-dimensional data collection process that even a middle schooler can carry out. What if we collected data on a scale so large the human brain could not handle on its own? What insights, then, could they possibly bring?

Figure 2: Freepik

With the proper use and interpretation of big data, we can see correlations and patterns that the human brain simply cannot. Whether it’s the perfect time to advertise baby clothing to a pregnant mother, or the strong indicator that a lung nodule is cancerous, big data gives us insight that even the smartest scientists on Earth cannot. Clearly, big data brings us great power, but with great power also comes great responsibility — the dangers of misinterpreting bad data have the potential of affecting us just as much as the benefits good big data brings us (one example of which is the failure of Google Flu Trends.) So how and where exactly do we find accurate, reliable, and rich big data?

There are two main methods of data collection: primary and secondary datasets. Primary datasets are data points collected by the researcher themselves, while secondary datasets focus more on digging for data that already exists. While both methods have their own advantages that fit specifically to different researchers’ needs, they also possess disadvantages that may not be suitable for certain topics of research.

Figure 3: QuestionPro

Primary datasets are sources that the researchers collect themselves. After identifying a topic and formulating a hypothesis, researchers will use methods such as surveys, scientific methods, or interviews to gather data that is fitted to the research topic. The main advantage of primary data collection is that it is more customizable and accessible compared to data collected by others. Researchers can have a clear understanding of what the data means and why it is significant to the bigger picture. However, one obvious challenge is that such massive data collection is time-consuming and costly. Researchers must pour in a significant amount of both time and energy in order to produce a reliable, useful, and compelling dataset. Even if one were to gather enough data for it to become useful, challenges may also arise in the storage and analysis of said data. These datasets require professionals who are well-versed in data science to actually use the data to its full potential.

Figure 4: Intellspot.com

Secondary datasets on the other hand are datasets that have already been collected and made available to the public. In contrary to primary datasets, secondary data is often collected and produced for a different purpose from what the researcher originally intended. Researchers may only want one variable, while missing other variables. Thus, the datasets may not answer the researchers’ questions perfectly. Since the data was not produced by the researcher themselves, it may be difficult to access raw data. Instead, it is common to find data that has already been processed into pre-coded forms that may have stripped the data of its value.

Despite these disadvantages, secondary datasets may come in handy if they fit nicely with one’s research topic. Different from primary datasets, secondary datasets are rather easy to come by. In addition, they are much more time and cost-efficient as they have already been collected, organized, and placed online for public use. More importantly, high-quality, reliable secondary datasets on ongoing health surveys, health care claims, and longitudinal studies on diet can often be found on institutional websites.

The choice between primary and secondary datasets, however, depends mostly on the researcher’s topic, experience, and willingness to gather data. However, both types of datasets will certainly ensure high qualities results when combined with proper analysis methods.

At the end of the day, big data is the new oil of the 21st century.

Citations

Mash, Micky. “What Are Secondary Data Collection Methods?” Business Jargons, 9 July 2016, https://businessjargons.com/secondary-data-collection-methods.html.

Pratt, Mary K. “Big Data Collection Processes, Challenges and Best Practices.” SearchDataManagement, TechTarget, 12 May 2021, https://searchdatamanagement.techtarget.com/feature/Big-data-collection-processes-challenges-and-best-practices.

“Secondary Research- Definition, Methods and Examples.” QuestionPro, 19 July 2021, https://www.questionpro.com/blog/secondary-research/.

--

--