Wrestling with Data Collection

Sunny Zhang
CISS AL Big Data
Published in
5 min readDec 10, 2021

--

Wrestling with Data Collection

Big Data is an unlimited source for insights and success that helps numerous diverse individuals when they are on their way to achieving different sets of goals of different fields in industry, business, and science. Big Data is not limited to a specific field, a type of people, or a chosen topic. As mentioned many times in previous modules, there are various unique examples that all lean on Big Data’s power to produce more beneficial results. This also means that both teenagers and adults are welcome to apply Big Data usage into their work and life, which will bring numerous new insights and improvements. After the importance of using Big Data is deeply rooted in one’s mind, now it is time to focus on the topic of gaining resources to apply Big Data analysis, data collection.

“Collecting Data in a Remote Investigation — Lit & More: Litigation Services.” Lit & More | Litigation Services, 15 Oct. 2020, http://www.litnmore.com/collecting-data-in-a-remote-investigation/.

The true success that Big Data can bring is fundamentally from the quality and quantity of the resources examined and used for Big Data analysis. Data collection is the method of gathering and measuring information from all the relevant sources that allows one to answer expressed analysis questions, check hypotheses, and appraise outcomes. The info collection part of research is common to all or any fields of study as well as physical and social sciences, humanities, business, and so on, whereas ways vary by discipline, the stress on guaranteeing correct and honest collection remains the same. Regardless of the field data is used for, accurate data collection is essential to maintaining research integrity. Both the selection of appropriate data collection instruments and clearly delineated instructions are essential for the proper performance of data collection and reducing the likelihood of errors occurring.1

When data collection is done in a detailed, considerate way, the potential of big data can answer complex questions and lead humans to success. However, the possible difficulties in data collection are also worth mentioning. To begin with, data collection depends primarily on the accuracy, which includes deriving information and data from well-organized source producers and also utilizing data analytical tools with a cautious mindset.

“Tất Cả Điều Cần Biết Cho Việc Dán Nhãn Dữ Liệu Trong Machine Learning.” TesterViet, 12 July 2021, https://testerviet.com.vn/dan-nhan-du-lieu-machine-learning/.

When data collection fails to be accurate, some resulted in problems are the inability to answer research questions accurately, inability to replicate and verify the research, misleading findings that waste both time and resources, negatively affecting participants who pursue efforts and funds, and causing harm to human participants and animal subjects. Ensuring the accuracy of the data is often the first step of a successful data collection, and the failure of this step is also crucial in decaying the whole project. Just like solving a math problem, when the wrong numbers are plugged into the equation at the very beginning, no credit can be gained despite the energy and time put into solving the complicated function.

After acknowledging the danger of improper collection, let’s look at the different kinds of datasets, which can give even more insights to lead to the achievement of collecting good data. When we are talking about datasets, we are talking about primary and secondary datasets. Primary datasets are data that has been produced by the researchers themselves from surveys, interviews, and experiments. These datasets are specially designed for understanding and solving the research query. Primary datasets can be referred to as firsthand datasets, which are often unique and only available to the researchers. On the other hand, secondary datasets are data produced by large government institutions, healthcare facilities, and other public groups. These datasets are often kept for organizational record keeping, and the people who have them often do not plan to analyze them. The data is then extracted from more varied data files for researchers to use for big data analytics projects.

“Marketing Teams Articles.” Unstack Inc., https://www.unstack.com/tag/marketing-teams. [Figure 1]

These datasets vary but are all essential in data collection. To make a more clear distinction between Primary and Secondary datasets, the following categories and descriptions by Research Guides are helpful to learn:

  1. The time when these datasets are collected is different. Primary is real-time data collected during the project, while Secondary is past data that are not intended to be used for a specific future project.
  2. The process of collecting Primary datasets is precise and time-consuming because it is designed specifically for benefiting the big data project to success. In contrast, Secondary datasets are quick and easy data collected as a byproduct from past events instead of the mean product.
  3. As mentioned a little previously above, Primary datasets are always specific to the researcher’s needs, while Secondary datasets do not necessarily need to consider the project’s needs.
  4. Primary data is often more accurate because it is done with more consideration and a fuller purpose, while Secondary datasets are not intended to correlate fully to the specific topic of project analytics.2
  5. To add on, some specific examples for Primary datasets are surveys, interviews, observations from experiments and questionnaires. Secondary datasets are often textbooks, articles, reviews, related but not the direct records. Here’s a quick example: analyzing what are the most frequently made purchases on a stormy day in Chicago. A Primary dataset is recordings of grocery lists of the customers in Costco on the stormy day. Secondary dataset can be an article about people craving for sweets when they are stuck inside the house due to severe weathers.

Please refer to Figure 1 for the descriptions mentioned above for further visual explanations.

To refresh the materials, Big Data analytics are never skills only limited to a few. Everyone is welcome to dive into data collection and evolve their minds and knowledge as Big Data continuously reveals information that is never available to be gained from other techniques.

Philip, Nate, and Nate Philip. “The Key to Success with Big Data Projects (Updated).” Qubole, 14 Sept. 2021, https://www.qubole.com/blog/big-data-project-success/.

However, understanding the importance of paying great attention to collecting accurate datasets is the crucial factor determining whether an individual can surely succeed in Big Data. Understanding the fundamental concepts of data collections and its components will ensure that an individual is mastering their use of Big Data at the very best.

Sources:

1Data Collection, https://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dctopic.html.

2“Public Health Research Guide: Primary & Secondary Data Definitions.” Research Guides, https://researchguides.ben.edu/c.php?g=282050&p=4036581

--

--