CISS AL Big Data
Published in

CISS AL Big Data

Data: The Commodity of the Modern World


The commodity of the modern world — costing more than the price of gold and oil. Now, you might be wondering, why is it so expensive? What’s so good about data? Well, data helps inform companies on what the best course of action is in their industry. From providing Youtube recommendations to predicting stock trends, data usage is crucial to success. A key step in effective data usage is data collection.

Usually, data collection is just the gathering of data; however, in this article, the term data collection includes the process of data wrangling.

Data Collection

Data collection is the process of gathering and quantifying information (qualitative or quantitative) to extrapolate correlation and relationships between the numerous factors. Because of this, data collection is a pillar of big data analytics and is the most important step in the pipeline. Data collection can prepare your research for success or destroy the analysis.

An issue that often arises during data collection is reliability. Sensors, external datasets, and measuring devices are susceptible to error, and verifying their accuracy will reduce unnecessary work in the future. A fast and easy method of determining accuracy is cross-checking in vast numbers. Instead of having one dataset or one sensor that needs thorough checking, having multiple sources that lead to the same conclusion can lead to the same credibility.

Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. — Viktor Mayer-Schönberger.

Another issue that may arise is the correct and effective quantification of data. A key characteristic of Big Data is the inherent messiness in the dataset. However, with each new category, more and more of that essence of Big Data is lost. Potential insights and analysis are lost into the luminiferous aether as we categorize and quantify our data, no longer embracing its messiness.

Primary vs Secondary Data

These potential issues have differing effects on the type of data that you will handle. There are two different types of data that are collected, Primary and Secondary Data. Primary data is data collected from the main source, often being raw and unorganized. Interviews, surveys, and experiments are all sources of Primary data.

Secondary data, on the other hand, is preprocessed and often is collected by someone else.

Books, Journals, Newspapers, and Government Records are all sources of Secondary data.

Both Primary and Secondary datasets have their merit. However, when to use these different types of data is dependent on the scenario of the research project. The first difference between the two data types is the organization level.

Primary data is usually full of raw, unfiltered, and confusing numbers. Often being directly from the source and often really quick. The majority of work is not the analysis of the data but the processing step of the data. Easily understandable categories for the dataset.

Secondary datasets are easier to extrapolate correlations and derive relationships because someone else has already done the bulk of the processing. Although you do not have to waste time creating categories, that still does not mean it’s all sunshine.

The second difference is the speed.

All real-time data is Primary data. Data in its natural state is raw. Thus, new and fresh insights and analysis in a dataset will mostly be from Primary data. For example, in the automobile industry, ultrasonic sensors and cameras are mounted on self-driving cars. These sensors and cameras measure raw, unfiltered data that is directly from the source. That data can be used in real-time to help predict pedestrian and car movement in an intersection. This cannot be done with Secondary data.

However, that doesn’t mean Secondary data is useless. Because Secondary data has been preprocessed into categories, someone else has used it for their study or research. So, new insights are limited. Secondary data has other uses though. It can be used to familiarize oneself with the industry of interest. Secondary data is easier to handle and easier to understand the topic at hand. Now, that does not mean that new insights cannot be drawn from Secondary data, but the chances are much smaller.

Even though these stark differences between Secondary and Primary data favor specific types of research, they both mold the trajectory of your project. Good data collection will lead you to solve unnecessary issues and subject switches.


1. Board, U. P. M. E. (2019, August 17). The challenges of big data for businesses. UP’ Magazine. Retrieved November 8, 2021, from

3. Importance of data collection & how to use it. Innovative Advertising. (2020, March 30). Retrieved November 8, 2021, from

4. Laqua, R. (2020, November 13). Organizational hazards. Lean Compliance. Retrieved November 8, 2021, from

5. Mayer-Schönberger Viktor, & Cukier, K. (2017). Big Data: The Essential Guide to work, life and learning in the age of insight. John Murray.

6. Testing, testing, 1, 2, speed… Mind The Speed Premium Fibre Internet. (n.d.). Retrieved November 8, 2021, from




This publication demonstrates the work done by students in Concordia’s AL Big Data class. Big Data, in the simplest of terms, refers to the tools and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities.

Recommended from Medium

Financial Data from Yahoo Finance with Python

An overview of using TensorFlow 2.0


How to calculate Efficiency from a Sankey Diagram

An intuitive explanation of Hypothesis Testing and P-Values

Comparing Models for COVID-19 Statistics

Financial Data Science: a fable

Hard Problems in Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Interpretable Machine Learning: Intro and the classical Generalized Additive Model (GAM)[Part…

Performance Metrics for Regression

Advancing Machine Learning with H2O

Dimensionality Reduction: What You Need to Know About Principal Component Analysis — Part 1