Data Collection: The Process Companies Undergo

Published in

CISS AL Big Data

5 min readOct 25, 2023

**Fig. 1:** Smartphones generate data with every minute of use (Dhunicorn, 2021).

Introduction

Big data analytics is a field of data science that focuses on collecting and analyzing datasets so large that doing so with regular data tools would be impossible, as shown in Figure 1. The field is growing as data becomes more and more of a valuable resource, and so are the ways of collecting data — like from social media posts, sensors on buildings, medical scans, and more. Companies and organizations use this data to aid in making important conclusions, like whether or not now would be a good time to post an advertisement, if a building is at risk of damage, or if a person has a tumor. But how do Big Data scientists obtain data? What happens between collecting the data and analyzing it? In this article, we will explore different kinds of data that organizations collect, ways that data can be “dirty”, and how organizations can clean their data.

Step One: Get

Many companies need data about their consumers to make decisions tailored to their audience’s needs — like what to show on their homepage feed or what new products could make the most money when oriented at a certain group. In order to know what a person likes and doesn’t like, or what groups of people they fall under, companies must collect data. The same goes for any situation: cities might keep data on certain buildings, and biologists might keep data on individual creatures in the wild.

Network data: You generate data for every minute you spend on devices connected to the web, as portrayed in Figure 1— Google records every internet search it’s ever received. Web trackers record what websites you visit. When you do anything on the internet, an organization is out there, taking notes.

Transactional data: Companies usually keep records of purchases— Amazon records how long you read about a certain product and whether or not you buy it afterward. Stores online and large stores like Target or Walmart record your purchases when you check out.

Sensor data: Another way organizations collect data is through sensors. For example, sensors in a farm might record concentrations of harmful chemicals and the ground temperature. Sensors on wild sea creatures may record the location of their hosts and generate geographic data. Buildings may have sensors that record the structural integrity of certain areas. These sensors usually are connected to the web and often upload their data for organizations in real-time.

Secondary data: Sometimes, data is not collected on its own. Instead, it can be taken from another source that has made their data publicly available. While shiny-new data is called primary data, this hand-me-down data is called secondary data. After all, why collect if someone else has already done the hard work? The only problem with this is that it makes step two — cleaning the data— quite a bit tougher.

**Fig. 2:** Data connects the world (WallpaperAccess).

Step Two: Clean

Often, collected data is riddled with errors, incomplete portions, duplicates, and other erroneous bits. It’s important for data scientists to clean their data before analyzing to steer clear of poor-quality results so their data is valuable and relevant in the world, as shown in Figure 2. Cleaning data involves detecting, modifying, or omitting dirty data and is usually done by running data through data-cleaning technologies. Here are some dirty data examples:

Duplicate data are quite easy to spot and definitely should be removed from a dataset to avoid wrong counts of data.

Inconsistencies in data are quite prevalent as well. One instance of this might be a sensor in a farm that suddenly reports high amounts of water even though all the nearby sensors report normal amounts. This sensor likely is faulty and the single inconsistent datum it produced should be omitted from the analysis. In some cases, though, an unusual sensor reading could entail an issue — for example, a location reading of a whale during migration may grow farther from the readings of the rest of the pod, meaning the whale has become lost. Because of this, it’s important to evaluate whether an outlying data point is faulty or just an outlier.

Data in different formats should generally be converted into the same format. For example, different units of measurement should be converted to a single system of units to avoid inaccuracies.

Usually, new data can be run through a program that recognizes issues like these and automatically omits them, fixes them, or flags them for manual review. There are many technologies out there today that aid companies in filing through data and dealing with erroneous entries.

The most conspicuous way to avoid poor data quality is by ensuring that new data is as clean as possible off the bat. Making sure that sensors are up-to-speed and having data formatted correctly from the very beginning will go a long way. For example, checking the validity of a phone number before allowing a user to submit it will make incorrect entries much less common, and having survey questions that aren’t ambiguous will help users submit truthful responses.

Organizations need data to keep up with a world diving into the age of datafication. Today, data is more than just numbers on a chart, and it can come from anywhere. Just by reading this article, you are generating data ripe for the picking. But even though all this data is collected, not all of it is clean, especially data more analog in nature. Organizations must ensure that their data is accurate, valid, and useable before trusting it. It’s important to ensure that data is clean — low-quality data can lead to low-quality decisions.

References

Bhandari, P. (2023). What is data cleansing? | Definition, Guide & Examples. Scribbr. https://www.scribbr.com/methodology/data-cleansing/

Computools, R. (2022). How is Big Data Collected by Companies? | Computools. Computools. https://computools.com/how-is-big-data-collected/

Hansen, B. (2022, December 28). How Companies Can Keep Their Data Clean In 2023. Forbes. https://www.forbes.com/sites/forbescommunicationscouncil/2022/12/28/how-companies-can-keep-their-data-clean-in-2023/?sh=258d08d67dc4

Dhunicorn. (n.d.). Data Mining Techniques for Stock market Analysis and Prediction — DHUNICORN. https://dhunicorn.net/data-mining-techniques-for-stock-market-analysis-and-prediction/

Data Collection: The Process Companies Undergo

Written by Kate Anderson