Big Data: Messy

Celeste Ma
CISS AL Big Data
Published in
2 min readOct 16, 2020
Source: https://www.breakingburgh.com/wp-content/uploads/2020/03/MessyBookstore.jpg

A problem with collecting large amounts of data is that it tends to be imprecise and disorganized. Because of this, the field of Big Data must adapt to these issues.

Issue 1: Imprecision

While imprecise data may seem useless, the imprecision isn’t really an issue when dealing with large amounts of data, as any noise caused by the imprecision would likely be drowned out by the general patterns when large amounts of data are analyzed.

Also, imprecisions that are small compared to the data is also negligible. For example, an error of ±1 kg is a big deal when measuring the mass of apples, but it is negligible when measuring the mass of elephants.

Issue 2: Disorganization

Organized data (arranged in arrays/tables, homogenous types, etc.) are easy to store and analyze. Unfortunately, this is rarely true for data in the real world, as real-world data tend to have messy structures of varying qualities and types.

To combat this issue, certain database structures have been created in order to allow meaningful storage of disorganized data. An example of this is NoSQL, an extension to the SQL database format that allows disorganized data to be stored in meaningful ways that SQL wouldn’t have allowed.

While some information will be lost when real-world data is organized, this isn’t a big problem as addressed in “Issue 1” in this article.

Conclusion

While the issues of imprecision and disorganization seem like big roadblocks, imprecision isn’t an issue as long as there is enough data and the error bars are relatively small compared to the data, and there are workarounds to the disorganized data problem.

This article is based on Chapter 3 of Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight by Viktor Mayer-Schönberger and Kenneth Cukier.

--

--