The Big Data Era

Ayoub_Ali
The Startup
Published in
4 min readFeb 21, 2021

Barton Poulson argues that just a few years ago the terms Big Data and Data Science were practically synonymous. But things are a little different now and it is important to distinguish between the two fields.

What is Big Data?

It is the very large data that is either fast or complex or both. And it is impossible to process it using traditional methods.

Characteristics of Big Data

There are 3 main characteristics (Volume, Velocity, Variety) to distinguish big data from data. Some people argue that the characteristics must be extended to 4 or 5 by adding Veracity and Value. These characteristics are called the 3Vs.

1. Volume

It refers to the amount of data to be analyzed and processed.

According to Data Never Sleeps Infographic 8.0 created by DOMO for 2020, the world’s internet population is growing significantly. By April 2020, the internet reached 59% of the world’s population and now represents 4.57 billion people — a 6% increase from January 2019.

2. Velocity

It refers to the speed with which data are being generated.

Data Never Sleeps Infographic 8.0

In 2016, Tx Zhuo wrote: “data is growing at a faster rate than ever before. By 2020, every person online will create roughly 1.7 megabytes of new data every second of every day, and that’s on top of the 44 zettabytes (or 44 trillion gigabytes) of data that will exist in the digital universe by that time.” And that is very close to how big the data is nowadays.

3. Variety

It refers to Structured, Semi-structured, Quasi-structured, and Unstructured data that is gathered from multiple sources.

  • Structured data: Data having a defined data model, format, structure. E.g. Database
  • Semi-structured data: Textual data files with an apparent pattern enabling analysis. E.g. XML files, JSON files, and sensor data.
  • Quasi-structured data: Textual data with erratic formats that can be effort and software tools. E.g. Clickstream data which is the data about webpages a user visited and in what order.
  • Unstructured data: Data that has no inherent structure and is usually stored as different types of files. E.g. Text documents, images, videos, and audio files.

As I mentioned earlier, there two more two V’s:

Veracity: refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.

Value: refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus — as long as the right policies and enablers are in place. — McKinsey Report 2011

These days, we live in the era of Big Data and that is due to three main factors:

1. Storage Capability

Now more than ever, it is very easy to store large amounts of data with the least amount of effort and space.

5MB in 1956 vs. 1TB in 2021

The shape of the storage has changed as well as the method of storing data. There are several new concepts, that are used to store data in a non-traditional way, one of them is Distributed File System (DFS). It is a storage system that is distributed on multiple file servers or multiple locations.

2. On-Demand Computing

It is a cloud computing model where computing resources are assigned on an as-needed and when-needed basis. In other words, it is the delivery of different services and resources through the web.

As the purpose of On-Demand Computing is to maintain minimal computing resources until it’s needed to increase them, companies can cut costs significantly. Moreover, On-Demand Computing overcomes the challenge of not being able to meet unpredictable, fluctuating computing demands in an efficient manner.

3. Open-Source Software Development

The open-source technology democratized automation in a way that has never happened before.

Throughout the history of technology, inventors and researchers generally treat new inventions and research findings as carefully guarded secrets, due to the substantial costs needed to bring them into existence as well as the outsized profits that they promise. — Sylvia Liu on Berkeley MDP.

The open-source software development enables everyone interested in the field of data to learn and try out what he/she learned.

--

--