# What is Big Data? How to handle it?

--

Article 1 in Big Data Series

How big is the big data? 1GB, 10GB, 100GB??? How to deal with big data???

Let’s try to understand the term “Big Data” and how to handle it.

## What is Big Data???

If a laptop’s memory is 16GB, hard disk is 1TB and CPU cores is 8, it may be able to process 1MB file in a few seconds to a few minutes. What if the file is of size 1GB file? It could take a few minutes to an hour or so, which might still be ok. What if the file is of size in a few TB’s? Do you think the laptop can handle? Well, may be, in a few days (if possible).

Now, in this case, 1TB file is probably a big data for the laptop. Let’s assume, there is a super machine that has a memory of 1000TB (just for the say), it can still handle only files/data of some size (which is finite) in a few days. For this super machine, may be 10000TB could be big data.

So, big data is a relative term and there is no definite answer.

## Why do we need to solve the problem above?

As said by an expert “Data is the new oil. Like oil, data is valuable, but if unrefined, it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity. So, must data be broken down, analyzed for it to have value.” in as limited time as possible.

## How do we solve the problem above?

Thinking logically, there are 2 options possible:

i) Increase the size of the machine
ii) Increase the # of machines.

Let’s try to evaluate both the options.
i) Increase the size of the machine (monolithic systems): It is observed that the performance is not increased simultaneously with the increase in the size (RAM, Hard disk, CPU) after some extent.
ii) Increase the # of machines (distributed systems): The performance is significantly increased with the increase in # of machines i.e. the performance would be great with 10 machines with 8GB RAM, 1TB Hard disk, 8 cores CPU rather than 1 machine with 80GB RAM, 10TB Hard disk and 80 core CPU.

Let’s try to understand why the second option is better in layman terms with a simple example below:

Assume there is a big container with balls of 10 different colors and the task is to segregate them based on color and get the count per each color and there is only 1 person assigned to the task which consumers about 10 mins per color. Now the total time consumed is 10 colors * 10 min = 100 mins.

What if we don’t have time to wait for 100mins? What if we need to get it done in 10mins? Yes!!! you thought it correct, I would assign the task to 10 people who can work simultaneously so that the task is complete in 10 mins.

## What is a cluster???

This is exactly how companies try to attack/deal the problem of enormously increasing data size and the necessity to analyze at the earliest possible. Comparing the above example to real world, the big container = big data, 1 person = 1 computer, 10 persons (i.e. >1 person) = a group of computer machines, also called as a cluster.

While there is no specific definition around the size of big data, there are certain characteristics of big data, famously known as 5 V’s.

• Volume: As discussed so far, big data is huge volumes of data in real world. (Can you think of how huge is the data being collected and analyzed by Google?)
• Value: Churning and analyzing huge volumes of data could bring in really a huge value for the organizations, provided, the quality and analysis are good. (How could Amazon sends recommendations and makes value out of the customer purchase behavior?)
• Variety: Data could be in various forms like structured (traditional rows and columns), semi-structured (xml etc), unstructured (images etc). Depending on the business, companies might need to analyze 1 to many different varieties of data. (Did you think the varieties of data a company like Facebook deals with?)
• Velocity: Data now a days is generated in seconds and the demand to analyze in near real time is highly needed. (How could youtube calculate the # of views of a video in a min after the video is posted?)
• Veracity: This is the quality of the data. No matter how humongous the volume of the data is, it would be of no use, unless it is of decent to high quality. (What happens if the data entered in patient intake form for the clinical trails is of no quality like incorrect gender, incorrect no.of patients etc?)

Hope this article created some level of understanding around the terminology of big data.

Note: All the images embedded here are retrieved from internet and the author has no intention of copyright infringement.

--

--

A Data guy, hustling to be a full-time Data Engineer. Fun Fact: Majored in Pharmacy, Chemistry, Information Systems. www.linkedin.com/in/ravi-nalla