Understanding the basics of Big Data Analytics

Yash Gupta
Data Science Simplified
7 min readJan 24, 2022

--

2,000,000,000,000,000,000 bytes of data are generated each day across all industries. In 2020, it was predicted that the amount of data existent in the world was almost 65 zettabytes and that the data created in the last 2 years outdid the data created in the rest of human history. Check out this size chart to see just how big that really is.

In this article, we will check out how Big Data Analytics is conducted on data that is as large as Terabytes and Petabytes of data. While we are probably unaware of how big data really influences our life, we constantly use them as a part of our day-to-day activities.

Before that, you may ask, what is so different about Big Data in comparison to normal or small data?

Well, the definition given by multiple organizations comes down to the following “5 V’s” of Data that make up the Big Data scenario;

Volume: referring to the size of the data.

Velocity: referring to the speed at which the data is flowing.

Variety: referring to the different types of data you’re gathering.

Veracity: referring to the reliance on the data you’re gathering from multiple sources.

Value: referring to the value that you can gain from analyzing the data because, at the end of the day, the data has to add value to your goals.

For more on the five V’s of Big Data:

To understand how big can this data be, let’s take some examples.

Big data is used by companies like Netflix and Amazon to provide recommendations to their customers about products that they might like. It refers to the choices made by millions of customers who visit their platforms on a daily basis and update their recommendation engines to suit their customers’ needs better.

When data this big needs to be analyzed, a system’s processing capacity individually would fall short of the requirements to process the data of this scale. This is when a beautiful idea of “parallel computing” was born.

Google’s technological advancements in the early 2000s with their papers that were released in 2003 and the work of Doug Cutting and Mike Cafarella with Yahoo! eventually led to the development of Hadoop, which Doug named after his son’s toy elephant, marked the first and foremost moments in mankind when Big Data was going to be harnessed for its capabilities to become an organization's holy grail to earn profits out of massive chunks of data.

Now onto the next question, what is Hadoop?

According to the official website;

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

In simpler terms, Hadoop helps users store massive amounts of data on multiple smaller systems, called nodes, where data is broken down to smaller blocks of a certain size, like 64 or 128 MB, and stored in bits and pieces, replicated as well, to ensure easy accessibility and reliable storage. The data itself is stored in smaller blocks which allows easier analysis using parallel computing and can make accessing chunks of data from TBs and PBs of data quicker than imaginable.

If you have worked with an excel sheet with over 500,000 rows and tried doing a complex formula on it, you know how long it takes. Imagine it for 500,000,000 rows but just quicker.

Quicker because hundreds and thousands of computers are parallel-ly working on the big block of data crushing it into smaller blocks which they independently analyze and then put together to derive information out of it.

Now, it’d be assumed that this would be pretty difficult to do for most people. Originally written in Java, it would require a lot of lines to execute a command. As it developed over time, Hadoop was accessible using languages such as Python, SQL, Scala, etc. leading to wider and easier usage.

Hadoop is based on top of 2 major components namely; HDFS i.e. the Hadoop Distributed File System and MapReduce i.e. the framework to conduct 2 functions; mapping to the data and reducing the files based on a given computation (working on a key to value basis).

For more on HDFS :

For more on how MapReduce works:

How can storing so much data help a company? How can you derive value out of so much data and what is “so much data”?

Let’s take the use case of Big data from Netflix.

Netflix, one of the best examples of how big data can help, collects customer interaction and response data to a TV show or a movie. For example, Netflix knows the time and date a user watched a show, the device used, if the show was paused, does the viewer resume watching after pausing? Do people finish an entire TV show or not, how long does it take for a user to finish a show, and so on.

It has screenshots of scenes people might have viewed repeatedly, the rating content is given, the number of searches, and what is being searched for. With this data, Netflix can create a detailed profile on its users. To collect all this data and harness it into meaningful information, Netflix requires data analytics. For example, Netflix uses what is known as the recommendation algorithm to suggest TV shows and movies based on the users’ preferences.

Netflix’s ability to collect and use the data is the reason behind its success.

According to Netflix, they earn over a billion in customer retention because the recommendation system accounts for over 80% of the content streamed on the platform. Netflix also uses its big data and analytics tools to decide if they want to greenlight original content. To an outsider, it might look like Netflix is throwing their cash at whatever they can get, but in reality, they greenlight original content based on several touchpoints derived from their user base.

For example, Netflix distributed ‘Orange is the New Black’ knowing it would be a big hit on their platform. How? Because ‘Weeds’, Jenji Kohan’s previous hit performed well on Netflix in terms of viewership and engagement.

Will Big Data ever go out of trend?

Seeing how data is growing exponentially with more than 1.7MB of data being created with every second, the data we’re going to amass in the near future will be massive and this will help in the advancements of the future where the data will answer a lot of questions about how to make catering to customers’ needs easier for companies.

The entire idea of gathering data for companies is to only ensure that value can be derived out of it. Big Data and Scalable Data Science will take over the future with more and more data piling up on companies.

To get a grip on how Hadoop works and how big data is growing, the Big Data 101 course by Big Data university in accordance with Cognitive class and IBM is a really good place to start.

Another resource is the free course on Introduction to Hadoop and MapReduce provided by Udacity.

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science. Please leave a review down in the comments.

Do connect with me on LinkedIn if you want to discuss how Big Data will impact our lives furthermore!

--

--

Yash Gupta
Data Science Simplified

Business Analyst at Lognormal Analytics and Data Science Enthusiast! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss