About Big Data Foundations

Today I bring you an overview of the Big Data Fundamentals course i completed in the weekend at Data Science Academy. The course aborte fundamentals, concepts and tools for Big Data.

What is Big Data?

Wikipedia: “term used to name very large or complex data sets that traditional data processing applications still can not handle”

Oracle: “big data is an management strategic of holistic informations that include and integrated many types of data and management data with tradicional datas.”

Data Science Academy: “The ability of a society to obtain information on new ways, generating new ideas and goods services of significant value”

Big Data is a large, structured, unstructured or streaming data volume from various sources and formats, which together with analytical techniques generate insights for the business making in real-time decision.

When we talk about Big Data we have 4 pillars that help us to understand the reality of Big Data, the 4 V’s Big Data: Volum (data size), Variety (data format), Velocity ​​(which are data generated) and Veracity (of datas).

Ever wondered what the volume of Facebook and Google of datas?

The volume of data from these largest internet companies is from around of 1 Yottabyte.

It would be something like:

1000 Terabytes = 1 Petabyte

1000 Petabytes = 1 Exabyte

1000 Exabytes = 1 Zettabyte

1000 Zettabytes = 1 Yottabyte

Surreal do not agree? But how do you store, process, and manage these massive amounts of information? Now that’s the cool part, nice to meet you, Hadoop!

At the beginning of this course I was intrigued, because it was Hadoop there, Hadoop here and I had no idea of ​​the size of this tool yet, but we will get there.

So what is Hadoop? Also known as Apache Hadoop?

It is an open-source framework created in 2008 for shared, distributed, and highly reliable storage and processing management for large datasets in low-cost hardware clusters.

This tool is extremely powerful because it was created to be worked with Big Data, it was born to meet a reality that until then we did not know how to handle. There was no tool that did this effectively in the market. A framework that uses multiple machines in clusters as one, sharing the storage and processing of low cost hardware, with reliability and scalability, potentializing the use of these machines together to work with massive amounts of data.

Hadoop consists of 3 main:

  • Hadoop Distributed File System (HDFS) manages the HD’s of a cluster of machines, so that several HD’s are viewed as only 1 HD. Responsible for the recording and readings in the HD’s of the clusters
  • Hadoop Yarn responsible for the management of clustered computing resources, as well as the scheduling of resources.
  • Hadoop MapRecude responsible for the management of data processing in clusters.

It is worth noting that APACHE HADOOP is for problems so large and complex that traditional systems are not able to handle, it was born to meet the present and future of Big Data.

Hadoop is a free Java-based framework and is inspired by the Google File System (GFS), a file system created by Google, that is, it uses a technique from one of the best technology companies in the world, so we begin to realize the real power of this tool.

I separated this post in 4 pieces to not have too much content, in the nexts weeks i will post the rest of the content.

If you want to leave your opnion, tips, corrections, feel free to contact me.

sources:

https://pt.wikipedia.org/wiki/Big_data / https://www.oracle.com/br/big-data/index.html / https://pt.wikipedia.org/wiki/Hadoop / https://www.datascienceacademy.com.br/