Getting to Know Big Data Analytics

Published in

The Startup

10 min readJan 15, 2021

Hi guys, if you have read my previous blog (If not, find it here), you have probably seen that Big Data Analytics is the talk of the town, and walk of the industries. Everyone is trying to leverage this powerful tool to catapult their business into success, increase revenue, and gain global recognition.

IBM maintains that businesses around the world generate nearly 2.5 quintillion bytes of data daily! Almost 90% of the global data has been produced in the last 2 years alone.

“Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

I will try to provide a better insight into the words and technologies that revolve around Big Data Analytics through a set of highly asked questions.

20 MOST ASKED BIG DATA QUESTIONS

— — — — — — — — — — — — — — —

What are the five V’s of Big Data?

Ans.

Volume — It indicates the amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity — The velocity of data means the rate at which data grows. Social media contributes a significant role in the velocity of growing data.
Variety — Term Variety in Big Data refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity — It indicates the uncertainty of available data. The main reason for arising uncertainty is the high volume of data that brings incompleteness and inconsistency
Value –It refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

2. What do you mean by the term Data Analytics?

Ans.

Data analytics is the science of analyzing raw data to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.

Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass of information. This information can then be used to optimize processes to increase the overall efficiency of a business or system.

3. What is Hadoop?

Ans. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

Fun Fact: Hadoop was named after a toy elephant

4. How is Hadoop related to Big Data?

Ans. Hadoop is an open-source, Java-based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes.

5. What are the components of the Hadoop Ecosystem?

Ans.

Hadoop Ecosystem is nothing but a combination of various components. Below are the components which come under the Hadoop Ecosystem’s Umbrella:

HDFS
YARN
MapReduce
Pig
Hive
Sqoop, etc.

6. What do you mean by HDFS?

Ans. HDFS stands for Hadoop File System.

The HDFS is Hadoop’s default storage unit and is responsible for storing different types of data in a distributed environment.

HDFS has the following two components:

NameNode — This is the master node that has the metadata information for all the data blocks in the HDFS.
DataNode — These are the nodes that act as slave nodes and are responsible for storing the data.

7. What is YARN?

Ans.YARN, short for Yet Another Resource Negotiator, is responsible for managing resources and providing an execution environment for the said processes.

The two main components of YARN are –
ResourceManager — Responsible for allocating resources to respective NodeManagers based on the needs.
NodeManager — Executes tasks on every DataNode.

8. What is MapReduce?

Ans. MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.

The term “MapReduce” refers to two separate and distinct tasks that Hadoop programs perform.

The first is the map job, which simply takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

9. What is the HIVE?

Ans. Hive is a project of Apache Hadoop. Hive is a data ware software that runs on top of Hadoop. Hive works as a storage layer that is used to store structured data. This is a very useful and convenient tool for SQL user as Hive use HQL.

HQL is an abbreviation of Hive Query Language. This is designed for those users who are very comfortable with SQL. HQL is used to query structured data into the hive.

10. What do you mean by PIG?

Ans. Pig is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce and automatically generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster. Pig is popular because it automates some of the complexity in MapReduce development.

Pig is commonly used for complex use cases that require multiple data operations. It is more of a processing language than a query language. Pig helps develop applications that aggregate and sort data and supports multiple inputs and exports. It is highly customizable because users can write their own functions using their preferred scripting language. Ruby, Python, and even Java are all supported.

11. Compare between MapReduce, HIVE, and PIG.

Ans.

12. What are the basic steps to be performed while working with big data?

Ans.

Data Ingestion

Data Ingestion is a process to move/ingest your data from one place to another place. In the reference of Big Data, Data movement from RDBMS to Hadoop is known as Data Ingestion.

Data Storage

Ingested data is stored into different storage layers like HDFS, Hive tables, etc.

Data Processing

Once you have data in HDFS, Data is processed for different purposes. Data can be processed using MapReduce, Hive tables, etc.

13. What are Edge Nodes in Hadoop?

Ans. Edge nodes refer to the gateway nodes which act as an interface between the Hadoop cluster and the external network. These nodes run client applications and cluster management tools and are used as staging areas as well. Enterprise-class storage capabilities are required for Edge Nodes, and a single edge node usually suffices for multiple Hadoop clusters.

14.What does P-value tell about statistical data?

Ans. The main task of the P-value is to determine the significance of results after a hypothesis test in statistics.

Readers can conclude with the help of the P-value and it is always between 0 and 1.

P-value > 0.05 denotes weak evidence against the null hypothesis, which means the null hypothesis cannot be rejected
P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected
P-value=0.05 is the marginal value indicating it is possible to go either way

15.What are the two types of tables in HIVE?

Ans.

–>Managed/Internal table
Here once the table gets deleted both metadata and actual data are deleted.

–>External table
Here once the table gets deleted only the metadata gets deleted but not the actual data.

16.What is the role of a JobTracker in Hadoop?

Ans. A JobTracker’s primary function is resource management (managing the TaskTrackers), tracking resource availability, and task life cycle management (tracking the tasks’ progress and fault tolerance).

It is a process that runs on a separate node, often not on a DataNode.
The JobTracker communicates with the NameNode to identify data location.
It finds the best TaskTracker nodes to execute the tasks on the given nodes.
It monitors individual TaskTrackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.

17. List the different file permissions in HDFS for files or directory levels.

Ans. The Hadoop distributed file system (HDFS) has specific permissions for files and directories. There are three user levels in HDFS — Owner, Group, and Others. For each of the user levels, there are three available permissions:

read (r)
write (w)
execute(x).

These three permissions work uniquely for files and directories.

For files –

The r permission is for reading a file
The w permission is for writing a file.

Although there’s an execute(x) permission, you cannot execute HDFS files.

For directories –

The r permission lists the contents of a specific directory.
The w permission creates or deletes a directory.
The X permission is for accessing a child directory.

18.What is Lambda Architecture in Big Data?

Ans. Lambda architecture is a Big Data processing architecture. To handle the enormous quantities of data, the lambda architecture makes use of batch as well as stream processing methods. It is a fault-tolerant architecture and achieves a balance between latency and throughput. Lambda architecture makes use of the model of data that has an append-only, immutable data source which serves as a system of record.

In Lambda architecture, we have a system that consists of three layers:

Batch processing
Real-time processing
Serving layer

19. Explain the three layers of Lambda Architecture.

Ans.

Batch Layer (Apache Hadoop)

Hadoop is an open-source platform for storing massive amounts of data. Lambda architecture provides “human fault-tolerance” which allows simple data deletion (to remedy human error) where the views are recomputed (immutability and recomputation).

The batch layer stores the master data set (HDFS) and computes arbitrary views (MapReduce). Computing views are continuous: new data is aggregated into views when recomputed during MapReduce iterations. Views are computed from the entire data set and the batch layer does not update views frequently resulting in latency.

Serving Layer (Real-time Queries)

The serving layer indexes and exposes precomputed views to be queried ad hoc with low latency. Open-source real-time Hadoop query implementations like Cloudera Impala, Hortonworks Stinger, Dremel (Apache Drill), and Spark Shark can query the views immediately. Hadoop can store and process large data sets and these tools can query data fast. At this time Spark Shark outperforms considering in-memory capabilities and has greater flexibility for Machine Learning functions.

Note that MapReduce is high latency and a speed layer is needed for real-time.

Speed Layer (Distributed Stream Processing)

The speed layer compensates for batch layer high latency by computing real-time views in distributed stream processing open source solutions like Storm and S4. They provide:

Stream processing
Distributed continuous computation
Fault tolerance
Modular design

In the speed layer, real-time views are incremented when new data received. Lambda architecture provides “complexity isolation” where real-time views are transient and can be discarded allowing the most complex part to be moved into the layer with temporary results.

20. What are the applications of Big Data?