A Guide to Big Data Analytics

Heena Rijhwani
Analytics Vidhya
Published in
5 min readFeb 5, 2021

With more than 3.7 billion humans using the internet, over 40,000 Google searches every second, and 16 million text messages sent every minutes, the amount of data being generated is increasing exponentially. Big data refers to the massive data sets that are collected from a variety of sources for business needs to reveal insights for optimized decision making. For instance, social media data related to human behaviour and interactions is used for Sentiment Analysis to help drive businesses, make predictions in politics, and more.

Ever wondered what lead to Big Data Analytics? Here are the computing trends behind it.

Social networking

Cloud computing

Mobile computing

To gain a deeper understanding of this big data, let us look at the characteristics or popularly known as the 7 V’s of Big Data.

  • Volume: This characteristic refers to the huge amount of data generated during big data applications. The amount of data generated as well as the storage volume is enormous in size and is often measured in petabytes. For instance, businesses gather a large amounts of data which includes data collected from business deals, transactions and investments, social media posts, and more. This data is analyzed to give actionable insights and help the business achieve a competitive advantage in the market.
  • Variety: This refers to the type and nature of the data. Big data can be structured, unstructured or semi-structured.

1. Structured data- can be stored, processed and accessed in a fixed format. It includes data stored in rows and columns in an ordered manner. Example of structured data: Data in a database.

2. Unstructured data- has an unknown format. The data size is massive and it is not easy to derive value out of it. The data can contain a mix of text files, videos and images. Example of unstructured data: Audio, Video files.

3. Semi-structured data- contains structured as well as unstructured data. Example of semi-structured data: emails and blogs.

  • Velocity- refers to the high speed of data generation. This data also needs to be processed at a higher speed. Example: Twitter messages or Facebook posts; GPS data and behavioral data is used to recommend nearby restaurants to the public.
  • Veracity: refers to the quality of the data that is being analyzed. The data must be accurate and truthful. Veracity is very important in processing of big data because, any inaccurate, fake and meaningless data and have adverse consequences. For instance, using outdated sales data to come up with a new marketing/ sales campaign will be meaningless and not render useful predictions.
  • Value: insights have to gained from the available data in order for it to be useful. Example: Data on employee age, experience, qualifications, etc. Can be analyzed to determine salaries.
  • Visualization- Using charts and graphs to visualize large amounts of complex data is more effective in conveying meaning rather than spreadsheets and reports. It helps make data readable, understandable, and accessible. Example: Use of bar graphs to represent increase in sales.
  • Variability: refers to the data which keeps on changing constantly. It focuses on understanding and interpreting the correct meaning of raw data. If data is continuously changing, then it can have an impact on the quality of your data. For example — A soda shop may offer 6 different blends of soda, but if you get the same blend of soda every day and it tastes different every day, that is variability.

Now that we know the properties of big data, it is but obvious that traditional systems would not be adequate to process it. This Big data needs to be handled differently. Let us look at the comparison between the traditional and big data analytics approach with respect to both hardware and software.

On the basis of hardware:

On the basis of software:

Big data Challenges

1. Data integration- combining data from varied sources and generating meaningful reports can be a convoluted task.

2. Data complexity- the input raw data is become increasingly complex. Complex data structures and architectures and needed to handle it.

3. Data security- Big data analytics processes can be vulnerable to various security threats and attacks. Security measures like firewalls, access control, encryption are used to handle the same.

4. Data capture- too can be a complex task. A variety of data capturing techniques are needed today.

5. Data mobility and scalability

Hadoop

Hadoop is a java-based big data analytics tool used to fill the voids and pitfalls in the traditional approach when there is voluminous data. It is an open source framework for storing data and running applications on clusters of commodity hardware. It offers massive storage and enormous power for processing the data. It is based on the assumption that hardware failure is possible and must be handled by the framework. Its core components include:

HDFS (Hadoop distributed file system)

MapReduce

Hadoop splits large files of data into blocks or fragments, distributes these across nodes in a cluster and transfers code to data to allow parallel processing. The data is locally available and lesser inter-process communication time allows for faster processing. Another important feature of Hadoop is the redundancy of data due to which node failure can be easily managed.

Hadoop environment

  • Hadoop Common- includes the libraries needed by Hadoop modules
  • HDFS- distributed file system modeled after the Google File System (GFS) paper. It is the storage component of Hadoop and allows data replication and fault tolerance.
  • Map Reduce- distributed framework modeled after the GFS paper. It is the computing component in Hadoop and allows to parallelize work over a large amount of raw data.
  • Hadoop YARN- resource management platform for managing computing resources in clusters and using them to schedule user’s applications.

We will further delve into HDFS, MapReduce and the Hadoop Ecosystem in my next blog.

--

--

Heena Rijhwani
Analytics Vidhya

Final Year Information Technology engineer with a focus in Data Science, Machine Learning, Deep Learning and Natural Language Processing.