Introduction to Big Data

Sruthi Sree Kumar
Big Data Processing
3 min readSep 26, 2022

Big Data is the data that is characterized by four key attributes which are also known as the 4V’s.

The Four Dimensions of Big Data

Volume: Volume indicates the size of the data. Big data is huge, due to which, the traditional systems cannot handle it.

Variety: Variety indicated the heterogeneity of the data. There are different kinds of data such as structured data (e.g: RDBMS), semi-structured data (e.g: CSV, XML, JSON), unstructured data (e.g: logs, audio, video)

Velocity: Velocity is the data generation rate or speed at which new data is formed.

Veracity: Veracity relates to the reliability and quality of the data. High veracity data has many records that are valuable to analyze whereas low veracity data contains a high percentage of meaningless data such as null or missing values.

What Happens in an Internet Minute in 2022

With the advancement of the Internet and the progress in the area of Information technology, the amount of data generated, consumed, and transferred has been increasing exponentially. These large amounts of data that are being generated exceed the processing capabilities of traditional data storage and processing systems. With big data processing, we will be able to process a large amount of data that any of the traditional systems are not able to process. In order to process the data, we will need to first store it. But our traditional systems are not capable of storing this huge amount of data. Here comes the need for new systems to store and process large-scale data. The major requirements of such a system include:

Requirement of the big data system
  1. Store large amount of data
  2. Process the data in a timely manner
  3. Scale the system as the data grows

Scalability

There are two types of scaling:

Horizontal Scaling vs Vertical Scaling
  1. Vertical scaling: Vertical scaling is also known as scale-up. Scale-up refers to making a resource larger. That means we keep adding resources to a single node in a system. It is more expensive and also we would reach a point after which we won't be able to scale further.
  2. Horizontal scaling: Horizontal scaling is also known as scale-out. Scale-out refers to adding more nodes to the system. This will be possible only in distributed systems and it makes challenging for fault tolerance.

Since big data does not fit into the traditional storage we need bigger/ distributed storage for storing and distributed processing systems for processing the big data. There are different distributed storage systems such as Distributed file systems(HDFS, GFS), NoSQL databases(Dynamo DB), and Distributed messaging systems(Kafka). Similarly, there are multiple distributed data processing systems such as MapReduce, Spark, Flink, etc. We will look deeper into different distributed store and processing systems and the following blogs.

References:

  1. https://www.youtube.com/watch?v=zYRm5UY1RI0

--

--