Architecture of Big Data Systems

Maninder Singh Bawa
LearnifyMe
Published in
5 min readSep 6, 2020

In the previous post we understood the Need of Big Data Systems. let’s take it further on the architecture of Big Data systems and see what are the different components required for creating an efficient Big Data System.

A crucial part of any Data Intensive application is Data Model. but prior to diving into data model for Big Data you need to be aware about several properties of Data which are

  1. Rawness
  2. Immutability
  3. Eternity

Rawness

It can be defined as granularity at which data is present.

When data is in its raw form lots of information and insights can be drawn as compare to when its in structured form or summed up.

To understand Rawness let’s take an example where you have the data of sales transaction of a store. there are lots of data points being gathered like order_id, customer details, product details, discounts etc. Now you summarize these records based on location to create another dataset so for example you got 5 records each location by summing up sales figures out of 100 original records. The information the you can extract from summarized dataset is way less than you can from original dataset. Thus many queries can be answered from Raw data.

Immutability

Its the property of data where you do not delete/update any piece of data, A new record is added for any new data information.

With immutable data system, original data remains untouched which helps in retrieving back the system from failure

Since RDBMS is widely used in industry where updating a record in database is very common, immutability is difficult to digest. Lets take an example where you are storing customer address and as the person moves to new location and address changes you update the record in database. This leads to a potential loss of information of person’s previous address which as a analyst you could use to derive potential insights. In Big Data Systems — Immutability helps you retain original data and add new data with timestamps.

Eternity

Eternity results from immutability as if the data is not tampered and no updation/deletion is allowed that data is called to be eternal.

Data is always pure and true.

A timestamp is attached with every event/fact while its stored in database and latest timestamp tells the current state of data.

By Enforcing these three properties in Big Data World we can achieve more Robust Systems and gain Powerful capabilities. Keeping in mind these properties of Big Data lets have a look at Fact Based Models in Big Data

Fact Based Model in Big Data

A Fact is a fundamental unit of data. Fact is atomic and timestamped. its the smallest unit of data that cannot be broken down further. To make them unique a timestamp is attached with every fact.

Examples of facts:

  • I am a blogger
  • I live in Punjab
  • I like Big Data and Stream Processing

In the Fact based model we store the data as atomic facts.

Facts are immutable and unique.

Why Fact Based model:

  • you can query data at any time
  • data is tolerable to human error
  • you can store data in both structured and unstructured formats

Even facts within a fact-based model capture single pieces of information and does not convey the relationship between them and different types of entities. solution to this is graph schemas

Graph Schema:

Graphs that capture the structure of dataset stored using fact based model are termed as Graph schema. There are 3 core components of a graph schema — Nodes, Edges and Properties

Graph Schema for Fact Based Model of Big Data
  • Nodes are the entities in the system
  • Edges are the relationships between nodes
  • Properties are information about entities

Okay, so till now you must have got the idea that information is stored as facts and a graph schema describes the types of facts contained in the dataset, But you still are not aware about what would be format you will use for storing facts. There are several options available such as JSON but there are problems in using it and a serialized binary format like AVRO, Parquet etc. would be a good option. Check this article to learn more on the data format.

No since we have got basic understanding of Big Data Systems, lets dive a bit deep into Big Data System Architecture

Generalized Big Data Architecture

Big data applications generally require several different kins of workloads such as

  • Batch Processing of data at rest
  • Real Time Processing of data in motion
  • Interactive exploration of Big Data
  • Predictive Analysis and Machine Learning

Big Data Systems are designed in such a way that they can handle Ingestion, Processing and Analysis of data that is too large or complex for traditional database systems.

Most of the big data architectures include some or all the components as shown in the figure

Generalized Big Data Architecture
  • Data Sources: all big data solutions start with one or more data sources like databases, IOT Sensors etc.
  • Data Storage: data for batch processing systems is generally stored in distributed systems that can store high volumes of large files.
  • Batch Processing: to process huge amounts of data usually these jobs include reading data from files, processing them and writing output to files.
  • Real-time Message Ingestion: it is used to store and capture streams of data in real time.
  • Stream Processing: Processing the data in real time and giving the output to sink.
  • Analytical Data Store: the data storage required for and used by analytic and reporting tools
  • Orchestration: used to facilitate repeated data processing operations, moving data between multiple sources and sink etc.

These all are the various components of a Big Data System. Now the question is -

When to use this style of architecture?

you should consider using it when you need to

  1. Store and process data in volumes too large for traditional databases.
  2. Transform unstructured data for analysis and reporting.
  3. Capture, Process, Analyze unbounded streams of data in real time.

Advantages of Big Data Architecture

  • Bunch of open source and mature technological options available.
  • High performance and throughput through parallelism.
  • Scale-out options are supported by default making the systems highly scalable

Things to keep in mind

although it all seems to be alluring to use Big Data Systems but there are several things to take care of whenever you are deciding to use a big data system such as

  • Complexity if these solutions may increase in some cases
  • Skillset in big-data is very important for the team who is implementing it.
  • Security is also one of the concerns as all data goes into data lake, thus it’s important to give correct access rights.

Thats all about the different components of a Big Data System, Most important of them are Batch Processing and Stream Processing. Check this post to understand the differences between these two.

The article was originally published at https://msbawa.com/architecture-of-big-data-systems/

--

--

Maninder Singh Bawa
LearnifyMe

Machine Learning & AI enthusiast, self learner, passionate about technology and new trends. want to make my own Jarvis.