Journey to the Cloud (2/3)

4 min readNov 6, 2019

I n 2017, we started a big data pilot project. The project consists of topology design, architecture, benchmark, and operation. In this project, we are commited to Open Source technology. That’s why 100% of our big data ecosystem was built from scratch. In other words, it’s a plain vanilla setup. No distribution like Cloudera/MapR/etc. I’m not joking.

A scalable data system isn’t cool, you know what’s cool? Real-time big data processing.

Credit goes to Nathan Marz who first dubbed the Lambda Architecture in his book “Big Data: Principles and best practises of scalable real-time data systems”. Available on Amazon. You should think of this book as primarily a theory book, focusing on how to approach building solution to any Big Data problem. For any big data practioners, I recommend you read this book.

Lambda Architecture has 3 main layers:

Batch Layer
Batch layer stores immutable, constantly growing datasets. You are gonna use batch processing to process the data in this layer.
Serving Layer
Batch layer produces precomputed data as a result of batch processing. You need to load the data somewhere so that they can be queried. NoSQL database is typical example for this purpose. Why not RDBMS? Well technically you could, but it doesn’t scale well that’s why.
Speed/Streaming Layer
The serving layer updates whenever batch processing finishes its work. This means the only data that is not represented in the serving layer is the data that came into batch layer after processing ends. Now you got the point. Some of business decisions just can’t wait for the next day/period when the data is available. All that’s left to do is to have functions computed on the data in realtime for those last few hours of data in batch layer and store the new computed data in serving layer. This is the difference between lambda architecture and typical batch processing.

I’d like to add one more layer, it’s called Ingestion Layer. As the name implies, it’s for data ingestion. It simply delivers your data sources to integral parts of lambda architecture.

Here’s how our lambda architecture looks like:

Each part of the system can be described below:

Sqoop
If you need to export your tables from RDBMS to Hadoop File System, sqoop is the answer. You create some import jobs and setup a schedule to run those jobs periodically.
Flume
Variety in Big Data means your data isn’t just structured data (like table). At Link Net we also store unstructured data like log files and messages from social media. Flume is created for that purpose, you need to create a configuration file to define source-channel-sink and start Flume as a background service. It will collect the data in realtime.
Kafka
Kafka is a message broker. Basically it has 3 parts: Publisher, Topic, and Consumer. You can think of Topic as volatile storage. Publishers just send messages to a topic, and on the other side, consumers instantly retrieve the data for further processing. In our topology, Kafka is tightly coupled to Spark Streaming.
Spark
The heart of our Big Data processing. Spark supports 3 languages: Scala, Java, Python, and R. Spark has numerous capabilities including batch, streaming, and machine learning computation. Most of our data processings are written in Scala. We also use Tensorflow for very specific purpose, it’s a separated box apart from lambda architecture.
HDFS (Hadoop Distributed File System)
HDFS is a main distributed storage for our data lake.
Hive
For those who already familiar with SQL language, Hive is quite a handy little tool.
Cassandra
Cassandra is a NoSQL database. It stores final results of computation from Spark.
Solr/Lucene
Solr is built on Apache Lucene. Lucene is a Java library used for full text search. It indexes terms and searches over terms. That’s why the search performance can be extremely fast. In our business case, we use it for Check Coverage Area page on our website.
Oozie
People usually use crontab to schedule any job. But Oozie’s capability goes beyond regular cron. Oozie workflows can be triggered by time interval, data availability, and external event. It’s comparable with Apache Airflow, Luigi, and Azkaban.

For more details, please refer to their documentations.

Each part of the architecture is horizontally scalable. But building, maintaining, and analyzing Big Data on-premise are challenging tasks especially with limited number of resources. Once you hit the threshold, you need to allocate new resources (servers, budgets, time, etc) to make things run smoothly. After a while, it’s time to learn how to leverage technology to reduce the negative impacts of complexity on ops, budgets, and outages. Otherwise we will never achieve the true scalability in terms of technical and business.

In the future, the economics are in favor of renting rather than buying processing power as it exponentially accelerates business agility.

That’s why we are going to move to the cloud. Part 3

Journey to the Cloud (2/3)

Written by Bayu Satria Setiadi