Journey to the Cloud (3/3)

Bayu Satria Setiadi
4 min readNov 6, 2019

--

W e are currently working on cloud migration PoC with GCP (Google Cloud Platform). There are 3 parts that we’d like to explore: Bigquery, Dataflow, and TPU (Tensor Processing Unit).

In the first and second posts, I’ve talked a lot about our journey and how we transformed the infrastructure from traditional datawarehouse to scalable-realtime data system.

But this article mainly focuses on architecting data with mix of on-premises and cloud infrastructure.

Without further ado, here’s how our lambda architecture with GCP looks like:

Lambda architecture with GCP

On-Premises:

  • Cloud Bridge
    It is not representing any commercial product, just the way we call it. This server facilitates communication data between on-premises and GCP. It acts as a bridge. All custom data ingestion scripts, projects, including PubSub publisher applications are stored here.
  • Apache Airflow
    Workflow scheduler to replace our Oozie. It’s the same as Cloud Composer in GCP. Airflow has a vital role to orchestrate our big data pipeline.
  • Flask
    RESTful webservice API. It’s a data access layer for all of our end-users.
  • Cassandra
    Wide column store NoSQL database, inspired by Google Big Table research paper in 2006. We use Cassandra as time-series database.
  • MongoDB
    Document oriented NoSQL database, it uses JSON-like documents with schema. Some of our data just won’t fit in a single server. Besides, it has reasonably short learning curve, the API is also very easy to use despite of its resource intensive usage. So, we give it a go in this phase of implementation.
  • Solr
    Solr is a full text search engine. Because the data is indexed, it provides fast search against structured, semi-structured, and unstructured data. We extensively use this technology as a search platform, one of the case is “Cek Coverage Area” on our site firstmedia.com.

Google Cloud Platform:

  • Cloud PubSub
    PubSub replaces Kafka.
  • Cloud Dataproc
    Dataproc is managed service for Hadoop Cluster and its ecosystem (Sqoop/Hive/Pig/Spark/etc). For those who have written the pipeline using Spark and want to run it on GCP, or if you need something like MLib like Spark does, use Dataproc.
    Surprisingly, we put DataProc in ingestion layer rather than batch layer as data processing. Why? Because we need sqoop to move RDBMS data into cloud storage.
  • Cloud Dataflow
    Dataflow is just a runner for Apache Beam. Apache Beam itself is a unified model, you can do both batch processing and streaming. Formerly developed by Google. Using Beam, you can run your code on various execution engines like Dataflow, Spark, Flink, Samza, etc. It’s ambitious goal is to unify big data development. In this project, we made a bet on Beam for our big data pipeline.
  • Cloud Storage
    Replaces HDFS. Also known as Collosus file system in Google’s internal.
  • BigQuery
    Replaces Hive. Bigquery is the external implementation of Google’s technology whose code name Dremel. To keep your costs down, you should consider partitioned and clustered tables as part of your strategy.

The above architecture significantly reduces TCO while still utilizing our physical resources at data center. But the most important and critical parts are already in the cloud, serverless and auto-scalled. Just in case something happens with our NoSQL (Cassandra/MongoDB/Solr) clusters, we’d be able to reconstruct everything from the cloud. Thanks to Lambda Architecture.

The Challenge

Although most of our problems have been resolved by this design, there’s some work to be done. The real challenge is converting our existing data pipeline into Apache Beam workflows….in Java.

I know Apache Beam supports python as well, but not all API features are python-friendly, for example at the time of writing BeamSQL is only available in Java SDK *sad*. Also, Java has more built-in IOs.

Built-in IO transform for Apache Beam

With the popularity of python in Big Data and Data Science fields, those limitations seem a little bit unfair. I hope there’s more support for python in the near future.

P.S.

In 2019, Google has publicly launched Cloud Data Fusion with graphical interface with no-code tool to build data pipelines. It lets you run processing using both Dataflow and Dataproc.

--

--