Journey to the Cloud (3/3)

Bayu Satria Setiadi
Nov 6 · 3 min read

W e are currently working on cloud migration PoC with GCP (Google Cloud Platform). There are 3 parts that we’d like to explore: Bigquery, Dataflow, and TPU (Tensor Processing Unit).

In the first and second posts, I’ve talked a lot about our journey and how we transformed the infrastructure from traditional datawarehouse to scalable-realtime data system.

But this article mainly focuses on architecting data with mix of on-premises and cloud infrastructure.

Without further ado, here’s how our lambda architecture with GCP looks like:

Lambda architecture with GCP.

On-Premises:

  • Cloud Bridge
    It is not representing any commercial product, just the way we call it. This server facilitates communication data between on-premises and GCP. It acts as a bridge. All custom data ingestion scripts, projects, including PubSub publisher applications are stored here.
  • Apache Airflow
    Workflow scheduler to replace our Oozie. It’s the same as Cloud Composer in GCP.
  • Flask
    RESTful webservice API.
  • Cassandra
    Wide column store NoSQL database, inspired by Google Big Table research paper in 2006.
  • MongoDB
    Document oriented NoSQL database, it uses JSON-like documents with schema.

Google Cloud Platform:

  • Cloud PubSub
    PubSub replaces Kafka.
  • Cloud Dataproc
    Dataproc is managed service for Hadoop Cluster and its ecosystem (Sqoop/Hive/Pig/Spark/etc). For those who have written the pipeline using Spark and want to run it on GCP, or if you need something like MLib like Spark does, use Dataproc.
    Surprisingly, we put DataProc in ingestion layer rather than batch layer as data processing. Why? Because we need sqoop to move RDBMS data into cloud storage.
  • Cloud Dataflow
    Dataflow is just a runner for Apache Beam. Apache Beam itself is a unified model, you can do both batch processing and streaming. Formerly developed by Google. Using Beam, you can run your code on various execution engines like Dataflow, Spark, Flink, Samza, etc. It’s ambitious goal is to unify big data development. In this project, we made a bet on Beam for our big data pipeline.
  • Cloud Storage
    Replaces HDFS. Also known as Collosus file system in Google’s internal.
  • BigQuery
    Replaces Hive. Bigquery is the external implementation of Google’s technology whose code name Dremel. To keep your costs down, you should consider partitioned and clustered tables as part of your strategy.

The above architecture significantly reduces TCO while still utilizing our physical resources at data center. But the most important and critical parts are already in the cloud, serverless and auto-scalled. Just in case something happens with Cassandra/MongoDB clusters, we’d be able to reconstruct everything from the cloud. Thanks to Lambda Architecture.

The Challenge

Although most of our problems have been resolved by this design, there’s some work to be done. The real challenge is converting our existing data pipeline into Apache Beam workflows….in Java.

I know Apache Beam supports python as well, but not all API features are python-friendly, for example at the time of writing BeamSQL is only available in Java SDK *sad*. Also, Java has more built-in IOs.

Built-in IO transform for Apache Beam

With the popularity of python in Big Data and Data Science fields, those limitations seem a little bit unfair. I hope there’s more support for python in the near future.

P.S.

In 2019, Google has publicly launched Cloud Data Fusion with graphical interface with no-code tool to build data pipelines. It lets you run processing using both Dataflow and Dataproc.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade