Building a Data Pipeline Framework (Part 1)

By – Aayush Devgan (Engineer, Data Platform)

Published in

Urban Company – Engineering

7 min readOct 8, 2019

This blog gives an overview of how we were able to make a data pipeline framework for UrbanClap that would capture data in near real-time, process it and put in a data warehouse/data lake.

You all must have seen the movie Charlie and The Chocolate Factory. The movie is an all-time classic starring Johnny Depp. The moment we think about the movie the first thought that comes to our mind is mouth-watering, exquisite variety of chocolates being made on high-speed conveyor belts from mere nothing and kids just running around filling their mouths and pockets with chocolates and just enjoying them. The kids have an unlimited supply of delicious chocolates and they are absolutely delighted. What more could they have asked for? Maybe I’m being too dramatic but it’s a perfect analogy depicting how impactful a data pipeline is for all the data-hungry analysts, scientists, and business heads.

To explain the above analogy in a more practical way, a data pipeline is a framework built for data which eliminates many manual steps from the data transition process and enables a smooth, scalable, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected. It automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. It combats possible bottlenecks and latency, and has the plug and play feature making it versatile and flexible.

When the data engineering team was being set up at UrbanClap there was an urgent requirement for a data platform to drive all analytical needs both realtime and hourly. After we got convinced about all the possible use cases, we were able to design the whole architecture of the data pipeline.

Components of the Data Pipeline :

Ingestion
Processing
Persistence

Let’s dive into all of them one by one.

1. Ingestion :

It’s been more than a decade since big data came into picture and people actually understood the power of data and how data can help a company make better, smarter and adaptable product. So, with time the analytics has also become very insightful and mature, and the majority of analytical use cases have driven more towards streaming (real-time data) from traditional batch (historical data).

We needed a strong, heavyweight tool that would take all the streaming data from all sources, persist it for as much time as we want, have low latency, and be fault-tolerant to avoid data lose. We need something like a black hole in our big data universe that would just take everything in!

Gargantua Black Hole from the movie Interstellar

Kafka saves the day for us!

UrbanClap deals with both consumers and providers across various platforms like android, ios, and web at a huge scale and all our transactional and non-transactional data gets ingested in various topics of our multi-node Kafka cluster with appropriate replications, partitions. We went with confluent Kafka so that we can leverage the schema registry for schema evolution and connectors like Debezium for our various use-cases. And ya chill! It’s free until confluent runs out of funding 😜.

On top of that, we wrote nodeJS and scala multi-threaded producers to push these events to Kafka.

Also, we decided to go with AVRO file format because :

It is a very compact and space-efficient format compared to bulky JSON.
It supports schema evolution making it the best fit for our overtime evolving data.
Avro serializations and deserializations are very fast making data movement across platforms fast.

2. Processing :

Traditional ELTs/ETLs with push and ingest model are long gone. With ever-evolving business requirements, we wanted a change of thinking towards serve and pull model across all domains. Instead of taking ownership of data end to end, we wanted to design a platform that would give the ownership to different teams. They could pull data from a source quickly and in the way they want, with the help of a skeleton framework, we provide to them.

Apache Spark, the flagship large scale data processing framework originally developed at UC Berkeley’s AMPLab. The chosen framework of all tech giants like Netflix, Airbnb, Spotify, etc. out there. The Flash of the big-data world with its powerful and fast in-memory computation. And as Uncle Ben from spiderman said “With great power comes great responsibility”, Spark in our data pipeline does all the heavy lifting. It is the engine of the pipeline doing all the cleaning, transformations on the data based on the actions given to it.

To fulfill business requirements, we needed a streaming analysis framework to cater to queries on real-time data and a batch analysis framework to process queries on historical data. Thus, we went ahead with using Spark Structured Streaming for streaming and Spark SQL for batch. We chose the newer Structured Streaming API instead of the legacy Spark Streaming API because:

Structured Streaming is an extension built on Spark SQL API, so, it takes advantage of Spark SQL’s code and memory optimizations and the dataframe/dataset APIs too. With Structured Streaming, we could make an interactive platform so that data analysts could update streaming transformations themselves based on requirements using SQL.
Structured Streaming comes with an exactly-once guarantee. It means that data is processed only once and output doesn’t contain duplicates. This helped us in generating views and dashboards directly from the streaming job.
It comes with the WaterMarking feature which helps the stream to deal with the lateness of data efficiently.

All jobs are cron scheduled and designed to be auto-scalable depending on the data and powerful enough to handle complex joins, aggregations and other computations on historical and streaming data. We use AWS EMR as the cloud service provider to run our spark jobs.

3. Persistence :

Now that we had figured out what we wanted to do with the data and how to do it, the most important thing was to figure out where to put the raw/processed data that would eventually powerup the dashboards for analytics. We needed a data lake where we could persist raw data and a data warehouse from where anyone could generate views.

If you would have watched the popular TV show SUITS there are two protagonists, Harvey and Mike, they basically suit up and solve cases. They have different personalities, traits but in the end, they are the best at what they do. Even though they’re different, their strengths and weaknesses complement each other.

Amazon Redshift(Data Warehouse) and Amazon S3 (Data Lake) are Harvey and Mike in our case :P

Data Lakes are still comparatively new/fresh in the data engineering ecosystem, still taking shape, evolving, finding purpose, but at the same time, they are stable, secure and scalable. They have a bright future in the big data ecosystem whereas Data Warehouses have been there for a very long time they have a purpose to drive all business needs, powerup the dashboards, provide faster insights. They are always on the forefront when it comes to decision making, but they are comparatively more complicated and slower to adapt to evolving data.

S3 takes in all the high throughput raw/transformed data coming from streaming pipeline and also derived data from some of the batch jobs. It is a very crucial piece in our data pipeline as it is the input source for the view spark jobs. It contains data from all the possible sources well partitioned and in parquet format, which makes querying super-fast.

Redshift is the analytical and reporting tool that we use to generate views on the filtered, transformed data. It is the one source of truth for all data analysts, sales, product and marketing guys as most of the dashboards get powered up by redshift. It’s SQL like interface and Parallel Processing Architecture makes it easy to use and at the same time powerful and fast for all business needs. The tables in redshift are updated half-hourly, hourly and daily periods.

Underlying design of the framework is as follows:

In this post, we saw an overview of the data pipeline framework we have built so far. We briefly discussed some of the paradigms and tools used by UrbanClap to meet business use cases, but there is so much more to learn and discuss.

In coming posts, I will dive into specifics and demonstrate other aspects such as code snippets of the pipelines and also the best practices around developing them based on my experience.

If you found this post useful, stay tuned for more!

About the author:
Aayush Devgan is an Engineer working in Platform team. He works on the foundational layer for data driven systems and is part of a team that owns data stack in UrbanClap.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (UrbanClap Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/urbanclap-design
https://medium.com/urbanclap-engineering

If you are interested in finding out about opportunities, visit us at http://careers.urbanclap.com