Self-serve data pipelining platform

By – Karuna Saini ( Engineer, Data Platform)

Published in

Urban Company – Engineering

6 min readMay 1, 2020

Querying your databases for analytics is not efficient at a certain scale. You need a transactional data source or for that matter any data source to be available for large-scale analytics. What is the first thing that comes to your mind?

If you are like any other engineer, you would think of writing a script to fetch data at certain intervals from your source and update it into your data lake or data warehouse alongside other analytical use cases. Congratulations! Your approach now resembles the first version of how transactional use cases were being pipelined at Urban Company.

We were writing customised scripts for each new use case to read data in batches from source, apply required transformations and persist it into our data warehouse tables.

This general approach served us well for awhile, but the ever increasing number of use cases called for a different solution. We took a step back and looked at the direction that we were heading.

The Need – for a platform

Let’s look at the problem from this angle. You decide upon reading a book and visit a library to get it for yourself. Imagine walking into a library having no catalogue whatsoever about the books that it has, no metadata around their exact location in the library, and if someone has borrowed it already for some duration. What would you do? You cannot obviously walk around and try finding that one book among plethora of other books. Now think about people responsible for maintaining this library. It might be manageable with a small number of books with some adhoc effort but doing it for substantial number of books is definitely unimaginable.

That is the situation we would end up at if we do not have a standardized mechanism to onboard new books and a centralised place that stores all the metadata related to them right from the beginning.

This makes it quite clear that we need to build a platform to solve this problem and deal with it in the following manner:

Have a central repository serve as single source of truth for all the event definitions and metadata.
Get rid of ever increasing number of scripts bloating our code repository and requiring regular maintenance.
Aim for complete automation around pipeline creation with self-serve onboarding of new events through a dashboard.
Flexibility around searching for onboarded events and updating their metadata.
Make this information available to other components in the system to aid in their work.

So what did we do?

We invested in building a data platform to ingest data from heterogeneous sources and standardise it in such a manner that our downstream data processing jobs are indifferent of the source of data. As a part of this platform, we needed a central tool that acts as an interface for onboarding new use cases. It maintains and enforces the events’ definitions across platform and is the single source of truth for all data events. We call it Data Platform Admin or DP Admin.

The Solution – Data Platform Admin

Defining each use case as an event in Admin

Each unique source of data becomes an event for us with its own set of definitions and metadata. A source could be anything- some transactional table, backend events or even clickstream messages triggered by front-end.

Major components of Event Definitions & Metadata:

Schema: Each event needs to have a schema associated with it. This keeps a check on the fields that get passed on to our data lake and also enforces validations in terms of their data types and nullability. We make use of Confluent Schema Registry for this and support schema evolution with backward compatibility.
Transformations: Platform supports defining basic transformations for all data events as per the need of analysis use cases. Through DP Admin UI, one can add transformations like rename columns, change data types or date time format etc which gets automatically handled by the data processing layer.
Configurations: All the metadata that determines how an event would be treated and processed on Data Platform is a part of configurations. It also includes details about event source (for example, the database cluster and table information in case of transactional event).

We use MySQL as database for DP Admin and store all of the event definitions and metadata in various tables within it. We have certain APIs exposed in DP Admin to provide this information to other components of the system and support various functions.

Major flows of DP Admin

Onboarding Wizard

The most important use case served by DP Admin is that of onboarding new events on Data Platform through its onboarding wizard.

User is asked to provide inputs required for identifying source to be onboarded.

DP Admin extracts the schema of source. For us, our transactional use cases were spread around MySql and MongoDB. We utilised DDL statements to extract schema for MySql tables. MongoDB was a bit challenging since it is schema-less in itself. So we parsed mongoose schemas defined in microservices for them.
Extracted schema is displayed to user.
User is asked to add transformations required on top of it and submit all the details.
After all the validations and compatibility checks, the event gets onboarded on data platform.
Pipeline instantiation for the event is done. For a transactional event, it could include whitelisting that particular table in kafka connector to start its change data flow.
User starts getting data for analysis as per event’s refresh cycle automatically.

View and Edit Event Metadata

Once an event has been onboarded through onboarding wizard, all its metadata can be viewed on the dashboard. User also has the access to modify schema for an event. They can simply lookup for existing schema of their event and submit new schema after making required changes. The new schema automatically gets updated everywhere in the pipeline if it fulfils backward compatibility criteria.

Clean up Flow

We have also integrated a cleanup flow with DP Admin which is quite useful for cleaning up the entire pipeline or parts of it for an event in case something gets messed up.

Any platform serves best when accompanied with sound metrics monitoring and alerting component. To support DP Admin, we have built a DP-DB diff framework on top of it. This framework is focussed on identifying gaps between data source and data platform at regular intervals and trigger alerts in case of major misses.

The Impact – How does this help us?

Well, now we can scale up to any number of use cases on our platform and managing them would not be an issue at all. We have all the definitions and metadata stored and updated at a central place in a standardised format. No custom logic and scripts for each and every use case to be taken care of and complete visibility to end users. As a brownie point, everything right from collecting source information and transformations to storing and applying them on event messages to even pipeline creation for the new event, everything is completely automated. So no repetitive work for new use cases and we can work upon something awesome instead!

Conclusion:

When in doubt, think platform! :)

About the author –

Karuna likes to play with colours on canvas and data while working. She is a part of Data Platform team which ensures that everyone gets their data in the most standardised and scalable way.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/uc-design
https://medium.com/uc-engineering
https://medium.com/uc-culture

https://www.urbancompany.com/blog/humans-of-urbanclap

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com