Data Engineering at Dwelo

Published in

Dwelo Research and Development

6 min readSep 10, 2020

A reflection on my three month journey.

In large companies, data is typically stored in a distributed manner. It is highly important to build a reliable data pipeline to deal with batch, incremental loads of data.

Big Data: Goals & Challenges

How do you develop pipelines for incrementally loading data?
How do you debug transformation logic in a highly distributed environment?
How do you optimize the running of pipelines and ensure reliability and availability?
How do you deal with endless changes in the underlying technology stack?
How does the system handle the propagation of upstream changes?

Dwelo uses bleeding edge data warehouse technology in architecting distributed systems , creating reliable pipelines and collaborating with data science teams to build the right solutions for them.

Here’s a brief description of the technologies that are used :

1. Airflow + Docker + Kubernetes (Scalable and painless data pipeline)

Airflow:

Airflow is a platform to programmatically author, schedule and monitor workflows(a.k.a DAGs or Directed Acyclic Graphs). The python code base makes it easily extendable.

The above Airflow UI allows any users to visualize the DAG in a graph view, using code as configuration. The author of a data pipeline must define the structure of dependencies among tasks in order to break down complex workflows into granular parts that are safer, more modular and reusable. This specification is often written in a file called the DAG definition file, which lays out the anatomy of an Airflow job.

Advantages of Airflow:

Handle task failures
Report/Alert on failures
Enforce SLAs
Easily scale for growing load

Docker:

Developing apps today requires so much more than writing code. Multiple languages, frameworks, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. Docker simplifies and accelerates your workflow, while giving developers the freedom to innovate with their choice of tools, application stacks, and deployment environments for each project.

Advantages:

Reproducibility
Isolation
Portability
Shareability

Kubernetes:

Kubernetes is a system for running and coordinating containerized applications across a cluster of machines. It is a platform designed to completely manage the life cycle of containerized applications and services using methods that provide predictability, scalability, and high availability.

For more information and latest updates on airflow, please refer the following link — https://github.com/jghoman/awesome-apache-airflow

2. GCP (Storage and Efficiency)

Google Cloud Platform (GCP), is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, like as google-search and youtube. Alongside a set of management tools, it also provides a series of modular cloud services including computing, data storage, data analytics and machine learning.

Google BigQuery was designed as a “cloud-native “ data warehouse. It was built to address the needs of data driven organizations in a cloud first world. We can ingest data directly into BigQuery by uploading in batch or by streaming data directly, enabling real-time insights.

BigQuery exposes two graphical web UI’s that you can use to create and manage BigQuery resources and to run SQL queries: the BigQuery web UI in the Cloud Console and the classic BigQuery web UI.

The pricing model is simple. We pay for data storage, streaming inserts and querying data. Loading and exporting data are free of charge. Storage costs are based on the amount of data stored. For queries, you can choose to pay per query or a flat rate for dedicated resources. At Dwelo, we use the billing dashboard in GCP to track the query with maximum usage and spend and further optimize the query to reduce costs.

3. dbt Models (Data Transformations)

ETL pipelines are a fundamental component of any data system. They extract data from many disparate sources, transform (aka wrangling) the data (often making it fit the data model defined by your data warehouse) then load said data into your data warehouse.

How to handle data transformations? — dbt is a new way to transform data and build pipelines. It applies the principles of software engineering to analytics code , an approach that dramatically increases your leverage as a data analyst.

The following are the four approaches to dbt:

Monolithic
Micro-services
Layers
Layers & Verticals

Dwelo uses Monolithic approach.

Advantages:

Easier to debug , test and deploy
Comprehensive Data Lineage and DBT documentation
Macros are defined in one place to help standardise data transformations
Easier to enforce standards with everything in a single place

The following is a code snippet from one of the underlying models to explain configuration , partitioning and clustering:

Configuration:

Materializations are strategies for persisting dbt models in a warehouse. There are four types of materializations built into dbt, namely :

table
view
incremental
ephemeral

Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs

Partition:

BigQuery supports the use of a partition by clause to easily partition a table by a column or expression. This option can help decrease latency and cost when querying large tables. Earlier only date fields were allowed for partitioning the data, but in December 2019 Google released a new partition capability: Integer range partitioning. This feature allows you to store all the values of a same range in the same partition. x`By allowing the integer partitioning, BigQuery allows you to partition on any fields: Float, String, Date,… For achieving this you have to transform your partition field into an integer value when you storing and querying your data.

Clustering:

BigQuery tables can be clustered to colocate related data. It helps in narrowing the volume of data to scanned by the database. The column order is extremely important in clustering.Earlier Clustering was supported only on partitioned tables. But as of June 2020, we can do clustering on any table.

Some of the Best ETL Practices that we follow:

Partition Data Tables — Partitioning your tables by date and querying the relevant partition; for example, WHERE _PARTITIONDATE=”2017–01–01" only scans the January 1, 2017 partition can help reduce the cost of processing queries as well as improve performance.
Loading Data incrementally — Since we use on-demand pricing , we are charged for the number of bytes processed, regardless of the data housed in BigQuery or external data sources involved. The advantage of incremental loading is that it reduces the amount of data being transferred and a full load may take hours / days to complete depending on volume of data. Even if the full-load takes 2–3 minutes to load, it is quite expensive.
Modularity — Breaking our model logic into base and staging models that then feed dim models addresses modularity to make things more manageable. Additionally, the ref function encourages you to write modular transformations, so that you can re-use models, and reduce repeated code.
Adding data checks early and often — When processing data, it is useful to write data into a staging table, check the data quality, and only then exchange the staging table with the final production table.

Data Engineering is a specialized skill that often does not come out as a necessity. But their need is realized only when enterprises are stuck for ROI(Return on Investment), limited by scale or do not have the same analytics velocity as the behemoths world-wide. Data Engineers are the heroes working within the shadows to ensure that the enterprises have the right data, at the right time, and to the right people.

It’s a long journey, and we are all still learning.

References:

Data Engineering at Dwelo

Big Data: Goals & Challenges

1. Airflow + Docker + Kubernetes (Scalable and painless data pipeline)

2. GCP (Storage and Efficiency)

3. dbt Models (Data Transformations)

Written by Anusha Karunakaran