Modernizing the Data Engineering Stack in 99 Group

Published in

99.co

9 min readFeb 24, 2022

A Happy Problem

99 Group has experienced tremendous growth over the past few years. With new companies joining the group, we have inevitably inherited engineering systems of varying architecture, which in turn increases the complexities in our engineering development process. Eyeing a strong growth trajectory ahead, the team will have to keep up with more frequent and more complex data demands. There is a strong motivation to re-design our data engineering systems to anticipate and meet such demands.

Introspection

The eventual goal is to have an integrated data engineering system that serves all property portals. As a starting point, we looked at our existing systems for inspiration, as well as to identify their strengths and pain points.

99.co SG

99.co utilises the Data Lake and Data Warehouse concept. A cronjob is setup in a VM instance that wakes up at midnight to run a Python script. The script crawls the microservices to fetch the full set of data from each service. The data is then processed, transformed and cached in the Data Lake. Subsequently, it is loaded into the Data Warehouse one by one.

One of the biggest issues with this process is that all the different pipelines are tightly coupled in one bulky sequential process. When any step in the pipeline fails, the data previously fetched in the previous steps might not be loaded into the final destination. Rectifying a failed pipeline often involves commenting in-and-out parts of the code, and re-running them manually.

Furthermore, since all pipelines follow a truncate pattern (i.e. fetching all data from the source and replace it in the final table), the time taken to fully load the data increases exponentially with the growth of our user base. At some point, it takes about 13 hours to fully load the data into the Data Warehouse. If the pipeline fails, re-running it will take up to another 13 hours, which will deprive our business users of fresh, updated data to work with.

One good feature we can adopt from this pipeline is that it attempts to algorithmically generate the table schema on the fly by studying the data. With this information, the process prepares the destination table by updating its schema, prior to it receiving the data. This saves engineers time and effort in maintaining the ever-changing schemas of the source tables.

SRX

In SRX, there is no concept of Data Lake and Data Warehouse. To answer business questions, a customised script connects to the production database directly and queries it. While this gives business users the most updated data, complex queries draw resources from the database, which could spell problems for service stability. Furthermore, connecting to the production database directly could also cause data corruption by means of erroneous query logic. Not being connected to a centralised Data Warehouse also impedes analysis across different portals, which is an important data use case for the group as we move forward.

Rumah123.com and 99.co ID (previously UrbanIndo)

Data Pipeline Architecture of 99.co ID and Rumah123.com

For the two portals in Indonesia, the common theme among the pipelines is that the data is generally saved into the Data Lake before being manually loaded into the Data Warehouse. However, data extraction requires the use of different tools in a non-standardized manner. Some are built-in tools, while some are third-party. Tables of the same data format, or from the same data source, might be imported into the Data Lake in very different ways. This brings about a high mental overhead to the team maintaining the pipelines.

In addition, the schemas of the tables are hardcoded and defined within the pipelines. Every time a new column is added to an existing table, data engineers have to introduce a corresponding change to the pipelines. When a new table is created, a new schema has to be defined in the pipelines once again. Such maintenance work gives the data engineers less time to improve the system overall.

That said, one lesson we can adopt from the Indonesian portals is the use of a pipeline orchestrator to handle our data importing process. These orchestrators often come with pre-built operators that integrate well with common endpoints. The framework provided by the orchestrators steers engineers to build standardised data pipelines that can be easily maintained.

Key Considerations

The team envisions the following key characteristics for the new data engineering system:

Resilience The system should be resilient against upstream failure. Data pipelines should have retry mechanisms in place. In the case of unexpected system failure, the supporting infrastructure should have self-healing mechanisms.
Ease of maintenance Pipelines of a similar data format should follow a standardised pattern as much as possible. Information that is specific to the context of a data source should not be hardcoded, and workloads should be automated if possible. Grunt maintenance work by data engineers should be minimised.
Scalability Data pipelines should be designed to handle a growing volume of data and higher ingestion frequency. The underlying infrastructure should also be scalable to meet sudden spikes in the demand for resources.
Flexibility Individual processes that constitute the entire system should be modular and reusable in different contexts. Data pipelines should be de-coupled from each other, and data assets should be easily portable to other systems.

These characteristics will determine the tools and system architecture to be used, as well as the strategy to be implemented.

The Solution

Infrastructure

The wider engineering team has been adopting Kubernetes to deploy and manage our services. Its scalability and resilience against system failures makes it the go-to deployment tool.

A complete CI/CD process will also be introduced to the deployment process to improve deployment reliability. Apart from automating the deployment process, it also opens up the possibility of adding test cases for our pipelines.

Being located within the Google Cloud Platform ecosystem, we should use Google Cloud Storage as our Data Lake and BigQuery as our Data Warehouse. Using services from the same ecosystem prevents incompatibility issues with other systems.

Transformation Layer

Previous iterations of the data engineering system has implemented to some degree, a Transformation Layer.

In 99.co SG, the transformation query sits with the visualisation tool, which is also used to create aggregated tables (via aggregating and appending a new row of aggregated data). This creates a strong coupled relationship between the transformation layer and the visualisation tool, which is not ideal.

In 99.co ID, raw tables are transformed through BigQuery’s ‘Scheduled Queries’ feature. This works well when there are few transformation jobs, but when there are too many of them, it becomes challenging to manage and track the transformations.

To address these problems, dbt (data build tool) is being adopted as our main tool to manage the transformation layer. Besides being able to run the transformation jobs, it also provides a web interface to visualise the relationships between our base, intermediate and final tables. Through dbt, we can also document what each column from the table means, and what their relationship with other tables is. Dbt picks up on these documentations and displays them in the same web interface, hence simplifying the data cataloguing process, enriching our metadata, and eventually lowering the barrier for other teams to explore and understand our datasets.

Table and columns descriptions on dbt docs site

Pipeline Orchestrator

Apache Airflow is our chosen pipeline orchestrator for the new data engineering system. As the advantages of using Airflow have been widely documented, we will focus on one key feature that greatly improves the team’s efficiency:

The concept of building DAGs with operators opens up the possibility of using a DAG factory class to generate them, with minimal input from data engineers. We can identify a common recurring pattern and encode it into a factory class. Subsequently, we can maintain a config file that informs the DAG factory of the source to fetch their data from, the method for iteration through the data sets and the specifications for loading them. Here’s an example:

This is a typical DAG, where we are attempting to import the Folder objects from a microservice. The steps for fetching data follow this pattern:

Request for new data and upload it to GCS.
Check if there is an existing table in the data warehouse.
If there is, prepare the destination table by adding all necessary columns. If there isn’t, do nothing.
Load the new data into the destination table.
Check if the task has succeeded or failed. If it has failed, send an alert to slack.

Many other DAGs could follow the same pattern — that is, we could use a DAG factory to take in a config file as argument, iterate over it and generate new DAGs for each config object. Creating new DAGs is as simple as defining the following configuration in json:

"folders": {
    "dag_name": "SG_99_folders_import",
    "url": "http://prod.shortlists.99.int/v3/shortlists/folders/data-dump",
    "params": {
        "page": 1,
        "page_size": 100,
        "sort_field": "created_at",
        "sort_order": "asc"
    },
    "response_key": "folders",
    "sort_key": "created_at",
    "sort_param_key": "created_after",
    "incr_timestamp": false,
    "table_name": "folders",
    "dag_template": "timestamped_endpoint_to_truncate"
}

With this functionality, only the absolute minimum amount of data is required from the data engineers. Our team has moved from manually creating DAGs one by one to using a configuration as a code paradigm.

The same idea can be applied to transformation-related DAGs. Transforming a table often involves running the transformation on the base tables first, then the intermediate tables, and eventually combining them to form the final table. A typical airflow DAG representation of the transformation looks like this:

If one were to construct such a DAG, he/she would need to identify the dependencies manually, and map the children tasks to their parents one by one. It would be a nightmare to maintain them overtime, especially if the DAGs grow in number and complexity.

Fortunately, dbt documents the relationship between the tables in a file named ‘manifest.json’. We can use this manifest file as our configuration file, programmatically map the children tasks to their parents, and generate them as a DAG. A detailed example of how to do so can be found here.

Automating the creation of transformation DAGs makes it easy to maintain DAGs as complicated as this:

A New Architecture

After about one year of experimenting, integrating and migrating the data pipelines, our data engineering system has evolved into the following architecture:

Instead of having four different engineers to maintain data pipelines from four different portals and with four different sets of tech stacks, now we have a team of four maintaining one system that caters to all. This significantly reduces bus factor, improves efficiency and also encourages collaboration within the team. Data engineers have more time and space to develop the overall data engineering system, the stability of the data pipelines has improved significantly, and the team as a whole now moves faster.

P.S. This article primarily gives a high-level overview of the transformation process, without expanding on the implementation details. If you have further questions or feedback, simply leave a comment below! Also, we’re actively hiring so if you’re interested, join us here.