How to Design and Maintain a High-Performing Data Pipeline

Data Pipeline Best Practices

Agoda Engineering
Agoda Engineering & Design
11 min readSep 27, 2023

--

by Lalit Ratanapusdeekul

Data pipelines are essential for managing data flow from various sources to a target destination. The BI-Infra-Ops team at Agoda has presented a comprehensive guide on best practices for designing, monitoring, and ensuring the quality of data pipelines.

What is a Data Pipeline

Data pipelines are processes where data is extracted from the source, single or multiple, transformed into a suitable format, and loaded into a target, mostly a table.

Designing a Healthy Data Pipeline

When creating a data pipeline, one of the most important things that should be considered is how to create a healthy one. A healthy data pipeline is critical for ensuring we can use its data effectively for strategic decision-making and operational efficiency. If our data pipelines are unhealthy, the data will not be able to be delivered on time or even cause data issues for downstream users.

Designing a data pipeline properly could also make it reliable and minimize data loss or system downtime. It should have error handling, data recovery, and monitoring mechanisms to detect and address issues promptly. When our data pipelines are well designed and optimized, the data will be processed much faster and cost-effectively, significantly reducing the time for data to be ready.

Moreover, as our data requirements change fast, our data pipelines should also be flexible and adaptable to these changing requirements. New data sources, new data formats, and new data schemas should be accommodated with little effort. A well-designed data pipeline can scale to handle increasing data volumes and sources without significant performance degradation.

This scalability is crucial as organizations grow and generate more data. Our data grows as we grow. Our data pipeline should be able to scale and handle continuously increasing data volumes without decreasing performance.

As a result, several factors must be considered to have a healthy data pipeline. This includes:

  1. Data: where is the data stored? What is the data behavior?
  2. Resource Used: how much resources should be allocated for our data pipeline?
  3. Partitioning: how should we partition our table (in Hadoop)?
  4. Job Scheduling: how frequently should our data pipeline run?
  5. Data Dependency: does the pipeline depend on the data of other tables?

We will go through each factor individually and give some generic examples of how these factors affect how we design our data pipeline at Agoda.

Resources must be allocated appropriately to ensure that the pipeline runs efficiently. Partitioning can improve write performance and decrease run time. Job scheduling must take into account the data usage, SLA, job duration, and freshness of the data source. Data dependency must be considered when one job needs another job to finish first before starting.

1. Data

The data source and data behavior determine how the data will be processed. The data source refers to where the source of data is stored (e.g., in MSSQL or Hadoop), as well as where the data will be stored. The data behavior could be whether the data changes over time or what the granularity (level of detail) of the data is.

In one of our data pipelines at Agoda, we divide the process into three sub-processes based on the characteristics of the data we are working with. This data pipeline loads data into a fact table in our data warehouse. This fact table is called the Fact Booking table. This fact table stores data about bookings at a booking level. Our design is based on the Star Schema method. The three sub-processes include:

  1. Original column process
    a. This processes column that has no change (static)
    b. Implementation: Snap the value since the booking was created (ignore change)
    c. Source: MSSQL + Hadoop
  2. Current column process
    a. This processes column that can be updated over time (dynamic)
    b. Implementation: All values can be changed if updated in source
    c. Source: MSSQL + Hadoop
  3. Flatten process
    a. Update dimensional columns
    b. Implementation: Get value by joining with dimension tables
    c. Source: Hadoop

Each sub-process is suitable for different types of data and needs to be implemented differently. This shows why understanding the data is crucial in designing a pipeline.

2. Resource Used

The resource used refers to the amount of resources used for processing our data in a data pipeline. The amount of resources allocated for our data pipeline is another factor that should be able to be specified.

  • If too little resources are allocated
    - Pipeline takes too much time to finish
    - Risk of pipeline failure due to not enough resources
  • If too many resources are allocated
    - Negatively affect other pipelines
    - Waste of resources

At Agoda, most of our data pipeline runs on Apache Spark. Allocating resources to Apache Spark applications is essential for ensuring efficient and optimal performance of the data pipelines. Spark allows you to set resource configurations for individual Spark jobs or stages within jobs to fine-tune resource allocation. You could use Apache YARN,
or Kubernetes to help manage and allocate resources to Spark applications.

Back to our Fact Booking table. Apart from data, we also separate the processes in our pipeline based on how much data is consumed in each round. Hence, resources are allocated to each process based on data volume. Processes that consume higher data volume are run less often to avoid consuming too much resource too frequently.

In Spark, you can adjust several things such as the driver memory and number executors that best suit your pipeline. For example, a process that consumes a high volume of data tends to require more driver memory and executors.

At Agoda, we also have dashboards made by our engineering team to track several Spark metrics used by our pipelines. This dashboard can be used to monitor our Spark applications.

3. Partitioning

In Hadoop, partitioning refers to the process of dividing and organizing data into smaller, more manageable subsets called partitions. We mostly partition our tables based on particular columns in our tables.

Partitioning the table properly can improve the writing performance and decrease the run time of your data pipeline. These are some factors that could be used to identify the table partition.

  1. Size and Number of Partitions
    a. The size and number of partitions should not be too much or too little.
  2. Distribution of Data
    a. We want the data to be distributed equally to all the partitions.

Back to our Fact Booking table. In the Fact Booking table, we use datamonth, which is the month of the booking date, as a partitioning column.

  • Why month of the booking date?
    - The number of bookings is distributed equally across months of booking date.
  • Why month, not day level?
    - If we use the day of the booking date, the size of the partitions would be too small, and there would be too much number of partitions.

In reality, it’s often more practical to start with a reasonable partition size based on an initial analysis and then fine-tune it through performance testing. Measure the impact of different partition sizes on query performance and resource utilization to find the optimal balance for each use case.

4. Job Scheduling

In normal circumstances, we would not manually submit and run our pipelines. Job scheduling is the way that we automate our data pipeline. With job scheduling, we can determine in advance when and how frequently our pipeline should run. The main consideration when setting up a job scheduling pipeline is the frequency of your job schedule.

The frequency of the job scheduling pipeline can depend on:

  • Data Usage
    -
    If data is used once a day (e.g., for a daily report), we could schedule the job to run once a day.
    - If data is used heavily, we could schedule the job to run more frequently.
  • SLA of data
    -
    Less SLA time means that the pipeline should be run more frequently
  • Pipeline duration
    -
    How much time is needed to run a single round of job
  • Freshness of data source
    -
    If source data is updated once a day, we do not have to schedule the job so frequently.

5. Data Dependency

Sometimes, your data pipeline might rely on a data source that is loaded by another upstream pipeline. This means that there is a data dependency going on in your pipeline. Data dependency should also be taken into account when designing a data pipeline.

Here is an example of why data dependency is crucial when designing your data pipeline.

Example Use Cases:

  • Pipeline B loads data into Table B (runs daily at 10:30 AM)
  • Pipeline B loads data by using Table A as a source table.
  • Pipeline A is a job that loads data into Table A (run daily at 10:00)
  • Pipeline B depends on data in Table A

If no data dependency rule is set on Pipeline B, it would start at 10:30 AM, and the data loaded into Table B would be incomplete.

Monitoring Your Data Pipeline

After you have the data pipeline in place, what is needed next is a way to monitor the pipeline. Monitoring your data pipeline is crucial to ensure that it runs smoothly and that any issues are quickly identified and resolved. In the BI team at Agoda, we mainly monitor the duration and the status of the submitted data pipeline.

When our pipelines are submitted to Spark, there are log tables that store information about the submitted pipeline. This information includes when the pipeline started and ended, whether the pipeline succeeded or failed, and many more. The Hadoop Data team is responsible for the platform used to run our pipelines and manage all these logs. As a BI team, we created a pipeline, or ETL, that consolidated the data from logs and loaded the data into the destination table, where the data is ready to be used in dashboards. Then, we created several dashboards for monitoring our submitted pipelines.

These are some examples of our dashboards used for monitoring.

  • Dashboard for monitoring the duration of pipelines.
Dashboard for monitoring the duration of pipelines
  • Dashboard for monitoring the status of the pipeline (succeeded or failed)
Dashboard for monitoring the status of the pipeline (succeeded or failed)

Ensuring Data Quality

The product of our data pipeline is data. So, ensuring data quality is essential. Data quality can be measured in terms of freshness, integrity, consistency, completeness, and accuracy. We’ll go through each aspect individually and how we test each data quality aspect here at Agoda.

Freshness

The freshness of the data refers to the timeliness of the data. This is to ensure that the data is available to be used in a timely manner.

At Agoda, we built an in-house tool to track the freshness of the data in our tables. For example, we set our freshness tool in the Fact Booking table to track the booking_datetime column. As our Fact Booking table SLA is 6 hours, the tool will notify us if the latest booking_datetime is older than 6 hours. Various reasons can cause data delays. These could be the pipelines failing or taking longer time than usual.

Integrity and Completeness

In most cases, data integrity refers to the uniqueness of the data. Data completeness refers to the completeness of the data (No NULL or empty value)

For example, table A has column “id”. This “id” column uniquely identifies each record in the table. So, there should be no two records that have the same id.

Another example of completeness of the data is when we do not want some value in the table to be NULL or empty. In the above picture case, we do not want value1 to be NULL.

At Agoda, we validate data before writing it into the target table. If the data fails the integrity check or completeness test, the pipeline will not write the data into the target table.

Accuracy

Data accuracy ensures that the data is accurate by checking the current data against the previous trend.

At Agoda, our BI team uses a tool called ThirdEye to help us detect anomalies in our data. ThirdEye can investigate deviations in metrics (anomalies) with interactive root-cause analysis.

The above picture shows an example of how ThirdEye helps us detect anomalies in our data. It compared the current value against the predicted value based on the previous trend. If the difference exceeds the threshold, ThirdEye will notify us about the anomaly.

ThirdEye is also very easy to use as it provides four basic instant algorithms to calculate the predicted value.

More about ThirdEye: https://github.com/project-thirdeye/thirdeye

Consistency

Data consistency ensures consistent data between two places (source and destination).

Our BI team uses a tool called Quilliup to check the consistency of the data after the data pipeline successfully loads the data into the table.

The main concept of Quilliup is that we have a source dataset that contains data from source tables and a destination dataset that contains data from the destination table. The data in the source and destination datasets should match.

We typically trigger the Quilliup test to run once the pipeline completes testing the consistency of the loaded data.

More about Quilliup: https://quilliup.zendesk.com/hc/en-us

Data Monitoring

Data monitoring is the process of collecting, analyzing, and responding to the result of the data quality testing result. At Agoda, we have an in-house centralized alert system called Hedwig. Hedwig can receive testing results from various data quality testing tools mentioned in the previous sections and send alerts through email and Slack to notify us when the testing fails. This allows for quick and efficient responses to any unexpected changes in the data.

We also automate the creation of JIRA tickets when Hedwig sends alerts. This allows us to track the start and end time of the data issue, the root cause, and how it is solved.

  • Example of email and Slack alert from Hedwig
  • We also have a dashboard that summarizes all the JIRA tickets created by Hedwig when it sends alerts.

Conclusion

In conclusion, designing, monitoring, and ensuring the quality of data pipelines is crucial for managing the flow of data. By following the best practices presented by the BI-Infra-Ops team, data pipelines can be designed and maintained to ensure their efficiency and accuracy.

Disclaimer: The approaches listed in this blog post are tailored to Agoda’s specific requirements, and all the tools we mentioned in the blog are used internally only.

--

--

Agoda Engineering
Agoda Engineering & Design

Learn more about how we build products at Agoda and what is being done under the hood to provide users with a seamless experience at agoda.com.