Site Reliability Engineering Best Practices for Data Pipelines
Managing production data pipelines the Google way
By now, most of you might have heard about the site reliability engineering concept. If not, here is a simple definition —
SRE is what you get when you treat operations as if it’s a software problem.
Google has been managing its production systems using these principles for a long time. They published a couple of books about this topic:
- Site Reliability Engineering: How Google Runs Production Systems
- The Site Reliability Workbook: Practical Ways to Implement SRE
In this article, I am going to talk about a summary of the best practices discussed in the 13th chapter of the book The Site Reliability Workbook: Practical Ways to Implement SRE.
There are various types of data pipelines we run these days in production systems.
Data transformation/event processing pipelines
- The extract, transform, load (ETL) model is a common paradigm in data processing: data is extracted from a source, transformed, and possibly denormalized, and then “reloaded” into a specialized format.
- The transformation phase can serve a variety of use cases, such as making changes to the data format to add or remove a field, aggregating computing functions across data sources, and applying an index to the data so it has better characteristics for serving jobs that consume the data.
Machine learning pipelines
Machine learning (ML) applications are used for a variety of purposes, like helping predict cancer, classifying spam, and personalizing product recommendations for users.
Typically, an ML system has the following stages:
- Data features and their labels are extracted from a larger data set.
- An ML algorithm trains a model on the extracted features.
- The model is evaluated on a test set of data.
- The model is made available (served) to other services.
- Other systems make decisions using the responses served by the model.
To maintain these pipelines, here are some best practices, discussed in the book.
Pipeline Best Practices
In general, the following best practices should be following while working on data pipelines.
Define and measure service-level objectives
SRE practice is based on some important concepts like service-level objectives (SLO), service-level indicators (SLI), and error budgets.
For data pipelines, we can think about defining SLOs with the following constraints.
- X% of data processed in Y [seconds, days, minutes].
- The oldest data is no older than Y [seconds, days, minutes].
- The pipeline job has completed successfully within Y [seconds, days, minutes]
- Difficult to define.
- For data ingestion, you can compare source and target.
- Backward-looking analysis. For example, the number of hours or days that bad data or errors are served from the pipeline output data.
Data isolation/load balancing:
- Segments of data: high/medium/low.
- It can be implemented using different queues, hardware, network tiers, etc.
- End-to-end measurement is important.
- Even if the pipeline has several stages, it is recommended to measure SLOs for end-to-end deliveries instead of per stage. Per stage, SLO does not impact customer value much.
Plan for dependency failures
- Once you define your SLO, first check if you’re not overdependent on products that fail to meet their SLOs.
- At Google, to encourage pipeline development with dependency failure in mind, SREs stage planned outages.
- Even the best products will fail and experience outages.
- Regularly practice disaster recovery scenarios to ensure your systems are resilient to common and uncommon failures.
- Assess your dependencies and automate your system responses as much as possible.
Create and maintain pipeline documentation
- Well-written and maintained system documentation can help engineers visualize the data pipeline and its dependencies, understand complex system tasks, and potentially shorten downtime in an outage.
- The book recommends three types of documentation.
1. System diagrams:
2. Process documentation:
- For example, how to release the pipeline to production.
- Once all tasks are documented, look for automation opportunities.
3. Playbook entries:
- Each alert condition in your system should have a corresponding playbook entry that describes the steps to recovery.
Map your development lifecycle
The development cycle of the data pipelines should be planned as below:
- Testing with a 1% dry run.
- Performing a partial deployment.
- Deploying to production.
Reduce hot-spotting and workload patterns
- Hot-spotting happens when a resource becomes overloaded from excessive access, resulting in an operation failure.
- To avoid hot-spotting, it is recommended to restructure your data or access patterns to spread the load evenly.
- Reducing lock granularity to avoid data lock contention.
Implement autoscaling and resource planning
- Spikes in workload are common and can lead to service outages if you’re unprepared for them.
- Autoscaling can help you handle these spikes.
- Predicting the future growth of your system and allocating capacity accordingly ensures that your service won’t run out of resources.
Adhere to access control and security policies
Adhere to the following privacy, security, and data integrity principles:
- Avoid storing personally identifiable information (PII) in temporary storage. If you’re required to store PII temporarily, make sure the data is properly encrypted.
- Restrict access to the data. Grant each pipeline stage only the minimal access it needs to read the output data from the previous stage
- Put time-to-live (TTL) limits on logs and PII.
Idempotent and two-phase mutations
Pipelines can process large amounts of data. When a pipeline fails, some data must be reprocessed. You can use the idempotent mutations design pattern to prevent storing duplicate or incorrect data
In simple words, when doing data ingestion, first write into stage tables and write data to actual tables only when stage table checks are correct.
- Checkpointing is a technique that enables long-running processes like pipelines to periodically save partial state to storage so that they can resume the process later.
- While checkpointing is often used for failure cases, it’s also useful when a job needs to be preempted or rescheduled.
- Checkpointing has the added advantage of enabling a pipeline to skip potentially expensive reads or computations because it already knows the work is done.