Principled Data Engineering, Part II: Data Governance

Published in

SSENSE-TECH

8 min readMay 2, 2019

The second of a two part series exploring the fundamental themes of architecting and governing a data lake.

*Click here for Part I

The fundamental mandate of a data engineer is to deliver clean, accurate, and meticulously supervised data. Herein lies the difference between a data and software engineer — for the latter, their product is the software, whereas for us, our product is the data, and any software we build is auxiliary to the data. In other words, it is the responsibility of a data engineer to treat data (rather than software), as a ‘first-class entity’. In this context, it goes without saying that data governance is a central theme in data engineering.

Our industry has not yet reached a consensus about what data governance entails. Factors such as legal regulations and the nature and size of the data play a major role in determining how to govern the data. Having said that, there is an underlying theme that is common across the board — the concept of meticulous supervision. This can be broken down further into three major points, as outlined below:

Metadata Management — Data Lineage and Cataloging

Arguably the very foundation of data governance is metadata management. In this case, metadata refers to data about your data — this includes any meaningful information about your data which can help you understand it better. At the very least, for any data set you store, your metadata should be able to answer two questions — what data can be found in the dataset, and where this data came from. The ‘what’ question is usually answered by a data catalog (sometimes known as a meta-store, metadata catalog, etc.). A data catalog is usually some representation of the schema for a particular dataset. This might include information about labels (column names, keys, etc.), data types, partitioning rules, memory footprint, and other relevant information about certain units of data (such as columns).

At SSENSE, we use the AWS Glue Catalog to catalog our data. Note here that while frameworks such as Glue usually offer schema inference capabilities, it’s always prudent to revise inferred schemas and use a file type that supports explicit schemas (such as Parquet), preferably one that is compatible with your framework’s automatic inferring tool.

*A sample Glue Catalog from the* *AWS docs*

Answering the second question about where the data comes from requires metadata about data lineage. Conversations about tracking data lineage in data ecosystems have not yet reached a point of maturity to identify or prescribe best practices. While projects like Apache Atlas are trying to standardize best practices around tracking data lineage, the data engineering industry at large has only agreed upon the problem, not the solution. In essence, tracking data lineage involves maintaining a recorded history of the origin, transformation, and movement of each unit of data. The granularity of a ‘unit’ may vary from an entire data set to single points of data. Usually, more granularity leads to better governance. When done right, recording data lineage can make your data a lot easier to understand, reproduce, and debug.

At SSENSE, all our data pipelines are managed by Apache Airflow DAGs (Distributed Acyclic Graphs). This has allowed us to develop a custom lineage solution wherein every time we load data into our lake (in any S3 bucket), we tag the object storing the data with certain metadata containing lineage information. The actual logic for adding metadata to each object is added to the load tasks of our Airflow pipelines. Some of the metadata is made easily available by the DAG’s metadata, and some of it is generated based on the programming logic. At the very least, we track the following data:

Data Source (data_src): The immediate source of the data. For example, if the data is being moved from our interim bucket to the business bucket, the interim bucket’s filepath becomes the data source. If a file inherits data from multiple files in the previous bucket, all paths are mentioned or the parent path’s prefix is used. This ensures that for all stored data, we can backtrack on a file by file basis to trace its full path from its origin to the final destination.
ETL Date (etl_date): While this usually corresponds with the Last-Modified system defined metadata on S3 objects, we prefer maintaining a custom field with this date set by our DAG. This ensures that we don’t run into unforeseen inconsistencies such as timezone issues.
ETL Source (etl_src): We manage all our ETL logic in an independent pip package. This allows us to easily specify the source of our transformation logic as the module’s dot notation path, for example: etl_jobs.sample_module.sample_job.
ETL Version (etl_version): Our aforementioned pip package is also semantically versioned, which means that not only can we specify which Python script contains the logic for the data’s transformation, we can specify the version number of the pip package used for the transformation. This, combined with the ETL Source, ensures that all our data transformations can be audited and reverse engineered.

*An illustration of how we track data lineage for aggregated data sets.*

While there is undeniable business value in proper metadata management, one of our primary goals in all of this is to protect the personal data of our customers and business partners while respecting every customer’s ‘right to be forgotten’. Not only is this critical in order to meet regulatory compliance standards, such as the EU General Data Protection Regulation (GDPR), it is also an ethical obligation towards anyone who entrusts our organization with their data. Moreover, we believe that regardless of the size and nature of an organization, it is advisable to strictly regulate personal data and to avoid storing Personally Identifiable Information (PII) in any situation where it is not absolutely business-critical. In our experience, anonymizing PII almost never disrupts analytics and machine learning systems.

Data Contracts and Service Level Agreements

While metadata management handles governing the data itself, there is also the matter of governing your data store’s relationship with the outside world — the sources and consumers of your data. Once again, due to the blistering pace of change in the data ecosystem, there are no unanimously accepted standards of how this should be done. We can however, agree on the importance of governing these relationships carefully. All data stores rely on certain expectations of both its data sources and consumers — it expects the interface for its data sources to be accessible under certain constraints, and its consumers to access its data under certain constraints.

Good governance involves ensuring that all such implicit, informal or ambiguous constraints are turned into explicit, formal, and specific agreements (or contracts). This is where concepts such as Data Contracts and SLAs (Service Level Agreements) become handy. An SLA explicitly and precisely defines, for the user of a software service, the constraints within which the service is expected to perform. Usually, the underlying infrastructure and software of your data source (such as an S3 bucket) will be able to provide clear SLAs about uptime, consistency, reliability etc. Sometimes, your data sources might be able to do the same (such as well built REST APIs). In the ideal case where all your underlying software and all your data sources are able to provide such SLAs, your data store may also be able to provide reliable SLAs to its consumers. This is particularly useful if a lot of your consumers are other automated services that cannot account for ambiguous constraints.

A Data Contract acts like an SLA but for every individual unit of data. For example, a contract might mandate that a certain column must contain the sum of certain other columns. Depending on the maturity of your data ecosystem, these may be implemented as modules of code that validate data during runtime (as shown here), or they could be implemented manually as an agreement between the data store’s team and the stakeholders to define the constraints for every unit of data.

At SSENSE, for every data pipeline we support, we first manually draft a contract with our stakeholders and then implement their constraints programmatically as runtime validators. In terms of SLAs, thanks to Airflow’s reliable dependency management, we are able to provide very specific consistency guarantees such as 100% temporal consistency. For other guarantees such as availability and durability, the SLAs provided by the AWS services we use (such as S3 and Athena) remain applicable.

Data Testing

The subject of testing code has been discussed thoroughly for many decades. Software engineers have come a long way from sprinkling unit tests onto code, to adopting comprehensive paradigms like Behaviour Driven Development and aiming for 100% test coverage. Data engineering however is slightly different. Traditional testing strategies often do not cater to the specific concerns of data pipelines. Unlike most client-facing services, a data pipeline is not subject to large numbers of unpredictable users. A lot of the code a data engineer writes will only be used by other data engineers in very predictable and specific ways. This negates the need to worry about a lot of the behavioural testing that is so important for APIs. Also, as mentioned earlier, the product for a data engineer is the data and not the software itself. This implies that testing in data engineering should prioritize data quality over code quality. This can be especially challenging since your data sources are usually out of your control, and catering for all incoming data anomalies in unit tests is practically impossible.

Keeping all this in mind, we have developed a two-fold approach to testing. Firstly, we use snapshot testing to test the code for our data transformations. For every pipeline, we maintain some static input data and its resultant output — the transformed data (validated by the relevant stakeholders), as the mock data for our tests. While this is supplemented by some traditional unit tests for the rest of our code, we only aim for 100% coverage for our transformation logic. These tests are then run in our CICD pipelines, guaranteeing that any code we push to production does not corrupt our data. Secondly, to cater for corrupt incoming data, we have runtime validations built into our data pipelines which run immediately after incoming data is transformed. These validations reflect our data contracts with our stakeholders and ensure that bad data never enters the data lake. Unlike client facing services, we are often comfortable with our pipeline tasks raising errors and failing in production. Airflow allows us to handle this very easily by fixing the issue(s) and re-running the failed tasks. What we cannot afford is the unnoticed corruption of our data.

Conclusion

In Part I of the Principled Data Engineering series, we introduced our data lake architecture as a reference point for discussing the challenges of storing and pipelining enterprise-scale data. In Part II, we focused solely on the subject of data governance. It may seem unusual that we invested half the series into discussing data governance, a subject that is often treated as an afterthought in both industry and academia. However, in our opinion, this neglect of good data governance has led to serious problems in the tech industry. As data engineers, we must assume an ethical responsibility to ensure the integrity, validity, confidentiality, and accountability of our data. We owe this to our clients, associates, and the company itself. Having read both parts of this series, you should now have a mental framework for architecting a data system in a practical and principled manner. We encourage you to let business needs drive your technical decisions, and to ground all such decisions in the philosophy that your data is your product, and by extension, your responsibility.

Editorial reviews by Deanna Chow, Hussein Danish & Liela Touré.

Want to work with us? Click here to see all open positions at SSENSE!

Principled Data Engineering, Part II: Data Governance

Written by Prateek Sanyal