The Meaning of Production in the Data World
Early on, software engineers are taught: develop locally, test in staging, deploy to production. What does this mean for analytics?
Development on many analytics teams
This might sound familiar: “Developing locally” meaning running queries in the one data warehouse in hopes of not stalling jobs powering BI reporting; “testing in staging” meaning performing ad-hoc analysis of the output and hoping you didn’t miss anything; “deploying to production” meaning, well, nervously deploying code changes to the data powering dashboards.
Why is this a problem?
Many organizations strive to be “data-driven”, i.e. striving to make key business decisions from data. Rarely is the data taken as is; usually data needs some sort of manipulation to feed dashboards referenced by internal stakeholders.
If changes pushed to the queries powering dashboards weren’t thoroughly tested and contain errors, confidence in reporting goes down leaving the organization with no choice but to become less data driven.
The solution: multi-environment development where data quality tests are run in a staging environment before changes are made to data processes with downstream impacts.
Multiple environments for analytics teams
An environment in the analytics ecosystem can mean many things. Consider the complete data product as the data warehouse, aggregation pipeline, and workflow orchestrator that runs the aggregations. Then, we can consider one environment as one instance of the entire data product (more on this architecture later).
What differentiates each environment?
The “production” environment is defined by what visualizations and processes are powered by the data warehouse; anyone outside of the analytics organization will primarily care about the insights in your BI tool. The data powering dashboards can be considered a product of the analytics team, which has a state that is generally agreed upon as correct. This mutually-agreed upon correct state is the production state.
The “local” environment is where analysts can run queries for ad-hoc analysis or large scale development. The analyst will query a data warehouse that has real data, but the warehouse does not have any other downstream dependencies. The real data can be replicated by either a real-time or batch replication service from the production warehouse.
The “staging” environment is a replica of production minus the BI Tool. This makes it the perfect end-to-end test environment for a new feature or data model being developed, a “staging” area so to speak before moving changes to production.
What does it mean to test data?
Testing. There are QA engineers, integration engineers, the software engineers that just love unit tests. Have you heard of a data quality engineer? Neither have I.
Testing data is completely different from testing software applications. Queries could complete and transform data, but the result could just be, well, wrong based on the expected business logic. If this is the case, the biggest question at hand: is the output data incorrect because of an upstream issue influencing source data, or is the aggregation implemented incorrectly?
A good data testing layer will test the output of all aggregations and incorporate business logic into tests to identify any issues and narrow down where they could be happening. As outlined below, each table can be tested for valid and missing values, throw an alert if anything is unusual, and not run any downstream queries:
There are plenty of data quality testing tools; one such tool is Great Expectations. With Great Expectations, you can write what you expect of the data as code, and validate actual data against expectations within your data pipeline. Great Expectations is so useful that it deserves its own blog post, so I won’t dive into the technical details of that here.
Architecting the platform
They say a picture is worth a thousand words.
Let’s unpack the proposal here. The elements:
- SQL jobs: This is the code and infrastructure. The queries aggregating data to give the rest of the organization the visuals they ask for. The SQL queries are likely scheduled to run every few hours in a workflow orchestrator, being your infrastructure.
- Data warehouse: This is where your data resides. This is the single source of truth for all analytical investigations, or at least it should be.
- BI tool: Mode, Looker, Tableau, DataStudio, Domo, etc. This is the culmination of all efforts from the analytics team and likely has the highest visibility outside of the team of anything mentioned.
- Data testing layer: If software engineers build unit tests, why shouldn’t data engineers build data tests? The data testing layer helps catch human error or bugs in your code that just make data inaccurate.
Staging is a replica of production, except the pipeline ends at the data testing layer as there are no downstream dependencies.
Stability of infrastructure. Confidence in data. Who doesn’t want that?
Stability is ensured by having the SQL jobs that generate data powering the BI tool in a completely isolated environment. This implies that an accidentally rushed query that scans an exponentially growing number of rows doesn’t lock up tables that are used in high visibility in dashboards.
Confidence comes from deploying all changes to a staging environment, where data quality checks can run on the entire system. Confidence is increased by knowing that data not passing these tests in staging won’t be moved to production, and data not passing tests in production won’t be moved to the BI tool.
The goal: align data engineering to software engineering workflows.
In this new world: “Developing locally” means running queries on replicas of production data without worrying about production jobs; “testing in staging” means deploying a code branch into a truly production-like environment with no downstream dependencies; and “deploying to production” means the same thing, but deploying thoroughly tested code with confidence.
First, I have been purposefully vendor-agnostic in my proposed architecture. It doesn’t matter if you’re implementing on GCP or AWS, if you orchestrate with Airflow or Argo, if you use Great Expectations or Soda. There are a wide variety of tools to implement each block, the best tool being dependent on the type of data and business requirements of your organization.
Separate environments and data quality testing are universal.
Lastly, I have not touched on the cost of such an architecture. It is fairly obvious the cost will increase when you replicate your production data warehouse two times over. I would challenge each organization to consider an alternative with a lower infrastructure cost to consider the cost of making decisions on incorrect data. The more data-driven an organization, the more incorrect data is costly.
Thanks for reading! I love talking data stacks. Shoot me a message.