My data portfolio — Case #02: Data ecosystem as a platform
Here is how I built a single source of code that provides a full platform for data processing and quality validation for data at transit. It uses many tools but its focus is to leverage the potential of all those tools while not losing agility from Data Engineers for new deployments.
Why call it a platform since I don’t own it and I have implemented external tools? Because platforms provide the tools and infrastructure to build applications (in our case data pipelines). Some concepts are an adaptation (or copy) of the “12 Factor principles” (Read it here: https://12factor.net/).
Value delivery:
- New data team members on-boarding/learning curve reduced;
- Faster deliveries from data team through segregation of flow logic execution from business rules;
- Data quality assured by capturing issues at transit with serialized validations;
- Data team sleeping better without incidents 😴 😁
Some challenges along the way:
To begin we have some of the worst problems in data engineer: How to organize an ever increasing size of ETL flows and not get lost by it? It is easy to have hundreds or even thousands of flows in a “mesh” architecture and things start to getting wild when this happens:
- Having duplicated codes or ETLs doing same things violating a simple rule of DRY;
- Business logic spread through different sources/technologies: Database, ETL logic, Modules/Libs, external tools, etc…;
- Lose the touch about what is considered a SLA for data, since it is ever changing and only a few people of the company now why.
The Idea:
Some key things must be done to address correctly the issues above. Lets go through each:
- Templates everywhere:
- Data pipelines should be configurable using templates to hold the task execution logic;
- Database modeling isin template files (AKA migrations) so the team are able to have same environment in development and production;
- SQL for extraction/loading queries should use templates to hold extraction logic according database modeling;
- Data validation for each necessary stage should use templates to verify nullable fields, regex patterns, min, max, etc… Also have its default routing/sinking for invalid data;
See that what is happening here: If you want to know everything about a pipeline, you do not need to dive deep in code. Just check a few templates and you are good to go.
2. The processing of data must be safeguarded inside a module. Here we keep the business logic in one place to transform/add/remove fields. This can be reused for multiple ETL flows;
3. Keep the modules specialization. Single responsibility principle. For example: Keeping one abstract module for database connections and query execution and another for data validation.
Going deeper in tech aspects:
Also here are some ideias of 12 Factor app applied for Data to make data team deliver faster:
- Code base: Deliver fast, deliver early! One code base easy deployable with CI/CD and multiple environments according to its branch (development, staging, production);
- Dependencies: Isolated and declared. Easy solved with a shareable requirements.txt and requirements-dev.txt in python repositories;
- Same environment for dev, staging, prod: So we have a docker to process the data pipelines and another services like a PostgreSQL database and a lib to seed data into database like yoyo. Also .env files so the application know where its running;
- Application modules logging things by default for each event, making debugging easier;
- TDD should be done to validate deployments and must be separated from run stage.
Design Overview
The architecture overview and a few constraints:
Techs and libs:
Flow execution overview:
Please note that invalid data that is checked against a contract is sinked in some error table in JSON at database. So if you want to know if something is wrong, check the error table ;)
What is a contract and why use it?
A contract is a simple template file that has its own constraints to be checked against. It might seem an overkill to use it for each step/task of flow runs, but it adds another layer to check data quality and prevent bugs in our code. Doing this, everyone can check what exactly is expected in each phase of data processing (at transit) because it works like a documentation too. Here is a simple example:
The anatomy of a flow and its phases:
Bellow is the template configuration for a flow. We could even customize it with run schedule, external processing, etc… Also note that extraction/load queries, contracts and transform functions are done in a way to facilitate it’s own reuse:
How to input this in an abstract flow for Prefect:
Running abstract flow:
Nice huhn?
Questions and pitfalls:
How to organize it in our repository? It’s not perfect because the more flows we have, more complexity is added. It works fine anyway if developers follow the rules. We could separate flows by context/source/destination also to get easier to find a flow. Here is what I have done:
Why use yoyo migrations and note custom flows to generate/populate the database? Why process everything in the python modules? Why pandas for data manipulation? Why not dbt or keep the logic in SQL database? Why the run logic is separated from build logic?
First: Context. Second: Keep the single source of responsibility, meaning one place to find database modeling, one place to find etl logic and validations, one place to find applied business rules, one place to take care of runs and schedules…
Isolating everything makes easier to do things like:
- Create a module to process something in PySpark;
- Load or Off load processed data in tool ABC;
- Keep it DRY!
Thanks for getting here. Feel free to get in touch or give me your feedback! 🙃 Cheers!