How to make the Data Warehouse automagical

Published in

Pipedrive R&D Blog

6 min readMay 17, 2022

An efficient company wants to produce minimal waste and only work on the most valuable features and products for their customers. At Pipedrive, we strive to be like this. One way we do that is by having all the data we need at our fingertips.

Our data layers

The data needs to go through a few layers before we can use it.

Bronze layer
Getting the raw data in the data warehouse is not an easy task at all. Different data domains might use completely different tools and technologies, from Excel spreadsheets to APIs of SaaS tools. We also need to think about our own production databases. So the main purpose of this layer is to find a way to integrate all data into our central data storage.
Silver layer
This layer curates datasets and makes sure they’re up to standard. It’s structured into source domain-specific data that typically has a 1–1 table level relationship with data in the previous layer.
Gold layer
At this point, data is finally displayed using complex models, reports and dashboards packed with useful insights.

“Silver layer” data is usually good enough for mission teams to see how their product is performing. “Gold layer” data is more for company-level reporting and other complex use cases. We’ll explain more about the “gold layer” soon on the Pipedrive Engineering blog.

In the past, “silver layer” work was done manually with a custom ETL map. This took a lot of work. We usually planned our sprints, having this “standard task” as one estimation point — a bit more than one day of development work.

Avoid the bottleneck

Data flows in a bottle, where most streams are blocked by thinning bottle’s neck — photostockeditor.com

These manual tasks started filling up our backlog. The whole company would fight over who got their data in the data warehouse next. It would take months to add new datasets. On top of that, the work was given to one centralized, highly specialized team. They had to write boilerplate code to achieve pretty simple transformations — not the most enjoyable work from an engineer’s perspective. As the company grew, the situation has only become more challenging.

Finding a solution

As engineers, we knew we needed a way to keep the technical code separate from business logic. That’s when we realized that the whole process could be automated, giving data owners the tools to govern their data in the warehouse.

To make it happen, we saw that there were three conditions:

Datasets automatically available in the “bronze layer”
Although countless types of data sources come in from all different business domains, the most common data flow comes from microservices, which are backed up daily. The bronze layer automation takes care of parsing these backups.
Data catalog with annotations
Data owners annotate their datasets with descriptions and some technical details. We get that data owners don’t need to know the ins and outs of data warehousing. Annotations should be about their data, not about what should happen to it in the “silver layer .” The logic of what to do is defined by automation.
Testing framework
Even though testing in the data world is not that common, we saw immediately that, in this case, the logic is way too complex to manage without it. This way, we could test each functional requirement for the system we had designed and ensure that adding more functionality would not break any existing logic for hundreds or thousands of tables already processed by this automation. Dockerized AWS-localstack with hive and spark is what we used to make it happen.

Old-timers in the data world are used to SQL being the language you define all ETL logic in and might not be used to generic software engineering. For some, having heavily dynamic transformation logic in the ETL map doing “silver layer” logic seemed like a wild idea. We could have rendered out the exact transformations based on annotations to know exactly what was done with the data. But we have daily backups of the data. We started logging Spark execution plans. That way, we could always figure out what transformations had been done. Our dynamic transformation map handled all the logic needed to run “silver layer” logic.

The solution worked like a charm. It’s clear to us what should happen based on any annotations and what has happened. Actually, the logic is so reliable that it catches issues in the “bronze layer” and acts as an extra validation step.

An automagical new world

https://www.vecteezy.com/free-vector/human

Finally, this is how our “silver layer” automation works on a higher level:

The daily task in Airflow renders a DAG with one task per schema (e.g., microservices, applications or databases) in the “bronze layer” that needs to be processed by the “silver layer.”
Each of the tasks in the main DAG renders another DAG for that specific schema, with one or more tasks for each annotated table that needs processing.
Each task gets annotations from the data catalog, state of bronze dataset and state of destination dataset as input. The task will carry out any needed transformations based on the logic agreed by our team.

If a new dataset comes up somewhere in production, it goes into the “bronze layer” the next day and pre-populates the data catalog with the schemas. After annotations have been added, the data moves to the “silver layer” the day after (or if we’re in a real hurry, we can even trigger the tasks manually).

The most important “silver layer” functions are:

setting the primary key for a table and calculating hashes when we need composite keys — the primary key is also the only required annotation besides the description
breaking up huge datasets for better performance
deduplication, where data owners get an alert and can set rules after duplicate rows are found
hashing or discarding fields and tables based on info classification level — satellite tables are automatically created in restricted access schemas for the hashed fields
special handling for a limited list of dimensions — splitting times into dates and timestamps, transforming keys of common entities like companies or users, languages, countries, IP addresses, user agents, etc.

We haven’t added automation for all cases yet, like datasets coming from lesser-used technologies, but we’ve covered the main data flows. Despite its perks, automating the silver layer has actually had a negative impact on our ability to estimate tasks for our sprints — we don’t really do our standard tasks anymore.

Key takeaways

Without mature tools for data engineering, even the simplest tasks could involve writing endless rows of boilerplate code. If we only used our data engineers to shovel data from source systems to the data warehouse, we would waste an expensive and hyper-specialized skill set. Now, our engineers are free to create more tools to spend less time on tedious and repetitive tasks, develop better data governance practices and focus on why we actually need the data in the data warehouse — to gain insights that solve business questions. And now, business people can access their data in days, instead of weeks or months.

Let us know in the comments which topics you would like to hear about more in-depth!

Interested in working in Pipedrive?

We’re currently hiring for several different positions in several different countries/cities.

Take a look and see if something suits you

Positions include:

Junior Data Engineer
Junior Data Platform Developer
Software Engineer in DevOps Tooling
Backend, Full-Stack, iOS Engineers
Infrastructure Engineer
And several more…