GWI’s new Business Data Platform

Published in

tech-gwi

4 min readFeb 21, 2024

AI generated using “Business Data Platform” and GWI’s main brand color as a prompt — by DALL·E 2

GWI has data at the core of its product offering. Apart from this data though, we are also looking into all other types of business data such as CRM, ERP, hiring, HR, and the-list-goes-on tools. Below is a short overview of how we’re dealing with this business data.

The why

GWI’s business data needs have been traditionally served by what we now call our “legacy” platform. While the system was adequate for certain tasks, it was a struggle to scale and adapt it to increasingly demanding business needs. The lack of features, flexibility, and modularity led to corner-cutting deliveries and building on shaky foundations, which eventually resulted in a lack of data trust.

The (short) history

In early 2023, we enlisted the help of a consultancy to develop a first version of our new Business Data Platform, or BDP in short. The prototype was delivered in June, and since then, has evolved significantly. We managed to tackle numerous challenges, enhancing its functionality and scalability to meet business needs, while keeping simplicity as a core principle.

The features

Here is a simplified system diagram of the current setup.

Fast ingestion
We moved away from custom scripts to integrate with the various data sources and invested in Fivetran. It is an ingestion tool that offers direct connectors to the vast majority* of tools that we are using as a business. The amount of time saved to write, validate, and debug API and other integrations almost equals a full-time engineer.

* BDP is not restricted to Fivetran and can leverage other tools such as Segment or custom integrations through Airflow operators.

Reliable transformations
We went from multiple hidden layers of materialized views to an open-source solution, dbt, to apply business logic to raw source data. dbt is among the top data transformation tools at the moment, offering out-of-the-box functionalities such as data lineage, data catalog, taxonomies for governance, etc. By leveraging it one can get comprehensive views like the one below.

Transformations alone would not be enough without the trust built by integrated data quality alerts and monitoring. Our latest implementation leverages Elementary to monitor both tests and models’ execution times.

Access Control
The system is designed to manage access rights based on individual job requirements. We’ve established user groups aligned with specific work functions, permitting access to datasets (a term used in BigQuery to refer to a group of tables) as needed. Specifically within our data warehouse, we leverage dbt’s imposed policy tags to enable column-level access control. This feature is crucial for safeguarding sensitive information, including Personal Identifiable Information (PII) and other confidential data if the need arises.

Orchestration — Infrastructure
Without getting any more technical (for now), for completeness, orchestration is achieved using Apache Airflow, and the infrastructure is supported by our astounding DevOps team following industry’s highest standards leveraging Terraform, an infrastructure-as-code tool (instead of untraceable, manual settings). The emphasis once again lies on agility but also robustness.

So what?

So far, the most significant source that we’ve ingested and modelled into the BDP is Salesforce. Part of this work aligned with the company-wide initiative of transitioning to CPQ. This shift alone has saved an impressive 15 hours per week in data entry, plus 3 working days per month in month-end reporting.

Salesforce data from the BDP is also utilized for Tableau reporting. The early stage adoption of the new dashboards looks promising.

Additionally, the BDP serves as the data core for what we call the Product Data Foundations work-stream, a GWI-first attempt to automate the combination of GWI’s platform events and Salesforce data, in collaboration with the Product Analytics team. As an early enablement milestone for this work-stream, an automated data export was recently delivered to our Customer Success org, which is estimated to save over 10 minutes per report or 1650 hours per year.

Besides reporting, we’ve also implemented email alerting, currently applied to monitor specific changes in Hibob data. As this blog post is being written, we’re in the process of creating dashboards from already ingested Greenhouse data and are also modelling ingested SAP data to enable reporting in the near future.

Outro

This write-up was only a foreword of a series where we’ll deep dive into various aspects of our deployment. Stay tuned for more info around our (dev/test/prod) environments set up, our philosophy around data modeling, the specifics around our CI/CD, creating DAGs through config files, the efforts for cost optimization, challenges, pitfalls and small wins! :)