Data at Pennylane: what it entails, and how we do it
From the very first day, data has been part of Pennylane’s culture. While that’s an easy statement to make, it is not that easy to get right in practice. “Data” — and our data team — covers a wide range of topics, from infrastructural considerations to internal insights and intelligent systems, each of them holding many layers of complexity.
Echoing the recent trends of “data-centric AI”, I want to share a few aspects of how we generally approach data at Pennylane, that can be captured in the idea of “engineering-centric data”.
Data as a production system ⚙️
Data we create or collect is a key part of the product at Pennylane: it is what our users interact with and, ultimately, trust to steer their business. Just like we would not want our application to be down or suffer from bugs, we want to take good care of our data.
Furthermore, data is part of our own culture and of how we steer our product every day, so we’d rather make sure we can trust it and make quality decisions based on it.
Simply put: we treat data as a production system, and naturally apply the same standards and best practices as we’ve seen developing in the software industry over the last decades.
In practice, this has meant establishing SLOs and alarms around our data systems, treating data issues as full-fledged outages, jumping on war-room calls, and writing post-mortems to learn from our mistakes and improve over time.
To avoid getting all the way there too often, we follow strict development cycles too. Our contributions gradually move from a development environment to larger-scale staging validations, and final releases to production, whether we are talking about data models we’ll put in front of our internal users, or data-intensive parts of our application receiving live traffic.
We have been combining this environment logic with systematic CI/CD pipelines (Github Actions) that perform various steps such as:
- running linters on our code with pre-commit to guarantee shared standards and increase code quality,
- running unit tests to prevent regressions,
- deploying our code to the right environment,
- notifying our Slack channels when it’s appropriate.
Our infrastructure is defined as code with Terraform, which makes it fast and reliable to spin up equivalent resources across all environments.
By now you’ll probably expect that one: we perform code reviews for any contributions meant to go live, from heavy streaming data pipelines to simpler tweaks of batch processing jobs or queries.
Data quality over quantity 🏅
Pennylane is growing really fast, which implies more and more data. In order to keep delivering quality insights, intelligent solutions, and ultimately make good decisions, we strive to maintain a healthy trade-off between the quantity of data we expose to our users or algorithms, and their quality.
From our past experience, we have learned that the flywheel “more data > more data practitioners > more data” can do a lot more harm than good to decision making.
We continuously work on maintaining a unique source of truth with robust lineage behind every user-facing solution, and migrate their consumers in non-breaking steps whenever needed.
Our “gold standard” data always goes through a series of tests ensuring its compliance with our expectations. It is fully documented and follows strict data modeling and nomenclature rules, as well as reviews to remain intelligible to our end users.
At this stage in our development, only a select few users are contributors to our data warehouse or streaming data pipelines to achieve our target “quality-quantity” balance. Not everyone can create public metrics or dashboards in our internal visualization tool either — while pretty much everyone can query data to answer ad hoc questions.
Finally, when it comes to designing algorithms, a large fraction of our time is spent on defining the appropriate approach, data, and metrics before we even discuss technical options, for us to be able to measure the performance of our solutions and iterate over them. This sometimes means making changes to our application and collecting new data, and we favor this approach over fuzzy proxies for anything that’s core to our product.
Explicit producer-consumer models 🤝
Beyond the increasing quantity of data, Pennylane’s growth translates into more and more producers and consumers of data too. Of course, our data team may sit on either side here, depending on the context. As much as we trust the practices we just described to avoid common pitfalls and quality issues, it is equally important to define contracts between data producers and consumers, outlining clear scopes and accountabilities. We have found that explicit producer-consumer models lead to good isolation of concerns, and allow to place accountability where the deepest knowledge about the data lies.
In practice, this depends a lot on where data is produced and how much control we have over it:
- full control — typically data produced by our data team (e.g. machine learning predictions or database change data capture). Here, our code and data can fully follow our standards;
- good control — data being produced by our tech teams or through the Pennylane application for example. Contracts here take the shape of unit tests guaranteeing non-breaking data model changes, schema validation at the level of APIs, appropriate alarms to quickly detect issues, and cross-teams code reviews to coordinate work and spread knowledge;
- poor control — typically data imported from external tools such as our CRM or customer support tools. Because we don’t want our end-users to be affected by possible schema or content changes that we cannot control, we introduce buffers in the form of “landing zones” in our data warehouse. Landing zone data gets synchronized in near real-time, and extracted-transformed-loaded later into data models we can control, and that goes through data tests (allowed values, freshness, volume, or distribution for example). This design allows us to know what may go wrong, while also giving us enough time to investigate issues and restore data before it is consumed in the wrong way. Data quality issues we identify are systemically addressed at the level of their root cause — most often tooling configuration — by the teams operating them, such that no short-term patches are made on top of corrupt data.
Tooling ergonomics 📠
What we’ve described so far is, in short, a lot of work! And because we want to be more helpful to our users every day, we focus heavily on making sure the right conditions are met for our data team to keep delivering value at a good pace.
It goes without saying that our data ecosystems are hosted on the cloud (AWS): we’d rather focus on designing valuable solutions rather than building and managing vanilla infrastructure. We have observed that using the right tools, abstractions, and automation goes a long way, and here is a short list to illustrate it:
- Scheduling: data jobs and their dependencies can be a hell to manage, and we have found that using Apache Airflow with a number of custom operators tailored to our stack greatly sped up our work.
- Heavy ETL jobs: we have been relying on Spark for a while, and learned the hard way that the underlying environment is a key factor in being able to iterate fast. We initially opted for AWS EMR, but very quickly looked for a more flexible, robust, and friendlier solution. We got very lucky to find out about Data Mechanics — if you have not heard about them, give it a try, it’s been a real enabler for us!
- Data discovery: one of our ambitions is to make internal analytics self-service for most teams. There are a few really solid business intelligence tools out there however, they tend to solve the querying/dashboarding/sharing part of the problem only. When it comes to searching data and knowing what is available in the first place, Castor has been filling the gaps very nicely, allowing us to focus on business-related tasks.
Data at the core of the product 🧠
Everything we described so far can be seen as prerequisites to put data at the center of what we do. At Pennylane, insights derived from data feed decisions every day, and data solutions fuel various parts of the product itself. Whether it’s through automation, increased relevance, predictive insights, or observational studies, the value we gain from data is huge — and we are only at the start of our journey.
We hope you have enjoyed this first glimpse at how we approach data challenges. Some ideas may sound obvious to you, some more surprising maybe: either way, let us know how you feel!
We will dive deeper into specific solutions in future articles; until then, if you feel intrigued and want to know more, check out our open data positions and get in touch!