Data Contracts at GoCardless — 6 Months On
It’s been 6 months since I introduced Data Contracts as our initiative to improve data quality at GoCardless. So, how are we getting on? What’s gone well, and what are the challenges we’ve faced?
Starting with why
Data quality matters as soon as your organisation starts any initiative that puts data directly in front of executives, non-data-team members, or customers. For a company like GoCardless with data-driven products such as Success+, having good quality data is essential for us to unlock the full value of our data.
We’ve found it’s important to provide regular communications to remind people why we’re doing this. Data Contracts is a big culture shift that is going to take some time, and it can be easy to become just something people feel like they have to do, rather than something they want to do for the benefits it brings their team, their stakeholders, and the organisation.
Starting with why is one of our values at GoCardless — we believe you can only do a job properly when you know why you are doing it — and it’s important to continually reiterate the why for Data Contracts.
Data Contracts for inter-service communication
We often think of data consumers as those who work in data teams building reports and ML models, but almost every service is also consuming data from another service and relies on that data to do its job.
Like any data consumer, the service needs this data to be reliable, timely, and of good quality — exactly what we are trying to achieve with Data Contracts! So it shouldn’t have been a surprise to us that we have seen great adoption of Data Contracts for asynchronous inter-service communication, with it now powering around 50% of these events.
We haven’t yet seen the same kind of adoption for data primarily produced for the data teams. While projects that generate new data do default to using Data Contracts for publishing that data, most of the usage is still driven by data coming from our existing change data capture (CDC) pipelines. We plan to decomission this, and as that deadline approaches we will see data generators move over to Data Contracts, but until then it can be difficult to make the case for teams to do so when compared against their other priorities.
Patterns for effectively publishing data
While our CDC pipelines just silently pulled data out of source databases, with Data Contracts we’re asking users to explicitly publish the data they want to make available for consumption. This is intentional — this friction is a necessary part of creating an interface to your data that you commit to maintaining, exactly like when you create an API rather than handing out database credentials to your users. But unlike an API, many software engineers are not as comfortable with the best practices around modelling and publishing data.
One data publishing pattern that we use within GoCardless is the outbox pattern. This pattern is very useful for cases when you want to send a message to another service only if the transaction commits. However, it does add some load to your database, not only when you write but also by the component that processes these messages and sends them on.
We’re finding that this pattern is being used for more than just sending messages as part of a transaction, but also in other cases, for example where you just need lower latency than writing directly to GCP Pub/Sub (our message broker of choice). Along with our CDC service and everything else that happens on a database, it can be easy to overload a database if we overuse this pattern, or if the implementation isn’t as efficient as it can be, leading to outages and incidents.
As well as improving the performance of our outbox implementation, we are also doing more work on understanding the different use cases it is being used for, and thinking about how best we want to support teams writing their Data Contract-backed events.
What’s next for Data Contracts
We see Data Contracts as a vessel for improving how we generate, manage and consume data at GoCardless. Through Data Contracts we’ve been providing tools that allow teams to autonomously handle our data better, with a privacy-by-design approach from the start that increases our data security and reduces risk.
While we continue to invest in these tools, we are also starting to look more at what we can do to help our consumers discover and make use of the data. Initially this will be through a data catalog, and in future we will consider building on that further to add things like data lineage, SLOs and other data quality measures.
Our primary aim at the moment though is to get us in a position to decommission our CDC pipelines, which as well as reducing the cost of supporting legacy data pipelines and removing a significant amount of load from the production databases will also be a huge milestone in our move towards a data-driven organisation, backed by quality data, guaranteed by a Data Contract.