A Data engineering journey at Kingfisher

James Arnold
Kingfisher-Technology
6 min readMar 30, 2023

In this blog post, we will look at the journey that data engineering has had at Kingfisher so far, where we are now and where we’re looking to go next!

Background

As previous readers of this blog might know, Kingfisher is a FTSE 100 group that is comprised of several ‘banners’ such as Screwfix and B&Q, as well as a number of others across Europe. These banners are all operating largely independently, with certain capabilities ‘Powered by Kingfisher’ provided by the Kingfisher group level teams. One such capability is Data.

Each banner has sophisticated and complex operations, producing a large amount of data with varying degrees of data analytical capabilities. For the majority of Kingfisher’s history, the banners utilised internal teams to perform data engineering or analytics activities. Whilst this provided some localised results, the banners were unable to get the most out of their data due to a lack of tooling and scalability, and challenges to co-ordinate and produce at the group level.

Around two years ago, Kingfisher began making investments in building out a data capability at the group level. This began with the establishment of Data Science, Data Engineering and Data Analytics teams to help lay the foundations. The data engineering capability was initially setup by Redkite, who were chosen as partners with Kingfisher, to help establish and accelerate the fledgeling area.

Redkite and Kingfisher worked together to establish the first central data platform that Kingfisher had seen. There were a few requirements at the start that drove the architectural decisions:

  • The platform must be able to ingest from sources all across the business
  • Ability to comply with all regulations and laws
  • Platform must be scalable, to support the ingestion and storage of large quantities of data
  • Provide interfaces for banners and teams to be able to access and wrangle data in the platform as well as include their own data for analysis
  • Permission model to allow for banner and team level access to the data

As a result, the initial tools for the platform were selected and the platform was developed. The output was Nucleus — the Kingfisher central data platform.

Nucleus Architecture

Figure 1 — Nucleus high level architecture

Figure 1 is a high level diagram of the initial architecture we decided upon for the centralised data platform. We decided to go for a data ‘lakehouse’ approach (https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html). To achieve this, we decided upon Databricks as the platform for providing the lakehouse capabilities and selected the Azure ecosystem to compliment it and provide the other requirements for the platform.

Here are some key technology decisions made:

  • Azure storage account Lake Gen2 — provide the lake capabilities. Files can be pushed to the lake from external sources or pulled in via Data factory. Additionally, storage account allows us to create nested folders and apply permissions on the folder level. The lake contains all the layers of data from landed to presented and it is used as physical storage for the tables that are served within Databricks
  • Azure Data Factory (ADF)– ADF provides two key capabilities within our architecture — connectors to a variety of sources for batch pulling data (such as from SAP), as well as orchestrating our pipelines through its capability to invoke Spark jobs within Databricks
  • Databricks — Databricks provides multiple capabilities to the platform, firstly, the Apache Spark capabilities that are used to transform the data through the layers in the storage account until they are in the presented delta table format ready for consumption. The Databricks mounted Hive Metastore is also used to provide the interface that consumers interact with to retrieve data from the data platform. Finally, Databricks is used to provide our ‘workspace’ capability, where different teams are able to load data into a controlled area, run notebooks and analyse data
  • Bicep — All of our infrastructure was written in Azure Bicep infrastructure as code (IaC). This has allowed us to use pull requests to manage changes to our infrastructure as well as keep our environments aligned
  • DevOps — We used DevOps for providing our version control and deployment pipeline capabilities, this was primarily due to it being well integrated with the rest of the Azure estate, particularly with ADF

With this architecture, we set out on ingesting a number of key sources from across the business and began to provide the first look at group level data, allowing us to compare apples with apples across the banners for the first time. This is driving several initiatives to reduce costs and improve efficiencies.

Making the platform more mature.

After the initial establishment of the platform and rapid growth in the amount of data we were ingesting as well as number of consumers of said data — we began to look at what we needed to do to mature the platform to ensure we were giving a reliable, scalable and high trust service to our stakeholders. To achieve this, we looked at three key areas to improve the maturity of our platform: Observability, Data quality and trend analysis.

Observability — Whilst we had alerts around failures, the visibility of our data platform was quite poor. To correct this, we built log forwarding into Datadog and built dashboards which allowed us to easily review how different pipelines were performing and quickly identify any errors that had arisen. Introducing this observability allowed us to be much more proactive around failures, letting stakeholders know straight away and having teams investigate the issues and resolve them quickly.

Data quality — Data quality is a significant topic that could warrant its own blog post. We wanted to improve the data quality controls we had on our assets, and be able to identify when certain fields were receiving out of profile or incorrect data. To do this, we utilised great expectations to evaluate the columns and setup fail or alerting rules to allow us to be proactive around data quality issues. We can now either resolve them with source system teams or inform consumers of the data so they are able to adjust their reports accordingly.

Trend analysis — Trend analysis is another invaluable tool to providing a resilient experience to stakeholders. The premise is, if we have a source system that ingests 10,000 records, via a daily batch pull, we can setup checks to ensure that the number of records ingested was close to 10,000 with +/- 20% variance as an example. If the number of records falls outside of this, we are able to trigger alerting and investigate potential issues with the source system ingestion.

These three areas have been key for providing an enterprise ready experience to our stakeholders and ensure that we are always the first people to identify issues and begin working on them. This is preferable to our consumers finding issues in their reports or analysis.

What is next?

With the central data platform established and supporting several group and banner level initiatives, the question began to arise on how we could scale this system to support all the data needs of the different banners at Kingfisher. One of the issues currently is that all development on the platform needs to go through the central data team. This can mean that the team becomes a bottle neck and some data assets that might be extremely valuable to a single banner but not others, are lower down on the priority list of the data engineering team.

Our solution to this is the federated data model. This is a model that draws inspiration from Data Mesh principles(https://www.datamesh-architecture.com/).

The federated data model aims to give banners a complete data platform stack that will allow them to ingest, transform and serve data assets that are relevant to their local domain. Additionally, the central data team will provide certain capabilities such as data quality tooling, data cataloguing, compliance and security, and observability. This will be done through a self-service layer that the banners are able to plug their respective data platforms into to receive this functionality.

We will dive into more detail on the federated model in a future blog post and share details on some of the architecture decisions and how we will support it through the use of chapters and guilds.

Thanks you for reading this post, I hope you’ve enjoyed hearing about our data engineering journey so far at Kingfisher and I look forward to being able to come back to share future developments.

--

--