Democratizing Dataproc — dunnhumby’s journey on Google Cloud Platform

Published in

dunnhumby Science blog

5 min readDec 2, 2019

dunnhumby are a customer first data science company. We believe that if our clients put their customers first, then through that they drive loyalty and then from that they drive long term sustainable profit. We have been helping our clients do this for over 25 years now, long before the term big data was ever coined.

At a high level we do two things for our clients. Firstly we provide consultancy over the data that our clients provide. Clients ask us tactical questions like ‘why am I losing very price sensitive customers to my competitor” and they ask us strategic questions such as “what media strategy will give me the best ROI over the next 4 or 5 years”. In each of these cases we use data and science to provide answers and in addition we offer a number of products that assist in the areas of category management, price and promotions, and media.

Traditionally our products and services were built upon proprietary, licensed, software which we ran in our four data centres in which we managed ~4PB of data, this suited us well as we typically sought long, multi-year engagements with our clients so we could amortize the cost of software licenses and hardware over the lifetime of those engagements.

About four years ago we realised that our business was changing, long engagements had become unpalatable to some of our prospective clients who wanted shorter engagements and more flexibility in our offerings. We made the conscious decision to move to open source and cloud — open source because it was cheaper and offered us much more flexibility, and cloud because buying hardware for clients engagements that last 6 months was never going to be feasible. We called this program of work “tech renewal”, the cornerstone of which is our investment in a number of open source technologies.

Open source technologies upon which we at dunnhumby are now building our business

We primarily use Python for development, Hashicorp Terraform for managing our cloud infrastructure as code, we provide Jupyter notebooks for our data scientists, our data engineers use Apache Airflow for scheduling and workflow, Grafana is used for monitoring visualisation and most of all we made a strategic bet on Apache Spark as our chosen big data processing technology.

We chose Google Cloud Platform (GCP) as our preferred public cloud provider and specifically chose GCP’s managed Hadoop, Hive and Spark service, Google Cloud Dataproc, as the basis of our data processing and data science efforts.

Four years on we are reaping the benefits of our tech renewal program. We have a modern tech stack, we have reduced our cost base, we have a more scalable platform, we have more flexibility as to where we host our clients’ data, we’re able to leverage Google’s security investments and we’re also finding that being on the cloud is providing new business opportunities too as we are more able to integrate with partners. We are halfway through our cloud migration journey, with aspirations to eventually fully deprecate all our data centres.

We have worked hand-in-hand with Google’s engineers to push the limits of GCP’s Dataproc service and have made ephemeral Dataproc clusters a central pillar of our platform. Ephemeral basically means that when we’re not using them we turn them off (or configure them to automatically shut them off when they’re not being used). Google have an article explaining the architecture of ephemeral Dataproc clusters at A flexible way to deploy Apache Hive on Cloud Dataproc and from that article I have lifted the following image:

This architecture separates the compute resources of a Dataproc cluster from the storage of data thus enabling us to turn clusters off when data processing routines are completed but continue to persist the data thereafter. The key points to take away from this architecture are:

The hive metastore which contains information about the hive databases and tables that we create is stored is hosted in a dedicated MySQL database and that hive metastore is shared by multiple ephemeral Dataproc clusters
The data is stored in Google Cloud Storage (GCS) buckets
We can tailor clusters according to the workload that we run upon them, this includes using labels to name the workload and thus enabling fine-grained identification of costs

In addition to using ephemeral Dataproc clusters we also automatically increase or decrease the number of nodes in the cluster according to how heavily the cluster is being utilised (aka autoscaling). When we started out Dataproc did not have an inbuilt autoscaling feature so we built our own autoscaler which we call Tidebell. Since then, and based in part on our experiences with Tidebell, Google have made autoscaling a built-in feature of Dataproc.

Prior to introducing ephemeral clusters to our estate we were wasting a lot of money by creating clusters with many nodes and leaving them running unnecessarily. The chart below depicts the cost profile of providing services to one of our clients:

The drop in our cost base from moving to ephemeral clusters is stark. Overnight we were able to achieve a cost saving of over 75% simply by separating compute from storage and automatically shutting down Dataproc clusters when they’re no longer being used.

As mentioned earlier we have invested in Terraform as a means of managing our infrastructure as code. We have made available a Terraform project that deploys the infrastructure outlined in this blog post, please visit https://github.com/dunnhumby/democratizing-dataproc to peruse, the README explains more about the purpose of the project and how to use it.

The instructions are quite simple, clone the project and issue the following commands:

If you’re interested in knowing more about our journey with Dataproc and GCP then please check out the recording of my session Democratizing Dataproc that I presented at Google’s Next conference in San Francisco in April 2019. In this session I dive into the detail of some of the concepts talked about in this blog post.

Democratizing Dataproc — dunnhumby’s journey on Google Cloud Platform

Written by Jamie Thomson