A tale of two data warehouses — News UK’s data platform migration to Google Cloud (Part 1)

Published in

News UK Technology

11 min readAug 10, 2020

News UK is a British newspaper publisher, currently publishing titles like The Times, The Sunday Times and The Sun newspapers. In addition, News UK also owns the Wireless Group (Virgin Radio, Talk Sport, Times Radio(new)) and a a diversified range of products and loyalty programs like The Times Expert Traveller, The Sunday Times Wine Club, Sun Savers, Sun holidays, Sun Bingo, etc.

These broad range of digital content and service offerings generate massive amounts of first party customer data along with several billions of interactions that these readers and users generate on a daily basis. This data helps News UK learn and respond effectively to the ever growing demand from customers for even better content, products and services.

The two data warehouses

News UK’s data warehouses do the heavy-lifting work of ingesting, integrating and providing a simple 360 degree view of our customer journeys and interactions and were distributed across two separate cloud platforms — AWS’s Redshift (for customer demographics, subscriptions, registrations) and GCP’s BigQuery (on-site interactions and transactions). This distributed set-up was mainly a consequence of evolving from batch-only traditional data warehouse (DW) type workloads to using more advanced analytics and applying data science capabilities on data with larger volumes(~50x larger), varied sources(eg.API’s, Google Analytics) and higher velocity data (real-time streams). The evolution initially started with Redshift for the DW workload and BigQuery was added to the landscape for the latter use cases ending up in a dual ecosystem of databases technologies and data processing tooling.

Building the business case for consolidation

Having two technology platforms for the same purpose was not sustainable for a business that was ambitious and fast growing and for a technology department that was increasingly cost and efficiency conscious. Some of the key issues with this model were:

Technology & delivery issues

Business agility: Use of a cloud-based data warehouse platform on Redshift, but with an “on-premise” mindset (eg. slow and limited code release cycles, weeks of planning to upscale clusters etc.), partly owing to the way it was implemented, and partly due to some of the mechanisms with Redshift (eg. manual snapshot/restore to copy data from prod to dev cluster) and the ETL tool used at the time slowing down time to market.
Distributed data: Two data technology platforms operating in isolation— one handling batch based, customer related ‘small’ data (by BI engineering team) and the other handling batch plus real-time, customer interaction and behavioural related ‘big’ data (by data engineering team). Two underlying cloud platforms — one for each of the above types of data— AWS for batch based and Google for more real-time and API based data sources, with no cross-utilisation of architectural implementations & practices.
Operational overhead: Two different delivery models aligned to individual technology platforms rather than business usage and demand.
Duplicated costs in terms of cloud and technology spend, people and skills.

Business consumption issues

The key problem for our data scientists, analysts and business users was the inability to consume customer data along with customer’s behavioural from a single trusted source.
Very often, data had to be either externally joined up within reporting tools like Tableau creating redundant data integration layers outside the warehouse or requiring data transfers between them which required additional pipeline development and maintenance.

Usage characteristics of platform A (Redshift) — reasonably good data modelling and alignment with DW principles, worked well for batch pipelines, relatively higher administration outsourced to central database team, not easily scalable when required, forced us to be less agile in general.
Usage characteristics of platform B (BigQuery) — relatively much lower administration, easily scalable, well suited for both real-time and batch pipelines, better and native integration with streaming and API based data sources, not much data modelling done yet and unsure of DW like workloads as most of the data used so far was transactional in nature. This meant we had to revamp and improve on the existing implementation within BigQuery/GCP as well if we decided to BigQuery as the go-to solution.

Building the strategy — best of both worlds

We could straight away see that we could kill two birds in one shot and solve both the technology problem and the unified data consumption problem by having one single trusted source of all customer data being serviced and maintained by one data platforms team on one data technology stack — this was the business case presented and the challenge we took up to solve using Google Cloud Platform and BigQuery.

We set out on a mission to get to a —

“Well modelled and governed; scalable and secure data platform, low on maintenance; suitable for both batch and real time data sources; for complex and creative data workloads not bounded by limitations of performance, scalability or access”

Google Cloud & BigQuery to the rescue

I had personally attended at least two Google Next conferences around this time and met a number of Google’s product managers and technical professionals (thanks to the News UK Google Cloud account management team) to ask and get all our questions answered around BigQuery’s capabilities to handle traditional DW type workloads and to establish what the right set of GCP services that had to be built around BigQuery to have an end to end data platform built within GCP without having to rely on an external ETL/ELT tool.

With the fast developments and releases being made in BigQuery (like MERGE statement, Scripting, enhanced DML capabilities and performance improvements) and some new Data & Analytics (now called ‘Smart Analytics’) related services on their roadmap (Data Fusion, Data Catalog, DLP etc.) we came away both excited and convinced that we were working with the right cloud technology partner to achieve our set objectives.

A quick chat with one of BigQuery’s co-founders was so insightful and enlightening that it made me realise that moving to something like BigQuery does require a switch in mindset of how you do data warehousing on the cloud and especially on BigQuery.

A quick SWOT-like analysis helped us establish our current situation so we can prepare better for the journey to the future:

Driving Principles

Aim for zero or low admin overhead
Focus on current scope of migration, but build the ecosystem for the future
Abstract away data engineering for business intelligence engineers
Build over buy (given possibilities using GCP services)
Lift and shift (unless warranted by any BigQuery limitation)

Constraints

Time and Budget
Skills availability in market and on-boarding team members to GCP
Maintaining business continuity with existing deliverables and initiatives through the migration period

Unknowns

To-be system architecture and GCP services to use for ELT operations
Suitability of BigQuery for specific DW operations (eg. SCD)
Complexity and duration of implementation and migration

Opportunities

Strong support from senior leadership within News UK Technology
Access to Google Cloud experts
Clear outcome and end goal in mind
A highly skilled business intelligence and data engineering team

With the above mix of elements, we embarked on a journey with a clear end state in mind and an unchartered path to get there. Although several other organisations have perhaps solved a similar problem previously, there were no clear cut answers to the specific situation we were in and where we want to get to. We had to build a custom solution to a custom problem.

The challenge ahead

The scope of most standard DW migrations include data pipelines (aka ETL jobs) and the data itself (historical and refreshed up to the switchover date). The other key challenge we had here was replacing the ETL tool. We intended to move to a Google native service(s), mainly for orchestration of the pipelines and use the power of BigQuery for data cleansing, SQL transformations and scripts.

ETL (/ELT) Tooling

Cloud Data Fusion on Google Cloud (built using open source project CDAP) was just released in beta and was a strong contender to use as our preferred ELT tool. It addressed one of our key principles of abstracting away the data engineering layer from the BI engineers letting them focus on building pipelines, transformations and data models without having to code. However, due to the forecasted timelines of General Availability (GA) of the product not aligning with our project timelines, we had to look out for another choice — probably a more difficult route but one which we would have more control of then— Cloud Composer.

Note: Data Fusion is now available in GA, you may want to review it’s latest and current capabilities if you are considering using it for your migration.

We were quite clear that we did not want to use Google Composer in it’s vanilla form. This would have required BI engineers to programme in python which all of them eagerly wanted to, but would eat into their BI development lifecycle and workload. We wanted to make configuring data ingestions, transformations and their deployment as easy and quick as possible and that’s exactly what we decided to build — a wrapper on top of Composer that allowed BI engineers to configure all pipelines using simple metadata using YAML files. These YAML files could be then converted to python files using a converter service (surfaced through a simple UI) and deployed as standard Composer python files using CircleCI — which was the tool of choice for CI/CD at News UK. We called this wrapper ‘News Flow’.

Looking back: It took several weeks of research and brainstorming to look at potential tools/services and finally a couple of days (& nights) of hacking helped prove that Composer with a functional programming wrapper was the best choice and a feasible option for in-house development. So, if in doubt, hack it out.

BigQuery

After our initial consultations and further prototyping on BigQuery, we established the hypothesis that BigQuery would be suitable for DW/ETL workloads as much as it would be for big data and OLAP style workloads (and we were prepared to be proven wrong). This meant that we did not see the need to re-engineer any of our existing pipelines and data models and could “lift and shift” them in the broader sense of the term. We did however want to use the opportunity to simplify and innovate where possible to make best use of some of BigQuery’s unique capabilities.

Some areas where we’ve modified or improved our existing implementation:

Minimal file cleansing is done only to the extent of enabling a file to be suitably structured to be loaded into a BigQuery table and any further data cleansing was delayed for post data load. Pre-cleansing was achieved using python libraries (built-in as a feature in NewsFlow for configuring standard cleansing steps without any coding). These are similar to and have been inspired by DataPrep’s “wrangler” functionality. No column or record filters are applied as we did not see any reasons for limiting data volumes —mainly, BigQuery’s low storage costs and scalable processing power of DataProc for cleansing larger files (Local Composer environment was used as default for files < 1 GiB, although not strictly recommended).
Replaced SCD 1/2 mechanism of change detection with using BigQuery partitions to capture and track history — this massively simplified the data loading mechanism (no more md5 record and column comparisons to determine a new record vs a changed record which take up a lot of processing power as destination tables accumulate history). Each day’s file have been pre-configured to be ingested into the relevant daily partition which by design captures history, and there is no need to perform UPDATES (which was a concern in terms of the limits per project/day, which have since been removed). A view has been created for each table to always surface the latest record for a given natural key which would be the most common use of a source table in 95% of use cases.Query performance of these views have been totally fine so far. Load date has been used as partition dates and clustering columns were used where possible. Obviously, this meant we would only accept incremental files from our data suppliers and we wouldn’t track any deletions — we only had a couple of full data feeds which were handled separately as exceptions. BigQuery’s MERGE statement is an alternative option to explore, we tried and skipped it due to a then limitation with the statement not working in a completely transactional way. This may have since been addressed and is another good way to handle change data capture.
In the Redshift world, a snapshot of the production cluster would be taken on a monthly basis and restored onto the UAT/Dev cluster for pre-deployment tests and validations. This involved administration activity and felt cumbersome to say the least. While thinking of replicating this feature on BigQuery, we were kindly reminded of BigQuery’s powerful feature of cross-project querying (think of cross instance queries in traditional RDBMS’s which was either not possible or complex to setup). This meant that you can query the data source layer in your production project directly from the UAT/Dev project without actually moving an iota of data across — yes, this is where BigQuery’s storage and compute separation magically saves the day! No more painful database replications or copying/restoring snapshots required. (Thanks Jordan Tigani for the simple yet mind-shifting tip!)
Use of the powerful native nested structures in BigQuery to ingest JSON files. Previously, these type of files had to be flattened out before ingesting into a flattened table structure. Using native nested data types (arrays and struct)in BigQuery meant that API based data sources could be made available to business users and analysts much more quickly. This is a game changer as most of our new data sources tend to be more and more API based ones.
Avoid materialisation if and where possible. The general presumption (and based on some experience with views in Redshift) was that views are going to be slower for large queries and hence it’s best to materialise them using a pipeline. Consequently you end up creating tens and hundreds of pipelines just to schedule and pre-populate tables and then tinker around with setting dependencies between them as they may need to run in a certain sequence. We found that this was not necessarily true with some of the pre-existing views on big data sets like Ad clicks and impressions in BigQuery. This helped change our thinking around the need to materialise and we pushed this to the limits by converting all of our materialised jobs into simply views. This has created lot of simplicity in the architecture and maintenance. All of our business facing views are now ready for use as soon as our data feeds in the source tables have been refreshed. Performance has not been a problem at all except in the instance where you have more than ten nested views in a single query. You are likely to hit a ‘resources exceeded’ limit error with a failed query. We managed to rewrite some of our views to avoid the deep level of nesting.

More on migration, project execution and key takeaways are in Part 2

A tale of two data warehouses — News UK’s data platform migration to Google Cloud (Part 1)

Building the business case for consolidation

Google Cloud & BigQuery to the rescue

The challenge ahead

Written by Karthik Ramani