A tale of two data warehouses — News UK’s data platform migration to Google Cloud (Part 2)

Continued from Part-1

Karthik Ramani
News UK Technology
11 min readAug 17, 2020

--

BigQuery

  • Avoiding materialisation if and where possible. The general presumption (and based on some experience with views in Redshift) was that views are going to be slower for large queries and hence it’s best to materialise them using a pipeline. Consequently you end up creating tens and hundreds of pipelines just to schedule and pre-populate tables and then tinker around with setting dependencies between them as they may need to run in a certain sequence. We found that this was not necessarily true with some of the pre-existing views on big data sets like Ad clicks and impressions in BigQuery. This helped change our thinking around the need to materialise and we pushed this to the limits by converting all of our existing materialised jobs into views. This has created a lot of simplicity in the design and maintenance of our DAG’s. All of our business facing views are now ready for use as soon as our data feeds in the source tables have been refreshed. Performance has not been a problem at all except in the instance where you have more than ten nested views in a single query where you are likely to hit a ‘resources exceeded’ limit error with a failed query. We managed to rewrite some of those scripts to avoid such deep level of nesting (which indicates a bad design in itself).

BigQuery has recently released native materialisation capability which is in pre-GA at the time of writing and was not considered for use in our case.

  • Deciding on BigQuery project architecture — another key consideration for us was to whether to think of the warehouse as one single BigQuery project that mimics how existing Redshift cluster is configured for both ETL and analytics workloads or to separate them out. More projects meant more overhead in terms of access configurations, governance and general operations but would allow taking advantage of BigQuery’s underlying separation of data and compute. One single project for everything meant lesser overhead but greater restriction in managing workload demands.

After considering the pros and cons, although not an immediate necessity, we decided to future proof the design and went ahead with creating multiple projects, one for the core DW with ELT workload and another one for each of the consumer groups to run campaigns, analytics or data science modelling accessing the data from the DW project. This then allows separation of workloads, and in reality, better and simpler governance due to clear data storage and access boundaries, and also to scale up or down the number of slots we could reserve on consumer projects based on varying demands by taking advantage of BigQuery Reservations.

Migration tooling

  • BigQuery’s Data Transfer Service was very handy to asynchronously migrate data from Redshift cluster to BigQuery. Once the data feed pipelines are deployed and validated, a snapshot of the data up to the point would be transferred over to the corresponding destination source table in BigQuery using this service (ie.backfilling). With some custom scripting and using another temporary staged layer in BigQuery, you can insert data into the final partitions using a data field in the source table as the partition date. A big advantage of this service is that it would automatically create the destination object structures for you in BigQuery based on the source table/view in Redshift. This service and it’s UI had some limitations at the time of use so be prepared for some level of manual intervention (eg. full roll-back in case of a single table failure, limited logging etc., unless they’ve been addressed in the recent GA release). This is not meant to be used as a synchronous ETL scheduler, it’s mainly aimed to perform one-off bulk data migrations. Also, using this feature would potentially need you to make network level changes to allow communication between Redshift and BigQuery environments via VPN setup between the two. The official documentation has sufficient information to assist you with this.
  • SQL transformation pipelines from Redshift (ie. scripts & stored procedures) had to be converted / re-written to make it compatible with BigQuery SQL. Given both are based on ANSI-SQL standards, there are a lot of commonalities but there are some differences too. At the time of writing, we are not aware of any tools or services available to do this automatically, so the team had built a SQL Compatibility Assistant as a service as a side-project (and as a surprise present for me!) to help the BI engineering team. This service highlights SQL keywords and functions that are not compatible with BigQuery and would also highlight a suitable replacement and link to the official documentation page based on a pre-populated mapping table in a Google Sheet. If you have hundreds of scripts with thousands of lines to migrate, this is a big time saver. I’m sure this could be extended to auto-convert scripts too but the team took a sensible call not to, in the interest of project timelines. This was extensively used by the BI team to convert their scripts much faster than they would have without it.
Redshift vs BigQuery SQL compatibility assistant

The actual migration time of all of our scripts, data, tables and views took about 75 days. This is a record time for such a complex implementation and all the pre-planning and design thinking made prior to the migration helped accelerate the work in this phase.

Security

One of the key concerns before choosing BigQuery was if it was safe enough to hold customer and PII data in it. After several consultations and internal discussions, it became apparent that security in itself was a vast topic and there is not a set level of security control that you need to implement for storing customer and PII data. This is directly dictated by the organisational requirements and guidance from your Cyber Security team and it’s often a combination of several measures including process controls, auditing controls, security awareness and trainings, access level controls etc. along with the system and network level security measures you put in place.

Based on our Cyber Security team’s assessments and recommendations and the available Google’s suite of products and technologies to protect data, we became sufficiently confident that the necessary technology measures and process controls could be put in place to start storing customer data within BigQuery and Google Cloud. Security requirements could vary widely between organisations, so best to check and implement these based on the guidance of your internal Security team and policies. Some of the key organisational and system measures that were put in place are:

  • Integration with News UK’s Corporate Okta
  • Multi-factor authentication enabled via Okta
  • Group based access to roles and permissions in IAM
  • Adherence to provisioning access based on least privilege needs
  • Reliance on default encryption of data in rest and transfer within Google’s services and use of Google managed keys
  • Use of authorised views, table level access controls, and column-level access controls to restrict access of customer and PII data
  • Monitoring access to sensitive data using BigQuery Data Access Audit logs

A combination of the above and additional processes to audit access to BigQuery and GCS reduces the risk of inadvertent loss of data. These are by no means exhaustive and given we are serious about protecting our customer’s data, we aim to review and add additional security measures and capabilities periodically.

Data Governance and Compliance

Migration of our data to BigQuery opened up a lot of technical possibilities to further strengthen our compliance with GDPR and Data Protection policies of the Information Commission Office (ICO). Like the topic on Security, Compliance is a vast topic in itself and there are several organisational specific measures you may need to comply to align with GDPR’s principles and requirements. The interpretation of GDPR principles would again vary between organisations and it’s best recommended to check and align with requirements from your own Compliance and Data Governance teams. We mapped out some of the key requirements and addressed them by way of process changes/improvements, documentation or technology implementations. I will aim to touch upon this topic in detail in a future blogpost but this is how we addressed some of the key requirements:

  • Purpose limitation — introduced workflow for BigQuery access request and approval process which records purpose, expiry of access, etc. with periodic reviews of all user access. Access is provided based on least privilege access basis at project level in IAM or at the dataset level. Table level access has also been recently introduced which gives even further control over your data and enables compliance to least privilege access principle.
  • Data minimisation and Storage limitation — where applicable, setting retention periods at the dataset level in BigQuery (esp. on workpads and sandboxes) and on GCS buckets. In addition, internal data retention schedules are adhered to using custom SQL scripts as they involve business logic to determine data deletions.
  • Special category data Data Catalog and DLP are being integrated with NewsFlow for identifying and tagging PII data and using metadata tags to restrict visibility of columns with special category data to downstream users and applications. As of now these are excluded using simple filters on authorised views exposed to each consumer group. BI engineers with inputs from the data governance team have the ability to tag business and technical metadata in the YAML configuration files in Newsflow against BigQuery datasets/tables for this purpose which would then be synced with metadata in Data Catalog and BigQuery labels.

There is also official guidance on data governance capabilities from Google Cloud for reference.

Project management and execution

Approach to delivery timelines

Below diagram shows the approximate split and sequence of key phases in months. Project timelines and sequencing were worked backwards from the end outcome in mind and a set deadline based on an initial assessment of the volume and scope of the migration. We set the high level roadmap as below and stayed flexible within those windows. The input required for each phase would be the output to work towards for the previous phase. Prior experience with migration projects and a strong grasp of technical deliverables and engineering work would help set the approximate durations for these phases and stick to them with a good level of confidence. All other factors were treated as variables that could be influenced — capacity, skills, people, requirements, scope, priorities, ways of working, milestones, dependencies etc.

We are pleased to say that this approach has worked well for us as planned and we had a go-live and full launch of our customer data warehouse on BigQuery over the weekend of Aug 8th/9th 2020!

High level timeline of key project phases

Stakeholder engagement

As with most technology projects, engaging with all your stakeholders and teams at the right time has been crucial for execution and delivery of this project which had multiple dependencies with teams and partners, both internal and external. Below are the ones we had to work with during different phases of the project. Amber ones are internal, blue are external and mainly with Google cloud teams. We at times had the carriage in front of the horse to not let our plans and timelines get delayed due to external dependencies, but ultimately we managed to successfully line up everything by end of the project.

In some cases it took us longer to resolve dependencies with internal teams than external, so would suggest not to underestimate the importance of early engagement with other departments in the organisation.

Due to our teams being geographically distributed, the logistics around getting the teams on BigQuery and Data engineering on GCP took much longer than anticipated. Training has been mostly complimentary to the self-taught knowledge they had already gained by then by working on it.

The key partnerships for us revolved around:

  • Internal project stakeholders and business users — business case, agreeing on scope and getting a strong buy-in
  • Google Cloud and BigQuery product teams — product assessment, roadmaps, gap analysis, migration support and agreements
  • Service partners (if deemed required) — assess areas for support &scope
  • Compliance & Legal — contracts with Google, service partners and capturing compliance requirements
  • Cyber Security — sign-off on security requirements and high level design
  • Google Cloud Professional Services — detailed architecture and technical consultations
  • Training partners (unless internally done)
  • Tech & engineering teams — networking, DB teams, data source suppliers and consumers (internal and external)

Key benefits and outcomes of modernising our platforms

  • Simplified ELT architecture & accelerated pipeline development (50–100% faster delivery times in setting up data ingestion pipelines)
  • Around 40x improvement in performance of data transformation pipelines. A 40M row data set with a complex transformation script took around 2 minutes on BigQuery vs 1.5 hours in previous implementation on Redshift and our legacy ETL tool!
  • Better monitoring, transparency and optimisation of data access, performance and costs. The team feels much more in control of these aspects by automating alerts and using real-time logging data increasing reliability on the warehouse and business’s confidence on our services.
  • Established strong foundations for both data democratisation and data governance (usually conflicting themes).
  • Having infrastructure as code (using Terraform for provisioning IAM, BigQuery datasets and Composer services where integrations are available) ensure these are repeatable and automated and ability to have test integrations during CI/CD process using CircleCI greatly improved our service quality.
  • A Unified Data Platforms team with a clear data culture at heart and strategy in mind to fulfil business needs and ambitions using data.

Learning and takeaways — tried and tested

  • Design and build stuff that works for the primary 70–80% of your workload and use cases (thinking of the entire 100% is likely to slow things down).
  • Establish a few high level tenets/principles at the start — these could be around purpose, methods, ways of working, technical choices or more. Stick to them throughout the project.
  • If time is money for you, prepare to run parallel technical work streams (eg. Newsflow data ingestion design and detailed migration analysis) even if they don’t fit in logically together at the time and has increased management overhead. Once you have the smaller pieces ready, assembling them later becomes easier.
  • Prioritise work streams towards building a minimum and core functional product. Separate the must-have’s from nice-to-have’s but plan ahead for both. It’s easy to leave out things like CI/CD strategy or metadata management to the very end but you may not get another opportunity to build this out.
  • At any given point in time, you will be racing against Google Cloud’s and BigQuery’s product development and engineering teams. If your migration project is going to last beyond 6 months, be up to date with significant developments and releases and be prepared to incorporate them into your tech stack and product plan mid-way during the project. Would highly recommend to check their product roadmaps before you decide to build something internally.
  • Set a soft and hard deadline for the project with solid reasoning and work backwards from it. If you can do this right from the start and keep this as your goalpost, you have more chances of taking the most optimal route and hitting the deadline. You will be forced to make compromises along the way and that’s key to reaching a logical completion point with maximum return on investment. You can always carry on improvising afterwards but there’s a tipping point after which there are more nice-to-have’s than must-have’s and your costs outweigh the benefits.

If there’s one key takeaway, then it’s this — Data warehouse migration projects are expensive and time-taking. Technology isn’t always necessarily the driver for change but if chosen appropriately, can be a great enabler in shifting the culture and mindsets of organisations and teams that are serious about being data-driven. Seize the opportunity of a migration project to also modernise your data landscape and enable the business to get better, and faster value from your data.

Thank you for reading. Please feel free to provide your feedback below or via LinkedIn.

--

--

Karthik Ramani
News UK Technology

Head of Data Platforms, NewsUK; travel and nature enthusiast.