Data Engineering for Public Good: How we kickstarted a public agency’s transformation towards a modern data stack on cloud

Colin
AI Practice and Data Engineering Practice, GovTech
17 min readNov 30, 2022

Introduction

Hello, I am Colin Ng, a Data Engineer in the Data Engineering team at DSAID (Data Science & Artificial Intelligence Division), GovTech.

At the Data Engineering team in DSAID, we do two main kinds of work:

1. Building Data Products / Platforms that are targeted at Whole-Of-Government (WOG) agencies and users. For example:

  • Analytics.Gov, an in-house data analytics exploitation platform
  • enTRUST, a platform to facilitate public-private sector data exchange and exploitation
  • enCRYPT, data privacy and anonymization toolkit
  • Data Infrastructure in a Box (DIAB), using Infrastructure as Code (IaC) to help agencies set up secure and compliant data infrastructure leveraging on our Government on Commercial Cloud (GCC) services.

2. Agency Projects and Consultancy work. For example:

  • Design and upgrade new or existing Data Infrastructures (e.g shift from on-prem to cloud infrastructure)
  • Run proof-of-concept work to solve specific agency use-cases
  • Provide general strategy, consulting, and propagation of best practices
Data Platforms vs Agency Projects

Platforms focus on providing Data Engineering related capabilities to as many agencies and Whole-Of-Government (WOG) users as possible, while agency projects deep dive into complex agency use-cases which existing products cannot support.

We will talk about these data platforms in future blog posts! For now, I will be writing a series of blog posts to explain how we kickstarted a public agency’s transformation towards a modern data stack on cloud.

The series of blog posts will be broken down into three parts:

1. Part 1 provides an overall intro, focusing on the use cases, strategy, and design decisions behind the project

2. Part 2 will focus on the technical infrastructure, architecture, and tooling that were considered and implemented

3. Part 3 further dives into our adventures on deploying open-source components, as well as implementing IaC and CI/CD processes

Sneak peek at the final architecture diagram below (which we will discuss more in part 2):

Architecture Diagram — Phase 1

Part 1 — The Strategic Thinking and Design Decisions of an Agency’s Data Engineering Project

TLDR — Intro to Data Engineering Project at a Public’s agency and the thought process behind how we ended up adopting the Data Mesh as part of our strategy

Background

Agency X wanted to build a brand new data warehouse on the Government on Commercial Cloud (GCC) service to replace their existing on-premise database, and this would become a central piece of their data infrastructure moving forward.

However, as the agency lacked data engineering expertise, they engaged us (DSAID Data Engineering team) to develop a proof-of-concept (PoC) where we would:

1. Design and implement a modern and scalable data infrastructure on GCC AWS;

2. Work with a given dataset to demonstrate how the ingestion, transformation, and exploitation process will be like

3. Provide hands-on data engineering expertise for knowledge transfer, allowing them to run their data warehouse and pipelines, and extend their data model post-PoC.

The overall vision is to design and create a data infrastructure that would enable the agency and its officers to make faster and better data-driven decisions and policies.

How we typically decide on what projects to take

Within our Data Engineering team at DSAID, we use the following in-house five-pillar framework to drive and develop data engineering projects at government agencies.

Our in-house framework for Data Engineering projects

The five-pillars of the framework serve as an important guide for us to assess the data readiness and maturity of the agency. These five-pillars cover both business and technical aspects that are necessary to ensure that the project succeeds.

Assessment and Evaluation of Agency X’s readiness for the proposed project

1. Sponsors and Use Cases: There was clear senior management support (both the Chief Data Officer and Chief Information Officer of the agency were on board the project), and the use cases were well-defined and of high impact to the organisation.

2. People and Resources: The agency had well-established IT and Data Governance teams who already had experience working on other projects on-premise and a few small data science PoC projects on GCC. They also had Forward Deploy Teams (FDT) Engineers who were able to provide additional engineering expertise, domain expertise, and support for the project.

3. Processes and Data: The agency’s data governance team could help us to answer questions regarding the various data processes and governance measures that needed to be done, while the FDT engineers could also provide domain expertise and knowledge on the data and relevant processes.

4. Infrastructure and Architecture: The agency already had experience with spinning up GCC environments and had a central account that was used for replicating data. The project had no other existing requirements or dependencies, making it easy to build from scratch or extend the existing architecture.

5. Tools and Techniques: Other than needing to deploy Tableau, the agency was flexible to explore other new tooling and products. They also had the engineering resources to develop their own CI/CD and IaC pipelines, which helped greatly with the overall project.

Overall, we assessed this agency to have relatively high data maturity and readiness. On top of that, their proposed project had a very clear scope, use case, and value. Thus, we decided to embark on this project.

Gathering use cases and requirements to formulate an overall strategy

The first step was to gather information and formulate user stories or use cases that can help solve and meet our end objectives and deliverables. This was done by interviewing multiple stakeholders in the project. We start off by gathering the high-level vision, goals, and objectives, before filtering down into lower-level details and considerations, and then formulating our overall strategy and architecture based on that. ¹

High Level Goals

From our interview and discussion with various stakeholders, we were able to synthesise the following high-level goals

  • Design a modernised data infrastructure on GCC that can adapt to their numerous use cases and requirements
  • Solve user data journey pain points by ensuring that data can be easily and quickly requested, accessed, shared, and exploited by end users within the agency while still adhering to data governance standards
  • Ensure compliance with data governance standards and policies by empowering the data governance teams with tooling and/or designing systems
  • Consolidate and unify their multiple datasets and models to enable and support cross-domain analysis and use cases
  • Solve data quality issues at scale by recommending data modelling or other solutions
  • Scale all the above given the agency’s engineering and resource constraints

Design Decisions

With the high-level goals and overall vision as our anchor, we gathered various low-level use-cases and requirements and consolidated them according to the five-pillar framework above. This allowed us to get a good overview of the various issues at hand and the key design decisions we needed to make in different areas of the project.

A. Infrastructure and Architecture

  • Which reference architecture to follow (e.g Analytical Data Warehouse + BI, Data Lake / Lakehouse, ML-Centric infrastructures; Operational vs. Analytical use cases and infrastructures)
  • Any existing architecture, requirements, or downstream dependencies to take note of
  • Consideration between managed components vs. open-source (e.g Price, flexibility / vendor-locked, ease of maintenance)
  • How best to design a system that can intake and consolidate the multiple domains f datasets without becoming yet another data swamp with multiple data silos instead

B. Centralised vs Decentralised vs Hybrid approach to Data Models and Processes

  • How to inbuild rigorous data governance and data quality control in the data models and process without overburdening the data governance team
  • How to enable end users to quickly perform cross-domain analysis across different domains of datasets
  • How best to solve dimensional modelling and semantic / metrics layer issues around datasets
  • How best to approach data quality issues (e.g using tooling like dbt tests, Great Expectations, Panderas, or utilizing other frameworks)

C. Manpower, Skillset, and Resource Considerations

  • Does the agency have the necessary resources to develop, maintain, and scale the infrastructure post-project?
  • Do the end users require some training and up-skilling to use the desired products (e.g SQL, Python, Tableau)

D. Tooling and Techniques

  • Which specific tooling, services, or components to use to best address various pain points and requirements (e.g AWS managed services vs. SaaS / Enterprise platforms vs. Open-source Solutions)
  • What existing tooling were end users familiar with? Whether to improve, extend, or revamp their existing tooling stack and workflows

Proposed Strategy and Architecture

A. Infrastructure and Architecture

A1. Implement a modern data stack in GCC AWS — with modularized components that allows for the best-in-class tools and solutions while providing flexibility

  • Architecture follows a typical analytical DWH with BI archetype (e.g Redshift with Tableau) as this suited the agency’s use cases and datasets best (mainly analytical use-cases)
  • Choose components which prevent unnecessary vendor lock-in, and provides flexibility in choice of components to solve current and future needs
  • Creation of a modernised tech stack also helps with talent acquisition

A2. Adopt mainly decentralised, data mesh approach to data architecture and model, centralised model only for a small set of core dataset (more on this in the later part of this post)

  • Engineering efforts can be focused on building up the infrastructure to enable self-serve and federated computational governance (two of the principles of data mesh)
  • Decentralised model suited the agency’s use case better — especially given their multiple domains of datasets and fast-changing use-cases, which requires flexibility over standardisation
  • Centralised model only for a small set of core datasets that is consistently used across multiple teams

B. Processes and Data

B1. SQL as the correct layer of abstraction as opposed to Tableau, Python or R

  • Existing suite of tools (Tableau, Jupyter Notebook / R Notebook) were insufficient to address existing use cases
  • Tableau was more suited for data visualisation whilst the Jupyter / R studio notebooks were generally used for more advanced use-cases. Further, it was difficult for beginners to write good code, and hard for non-technical folks to read and understand business logic within the Python / R scripts
  • SQL was able to provide a middle ground, was better for general data purposes, and could be used as a common layer for data modelling between business users and central engineers

B2. Incorporating best ideas, practices, and frameworks from other tech companies to solve data quality and governance issues

  • Advocating Domain Ownership and Data as a Product principles (the other two principles from Data Mesh)
  • Adapting ideas and framework such as Data Contracts, Data Quality framework (Airbnb, Montecarlo), Data Metrics framework (Airbnb), and Semantic Layers into our own data processes

B3. Teach and apply data modelling techniques where applicable to improve the quality of downstream data models

  • Teach end users basic data modelling (e.g Kimball) as they are the domain experts
  • Apply data modelling techniques to source datasets where applicable
  • Apply SCD2 where necessary to retain historical snapshots and to reproduce important reports made to management

C. People and Resources

C1. Tapping into existing whole of government (WOG) resources and training to upskill the general public officer

  • This works together with B1. on SQL as the correct layer of abstraction

C2. Creating valuable skillsets for officers to learn and providing a platform where they can use their skillset on

  • Provide the necessary platform for officers to utilize the skills they have learned from WOG data upskilling initiatives
  • Create highly desired skillsets and roles to attract the next generation of fresh graduates and new hires

D. Toolings and Techniques

D1. Strategic bets into specific tooling like DBT and Amundsen, which deliver outsized value and reduce operational workload

  • GCC constraints meant that we had to dedicate engineering resources to implement these tooling as we were unable to onboard the SaaS equivalents
  • DBT as a tool incorporated many best practices in DE (many great articles and discourse on how to use DBT to tackle common DE issues); Better approach than re-inventing the wheel each time we do ETL/ELT processes; More aligned with data mesh goals of letting end users do their own data pipelines and models
  • Amundsen helped by reducing the operational workload of domain owners and data governance folks by automatically pulling in metadata information, and allowing end users to easily search and populate data columns information on their own; This gels in with the overall idea of allowing users to self-serve

D2. Front-loading engineering effort into developing rigorous IaC and CI/CD processes to build secure and anti-fragile systems²

  • Heavier engineering work done upfront to figure out how to nicely link together various infra-app-data processes nicely into a seamless IaC and CICD framework; As opposed to over-documenting, we spend time and effort building systems and processes instead
  • Antifragile — Failed deployments and errors are corrected in our IaC and CI/CD processes, ensuring that future deployments do not face the same issues; Our IaC and CI/CD processes give us the confidence to quickly build, test, and deploy new changes

Deep Dive into Data Mesh Strategy

NB: Readers are encouraged to read up on what is data mesh first before proceeding with this section.

In this section, we will go through the key points that led to the decision to adopt the data mesh strategy for the project, some sample ideas and example implementation, and addressing the centralised model approach

Reasons that led to adoption of Data Mesh Strategy

1. There was an over-reliance on the central teams (engineering, data governance) and service requests to vendor systems to address data issues. A lot of these issues were operational in nature, and this prevented the central teams and vendor source systems from focusing on more important higher-level issues.

  • This existing system is also a bottleneck as the central team will not be able to scale faster than the requests by users
  • Users had to either wait for a long time before they could get simple data fixes or had to manually incorporate and fix them into their workflows which were not sustainable
  • Moreover, this was exacerbated by the fact that requirements were generally fast-changing, and speed in answering queries was important

2. Scalability and resourcing constraints — Lack of central resources (engineers and data governance officers) meant that a centralised model will not scale

  • Centralised Data Model scales linearly (think one engineer and one data governance officer for every few domains of datasets, not including redundancy), which was not suitable for the large number of datasets that the agency owned
  • Decentralised Model allows the central engineering team to focus on building a platform and infrastructure to enable and empower end users
  • Decentralised Model allows the data governance team to focus on figuring out how to implement high-level data governance policy and framework, which are then disseminated to the domain team’s data governance stewards and owners to follow

3. Organisationally, teams and departments were already arranged into broad domain groups, each with their own datasets and source systems that were handled by a variety of business and data analysts. However, the individual data systems existed in silos with little central oversight or incentive to gel well with other systems. This made cross-domain use cases messy and difficult to implement

  • Typically, data that is produced by their own source systems are also consumed by their own users. There is little incentive for domain owners and data stewards to promptly respond to data requests outside of their own use-cases or keep their metadata information updated
  • The existing system was a bunch of data silos with no central unifying data catalogue or system, making data hard to find, access, and consume for end users in other domains
  • As these teams / departments were already managing their own data systems, there was no need for us to redesign the upstream operational systems (and it was difficult for us to touch those systems anyway) — instead, we could focus on creating a platform to “centralise” those data silos and work on downstream analytics use cases (e.g. focusing on consolidating the data from upstream systems, transforming, cleaning and refining them, and allowing downstream users to easily access and exploit these data)

How will the data mesh principles look like, and how it will help the organisation achieve its goals

1/ Domain Ownership — Create alignments and demarcate clear responsibilities, reducing the business and operational workload on central teams

  • Currently, domain teams are already in charge of their own relevant source data systems. However, there is a general reliance on the central engineering, data governance, and vendor systems for both small and big fixes, which inherently does not scale. For example, minor updates to data tables, data modelling and transformation, and data access and approval are currently being done by the central teams.
  • Domain ownerships advocates that domain owners and their teams should take charge of these responsibilities. The central engineers and data governance needs to create the necessary infrastructure and framework to enable the domain teams to do this (see the following two main points on self-serve data platforms and federated computational governance), as well as provide the necessary training and support to enable them to utilize these infrastructures effectively

2/ Data as a Product — Ensures that data tables and artefacts are of high quality and standards for use across the agency

  • Data as a product principle complements domain ownership and brings this one step further by ensuring that any data artefacts (e.g data marts, Tableau workbooks, and other metadata information) produced are of high quality and ready for consumption by other domains / end users
  • One proposed strategy was to use a bronze-silver-gold curated tier tables (i.e Medallion Architecture), which provides a clear separation between raw data, intermediate data and models, and curated data marts. The benefit of adopting this medallion architecture is that we could immediately use existing tables and data as it is (for existing operational needs), while clearly demarcating data-as-a-product ready (gold-tier) data mart tables to strive towards
  • “Gold-tier” data mart tables should have their metadata information populated, follows naming convention, adhere to basic data modelling, must have SLAs and table owners specified. Enforcement of this principle is further discussed in the federated computational governance section below
  • Importantly, for both domain ownership and data-as-a-product, there must also be management support and buy-in as training needs to be done and additional time dedicated for the domain team to be able to do their own data modelling and update their metadata information to deliver high-quality data products. Incentives also need to be aligned to ensure that good data artefacts are clearly rewarded

3/ Self-Serve Data Platform — Ensures efficient use of limited engineering resources to create superior products that solve pain points and provide value to the agency users

  • Focusing on developing self-serve data platforms was a great way to utilise the limited engineering resources and allow our system to scale. The engineering team only needs to focus on building the necessary infrastructure and platform to allow domain teams and end users to produce and consume data (and everything else in between)
  • Importantly, these data platforms need to solve existing user pain points and integrate nicely with their current workflow for adoptability

4/ Federated Computational Governance — Allow data governance processes to scale and enable data governance team to focus on higher-value work by using a federated system (guild) to decentralize data governance work and computational tools to automate enforcement and compliance

Federated computational governance consists of two main points:

  1. Using a guild like structure to agree on global data governance policies (e.g aligning on metadata information, data modelling and naming convention) and allow for decentralization of data governance work to domain teams
  2. Using computational tools to automate enforcement and compliance to data governance policies

We discuss two example use cases below:

  • Access Control and Permissions — Previously, access control and permissions to datasets often had to go through both the data governance team (middle-man role) and the domain team, resulting in inefficiency and slowness. Our federated computational governance approach relies on using RBAC (Role Based Access Control) in the data consumption tools (e.g Tableau, Query Editor) to achieve both federated and computational governance. Central data governance team establishes global policies on data tables (e.g what datasets can be shared to who, PII masking, default duration of access) and these are then codified into RBAC roles and permissions, and delegated to domain owners / data stewards to apply to end users
  • Metadata information and Documentation — Another area that can be improved is the documentation and population of metadata information into the data catalogue. Previously, this was a slow and painful process — that had frequent back and forth between the data governance team and the domain teams as there was no incentive, alignment, and agreement for the domain teams to do this well. Our federated governance approach involves first aligning with domain owners on what are the necessary expectations for their data tables (e.g metadata information, data modelling, naming conventions, SLAs, table owners, masking of PIIs). Computational enforcement can then be done by baking these requirements as part of a process (e.g pull-requests) before the central team publish these tables for usage

Considerations for a centralised model, and the decision to adopt a hybrid model

Despite the benefits of Data Mesh, it was initially difficult to convince stakeholders and management to entirely move away from the centralised model which they were familiar with.³ Furthermore, as Data Mesh is a relatively new concept, there were concerns about how to implement it effectively

The centralised model, when done right, brings about certain benefits, such as having a single source of truth, higher data quality, and better data governance in general — which are highly valued by the senior management in an enterprise environment like public agencies.

However, as elaborated above, due to the agency’s general needs, resourcing constraints, and fast-changing use case with datasets, it made more sense to go with the data mesh model for most of the datasets, so that engineers and data governance folks are not tied up with day-to-day operational work. The centralised model would not scale past a certain size.

Thus, we decided to go about with a hybrid model⁴ — we went with a centralised model for our core datasets, those that were consistently used across the organization and required a high degree of oversight, and a decentralised model for other domain datasets.

This hybrid approach brought about a few additional benefits:

  • Dogfooding — The engineers and central team become a domain team and could first-hand use the very products and systems they designed (i.e test that our self-serve platform and federated governance model work as intended)
  • Gather feedback quickly — The non-technical folks within the central team (e.g data governance folks, engagement officers) provide the engineering team with valuable feedback on the usability, adoptability, and feasibility of our platform products and federated governance model
  • Refine and develop best practices in data modelling and data catalogs, and provide examples and documentations for other domain teams to follow; Take the lead in driving adoption by creating useful data marts for the entire agency to use
  • Provides flexibility and optionality. Even if all else fails (e.g the data mesh concept does not take off due to non-engineering issues beyond our control⁵) — we could quickly pivot and fall back to this centralised model. Depending on priorities and how things turn out, we could choose to scale either the centralised model or decentralised model accordingly

Reflection

Overall, it was a very interesting experience to be able to build and deploy a brand-new data infrastructure for this agency.

This involved end-to-end work from conceptualisation, to use case and requirements gathering, convincing the numerous stakeholders on our strategy, exploration and research into multiple tooling, and finally linking everything together into a coherent PoC product and deploying it. Along the way, we are expected to know not just our core data engineering expertise, but also other relevant areas like networking, security, DevOps and more.

Interested to embark on transformation data engineering projects for agencies like these? Or perhaps do you prefer to create data platforms / products for Whole-Of-Government usage instead? Check out our job postings here!

Stay tuned for part 2!

Endnotes

  1. Top-down heuristics ensure that we do not stray away from the high level goals, objective and vision. Some useful top-down heuristics that can be used here are the Lean Value Tree (LVT) , and Govtech’s own Analytics-By-Design.
  2. Antifragile — I got this idea from this fantastic reading on Picnic’s antifragile Data Warehouse Principles
  3. See this excellent blogpost by Piethein Strengholt — which discusses the finer points between centralisation and decentralisation topology for data mesh especially from an enterprise’s perspective
  4. Later on, with the benefit of hindsight and further readings, I found that the model we implemented best corresponded to the hybrid mesh approach as explained in this blogpost by Piethein Strengholt
  5. “Organisational changes are the hardest part of an enterprise data mesh journey

--

--