Building a Data Warehouse on Google Cloud Platform That Scales With the Business

Published in

Policygenius

8 min readNov 12, 2021

Policygenius is America’s leading online insurance marketplace. Our mission is to help people get insurance right by making it easy for them to understand their options, compare quotes, and buy a policy, all in one place.

This story begins in 2017 when Policygenius only had 50 or so employees and a simple data warehouse setup that consisted of a business intelligence tool hooked up to a transactional read replica database in production. While it worked great in the beginning, we started to see a few pains as the business grew. Queries on the Postgres database would run very slowly and often time out. We started building more backend services and found a lack of out-of-the-box capabilities that would allow us to build unified views of data from across multiple databases or integrate with external sources to enrich our internal data for analysis. It soon became clear that our business needs had outgrown our initial set-up.

And so, later that year, we migrated to a new data warehouse architecture shown below. Over the last four years, this data warehouse has scaled as our employee headcount increased more than 10-fold and usage jumped more than 15 times. We’ve moved beyond a barebones data warehouse used primarily for reporting needs to a more sophisticated platform built to support the running of machine learning models and other data capabilities. Our process also has evolved from just a single data engineer (that would be me!) cranking out code and queries for the first 2+ years to five data engineers following a standard development workflow. This post provides a high level overview of this data architecture and key lessons learned along the way.

ELT Pipelines

On a high level, our data pipelines follow an Extract, Load, Transform (ELT) design based on Google Cloud Platform services, although it could be made to work with any cloud or on-premise solution. The majority of the data is extracted from Cloud SQL Postgres instances, loaded into storage buckets as part of staging before being loaded into BigQuery tables, and finally transformed into BigQuery views that are consumed for analytical purposes. We chose Airflow to orchestrate the jobs because it supports error handling/retrying/alerting out of the box and has a rich set of operators with an abstraction layer that makes scripting very easy. Analytics users have access to a few tools, including Tableau, Data Studio or Google Sheets (BigQuery Connector), R Studio, and Jupyter Notebook to connect to BigQuery for reporting, analysis, and building machine learning models.

Config-driven

Currently, we have over 300 jobs running daily at different schedules and about 250 of them are auto-generated to pull data from the Postgres read replicas with the help of configuration files. Adopting this config-driven approach has increased the velocity of the data engineering team and enabled the product engineering teams to customize the data that is sent to BigQuery with minimum assistance from our team. When there are upstream schema changes, our product engineering teams will typically just update the config files directly, create a pull request for the data engineering team to review, and upon merging the code, the BigQuery tables would update automatically. This has saved a lot of time, overhead, and unnecessary coordination between both teams.

We structured our data warehouse code repo to look like this:

postgresql/    schema/         {database_name}/             {table_name}.json

A typical {table_name}.json config file consists of these fields:

start_datetime: the earliest date and time this job should run.
priority_weight: the priority of this job versus others in the same queue -.
interval_minutes: the frequency of this job (i.e., how often the data is updated in BigQuery).
extracted_by: field used to identify newest incoming data. Some tables have both system timestamps — created_at and updated_at — and others just created_at.
deny_list: fields we don’t want to pull into BigQuery, e.g. PII fields or other info we don’t need for analytic purposes.
fields: columns in the BigQuery table schema.
stringify: Since there is no 1:1 translation to BigQuery for all Postgres data types and it’s a bit of pain to deal with arrays and nested data types in BigQuery, we stringify columns that are not compatible with the common BigQuery data types including STRING, NUMERIC, DATE, DATETIME, BOOLEAN, TIME, TIMESTAMP. We do have a lot of JSONB data from Postgres that are loaded into BigQuery as strings and then we use JSON functions for parsing.
view: custom view used to eliminate duplicates. By default, it dedupes by table id.

Here’s an example config file:

Another use case is for ingesting Google Sheets into BigQuery. You heard it right, Google Sheets! Although we acknowledge that Google Sheets is not perfect because the data in those spreadsheets are prone to human and unexpected automated-formatting errors, we also believe we should not let perfect be the enemy of good. Sometimes Google Sheets enables us to move fast to support reporting initiatives and we can iterate later towards a better solution. Other times it’s the best option, such as when you have to manually retrieve data from third-party systems. As long as you have checks and balances in place, it should not deter you from adopting a working solution. A typical config file for ingesting a Google Sheet looks like this:

Latency

Analytics users often ask if they can ingest the data in real-time, and while it can feel bad to say no to your business stakeholders, it’s also a great skill to acquire — and in this case, necessary! The reality is that, unless you work in trading, most of our stakeholders simply don’t stare at data dashboards all the time. Business teams are pretty understanding when we explain to them the extra miles it would take to reduce the latency down to seconds. At Policygenius, our BigQuery data lags behind transactional production databases anywhere from 15 minutes to 24 hours, depending on reporting needs. That said, we are actively exploring a new centralized data bus architecture to reduce the latency in the future.

Event-driven Pipelines

Aside from the batch pipelines, we also heavily use event-driven pipelines for ingesting data into BigQuery. This is commonly used when integrating with third parties using webhook notifications or SFTP file transfers. An example of such a pipeline is illustrated below. A file arrives in a storage Bucket, which we mount as part of the file system on a SFTP server, and it triggers a Google Cloud Function, which as part of its processing writes the data into BigQuery. We chose cloud functions because we are mostly an exclusive Google Cloud shop, and using a managed service requires very minimal infrastructure work and is easy to develop.

BigQuery

We chose BigQuery as our data warehouse because it’s serverless and has a super powerful compute engine, which allows the data engineering team to focus on serving up analytics rather than getting bogged down in infrastructure upkeep.

Below is how we classify the data in our warehouse. Given that BigQuery only allows two nested levels (dataset and table), we use Google Cloud projects as the boundary for organizing into four levels of data, as well as creating data marts to control access. Level 1–2 are co-located in one project owned by the data engineers, while Level 3–4 resides in another project owned by data analysts and scientists.

Level 1: Source data: A BigQuery dataset corresponds to an internal service (e.g. a production Postgres database table) or an external third-party (e.g. Facebook). Most tables append the data, which allows for historical views, with a few write-truncate exceptions. Each table has a logical view that de-dupes the appended data to show the latest state.
Level 2: Foundational views: These are views created following the star schema table design (e.g.,fact_application, dim_customer, dim_customer_profile_household). They are meant for power users for analysis.
Level 3: Unified views: These are the views derived from the foundational views described above. This allows the data from the foundational views to be augmented, and often denormalized through additional table joins. Level 3 views are meant to be business function agnostic and used by a larger analytics audience who need a lot of data from across multiple tables and views. This approach allows us to design and centralize some of our business logic to provide a consistent view to stakeholders. For example, a customer view joins dim_customer, dim_customer_profile_household and other customer related views to provide a more complete profile of a customer.
Level 4: Reporting views. These views leverage our Level 3 views, and provide data that is further denormalized and aggregated for a given business function. You can think of them as data marts. They previously existed in Tableau Server and we are moving them to BigQuery for consolidation and quality control.

A few other features we added over the years:

Data cleaning: As explained above, each Level 1 table has a view for deduping data to show the current state. In some cases if some bad data can not be fixed promptly upstream by the product teams, we build the data cleaning logic in the view as an intermediate solution. This allows us to quickly patch a bug that would otherwise skew the results.
RBAC
- Access: The access to a BigQuery resource is configured at the Google Group level instead of individuals following DRY (don’t repeat yourself) principle. All permissions are provisioned via Terraform.
- Sensitive info: In the beginning, we would deny-list PII fields to be synced into BigQuery. If for any reason an internal user needed to access this information, they would need to request permission to query the production read replica. In late 2020, we started leveraging Google Cloud’s Data Catalog feature for tagging and gating sensitive fields. Even though users would still need to go through an approval process, this unified the data and allowed users to get all of the information from BigQuery.
Data governance: Right now, our data dictionary lives in a Google Sheet, which requires manual updating and isn’t super great from a discoverability perspective. We are looking at automated tooling to improve the usability and lineage of our data ecosystem.

Future

Our architecture has gone through a few iterations in the past four years as we are constantly thinking about what is next and how we can better support the business’ future growth. To that end, we are expanding both our breadth and depth. On the breadth side, we are moving up the data maturity ladder from descriptive analytics to predictive analytics, and we are closely working with data science teams in ramping up machine learning operations such as model deployment and serving. For depth, we are going deep to solidify our infrastructure to scale with the growing analytical needs and workloads. For example, we are moving Airflow to a dedicated private Kubernetes cluster to allow us more flexibility to scale. Interested or experienced in building a scalable data ecosystem? Come talk to us, we are hiring!

Many thanks to my colleagues, Michael Xu, Hoa Nguyen and others, for the feedback on this post.

We’re hiring in NYC, Durham, and remote. Check out our careers page to see open roles: https://www.policygenius.com/careers/