How We Created and Adopted an ELT Data Warehouse

Published in

Hippo Engineering Blog

7 min readJul 26, 2022

History of Data at Hippo

Initially, we had two data needs at Hippo — business analytics and reporting our policy and claims information to carriers. As a fronting carrier, Hippo has contracts with insurance companies to underwrite policies. To do so, we must report the status of our policy book and claims to these carriers so they can verify that our underwriting is being done according to their expectations. These reports must be sent electronically either daily, weekly, or monthly. Our business partners usually require that the reports be delivered in a specific format and delivery mechanism that they specify.

Prior to setting up a data warehouse, our team of analysts used a read replica of our production database to perform analysis and setup views of our data. That worked for a while but didn’t include the many external data sources we had, taxed our production systems, and didn’t include a process for verifying and cleaning our data.

Data Requirements and Design

Hippo’s business is a daily, weekly, and monthly business. Our business generally doesn’t make decisions in real time, with a few exceptions. Given this requirement we initially chose not to implement a real time solution to sync our production data into a data warehouse.

After evaluating a number of batch solutions and the possibility of a homegrown data sync, we chose Fivetran as our sync solution. Fivetran provides solid capability to sync from many data sources into a destination data warehouse. It has monitoring, logging, and recovery hooks. We reviewed a few other options, including:

Building an in-house sync solution with Python
Using Stitch, a similar tool to Fivetran
Debezium, an open source alternative to SaaS sync tools

Hippo does have a small number of real time requirements to understand certain sales targets throughout the day. To satisfy these requirements, we created materialized views on our production read replica database and use a dashboarding tool for displaying these numbers.

The final notable requirement is the need for our data warehouse to have a higher reliability than the typical data warehouse due to reporting requirements mentioned earlier. In most places, if the warehouse goes down or has issues, the analysts and a few business folks get annoyed and that’s the end of it. The service level objective (SLO) that we’re required to maintain is higher than that. An increasing number of organizations are treating their data warehouse as a “production” system and this requirement will become more typical over time.

Implementation Details

Simple Data Flow diagram of our ELT data warehouse

Our data infrastructure consists of a number of parts. The source systems are our multiple production databases. At the time we started implementation of our data warehouse our primary production database was a monolithic db which stored our policy, customer, leads, claims, and property data. Since then we’ve been splitting that monolith into micro-services and updating our data ingestion along with the changes.

We decided to go with an Extract, Load, and Transform (ELT) paradigm for our data warehouse because it’s simple and well suited to dbt to copy data raw into a data lake and transform it from that raw data to more useful formats. We worked closely with an expert consulting firm, Data CRT, to help design and implement this system.

We use Google BigQuery for our data warehouse. We’ve found that BigQuery is a best in class data warehouse tool with a rich query language and significant scalability that meets our need in the foreseeable future.

Our sync tool is Fivetran. We also have a limited number of direct syncs from third party software systems that integrate well with BigQuery. A few of our other syncs are from flat files stored in AWS S3 or Google Drive. That data is usually reference data, such as insurance or commission tables.

We use dbt for our transform step. dbt provides query-first transformations and is able to mix SQL and code-like iteration together in a powerful processing language. dbt has allowed our teams to build a multi-step data validation and transformation. Because it’s based on SQL we’ve also been able to empower our analysts to help with ELT development. This has proven a great way for our analysts to learn how data engineering works without requiring them to write code. Since we were a small startup at the time we started this project, Hippo didn’t have the resources for a large specialized data engineering team. dbt allowed multiple team members of different skill levels to pitch in on developing our ELT and expanded our capacity of data warehouse developers to our analyst partners.

Reports Framework

Given Hippo’s significant reporting requirements, a way to auto-package and auto-deliver reports from our data warehouse is critical. Some reports are reviewed by people and others are machine-ingested. After a short debate and tool evaluation, we decided to write a home-grown Python framework for report delivery. Report preparation and delivery looks something like this:

Report delivery using a framework we developed in-house

Our in-house reports framework is written in Python and configurable to be able to schedule reports at any time and frequency to be delivered to a variety of destinations. Report metrics are sent to our observability infrastructure, allowing us to configure system-wide alerts if reports fail or are failed to be delivered to their destination.

Hippo has scheduled 150+ reports with this framework with various cadence and we’ve had few reporting failures. What once took a team of analysts days to compile is now completely automated by our reports infrastructure.

Gaining Adoption

When we first stood up our data warehouse, the organization had gotten used to having direct access to production data. With a number of systems, processes, and reports that were dependent on our production database it was a difficult climb to move over to the data warehouse.

The decision that aided our organizational adoption was to move most of our analyst team together with our data engineering team. By making them “one team” and putting them together to resolve issues with the data warehouse in real time, it built analyst trust in our systems. As our analysts became more comfortable with using the data warehouse as the source of truth for their reports and analytics, we began to remove views and access from our production read replica database.

This process took about six months of hard work, but ultimately resulted in the whole company transforming the way we worked. Instead of joining marketing or financial data in a spreadsheet exported from a production database, our analyst team could write queries in real time to join data from these sources. Once they realized the benefit of less manual work, they jumped onboard our data warehouse bandwagon.

Our Future Plans

As we’ve built a functioning and robust “v1” data warehouse, we’ve satisfied our business needs of getting data and dashboards in the hands of our users from any source. Our data warehouse normalizes/denormalizes this data and performs some basic data quality checks in the course of the ELT. The data is available at a frequency of more often than a day and we use BigQuery to run queries — which means users aren’t waiting very long for their queries to complete, no matter how complex their analysis.

By integrating readily available SaaS and cloud infrastructure, writing some custom code for functions specific to our business, and organizing our team to drive adoption of our data warehouse we were able to transform the way our data driven business uses that data. We did this with a relatively small team of fewer than 10 engineers and analysts and in a 6–8 month timeframe. This was quite an accomplishment for a small and capable team!

In our “v2” data warehouse, we’ll be looking at how we take a more modular and workflow-like approach to our ELT. We’ll also examine how we can build in powerful data quality checks and anomaly detection to our ELT, so that we can be sure the data is clean enough for building and running ML models. We’ve consolidated our PII and user data to special access tables where only authorized users can view PII. A reliable real time pipeline and event driven approach to real time is something else that we’re thinking about as we evolve our data infrastructure.

What We Learned

Sometimes you have to organize the team in non-traditional ways to meet a goal. If our analyst team and engineering team weren’t combined, the pace of adoption to our new data warehouse would have been much slower.

Data sync is a difficult but solved problem. We chose a sync system like Fivetran because we calculated the cost of maintaining our own would be far greater. It’s a nice idea to have an engineer write a data sync in an afternoon. It’s a much taller order to actually maintain one over years.

The cost of data warehouse as a service isn’t prohibitive. Many organizations maintain their own Hadoop or Presto clusters to use as a query engine. Using a SaaS tool like BigQuery that has the user management, UI, and scalability built in isn’t as cheap as maintaining your own but costs can be managed.

And building your own can be the right solution. We built our own reporting framework even though we knew we could probably find a commercial product to do it. We intended our framework to be a stop gap that lasted 6 months, but discovered it has a much longer shelf life than 6 months. Tools like Kubernetes cron and infrastructure-as-code now enable teams to stitch tested solutions together and build a good solution vs. buy one.

Data warehouses are being treated more like production systems. Data warehouses, lakes, and processes are starting to come with the expectation of reliability and timeliness. In this day and age so many businesses run on data. The more up to date and correct that data is, the easier it is to make effective and quick business decisions. Cycle time is being reduced as new technologies are harnessed to deliver data to business decision makers.

References

Fivetran

Google BigQuery

dbt data transformation tool

Data CRT

Thanks to our data engineering team, data analytics team, and the team at Data CRT for helping to provide detail and review this post. If you’d like to learn more about working in data engineering at Hippo, go to https://www.hippo.com/careers

Engineering Blog, Episode 3