Taking the Star Schema into the Future đź’«

Avi Chad-Friedman
LeaseLock Product & Engineering
5 min readJul 13, 2021

How LeaseLock’s Data Science team combines data warehousing & simulation modeling to create interactive, scenario-driven Looker dashboards

If you find any of this interesting, be sure to check out our open job postings or email jobs@leaselock.com!

Note: All data and code in this article have been scrubbed of proprietary information and are meant only to be illustrative of the concepts discussed

Introduction

LeaseLock is a Risk Platform that automates the deployment and monitoring of financial protection (in the form of insurance) through software integrations with property management systems. The core challenge for the data science team is to create a layer of understanding to assess, diagnose, and monitor risk. To this end, we’ve implemented a variety of sophisticated models in…. SQL.

Yes, SQL. Plain old SQL. We do have a layer of machine learning models, written in Python, but the majority of the heavy lifting is done on Amazon Redshift’s engine. We’ve put a unique twist on the classic Star Schema, which allows us to simultaneously monitor and forecast our KPIs in Looker dashboards. Like a standard Star Schema data warehouse, ours divides our universe into logical dimensions and measurements, but, through the injection of synthetic data, we’ve added a forecasting layer not typically seen in this layer of a data stack.

The Foundation: Dimensional Modeling

Dimensional modeling, or a “Star Schema” forms the foundation of everything the data science team does. Our deep software integrations vacuum data into our warehouse. While the core business primitives (tenants, leases, ledgers, etc.) are the same, the data structures vary drastically. Our Star Schema ETL pipeline creates a system-agnostic layer for this massive, disparate data and, additionally, enriches it with first-party (e.g. claims) and third-party sources (e.g. the US census), the end result being a set of ultra-wide, analytical “fact” tables that feed our reporting and statistical models. You can see a snapshot of the complex Airflow DAG below. The focus of this blog is not on the intricacies of how we transform PMS data but on the unique way in which we marry probabilistic forecasting and an otherwise mundane ETL pipeline.

Airflow DAG that constructs LeaseLock’s Star Schema

Growth Forecasting

From a SaaS perspective, a beautiful aspect of LeaseLock’s business model is the predictable rate at which new homes “come online,” after a property is launched. This is driven by our high conversion rates and the predictable seasonality of the leasing cycle: it peaks in the summer months and troughs in the winter.

Logic discussed in this section is packaged in the “dim_synthetic_lease”

The dim_synthetic_lease step in our pipeline uses a seasonality model to randomly sample and create synthetic copies from our existing leases in our PMS data. The number of synthetic leases created is driven by a combination of a leasing seasonality model and the sales expectations of the business; the lifecycle of each lease is driven by our ML-based survival model (which we don’t discuss here). Some lightweight Python code generates the synthetic data according to this business logic and hooks into our pre-trained survival model to simulate leasing occupancy terms.

The end result is that our Star Schema contains full historical lease information and synthetic data that allows us to chart our future. You can see in the charts below how this allows us to track and project growth KPIs in one Looker dashboard. The first chart shows our raw lease volume forecast, while the second demonstrates how the survival model normalizes this data into a forecast of how many homes are online at any given time.

Seasonality-driven forecast of new LeaseLock homes
History & forecast of how many homes are online per period, driven by survival modeling

Pro Forma Modeling in Looker

Pro forma models are familiar tools to people in finance: they’re typically large Excel workbooks with a multitude of inputs that drive scenarios. LeaseLock’s finance team, of course, loves these too. But our philosophy as a company is to make data available and actionable in real-time across the company. To this end, we’ve deployed Looker, which is a powerful tool for creating complex, real-time dashboards. It’s not, however, intended to be interactive in the way an Excel model is: it’s a summarizer and not a manipulator of data. This poses a problem if you’re attempting to build an interactive, scenario-driven dashboard. The Data Science team has, again, injected some interesting, non-standard features into our Star Schema in order to make our Looker dashboards interactive.

Logic discussed in this section is packaged in the "fact_derived_ts"

Let’s focus on the fact_derived_ts step in the pipeline. This step again takes a trained machine learning model (this time to predict the timing of defaults) and simulates it against historical data. It also piggy-backs off of the “synthetic lease” dimension to predict defaults against our future business. This prediction is done in pure SQL, with the clever use of a window function, to sample from a pre-built distribution.

This explains how we merge our ML models with our Star Schema but not how we achieve interactivity in our Looker dashboards. Default rates are a key driver of our carriers’ future performance and our internal operational planning (e.g. what volume of claims our system will be handling). We like to run stress tests against our data — at various default rates — to ensure our program can maintain performance for our carriers and value for our customers under a variety of macroeconomic conditions. In the query above, we’re actually simulating a default on every single lease; the “rate” portion is input as a Looker filter.

Sample control panel of interactive Looker dashboard
Looker dimensions that drive forecasts based on dashboard inputs

Add a Looker dimension to simulate which leases will experience a default and voila… you have a Looker tool capable of accepting rate parameters and adjusting a forecast, accordingly! Because the synthetic and real data coexist in the same fact table of our Star Schema, we’re able to chart history and the future in one elegant chart. Seasonality is, of course, baked in, because this is built on top of the growth forecast.

It has taken some unorthodox schema design, but, through combining real with synthetic data, we’ve transformed Looker from a KPI-summarizer to an interactive modeling interface. So… who needs Excel?

Example parameterized claims volume forecast

--

--