Engineering Business Insights from Unstructured Data

Published in

LeaseLock Product & Engineering

6 min readMay 9, 2022

Intro

LeaseLock is an AI-powered lease insurance platform that eliminates security deposits from the rental housing industry. Through software integrations with various property management systems, we are able to consume valuable data that is the foundation for our entire business model. The LeaseLock Data Engineering team is on the front line when it comes to turning messy and unstructured, raw data into meaningful, usable insights for the entire company. We use dimensional modeling to transform raw data into “one source of truth,” from which our data science and analytics teams create sophisticated models and uncover business insights.

Overview

What Is Dimensional Modeling?

Dimensional modeling (AKA Star Schema) is a data structure consisting of two layers: fact tables and dimension tables. Fact tables store measures, the data that is used for analysis. Dimension tables store dimensions, the data used to look at the fact table measures from different angles.

Each fact table pulls, transforms, and combines data from one or multiple dimensions, while each dimension represents an aspect of the business. For example, an online shopping platform would use dimension tables such as “customers” and “products.” The customer table stores a record of each customer with a unique ID (“customer dimension ID”), as well as information on each customer such as name, age, email address and phone number. The product table is structured in a similar way, with a unique record of each product the company sells (“product dimension ID”), with corresponding product information such as name, SKU, price, date added, etc.

As you might expect, dimension tables are typically narrow tables as the goal is to isolate one dimension of the business. This focused approach enables us to provide all-encompassing information about that one business aspect. Fact tables, on the other hand, are wider tables, built on top of the dimension layer. Each fact table is tailored by combining data from different dimensions in order to produce the measures on which internal and client-facing dashboards depend. These fact tables are more than just a glimpse into the business — in fact :) — the measures produced in fact tables are a valuable asset to the data analytics and data science teams.

How Does LeaseLock Use Dimensional Modeling?

As a B2B tech company in the multifamily housing industry, LeaseLock ingests raw client data from a variety of sources depending on what property management software (or “PMS platform”) the customer uses. This means that we receive the same type of data in different formats and structures. After ingestion, it takes dozens of ETL tasks and SQL queries to normalize data such that it can be easily interpreted and consumed across the company.

Here’s an overview of the process:

Layer 0: Raw Data

This is unprocessed data that is stored in a RedShift database, ready to be utilized by our dimensional model. It consists of client data and internal data.
Client data is received daily in a different unstructured format for each PMS platform, and we create specific data ingestion pipelines to extract data from different endpoints.
Internally generated data is mostly related to the insurance and underwriting side of the company.

Layer 1: Dimensions

To create dimension tables, unstructured data with different schemas is normalized and combined into one master schema. For example, in order to create our Lease Dimension table, we combine data from six sets of tables (from Layer 0). Each set of tables has a distinct structure, definitions, naming conventions, and number of fields. But the same type of normalized data can be extracted from each.
In order to achieve this, it is essential to define the join rules for each set of tables. This will usually generate SQL with complex nested/layered joins. Therefore, the focus in this layer is on join accuracy and pulling the raw data with minimal transformation.
Some simplified SQL logic is utilized to avoid duplicates and to create basic metrics related to each dimension. In the Lease Dimension table for example, we calculate paid and owed measures by aggregating the raw transaction data.

Below is an example of a generic ETL function that we use to build our dimensions, assuming that queries to pull from each set of PMS tables are imported in string format.

Layer 2: Facts

In this layer we combine dimensions and add more sophisticated SQL that replicates our business logic to create ready-to-use data assets.
Unlike in Layer 1, joins are very simple when creating fact tables. Remember, dimensions have already taken care of the complicated task of unifying the raw data. Fact tables don’t need to know anything about the raw data and the diversity in its structures, so we focus on replicating the business logic and the accuracy of calculations.

Here’s an example of how we use simple joins in SQL to build fact tables:

For Layers 1 and 2, we use Airflow to manage, schedule, and monitor sequences of tasks (each task creates either a dimension or a fact table). As shown in the diagram below, we designed our Airflow DAG in a way that all dimension tasks run simultaneously since they are independent of each other and only use/combine the raw data. A successful run for all dimensions triggers the fact table task to run.

Business Impacts

Dimensional modeling powers over 90% of our analytics and reporting needs, and our Risk Platform relies heavily on its output. The benefits of this data structure are wide-ranging and include:

Faster queries: We used to write a lot of custom queries for each raw data source. Blending data with different formats required a lot of knowledge, caveats, and caution. Each of these complex queries was taking on average two minutes to run. Dimensional modeling abstracts all that manual work under the hood and eliminates the time required to gather and blend raw data. As a result, data analysts and data scientists can simply query the fact or dimension tables.

Performant dashboards: Dashboards are query results turned into visualizations. Faster queries accelerate the performance of our dashboards which directly improves user experience.
Saving Cloud spending: Normally, storage is cheaper than computation. Dimensional modeling avoids using cloud processing power multiple times a day for the same computation. Instead, it does the computation once a day, and stores the results in the database.
Uncovering hidden insights: By seamlessly combining data from previously separate data sources, we shed light on issues and insights that weren’t apparent before. For example, we can link data on one resident across different PMS ecosystems and use data from past properties that the resident resided in to build insights — this is the data that our clients don’t have access to.

Easy to diagnose bugs and make changes: When we find an anomaly or an unexpected behavior, it’s easy to trace it back to the dimension that is causing the problem, allowing for faster fixes. Also, using Airflow to orchestrate the pipeline makes it easy to find the point of failure.
Understanding of the business: Dimensional modeling is a structured way of using data to represent the business model. It enables the data analytics team to understand the business process.

Dimensional Modeling gives us the ability to bring bespoke data into one unified schema, allowing us to uncover many insights and answer questions that were previously difficult or impossible to answer. For the past two years, we’ve invested heavily in the development of this data structure and have expanded it to all areas of our business by introducing multiple dimension and fact tables. While what we have today is technically a galaxy schema, we still humbly call it “the star schema”!

If data and dimensional modeling excite you, scan our careers page to see our current openings. Don’t see the role you’re looking for? Reach out to us directly at talent@leaselock.com.