Lime Analytics Engineering: Modeling GovTech
GovTech systems, designed to help businesses manage locality-specific compliance, require data modeling that involves navigating the complexities of spatial, temporal, and density attributes — each of which can be multivariate and intricate. At Lime, where managing a global fleet of vehicles is critical for operations, the stakes are especially high. Accessibility and availability are paramount to supporting our broad spectrum of stakeholders, from analysts to government relations and operations leaders.
As Analytics Engineers, our challenge is to transform these complex attributes into digestible and useful data products. But how do we achieve this in a way that is both performant and maintainable? This blog post delves into how our team employs dbt, a powerful tool in our data engineering toolkit, to implement advanced dimensional modeling techniques. We’ll explore the use of dimension-to-dimension joins, bridge tables, and SCD Type-2 tables, which are typically maintenance-heavy. Our goal is to showcase how we model operational data sources with complex relationships in a flexible, user-friendly manner, enhancing understanding and usability of the dimensional model.
GovTech Specifically at Lime and why it’s not Unique
Predicting the exact needs of city governments is a formidable challenge due to the diverse and unique requirements of each cityscape. This complexity isn’t unique to Lime — many businesses face similar challenges when operating across numerous jurisdictions, each with its own set of rules and expectations.
At Lime, we address this variability through a system we call “Operational Targets.” This system allows our operations team to collect a wide range of inputs that reflect the specific requirements of each market. For example, these inputs might include data on local traffic patterns, regulatory restrictions, or public transit integration, ensuring that our services align closely with local needs.
But collecting this data is only part of the solution. It’s equally crucial to integrate these inputs effectively into our data warehouse, making them accessible and actionable for our stakeholders. This integration enables us to transform raw data into valuable insights, which in turn supports decision-making across functions — from government relations to strategic planning.
By surfacing this tailored data in our data warehouse, we empower our analysts and product leaders with the nuanced understanding needed to optimize operations and meet diverse regulatory environments effectively.
GovTech Complexity
The needs of modeling compliance are ever-evolving, often without much notice. In the context of understanding where our fleet was and when, it’s essential to maintain a view of history while planning for the future.
Methods of Counting: We need to account for varied methods of compliance. Some cities prefer to measure vehicles in the public right of way (PROW) using hourly average while others prefer to look at the max deployed at any given time. In addition, opposed to absolute counts of vehicles, we have percentage based counts where x% of vehicles in the entire city need to be in a given subsection of the city. Point being, we will want to calculate the data in different ways.
Temporal: Our vehicles are deployed 24/7, however, from a compliance standpoint we must be able to measure where they are at certain times of the day. For example, we may need to maintain a minimum number of vehicles in a given area between 6–9 AM and then 5–8 PM. From a modeling perspective, we’re inclined to create columns for each set; time_span_one & time_span_two. However, it’s difficult to predict the number of time spans, what if tomorrow a city decided they now want 3 time spans? We’d need to rebuild our table!
Spatial: Where our vehicles are deployed in a given city matters a lot, both from a revenue and compliance perspective. There are two main problems to address: measure changes to shapes of city sections and managing denominator shapes for percentage based counts.
For example, if we need to have a minimum of 500 vehicles in a section of a city, we’ll want to track over the changes to the boundary of this section over time. This example could just as easily be deploying 10% of all vehicles to a certain subsection. Not only will we have to manage the ever-changing spatial shapes, but also the shapes of the denominator spatial shapes (which could be more than 1 shape)!
Our Technical Solution
In our pursuit to manage complex compliance and operational data effectively, dbt snapshots emerge as a pivotal solution. These snapshots are especially adept at creating and managing Slowly Changing Dimension (SCD) Type-2 tables, which allow us to look back in history and understand performance within the context of past configurations.
Taking inventory of the modeling concepts:
- count_method: standard SCD type-2 attribute
- timeframe: managed with a bridge table to manage the multivariate nature of time
- Spatial_id & denominator_spatial_id: utilize a dimension-to-dimension join to capture spatial attributes over time and manage the relationships between multiple spatial dimensions.
Let’s dive into how we’ll leverage dbt snapshot to make managing these techniques a breeze.
dbt snapshot
dbt’s snapshot feature simplifies the heavy lifting involved in SCD type-2 transformations and the integration of complex bridge and dimension-to-dimension joins. Here’s how we implement it:
- Load Dimension Attributes: All dimension attributes are loaded into an intermediate layer model using the current state of the attributes. In addition, we keep the arrays for time and spatial data unflattened. This setup preserves the flexibility to adapt to changes in compliance requirements.
- Generate Surrogate Keys: Using dbt’s generate_surrogate_key function, we create unique keys for both the timeframe and denominator spatial attributes. This facilitates the creation of bridge tables and ensures consistent reference points across different snapshots.
- SCD Type-2 Transformations: Lastly, a dbt snapshot model performs the SCD Type-2 transformation.
That is the bulk of the dimensional modeling; dbt snapshot handles the SCD type-2 transformations and our bridge tables are effectively already built within the snapshot table.
Surfacing Models to the Data Mart
Once the dbt snapshot has completed the foundational data transformations, our next step is to make this enriched data accessible and practical for end-user queries. This involves surfacing the snapshot table to our data mart, which represents the most refined and business-ready data in our warehouse.
To maximize the usability of our snapshot table, we implement a series of lightweight views:
- Dimension View: Focuses on single-variate attributes, providing a clean and straightforward representation of each operational target.
- Timeframe View: Isolates the timeframe attribute, facilitating temporal analysis by flattening the relevant arrays and aligning this data with our standard time dimensions.
- Spatial View: Handles the denominator spatial attributes, ensuring that spatial calculations and visualizations are consistent and accurate.
These views serve not only to align the column naming conventions across our datasets but also to simplify the orchestration of data queries. By flattening arrays in dedicated views, we ensure that each aspect of our data can be easily accessed and manipulated as needed.
The resulting data lineage is streamlined and efficient, designed to support a wide range of queries and analyses. This structure allows stakeholders to trace changes over time through a dimension table that contains one row per operational target, meeting the general use-case of querying operational target changes.
Furthermore, for stakeholders requiring more detailed analysis, the setup democratizes access to array transformations. Easy joins to the two bridge tables allow for scalable handling of complex queries. For instance, the pipeline computing fleet distribution can now effortlessly filter and group vehicles using the criteria established in these views.
Final ERD Diagram
Elevate Your Data Strategy with dbt Snapshots
dbt is the next-generation tool in data stack technology, the robust capabilities of dbt snapshots are a microcosm of the full feature set. Uplevel your data modeling processes with dbt’s cutting-edge approach that simplifies management of complex, evolving datasets.
Join the Community: Connect with us and other data professionals on the dbt Slack community. Exchange ideas, ask questions, and stay at the forefront of data engineering innovations. Tell us how you’re leveraging dbt to transform your data!
Career Opportunities: Inspired by what you’ve read? We’re looking for talented individuals to join our team. Explore our careers page to find opportunities to work on exciting projects like those discussed here. Help us shape the future of transportation and solve tough problems along the way.
Take your data architecture to the next level with dbt. Join us in redefining the possibilities of analytics engineering.