Analytics Engineering at Mothership
The Power of Data Modeling
Right now is an exciting time to be part of a data team, and early-stage is always an exciting time to be part of a start-up. New tools in the Modern Data Stack made data modeling more dynamic and accessible, multiplying the potential impact of a small data team. A good data modeling process empowers the business to make data driven decisions, enables curated self-service and ensures scalability, reliability, and shared context. This post will talk about how we organize our analytics data processes, with a focus on the data modeling layer.
The workflows enabled by having analysts and analytics engineers own their own data models are meaningfully different from how most teams have worked until quite recently. In the past, data teams at start-ups have either had to build and maintain single-report queries with siloed and potentially brittle definitions, or else log requests to an engineering team backlog to execute data transformations in the ETL pipeline. By the time a data engineer is hired to standardize some centralized views the amount of technical debt and maintenance work is substantially painful (and expensive).
New technologies have granted data teams the power (and responsibility) to build their own flexible data models into a centralized codebase, allowing us to react quickly to changing products and take a new level of ownership of the data that we work with every day. At Mothership we primarily use an ELT process where data is extracted and loaded via Fivetran and Segment, then transformed and modeled in dbt. Our BI tool is Looker, which primarily pulls directly from the modeled outputs rather than perform any significant calculations. In our analytics database we organize into distinct layers of ingestion where we clean and stage data, and then modeling where we do transformations, render metrics and develop models.
How it Starts
If you’ve been part of an early stage start-up you’ve seen this: someone (usually in engineering) has set up a simple iteration of a BI tool. Maybe it’s Metabase, maybe it’s Looker, but what tends to happen is that the BI tool sits directly on top of a replica of production data. When schemas are concise or for those close to the source data this can be a handy solution, but as more data sources are connected and more people get access to the BI tool you find a few key problems to solve:
- Difficulty in reproducing or interpreting metrics
- Data structures that are difficult to align to business problems
- Inflexibility and data silos
- No formal data validation or testing
- Minimal data governance or security
All this impacts the company’s ability to leverage data efficiently and make informed decisions about the right problems. The solution here is a thoughtful approach to data modeling that can build a shared context for teams and leaders.
Analytics Engineering is the practice of building clean and reusable data sets (models) for analysts, machine learning teams, and everyone else who interacts with our BI tools. At Mothership this is done by analytics engineers and data analysts. We’re enthusiastic about dbt and agree that analysts are the right people to design and build these models, as they are the ones who use them most extensively.
As we operate an Extract Load and Transform model (ELT), rather than ETL, data arrives on the analytics database unmodified from the source system. This is then cleaned onto staging tables where it is formatted appropriately and presented as an ideal version of the production data. This is where the first set of basic validation tests live, usually generic field-level constraints.
Given the mission to generate trusted and extremely usable data models, Mothership’s data team cultivates distinct expertise on the domains we work in even before any analysis begins. The task of mapping production data to business processes (and testing all the assumptions in between!) is necessary for the responsible presentation of data. Providing a shared decision-making context requires models that are natural to the business stakeholders and legible to the engineering and product teams. Documentation and data stewardship are key processes of the data team.
The Modular Approach
Once data is staged and understood we build modularly toward wide tables that we call marts. These tend to align to key business entities, and are normalized as Kimball dictates. Where we deviate from Kimball and lean into the strengths of modern column store databases is by collecting very many dimensions and values onto the table. This takes advantage of the efficiency of a column store database and eliminates the expense and potential confusion of multiple or complex joins.
Rather than define marts in long queries with multiple derived/temporary tables (CTEs) we break these down into dbt models and construct the mart modularly. An example of this modular approach would be calculating the relevant accounting metrics per shipment onto their own table. Where Kimball might have us model all the facts relating to a shipment in the mart directly, by treating what would be a CTE(s) as their own dbt model, we can test this output of the dbt model against source data before passing it through to other models. This also encourages re-use the defined and documented values in other analysis or outputs.
These modules tend to be domain specific, and in the case of complex business logics, rendered in distinct stages. Here we implement more complex secondary testing, including relational tests. This also gives us greater visibility into the runtime of elements of the mart.
Like most teams using a modern data stack we make use of version control, continuous integration, and typical software development practices. Tests are included as part of the release of new models and usually include not just completeness and matching against source and staging data but also tests against expected forecasts designed to alert us to issues the the extract and load steps.
This modularity allows us to partner with our Machine Learning Engineering team as creators and consumers. When they craft a query or deploy a model that generates data, those outputs can be added to the marts or combined with some other set of modules. Likewise, key values or segmentations used for training ML models can be added for common use. Another benefit here is that when a model graduates to a feature we simply sub in the source systems and transition to production visibility very quickly, having already built pipelines for data visualization or other analysis.
For those of us working as analysts, especially at start-ups, migrating the ownership of engineering analytics models to people who will be using that data directly feels like a natural progression. While some would say this is just Kimball by another name, the consolidation of data ingestion and the opening up of transformation and modeling processes allows data teams to build sustainable resources in the act of discovery and analysis.
Many analytics teams measure themselves by a time to insight metric (or at least their stakeholders do) and in some orgs this has been at odds with the initiative to automate systems and follow better development practices. Analytics engineering aims to resolve that conflict by embedding automation and engineering into the process.
A good data modeling process empowers the business to make data driven decisions, enables curated self-service and ensures scalability, reliability, and shared context. Part of what makes tools in the Modern Data Stack so successful is that their value is obvious to users. Our goal is to make the value of the data team as obvious to our stakeholders as these tools are to us.
In the future we’ll follow up with posts about our data ingestion processes, metrics modeling, and some alternatives to wide mart tables that we’ll be using for event data.
Learn more about Mothership and join our team!