How to design scalable analytics

Alex Mao
Loblaw Digital
10 min readJan 9, 2023

--

When I first started leading the Marketing Analytics & Insights team in March 2021, I was amazed with all of the curated datasets the previous team had built up in the preceding two years. It spanned a rich spectrum of marketing, product, transactional, and aggregated customer-level data. The amount of work was extraordinary and I was excited to build on this great foundation.

As I began learning about the troves of pipelines and semantic layers, I wondered how so much could be accomplished in such a short amount of time? I dug further and it became clear there was an unfortunate lack of documentation and process. The team had grown incredibly fast. They prioritized getting work done which meant there was insufficient time to focus on developing appropriate processes and standards.

The results were opaque logic with code that depended on other code that depended on other code, sometimes across different tools. The lack of documentation and clear naming convention meant it was difficult to discern dependencies across pipelines. As issues arose, it became difficult and sometimes impractical to fix them.

But there was hope. I had heard of new tools implemented by our Business Intelligence team, including Data Build Tool, and knew we could rebuild the work with a robust set of processes and practices.

Industry wide issues across data & analytics

Unfortunately, the problems I encountered are universal to the data and analytics industry. Analysts and data scientists need good, documented, and contextualized data. A lot of work goes into shaping and processing raw data. It’s the mise en place — the washing, chopping, and preparation of your ingredients — before they can be cooked. The unconscionable amount of time spent in this work is well reflected in the meme below.

Data processing and curation is crucial but often a grind. Piecing together information from tribal knowledge, without properly standardizing and scaling its outputs, leads to too many people spending too much time inventing the wheel again and again. This also contributes to misaligned data sources and definitions.

There are other issues as well. In a large company, analytics teams by and large remain disconnected to each other and continue to work independently. Any expertise gained and final outputs of their work are often localized to the team and individual contributors that completed the work. This reflects the root cause of the problem — analysts and data scientists function as merely consumers of data. We develop expertise in certain domains and datasets but there is no scalable way for us to contribute back to the greater data ecosystem.

From services to products

Back to the story of our team’s transformation. In the beginning we didn’t have a clear idea of where to start. As we discussed the possible solutions to our problems, we were intrigued by the idea of products. Organizationally, we sit within a product vertical. So instead of working on projects, we would develop products with clearly defined scopes, objectives, and requirements from careful requirements gathering. Also included would be standards applied to the overall product lifecycle, from development to QA to production to enhancements and possible deprecation.

It was then I realized I had known about data products for many years — they just weren’t positioned as such. There was a team at my previous company that built a ‘customer data warehouse’ over the course of two decades. This was the most popular source of data for analysts and data scientists alike. They were very well-documented, designed, curated datasets.

However, this team had dozens of people, the meticulous work was done slowly, and there was little room for analysts to contribute back to this data. I had a template for how the ‘product’ side of this change could be managed but how would we execute on the vision with only a handful of analysts? As we discovered, this was where analytics engineering came into play.

Analytics engineering

At its core, analytics engineering is an application of software engineering best practices, like modularity, version control, and use of a shared repository for code, to analytics. You can think of this practice as multiple teams building Lego blocks in parallel and eventually connecting into one another to form a larger structure. This is in contrast to what we commonly find in analytics — spaghetti code that can’t hold its own weight.

Analytics engineering vs data engineering

To understand analytics engineering it’s helpful to compare it against data engineering.

Historically, the industry paradigm for data warehousing has been extract, transform, load, also known as ETL. With the rise of modern data warehouses like BigQuery and Snowflake, the paradigm has shifted to ELT. Instead of transforming data prior to loading, all relevant data is loaded to the data warehouse and then transformed for maximum flexibility and robustness.

Data engineering is responsible for the extraction (from API or other sources) and loading. It is heavier on the engineering side and has strong processes to guide its development cycle. There is less dependency on understanding the data since transformations are typically limited to normalizations and formatting. Analytics engineering on the other hand is primarily concerned with the transformation and combination of disparate raw data into more usable curated datasets. This requires an engineering mindset and also deep understanding of particular sets or domains of data.

Analytics engineering vs alternative operating models

The diagram below summarizes the operating models I have seen in data and analytics.

There are the traditional ETL or data warehousing teams who have a strong set of processes but also long timelines — often taking quarters if not longer from project kickoff to delivery. They also tend to be further removed from the use cases of the data so the context and expertise on specific datasets can be lacking for meaningful curation. On the other end of the spectrum, there are business analytics teams who respond rapidly to business needs — pulling together reports and analysis within days and weeks. However, these teams tend to have few processes and are subject to the industry challenges described previously.

Analytics engineering aims to deliver both speed and quality to data transformation. It does so through a decentralized organization structure across analytics teams with complementary domain expertise. The lack of centralization facilitates speedy development while appropriate tooling and a strong set of shared practices allows close collaboration across teams.

Tooling

There are several tools that make analytics engineering work effectively. Data Build Tool (dbt) is the all important hub that connects to the spokes of existing technology to enable analytics engineering. I first heard of it when I started at Loblaw Digital but didn’t give it too much thought since it wasn’t adopted widely. But as we embarked on the journey to build products, my team started learning about dbt and analytics engineering. They were the ones to really sell me on dbt and the critical role it plays in analytics engineering.

So what does dbt do? It has two key features:

  1. Connects multiple tools to enable SQL as a full development language for data transformation.
  2. Creates lineage graphs to show all relationships from raw data sources to output datasets and beyond.

Other tools connected to the hub

  • ‘Modern’ data warehouses like BigQuery and Snowflake
    - These warehouses can both store and process petabyte scale of data
  • Git
    - The industry standard for version control and managing personal development branches and production branches of code
  • Airflow
    - Powerful scheduling and monitoring tool

The most important part of this suite of connections is to Git. It allows multiple teams to view, develop, and update to the same code base following DevOps best practices. It also enables version control and, by extension, standardization of data sources and logic through this shared code base.

Lineage graphs

Each block in the visualization is a data source or a piece of logic.

https://docs.getdbt.com/docs/collaborate/documentation

Take for example a common set of filters used for a particular dataset. In the past, an analyst would need to remember to apply those filters every time they run a query. Now we can materialize the query with the filters as a view and use the view as the standardized starting point of the query. This becomes exponentially more powerful as more joins, processing, and logic is applied to the source.

Upskilling and standards

Starting from challenges, we’ve arrived at analytics engineering as a solution. But how do we implement it? As analysts, we are already familiar with SQL as a core expertise and basic data modeling experience. What we lack is the experience with software engineering standards like Git and IDE, like Visual Studio Code, for development, and dbt itself. We also wanted to onboard to an agile project management practice with two week sprints.

To accomplish our learning objectives, we found it extremely helpful to make time for them by creating stories and epics. This way the upskilling process was formalized as part of the work our team delivered and updates could be easily communicated to stakeholders.

In addition to our skill sets, we also needed to create a set of standards — naming conventions, folder structures, and workflows to ensure we are moving forward together as a team. Initially we set the scope of these core practices to account for development and pushing to production. However there were also many areas that we wanted to develop further as we matured in our analytics engineering practice. These items included ongoing data quality checks, robust documentation, lifecycle management, and standardized communication of changes and updates. The work in these areas is ongoing as we continue to evolve our practice.

The results of this journey — data products, analytics products, and the analytics platform

The results of analytics engineering and our services to products strategy are data products and analytics products. Together, these form the analytics platform ecosystem designed for analysts and data scientists.

Data products

Data products are curated datasets developed through analytics engineering with a managed lifecycle of enhancements, fixes, and eventual deprecation. There are two main types of data products — foundational and derived.

Foundational data products are built on raw data for a wide range of potential use cases, with very fine granularity. Since it queries data directly from the source, certain staging steps in the foundational data products define the standardized exclusions and highlight useful columns. Development of foundational data products is executed by, or in collaboration with, teams who are experts in the source data. Derived data products are built primarily on foundational data products. Standardized transformations have already been applied to raw data and can be worked with more easily. The goals of derived data products are more specific and the scope of its use cases more limited.

Analytics products

Analytics products are the interface between foundational/derived data products and the end user, whether it’s an analyst or a business stakeholder. With our current tech stack, analytics products refer to a Looker Explore, Look, Action, and related code in Cloud Functions. To give a sense of the workflow, an Explore is developed from several data products, which is then built into various Looks on an ad hoc basis. The fine granularity of many data products enables Looks to be highly flexible and capable of answering a suite of business questions or provide quick access to the underlying data. This standardizes ad hoc analysis and data pulls, while significantly reducing the turnaround time.

Some insights can be directly leveraged using pre-built Actions which triggers certain Cloud Functions. A potential use case is to use Looks to identify target audiences and use Actions to build them by updating an audience inventory table in BigQuery.

The analytics platform

The analytics platform describes the ecosystem of ongoing, decentralized development of data and analytics products. We can imagine a forest of various trees and vegetation — data and analytics products that comprise the analytics platform ecosystem. They make up the habitat that allows forest dwellers, like analysts and data scientists, to thrive.

In this way, we are no longer limited to our roles as consumers of data. We become squirrels who live off of the acorns and also help to grow the forest by burying nuts and (accidentally) planting trees. As we contribute our knowledge back into the form of data and analytics products, we help to grow our shared ecosystem.

Closing thoughts

In the past decade, we’ve had enormous industry-wide innovations in the technical and tooling domains where cloud platforms, python/R, etc. have solved many ‘Big Data’ and AI/ML challenges. But process related issues like silos, non-standardized data sources, and mis-aligned logic across teams still plague the data and analytics industry. There is still too much time devoted to data preparation and cleaning, work that can and should be scaled through the development of data products.

Our goal is to continue to enhance our analytics engineering practice, build more data products in collaboration with other teams, and showcase the benefits of our proposed analytics platform more widely across the enterprise. We are planning 2023 data and analytics product development roadmaps with several teams and we are thrilled to have so many folks join us in this journey.

Learn more about our work and our team at loblawdigital.co

--

--