Improving Looker Governance With User-Centric Models and Automated Tests

If you’ve ever worked with Looker you might have noticed that, although it is one of the most powerful data analytics tools out there, it can be difficult to govern and maintain. This is especially the case when you have a large team of analysts and engineers contributing to the Looker model, each trying to respond to their stakeholders’ different requests.

At QuintoAndar, we have almost 30 data analysts who can contribute to our Looker model. Although analysts are technical folks, our objective is to make Looker as accessible as possible to our hundreds of stakeholders in product and operations teams.

Because of this, and after much trial and error, we’ve created a series of modeling guidelines as well as automated tests to enforce them, which we’ll cover in this post.

Image by Pexels from Pixabay

What is the Looker model and what is it for?

Looker uses its own modeling language, LookML, to define a model on top of a database. This model defines what users will see in their interface, what data they can query, and how.

You can think of the Looker model as a description of the tables and relationships in your database. When users create a chart in Looker, the tool is using the model to generate SQL on-the-fly that queries the data from your database (you can read more about LookML here).

How Looker works. Credits: Looker

The naive way: let’s model everything!

When we started using Looker, our first attempt was to model absolutely everything as existed in our data warehouse, which uses the star schema modeling approach.

This meant creating explores that started from our fact tables and joined with all dimensions possible, and bringing in all the fields from all of these tables. Then, when an analyst wanted to add another measure or join to the model, they would usually just add it to an existing explore.

Our thought was: since this is a data self-service tool, let’s give users everything possible and they’ll just figure out what they want, right? Turns out we were very wrong.

A diagram of a two Looker models: above, the naive approach with all joins possible; below, the refactoring with just a few joins. The names of the joins were anonymized.

A few months and hundreds of commits in, our models were so polluted they were almost impossible to use. This caused the following problems:

  • Bad usability (there were too many options; too many ways to do the same thing)
  • Duplicated and redundant measures (sometimes with different results)
  • Overly complex models and explores, with too many joins and fields (difficult to to maintain)

A better way: a user-centric model

All of these problems led to inconsistent analytics and, eventually, user dropout. In a nutshell, we were not using Looker at its fullest capacity.

Turns out that users don’t want to be overloaded with content. Instead, users want well-defined, specific, and curated content that helps them find the data they need and get their job done.

KPI overload, Marketoonist, Tom Fishburne

With this in mind, we took a step back and answered these questions:

  1. What do we want from Looker (versus our other BI tools, mainly Metabase)?
    A: Looker should be the first and easiest option for users to visualize and analyze our main company metrics, and to do exploratory data analysis
  2. What do we do not want from Looker?
    A: Looker should not be the place for analyses that are too specific (e.g. too granular, or related to a very specific project)

Defining new model guidelines

Based on these definitions, we redefined our contribution guidelines, and then proceeded to refactor our models, explores, and views accordingly. We summarize these guidelines below.

1. A model should contain a specific business topic (and its name should reflect that)

Previously, we had models with generic names. For example, one was called “Conversion,” and it was related to the conversion of our apartment listings into rental contracts. But there are other parts of our business funnel which also have conversion metrics, so we renamed it after something that was more explanatory: “Rental Listing Demand.” This also required us to take a step back and create a diagram that exhausted all of our business topics in a coherent way, which didn’t exist before.

2. Explores should permit analyzes of a specific aspect within a business topic

Before, we often had explores that covered more than one aspect of a topic. For example, in a single explore you could count active users, measure lead conversion, or check average rental prices. We remodeled these so that each explore was related to a single aspect of a business topic, which makes it much easier for users to find the measures and dimensions they need.

3. All explores should have a defined grain (and use left joins)

Much like fact tables in a star schema, explores should have a defined, unique grain. For example, if my explore is called “Active Users,” then each row of the explore should be a unique active user. One consequence of this is that explores should only use “left outer” joins, since other types of joins could break the granularity of the explore.

4. Display only the data that the user needs

Too many fields and joins can leave a user overwhelmed, which leads to confusion and faulty analytics. We limit the joins and fields we display in an explore to the minimum, to make it easier for users to find what they need. We use sets and the fields parameter in explores to make this easier.

5. Document absolutely everything

It’s essential for users to know what the data they’re using actually means. Descriptions for fields and explores are a key way to achieve this, and we use them extensively (and made them mandatory, as you’ll see in the tests below).

6. Break free from your Data Warehouse conventions when necessary

As I had described above, at the beginning we took the approach of simply mirroring our DW structure in our Looker model. However, we’ve found that it was sometimes necessary to do things in Looker that were a departure from our star schema model. For example, we might start an explore from a dimension instead of a fact table. Or we might radically rename the explore to something that has little relation to the underlying database table. And it turns out that this was OK, as long as it was better for the end-users to find what they needed.

Automated testing to ensure model quality

It’s great to have a shared set of guidelines that everyone agrees upon; however, we’ve found that, unless you’re able to enforce these via automated testing, it’s very difficult to maintain a high quality code base. This is why we wrote automated tests to enforce our guidelines, as well as code quality.

Every time someone pushes a commit to our Looker model repository, our Drone CI runs tests on the code. We use the lkml to parse our model and run tests using pytest.

Our automated tests for code and model quality

A simple test, for example, guarantees that all dimensions in the model have descriptions. This is important because descriptions are really helpful for end-users to understand our data:

Other tests enforce folder organization. For example, this one makes sure that all the view files are located in a folder called views/ :

We also test things such as naming conventions. For example, whenever we use a measure that is an average of values, we want it to be prefixed with avg_. This helps make our fields consistent and readable for users:

For what can’t be tested (yet), trust your code reviewers

As much as we want to create tests for everything, there are some of our guidelines that are very difficult to test automatically. For example, we can’t automate testing on how to choose a model or explore name, or how to guarantee an explore’s data granularity.

Because of this, we’ve implemented a serious code review process in our Looker model repository. Data analysts make contributions by creating pull requests, and a part of our team are designated “code owners”. In our code owners, we have at least one analyst from each tribe (our tribes are organized according to business topics).

For a PR to be approved, we require at least two approvals from code owners. This guarantees that every PR will need at least one approval from an analyst in a different tribe.

When reviewing PRs, analysts should use the development mode to see exactly what users will see, thus putting themselves in their shoes. Credits: Looker

Reviews from analysts in other tribes are great, since they will likely not have the full context of what is contained in the PR. This is positive as it forces reviewers to put themselves in the user’s mindset, who will often not have all the context on the data they’re seeing. For this, we take advantage of Looker’s development mode, which allows you to see exactly what the user will see once the pull request is merged.

By refactoring our Looker model and learning from this process, we hope that in the future we will be able to create even more automated tests to help ensure that our guidelines are being followed. But in any case, we are sure that good code reviews will still be necessary to achieve a better Looker governance.

Good data governance leads to good analytics

At QuintoAndar, we want to enable every team to make decisions with data. For this to happen, we need to make sure that users are able to find the data they need, understand it, and be able to use it. Refactoring our Looker model is just one of our many ongoing data governance initiatives.

We believe that data governance should be (like the product we develop) user-centric, since in the end it is our own users who will use the data to make decisions. We will continue to work in improving our data governance, as we believe that this will lead to better analytics for everyone in the company.

--

--