Why and how we created the Taxfix glossary of terms and metrics
Dolf ter Hofsté from our Data Insights team walks us through how he and his team set up the Taxfix glossary of terms and metrics, why it was an essential step for a rapidly scaling business, and how we’re looking to automate it in the future.
Why why why?: The need for a Taxfix glossary
Along with Taxfix, both our Data team and the amount of data we have available continue to grow fast. Due to this rapid expansion, specific terms can sometimes become confusing. We might ask ourselves: “When do we consider a person using the app a user?” or “Does Net Revenue include taxes and discounts, or not?”.
In 2021, we decided to implement a glossary of terms to tackle this challenge. Our goals?
- Facilitate understanding and onboarding, by defining terms
- Improve communication, providing one source of truth for definitions of terms
- Establish ownership, by assigning owners for specific terms and business areas
- Improve productivity, by making report-making easier
- Increase trust, by having confidence in the information
We’ve tried to set up a glossary in the past at Taxfix. Back then, setting one up was not a clear process, leading to challenges in adopting it. This time around, we decided to learn from other companies’ successful examples. This discovery process helped us gain clarity and confidence in creating the Taxfix glossary.
What is a (business) glossary
Since we use different terms in the metadata space, it’s only appropriate to start with three important definitions of terms:
- Business glossary: A list of business terms with their unique definitions.
- Data dictionary: A repository of information about data.
- Data (discovery) catalogue: An inventory of data assets, to help find appropriate data for analytical purposes.
Although those are separate items with separate usages, they are connected. For example, the glossary and dictionary are usually linked so that the business definitions can be written in terms of elements in a database. Also, a data catalogue will contain the elements in the data dictionary, but also complete data sets.
We started reading about experiences other companies had. We found three very useful resources:
- Štefan Urbánek, former Facebook & Squarespace tech lead for data warehouse architecture: A slide deck on taking an MVP approach to metadata problems
- Carl Anderson, Sr. Director of Data Science at Weight Watchers, formerly of Warby Parker: A list of data dictionary best practises
- Dataedo, a company selling metadata solutions: A list of best practices from companies they have worked with to build business glossaries
Following what we learned from the resources we found, we started with a spreadsheet. We quickly moved the work to Notion, the tool we use for our internal documentation, to make it readily accessible for everyone in the company. We then seeded the glossary with an existing list of abbreviations used in the company and then went domain by domain, team by team, and made an inventory of what each team already had. We talked to Brand, Operations, Finance, CRM (Customer Relationship Management), Marketing, and multiple Product Management Teams. Most teams had a person who supported the effort — an analyst, a product manager, or an enthusiastic data user.
We started defining terms by pulling definitions from existing documentation or from actual code. It wasn’t long before we found out we were collecting two things: how data is sliced (dimensions) and business concepts (metrics). For the dimensions, we listed the definitions only.
Coming to the metrics, we recorded:
- Name of the metric
- The domain that represents the area of activity, making it easier to check if the domain is fully covered
- The abbreviation or alias we use. For example: our install to registration conversion rate is usually abbreviated as I2R.
- Data type
- Definition and calculation in words, avoiding abbreviations and tautologies
- A link to a Looker dashboard, so an untrained data user can easily start exploring
- The owner, meaning the person responsible for the business metric.
Note: Looker is the visualisation tool we use for descriptive analysis at Taxfix.
In case there are several ideas on how a metric should be calculated, the metric owner makes the final decision. They should make sure all employees are aligned and aware the definition exists.
The first successful entries in the glossary came from the discussions around common terms that different teams use. Sometimes we had to use a different word for one of the items, but it was mostly a misunderstanding leading to multiple interpretations. When we felt we had a significant number of items in the glossary, we started marketing it by presenting it in our weekly all-hands meeting and referring to it as “The green book” (Taxfix’s colour is green), with its own emoji: 📗. We also show the glossary to new employees joining Taxfix in the data onboarding session.
Repetition is the mother of learning, as they say. Whenever someone reaches out to the data team to ask a question about the meaning of a term, we respond by pointing to the corresponding entry in the glossary. If the item is missing or the definition is incomplete, we add it. This iterative process leads to ever-improving outcomes.
The main improvement we made was the implementation of a change process. Whenever a definition changes, we follow this process:
- We comment on the existing definition in the glossary or we add a new entry
- The Analytics Engineering team reflects the change in the relevant data model or dashboard
- We alter the definition in the glossary
Using Notion for the glossary makes this easy because every item has a subpage. This means stakeholders can easily see comments.
As the glossary is constantly evolving, it’s harder to maintain it in its current form. We are looking for automation solutions to help us manage the change workflow. We’ve watched the discussions around the concept of “Headless BI” and are currently investigating if the metrics layer, introduced in dbt v1.0, can serve as a central metrics repository for us. The idea is that this central repo would hold the definition of the items that are currently defined in the glossary, plus the actual code for calculating the metrics.
As for the glossary itself, we would like to keep it as the central place where everyone in the company can read about the definitions we use internally. However, in the future, we want to start generating it from our central metrics repository. Then, we would want to generate the semantic layer for our data tooling, such as the LookML for Looker from this central repository. This way, the changes in definitions will also propagate to the producers of new data insights.
Our Data & Insights team is looking for new members. Check out our career site to learn more.