Automating simple processes to allow time for better analysis

Pedro Paiva
Bondora Engineering and Data
4 min readFeb 27, 2023

Our day-to-day is full of all types of activities. Typically it involves talking with stakeholders during dailies and ad-hoc to understand better which analyses are needed and the context surrounding those analyses; working on the analysis itself; and, finally, ensuring it becomes part of our monitoring through our datasets and dashboards.

At first glance, it seems pretty straightforward, but we must remember that these steps often require small processes even to get started! When working in a fast-growing and fast-developing environment, we constantly have new tables and changes to existing tables in our services’ databases and in our analytical one (through our DBT models). To accommodate new products and features, we need to keep it in sync with Looker, our primary BI tool, to ensure our dashboards and graphs can be built with all the data available.

Although Looker has a built-in LookML generator, if tables change, one has to manually run it again or manually update the LookML to ensure it includes the new changes, and, as we all know, this is a huge failure point, we often forget to do this simple step and have to debug our way around it just to realize this is missing. How many times did you forget you were working in development mode, and it was no longer up-to-date with the main branch?

On top of that, what if you always want to include extra measures or custom filters that are always available through liquid variables, and so on?

First, we built our very own LookML generator to be able to include this custom-made code in every view. Then we automated this generator to run automatically whenever we have database changes, keeping our LookML files always up-to-date!

Let’s go and have a look at how this works:

Base Settings

Our environment runs on top of Databricks, allowing us to build python notebooks and schedule them to run daily. Within Databricks we are fully on Unity Catalog, giving us three level namespace and the convenient use of information_schema mentioned below. We also use Azure DevOps, and together with GitPython, it enables us to manipulate and operate on git folders and git repositories. Looker is also connected to Azure DevOps, and we have a repository there; for our folder structures, we went with a /base folder where we keep all the machine-generated code and a /refinements where we adjust and create extra specific content for our views and explores.

The Logic

Every catalog (database) in Databricks has an information_schema.columns table containing a list of all the tables and columns available, as well as their data type and extra information, so we can just query and build our very own LookML from it. We start with a basic template that can be as simple as the following:

LOOKML_TEMPLATE = """
### ==================================================== ###
### MACHINE GENERATED! DO NOT UPDATE THIS FILE MANUALLY! ###
### ==================================================== ###

view: {table} {{
# The sql_table_name parameter indicates the underlying database table
# to be used for all fields in this view.
sql_table_name: {schema}.{table} ;;
suggestions: yes
python
# DIMENSIONS
# ==========
{dimensions}
# MEASURES
# ========

measure: count {{
type: count
}}
{measures}
}}
"""

We create one of these for every table, and for every column, we add the relevant dimensions and measures according to the column data type. We have an extensive dictionary with all the data types we have in use; STRING, for example, will map to the following dimension:

"STRING": """
dimension: {column_name} {{
type: string
sql: ${{TABLE}}.{column_name} ;;
}}
""",

and it also maps to an empty measure string (we don’t have measures associated with string types. BOOLEAN, on the other hand, maps to the following dimension:

"BOOLEAN": """
dimension: {column_name} {{
type: yesno
sql: ${{TABLE}}.{column_name} ;;
}}
""",

but also generates a couple of measures that might be useful. Note how we keep our flags as yesno here so that we can convert them to integers and average or sum them, giving us a friendly and easy way to calculate conversion rates, for example)

"BOOLEAN": """
measure: avg_{column_name} {{
type: average
sql: ${{TABLE}}.{column_name}::int ;;
value_format_name: percent_2
}}

measure: count_{column_name} {{
type: sum
sql: ${{TABLE}}.{column_name}::int ;;
}}
""",

Ultimately, the generated LookML should cover most of what’s automatically generated by Looker, plus some extra measures that better fit our context.

Dealing with GIT

GitPython allows us to run git commands straight from our python notebook to clone our Looker repo into the Databricks environment. For every table available in our analytical domain, we generate our custom LookML and save it into our /base folder. In the end, Azure DevOps allows us to open a Pull Request and even merge it if we so wish.

The result looks something like this:

import git
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication


repo = git.Repo.clone_from(f"{REPO_URL}", f"./{REPO_NAME}")
repo.git.checkout("-b", DEV_BRANCH)

# custom function to create and save all the relevant LookML
generate_catalog_lookml(CATALOG_NAME)

repo.git.add(".")
repo.git.commit("-m", COMMIT_MSG)
repo.git.push("--set-upstream", "origin", DEV_BRANCH)

credentials = BasicAuthentication('', PERSONAL_ACCESS_TOKEN)
connection = Connection(base_url=AZURE_ORG_URL, creds=credentials)
azure_git_client = connection.clients_v6_0.get_git_client()
pr = azure_git_client.create_pull_request(
{
"title": f"{PR_TITLE}",
"description": f"{PR_DESCRIPTION}",
"sourceRefName": f"refs/heads/{DEV_BRANCH}",
"targetRefName": f"refs/heads/{MAIN_BRANCH}",
},
REPO_NAME,
project=PROJ_NAME,
)

# optionally, already merge the PR as well
# you might want to merge it manually for the first few times
# to ensure everything is running fine
azure_git_client.update_pull_request(
{
"status": "completed",
"last_merge_source_commit": pr.last_merge_source_commit
},
REPO_NAME,
pr.pull_request_id,
project=PROJ_NAME,
)

Conclusion

With this, we are sure that our LookML views are always up-to-date. This simple tool saves us many hours every week that were wasted trying to debug and ensuring that the `views` are correct and up-to-date. We can skip ahead, being sure that when our DBT models run, Looker and everything needed will also be updated. We can dedicate this extra time to what really matters, pushing our analytical capabilities forward and bringing better insights for our teams and customers.

--

--