Runaway Query Termination via Looker and Slack

Published in

Doma

6 min readSep 28, 2020

I believe that the hallmark of tech-forward businesses in the 2020s will be the ability to combine business intelligence (BI) and data democratization (DD). BI tools allow users across the company to monitor sales, operational, and customer metrics live as they happen. DD allows individual non-technical users to go beyond just monitoring reports to actively modifying, adapting, and even creating unique views on the fly, using easy point-and-click interfaces.

One inevitable consequence of BI/DD platforms, such as Looker, is slow and non-performant queries. Because the SQL code is procedurally written, strange inefficiencies in user-click-generated code come up frequently. It is important to identify those queries, shut them down, and then work to improve them with the users.

During overloaded and extremely busy times, I’ve found even the best-built database can lag with these types of queries. If you, too, find your queries are always queued and never seem to complete; if you, too, find your Looker dashboards are running slow, read on! An ever-growing list of scheduled jobs in the Queries section of Looker with no completed status in sight is never fun, especially on a Friday. I will show you how to find runaway queries that bog down your system, terminate them, and then post the details to Slack using the Looker SDK.

Before we get into the solution, it all started when Looker users reached out to me to help solve two related issues: a steady increase in the number of incomplete queries (resulting in failed dashboards), and slow overall performance. I asked Looker users to stagger their scheduled runs throughout the day, as opposed to mostly in the mornings, and we worked to create workarounds for taxing queries such as merged queries and ones with custom filters. Although these solutions seemed to help, they did not completely solve these issues. This motivated me to find a better solution!

“At Doma, we are solution-driven and believe that inclusion and teamwork produce the best results. These are part of our core values. Data democratization allows us to move quickly as one team.”

One component of data democratization is reducing the complexity and redundancy of data storage and processing. Looker enables more centralized data sharing and easy data discovery and analysis across multiple input systems, but you must build it! We have been building a modern data stack at Doma with the philosophy of storing all data, all the time, with a uniform structure that allows for rapid data ingestion and consumption. Data Engineering Manager wrote the details on Doma’s Data Science Blog.

Our current strategy is to allow all Looker users to develop and schedule dashboards, but as mentioned above, this can create inconsistencies and inefficient queries that often slow down the entire system. Optimizing queries is tough, with many nuances, but it is necessary for greater response times. We at Doma believe in continuous improvement of the entire reporting platform. That is why I will provide you with the methods necessary to locate runaway queries, kill them, and share the details on Slack so other Looker users can improve them as a community.

Prerequisites

Create the History dashboard:

This dashboard will provide Looker SMEs with actionable analytics and help them respond promptly. Creating metadata reports with i__looker is intuitive, and for this post, Looker Customer Love confirmed that this look should work on any instance. I saved mine as “Health Dash”. I hope you can come up with a better name! Here is the Expanded URL; simply add your instance name, and you should be ready. Remember, for users to see the History explore, they will need the ‘see_system_activity’ permission. To limit visibility to the dashboard, I created a new permission set and assigned it to each of our Looker users individually. After adding your instance name, save it as a dashboard because the available formats in a Look when posting to a Slack channel are limited to .csv and .txt, and neither would work here.

Create Looker.ini and the Slack App:

My goal here was not only to find a practical solution to an existing problem but an actionable one. Posting information to a Slack channel brings information in front of the larger Looker community, allowing instant insights and remediation of queries as indicated.

To set up Looker.ini, I followed the instructions on the Looker SDK PyPI site within the “Configuring the SDK” section. When this script was developed, the only versions available were 3.0 and 3.1. I have not tested this with the 4.0 version yet; if you do, please let me know how it works out for you!

Creating the Looker.ini file is as simple as creating a text file, populating with the info in the instructions, and saving it in your project directory. Looker.ini file has the API version, base URL for the API, your client id, and the secret. Your Looker admin can help you obtain your API keys.

If you want to replicate the environment I work in, consider AWS SageMaker. First, I created a base model, then installed the looker_sdk and slackclient packages. See requirements.txt.

To post to Slack via the API, you need to create a Slack app and the slackclient python package. I aptly called my app “lookerhealth.” There are many Bot Token Scopes to choose from. The combination of chat:write.public, channels:read, chat:write, chat:Write.public, files:read, files:write worked for me. Another tip here is to remember to reinstall the Slack app every time a new scope is added. Slack API documentation is easy to follow and comprehend — I was able to debug issues quickly with a few trials and errors.

The method

The following code imports the looker_sdk and slack libraries and loads all running queries. I am using API version 3.1.

2. The following code kills queries running longer than the given threshold value and collects QueryIDs in a list called results.
After following the queries for a couple of days, which I find fairly soothing, I have concluded that none of the reports take more than a few minutes to complete therefore, a ten minutes threshold would be safe to set.

The “sources_to_exclude” variable lists three sources. For my use case, I didn’t want to cancel any jobs from these; however, you can alter the list to your needs.

3. Next step is to load the Slack API token and set the Slack channel. To add the QueryIDs as a suffix to the dashboard URL link, data prep was necessary. We replaced all empty spaces and dropped the square brackets from the list.

4. Final step is to post the link to the Slack channel by the following post_message_to_slack function. I referenced it from python-slackclient’s GitHub page.

Putting it all together: https://gist.github.com/jayozer/96d42c9b12eecfd2b3d95eb70198ee04

5. Celebrate success! A link to “Health Dash” is posted to the Slack channel.

Final thoughts

Currently, we plan to schedule the script at the top of every hour to catch runaway queries before they become issues. By utilizing this method, Looker users can rapidly free up their data pipeline from runaway queries, thereby preventing the system from being bogged down. At the same time, we help users to improve the queries so that the slowdown doesn’t happen again. Throw in an upgrade to Snowflake Enterprise, and the Doma dashboards are now running blazing fast!

Runaway Query Termination via Looker and Slack

Prerequisites

The method

Final thoughts

Written by Jay Ozer