How we tracked Mode Analytics usage to reduce costs

Published in

Open House

9 min readNov 16, 2020

At Opendoor, data are critical in understanding historical and current business performance and also in predicting future performance. Consequently, the quality of our data and the presentation of that data are also critical.

We use Mode Analytics as one of our Business Intelligence (BI) tools to query data, create reports, and serve metrics that are shared across the company. However, as our volume of Mode reports grew, we realized we needed to keep better track of all reports and monitor the cost incurred by each.

This post explains how we used a combination of external tools and Mode metadata to improve the visibility of our Mode usage.

Background

To serve data in Mode, we established a direct connection between Mode and our Data Warehouse (Snowflake) database instance. Any queries executed on Mode will actually be executed in our Data Warehouse environment, and the cost of execution will be billed within the Data Warehouse, not Mode. The connection between our Data Warehouse and Mode uses a dedicated database user and role. Thus, we can identify all the queries run by Mode from Snowflake query logs.

Furthermore, Mode’s API offers analytical endpoints, such as count and last run time, through which we are better able to understand Mode usage. However, these data alone were not enough to answer questions like:

How many views did this report have in the past 30 days?
How many individual people viewed the report?

Answering these questions would be crucial to start taking action on high-cost reports.

Limitations

Analyzing cost and usage with data immediately available from Mode and our Data Warehouse has the following problems:

Accessibility
Data exist in different platforms and in different tables, so it is difficult to get all the data you need in one place.
Bias on Cost
While a high-cost report might immediately seem alarming, it isn’t always the end of the world. We also need to account for how often the report is run, how many people are actively using the report, etc., to understand if the high cost is worth it.
Bias on Usage
Mode’s API includes the view count of reports in the /reports endpoint, but does not include when the reports were viewed. For example, a view_count can be greater than 10,000, but have 0 views in the last month.

Analyzing Mode Report Cost (feat. Snowflake)

To analyze Mode reports’ costs, we first needed to understand the cost of each report execution. To do that, we needed more details than were immediately available to us on the queries within the report as well as the individual cost of one query execution.

Mode Query Execution Cost

Calculating Mode’s query execution cost is dependent on the Data Warehouse that your company chooses to use. As a result, we will not be going into much detail on how to derive cost from a Data Warehouse, but you can find more information about our Data Warehouse (Snowflake) cost in the following blog post.

Following is the set of dimensions we wanted to explore with Mode query execution cost and how we aggregated data to get them:

What are the most expensive reports per run?
Sum of cost grouped by mode_report_id and mode_run_id for all most recent report runs.
What are the most expensive reports [T28]?
Sum of cost grouped by mode_report_idfor all queries run in the last 28 days.

Mode Query Metadata

Once we had the data on how much query execution would cost, we needed more details on the queries within a report. Luckily, when you run a query on Mode, Mode adds certain metadata to that query run:

Report Identifier
Unique identifier for the report in which the query exists
Report Run Identifier
Unique identifier for the specific run of the report (a new identifier is generated per run)
Query Identifier
Unique identifier for the query
Report Creator
Email address of the user who created the report
Is Scheduled
Whether the report run was from an automated schedule or run manually

These additional metadata are appended to the end of each query as a comment at execution time. The metadata are ephemeral and controlled by Mode. For example, the query body from a log in our Data Warehouse for Mode might look like:

SELECT * FROM tutorial.flights;-- {"user": "@jaelee", "email": "<email address>", "url": "https://modeanalytics.com/opendoor/reports/8cdfbc5fb16a/runs/1235abvdkfj2/queries/12vidjf", "scheduled": false}

You can parse the above SQL text as following:

-- Parsing SQL (Snowflake)
SELECT OBJECT_CONSTRUCT(
         'mode_report_id',
         SPLIT_PART(
           REGEXP_SUBSTR(QUERY_TEXT, 'reports/[A-Za-z0-9]*\/'),
           '/',
           2
         ),         'mode_run_id',
         SPLIT_PART(
           REGEXP_SUBSTR(QUERY_TEXT, 'runs/[A-Za-z0-9]*\/'),
           '/',
           2
         ),         'mode_query_id',
         SPLIT_PART(
           REGEXP_SUBSTR(QUERY_TEXT, 'queries/[A-Za-z0-9]*'),
           '/',
           2
         ),         'user',
         GET(
           PARSE_JSON(SPLIT_PART(QUERY_TEXT, '--', -1)),
           'user'
         ),         'email',
         GET(
           PARSE_JSON(SPLIT_PART(QUERY_TEXT, '--', -1)),
           'email'
         ),         'scheduled',
         GET(
           PARSE_JSON(SPLIT_PART(QUERY_TEXT, '--', -1)), 
           'scheduled'
         )
       )
  FROM QUERY_LOGS

In order to serve the parsed metadata faster, we created a daily job that grabs all the queries for the given day and stores the parsed metadata in a new table.

Analyzing Mode User Activity (feat. Segment)

As mentioned previously, the view_count metrics offered by Mode with its API was not sufficient for us to make strong arguments as to why a report should be maintained (or removed). As a result, we needed to come up with a different way to collect user activity — Segment.

Segment is a third-party tool that collects user events from web and mobile applications. We integrated Segment with Mode and our Data Warehouse so that we can have Mode user activity ultimately pipe into our Data Warehouse. In other words, when a user accesses a Mode report, the event will be pushed to Segment and Segment will sync this data incrementally to our Data Warehouse for easy discoverability. Here are the steps we took to set this up:

1. Segment <-> Data Warehouse (Snowflake) Integration

Create a user in Data Warehouse that has read and write permissions.
Add a new destination in Segment for Data Warehouse
We use Snowflake for our Data Warehouse, but Segment also offers integration with other data warehouses.

Fill out the credentials for the database and user.

2. Segment <-> Mode Integration

Add a new Segment source for Mode.
Select Javascript for type of source.

Setup source configuration

3. Inject Segment Script in Mode

Once you set up the source in Segment’s source for Mode, Segment will provide you with a code snippet that you can insert into Javascript code in your Mode report.

Navigate to a report view in Mode and open the HTML Editor.

This view allows you to enter custom HTML / JS code that renders the graphs in your report. Insert the Javascript script provided by Segment here.

Click on “Apply HTML Changes”.

Segment Script Automation

The current setup so far introduces another problem of maintainability:

How do we ensure all reports have the script?

Luckily, using one of the API endpoints offered by Mode, we can update the HTML body via code as well. Therefore, we created a job that will get the full set of Mode reports, check if the Segment script exists, and inject the script if it is not present.

1. Set up Mode API Credentials

Navigate to Mode access token settings.
Create a token and save the credentials.

2. Get all reports

There are multiple approaches you can take to get the list of all reports in Mode. One way could be to use the API offered by Mode to get the unique report IDs. However, we decided to use query logs, since we did not want to exhaust the job with stale reports that have not been run (and stale jobs do not incur cost, anyway).

SELECT DISTINCT(METADATA:mode_report_id::VARCHAR) AS MODE_REPORT_ID
  FROM DATA_MART_AXE.SFLK_QUERY_LOGS
 WHERE START_TIME BETWEEN DATEADD(DAY, -1, CURRENT_DATE)
   AND MODE_REPORT_ID IS NOT NULL

3. Add Segment script if it does not exist

For each unique report, check if the Segment script exists. If it does not exist, add it. We added a custom header for the script so that identifying whether the script exists is easier. Also, add Mode metadata to the script so that the data returned to Segment would distinctly identify a unique report.

# Modify last piece of segment script as:
#
#   analytics.page(null, "{report_name}", {
#       mode_report_id: "{report_token}",
#       created_by: "{created_by_email_address}"
#   })# See Appendix A. for full code of ModeClient().mode = ModeClient()
mode_report_ids = <mode_report_ids>
segment_script = <segment_script>
segment_script_identifier = "<!-- [Opendoor] Injected Segment Script Fragment -->"for report_token in mode_report_ids:
    report = mode.get_report_by_token(report_token)
    layout = report["layout"]
    
    # For additional metadata on Mode reports
    # report_name = report["name"]
    # creator_api_url = report["_links"]["creator"]["href"]
    # creator = mode.get_user_by_username(creator_username)
    # creator_email = creator["email"]    if segment_script_identifier not in layout:
        report["layout"] = "\n".join([
            segment_script_identifier,
            segment_script,
            layout,
        ])        mode.update_report(report_token, report)

Alternatives to Injection Process

Mode allows users to subscribe to event based webhooks on events like reports being created, updated and run, etc. We explored creating an AWS Lambda endpoint to accomplish the above based on webhooks, but we decided not to undertake the overhead of creating and maintaining a new AWS Lambda endpoint.

Results and Next Steps

Once the Mode report metadata and usage metrics were being piped to our Data Warehouse via Segment, we were easily able to create Mode reports based on this data to track high-cost and high-use reports. We used this cost and usage data to make strong analytical arguments on what reports should be maintained and what reports should be deprecated.

We first focused on deprecating stale scheduled reports — reports that run and incur charges but have low or no views — and realized 5% in cost savings. Next, we began deprecating and optimizing high-cost reports.

Since many companies are focused on optimizing costs, we’re hoping our insights and strategy might help other teams looking to similarly optimize and save on their analytics costs.

Appendix

Appendix A. `ModeClient()` Full Code

from json
import dumps
from os import getenv
from typing import Dict, Optional
from urllib.parse import urljoin
 
from requests import request
from requests.auth import HTTPBasicAuth
class ModeClientError(Exception):
    pass
class ModeClient:
    ACCOUNT = "opendoor"  # a.k.a. ORGANIZATION
    HOST = "https://app.mode.com"
    BASE_URI = f"{HOST}/api/"
    USERNAME = getenv("MODE_ANALYTICS_USERNAME", None)
    PASSWORD = getenv("MODE_ANALYTICS_PASSWORD", None)
    BASE_HEADERS = {
        "Content-Type": "application/json",
        "Accept": "application/hal+json",
    }    def __init__(self):
        if self.USERNAME is None or self.PASSWORD is None:
            raise ModeClientError("incorrect username or password")    def construct_mode_url_from_report_token(
        self,
        report_token: str = "",
    ):
        return f"{self.HOST}/opendoor/reports/{report_token}"    def get_memberships(self):
        return self._request(
            path=f"{self.ACCOUNT}/memberships/"
        )["_embedded"]["memberships"]    def get_report_by_token(self, report_token: str = ""):
        return self._request(
            path=f{self.ACCOUNT}/reports/{report_token}",
        )    def get_reports_by_space_token_and_page_id(
        self,
        space_token: str = "",
        page_id: Optional[int] = None,
    ):
        return self._request(
            path=f"{self.ACCOUNT}/spaces/{space_token}/reports/?page={page_id}"
        )["_embedded"]["reports"]    def get_segment_client_key(self):
        return self.SEGMENT_CLIENT_KEY    def get_schedules_by_report_token(
        self,
        report_token: str = "",
    ):
        return self._request(
            path=f"{self.ACCOUNT}/reports/{report_token}/schedules"
        )["_embedded"]["report_schedules"]    def get_spaces(self):
        return self._request(
            path=f"{self.ACCOUNT}/spaces/?filter=all"
        )["_embedded"]["spaces"]    def get_user_by_username(self, username: str = ""):
        return self._request(path=f"{username}")    def update_report(
        self,
        report_token: str = "",
        report: Dict = {},
    ):
        return self._request(
            method="PATCH",
            path=f"{self.ACCOUNT}/reports/{report_token}",
            body={"report": report},
        )    def _request(
        self,
        method: str = "GET",
        path: str = "",
        params: Dict = {},
        body: Dict = {},
    ):
        url = urljoin(self.BASE_URI, path)        if self.USERNAME is None or self.PASSWORD is None:
            raise ModeClientError(
                dumps({"message": "invalid mode api credentials"})
        )        response = request(
            method,
            auth=HTTPBasicAuth(self.USERNAME, self.PASSWORD),
            url=url,
            headers=self.BASE_HEADERS,
            params=params,
            json=body,
        )        status_code = response.status_code        if status_code > 300:
            err_msg = {
                "base_uri": self.BASE_URI,
                "body": body,
                "message": response.json()["message"],
                "method": method,
                "path": path,
                "params": params,
                "status_code": response.status_code,
            }
            raise ModeClientError(dumps(err_msg))         else:
            return response.json()