Go Fetch: Building a Spreadsheet Plugin Using Serverless Architecture on GCP for Data Self-Serve Users Love

Georgian
Georgian Impact Blog
7 min readApr 7, 2021

By Jing Zhang

Have you ever been in a situation where you’ve built a powerful reporting tool hoping to enable data self-serve only to get requests asking for data in a spreadsheet? Even a platform that gathers the latest data can struggle on the last mile delivery.

Let’s face it — spreadsheets are deeply wired into daily workflows for many types of analysis. Building something to replace them could be a losing battle. Instead, how about we make it easier for data self-serve inside spreadsheets to complement other more advanced internal systems?

Google Cloud Platform (GCP) has already made efforts to address this need through the BigQuery connector with Google Sheets. While this is a great step to bridge the gap between data lakes and end users, there are still a few limitations:

  • End users need to know SQL.
  • It’s not clear how to reference the data from the BigQuery connector in other cells or tabs.

At Georgian, our data lake contains information about prospective investment targets. Our users — our Growth team — often want to pull this data to help them process leads and prioritize their outreach. We were getting questions from our end users like:

I just want to know the company headquarters location and put it in a cell/tab of my choice. Can we do that?

Can we get the latest data from the data platform in google sheet as soon as it is updated?

Until GCP provides something more flexible, we saw that we would need a customized solution to address these questions.

So we created Fetch.

What is Fetch?

Fetch is a collection of components that connect the end user with our data platform. It includes:

  • a Google sheet plugin
  • multiple Google Cloud Functions with HTTP triggers
  • and a data mart

How does it work?

This diagram shows Fetch’s architecture.

As an end user, you interact with the usual google sheet UI. After installing the plugin through the G-suite marketplace, all you need to do is to invoke the function FETCH(company, attribute) in a cell to access the latest data. For example, to get Georgian’s own HQ location, you simply type FETCH(‘georgian.io’, ‘hq_location’).

On the backend, this triggers a chain of actions. It invokes the HTTPs API hosted by a Google Cloud Function. The cloud function in turn queries the data mart to retrieve the data. The query result flows back to the google sheet plugin and renders on the google sheet UI.

What about performance?

There are two requirements we need to meet:

  • Data freshness: We need to render the latest data from the data mart.
  • Response time: We need to render the data within a reasonable SLA.

Data Freshness

Depending on the data source, the tables in the data mart are updated either daily or weekly. Since Fetch is only a “getter” API, its data freshness is the same as the data mart.

BigQuery provides a query cache which expires in at most 24 hours. We can leverage the cache to speed up the query since it fits with our data freshness SLA.

Response time

Our data platform uses BigQuery as the data warehouse. Initially we thought we shouldn’t use it to serve API calls because data warehouses are not built for serving requests for simple and quick data lookup. Based on this assumption, the preliminary design included a cloud SQL as the data mart.

One of our engineers pointed out that we shouldn’t introduce another component based on unchecked assumptions. He raised a few good questions:

  • What is the actual SLA requirement? Does it have to be an instantaneous response?
  • Can we do a performance test with BigQuery?

We went back to our users to learn more about their workflow to understand what the SLA should be. Their workflow consists of three steps:

  1. Copy a list of companies in a google sheet column.
  2. Enter a list of attributes to retrieve as row headers.
  3. Type in the FETCH command with cell references and drag it across the sheet.

Since our customers process data in batches (two to three hundred companies at a time), it’s ok for them to wait for a few seconds (say 5 seconds) for the whole report to load.

Given this information, we decided to do a load test using BigQuery as the database.

Load test setup

Prepare the Requests

To mimic a typical workflow, we prepared a spreadsheet of our own, which, once opened, sends 200 requests simultaneously to the backend. To be more specific, the sheet has 200 cells where each contains a FETCH function call like =FETCH(“COMPANY_NAME”, “ATTRIBUTE”).

Turn off the BigQuery Cache

In theory, since the plugin directly connects to the data platform, the data freshness should be identical to that in the data platform.

However, that is not always the case. BigQuery has a 24 hour query result cache by default. To do a fair evaluation of BigQuery’s capability to handle the requests, we need to turn off the cache. This requires two changes:

Disable retrieval of cached results: Following the GCP documentation, set the use_query_cache to False.

Add randomness to the query: By leveraging a non-deterministic function in BigQuery, we can force the query to not use the cache. Functions like CURRENT_TIMESTAMP() should do the trick.

job_config = bigquery.QueryJobConfig(use_query_cache=False)
sql = “SELECT *, CURRENT_TIMESTAMP() FROM `YOUR_DATA_TABLE`;”
query_job = client.query(sql, job_config=job_config)

Metric Monitoring

Google AppScript provides a built-in execution dashboard to monitor the performance of the plugin. Based on the test, we found on average the execution time per cell is within acceptable range even though cache was disabled.

AppScript Execution Dashboard

Can we do better?

According to the chart above, we meet the SLA even without any cache. However, waiting for 4–5 seconds is not the most pleasant experience for our users.

Given that the data is updated at most daily, we wonder whether we can cache the results during the day. Enabling BigQuery’s query cache is one option, but the call still has to go over the network to Cloud Function, BigQuery and come back. Luckily, Google AppScript provides a script-level CacheService with a maximum expiration time of 6 hours. This allows us to cache the results across all end customers and speed up the majority of the calls significantly, from 4 seconds to less than 0.2 seconds.

Cold Start

Sometimes we noticed that the loading time exceeded the SLA, typically at the beginning of the day when users open up the google sheet. This was due to the cold start of the cloud functions in the backend (and no warm cache in the AppScript CacheService). The tips and tricks provided by the GCP documentation helped to some extent (about 40% reduction on the cold start time).

We also leveraged a Cloud Scheduler to keep the cloud function warm during the working hours by sending a request with a light payload every 14 minutes (based on our observation, an idle cloud function instance is scaled down every 15 minutes or so). This solves the lag for the plugin at the beginning of the day and reduces latency for other clients calling the APIs infrequently during the day. The cost of the scheduler is trivial given the pricing.

Summary

Impact

The Fetch Google Sheet plugin MVP has been widely adopted by our internal teams.The plugin has reduced a large portion of the manual effort of inputting and updating data in spreadsheets so the team can focus on doing the real analysis of companies we may be interested in. Based on our customer feedback, it has saved them at least 1 hour per day (approximately 200 hours per year) per person. Multiply that with 5 to 10 people on our BD and investment team, that is a huge improvement.

Future Development

By adding more data sources to the data platform in the future, we can enable our users to access an even wider range of fresh data in their daily workflow. We hope this short walkthrough of Fetch can help you with any similar use cases.

About Georgian R&D

Georgian is a fintech that invests in high-growth software companies.

At Georgian, the R&D team works on building our platform that identifies and accelerates the best growth stage software companies.

For sourcing, our platform provides insights that help our Growth team to prioritize their outreach to the most promising growth stage companies.

Take a look at our open opportunities if you’re interested in a career at Georgian.

--

--

Georgian
Georgian Impact Blog

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.