Tracing Python Cloud Functions

Did you ever have the impression that after deploying a Cloud Function to the cloud, your view of it was… well… clouded? Because it was hard to understand what caused its latency, now that it had ascended to no-ops heaven?

Among the application performance management (APM) products on the Google Cloud Platform, there is a nice tool called Stackdriver Trace to analyse and dissect the latency of your applications. It’s very useful to find out what part of your function it may be worth spending time on optimising. To use it with Cloud Functions, all you have to do is to make sure traces are being captured and sent to Stackdriver Trace.

Getting Started

In the Python 3.7 runtime, this is very easy. In fact, so easy that you probably want to implement it by default. My advice: Do not bother wrangling with the Python API; it’s a rather unpleasant experience from a usability and documentation point of view. Instead, use OpenCensus which will save you head-scratching and get you straight down to the nitty-gritty.

Start by adding the OpenCensus and Stackdriver Trace dependencies to your requirements.txt:


In your, import the required OpenCensus modules and create a Stackdriver exporter instance:

from opencensus.trace import tracer as tracer_module
from opencensus.trace.exporters import stackdriver_exporter
from opencensus.trace.exporters.transports.background_thread \
import BackgroundThreadTransport
# instantiate trace exporter
exporter = stackdriver_exporter.StackdriverExporter(
project_id=PROJECT_ID, transport=BackgroundThreadTransport)

Of course, introducing extra latency to the code execution by adding tracing would be silly. To avoid this, initialise the exporter with the transport=BackgroundThreadTransport argument so that it uses a background thread to send the data to Stackdriver.

Define Traces and Spans

Now, all that is left to do is to create a tracer instance and to define what you want to have traced. In case you’re wondering, the Cloud Function snippet I used here as an example queries data from BigQuery and calculates some key performance indicators (KPI) with pandas.

def get_kpis(request):
This is the entry point to my Cloud Function.
I will complete the docstring later -- promised!
    # instantiate tracer
tracer = tracer_module.Tracer(exporter=exporter)

Make sure not to create the tracer instance in the global scope. Otherwise, every function call will be logged in the same single trace.

Next, use the tracer’s span method to define the spans, i.e. the operations within a trace that shall be observed. Probably the most convenient way to do this is to use with statements. Also, giving the span a meaningful name makes it easier to identify in Stackdriver Trace. In my example, the root span named get_kpis covers the whole function:

    with tracer.span(name='get_kpis') as span_get_kpis:
# parse HTTP request
client_id, filters = parse_request(request.get_json())

Everything within this with statement will now be traced as one span. To get a more granular view, let’s create nested spans by enclosing sub-operations in child spans.

        # query data from database
with span_get_kpis.span(name='query_data'):
data = query_data(client_id, filters)
        # calculate KPIs
with span_get_kpis.span(name='calc_kpis') as span_calc_kpis:
# instantiate KPI object
kpis = KPI(data)
            # define DataFrame index
with span_calc_kpis.span(name='create_df_index'):
            # add I/O counts
with span_calc_kpis.span(name='add_io_counts'):
            # add smoothed I/O
with span_calc_kpis.span(name='add_smoothed_io'):
. [other KPI calculations omitted]
        # create response body dict from DataFrame
with span_get_kpis.span(name='df_to_dict'):
body = kpis.to_dict()
    # return HTTP response
return json.dumps(body), 200

Voilà. Now call your function, locally or after deploying, and head over to your Cloud Console to inspect the latency.

Trace Visualisation and Analysis

Stackdriver Trace dashboard in Cloud Console

Select one particular trace to break it down and see where the latency comes from. In the timeline, the spans appear hierarchically, exactly as you defined it in your code.

I may want to look into those query_data and add_tpt operations again…

Analysing single traces is helpful to uncover the bottlenecks in your code. However, in terms of the overall latency, you get rather anecdotal evidence. To have a better statistical understanding of how your function performs overall, you can create analysis reports with nice distribution charts. This is particularly useful to assess the impact of changes to your code on latency. Analysis reports are available once you have at least 100 traces. If you want to fast-forward, you can call the Artillery to take your function under fire.

Comparing the latency distributions of two different deployments

Automated Monitoring and Alerting

Even after optimising the latency, it could still deteriorate every now and then. E.g., shirking dependencies such as a database or network connection could eventually slow down your function and require you to take action.

Of course, you don’t want to continuously keep an eye on the dashboard to observe how the latency evolves over time. And you don’t have to. Instead, you can set up an alerting policy in Stackdriver Monitoring to have your latency monitored around the clock. Whenever it surpasses a level you defined, you’re notified automatically so that you can act early. Pretty neat, isn’t it?

So, grab a cup of coffee, lean back, rejoice, and look smart.