Performance debugging for serverless functions using the Cloud Shell

rodric rabbah
5 min readFeb 5, 2018

--

The execution of functions that run in a serverless cloud is opaque for the most part. In general terms, when a serverless function is invoked, the platform accepts the request and provisions resources before it executes the function. We can refer to the duration between accepting the request and the start of the function execution as the “system time”. The system time along with the “user time” (i.e., the time consumed executing the function) contribute to the total time it takes to service a single function request.

In the cloud, the serverless platform has almost total visibility into the system and user times and can aggregate this information systematically. Such information can enable performance analysis that is often central to improving the efficiency of applications, cloud native or not. A service like AWS X-Ray is one way to address this problem, for applications locked into AWS that is.

Apache OpenWhisk is an open source serverless platform for functions as a service. Like AWS Lambda, functions execute in opaque resources, and instances scale quickly when needed. OpenWhisk may be deployed on-prem, or used as a service with IBM Cloud Functions. The platform provides visibility into each function activation in terms of the system and user times. These are reported on every activation and stored as part of the activation record.

The Cloud shell aggregates and visualizes the information to make it readily possible to measure the mean execution time of individual functions and their variance. In addition, when functions are composed into sequences or used to orchestrate a more general control flow, the shell provides a detailed trace view to quickly identify performance bottlenecks and properly account for system versus user time.

Below are some examples of the performance visualization, taken from an application I run regularly to alert Apache OpenWhisk committers when their pull requests break the build. The application is a composition of functions which I’ll describe in a separate article. Briefly, when a Travis CI build completes, it triggers my application, called travis2slack.app, to fetch the build logs, analyze them for failures, and alert the contributor on Slack with a build summary.

Example Slack notifications for a failed and passing Travis CI build, here shown for two different pull requests.

For now, I think you’ll understand what it generally does by looking at the visual representation of the application as generated by the shell below. The application control flow is shown on the left. It should be evident there are several steps in the execution, including some error handling logic (e.g., a try-catch) and a mixture of cloud and inline functions. Since cloud functions can be described in a many languages, it is easy to mix and match Node.js and Python functions for example, where appropriate.

The figure on the right below represents a single execution of the application, showing which of the functions actually executed with rectangles colored green (function did not error) or red (function existed with an error). This is the dynamic trace of the execution. A grey rectangle is along a path not taken.

The shell is a desktop application. You can install it with “npm -g install @ibm-functions/shell”. From your terminal, you launch it with “fsh shell”. The figure on the left is generated by typing “app list” in the shell prompt and then selecting an application from the list of deployed apps. The figure on the right is an example session. You can see your sessions with the command “session list”, and clicking on one from the list shown.

The session flow is generally useful for debugging application execution. It lends itself well to finding specific function failures, and replaying only those executions, possibly using a debugger to identify the root cause of a failure; this is a feature I’ll cover in a future article.

Another set of views that is useful for understanding the performance and where bottlenecks may lie includes the trace and summary views.

The trace view is another way of visualizing a session. You can cycle through the views by selecting the appropriate tab in the bottom panel. In addition to Trace and Session Flow, you can also inspect the result of the activation result and the raw activation record.

The trace is rendered as a table, with each row representing a step in the execution as it unfolds from top to bottom. Various metrics are shown including the execution duration of the function and the system overhead. In the figure above, the latter is visible as queueing delays.

The shell complements the trace and session views with performance summaries for individual functions, as shown next.

Performance of individual functions, here showing the mean and 25%–75% execution time spread. This view is accessible by typing the command “summary” on the shell prompt. Once in this view, you can change between summary, timeline and grid views which are shown later.

The summary view is useful when identifying performance bottlenecks, and honing in on functions where the performance varies significantly from one invocation to the next. The shell attempts to explain this performance for outliers, based on the information gleaned from the activation records. An example is shown below for the function analyze.log.

Clicking on a specific outlier brings up its corresponding activation record, which permits further inspection of the input, the result, even the function code. It’s also possible to replay a specific activation and attach a debugger, or edit the code in place and redeploy — I will cover these in future articles. For the curious, analyze.log fetches the Travis CI logs from S3 and occasionally this is a slow operation. It then parses the log file for failed tests. I will make the code available on GitHub when I describe the composition in more detail.

I’ll conclude with two more shell views that I think are worthwhile and handy. The first of these is the grid view.

A heat map of activations allows one to quickly find slow or failed activations. You can type the command “grid” on the shell prompt to bring up this view.

The grid may be aggregated over thousands of activations. It provides a visual representation of the execution variance, while also making it readily possible to find failed executions.

Lastly, the timeline shows executions unfolding over time, with an estimated cost per million activations (ignoring the free tier).

The timeline view is another way of looking at a set of activations over time. The command for this view is “timeline” which you can enter on the shell prompt, or select from the bottom panel from select views.

The timeline highlights the non-uniform nature of my travis2slack.app activations. When developers and contributors are active, there are many pull requests. It so happens there are contributions to Apache OpenWhisk from other continents, explaining the spurt of activity at 4AM in this timeline (I’m using IBM Cloud Functions in the US region).

While it is possible to generate some of these views in other data visualization stacks, one of the advantages of the shell is its fully immersive, fast, and interactive nature. One can quickly switch from development, to monitoring, to debugging, without having to leave the shell environment, which runs as a native application on your desktop.

The shell is a plugin based application that is rapidly evolving and with features added every day. It is available on GitHub and we welcome contributions. For more information about the composer, visit the getting started guide and stay tuned for more articles on this topic. You can reach us directly by opening issues on GitHub or via Slack.

--

--