Integrating OpenTelemetry into Cloud Pub/Sub

Tracing Cloud Pub/Sub messages from start to finish

Seth Maxwell
Google Cloud - Community
5 min readAug 14, 2020

--

The Observability Problem

The Cloud Pub/Sub service provides a convenient way to communicate information as an alternative to a typical request-response model. This makes it an ideal system to use for event-driven applications that are composed of multiple services.

However, the inner workings of Cloud Pub/Sub are largely hidden to the user. This makes debugging the cause of undelivered messages or large gaps between sending and receiving messages much more difficult, as there is not a good way to track a message from when it is published to the point where a subscriber picks it up.

The Problem in Action

Take, for example, a Python chat application that sends messages through the Cloud Pub/Sub service. A message sent from a publisher to a subscriber always arrives, but there seems to be a significant delay between the point at which a publisher sends a message and when a subscriber receives it. For example, the message “hello world” can be sent to all clients.

The message eventually appears in the message list, but the time between when it was sent and when it appeared in the list of messages was approximately six seconds.

This delay is only present in some messages, and it does not always last for the same amount of time. Without some way to monitor a message throughout its delivery cycle, discovering the source of the inconsistent delay may be difficult.

Integration with OpenTelemetry

This is where OpenTelemetry comes into play. OpenTelemetry is an observability library that adds tracing and metrics to applications. When added to Cloud Pub/Sub, it provides a trace that enables more visibility into how a message is being handled outside the scope of an application. As of the time of writing, the Node.js and Python Cloud Pub/Sub clients have OpenTelemetry support. The Java and Go clients have OpenCensus support, the predecessor library to OpenTelemetry, but have not been instrumented with OpenTelemetry yet.

To discover the cause of the message delay, OpenTelemetry tracing can be added to this application. By default, OpenTelemetry is an optional feature. For the Python library, OpenTelemetry can be enabled simply by installing the pip packages:

pip3 install opentelemetry-api opentelemetry-sdk

Now that the packages have been installed, the publisher and subscriber need to be told how to create and handle traces. This is done through the tracer provider, which serves as a trace creation factory. For this example, when a trace is created, it will be exported as soon as the trace is completed. OpenTelemetry supports exporting to a variety of tracing backends, but for this example, traces will be exported to Google Cloud Trace. This can be set up in the publisher and subscriber by adding the following to both:

The application now knows how to handle exporting traces. All that is left to do is to redeploy it and send a new message in chat. An OpenTelemetry trace should appear in Google Cloud Trace that shows the new message being sent.

Looking at the trace that was generated by the Cloud Pub/Sub message, it appears that the message was delivered to the same subscriber multiple times. This could indicate that there is nothing wrong with the message delivery, but rather that when the message is received by the application’s subscriber, something is wrong with the way that new messages are handled when they are received. We can take a look at the subscriber to see what’s going on.

It appears that the culprit is the callback method provided to the subscriber to handle incoming messages. There’s a random chance that incoming messages are not acknowledged or added to the list of messages to display to the user. This prompts Cloud Pub/Sub to resend the message since the subscriber did not send back an acknowledgement that it was received.

Taking the random return out of the callback method seems to have fixed the issue. Now, messages are appearing approximately one second after they are sent, and the traces for each message back this up.

Yay! Adding OpenTelemetry observability to Cloud Pub/Sub messages helped to diagnose why the messages were being delayed, and it ultimately provided a better user experience.

This was a simple example of why OpenTelemetry can be useful in Cloud Pub/Sub applications, but for your application, this random number-driven error could be an asynchronous call that takes longer than the deadline to acknowledge a message. This would cause the same issue as the one explored here, and viewing the OpenTelemetry trace for these messages would be a useful debugging tool to find its cause.

Support for OpenTelemetry in the Node.js Cloud Pub/Sub client is now available, and support for the Python client will be available shortly.

To try out this example for yourself, the source can be found here. Please also see the OpenTelemetry Python 3 docs and Node.js docs for more information.

--

--