Troubleshooting Cloud Spanner Applications with OpenCensus
This tutorial shows how to configure client-side metrics recording in your Cloud Spanner workloads using OpenCensus. While Cloud Spanner surfaces a number of helpful server-side metrics, applications can realize the added benefits of collecting metrics emitted by the Cloud Spanner client library. For example, only client-side metrics include session and transaction related information. These Spanner + OpenCensus metrics will provide you with enough data to enable you to spot, and investigate the cause of any unusual deviations from normal behavior.
Before I dive into the details let me provide a brief introduction to OpenCensus.
OpenCensus provides libraries for C++, Java, Go, Python, Javascript, .Net, Ruby, and PHP that that capture distributed traces and metrics from applications, and send this telemetry to backends like Cloud Monitoring, Cloud Trace, Prometheus, Jaeger, Datadog, etc. More info and the user manual are available on opencensus.io.
** Important: OpenCensus is being replaced by OpenTelemetry (currently entering beta), but the migration should not require much change to use the API. Read more here. **
Before you begin
- Make sure you’re using Java 8 or greater and Apache Maven.
- Our example will run on Google Cloud and emit metrics to Cloud Monitoring and Traces to Cloud Trace, however you can run your code anywhere and can use any exporter that OpenCensus supports.
Client-side metrics for investigating the use and abuse of the session pool
These simple metrics should allow you to diagnose a range of different issues, from the effects of changing the min/max session configuration, to detecting session leaks.
Sessions are a central concept in the Cloud Spanner API. All Spanner reads and writes are performed through a session. A session can have only one active transaction at any time.
1. Tracking the number of max-allowed sessions, configurable by the user
cloud.google.com/java/spanner/max_allowed_sessions
: This shows the maximum number of sessions allowed. If the current number of sessions in the pool is less than this and they are all in use, then a new session will be created for any new operation. If the current number of in-use sessions are the same as this number and a new request comes, the pool can either block or fail. This metric is labeled by database name, instance name and library version.
2. Tracking the number of max-in-use sessions
cloud.google.com/java/spanner/max_in_use_sessions
: This returns the maximum number of sessions that have been in use during the last maintenance window interval, so as to provide an indication of the amount of activity currently in the database. It is specific to the database and instance name it runs in. A maintenance window is a set 10 minute interval. After a complete maintenance window has passed, the value is reset to zero (and then starts increasing again). The value is updated every time a session is checked out of the pool.
3. Tracking the number of recently-in-use sessions
cloud.google.com/java/spanner/in_use_sessions
: This metric allows users to see instance-level and database-level data for the total number of sessions in use (or checked out from the pool) at this very moment. With the help of this metric, we can tell if the number of in-use sessions increases for any reason or suddenly drops to zero. Depending on probe intervals and average transaction duration, this number might often be measured to be zero for a system under low load. E.g. if the system executes 1 transaction per second, the average transaction duration is 300ms and the value is probed once every 10 seconds, there’s a chance that you will often measure the value between two transactions and get a zero. On a system with higher load, measuring this value will give the user a better idea of the distribution of the transactions over time. If this value is relatively constant most of the time, but suddenly shows a spike that cannot be correlated with a similar sudden spike in user requests, it might be an indication of a concurrency problem somewhere (locks in the database, concurrency issues in the application itself etc.).
4. Tracking the number of requests to get session timeouts (cumulative)
cloud.google.com/java/spanner/get_session_timeouts
: This gives you an indication of the total number of get session timed-out instead of being granted (the thread that requested the session is placed in a wait queue where it waits until a session is released into the pool by another thread) due to pool exhaustion since the server process started. Consider this value in combination with the value of currently in_use_sessions
to understand the load on the database. This metric is labeled by database name, instance name and library version. In such a situation, you should observe a value for currently in_use_sessions
that is equal to (or almost equal to) max_allowed_sessions
all the time. While that is an indication that something is not well, it could also be that the application and the session pool are operating at their max, but without actually being exhausted.
Below are the steps to troubleshoot this problem:
- Check your application log for
LeakedSessionExceptions
that occur when you close your Spanner instance. (Closing a Spanner instance, will close the session pool, and closing the session pool will trigger theseLeakedSessionExceptions
for all sessions that have not been checked back into the session pool before closing the pool.) - If you have
LeakedSessionExceptions
: Investigate the stack trace of these and fix the code that is checking out the session, but not checking them back in, for example not closing allResultSets
or by not closing a read-only transaction. - If you do not have
LeakedSessionExceptions
: IncreaseSessionPool.maxSessions
.
5. Tracking the number of acquired and released sessions
cloud.google.com/java/spanner/num_acquired_sessions
: This metric allows users to see the total number of acquired sessions.
cloud.google.com/java/spanner/num_released_sessions
: This metric allows users to see the total number of released (destroyed) sessions.
Under ideal conditions, the number of acquired sessions should be equal to (or almost equal to) the released session count. If the difference between the two is steadily increasing, it is certainly an indication that there is a session leak. The metric is also an indication of the load on the system, so if you were to present it as a timeline, you would be able to recognize peak and low load moments.
To set up and enable Metrics, follow these steps:
Step 1 — Setting Up and Enabling Enhanced Metrics
In order to capture client-side metrics and distributed traces from the Cloud Spanner client, we need to use OpenCensus Observability Ready Util for Java. This package provides a convenient wrapper so that developers can use OpenCensus easily.
By default, OpenCensus Observability Ready Util does the following things:
- Enables basic RPC (is a single call against a gRPC service, either streaming or unary) views to surface below mentioned server-side metrics:
- Sets probabilistic sampling rate to 0.0001 (1 in 10,000)
- Creates and registers OCAgent Trace Exporter to collect traces
- Creates and registers OCAgent Metrics Exporter to collect metrics
If you are using Maven, add this to your pom.xml file
If you are using Gradle, add this to your dependencies
compile 'io.opencensus:opencensus-impl:0.25.0'
compile 'io.opencensus:opencensus-contrib-observability-ready-util:0.25.0'
At the start of your application, configure OpenCensus:
The Cloud Spanner client supports OpenCensus Metrics, which gives insight into the client internals and aids in debugging/troubleshooting production issues. Metrics prefixed with cloud.google.com/java/spanner/
focus on operational level metrics. RPC level metrics can be gleaned from gRPC’s metrics, which are prefixed with grpc.io/client/
.
Step 2 — Deploy OpenCensus Agent
The OpenCensus Agent is a component that can collect traces and metrics from processes instrumented by OpenCensus or other monitoring/tracing libraries (Jaeger, Prometheus, etc.), do aggregation and smart sampling, and export traces and metrics to one or more monitoring/tracing backends. This component is optional (metrics and traces can be exported directly from any OpenCensus library), but the agent makes it easier to choose an exporter without rebuilding and redeploying your application.
Here’s an agent configuration that exports metrics to Cloud Monitoring:
The oc-agent can be run directly from sources, binary, or a Docker image.
Cloning source: Open up a new shell to get the OpenCensus agent
$ git clone https://github.com/census-instrumentation/opencensus-service.git
$ cd opencensus-service
Building it: $ make agent
which will place the binary in the bin folder in your current working directory.
Running it: $ ./bin/ocagent_darwin — config=/config.yaml
Step 3 — View the captured metrics with Stackdriver Metrics Explorer
After you have enabled Metrics for your spanner client, you can view the metrics using Stackdriver Metrics Exporter.
- Navigate to the Metrics Explorer.
- In the “Find” resource type and metric field, enter the following:
cloud.google.com/java/spanner/in_use_sessions
- Select this metric from the list.
- In the right pane, the count of in-use sessions.
Step 4 — Time to clean up
Since this tutorial uses multiple GCP components, please be sure to delete the associated resources once you are done.
Sample Java application
Clone the source repository for this tutorial: $ git clone https://github.com/mayurkale22/spanner-oc.git
Update the Java application with some configuration specific to your project. Navigate to the folder containing the Java source: $ cd src/main/java/com/example/spanner
Configure the environment variable in the application code to use the Cloud Spanner instance and database.
To build the example: $ mvn clean package
Run the following Maven commands to build and run the program: $ mvn exec:java -Dexec.mainClass=com.example.spanner.App
Go to http://localhost:8080/spanner/, and start sending read requests.
To Conclude…
OpenCensus’s integration with Cloud Spanner thrives on being able to detect any unusual deviation from the normal in one or more metrics. The number of recently in-use sessions is a good example of a metric that needs a baseline to tell you if a measurement is reasonable or a sign of a problem. Having established a baseline for the metric, you’ll be able to spot, and investigate the cause of any unexpected deviations from normal behavior.
I hope this simple getting started guide gets you up and running. Let us know in the comment section below if you find this helpful and any suggestions or questions you may have.