Microservice Observability with MicroProfile 3 on OpenShift

When adopting a microservices architecture, as we show in our IBM Stock Trader sample, one of the key benefits of such an approach is the “observability” offered by the platform.

John Alcorn
AI+ Enterprise Engineering
19 min readAug 26, 2019

--

That is, being able to “observe” the application, as the flow goes from microservice to microservice, both in the happy-path case and in problem-determination cases. Some of this comes from the Kubernetes platform, such as Red Hat’s OpenShift Container Platform (OCP), and some comes from the frameworks upon which you build your microservices, such as the Eclipse MicroProfile project, for those that like to write their microservices in Java.

As always in this space, things are evolving quickly, and MicroProfile 3.0 was recently released (specs finalized in June and delivered in beta in the June release of Open Liberty, and promoted to being GA quality in the July release of Open Liberty — 19.0.0.7), which offers some key improvements in this space. Specifically, it delivers mpHealth-2.0, which is now fully Kubernetes-ready, now offering separate annotations and endpoints for readiness and liveness probes. You now get default implementations of both just by adding the mpHealth-2.0 feature to your server.xml (which basically return “ready” as soon as the server is fully started, and which return “live” as long as the server is actually still healthy enough to process a JAX-RS call), and can code up your own custom implementations by using the new @Readiness and @Liveness annotations.

Likewise, the new mpMetrics-2.0 feature is updated to more naturally support the OpenMetrics format, providing an endpoint that is more Prometheus-friendly, and giving you many infrastructure-level metrics (memory usage, CPU usage, thread count, etc.) and app server-level metrics (JDBC and JMS connection pools, etc.) for free, and a simple programming model for adding application-specific metrics such as counters and gauges.

Stepping back for a moment, there are really two forms of observability: that which is automatically utilized by the platform (enabling simple autonomic/self-tuning/self-healing capabilities in the platform), and that which is meant for human consumption, usually in some browser-based user interface (UI). There’s a lot to cover in this space, so here we’ll focus on discussing what to do in your Java code and related artifacts like your Dockerfile, server.xml, pom.xml, and deployment yaml, and will use the curl command-line tool to demonstrate what’s available at each endpoint. There are already plenty of articles and blog posts showing the UIs built-in to the OpenShift console, as well as third-party UIs like Grafana, but here we want to focus on how to publish the data that feeds into those pretty UIs.

To begin, let’s look at what image you should start from, when you want to build a microservice that leverages the new features of MicroProfile 3.0. As we’ve discussed in past blog entries, there are two flavors of Liberty: one that runs atop Ubuntu, which is accessed via “docker pull open-liberty:microProfile3”, and one that runs atop Red Hat’s Universal Base Image (UBI), accessed via “docker pull openliberty/open-liberty:microProfile3-ubi-min” (the difference in naming convention is due to the fact that the UBI hasn’t been accepted as “official” by Docker yet, so nothing built atop it can be marked “official” either, hence the extra directory prefix for now). So here’s my full Dockerfile for my Trader microservice, showing using this new MP 3 tag:

# FROM open-liberty:microProfile3 #Ubuntu flavor; I prefer the UBI instead
FROM openliberty/open-liberty:microProfile3-ubi-min
COPY — chown=1001:0 src/main/liberty/config /config/COPY — chown=1001:0 target/trader-1.0-SNAPSHOT.war /config/apps/TraderUI.warRUN configure.sh

For those new to running their containers in OpenShift, there are a few things to note here. First, as stated above, we’re using (the MP3 flavor of) the UBI, which tends to be a more OpenShift-friendly base operating system. Second, we are specifying ownership metadata for the files we copy into the container, since OpenShift runs containers under a different user ID than was used to create the image (for security reasons). And lastly, we are running Liberty’s configure.sh, which among other things (like starting/stopping the server, which implicitly validates your server.xml and will fail the build if it’s invalid, and constructing the feature metadata cache for a performance benefit when the container is used later), adjusts file permissions so that everything will be properly accessible in the image on OpenShift.

Now that we know how to build a container image based on MP 3, we need to adjust the version number on the compile-time dependency in our Maven pom.xml. To ensure you can use the new features in mpHealth-2.0 and mpMetrics-2.0 in your Java code, you need the following stanza:

<dependency>
<groupId>org.eclipse.microprofile</groupId>
<artifactId>microprofile</artifactId>
<version>3.0</version>
<type>pom</type>
<scope>provided</scope>
</dependency>

And we need to either manually add mpHealth-2.0 and mpMetrics-2.0, or the convenience feature microProfile-3.0, that enables all of the MP features, to our featureManager stanza in our server.xml:

<featureManager>
<feature>microProfile-3.0</feature>
<feature>monitor-1.0</feature>
</featureManager>

If you’re curious, just enabling the microProfile-3.0 feature will actually enable all of the following features that you’ll see if you look at the console log for the container:

[AUDIT   ] CWWKF0012I: The server installed the following features: [appSecurity-2.0, cdi-2.0, concurrent-1.0, distributedMap-1.0, jaxrs-2.1, jaxrsClient-2.1, jndi-1.0, json-1.0, jsonb-1.0, jsonp-1.1, jwt-1.0, microProfile-3.0, mpConfig-1.3, mpFaultTolerance-2.0, mpHealth-2.0, mpJwt-1.1, mpMetrics-2.0, mpOpenAPI-1.1, mpOpenTracing-1.3, mpRestClient-1.3, opentracing-1.3, servlet-4.0, ssl-1.0].

As you can see, I also added the monitor-1.0 feature, because that makes all of the app server-level metrics, like around the servlet container and connection pool statistics, available when you hit the /metrics endpoint (otherwise you just get JVM-level metrics; more on all of that later).

One last thing before we start editing Java code — you just saw a log entry from our microservice. If you want such log entries to get fed to ElasticSearch, so that they are viewable in Kibana, then you need to set the WLP_LOGGING_CONSOLE_FORMAT environment variable to “json”. You can also tell it you’d like the FFDC logs to get sent there too, via the WLP_LOGGING_CONSOLE_SOURCE environment variable. Here’s the stanza you’ll need in your deployment yaml to make that happen (of course, if you are using a helm chart, you can parameterize those values so they can be specified for that helm release, rather than hard-coding the values):

env:
- name: WLP_LOGGING_CONSOLE_FORMAT
value: json
- name: WLP_LOGGING_CONSOLE_SOURCE
value: message,ffdc

OK, now we’re ready to start altering our code to take advantage of the new features in MP 3. Let’s start with mpHealth 2.0. The key is to use the new @Readiness and @Liveness annotations, from the org.eclipse.microprofile.health package. You need a separate Java class for each. Let’s look at the one for the Readiness probe (the one for the Liveness probe is essentially the same, except it uses @Liveness, and checks a different condition for whether to say its alive).

package com.ibm.hybrid.cloud.sample.stocktrader.trader.health;import java.util.logging.Logger;//CDI 2.0
import javax.enterprise.context.ApplicationScoped;
//mpHealth 2.0
import org.eclipse.microprofile.health.HealthCheck;
import org.eclipse.microprofile.health.HealthCheckResponse;
import org.eclipse.microprofile.health.HealthCheckResponseBuilder;
import org.eclipse.microprofile.health.Readiness;
@Readiness
@ApplicationScoped
/** Use mpHealth for readiness probe */
public class ReadinessProbe implements HealthCheck {
private static Logger logger = Logger.getLogger(ReadinessProbe.class.getName());
private static String jwtAudience = System.getenv("JWT_AUDIENCE");
private static String jwtIssuer = System.getenv("JWT_ISSUER");
//mpHealth probe
public HealthCheckResponse call() {
HealthCheckResponse response = null;
String message = "Ready";
try {
HealthCheckResponseBuilder builder = HealthCheckResponse.named("Trader");
if ((jwtAudience==null) || (jwtIssuer==null)) { //can't run without these env vars
builder = builder.down();
message = "JWT environment variables not set!";
logger.warning("Returning NOT ready!");
} else {
builder = builder.up();
logger.fine("Returning ready!");
}

builder = builder.withData("message", message);
response = builder.build();
} catch (Throwable t) {
logger.warning("Exception occurred during health check: "+t.getMessage());
throw t;
}
return response;
}
}

That may look like a lot to digest, but really it’s pretty simple. I added bold highlighting for the key bits. It basically says to return success (“ready”) if the two environment variables it requires are set. That is, an http 200 (“success”) return code will be returned when those environment variables have been configured on the container, and it will return an http 503 (“unavailable”) if not. Kubernetes knows not to schedule work to a pod that isn’t ready. Someone checking that pod’s log will see the warning level message stating why it isn’t ready, and should be motivated to correct the problem (usually by adjusting the yaml for the Deployment to say the values for those env vars come from certain fields in a config map).

We can test our microservice directly from the CLI, before even deploying it to our Kube cluster (OpenShift), by just doing a “docker run -p 9080:9080 ibmstocktrader/trader:latest” (after first doing the “mvn package” and the “docker build -t trader .” steps as usual, of course). This should cause the “not ready” situation to occur, since we aren’t specifying any “-e” parameters to pass in environment variables. Note we are telling it to expose the 9080 port on the host, so we can hit that port via curl:

johns-mbp-8:trader jalcorn$ curl http://localhost:9080/health/ready
{"checks":[{"data":{"message":"JWT environment variables not set!"},"name":"Trader","status":"DOWN"}],"status":"DOWN"}
johns-mbp-8:trader jalcorn$ curl -I http://localhost:9080/health/ready
HTTP/1.1 503 Service Unavailable
X-Powered-By: Servlet/4.0
Content-Type: application/json; charset=UTF-8
Content-Language: en-US
Connection: Close
Date: Wed, 21 Aug 2019 22:52:39 GMT

As you can see, we got back the expected 503, and the JSON output with our message. Now if we kill the running image and run it again, this time passing values for those two environment variables via “docker run -p 9080:9080 -e JWT_AUDIENCE=blah -e JWT_ISSUER=blah ibmstocktrader/trader:latest”, and try the same curl commands:

johns-mbp-8:trader jalcorn$ curl http://localhost:9080/health/ready
{"checks":[{"data":{"message":"Ready"},"name":"Trader","status":"UP"}],"status":"UP"}
johns-mbp-8:trader jalcorn$ curl -I http://localhost:9080/health/ready
HTTP/1.1 200 OK
X-Powered-By: Servlet/4.0
Content-Type: application/json; charset=UTF-8
Content-Language: en-US
Date: Wed, 21 Aug 2019 22:57:26 GMT

We see that all is well now. Likewise, we can curl the liveness endpoint, and see it is happy too:

johns-mbp-8:trader jalcorn$ curl http://localhost:9080/health/live
{"checks":[{"data":{"message":"Live"},"name":"Trader","status":"UP"}],"status":"UP"}
johns-mbp-8:trader jalcorn$

Lastly, we need to tell Kubernetes the URLs for these two new endpoints, via a stanza like this in our deployment yaml:

readinessProbe:
httpGet:
path: /health/ready
port: 9080
initialDelaySeconds: 30
periodSeconds: 15
successThreshold: 1
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 9080
initialDelaySeconds: 60
periodSeconds: 15
successThreshold: 1
failureThreshold: 5 # Allow a little time to exec into the pod to grab FFDCs before it's killed

OK, now Kubernetes, such as OCP, can “observe” when a pod is “ready”, knowing not to schedule any work to it until it acknowledges its readiness (as an aside, there are now smarts in mpHealth 2.0 to not start answering “ready” until the server is truly fully started, including your application — there was actually a small time window back in mpHealth 1.0 where the server could answer “up” as soon as the http port was activated, before your app had actually started, briefly leading to 404 responses since its context root wouldn’t be active yet.). It can also “observe” when a pod considers itself “live”, and if it answers in the negative (or doesn’t answer at all) for the specified number of consecutive attempts, Kubernetes will kill the pod and start a fresh one to replace it. And it should also be pointed out that, since the readiness and liveness endpoints aren’t under the application’s endpoint (that is, it’s under /health, not /trader, in this case), none of the security you might have in place for your app, such as it requiring a JWT be passed, would happen for these health probes — which is a good thing, because Kubernetes, such as OCP, wouldn’t know how to construct and pass such a JWT.

One quick note about auto-scaling, as my colleague Greg discussed in an earlier blog entry — if the HorizontalPodAutoscaler policy “observes” that another pod is needed, it will start one, and will hold off dispatching any work to it until the readiness probe answers that the pod is ready (on Liberty, this tends to be about 30 seconds after Kube starts the container). Anyway, having proper health probes is key to Kubernetes being able to properly observe when your app is “ready”, and when it’s no longer “live”, and then being able to automatically act accordingly on this info.

So mpHealth is good at giving a boolean response — up or down. But sometimes you need finer-grained telemetry from your pod than that — actual numbers that you can see and act upon. So now let’s turn our attention to mpMetrics. As mentioned earlier, I’ve enabled the new microProfile-3.0 feature in my Trader microservice, which among other things, gives you mpMetrics-2.0. This automatically enables a /metrics endpoint. Let’s curl that endpoint, again against our Trader microservice that we did the “docker run” on earlier (note I ran through a few screens in my UI, so it would have some non-zero metrics to report, and highlighted interesting stuff in bold):

johns-mbp-8:trader jalcorn$ curl -k https://localhost:9443/metrics/
# TYPE base_cpu_processCpuLoad_percent gauge
# HELP base_cpu_processCpuLoad_percent Displays the 'recent cpu usage' for the Java Virtual Machine process.
base_cpu_processCpuLoad_percent 0.009217570821211652
# TYPE base_memory_maxHeap_bytes gauge
# HELP base_memory_maxHeap_bytes Displays the maximum amount of heap memory in bytes that can be used for memory management. This attribute displays -1 if the maximum heap memory size is undefined. This amount of memory is not guaranteed to be available for memory management if it is greater than the amount of committed memory. The Java virtual machine may fail to allocate memory even if the amount of used memory does not exceed this maximum size.
base_memory_maxHeap_bytes 5.36870912E8
# TYPE base_gc_time_seconds gauge
# HELP base_gc_time_seconds Displays the approximate accumulated collection elapsed time in milliseconds. This attribute displays -1 if the collection elapsed time is undefined for this collector. The Java virtual machine implementation may use a high resolution timer to measure the elapsed time. This attribute may display the same value even if the collection count has been incremented if the collection elapsed time is very short.
base_gc_time_seconds{name="scavenge"} 0.366
# TYPE base_gc_time_seconds gauge
# HELP base_gc_time_seconds Displays the approximate accumulated collection elapsed time in milliseconds. This attribute displays -1 if the collection elapsed time is undefined for this collector. The Java virtual machine implementation may use a high resolution timer to measure the elapsed time. This attribute may display the same value even if the collection count has been incremented if the collection elapsed time is very short.
base_gc_time_seconds{name="global"} 0.073
# TYPE base_memory_committedHeap_bytes gauge
# HELP base_memory_committedHeap_bytes Displays the amount of memory in bytes that is committed for the Java virtual machine to use. This amount of memory is guaranteed for the Java virtual machine to use.
base_memory_committedHeap_bytes 6.6453504E7
# TYPE base_cpu_systemLoadAverage gauge
# HELP base_cpu_systemLoadAverage Displays the system load average for the last minute. The system load average is the sum of the number of runnable entities queued to the available processors and the number of runnable entities running on the available processors averaged over a period of time. The way in which the load average is calculated is operating system specific but is typically a damped time-dependent average. If the load average is not available, a negative value is displayed. This attribute is designed to provide a hint about the system load and may be queried frequently. The load average may be unavailable on some platform where it is expensive to implement this method.
base_cpu_systemLoadAverage 0.1
# TYPE base_gc_total counter
# HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total{name="global"} 8
# TYPE base_gc_total counter
# HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
base_gc_total{name="scavenge"} 81
# TYPE base_thread_daemon_count gauge
# HELP base_thread_daemon_count Displays the current number of live daemon threads.
base_thread_daemon_count 51
# TYPE base_cpu_availableProcessors gauge
# HELP base_cpu_availableProcessors Displays the number of processors available to the Java virtual machine. This value may change during a particular invocation of the virtual machine.
base_cpu_availableProcessors 4
# TYPE base_memory_usedHeap_bytes gauge
# HELP base_memory_usedHeap_bytes Displays the amount of used heap memory in bytes.
base_memory_usedHeap_bytes 4.69968E7
# TYPE base_thread_max_count gauge
# HELP base_thread_max_count Displays the peak live thread count since the Java virtual machine started or peak was reset. This includes daemon and non-daemon threads.
base_thread_max_count 63
# TYPE base_thread_count gauge
# HELP base_thread_count Displays the current number of live threads including both daemon and non-daemon threads.
base_thread_count 54
# TYPE base_classloader_loadedClasses_count gauge
# HELP base_classloader_loadedClasses_count Displays the number of classes that are currently loaded in the Java virtual machine.
base_classloader_loadedClasses_count 11014
# TYPE base_classloader_loadedClasses_total counter
# HELP base_classloader_loadedClasses_total Displays the total number of classes that have been loaded since the Java virtual machine has started execution.
base_classloader_loadedClasses_total 11017
# TYPE base_jvm_uptime_seconds gauge
# HELP base_jvm_uptime_seconds Displays the start time of the Java virtual machine in milliseconds. This attribute displays the approximate time when the Java virtual machine started.
base_jvm_uptime_seconds 438.083
# TYPE base_classloader_unloadedClasses_total counter
# HELP base_classloader_unloadedClasses_total Displays the total number of classes unloaded since the Java virtual machine has started execution.
base_classloader_unloadedClasses_total 3
# TYPE vendor_threadpool_size gauge
# HELP vendor_threadpool_size The size of the thread pool.
vendor_threadpool_size{pool="Default_Executor"} 12
# TYPE vendor_session_liveSessions gauge
# HELP vendor_session_liveSessions The number of users that are currently logged in.
vendor_session_liveSessions{appname="default_host_trader"} 1
# TYPE vendor_session_create_total gauge
# HELP vendor_session_create_total The number of sessions that have logged in since this metric was enabled.
vendor_session_create_total{appname="default_host_metrics"} 1
# TYPE vendor_session_invalidated_total counter
# HELP vendor_session_invalidated_total The number of sessions that have logged out since this metric was enabled.
vendor_session_invalidated_total{appname="default_host_health"} 0
# TYPE vendor_session_invalidatedbyTimeout_total counter
# HELP vendor_session_invalidatedbyTimeout_total The number of sessions that have logged out by timeout since this metric was enabled.
vendor_session_invalidatedbyTimeout_total{appname="default_host_health"} 0
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_Login"} 2
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_Summary"} 3
# TYPE vendor_threadpool_activeThreads gauge
# HELP vendor_threadpool_activeThreads The number of threads that are running tasks.
vendor_threadpool_activeThreads{pool="Default_Executor"} 2
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_Login"} 0.26823716000000003
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="com_ibm_ws_microprofile_health_2_0_HealthCheckLivenessServlet"} 1
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_DisplayMessage"} 1
# TYPE vendor_session_liveSessions gauge
# HELP vendor_session_liveSessions The number of users that are currently logged in.
vendor_session_liveSessions{appname="default_host_health"} 2
# TYPE vendor_session_invalidated_total counter
# HELP vendor_session_invalidated_total The number of sessions that have logged out since this metric was enabled.
vendor_session_invalidated_total{appname="default_host_trader"} 0
# TYPE vendor_session_activeSessions gauge
# HELP vendor_session_activeSessions The number of concurrently active sessions. (A session is active if the product is currently processing a request that uses that user session).
vendor_session_activeSessions{appname="default_host_trader"} 0
# TYPE vendor_session_activeSessions gauge
# HELP vendor_session_activeSessions The number of concurrently active sessions. (A session is active if the product is currently processing a request that uses that user session).
vendor_session_activeSessions{appname="default_host_metrics"} 0
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="com_ibm_ws_microprofile_metrics_public_PublicMetricsRESTProxyServlet"} 4
# TYPE vendor_session_liveSessions gauge
# HELP vendor_session_liveSessions The number of users that are currently logged in.
vendor_session_liveSessions{appname="default_host_metrics"} 1
# TYPE vendor_session_activeSessions gauge
# HELP vendor_session_activeSessions The number of concurrently active sessions. (A session is active if the product is currently processing a request that uses that user session).
vendor_session_activeSessions{appname="default_host_health"} 0
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_Summary"} 1.1705176620000002
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="com_ibm_ws_microprofile_health_2_0_HealthCheckReadinessServlet"} 0.386380873
# TYPE vendor_session_invalidated_total counter
# HELP vendor_session_invalidated_total The number of sessions that have logged out since this metric was enabled.
vendor_session_invalidated_total{appname="default_host_metrics"} 0
# TYPE vendor_session_invalidatedbyTimeout_total counter
# HELP vendor_session_invalidatedbyTimeout_total The number of sessions that have logged out by timeout since this metric was enabled.
vendor_session_invalidatedbyTimeout_total{appname="default_host_metrics"} 0
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="com_ibm_ws_microprofile_health_2_0_HealthCheckReadinessServlet"} 1
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="com_ibm_ws_microprofile_health_2_0_HealthCheckLivenessServlet"} 0.01125599
# TYPE vendor_servlet_request_total counter
# HELP vendor_servlet_request_total The number of visits to this servlet since the start of the server.
vendor_servlet_request_total{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_AddPortfolio"} 2
# TYPE vendor_session_create_total gauge
# HELP vendor_session_create_total The number of sessions that have logged in since this metric was enabled.
vendor_session_create_total{appname="default_host_trader"} 1
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_AddPortfolio"} 0.08634327700000001
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="com_ibm_ws_microprofile_metrics_public_PublicMetricsRESTProxyServlet"} 0.10208623300000001
# TYPE vendor_session_invalidatedbyTimeout_total counter
# HELP vendor_session_invalidatedbyTimeout_total The number of sessions that have logged out by timeout since this metric was enabled.
vendor_session_invalidatedbyTimeout_total{appname="default_host_trader"} 0
# TYPE vendor_servlet_responseTime_total_seconds gauge
# HELP vendor_servlet_responseTime_total_seconds The total response time of this servlet since the start of the server.
vendor_servlet_responseTime_total_seconds{servlet="TraderUI_com_ibm_hybrid_cloud_sample_stocktrader_trader_DisplayMessage"} 0.009028340000000001
# TYPE vendor_session_create_total gauge
# HELP vendor_session_create_total The number of sessions that have logged in since this metric was enabled.
vendor_session_create_total{appname="default_host_health"} 2
johns-mbp-8:trader jalcorn$

That’s a lot of output. It falls into two categories — the JVM-level metrics, that all start with a prefix of “base”, and the app server-level metrics, that have a “vendor” prefix. The data is returned in the OpenMetrics format that tools like Prometheus expect. For example, we can see that our microservice JVM is currently running 54 threads (base_thread_count), and that our mpHealth hit count (vendor_session_create_total{appname=”default_host_health”}) is 2 (one for readiness and one for liveness, in this case). This particular microservice isn’t using JDBC or JMS, but if it were, we’d see metrics around the connection pools for those type of resources. Also, if we had any application-defined metrics, we’d see those here too.

Now let’s look at how to add application-defined metrics. Given my “past life” as the chief architect for IBM Business Monitor, I especially like these kinds of metrics, as they tend to be more “business level” metrics, whereas the metrics we have seen so far are more “IT level” metrics. Said another way, the audience for these app-level metrics is different; for example, a business analyst would find a Grafana dashboard showing the values of such metrics a lot more interesting, whereas the built-in metrics would be more of interest to an ops team (as an aside, you could build a HorizontalPodAutoscaler policy, like discussed earlier, based on values of those IT-level metrics, like scaling up if the number of concurrent Trader sessions (vendor_session_activeSessions{appname=”default_host_trader”}) goes above 10 or whatever threshold you like).

I’ve actually had simple annotation-defined metrics in my Portfolio microservice for a year or so. But new in MP Metrics 2.0, there is support for multi-dimensional metrics — that is, metrics that can have different values for different names of a “dimension”. That way, you can make graphs that show each value, like in a Grafana dashboard. For now, let’s look at how we can define a new multi-valued metric on the fly, in our Trader UI microservice that we’ve been using throughout this blog entry. I just added a call, in my for loop that iterates over the portfolios in my Summary servlet, to a new setPortfolioMetric method, passing it the values of the portfolio’s owner and its total value. Here’s what the implementation of that method looks like:

void setPortfolioMetric(String owner, double total) {
org.eclipse.microprofile.metrics.Gauge<Double> gauge = () -> { return total; };
Metadata metadata = Metadata.builder().withName(“portfolio_value”).withType(MetricType.GAUGE).withUnit(DOLLARS).build(); metricRegistry.register(metadata, gauge, new Tag(“owner”, owner)); //registry injected via CDI
}

Let’s unpack those three lines of code. First, it defines a Gauge whose value is the value of the portfolio’s total value (the sum of the values of all of the stocks in that portfolio). Personally, I still find lambda expressions a little spooky, but that’s the programming model for defining one of these things, so it is what it is. Then it defines the metadata around this “created on the fly” metric, giving its base metric name, its type, and its units. Finally, it registers that metric with the metric registry (which was injected via CDI, via private @Inject MetricRegistry metricRegistry;). Note that the semantic on registering a metric is that it creates a new one if it’s the first time, or it replaces the old value if one already existed. Finally, note the Tag that is defined and passed in — this is the “dimension” of the metric, allowing us to say that we want a different value tracked for this metric for each value of the portfolio owner, in this case (prior to the introduction of multi-dimensional metrics to the programming model, all we could have tracked would have been the total value across all portfolios, which isn’t as interesting).

OK, with that code in place, let’s look at how it works. For the case of the following values for portfolios as seen in the Trader UI:

UI showing six portfolios

Now if we hit the /metrics/application endpoint (which narrows down the results to just the application-defined metrics), we should see the result:

johns-mbp-8:trader jalcorn$ curl -k https://localhost:9443/metrics/application
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”John”} 1234.56
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”Kyle”} 0.0
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”Greg”} 98765.43
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”Eric”} 123456.78
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”Ryan”} 23456.78
# TYPE application_portfolio_value_USD gauge
application_portfolio_value_USD{owner=”Karri”} 12345.67
johns-mbp-8:trader jalcorn$

As you can see, the server is now reporting back the value of the portfolio for each value of the dimension/tag — which in this case is the portfolio’s primary key, its owner. So now we can do dimensional analysis of this metric in a dashboard.

To complete the story, we need to tell Kubernetes (OpenShift, in this case) that we want the Prometheus scraper to periodically poll the /metrics endpoint on our microservice. We can accomplish that by simplifying adding the following stanza to our deployment yaml (and the same thing in our service yaml):

annotations:
prometheus.io/scrape: “true”
prometheus.io/port: “9080”

If we were only exposing a single port, we’d have only needed that first annotation. But since we’re exposing both 9080 and 9443, we needed the second line as well, to tell it to hit that specific port (otherwise, it hits both, once a minute, resulting in an SSLException in the server’s log each time the 9443 port is polled; the polling interval defaults to once a minute). Note that we are able to use the plain http (9080) port, because of having the following line in our server.xml — without this, we’d be having to mess with giving Prometheus our SSL certificate, and our server’s login credentials:

    <mpMetrics authentication=”false”/>

Let’s wrap things up by now showing the metadata for the Trader deployment in OpenShift, via “oc describe deployment trader” (I’ve added bold highlighting to point out a few of the things we’ve discussed here):

johns-mbp-8:trader jalcorn$ oc describe deployment trader
Name: trader
Namespace: stock-trader
CreationTimestamp: Mon, 29 Jul 2019 15:48:17 -0400
Labels: app=stock-trader
Annotations: deployment.kubernetes.io/revision=1
prism.subkind=Liberty
Selector: app=trader
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: app=trader
Annotations: prometheus.io/port=9080
prometheus.io/scrape=true

Containers:
trader:
Image: ibmstocktrader/trader:basicregistry
Ports: 9080/TCP, 9443/TCP
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 100m
memory: 128Mi
Liveness: http-get http://:9080/health/live delay=60s timeout=1s period=15s #success=1 #failure=5
Readiness: http-get
http://:9080/health/ready delay=30s timeout=1s period=15s #success=1 #failure=3
Environment:
JWT_AUDIENCE: <set to the key 'audience' in secret 'jwt'> Optional: false
JWT_ISSUER: <set to the key 'issuer' in secret 'jwt'> Optional: false
WLP_LOGGING_CONSOLE_FORMAT: json
WLP_LOGGING_CONSOLE_SOURCE: message,ffdc

OIDC_NAME: <set to the key 'name' in secret 'oidc'> Optional: true
OIDC_ISSUER: <set to the key 'issuer' in secret 'oidc'> Optional: true
OIDC_AUTH: <set to the key 'auth' in secret 'oidc'> Optional: true
OIDC_TOKEN: <set to the key 'token' in secret 'oidc'> Optional: true
OIDC_ID: <set to the key 'id' in secret 'oidc'> Optional: true
OIDC_SECRET: <set to the key 'secret' in secret 'oidc'> Optional: true
OIDC_KEY: <set to the key 'key' in secret 'oidc'> Optional: true
OIDC_NODEPORT: <set to the key 'nodeport' in secret 'oidc'> Optional: true
PORTFOLIO_URL: <set to the key 'portfolio' of config map 'urls'> Optional: true
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: trader-5d4546666b (2/2 replicas created)
Events: <none>
johns-mbp-8:trader jalcorn$

We’ve now seen what the microservice developers need to do to make their apps observable. You can use various tools for reporting against this data and visualizing the results, such as Kibana for federated logging, Jaeger for federated tracing, and Prometheus and Grafana for federated metrics. There are also fancier tools, like LogDNA and Sysdig, that are routinely used in the IBM Cloud.

Thanks for visiting, and let us know if you’d like to see a follow-up article that does a walkthrough of such browser-based user interfaces!

Originally published at https://www.ibm.com on August 26, 2019.

--

--

John Alcorn
AI+ Enterprise Engineering

Member of the Cloud Journey Optimization Team at Kyndryl. Usually busy writing/testing code, or teaching others what I’ve learned.