Google Cloud Platform — Kubernetes Logs to StackDriver Graph — Epiphanies
This is just to connect the dots going from a log that you see in Stackdriver Logging to Stackdriver Monitoring Dashboard with pretty looking Graphs.
We use Kubernetes on Google Cloud Platform’s GKE. And if you’ve got a crap load of pods, you need a centralized place to see those logs.
We used to use elasticsearch-logstash/fluentd-kibana (ELK), but switched stackdriver recently, but the concepts here should apply for both platforms.
How do you go from a stdout log output that’s on a single pod to a graph?
First in whatever platform you’re using, make sure you can see the logs on stackdriver logging or kibana.
The most confusing part on GCP Logs Viewer is to find the logs. For GKE Kubernetes Clusters it’s finding the right logs, because everything is logged.
GKE Container -> {{your k8s cluster name}} -> {{ k8s namespace }}
Even if you just find this out, it’s awesome because you can do two very powerful things: see all your pod’s logs in a centralized place and be able to filter and search them. This is true of ELK stash too, I just don’t have screenshots.
The above shows inside the sandbox-cluster
GKE cluster, default
namespace, autographer
deployment. The awesome thing is you can filter your deployments, statefulsets, etc…
You can also search for very specific keywords and errors.
Here’s the first epiphany :Logging in JSON allows you to filter and create metrics around your logs.
python
print('{"error_message": "this was some error", "latency_in_ms": 123132}')orlog.info(json.dumps({"error_message": "this was some error", "latency_in_ms": 123132}))
javascript
console.log({"error_message": "this was some error", "latency_in_ms": 123132})ordebug({"error_message": "this was some error", "latency_in_ms": 123132})orlog.info({"error_message": "this was some error", "latency_in_ms": 123132})
This means in whatever language or stack you’re using, if you log the output as a json, it will go from textPayload
to jsonPayload
. And when you do this, it becomes really powerful.
Yes, there are plenty of libraries that convert your current logs into json, but you want to really keep this in mind whenever you’re logging something you want to track. For example latency, response times, etc.
It doesn’t really matter what language or stack you’re using, but if you’re already using some library for logging in whatever language or stack they support usually support json. So when you output the logs, make sure you structure your log output so that the metrics you want to be filtered by can be. So this means in your logs, don’t put units because everything will be a string. For example: latency in milliseconds.
Instead of:
{
"latency": "205ms"
}
Do this:
{
"latency_in_ms": 205
}
Stackdriver Logging has documentation for different languages, but like I said at the end of the day, you just want it to be logging json so that it’s structured.
Second Epiphany: Turning your jsonPayload logs into metrics that become a graph on a dashboard
I’ll use the example above because we actually use this. Notice how there are a few attributes that I wanted to track: response_duration
as a number, status_code
as an int, request_payload
which is nested and with whatever strings I want to track.
Let’s say I want to just get the average response_duration
or see how many 200
or 400
I get in status_code
They way to do this is to create a metrics. There are two types of metrics in any platform you use: counts or distribution
In the GCP Stackdriver Logging Console, you can create a metrics. Note: you can also filter by labels, clusters, namespaces, deployments, etc….
So when you’re creating the metric now, the difference between a textPayload and jsonPayload really shows because now you’re able to choose a very specific field.
When you’re creating the metrics, add a few labels to the metrics you want to pay attention to so that when you’re searching for them in the graphing part you can easily find them.
Here we can see in the filters the specific metric you want to graph response_duration
So once you’re in the Stack Driver Monitoring Tab, you can go to Resources -> Metrics Explorer
Here you can search for your metric name
then filter them by the attributes you wanted. Above for example is the mean response_duration, grouped by service_name. There’s a boatload of filter syntax you can also use to get the graphs that you want, but that’s for another post.
Here’s just one more complex graph where you can count events per second, as a stacked bar
This one graphs 400 or 500 errors using status_code
over time.
Sign up for GCP using the link below for an additional $50! ($350 total for new users).