Stackdriver System Logs-based Metrics

System Logging Metrics

“A place for everything and everything in its place”

It’s probably me. I find that, because I use monitoring tools irregularly, the first little while with them is occupied trying to remember details of their very specific types (metrics, time-series, resources, labels) while watching the clock as I try to get a sample built.

TL;DR a customer asked how they could determine the net, aggregate growth in their Stackdriver Logs. My thanks to Mary and Summit who helped educate me on the solution summarized here.

Aside: if you’re like me, the singular most valuable trait I have to pretend competence coding is that I document EVERYTHING I do. I’ve used different tools over time but I currently have a Google Doc open and I copy and paste everything. These Medium posts are my public version of this behavior. If no-one else reads this, I’ll likely return to it next time I use Stackdriver Monitoring :-)

System Logging Metrics

Stackdriver Monitoring (!) now provides metrics that “track the number and volume of log entries received” these are called (slightly confusingly) “System Logging Metrics”. In this case “system” means not “user-defined” — they’re metrics provided by default by Stackdriver — and “Logging Metrics” are elsewhere called “Logs-based Metrics”. Metrics are measurements and commonly include CPU utilization, memory usage, network throughput. When we measure some aspect of logs (entries, size), we create what (Stackdriver) calls Logs-based Metrics. This feature permits us to, among other things measure the byte counts of our logs. And, if we can measure the bytes being added to our logs, we can measure the rate of growth in our logs over time.

Every Metric must be uniquely named and the 3 Stackdriver-provided Logs-based Metrics (summarized here) are all logging.googleapis.com/[name]:

  • log_entry_count
  • byte_count ← This is the metric we’ll be using
  • dropped_log_entry_count

You can review the entire list of predefined metrics provided by Stackdriver on this page. Here again is the link to the “logging” metrics.

Lastly, for this preamable, metrics have a type and a kind. byte_count is type int64 and a delta kind. Delta metrics record a change in value. A change between the current value and the prior value. To determine how many bytes are added to a log today, I must sum all today’s delta values.

Resources

For the sake of simplicity, I’m going to assume there is one type of resource on Google Cloud Platform (GCP). This is not true. There are many types of resources. The one I’m going to focus on is the resource type for Google Compute Engine (GCE) Instances. Stackdriver calls this resource “gce_instance”. Here it is among the list of all the types of resources. GCP projects may contain many instances of many types of resources. In this case, each GCE Instance will generate its own logs and so we will need, not only to summarize the byte counts for the period of time that we’re interested in, we must also summarize these byte counts across all the VMs (Instances) in our project(s).

Let’s Code!

I’m going to build upon some of the foundation described in “Getting Started w/ Python on GCP” and, of course, in Google’s documentation. Specifically, we’ll use the Stackdriver Monitoring API v3. The Python documentation is summarized here.

I recommended using virtualenv to partition this sample and using a requirements.txt document to keep everything repeatable:

requirements.txt:

google-cloud-monitoring==0.27.0
pandas==0.20.3

Then:

cd /path/to/your/desired/directory
virtualenv venv
source venv/bin/activate
pip install --requirement requirements.txt

Because I’m not creative… python.py:

import datetime
from google.cloud import monitoring
PROJECT_ID = [[YOUR-PROJECT-ID]]
START = datetime.datetime(2017,8,30,0,0,0)
END = datetime.datetime(2017,9,5,23,59,59)
METRIC = "logging.googleapis.com/byte_count"
RESOURCE = "gce_instance"
WEEK_HOURS = 7*24

Assuming (please do) you’re using Application Default Credentials, all you need to do to create an auth’d client is:

client_monitor = monitoring.Client(
project=PROJECT_ID
)

OK…. all the measurements made for a Resource for a Metric form a time-series of values. We must first query Stackdriver for the correct set of time-series. Stackdriver’s Python Monitoring Library provides a fluent interface so we can chain methods together. Let’s create a query based on our byte_count metric type:

query = client_monitor.query(
metric_type="logging.googleapis.com/byte_count"
)

Then we’ll refine it by specifying a time period of interest:

query = query.select_interval(
start_time=START,
end_time=END
)

At this point, we can manifest the query (results) in the form of a pandas dataframe:

print(query.as_dataframe())

If the code is working as expected, you should receive repeated variants of the following:

resource_type                               gce_instance  \
project_id [[YOUR-PROJECT-ID]]
location
zone us-east4-c
backend_name
backend_zone
bucket_name
cluster_name
container_name
disk_id
forwarding_rule_name
instance_group_name
instance_id 1234567890123456789
matched_url_path_rule
namespace_id
pod_id
storage_class
target_proxy_name
target_proxy_type
url_map_name
log cloudaudit.googleapis.com/activity
severity NOTICE
2017-08-30 23:22:59 NaN
2017-08-30 23:23:59 NaN
2017-08-30 23:24:59 NaN
2017-08-30 23:25:59 NaN
...
2017-09-05 16:05:59 0.0
2017-09-05 16:06:59 2636.0
2017-09-05 16:07:59 44.0
2017-09-05 16:08:59 0.0

What does it all mean?

In this case, this time-series includes byte counts by log entry severity. These are all NOTICE. Those in August (08) don’t contain values (Not-A-Number). This is probably because the resouce (in this case a “gce_instance”) did not exist in August but was created in September (09) when values are produced.

project_id, zone, backend_name, instance_id etc. are all resource labels. These represent metadata associated with this resource type (gce_instance) by which we could refine (filter) the query.

The metadata also includes metric labels. These are “log” and “severity” in the above table. These represent metadata associated with this metric type (byte_count). In this representation, we see only one value for log (cloudaudit.googleapis.com/activity) but there are actually several logs associated with this GCE Instance. The others are:

  • cloudaudit.googleapis.com/activity
  • compute.googleapis.com/activity_log
  • kubernetes.datalab-server-xxx-datalab_default_datalab

Aside: we can confirm this by filtering our query by the specific instance’s name or ID. As both of these values are associated (thanks to resource labels) with the data we have.

aside = query.select_resources(
instance_id="1234567890123456789"
)

Would yield (edited) results:

resource_type                             gce_instance  \
project_id [[YOUR-PROJECT-ID]]
zone us-east4-c
instance_id 1234567890123456789
log cloudaudit.googleapis.com/activity
severity NOTICE
...
resource_type                                            \
project_id
zone
instance_id
log compute.googleapis.com/activity_log
severity INFO
...
resource_type
project_id
zone
instance_id
log kubernetes.datalab...default_datalab
severity DEFAULT

It’s slightly confusingly presented by note the “\” (continuation) characters after the resource_type lines. The result is actually intended to be presented as a very wide table but the long log names (among other things) require it to be folded and presented this way.

Okay but this is a useful path to pursue… For one GCE instance, we have several logs and each has many metric (delta) values for the time period we chose. Let’s aggregate these time-series by log by week:

aside = aside.align(
monitoring.Aligner.ALIGN_SUM, hours=WEEK_HOURS
)

Here’s an explanation of aggregation of time-series data. It’s easiest understood with examples… Using our previous (aside) query and adding this align method call, we now get:

resource_type                             gce_instance  \
project_id [[YOUR-PROJECT-ID]]
zone us-west1-c
instance_id 1234567890123456789
log cloudaudit.googleapis.com/activity
severity NOTICE
2017-09-05 23:59:59 8701
resource_type                                            \
project_id
zone
instance_id
log compute.googleapis.com/activity_log
severity INFO
2017-09-05 23:59:59 7693
resource_type
project_id
zone
instance_id
log kubernetes.datalab...default_datalab
severity DEFAULT
2017-09-05 23:59:59 84572

Although I didn’t include the many data points in the series in the previous output, these are now summarized (by week) in this table. We specified a week of data (30Aug-05Sep), so there’s in fact, only one week of data to summarize.

Let’s continue, let’s summarize across the logs. The dimension we want to retain this time is the instance_id:

aside = aside.reduce(
monitoring.Reducer.REDUCE_SUM, 'resource.instance_id'
)

And this results:

resource_type                    gce_instance
project_id dazwilkin-170803-http2-lb
instance_id 3797819422094211637
2017-09-05 23:59:59 100966

And, like me, you’ll be reassured to see that 100,966=8,701+7,693+8,4572!

Pop: Aside | Resume

OK. This isn’t what we actually intended to do but it’s a useful diversion because we will now do the same align →reduce but using all-and-only GCE Instances and we’ll fold the data to retain the “by log”. This is relatively straightforward:

query = query.select_resources(
resource_type=RESOURCE
).align(
monitoring.Aligner.ALIGN_SUM, hours=WEEK_HOURS
).reduce(
monitoring.Reducer.REDUCE_SUM, 'metric.label.log'
)

NB in the example above, we’re employing the method-chaining capability of the Library to string the query →align →reduce

resource_type                             gce_instance  \
project_id [[YOUR-PROJECT-ID]]
log cloudaudit.googleapis.com/activity
2017-09-05 23:59:59 18737
resource_type                                            \
project_id
log compute.googleapis.com/activity_log
2017-09-05 23:59:59 15312
resource_type                                            \
project_id
log kubernetes.datalab...default_datalab
2017-09-05 23:59:59 84572

Datalab

Now, at the risk of doubling the length of this post, I’m going to show you another way to achieve the above, using Datalab (derived from Jupyter). It’s a very powerful tool.

https://cloud.google.com/datalab/
https://cloud.google.com/datalab/docs/quickstarts

The easiest way to launch datalab is using Google Cloud Shell.

So, what?

Well, if you can create a new Notebook:

Empty Notebook

Then, paste in the following code:

import datetime
from google.datalab.stackdriver import monitoring
START = datetime.datetime(2017,8,30,0,0,0)
END = datetime.datetime(2017,9,5,23,59,59)
RESOURCE = "gce_instance"
WEEK_HOURS = 7*24

NB the “from” statement uses “google.datalab.stackdriver” this time. This is a version of the library that supports some Datalab specific goodies.

Then click “Add Code” and paste in:

query = monitoring.Query(
"logging.googleapis.com/byte_count"
).select_interval(
start_time=START,
end_time=END
)
query.as_dataframe().head(5)

NB this library requires a capitalized Query but is otherwise the same as the regular Python code from before except for the “head” command which we can use to return, as expected, the first 5 values. We also don’t need the print statement. Datalab will render the results for us:

Datalab renders as_dataframe() automatically

This provides a very compelling way to interact with the data which makes using the metrics easier. Let’s continue, “Add Code” and paste:

query.metadata().as_dataframe().head(5)
Datalab and the Python Library provide metadata

This is super useful as it provides a way to introspect the dataframe to see what metadata we have to use. As you may be able to just about see, we have resource_type, and we can see the resource-specific metadata (resource.labels) and metric-specific metadata (metric.labels).

Almost there, “Add Code” and paste:

query = query.select_resources(
resource_type=RESOURCE
).align(
monitoring.Aligner.ALIGN_SUM, hours=WEEK_HOURS
).reduce(
monitoring.Reducer.REDUCE_SUM, 'metric.label.log'
)
query.as_dataframe()

and….

Results!

Conclusion

Hopefully this post has helped demystify some of the intricacies of Stackdriver. For someone that use Stackdriver infrequently, it can be confusing to try to demystify metrics, their types, resource and metric labels etc. But, in truth, there is a place for everything and everything in its place.

Extra #1: Cloud Console

You may access these “System Metrics” from Google Cloud Console:

https://console.cloud.google.com/logs/metrics
Google Cloud Console “Logging”

and, from here, selecting the hamburger for e.g. “byte_count” links to “view in Metric Explorer”:

Stackdriver “Metric Explorer” of logs-to-metrics for byte_count

Extra #2: Google API Explorer

My colleague-buddy got me addicted to API Explorer and it’s a boon!

For Stackdriver Monitoring API v3 you can start here:

https://developers.google.com/apis-explorer/#search/monitoring/monitoring/v3/
https://developers.google.com/apis-explorer/#search/monitoring/monitoring/v3/monitoring.projects.timeSeries.list
API Explorer Populated

Authenticate and click “Execute” and, all being well:

API Explorer “Executed”