Google Cloud / Stack Driver Monitoring for Batch Processes

So you‘ve got a batch process. It might run regularly on a schedule or even continuously. And it might be a process that can take a long while to run and you don’t necessarily care if it takes a little while longer or shorter to run but it does need to run on whatever cadence you expect it to run on.

If you are using Google Cloud to run your infrastructure, you might also be using Google StackDriver to do monitoring and logging. If you are, then maybe you might want to know how you can setup a monitoring alert in StackDriver for your batch process?

I know I did. We have a batch process that actually runs continuously and typically takes anywhere from about 10–25 minutes to run. When it completes, it starts back over and runs again. We’ve set it up to log to StackDriver logging, like this:

For those of you wanting to setup logging like this for any arbitrary application that StackDriver doesn’t automatically support, you’ll want to setup the StackDriver Logging Agent using a proper fluentd configuration. For the above, this was the configuration file we used for the application which itself logged a specific format of logging out to a text-based log file on the server it runs on:

type tail
path D:\_CM_runtime\Current\Logs\logs.log
tag asset.core.context.manager.development
format /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{4}) — (? <log_level>\w+) — (?<message>.*)$/
time_format %Y-%m-%d %H:%M:%S.%L

Back to the alert… for this to work, you first need to know what constitutes whether your batch process is running successfully. For our needs, we had a simple log entry the consisted of the following that, if present in the log entries every so often, meant the batch process was running as we expect it to:

What we wanted was to have an alert that would bother us if that entry in the log file didn’t appear after some reasonable time threshold. The first thing to do was to setup a log metric for this. To do that, you create a new log-based metric in the Stackdriver Logging console and add whatever filtering logic you need to identify the thing you expect to find, like this:

As you can see, we’re simply filtering the log we want based on the text entry we expected to find in the logs. In this case, the entries are occurring about 26 minutes or in the example above. What we’d like to do is set an alert for, say, every 30 minutes if we haven’t seen that log entry, to alert us that something is wrong.

You can use Stackdriver Monitoring to set up an alert policy based on that log metric you created above. Create a new alert with a condition using the log-based metric you created above, like this:

You’ll notice a few things here: First of all, the graph is extremely spiky on the 1-day graph and that’s because this batch process is running every 25 minutes or so, which certainly isn’t frequent so we expect to see these spikes. Also, there was a recent problem so we aren’t seeing some previous completion spikes because it hadn’t been running actually.

Secondly, due to the frequency, the graph has very small numbers on a per-second basis. Again, this makes sense when you consider an infrequently-running batch process like this. For that, we know that anything below a threshold of 0.001/second was enough to know that there was a problem and if that persisted for 30 minutes or more (our threshold for knowing whether we should go investigate a problem), then go ahead and alert us!

After you create that condition and configure the rest of your alert policy with how you want to be notified, etc., you can save it out and you’ll start to get alerts around your batch process. Good luck and happy monitoring!