Bring us to red alert!
Sometimes it can be hard knowing what is going on in your cloud environment. With so many resources, projects and assets, it can be hard to manage, much less know what is going on. I won’t to go into a sales pitch about how Stackdriver can help with cloud management pain points, but hopefully a tangible example of how this awesome technology works will give us a good idea of why being proactive about managing your cloud environment is a win. So, without further-ado, let’s get started.
Where’d that come from?
The idea behind this Stackdriver bonanza is to answer the question “How can I know if someone changes an IAM policy on a BigQuery dataset?” It’s a great question, because BigQuery IAM permissions are the most granular at the dataset level (at the time of writing at least). The implications of IAM permissions changing, especially with sensitive data, in a dataset are large, seeing how all tables and views within that dataset could be accessible depending on the change. Also, since Dataset Owners can change these permissions, we would want to know if and when those changes occur.
We’ll set out to accomplish a few things in this blog post:
- Change some permissions
- See what Stackdriver logs look like when we do that
- Create Stackdriver Metrics for the specific logs
- Create Stackdriver Alerts on those metrics
- Test it out
A few notes before we get started… I’ll be showing all of this off in the GCP Console. This isn’t to say that you have to do it this way; if you’re a GCP pro and prefer using gcloud and APIs, go for it! Additionally, this is a starting point and for example only. Ideally we would take this concept much further than this blog post by capturing what has changed, not just alerting that something has. Disclaimers done. Onward!
The Set Up
Log into the GCP console and open the hamburger menu on the left hand side. Then navigate to the IAM section, and open the service accounts page. Create a service account here; give it a name and don’t sweat the particulars, we’ll clean it up later and don’t need it to do anything, anyway. I recommend copying and pasting the name of the service account in a text doc or notepad— we’ll need it a few times during this exercise and this is a quick and easy way to get it.
Once you have a service account created, head over to BigQuery via the hamburger menu or the search bar. Once in BigQuery, create a new dataset. Once the data set is created, click on it in the tree menu to the left, and then click on the “Share Dataset” button.
Let’s change some permissions. Paste the name of the new service account into the name field, and then open the drop down menu next to the name box. Select BigQuery, and then select “DataViewer”. Click “Add” and then click “Done”. Check out figure 2 for how that looks in the console.
Logs… they’re everywhere!
Let’s see what happened in the logs, shall we? (Pro tip: open a new tab) Navigate to Stackdriver Logging via the hamburger menu. Once in Stackdriver, take a moment to browse through the logs. Depending on your GCP environment, project, and how Stackdriver is implemented, there could be a deluge of logs, or there could be only a few. If you’re feeling overwhelmed or confused, use the query bar to search for BigQuery logs. If you feel so bold, also filter the logs on the dataset you just created. Check out how I did that in Figure 3 below.
Now that we have found the appropriate logs, let’s take a look at what they contain. This is important so we can create a metric around these events. You may notice that there is a lot going on in this log, with many nested objects. The objects we are looking for are the following:
These will give us the logs that are created when a dataset has an IAM policy change. Depending on your GCP environment set up, you may also want to include a filter for the project ID. Once you find the log with the permission change, start expanding it. You’ll see (in Figure 4 below) that buried deep in the log is the new IAM policy binding on the data viewer role for the data set.
Now that we know what logs are generated when we make a policy change for a data set in BigQuery, let’s create a log based metric to track these. Click on “Create Metric”. My metric configuration can be seen in Figure 5.
Basically what we will do when we create the metric is set the label. The label is what tells Stackdriver to watch for within the logs. In this case, we are looking for the logs that we filtered on earlier. Let’s enter the appropriate details for the protoPayload.methodName and then put the regex of the value in. Click done when you’re finished.
When these labels are found within Stackdriver, they are then measured according to the metric configuration. In this case, I am using a counter, but in other use cases it may be appropriate to have a distribution.
Now that we have a metric configured, click on the hamburger menu to open up the navigation menu, and scroll down to Stackdriver Monitoring, and click on it. If you’ve never used Stackdriver Monitoring, a new workspace will be created for you.
Once Stackdriver Monitoring has opened, click on “Dashboards” in the navigation menu. Then click on “Create Dashboard” and give your new dashboard a name. Then click on “Add Chart” in the right hand corner. Search for your metric name in the metric field to add it (see Figure 6). I like to give my chart the name of the metric as well, so I know exactly what I’m looking at. Once you’ve added the metric and a name, click on the “Save” button to create the chart. You can now see that the chart is in your dashboard. Don’t sweat it if you don’t see anything yet; we’ll test this out soon enough.
Creating the Alerting Policies
Now that we have metrics and charts set up, let’s create an alerting policy to be notified when something happens. In the navigation menu, click on the “Alerting” section. Once in the Alerting page, click on “Create Policy” and give your new policy a name. Click on “Add Condition” to create a new condition. The condition in this case is if our new metric is above 0 (meaning that a policy change has occurred). For the configuration, change the threshold to 0 and “Most Recent Value”. Click on “Add” when you’re done.
Next, click on “Add Notification Channel”. Select “Email” from the drop down list and enter the email you would like the alert to be sent to. There are some other interesting options for notifications, including Slack, SMS, and GCP Console Mobile. I think that perhaps the most useful is PubSub, but maybe that is best discussed in another blog post… Leave the rest of the options as defaults.
Once you’ve set up your notification channel, add a description that will be sent along with the notification. This is helpful for letting recipients know what happened and why they are getting an alert notification. Once done, click on “Save”.
Time to Test!
Nice job on making it this far! Let’s put what we’ve built to the test now. Go back to BigQuery and delete the service account we created from the Data Viewer role that we had previously assigned it to.
It may take a moment, but if you navigate back to your Stackdriver Monitoring Dashboard that we created, you’ll see the chart update with the new activity. You can also check the logs in Stackdriver logging; you’ll see the recent IAM policy change reflected there too, as well as in the Alerting Policy we just created. By the time you’re done check out all of that, you should have gotten an email. My email looked like this:
We’ve Come So Far…
By this point, you’ll also notice that you got another email from Stackdriver saying that the incident has been recovered from. This is because the alert is triggered when there is any log that matches our alerting policy (which is counts of the metric greater than 0) , and when the measure of the metric goes back to 0 (our threshold), the system has “recovered”. It’s not a perfect solution, but it shows off some of the basics of how we can configure log based metrics and alerting on events in Stackdriver.
Something to consider for the real world is to work with your Ops/SRE team to determine what logs, metrics and alerting is appropriate, and what actions need to be taken once these alerts have been generated.
Be sure to clean up the metrics, policies, service accounts, data set and any other extraneous items you created for this exercise. Not just to avoid any GCP charges, but also because it’s good house keeping.
Phil Goerdt is a cloud architect and consultant, currently working for Google Cloud’s Professional Services Organization. At Google Cloud PSO, Phil helps organizations tackle cloud migrations, tune their environments, and implement best practices for cloud operations and management. Opinions stated here are his own.