How is my Cloud Storage Data being used?

Jesselovelace
Jul 14, 2020 · 9 min read
Image for post
Image for post

This tutorial guides you to create a visualization of your Google Cloud Storage logs in Data Studio, which gives you valuable insights into your Google Cloud Storage traffic and storage data. We’ll create a dashboard that shows the regions of your buckets compared with the regions of your Compute instances, and highlight the ones that are zoned inefficiently.

This tutorial is based on the 2020 Google Cloud Next Showcase for data visualization.

Table of Contents

Enabling Cloud Access Logs For Your Bucket

In order to analyze access to your data, you must collect the data through Google Cloud Access logs. Enable usage logging for any buckets that you would like to analyze using the following commands:

  1. Choose your bucket name, your logs bucket name, and a prefix for your logs and set environment variables for them in your terminal:

2. Create a new logs bucket:

3. Grant necessary permissions:

4. Enable Cloud Access Logging for bucket gs://your-bucket and store logs in gs://your-logs-bucket:

If you enable logging for multiple buckets, specify the same ${LOGS_PREFIX} for all buckets. Otherwise, the logs prefix defaults to the name of the bucket associated with the logs, and it will be difficult to search for them in the next step.

Loading Logs into BigQuery

Follow the instructions below to load data into BigQuery. For further assistance with access logs, consult the Google Cloud Storage documentation.

  1. Download the Storage Usage schema:

2. Create a Bigquery Dataset:

3. Load usage data into BigQuery:

Remember to clean up logs loaded into BigQuery so you don’t reload them accidentally.

See the appendix at the end of this post to learn how to automate loading of Access Logs when they’re created using Google Cloud Function Triggers.

Mapping Cloud Storage Bucket Locations

Next, create a mapping table of bucket locations to country codes to query which country each bucket is in later on:

  1. Get each bucket’s name and location and store them in a CSV file. Note that this only gets the data from one bucket. If you want to analyze multiple buckets, you may want to run a prefix search (for example, “gs://bucket-prefix*”), or run this command multiple times and keep appending to the CSV:

2. Load the CSV file into a table named storageanalysis.bucket_locations:

These commands create a new BigQuery table that can be used to look up the location of each bucket we encounter through our usage logs. To verify everything worked, run the following command:

You should see something similar to the following:

Mapping Compute Engine Instance Locations

Next, we’ll want to look at the request logs for Compute Engine instances.

  1. Get a CSV of Compute instance external IPs in use, with an additional empty column for country, which we will fill in a moment:

Note: If you’re analyzing a multi-project environment, you might need to run this command for all your projects:

2. Load the CSV into a BigQuery table called storageanalysis.compute_instance_locations:

3. Next, map the instance zone to the 2-digit ISO country code of its Compute Instance Location. If you want to run a query from the command line, you can use:

Or you can use the BigQuery console. Using either method, run the following query:

See the appendix at the end of this post to learn how to do this with individual user requests, and how to automate tracking ephemeral compute instance IPs.

Creating BigQuery View for DataStudio Visualization

Next, create a BigQuery view, which allows you to update underlying tables without having to regenerate this table. In other words, it will reflect the most up-to-date data from other tables.

You can either run this query from the BigQuery Console and click “Save view” on the results to save it into a view called “visualization_view,” or run the following in the terminal:

The query is as follows:

Connecting Data to Data Studio

  1. Open the Data Studio dashboard, which contains simulated data.
  2. Click the “Make a copy of this report” icon at the top
Image for post
Image for post

3. Under “New data source,” click the drop-down and then click “Create new data source” (You can also leave it unchanged if you want to use the simulated data in the template).

Image for post
Image for post

4. Choose “BigQuery” as the connector.

Image for post
Image for post

5. Navigate to the view you just created, select it, and then click “Connect” at the top right.

Image for post
Image for post

6. Click “Connect,” then click “Add to report,” then click “Copy Report.”

Image for post
Image for post
Image for post
Image for post
Image for post

Conclusions

Thank you for taking time to read this tutorial and learn how you can use Cloud Storage, BigQuery and DataStudio to gain insights through request log visualizations.

Ready to take the next step?

Appendix

A.1: Automate IP tracking for compute engine instances

Compute Engine instance external IPs are allocated at startup and deallocated at shutdown. Once a Compute Engine instance is shut down or deleted, the allocated external IP is deallocated and can be reused by another instance. Cloud Storage Access Logs tracks the external IP used, but not the specific Compute Engine instance associated with it. You must keep track of IP allocation and deallocation over the lifetime of your Compute Engine instances.

To automate IP tracking for Compute Engine instances, use the Monitoring Asset Changes feature of the Cloud Asset Inventory API, which tracks historical information of resource metadata and can let us know when changes occur using Cloud Pub/Sub. We will use Cloud Functions to log IP updates coming from Pub/Sub feed in BigQuery using the same table created earlier (storageanalysis.compute_instance_locations).

  1. Enable the Resource Manager API for your project by following this link.
  2. Create a Pub/Sub topic to associate a Cloud Function trigger:

3. Create an Asset Inventory Feed to push updates to the Pub/Sub topic compute_instance_updates:

Next, we’ll create a Cloud Function to process events from the Pub/Sub topic compute_instance_updates.

  1. From the Cloud Functions console, click “Create Function”.
  2. Name the function “compute_instance_updates_function”.
  3. In the “Trigger” dropdown, select “Cloud Pub/Sub,” then select the “compute_instance_updates” topic from the dropdown that appears.
  4. Click “Save” and then click “Next”
  5. Select Python 3.7 as the runtime.
  6. Select Source code as “ZIP from Cloud Storage”.
  7. For “Cloud Storage location” enter “gs://storage-traffic-project/compute_instance_updates_function.zip”
  8. You can download the zip file to view the source of the Cloud Function.
  9. For “Function to execute” enter “handle_asset_event”.
  10. Click “Deploy” to deploy the function.

When Compute Engine instances are created, started, shutdown, or deleted, the deployed Cloud Function will track when external IPs are allocated and deallocated from Compute Instances. Using this information, BigQuery can associate with a Compute Engine instance in Cloud Storage Access Logs based on when the IP was allocated to that instance.

For additional information take a look at:

A.2: Automate Loading Access Logs into BigQuery

The tutorial describes how to manually load request logs into the storagenalysis.usage table. To automate this process, you must use a Cloud Function Storage Triggers based on object events.

Next, we’ll create a Cloud Function to process events for the Cloud Storage logs bucket when new objects are created.

  1. From the Cloud Functions console, click “Create Function”.
  2. Name the function “storage_logs_updates_function”.
  3. In the “Trigger” dropdown, select “Cloud Storage,” then select the “Finalize/Create” event from the “Event Type” dropdown that appears and the logs bucket in “Bucket” that stores Access Logs.
  4. Click “Save” and then click “Next”
  5. Select Node.js 10 as the runtime.
  6. Select Source code as “ZIP from Cloud Storage”.
  7. For “Cloud Storage location” enter “gs://storage-traffic-project/ access_logs_loader.zip”
  8. You can download the zip file to view the source of the Cloud Function.
  9. For “Function to execute” enter “processUsageUpdate”.
  10. Click “Deploy” to deploy the function.

A.3: Understanding Individual User Requests

Another way to understand client requests is to determine the country where your public data was accessed using the MaxMind IP to geolocation dataset. To do this, register for a MaxMind account here. Once you’ve logged in, navigate to “Download Files” under “GeoIP2 / GeoLite2” on the left. Click the “Download ZIP” links for “GeoLite2 City: CSV Format” and extract the ZIP file.

In order to get the location of the user requesting your data, look up the location of the IP address accessing data. You can do this using a BigQuery query against the MaxMind GeoIP database we just downloaded. For more detailed information, review Geolocation with BigQuery.

Finally, we want to aggregate the data access into groups of time with the same origin/destination location. Our final BigQuery query looks like this:

From here, go back to Creating Visualization Table and continue following along.

Google Cloud - Community

Google Cloud community articles and blogs

Jesselovelace

Written by

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Jesselovelace

Written by

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store