How is my Cloud Storage Data being used?

Published in

Google Cloud - Community

9 min readJul 14, 2020

This tutorial guides you to create a visualization of your Google Cloud Storage logs in Data Studio, which gives you valuable insights into your Google Cloud Storage traffic and storage data. We’ll create a dashboard that shows the regions of your buckets compared with the regions of your Compute instances, and highlight the ones that are zoned inefficiently.

This tutorial is based on the 2020 Google Cloud Next Showcase for data visualization.

Table of Contents

Enabling Cloud Access Logs For Your Bucket
Loading Logs Into BigQuery
Mapping Cloud Storage Bucket Locations
Mapping Compute Engine Instance Locations
Creating BigQuery View For DataStudio Visualization
Connecting To DataStudio
Conclusions
Appendix:
A.1: Automate IP tracking for compute engine instances
A.2: Automate Loading Access Logs into BigQuery
A.3: Understanding Individual User Requests

Enabling Cloud Access Logs For Your Bucket

In order to analyze access to your data, you must collect the data through Google Cloud Access logs. Enable usage logging for any buckets that you would like to analyze using the following commands:

Choose your bucket name, your logs bucket name, and a prefix for your logs and set environment variables for them in your terminal:

BUCKET="your-bucket-name"
LOGS_BUCKET="your-logs-bucket"
LOGS_PREFIX="your-logs-prefix"

2. Create a new logs bucket:

gsutil mb "gs://${LOGS_BUCKET}"

3. Grant necessary permissions:

gsutil iam ch group:cloud-storage-analytics@google.com:objectCreator "gs://${LOGS_BUCKET}"

4. Enable Cloud Access Logging for bucket gs://your-bucket and store logs in gs://your-logs-bucket:

gsutil logging set on -b "gs://${LOGS_BUCKET}" -o "${LOGS_PREFIX}" "gs://${BUCKET}"

If you enable logging for multiple buckets, specify the same ${LOGS_PREFIX} for all buckets. Otherwise, the logs prefix defaults to the name of the bucket associated with the logs, and it will be difficult to search for them in the next step.

Loading Logs into BigQuery

Follow the instructions below to load data into BigQuery. For further assistance with access logs, consult the Google Cloud Storage documentation.

Download the Storage Usage schema:

wget http://storage.googleapis.com/pub/cloud_storage_usage_schema_v0.json

2. Create a Bigquery Dataset:

bq mk storageanalysis

3. Load usage data into BigQuery:

bq load --skip_leading_rows=1 storageanalysis.usage "gs://$LOGS_BUCKET/${LOGS_PREFIX}_usage*" ./cloud_storage_usage_schema_v0.json

Remember to clean up logs loaded into BigQuery so you don’t reload them accidentally.

See the appendix at the end of this post to learn how to automate loading of Access Logs when they’re created using Google Cloud Function Triggers.

Mapping Cloud Storage Bucket Locations

Next, create a mapping table of bucket locations to country codes to query which country each bucket is in later on:

Get each bucket’s name and location and store them in a CSV file. Note that this only gets the data from one bucket. If you want to analyze multiple buckets, you may want to run a prefix search (for example, “gs://bucket-prefix*”), or run this command multiple times and keep appending to the CSV:

gsutil ls -L -b "gs://${BUCKET}" |
 grep -e "gs://" -e "Location constraint" |
 awk -F ":" ‘{printf (NR%2==0) ? $2 "\n" : $2 ","}’ |
 sed 's/\///g;s/[[:blank:]]//g' > bucket_locations.csv

2. Load the CSV file into a table named storageanalysis.bucket_locations:

bq load --replace --source_format=CSV storageanalysis.bucket_locations ./bucket_locations.csv bucket_name:STRING,bucket_location:STRING

These commands create a new BigQuery table that can be used to look up the location of each bucket we encounter through our usage logs. To verify everything worked, run the following command:

bq ls storageanalysis

You should see something similar to the following:

tableId Type Labels Time Partitioning Clustered Fields
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
bucket_locations TABLE
usage TABLE

Mapping Compute Engine Instance Locations

Next, we’ll want to look at the request logs for Compute Engine instances.

Get a CSV of Compute instance external IPs in use, with an additional empty column for country, which we will fill in a moment:

gcloud compute instances list --format="csv(id,name,creationTimestamp,end_time,EXTERNAL_IP,INTERNAL_IP,zone,country)" --filter="status=RUNNING" > compute_instance_locations.csv

Note: If you’re analyzing a multi-project environment, you might need to run this command for all your projects:

gcloud compute instances list --format="csv(id,name,creationTimestamp,end_time,EXTERNAL_IP,INTERNAL_IP,zone,country)" --filter="status=RUNNING" --project={your-project-id} >> compute_instance_locations.csv

2. Load the CSV into a BigQuery table called storageanalysis.compute_instance_locations:

bq load --replace --source_format=CSV --skip_leading_rows=1 storageanalysis.compute_instance_locations ./compute_instance_locations.csv resource_id:STRING,resource_name:STRING,start_time:TIMESTAMP,end_time:TIMESTAMP,external_ip:STRING,internal_ip:STRING,zone:STRING,country:STRING

3. Next, map the instance zone to the 2-digit ISO country code of its Compute Instance Location. If you want to run a query from the command line, you can use:

bq query --use_legacy_sql=false '{QUERY_BELOW_HERE}'

Or you can use the BigQuery console. Using either method, run the following query:

CREATE TEMP TABLE `mappings` AS
SELECT *
FROM UNNEST(
[STRUCT("US" AS abbr, "US%" AS long), ("TW", "ASIA-EAST1%"),
("JP", "ASIA-NORTHEAST1%"), ("HK", "ASIA-EAST2%"),
("JP", "ASIA-NORTHEAST2%"),("KR", "ASIA-NORTHEAST3%"),
("IN", "ASIA-SOUTH1%"),("SG", "ASIA-SOUTHEAST1%"),
("AU", "AUSTRALIA%"),("FI", "EUROPE-NORTH1%"),
("BE", "EUROPE-WEST1%"),("GB", "EUROPE-WEST2%"),
("DE", "EUROPE-WEST3%"),("NL", "EUROPE-WEST4%"),
("CH", "EUROPE-WEST6%"),("CA", "NORTHAMERICA%"),
("BR", "SOUTHAMERICA%")
]);
UPDATE storageanalysis.compute_instance_locations
SET country = abbr
FROM mappings
WHERE UPPER(zone) LIKE long;

See the appendix at the end of this post to learn how to do this with individual user requests, and how to automate tracking ephemeral compute instance IPs.

Creating BigQuery View for DataStudio Visualization

Next, create a BigQuery view, which allows you to update underlying tables without having to regenerate this table. In other words, it will reflect the most up-to-date data from other tables.

You can either run this query from the BigQuery Console and click “Save view” on the results to save it into a view called “visualization_view,” or run the following in the terminal:

bq mk --use_legacy_sql=false --description "View for Data Studio visualization" --view '{QUERY_BELOW_HERE}' storageanalysis.visualization_view

The query is as follows:

SELECT 
 time_micros AS timestamp,
 SUBSTR(zone,0,length(zone)-2) as client_location,
 compute_instances.country as client_country,
 cs_bucket AS bucket_name,
 LOWER(bucket_location) AS bucket_location,
 IF(LOWER(SUBSTR(bucket_location,0,2))=LOWER(SUBSTR(zone,0,2)), “Colocated”, “Non Colocated”) AS colocated,
 cs_object AS object_name,
 sc_bytes AS egress_bytes,
 cs_method AS operation,
 IF(LOWER(SUBSTR(bucket_location,0,2))=LOWER(SUBSTR(zone,0,2)), sc_bytes/POWER(2,30)*0.01, sc_bytes/POWER(2,30)*0.10) AS calculated_cost
FROM `storage-traffic-project.storageanalysis.usage` 
LEFT JOIN `storage-traffic-project.storageanalysis.bucket_locations`
ON cs_bucket=bucket_name
LEFT JOIN `storage-traffic-project.storageanalysis.compute_instance_locations` AS compute_instances
ON compute_instances.external_ip=c_ip AND compute_instances.start_time < TIMESTAMP_MICROS(time_micros) AND (TIMESTAMP_MICROS(time_micros) < compute_instances.end_time OR compute_instances.end_time IS NULL)
WHERE cs_method LIKE "GET" AND cs_object NOT LIKE "" AND cs_object IS NOT NULL AND zone IS NOT NULL;

Connecting Data to Data Studio

Open the Data Studio dashboard, which contains simulated data.
Click the “Make a copy of this report” icon at the top

3. Under “New data source,” click the drop-down and then click “Create new data source” (You can also leave it unchanged if you want to use the simulated data in the template).

4. Choose “BigQuery” as the connector.

5. Navigate to the view you just created, select it, and then click “Connect” at the top right.

6. Click “Connect,” then click “Add to report,” then click “Copy Report.”

Conclusions

Thank you for taking time to read this tutorial and learn how you can use Cloud Storage, BigQuery and DataStudio to gain insights through request log visualizations.

Ready to take the next step?

Learn to perform basic tasks using Cloud Storage
See all of our Google Cloud storage options

Appendix

A.1: Automate IP tracking for compute engine instances

Compute Engine instance external IPs are allocated at startup and deallocated at shutdown. Once a Compute Engine instance is shut down or deleted, the allocated external IP is deallocated and can be reused by another instance. Cloud Storage Access Logs tracks the external IP used, but not the specific Compute Engine instance associated with it. You must keep track of IP allocation and deallocation over the lifetime of your Compute Engine instances.

To automate IP tracking for Compute Engine instances, use the Monitoring Asset Changes feature of the Cloud Asset Inventory API, which tracks historical information of resource metadata and can let us know when changes occur using Cloud Pub/Sub. We will use Cloud Functions to log IP updates coming from Pub/Sub feed in BigQuery using the same table created earlier (storageanalysis.compute_instance_locations).

Enable the Resource Manager API for your project by following this link.
Create a Pub/Sub topic to associate a Cloud Function trigger:

gcloud pubsub topics create compute_instance_updates

3. Create an Asset Inventory Feed to push updates to the Pub/Sub topic compute_instance_updates:

gcloud asset feeds create feed-name --project="{your-project-id}" — content-type="resource" --asset-types="compute.googleapis.com/Instance" --pubsub-topic="projects/{your-project-id}/topics/compute_instance_updates"

Next, we’ll create a Cloud Function to process events from the Pub/Sub topic compute_instance_updates.

From the Cloud Functions console, click “Create Function”.
Name the function “compute_instance_updates_function”.
In the “Trigger” dropdown, select “Cloud Pub/Sub,” then select the “compute_instance_updates” topic from the dropdown that appears.
Click “Save” and then click “Next”
Select Python 3.7 as the runtime.
Select Source code as “ZIP from Cloud Storage”.
For “Cloud Storage location” enter “gs://storage-traffic-project/compute_instance_updates_function.zip”
You can download the zip file to view the source of the Cloud Function.
For “Function to execute” enter “handle_asset_event”.
Click “Deploy” to deploy the function.

When Compute Engine instances are created, started, shutdown, or deleted, the deployed Cloud Function will track when external IPs are allocated and deallocated from Compute Instances. Using this information, BigQuery can associate with a Compute Engine instance in Cloud Storage Access Logs based on when the IP was allocated to that instance.

For additional information take a look at:

A.2: Automate Loading Access Logs into BigQuery

The tutorial describes how to manually load request logs into the storagenalysis.usage table. To automate this process, you must use a Cloud Function Storage Triggers based on object events.

Next, we’ll create a Cloud Function to process events for the Cloud Storage logs bucket when new objects are created.

From the Cloud Functions console, click “Create Function”.
Name the function “storage_logs_updates_function”.
In the “Trigger” dropdown, select “Cloud Storage,” then select the “Finalize/Create” event from the “Event Type” dropdown that appears and the logs bucket in “Bucket” that stores Access Logs.
Click “Save” and then click “Next”
Select Node.js 10 as the runtime.
Select Source code as “ZIP from Cloud Storage”.
For “Cloud Storage location” enter “gs://storage-traffic-project/ access_logs_loader.zip”
You can download the zip file to view the source of the Cloud Function.
For “Function to execute” enter “processUsageUpdate”.
Click “Deploy” to deploy the function.

A.3: Understanding Individual User Requests

Another way to understand client requests is to determine the country where your public data was accessed using the MaxMind IP to geolocation dataset. To do this, register for a MaxMind account here. Once you’ve logged in, navigate to “Download Files” under “GeoIP2 / GeoLite2” on the left. Click the “Download ZIP” links for “GeoLite2 City: CSV Format” and extract the ZIP file.

In order to get the location of the user requesting your data, look up the location of the IP address accessing data. You can do this using a BigQuery query against the MaxMind GeoIP database we just downloaded. For more detailed information, review Geolocation with BigQuery.

bq load -source_format=CSV — autodetect storageanalysis.ipv6_city_locations GeoLite2-City-Locations-en.csv
bq load -source_format=CSV — autodetect storageanalysis.ipv6_city_blocks GeoLite2-City-Blocks-IPv6.csv
bq load -source_format=CSV — autodetect storageanalysis.city_locations

Finally, we want to aggregate the data access into groups of time with the same origin/destination location. Our final BigQuery query looks like this:

SELECT time_block*600 AS time, country, region, ingress, egress, requests FROM (
SELECT
country_name as country,
CAST(FLOOR(time_micros/(1000000*600)) AS int64) AS time_block,
SUM(cs_bytes) as ingress,
SUM(sc_bytes) as egress,
COUNT(cs_bucket) as requests,
bucket_locations.location AS region
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(c_ip) & NET.IP_NET_MASK(16, mask)
network_bin
FROM storageanalysis.usage, UNNEST(GENERATE_ARRAY(9,128)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(c_ip)) = 16
)
JOIN (SELECT *
, NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin
, CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask
FROM `storageanalysis.ipv6_city_blocks` AS city_blocks
JOIN `storageanalysis.ipv6_city_locations` AS city_locations
USING(geoname_id)
)
USING (network_bin, mask)
LEFT OUTER JOIN
storageanalysis.bucket_locations AS bucket_locations
ON
cs_bucket = bucket_locations.id
GROUP BY country_name, time_block, region
ORDER BY time_block ASC
)

From here, go back to Creating Visualization Table and continue following along.