Resource Metrics Collection using GCP Compute Engine API

Bhabani Ranjan Mahapatra

Published in

Google Cloud - Community

9 min readFeb 27, 2023

Written in collaboration with Kriti Maloo, Avani Vyas, Karthik Venkatesh and Megha Manglani

Audience: Developers working on GCP
Language: Python
Expected time for reading: Approximately 15 minutes

Google Cloud Platform (GCP) offers a wide range of powerful and scalable infrastructure services to meet the needs of businesses of all sizes. The Compute Engine API is one of the core components of GCP, providing an interface for managing virtual machine instances and other compute resources.

Leveraging this API, developers can easily obtain information on the number of instances, vCPU, images, persistent disks, snapshots etc including their configuration and use this data to optimize performance, reduce costs, implement custom business logic and explore them using any reporting services.

For larger organizations that have thousands of projects in their GCP environment, retrieving information from each project can be a cumbersome and time-consuming task. However, the Compute Engine API offers a solution to this challenge. By using service accounts and delegating authority to the API, you can retrieve data from multiple projects simultaneously, without having to switch between projects or authenticate multiple times.

This can help you gain valuable insights into the entire GCP infrastructure, enabling data-driven decisions and optimizing resources across your organization. With its powerful and flexible capabilities, the Compute Engine API becomes an invaluable tool for managing and monitoring large-scale GCP environments.

For more on Compute Engine API documentation, you can refer to this link.

There are alternate solutions available as well. For example, you can use GCP Asset Inventory. Here is the API link in case you want to refer to it later. I can write another post in detail describing how CAI works.

In this post, we’ll explore specific resources and their metrics using Compute Engine API in detail and look at how we can leverage it to get valuable insights from the GCP environment. You can extend it further to other compute resources based on your preferences.

Prerequisite:

Ensure the following points are taken care before you start executing the python scripts.

Enable the Compute Engine API and check the quota for your project at https://console.developers.google.com/apis/api/compute
All the functions given in this post use a service account. However you can use default credentials for authentication as well. If not already done, install the gcloud CLI from https://cloud.google.com/sdk and run.
For more information, see this link

gcloud beta auth application-default login

Install the Python client library for Google APIs by running

pip install --upgrade google-api-python-client

A service account should have the following permission
–compute.instances.get
–compute.instances.list
–compute.disks.get
–compute.disks.list
–compute.images.get
–compute.images.list
–compute.zones.get
–compute.zones.list
–compute.regions.get
–compute.regions.list
–compute.snapshots.get
–compute.snapshots.list
–compute.projects.get
–compute.addresses.list
For BigQuery and Cloud storage, you can have the following roles attached to the service account. You can even add specific permissions if you want like above.
–BigQuery Data Editor
–BigQueryJob User
–Storage Object Admin
One project id where the script would run to fetch metrics information from remaining project ids. Let’s assume this project id is google-dev-resource-monitoring-1122
Two tables in the BigQuery dataset (under google-dev-resource-monitoring-1122 project id). You will find more information regarding these tables later in this post.
–resource_metrics
–resource_metrics_config

Our objective is to get the count of instances and the VCPU belonging to each instance, count of images, snapshots and persistent disks. We will be creating a config table and using it to hold all the project ids. You might have questions on how to list these project ids? There are actually multiple ways of doing this. If you have the Org level permissions then you can simply list them using the following command.

$ gcloud asset search-all-resources \ --asset-types="cloudresourcemanager.googleapis.com/Project" --scope=organizations/YOUR_ORG_NAME

To use the above command the following API must be enabled.

For more on this please refer to the documentation.

Now coming back to the specifics of the configuration table, this is how the table structure would look like.

resource_metrics_config:

CREATE TABLE `project_id.monitoring.
usage_metrics_config`
(
project_id STRING,
is_active STRING,
last_inserted TIMESTAMP,
);

We will be creating another table that would store the metrics information belonging to resources within each project id.

resource_metrics:

CREATE TABLE `project_id.monitoring.resource_metrics`
(
project_id STRING,
metric_name STRING,
metric_value FLOAT64,
last_inserted TIMESTAMP
)
PARTITION BY DATE(last_inserted);

The following architecture diagram would give a fair idea of how this setup works.

Once the above tables are created and the configuration table is updated with all the project ids, we can start looking at the python code and explore the API.

The very first function would be to get the list of project id from the config table.

#Get configuration details from bigquery. 
#The json_file is nothing but the service account key file
def get_project_ids():
credentials=service_account.Credentials.from_service_account_file(json_file)
service=discovery.build('compute','v1',credentials=credentials,cache_discovery=False)


client = bigquery.Client(credentials= credentials,project=”google-dev-resource-monitoring-1122”)
records = client.query("SELECT project_id FROM `google-dev-resource-monitoring-1122.monitoring.resource_metrics_config` where is_active='Y' ")
   return records

Once we have the list of project ids, we can loop through each project id and call the respective resource metric function. Some of those functions are outlined below.

Get Snapshot Count:

#Get the snapshot count that are in READY status
def get_snpashot_count_from_compute_api(project_id,service):
   vmlist = []
   resource_count = []
   try :
       request = service.snapshots().list(project=project_id, filter="status eq 'READY'")
       while request is not None:
           response = request.execute()
           if ('items' in response):
               for snapshot in response['items']:
                   resource_count.append(snapshot['name'])
           request = service.disks().list_next(previous_request=request, previous_response=response)
       len_resource_count = len(resource_count)


   except Exception as e:
       print("Error : "+str(e))
       return None
   return len_resource_count

Get Disk Count:

#Get the disk count that are in READY status
def get_disk_count_from_compute_api(project_id,service):


   resource_count = []
   try :
       request = service.disks().aggregatedList(project=project_id, filter="status eq 'READY'")
       while request is not None:
           response = request.execute()
           for name, disks_scoped_list in response['items'].items():
               j = disks_scoped_list
               if 'disks' in j:
                   # This section is for resources that does not have zone parameter. for example aggregated list
                   for key in j['disks']:
                       resource_count.append(key['name'])


           request = service.disks().aggregatedList_next(previous_request=request, previous_response=response)
       len_resource_count = len(resource_count)        


   except Exception as e:
       print("Error : "+str(e))
       return None
   return len_resource_count

Get Image Count:

#Get the image count that are in READY status
def get_image_count_from_compute_api(project_id,service):


   resource_count = []


   try :
       request = service.images().list(project=project_id, filter="status eq 'READY'")
       while request is not None:
           response = request.execute()


           if('items' in response):
               for image in response['items']:
                   resource_count.append(image['name'])
           request = service.disks().list_next(previous_request=request, previous_response=response)
       len_resource_count = len(resource_count)        


   except Exception as e:
       print("Error : "+str(e))
       return None


   return len_resource_count

Now comes the final and interesting one, Instance and VCPU count. To get the instance information we need to call the following API.

https://cloud.google.com/compute/docs/reference/rest/v1/machineTypes/aggregatedList

GET https://compute.googleapis.com/compute/v1/projects/{project}/aggregated/machineTypes

If you go through the response payload carefully, you will notice, there is no direct field available that can give the VCPU information from compute.instances().aggregatedList() method. What we can do is, retrieve the machine type from this method and then do a lookup in “compute.machineTypes().aggregatedList()”.

Note, this API would return all the global machine types but not the custom one. For now lets store this information in a dictionary and use it during our lookup.

In case there is a custom machine type in any of the projects, we will not be able to find a match in the dictionary. So in that case we can get some more information about the project like zone information and make another API call to the list method instead of the aggregatedList method.

https://cloud.google.com/compute/docs/reference/rest/v1/machineTypes/list

GET https://compute.googleapis.com/compute/v1/projects/{project}/zones/{zone}/machineTypes

So here is the lookup function

#Get machine type details
def machine_type_lookup(project_id, compute):
   try:
       request = compute.machineTypes().aggregatedList(project=project_id)
       mt_lookup = {}
       while request is not None:
           # Execute the request
           response = request.execute()
           for name, machine_types_scoped_list in response['items'].items():
               try:
                   for machineTypes in machine_types_scoped_list.get('machineTypes', []):
                       mt_lookup[machineTypes['zone'] + '_' + machineTypes['name']] = [machineTypes['guestCpus'],
                                                                                       machineTypes['memoryMb']]
               except Exception as e:
                   print("Error from machine_types_scoped_list.get : "+str(e))
           request = compute.machineTypes().aggregatedList_next(previous_request=request, previous_response=response)
       return mt_lookup
   except Exception as e:
       print("Error from machine type lookup : "+str(e))
       return mt_lookup

Get Instance and VCPU Count:

def get_instance_vcpu_count_from_compute_api(project_id, compute,mt_lookup):
   try:
       data_instance = []
       data_vcpu = 0
       request = compute.instances().aggregatedList(project=project_id, filter="status eq 'RUNNING'")
       while request is not None:
           response = request.execute()
           for zone_name, instances in response.get('items', {}).items():
               for instance in instances.get('instances', []):
                   machine_type_url = instance['machineType']
                   machine_type_name = machine_type_url.split("/")[-1]
                   zone = zone_name.split("/")[-1]
                   lookup_key = zone + '_' + machine_type_name
                   data_instance.append(instance['name'])


                   if mt_lookup is not None and lookup_key in mt_lookup:
                       data_vcpu += mt_lookup[lookup_key][0]
                   else:
                       try:
                           data_vcpu += compute.machineTypes().get(project=project_id, zone=zone_name.split("/")[-1],
                                                                   machineType=machine_type_name).execute()['guestCpus']
                       except Exception as e:
                           print("Error : "+str(e))
                           data_vcpu = None


           request = compute.machineTypes().aggregatedList_next(previous_request=request, previous_response=response)
       return [len(data_instance), data_vcpu]
   except Exception as e:
       print("Error : "+str(e))
       return None

Finally write the consolidated count into the BigQuery table.

# Write all the records in to BigQuery table
def write_to_bq(record_of_resource_details):
   if len(record_of_resource_details)>0:
 credentials=service_account.Credentials.from_service_account_file(json_file)
client = bigquery.Client(credentials= credentials,project=”google-dev-resource-monitoring-1122”)
       errors = client.insert_rows_json(”resource_metrics”, record_of_resource_details)  # Make an API request.
       if errors == []:
           print(str(len(record_of_resource_details))+" rows have been added to "+TABLE_ID+" table")
       else:
           print("Encountered errors while inserting rows in to resource_metrics table: {}".format(errors))
   else:
       print('No record available to write in to resource_metrics table')

Here is how the final output would look like. You can expand it further to create resource specific tables and write more attributes for your analysis.

To understand these numbers in a better way we can create a few tiles in Looker. See this post from Avani Vyas for detailed information on Looker dashboard.

There is still scope to optimize the codebase so that you can run them in parallel. I just don’t want you to hit the API limit in the first try itself. Once you are familiar with the codebase, you can further enhance it and implement your own logic based on the requirements.

Note that these are all point-in-time and the number that you get might change from time to time and there are costs associated with the number of API calls made in a day. Make sure you are not hitting your quota. At the same time you should not exhaust all of them because it may potentially cause failure in other jobs.

Scheduling:

Once the program is ready we can plan for scheduling it. There are multiple options available when it comes to scheduling. Some of them are outlined below.

Composer :
- Composer is a powerful tool for scheduling and automating workflows in GCP. With features such as task dependency management, dynamic task generation, and failure handling, Composer Airflow DAG enables you to build complex workflows that can run at scale
Cloud Scheduler:
- Cloud Scheduler is a fully managed service that allows you to schedule jobs that invoke HTTP or Pub/Sub endpoints on a flexible schedule
Cloud functions:
- By using Cloud Functions, you can trigger code in response to events that occur in your GCP environment, enabling you to automate tasks such as data processing, image analysis, and more

There are many such services that you can use for scheduling and orchestration depending on your business use cases. I preferred to run them in Airflow with 4 tasks in parallel. This can be run as many times you want in a day.

Conclusion:

In conclusion, the Compute Engine API is a powerful tool that enables you to manage and monitor your GCP compute resources at scale. With the ability to obtain instance and resource metrics programmatically, you can automate many of the repetitive tasks associated with managing your infrastructure, freeing up your time to focus on higher-level activities.

From monitoring instance uptime and CPU utilization to tracking costs and ensuring compliance, the Compute Engine API provides a wide range of capabilities that can help you optimize the performance, security, and efficiency of your GCP environment.

In addition to the Compute Engine API, Google Cloud Platform offers a wide range of other powerful services that can help you manage and monitor your GCP infrastructure. Cloud Monitoring, for example, provides a comprehensive monitoring solution that enables you to track the performance and availability of your cloud resources in real-time. By leveraging features such as customizable dashboards, alerting, and metrics analysis, you can gain valuable insights into your GCP environment and identify potential issues before they impact your operations. Other services, such as Cloud Logging and Cloud Trace, provide additional capabilities for logging, debugging, and tracing your applications in the cloud.

By combining these services with the Compute Engine API, you can really create a powerful suite of tools for managing and optimizing your GCP environment, and ensure that your infrastructure is running at peak performance at all times.

Special thanks to Shreya Shrivastava for taking the time to review this blog and sharing her thoughtful feedback to ensure readers gets the best out of it.

Resource Metrics Collection using GCP Compute Engine API

Written by Bhabani Ranjan Mahapatra