Creating Your Kubernetes Cluster in Google Cloud Platform Using Service Account
You have seen how you can secure your Kubernetes (K8S) cluster in the Google Cloud Platform (GCP) by creating a private K8S cluster with its associated management infrastructure using Terraform scripts. But who will run the Terraform scripts to create the infrastructure? The GCP project owner? No way. The GCP project owner is too powerful and any bug in the Terraform script can cause unwanted side effects and at times that will become too catastrophic. It is ideal to use a service account in GCP project possessing just the necessary and sufficient permissions to run the Terraform scripts to set up the K8S cluster and the helper systems.
Infrastructure as Code and SDLC
It may sound like something wrong with the title of this section. You may also feel the taste of an oxymoron. The fact is; Terraform helps you build infrastructure using code. Every piece of code that is written should follow some form of software development lifecycle (SDLC). This means that you will be writing code, running code and testing code to make sure that it is producing the desired effect. In this context, the desired effect is the successful creation of the K8S infrastructure. During the development or exploratory phase, you have to execute the Terraform script as a user with powerful roles attached. After a few iterations, once you are happy with the results, it is the time to create a service account in GCP and make sure that you are able to reproduce the same results by using the service account to run the scripts.
Any software system will have a set of user personas associated with it. In UML Modeling paradigm, they are also called Actors. Many times, these user personas are human equivalents such as infrastructure administrator, database administrator, file archiver etc. In this context, assume that the user persona who is doing all these K8S cluster creation is the infrastructure administrator.
Infrastructure Administrator Service Account
In a typical organization, the roles and responsibilities of an infrastructure administrator change almost on a daily basis. When it comes systems, you cannot have that kind of loosely created set of roles and responsibilities. On the other hand, you cannot make the infrastructure administrator have all the possible entitlements in the world as well.
In GCP, for doing any kind of activity on any resource, you need to have the required permissions. A collection of permissions can be grouped together to form a role. You can bind one or more roles to an identity. An identity in GCP can be a human being, a service account or a group consisting of multiple individuals and/or service accounts.
You create a service account to represent the infrastructure administrator with a name say rajtmana-infra-admin. You can do this by going to the GCP Console option IAM & admin -> Service accounts and clicking the CREATE SERVICE ACCOUNT option. At some point in the future, based on the maturity of the Terraform scripting, you can also create service accounts using Terraform scripts. With the service account created using GCP console, in this article, you are going to take a trial and error approach to come up with the required permissions needed for the rajtmana-infra-admin service account.
When you run the Terraform script, you need to use the user’s security key as described in the README file of the code base. Now this time, download the security key of the newly created service account in GCP to the machine from where you are running the Terraform scripts. Now you are ready to run your Terraform scripts using this service account.
Binding Roles to Service Account
At this moment you don’t know the required set of permissions that are needed to build the entire infrastructure that was coded up to build a complete K8S setup including the associated machinery for the proper functioning of the K8S cluster as described in this Terraform scripts. The service account rajtmana-infra-admin is created already. One more time, make sure that the key for this service account is downloaded to the system from where the Terraform scripts are running and follow the instructions to make sure that the terraform scripts use this key. By this way, you are extremely sure that the Terraform script is being executed by this service account. Now you can start the trials to identify the permissions.
Rajanarayanans-MacBook-Pro:k8s-cluster RajT$ terraform apply
auto_create_subnetworks: "" => "false"
gateway_ipv4: "" => "<computed>"
name: "" => "mservice-network"
project: "" => "<computed>"
routing_mode: "" => "<computed>"
self_link: "" => "<computed>"
Error: Error applying plan:
1 error(s) occurred:
* google_compute_network.mservice_network: 1 error(s) occurred:
* google_compute_network.mservice_network: Error creating network: googleapi: Error 403: Required 'compute.networks.create' permission for 'projects/YYYY-XXXX/global/networks/mservice-network', forbidden
The first trial gives you an error saying that the service account doesn't have the compute.networks.create permission. By going to the GCP Console option IAM & admin -> Roles you can create a new role say mservice_admin as shown in the following picture and add the above permission to this custom role.
As a next step, bind this role to the service account rajtmana-infra-admin by going to the GCP Console option IAM & admin -> IAM. Now you can keep adding the permissions that are required to this role. Repeat the following steps until you are able to run the Terraform scripts successfully and the K8S cluster is running fine.
1. Run the Terraform scripts
2. Look at the error message and add the required permission(s) to the mservice_admin role
3. Look at the error message and add the required role(s) to the rajtmana-infra-admin service account if the error message is asking you to add a role
You may find that this is a very laborious process and gets tempted to go with custom roles having lots of additional permissions. But this is a great way to understand the bare minimum permissions required to create a meaningful K8S cluster. When you are building mission-critical infrastructure, the need to drill down to the minutest level of permissions is a no brainer. One word of caution here is that you have to make sure that there is no proliferation of this kind of custom roles as your infrastructure grows. Instead of strictly going through the ONLY required permissions and adding everything into a given role, you may also create multiple roles for easy management. Say for example; group all the compute engine related permissions to one role and all the container related permissions to another role.
This iterative process completes when the Terraform scripts run successfully and all the infrastructure pieces are ready to use. At this stage, you want to take one step back and make sure that you are automating the service account creation, role creation, adding permissions to the role, and binding the roles to the service account using Terraform scripts. Automation using Terraform will help you build the infrastructure repeatedly for higher environments such as production.
Sometimes, you will get an error message saying that “Ask a project owner to grant you the iam.serviceAccountUser role on the service account”. In this case, you add the necessary canned role to the service account.
Sometimes you will get an error message saying that “google_container_cluster.mservice: googleapi: Error 409: Already exists: projects/YYYY-XXXX/locations/europe-west2/clusters/mservice-dev-cluster., alreadyExists” . In this case, you may have to delete the cluster manually by going to the GCP console and run the Terraform scripts once again. This is to make sure that the resource that is being created is in a consistent state.
Sometimes you will get an error message saying that “google_compute_network.mservice_network: Error waiting for Deleting Network: The network resource ‘projects/protean-217618/global/networks/mservice-network’ is already being used by ‘projects/YYYY-XXXX/global/firewalls/k8s-b539c081e55719da-node-http-hc’”. In this case, you may have to look at the Firewalls list in the custom VPC that are still hanging around. Terraform is does not do a good job in identifying all the “under the hood” services that are getting created and hence they are not destroyed by Terrafom
SSH into Bastion Host Using Service Account
For doing any kind of K8S cluster management, you need to make sure that you are using the rajtmana-infra-admin service account. This includes SSH into the bastion host as well. The K8S cluster management must not be confused with the service deployments into the K8S cluster and activities like that. For that, we will require another user persona which is not in the scope of this article.
The Google Cloud Shell is a safe place from where you can SSH into the bastion host. To open the Google Cloud Shell, you should be a human being with the least privilege in a systems perspective but you have the service account key with you (you might be the person who executed all the Terraform scripts). Once you login to your Google Cloud Shell, upload the service account key to a directory and make sure that only YOU have the read access to that file (zero bit mod setting for your group and others). Use the following command line from your Google Cloud Shell to SSH into the bastion host. It will ask you to enter a passphrase before connecting. Make sure that you are using a very strong passphrase.
$ gcloud compute ssh rajtmana-infra-admin@mservice-bastion --force-key-file-overwrite --ssh-key-file ~/infra.json --zone europe-west2-a
Warning: Permanently added 'compute.8821967775694683449' (ECDSA) to the list of known hosts.
Enter passphrase for key '/home/xyz/infra.json':
Linux mservice-bastion 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You might have noticed that you, the human being who is very powerful have manually deleted the K8S cluster while doing the trial and error method of finding the required permissions for the custom role. You might think that everything is good. But once you have the infrastructure created, at some point you will have to destroy it as well. For testing that, go ahead and do a terraform destroy and see whether it goes through fine. It is definite that the command will NOT complete properly as the role doesn’t have the required permissions to destroy some of the infrastructures that you have already created. So continue your exploration of finding the required permissions to destroy the infrastructure and keep adding until the infrastructure is completely destroyed successfully. Before automating the creation of service account, roles etc, you run the scripts multiple times to make sure that the creation and destruction of the K8S cluster are working fine without any changes to the permissions and roles.
List of Permissions and Roles
The author has gone through the trial and error method to have the K8S cluster setup done end to end described by these Terraform scripts using the approach described in the previous sections. It has been found that the following roles are to be attached to the service account rajtmana-infra-admin 1) mservice_admin, and 2) Service Account User. The mservice_admin role has the following permissions attached to it.
This article focussed on setting up a service account for creating the K8S infrastructure, identifying the required permissions and roles to create the infrastructure and finally binding the roles to the service account. This also covered the use of the service account key to SSH into the bastion host for the upkeep and management of your K8S cluster. The security key for the service account created is to be managed extremely well and a war-tested key management system is to be used for that.
Just like you develop pieces of software in a modular fashion, you can build your infrastructure as well in a modular way as you are writing code to define your infrastructure.