Managing GCP projects at scale — part 2

How did we build our GCP project factory?

Published in

Decathlon Digital

6 min readFeb 1, 2022

In the last chapter, you learned about the main goal of our GCP Project Factory. Let’s now discover how it works, and why we chose to build it that way.

First, let’s discover the main document for this chapter: the global architecture document.

We will describe each part of this process, and discuss the technologies used.

Internal web portal

As we explained in our previous chapter, we serve many application teams, in multiple different locations. This means that the level of proficiency in technical tasks, and the knowledge of Decathlon required information may vary from one user to another.

For this purpose, we included the front end of this process in the web portal that gives access to our cloud services to all Decathlon users. Users only have to fill in a form with the required information, which is validated on the spot and provides feedback to the user, avoiding all ambiguity.

The API

The API is the real first part of the GCP Project Factory. Its purpose is twofold:

First, it validates the information provided by the user, using Decathlon’s different data referentials. This includes the cost center (who pays the bill), the team in charge, type of project, and so on… to make sure the information is valid, the resource exists and is still active.
Secondly, the API is responsible for the creation of the folders, necessary for the “design” of our GCP organization.

There are a lot of different users, and some of them need to automate project creation, so having an API is mandatory.

This API is developed in Python, using the Flask framework. We were already familiar with these components and they were relevant for this application as well.

This API runs inside a docker container, on a mutualized GKE cluster, managed by our containerization team. This provides stability, high availability, connection to our internal network and full compatibility with the other technologies used at Decathlon. This was the only option possible for our needs at the time.

When the API has finished doing its checks, it sends the full message to the next part of the process: the BUILD.

The BUILD

The BUILD is the part of the process in charge of generating the configuration of the project. Its name takes inspiration from development pipelines.

By design, the BUILD can only take one request at a time, however, the API can send multiple requests in a short amount of time. To avoid losing messages between the BUILD and API, we added a message queue in the form of a Pub/Sub topic. Messages are on standby and trigger the BUILD when it is ready.

The BUILD takes the message sent by the API and checks that all the information necessary is defined. The BUILD also makes sure that there cannot be any ambiguity in the requested action. If a message is sent for a project update, it could be catastrophic if it was understood as a project deletion request!

With all the necessary checks performed, the BUILD takes all the information it was given, and builds a configuration file, using a template. Once this is done, it pushes this configuration file to the next step, inside a GitHub repository. This is why we wanted to have only one BUILD execution at a time.

We went with Cloud Functions for the BUILD, because it is completely serverless, and we are only billed for execution time: you only provide the code and it runs when needed. They are perfect for event-driven applications. Not all languages are supported, but the most popular ones are, including Python. You can also define how many functions can run simultaneously, so we set it to 1 for no concurrency, in order to avoid any git merge conflict.

The GitHub repository at the end of the BUILD part, is connected to the last step of the main process: DEPLOY.

DEPLOY

Here we are, the last part in the journey of our request. At this point, we just need one more step, to apply the generated configuration. For this, we use Hashicorp’s Terraform.

If you work anywhere near infrastructure, you probably already know Terraform. It is an IaC (Infrastructure as Code) tool, that allows the deployment and management of infrastructures, using a descriptive language. This allows us to define a project configuration with only a few text files. Terraform allows us to store the state of the configuration.

But how do we apply this Terraform configuration? We use a tool provided by GitHub called GitHub actions, that allows execution of custom code inside a linux environment. GitHub actions are triggered by events inside a GitHub repository (e.g. a push) and execute a workflow file. In our case, this is triggered by the push of the configuration files.

When the GitHub action is launched, it executes the Terraform stack of the project. The stack contains the different modules, representing the components of the configuration (see our first article). It uses the variables provided by the user.

We went with Terraform in this step for several reasons:

Google’s provider contains everything we need for the configuration of the project, with no restrictions. Anything possible on GCP can be achieved using Terraform.
It is great for managing the evolution of configurations over multiple updates.
It works well in the context of a unique resource, in our case a project. If we wanted to manage all projects in a single stack, applying this stack would take a very long time, and any edit on any project would mean applying the whole configuration for all projects. It would be a nightmare to manage.

GitHub is a SaaS platform, so we do not have to manage the infrastructure ourselves. GitHub actions are integrated into GitHub by default, so the configuration is stored and applied at the same place.

At the end of DEPLOY, the project is created, configured and ready to use. But how is the user notified? And how do we manage errors if any? For this, let’s dive into a parallel process to the main factory: OPERATIONS.

OPERATIONS

What we call OPERATIONS is a complementary process to the main factory. It controls the correct execution of the main process. Each key task in the factory reports a failure or success to the OPERATIONS process.

This allows our team to troubleshoot the entire factory easily. If it receives an error message, OPERATIONS:

updates the database with the tasks that failed, and the reason of the failure
alerts the cloud team that there was a problem
alerts the requester, asking him to wait for the cloud team to solve the issue

When the process completes successfully, the OPERATIONS sends a message to the requester to inform them that their GCP project is ready.

For OPERATIONS, we leverage the same technologies as the BUILD part: Pub/Sub and Cloud Functions.

You now have an entire view of the GCP Project Factory that we use daily at Decathlon, and the many technologies that allowed us to make it fully automated. In the next and last chapter of our story, we will look back on its development and usage since it was first deployed.

François Betremieux & Adeline Villette