Building ML Infrastructure with Terraform
Whether you’re building a commercial application or an individual research stack, you’ll have to one way or the other set up the infrastructure that will host it. Cloud service providers like AWS, GCP and Azzure offer a wide variety of solutions to meet the needs of any application. Designing and provisioning your cloud infrastructure takes some experimentation to get it right, but by standardizing and automating this process, you can make your application a lot more efficient, saving time and, more importantly, money. This is where Infrastructure as Code (IaC) frameworks such as Terraform come into play.
What is Terraform
Terraform is an open-source Infrastructure as Code (IaC) framework that enables users to define and provision their infrastructure using a high-level configuration language known as HashiCorp Configuration Language (HCL). Terraform manages resources, such as public cloud infrastructure, by codifying the cloud APIs of different cloud providers into declarative configuration files. In simple terms, it allows you to define your entire infrastructure — including VMs, private networks, buckets, and databases — in a series of configuration files and deploy it with just a few commands. This paradigm allows for the automation of infrastructure deployment, scaling and management across various cloud providers, thereby enhancing efficiency and consistency.
Building your ML stack
Now with the basics out of the way, let’s dive into building our infrastructure. Our ML setup will focus on training and evaluating an ML model on existing data, then saving the trained model artifact for later use. This scenario, agnostic of the model type or framework, can then be extended to support various deployment patterns, such as batch prediction or a REST API. In this guide, we will show a simple application using PyTorch for the creation of NLP models. We will use GCP, but the steps can be easily adjusted for AWS or Azure.
Virtual Private Network
Our architecture will consist of several different resources needing interconnection. The best way to achieve this is by creating a Virtual Private Network (VPC) and attaching your resources to it. Once the VPC is created we will use its name to connect other resources. We also need to configure a firewall rule that allows SSH ingress connections for all resources in the VPC.
resource "google_compute_network" "vpc_network" {
name = "nlp-vpc-network"
}
// Create a firewall rule to allow SSH ingress
resource "google_compute_firewall" "allow_ingress" {
name = "allow-ingress-from-iap"
network = google_compute_network.vpc_network.name
allow {
protocol = "tcp"
ports = ["22"]
}
// Allow incoming connections
source_ranges = ["0.0.0.0/0"]
direction = "INGRESS"
}
Cloud Storage
In our architecture, we use a Cloud Storage (CS) bucket (S3 for AWS) to hold our data and trained model artifacts. CS is a cheap way to store large artifacts used by different parts of your application. The key setting here is to configure your bucket in the same location as your other resources, to avoid slow and expensive data transfers.
resource "google_storage_bucket" "artifact_bucket" {
name = "model-artifact-bucket"
location = "US"
force_destroy = true
uniform_bucket_level_access = true
}
Compute Engine
The heart of our setup is the worker node, responsible for training and evaluating our ML model. In the GCP world, this is a Compute Engine instance (or EC2 for AWS). When creating our compute instance, it’s important to have it in a zone that’s in the same location as our storage bucket. We should also keep in ming that different zones will have different cost and availability for the same resource types.
Since we will be using this instance for training deep learning models, the bulk of our computation will be happening on a GPU. Therefore, we will use a n1-standard-4
machine type with 4 virtual CPUs and 15GB of memory. We will also be attaching an NVIDIA Tesla T4 GPU which is a great value for money option. We will configure our worker with a Linux image compatible with Pytorch and Cuda 11.3.
resource "google_compute_instance" "gpu_instance" {
name = "training-worker-gpu-instance"
// use variables to configure different values
machine_type = "n1-standard-4"
zone = "us-central1-c"
tags = ["ssh-enabled"]
boot_disk {
initialize_params {
// pytorch image that works with cuda 11.3
image = "deeplearning-platform-release/pytorch-latest-cu113"
type = "pd-ssd"
size = 150
}
}
// add a single nvidia T4 GPU
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
}
metadata = {
// path to store ssh keys inside the worker
ssh-keys = "gcp:${trimspace(file(var.ssh_file))}"
install-nvidia-driver = true
proxy-mode = "project_editors"
}
scheduling {
// GCP might terminate your instance for maintainance. We don't want it to restart automatically
automatic_restart = false
on_host_maintenance = "TERMINATE"
// Preemptible instances are much cheaper but they might be terminated during a long training session
preemptible = false
}
network_interface {
network = "nlp-vpc-network"
access_config {
// Ephemeral IP
}
}
// give read/write access to cloud storage
service_account {
scopes = ["https://www.googleapis.com/auth/devstorage.read_write"]
}
provisioner "file" {
// We use a provisioner to copy our local ssh key inside the worker. This will be used to authenticate with GitHub.
source = var.ssh_file_private
destination = "/home/gcp/.ssh/id_ed25519"
connection {
type = "ssh"
user = "gcp"
port = 22
private_key = "${file(var.ssh_file_private)}"
host = google_compute_instance.gpu_instance.network_interface[0].access_config[0].nat_ip
}
}
provisioner "remote-exec" {
// Here we define some init steps we want to run when our instance is created
inline = [
"sudo apt-get update",
"mkdir /home/gcp/gcs-bucket", // create mount path for our bucket
"sudo chown gcp: /home/gcp/gcs-bucket",
"sudo gcsfuse -o allow_other -file-mode=777 -dir-mode=777 model-artifact-bucket /home/gcp/gcs-bucket", // mount our bucket
"sudo /opt/deeplearning/install-driver.sh", // install required GPU drivers
"sudo apt-get install -y git",
"chmod 400 /home/gcp/.ssh/id_ed25519", // allow git to use our ssh key
"echo 'Host github.com' >> ~/.ssh/config",
"echo ' StrictHostKeyChecking no' >> ~/.ssh/config",
"git clone ${var.git_ssh_url}", // clone our application repository
"cd ~/${var.git_clone_dir}",
]
connection {
type = "ssh"
user = "gcp"
port = 22
private_key = "${file(var.ssh_file_private)}"
host = google_compute_instance.gpu_instance.network_interface[0].access_config[0].nat_ip
}
}
}
We need to configure a couple of scheduling options:
- automatic_restart: Specify what should happen if GCP terminates our instance for maintenance. In setups that don’t require 100% uptime, we will configure this to “false” to avoid paying for idle instances.
- preemptible: Preemptible instances are much cheaper than standard ones but may be terminated if GCP needs the resources. This option could be problematic for long training sessions that should not be interrupted.
scheduling {
// GCP might terminate your instance for maintainance. We don't want it to restart automatically
automatic_restart = false
on_host_maintenance = "TERMINATE"
// Preemptible instances are much cheaper but they might be terminated during a long training session
preemptible = false
We also need to attach the worker instance to our VPC network, which can be done using the VPC name. Additionally, we add a service account scope that allows read/write access to our CS bucket.
network_interface {
network = "nlp-vpc-network"
access_config {
// Ephemeral IP
}
}
// give read/write access to cloud storage
service_account {
scopes = ["https://www.googleapis.com/auth/devstorage.read_write"]
}
Terraform provisioners are a convenient way of defining initialisation steps for your instance. Here we’ll use a “remote-exec provisioner” to install the Nvidia drivers required to operate the GPU and mount our CS bucket to a local path. Most projects will be using a GitHub repository for their source code, and we will need our worker to be able to clone this code. For private repositories, the safest method is using an SSH key added to your GitHub account (see how to do it here). After following these steps, you’ll end up with a private key (in my case id_ed25519
) on your local machine. We use a “file provisioner” to copy this key into the worker, then configure Git with our “remote-exec provisioner” to use this key for authentication
provisioner "file" {
// We use a provisioner to copy our local ssh key inside the worker. This will be used to authenticate with GitHub.
source = var.ssh_file_private
destination = "/home/gcp/.ssh/id_ed25519"
connection {
type = "ssh"
user = "gcp"
port = 22
private_key = "${file(var.ssh_file_private)}"
host = google_compute_instance.gpu_instance.network_interface[0].access_config[0].nat_ip
}
}
provisioner "remote-exec" {
// Here we define some init steps we want to run when our instance is created
inline = [
"sudo apt-get update",
"mkdir /home/gcp/gcs-bucket", // create mount path for our bucket
"sudo chown gcp: /home/gcp/gcs-bucket",
"sudo gcsfuse -o allow_other -file-mode=777 -dir-mode=777 model-artifact-bucket /home/gcp/gcs-bucket", // mount our bucket
"sudo /opt/deeplearning/install-driver.sh", // install required GPU drivers
"sudo apt-get install -y git",
"chmod 400 /home/gcp/.ssh/id_ed25519", // allow git to use our ssh key
"echo 'Host github.com' >> ~/.ssh/config",
"echo ' StrictHostKeyChecking no' >> ~/.ssh/config",
"git clone ${var.git_ssh_url}", // clone our application repository
"cd ~/${var.git_clone_dir}",
]
connection {
type = "ssh"
user = "gcp"
port = 22
private_key = "${file(var.ssh_file_private)}"
host = google_compute_instance.gpu_instance.network_interface[0].access_config[0].nat_ip
}
}
}
Deployment
Now with all the individual parts in place, let’s put them together in our deployment script. To keep this script concise, the VPC and worker resources are defined in separate Terraform modules and imported into this script. It’s best practice to use Terraform secrets and variables in variables.tf
files to securely hold sensitive information.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "4.49.0"
}
}
}
provider "google" {
credentials = file(var.credentials_file)
project = var.project
region = "us-central1"
zone = "us-central1-c"
}
module "vpc_network" {
source = "./modules/vpc"
}
resource "google_storage_bucket" "artifact_bucket" {
name = "model-artifact-bucket"
location = "US"
force_destroy = true
uniform_bucket_level_access = true
}
module "training_worker" {
source = "./modules/worker"
ssh_file = var.ssh_file
ssh_file_private = var.ssh_file_private
bucket_url = var.bucket_url
git_ssh_url = var.git_ssh_url
git_clone_dir = var.git_clone_dir
}
No it’s time to deploy the infrastructure defined in the script. When we create a new configuration, we start by initializing the directory with terraform init
. This step downloads the providers defined in your configuration. You can check your configuration’s validity with terraform validate
. If your configuration is valid, you can deploy it with terraform apply
. Terraform will indicate the infrastructure changes it plans to make, prompting for your approval. If the plan looks good, type ‘yes’ at the confirmation prompt to proceed. It will take a few minutes for Terraform to provision and set up your resources, but once complete, you should be able to SSH into your worker instance and begin development.
Once the resources are created, you can view and manage them from the cloud console interface. From there, you can activate or deactivate your worker instance as needed. If at some point you want to make changes to your infrastructure, rerun terraform apply
, and Terraform will adjust your existing resources or create new ones as needed. Finally, if you wish to destroy the infrastructure and remove all resources, do so by running terraform destroy
.
Conclusion
In this guide, we’ve navigated the complexities of building an ML infrastructure using Terraform, showcasing its versatility across different cloud platforms like GCP, AWS, and Azure. By leveraging Terraform’s Infrastructure as Code (IaC) capabilities, we’ve demonstrated how to create a scalable, secure, and efficient ML infrastructure that is both cost-effective and robust. From setting up a Virtual Private Network to configuring Compute Engines with GPUs for ML processing, the steps outlined offer a solid foundation for anyone looking to deploy ML models at scale.
The beauty of using Terraform lies in its ability to automate and standardize the deployment process, making the management of cloud resources more streamlined and error-free. Whether you’re working on commercial applications or individual research projects, understanding how to effectively utilize tools like Terraform can significantly enhance your productivity and the performance of your ML models.
Do you have any questions about setting up your ML infrastructure or need further insights into optimizing your ML projects? Feel free to reach out through the comments or on social media.