MLOps on Red Hat with Terraform, Ansible, and Podman

Enabling Machine Learning Infrastructure on Enterprise Servers

8 min readJun 19, 2022

So you want to do machine learning on Red Hat. Good for you, you brave soul. Maybe it’s because you work in a bank, or maybe it’s because you enjoy pain. Likely both.

Photo by Photo by Jishnu Viswanath from Pexels.com

Overview

In this example, we will be focusing on the automated provisioning of a Red Hat Enterprise Linux (RHEL) GPU instance and automatically installing machine learning libraries through Podman acting as a Docker clone. Any containerized MLOps toolchain requires GPU access through a container, which is not trivial on RHEL.

The goal is simple: automatically provision a GPU-enabled instance capable of running TensorFlow code, on a Red Hat instance, from within a container in an enterprise-grade environment.

Once you’re at a point of running GPU-enabled containers, many of the standard data science packages (like the simple tensorflow-gpu Docker container that we’ll use here) will run just the same.

Many, many deployments and MLOps tools can be deployed as containers. Once you can deploy your inference pipeline, you’re set. Most of the issues we see with our consultancy are related to getting teams up to the point of actually enabling the work.

The GitHub repo for this article is available here.

Background

Red Hat Enterprise Linux (RHEL), in its present form, is a security-first, enterprise-focused Linux distro. The second oldest surviving Linux distro (after Slackware), it has a lot of history and a proven track record.

Where it differentiates itself from Ubuntu (the most familiar Linux distro for budding data scientists and machine learning engineers) is in its additional native security features through SELinux. Additionally, enterprise means 10 years life expectancy of packages (as opposed to 5 years for Ubuntu).

Due to its enterprise focus, there are a lot of OS-specific issues that arise when trying to run cutting-edge libraries. As listed in the notes at the end of this article, there are a lot of version support gaps that are mainly artifacts of a smaller user pool and more stringent reliability requirements — companies would rather not support a specific version rather than support it poorly. The best chance that we have at keeping our codebase up-to-date in this environment is to use containerization so that any deployment can be fairly standardized.

The Tool Stack

To support the provisioning of our infrastructure, we will use the following tools:

Amazon Web Services. This is just our preferred Cloud provider since we’ll be using their dirt-cheap g4dn instances. Terraform can easily be adjusted to provision Google Compute Instances or Azure Virtual Machines.
Ansible is an IT automation tool that supports software provisioning, configuration management, and application deployment. (It is also, ironically, owned by Red Hat.)
Podman. Launched as a Docker clone, its reason for being is security: it can run without root privileges and without exposing root permissions.
Red Hat Enterprise Linux. Big Red is the star of the show here acting as the preferred operating system for mission-critical infrastructure in banks.
Terraform. Terraform is an infrastructure provisioning automation tool (or “Infrastructure-as-Code”) that converts the usually tedious provisioning activities to deployment scripts.

What You’ll Need

An AWS account with credentials available and enabled for API provisioning, and an SSH key in your account. (If you wish to run it off of a VM, you can skip the Terraform steps entirely. Please make sure that you’ve enabled GPU passthrough, however.)
A Linux host. (WSL works great for running the entire code in Windows.)
Both Terraform and Ansible downloaded and installed.
After downloading the repository, create a terraform.tfvars file with the following content:

shared_config_files      = [<AWS config file, sometimes blank>]
shared_credentials_files = [<AWS creds file with key and ID>]
key_name                 = <SSH Key code used on AWS>
private_key              = <path to PEM file>

From there, we’re off to the races.

Execution Overview

An overview of the related project GitHub repo.

The following steps take place to execute the provisioning of the ML-ready RHEL instance:

Terraform provisions a GPU server on AWS and records the address and then records the address in the hosts.cfg file.
Ansible then retrieves the hosts.cfg file and runs the playbook.yml file against the defined hosts.
Once the server is fully configured, we can run Podman to download the Tensorflow container and run GPU-enabled code.

Terraform

Let’s take a look at the Terraform provisioning file.

The goal is to automatically select the Red Hat 8.5 AMI from the US East region and push it on a g4xn.large instance. This instance provides an NVIDIA T4, plenty of GPU power for this exercise.

Once we connect to it (for the added option of running scripts directly from Terraform), we save its IP to an Ansible template and prepare for the next step.

Here are the basic Terraform commands:

# Initial commands
./terraform init    # Initializes the environment
./terraform plan    # Checks that the configuration is valid
./terraform apply   # Provisions the instance# Termination commands
./terraform destroy # Ends all of the provisioned instances

It’s a really good mechanism for quickly testing infrastructure, especially with destroy as an easy option to terminate all of the running instances that were created by the script.

If you execute init, plan, and apply, you will see a large configuration output and a prompt to execute the plan:

$ ./terraform applyTerraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + createTerraform will perform the following actions:# aws_instance.new_rhel will be created
  + resource "aws_instance" "new_rhel" {
      + ami                                  = "ami-03c1de4e158cf48d4"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)

[...]Plan: 2 to add, 0 to change, 0 to destroy.Changes to Outputs:
  + instance_id        = (known after apply)
  + instance_public_ip = (known after apply)Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.Enter a value: yesaws_instance.new_rhel: Creating...
aws_instance.new_rhel: Still creating... [10s elapsed]
aws_instance.new_rhel: Creation complete after 14s [id=i-0b1c83c65d1dc00e7]
local_file.hosts_cfg: Creating...
local_file.hosts_cfg: Provisioning with 'local-exec'...
[...]

The script creates an AWS instance and a hosts.cfg file, which is why it’s stating two items are created.

You now have a running GPU instance.

Ansible

The hosts.cfg file has now been populated, so Ansible can do its thing. Within the playbook, all of the libraries to be installed and all of the complex tasks that need to be accomplished can be automated and centralized, so we don’t need to worry about whether or not we completed the steps in the right order.

Here’s the playbook.yml file:

What we see are a few things:

There is a sequential installation process for adding the NVIDIA and CUDA drivers, then the container tools, then the NVIDIA container runtime. (Simplifying the NVIDIA drivers' installation process was taken from Darryl Dias’s article.)
There are a lot of reboots. This may require investigation, but I found that a reboot does wonders to ensure that the graphics card gets detected at every step.
It’s a bit more elaborate than a typical driver+docker+runtime install in Ubuntu, but it seems to only be due to less documentation support rather than complexity.
As opposed to setting up Ansible to run scripts directly off of a local-exec command, I separated the two for readability and improving troubleshooting. (The repo still has the Ansible as a local-exec.)

To run this playbook, the following script does it:

ansible-playbook   \
  --private-key <YOUR PRIVATE KEY>  \
  --inventory   Ansible/hosts.cfg   \
  -e  'ansible_python_interpreter=/usr/bin/python3'  \
  Ansible/playbook.yml

The playbook will connect to the machine and start executing every command, line by line:

We have now installed all of the libraries that we need to run GPU-enabled containers.

Podman

Now that Ansible configured the machine with all of the NVIDIA libraries and container engines, we can connect to the server and make sure that we can see the GPU with nvidia-smi:

A nice nvidia-smi output showing T4 recognition.

This output means that we’ve successfully provisioned a GPU instance, installed all of the NVIDIA libraries, and configured it so that a non-root user can see the GPU.

We’ll use Podman to pull the Tensorflow-GPU container from Docker and run a Python scriptlet to see the GPU from within the container:

podman run \
  --security-opt=label=disable \
  --hooks-dir=/usr/share/containers/oci/hooks.d/ \
  tensorflow/tensorflow:latest-gpu \
  python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

The first thing you’ll notice is you can choose the source from multiple locations. Just pick your preferred provider, Docker also being available.

Once your preferred source is selected, it will download all of the images and run the container, run the scriptlet within it, and show us that we have full GPU access:

This output shows that TensorFlow-GPU has successfully detected the GPU from inside the container. Training and inference are now possible as you would do on any other machine, and your MLOps stack will work just as well.

At this point, we have achieved the objective of running GPU-enabled machine learning code from within a container in an enterprise environment.

Notes

There are a few things worth mentioning about version selection at the time of writing. RHEL 9 is not supported by NVIDIA through container installations. Also, RHEL 8.6 breaks when trying to access the libnvidia-container repo. 8.5 works though, so we’ll focus on that version.
Although we’re deploying to a Cloud instance, we ran the Ansible management script on a local VM and it ran just as well.
AWS no longer allows root SSH connections. This means that with Ansible, we need to connect with the ec2-user and the escalate privileges on every command (i.e., running sudo). That’s the role of become: yes.
The terraform.tfvars file separation is not necessary, but keeps the credentials separation clean.
Setting the ansible_python_interpreter is required due to an Ansible historical bug that would otherwise try to run the Ansible remote client through Python 2.7.
I sometimes ran into issues with using the native Ansible modules like dnf. Luckily, there’s the God-mode option of running command or shell to drop in complex tasks.
The --security-opt=label-disable is an aggressive policy and security risk, but would otherwise require Podman to tie in with the IT security policies, which goes beyond the scope of this article.