Package and deploy your machine learning models to Google Cloud with Cog

Billy Jacobson
Google Cloud - Community
7 min readJul 25, 2024

Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. I discovered Cog while I was looking for interesting machine learning models, and found a great community of creators and models. I was able to run some simple ones on my laptop, but there were a few beefier models I wanted to try — and as a Google Developer Advocate, I had to see how to run them on Google Cloud. There wasn’t any specific documentation for this, excluding a few useful StackOverflow questions, but once I understood the requirements, it was fairly easy to deploy in to a variety of compute serving tools.

I was able to deploy a fun generative AI models like a voice changing one where I had Squidward sing pop music and an interior design model that let me redecorate the oval office for Halloween. I did this using Cog powered GPU models on Google Cloud, and after reading through this, you will be able to do all this and more.

In this blog I’m going to explore some ways to deploy models with Cog to three different cloud computing services starting with the highest level of abstraction Cloud Run, then to Google Kubernetes Engine (GKE), and finally Compute Engine which is the most customizable. I’ll walk through the steps, but you should have some familiarity with these services and containers to get the most out of this blog.

Note that Cog is open-source, but was created by Replicate a company that sells production-ready APIs for Cog deployments and have a container registry to host the community’s open source models.

The Cog framework

Cog is an abstraction for Docker that is optimized to serve machine learning models. You use a cog.yaml file to define your environment. The cog.yaml file is is a simplified version of what you’d need to include in a Dockerfile to do the same task. You’ll also create a predict.py file which defines how predictions are run on your model. You can use Python to do any preprocessing or even combine models. When you’re ready to deploy, Cog will generate an OpenAPI schema and validate the inputs and outputs with Pydantic.

Prerequisites

Cog doesn’t have many requirements to get up and running. You’ll need to install Cog and Docker on your machine, and Cog serves data on port 5000, so you’ll need that port open for traffic. For each deployment method, I’ll walk you through how to do this.

Cloud Run

Cloud Run is an easy, serverless way to run containers on Google Cloud.

Push the image

Cloud Run requires your image to be on the official Docker container registry or in the Google Cloud Artifact Registry.

I’m going to use the Hello World model hosted on the Replicate website. You can find the Docker image URL on the Docker tab of the model’s playground.

To deploy an image directly from Replicate, first pull the image from Docker, then follow the steps. Thank you to the StackOverflow post that helped me figure this out.

export IMAGE_URL=r8.im/replicate/hello-world@sha256:5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa
docker pull $IMAGE_URL

You can also use an image you build yourself with cog build -t YOUR-IMAGE and then use YOUR-IMAGE as the IMAGE_URL.

We’ll define a few environment variables to make this process easier.

export PROJECT_ID=your-project-id
export ARTIFACT_URL=gcr.io/$PROJECT_ID/cog-hello-world

Tag and push the image, and you should be set to go.

docker tag $IMAGE_URL $ARTIFACT_URL
docker push $ARTIFACT_URL

Note that if you’re doing this from your local machine you’ll need to configure your Docker client with gcloud auth configure-docker. More instructions can be found in the documentation.

Deploy

You can deploy a new service from the console, and should see your new image using the Artifact Repository selection menu.

Set the container port to 5000, so the Cog service can be accessed, and then deploy.

Once the service is deployed, you can make your HTTP requests to the Cloud Run URL to make predictions.

Cloud Run will create a URL which you’ll use in your HTTP request.

Use curl to test out your service, and see the results.

export CLOUD_RUN_URL=[https://cog-hello-world-your-id.a.run.app]
curl -s -X POST \
-H "Content-Type: application/json" \
-d $'{
"input": {
"text": "Billy"
}
}' \
http://$CLOUD_RUN_URL/predictions

If it works, you should see something like this.

Deploying to Google Kubernetes Engine

For more customization of your container infrastructure Google Kubernetes Engine (GKE) can be a great fit. GKE also can work with GPUs which are commonly used in generative AI. If you’re a Cog user and already have your code ready or want to use someone else model hosted on Replicate, this could be a good method for you.

For your own Cog deployment, you need to have the image hosted in a Docker repository for this method. Artifact Repository is an easy place to host your image and the Cloud Run section shows how to push your images there.

In the Google Cloud Console, open Kubernetes Engine and create a new Workload deployment. Paste that URL for the container image.

Then you can set a deployment name. Here I’m using cog-to-gcp-tutorial.

Finally, expose the Cog service by allowing TCP on port 5000.

Click “deploy,” and after a few minutes the service should be running. You can grab the IP address from the Exposing services section and make a curl request to make a prediction.

export IP_ADDR=[your-ip-address:5000]
curl -s -X POST \
-H "Content-Type: application/json" \
-d $'{
"input": {
"text": "Billy"
}
}' \
http://$IP_ADDR/predictions

Deploying to Compute Engine

If you are developing and debugging your own model, a virtual machine can be a helpful tool. For smaller models you can likely do this on your own machine, but VMs are a good option if you aren’t familiar with Kubernetes and want a straightforward serving option during testing.

Create your VM

For running this small model, you can create a virtual machine with the default machine configurations (E2-medium with 2 vCPUs and 4GB memory).

Many machine learning models take advantage of the parallel math that GPUs can do. Nvidia CUDA is a tool that allows GPUs to speed up computing applications by running tasks in parallel instead of sequentially. You can create a VM with GPUs but need to do some work to allow all the parts to talk to each other.

The Google Cloud documentation provides instructions for installing the NVIDIA drivers and CUDA toolkit which you can run once the VM spins up. Also, set your boot disk storage to around 200GB to accommodate the installation and image downloads.

Enable the HTTP traffic, so you can make predictions from other networks. Then deploy the VM.

You’ll also need to add a firewall rule to allow access to port 5000, so the Cog service can be accessed. Here I just allowed all instances in my network and all IP addresses to access it, but for a production deployment make sure to have a more secure rule.

Once your VM is deployed, SSH into it and start configuring the environment.

Run these commands to install Docker. The Docker documentation has the latest installation instructions.

sudo apt-get install docker.io

Run these commands to install Cog. The Cog documentation has the latest installation instructions.

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

We’re using the same Hello World example as above, but this time we are using the unbuilt Cog code. Clone the repository from Github and navigate to the directory.

git clone https://github.com/replicate/cog-examples.git
cd cog-examples/hello-world

Build the Cog model.

sudo cog build -t cog-hello-world

You can make your first prediction.

sudo cog predict -i text=Billy

Now, you can deploy the model with Docker.

sudo docker run -d -p 5000:5000 cog-hello-world

You can now make a curl request to your server using the external IP address and port 5000.

export IP_ADDR=[your-ip-address:5000]
curl -s -X POST \
-H "Content-Type: application/json" \
-d $'{
"input": {
"text": "Billy"
}
}' \
http://$IP_ADDR/predictions

Next steps

I hope you can use Cog to deploy your own models to Google Cloud or find some cool ones the community created and try deploying them yourself. Here are a few of my favorites I’ve found:

--

--