Running AlphaFold with Google Cloud Life Sciences

Nobuhisa Mizue
Google Cloud - Community
9 min readSep 10, 2021

AlphaFold

AlphaFold is an AI model developed by DeepMind for predicting 3D structure of proteins. The first AlphaFold (version 1) was released in 2018, followed by version 2 in 2020. It’s an attracting attention from all over the world as an innovative invention that is said to be a solution to a 50-year-old grand challenge in biology. DeepMind also published the AlphaFold on GitHub in 2021. It’s expected that this will allow everyone to have AlphaFold at hand, and in the near future it will be possible to elucidate the function of unknown proteins, accelerate the development of new drugs, and treat many people’s diseases.

Immediately after AlphaFold was published on GitHub, research institutes and many individuals have posted on their blogs the steps to run AlphaFold on cloud or on-premises servers. However, to run AlphaFold, you will need to download 2.2 TB+ database files from the Internet and store it on the server’s disk. It also requires a certain amount of CPU, memory, or GPU, so it’s uncomfortable to build and configure a cloud server to use AlphaFold and keep it running for long periods of time.

In this article, I will explain the steps to run AlphaFold fast and inexpensively without setting up a virtual server (Compute Engine VM instance) to run AlphaFold on Google Cloud.

Disclaimer: This article assumes that you have some background knowledge of Linux OS, Docker, and Google Cloud. Please note that this article is not a complete runbook and some steps have been omitted.

First time only preparation

Create a VM instance just for the first time preparation. This is because you need to temporarily use the VM instance to create the disk image first.

The preparatory steps are as follows.

  1. Create an AlphaFold container image and register it in the container registry
  2. Download the database files and convert the disk to an image
  3. Enable Cloud Life Sciences API
  4. Install dsub
  5. Create a script to run inside the container

VM instance is required in the above steps up to 2. Subsequent steps assume operation on Cloud Shell.

Then, let’s go through the procedure in order.

1. Create an AlphaFold container image and register it in the container registry

Create a Compute Engine VM instance. Attach a 2.2 TB or larger disk to this VM and mount it on the VM instance. The following is an example command for creating a VM instance in the us-central1-a zone on the command line. Note that the boot-disk size is 100 GB and creates a non-boot 3000 GB disk. You will need to format and mount the non-boot disk after the vm spin up. Please see the document for details.

gcloud compute instances create <INSTANCE_NAME> \
--zone=us-central1-a \
--machine-type=e2-standard-8 \
--boot-disk-size=100GB \
--create-disk=mode=rw,size=3000,type=projects/<PROJECT_ID>/zones/us-central1-a/diskTypes/pd-balanced,name=alphafold-data,device-name=alphafold-data

You can also do the same from the Google Cloud console.

In Google Cloud, “images are global resources”. You only need to create an image once, in one place, and you can use the image from any regions or zones in the world. Therefore, you can create a VM instance in any region.

SSH login to the VM instance you created, clone the AlphaFold from GitHub, build a container image and push it to the container registry. Execute the following command.

# install git
sudo apt-get update
sudo apt-get install git
# install docker
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpgecho \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
# make docker available with general user privilege
sudo gpasswd -a $(whoami) docker
# docker build, tag and push
git clone https://github.com/deepmind/alphafold.git
cd alphafold
docker build -f docker/Dockerfile -t alphafold .
docker tag alphafold gcr.io/<PROJECT_ID>/alphafold
docker push gcr.io/<PROJECT_ID>/alphafold

2. Download database files, convert the disk to an image

Run the script to download the database files, specifying the path where you mounted the attached disk. MOUNT_POINT specifies the path where the 3000GB disk is mounted. See here for how to mount additional non-boot disk.

sudo apt-get install rsync
sudo apt-get install aria2
cd scripts
./download_all_data.sh <MOUNT_POINT>

This download will take 2–3 hours to complete. With this download, the structure on the disk will be as follows.

/                             # Total: ~ 2.2 TB (download:438GB)
bfd/ #~1.7TB (download:271.6GB)
# 6 files.
mgnify/ # ~64GB (download:32.9GB)
mgy_clusters_2018_12.fa
params/ # ~3.5GB (download:3.5GB)
# 5 CASP14 models,
# 5 pTM models,
# LICENSE,
# = 11 files.
pdb70/ # ~56GB (download:19.5GB)
# 9 files.
pdb_mmcif/ # ~206GB (download:46GB)
mmcif_files/
# About 180,000 .cif files.
obsolete.dat
small_bfd/ # ~17GB (download:9.6GB)
bfd-first_non_consensus_sequences.fasta
uniclust30/ # ~86GB (download:24.9GB)
uniclust30_2018_08/
# 13 files.
uniref90/ # ~58GB (download:29.7 GB)
uniref90.fasta

When you have finished downloading, you can detatch the disk and then create an image from the disk. You can create an image on the Google Cloud console or the command line as well.

If you run it with the command line:

gcloud compute images create <IMAGE_NAME> --source-disk <DISK_NAME>

Once you’ve created the image, you no longer need the VM instance, so delete it. Deleting a VM instance is also very easy. If you delete the instance with the command line:

gcloud compute instances delete <INSTANCE_NAME>

You can perform the following operations on Cloud Shell. Please refer to the following documents for how to use Cloud Shell.

https://cloud.google.com/shell/docs/launching-cloud-shell

3. Enabling the Cloud Life Sciences API

Cloud Life Sciences

Google Cloud Life Sciences is a set of services and tools for managing, processing, and transforming life sciences data. The “dsub” used in the following steps calls the Google Cloud Life Sciences API, so let’s enable the API here. You can enable the API from the Google Cloud console or the command line.

If you run it with the command line:

gcloud services enable lifesciences.googleapis.com

4. Install dsub

dsub is a command line tool developed by Google to run various software for Bioinformatics. dsub can create a container from the specified image and execute any command or script in it.

In addition, dsub also has functions such as mounting a disk so that it can be accessed from a container, taking files placed in Google Cloud Storage (GCS) into a container, and bringing out the result processed in the container to GCS.

Many Bioinformatics tools are containerized and require reading and writing large amounts of data, so tools like dsub are very useful.

dsub can be easily installed with pip. To install using Python’s venv, follow the steps below.

python3 -m venv env
source env/bin/activate
pip install dsub

For more information on dsub, see GitHub.

https://github.com/DataBiosphere/dsub

5. Create a script to be executed inside the container

Create a script file to be executed inside the container. Save the following contents with the file name “alphafold.sh”. You can see a lot of parameters, but it’s just running “/app/run_alphafold.sh”. Variables enclosed in ${} are passed from the parameters of dsub command described later.

cd /app/alphafold/app/run_alphafold.sh \
--fasta_paths=${FASTA} \
--uniref90_database_path=${DB}/uniref90/uniref90.fasta \
--mgnify_database_path=${DB}/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path=${DB}/pdb70/pdb70 \
--data_dir=${DB} \
--template_mmcif_dir=${DB}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=${DB}/pdb_mmcif/obsolete.dat \
--uniclust30_database_path=${DB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--bfd_database_path=${DB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--output_dir=${OUT_PATH} \
--model_names=model_1,model_2,model_3,model_4,model_5 \
--max_template_date=2020–05–14 \
--preset=full_dbs \
--benchmark=False \
--logtostderr

Now you completed the preparatory work required for the first time only.

The following resources should have been created by the work so far.

・ The container image is registered in the container registry.
・ The disk image is created.

Run AlphaFold

Now that you can run AlphaFold. The preparatory work up to this point is only done once at the beginning. After that, you can run AlphaFold as many times as you like with the following simple steps.

  1. Save the amino acid sequence (FASTA file) in GCS
  2. Execute AlphaFold

Let’s go through each procedure.

1. Save the amino acid sequence (FASTA file) in GCS

Prepare the amino acid sequence that will be the input data to AlphaFold. Amino acid sequence data is usually in the form of an alphabetical text. Create a sample file and copy it to GCS. Save the following data with the file named “all0174.fasta”. The line starting with the first “>” is the comment line, and the amino acid sequence is described after the next line.

>all0174
MTEADSSVLQIWGGHPLQGHVKISGAKNSALVIMAGALLCSGDCRIRNVPLLADVERMGEVISALGVRLTRQADIIDINASEIKTSKAPYELVTQLRASFFAIGAILARLGVAQMPLPGGCAIGARPVDLHVRGLQAMGAEVQIEHGICNAYVPGSGGRLKGAKIYLDTPSVGATETLMMAATLADGETILENAAREPEVVDLANFCKAMGANIQGAGTSTITIVGVPKLHSVDYSIIPDRIEAGTFLVAGAITRSEITLSSVVPEHLIPLIAKLRDIGVTIIEESPDCLRILPAEILKATDIDTLPHPGFPTDMQAPFMALLTLAEGDSIINESVFENRLRHASELNRLGADIRVKGNTAFVRGVPLLSGAPVIGTDLRASAALVIAGLAAEGKTTIQGLHHLDRGYDQIDVKLQQLGAKILRVREEPANAEVAVNNNVSPASIST

First of all, you have to create a GCS bucket. If you do it with the command line:

gsutil mb gs://<PROJECT_ID>-alphafold

Copy the FASTA file you created to the input folder under your GCS bucket. You can execute it with the command line:

gsutil cp all0174.fasta gs://<PROJECT_ID>-alphafold/input/

2. Run AlphaFold

Execute AlphaFold with below dsub command.

dsub --provider google-cls-v2 \
--project <PROJECT_ID> \
--zones <ZONE_NAME> \
--logging gs://<PROJECT_ID>-alphafold/logs \
--image=gcr.io/<PROJECT_ID>/alphafold:latest \
--script=alphafold.sh \
--input FASTA=gs://<PROJECT_ID>-alphafold/input/all0174.fasta \
--mount DB="<IMAGE_URL> 3000" \
--output-recursive OUT_PATH=gs://<PROJECT_ID>-alphafold/output \
--machine-type n1-standard-8 \
--boot-disk-size 100 \
--subnetwork <SUBNET_NAME>

Description of parameters:

--provider google-cls-v2: Google Cloud Life Sicences API (v2)
--project: PROJECT ID
--zones: ZONE NAME (wildcard "*" is possible. e.g. "us-central1-*")
--logging: Log output destination folder
--image: AlphaFold Container Image URL on GCR
--script: Script file name created earlier
--input: Path of the FASTA file uploaded to GCS
--mount: URL of the created disk image (*1)
--output-recursive: Path of GCS where the artifacts are output
--machine-type: n1-standard-8
--boot-disk-size: 100 GB
--subnetwork: Subnetwork name if it’s not "default"
*1:
URL of the disk image can be found with the following command.
gcloud compute images list --no-standard-images --uri

When submitted the dsub command, you can see the job id. You can use this job id to check the status of the running job with the following command.

dstat --provider google-cls-v2 --project <PROJECT_ID> --jobs <JOB_ID>

You can also cancel the job with the ddel command. The command looks like this:

ddel --provider google-cls-v2 --project <PROJECT_ID> --jobs <JOB_ID>

This command uses only the CPU, but it can also be processed at high speed using the GPU. For example, if you use one NVIDIA K80, run dsub command with the following parameters:

--accelerator-type nvidia-tesla-k80
--accelerator-count 1

You can check the regions and GPU types that GPUs can use at the following URL.

https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

In addition, preemptible VMs keep costs down. Preemptible VMs are short-lived, affordable Compute Engine VM instances. If you want to use a preemptible VM, add the following parameters and run dsub.

--preemptible

Visualize predicated 3D structure

AlphaFold can take tens of minutes to hours to complete. When the command is completed, files (PDB files) containing the protein structure that is the prediction result of AlphaFold is output under the folder specified by the output-recursive parameter of dsub.
Download one .pdb file and visualize it.
There is a lot of software for working with PDB files and there are also several websites where you can easily visualize the 3D structure just by uploading the file. For example, there are following sites.

https://www.rcsb.org/3d-view
https://molstar.org/viewer/
https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP

You can visualize any website with similar operations. Select the PDB file from the Open File menu and click Apply to see the 3D structure as shown below.

Conclusion

Running AlphaFold with dsub command introduced here is inexpensive because you don’t have to keep the virtual server running. If you use GPU + preemptible as a set, you will be able to execute it at a lower cost while speeding up the process.

While I used dsub this time, the same thing can be achieved with Cloud Life Sciences alone without using dsub.

Google Cloud offers a variety of solutions to support life sciences. Please refer to the following documents for details.

https://cloud.google.com/life-sciences/docs/how-tos

--

--

Nobuhisa Mizue
Google Cloud - Community

Customer Engineer at Google Cloud. All views and opinions are my own.