Running Alphafold on Google Cloud compute engine

Published in

Google Cloud - Community

7 min readAug 4, 2021

Hope you were as excited as i was when Deepmind open sourced Alphafold to the wider scientific community. https://github.com/deepmind/alphafold

I was eager to try it out following its release, closely following installation steps provided in their repo. I ran this on Google cloud and documented the steps as I went along . Sharing them here, in case it’s useful to someone out there.

The instructions in Deepmind alphafold repo should be straightforward to setup and run on a GCE instance. Hope the tiny additions and instructions here help as well.

Disclaimer: The instructions and opinions here in the post are my own and do not reflect that of Deepmind or Google Cloud.

So here Goes….

To begin, it’s always good to picture what you are trying to do. Maybe in my own over complicated way :) . All we are doing is spinning a machine on cloud and installing alphafold and downloading reference data it needs to run it.

You would need access to Google cloud for this to work, they also provide some free credits that you can make use of for new users. There is plenty of help out there on setting up your account.

From your Google cloud console, navigate to your project -> Compute Engine to create a virtual machine. You would have to enable the compute engine API if it’s not already enabled for your project.
Spin a compute engine instance (Create Instance) with spec e2-standard-8 (8 vCPUs, 32 GB memory) using Debian OS , a 100 GB boot disk, a 3 TB disk persistent disk for the reference data in the region of your choosing.

Change default Boot disk size from 10GB to 100GB

Ensure to choose “keep disk” for the persistent disk, this will allow you to retain reference data when you delete VM’s and switch machines at a later time. I also chose to make a copy of the reference data to a GCS bucket with the gsutil command.

These instructions use a compute engine without GPU attached. Try and locate your Persistent Disk in regions/zones where GPUs are available for future possible experiments with accelerators.
If you can create a compute engine with External IP, skip the next steps on creating a NAT gateway. But if your GCP account has policy restrictions or you would like to avoid External IP, then NAT gateway helps.
You might also want to create firewall rules TCP forward to IAP to ssh into the server and NAT Gateway to allow internet traffic (download of git repository and data to the VM & if you are not using external IP for your VMs). If you still run into ssh issues refer to this page
Basically what we are trying to achieve is to have a Compute engine VM that you can ssh into and it has access to download libraries + data from the internet.
Since the data required for the reference files need at least 2.2TB in uncompressed form, we need to make sure we mount the needed disk space to the VM (the 3TB persistent disk).
Follow the below instructions to mount the persistent disk to your linux vm.

SSH into your VM and check for attached storage using the below cmd, it usually lists as /dev/sdb

lsblk

Format disk ext4, helps if we need to resize later

sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb

Create directory to mount the disk

sudo mkdir -p /mnt/disks/data

Mount the disk

sudo mount -o discard,defaults /dev/sdb /mnt/disks/data

Make sure there is write access to the disk

sudo chmod a+w /mnt/disks/data

Check by running

df -h

Your new mounted disks should be available under

cd /mnt/disks/data

If you want to mount the disk automatically even after rebooting the VM, add the following line to /etc/fstab

/dev/sdb /mnt/disks/data ext4 rw,discard,defaults 0 0

Now to the interesting bits. We will make a clone of the alphafold open source repo to our VM. The repo contains scripts for the docker build, run and to download reference data. (you might have to run sudo apt-get update & sudo apt-get install git if the vm does not come with git)

git clone https://github.com/deepmind/alphafold.git

Install rsync and aria2 as they are required libraries to download reference data from multiple sites (https, ftp, rsync).

sudo apt-get install rsync

sudo apt-get install aria2

Navigate to the downloaded alphafold git repo folder. Run the below scripts to download the reference sequence data to your persistent disk mount location. The script should be available under /alphafold/scripts. This should take a few hours depending on if you are going for the full volume or using the reduced dbs flag. The download will take a few hours to complete.

./download_all_data.sh /mnt/disks/data &

Using the Docker command, run the below command to build the alphafold container image. Install docker if it doesn’t exist on the VM and run docker as non root

docker build -f docker/Dockerfile -t alphafold .

Update parameters in docker/run_docker.py for DOWNLOAD_DIR with the path to the reference data (this will be the mounted persistent disk where our download script stored all the reference files ) and a valid path to store output files under output_dir that needs to be created beforehand
Use your favourite editor to update parameters in the python script /alphafold/docker/run_docker.py

vi docker/run_docker.py

:wq! To save your changes

Create output folder

mkdir /mnt/disks/data/output

You will need a fasta file sequence as input parameter — fasta_paths when you run alphafold. Create a file T1029.fasta with the below content. If you are not familiar with the FASTA format, please see here.

>T1029 EbsA, Cyanobacteria, 125 residues|

MRIDELVPADPRAVSLYTPYYSQANRRRYLPYALSLYQGSSIEGSRAVEGGAPISFVATWTVTPLPADMTRCHLQFNNDAELTYEILLPNHEFLEYLIDMLMGYQRMQKTDFPGAFYRRLLGYDS

We will execute alpha fold with the GPU flag set to false ( — use_gpu=False). This will allow us to run alphafold only using CPU ( which is what our VM has).

Symlink to create mgy_clusters_2018_08.fa as there is an issue raised here : https://github.com/deepmind/alphafold/issues/54

cd /mnt/disks/data/mgnify

ln -s mgy_clusters_2018_12.fa mgy_clusters_2018_08.fa

Also install all other libraries required for alphafold to run

pip install -r docker/requirements.txt

We need an input sequence to pass to alphafold to run. This is done by passing a FASTA_DIR with the run command. This parameter needs to be an absolute path to where you have created or downloaded the fasta file discussed in 16.

FASTA_DIR=/home/<your user home path>/T1029.fasta

Now It’s time to Run Alphafold. Run using the below command

python3 docker/run_docker.py — fasta_paths=${FASTA_DIR}/T1029.fasta — max_template_date=2020–05–14 — use_gpu=False

You should see output something like this:

I0724 23:29:46.616896 139835467216704 run_alphafold.py:130] Running model model_2
I0724 23:29:48.999432 140445071447872 run_docker.py:193] I0724 23:29:48.998316 139835467216704 model.py:132] Running predict with shape(feat) = {‘aatype’: (4, 125), ‘residue_index’: (4, 125), ‘seq_length’: (4,), ‘template_aatype’: (4, 4, 125), ‘template_all_atom_masks’: (4, 4, 125, 37), ‘template_all_atom_positions’: (4, 4, 125, 37, 3), ‘template_sum_probs’: (4, 4, 1), ‘is_distillation’: (4,), ‘seq_mask’: (4, 125), ‘msa_mask’: (4, 508, 125), ‘msa_row_mask’: (4, 508), ‘random_crop_to_size_seed’: (4, 2), ‘template_mask’: (4, 4), ‘template_pseudo_beta’: (4, 4, 125, 3), ‘template_pseudo_beta_mask’: (4, 4, 125), ‘atom14_atom_exists’: (4, 125, 14), ‘residx_atom14_to_atom37’: (4, 125, 14), ‘residx_atom37_to_atom14’: (4, 125, 37), ‘atom37_atom_exists’: (4, 125, 37), ‘extra_msa’: (4, 1024, 125), ‘extra_msa_mask’: (4, 1024, 125), ‘extra_msa_row_mask’: (4, 1024), ‘bert_mask’: (4, 508, 125), ‘true_msa’: (4, 508, 125), ‘extra_has_deletion’: (4, 1024, 125), ‘extra_deletion_value’: (4, 1024, 125), ‘msa_feat’: (4, 508, 125, 49), ‘target_feat’: (4, 125, 22)}
….
Final timings for T1029: {‘features’: 2304.6203739643097, ‘process_features_model_1’: 5.570271968841553, ‘predict_and_compile_model_1’: 1880.1474549770355, ‘relax_model_1’: 21.630855560302734, ‘process_features_model_2’: 2.1305012702941895, ‘predict_and_compile_model_2’: 1656.9139652252197, ‘relax_model_2’: 19.625572681427002, ‘process_features_model_3’: 1.6185104846954346, ‘predict_and_compile_model_3’: 1509.6751911640167, ‘relax_model_3’: 22.60988211631775, ‘process_features_model_4’: 1.6407415866851807, ‘predict_and_compile_model_4’: 1531.7309045791626, ‘relax_model_4’: 20.198458194732666, ‘process_features_model_5’: 1.7291264533996582, ‘predict_and_compile_model_5’: 1455.134045124054, ‘relax_model_5’: 19.432605981826782}

Results / Success : Once the run completes, the model should output predictions to your output directory as below. Runs on CPU’s tend to be slower than running with GPUs

Predicted structures and other files in output dir /mnt/disks/data/output/T1029.

pkl

features.pkl

result_model_1.pkl 〜 result_model_5.pkl

pdb

ranked_0.pdb 〜 ranked_4.pdb

relaxed_model_1.pdb 〜 relaxed_model_5.pdb

unrelaxed_model_1.pdb 〜 unrelaxed_model_5.pdb

json

ranking_debug.json

Timings.json

others

msas/

bfd_uniclust_hits.a3m

mgnify_hits.sto

uniref90_hits.sto

Viewing your output PDB files: The output PDB file is in text format and is mainly composed of atomic coordinate records. If you are not familiar with PDB atomic coordinate records, please see here. https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/dealing-with-coordinates
There are multiple online portals that will allow you to visualize your protein structures (output PDB files). Link to the NCBI portal. There are also libraries that you can install and run locally to visualize the structures.
PDB viewers :

https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP

https://www.rcsb.org/3d-view

https://molstar.org/viewer/

Hope that helps and works out for you !!! :)

Running Alphafold on Google Cloud compute engine

Written by Hariprasad Radhakrishnan