How to Use GCP with Weights & Biases

Published in

Weights & Biases

9 min readNov 25, 2019

Google Cloud Platform or more popularly GCP makes it really easy to quickly fire up a virtual machine that’s loaded with a GPU and all the necessary setup to help accelerate your machine learning experimentation. Among the various services that GCP offers in this line, Notebooks from the AI Platform section are my favorite.

In this article, I will be showing you how to quickly spin up a Notebook instance via the AI Platform and how to configure Weights and Biases in that instance. I’ll also highlight some cool features that come with Weights and Biases (W&B) that make it a lot of fun to work with.

Note: The rest of this post assumes you already have a billing-enabled GCP account.

Getting up and running with a Notebook instance

In your browser, go to https://console.cloud.google.com.
Navigate to AI Platform -> Notebooks.

After navigating to the Notebooks section, click on NEW INSTANCE.

You will have a number of pre-configured environment choices. Since TensorFlow 2.0 is new and officially released, let’s go with that version.
Let’s also use a GPU.

Note that in the Customize Instance, you get a lot more options including a range of GPU choices. Upon selecting the environment you wish to proceed with, you will be given a popup like the following to specify the additional details. Be sure to tick the installation option for the NVIDIA GPU drivers.

After you click CREATE on that popup, it will take some time to create the instance. So hold tight. If your notebook instance was successfully created you should see something like so -

Now click OPEN JUPYTERLAB to access your notebook instance. This will cause a Jupyter Lab instance to start running where you will be able to:
Fire up notebooks
Execute your Python scripts via the terminal

Now let’s configure Weights and Biases within this instance.

Configuring W&B within a GCP notebook instance

Your notebook instance comes with both Python2 and Python3 (3.5) installed. So, we will have to very careful about using python and pip aliases here. By default, when you type python in the terminal version 2.7 gets selected. We can, of course, configure it accordingly. But let’s not focus on that for the time being.

Let’s go ahead and execute the following from a terminal of your notebook instance

pip3 install wandb

Notice, I am using pip3 here. Let’s verify the installation.

Next, let’s ensure that the CLI interface of W&B is also working. We can verify that here:

In this instance, it is not working as expected. So what is going on here? This is a common error for beginners and those new to working with Python. This is because the PATH environment variable is not configured properly which tells the operating system from where to load wandb.

Let’s fix this.

First, we need to find out the location directory of wandb. The easiest way to do this I know of is to simply run pip3 uninstall wandb and it logs out information. Below is an example -

Note the path of wandb (this is not the Python library but the CLI utility) which in this case is: /home/jupyter/.local/bin/wandb. Now, we need to symlink wandb inside /usr/bin -

$ cd /usr/bin

$ sudo ln -s /home/jupyter/.local/bin/wandb wandb

The next step is to update the value of the following environment variable correctly: PATH. So, type in nano ~/.bashrc and enter the following -

export PATH=”$PATH:/usr/bin/wandb”

After you are done typing the above, go ahead and save it. After that, do not forget to run source ~/.bashrc otherwise, the changes won’t take effect.

Now, when you run wandb login, you should see the following -

Proceed with an option accordingly and you should be good to go. After this step has been completed, W&B should be authorized.

Tracking metrics: Using W&B to keep track of model’s performance

Now that we have set up and authorized W&B for our Notebook instance, let’s actually see how to use it to keep track of a model’s performance while it is training. We will be using TensorFlow 2.0 and particularly the high level tf.keras API.

I will walk you through the primary steps you would need to perform in order to use W&B to keep track of your model’s performance. You may find this notebook handy if you want to follow along.

Initialize wandb:

# Initialize your W&B project allowing it to sync with TensorBoard

wandb.init(project="tensorboard-integration", sync_tensorboard=True)

config = wandb.config

Note the sync_tensorboard argument in the above code block. If we set it to True, W&B will spin up a TensorBoard instance on your W&B project (tensorboard-integration) dashboard provided that we are supplying the TensorBoardCallback in the right manner.
After running the above code block, W&B will create a run for you and it will also provide you with an online run page that should be similar to this: https://app.wandb.ai/sayakpaul/tensorboard-integration/runs/e8kv5zab.
Specify the configuration variables:

# Specify the configuration variables

config.dropout = 0.2

config.hidden_layer_size = 128

config.layer_1_size = 16

config.layer_2_size = 32

config.learn_rate = 0.01

config.decay = 1e-6

config.momentum = 0.9

config.epochs = 25

These are essentially your model’s hyperparameters.
Prepare your dataset that will go into your model and kickstart the training process with the appropriate callback(s):

# Specify your model definition

# Try supplying the configuration variables

model = Sequential([...])

# Compile the model

model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])

# The WandbCallback logs metrics and some examples of the test data

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=config.epochs,

callbacks=[WandbCallback(data_type="image", labels=labels),

TensorBoard(log_dir=wandb.run.dir)])

Two things to pay attention to in the above code block:
Note how the number of epochs has been specified. Instead of just supplying a hard-coded value, we are supplying config.epochs which we defined earlier. Doing this will allow W&B to keep track of the configuration variables as well.
WandbCallback(data_type=”image”, labels=labels): We can simply supply WandbCallback without any arguments and it will keep track of the performance of the model. Additionally, if we supply the data_type and labels arguments, W&B will log some sample predictions on the validation data points as the model is training.

And that’s it! Now if you go to this run page (https://app.wandb.ai/sayakpaul/tensorboard-integration/runs/e8kv5zab) you will see:

Visualizing the model performance:

TensorBoard:

Sample predictions from different steps in the training:

How to read metrics automatically using W&B with GCP

Weights and Biases allows reading your previous runs for analysis purpose. Here’s an excellent analysis done by Lukas on some publicly available Weights and Biases runs. Instrumenting runs is as easy as -

api = wandb.Api()

run = api.run("sayakpaul/arxiv-project-complex-models/6t93vdp7")

In the above example, https://app.wandb.ai/sayakpaul/arxiv-project-complex-models/runs/6t93vdp7 is a publicly available run. Now, after the run is loaded, you can extract the configuration variables of the run like so — run.config. It will print out -

If you want to read the metrics associated with a particular run along with other important stuff, you can easily do so by -

api = wandb.Api()

run = api.run("sayakpaul/arxiv-project-complex-models/6t93vdp7")

run.history()

You get -

In order to read multiple runs residing in a project and summarizing them, you need three lines of code -

runs = api.runs("sayakpaul/arxiv-project-complex-models")

for run in runs:

print(run.summary)

And -

Of course, you have the flexibility of trimming the parts from the summary you don’t need. To know about the full potential of the Weights and Biases API check out the official documentation: https://docs.wandb.com/library/api.

Advanced Features: Resuming and Grouping Runs

Resuming Runs

In machine learning, MemoryErrors are extremely common — even when training is taking place on GPUs. When these errors occur, model training is impacted. In addition to MemoryErrors, it is also common for those new to Machine Learning to encounter a power failure and lose the progress of days of training. This is all to say that setting up your development environment thoughtfully will pay dividends in the long run.

But let’s say you are training a model, water spills on your computer, and your computer shuts down unexpectedly. Weights and Biases help you prepare for these types of situations by allowing users to resume a run if it was not completed. Let’s use an example to demonstrate.

First things first — you need to set the resume argument in wandb.init() to True:

wandb.init(project=”resume-read-group-runs”, name=”resume_runs”, resume=True)

Now, let’s say, while training my model I accidentally restarted my Jupyter Notebook instance. After I restarted the kernel, to be able to resume the training from exactly where it was last, I would do the following -

model = tf.keras.models.load_model(wandb.restore(“model-best.h5”).name)

It is important to note this will only work if you supply the WandbCallback while calling model.fit().

After loading the model, I simply need to compile it in exactly the same way I did before restarting the kernel and I would call model.fit() then -

model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

model.fit(X_train, y_train, validation_data=(X_test, y_test),

epochs=20, initial_epoch=wandb.run.step,

callbacks=[WandbCallback(data_type="image", labels=labels, save_model=True)])

Pay attention to the initial_epoch argument. The training will begin from exactly where it was left off.

Grouping Runs

You might want to group many different runs with respect to one or more configuration variables. This helps to draw comparisons between many different runs within a project or even distributed training. I am going to show a simple example of grouping runs together.

Go to your project page (the URL should be like https://app.wandb.ai/sayakpaul/arxiv-project-complex-models) and press “ALT + Space”. It should look like -

Now, click on Group and you will see a list of the available configuration variables -

Now, I wish to group the runs with respect to the learning rate which is present as lr. I will select it accordingly and I am done -

You can go beyond just one field and select any field you may find necessary to group together the runs for your purpose -

Note: Be sure to turn off your notebook instance after you are done with your work -

Just go to your Notebook homepage: https://console.cloud.google.com/ai-platform/notebooks/instances.
Select the running instance that you want to stop.
Click on STOP from the upper panel.

What runs will you be grouping?

I hope this article gave you a flavor of different useful features offered by Weights and Biases to help you keep track of your deep learning experiments smoothly and systematically. I have made available all the experiments I did for this article here. I hope they are useful :)