AI Platform Notebooks “headless” Training

Less triggering than a chicken in a barnyard

Ferris Argyle
Google Cloud - Community
3 min readDec 27, 2020

--

AI Platform Notebooks provide managed JupyterLab notebook instances, a familiar tool to experiment, develop, and deploy models into production.

The missing piece is training models for production; this typically runs much longer than the initial experimentation or the production prediction. The interactive Notebook in which you did your experimentation is probably not the right place for this: browser sessions time out, connections are lost, image rendering freezes, and the instance isn’t right-sized for training.

This article focuses on two approaches to running your training in a headless manner:

AI Hub JupyterLab VM: Papermill

Papermill is a tool for parameterizing and executing Jupyter Notebooks in a number of different ways including by spinning up a VM.

One of these is submitting the notebook from the command line of the Notebook’s VM: this takes advantage of the custom VM configuration which you’ve already configured with your AI libraries, while bypassing the potential freezing issues associated with rendering the results in the Notebook UI. The Notebook VM also includes Papermill pre-installed, so no additional configuration is required.

You can access the command line either from the Cloud Console -> Compute Engine -> VM instances -> SSH option for the VM, or from within the Notebook itself; the former excludes the Notebook UI as a potential concern. You can open the SSH session in a browser window, or in a dedicated client for additional state persistence.

Parameterization

You can parameterize your Notebook to pass variable information such as the number of epochs into the notebook at run time. Set default parameters in a cell tagged with the “parameter” keyword and refer to these in the training or other steps; these are then over-ridden by whatever you specify on the command line.

Eg. Command Line for a Notebook with an epochs parameter:

Execution

When you run the command, you specify an output Notebook; all intermediate results, logging, etc. are recorded here. Include the log-output flag to write notebook output to stderr (ie. the terminal window)

Note that if the terminal is closed, this terminates the SSH session, and by default any Notebooks running within it, including headless notebooks.

Continue execution if the terminal is closed

To continue execution even if the terminal session is closed, use a command sequence like one of the following:

These commands show the process id (pid) and redirect stderr and stdout to nohup.out or ~/output.txt respectively.

You can monitor the running process from from the output notebook, as well as from a new terminal session using the following command:

Note that redirecting stdout and stderr throws the following error; this doesn’t affect Notebook execution or logging:

  • AttributeError: ‘NoneType’ object has no attribute ‘send_multipart’

More background here:

AI Platform Training

This approach is based on creating a Docker image with your libraries layered on top of a base Tensorflow image, then deploying that to AI Platform Training; you can also run the image locally for testing. This has the advantage of being able to size your training environment differently from your Notebook environment, and natively close the terminal session without affecting job execution; you can track job execution in the cloud console; logs are written to Stackdriver.

Store the model weights in Google Cloud Storage as the link between the training and prediction steps; note that you can also save the whole model instead of just the weights; the chosen approach will depend on your use case.

Parameterization

As before, you can parameterize the execution, eg. with the number of epochs.

Sample script

Execution

Step through the steps outlined above on your VM, or run it all as a bash script. You will need to create your own GCS bucket in the project in which you’re running the VM; this GCS bucket can be on a different VM if you configure service accounts appropriately.

If you choose to run the optional local training test step, your VM will need a GPU.

Checkpoints

​To support better recovery from failures during a job, you can implement​ checkpoints; this Keras sample demonstrates checkpointing on epochs.

--

--

Ferris Argyle
Google Cloud - Community

These are my personal writings; the views expressed in these pages are mine alone and not those of my employer, Google.