AI Platform Notebooks “headless” Training
Less triggering than a chicken in a barnyard
AI Platform Notebooks provide managed JupyterLab notebook instances, a familiar tool to experiment, develop, and deploy models into production.
The missing piece is training models for production; this typically runs much longer than the initial experimentation or the production prediction. The interactive Notebook in which you did your experimentation is probably not the right place for this: browser sessions time out, connections are lost, image rendering freezes, and the instance isn’t right-sized for training.
This article focuses on two approaches to running your training in a headless manner:
- AI Hub JupyterLab VM: Papermill
- AI Platform Training
AI Hub JupyterLab VM: Papermill
Papermill is a tool for parameterizing and executing Jupyter Notebooks in a number of different ways including by spinning up a VM.
One of these is submitting the notebook from the command line of the Notebook’s VM: this takes advantage of the custom VM configuration which you’ve already configured with your AI libraries, while bypassing the potential freezing issues associated with rendering the results in the Notebook UI. The Notebook VM also includes Papermill pre-installed, so no additional configuration is required.
- If you’d like to run on a separate VM with Papermill, this TPU-based Next 2019 talk walks through that scenario.
You can access the command line either from the Cloud Console -> Compute Engine -> VM instances -> SSH option for the VM, or from within the Notebook itself; the former excludes the Notebook UI as a potential concern. You can open the SSH session in a browser window, or in a dedicated client for additional state persistence.
Parameterization
You can parameterize your Notebook to pass variable information such as the number of epochs into the notebook at run time. Set default parameters in a cell tagged with the “parameter” keyword and refer to these in the training or other steps; these are then over-ridden by whatever you specify on the command line.
Eg. Command Line for a Notebook with an epochs parameter:
Execution
When you run the command, you specify an output Notebook; all intermediate results, logging, etc. are recorded here. Include the log-output flag to write notebook output to stderr (ie. the terminal window)
Note that if the terminal is closed, this terminates the SSH session, and by default any Notebooks running within it, including headless notebooks.
Continue execution if the terminal is closed
To continue execution even if the terminal session is closed, use a command sequence like one of the following:
These commands show the process id (pid) and redirect stderr and stdout to nohup.out or ~/output.txt respectively.
You can monitor the running process from from the output notebook, as well as from a new terminal session using the following command:
Note that redirecting stdout and stderr throws the following error; this doesn’t affect Notebook execution or logging:
- AttributeError: ‘NoneType’ object has no attribute ‘send_multipart’
More background here:
AI Platform Training
This approach is based on creating a Docker image with your libraries layered on top of a base Tensorflow image, then deploying that to AI Platform Training; you can also run the image locally for testing. This has the advantage of being able to size your training environment differently from your Notebook environment, and natively close the terminal session without affecting job execution; you can track job execution in the cloud console; logs are written to Stackdriver.
Store the model weights in Google Cloud Storage as the link between the training and prediction steps; note that you can also save the whole model instead of just the weights; the chosen approach will depend on your use case.
Parameterization
As before, you can parameterize the execution, eg. with the number of epochs.
Sample script
Execution
Step through the steps outlined above on your VM, or run it all as a bash script. You will need to create your own GCS bucket in the project in which you’re running the VM; this GCS bucket can be on a different VM if you configure service accounts appropriately.
If you choose to run the optional local training test step, your VM will need a GPU.
Checkpoints
To support better recovery from failures during a job, you can implement checkpoints; this Keras sample demonstrates checkpointing on epochs.