Saving, resuming, and restarting experiments with Polyaxon

In this post we will introduce a new feature on Polyaxon, checkpointing, resuming, and restarting experiments.

Often data scientists can’t afford to let their models to run for days before making adjustment. Sometimes infrastructure crashes can interrupt the training and force them to run their models all over again.

It’s crucial to be able to stop training at any point, for any reason, and resume it later on. It’s also crucial to be able to resume an experiment with different parameters multiple times without losing the original progress.

The experiments should ideally be immutable and reproducible, and in order to add this structure to your experiments, Polyaxon creates and exposes a couple of paths for every experiment, these paths are created on the volumes (logs and ouputs) provided during the deployment.

You don’t need to figure out these paths or hardcode them manually, Polyaxon will provide an environment variable for the outputs POLYAXON_OUTPUTS_PATH for example, that you can use to export your outputs, artifacts, and checkpoints. You can also use our helper to get the paths get_outputs_path.

In this post, we will go over some strategies to save your and work and resume it or restart it.

Saving and checkpointing

Most frameworks provide a ways to save your progress as you train your models. It’s up to the user to decide the frequency and number of checkpoints to keep.

Saving with Tensorflow

Tensorflow provide different ways for saving and resuming a checkpoint. The easiest is to use the Estimator api. The Estimator takes care of saving checkpoints automatically, you only need to specify the top-level directory in which the Estimator stores its information, This is done by assigning a value to the optional model_dir argument of any Estimator's constructor.

In the case of Polyaxon you should assign provided path POLYAXON_OUTPUTS_PATH, e.g.

from polyaxon_helper import get_outputs_path
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
hidden_units=[10, 10],
n_classes=3,
model_dir=get_outputs_path())

Saving with Keras

Keras api provides callbacks and in the case of saving checkpointing, it is the ModelCheckpoint which should be called.

from polyaxon_helper import get_outputs_path
checkpoint = ModelCheckpoint(get_outputs_path(),
monitor='val_loss',
verbose=0,
save_best_only=True,
period=1,
mode='auto')

Saving with Pytorch

Pytorch provides different approaches also to save models

from polyaxon_helper import get_outputs_path
model_path = '{}/checkpoint1.pth'.format(get_outputs_path())
torch.save(the_model.state_dict(), model_path)

or

from polyaxon_helper import get_outputs_path
model_path = '{}/checkpoint20.pth'.format(get_outputs_path())
torch.save(the_model, model_path)

Resuming

Polyaxon provides the possibility to resume the training of an already stopped experiment.

In order to resume an experiment, it should have some checkpoints to resume from.

You can also resume an experiment with an updated environment or some parameters.

When resuming an experiment, the outputs will be appended to the original experiment.

Resuming with Tensorflow

If you used the Estimator api, you don’t need to do anything because the Estimator will take care of resuming your training. Of course you can have tweak your code to resume from a very specific state.

You just need to call:

$ polyaxon experiment -xp 23 resume

Resuming with Keras

To resume with keras, your code must be able to load the model’s weights:


model.load_weights(model_weights)
...

And you need to call:

$ polyaxon experiment -xp 23 resume

Resuming with Pytorch

Depending on how you saved your model with pytorch, i.e. overriding or creating multiple checkpoints, you can resume training by using the pytorch load function

the_model = TheModelClass(*args, **kwargs)
model_path = '{}/checkpoint1.pth'.format(get_outputs_path())
the_model.load_state_dict(torch.load(model_path))

or

model_path = '{}/checkpoint20.pth'.format(get_outputs_path())
the_model = torch.load(model_path)

and you need to call:

$ polyaxon experiment -xp 23 resume

Resuming with updated code or parameters

To resume an experiment with latest code:

$ polyaxon experiment -xp 23 resume -u

To override the config of the experiment you wish to resume, you need to create a polyaxonfile with the override section/params:

$ polyaxon experiment -xp 23 resume -f polyaxonfile_override.yml

Restarting

Sometimes you don’t want to resume an experiment, and you wish to keep it and restart the training with the same or different code or parameters.

Polyaxon provides a way to do that:

$ polyaxon experiment -xp 23 restart

To restart an experiment with latest code:

$ polyaxon experiment -xp 23 restart -u

To override the config of the experiment you wish to restart, you need to create a polyaxonfile with the override section/params:

$ polyaxon experiment -xp 23 restart -f polyaxonfile_override.yml

For example you can restart an experiment with gpu or with a different learning rate.

Copying

Another option that Polyaxon offers is to copy an experiment before restarting it. This option is useful if the user wants to resume the training of an experiment with multiple updated versions of her code or parameters, what it does is basically it copies all outputs from the experiment to the new experiments.

$ polyaxon experiment -xp 23 restart --copy

To copy an experiment with latest code:

$ polyaxon experiment -xp 23 restart --copy -u

To override the config of the experiment you wish to copy, you need to create a polyaxonfile with the override section/params:

$ polyaxon experiment -xp 23 --copy restart -f overridefile.yml

Conclusion

Link to our repo examples with some code showing how to restart/resume experiments with these frameworks. And as always, let us know if you have any feedback on this and other features on Polyaxon.

Like what you read? Give Mourad a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.