ML Design Pattern #3: Virtual Epochs

Base machine learning model training and evaluation on total number of examples, not on epochs or steps

model.fit(X_train, y_train, 
batch_size=100,
epochs=15)
cp_callback = tf.keras.callbacks.ModelCheckpoint(...)
history = model.fit(trainds,
validation_data=evalds,
epochs=15,
batch_size=128,
callbacks=[cp_callback])

Epochs are problematic

However, using epochs is a bad idea. Epochs may be easy to understand, but the use of epochs leads to bad effects in real-world ML models. Too see why, imagine that you have a training dataset with 1 million examples. It can be tempting to simply go through this dataset 15 times (for example) by setting the number of epochs to 15. There are several problems with this, though:

  • The number of epochs is an integer, but the difference in training time between processing the dataset 14.3 times and 15 times can be hours. If the model has converged after having seen 14.3 million examples, you might want to exit and not waste the computational resources necessary to process 0.7 million more examples.
  • You checkpoint once per epoch and waiting 1 million examples between checkpoints might be way too long. For resilience, you might want to checkpoint more often.
  • Datasets grow over time. If you get 100,000 more examples and you train the model and get a higher error, is it because you need to do an early stop or is the new data corrupt in some way? You can’t tell because the prior training was on 15 million examples and the new one is one is on 16.5 million examples.

Fixing the number of steps is not the answer

What if you fix the number of steps?

NUM_STEPS = 143000
BATCH_SIZE = 100
NUM_CHECKPOINTS = 15
cp_callback = tf.keras.callbacks.ModelCheckpoint(...)
history = model.fit(trainds,
validation_data=evalds,
epochs=NUM_CHECKPOINTS,
steps_per_epoch=NUM_STEPS // NUM_CHECKPOINTS,
batch_size=BATCH_SIZE,
callbacks=[cp_callback])
trainds = trainds.repeat()

Virtual epochs

The answer is to keep the total number of training examples shown to the model (not number of steps) constant:

NUM_TRAINING_EXAMPLES = 1000 * 1000
STOP_POINT = 14.3
TOTAL_TRAINING_EXAMPLES = int(STOP_POINT * NUM_TRAINING_EXAMPLES)
BATCH_SIZE = 100
NUM_CHECKPOINTS = 15
steps_per_epoch = (TOTAL_TRAINING_EXAMPLES //
(BATCH_SIZE*NUM_CHECKPOINTS))
cp_callback = tf.keras.callbacks.ModelCheckpoint(...)
history = model.fit(trainds,
validation_data=evalds,
epochs=NUM_CHECKPOINTS,
steps_per_epoch=steps_per_epoch,
batch_size=BATCH_SIZE,
callbacks=[cp_callback])

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store