IBM’s Deep Learning Service: Terms and Definitions
A list of key terms and definitions to help you get started with IBM’s Deep Learning Service.
What is IBM’s Deep Learning service in Watson Studio?
IBM’s Deep Learning service in Watson Studio is a platform for Deep Learning training. It allows the use of popular frameworks like Tensorflow, Caffe and PyTorch to train Neural Network models with on-demand GPU compute instances.
If it’s your first time using this service, you can get set up on by following a step-by-step tutorial
Additional Resources:
Behind IBMs’ DL Service in Watson Studio
Key Terms and Definitions:
- IBM COS Bucket
- Model Definition (.zip file)
- Training-definition
- Training-run
- Experiment
- Hyperparameter Optimization Experiment
- Experiment-run
IBM COS Bucket: A container for data on an IBM Cloud Object Storage (COS) instance.
A Cloud Object Storage (COS) instance can contain several uniquely named buckets:

You can upload files to your bucket and access them during training.
You can explore the contents of a bucket either through:

The command line
$ bucket_name=<your_bucket_name>
$ bxaws s3 ls s3://$bucket_name/
Model Definition: A zip file containing the files that will be run during training, such as the defined neural network model and any additional programs you require.

Training-definition: An entity that stores metadata about how a model needs to be trained, and has a uniquely-generated training-definition-id.
Training-definitions are created based on a manifest file, which allows you to specify a training’s requirements and configure an individual training session:
You also need to specify the COS bucket you want to upload your model to (19). For example we specified our bucket as samplebucketoneand the execution command aspython3 mnist_classifier.py (10). We also specified that we’re using tensorflow version 1.5 (5,6) and python version 3.5 (7,8).
Training-run: An instance of an executed training-definition, with a uniquely-generated training-run ID.
When a training is launched, a folder under the name of the training-run ID is created in the specified COS bucket.
This folder contains the model definition (model.zip), the output of the standard output of the training (training-log.txt), the results of the training runs, and more.

The path,
$ s3://$bucket_name/<training-run-id>
is saved under the environment variable RESULT_DIR during training.
In our example, a training run’s data would be stored in
$ s3://samplebucketone/<training-run-id>
Experiment: An entity that stores metadata about how a group of training-definitions needs to be trained, and has a uniquely-generated experiment-id.
An experiment manifest file allows you to specify the training-definition you will be launching (line 10) and choose your GPU usage (line 13) .
HPO Experiment: An HPO experiment is a special kind of experiment that includes Hyperparameter Optimization instructions in its manifest file.
With the specified HPO Algorithm, minimizing/maximizing objective, and hyperparamter ranges, an HPO experiment launches training-runs based on the training-definition, where each run has a different set of hyperparameters.
This manifest file will trigger several training-runs and the creation of the file, config.json .
config.json contains the values of the hyperparameters that are chosen by the HPO algorithm, and we can access this file to set the hyper parameter values, e.g. lam in our code.
So, in the manifest file, we’re performing a random search (16) , our objective is the error_val (18), we want to minimize error_val (21), and are relying on specific ranges/values of lam (25–26).
There will be an overall training-run ID related to the HPO experiment, and sub-IDs related to the individual runs of our model, where each sub-id has a different set of hyperparameters.

For example, the guid training-Zlo4wHhiR_0 is the ID for the first training-run, when lam=0.0001 , and training-Zlo4wHhiR_1 is the ID for the second, when lam=0.001
When an HPO experiment is run, the structure of our RESULT_DIR ( s3://bucket_name/<training_id>) is slightly different too.
The RESULT_DIR is created based on a single training-run-ID for the overall HPO experiment.
In our bucket, RESULT_DIR (i.e. s3://$bucket_name/training-Zlo4wHhiR) contains sub-directories (0/, 1/ ...),each of which refers to the sub-training-runs that were launched ( i.e. training-Zlo4wHhiR_0, training-Zlo4wHhiR_1)

The results of each sub-run will be stored in the file <SUB_ID_INDEX>/val_dict_list.json.

After all sub-training-runs have completed and the HPO algorithm compares the results across each of the sub-directories’ val_dict_list.json . A result.json is created in the main RESULT_DIR, with the optimally set value for lam.

For a more detailed explanation of the items in the HPO-experiment manifest file, you can check out the DLaaS HPO Docs.
Experiment run: An instance of an experiment that generates respective training-runs, and has a uniquely-generated experiment-run ID.

