A Guide to Distributed TensorFlow: Part 2

How to set up large scale, distributed training of TensorFlow models using Kubeflow.

Roshan Thaikkat
Oct 16, 2020 · 7 min read

TL;DR

Distributed Training

Model initialization

Step-by-Step Kubeflow

Creating a StorageClass

Example of a storage class manifest.
kubectl create -f <manifest>.yaml 

Preparing the Node Pool and Configuring Firewall Rules

Installing and Deploying Kubeflow

kubectl -n kubeflow get pods
Kubeflow pods up and running

Creating a TFJob

Example of a training manifest.

Connecting the Pieces Together

kubectl -n kubeflow create -f <path_to_training_spec>.yaml
kubectl -n kubeflow logs <worker-id>

Training Statistics

Conclusion

When Machines Learn

Sharing research and insight into applying machine learning to industrial asset management.