Setting up a Kubeflow deployment on a Kubernetes cluster

After a long struggle with the configuration of Kubeflow, I decided to make a short tutorial on how to create a Kubernetes cluster and deploy Kubeflow on it. In this tutorial, I will show how to deploy a Kubernetes cluster using Kubeadm and Flannel as a pod network, how to create a Kubeflow deployment and how to create persistent volumes and the load balancer in order for the Kubeflow deployment to be successfully.

  • * Disclaimer: Because Kubernetes and Kubeflow are open source projects, the functionality and documentation can have major changes in a short time. For this tutorial, I am using Kubernetes version 1.14. Please check in the official documentation for the validity of the commands*

What you need

Hardware:

  • One machine running an Ubuntu OS
  • 12 GB or more RAM
  • 2 CPUs or more
  • Minimum 50GB of Storage in the Root folder

Software:

  • Docker
  • Kubernetes version 1.14
  • Kubeadm version 1.14

Setup

First of all, Kubernetes is mainly working on containers. Therefore, we will, first of all, need Docker to be installed. Any application that manages containers should be just fine, but for safety, we will use Docker.

For installation it’s best to follow the Docker website documentation: https://docs.docker.com/install/linux/docker-ce/ubuntu/

  1. First, we will update our repository:
$ sudo apt-get update

2. Afterward, we want to install the packages:

$ sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common

3. Then the Docker GPG Key is added :

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

4. Then we should verify if everything worked as planned:

$ sudo apt-key fingerprint 0EBFCD88

5. The output you are aiming for should be similar to this:

pub   rsa4096 2017-02-22 [SCEA]
9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
uid [ unknown] Docker Release (CE deb) <docker@docker.com>
sub rsa4096 2017-02-22 [S]

6. Afterward, install Docker:

$ sudo add-apt-repository \
“deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable”

7. Test the installation:

sudo docker run hello-world

The output needs to be Hello World from Docker.

Now that we have Docker to manage our containers is time to install Kubernetes and Kubeadm.

8. Before we install Kubernetes we need to disable swap, otherwise, the installation will not work:

$ sudo swapoff -a

ATTENTION! The swap memory will be reactivated after reboot, and it will block the cluster start. If you don’t want to permanently disable the swap memory, then rerun the command shown above and it will work. Otherwise, check the steps to permanently disable permanently the swap memory!

9. Run the following command To install Kubernetes v 1.14 :

$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add — && \
echo “deb http://apt.kubernetes.io/ kubernetes-xenial main” | sudo tee /etc/apt/sources.list.d/kubernetes.list && \
sudo apt-get update -q && \
sudo apt-get install -qy kubelet=1.14.8–00 kubectl=1.14.8–00 kubeadm=1.14.8–00

This command installs the key for Kubernetes. Afterward it updates the repositories and lastly, it installs Kubernetes version 1.14. The version in the command can be changed with any available version that is released on Kubernetes depending on your case.

10. In order to test the command run:

$ sudo kubectl version$ sudo kubeadm version

Now that we have initialized the software, we can move on creating our cluster!

Cluster creation

As stated in the beginning we will use Flannel as a pod network, since it has better compatibility. You can find more info. on the existent types of pod networks here.

  1. First, we will initialize our cluster using Kubeadm:
$ kubeadm init --pod-network-cidr=10.244.0.0/16

The IP address in the command is the specific one for the Flannel pod network.

2. Afterward, we need to create the configuration file:

$ mkdir -p $HOME/.kube$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

3. We will set the iptables value to 1, As a mandatory condition for Flannel.

$ sudo sysctl net.bridge.bridge-nf-call-iptables=1

4. Finally, we can install Flannel using the command:

$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml

5. By default, your cluster will not schedule pods on the control-plane node for security reasons. In order to install Kubeflow we need to change this, so we need to run the following command:

$ kubectl taint nodes --all node-role.kubernetes.io/master-

6. In order to verify the functionality, run:

$ sudo kubectl get all -A

7. Afterwards, run:

$ sudo kubectl get nodes

The output should show you the name of your node followed by the message “READY”.

If you encounter errors check at the end, I also documented some of my errors.

Now that we have our cluster up and running it is time to install Kubeflow.

8. Before we install Kubeflow, we need to create persistent volumes. In order to do so, open an empty file and paste the following lines:

apiVersion: v1
kind: Namespace
metadata:
name: local-path-storage
---
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: local-path-provisioner-service-account
namespace: local-path-storage
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: local-path-provisioner-role
rules:
- apiGroups: [""]
resources: ["nodes", "persistentvolumeclaims"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["endpoints", "persistentvolumes", "pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: local-path-provisioner-bind
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: local-path-provisioner-role
subjects:
- kind: ServiceAccount
name: local-path-provisioner-service-account
namespace: local-path-storage
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: local-path-provisioner
namespace: local-path-storage
spec:
replicas: 1
selector:
matchLabels:
app: local-path-provisioner
template:
metadata:
labels:
app: local-path-provisioner
spec:
serviceAccountName: local-path-provisioner-service-account
containers:
- name: local-path-provisioner
image: rancher/local-path-provisioner:v0.0.11
imagePullPolicy: IfNotPresent
command:
- local-path-provisioner
- --debug
- start
- --config
- /etc/config/config.json
volumeMounts:
- name: config-volume
mountPath: /etc/config/
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: config-volume
configMap:
name: local-path-config
tolerations:
- operator: "Exists"
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
annotations: {
"storageclass.kubernetes.io/is-default-class":"true"
}
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
kind: ConfigMap
apiVersion: v1
metadata:
name: local-path-config
namespace: local-path-storage
data:
config.json: |-
{
"nodePathMap":[
{
"node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
"paths":["/opt/local-path-provisioner"]
}
]
}
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume1
spec:
storageClassName: local-path
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv1"
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume2
spec:
storageClassName: local-path
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv2"
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume3
spec:
storageClassName: local-path
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv3"
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume4
spec:
storageClassName: local-path
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv4"
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume5
spec:
storageClassName: local-path
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv5"
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume6
spec:`
storageClassName: local-path
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/pv6"

9. Save the file with the name kubeflow_persitent_volume_setup.yaml.

10. In order to apply the script, we first need to make sure that our application has permission to create these storage volumes. Run the following command to do so:

$ sudo chmod 666 /mnt

11. In order to run the script, use the command:

$ sudo kubectl apply -f kubeflow_persitent_volume_setup.yaml

12. The following line verifies its functioning:

$ sudo kubectl get storageclass -A

13. If you followed the tutorial step by step, you should end with having all the infrastructure needed to install Kubeflow.

Kubeflow installation

The following steps were taken from here.

We will install a vanilla version of Kubeflow, on a Linux existing cluster.

  1. First, we need to download the release, for this access this Github link.

2. Unpack the archive:

$ tar -xvf kfctl_v0.7.0_<platform>.tar.gz

3. Open a command prompt window in the same folder where you unzip the binary and run the command:

$ export PATH=$PATH:$PWD

This should give you access to run the kfctl command in any folder. If not, use the whole file path to run the command (Eg: /path/to/command/kfctl )

4. Next, run the following lines:

$ export KF_NAME=kf-cluster$ export BASE_DIR=$HOME$ export KF_DIR=${BASE_DIR}/${KF_NAME}$ mkdir -p ${KF_DIR}$ cd ${KF_DIR}
  • ${KF_NAME} — The name of your Kubeflow deployment. If you want a custom deployment name, specify that name here. For example, my-kubeflow or kf-test. The value of KF_NAME must only contain lower case alphanumeric characters and ‘-’(dashes), and must start and end with an alphanumeric character. The value of this variable cannot be greater than 25 characters. It must only contain a name, not a whole directory path. You will also use this value as a directory name when creating the directory where your Kubeflow configurations are stored (the Kubeflow application directory).
  • ${KF_DIR} — The full path to your Kubeflow application directory.

5. In ${KF_DIR}, make a new file and paste the following lines:

apiVersion: kfdef.apps.kubeflow.org/v1beta1
kind: KfDef
metadata:
creationTimestamp: null
namespace: kubeflow
spec:
applications:
- kustomizeConfig:
parameters:
- name: namespace
value: istio-system
repoRef:
name: manifests
path: istio/istio-crds
name: istio-crds
- kustomizeConfig:
parameters:
- name: namespace
value: istio-system
repoRef:
name: manifests
path: istio/istio-install
name: istio-install
- kustomizeConfig:
parameters:
- name: clusterRbacConfig
value: "OFF"
repoRef:
name: manifests
path: istio/istio
name: istio
- kustomizeConfig:
repoRef:
name: manifests
path: application/application-crds
name: application-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: application/application
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: metacontroller
name: metacontroller
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: argo
name: argo
- kustomizeConfig:
repoRef:
name: manifests
path: kubeflow-roles
name: kubeflow-roles
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: common/centraldashboard
name: centraldashboard
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: admission-webhook/bootstrap
name: bootstrap
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: admission-webhook/webhook
name: webhook
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: jupyter/jupyter-web-app
name: jupyter-web-app
- kustomizeConfig:
overlays:
- istio
repoRef:
name: manifests
path: metadata
name: metadata
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: jupyter/notebook-controller
name: notebook-controller
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pytorch-job/pytorch-job-crds
name: pytorch-job-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pytorch-job/pytorch-operator
name: pytorch-operator
- kustomizeConfig:
overlays:
- application
parameters:
- name: namespace
value: knative-serving
repoRef:
name: manifests
path: knative/knative-serving-crds
name: knative-crds
- kustomizeConfig:
overlays:
- application
parameters:
- name: namespace
value: knative-serving
repoRef:
name: manifests
path: knative/knative-serving-install
name: knative-install
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: kfserving/kfserving-crds
name: kfserving-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: kfserving/kfserving-install
name: kfserving-install
- kustomizeConfig:
overlays:
- application
parameters:
- name: usageId
value: <randomly-generated-id>
- name: reportUsage
value: "true"
repoRef:
name: manifests
path: common/spartakus
name: spartakus
- kustomizeConfig:
overlays:
- istio
repoRef:
name: manifests
path: tensorboard
name: tensorboard
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: tf-training/tf-job-crds
name: tf-job-crds
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: tf-training/tf-job-operator
name: tf-job-operator
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: katib/katib-crds
name: katib-crds
- kustomizeConfig:
overlays:
- application
- istio
repoRef:
name: manifests
path: katib/katib-controller
name: katib-controller
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/api-service
name: api-service
- kustomizeConfig:
overlays:
- application
parameters:
- name: minioPvcName
value: minio-pv-claim
repoRef:
name: manifests
path: pipeline/minio
name: minio
- kustomizeConfig:
overlays:
- application
parameters:
- name: mysqlPvcName
value: mysql-pv-claim
repoRef:
name: manifests
path: pipeline/mysql
name: mysql
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/persistent-agent
name: persistent-agent
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipelines-runner
name: pipelines-runner
- kustomizeConfig:
overlays:
- istio
- application
repoRef:
name: manifests
path: pipeline/pipelines-ui
name: pipelines-ui
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipelines-viewer
name: pipelines-viewer
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/scheduledworkflow
name: scheduledworkflow
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: pipeline/pipeline-visualization-service
name: pipeline-visualization-service
- kustomizeConfig:
overlays:
- application
- istio
parameters:
- name: admin
value: johnDoe@acme.com
repoRef:
name: manifests
path: profiles
name: profiles
- kustomizeConfig:
overlays:
- application
repoRef:
name: manifests
path: seldon/seldon-core-operator
name: seldon-core-operator
repos:
- name: manifests
uri: https://github.com/kubeflow/manifests/archive/v0.7-branch.tar.gz
version: master
status: {}

6. Save the .yaml file with the name “kfctl_k8s_istio.yaml”

7. Next, run the .yaml file using the command:

$ kfctl apply -V -f kfctl_k8s_isitio.yaml

The deployment should start shortly. This might take a few minutes.

8. After the deployment is successful, we need a Load Balancer to expose our IP. In order to enable the Load Balancer, we need to change the config file for the Istio Gateway.

Navigate in the following folder:

${KF_DIR}\kustomize\istio-install\base\istio-noauth.yaml

Search for “Node Port” as specification, and replace it with “LoadBalancer”.

Next, is time to install the Load Balancer, for this run the following command:

$ kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.1/manifests/metallb.yaml

9. Insert the configurations:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: default
protocol: layer2
addresses:
- xxx.xxx.xxx.xxx-xxx.xxx.xxx.xxx
EOF

In the address field, add the IP range that suits your network and IP configuration.

10. To finalize the installation run the following commands:

$ kubectl patch service -n istio-system istio-ingressgateway -p '{"spec": {"type": "LoadBalancer"}}'$ kubectl get svc -n istio-system istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0]}'

11. To verify the installation run:

$ sudo kubectl get svc -A

At the ingress gateway service, you should see the chosen IP.

Deleting Kubeflow

cd ${KF_DIR}
# If you want to delete all the resources, run:
kfctl delete -f ${CONFIG_FILE}

Deleting the Master Node and Cluster

First, to make sure of the name of the Master node run

sudo kubectl get nodes

For node deletion, run the following lines:

sudo kubectl drain <node name> — delete-local-data — force — ignore-daemonsets
sudo kubectl delete node <node name>

For the cluster reset, run the following lines:

sudo kubeadm reset

For deleting the configuration file, run the following lines:

sudo rm -r .kube

Resetting the iptables:

iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

Conclusion

Kubernetes and Kubeflow can open a new perspective in the field of automatic deployment. After running this tutorial you should be able to start experimenting with the power of Kubernetes and Kubeflow.

I will also keep updating the troubleshooting section as I encounter issues on different setups.

Troubleshooting

In this process, you can encounter a lot of issues. I will list some of them and I will keep updating this list as more issues come up.

First of all, the following command is a must when troubleshooting. This will explain what happened with every pod and why it failed.

sudo kubectl describe pods -A

Connection to localhost refused

This is a common error. This can occur if you didn’t initialize the config file .kube. In order to do this, you need to restart the cluster and be sure you run the commands from Cluster Creation point 2.

This error can also occur if you restarted the machine and forgot to disable swap.

Also if you restarted the machine and disable swap and the error is still persistent it might be because the container was not created. After a few minutes (maximum 10–15 minimum), the error should be resolved by itself.

In rare cases, this error can occur due to changes in the IP address. For this, you need to monitor if the network settings change. In this case, the best you should set a static IP.

The installation of Kubeflow is not finishing properly.

This error I occurred for two reasons:

  • There was not enough space in the root folder
  • Not having the latest release of kfctl command.