Deploying Large, Data-Intensive AI Applications Using Kubernetes on IBM Cloud

Published in

AI Platforms Research

7 min readNov 30, 2018

Arunima Chaudhary, Falk Pollok, Hendrik Strobelt, Daniel Weidele

Kubernetes for Hosting Applications

The intent of this tutorial is to enable you to create applications that expose APIs and UIs using large data backends in custom formats (like HDF5, BAM,..), deploy them through Kubernetes with Cloud Object Storage. The simple application that you should be able to host on your Kubernetes cluster is a CIFAR10 image browser that shows for each instance if the classification of your model matches the ground truth. Here is a preview of the UI:

The sample application exposes a REST API that takes in image IDs and provides the image pixel data, ground truth label and prediction label. We use the CIFAR10 image dataset and the predictions are provided by a LeNet model trained using Watson Machine Learning.

Here is an architecture diagram of the application internals:

The following text describes how to deploy a sample application to a Kubernetes cluster.

Step 1: Download CIFAR10 to your machine and convert it to the HDF5 format

cd tutorial
python3 download_cifar.py

You can now either proceed to step 2 or, optionally, follow the instructions below to compute the model on Watson Machine Learning that are based on a Watson Studio tutorial.

Step 1.1: Create a Watson Machine Learning Instance

Step 1.2: Create Data and Results Buckets

Step 1.3: Upload CIFAR10 to the Data Bucket

$ aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp cifar10/ s3://<bucket-name> --recursive

Step 1.4: Prepare a Watson Machine Learning Manifest

model_definition:
  framework:
#framework name and version (supported list of frameworks available at 'bx ml list frameworks')
    name: pytorch
    version: 0.3
#name of the training-run
  name: cifar10 in pytorch
#Author name and email
  author:
    name: John Doe
    email: johndoe@in.ibm.com
  description: This is running cifar training on multipple models
  execution:
#Command to execute -- see script parameters in later section !!
    command: python3 main.py --cifar_path ${DATA_DIR}
      --checkpoint_path ${RESULT_DIR} --epochs 10
    compute_configuration:
#Valid values for name - k80/k80x2/k80x4/p100/p100x2/v100/v100x2
      name: k80
training_data_reference:
  name: training_data_reference_name
  connection:
    endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
    aws_access_key_id: < from cloud portal >
    aws_secret_access_key: < from cloud portal >
  source:
    bucket: < data bucket name >
  type: s3
training_results_reference:
  name: training_results_reference_name
  connection:
    endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
    aws_access_key_id: < from cloud portal >
    aws_secret_access_key: < from cloud portal >
  target:
    bucket: < results bucket name >
  type: s3

Step 1.5: Compress the Code

$ zip model.zip main.py utils.py models/*

Step 1.6: Train the Model in the Cloud

$ ibmcloud ml train model.zip manifest.yml

Step 1.7: Download `model.ckpt` from the Results Bucket and Place it in the Project’s Root Directory

Step 1.8: Please proceed with Step 3 & 4

Step 2: Setup and activate conda environment

Use Anaconda to create a virtual environment, here named k8st.

conda env create -f environment.yml
source activate k8st

Step 3: Run the application

Now you should be able to run the Flask server.

python server.py

Step 4: Test the application

Test your application at http://localhost:5001/images/?ids=0,1,2,3,4,5 which should return a JSON object.

You can also view the results through a UI at http://localhost:5001 .

Dockerize your code

To host the app on Kubernetes it has to be containerized, i.e. it has to be bundled into a Docker container. For this part of the tutorial you need to have Docker installed on your machine.

The Dockerfile describes how to build such a container. Once built, the container has to be pushed to a registry for Kubernetes to obtain it from.

Step 1: Create Dockerfile

A Dockerfile can look like this:

FROM continuumio/miniconda3# Update package lists
RUN apt-get -y update
RUN apt-get -y upgrade
RUN apt-get -y install s3fsWORKDIR /usr/app# Build conda environment
RUN mkdir tmp && cd tmp
COPY tutorial/environment.yml .
RUN conda env create -f environment.yml
RUN cd .. && rm -rf tmp# Create data dir
RUN mkdir data# Copy secret file
COPY .passwd-s3fs .
RUN chmod 600 .passwd-s3fs# Copy full tutorial code
COPY tutorial .# Run instructions
CMD ["source activate k8st && exec python3 server.py"]
ENTRYPOINT ["/bin/bash", "-c"]
EXPOSE 5001

Step 2: Create a `.passwd-s3fs` file to store credentials for the Cloud Object Storage instance

Step 2.1: Create a Cloud Object Storage Instance

Step 2.2: Get the access credentials for the Cloud Object Storage instance

Step 2.3: Create `a .passwd-s3fs` file to store the access credentials obtained in Step 2.2

<aws_access_key_id>:<aws_secret_access_key>

Step 3: Build & Test the Container

From the main directory execute

docker build -t k8-tut/tutorial .

After the container is completely built, it can be tested. The following command mounts the tutorial/data directory into the container and runs the container by exposing port 5001 from the container to the local machine:

docker run -it -v "${PWD}/tutorial/data:/usr/app/data" -p "5001:5001" k8-tut/tutorial

Step 4: Create Namespaces and Upload Docker Container to Registry

Step 4.1: Install the IBM Cloud CLI

Step 4.2: Install the Container Registry plug-in

ibmcloud plugin install container-registry -r Bluemix

Step 4.3: Create the Namespace and Upload Docker Container

# Log in to your IBM Cloud account (IBMers add --sso)
ibmcloud login -a https://api.us-east.bluemix.net# Create namespace, e.g. "k8-tut"
ibmcloud cr namespace-add k8-tut# Tag the docker image
docker tag k8-tut/tutorial registry.ng.bluemix.net/k8-tut/tutorial# Push the image
docker push registry.ng.bluemix.net/k8-tut/tutorial

Getting Access to Kubernetes Cluster

Contact your resource administrator to make sure you have a Kubernetes cluster with admin access.

The follwing steps setup the kubectl CLI to work with your Kubernetes cluster. The sequence of commands is replicated from the Access section of a specific Kubernetes cluster on the IBM Cloud.

Target the IBM Cloud Container Service region in which you want to work against:

ibmcloud cs region-set <cluster-region>

Run the command to download the configuration files and set the Kubernetes environment accordingly:

ibmcloud cs cluster-config <cluster-name>

Set the KUBECONFIG environment variable. Copy the output from the previous command and paste it in your terminal. The command output should look similar to the following:

export KUBECONFIG=/Users/$USER/.bluemix/plugins/container-service/clusters/<cluster-name>/<kube-config.yml>

Verify that you can connect to your cluster and have admin access by listing your worker nodes:

kubectl get nodes

COS Driver Setup

Install kubernetes-helm and run the following commands to setup the COS driver:

helm repo add stage https://registry.stage1.ng.bluemix.net/helm/ibm
helm repo update
helm fetch --untar stage/ibmcloud-object-storage-plugin
helm plugin install ibmcloud-object-storage-plugin/helm-ibmc
helm init
kubectl get pod -n kube-system | grep tiller
# Check until state is Running
helm ibmc install stage/ibmcloud-object-storage-plugin –f ibmcloud-object-storage-plugin/ibm/values.yaml

You can then list the new storage classes:

kubectl get sc

Cloud Object storage

Use the following steps to create a Persistent Volume Claim (PVC) for Cloud Object Storage.

Step 1: Obtain the Base64 encoded credentials…

echo -n "<AWS_ACCESS_KEY>" | base64
echo -n "<AWS_SECRET_ACCESS_KEY>" | base64

…and create a secret with them:

kubectl apply -f -<<EOF
apiVersion: v1
kind: Secret
type: ibm/ibmc-s3fs
metadata:
  name: test-secret
  namespace: default
data:
  access-key: <AWS_ACCESS_KEY>
  secret-key: <AWS_SECRET_ACCESS_KEY>
EOF

Step 2: Upload cifar10_hdf5 files to a separate bucket:

aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp tutorial/data/cifar10_hdf5/ s3://<bucket-name>/cifar10_hdf5 --recursive

Step 3: Request the PVC

Replace <bucket-name> with your values you chose in Step 2 and run the entire command. You can change the size of your request from 10Gi to any desired value.

kubectl apply -f - <<EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: cos-pvc
  namespace: default
  annotations:
    volume.beta.kubernetes.io/storage-class: "ibmc-s3fs-standard"
    ibm.io/auto-create-bucket: "false"
    ibm.io/auto-delete-bucket: "false"
    ibm.io/bucket: "<bucket-name>"
    ibm.io/endpoint: "https://s3-api.us-geo.objectstorage.softlayer.net"
    ibm.io/region: "us-standard"
    ibm.io/secret-name: "test-secret"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF

Create the Deployment

The ConfigMap created in previous step can now be fed to the deployment along with all the PVCs.

kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tut-deploy
  labels:
    app: tut-deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tut-deploy
  template:
    metadata:
      labels:
        app: tut-deploy
    spec:
      containers:
      - name: tut-deploy
        image: registry.ng.bluemix.net/k8-tut/tutorial
        ports:
        - containerPort: 5001
        imagePullPolicy: Always
        volumeMounts:
        - mountPath: "/usr/app/data"
          name: s3fs-test-volume
      volumes:
      - name: s3fs-test-volume
        persistentVolumeClaim:
          claimName: cos-pvc
EOF

Scale your application

kubectl apply -f - <<EOF
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: drawnation-scaler
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: tut-deploy
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50
EOF

Expose the service

Paid Cluster: Expose the service using an External IP and Loadbalancer

$ kubectl expose deployment tut-deploy --type LoadBalancer --port 5001 --target-port 5001

Free Cluster: Use the Worker IP and NodePort

$ kubectl expose deployment tut-deploy --type NodePort --port 5001 --target-port 5001

More details can be found at https://github.com/IBM-Cloud/get-started-python/blob/master/README-kubernetes.md

Access the application

Verify that the status of the pod is RUNNING

$ kubectl get pods -l app=tut-deploy

Standard (Paid) Cluster:

Identify your LoadBalancer Ingress IP using

$ kubectl get service tut-deploy

Access your application at http://<EXTERNAL-IP>:5001/

Free Cluster:

Identify your Worker Public IP using

$ ibmcloud cs workers <cluster-name>

Identify the Node Port using kubectl describe service get-started-python Access your application at http://<WORKER-PUBLIC-IP>:<NODE-PORT>/

(Optional) Create an Ingress to Access your Application at a Requested Hostname

kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: tut-ingress
spec:
  rules:
  - host: <HOSTNAME>
    http:
      paths:
      - path: /
        backend:
          serviceName: tut-deploy
          servicePort: 5001
EOF

Access your app at http://<HOSTNAME>/