Deploying Large, Data-Intensive AI Applications Using Kubernetes on IBM Cloud
Arunima Chaudhary, Falk Pollok, Hendrik Strobelt, Daniel Weidele
Kubernetes for Hosting Applications
The intent of this tutorial is to enable you to create applications that expose APIs and UIs using large data backends in custom formats (like HDF5, BAM,..), deploy them through Kubernetes with Cloud Object Storage. The simple application that you should be able to host on your Kubernetes cluster is a CIFAR10 image browser that shows for each instance if the classification of your model matches the ground truth. Here is a preview of the UI:
The sample application exposes a REST API that takes in image IDs and provides the image pixel data, ground truth label and prediction label. We use the CIFAR10 image dataset and the predictions are provided by a LeNet model trained using Watson Machine Learning.
Here is an architecture diagram of the application internals:
The following text describes how to deploy a sample application to a Kubernetes cluster.
Step 1: Download CIFAR10 to your machine and convert it to the HDF5 format
cd tutorial
python3 download_cifar.py
You can now either proceed to step 2 or, optionally, follow the instructions below to compute the model on Watson Machine Learning that are based on a Watson Studio tutorial.
Step 1.1: Create a Watson Machine Learning Instance
Step 1.2: Create Data and Results Buckets
Step 1.3: Upload CIFAR10 to the Data Bucket
$ aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp cifar10/ s3://<bucket-name> --recursive
Step 1.4: Prepare a Watson Machine Learning Manifest
model_definition:
framework:
#framework name and version (supported list of frameworks available at 'bx ml list frameworks')
name: pytorch
version: 0.3
#name of the training-run
name: cifar10 in pytorch
#Author name and email
author:
name: John Doe
email: johndoe@in.ibm.com
description: This is running cifar training on multipple models
execution:
#Command to execute -- see script parameters in later section !!
command: python3 main.py --cifar_path ${DATA_DIR}
--checkpoint_path ${RESULT_DIR} --epochs 10
compute_configuration:
#Valid values for name - k80/k80x2/k80x4/p100/p100x2/v100/v100x2
name: k80
training_data_reference:
name: training_data_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
source:
bucket: < data bucket name >
type: s3
training_results_reference:
name: training_results_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
target:
bucket: < results bucket name >
type: s3
Step 1.5: Compress the Code
$ zip model.zip main.py utils.py models/*
Step 1.6: Train the Model in the Cloud
$ ibmcloud ml train model.zip manifest.yml
Step 1.7: Download model.ckpt
from the Results Bucket and Place it in the Project’s Root Directory
Step 1.8: Please proceed with Step 3 & 4
Step 2: Setup and activate conda environment
Use Anaconda to create a virtual environment, here named k8st
.
conda env create -f environment.yml
source activate k8st
Step 3: Run the application
Now you should be able to run the Flask server.
python server.py
Step 4: Test the application
Test your application at http://localhost:5001/images/?ids=0,1,2,3,4,5 which should return a JSON object.
You can also view the results through a UI at http://localhost:5001 .
Dockerize your code
To host the app on Kubernetes it has to be containerized, i.e. it has to be bundled into a Docker container. For this part of the tutorial you need to have Docker installed on your machine.
The Dockerfile
describes how to build such a container. Once built, the container has to be pushed to a registry for Kubernetes to obtain it from.
Step 1: Create Dockerfile
A Dockerfile
can look like this:
FROM continuumio/miniconda3# Update package lists
RUN apt-get -y update
RUN apt-get -y upgrade
RUN apt-get -y install s3fsWORKDIR /usr/app# Build conda environment
RUN mkdir tmp && cd tmp
COPY tutorial/environment.yml .
RUN conda env create -f environment.yml
RUN cd .. && rm -rf tmp# Create data dir
RUN mkdir data# Copy secret file
COPY .passwd-s3fs .
RUN chmod 600 .passwd-s3fs# Copy full tutorial code
COPY tutorial .# Run instructions
CMD ["source activate k8st && exec python3 server.py"]
ENTRYPOINT ["/bin/bash", "-c"]
EXPOSE 5001
Step 2: Create a .passwd-s3fs
file to store credentials for the Cloud Object Storage instance
Step 2.1: Create a Cloud Object Storage Instance
Step 2.2: Get the access credentials for the Cloud Object Storage instance
Step 2.3: Create a .passwd-s3fs
file to store the access credentials obtained in Step 2.2
<aws_access_key_id>:<aws_secret_access_key>
Step 3: Build & Test the Container
From the main directory execute
docker build -t k8-tut/tutorial .
After the container is completely built, it can be tested. The following command mounts the tutorial/data
directory into the container and runs the container by exposing port 5001 from the container to the local machine:
docker run -it -v "${PWD}/tutorial/data:/usr/app/data" -p "5001:5001" k8-tut/tutorial
Step 4: Create Namespaces and Upload Docker Container to Registry
Step 4.1: Install the IBM Cloud CLI
Step 4.2: Install the Container Registry plug-in
ibmcloud plugin install container-registry -r Bluemix
Step 4.3: Create the Namespace and Upload Docker Container
# Log in to your IBM Cloud account (IBMers add --sso)
ibmcloud login -a https://api.us-east.bluemix.net# Create namespace, e.g. "k8-tut"
ibmcloud cr namespace-add k8-tut# Tag the docker image
docker tag k8-tut/tutorial registry.ng.bluemix.net/k8-tut/tutorial# Push the image
docker push registry.ng.bluemix.net/k8-tut/tutorial
Getting Access to Kubernetes Cluster
Contact your resource administrator to make sure you have a Kubernetes cluster with admin access.
The follwing steps setup the kubectl
CLI to work with your Kubernetes cluster. The sequence of commands is replicated from the Access section of a specific Kubernetes cluster on the IBM Cloud.
Target the IBM Cloud Container Service region in which you want to work against:
ibmcloud cs region-set <cluster-region>
Run the command to download the configuration files and set the Kubernetes environment accordingly:
ibmcloud cs cluster-config <cluster-name>
Set the KUBECONFIG environment variable. Copy the output from the previous command and paste it in your terminal. The command output should look similar to the following:
export KUBECONFIG=/Users/$USER/.bluemix/plugins/container-service/clusters/<cluster-name>/<kube-config.yml>
Verify that you can connect to your cluster and have admin access by listing your worker nodes:
kubectl get nodes
COS Driver Setup
Install kubernetes-helm and run the following commands to setup the COS driver:
helm repo add stage https://registry.stage1.ng.bluemix.net/helm/ibm
helm repo update
helm fetch --untar stage/ibmcloud-object-storage-plugin
helm plugin install ibmcloud-object-storage-plugin/helm-ibmc
helm init
kubectl get pod -n kube-system | grep tiller
# Check until state is Running
helm ibmc install stage/ibmcloud-object-storage-plugin –f ibmcloud-object-storage-plugin/ibm/values.yaml
You can then list the new storage classes:
kubectl get sc
Cloud Object storage
Use the following steps to create a Persistent Volume Claim (PVC) for Cloud Object Storage.
Step 1: Obtain the Base64 encoded credentials…
echo -n "<AWS_ACCESS_KEY>" | base64
echo -n "<AWS_SECRET_ACCESS_KEY>" | base64
…and create a secret with them:
kubectl apply -f -<<EOF
apiVersion: v1
kind: Secret
type: ibm/ibmc-s3fs
metadata:
name: test-secret
namespace: default
data:
access-key: <AWS_ACCESS_KEY>
secret-key: <AWS_SECRET_ACCESS_KEY>
EOF
Step 2: Upload cifar10_hdf5 files to a separate bucket:
aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp tutorial/data/cifar10_hdf5/ s3://<bucket-name>/cifar10_hdf5 --recursive
Step 3: Request the PVC
Replace <bucket-name>
with your values you chose in Step 2 and run the entire command. You can change the size of your request from 10Gi to any desired value.
kubectl apply -f - <<EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: cos-pvc
namespace: default
annotations:
volume.beta.kubernetes.io/storage-class: "ibmc-s3fs-standard"
ibm.io/auto-create-bucket: "false"
ibm.io/auto-delete-bucket: "false"
ibm.io/bucket: "<bucket-name>"
ibm.io/endpoint: "https://s3-api.us-geo.objectstorage.softlayer.net"
ibm.io/region: "us-standard"
ibm.io/secret-name: "test-secret"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
EOF
Create the Deployment
The ConfigMap created in previous step can now be fed to the deployment along with all the PVCs.
kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tut-deploy
labels:
app: tut-deploy
spec:
replicas: 1
selector:
matchLabels:
app: tut-deploy
template:
metadata:
labels:
app: tut-deploy
spec:
containers:
- name: tut-deploy
image: registry.ng.bluemix.net/k8-tut/tutorial
ports:
- containerPort: 5001
imagePullPolicy: Always
volumeMounts:
- mountPath: "/usr/app/data"
name: s3fs-test-volume
volumes:
- name: s3fs-test-volume
persistentVolumeClaim:
claimName: cos-pvc
EOF
Scale your application
kubectl apply -f - <<EOF
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: drawnation-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: tut-deploy
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
EOF
Expose the service
Paid Cluster: Expose the service using an External IP and Loadbalancer
$ kubectl expose deployment tut-deploy --type LoadBalancer --port 5001 --target-port 5001
Free Cluster: Use the Worker IP and NodePort
$ kubectl expose deployment tut-deploy --type NodePort --port 5001 --target-port 5001
More details can be found at https://github.com/IBM-Cloud/get-started-python/blob/master/README-kubernetes.md
Access the application
Verify that the status
of the pod is RUNNING
$ kubectl get pods -l app=tut-deploy
Standard (Paid) Cluster:
Identify your LoadBalancer Ingress IP using
$ kubectl get service tut-deploy
Access your application at http://<EXTERNAL-IP>:5001/
Free Cluster:
Identify your Worker Public IP using
$ ibmcloud cs workers <cluster-name>
Identify the Node Port using kubectl describe service get-started-python Access your application at http://<WORKER-PUBLIC-IP>:<NODE-PORT>/
(Optional) Create an Ingress to Access your Application at a Requested Hostname
kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: tut-ingress
spec:
rules:
- host: <HOSTNAME>
http:
paths:
- path: /
backend:
serviceName: tut-deploy
servicePort: 5001
EOF
Access your app at http://<HOSTNAME>/