On-Premises Data Science — Challenges and Opportunity

6 min readJun 14, 2020

Democratization of AI is a popular topic these days, and rightly so. It is necessary for mass adoption and enablement in many industries in order to solve their specific problems using AI — industries such as manufacturing, healthcare, and automotive all benefit from this approach. A major challenge is to build an AI platform that combines the best practices, tools and hardware.

Companies such as Amazon, Google, and Microsoft offer cloud-based Machine Learning (ML) solutions that simplify the process of getting started by spinning up the required infrastructure with a click, removing the difficulty of IT management and dealing with complex hardware bring-up and configuration. They provide the integrated state-of-the-art GPUs, TPUs and FPGAs that handle the required massive computation and capacity needed for neural network training and inference. They also integrate many useful tools and development frameworks for data scientists to provide a seamless machine learning workflow.

While working in the cloud might be simple and beneficial for early modelling and for experiments with open source data, there are a few challenges when ML is adopted in some of the industries.

Data Gravity — Many industries generate data in-house and if AI is deployed in the cloud, that data has to be moved into the cloud. This may not be feasible.
Security & Compliance — There may be governance or security issues with putting sensitive data in the cloud.
Stickiness — Migration between different cloud providers is difficult — on purpose. It may involve, rethinking of infrastructure to remodeling of ML program. While there have been efforts to develop portable frameworks, if the data scientist uses a cloud API the program becomes non-portable.

Based on the points above, an on-prem AI solution is imperative — but one that must provide feature parity with the cloud solutions, and an equally simple workflow. While most of ML is domain specific and accomplished by experts, most of the infrastructure, IT management, frameworks, and tools are generic.

The challenges involved in having an on-prem AI solution are -

Cluster Management — Cluster bring up, maintenance, server add/remove, and the management of essential services.
Specialized Hardware Management — Bring up, tear down and configuration of specialized devices such as GPUs, RDMA NICs and FPGAs that cater to the high computational demands of ML programs. Bring up also involves dealing with all of the software packages and drivers needed for the devices.
Essential Tools — Integration with the tools needed by Data scientists. They should spin up and be available on a click.
Monitoring — Keeping track of hardware utilization and other metrics in order to make informed decisions on capacity planning and hardware allocation.
Management Portal — Offering a consolidated view for control and management of the entire cluster.
Fail-over and Access Guarantees — Uninterrupted availability of tools and services for the data scientist.
Acceleration & Optimization — Native integration with frameworks to provide the required acceleration and optimization for ML programs.
Seamless Workflow — Similar to what the cloud offers — Starting with development, through unsupervised training, and all the way through inference.
Provisioning — One click spin-up of tools, auto scaling based on utilization, and auto provisioning of required devices such as GPUs.
Authentication & Authorization — Access to different ML services based on organization-wide policies and user profiles.

Kubernetes For Machine Learning

Kubernetes (k8s) has a major role in enabling the democratization of AI. It is a popular and flexible container management and orchestration system that handles the scheduling of containerized applications on a cluster.

Containers package the required software as an image, providing the right level of portability for ML/DL workloads. Many popular frameworks like TensorFlow, PyTorch, Scikit-Learn, etc. are containerized. There are also specialized containers with optimized stacks from NVIDIA that enable access to their GPUs. Containerizing ML/DL programs and executing them using K8s enables them to be handled as any other application. Kubernetes provides the required cluster management, monitoring and scaling for ML applications, automated allocation of resources, and abstract APIs for storage management. Projects like Ambassador and Istio provides the Ingress function, while the Knative project provides the application scaling.

Specialized hardware’s such as GPUs are enabled through Device Plug-in Feature Gates. Vendors can implement a plug-in to advertise custom hardware resources. Device plug-ins report the devices and their capabilities which gets reflected in the node spec. ML applications can then get request devices using a resource section in a pod spec as shown in the sample below.

NVIDIA offers a device plug-in for their GPUs, detailed in https://github.com/NVIDIA/k8s-device-plugin. This device plug-in runs as a pod in Kubernetes and enumerates the GPUs on the node. For the device plugin pod to detect and report GPUs, or for any other ML application pod to see and utilize the GPUs, some level of bootstrapping is needed. This includes installing the drivers on the node, and installing the NVIDIA libraries and other modules. Project https://github.com/NVIDIA/gpu-operator can be used, which does all of the bootstrapping necessary to bring up the necessary components on a Kubernetes cluster. This avoids the mundane, repetitive tasks of managing the software stack on each node.

Specialized NICs and RDMA devices are also supported through their plug-ins. When GPUs are setup with RDMA, for example, they can achieve the best performance with distributed training. This is an advanced topic, which I will save for another day!

Kubeflow in Machine Learning Infrastructure

Kubeflow is an important open source effort aimed at putting together all of the packages, tools and frameworks required for machine learning. It provides an end-to-end ML workflow — from simple notebooks and distributed training all the way through complex pipelines and serving. It does this while abstracting the infrastructure details using Kubernetes resources and APIs. Kubeflow can be deployed on any cloud or on-prem platform, providing a constant API definition, thus guaranteeing a portable ML application.

A typical installation and usage workflow for on-prem Machine Learning would be:

Bring up a Kubernetes cluster, using Rancher or some other k8s-based cluster manager
Install the NVIDIA software using project https://github.com/NVIDIA/gpu-operator
Set up a persistent storage device and create a Kubernetes persistent volume
Install Kubeflow
Launch a notebook, requesting GPUs if needed. Code and Test a Machine Learning program
Build and containerize the program and push it to a registry
Trigger a run — distributed if required — using TensorFlow or PyTorch
Use Katib to trigger a hyperparameter tuning experiment if needed
Build a pipeline and deploy it to run for a given dataset — automate using CronJob
Store artifacts generated in Minio
Collect and view the artifacts using TensorBoard or another visualization tool

While Kubeflow does provide an essential platform for Machine Learning that can be deployed on-prem, the goal of such an on-prem solution would be to achieve the feature parity and simplicity of cloud solutions. This requires a few more points to be addressed:

User awareness of Kubernetes APIs and constructs. This is a limitation and needs to be abstracted using higher level ML workflow APIs.
Installation and teardown with a one-click operation.
Native integration with on-prem storage layer, user profiles, access restrictions.
Integrations with on-prem identity providers such as LDAP/Gitlab/AD/Custom.
Clearly defined and enhanced Admin workflow with a unified portal to manage Data Science applications.
Support for custom tool integrations. Bring Your Own Tool.
Integration with data sources and a unified way for accessing data — typically using handles.
Enhanced visualization of Data, Model, Program and other Artifacts.
Support for integration at the infrastructure level with services like MySQL, Postgres etc.. for storing/searching data-frames.
Guaranteed resiliency for abrupt shutdowns / unexpected node issues such as memory pressure.
Seamless integration with an Enterprise IT environment.
Etc…

This article has explained why an on-prem ML solution is required, and highlights how Kubernetes and Kubeflow enable such deployments. Future articles will explore aspects of on-prem deployments, tools in kubeflow and range of other topics.

On-Premises Data Science — Challenges and Opportunity

Kubernetes For Machine Learning

Kubeflow in Machine Learning Infrastructure

Written by Ahmed Khan