Introducing RiseML — A platform for deep learning on Kubernetes

Henning Peters
RiseML Blog
Published in
3 min readJan 17, 2018

Today we’re thrilled to announce our 1.0 release of RiseML, a platform for deep learning on Kubernetes. RiseML lets your machine learning team treat individual GPU servers as a single compute cluster and share its resources for training machine learning models. This allows your team to focus on what really matters, namely deep learning research and not repetitive manual work.

RiseML abstracts your cluster’s compute resources and provides an interface tailored for machine learning engineers, allowing them to prepare, run, monitor, and scale experiments in parallel using the machine learning framework of their choice. Advanced techniques such as hyperparameter optimization and distributed training can be enabled easily, giving your team powerful tools to train their models.

Getting started with RiseML on AWS only takes 10 minutes

Already have an existing Kubernetes cluster? You can install RiseML alongside your existing infrastructure, whether it is deployed on-premise or in the cloud. Compared to a hosted offering, this allows maximum flexibility regarding data privacy, security and hardware choices while keeping costs low. To give RiseML a spin, simply sign up for our free Community Edition and follow the installation instructions.

A glimpse into using RiseML

At the core of using RiseML stands the command-line interface that is typically installed locally on every workstation and connects to a remote cluster. Let’s take a look at our cluster:

$ riseml system info
RiseML Client/Server Version: 1.0.0/1.0.1
RiseML Cluster ID: 30fd476c-90a9-4b90-820e-9f4460869f75
Kubernetes Version 1.7 (Build Date: 2017-11-25T17:51:39Z)
NODE CPU MEM GPU GPU MEM
ip-172-20-36-52 32 480.2 8 89.4
ip-172-20-38-1 32 480.2 8 89.4
ip-172-20-68-32 32 480.2 8 89.4
ip-172-20-9-37 32 480.2 8 89.4
-----------------------------------------
Total 128 1920.8 32 357.6

Experiments are organized in projects and users and are identified by unique ID. Here’s an overview of the currently running experiments on our cluster:

$ riseml status -au
ID USER PROJECT STATE AGE TYPE
112 martin imagenet RUNNING 4 day(s) Series
139 elmar deepspeech RUNNING 15 hour(s) Experiment
155 bill 20bn-jester RUNNING 12 hour(s) Experiment

Next, we’ll train an image classification model based on the popular CIFAR-10 dataset. We maintain a few example codes in a GitHub repository. Simply check out the repository and go to the cifar10 directory:

$ git clone https://github.com/riseml/examples
$ cd examples/cifar10

You’ll find a riseml.yml file in the project’s root directory. The riseml.yml file contains everything you need to define and run your deep learning experiments:

project: cifar10
train:
framework: tensorflow
tensorflow:
version: 1.2.1
install:
- apt-get update && apt-get install -y curl git
- pip install -r requirements.txt
resources:
cpus: 3
mem: 4096
gpus: 1
run:
- python cifar10.py --epochs 5

To train the model we can simply run:

$ riseml train
Syncing project (8.7 KB, 3 files)...done
TensorBoard: http://ae77052fcfb7211e7b74806202995e7d-1923027246.us-west-2.elb.amazonaws.com/tensorboard/henning-cifar10-181-tensorboard
Type `riseml logs 181` to connect to log stream again.

For monitoring the training progress we can follow the TensorBoard link:

Or obtain the training results via command-line:

$ riseml status 181 | grep accuracy
Result: accuracy=0.5244

Try it out!

RiseML boosts productivity for your machine learning and infrastructure teams.

Beyond the free Community Edition, which is limited to individual users, we offer a Professional Edition for teams starting at only $249 USD per month per node. For research we offer academic discounts and an open-source release is in preparation.

Looking for priority support, training or custom features? Please contact us for our Enterprise Edition that can be fully tailored to your requirements.

Try RiseML now for free!

--

--