Scale DO execution on WML

6 min readFeb 24, 2020

In some earlier post, I covered how Decision Optimization (DO) has been introduced in Watson Machine Learning (WML).

In this new post, I will go into more details and try to answer questions and concerns which appear when one wants to scale and run more jobs, which are more time consuming and require more memory, and which should potentially all run in parallel to fit into a given time window.

Some reminders on models, deployments, nodes, sizes and jobs

The general concepts involved when deploying a DO model in WML have been introduced in this previous post. An optimization problem is the aggregation of an optimization model applyed to some data. DO in WML allows to create one or several deployment(s) for a given model. When a new deployment is created, a configuration is required, including a size (S, M or XL) and a number of nodes.

props = {
    client.deployments.ConfigurationMetaNames.NAME: "MyName",
    client.deployments.ConfigurationMetaNames.DESCRIPTION: "MyDesc",
    client.deployments.ConfigurationMetaNames.BATCH: {},
    client.deployments.ConfigurationMetaNames.COMPUTE: {'name': size, 'nodes': n_nodes}
}

(See this post for the latest updates on python API)

The size indicates the type of hardware which will be used for model execution:

Small : 2 CPUs and 8 GB
Medium: 4 CPUs and 16 GB
Extra Large: 16 CPUs and 64 GB

(See this post about the newly introduced Large T Shirt size)

The size should be choosen in order to optimize the execution of one single problem as only one problem at a time will be solved on a node. The size is also impacting the pricing (see below).

The number of nodes indicates how many instances of this hardware can be used at most. It will impact the maximum number of problems that can be solved at the same time (as detailled below). The number of nodes does not impact the pricing.

After a deployment is created, a new problem can be defined creating a new job for this deployment. Job configuration includes the inlined data or reference to the data to be used in addition to the model from the deployment for this problem. See the other posts for examples on how to add data to jobs using REST API or Python Client.

No hardware is started or used when deployment is created but only as required when jobs are created.

How jobs are handled?

There is one queue per deployment.

When a job is created, it is automatically submitted to the queue. The way it gets executed depends on the current configuration and jobs already running for this deployment, as shown in the diagram below.

Jobs are run inside PODs. A Kubernetes POD consists of one or more containers that are guaranteed to be co-located on the host machine and can share resources.

How new jobs are handled.

the newly created job is sent to the queue,
if a POD is up but idle (not running a job) it immediately starts processing this job.
otherwise, if the maximum number of nodes is not reached, a new POD is started (this can take a few seconds), and the job is assigned to be processed by this new POD,
otherwise, the job waits in the queue until one of the running PODs has finished and can pick up the waiting job.

A POD can stay idle for two hours, hence consuming some of the cluster hardware resources. During that period it can only serve jobs for the given deployment. Then it automatically stops and hardware resources are made available for another POD, potentially for another deployment or even another WML instance. It is possible to force POD to stop by manually deleting the deployment (using REST API or WML Python Client).

Running time based pricing (CUH)

Only the job solving time is charged, i.e. the idle time for PODs is not charged.

Depending on the size of the POD used, a different multiplier will be used to compute the number of Compute Units Hours (CUH) consumed.

An example of running many jobs

In this notebook, you can see an example of creating a model, a deployment with different possible sizes and number of nodes, and then send a set of new jobs and see how they are processed.

Looking at how jobs are processed you can see the 2 phases:

First, as these are the very first jobs for this deployment, you see a few iteration where jobs are in queue while the nodes are starting.
then, one iteration shows some jobs are completed and some others are still running. Here several jobs run at the same time as the number of nodes has been set to 2. Finally, quickly, all jobs are completed.

###############################
0:00:05.828219
0 queued... JOB -  b4170277-4db2-4663-9df9-74af58b97ce4
1 queued... JOB -  54a775b3-a190-40c9-8e00-2e49744d532a
2 queued... JOB -  2a5ff734-c1bb-4917-9b54-185bf88e454a
3 queued... JOB -  fa1a4869-dfcd-4e0f-a80a-5304ab0f41a1
4 queued... JOB -  e16cd811-1cac-46af-b7b7-e540a2122dcb
5 queued... JOB -  bae6dbc2-aff3-49fc-ae25-4fd81c912cd6
6 queued... JOB -  fa0986e6-9a8b-4190-8627-027a330c20b2
7 queued... JOB -  7314cca2-c5a5-4e4a-885e-6565580f8d96
8 queued... JOB -  50aac044-1d4c-40e8-9033-075e97f4013e
9 queued... JOB -  ac822ef6-7530-4ebe-a4ef-176784d8211d
{'queued': 10}
###############################
0:00:11.670979
0 queued... JOB -  b4170277-4db2-4663-9df9-74af58b97ce4
1 queued... JOB -  54a775b3-a190-40c9-8e00-2e49744d532a
2 queued... JOB -  2a5ff734-c1bb-4917-9b54-185bf88e454a
3 queued... JOB -  fa1a4869-dfcd-4e0f-a80a-5304ab0f41a1
4 queued... JOB -  e16cd811-1cac-46af-b7b7-e540a2122dcb
5 queued... JOB -  bae6dbc2-aff3-49fc-ae25-4fd81c912cd6
6 queued... JOB -  fa0986e6-9a8b-4190-8627-027a330c20b2
7 queued... JOB -  7314cca2-c5a5-4e4a-885e-6565580f8d96
8 queued... JOB -  50aac044-1d4c-40e8-9033-075e97f4013e
9 queued... JOB -  ac822ef6-7530-4ebe-a4ef-176784d8211d
{'queued': 10}
###############################
0:00:17.596201
0 queued... JOB -  b4170277-4db2-4663-9df9-74af58b97ce4
1 queued... JOB -  54a775b3-a190-40c9-8e00-2e49744d532a
2 queued... JOB -  2a5ff734-c1bb-4917-9b54-185bf88e454a
3 queued... JOB -  fa1a4869-dfcd-4e0f-a80a-5304ab0f41a1
4 queued... JOB -  e16cd811-1cac-46af-b7b7-e540a2122dcb
5 queued... JOB -  bae6dbc2-aff3-49fc-ae25-4fd81c912cd6
6 queued... JOB -  fa0986e6-9a8b-4190-8627-027a330c20b2
7 queued... JOB -  7314cca2-c5a5-4e4a-885e-6565580f8d96
8 queued... JOB -  50aac044-1d4c-40e8-9033-075e97f4013e
9 queued... JOB -  ac822ef6-7530-4ebe-a4ef-176784d8211d
{'queued': 10}
###############################
0:00:23.502208
0 completed... JOB -  b4170277-4db2-4663-9df9-74af58b97ce4
1 running... JOB -  54a775b3-a190-40c9-8e00-2e49744d532a
2 completed... JOB -  2a5ff734-c1bb-4917-9b54-185bf88e454a
3 queued... JOB -  fa1a4869-dfcd-4e0f-a80a-5304ab0f41a1
4 completed... JOB -  e16cd811-1cac-46af-b7b7-e540a2122dcb
5 completed... JOB -  bae6dbc2-aff3-49fc-ae25-4fd81c912cd6
6 running... JOB -  fa0986e6-9a8b-4190-8627-027a330c20b2
7 queued... JOB -  7314cca2-c5a5-4e4a-885e-6565580f8d96
8 queued... JOB -  50aac044-1d4c-40e8-9033-075e97f4013e
9 queued... JOB -  ac822ef6-7530-4ebe-a4ef-176784d8211d
{'completed': 4, 'running': 2, 'queued': 4}
###############################
0:00:29.404864
1 completed... JOB -  54a775b3-a190-40c9-8e00-2e49744d532a
3 completed... JOB -  fa1a4869-dfcd-4e0f-a80a-5304ab0f41a1
6 completed... JOB -  fa0986e6-9a8b-4190-8627-027a330c20b2
7 completed... JOB -  7314cca2-c5a5-4e4a-885e-6565580f8d96
8 completed... JOB -  50aac044-1d4c-40e8-9033-075e97f4013e
9 completed... JOB -  ac822ef6-7530-4ebe-a4ef-176784d8211d
{'completed': 10}
###############################
0:00:35.010775
{'completed': 10}

After these jobs are completed, the 2 nodes will stay idle for some time. If the cells creating and monitoring the set of jobs are re-executed, it will appear that new jobs are run immediately.

If the deployment is not used for some time, the nodes will hibernate. After that, if new jobs are created, then they will be restarted.

Conclusions

WML allows to easily scale the resolution of optimization models.

It is very easy to choose the hardware configuration that will work best.
It is very easy to increase the number of problems that can be solved in parallel to fit in a given time window.
It is very easy to integrate into production applications using very simple REST APIs.
Only the time used to solve problems is billed.

Alain.chabrier@ibm.com

@AlainChabrier

https://www.linkedin.com/in/alain-chabrier-5430656/

Scale DO execution on WML

Some reminders on models, deployments, nodes, sizes and jobs

How jobs are handled?

An example of running many jobs

Written by AlainChabrier