DO for WML throughput and latency

AlainChabrier
5 min readJun 2, 2020

--

While we support customers migrating from DO CPLEX CLOUD, I usually get questions about the throughput and the overheads of solving models with Decision Optimization (DO) on Watson Machine Learning (WML). I answer some of these questions in this post.

There is an important variety of usages for DO, which leads to an important variety and complexity of models. I have worked for customer cases where the execution of one model was done overnight and was taking 10 hours to solve. I also worked for customer cases where a typical model was taking 1 second to solve to optimality, but the customer needed to solve many of these models.

When a model takes 10 minutes to solve, you, in general don’t care so much if using the Cloud or a distributed architecture will result in 2 seconds or even 10 seconds of delay in you execution. Inversely, you tend to consider that the infrastructure will make available some significantly bigger hardware than your laptops, with multiple cpus and big memory, which can be used to better solve your model.

This is not the same story when you solve models which take 1 sec on average. In that case, you will quickly wonder about the possible overhead of solving remotely.

Basic OPL and LP overhead test

So I made a test to run a bunch of jobs and see the running times. Most of the code to reproduce these tests is available in this github repository: https://github.com/achabrier/assets/blob/master/DOforWMLwithJava/

The following code is running the test with OPL models:

main.fullDietPythonFlow(false, 100);

It creates a deployment with a diet OPL model, then runs 100 jobs on it, with data passed inline, and then delete the deployment. The outcome is as follow:

May 25, 2020 2:10:47 PM com.ibm.Sample fullDietPythonFlow
INFO: Full flow with Diet
May 25, 2020 2:10:47 PM com.ibm.wmlconnector.impl.ConnectorImpl lookupBearerToken
INFO: Lookup Bearer Token from IAM (ASYNCH)
May 25, 2020 2:10:51 PM com.ibm.Sample createAndDeployDietPythonModel
INFO: Create Python Model
May 25, 2020 2:10:53 PM com.ibm.Sample createAndDeployDietPythonModel
INFO: model_id = bcd0e85b-00d5-4175-a1f5-0b34282af5f4
May 25, 2020 2:10:54 PM com.ibm.Sample createAndDeployDietPythonModel
INFO: deployment_id = d35a1f0b-7a4f-4704-b2de-14c6cf3c0f24
May 25, 2020 2:11:13 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 19.6440307
May 25, 2020 2:11:19 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 5.3815484
May 25, 2020 2:11:21 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0217193
May 25, 2020 2:11:23 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0118736
May 25, 2020 2:11:25 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.032722
May 25, 2020 2:11:27 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0216463
May 25, 2020 2:11:29 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0821861
May 25, 2020 2:11:31 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0499582
May 25, 2020 2:11:33 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0187806
/* ... */May 25, 2020 2:14:51 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0242382
May 25, 2020 2:14:53 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 1.9995151
May 25, 2020 2:14:55 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 1.993785
May 25, 2020 2:14:57 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 1.9972234
May 25, 2020 2:14:59 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.0116058
May 25, 2020 2:15:01 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 1.9936405
May 25, 2020 2:15:03 PM com.ibm.Sample fullDietPythonFlow
INFO: Total time: 2.029247
May 25, 2020 2:15:03 PM com.ibm.Sample deleteDeployment
INFO: Delete deployment

It shows that after the first job, which takes around 20 seconds to execute, including the start of the hardware, jobs are being run in about 2 seconds each.

With an LP file, reusing a unique deployment too, and passing the LP file inline in the job creation request, the outcome of running 100 jobs is:

May 25, 2020 2:16:09 PM com.ibm.Sample createAndDeployEmptyCPLEXModel
INFO: Create Empty CPLEX Model
May 25, 2020 2:16:10 PM com.ibm.Sample createAndDeployEmptyCPLEXModel
INFO: model_id = ab60ba5a-10b7-4be8-a000-53af8d98db60
May 25, 2020 2:16:10 PM com.ibm.Sample createAndDeployEmptyCPLEXModel
INFO: deployment_id = 063972c9-f242-456d-8f35-ecd4535feeee
May 25, 2020 2:16:35 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 24.5004511
May 25, 2020 2:16:39 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 3.9077609
May 25, 2020 2:16:41 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 2.0885931
May 25, 2020 2:16:43 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 2.7108422
May 25, 2020 2:16:45 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4595121
May 25, 2020 2:16:52 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4397482
May 25, 2020 2:16:53 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.448809
May 25, 2020 2:16:55 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4331356
May 25, 2020 2:16:57 PM com.ibm.Sample fullLPInlineFLow
/* ... */May 25, 2020 2:19:33 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4314276
May 25, 2020 2:19:35 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.498373
May 25, 2020 2:19:42 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4455167
May 25, 2020 2:19:43 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.4473214
May 25, 2020 2:19:45 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.5876827
May 25, 2020 2:19:46 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 1.6378852
May 25, 2020 2:19:49 PM com.ibm.Sample fullLPInlineFLow
INFO: Total time: 2.0456874
May 25, 2020 2:19:49 PM com.ibm.Sample deleteDeployment
INFO: Delete deployment

It is a bit faster than with OPL, a bit less than 2 seconds, which can be explained by the fact there is no need to launch OPL and interpret the OPL model to generate the CPLEX model.

Knowing that the diet model takes a bit less than a second to solve with CPLEX, we can see the overhead is pretty small.

Note that with so small models, the selection of the cluster may be important. From the Cote d’Azur, I used London cluster. With Dallas cluster, the average execution time is almost half a second more.

Improving throughput

For these types of small models, it make no sense to increase the size of the hardware (CPLEX parallelization would not bring significant benefits). On the other hand, as we are creating many jobs, it may be interesting to benefit from the scaling capability of the infrastructure.

With DO for WML, it is very easy to specify a maximum number of nodes to be used, so that several jobs can be run in parallel.

With this other test:

main.parallelFullLPInlineFLow("acc-tight4.lp", 10, 100);

It runs 100 times another model which takes 20 second to solve normally, but using 10 nodes. The outcome is:

May 20, 2020 3:55:14 PM com.ibm.Sample parallelFullLPInlineFLow
INFO: Total time: 201.9855527
May 20, 2020 3:55:14 PM com.ibm.Sample parallelFullLPInlineFLow
INFO: Per instance: 2.049277635

It means that it took in total (hardware starting included) 201 seconds to run the 100 instances, an average of 2 second per instance!

The throughput can be significantly improved by running jobs in parallel.

You can easily see this with the DO for WML monitor UI code shared recently. While each job still take 20 seconds, many jobs are executed in parallel.

Conclusions

DO for WML allows you to benefit easily from low latency and parallelism.

Follow me on twitter

--

--

AlainChabrier

Former Decision Optimization Senior Technical Staff Member at IBM Opinions are my own and I do not work for any company anymore.