Inline data in DO for WML jobs

AlainChabrier
3 min readSep 22, 2020

--

This post describes how any type of data can now(*) be passed inline to create a Watson Machine Learning (WML) batch job for Decision Optimization (DO).

(*) This is possible on the new WML v2 instances.

DO models in WML can be of different types: OPL, docplex, cplex or cp. These types correspond to different ways of formulating the optimization model or problem.

Types of models and data.

With docplex or OPL types, a model formulation can be included in the deployment which can be used to solve different instances of the problem. These two modeling approaches are pretty equivalent even if each one has pros and cons. Different jobs can then be created on the same deployment using the same model formulation. Input data to create new jobs is provided either using tabular structures (e.g. fields and values) or other specific non tabular structures (e.g. .dat files for OPL models). The DO runtime in WML combines the model and data in order to create the problem to solve.

With cplex or cp types of models, deployments are created empty. The solve jobs are then created providing each time the complete problem, in the form of a .lp, .mps or .cpo file.

Inline tabular data

When the data to be provided to a job is tabular ( fields and values ) then it can be passed inline in the job creation payload:

"input_data": [
{
"id":"diet_food.csv",
"fields" : ["name","unit_cost","qmin","qmax"],
"values" : [ ["Roasted Chicken", 0.84, 0, 10] ]
}
]

With the python client, you can even directly provide a pandas dataframe.

diet_nutrients = pd.DataFrame([
["Calories", 2000, 2500],
["Calcium", 800, 1600],
["Iron", 10, 30],
["Vit_A", 5000, 50000],
["Dietary_Fiber", 25, 100],
["Carbohydrates", 0, 300],
["Protein", 50, 100]
], columns = ["name","qmin","qmax"])

and

client.deployments.DecisionOptimizationMetaNames.INPUT_DATA: [      
{
"id":"diet_nutrients.csv",
"values" : diet_nutrients
}
],

Same thing can be done for output solution.

Referenced data

Until recently, when input or output data was not tabular, the only option was to reference it on some storage. Several alternatives are supported, such as Cloud Object Storage or databases.

You can see in this documentation how to configure these different references.

Inline non tabular data

It is now possible to create a job with non tabular inline data. Instead of providing reference information or fields/values, you should use the content field. The content can be anything, but needs to be binary encoded.

Among other usages, you can use this to provide one or several OPL .dat input file(s), or a CPLEX .lp or .prm file.

Python example

Here is an example of job creation payload in Python:

client.deployments.DecisionOptimizationMetaNames.INPUT_DATA: [
{
"id": dat_file,
"content": getfileasdata(dat_file)
}
],

using this method to binary encode the content:

import base64
def getfileasdata(filename):
with open(filename, 'r') as file:
data = file.read();

data = data.encode("UTF-8")
data = base64.b64encode(data)
data = data.decode("UTF-8")

return data

Java example

Here is an example on creating the JSON payload in Java:

@Override
public JSONObject createDataFromString(String id, String text) {
byte[] bytes = text.getBytes();
byte[] encoded = Base64.getEncoder().encode(bytes);

JSONObject data = new JSONObject();
data.put("id", id);
data.put("content", encoded);

return data;
}

Output data

In the exact same way, you can use this mechanism to directly get some optimization output data returned in the payload response. For example, you can request for the log to be returned, as follow:

solve_payload = {
client.deployments.DecisionOptimizationMetaNames.SOLVE_PARAMETERS: {
'oaas.logAttachmentName': 'log.txt'
},
client.deployments.DecisionOptimizationMetaNames.OUTPUT_DATA: [
{
"id": ".*\.txt"
}
]
}

And then manage output for txt files:

for output_data in job_details['entity']['decision_optimization']['output_data']:
if output_data['id'].endswith('txt'):
print(output_data['id'])
output = output_data['content']
output = output.encode("UTF-8")
output = base64.b64decode(output)
output = output.decode("UTF-8")
print(output)
with open(output_data['id'], 'wt') as file:
file.write(output)

Inline vs referenced data

Using inline data can be easier to set up and use, but one needs to keep in mind that including big amount of data in http requests may lead to issues. For example, some firewalls may impose some limits to the payload size.

On the other hand, even if referenced data might be a bit harder to set up (create credentials, find right urls, etc) it brings other benefits such as the ability to log and monitor the input and output which have been used for different jobs. It is for example quite easy to use different buckets in Cloud Object Storage for different applications or jobs, and replay those jobs later in time.

Follow me on Twitter: https://twitter.com/AlainChabrier

--

--

AlainChabrier

Former Decision Optimization Senior Technical Staff Member at IBM Opinions are my own and I do not work for any company anymore.