IBM Open Data for Industries and Cloud Pak for Data Services

General integration approach 1.0 (Part 2)

a dotted path goes to a platform with a floating magnifying lens followed by a set of boxes and then a floating screen with some figures on it

Quick Recap

Part 1 of this post has covered following steps of Open Data for Industries and Cloud Pak for Data service integration:

STEP 1: Cloud Pak for Data Analytics Project and Notebook set up

STEP 2: Data refinery and cleansing

STEP 3: Data Ingestion to IBM Open Data for Industries instance

This is Part 2 which will cover:

STEP 4: Data searching and retrieving data from Open Data for Industries instance

STEP 5: Data analysis and prediction using Cloud Pak for Data services

Open Data for Industries and Cloud Pak for Data service integration pipeline: data sources go into the WSL analytics project, starting with Data Refinery, and then the Notebook and runtime environments before going into the IBM ODI instance via data ingestion. The data then goes via data searching and retrieving back into the Notebook and runtime environments into the creation of ML modelsOpen Data for Industries and Cloud Pak for Data service integration: data cleansing and analyzing flow.
Open Data for Industries and Cloud Pak for Data service integration: data cleansing and analyzing flow

Search and retrieve data from Open Data for Industries instance

IBM Open Data for Industries supports REST APIs to query and retrieve raw data and metadata. Within a Python notebook, the following steps needed to retrieve data from an Open Data for Industries instance.

Getting bearer token from keycloak

Please refer to Part 1 for details of getting a token from Keycloak.

Set up API calls for data searching and retrieving

When retrieving raw data or metadata from an Open Data for Industries instance, the often-used services are Open Data for Industries search and delivery services.

The Python code snippet can be like below:

headers = {
'data-partition-id': OSDU_DATA_PARTITION,
'authorization': "Bearer " + BEARER_TOKEN,
"Accept": "application/json",
"Content-type": "application/json",
}
url = ODI_INSTANCE + SERVICE_PATH + <service_endpoint>
r = requests.request(<method>, url, json=<body>, headers=headers)
return r.json()
  1. Query to get information.

The following cURL command showed a search example with SERVICE_PATH, <endpoint>, and <body> replacements:

curl --request POST \
--url <cpd-route>/osdu-search/api/search/v2/query \
--header 'authorization: Bearer <<access_token>>' \
--header 'content-type: application/json' \
--header 'data-partition-id: opendes' \
--data '{
"kind": "opendes:*:*:*",
"returnedFields": [
"id", "kind","data"
],
"query": "id:\"opendes:doc:c5cdb9bb4bb84baa81ccf067c58c2750\""
}'

2. Based on search results, the end user can use the Delivery service “GetFileSignedUrl” endpoint to return the signed URL for a specific record, and get its raw data by using the requests library

curl --request POST \
--url <cpd-route>/osdu-delivery/api/delivery/v2/GetFileSignedUrl \
--header 'authorization: Bearer <<access_token>>' \
--header 'content-type: application/json' \
--header 'data-partition-id: opendes' \
--data '{"srns": ["srn:file/segy:mysegy1:"] }'

The following is an example of a signed URL:

# Open Data for Industries using minIO to save raw data.
# https://minio-osdu-minio.odi-ibmslb-demo-b7cd7bacf7d92146ece9843b7b89c840-0000.us-south.containers.appdomain.cloud/osdu-seismic-test-data/140435.segy?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210210T022030Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=minio%2F20210210%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=d747a52bf3e401684c1a1dfd15151677bde848b5b07ea2cf00c4fdd67d5792e7

3. Once we get a signed URL, we can use the following code snippet to get the raw data and save it as a data asset in original format or CSV format to the project:

import os, lasio
# record name
datapath = os.path.basename(urlparse(record_signed_url).path)
# record content
record = requests.get(record_signed_url).text
if record.ok:
datacontent = record.text
#print(las_content)

datapath = "/project_data/data_asset/" + datapath
# save raw data as an data asset of the project.
with open(datapath, 'w') as outfile:
outfile.write(datacontent)

las_csv = datapath[:-3] + 'csv'
print(las_csv)
lasio.read(datapath).df().to_csv(las_csv, encoding='utf-8')

Data analysis and business decision

a finger touching a translucent network with edges and nodes in mid-air

When metadata and raw data from Open Data for Industries are available within a notebook or saved as data assets of a Cloud Pak for Data analytics project, end users can use Cloud Pak for Data services to do data analytics to help a customer make business decisions.

Data Analyzing

Cloud Pak for Data provides Watson Studio (WSL), Watson Machine Learning (WML), and other supplemental services for analyzing data and building models.

Data scientists can analyze data, train ML models, and define decision optimization (DO) models. To train models, they can use SPSS Modeler, or Auto AI. The graph below shows an in-progress AutoAI procedure.

a screenshot of an auto-ai flow in Cloud Pak for Data that shows a progress map beginning with “read dataset”, going to “split holdout data”, which then goes to “read training data”, which then goes to “preprocessing” which then goes to “model selection”. Further steps are shown with dotted lines which presumably aren’t executed yet such as “selected algorithm”, “hyperparameter optimization”, and “feature engineering”.
An in-progress autoAI flow

The trained models can be promoted to the Deployments Space and deployed (as shown below) to make them available for AI infusion into business processes and/or integrate with applications or services, such as Palantir.

a screenshot of a deployment title “ODI-Demo-Space” with the deployments tab selected showing one deployment listed as online and deployed
Model promoted and deployed to Deployments Space
screenshot of a deployed model in Cloud Pak for Data showing the API reference tab with an endpoint available and code snippets available with the “Python” tab being active and showing Python code
Model’s online deployment and code snippets

The deployed model can also be called via the model deployment REST API within notebooks.

A specific example of the above is of using an OpenVino ML model to predict seismic data within a notebook. This example is not directly using Cloud Pak for Data and its services, but similar codes can be used within Cloud Pak for Data analytics project notebooks also.

Business Analytics

Cloud Pak for Data provides some tools and services to analyze and visualize patterns and trends in existing data and helps customers to make business decisions.

Cognos Dashboard is one of the base services that Cloud Pak for Data supports.

Within an analytics project, the end user can click “Add to project” -> “Dashboard” to create a new dashboard, and select existing data from asset to visualize, as one of screenshot show below:

a screenshot of a Cognos Analytics dashboard showing a visualization of an analytics project
Cognos Dashboard example

Besides Cognos Dashboard, Cloud Pak for Data also supports Cognos Analytics and Planning Analytics services for business analytics.

Other services integration

Cloud Pak for Data provides a platform to integrate many IBM and partner services to support data analytics following AI ladder. This post just briefly lists some major steps of using Open Data for Industries and several Cloud Pak for Data services to analyze Oil and Gas domain specific data.

With the Cloud Pak for Data platform, end users are able to use many other services to do more powerful analyses. The following services are also used often for some use cases:

  • Watson OpenScale to understand how your AI models make decisions, to detect and mitigate bias and drift, and also to increase the quality and accuracy of your predictions;
  • Watson Discovery to extract answers from complex business documents.

Another integration approach and data flow between Open Data for Industries and Cloud Pak for Data services

So far, this post has talked about one flow of data processing with Open Data for Industries and other Cloud Pak for Data services, as the listed steps below (and also mentioned at the beginning of the post):

  1. Source data refinery and cleansing
  2. Data Ingestion to an IBM Open Data for Industries instance
  3. Data searching and retrieving data from an Open Data for Industries instance
  4. Data analysis and prediction using Cloud Pak for Data services

There is also another possibly data flow, like graph below shows:

Open Data for Industries and Cloud Pak for Data service integration: data cleansing and analyzing flow. The source data goes into the IBM ODI Instance which then via data searching and retrieving feeds into the notebook and runtime which then is fed into Data Refinery and then into Data Analyzing, ML Models and Watson Discovery
Open Data for Industries and Cloud Pak for Data service integration: data cleansing and analyzing flow
  1. Using existing data ingested into Open Data for Industries instance already.
  2. Using a notebook to search and retrieve data from an Open Data for Industries instance and save as a data asset with Analytics Project
  3. Do data refinery and cleansing with Data Refinery
  4. Data analysis and prediction.

For some customer cases, since data is already loaded into the Open Data for Industries instance, the above scenario will fit. Even though the data flow is different from what I talked about before, the basic techniques and details of these steps, from searching and retrieving data from an Open Data for Industries instance to data refinery and cleansing, and then to data analysis and prediction, are really same as what I mentioned before. So this post can also help with the above scenario.

Conclusion

This post highlighted steps of integrating IBM Open Data for Industries with Cloud Pak for Data services through Python notebooks. It covers technique details that can work with different scenarios, including the two mentioned with this post.

Through the use of a notebook, once the retrieved data from Open Data for Industries are saved as data assets of a Cloud Pak for Data analytics project, these data assets can be consumed by many services, following general practices of Cloud Pak for Data.

IBM Open Data for Industries has a roadmap to further integrate with Cloud Pak for Data services as a data source directly. Please stay tuned for my “Integrate IBM Open Data for Industries with Cloud Pak for Data Services — general integration approach 2.0” post later 2021.

Some useful links:

  1. Accessing Data from Cloud Pak for Data instance.
  2. Install custom libraries through notebook (WSL).
  3. Analyzing data and building ML models.
  4. Deploying and managing ML models.
  5. AI solutions and Watson services.
  6. Data and AI applications with Palantir for IBM Cloud Pak for Data.
  7. IBM Data and AI Accelerators powered by Cloud Pak for Data.

--

--

--

IBM Data Science in Practice is written by data scientists for data scientists to gain hands-on and in-depth learning and to read about inspirational applications and conceptual understanding for challenging topics in the field. Discuss and network: community.ibm.com/datascience

Recommended from Medium

Journey to Blockchain Development with Haskell and Plutus

Contributing to the F# Compiler Service

Observability of Logic App Runtime

Rails Girls Ica

Publisher and Subscriber(Python)

Automate chores with GitHub Actions

How to Import Android Phone Contacts to Outlook Manually?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jingdong Sun

Jingdong Sun

Software Architect and Developer

More from Medium

Osterus — The “Moneyball” approach to assessing the human capital within organisations

“Real-time”is becoming the default expectation. What is holding it back?

Meet Jennifer Flashman, Director of Analytics at Tinder

What I Learned From Convergence 2022