Using Shell Scripts to Control Data Flows Created in Watson Applications

Arron Harden
IBM watsonx Assistant
3 min readMar 26, 2018

IBM Watson offers a collection of REST APIs for creating, running, managing, and troubleshooting data flows to allow your applications to easily integrate with Data Refinery.

A flow can read data from a large variety of sources, process that data in a runtime engine using pre-defined operations or custom code, and then write it to one or more targets. The runtime engine can handle large amounts of data so it’s ideally suited for reading, processing, and writing data at volume.

A key requirement for many organisations is the ability to integrate between different systems and for many years, the humble shell script has been the most convenient for its ubiquity and quick prototyping. Once something can be shown to work in a shell script, moving it to other scripting languages becomes a lot easier.

To demonstrate the use of some of the APIs for controlling data flows, I’ve created a sample shell script with reusable functions to perform the following operations:

  • Authenticate with IBM Cloud and generate a bearer token
  • Run an existing data flow
  • Wait for an in-progress data flow run to complete
  • Collect simple metrics (rows read and written) for a completed data flow run

Before you can run this script, you need a Platform API key for your IBM Cloud account. See these instructions for creating the key.

You also need to create two data flows by using Watson Studio or Watson Knowledge Catalog. Once you’ve created them, log in to https://dataplatform.ibm.com/ and click one of the data flows you want to run. Copy the data flow ID and the project ID from your browser’s URL. (The red rectangle below shows the data flow ID and the blue rectangle shows the project ID.) Do the same for the other data flow you want to run.

The sample shell script will simply run an existing data flow, wait for it to finish, and then run another data flow, waiting for that one to finish as well. This demonstrates how multiple data flows can easily be run one after another. With simple adjustments, multiple data flows can also be run in parallel.

Here’s the output from running the script with my own data flows:

Example output from script

The shell script itself is shown below with the API calls separated into reusable functions. Before running the script, remember to replace the following environment variables that are defined at the bottom of the script.

  • APIKEY (your IBM Cloud platform API key)
  • MY_PROJECT_ID (the project ID containing the data flows)
  • MY_DATA_FLOW_1_ID (the ID of the first data flow to run)
  • MY_DATA_FLOW_2_ID (the ID of the second data flow to run)

You can find more information about Data Refinery in the announcement blog post: Self-service data preparation with Data Refinery

API Documentation

The data flows API specification can be found in the Watson Data API documentation under Documentation > Data flows.

Tutorials and Notebooks

The Watson Studio Community is a hub of useful blogs, notebooks, tutorials and data sets to get you started.

--

--