Rapidly build Machine Learning flows with DSX

Adam Massachi
6 min readOct 25, 2017

--

Create Flows for your data science workloads on Data Science Experience. Flows offer an interactive, graphical environment to ingest and clean data, to train and tune models, and to organize your workflow in a visual, collaborative ecosystem of tools. Take advantage of either Spark or SPSS Modeler runtimes to accelerate model development and deployment.

The interface allows you to build Flows quickly and intuitively. When we’re through with this guide, the entire Flow will look like this:

Completed Flow

To get started, create a project in IBM Data Science Experience.

Next, you’ll need to add the data assets we’ll use for the project. There are few ways to handle this task — programmatically from notebooks, from the project user interface, and from the side bar panel. Find the data here. Then add the data sets to your project.

We’re modeling customer churn for a telecommunications company. The data sets contain demographic and other information about customers, and a label indicating whether or not this customer had failed to renew with the company. Notice these data sets are stored in different files, but we’ll easily merge them soon.

Add data assets in DSX

Now we’re ready to model the data using Flows. If you’d like more information, refer to the documentation.

Let’s create a new flow from within your project. After clicking (+) New flow, you’ll find this screen.

New flow with SPSS Modeler runtime

Choose IBM SPSS Modeler for the runtime. We’ll get to Spark later.

After clicking Create Flow, you’ll be taken to the Flow interface, where you can build flows, inspect your models, and analyze your data. Open the right-hand-side panel and add both data sets to your Flow. We’re using customer.csv and churn.csv. We need to merge the data sets, joining on the ID column. Then, we’ll have the feature data and the label data in one data set. Take a look at how it’s done.

First, add the data.

Then merge the data sets. First, we’ll need to drop a Merge node. Then, we’ll configure the node with Inner Join on the ID column.

Connect the data sets to the Merge node

Now configure the node.

Configure the node to join on ID

Now that we’ve merged the data sets together, let’s do some exploration with some of the powerful graphing capabilities that DSX provides.

Go to the Palette on the left hand side and find the Graphs drop down.

We’re going to visualize a comparison in the churn rate by gender. You can build many visualizations using Graphs.

First, add a Distribution node.

Adding and configuring the Distribution node

This node will produce Outputs once run. You access the Outputs view from the right hand panel. Flows allow you to intuitively organize all of your assets during the modeling process. You can quickly find and interact with your data, models, graphs, and exports. Run the node and take a look.

We use the graphing functionality to explore relationships in the data. Experiment with the different nodes in Palette and the fields in the data.

Now we’re going to finish building the model. Find the Type node under Field Operations. We’ll connect the data traveling through the Merge node.

The Type node

Click(+) Add Columns and select all columns except for ID. We will not be using the ID in our model because it’s just an index and has no other relevance. Then for the field CHURN, we select Target under the Role menu and Flag under Measure. That’s how we communicate to the SPSS runtime that we’d like to use CHURN as a label column. You can find more information about these nodes and others in the docs.

Then we’ll build a C5.0 model. The IBM Knowledge Center Entry describes the C.50 algorithm, noting that it

works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.

Let’s add the node.

Add the C5.0 node from the Modeling menu

Now click the node and select Run. Running will execute the entire flow and produce new nodes and outputs. One of the new nodes is the model golden nugget.

Running this flow will produce a trained model available in the gold nugget

Let’s a take look at its contents. Right click the model nugget and clickView Model on the node.

Viewing the trained model

Now that we’ve created a model in DSX with Flows, we’ll add a Table and an Object Store node from the Outputs and Export menus. The Object Store node allows us to export the predicted records to familiar file format, such as csv.

The Output and Export nodes

After running the Flow, you can view the Table in the Outputs view. Then the predicted label and confidence appear in the rightmost fields.

The Table output

Thus far, we’ve built and scored a model using training data. Now, we’re going to score new records. Go back to the Github repo at the top and get the customer_churn.csv data. By now you know how to add this to a project and drop a data node.

Connecting the test data and the Analysis node

Then, connect the new data node that you’ve just dropped as input to the trained C5 model. Drop an Analysis node and connect the model as input. Navigate to the Analysis node and click Run. After running, you’ll be able to view the output.

The output of the analysis

Double-click on the item for details.

The details of the analysis

We’re satisfied with 99% accuracy on the test set. You can configure the Analysis node to provide other metrics as well.

We’ve completed the guide. If you followed along, you have loaded data in IBM Data Science Experience and created a new Flow. Then, you explored your data and created visualizations that you can use to gain insight and to share with your team. Finally, we developed a model using SPSS implementations of machine learning algorithms and then tested the performance of this model on unseen data.

In the second part of this tutorial series, I’ll cover Model Deployment and the powerful Spark Runtime.

This guide is based on the work of Elena Lowery, an experienced Analytics Architect at IBM.

--

--