Rapidly build Machine Learning flows with DSX
Create Flows for your data science workloads on Data Science Experience. Flows offer an interactive, graphical environment to ingest and clean data, to train and tune models, and to organize your workflow in a visual, collaborative ecosystem of tools. Take advantage of either Spark or SPSS Modeler runtimes to accelerate model development and deployment.
The interface allows you to build Flows quickly and intuitively. When we’re through with this guide, the entire Flow will look like this:
To get started, create a project in IBM Data Science Experience.
Next, you’ll need to add the data assets we’ll use for the project. There are few ways to handle this task — programmatically from notebooks, from the project user interface, and from the side bar panel. Find the data here. Then add the data sets to your project.
We’re modeling customer churn for a telecommunications company. The data sets contain demographic and other information about customers, and a label indicating whether or not this customer had failed to renew with the company. Notice these data sets are stored in different files, but we’ll easily merge them soon.
Now we’re ready to model the data using Flows. If you’d like more information, refer to the documentation.
Let’s create a new flow from within your project. After clicking (+) New flow
, you’ll find this screen.
Choose IBM SPSS Modeler
for the runtime. We’ll get to Spark later.
After clicking Create Flow
, you’ll be taken to the Flow interface, where you can build flows, inspect your models, and analyze your data. Open the right-hand-side panel and add both data sets to your Flow. We’re using customer.csv
and churn.csv
. We need to merge the data sets, joining on the ID
column. Then, we’ll have the feature data and the label data in one data set. Take a look at how it’s done.
First, add the data.
Then merge the data sets. First, we’ll need to drop a Merge
node. Then, we’ll configure the node with Inner Join
on the ID
column.
Now configure the node.
Now that we’ve merged the data sets together, let’s do some exploration with some of the powerful graphing capabilities that DSX provides.
Go to the Palette
on the left hand side and find the Graphs
drop down.
We’re going to visualize a comparison in the churn
rate by gender
. You can build many visualizations using Graphs
.
First, add a Distribution
node.
This node will produce Outputs
once run. You access the Outputs
view from the right hand panel. Flows allow you to intuitively organize all of your assets during the modeling process. You can quickly find and interact with your data, models, graphs, and exports. Run the node and take a look.
We use the graphing functionality to explore relationships in the data. Experiment with the different nodes in Palette and the fields in the data.
Now we’re going to finish building the model. Find the Type
node under Field Operations
. We’ll connect the data traveling through the Merge
node.
Click(+) Add Columns
and select all columns except for ID
. We will not be using the ID
in our model because it’s just an index and has no other relevance. Then for the field CHURN
, we select Target
under the Role
menu and Flag
under Measure
. That’s how we communicate to the SPSS runtime that we’d like to use CHURN
as a label column. You can find more information about these nodes and others in the docs.
Then we’ll build a C5.0
model. The IBM Knowledge Center Entry describes the C.50
algorithm, noting that it
works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.
Let’s add the node.
Now click the node and select Run
. Running will execute the entire flow and produce new nodes and outputs. One of the new nodes is the model golden nugget
.
Let’s a take look at its contents. Right click the model nugget and clickView Model
on the node.
Now that we’ve created a model in DSX with Flows, we’ll add a Table
and an Object Store
node from the Outputs
and Export
menus. The Object Store
node allows us to export the predicted records to familiar file format, such as csv
.
After running the Flow, you can view the Table
in the Outputs
view. Then the predicted label and confidence appear in the rightmost fields.
Thus far, we’ve built and scored a model using training data. Now, we’re going to score new records. Go back to the Github repo at the top and get the customer_churn.csv
data. By now you know how to add this to a project and drop a data node.
Then, connect the new data node that you’ve just dropped as input to the trained C5 model. Drop an Analysis
node and connect the model as input. Navigate to the Analysis
node and click Run
. After running, you’ll be able to view the output.
Double-click on the item for details.
We’re satisfied with 99% accuracy on the test set. You can configure the Analysis
node to provide other metrics as well.
We’ve completed the guide. If you followed along, you have loaded data in IBM Data Science Experience and created a new Flow. Then, you explored your data and created visualizations that you can use to gain insight and to share with your team. Finally, we developed a model using SPSS implementations of machine learning algorithms and then tested the performance of this model on unseen data.
In the second part of this tutorial series, I’ll cover Model Deployment
and the powerful Spark Runtime
.
This guide is based on the work of Elena Lowery, an experienced Analytics Architect at IBM.