Predict Customer Churn by Building and Deploying Models Using Watson Studio Flows
The content for this tutorial came from a session at IBM’s Think! conference in March 20–22, 2018 called: Accelerating Insight from data by Building and Deploying Models using Watson Studio Flows
Tutorial Update (May 1,2018): We updated the ‘Adding and merging data using the flow’ section [Section 3] to match changes to the process of importing data assets into a flow.
Working with Watson Studio
The purpose of this lab session is to use the Modeler flows within Watson Studio to build and deploy a model for a telecommunications company to predict customer churn. Two different datasets will be provided and merged as input for the predictive model. One of the datasets contains information about the customers and which of the company’s services they use. The other set contains call records for the customers.
We will cover:
- Creating a Watson Studio Project
- Establish Project association with Spark instance
- Add data assets to the project
- Building a Modeler Flow — preparing & merging the data in the flow, building the model, evaluating the model
- Comparing a CHAID tree model with an Auto Classifier
- Saving your Modeler Flow in your project
1. Create a project
All analytical assets within Watson Studio are organized into ‘Projects’. Projects are workspaces which can include data assets, Notebooks, Flows and models (among other items). Projects can be personal or shared with a team of collaborators.
1. From the ‘Welcome’ page of Watson Studio click on ‘New project’.
2. Enter a project name, and select your target IBM Cloud Object Storage Instance (we’ll help you create a storage instance if you don’t have one already). Press Create.
2. Add data assets to project
Data assets are added to a project to make them available for any of the tools included in Watson Studio. Data assets added to a project are also available for any of the collaborators sharing that project to use. Examples of data assets are files, such as .csv or .json or database tables added through a data connection.
1. Click on the ‘Community’ tab in the top bar. From here you can see content in the Watson Studio Community, which includes data sets.
2. Navigate to the data set ‘Calls by customers of a Telco company’, then click on the add (+) icon on the community card for this data set, select your project and click ‘Add’. Once this completes it will display ‘Added’. This may take up to a minute.
3. From the Community add the data set ‘Customers of a Telco including services used’, and once it displays ‘Added’ click on the ‘View Project’ link.
4. Within the project’s ‘Assets’ tab you will see ‘Data assets’ section, which should now have the two added above. Click on the ‘Calls by customers of a Telco company.csv’ name to preview the data. From here you will see the description and name on the right. Click on the name to and rename it to ‘calls’ then click ‘Apply’.
5. Return to the project by clicking on the project name in the navigation breadcrumb, then perform similar steps to rename ‘Customers of a Telco including services used.csv’ to ‘customers’.
3. Building a flow in Modeler
1. From the navigation bar select ‘Add to project’ then ‘Modeler flow’.
2. Provide a name and use the default runtime ‘IBM SPSS Modeler’, and click ‘Create’.
Adding and merging data using the flow
- Open the node palette on the left side of the screen using the button in the toolbar. Add a ‘Data Asset’ node from the Import category in the palette.
- Double click the node to open the editor then click the ‘Change data asset’ button. This will open the asset browser allowing you to select from the available data assets in your project.
- Select the ‘calls’ dataset and click OK. This will close the asset browser. Then click ‘Save’ in the node editor to apply your changes.
This dataset contains the call records for the Telco customers. Each customer may have made multiple calls, some of which may have been dropped — meaning the call ended abnormally (perhaps because the phone lost service).
4. Right click on the ‘calls’ node, or click on the three dots within the node, and select Preview to examine this data set.
As you can see this dataset contains five columns: from, to, dt, duration and dropped. However there are only two columns in this dataset which will be useful for building the predictive model. The ‘from’ column contains the customer’s phone number, and as this is unique to each customer (and is common with the ‘number’ field in the ‘customers’ dataset) it can be used as an ID. The column labelled ‘DROPPED’ is a flag where 1 is yes (meaning the call ended abnormally) and 0 is no (meaning the call ended normally). This column could be an important input to the predictive model as a reasonable hypothesis to test is:
If a customer has a large number of their calls end abnormally, are they more likely to churn?
The ‘to’ column is the number dialed for that call and the ‘dt’ column is the date of the call.
5. Add the ‘customers’ dataset to the canvas and preview it. Take a look at the Profile and Visualization tabs to get a better idea of the shape of your data.
The ‘customers’ dataset contains a row for each customer and various columns with data about the customer. This includes things like the customers age and gender as well as which deals and offers provided by the Telco company that customer makes use of. It also includes the number of times the customer has called the company service desk. The ‘number’ column contains the customers phone number and this maps directly to the ‘from’ column in the ‘calls’ dataset we just looked at.
In order to work with these datasets it will be useful to read the data and instantiate the fields within the flow. To do this use a Type node, which traverses the data and determines the field types and metadata.
6. Add a Type node from the node palette on the left hand side. The Type node is a ‘Field Operations’ node (or you can use the search bar).
7. Connect the ‘calls’ and Type nodes together by dragging a link from the output port of the source node to the input port of the Type node.
8. Double click the Type node and click on ‘Configure Types’ then click ‘Read Values’. Once this operation completes click OK to close the table dialog, then click ‘Save’.
9. Add another Type node to the flow and connect the customers node to it, open its settings and click on ‘Configure Types’ then click ‘Read Values’, and then ‘Save’.
The ‘calls’ dataset potentially contains multiple rows for each customer as multiple calls may have been made and will be logged as separate rows in the table. However, the ‘customers’ dataset contains only a single row for each customer. It is therefore necessary to take the multiple rows for each customer’s calls and aggregate them together, taking an average for the dropped column to get a ratio of the calls for each customer that were dropped.
10. To do this, add an Aggregate node from the node palette on the left hand side. The Aggregate node is a ‘Record Operations’ node (or you can use the search bar).
11. Connect the Type node for the calls dataset and the Aggregate node together by dragging a link from the output port of the Type node to the input port of the Aggregate node.
12. Open the Aggregate node and click on the ‘Add Columns’ for the ‘Group by’ section. Select the ‘from’ field in the field picker and click ‘OK’.
As we are grouping the records by the ‘from’ column which, as discussed earlier, can be considered an ID, each unique value will have a single row coming out of the Aggregate node.
13. Click on the ‘Aggregations’ link to open the Aggregations panel. Add the ‘dropped’ field to the aggregation table using the field picker. For this field, we need to determine what aggregation function will be used to combine the multiple values for each customer into a single value. In this case mean makes sense so click the edit button and ensure that only the ‘Mean’ checkbox is checked, then ‘Save’ the node settings.
14. Preview the Aggregate node to ensure that the data is as expected. There should now be three columns in the dataset; from, dropped_Mean and Record_Count. Click the ‘Visualizations’ tab and start typing dropped_Mean into the Columns text box, it will autocomplete allowing you to select the field. Click ‘Visualize Data’. A histogram will be generated showing the distribution and frequency of values for the dropped_Mean field.
15. Return to the flow by clicking the flow name in the navigation breadcrumb.
16. Add a Filter node from the node palette and connect it to the Aggregate node. Use the Filter node to rename the ‘dropped_Mean’ field to ‘dropped_ratio’, and also rename the ‘from’ field to ‘number’ then click ‘Save’.
The next step is to merge the ‘calls’ and ‘customers’ datasets together and in order to do that a key field that has common values for each customer in both datasets is necessary. As discussed earlier the customers’ phone numbers can be used as this ID. As we have renamed the ‘from’ field to ‘number’ it is common with the equivalent field in the ‘customers’ dataset.
17. Add a Merge node to the canvas and connect the ‘customers’ dataset and then the ‘calls’ dataset to the Merge input port, by connecting first the Type node from customers to the Merge node, then the Filter node from calls to the Merge node.
The order in which the inputs are connected to the Merge node is important, as the setting we will use, ‘Partial out join’, takes the first input as the primary.
18. In the Merge node settings tab change the ‘Merge method’ to ‘Keys’ and select the ‘number’ field as the key. Here we are stating that the ‘number’ field is in both datasets and has unique values for each row which match in both datasets (i.e. customer A in dataset 1 is the same person as customer A in dataset 2).
19. Select ‘Partial outer join’ as the join type. This means that incomplete records from either dataset (where there is no merge to be performed) will be retained in the output dataset. Click the ‘Select Dataset for Outer Join’ link and make sure the ‘customers’ dataset is checked as the primary.
Preparing the data for modelling
A partial outer join will include all records from both datasets. If there are any customers that only appear in the ‘customers’ dataset then they will still be included in the output dataset but for the columns added from the ‘calls’ dataset the value will be null. A null value is not useful for modeling and in this situation the fact a user has made zero calls may be useful information for the model. We therefore need to convert the null values to ‘0’.
1. Add a Filler node and connect the Merge node to it.
2. In the Filler node add the ‘dropped_ratio’ field to the ‘Fill in fields’ list and select ‘Null values’ in the ‘Replace’ dropdown, and click ‘Save’. The default value for the ‘Replace with’ control is 0 so this is all we need to do here.
Next we need to divide the data into training and testing sets for the model. By doing this the model will use the training partition to build and train and the testing partition to test the accuracy of the model on data it has never seen before.
3. Add a Partition node to the canvas and connect it up. Open the node editor and ensure that the training and testing partitions use the default value of 50%. This means half of the data will be used to train and half used to test.
4. To make model building repeatable using the same partitions (meaning the same sets of customers will be used to train and test the model if it is rebuilt) check the ‘Use unique field to assign partitions’ field and select ‘customer_id’ in the dropdown that appears.
The model won’t need to use all fields in the data and by default all fields are given the ‘input’ role, meaning the model will use that field and try to find relationships between it and the target. For fields like ‘number’ or ‘customer_id’ where the data will have no relationship on the customer churning using these fields as inputs makes no sense so we will use the type node to stop the model using them.
5. Add a Type node from the palette and connect it up.
6. Open the node editor and click the ‘Configure Types’ link to open the wider editor.
7. As mentioned all fields will use the ‘input’ role by default so we only need to add fields we want to change to the type table. Add the following fields:
8. Using the dropdown in the table set the role of churned to ‘Target’.
9. For all the other fields set the role to ‘None’. This means the model will ignore them.
Building the model
1. Add a CHAID node from the palette and connect it up.
The CHAID node is a tree model which iteratively splits the data into groups based on significant differences in field values. The model finds the most significant split first and divides the data, for example it may split the data by gender. Then for each subgroup it repeats the process, forming a tree-like diagram. When a new row of data is scored with the model the row effectively travels from the top of the tree down through this tree diagram based on the values for each field that split the data. When the row reaches the final level of the tree diagram, called the leaf node, a prediction is made based on the majority target value for training records contained in that node. The bigger the majority the higher the confidence in that prediction.
2. With the Type node determining the target field the CHAID node label should change to ‘churned’. We will run the node with the default settings but feel free to open the node editor to take a look.
3. Click the ‘Run’ button in the canvas toolbar. Once execution has completed a new model node will appear on the canvas.
4. Preview the new model node and confirm that the two new fields from the model are included ($R-CHURNED — the prediction, $RC-CHURNED — the confidence in that prediction as a value between 0 and 1).
5. Open the context menu on the model (by right-clicking or click the little options glyph on the node) and click ‘View Model’. This opens the model viewer which contains information, statistics and charts about the model. Use the tabs in the navigation pane to view the different pages about the model.
6. In particular the ‘Predictor Importance’ chart shows useful information about which input fields are most useful for making the prediction (including the dropped_ratio field we derived earlier).
7. Also look at the tree diagram and hover on the different nodes to see how the training records were split.
Evaluating the model
Now that a model has been built we can evaluate it to determine how accurate it is at predicting customer churn.
1. Add an Analysis node from the outputs section of the node palette and connect the generated model node to it.
2. Open the Analysis node editor and select the ‘Coincidence matrices’ checkbox then save the change.
3. Using the node context menu run the Analysis node. When complete it will generate an output object and the flyout panel containing your outputs will open. Double click on the new analysis object to open it.
The analysis output contains the accuracy figures for the model as well as a coincidence matrix showing the correct and incorrect predictions made by the model.
The accuracy and coincidence matrix items are divided into the training and testing partitions created earlier in the flow. It is expected that the accuracy for the testing partition is lower than the training as this is data that the model did not see as part of it being built.
The coincidence matrix is useful for determining if the model has a problem with predicting one category over another. For example, the model may be worse at predicting a customer staying than it is at predicting a customer churning.
4. From within the Analysis output click on the flow name in the header bar to return to your flow.
5. Build an evaluation chart to get a graphical representation of the quality of our model — to do this add an Evaluation node from the graphs section of the node palette and connect it to your model.
6. We will run this node with default settings to build a ‘Gains’ chart, but feel free to open the editor to see what other settings are available. Run the Evaluation node using the context menu.
7. A new evaluation chart output item should be added to the flyout panel. Double click this to open the chart in a new breadcrumb.
The SPSS Modeler Knowledge Center description of Gains charts is:
Gains charts. Cumulative gains charts always start at 0% and end at 100% as you go from left to right. For a good model, the gains chart will rise steeply toward 100% and then level off. A model that provides no information will follow the diagonal from lower left to upper right (shown in the chart if Include baseline is selected).
Build an Auto Model to compare classifiers
The SPSS Modeler runtime includes auto models which build multiple different examples of the same type of model (either classifier or numeric) and then ensemble them together to provide a prediction. The auto modeler can also be useful for comparing different model types and their accuracy.
1. Add an Auto Classifier from the node palette and connect the Type node to it.
2. The Auto Classifier builds a default selection of classification models and keeps the top few models based on accuracy. Run the node.
3. When the Auto Classifier model is added to the canvas view it using the context menu.
4. The Auto Model is a special case in that it contains multiple individual models. These can be seen in the ‘Models’ table in the viewer along with information about each model. You can view the individual models using the action menu in the table row.
5. The ‘Use’ checkbox allows you to deselect one or more models from this list so that they are not included in the ensemble when the overall Auto Modeler is scored. Keep all the models selected and return to the flow using the breadcrumb in the header bar.
Evaluating the Auto Classifier and comparing with the CHAID model
To evaluate and compare the Auto Classifier and the CHAID model, we will need to connect them together before passing the predictions to the evaluation graph node.
1. Using the context menu on the generated Auto Classifier disconnect it from the Type node (note the dashed ‘model refresh’ link will remain).
2. Connect the CHAID model output port to the input port of the Auto Classifier model node and then disconnect the Evaluation node and connect the Auto Classifier to it so that the flow looks like this:
3. Run the Evaluation node to build a combined ‘Gains’ chart, then view it.
As both model prediction fields are being passed to the Evaluation node it will plot both on the same graph allowing a comparison. The models are likely to be similar in performance but note whichever has the generally higher line — this is the model with better performance at predicting customer churn. $R-CHURNED is the CHAID model, and $XF-CHURNED is the Auto Classifier.
You could also use an analysis node to compare the performance of the two models.
Congratulations! You’ve built two models predicting customer churn and evaluated them to compare their performance.
In this tutorial we have created a project in Watson Studio, added some data from the community and then used the Modeler Flows product to build a model that answers the question: Will a customer churn?
With the variety of other model types available in Modeler Flows you can build a model to answer practically any question and get real value from your data!
I’ve been a UX designer for IBM for six years and previously worked on the SPSS suite of data science products. My particular areas of interest are graphical representations of data flows and simplifying data science through good interaction design.