Watson Machine Learning for Developers

Understanding the basic problems and workflow (part 1)

Published in

Center for Open Source Data and AI Technologies

5 min readOct 18, 2017

I am not a Data Scientist, but I am a developer interested in data science and machine learning. I hope you are here because you are as well!

This is the first installment in a series of posts aimed at introducing developers like me and you to the basic machine learning concepts and tools required to get an ML system up and running. I will not be spending a lot of time talking about how to clean and analyze data, or the finer points of how machine learning works, but I will introduce you to fundamental concepts that you will need to get your first system up and running.

Let’s start by understanding when and why you would use machine learning.

We’ll eventually use the Watson ML service to deploy our model, but the problems and workflow I describe here apply broadly to machine learning.

Predictions

The ultimate goal of a machine learning system is to make a prediction. Here are some examples you may be familiar with:

Predict whether an image is a cat or dog
Predict the value of a home
Predict which products to recommend to a user
Predict which users share the same interests
Predict when to turn, accelerate, or apply the brakes in a self-driving car

Machine learning is all about predictions. If you have a use case where you need to make predictions (and a lot of data), machine learning may be a good fit. How do ML systems make predictions?

It all starts with the data. ML libraries and platforms can make predictions by analyzing massive amounts of data and finding patterns or mathematical formulas that “explain” the data. The data is the most crucial component to a successful ML system. You need to have a lot of it, and it has to be good. Bad data in = bad predictions out.

Let’s go a tad deeper to get a better understanding of how machine learning works.

Data

It can’t be said enough. It all starts with the data, and it has to be good data.

Let’s start with a simple, well-known machine learning example: predicting house prices. Let’s say we have a data set of known houses and their associated prices:

Square Feet       # Bedrooms       Color         Price
-----------       ----------       -----         -----        
2,100             3                White         $100,000
2,300             4                White         $125,000
2,500             4                Brown         $150,000

Obviously this is not a lot of data, and not good data, but ignore that for now. Our goal is to build a machine learning system to predict house prices using this data set.

Predicting the price of a house is a supervised machine learning problem — that means we know the outcome for a subset of use cases (i.e., we know what the prices are for the houses listed above), and we can use those outcomes to train a ML system to predict outcomes for new use cases (i.e., predict the price for a house that is not in the list). An unsupervised ML problem is one where the system learns from the data, rather than being trained by the data. We’ll cover unsupervised learning in a future post.

Specifically, this is a regression problem. A regression problem is one in which you want to predict a real number, like the price of a house. We will also cover binary and multiclass classification (when you want to predict a class or category from a predefined list of values) and clustering (when you want to group data that is similar).

When we build a supervised ML model, we need to specify which variables we want to use to make our predictions. These variables are referred to as features. We know that when a house is 2,100 square feet, has 3 bedrooms, and is the color white, then the price is $100,000. In this example, color is not important to predicting the price of a home, but you could reason that both square footage and the number of bedrooms are. So, it makes sense that we choose Square Feet and # Bedrooms as our features.

The value we want to predict is the Price. This is referred to as our label.

We’ll use the features in our data set to build a model that can predict the label (Price). That process looks a little like this:

Choose a ML algorithm. We’ll cover some of the common algorithms used in machine learning.
Instruct our ML algorithm to use Square Feet and # Bedrooms as our features and Price as our label (the value we want to predict).
Feed the data set to our ML algorithm to train an ML model that can make predictions. The algorithm will use the data set that you feed it to come up with a mathematical formula for predicting new outcomes.
To predict a price, we feed our ML model a set of features (square footage and number of bedrooms) and in response receive a predicted price.

Now that we have data, and I’ve outlined the general steps from getting from the data to a prediction, let’s see what tools can help us get there.

Tools

We’ll focus on the tools provided by the IBM Data Science Experience (DSX). Many of the tools are open source and can be run locally or on other platforms, and the general concepts should apply to other hosted machine learning offerings.

Jupyter Notebooks: Notebooks are used by data scientists to clean, visualize, and understand data. DSX uses Jupyter Notebooks, but notebooks come in different flavors. In DSX you code your notebooks in Python or Scala.

Apache Spark™: Spark is a cluster computing platform for analyzing massive amounts of data in-memory. For machine learning to be effective, you need lots of data, so it only makes sense that you have a platform like Spark to help.

Apache Spark ML: Spark ML is a library for building ML pipelines on top of Apache Spark. Spark ML includes algorithms and APIs for supervised and unsupervised machine learning problems.

IBM Watson ML: Watson ML is a service for deploying ML models and making predictions at runtime. Watson ML provides a REST API to your ML models which can be called directly from your application or your middleware.

Let’s see how all these tools work together.

Workflow

Once I have identified a prediction I want to make, and a data set to help make it, I will typically take the following path to build and host my machine learning model:

Create a Jupyter Notebook and import, clean, and analyze the data.
Use Apache Spark ML to build and test a machine learning model.
Deploy the model to Watson ML.
Call the Watson ML scoring endpoint (REST API) to make predictions from a client application or backend service.

This path works for supervised and unsupervised machine learning, and I’ll use it to show you how you can solve regression, classification, and clustering ML problems.

Next steps

In this post, I gave an overview of what you can use machine learning for, a tool chain that you can use to build end-to-end ML systems, and the path I follow to build them. In part two, we’ll follow this path to build an ML system to predict housing prices. I’ll show you how to get from a raw data set to a REST API with just a few lines of code.