[deprecated] Building Your First Machine Learning System

Train your model and deploy it, Watson ML for developers (part 2)

--

Editor’s note: This article has been deprecated. The IBM DSX service has been upgraded to the new IBM Watson Studio platform, which provides additional APIs for more advanced ML and AI applications. Find more recent tutorials here.

In Part 1 I gave you an overview of machine learning, discussed some of the tools you can use to build end-to-end ML systems, and the path I like to follow when building them.

In this post we are going to follow this path to train a machine learning model, deploy it to Watson ML, and run predictions against it in real time.

Look Ahead: In Part 3 we’ll create a small web application and backend to demonstrate how you can integrate Watson ML and make machine learning predictions in an end-user application.

The Model Cafe in the Allston neighborhood of Boston. Image: Toby McGuire.

We are going to use our small data set from Part 1 because the point of this post is to get something up and running quickly — not to actually build an accurate system for making predictions. Here’s the data set:

Square Feet       # Bedrooms       Color         Price
----------- ---------- ----- -----
2,100 3 White $100,000
2,300 4 White $125,000
2,500 4 Brown $150,000

In Part 1 I talked about the tools I use to build machine learning systems. Before we start building our ML system, let’s setup our tools.

Tool setup

Bluemix/DSX: You’ll need a Bluemix and Data Science Experience account. If you don’t have one, go to https://datascience.ibm.com to sign up. This will create a single account where you can access Bluemix and DSX.

Watson Machine Learning: You’ll need an instance of Watson Machine Learning. You can provision a new instance here.

Apache Spark™: You’ll need a Spark instance, but if you don’t have one now you can create one later.

Now that you’re all set up, let’s follow the process I outlined in Part 1.

Step 1: Identify what you want to predict and the source of your data

We’ve identified that we want to predict house prices, and the data set we want to use to drive those predictions. I have made the data set available on GitHub:

https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv

This URL is important because we’ll need to pull this data into our Jupyter Notebook in the next step.

Step 2: Create a Jupyter Notebook — import, clean, and analyze the data

Create a Jupyter Notebook

We’re going to analyze our data in a Jupyter Notebook in the IBM Data Science Experience. Jupyter Notebooks are documents that run in a web browser and are composed of cells. Cells can contain markup or executable code. We’ll be coding in Python. I’ll show you how we can import and analyze our data with just three lines of code.

Download the following notebook to your computer:

https://dataplatform.ibm.com/analytics/notebooks/3e83ffa1-f52a-4b76-bbb5-498b6b7f9505/view?access_token=a7dfdd01dbc24c53a5ac9688fbdd32da1b59156117d721fe10d12660f18dd591

Open DSX and create a new project called “Watson ML for Developers”. From here, create a new Spark instance for it.

In the project navigate to Analytic assets and click New notebook. Choose From file. Specify a name, like “House Prices”, and choose the notebook you downloaded above.

Finally, click Create Notebook. You should be taken directly to edit the notebook.

If this is your first time using Jupyter notebooks here are a few tips that you may find helpful (if you are already familiar with Jupyter notebooks, feel free to skip ahead):

1. Always make sure your kernel is running. You should see the status of your kernel in the top right.

2. If your kernel is not running, you can restart it from the Kernel menu. From here you can also interrupt your kernel, or change your kernel (if you want to use a different version of Python or Apache Spark).

3. A notebook is made of up markup and code cells. You can walk through the notebook and execute the code cells by clicking the run button in the toolbar or from the Cell menu.

Import, clean, and analyze the data

Let’s look at the first three code cells in the notebook where we will load and analyze our data. Here’s the first code cell:

import pixiedust

This cell just imports a Python library called PixieDust. PixieDust is an open source helper library that works as an add-on to Jupyter Notebooks that makes it easy to import and visualize data.

In the second cell we load our sample data:

df = pixiedust.sampleData("https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv")

This will generate a Spark DataFrame called “df”. A DataFrame is a data set organized into named columns. You can think of it as a spreadsheet, or a relational database table. The Spark ML API uses DataFrames to train and test ML models.

Finally, we’ll call the display function in PixieDust to display our data:

display(df)

It should look something like this:

In this case we are displaying a simple table, but PixieDust also provides graphs and charts for helping you understand and analyze your data without writing any code.

In just three lines of code we have imported and analyzed our data set. Now it’s time to do some machine learning!

Step 3: Use Apache Spark ML to build and test a machine learning model

Build a machine learning model

We’re going to build our first ML model in just a handful of cells. To start we need to import the Spark ML libraries that we’ll be using:

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

This is a regression problem (we’re trying to predict a real number), so we are going to use the LinearRegression algorithm in pyspark.ml.regression. There are other regression algorithms, but those are outside of the scope of this post.

We are going to build our ML model in just four lines of code. These four lines are in a single cell in our notebook, like so:

assembler = VectorAssembler(
inputCols=['SquareFeet','Bedrooms'],
outputCol="features"
)
lr = LinearRegression(labelCol='Price', featuresCol='features')pipeline = Pipeline(stages=[assembler, lr])model = pipeline.fit(df)

Let’s break this down, line by line.

First of all we need to specify our features. In the previous post we decided that we would use Square Feet and # Bedrooms as our features. Our ML algorithm expects a single vector of feature columns, so here we use a VectorAssembler to tell our ML pipeline (we’ll talk about pipelines in a minute) that we want SquareFeet and Bedrooms as our features:

assembler = VectorAssembler(
inputCols=['SquareFeet','Bedrooms'],
outputCol="features"
)

Next, we create an instance of LinearRegression, the ML algorithm we are going to use. At a minimum, you must specify the features and the labels. There are other parameters you can provide to tweak the algorithm, but they’re not going to do us much good when working with three data points :)

lr = LinearRegression(labelCol='Price', featuresCol='features')

Next, we create our pipeline. A Pipeline allows us to specify the steps that should be performed when training an ML model. In this case, we first want to assemble our two feature columns into a single vector — that’s the assembler. Then we want to run it through our LinearRegression algorithm. In upcoming posts I’ll discuss other operations that you’ll run through the pipeline — like converting non-numeric data to numeric data.

pipeline = Pipeline(stages=[assembler, lr])

Finally, we pass our DataFrame to the fit method on the pipeline to create our ML model.

model = pipeline.fit(df)

Congratulations, you now have a machine learning model that you can use to predict house prices!

Test the model

It’s time to test our model. In our example we are going to run a single prediction. In future posts I’ll discuss how you can analyze the accuracy of your model by running a large number of predictions based on your original data set.

Here we create a Python function to get our prediction:

def get_prediction(square_feet, num_bedrooms):

request_df = spark.createDataFrame(
[(square_feet, num_bedrooms)],
['SquareFeet','Bedrooms']
)

response_df = model.transform(request_df)

return response_df

Let’s break this cell down. First of all, in order to generate a prediction against an ML model generated using Spark ML, we need to pass it a DataFrame with the data we want to use in our prediction (i.e., the square footage and # bedrooms for the house price we want to predict). This line of code creates the DataFrame we’ll pass to our model:

request_df = spark.createDataFrame(
[(square_feet, num_bedrooms)],
['SquareFeet','Bedrooms']
)

Then we’ll call transform on the model, passing in the request DataFrame. This returns another DataFrame:

response_df = model.transform(request_df)

Let’s run a prediction for a house that is 2,400 square feet and has 4 bedrooms:

response = get_prediction(2400, 4)
response.show()

The result is a DataFrame that looks like this:

+----------+--------+------------+------------------+
|SquareFeet|Bedrooms| features| prediction|
+----------+--------+------------+------------------+
| 2400| 4|[2400.0,4.0]|137499.99999999968|
+----------+--------+------------+------------------+

Tip: You can use PixieDust to visualize any DataFrame, including this one. If you’ve imported PixieDust and you have a DataFrame, display() is your friend :)

Our ML model returned back our features along with a prediction. In this case, it predicted that a house that is 2,400 square feet and has 4 bedrooms should have a price of about $137,500, which is directly in between our 2,300 square foot house and our 2,500 square foot house.

Step 4: Deploy and test the model with Watson ML

Deploy the model

We’ve trained and tested our machine learning model, but if we want to predict house prices from a web or mobile app it’s not going to do us much good in this notebook. That’s where Watson ML comes in.

In the same notebook, we’re going to deploy this model to Watson ML and create a “scoring endpoint”, or a REST API for making predictions.

The first thing you’ll need to do is specify your Watson ML credentials. You can find your credentials by going to the Watson ML in Bluemix and clicking Service Credentials on the left (head to the catalog to deploy it):

Fill in the following cell with your credentials:

service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'YOUR_WML_USER_NAME'
password = 'YOUR_WML_PASSWORD'
instance_id = 'YOUR_WML_INSTANCE_ID'
model_name = 'House Prices Model'
deployment_name = 'House Prices Deployment'

The next cell initializes some libraries for connecting to Watson ML. These libraries are built into DSX:

from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Next, we’ll use the same libraries to save our model to Watson ML. We pass the trained model, our data set, and a name for the model — in this case we’re calling it “House Prices Model”:

model_artifact = MLRepositoryArtifact(
model,
training_data=df,
name=model_name
)
saved_model = ml_repository_client.models.save(model_artifact)
model_id = saved_model.uid

The call to save the model returns an object that we store in our saved_model variable from which we extract the unique ID for the model. This is important as it will be used later to create a deployment for the model.

We now have a trained machine learning model that we have deployed to Watson ML, but we still don’t have a way to access it. The next few cells will do just that.

We are going to create a Deployment for our ML model. To do this, we are going to use the Watson ML Rest API. The Watson ML Rest API uses token-based authentication, so our first step is to generate a token using our Watson ML credentials:

headers = urllib3.util.make_headers(
basic_auth='{}:{}'.format(username, password)
)
url = '{}/v3/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
ml_token = 'Bearer ' + json.loads(response.text).get('token')

Now we can create our deployment. Here we make an HTTP POST to the published_models/deployments endpoint — passing in our Watson ML instance_id and the model_id of our newly saved model.

deployment_url = service_path
+ "/v3/wml_instances/" + instance_id
+ "/published_models/" + model_id
+ "/deployments/"
deployment_header = {
'Content-Type': 'application/json',
'Authorization': ml_token
}
deployment_payload = {
"type": "online",
"name": deployment_name
}
deployment_response = requests.post(
deployment_url,
json=deployment_payload,
headers=deployment_header
)
scoring_url = json.loads(deployment_response.text)
.get('entity')
.get('scoring_url')
print scoring_url

The last line above prints the scoring_url parsed from the response received from Watson ML. This is an HTTP endpoint that we can use to make predictions. You now have a deployed machine learning model that you can use to predict house prices from anywhere! You can call it from a front-end application, your middleware, or from a notebook — we’ll do just that next :)

Test the model

For now, we’re going to test our Watson ML deployment from our notebook, but the real value of deploying your ML models to Watson ML is that you can run predictions from anywhere.

In the notebook I created a new function called get_prediction_from_watson_ml. Just like the last function, this one takes the square footage and the number of bedrooms for the house price you would like to predict.

Rather than calling the Spark ML APIs, you can see that this function performs an HTTP POST to the scoring_url we received earlier.

def get_prediction_from_watson_ml(square_feet, num_bedrooms):
scoring_header = {
'Content-Type': 'application/json',
'Authorization': ml_token
}
scoring_payload = {
'fields': ['SquareFeet','Bedrooms'],
'values': [[square_feet, num_bedrooms]]
}
scoring_response = requests.post(
scoring_url,
json=scoring_payload,
headers=scoring_header
)
return scoring_response.text

Let’s run the same prediction we ran earlier — a house that is 2,400 square feet and has 4 bedrooms:

response = get_prediction_from_watson_ml(2400, 4)
print response

The call to our Watson ML REST API returned our features along with the same prediction we received when we ran our test using Spark ML and the local ML model that we generated.

{
"fields": ["SquareFeet", "Bedrooms", "features", "prediction"],
"values": [[2400, 4, [2400.0, 4.0], 137499.99999999968]]
}

Next steps

In this post we built an end-to-end machine learning system using the IBM Data Science Experience, Spark ML, and Watson ML. In just a few lines of code, we imported and visualized a data set, built an ML pipeline and trained an ML model, and made that model available to make predictions from software running anywhere. Although we barely scratched the surface of machine learning, I hope this article gave you a basic understanding of how to build an ML system.

In the next post, I will show you how to consume the Watson ML scoring endpoint from an end-user application. In future posts, I will slowly venture deeper into machine learning with working examples for common ML problems: supervised and unsupervised, binary and multiclass classification, clustering, and more.

--

--