Machine Learning using AWS ML

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Amazon Machine Learning (Amazon ML) is a robust, cloud-based service that makes it easy for developers of all skill levels to use machine learning technology. In this post let’s look at how we can use Amazon Machine Learning technologies to create machine learning models. Assuming that you already know machine learning concepts let us dive straight into Amazon Machine Learning.

Before going into creating a machine learning model in AWS ML let us first understand what are the key concepts that are used in AWS machine learning.

Datasources

A data source is an object which contains metadata associated with data inputs to Amazon ML. Amazon ML reads the input data and computes descriptive statistics, gather schema and other information and stores them as a part of the data source object. Currently, AWS ML only support data source creation inputs as Amazon S3 buckets and Amazon RedShift only. One thing to remember when creating a data source is that a data source does not store a copy of your input data. It only stores a reference for the input data. So in an instance where our input data resides on an S3 bucket if we move or change the S3 file Amazon ML will not be able to create an ML model using that input data.

ML Models

An ML model is a mathematical model that generates predictions by finding different patterns in your data. Presently, Amazon ML supports three types of ML models.

  • Binary classification
  • Multi-class classification
  • Regression

Both Binary classification and Multi-class classification comes under supervised learning and Regression comes under unsupervised learning. Supervised learning is where we provide training data to the model and unsupervised learning is where there is no need of training data to be provided.

Binary classification (logistic loss function + SGD)
Predict values that can only have two categories such as true or false. (ex:- whether a person has diabetes, whether a person might get a housing loan etc..)

Multi-class classification (multinomial logistic loss + SGD)
Predict values that belong to limited, predefined categories. (ex:- what type of transportation a person might use (bus/train/car) etc ..)

Regression (squared loss function + SGD)
Predict a numeric value (ex:- number of patients per day, income of a person etc..)

Evaluations

Evaluations measure the quality of ML models and determine whether the model is performing well. Measurements like AUC, F1-score, Accuracy, Precision, Recall are used in order to determine the quality.

Batch Predictions

Asynchronously generate predictions for multiple input data observations. This is useful when there is a huge number of records which need to be predicted. Rather than running them one at a time using batch predictions we can predict them all by only running once.

Real-time Predictions

Synchronously generate predictions for individual data observations. This is useful in scenarios like interactive web applications where low latency is required when predicting.

Above are the main key concepts that are used in Amazon ML. Now it is time to get our hands dirty while creating an ML model in AWS ML. For this blog post, let’s create a machine learning model where we will ask the question of whether a specific person will get a loan or not.

Let’s first head into the Amazon machine Learning service using the console. In services, it should be under Machine Learning category. Inside you will be greeted with the following screen if you are launching this at the first time.

Here you are provided with resources for getting started. For now, let’s click on Get started and start creating our machine learning model. In the next page click Launch under Standard setup to start creating a data source.

Creating Datasource

Before starting to create dat source in AWS ML there are couple of things we need to do beforehand. First, we must save our data in the comma-separated values(.csv) as data sources are only supported by this file format. Then we need to analyze the data and start feature processing on data that we have gathered. Feature processing is the process of transforming the attributes further to make them more meaningful. Examples for common feature processing are

  • Replace missing or invalid data
  • Forming Cartesian products of one variable with another. (with variables that seem to have relationships between them)
  • Non-linear transformations (making numerical values categorical (if income < 5,000 then low, 5,000 < income < 50,000 then medium and else high)
  • Domain-specific features

After transforming our data we need to upload the data to an Amazon S3 bucket in order to hand it over to Amazon machine learning. For this post, let us use the banking datasource that is provided by Amazon themselves. You can download this datasource by this link. Our question for predicting is whether a given person will take a loan or not. So the first thing we need to do is convert yes/no in column loan to 1/0 and rename the column to target because that is our target attribute (make sure to replace unknown values as well). Also, make sure to delete column y.

After transforming the data we need to upload these data to Amazon S3 in order to create datasource. For that create a new bucket in S3 and upload the csv file.

Now that we have input data available in S3 bucket we can continue on to creating the data source. After selecting launch from the previous screen in ML next page will land you to create datasource. Here specify the S3 location and provide a datasource name. After that click on Verify. When verifying Amazon ML will ask for permission to access S3 bucket. Choose Yes.

After a couple of seconds, Amazon ML will verify the datasource and let you know.

In the next page, we need to define a schema for the datasource. By default, Amazon ML will scan the input data and auto-generate a schema for us. But we can do modifications to the auto-generated schema. First, check yes for radio button for Does the CSV contain the column names. When defining data types for our attributes we must select from below pre-defined categories.

Binary — choose for an attribute that has only two possible states, such as yes or no
Categorical — choose for an attribute that takes on a limited number of unique values 
Numeric — choose for an attribute that takes quantity as value
Text — choose for an attribute that is a string of word.

In the next page, we need to select our target attribute. In the checkbox asking whether we are creating a ML model select Yes. After that select the target attribute from the schema. (you need to only specify target attribute if you will use the datasource for training and evaluating the ML model)

On the Row ID page, for Does your data contain an identifier? , make sure that No, the default, is selected. Next, in the review page review the datasource and select Create datasource.

Creating ML Model

To train an ML model we need to specify the following parameters.

  • Input training datasource
  • Name of the data attribute that contains the target to be predicted
  • Required data transformation instructions
  • Training parameters to control the learning algorithm

Since we have created this ML model using the wizard the datasource is automatically selected with the datasource that we have just created. The name of the target attribute is also automatically selected in this step. For the simplicity of this blog, we are going to use Default(Recommended) settings that are generated by Amazon ML. If we want to specify further data transformations and add training parameters we can select Custom. When we select Review we can review the settings we have provided and settings that are automatically generated by Amazon ML.

Here under evaluation data, you can see that our data is split 70% as training data and 30% evaluation data. If we select custom we can customize these parameters. In custom mode following training parameters can be configured.

Maximum model size — the maximum model size in bytes. The default value is 100 MB. You are priced against the model size.

Maximum number of passes over the data — to discover patterns Amazon ML use multi passes over your data. The default value is 100. The number of patterns may increase with the number of passes and may increase the quality of the model.

Shuffle type for training data — shuffle type when you splitting the data. By default, Amazon does not shuffle the training data.

Regularization type and amount — the performance of complex ML models suffers when data contains too many patterns. Regularization helps models from overfitting training data.

In our post, we are going to use the default values provided by Amazon ML. After reviewing select Create Ml model. In this step, Amazon ML adds your model to the processing queue. When Amazon ML creates your model, it applies the defaults and performs the following actions:

  • Splits the training datasource into two sections, one containing 70% of the data and one containing the remaining 30%
  • Trains the ML model on the section that contains 70% of the input data
  • Evaluates the model using the remaining 30% of the input data

While your model is in the queue, Amazon ML reports the status as Pending. While Amazon ML creates your model, it reports the status as In Progress. When it has completed all actions, it reports the status as Completed. Wait for the evaluation to complete before proceeding.

That is it. We have created our first ML model using Amazon ML. But the steps do not stop there. The next step is to evaluate the ML model that we have created.

Evaluating Model

In Amazon ML evaluation process it generates industry standard quality metrics. One such metric is AUC (Area Under the Curve). This expresses the performance quality of your ML model. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples To review the AUC of the model that we have created, on the ML model summary page in ML model report pane choose Evaluations and then the model.

We can adjust the score threshold in order to change the accuracy of the model. The ML model generates numeric prediction scores for each record in a prediction datasource, and then applies a threshold to convert these scores into binary labels of 0 (for no) or 1 (for yes). By changing the score threshold. To set the threshold select Adjust score threshold.

The default threshold provided by Amazon ML is 0.5 . You can fine-tune this value in order to meet your requirements. Adjusting this value changes the level of confidence that the model must have in a prediction before it considers the prediction to be positive. It also changes how many false negatives and false positives you are willing to tolerate in your predictions. If you need only highest likelihood value to be set as positive you can set a higher threshold value like in a scenario where testing positive for a disease a wrong positive result may be critical.

Generating Predictions

Amazon ML can generate two types of predictions.

  • Real-time predictions
  • Batch predictions

Let us first look at real-time predictions. Real-time predictions are used when there is low latency required when predicting from the ML model. They are ideal for interactive websites and mobiles. For applications that require real-time predictions, we must create a real-time endpoint for the ML model. For this, we will accrue charges while the end-point is available. But we can try using real-time prediction feature in Tools without creating a real-time endpoint. In ML model reports select Try real-time predictions.

Here you can paste a record and choose to create a prediction. Then Amazon ML populates the predicted label in the real-time.

To generate batch predictions we need to select Batch predictions from the Amazon Machine Learning. Choose new batch prediction and on the next page select the ML model.

To generate a batch prediction we need to have uploaded the batch data to S3 bucket. After uploading batch predictions we can point them in the Locate the input data. After that configure the batch data where you will be again asked for Amazon ML permission to access S3 object. For S3 destination choose an S3 bucket. This will be the location where the results will be uploaded after completing. After completing batch predictions will run and will be uploaded to the specified S3 location.

Above explained is the basic outline of creating a machine learning model using AWS Machine Learning. More complex machine learning models can be created using following AWS documentation.