Predicting used car prices with linear regression in Amazon SageMaker — Part 2

Capstone Project for Udacity Machine Learning Nanodegree

Charles Frederic Atienza
9 min readMar 27, 2020
Photo by Franck V. on Unsplash

This is the second part of my article that details my completion of my Capstone Project for my Udacity Machine Learning Nanodegree. This part will go through how I trained three supervised models with the data we prepared in the first part.

You can follow along with the notebooks on my repository. The Linear Learner and XGBoost notebooks will have to be run on an Amazon SageMaker notebook instance to work properly. Most sections in this article skip some code so it is best to refer to the notebooks if you want to dig deep into how these were implemented in code.

Machine learning has come a long way and engineers have come up with amazing tools and frameworks to help other engineers. Here, I will employ Amazon SageMaker’s built-in algorithms to train a supervised machine learning model to predict used car prices. I will also build a custom PyTorch model and train it to achieve the same task.

Linear Learner model

First, we will train and test a model using one of Amazon SageMaker’s built-in algorithms, LinearLearner. The LinearLearner algorithm is an easy-to-use and straightforward model. It can handle binary classification, multiclass classification, and linear regression. In our case, since our desired output is a continuous value, we will make use of the algorithm’s linear regression.

Training

Remember that for the train and validation dataset, we arranged it so that the first column is the data label — the price feature. This is a requirement by SageMaker’s built-in algorithms when feeding it data in CSV format, however, for this model, we will feed the data in through the RecordSet object which requires the features and labels to be supplied separately.

X_train = train_data[:,1:]
y_train = train_data[:,0]
X_validation = validation_data[:,1:]
y_validation = validation_data[:,0]
# LinearLearner estimator object instantiation code hereX_train_np = X_train.astype('float32')
y_train_np = y_train.astype('float32')
formatted_train_data = linear.record_set(X_train_np, labels=y_train_np)X_validation_np = X_validation.astype('float32')
y_validation_np = y_validation.astype('float32')
formatted_validation_data = linear.record_set(X_validation_np, labels=y_validation_np, channel='validation')

We can now train the LinearLearner estimator by supplying the training and validation RecordSet objects.

If we take a look at the validation loss at the end of training, we can see that it is a ridiculously high number for a validation loss — 30230839.25. This loss value is this high because, unlike the training dataset, LinearLearner does not normalize the labels of the validation set. Our labels or car prices are usually in the 4000 to 12000 range so it is no surprise that we arrive at such a value for the validation loss.

This, however, does not mean that our model is not learning at all. What we should be paying attention to is how much the validation loss has decreased throughout the training process. It has gone from 31856580.89 to 30230839.25 in 16 epochs.

Evaluation

In order to make predictions from our trained model, we need to deploy it as an endpoint first. Once that’s done, we can load in our test data and run predictions on the endpoint. Here are the LinearLearner model’s predicted prices compared to the true prices. The closer the plot is to a y = x function, the better.

As can be observed, the plot is more bent than straight implying that the LinearLearner model is overestimating the prices as the true price label increases.

R-squared score

To represent how well the LinearLearner model performs numerically, we will employ the R-squared algorithm. I find that this algorithm is the best to use as the accuracy metric of our model given that our model is of a linear regression objective. The R-squared score is the average of the distance between x(true price) and y(predicted price).

R-squared score: 0.7372068518275521

When ran through R-squared, our model’s accuracy score on our test dataset turns out to be 0.74. The R-squared score can become negative but can only go as high as 1.0. Given that our score is over the 0.50 mark, we can say that our model did an okay job. We can also see below some metrics of our model’s predictions.

Min distance: 0.3203125
Max distance: 69425.931640625
Mean distance: 3898.957455078954

Looking at these metrics, what stands out is the abnormally high max distance value. This value, and why it is actually a good thing when the project’s objective is kept in mind, will be explained in the conclusion at the end.

XGBoost model

For our second model, we will use another of Amazon SageMaker’s built-in algorithms, XGBoost. XGBoost has a more advanced architecture and is generally a more complicated function. I have not had the opportunity to fully understand it yet so I will not try to summarize what it does here. It does support linear regression which is why I opted to use it here.

Training

This time, we will be feeding the dataset to our model through CSV files so no further manipulation of the data is needed since it already has the labels(price) as its first column. However, we need to upload them to our SageMaker session’s S3 bucket first.

After instantiating the XGBoost estimator, we will not train it right away but instead, we will employ Amazon SageMaker’s Hyperparameter Tuning to get the best model possible. Essentially, Hyperparameter Tuning trains multiple models with different hyperparameters after which we can take the best performing model and attach our estimator to it. We specify certain model hyperparameters we want to tune and the range on which they will be automatically adjusted. In my case, I opted to train a total of 20 models.

We can now supply the training and validation set to the hyperparameter tuner to train those 20 models on. Once that’s done, we take the best performing model and use it as our estimator.

Like the LinearLearner model, our XGBoost model does not normalize the labels so we end up with very high losses as well given the range of the car prices. Although they are significantly lower here because the model uses Root Mean Square Error(RMSE) for validation loss, again, what is important is that the validation loss decreases throughout the training process. In this instance, the validation loss decreased from 19140.1 to 4988.79 by the end of training.

Evaluation

To run predictions on our model, we will create a transformer job and supply it with the location of our test dataset. A transformer job uses SageMaker’s Batch Transform feature to conduct inference on large datasets.

Once the transform job is finished, we can download the output file from the transform job’s output path and compare it against the true labels.

Going by the plot, we can observe that the XGBoost model did significantly better than the LinearLearner model. Its plot, although has more noise, is a lot straighter meaning the predicted price and true price are closer to each other.

R-squared score

Like the LinearLearner model, we will use the R-squared score for accuracy to accurately measure the XGBoost model’s performance against the former.

R-squared score: 0.7832844072803054

We can see that the XGBoost model also performed significantly better on our accuracy metric with a 0.78 score compared to the LinearLearner model’s 0.73. On some metrics, however, it performed mostly the same except for a lower mean distance.

Min distance: 0.06835940000019036
Max distance: 68591.40625
Mean distance: 3193.5946961869945

Again, ignore the abnormally high value of the max distance metric for now.

PyTorch model

For our third and final model, we will be building a custom PyTorch model. Our model will consist of two fully-connected hidden layers with ReLu activations. There will also be a dropout layer between the second hidden layer and the output layer. Since we want a continuous numerical value as output, our final layer will also be a fully-connected layer with one neuron.

Training

Since I was working with a custom model, I had to implement the train function myself. It supports early stopping in case our validation loss does not improve after a certain number of epochs and we want to stop training early.

Before we start feeding the datasets to our model, we are going to have to normalize the features first. We will use scikit-learn’s MinMaxScaler for normalization. What it does is it scales the values in each column from 0–1 based on the value’s ratio to the minimum and maximum value of the column it is in. After normalization, we can now format the train and validation dataset to be put into a DataLoader which we will use for training the model.

For the optimizer and loss function, I have decided to use Adam and Mean Squared Error(MSE) respectively. We can now start training for a maximum of 300 epochs.

Epoch: 1, Train loss: 843259303.4766355, Validation loss: 556914428.3076923
...
Epoch: 203, Train loss: 21587198.14953271, Validation loss: 26712609.230769232
Early stopping condition reached. Stopping training

We can observe a similar scale for the validation loss compared to the models we have worked on prior. Going through the logs, we can see that the validation loss decreased from 556914428.31 to 26712609.23 before early stopping kicked in.

Evaluation

Now we can run predictions on the trained model. After we load in the test data, we still have some processing to do. Since we trained the model with normalized features, we will have to normalize the test dataset features before inference too in order to get predictions on a similar scale. It is important that we transform/normalize the test data with the same scaler we used to normalize the training dataset to ensure that both datasets are scaled equally.

We can observe from the plot that our PyTorch model performed similarly to the XGBoost model — a mostly straight plot but with some outliers.

R-squared score

To measure accuracy, we will again use the R-squared score.

R-squared score: 0.7699063746869551

The PyTorch model’s accuracy is a little below the XGBoost model at 0.77. Along with this, other metrics below are also very close to that of the XGBoost model.

Min distance: 0.056640625
Max distance: 67580.33642578125
Mean distance: 3468.076180230625

Conclusion

The models we have built and trained seem to be generally performing well except for the alarmingly high value of max distance for each of them. This high value is also most likely what is causing our accuracy score to not be in the .80-.90 range. Let’s take a look at what this max distance actually is.

Max distance car true price: 75000.0

If take a look at the actual car price of the car that has the max distance from the predicted price, it is 75000. In order to put this price in perspective, we need to find the car’s model.

Max distance car model: model_hyundai-elantra

After finding out that the car is a Hyundai Elantra, a quick search on how much it is on the US used cars market yields that its price ranges from 5000–11000 USD, way beyond its label price in our test dataset.

We can also put the high price in a context in terms of our dataset by getting the mean price of all Hyundai Elantra cars we have in our test dataset. We get a mean of 12000 USD. This, again, is way below the label. So we can conclude that this is an outlier in the dataset.

Max distance car model mean price: 12317.555970149253

This actually proves that our models work. The objective of this project is to build a model that can estimate a market-average price for used cars in order to help buyers and sellers avoid inflated car prices on the consumer-to-consumer market.

So when John Doe heads to Craigslist looking to buy a used Hyundai Elantra priced at 75000 USD, running its specs on our models will make him realize that this particular Hyundai Elantra’s price is well above market and he should look somewhere else.

References

  • Hudgeon, D., & Nichol, R. (2020). Machine learning for business: using Amazon SageMaker and Jupyter. Retrieved from https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
  • Hudgeon, D., & Nichol, R. (2020). Machine learning for business: using Amazon SageMaker and Jupyter. Retrieved from https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html
  • Hudgeon, Doug, and Richard Nichol. “Machine Learning for Business: Using Amazon SageMaker and Jupyter.” Amazon, Manning Publications Co., 2020, docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html.
  • Bushaev, Vitaly. “Adam — Latest Trends in Deep Learning Optimization.” Medium, Towards Data Science, 24 Oct. 2018, towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c.
  • Editor, Minitab Blog. “Regression Analysis: How Do I Interpret R-Squared and Assess the Goodness-of-Fit?” Minitab Blog, blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit.
  • “Mean Squared Error Loss Function: Peltarion Platform.” Peltarion.com, peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/mean-squared-error.

--

--

Charles Frederic Atienza

I am a software engineer who has done web development and, more recently, machine learning.