Predicting used car prices with linear regression in Amazon SageMaker — Part 1

Capstone Project for Udacity Machine Learning Nanodegree

9 min readMar 26, 2020

This post details my approach to completing my Capstone Project for my Udacity Machine Learning Nanodegree. This is the first part that goes through data exploration and preprocessing.

Supervised learning has proven to be vital in any business in the 2010s and without a doubt will continue to be applied to more markets in the future. Here, I decided to explore how it can be applied to predicting used car prices on Craigslist. We will be using the Used Cars Dataset (Version 4) on Kaggle.

If you want to follow along, the source files and notebooks I worked on can be viewed on my public repository. The repository also contains a project proposal document. This document describes my theoretical approach to how I initially thought the phases of this project should be tackled. Most of these were followed but some, most notably the accuracy formula for the supervised models, were disregarded and replaced with more appropriate and efficient practices.

Data Exploration

Like most data science projects, we will start by exploring our data and determining which features contribute to the car’s price.

Our dataset consists only of one file — vehicles.csv. It is a collection of every vehicle listing on Craigslist. In total, the dataset has 20 columns. Most of these, like the id, url, lat, lang, and description, are Craigslist-specific features that do not have anything to do with the price so we can safely ignore them. The cylinders column will also be dropped. Although this column is more closely associated with car specifications than the others that were dropped, we will be working under the knowledge that most car owners do not take the number of cylinders into account when determining a price for their car.

We will only be exploring the columns odometer, manufacturer, model(car model), condition, fuel, drive, size, type, and paint_color. First, we filter the dataset and only take entries posted from 2015 to 2019 then we can proceed.

Let’s start by exploring the most obvious contributing feature to the price, the model feature. To avoid any confusion, the model feature here describes the car model.

The model feature requires additional clean-up. This is because unlike other features, the model column’s exact text value can vary greatly depending on the poster’s prerogative. For example, a car’s actual model can be ‘Cruze’ but the poster might opt to put in ‘Cruze 2017 AT’ instead which is essentially the same category but, if left unhandled, will result in its separate category. This can be seen by getting the occurrence count of each class in the model column.

f-150                    3128
1500                     2548
silverado 1500           2201
2500                     1355
altima                   1351
                         ... 
rogue sv automatic          1
escalade premium awd        1
transit-250 cutaway         1
tlx v6 w/tech               1
patriot sport 4x4 suv       1
Name: model, Length: 8897, dtype: int64

To remedy this, we will use containment to normalize redundant model classes. So ‘Cruze 2017 AT’ will simply be normalized to ‘Cruze’. If we do not do this, our machine learning model will treat these two values as separate values which will likely lead to it not training as well due to an extremely high number of unique values of the model feature. The implementation of this containment normalization can be seen in the data exploration notebook. To further simplify our dataset and trim out outliers, we will be dropping rows whose model class occurs in the dataset less than 50 times.

Feature-price correlation

Before exploring correlation, we will be filling null values in these columns with ‘other’ since it is what is used in the dataset for nominal features whose entry’s value does not fall under the more common values for the feature.

We can now explore how each feature is correlated, if at all, to the car’s price. We will plot the manufacturer-to-price correlation first.

We can see from above that the plot barely makes any sense but upon further inspection, we notice that this plot is caused by very extreme outliers in the data. These extreme outliers squish the normal data so far down the plot that they become essentially unrecognizable.

To avoid this, we will limit the plot’s y-axis from 2500 to 100000. We decided on these limits based on the real-world range of car prices. Although there are cars that cost way beyond USD100000, these are most likely luxury and custom cars which we can afford to ignore since the goal of this project is to build a model that accurately estimates prices of cars posted on Craigslist — most of which are common, non-luxury cars. Below is the much-improved plot.

Manufacturer-to-price correlation with y-axis limits

We remove said outliers from the dataset based on the model column. We will also remove outliers under the odometer column.

Directly plotting each feature against the price does not make much sense since we know the car prices are predominantly determined by the car model before anything else. This can be observed in the plot below where we only plot car models under the Chevrolet manufacturer.

Given that the car’s model is ultimately the biggest contributor to the price, keeping the manufacturer as a separate feature does not make much sense and doing so would only confuse the model we are building. So, we will just concatenate the manufacturer and model value into the model column then drop the manufacturer column.

Now that we can isolate each car model, we can plot its features correlation to the price and clearly distinguish the contributing features. For this example, we will stick with the car model used above — Chevrolet Cruze. Here are the plots for each features’ correlation to price.

After careful observation, we can conclude that all of the above features affect the price given the distance of each feature’s classes’ from each other — all except the drive feature. The drive feature’s class medians are close to each other compared to the others so it is safe to assume that this feature does not affect the price as much and can be dropped.

Categorical features distribution

Next is exploring how balanced or imbalanced the distribution of our categorical features’ classes are. Below is the distribution plot for the model.

As can be observed, the model class distribution is absurdly imbalanced. If our model’s goal is the classification of the car model instead of linear regression on the car’s price, this imbalance will prove to be a nuisance for it can make our model biased towards model classes of high occurrence.

Nonetheless, in order to ensure our model trains on a balanced range of model-to-price relations, we will attempt to work around this imbalanced when we preprocess the data with sampling techniques.

The other features’ class distributions are also imbalanced. Unlike the car model feature, however, these barely compare in the scale of its effect on the price so this imbalance can be safely ignored.

After dropping rows and columns we deemed unnecessary from our dataset, we are left with a dataset with 103670 rows and 9 columns.

Data Preprocessing

We now move on to preprocessing the data for feeding to our supervised model. For our model to properly process the dataset’s numerous nominal features, we have to one-hot encode them first.

Out of our 9 columns, 7 are nominal and only price and odometer are numerical. After one-hot encoding the 7 nominal columns, we will drop the original 7 columns except the model column which we will use as the target for sampling the data. We now have a dataset with 103670 rows and 295 columns. From this dataset, we will take 50% for the training set, 25% for the validation set, and the remaining 25% for the test set which will leave us with datasets with the below dimensions.

Training set — 51835 rows x 295 columns
Validation set — 25918 rows x 295 columns
Test set — 25918 rows x 295 columns

Train dataset sampling

Earlier, I mentioned that the extreme imbalance of the car model class distribution can be improved with sampling techniques. Below is a plot of how this imbalance has skewed the price distribution in our training dataset.

We can improve the distribution of the car price through oversampling and undersampling the dataset around the car model feature.

SMOTE

We will use an algorithm called SMOTE(Synthetic Minority Over-sampling Technique) to handle the oversampling of data. For this algorithm to work, all samples in the dataset must have at least 1 neighbor(same car model class). In our case, I decided on at least 6 neighbors so I am taking out any row that has less than that.

Edited Nearest Neighbor

For the undersampling, we will use the Edited Nearest Neighbor(ENN) algorithm.

I used the package imbalanced-learn which has already implement SMOTE and ENN into a combined function SMOTEENN. After running the function on our training dataset, we now have a resampled dataset with 274019 rows and 295 columns and the below price distribution.

Resampled training dataset price distribution

After sampling, we can see that the plot is still skewed but with a slightly longer tail. But what is important here is that the distribution of price is not so far from each other anymore. If we take a look at the y-axis labels of the graph, it can be observed that the higher occurrence car models have been drastically lessened and the lesser occurrence car models increased.

We can now drop the car model feature from the three datasets, note that this feature has already been one-hot encoded earlier. This will leave the datasets with 294 columns.

We are now done with preprocessing the data and can save the three datasets into CSVs to be fed to the model we are building. The training and validation datasets will have its features and labels all in one CSV file with the label being the first column while the testing set will have separate CSV files for its labels and features.

What’s next

In part 2, we will use the preprocessed data to build three types of models:

Amazon SageMaker’s LinearLearner
Amazon SageMaker’s XGBoost
Custom PyTorch model

References

Brownlee, Jason. “Undersampling Algorithms for Imbalanced Classification.” Machine Learning Mastery, 19 Jan. 2020, machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/.
Chakrabarty, Navoneel. “Application of Synthetic Minority Over-Sampling Technique (SMOTe) for Imbalanced Datasets.” Medium, Towards AI, 16 July 2019, medium.com/towards-artificial-intelligence/application-of-synthetic-minority-over-sampling-technique-smote-for-imbalanced-data-sets-509ab55cfdaf.
Reese, A. (2020, January 7). Used Cars Dataset. Retrieved from https://www.kaggle.com/austinreese/craigslist-carstrucks-data