My cousin got pregnant recently and she is the first in our generation to have a kid. With that, comes a handful of traditions that I am not familiar with like gender reveals and baby showers. Her baby shower is coming up and it might be too soon to get her a lego set for the baby…but I kinda want to get her a lego set for the baby.
Either way, I’m going to get her kid a lot of lego sets in the coming years, so I wanted to do an analysis on the price of legos. I wanted to see how certain features of a lego set interact with price and if I could find a strategy to buy this future child a ton of legos without breaking my bank. If I could build a model that predicted the price of a lego set, then I could understand what features I should consider when shopping around.
In order to conduct this analysis, I used selenium - a web scraping library - to gather product data from Lego’s website. On the website you can browse the available lego sets and get information like how much they cost, how many pieces are in the set, product reviews, etc. You can also toggle between countries and see which sets are available in each country. At the end of the day, I scraped around 11,000 different observations, each with 14 columns representing different features of the lego set.
Cleaning the Data
The biggest hurdle in cleaning this dataset came down to converting the currencies into USD. Since I scraped data from 22 different countries, I needed to get them into one common currency. Finding the current exchange rate for all the different currencies was easy enough, but parsing the strings proved difficult.
Some countries treat periods as commas and almost every country has a unique symbol for their currency. What was even more confusing was the fact that different currencies can share the same symbol ($ in Australia is different than a $ in the U.S and is different than a $ in New Zealand). A few simple lines of regex removed non-numerical characters and handled the odd commas elegantly.
Beyond some basic cleaning, there were two features that I created from the raw data. The first was an average suggested age for the lego set. Each lego set comes with a suggested age range, but are not in a format conducive for modeling. My theory is that a set designed for an older child will be more expensive than a set designed for a younger child. However, it would be hard to throw ranges like 8+, 6–12, and 4–7 into a model and expect it to understand what these mean.
To create an “average suggested age,” I made the decision to give an upper boundary of 22 for the suggestions without an upper boundary. I then took the average of the upper and lower bounds. So 6–14 turned into 10 and 14+ turned into 18.
The other features I created were part of speech tags from the description text. Each lego set had a long description and I wanted to understand if the verbiage of the description helps predict how much the lego set costs. The NLTK package has a handy function that gave me the occurrence of different parts of speech. Because I was comparing descriptions of different lengths, I took the percentage of times different parts of speech occurred in the description. After running the function below, I was left with features like VERB 17%, NOUN 28%, etc.
Transforming the Data
After creating the new features, I needed to find the appropriate transformations for this data in order for it to pass the assumptions of using linear regression:
- Features are normally distributed
- Features are not autocorrelated
- There is no multicollinearity
The data passed 2 and 3, but it had several variables that were extremely skewed.
I considered three strategies to help make this data a bit more normally distributed.
- Apply log transformations: By taking the natural log of these variables, skewed distributions tend to return to a normal distribution.
- Add non-linear terms: By adding non-linear terms (x -> x²) we can use polynomial regression to fit data that isn’t strictly linear.
- Remove outliers: Since there are only a few examples of high prices, it would be fair to simplify the scope of this project by removing anything above $200 from my analysis.
I ended up using a combination of 1 and 3 to reduce complexity and hone in the scope of this analysis on lego sets that I might actually buy in the future. The result gave me features that were much closer to a normal distribution.
After running an OLS model with key features, I had a model that performed well! It had an r-squared score of .82 on the holdout sample and an RMSE of 0.135. After transforming that RMSE out of the log space, the error is around $1.50. After considering the average lego set price is around $40, an error of +-$1.50 isn’t too bad!
I took a look at the distribution of the errors to ensure I wasn’t violating any of the aforementioned laws of linear regression.
I could see from my residual plot and the Q-Q plot that I wasn’t doing the best job of predicting the price across the entire range of prices. The points deviating from the red line show the model had difficulty in predicting the price of lego sets that are either very cheap or very expensive. However, the model performs well but we need to be aware of these limitations moving forward.
Insights gained from the Model
If we take a look at the coefficients we can learn some interesting things about lego sets. There are obvious relationships like the more pieces in a set, the more expensive the set. These are good for a sanity check, but aren’t the most fascinating relationships. Some of the more interesting relationships came from the country variables and the part of speech tag variables.
When we exclude the U.S as a feature in the model, we can interpret the rest of the country coefficients as follows: “All things equal, a lego set in Finland is 1.5% more expensive than buying a lego set in the U.S”. Luckily for me, there is only one other place in the world where legos are cheaper: Canada.
The other features I found interesting were the parts of speech tag variables. It turns out that a higher percentage of action-oriented words, like verbs and adverbs, were associated with more expensive lego sets. On the other hand, descriptions with a high percentage of conjunctions (and, or, but, if) and particles (at, on, over, out) were associated with less expensive lego sets.
But is it worth $800?
In addition to the insights outlined, we can feed the model any lego set we are interested in purchasing and predict the price. If the model predicts a price higher than what is listed on the website, then we can feel good that we are getting a deal on that set. When we plug the Millenium Falcon’s specs into the model, the predicted price comes out to $930.24! However, we must remember that our model doesn’t do a great job of predicting legos at the extreme ends of the price scale. Where my model will fail, Martin Ray can help fill the gaps. His review says it all…