A Very Simple Data Integration Project: Rebrickable LEGO Datasets

Published in

Emanuele Sicurella

7 min readMay 20, 2019

I perfectly remember the first time I got an injection. I was close to 6 and I was very scared but in the end, it didn’t hurt too much.
To celebrate my courage, my father brought me a gift: it was the first LEGO set I had ever had. A street sweeper from the LEGO System theme.

Since then, my parents bought me hundreds of LEGO and I’ve spent hours and hours playing with it, building sets as they were and trying to create new things. Maybe that’s one of the reasons why I was (almost) always comfortable with Mathematics, I decided to study Physics and I’m into Data Science right now.

When I started to study Data Science I ran into a project from DataCamp called Exploring 67 years of LEGO. I couldn’t resist the urge of taking a look at it.

Rebrickable is a repository which contains a lot of information about LEGO sets, like name, date, number of parts and name of the single part (crazy!).
Before downloading and analyzing the datasets, I have of course did some searches on Google and I’ve found that some pretty data analysis had already been done.

In order to do something kind of different (and to practice my data retrieving, data integration and data cleaning skills), I tried to add some information about prices.

Data Scraping

Rebrickable has a section called buy sets, with information about actual average prices calculated over several websites excluding Amazon and eBay.

The data is organized as an HTML table and it is spread across different pages. The first step was indeed building a very simple web scraper on a file called scraper.py using this guide. I copied it here.

The page URL format was something like “https://www.rebrickable.com/buy/sets/?query_1=value_1&query_2=value_2...&page=1”.
So I wrote a very simple function to save the table from a URL as a pandas DataFrame with the function pandas.read_html. With the page_query = True parameter, it is possible to use the function across different pages of the same URL identified by a query string like https://www.url.com/?queries=values&page=1.

import scraper as scp
import pandas as pddef scrape_tables(url, num_of_pages, page_query = False):
    '''
    Transform multiple html tables from an url into a pandas DataFrame.
    '''
    df_list = []
    for n in range(1, num_of_pages + 1):
        if page_query:
            raw_html = scp.simple_get(url + "&page={}".format(n))
        else:
            raw_html = scp.simple_get(url)
        for df in pd.read_html(raw_html):
            df_list.append(df)
    return pd.concat(df_list)

As an example, here is how I scraped the data from this URL https://rebrickable.com/buy/sets/?q=&theme=&store=&min_year=2018&max_year=2018&min_parts=0&max_parts=300 which contains all the insertions for 2018 sets composed by a maximum of 300 parts.

df_scraped = scrape_tables('https://rebrickable.com/buy/sets/?q=&theme=&store=&min_year=2018&max_year=2018&min_parts=0&max_parts=300', 196, page_query= True)

Data Cleaning

After that, I wrote a simple function to clean the DataFrame a bit. First, I’m interested in average prices. On Rebrickable, there are various insertions for the same sets, all of them with the same average price of course. Hence, the first step was dropping duplicates, keeping only one record for every set. Then I just added the possibility to subset only certain columns and to save the resulting DataFrame into a CSV file.

def clean_and_save(df, cols=None, drop_subset=None, filename=None):
    '''
    Clean a pandas DataFrame by dropping duplicates and subselecting     specified columns, and save it into a CSV file.
    '''
    if not(cols == None):
        df = df[cols]
    df = df.drop_duplicates(subset=drop_subset)
    if not(filename == None):
        df.to_csv(filename + '.csv')
    return df

Let’s apply this function to the previous scraped DataFrame:

df_clean = clean_and_save(df_scraped, cols=['Set', 'Cost'], 'avg_price_1')

Repeating this procedure I obtained 14 files. Unfortunately, I couldn’t find an efficient way to loop between various URLs since the results were filtered to a maximum of 200 insertions. This means that I had to manually find the combination of year and number of parts so that the results were less than 200 insertions.

The next step was reading all the 14 files and concatenating into a single DataFrame. The names of the files were “avg_price_1” to “avg_price_14”. With a simple wildcard I matched all the files as follows:

import globavg_price_filenames = glob.glob('avg_price_*[0-9].csv')avg_price_df = []
for file in avg_price_filenames:
    avg_price_df.append(pd.read_csv(file))
avg_price = pd.concat(avg_price_df, ignore_index=True)avg_price.head()

As we can see, the DataFrame isn’t ready for analysis yet. Indeed, the information about the price is a string in the form of “€1.23 Avg: €4.56”. Moreover, some of the entries haven’t information about average prices, so we want to put a NaN value in those cases. Finally, prices higher than €999 contains a comma which has to be discarded to obtain a numerical value.

Summing up, what we need is:

splitting the string and keeping only the part after “Avg: €” as this is the information about the average price we need.
If the string doesn’t contain “Avg: €” a NaN value should be put as there is no information about average price.
Drop commas so that ‘1,000’ becomes ‘1000’.
Convert the string to float.

Luckily, Python allows solving all those 4 steps with only one line of code using the .apply() method with a lambda function:

from math import nanavg_price['Cost'] = avg_price['Cost'].apply(lambda x: nan if not('Avg: €' in x) else float(x.split('Avg: €')[1].replace(',','')))

After dropping the NaN values, the DataFrame is ready! Let’s take a look at it.

avg_price = avg_price.dropna()
avg_price.describe()

Or graphically, with a box plot.

import matplotlib.pyplot as pltavg_price['Cost'].plot(kind='box')
plt.show()

Seems there are a lot of outliers! Let’s take a look at the histogram:

plt.clf()avg_price['Cost'].hist(bins=200, range=(0,400))
plt.xlabel('Cost')
plt.show()

I limited the x-axis to 400€ so the distribution can be better appreciated.

Data Integration

After exploring the cost distribution, it is time to put these acquired data together with the previous we already had. In particular, I want to use the sets.csv dataset (can be downloaded here together with other datasets) which contains information about the name, year and number of parts of the sets.

sets = pd.read_csv('sets.csv')
sets.info()

14542 entries, while we have only 6384 average prices. Let’s merge the two datasets with a left join to discard all the entries without information about prices.

sets_cost = pd.merge(avg_price, sets, left_on ='Set', right_on='set_num')sets_cost.info()

Everything seems to be ok. Let’s see how our final DataFrame looks like.

sets_cost.head()

Now we can see how the actual average selling price is related to the year of the set.

plt.clf()sets_cost.plot(x = 'year', y = 'Cost', kind='scatter')
plt.show()

Another interesting thing to see is how the price change with the number of parts of a set.

plt.clf()sets_cost.plot(x = 'num_parts', y = 'Cost', kind='scatter')
plt.show()

Of course, the price increases as the number of parts does. However, it is interesting to note that the prices increase linearly with the number of parts but in different ways.

Looking closely at the graph we can see different line distributions. This is not surprising since we mixed together sets from different collections. For example, LEGO Bionicle theme certainly has bigger more complex parts which cost more than LEGO systems parts. So in a single graph, we can see several lines with different slopes.

Of course, we have to take into account that some actual average prices might not be the original prices when the sets were put into the market. Some prices could, for example, be higher because of collecting sites.

Conclusions

This is my first Data Analysis project so I kept it very simple. I just wanted to show how easy it is to integrate external data to a dataset to obtain more useful insights. Web scraping is a powerful way to add insightful data available on the internet, but be careful since usually, it needs more cleaning work.

I hope you enjoyed the reading. Please, feel free to comment for any suggestion to improve my work!

Thank you,

Emanuele

A Very Simple Data Integration Project: Rebrickable LEGO Datasets

Data Scraping

Data Cleaning

Data Integration

Conclusions

Written by Emanuele Sicurella