P05 — Week 5— Data Gathering from HonestJohn and Data Merging

Bengü Barış Balkan
AIN311 Fall 2023 Projects
2 min readDec 18, 2023

This week we gathered data from HonestJhon. We encountered the same issue as last week with the connection errors. But since we are experienced with it by now, we quickly changed to Scrapy. At the end, we uploaded it to GitHub.

After that, we finally started to combine our scraped data with the European Enviroment Agency’s data.

First we made sure that our data is combinable with each other. We pruned a lot from the original one. For example, we needed to drop model names from the brand name column; also there were year information in both model and brand, we dropped both of them. Also the given mpg values contained non-uniform values like intervals and dashes. We took average of the intervals and dropped the NaN values. Lastly, we convert mpg values to the CO2 emissions by the convertion rates given in this website.

Data gathered from the web

At the end, we had the data we need for the merge.

Then, we proceed with the EEA’s data. There were a lot of sparse columns in that. We dropped the sparse and non-necessary columns for the sake of clarity.

NaN values in the EEA dataset
Important abbreviations in the dataset

After clearing the EEA data, we were finally ready to merge with our scraped data. We merged on the brand name and the car model.

merge operation
Merged Data

After the merge, we pruned the duplicate rows and had a dataset containing 210k values.

Next week, we’ll be focusing on the exploratory data analysis and the project progress report.

--

--