P05 — Week 5— Data Gathering from HonestJohn and Data Merging
This week we gathered data from HonestJhon. We encountered the same issue as last week with the connection errors. But since we are experienced with it by now, we quickly changed to Scrapy. At the end, we uploaded it to GitHub.
After that, we finally started to combine our scraped data with the European Enviroment Agency’s data.
First we made sure that our data is combinable with each other. We pruned a lot from the original one. For example, we needed to drop model names from the brand name column; also there were year information in both model and brand, we dropped both of them. Also the given mpg values contained non-uniform values like intervals and dashes. We took average of the intervals and dropped the NaN values. Lastly, we convert mpg values to the CO2 emissions by the convertion rates given in this website.
At the end, we had the data we need for the merge.
Then, we proceed with the EEA’s data. There were a lot of sparse columns in that. We dropped the sparse and non-necessary columns for the sake of clarity.
After clearing the EEA data, we were finally ready to merge with our scraped data. We merged on the brand name and the car model.
After the merge, we pruned the duplicate rows and had a dataset containing 210k values.
Next week, we’ll be focusing on the exploratory data analysis and the project progress report.