P05 — Week 4 — Data gathering from spritmonitor.de

Bengü Barış Balkan
AIN311 Fall 2023 Projects
2 min readDec 11, 2023

This week, we used BeautifulSoup(bs4) and Scrapy to gather data from spritmonitor.de as we mentioned last week.

This picture displays all the cars in the Spritmonitor database, which provides close-to-accurate data about CO2 emissions. As the website has a forum-like structure, many diesel and petrol cars were reported with zero CO2 emissions, likely due to user preferences in presenting the information. We focused on the “Most Efficient CO2 Cars” section of the website, which contained approximately 14,000 cars. From there, we successfully gathered information for 10,000 cars.

During our data collection process, we encountered connectivity issues, leading to a loss of connection to the website. To address this problem, we adapted our approach by moving away from using BeautifulSoup (bs4) since it caused failures due to the high volume of requests from a single source. To prevent overloading the website, we implemented a multi-proxy system using Scrapy. This modification allowed us to send requests from different proxies, and as a result, we were able to collect information for around 10,000 cars in approximately 2.5 hours. The collected data has been uploaded to GitHub.

Spritmonitor data in Github
Spritmonitor data in Github

Next week, we’ll be focusing on another website which is called HonestJohn to collect more information. Here is the diagram of the data flow in that website.

HonestJohn data diagram

As we mentioned last week, we’ll be using mpg (miles-per-galoon) metric and convert it to CO2 emission.

--

--