Web Scraping for Data Collection
Collecting McDonald’s Addresses using Python
I was working on a machine learning classification model to determine the level of Diabetes in the population of the USA. After surfing the net for ready to use data, I realized that I was going to have to do some hands-on work to put together my own data set.
I decided to include a McDonald’s-to-population ratio variable in my data. In order to do this, I needed to get the count of restaurants by county. There are some websites that offer this data, but they charge around $75 for it and I definitely did not want to pay for my data.
Since I did not find an open API, I decided to go with web scraping. In this blog post I will share the python code I used to collect the addresses of McDonald’s restaurants by zip code.
Getting started with Web Scraping:
- Choose a web page to use for data extraction: I used MENUISM to extract McDonald’s addresses by zip code. Note that I collected the data for a personal project with no intent of using it for commercial purposes. Always make sure to read the terms and conditions of the web page you are using to extract data.
- Check the anatomy of the web page’s URL: look for repetitive patterns in the URL that help you determine the variables to pass to your URL when loading new content. In my case, I searched for McDonald’s restaurants by passing different zip codes to the URL (Fig 1).
3. Inspect the web page’s HTML document:If you are using Google Chrome, you can access the element panel by following the steps on Fig 2. Check the data you wish to collect and inspect its HTML element. Identify the attribute of the element that contains the content to extract. Fig 3 shows an example of the attribute “onclick” of an “a” element that contains the city name of the McDonald’s listings.
Python Code
After following the steps described in the last section, I put together my web scraping code. This routine captures the name of the restaurant, the first line of the address and city.
Web Scraping Tips and Tricks
- Divide your web scrapping requests into several rounds: I ran my scraping routine with groups of 1000 requests (1000 zip codes) and saved the data to my computer. When doing web scraping, the server could terminate connection with your IP address any time and you don’t want to risk losing too much data.
- Set random delays between each request to avoid crashing the website’s server.
- Switch user-agent headers and delete cookies between each scraping round. You can search for user-agent header examples online.
- Make sure to use a proxy pool if you are scraping a large volume of data. I scrapped around 15k requests without using a proxy with no problem, but ideally you would use a proxy server pool for large volumes of requests.
- If you need to download the data again after some period of time, make sure to check that the HTML elements and attributes have not changed to ensure your scraping code will run.
Additional Quality Check
I was able to collect the primary number and street name for each McDonald’s restaurant through the Menuism web scraping routine. I used the Google Maps Places API to get the full formatted address, to confirm if the addresses actually corresponded to a McDonald’s restaurant and to check if the restaurant is still operating.
If you want to read about this process, take a look at my Google Maps: Places API blog post!