Web Scraping Weather Data With Selenium Webdriver
I have recently finished a Master’s Degree in Data Science and as part of my Final Project (Manhattan Taxi Demand Prediction), I needed to scrap precipitation data from wunderground.com. I tried using the python library BeautifulSoup but encountered a problem that got me stucked for days until I figured it out, so I thought about writing this article to help other people in the same situation.
On my first attempt, I used the code below to have a glance of the page source with soup.pretify(). My surprise was that I did not see the same code that I could see with my internet browser using the browser inspector.
from bs4 import BeautifulSoup
import requestsurl = 'https://www.wunderground.com/hourly/us/ny/new-york-city/KNYNEWYO1335'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
There were a lot of lines missing. I tried many options, searched on the internet for hours, stack overflow, etc., until I finally saw a glimmer of light: wunderground.com seems to have some security feature that blocks known spider/bot user agents (like
urllib used by python). This makes sense because if you want their data, they want you to pay for it (using their API).
If you cannot pay the API you have to simulate that you are accessing from a known browser user agent (i.e. Chrome).
This is when Selenium WebDriver comes in. WebDriver drives a browser natively, as a user would. So for wunderground.com you are not a python script doing dozens of requests and scraping tons of data. You are just another person browsing their website with Google Chrome, and with Chrome you can access the entirely web source code.
Let´s go into detail…
- Install selenium
!pip install selenium
- Download and install
chromedriver.exefrom here. It needs to be saved in the same folder as your notebook or python script. Also, it needs to be compatible with your Chrome version.
My use case
My aim was to scrap hourly precipitation forecast of New York City for the next three days, and store it in a data frame that I could use afterwards.
So here is the script explained step by step:
- lookup_URL: the URL I want to scrape data from. I wanted precipitation forecast for the next 3 days. The curly brackets at the end allowed me to use .format(YYYY, M, D) within a while loop and open tomorrow´s forecast website, and the following days (you will see later).
- start_date: the day after I run the script.
- end_date: 3 days later.
- df_prep: Data frame to store results.
Add argument ‘headless’ to Chrome options
Configure webdriver to run in ‘headless’ mode. This will run Chrome in the background, without a visible UI shell.
options = webdriver.ChromeOptions();
Create an instance of ChromeDriver
This is the first step to start using the webdriver. I also pass in the location of chromedriver.exe and the options from the previous step. In my case, I store
chromedriver.exe in the same folder as the python script.
driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)
The while loop allows me to open tomorrow’s forecast website, do something, open next day’s forecast, do the same, and so on.
I print out the date to the console to make sure that all loops have gone through. Then I format the lookup_URL with the right date, scrape the data, and add another day to the boolean condition.
This is the structure of the loop:
while start_date != end_date:
print('gathering data from: ', start_date)
formatted_lookup_URL = lookup_URL.format(start_date.year,
start_date.day) # SCRAPE AND STORE DATA start_date += timedelta(days=1)
Open the URL in the background
The method .get() will open up the URL in background.
The next line is very important and it took me hours to figure out why my code did not work without it. It just so happens that the website takes a couple of seconds to load and my (old) code was scraping incomplete data due to the website not being fully loaded.
WebDriverWait() tells the driver to wait a certain amount of seconds, or
until() an Expected Condition
EC is met. The are many expected conditions you can use, I used
visibility_of_all_elements_located() , specifying the XML path -
By.XPATH() — found in the HTML structure of the page.
How do I find the XML path? By inspecting the website. Selecting in my internet browser the information I want to scrape, right-click > Inspect Element. It opens the developer tools and shows the HTML code of the web page.
"mat-cell cdk-cell cdk-column-liquidPrecipitation mat-colum-liquidPrecipitation ng-star-inserted" contains the precipitation values I am looking for (see image below). I store this class in the rows variable.
Iterate through the elements inside ‘rows’
Precipitation values are stored under the class
I loop through the rows matching that class, get the text value, and append it to the df_prep Data Frame.
The result is the caption below: (I know, it looks like there won´t be a drop of rain in the next 3 days)
And this is the entire script:
I hope I was able to help someone with this article. You can leave comments and questions below and I will gladly answer.
If you want to know more about this project feel free to visit my GitHub repository: Manhattan Taxi Demand Predictor.
You can contact me on my Linkedin Profile.