Mastering Web Scraping with Python — Introduction part2

Tobiloba Adejumo
Dataly.ai
Published in
8 min readMar 30, 2019

This article is the second part in the series — Mastering Web Scraping with Python. If you did not read the first part, you can start from below.

Series Intermission

  • Mastering Web Scraping with Python — Introduction part 1
  • Mastering Web Scraping with Python — Introduction part 2 (you are here)
  • Mastering Web Scraping with Python — Intermediate part 1
  • Mastering Web Scraping with Python — Intermediate part 2
  • Mastering Web Scraping with Python — Advanced part 1
  • Mastering Web Scraping with Python — Advanced part 2

Previously on Web Scraping…

In the previous intermission, we learnt about the various libraries such as beautiful soup, requests, lxml and pandas which are used in web scraping.

Prerequisite

I assume that you have basic programming skills and experience in any language. I also assume that you have successfully set up a Python environment (Jupyter Notebook, PyCharm e.t.c.) on your PC . If not, download the Anaconda Distribution. The Anaconda distribution is easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. Also, it comes with pre-installed Python packages too, and of course comes with Jupyter Notebook :).

In this intermission, I mainly cited
Vik Paruchuri’s post (Python Web Scraping Tutorial using BeautifulSoup)

Intermission index

  • National weather service.
  • Downloading weather data
  • Exploring page structure with chrome dev tools
  • Extracting information from the page
  • Extracting all the information from the page
  • Combining our data into a pandas dataframe

National weather service

In this intermission, we will be scraping weather forecasts from the National Weather Service. And then, analyzing them using the Pandas library by converting the data to a Python object and representing the object in a dataframe.

Downloading weather data

We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.

We’ll extract data about the extended forecast.

As you can see from the image, the page has information about the extended forecast for the next week, including time of day, temperature, and a brief description of the conditions.

Exploring page structure with chrome dev tools

The first thing we’ll need to do is inspect the page using Chrome Devtools. If you’re using another browser, Firefox and Safari have equivalents. It’s recommended to use Chrome though.

You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted:

Chrome Developer Tools.

The elements panel will show you all the HTML tags on the page, and let you navigate through them. It’s a really handy feature!

By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel:

The extended forecast text.

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast:

The div that contains the extended forecast items.

If you click around on the console, and explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class tombstone-container.

We now know enough to download the page and start parsing it. In the below code, we:

  • Download the web page containing the forecast.
  • Create a BeautifulSoup class to parse the page.
  • Find the div with id seven-day-forecast, and assign to seven_day
  • Inside seven_day, find each individual forecast item.
  • Extract and print the first forecast item.
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

The result obtained is:

<div class="tombstone-container">
<p class="period-name">
Tonight
<br>
<br/>
</br>
</p>
<p>
<img alt="Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. "/>
</p>
<p class="short-desc">
Mostly Clear
</p>
<p class="temp temp-low">
Low: 49 °F
</p>
</div>

Extracting information from the page

As you can see, inside the forecast item tonight is all the information we want. There are 4pieces of information we can extract:

  • The name of the forecast item — in this case, Tonight.
  • The description of the conditions — this is stored in the title property of img.
  • A short description of the conditions — in this case, Mostly Clear.
  • The temperature low — in this case, 49 degrees.

We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

img = tonight.find("img")
desc = img['title']

print(desc)

The result obtained is:

Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph.

Extracting all the information from the page

Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:

  • Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
  • Use a list comprehension to call the get_text method on each BeautifulSoupobject.

As you can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other 3 fields:

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]print(short_descs)print(temps)print(descs)

The result is:

['Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Slight ChanceRain', 'Rain Likely', 'Rain Likely', 'Rain Likely', 'Chance Rain']
['Low: 49 °F', 'High: 63 °F', 'Low: 50 °F', 'High: 67 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 55 °F']
['Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. ', 'Thursday: Sunny, with a high near 63. North wind 3 to 5 mph. ', 'Thursday Night: Mostly clear, with a low around 50. Light and variable wind becoming east southeast 5 to 8 mph after midnight. ', 'Friday: Sunny, with a high near 67. Southeast wind around 9 mph. ', 'Friday Night: A 20 percent chance of rain after 11pm. Partly cloudy, with a low around 57. South southeast wind 13 to 15 mph, with gusts as high as 20 mph. New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: Rain likely. Cloudy, with a high near 64. Chance of precipitation is 70%. New precipitation amounts between a quarter and half of an inch possible. ', 'Saturday Night: Rain likely. Cloudy, with a low around 57. Chance of precipitation is 60%.', 'Sunday: Rain likely. Cloudy, with a high near 64.', 'Sunday Night: A chance of rain. Mostly cloudy, with a low around 55.']

Combining our data into a pandas dataframe

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. If you want to learn more about Pandas, check out our free to start course here.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

import pandas as pd
weather = pd.DataFrame({
"period": periods,
"short_desc": short_descs,
"temp": temps,
"desc":descs
})
weather

Our result becomes:

What’s Next?

On the next intermission, we will be working with another Python library known as selenium. We will be working with Instagram web application and perform activities such as log in, scrolling, downloading of captions and downloading of images. Imagine you have 2,000 pictures on Instagram, will you resolve to use web scraping and automate the process or you will download them one after the other?

Additional Information

Recommended Read

Fork me on GitHub

Check out the complete source code in the following link:

https://github.com/themavencoder/web-scraping-tutorial

--

--

Tobiloba Adejumo
Dataly.ai

Interested in biomarker development, software dev and ai, as well as psychology, history, philosophy, relationships. Website: tobilobaadejumo.com