Scrap a web page in 20 lines of code with Python and BeautifulSoup

Kashif Aziz
3 min readOct 25, 2017

--

Using Python and BeautifulSoup, we can quickly, and efficiently, scrap data from a web page. In the example below, I am going to show you how to scrap a web page in 20 lines of code, using BeautifulSoup and Python.

What is Web Scraping:

Web scraping is the process of automatically extracting information from a website. Web scraping is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.

A word of caution: Always respect the website’s privacy policy and check robots.txt before scraping. If a website offers API to interact with its data, it is better to use that instead of scraping.

Web Scraping with Python and BeautifulSoup:

We are scraping college footballer data from ESPN website. I am using the data set from 2006.

As we are scraping the web page using BeautifulSoup and Requests libraries, we need to install them first. This can be done using pip:

pip install requests

pip install beautifulsoup4

Ok. Time to brew some Python magic.

mporting required libraries in our code. As we are going to save the extracted data in a CSV, required libraries are imported too.

from bs4 import BeautifulSoupimport requestsimport os, os.path, csv

Next step is to fetch the web page and store it in a BeautifulSoup object. We also need a parser to parse through the fetched web page. BeautifulSoup can work with a variety of parsers, we are using the default html.parser in this example.

listingurl = "http://www.espn.com/college-sports/football/recruiting/databaseresults/_/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited"response = requests.get(listingurl)soup = BeautifulSoup(response.text, "html.parser")

Now comes the fun part.

We are going to extract the player name, school, city, playing position and grade.

Web Scraping using Python and BeautifulSoup [more here]

On viewing the source code (CTRL + U in Chrome) we note that the page uses a table to display the data, rows are using odd and even classes to give shadow effect, and fields are enclosed in td tags.

Next step is to find all rows, checking for both odd and even rows, and traverse through their columns to fetch the data.

Note that we need to separate city and school from the hometown field.

The fetched data is appended in a list which will be written to a CSV file at later stage.

listings = []for rows in soup.find_all("tr"):    if ("oddrow" in rows["class"]) or ("evenrow" in rows["class"]):        name = rows.find("div", class_="name").a.get_text()        hometown = rows.find_all("td")[1].get_text()        school = hometown[hometown.find(",")+4:]        city = hometown[:hometown.find(",")+4]        position = rows.find_all("td")[2].get_text()        grade = rows.find_all("td")[4].get_text()        listings.append([name, school, city, position, grade])

The final section of the code opens a CSV file and writes content of the list to it. A confirmation message is printed in the end.

with open("footballers.csv", 'a', encoding='utf-8') as toWrite:    writer = csv.writer(toWrite)    writer.writerows(listings)print("ESPN College Football listings fetched.")

That’s all folks!

This is how Python and BeautifulSoup are used to scrap a web page in just 20 lines of code.

While the code achieved the requirements, it is not very elegant or self-explanatory. The detailed version of code which includes comments, and extra bits to tie up the lose ends, is available at GitHub [here].

For more resources on web scraping with Python and BeautifulSoup, check my blog post here.

--

--

Kashif Aziz

Tech Consultant | Solopreneur | Python | Adsense | WordPress | SEO | Web Scraping