Web Scraping Using Python with forLoop

Chinwe O.
4 min readMar 24, 2020

--

Web Scraping, also known as web extraction or web harvesting is known to rank amongst the most required skill as a Data Scientist. With web scraping as an additional skill set, it helps complement all other skills needed as a Data Scientist.

The virtual world (web) is such a huge reservoir of information, as there are a lot of data across the different sector in finance, health, education, entertainment, etc.

Web Scraping can be applied in the following:

  1. Research
  2. Price Comparison
  3. Job postings
  4. Social Media scraping

Python is a widely acceptable programming language for Data scientists because it is easy to learn, has low-cost maintenance and has a rich suite of libraries, packages, and tools designed for data science. These libraries allow you to easily access the websites and request data on the website for scraping.

To build a Web scraper, the workflow looks like this:

Web Scraping workflow

Website

To be clear, it is advisable you have basic background knowledge of HTML Structures e.g <p>, <h1>, <div>, <a>, classes, and ids amongst others as this will help you access specific structures in a website.

In this project, we would be working with this website:

url = https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=San+Francisco%2C+CA

Web Scraper (Using Python and its Libraries)

Using python, we need 2 Libraries, BeautifulSoup and Request.

  • BeautifulSoup is a library for parsing/analyzing HTML and XML structure which is useful for web scraping. In simple terms, BeautifulSoup allows you to extract all text from HTML tags from a website and save this information.
  • Request is a Python library that allows you to make HTTP requests using Python.

That is, response = requests.get(“https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=San+Francisco%2C+CA”)

Using Python BeautifulSoup and Request has 3 components:

  • URL
  • RESPONSE
  • SOUPCONTENT

Now, let’s get our hands dirty.

PROJECT TOPIC: Write a script that scrapes 20 data from the website page and upload this to CSV or Excel file. We will be scraping specifically for name, occupation, reviews, address and phone numbers. Our Excel or CSV headers should follow the same format. If a piece of information says a review is not found on the page, it should return blank or null.

STEP 1: IMPORT PYTHON LIBRARIES (BEAUTIFULSOUP AND REQUEST)

#import python libraries
from bs4 import BeautifulSoup #to parse the page and search for specific elements
import requests #to connect to the web
import re #regular expression
import urllib.request #download images from the urls
import pandas as pd
from pandas import DataFrame
#request with python beautifulsoup using URL,RESPONSE AND SOUPCONTENT
url = "https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=San+Francisco%2C+CA"
try:
response = requests.get(url)
soupcontent = BeautifulSoup(response.content, 'html.parser') #prints out the pagesource
print(soupcontent)
except:
print('An error occured') #incase an error occurs

STEP 2: SELECT THE BODY ELEMENT CONTAINING THE DATA

#select the container with all the 30 different dentist listingbody_dentist = soupcontent.find('div', {'class': 'search-results organic'})#select individual dentist listings
each_dentist = body_dentist.find_all('div', {'class': 'result'})
print(each_dentist)

STEP 3: ACCESS THE TEXT IN THESE ELEMENTS (such as Name, occupation, reviews, address, and phone numbers)

Using the for loop access the details of the first 20 dentists.

alldentist = []

for i in each_dentist[:20]: #to print only first 20
name = i.find('a', {'class': 'business-name'}).get_text()

occupation = i.find('a', {'href': '/san-francisco-ca/dentists'}).get_text()

reviews = i.find('div', {'class': 'ratings'}).get_text()

address= i.find('div', {'class': 'street-address'}).get_text()
address1 = i.find('div', {'class': 'locality'}).get_text()
address2 = address +", " + address1

phone_numbers = i.find('div', {'class': 'phones phone primary'}).get_text()

images = i.find('img')['src']

dentist = [name, occupation, reviews, address2, phone_numbers, images]
alldentist.append(dentist)
print(dentist, '\n')

STEP 4: Save the Data

#Save the data in a Dictionary using DataFrame and then to a csv file, using name column, occupation column, reviews column, address column, and phone numbers columndentistdf = pd.DataFrame(alldentist,columns=['Name','Occupation','Reviews','Address','Phone Numbers','Images'], index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
print (dentistdf)
dentistdf.to_csv('dentisttrial.csv', columns=['Name','Occupation','Reviews','Address','Phone Numbers','Images'], index = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) #convert to csv

That is it! Building a web scraper is considered as a good friendly beginner project when starting out on the data science track because it helps in solidifying the basic knowledge gained in data collection, data conversion, use of loops and functions, indexing/slicing, etc.

An important thing to note is that not all website allows for scraping of its data, therefore scrape legally. Using robots.txt lets bots know which websites can be crawled or not. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and also to discover hints about a website’s structure.

I would love to say a very big thank you to the Founder of @shecodeafrica SheCode Africa, Ada Nduka Oyom (@kolokodess) and all the team behind the SCA Mentorship Program. Also, my heartfelt gratitude goes to Barri Sambaris (@Barine), thank you so much for doing such an amazing job mentoring me during the past 3-months and constantly following up to ensure that you resolve whatever learning issue I encountered.

--

--

Chinwe O.

Tech | Data Science. I write to re-inforce my learning and document what i have learnt.