Make a Web Scraper with Python and Beautiful Soup

Published in

TheCodr

2 min readSep 10, 2021

Here’s how to make your own web scraper in less than 10 minutes

Photo by Filiberto Santillán on Unsplash

Web scraping is the process of automatically collecting information from web pages. This tutorial will help you build a scraper from scratch using python.

Setup your environment

Make sure you have the latest version of python. Run this in terminal to check what version you’re on:

python --version

Create your project folder and a python file, name it app.py

We’ll next create a virtual environment named env, just run:

python -m venv env

To start the environment, run:

source env/bin/activate

Libraries

To run this scraper, we’ll need to add a few libraries to interact with the browser and collect the information we need.

Selenium — this is a web framework that allows you to execute cross-browser tests. To install selenium, run:

pip install selenium

Web Driver Manager — this simplifies management of binary drivers for different browsers, instead of installing the exact driver, the web driver manager does it for you. To install webdriver-manager, run:

pip install webdriver-manager

Beautiful Soup — this is a package for parsing HTML and XML documents. To install it run:

pip install beautifulsoup4

Let’s Code

We’ll first import our libraries:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

Then we’ll setup the driver, well use wiki how for this tutorial:

url = 'https://www.wikihow.com/Main-Page'
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get(url)

Then onto the main function, we’ll need to define the element using an x-path. To get one, inspect the element on a page, then copy specifically the x-path.

We’ll code to collect the data by the x-path, parse it and print it out. I copied the x-path of the wiki-how’s homepage title for this example:

try:
    element = browser.find_element_by_xpath('//*[@id="hp_container_inner"]/h1')
    element_data = element.get_attribute('innerHTML')
    parsed_info = BeautifulSoup(element_data, 'html.parser')
    print(parsed_info)
    browser.quit()
catch:
    break

Now run the app:

python app.py

Congrats! You’re running your first web scraper. Let me know how it goes in the comments.

Stay tuned for more & happy coding!

Using Selenium for a Web Scraper

Here’s all the things you can do with Selenium on your Web Scraper.

medium.com