Make a Web Scraper with Python and Beautiful Soup
Here’s how to make your own web scraper in less than 10 minutes
Web scraping is the process of automatically collecting information from web pages. This tutorial will help you build a scraper from scratch using python.
Setup your environment
Make sure you have the latest version of python. Run this in terminal to check what version you’re on:
python --version
Create your project folder and a python file, name it app.py
We’ll next create a virtual environment named env, just run:
python -m venv env
To start the environment, run:
source env/bin/activate
Libraries
To run this scraper, we’ll need to add a few libraries to interact with the browser and collect the information we need.
Selenium — this is a web framework that allows you to execute cross-browser tests. To install selenium, run:
pip install selenium
Web Driver Manager — this simplifies management of binary drivers for different browsers, instead of installing the exact driver, the web driver manager does it for you. To install webdriver-manager, run:
pip install webdriver-manager
Beautiful Soup — this is a package for parsing HTML and XML documents. To install it run:
pip install beautifulsoup4
Let’s Code
We’ll first import our libraries:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
Then we’ll setup the driver, well use wiki how for this tutorial:
url = 'https://www.wikihow.com/Main-Page'
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get(url)
Then onto the main function, we’ll need to define the element using an x-path. To get one, inspect the element on a page, then copy specifically the x-path.
We’ll code to collect the data by the x-path, parse it and print it out. I copied the x-path of the wiki-how’s homepage title for this example:
try:
element = browser.find_element_by_xpath('//*[@id="hp_container_inner"]/h1')
element_data = element.get_attribute('innerHTML')
parsed_info = BeautifulSoup(element_data, 'html.parser')
print(parsed_info)
browser.quit()
catch:
break
Now run the app:
python app.py
Congrats! You’re running your first web scraper. Let me know how it goes in the comments.
Stay tuned for more & happy coding!