Web Scraping Using BeautifulSoup + Pushing Data to MySQL Database
Create your own dataset by regularly scrape and log data from websites.
In this article, we will learn how to scrape data from web’s HTML content with BeautifulSoup, a popular Python package for web scraping. Then, we will learn how to push the data into a table in MySQL database.
Scrape HTML data with BeautifulSoup
BeautifulSoup is a Python package for parsing HTML and XML documents. Every website accessed via a web browser consists of HTML documents, which can be seen by inspecting the page.
Let’s say, you are a bookworm. You want to keep yourself updated with new books from Book Depository, and you want to update your list regularly. You can check New Release page in Book Depository’s website every now and then (by visiting https://www.bookdepository.com/top-new-releases), but you think it will be far easier to process those information if you can scrape the data and save it in tabular format.
After inspecting the elements of that website, you discovered that all book elements in that page are stored under div tag with "book-item"
class. From each book-item elements, you can scrape various informations, such as ISBN, title, author name, published date, book format, and price. You want to record all informations, plus one additional information i.e. scraping date. So, now you have a rough plan for your web scraping journey: (1) first, scrape informations from HTML and convert it to tabular format, and (2) save the table to external location. For the sake of simplicity, we will use requests package to send a HTTP request to get the HTML response, use BeautifulSoup to parse the HTML, use pandas to convert the information to a DataFrame, and lastly push all data to a MySQL database by using sqlalchemy package.
Install BeautifulSoup using a Python package manager. If you use pip, install it by using pip install beautifulsoup4
. Then, import several packages and start requesting the HTML via HTTP request.
import requests
import pandas as pd
import datetime
from bs4 import BeautifulSoup
from sqlalchemy import create_engineURL = 'https://www.bookdepository.com/top-new-releases'
page = requests.get(URL)
soup = BeautiulSoup(page.content, "html.parser")
books = soup.find_all("div", class_ = "book-item")
In the above code snippet, a get request was sent to the URL. The response stored in page
variable, which later converted to a parsed data object by using BeautifulSoup object. Then, we take all HTML elements with book-item
class and store it in books
list. The books
variable consists hundreds of book element, with each element looks like this:
print(books[0])===============<div class="book-item" itemscope="" itemtype="http://schema.org/Book">
<div class="item-img">
<a href="/Mirror-Light-Hilary-Mantel/9780007481002" itemprop="url">
<img alt="The Mirror and the Light" class="lazy" data-lazy="https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/mid/9780/0074/9780007481002.jpg"/>
</a>
</div>
<meta content="9780007481002" itemprop="isbn"/>
<meta content="The Mirror and the Light" itemprop="name"/>
<meta content="Hilary Mantel" itemprop="contributor"/>
<div class="item-info">
<h3 class="title">
<a href="/Mirror-Light-Hilary-Mantel/9780007481002">
The Mirror and the Light<br/>
</a>
</h3>
<p class="author">
<a href="/author/Hilary-Mantel" itemprop="author">Hilary Mantel</a>
</p>
<div class="rating-wrap">
<a aria-label="Ratings for The Mirror and the Light" href="/Mirror-Light-Hilary-Mantel/9780007481002">
<div class="stars">
<span class="star full-star"></span>
<span class="star full-star"></span>
<span class="star full-star"></span>
<span class="star full-star"></span>
<span class="star half-star"></span>
</div>
</a>
</div>
<p class="published" itemprop="datePublished">29 Apr 2021</p>
<p class="format">Paperback</p>
<div class="price-wrap">
<p class="price">
Rp271.455</p>
</div>
</div>
<div class="item-actions">
<div class="btn-wrap">
<a class="btn btn-sm btn-primary add-to-basket" data-currency="IDR" data-isbn="9780007481002" data-price="271455" data-ref="carousel" data-show-related="false" href="/basket/addisbn/isbn13/9780007481002" rel="nofollow">
Add to basket</a>
</div>
</div>
</div>
Based on the element’s structure, we can determine where we want to scrape the informations that we need from. For instance, ISBN, price, and currency can be scraped from tag a
with "btn btn-sm btn-primary add-to-basket"
class. Title can be scraped from meta
tag with "name"
attribute for itemprop, where author name, book format, and published date can be scraped from p
tags with "author"
, "format"
, and "published"
class respectively.
For the first book element, each information can be scraped by using these lines of codes:
book_title = books[0].find("meta", {"itemprop":"name"}).get("content")book_author = books[0].find("p", class_ = "author").find("a").textbook_format = books[0].find("p", class_ = "format").textbook_published = books[0].find("p", class_ = "published").textbook_isbn = books[0].find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-isbn")book_currency = books[0].find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-currency")book_price = books[0].find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-price")
To scrape information from all book elements, create a dictionary where each key corresponds to each column (e.g. title, author, or isbn), and each value contains a list of values for the respective columns. Then, iterate over to books
variable and append each list by using the same method to scrape individual informations as shown in the previous code snippet.
books_dict = {'title':[], 'author':[], 'format':[], 'published':[], 'isbn':[], 'currency':[], 'price':[], 'date':[]}for b in books:
books_dict['title'].append(b.find("meta", {"itemprop":"name"}).get("content"))
books_dict['author'].append(b.find("p", class_ = "author").find("a").text)
books_dict['format'].append(b.find("p", class_ = "format").text)
books_dict['published'].append(b.find("p", class_ = "published").text)
books_dict['isbn'].append(b.find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-isbn"))
books_dict['currency'].append(b.find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-currency"))
books_dict['price'].append(b.find("a", class_ = "btn btn-sm btn-primary add-to-basket").get("data-price"))
books_dict['date'].append(datetime.date.today())
Finally, convert the dictionary to a pandas DataFrame.
df = pd.DataFrame.from_dict(books_dict)
df.head(5)
Pushing DataFrame to MySQL Database
Assume you have a MySQL database, with a table named "book-depo-new-releases"
. Along with that table, you also have the username and password of your MySQL database, the host, and database name. To push dataframe to MySQL, create an engine with sqlalchemy, then push the data directly by utilizing pandas method.
engine = create_engine('mysql+pymysql://'+{USERNAME}+':'+{PASSWORD}+'@'+{HOST}+'/'+{DATABASE})df.to_sql(con=engine, name='book-depo-new-releases', if_exists='append', index=False)
By using append
option, you won’t overwrite previously written rows in that table. Moreover, the DataFrame’s index also won’t be written to the MySQL table.