Python app that scraps data and sends email

Atm Ahad
Big0one
Published in
5 min readJun 19, 2020

The huge amount of data on the internet is a great resource for any field of research or personal purpose. Web scraping is one of the best ways of harvesting data from the web. In this article, you will be briefly known what web scraping is as well as you will able to make your very first web scraping app using python. We will scrap the top 10 best-seller children’s series books on ‘ The New York Times’ and send an email from our app containing the list. Here is the link — https://www.nytimes.com/books/best-sellers/series-books/ . You can also find the project in the git repo.

source : google.com

What is web scraping?

Web scraping, also called web data extraction, is the process of extracting or scraping data from websites. The words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Why web scraping?

Web scraping can help you extract any kind of data that you want. You would then be able to retrieve, analyze and use the data the way you want. So web scraping simplifies the process of extracting data, speeds it up by automating it and creates easy access to the scrapped data. Before starting let me check the version of python and also the pip ( PIP is a package manager for Python packages) by following commands-

python --version
pip3 --version
checked python and pip version

Let’s start-

We will simply start with a simple python project containing a python file named scraper.py

python project containing scraper.py file

First, you’ll want to get the site’s HTML code into your Python script so that you can interact with it. For this task, you’ll use Python’s requests library. Type the following in your terminal to install it:

pip3 install requests

Then type the following in the scraper.py file to retrieve the HTML :

import requests
URL = 'https://www.nytimes.com/books/best-sellers/series-books/'
page = requests.get(URL)

This code performs a HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

Now it’s time for another library Beautiful Soup for parsing structured data. It allows you to interact with HTML in a similar way to how you would interact with a web page using developer tools. Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you would interact with a web page using developer tools.

pip3 install beautifulsoup4

Then, import the library and create a Beautiful Soup object ( we will do this inside a function to keep further modules separate). So this is what we did so far:

import requests
from bs4 import BeautifulSoup
URL = 'https://www.nytimes.com/books/best-sellers/series-books/'
def check_book_list():
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup)
check_book_list()

you can uncomment the print(soup) line to see what does it print.

Now you go to the site mentioned below and inspect it using developer tool. What we are looking for actually the div that contains the book details-

the div that contain the book details

Add this lines-

books_area = soup.find_all('div', {'class': 'css-xe4cfy'})

Now let’s complete this function by adding more lines. Finally what we did so far-

import requests
from bs4 import BeautifulSoup
URL = 'https://www.nytimes.com/books/best-sellers/series-books/'book_list = []
def check_book_list():
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup)
books_area = soup.find_all('div', {'class': 'css-ccccccc'})
for index in range(0, len(books_area)):
if len(book_list) > 9:
break
title = books_area[index].select('h3.cccccc')[0].text
book_list.append(title)
print(book_list)
send_mail()
check_book_list()

We declare a global variable book_list outside the function. The books_area is a list of all div tag having class : ‘css-xe4cfy’ . The following for loop is used in order to iterate each div item of books_area list. Remember we want to scrap only the first 10 books. So we initiate a if condition here that check the list length if more than 10 then the loop goes on break. After that we select the book title that is inside the h3 tag ( If you have noticed the HTML) of having a class name ‘css-5pe77f’ . And add the title to our global list.

h3 tag that holds the book title

Now time to create the send_email function. For sending email import another library-

import smtplib

And then complete the function-

def send_mail():
server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.ehlo()
server.login('stayhome@gmail.com', 'your password here')
all_books = ''
increment = 1
for book in book_list:
all_books = all_books + f"{increment}. " + book + "\n"
increment += 1
subject = 'Bestseller children\'s books !'
body = f'Top 10 Bestseller children\'s series on \'The New york Times\'---> {URL}\n' + all_books
massage = f"Subject : {subject}\n\n{body}"
server.sendmail(
'satyhome@gmail.com',
'ahad@gmail.com',
massage
)
print('Hey email has been sent!')
server.quit()

Here we use python smtplib library for sending email. you can learn much about this library here. send_email consists of some lines of code for configuring the mail server and designing the mail body. Don’t feel worried, it’s really simple. And that’s it. Let’s run the program and check your email.

email that has been received from the application

--

--