How to Scrape HTTPS sites in python (BeautifulSoup).

Kalp Panwala
Analytics Vidhya
Published in
5 min readJun 3, 2020

You might have heard of Web Scraping and its various applications though just in case you want a introduction here it is. Also in this article we will focus on how to scrape https sites and will take an example of web scraping of https://www.accuweather.com .

What is web-scraping?

Scraping is simply a process of extracting data.When we do scraping or extracting data or feeds from the web (like from web-pages or websites), it is termed as web-scraping.

So, web scraping which is also known as web data extraction or web harvesting is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet.

Why Web-scraping?

Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways −

  • Finance: To fetch stock closing and opening prices from bse stock exchange and from many different sites.
  • Price Comparison: Web scraping are use to collect data from online shopping websites and use it to compare the prices of products.
  • Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.
  • Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
  • Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.
  • Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.

Is Web Scraping legal?

Talking about whether web scraping is legal or not, some websites allow web scraping and some don’t. To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. You can find this file by appending “/robots.txt” to the URL that you want to scrape.

Why Python for Web Scraping?

Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily.

  1. Ease of Use (doesn’t have any curly braces “{ }” or semi-colons “;” anywhere)
  2. Huge Library Support
  3. Huge Community
  4. Dynamically-typed language (data assigned to a variable tells, what type of variable it is)

How does Web Scraping work?

When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it.

To extract data using web scraping with python, you need to follow these basic steps:

  1. Find the URL that you want to scrape
  2. Inspecting the Page
  3. Find the data you want to extract
  4. Write the code
  5. Run the code and extract the data
  6. Store the data in the required format

Now its enough of theory and lets jump to the topic of scraping the https sites https://www.accuweather.com. We want to get the weather info and accuweather provides us with API but for learning purpose we will use BS4 (Beautiful Soup).

End Result: We will fetch data in this format

Output data format (output.csv)
Screen shot from https://www.accuweather.com/en/in/surat/202441/daily-weather-forecast/202441

For web scraping lets first head towards the accuweather site and here I was in Surat so i have weather info of surat you can also head towards your locality and have the data. Here’s a screenshot from the site and here you can see class “page-column-1” , our parent class and ‘content-module non-ad’ , ’content-module non-ad bottom-forecast’ are having our data in form of <a> tags. We have 3 classes in it “date”, “”temps” and “info”.

Detailed look into “content-module non-ad” class

In “date” class our day details lies in class “dow”, date details in class “sub”.

In “temps” class our high and low temperature details are in “high” and “low” class respectively. Also we have phrase details in “phrase” class.

Code

You should have BeautifulSoup library installed or you can do

pip install beautifulsoup4

in your command prompt.

## Importing required librariesfrom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

Always remember to include headers while scraping https sites as without it you will have access denied error. And the headers are same for all so don’t worry. Most https sites don’t allow web scraping, but the point is how were you able to view the website though it is https site ? It is because our Web browsers uses headers to let the servers know that the request is through a Web browser and so we pass headers to the server.

url=’https://www.accuweather.com/en/in/surat/202441/daily-weather-forecast/202441'headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36’}

Here ‘html5lib’ is a pure python library for parsing html.

r=requests.get(url, headers=headers)
soup1 = BeautifulSoup(r.content, ‘html5lib’)

We will use .findAll(), .find() functions of bs4 to find out data elements and .get_text() to get the text of the element. Also if you have multiple classes to fetch from then pass them in list like [‘content-module non-ad’,’content-module non-ad bottom-forecast’].

Here’s the whole code of it.

Here’s how to convert to .csv file and can be used as per your applications

df.to_csv(‘./output.csv’, encoding=’utf-8', index=False)

Thanks much for reading, if you like the story then do give it a clap.

connect me on linkedIn : https://www.linkedin.com/in/kalp-panwala-72284018a

follow me on twitter : https://twitter.com/PanwalaKalp

follow me on github : https://github.com/kpanwala

If you have any queries regarding the story or any room for improvements then mail me on kpanwala33@gmail.com .

--

--