This article is focused on web scraping using Python. We’re going to use the Beautiful Soup 4 library. The article intends to detail the simple steps required to scrape data from a webpage. We’ll be writing sample code to extract data from the website.

Let’s take a look at the required Python libraries:

  1. The request library to make network requests

To scrape data from a website, we need to extract the content of the webpage. Once the request is made to a website, the entire content of the webpage is available, and we can then evaluate the web content to extract data out from it. The content is made available in the form of plain text.

2. Thehtml5lib library for parsing HTML

Once the content is available, we need to specify the library that represents the parsing logic for the text available. We’ll be using the html5lib library to parse the text content to HTML DOM-based representation.

3. Thebeautifulsoup4 library for navigating the HTML tree structure

beautifulsoup4 takes the raw text content and parsing library as the input parameters. In our example, we have exposed html5lib as a parsing library. It can then be used to navigate and search for elements from the parsed HTML nodes. It can pull data out from the HTML nodes and extract/search required nodes from HTML structure.

Making the Request for the Web Content

Let's make the web request for the website to be scraped. We will be using the requests library. To start using the requests library, we need to install the third-party library using the following command

  • pip install requests

We will be scrapping the website www.learn-javascript.in to see how many articles are available. Let’s first make a request to extract the content for the specified website. request.get makes a request to the webpage, which returns back the raw HTML content.



