Web Scraping Using Python 3 and BeautifulSoup4 — Part 1

Overview

Introduction
Requirements
Setting Up your Environment
Learning Web Scraping using Python 3 & Beautifulsoup4

Introduction

There is a need and importance of extracting data from websites and this has become increasingly loud and clear, especially with the outburst in data analytic. Sometimes we find ourselves in need of data especially when we need data to build and train a machine learning model.

There are several ways to extract information from the web. One of which is the use of APIs, most probably the best way to extract data from a website since this will result in access to data in a more structured way. Unfortunately, not all websites can employ people with the technical know-how to create an API to access their data while some just don’t feel the need for it. In cases like these, we get our hands dirty.

Requirements

To successfully follow the steps in this tutorial, you’d be needing to install some software. Find below the list and how to setup your environment for this tutorial.

Visual Studio Code — This can be downloaded here (Note: You can also use any IDE of your choosing. I feel comfortable using VS Code)
Python 3 — Download from this link
Install VirtualEnv, to create a virtual environment for running your project.

Setting Up Your Environment

Create a folder in any location of your choice. Mine will be ‘C:\Users\username\Desktop\PyScraping’
Installing VirtualEnv — Run Windows CMD as Administrator, type ‘pip install virtualenv’ and ‘pip install virtualenvwrapper-win’
Navigate to your project folder using terminal on VS Code — simply run ‘pip install requests beautifulsoup4’ to install requests and beautifulsoup4 libraries.

Note:

Request library allows us to easily make HTTP requests.

BeautifulSoup will make scraping much easier for us.

Learning Web Scraping using Python 3 & Beautifulsoup4

What is Web Scraping?

Web scraping is said to be a computer technique that includes extracting of information from websites. It mostly focuses on the transformation of unstructured HTML data into structured data (databases or spreadsheets).

Tutorial

Open your VS Code and navigate till you can see something similar to the image below. If you watch, I’m using git bash on my terminal, you can use powershell too, it’s a matter of choice.

Setup Virtual Environment

To setup your virtual environment, navigate to your project folder via terminal and type in ‘python -m virtualenv env’. Once this is completed, you will notice it creates a folder called env in the root of our project. This will contain the necessary packages that Python would need to work.

Select your Python Interpreter

To select your python interpreter, press ‘ctrl + shift + p’ to open up all the commands, search and select ‘Python: Select Interpreter’ like below and select ‘Python 3.7.* 64bit(env:virtualenv)’

If you’ve successfully made it through till this stage, Congrats! You’re a few steps away from knowing web scraping.

Timmy Iwoni

2 min

4 cards