There is a need and importance of extracting data from websites and this has become increasingly loud and clear, especially with the outburst in data analytic. Sometimes we find ourselves in need of data especially when we need data to build and train a machine learning model.
There are several ways to extract information from the web. One of which is the use of APIs, most probably the best way to extract data from a website since this will result in access to data in a more structured way. Unfortunately, not all websites can employ people with the technical know-how to create an API to access their data while some just don’t feel the need for it. In cases like these, we get our hands dirty.
To successfully follow the steps in this tutorial, you’d be needing to install some software. Find below the list and how to setup your environment for this tutorial.
Note:
Request library allows us to easily make HTTP requests.
BeautifulSoup will make scraping much easier for us.
Web scraping is said to be a computer technique that includes extracting of information from websites. It mostly focuses on the transformation of unstructured HTML data into structured data (databases or spreadsheets).
Open your VS Code and navigate till you can see something similar to the image below. If you watch, I’m using git bash on my terminal, you can use powershell too, it’s a matter of choice.
To setup your virtual environment, navigate to your project folder via terminal and type in ‘python -m virtualenv env’. Once this is completed, you will notice it creates a folder called env in the root of our project. This will contain the necessary packages that Python would need to work.
Select your Python Interpreter
To select your python interpreter, press ‘ctrl + shift + p’ to open up all the commands, search and select ‘Python: Select Interpreter’ like below and select ‘Python 3.7.* 64bit(env:virtualenv)’
If you’ve successfully made it through till this stage, Congrats! You’re a few steps away from knowing web scraping.