How to scrape all types of websites with python — part 1
A comprehensive guide on how I scraped 19 thousand medium posts with scrappy and splash.
You might have probably come across the catchphrase ‘Data is the new oil’ or the new buzzword ‘data economy’. Either way, these phrases are coined by industry experts to depict how prominence and invaluable data have been in our economy.
Likewise all oils, the ‘data being the new oil’ conceptualize the fact that data on its own is not as valuable until it is accurately and completely gathered in a well-structured form, then It becomes a powerful decision-making mechanism to propel business growth.
Now, the question is, where do we find this data? Simply put, on the Internet!!.
I know this answer didn’t come as a surprise, but why the internet?
Allow me to whet your appetite:
1. 18.7 billion text messages are sent every day worldwide.
2. In every 24 hours, 500 million tweets are published on Twitter.
3. Four petabytes is the estimated amount of new data being generated by Facebook daily.
4. By 2025, there will be 175 zettabytes of data in the global datasphere.
I know this is crazy to understand, but just ignore the number :)
5. In fact, at the beginning of 2020, the number of bytes in the digital universe was 40 times bigger than the number of stars in the observable universe.
I guess this is enough because the list goes on and on.
Okay, we now know the value of data and where It can be found, but guess what the next most pressing question is...
How can we get it?
This is where the application of web scraping comes in.
Web scraping is basically, the process of extracting information and data from a website, transforming the information on a webpage into structured data for further analysis.
This is an imperative skill for any modern-day Data Scientist, Business intelligence analyst and all machine learning practitioners. And that is exactly what this article is all about.
Project goal: Setting up the environment for the web scraping project.
The next part is where the actual work begins
- Download and install Anaconda Navigator and Docker.
- Know how to install scrappy and splash.
Downloading and installation of Anaconda and scrappy
Anaconda is a distribution that has many inbuilt packages and libraries to aid in data projects. Download your specific installer from here and simply follow the installation wizard to complete it.
After the installation, follow the following step to launch and create the new virtual environment.
- Find the search space and type ‘anaconda navigator’ on any OS, in my case, am using windows. Launch the anaconda navigator as shown below.
2. Click on ‘Environments’ and then click on Create
3. Type a name for your environment and click Create. In my case ‘web_scraping_project’. That is all for creating a new virtual environment using anaconda.
4. Let's go on to install Scrapy.
Scrapy is an open-source and collaborative framework for extracting the data you need from websites. You can read more here.
Navigate to the virtual environment we created, click on the play button, then click on ‘open terminal’ to open the project terminal on the pop-up menu.
5. Type the following command in the terminal ‘conda install scrapy’ to install Scrapy.
we are done installing and creating a virtual environment in anaconda as well as installing Scrapy.
Downloading and installing docker and splash
Docker is an open platform for developing, shipping, and running applications. Docker provides the ability to package and run an application in a loosely isolated environment called a container.
The following are the installation process.
- Download your OS supported installer from here and follow the default installation process.
The installation process is at times, tricky, feel free to reference this guide if you encounter any difficulty in the process
2. Now let's install splash.
If you are a windows user simply open your command prompt and type the command
pip install splash and then pull the splash repository with this command
docker pull scrapinghub/splash
For other OS users, and for reference, visit the official documentation site for more installation guides.
3. To run the splash container, open your windows command line and type:
docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
You should get a similar output below
4. The docker server will run on the default port
8050 you can type
http://localhost:8050/ to open splash in your browser. The browser should open as below
It has been a long ride, but we have successfully set up our environment and gotten set for the next journey; web scrapping.
Follow this link to part - 2 of this tutorial where the actual work begins.