Scraping from Wikipedia using Python and Selenium
Web scraping can help us automate tasks that involve a lot of manual work or tasks which have to be done at regular intervals of time. These can include fetching stock prices or images in bulk from a website. Python with the help of Selenium can enable us to do this easily!
In this story, we’ll write a script to scrape the first paragraph of information about any keyword from Wikipedia. We’ll then run this script for the keyword ‘Bitcoin’ and fetch the information!
Selenium is an open-source web-based automation tool. Using Selenium we can fire up a web browser and make it do anything which we could have manually done otherwise. To install Selenium, just run the command:
pip install selenium
Writing The Scraping Script
Let us first import webdriver and Keys from Selenium along with the re package.
We’ll initialize the search keyword as ‘Bitcoin’ as we want information about Bitcoin from Wikipedia. We’ll also specify a window size for the chromedriver. Note that we have commented out a line! Uncomment this if you don't want to see the chrome window that does the scraping for us (running the chromedriver in headless mode).
We want to make the script search the keyword + ‘wikipedia’ on Google and then navigate to the top search result (which will be the Wikipedia page for the keyword).
Let’s inspect Google’s search bar.
You can notice name=”q” inside the input element. We’ll use this name to enter text into the search bar and execute the search.
Now, let’s inspect search results to get the link to the top website using the class name.
We repeat the same process with the Wikipedia article so that we can retrieve only the introductory paragraph from it.
Note that, here we’ll are using the XPath to fetch the first paragraph of the page. Using XPath is just one more interesting way of getting an item from a page.
Now, as we know the locations where different items we want are present, we can write the script further to fire up the chromedriver and see it navigating all the way through to our Wikipedia page and fetch the keyword’s information for us.
We are removing brackets containing references from the text retrieved using re to improve readability. Along with that, there is also some other usual text cleaning which you will require in every web scraping project in some form.
You might have observed the if condition here. While inspecting different Wikipedia pages, it was found that the first paragraph in Wikipedia is generally present in one of the two locations on a page. So, we have included the XPaths to both of them here.
Cool. Now, it's time to run the python script we just wrote.
Here’s the output that we got after it completed running:
Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software.:ch. 1 Bitcoin is a decentralized digital currency, without a central bank or single administrator, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. Transactions are verified by network nodes through cryptography and recorded in a public distributed ledger called a blockchain.
You can try this with different keywords and see the magic happen!
Here’s the entire code for the web scraper:
A Last Important Point to Remember
One point that anyone creating a web scraper should keep in mind is that the structure, HTML, CSS of a web page can change with time. The scraper that you had written at a previous point in time can break when this happens. So, making some minor tweaks to your scraper with time is something you might have to do!