hackerdawn
Published in

hackerdawn

Scraping from Wikipedia using Python and Selenium

Photo by André François McKenzie on Unsplash

Web scraping can help us automate tasks that involve a lot of manual work or tasks which have to be done at regular intervals of time. These can include fetching stock prices or images in bulk from a website. Python with the help of Selenium can enable us to do this easily!

In this story, we’ll write a script to scrape the first paragraph of information about any keyword from Wikipedia. We’ll then run this script for the keyword ‘Bitcoin’ and fetch the information!

Selenium

Selenium is an open-source web-based automation tool. Using Selenium we can fire up a web browser and make it do anything which we could have manually done otherwise. To install Selenium, just run the command:

pip install selenium

ChromeDriver

Before move forward, one last thing: Downloading ChromeDriver. ChromeDriver is a tool that provides capabilities for navigating to web pages, user input, JavaScript execution, and more. For downloading ChromeDriver, navigate to https://chromedriver.chromium.org and download the latest stable release for your OS.

Writing The Scraping Script

Let us first import webdriver and Keys from Selenium along with the re package.

We’ll initialize the search keyword as ‘Bitcoin’ as we want information about Bitcoin from Wikipedia. We’ll also specify a window size for the chromedriver. Note that we have commented out a line! Uncomment this if you don't want to see the chrome window that does the scraping for us (running the chromedriver in headless mode).

We want to make the script search the keyword + ‘wikipedia’ on Google and then navigate to the top search result (which will be the Wikipedia page for the keyword).

Let’s inspect Google’s search bar.

You can notice name=”q” inside the input element. We’ll use this name to enter text into the search bar and execute the search.

Now, let’s inspect search results to get the link to the top website using the class name.

We repeat the same process with the Wikipedia article so that we can retrieve only the introductory paragraph from it.

Note that, here we’ll are using the XPath to fetch the first paragraph of the page. Using XPath is just one more interesting way of getting an item from a page.

Now, as we know the locations where different items we want are present, we can write the script further to fire up the chromedriver and see it navigating all the way through to our Wikipedia page and fetch the keyword’s information for us.

We are removing brackets containing references from the text retrieved using re to improve readability. Along with that, there is also some other usual text cleaning which you will require in every web scraping project in some form.

You might have observed the if condition here. While inspecting different Wikipedia pages, it was found that the first paragraph in Wikipedia is generally present in one of the two locations on a page. So, we have included the XPaths to both of them here.

Cool. Now, it's time to run the python script we just wrote.

Here’s the output that we got after it completed running:

Bitcoin  is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software.:ch. 1 Bitcoin is a decentralized digital currency, without a central bank or single administrator, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. Transactions are verified by network nodes through cryptography and recorded in a public distributed ledger called a blockchain.

You can try this with different keywords and see the magic happen!

Here’s the entire code for the web scraper:

A Last Important Point to Remember

One point that anyone creating a web scraper should keep in mind is that the structure, HTML, CSS of a web page can change with time. The scraper that you had written at a previous point in time can break when this happens. So, making some minor tweaks to your scraper with time is something you might have to do!

--

--

--

hackerdawn is a place which you find stories that help you build stuff you’ve always wanted to. At hackerdawn, we always try to keep things simple and not bring complexity where it is not required.

Recommended from Medium

Create different Environments(DEV,QA,UAT,PROD) and configurations for iOS Projects.

Tiny System Design part 2

The Quickest Guide to Quitting Games in Unity

Todoist 2019 — My Set-up

Empirical Developer

HTTP/1.1 vs HTTP/2.0

Should everyone learn to code?

The Road to first PR in GSSoC

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sidharth Pandita

Sidharth Pandita

More from Medium

The Streamlit Experience

Turned on Imac Beside Macbook on Table

Football Data Analysis Project (Python) using Docker Image

Whatsapp Blast? It’s a Piece of Cake

Searching Text in Multiple Files in Python