Web Scraping with MS Excel and Python: Static Site Contents.

This article introduces you to the simple concept of web scraping by discussing what it is, why, and how it is done for static site contents with practical examples in excel and python using the requests and beautiful soup (bs4) libraries.

Einstein EBEREONWU
8 min readSep 13, 2022
Photo by Markus Spiske on Unsplash

Web Scraping

what?

The term web scraping is used to refer to the process of extracting data of various kinds; text, image, video, e.t.c. from web pages which can then be stored and used for various operations.

Why?

Web scraping is performed in order to extract data of various forms from web pages which can then be used for analytical, record keeping and/or other purpose. When the term “web scraping” is mentioned, the first thought that comes to mind is usually “automated data extraction from sites” at least for me, but web scraping can also be done manually. Did you know that copying and pasting of information, downloading images from websites is web scraping too but it then becomes overwhelming when you have to repeat this process 100 times or more and that’s where automated web scraping can come in handy. Let’s now see how it can be done with amazing tools like Python and Excel, starting with Microsoft Excel.

How?

With Microsoft Excel

  • Navigate to the data section (1).
  • Select get external data (2).
  • Select from web (3).
Image referenced in steps above.

In this article, I will be making use of this web page for data scraping.

  • In the address bar (1), enter the web address and hit go (2). NB: several modals may come up asking you if you want to continue scripting on the page, click yes.
  • In some cases, you may want to scrape specific data content from tables or specific locations on the site, in front of the tables/specific location, you will see the icon (3). if you select it and hit import (4), only the data in the specified locations will be extracted.
  • If you did not select any specific data and you hit import (4), all the data on the web page will be imported into your excel sheet.
  • After clicking on import, a modal will come up asking you where you want to put the data, by default it starts from the first cell but you may adjust it if you wish.
Image referenced in steps above.

After scraping and importing the data, you can then format it to be however you wish and save it in the format of your choice. That is how to scrape data easily with Excel. In the image below, you can see a segment of the data scraped from the web page used in this article.

Data extracted from the website.

With Python.

When it comes to python, web scraping is a whole different ball game. You have to study the structure of whatever website you wish to scrape data from, you can view the site structure by right clicking anywhere within the website and then left clicking inspect. When you do so, your browser will open up a side bar where you can view your dev tools such as the site elements, console, source, e.t.c as shown below. To identify a specific element on a website, you can click on the element selector(2) then hover above or click on the element you wish to identify just as shown below.

Sample site showing dev tools and more.

From the picture above, you can see the dev tools(1), element selector(2) which when used to select an element will expose the elements information such as class, id, name, e.t.c in two places as seen above (4, 5). When you have information about elements, you can then search for such elements by the information you have about them. Now to see how this is done with Beautifulsoup.

Beautifulsoup (bs4) and Requests

View GitHub repo HERE.

Beautifulsoup is a python web scraping library which helps in the extraction of data from HTML and XML files, while the Requests library is a python library that allows you to send HTTP requests and returns a response which consists of the site content, status, encoding, e.t.c. With Beautifulsoup, you can scrape data from downloaded HTTP/XML files or directly from the HTTP response of a request made with the Requests library.

To install the Beautifulsoup library, you can simply type into your terminal; pip install beautifulsoup4 while to install the Requests library, type pip install requests. To scrape data from a site, you need to import the libraries as shown below but if the HTML/XML file is already saved locally, you do not need to import the requests library, just read in the saved file and store it in a variable then instantiate your soup with it.

Image showing how to import Beautifulsoup and Requests libraries.

After importing both libraries, you can then store the url of the site to be scraped in a variable with a name such as URL then make your HTTP request as shown below.

Image showing declaration of URL variable and HTTP request making.

After making the request, you can extract basic information about your response such as its status code, encoding and content just as shown below.

Image showing basic information about response.

In the image above, you can see the status code 200 which means the request was successful, the file encoding which is a UTF-8 and finally the site HTML code content. Notice that the HTML content is disorganized and doesn’t appear in standard format, you can change this by using the Beautifulsoup prettify method as shown below but NOTE: you do not need to prettify your response before you can scrape data from it.

Image showing formatted HTML content.

Text Data extraction

Above, you can see that the HTML file now appears in a familiar and more orderly manner. And now for the extraction of content from the HTML file. If you recall from the first image of the site (Sample site showing dev tools and more.), the text “Top 50 Programming Languages to Learn in 2022” is contained in the header (h1) tag. To extract that text, I will search for h1 tag and pull out the text it holds.

Image showing how to extract heading text with different styles.

In the image above, you can see that the results are the same but were achieved through different methods, At times your concern may be with the first appearance of any element on a web page other times it may be with all appearances, lets take a further look at both instances. I’ll be extracting the first of the top programing languages followed by all the top 50 programming languages. In order to do so, I have to first of all find out the attribute or tag name associated with the languages as seen below.

Image showing element information for languages.

From the image above, in the element information area you can see that the languages are within header 3 (h3) tag so I’ll be extracting the text content of all h3 tags to get the languages as shown below.

Image showing first and all top 50 languages.

And that’s how you scrape text data from websites using Beautifulsoup.

Link extraction

Links are usually embedded in anchor (a) tags as the href attribute. In order to extract links, you have to search for the anchor tags before applying the get method and specifying the href attribute of the anchor tags as seen below.

Image showing how to extract all links on a web page.

Above you see the illustration of how to extract all links on a web page, the process of extracting specific links goes a bit further than just what you see above. Assuming I want to store the link to the page that shows all services offered by the Devox software organization (the organization that published the article about top 50 languages), I have to first of all inspect the element to understand its structure.

Image showing structure of Services link.

From the image above, the only information provided about the ‘Services’ link is the element’s tag which is an anchor tag, the way to go about this is to check all anchor tags and then link (href attribute) of the anchor tag whose text content is ‘Services’ as shown below

Image showing how to get specific link

Some links may not be associated directly with any text content written within the anchor tag, rather the link will be stored in an anchor tag and the text will be stored in a span tag inside the anchor tag just as seen below.

Image showing structure of ‘the language of the year’ link.

Above you can see that the text ‘the language of the year’ is stored in a span while the link is stored in the anchor tag above it. This type of relationship is a parent-to-child relationship where the anchor is the parent element and the span is the child element. In order to get the link, I’ll be searching for the span whose text is the language of the year then extracting the href attribute from its parent element.

Image showing how to extract link from parent element.

And just like that, I have the link to the language of the year web page. You can be extremely creative in your data scraping style, once you understand the relationship between your target data and the site, you can easily grab it. With this basic knowledge I’ve transferred to you in this article, you can now scrape more interesting data with different styles for your various (good 😀) reasons. For more of those interesting things you can do with beautifulsoup, visit its documentation here.

Make sure to leave your questions if you have any at the comment section, I’ll be sure to respond as soon as possible. Clap and share if you found this article helpful. In my next web scraping article, I’ll be introducing you to Selenium an amazing site automation library that can also be used for web scraping of static and dynamic site contents and a few other activities you can perform with selenium.

Thanks and see you again next time.

--

--

Einstein EBEREONWU

Google Certified Data analyst • DataScientist 💻• ML/AI 🤖• Pythonist.