Web Scraping IGN Article Using Python BeautifulSoup4

Nugroho
7 min readFeb 22, 2024

--

Medium Stories Nugroho : Web Scraping IGN Article Using Python BeautifulSoup4
How to do Web Scraping Articles with Python BeautifulSoup4

In this digital age, accessing large amounts of information from the web has become a routine part of our daily lives.

Whether it’s news articles, product reviews, or even things related to tutorials.

However, collecting this data manually can take time and effort, especially when dealing with large amounts of information.

This is where web scraping comes in.

Web scraping is the process of automatically extracting information from websites, allowing us to collect data quickly and efficiently.

In this article, I will show you how to utilize the power of web scraping to extract information from IGN articles using BeautifulSoup, a popular Python library for web scraping.

By the way, this is my first article on Medium. So, I would appreciate it if you follow and subscribe to get the latest article from me. Okay, let’s get started!

Prepare Packages and Tools

Before performing the web scraping process, it is imperative to set up the necessary Python libraries to facilitate data extraction and manipulation.

To start, we need to install the BeautifulSoup and Requests packages, along with Pandas to organize and present the scraped data.

To install these packages, I used Jupyter Notebook. Simply add a cell in your Notebook, then enter the following commands one by one.

# Install one by one
!pip install beautifulsoup4
!pip install requests
!pip install pandas

To install all three packages at once, use the command below :

# Install all three packages at once
!pip install beautifulsoup4 requests pandas

Okay, all packages are installed. Now, let’s start scraping the articles.

Start Scraping Articles

The IGN article that I use for web scraping is about “The Top 100 Video Games of All Time”.

The plan is to extract information such as Rank, Game Name, and also a Short Description of the game.

To do that, first, we import the packages that have been installed.

# Import Packages
from bs4 import BeautifulSoup
import pandas as pd
import requests

After that, we must specify the HTTP header so that the request sent is not blocked by the destination web page (because it could be considered a bot).

And we must also specify the destination URL from which we will retrieve the information. In this case the IGN article page.

# Define HTTP Headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"DNT": "1",
"Connection": "close",
"Upgrade-Insecure-Requests": "1"
}

# Destination URL
base_url = 'https://sea.ign.com/portal-2/180409/feature/the-top-100-video-games-of-all-time'

Alright, I’ll explain each of the headers above.

  • User-Agent: This is like your ID card when you visit a website. It tells the website what kind of device and browser you are using. In our case, it says that we are using Firefox on a Windows computer.
  • Accept-Encoding: This tells the website what type of compression algorithm our browser can handle.
  • Accept: This section tells the website what type of data format we want. This means, “I prefer HTML, but I can also handle XML if that’s what you have.”
  • DNT (Do Not Track): This is like saying, “Please don’t follow me on the internet and collect data about my browsing habits.”
  • Connection: This is simply saying that once we are done browsing a web page, we will close the connection, like saying goodbye after visiting a friend’s house.
  • Upgrade-Insecure-Requests: This is a way of asking a website to make sure all the resources it sends us, such as images and scripts, are safe and do not harm our computer.

So basically, this code is just a set of instructions to be a good guest when visiting a website!

Next, we start sending GET requests to the IGN article page URL.

The goal is to allow us to obtain HTML information and its elements on the web page.

# Send a GET request to the IGN article page
req = requests.get(base_url, headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')

Here’s what each part of the code does:

  • req = requests.get(base_url, headers=headers): This line of code sends a GET request to the URL specified by base_url. The headers parameter contains the headers information described earlier.
  • soup = BeautifulSoup(req.text, ‘html.parser’): After sending the request and receiving the HTML content of the page in response (req.text), we create a BeautifulSoup object named soup to parse the HTML document.

You can display the parsed results of an HTML document with print(soup.prettify())

Web Scraping IGN Article HTML Document Parsing Results
Results of Parsing HTML documents with BeautifulSoup

After successfully parsing the HTML document, the next step is to search for the article tag that contains the name of the game as well as a short description of the game.

I will refer to this short description as the “summary”.

Web Scraping IGN Article : Article Section with Class name “article-section”
Focus on Tags articles with the class name or ID
# Find the article section containing the game names and summaries
article_section = soup.find('article', class_='article-section')

As you can see, there is an article tag with the class ="article-section article-page". In this case, we just need to use one of the two classes, namely article-section. Of course, we can also use both if necessary.

Next, look for tags that display game names, which are usually marked with an h2 tag or level 2 heading. After that, we create a global variable in the form of an empty list to hold all the data.

Web Scraping IGN Article : H2 Tags for Game Name
To get the Game Name, take the h2 (Heading 2) tag
# Find all h2 tags within the article section
h2_tags = article_section.find_all('h2')

# Initialize an empty list to store the data
data = []

Next, we loop through all h2 tags to extract the game names and summaries from paragraph tags or p.

# Loop through each h2 tag to extract the game name and summary
for h2 in h2_tags:
game_name = h2.text.split('. ')[1].strip() # Remove the number prefix
summary = h2.find_next_sibling('p').text.strip() if h2.find_next_sibling('p') else None
data.append({'Game Name': game_name, 'Summary': summary})

Here is an explanation of the code above:

  • for h2 in h2_tags:: This line starts a loop that iterates over each h2 tag stored in the h2_tags variable.
  • game_name = h2.text.split('. ')[1].strip(): This line extracts the text content of the h2 tag (h2.text), splits it by the period and space character ('. '), and selects the second part of the split (index 1) using [1]. This removes the number prefix (like "100.", "99.", etc.) from the game name. The strip() method removes any leading or trailing whitespace.
  • summary = h2.find_next_sibling('p').text.strip() if h2.find_next_sibling('p') else None: This line finds the next sibling paragraph tag (p) after the current h2 tag (h2.find_next_sibling('p')). If such a paragraph tag exists, it extracts its text content using .text.strip(). If no paragraph tag is found, it assigns None to the summary.
  • data.append({'Game Name': game_name, 'Summary': summary}): This line appends a dictionary containing the extracted game name and summary to the data list. Each dictionary represents one game entry, with keys 'Game Name' and 'Summary'.

For the final step, the data from the list is stored in a DataFrame for easy reading.

Apart from that, I will also create a new column to show the ranking from 100 to 1.

This column replaces the numbers that were previously removed when looping. This is just my preference.

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
# Create a new column "Rank" with rank numbers from 100 to 1
df['Rank'] = range(100, 0, -1)

# Reorder the columns to have "Rank" before "Game Name"
df = df[['Rank', 'Game Name', 'Summary']]

Here is an explanation of the code above:

  • df = pd.DataFrame(data): This line creates a DataFrame from the list of dictionaries called data. Each dictionary in the list represents a row in the DataFrame.
  • df['Rank'] = range(100, 0, -1): This line creates a new column named "Rank" in the DataFrame. It uses the range() function to generate a sequence of numbers from 100 to 1 in descending order (-1 specifies the step size). This assigns ranks to each game from 100 (top) to 1 (bottom).
  • df = df[['Rank', 'Game Name', 'Summary']]: This line reorders the columns in the DataFrame to have "Rank" as the first column, followed by "Game Name" and "Summary". It uses double square brackets to select and reorder the columns.

You can display scraping results with the code below:

# Print the Scraping Result
df
Web Scraping IGN Article : Scraping Results
Scraping Results in a Dataframe

After successfully performing web scraping and getting data in DataFrame form, you can convert it into CSV or Excel format using the following code:

# Save to CSV
df.to_csv('top_100_games_by_IGN.csv', index=False, encoding='utf-8')

# Save to Excel
df.to_excel('top_100_games_by_IGN.xlsx', index=False)

The resulting file will be in the same folder as this project.

Conclusion

That’s how to do web scraping on an article, such as the IGN article entitled “Top 100 Video Games of All Time”.

You can also apply this technique to various other websites such as Wikipedia, simply by adjusting the tag elements you want to retrieve.

Hopefully, this article is useful for web scraping fans.

Don’t forget to follow me for other interesting content and also check my website to get insight about finance.

Thank you and regards, Nugroho.

--

--

Nugroho

Enthusiastic about data, Machine Learning, web scraping, Python, SQL & data viz, I also talk about money at www.cashnug.com