Unleashing the Data Magic: Mastering Web Scraping with Python and Beautiful Soup

Mohammad Ghasemi
10 min readJun 25, 2023

--

Hey there! Ever found yourself manually copying and pasting data from websites, wishing there was a more efficient way? Enter web scraping — a game-changing technique that saves you time and unlocks a world of data possibilities. Web scraping is like having your personal data detective, gathering information from websites with a few lines of code. It’s a powerful tool that can greatly enhance your job, allowing you to extract valuable insights, automate repetitive tasks, and make data-driven decisions. Imagine effortlessly collecting product details, tracking market trends, or conducting competitor analysis. With web scraping, you can unlock the treasure trove of information available on the web and leverage it to your advantage. In this article, we’ll dive into the exciting world of web scraping and introduce you to Beautiful Soup, a Python library that makes extracting data from HTML and XML documents a breeze.

you can also view this projects code, by following this link

Introduction to beautiful soup

At the heart of web scraping lies Beautiful Soup, a Python library that makes extracting data from HTML and XML documents a delightful experience. Beautiful Soup acts as your trusty assistant, helping you navigate through the labyrinth of website code and effortlessly extract the information you need. With its intuitive and easy-to-use API, Beautiful Soup abstracts away the complexities of parsing and traversing HTML structures, allowing you to focus on the data that matters. Whether you’re searching for specific elements, extracting text, or even diving into nested tags, Beautiful Soup provides a comprehensive set of tools that simplifies the process. It handles messy and inconsistent HTML with grace, adapting to different page structures and making your scraping tasks smoother. In this article, we’ll delve into the world of Beautiful Soup and explore its powerful features, empowering you to become a proficient web scraper in no time.

Our use case

Imagine this: your boss expects you to provide daily reports on the latest prices of major indexes like the S&P 500, DOW30, NASDAQ, and Russell 2000. Not only do you need to scrape the data from multiple sources, but your boss also wants the reports conveniently delivered in an Excel file. Don’t worry, because with the power of web scraping and some additional code, you can meet this requirement seamlessly. By leveraging Beautiful Soup and incorporating Excel file generation, you can automate the retrieval of index data and save it in a structured spreadsheet format. With a few simple modifications, you’ll be able to scrape the latest prices of the S&P 500, DOW30, NASDAQ, and Russell 2000 indexes and export them as an Excel file. In this article, we’ll guide you through the process of scraping the index data, and we’ll provide you with the code to generate an Excel file, so you can impress your boss with accurate reports in a format they’ll love using only one click!

Let the job begin!

Before we dive into the exciting world of web scraping using Beautiful Soup, let’s start by setting up our project in PyCharm. PyCharm is a powerful integrated development environment (IDE) that provides a seamless coding experience for Python developers. By creating a dedicated project for our web scraping endeavors, we can keep our code organized, leverage the features of the IDE, and make our development process more efficient.

In this section, we’ll walk you through the steps of creating a new project in PyCharm, setting up the necessary environment, and ensuring that we have everything in place to begin our web scraping journey. By the end of this section, you’ll have a fully set up project in PyCharm, ready to write your web scraping code with ease.

So, let’s roll up our sleeves and get started with creating our project in PyCharm!

Create a PyCharm project

to create your project in pycharm, follow these steps :

1. Launch PyCharm: Open the PyCharm IDE on your computer.

2. Create a New Project: Click on “Create New Project” or go to “File” > “New Project” to create a new project.

3. Specify Project Details: Choose a location on your computer where you want to store your project files. Give your project a meaningful name, such as “WebScraper” or any name you prefer.

4. Select Interpreter: Choose the Python interpreter you want to use for your project. If you have a specific version of Python installed, select it from the list. Otherwise, you can create a new virtual environment for your project.

5. Create Project: Click on the “Create” button to create the project. PyCharm will set up the project structure and open the main project window.

Exploring the web page

In the exciting world of web scraping, our journey begins with delving into the underlying HTML code of web pages. By exploring the structure and elements of a webpage, we can identify the data we want to extract and devise an effective scraping strategy.

In this section, I’ll guide you through the process of inspecting the HTML code of a web page and understanding its structure. Armed with this knowledge, you’ll be ready to target the specific elements and content you wish to scrape. So, let’s roll up our sleeves and learn how to navigate the HTML terrain!

To get started with our web scraping adventure, let’s take a closer look at the Yahoo Finance main page. Head over to Yahoo Finance and explore the plethora of financial information available. This will be our playground for extracting valuable data. Take a moment to familiarize yourself with the elements we’ll be targeting for scraping. To assist you, refer to the image below which highlights the specific elements we aim to extract:

The yahoo finance main page and the elements we want from this page are shown here.

By identifying these elements, we can direct our efforts toward capturing the desired data. Now, let’s proceed with the code implementation to retrieve the HTML code of the Yahoo Finance main page.

To get a glimpse of the underlying HTML code of a web page, we can utilize the developer view in our web browser. In this tutorial, we’ll focus on the popular Chrome browser. Follow these simple steps to access the developer view:

  1. Open Chrome: Launch the Chrome browser on your computer.
  2. Navigate to Yahoo Finance: Visit the Yahoo Finance main page or any other web page you want to inspect.
  3. Open Developer Tools: Right-click anywhere on the web page and select ‘Inspect’ or ‘Inspect Element’. This will open the developer view within Chrome.
  4. Explore the Elements: Within the developer view, you’ll see the HTML code of the web page. Take some time to navigate through the different elements and understand the structure of the page.

As you explore the HTML code within the developer view, you may initially be greeted with a multitude of tags, attributes, and text. Don’t be overwhelmed! To locate the specific element you want to scrape, Chrome offers a helpful tool. Look for a small button resembling a cursor or arrow icon in the top-left corner of the developer view. Click on this button to activate the ‘Select Element’ mode.

Once in ‘Select Element’ mode, move your cursor over different elements on the web page, and you’ll notice that they get highlighted in the developer view. By hovering over elements, you can visually identify the HTML code associated with them.

Chrome “select element” tool navigates us to the exact element code we want.

For instance, let’s consider the S&P price element as an example:

<fin-streamer class="Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-symbol="^GSPC" data-field="regularMarketPrice" data-trend="none" value="4348.33" active="true">4,348.33</fin-streamer>

Scrapping the elements using beatiful soup

To begin scraping the data from the Yahoo Finance page, we’ll utilize the power of Beautiful Soup (bs4), our trusty companion in navigating and extracting information from HTML code. With Beautiful Soup, we can traverse the HTML structure effortlessly and extract the elements we need.

To get started, we need to install the Beautiful Soup library in our Python project. Open your project in PyCharm and navigate to the terminal or command prompt. Use the following command to install Beautiful Soup:

pip install requests
pip install beautifulsoup4

Once installed, we’ll import the necessary modules in our Python script. Add the following lines of code at the beginning of your Python file:

import requests
from bs4 import BeautifulSoup

We import the requests module to fetch the HTML code of the Yahoo Finance page, and BeautifulSoup from bs4 to parse and navigate the HTML structure.

Next, we’ll write the code to retrieve the HTML code of the Yahoo Finance page using the requests library. Stay tuned as we embark on our web scraping journey and uncover the fascinating world of data extraction!

With Beautiful Soup and the necessary modules imported, we’re ready to fetch the HTML code of the Yahoo Finance page. To do this, we’ll use the requests library, which allows us to send HTTP requests and retrieve the HTML content of a web page.

Add the following code snippet to your Python script:

url = 'https://finance.yahoo.com/'
response = requests.get(url)
html_content = response.content

In the above code, we define the url variable to store the URL of the Yahoo Finance page we want to scrape. Then, we use requests.get(url) to send a GET request to the URL and retrieve the response. The response contains the HTML content of the page.

We store the HTML content in the html_content variable, which we'll later pass to Beautiful Soup for parsing and extraction.

Now that we have the HTML code in our hands, we’re ready to unleash the power of Beautiful Soup to navigate through it and extract the desired data. Stay tuned for the upcoming section, where we’ll explore how to use Beautiful Soup to locate the element representing the S&P price and extract the data we need.

Keep the enthusiasm high as we delve deeper into web scraping with Python and Beautiful Soup!

# Create a Beautiful Soup object
soup = BeautifulSoup(html_content, 'html.parser')
# Find the element representing the S&P price
sp_price_element = soup.find('fin-streamer', {'data-symbol': '^GSPC', 'data-field': 'regularMarketPrice'})
# Extract the price value
sp_price = sp_price_element['value']
# Print the current S&P 500 price
print('The current S&P 500 price is:', sp_price)

To locate the element representing the S&P price, we use the soup.find() method. We pass the tag name 'fin-streamer' and a dictionary containing the attributes 'data-symbol': '^GSPC' and 'data-field': 'regularMarketPrice'. This helps us narrow down our search to the specific element we want.

Once we have the sp_price_element, we extract the price value by accessing the 'value' attribute using square brackets. We store the price in the sp_price variable.

Finally, we print the current S&P 500 price using print('The current S&P 500 price is:', sp_price).

By running the script, you will see the current S&P 500 price printed to the console.

Now let’s grab the other three elements that we wanted :

dow30_price_element = soup.find('fin-streamer', {'data-symbol': '^DJI', 'data-field': 'regularMarketPrice'})
dow30_price = dow30_price_element['value']

nasdaq_price_element = soup.find('fin-streamer', {'data-symbol': '^IXIC', 'data-field': 'regularMarketPrice'})
nasdaq_price = nasdaq_price_element['value']

russel_price_element = soup.find('fin-streamer', {'data-symbol': '^RUT', 'data-field': 'regularMarketPrice'})
russel_price = russel_price_element['value']

And in the end, let’s print all the results to see what we have done:

print('The current S&P 500 price is:', sp_price)
print('The current DOW30 price is:', dow30_price)
print('The current nasdaq price is:', nasdaq_price)
print('The current RUSSEL2000 price is:', russel_price)

As we can see, the bs4 gives us the prices exactly as they are:

The current S&P 500 price is: 4348.33
The current DOW30 price is: 33727.43
The current nasdaq price is: 13492.516
The current RUSSEL2000 price is: 1821.6345

Saving our data into an excel file

Now that we have collected our data, we can save them into an excel file. I will make another complete tutorial for this job but for now, let’s just write a simple code to complete our job.

first we need to install the pandas library:

pip install pandas

and then we import it to the beginning of our code:

import pandas as pd

To store the scraped data in a structured format, we’ll use a pandas DataFrame. Add the following code snippet after scraping the stock prices:

# Creating a DataFrame
data = {
'Symbol': ['S&P 500', 'Dow Jones', 'Nasdaq Composite','RUSSEL 2000'],
'Price': [sp_price, dow30_price, nasdaq_price,russel_price]
}
df = pd.DataFrame(data)

Now that we have our DataFrame ready, we can save it to an Excel file using the to_excel() function provided by pandas. Add the following code snippet after creating the DataFrame:

# Saving data to Excel
file_name = 'stock_prices.xlsx'
df.to_excel(file_name, index=False)
print('Data saved to', file_name)

Diving deeper into HTML elements

In the previous code snippet, we used the following attributes in the soup.find() method:

sp_price_element = soup.find('fin-streamer', {'data-symbol': '^GSPC', 'data-field': 'regularMarketPrice'})

These attributes help us identify the specific element we’re interested in within the HTML structure. Let’s break them down:

  1. 'data-symbol': '^GSPC': This attribute specifies the value of the data-symbol attribute of the element. In this case, we're looking for an element with the data-symbol attribute set to ^GSPC. The ^GSPC represents the S&P 500 symbol.
  2. 'data-field': 'regularMarketPrice': This attribute specifies the value of the data-field attribute of the element. We're interested in an element with the data-field attribute set to 'regularMarketPrice'. This particular attribute indicates that we want to extract the regular market price of the S&P 500.

By combining these attributes, we create a filter that helps narrow down the search to the specific element we need. If the element with the tag name 'fin-streamer' contains these attributes with the specified values, it will be selected as the sp_price_element.

It’s important to note that these attributes are specific to the HTML structure of the Yahoo Finance page. If the structure of the page changes or the element representing the S&P price has different attributes, you may need to adjust the attributes accordingly.

In such cases, you can inspect the HTML source code of the page or use browser developer tools to identify the relevant attributes and their values. Update the attributes in the soup.find() method accordingly to ensure accurate element selection.

Conclusion

Congratulations on completing this tutorial! You’ve learned how to scrape stock prices from Yahoo Finance using Python and Beautiful Soup. By leveraging the power of web scraping, you can gather valuable data for analysis and decision-making. We covered the basics of HTML parsing, locating specific elements, and extracting data using Beautiful Soup. Additionally, we explored how to save the scraped data into an Excel file using pandas. Armed with these techniques, you can now scrape stock prices for various symbols and store them for further analysis or sharing. Keep exploring and applying your newfound skills to unlock even more possibilities in web scraping. Happy scraping!

if you had any problems or questions, feel free to ask me!

you can also view the code repository by following this link

--

--

Mohammad Ghasemi

Analyzing markets, dissecting trends, forecasting outcomes. Expertise in finance & economics. Data-driven insights for strategic decisions.