Nerd For Tech
Published in

Nerd For Tech

How to scrape Dow Jones Industrial Average data from Yahoo finance

credit: unsplash@markusspiske

Recently, I am working on a stock price predicting project, and we will use top news headlines to predict if the Dow Jones Industrial Average number will go up or go down. It’s good that there are a lot of datasets on Kaggle for you to play around with it. However, to collect the most recent data, you might consider using web scraping. It’s quite easy to pick up, so let’s get started!

How things are working in the background

We will start with how clients and servers communicate so that we can have an idea of what web scraping is. This is what is going to happen when you are visiting this page on your browser. Your browser will make a request to the server at this URL(www.medium.com/…), and the server will respond by sending an HTML file that tells the browser how to present the data. Voilà! Now your browser got the data and gad it displayed nicely on your browser!

Thus, web scraping is to extract data from the HTML file we request from the server and the Python script we write is called a web bot.

Libraries we need for web scraping

There are only two steps in web scraping: Download the web page and Parse the data. We will be using the requests library for webpage downloading and beautifulsoup4 for data parsing. Run the commands below in your project terminal to install these libraries.

pip3 install beautifulsoup4
pip3 install requests

Why do we need to parse the data?

The HTML file is made up of the data and tags. Since we only want the data, we will need to parse the HTML file and remove the tags and all the other unwanted stuff. As you can see from the screenshot below, to get our article title, we will need to extract it from the highlighted <h2>tag.

The inspect of the medium.com

Scraping data from Yahoo finance

Screenshot of Dow Jones Industrial Average page on Yahoo Finance

Now that we understand how web scraping works, we can get to the business now. We want to get the Dow Jones Industrial Average data from Yahoo finance, so we will extract the table I highlighted in the screenshot.

Once you choose the Time Period and hit the Apply button, you will see the URL has changed.

https://finance.yahoo.com/quote/%5EDJI/history?period1=1362700800&period2=1615161600&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

What just happened is that we made a GET request, and we said we want the data from period1=1362700800 to period2=1615161600. The format of the timestamp here is *Epoch time.

An overview of our web bot

In our web bot, we can write a function that takes in two timestamps so that we can get an interval of any time we want without having to open the website anymore.

Break down the Script

I know the script above is quite overwhelming. No worries, I will be holding your hands and walk you through it step by step :)

Download the webpage

We will be using the URL we discussed above and we will swap the epoch time with the parameter of this function. Next, write a try-except block in case of any error comes up while requesting the data.

page = requests.get(url) is where the magic happens, the program will make a request and store the data in the page variable. Note that the returned object, page, is a Response object. To get the content of the HTML page as a byte string we will do page.content

url = f"https://finance.yahoo.com/quote/%5EDJI/history?
period1={period1}
&period2={period2}
&interval=1d
&filter=history
&frequency=1d
&includeAdjustedClose=true"
try:
page = requests.get(url)
page.raise_for_status()
except requests.exceptions.HTTPError as e:
print("HTTP Error:", e)
except requests.exceptions.ConnectionError as e:
print("Error Connecting:", e)
except requests.exceptions.Timeout as e:
print("Timeout Error:", e)
except requests.exceptions.RequestException as e:
print("Request exception: ", e)

Parsing the data

Because the data in an HTML page is embedded among the HTML tags, and the tags don’t follow the strict syntax. It can be a nightmare to parse data out of an HTML page with vanilla Python. That’s why we need BeautifulSoup because it makes things so easy! You can easily get the data in the <table> tag with soup.table .

soup = BeautifulSoup(page.content, "lxml")
table = soup.table

Observe the HTML file

Before we move on, we need to know where does the data locate so that we know how to extract it. In the screenshot below, we can see the data we want is in the <table> tag and the heading is in <thead>tag, the body is in the <tbody>tag.

Let’s take a closer look at the structure of the table. In the <thead>tags, there is a <tr>tag which defines a row in the table and the <th> tags define header cells. As for what’s inside the<tbody>tag, it’s quite similar, but we use <td> here to defines the standard data cells.

<table>
<thead>
<tr>
<th>...</th>
<th>...</th>
</tr>
</thead>
<tbody>
<tr>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

Extract the headers

To find data with the tags, we use find() or findall() , so what we are doing here is to extract the header from <thead> and store them into headings list.

table_head = table.find('thead')
table_headrows = table_head.find_all('th')
for row in table_headrows:
col = row.text.strip()
headings.append(col.replace('*', ''))

Extract the body

table_body = table.find('tbody')
table_bodyrows = table_body.find_all('tr')
for row in table_bodyrows:
cols = row.select('td span')
cols = [col.get_text() for col in cols]
data.append(cols)

If you look into what’s inside the <td>tags, you will find that the data is wrapped with a<span> tag, so we will use row.select(‘td span’) to get our data.

Now that we’ve found where the data is, to remove the tags we will use get_text() to remove the tags.

There you have it! We just scraped all the data we want from the website. You may either use the pickle library to dump it as a JSON file or write them into a CSV file.

The output from the web bot

Notes

*What is epoch time?

The Unix epoch (or Unix time or POSIX time or Unix timestamp) is the number of seconds that have elapsed since January 1, 1970 (midnight UTC/GMT), not counting leap seconds (in ISO 8601: 1970–01–01T00:00:00Z).

References:

  1. Stock Sentiment Analysis- Classification, NLP
  2. Class note of my instructor Clare Nguyen
  3. What is epoch time? quoted from Epoch Converter

--

--

--

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Recommended from Medium

3 Important considerations before you embark on Feature Store journey!

Simple Flutter Code structure with MVVM Architecture

How To Make An Epic Discord Server For Your YouTube Channel — Part 14 Of 30 — Twitter Tweets…

Design Browser History

How To Install WordPress On Cloudways Hosting 2021?

Project Schedule Game Four, Sweep Under The Rug — Tentamen Software Testing Blog

Zimbra Message Cache

Blue-Green Upgrades of Istio Control Plane

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yin Chang

Yin Chang

An aspiring data scientist who tries to turn the scary technical terms into easy stories

More from Medium

A Ride in Life of A Machine learning Engineer 🧑‍💻

GPU Support for Tensorflow

ML use cases in Banking, Finance & Insurance

Does predict function work in parallel when predicting k-nearest neighbour?