How to scrape Dow Jones Industrial Average data from Yahoo finance
Recently, I am working on a stock price predicting project, and we will use top news headlines to predict if the Dow Jones Industrial Average number will go up or go down. It’s good that there are a lot of datasets on Kaggle for you to play around with it. However, to collect the most recent data, you might consider using web scraping. It’s quite easy to pick up, so let’s get started!
How things are working in the background
We will start with how clients and servers communicate so that we can have an idea of what web scraping is. This is what is going to happen when you are visiting this page on your browser. Your browser will make a request to the server at this URL(www.medium.com/…), and the server will respond by sending an HTML file that tells the browser how to present the data. Voilà! Now your browser got the data and gad it displayed nicely on your browser!
Thus, web scraping is to extract data from the HTML file we request from the server and the Python script we write is called a web bot.
Libraries we need for web scraping
There are only two steps in web scraping: Download the web page and Parse the data. We will be using the
requests library for webpage downloading and
beautifulsoup4 for data parsing. Run the commands below in your project terminal to install these libraries.
pip3 install beautifulsoup4
pip3 install requests
Why do we need to parse the data?
The HTML file is made up of the data and tags. Since we only want the data, we will need to parse the HTML file and remove the tags and all the other unwanted stuff. As you can see from the screenshot below, to get our article title, we will need to extract it from the highlighted <h2>tag.
Scraping data from Yahoo finance
Now that we understand how web scraping works, we can get to the business now. We want to get the Dow Jones Industrial Average data from Yahoo finance, so we will extract the table I highlighted in the screenshot.
Once you choose the Time Period and hit the Apply button, you will see the URL has changed.
What just happened is that we made a GET request, and we said we want the data from
period2=1615161600. The format of the timestamp here is *Epoch time.
An overview of our web bot
In our web bot, we can write a function that takes in two timestamps so that we can get an interval of any time we want without having to open the website anymore.
Break down the Script
I know the script above is quite overwhelming. No worries, I will be holding your hands and walk you through it step by step :)
Download the webpage
We will be using the URL we discussed above and we will swap the epoch time with the parameter of this function. Next, write a try-except block in case of any error comes up while requesting the data.
page = requests.get(url) is where the magic happens, the program will make a request and store the data in the
page variable. Note that the returned object,
page, is a Response object. To get the content of the HTML page as a byte string we will do
url = f"https://finance.yahoo.com/quote/%5EDJI/history?
page = requests.get(url)
page.raise_for_status()except requests.exceptions.HTTPError as e:
print("HTTP Error:", e)
except requests.exceptions.ConnectionError as e:
print("Error Connecting:", e)
except requests.exceptions.Timeout as e:
print("Timeout Error:", e)
except requests.exceptions.RequestException as e:
print("Request exception: ", e)
Parsing the data
Because the data in an HTML page is embedded among the HTML tags, and the tags don’t follow the strict syntax. It can be a nightmare to parse data out of an HTML page with vanilla Python. That’s why we need BeautifulSoup because it makes things so easy! You can easily get the data in the <table> tag with
soup = BeautifulSoup(page.content, "lxml")
table = soup.table
Observe the HTML file
Before we move on, we need to know where does the data locate so that we know how to extract it. In the screenshot below, we can see the data we want is in the
<table> tag and the heading is in
<thead>tag, the body is in the
Let’s take a closer look at the structure of the table. In the
<thead>tags, there is a
<tr>tag which defines a row in the table and the
<th> tags define header cells. As for what’s inside the
<tbody>tag, it’s quite similar, but we use
<td> here to defines the standard data cells.
Extract the headers
To find data with the tags, we use
findall() , so what we are doing here is to extract the header from
<thead> and store them into
table_head = table.find('thead')
table_headrows = table_head.find_all('th')
for row in table_headrows:
col = row.text.strip()
Extract the body
table_body = table.find('tbody')
table_bodyrows = table_body.find_all('tr')
for row in table_bodyrows:
cols = row.select('td span')
cols = [col.get_text() for col in cols]
If you look into what’s inside the
<td>tags, you will find that the data is wrapped with a<span> tag, so we will
use row.select(‘td span’) to get our data.
Now that we’ve found where the data is, to remove the tags we will use
get_text() to remove the tags.
There you have it! We just scraped all the data we want from the website. You may either use the
pickle library to dump it as a JSON file or write them into a CSV file.
*What is epoch time?
The Unix epoch (or Unix time or POSIX time or Unix timestamp) is the number of seconds that have elapsed since January 1, 1970 (midnight UTC/GMT), not counting leap seconds (in ISO 8601: 1970–01–01T00:00:00Z).