Using Python & Pandas Datareader to Analyze Financial Data

Analyze Financial Data using Pandas Datareader

Daniel Morales
Jan 31 · 12 min read
Image for post
Image for post
Image by Micheile Henderson

Finance and economics are becoming more and more interesting for all kinds of people, regardless of their career or profession. This is because we are all affected by economic data, or at least we are increasingly interested in being up-to-date, and we have a lot of information at hand.

Every day billions of bytes of financial data are sent over the Internet. Whether it is the price of a share, an e-commerce transaction, or even information on a country’s GDP. All this data, when properly organized and managed can be used to build some amazing and insightful software applications.

We will use Python to access public financial data, organize it and combine it to gain new insights into how money makes the world go round. We will focus mainly on two Python modules:

  • Pandas — used to organize and format complex data in table structures called DataFrames.
  • Pandas-datareader — used to access public financial data from the Internet and import it into Python as a DataFrame.

We will use these modules to import data from some of the largest financial organizations in the world, as well as data stored locally on our computers. By the end of the notebook, you should feel comfortable importing financial data, either from a public source or from a local file, into Python, organizing that data and combining it with each other

Importing data via Datareader

Many financial institutions, stock markets and global banks provide the public with large amounts of the data they publicly store. Most of this data is well organized, updated live and accessible through the use of an application programming interface (API), which offers programming languages such as Python a way to download and import it.

The pandas-datareader module is specifically designed to interface with some of the world's most popular financial data APIs, and import their data into an easily digestible pandas DataFrame. Each financial API is accessed by a different function exposed by pandas-datareader. Generally, accessing each API requires a different set of arguments and information to be provided by the programmer.

We will import data from several of these APIs and play with them. For a complete list of all data that the pandas-datareader can access, you can consult the official documentation.

If you want to download the notebook, link here.

Let’s make a small program that uses pandas-datareader to get data from the World Bank API gdp. First we will install the library as follows (Please run the command on your own in the next cell, I already have it installed!):

!pip3 install pandas_datareaderfrom pandas_datareader import wb
from datetime import datetime

Here we have imported the data from the World Bank via wb. And we imported the python datetime library, in order to put the start and end dates in which I want to do the analysis. For more information about Python Datetimes see this other notebook!

In [2]:

start = datetime(2005, 1, 1)
end = datetime(2008, 1, 1)
indicator_id = 'NY.GDP.PCAP.KD'

In [3]:

gdp_per_capita = wb.download(indicator=indicator_id, start=start, end=end, country=['US', 'CA', 'MX'])

In [4]:

gdp_per_capita
Image for post
Image for post
Image by Author

The download method has different parameters that we can pass on in order to get the data from the World Bank. Among them is the indicator_id. You can find more information about the parameters here.

As we can see, with this command we have obtained the GDP per capita data for 3 countries on a given date

Getting the NASDAQ Symbols

The NASDAQ Stock Exchange identifies each of its shares with a unique symbol:

  • Apple — APPL
  • Google — GOOGL
  • Tesla — TSLA

It also provides a useful API for accessing the symbols currently traded on it. Pandas-datareader provides several functions to import data from the NASDAQ API through its nasdaq_trader sub-module.

In [5]:

from pandas_datareader.nasdaq_trader import get_nasdaq_symbols

To import the list of stock symbols, we want to use the function get_nasdaq_symbols from nasdaq_trader. It is done in this way

In [6]:

symbols = get_nasdaq_symbols()

When called, it will go to the NASDAQ API, and import the list of symbols currently being traded. The advantage of using pandas-datareader is that all the logic to interact with the NASDAQ API or any other API is encapsulated in easy-to-use sub-modules and functions like the ones above.

In [7]:

symbols
Image for post
Image for post
Image by Author

We have a total of 9,338 different stocks that are listed on the Nasdaq, and in whose columns we have different economic and financial information about that stock. But what if we want to access a single symbol/stock?

In [8]:

symbols.loc['IBM']
Image for post
Image for post
Image by Author

Technical Analysis in Finance

Technical analysis in finance is the type of analysis performed by means of statistics and charts on stocks (or indices in our case). Let’s see how to do something very simple with ‘Plotly’ a Python library for charting. In this case we’ll access to Microsoft daily quotes. Let’s do it!

In [9]:

!pip3 install plotly

In [10]:

import plotly.graph_objects as go

In [11]:

import pandas_datareader.data as webstock = 'MSFT'
start = datetime(2019, 1, 1)
df = web.DataReader(stock, data_source='yahoo', start=start)

In [12]:

df
Image for post
Image for post
Image By Author

We have accessed the data of MSFT. We did this by importing .data from datareader and giving it the web alias. Under the hood we are using Yahoo Finance to import the data from an API, but in this case pandas.datareader allowed us to do it in a very simple way. Now we are going to plot the result to make Technical Analysis.

In [13]:

graph = {
'x': df.index,
'open': df.Open,
'close': df.Close,
'high': df.High,
'low': df.Low,
'type': 'candlestick',
'name': 'MSFT',
'showlegend': True
}

In [14]:

layout = go.Figure(
data = [graph],
layout_title="Microsoft Stock"
)

In [15]:

layout
Image for post
Image for post
Image by Author

We just did something very interesting and it was to chart MSFT’s acicon with the updated data! Today is November 20, 2020, so the last data of my graph is that date, you can do the same, place the mouse at the end of the graph and see the last quote of the stock! You could in this case access your investment portfolio and run the code daily and make a technical analysis on those tickets!

Data filtering by date

Many of the APIs that the pandas-datareader connects to allow us to filter the data we get by date. Financial institutions tend to keep track of data that goes back several decades, and when we import that data, it is useful to be able to specify exactly when we want it to come from

An API that does just that is the Federal Reserve Bank of San Luis (FRED), which we can access by first importing the pandas_datareader.data sub-module and then calling its DataReader function:

In [16]:

import pandas_datareader.data as webstart = datetime(2019, 1, 1)
end = datetime(2019, 2, 1)

In [17]:

sap_data = web.DataReader('SP500', 'fred', start, end)

In [18]:

sap_data
Image for post
Image for post
Image by Author

The DataReader function takes 4 arguments:

  • SP500 — An identifier provided by the API that specifies the data we want to retrieve, in this case data from the SP500
  • “fred” — The name of the API we want to access
  • start_date, end_date — The range of dates we want the data to be

By changing the start and end dates, you can easily filter the data you receive

Using the Shift() operation

Once we’ve imported a DataFrame full of financial data, there are some pretty cool ways to manipulate it. In this exercise we will see the shift() operation, a DataFrame function that moves all the rows in a column up or down

Image for post
Image for post
Image By Author
Image for post
Image for post
Image by Author

Shift() can be called in a single column (as in the image above), or in the entire DataFrame where all columns will be shifted. You can also scroll through more than one row, and in any direction.

# shifts all rows down by 1
dataframe.shift(1)
# shifts all rows in name column up 5
dataframe[name].shift(-5)
# shifts all rows in the price column down 3
dataframe[name].shift(3)

Shift is particularly useful when it comes to financial data. For example, it can be used to help calculate the percentage growth between one row and the next, or to find the difference in stock prices over a series of days. Let’s see an example

In [19]:

start = datetime(2008, 1, 1)
end = datetime(2018, 1, 1)

In [20]:

gdp = web.DataReader('GDP', 'fred', start, end)
gdp.head()
Image for post
Image for post
Image by Author

We have imported the GDP from the FRED, now we will create a new column called Growth where we can do the math of the difference (in absolute values) between the different days

In [21]:

gdp['Growth'] = gdp['GDP'] - gdp['GDP'].shift(1)

In [22]:

gdp.head()
Image for post
Image for post
Image By Author

We can now see the absolute differences in this new column. An important clarification: the first row of growth column is now ‘NaN' because it has no one to do the calculation with, it is the first row of the dataset

Calculating basic financial statistics

Two useful calculations that can be made with financial data are variance and covariance. To illustrate these concepts, let’s use the example of a DataFrame that measures stock and bond prices over time

Image for post
Image for post
Image by Author

Variance

Variance measures how far a set of numbers is from its average. In finance, it is used to determine the volatility of investments.

dataframe['stocks'].var() # 106427
dataframe['bonds'].var() # 2272

In the above variance calculations, stocks are greater in value than bonds (106427 vs 2272). That's because stock prices are more dispersed than bond prices, indicating that stocks are a more volatile investment.

Covariance

Covariance, in a financial context, describes the relationship between the returns of two different investments over a period of time, and can be used to help balance a portfolio. Calling our stock and bond columns cov() produces an array that defines the covariance values between each pair of columns in the DataFrame. Covariance is also known as a correlation in finance. In our example data, when stock prices go up, bonds go down. We can use the covariance function to see this numerically.

dataframe.cov()

The above code produces the following DataFrame output:

Image for post
Image for post
Image by Author

Each value above represents the covariance between two columns. The higher the number, the more investments tend to go up and down at the same time. The upper right and lower left columns represent the covariance between stocks and bonds. Here we have a negative number, which indicates that stocks and bonds tend to move in different directions. To see this with an example, we are going to make a more complex exercise, let’s go!

Calculating the Logarithmic Return

The Logarithmic return is one of the three methods for calculating the return and assumes that the returns are composed continuously rather than through sub-periods. It is calculated by taking the natural logarithm of the final value divided by the initial value.

As an example, let’s say that the final value of an investment was 11andtheinitialvaluewas11andtheinitialvaluewas10. The function would say =LN(11/10) for a result of 9.53%.

In mathematics and statistics, a distinction is often made between discrete and continuous data. The return of the record is the most theoretical continuous version. In the real world, however, most people think of returns divided into discrete periods.

  • Arithmetic Return — A period, not composed, discrete.
  • Geometric return — Multi-period, composed, discrete.
  • Logarithmic return — Infinite periods, composed, continuous.

So the logarithmic return is the non-discrete, continuous version, meaning that if the whole period were divided into an infinite number of periods, what would that return be? You see, it’s theoretical.

Except for very short periods of time (less than a day), and for theoretical applications involving the calculation and precise measurement of curves, logarithmic returns are not commonly used.

Price data are useful, but in this case, since we want to compare each data set, it would be even better if instead of daily/annual prices, we had information on the returns of daily/annual prices.

As a first step, let’s define a function called log_return, which should accept one parameter, prices.

In [23]:

def log_return(prices):
pass

The equation for calculating the log return between two prices is as follows natural_log(current price/previous price)

In our case we want to run this equation for every day/year of price data in our imported DataFrame series (the A series is a single column in a DataFrame).

The pandas shift function (sifth()) can be used to divide each current price by its previous price in the Series.

prices / prices.shift(1)

And we can use Numpy’s natural logarithm function to get the log return for each entry in the new Series.

import numpy as np
np.log(Series)

Let’s do it

In [24]:

import numpy as npdef log_return(prices):
return np.log(prices / prices.shift(1))

Let’s use our new log_return function to calculate the logarithmic return of the DataFrame of Nasdaq, GDP and other indicator that we've already loaded

In [25]:

start = datetime(1999, 1, 1)
end = datetime(2019, 1, 1)
nasdaq_data = web.DataReader("NASDAQ100", "fred", start, end)
sap_data = web.DataReader("SP500", "fred", start, end)
gdp_data = wb.download(indicator='NY.GDP.MKTP.CD', country=['US'], start=start, end=end)
export_data = wb.download(indicator='NE.EXP.GNFS.CN', country=['US'], start=start, end=end)

Now we are going to pass each variable to the log_return function

In [26]:

nasdaq_returns = log_return(nasdaq_data['NASDAQ100'])

In [27]:

nasdaq_returns
Image for post
Image for post
Image by Author
sap_returns = log_return(sap_data['SP500'])

In [29]:

sap_returns
Image for post
Image for post
Image by Author
gdp_returns = log_return(gdp_data['NY.GDP.MKTP.CD'])

In [31]:

gdp_returns
Image for post
Image for post
Image by Author
export_returns = log_return(export_data['NE.EXP.GNFS.CN'])

In [33]:

export_returns
Image for post
Image for post
Image by Author

Comparison of return volatility

We are now ready to compare the volatility of each type of data. Remember that variance, in the context of financial data, tells us how volatile an investment is. We’ll use the var() function in Pandas to calculate the variance of returns on stocks and World Bank data, and print the results.

The results can be interpreted in various ways, but in general, the higher the variance, the more volatile the data.

What conclusions can be drawn from these data? Which data set was the most volatile? Did any data sets have similar variances?

sap series: sap: 8.860342194008153e-05 which is equivalent to 0.00008860342194008153

In [34]:

print('nasdaq_returns:', nasdaq_returns.var())>> nasdaq_returns: 0.0003178379833057229

In [35]:

print('sap_returns:', sap_returns.var())>> sap_returns: 8.375653262337878e-05

In [36]:

print('gdp_returns:', gdp_returns.var())>> gdp_returns: 0.0003408930734314469

In [37]:

print('export_returns:', export_returns.var())>> export_returns: 0.006197236757109849

The S&P 500, a set of 500 large companies listed on U.S. stock exchanges, has the smallest variation, and therefore is the least volatile. Because the S&P 500 index is a weighted measure of many stocks in a variety of industries, it is considered a safer and more diversified investment.

Stocks are interesting. The NASDAQ 100 is more volatile than the S&P 500, which, when you think about it, makes sense since the S&P 500 is much more diversified and follows the market more closely.

So finally we have GDP and exports.

Exports are very volatile, which could have to do with industries that have moved abroad in the last 20 years, and global competition for the production of goods.

GDP is actually quite similar to the NASDAQ 100 in terms of volatility, which is perhaps an interesting correlation.

Note: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slack

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Thanks for reading!

Other posts written by me in Medium

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Daniel Morales

Written by

Data Scientist. ML Engineer. Co-founder at DataSource.ai, Linkedin https://www.linkedin.com/in/danielmorales1/, Twitter https://twitter.com/daniel_moralesp

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Daniel Morales

Written by

Data Scientist. ML Engineer. Co-founder at DataSource.ai, Linkedin https://www.linkedin.com/in/danielmorales1/, Twitter https://twitter.com/daniel_moralesp

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store