# Using Python & Pandas Datareader to Analyze Financial Data

## Analyze Financial Data using Pandas Datareader

Finance and economics are becoming more and more interesting for all kinds of people, regardless of their career or profession. This is because we are all affected by economic data, or at least we are increasingly interested in being up-to-date, and we have a lot of information at hand.

Every day billions of bytes of financial data are sent over the Internet. Whether it is the price of a share, an e-commerce transaction, or even information on a country’s GDP. All this data, when properly organized and managed can be used to build some amazing and insightful software applications.

We will use Python to access public financial data, organize it and combine it to gain new insights into how money makes the world go round. We will focus mainly on two Python modules:

**Pandas**— used to organize and format complex data in table structures called DataFrames.**Pandas-datareader**— used to access public financial data from the Internet and import it into Python as a DataFrame.

We will use these modules to import data from some of the largest financial organizations in the world, as well as data stored locally on our computers. By the end of the notebook, you should feel comfortable importing financial data, either from a public source or from a local file, into Python, organizing that data and combining it with each other

# Importing data via Datareader

Many financial institutions, stock markets and global banks provide the public with large amounts of the data they publicly store. Most of this data is well organized, updated live and accessible through the use of an application programming interface (API), which offers programming languages such as Python a way to download and import it.

The `pandas-datareader`

module is specifically designed to interface with some of the world's most popular financial data APIs, and import their data into an easily digestible pandas DataFrame. Each financial API is accessed by a different function exposed by `pandas-datareader`

. Generally, accessing each API requires a different set of arguments and information to be provided by the programmer.

We will import data from several of these APIs and play with them. For a complete list of all data that the `pandas-datareader`

can access, you can consult the official documentation.

If you want to download the notebook, link here.

Let’s make a small program that uses `pandas-datareader`

to get data from the World Bank API gdp. First we will install the library as follows (Please run the command on your own in the next cell, I already have it installed!):

!pip3 install pandas_datareaderfrompandas_datareaderimportwbfromdatetimeimportdatetime

Here we have imported the data from the World Bank via `wb`

. And we imported the python `datetime`

library, in order to put the start and end dates in which I want to do the analysis. For more information about Python Datetimes see this other notebook!

In [2]:

`start `**=** datetime(2005, 1, 1)

end **=** datetime(2008, 1, 1)

indicator_id **=** 'NY.GDP.PCAP.KD'

In [3]:

`gdp_per_capita `**=** wb**.**download(indicator**=**indicator_id, start**=**start, end**=**end, country**=**['US', 'CA', 'MX'])

In [4]:

`gdp_per_capita`

The `download`

method has different parameters that we can pass on in order to get the data from the World Bank. Among them is the `indicator_id`

. You can find more information about the parameters here.

As we can see, with this command we have obtained the GDP per capita data for 3 countries on a given date

# Getting the NASDAQ Symbols

The NASDAQ Stock Exchange identifies each of its shares with a unique symbol:

- Apple — APPL
- Google — GOOGL
- Tesla — TSLA

It also provides a useful API for accessing the symbols currently traded on it. Pandas-datareader provides several functions to import data from the NASDAQ API through its `nasdaq_trader`

sub-module.

In [5]:

**from** pandas_datareader.nasdaq_trader **import** get_nasdaq_symbols

To import the list of stock symbols, we want to use the function `get_nasdaq_symbols`

from `nasdaq_trader`

. It is done in this way

In [6]:

`symbols `**=** get_nasdaq_symbols()

When called, it will go to the NASDAQ API, and import the list of symbols currently being traded. The advantage of using `pandas-datareader`

is that all the logic to interact with the NASDAQ API or any other API is encapsulated in easy-to-use sub-modules and functions like the ones above.

In [7]:

`symbols`

We have a total of 9,338 different stocks that are listed on the Nasdaq, and in whose columns we have different economic and financial information about that stock. But what if we want to access a single symbol/stock?

In [8]:

`symbols`**.**loc['IBM']

# Technical Analysis in Finance

Technical analysis in finance is the type of analysis performed by means of statistics and charts on stocks (or indices in our case). Let’s see how to do something very simple with ‘Plotly’ a Python library for charting. In this case we’ll access to Microsoft daily quotes. Let’s do it!

In [9]:

**!**pip3 install plotly

In [10]:

**import** plotly.graph_objects **as** go

In [11]:

importpandas_datareader.dataaswebstock='MSFT'

start=datetime(2019, 1, 1)df=web.DataReader(stock, data_source='yahoo', start=start)

In [12]:

`df`

We have accessed the data of `MSFT`

. We did this by importing `.data`

from datareader and giving it the `web`

alias. Under the hood we are using Yahoo Finance to import the data from an API, but in this case `pandas.datareader`

allowed us to do it in a very simple way. Now we are going to plot the result to make Technical Analysis.

In [13]:

`graph `**=** {

'x': df**.**index,

'open': df**.**Open,

'close': df**.**Close,

'high': df**.**High,

'low': df**.**Low,

'type': 'candlestick',

'name': 'MSFT',

'showlegend': **True**

}

In [14]:

`layout `**=** go**.**Figure(

data **=** [graph],

layout_title**=**"Microsoft Stock"

)

In [15]:

`layout`

We just did something very interesting and it was to chart MSFT’s acicon with the updated data! Today is November 20, 2020, so the last data of my graph is that date, you can do the same, place the mouse at the end of the graph and see the last quote of the stock! You could in this case access your investment portfolio and run the code daily and make a technical analysis on those tickets!

# Data filtering by date

Many of the APIs that the `pandas-datareader`

connects to allow us to filter the data we get by date. Financial institutions tend to keep track of data that goes back several decades, and when we import that data, it is useful to be able to specify exactly when we want it to come from

An API that does just that is the Federal Reserve Bank of San Luis (`FRED`

), which we can access by first importing the `pandas_datareader.data`

sub-module and then calling its DataReader function:

In [16]:

importpandas_datareader.dataaswebstart=datetime(2019, 1, 1)

end=datetime(2019, 2, 1)

In [17]:

`sap_data `**=** web**.**DataReader('SP500', 'fred', start, end)

In [18]:

`sap_data`

The DataReader function takes 4 arguments:

**SP500**— An identifier provided by the API that specifies the data we want to retrieve, in this case data from the SP500**“fred”**— The name of the API we want to access**start_date, end_date**— The range of dates we want the data to be

By changing the start and end dates, you can easily filter the data you receive

# Using the `Shift()`

operation

Once we’ve imported a DataFrame full of financial data, there are some pretty cool ways to manipulate it. In this exercise we will see the `shift()`

operation, a DataFrame function that moves all the rows in a column up or down

`Shift()`

can be called in a single column (as in the image above), or in the entire DataFrame where all columns will be shifted. You can also scroll through more than one row, and in any direction.

# shifts all rows down by 1

dataframe.shift(1)# shifts all rows in name column up 5

dataframe[name].shift(-5)# shifts all rows in the price column down 3

dataframe[name].shift(3)

Shift is particularly useful when it comes to financial data. For example, it can be used to help calculate the percentage growth between one row and the next, or to find the difference in stock prices over a series of days. Let’s see an example

In [19]:

`start `**=** datetime(2008, 1, 1)

end **=** datetime(2018, 1, 1)

In [20]:

`gdp `**=** web**.**DataReader('GDP', 'fred', start, end)

gdp**.**head()

We have imported the `GDP`

from the `FRED`

, now we will create a new column called `Growth`

where we can do the math of the difference (in absolute values) between the different days

In [21]:

`gdp['Growth'] `**=** gdp['GDP'] **-** gdp['GDP']**.**shift(1)

In [22]:

`gdp`**.**head()

We can now see the absolute differences in this new column. An important clarification: the first row of growth column is now ‘`NaN`

' because it has no one to do the calculation with, it is the first row of the dataset

# Calculating basic financial statistics

Two useful calculations that can be made with financial data are variance and covariance. To illustrate these concepts, let’s use the example of a DataFrame that measures stock and bond prices over time

# Variance

Variance measures how far a set of numbers is from its average. In finance, it is used to determine the volatility of investments.

`dataframe['stocks'].var() # 106427`

dataframe['bonds'].var() # 2272

In the above variance calculations, stocks are greater in value than bonds (`106427`

vs `2272`

). That's because stock prices are more dispersed than bond prices, indicating that stocks are a more volatile investment.

# Covariance

Covariance, in a financial context, describes the relationship between the returns of two different investments over a period of time, and can be used to help balance a portfolio. Calling our stock and bond columns `cov()`

produces an array that defines the covariance values between each pair of columns in the DataFrame. Covariance is also known as a correlation in finance. In our example data, when stock prices go up, bonds go down. We can use the covariance function to see this numerically.

`dataframe.cov()`

The above code produces the following DataFrame output:

Each value above represents the covariance between two columns. The higher the number, the more investments tend to go up and down at the same time. The upper right and lower left columns represent the covariance between stocks and bonds. Here we have a negative number, which indicates that stocks and bonds tend to move in different directions. To see this with an example, we are going to make a more complex exercise, let’s go!

# Calculating the Logarithmic Return

The Logarithmic return is one of the three methods for calculating the return and assumes that the returns are composed continuously rather than through sub-periods. It is calculated by taking the natural logarithm of the final value divided by the initial value.

As an example, let’s say that the final value of an investment was 11andtheinitialvaluewas11andtheinitialvaluewas10. The function would say `=LN(11/10)`

for a result of `9.53%`

.

In mathematics and statistics, a distinction is often made between discrete and continuous data. The return of the record is the most theoretical continuous version. In the real world, however, most people think of returns divided into discrete periods.

**Arithmetic Return**— A period, not composed, discrete.**Geometric return**— Multi-period, composed, discrete.**Logarithmic return**— Infinite periods, composed, continuous.

So the logarithmic return is the non-discrete, continuous version, meaning that if the whole period were divided into an infinite number of periods, what would that return be? You see, it’s theoretical.

Except for very short periods of time (less than a day), and for theoretical applications involving the calculation and precise measurement of curves, logarithmic returns are not commonly used.

Price data are useful, but in this case, since we want to compare each data set, it would be even better if instead of daily/annual prices, we had information on the returns of daily/annual prices.

As a first step, let’s define a function called `log_return`

, which should accept one parameter, prices.

In [23]:

**def** log_return(prices):

**pass**

The equation for calculating the log return between two prices is as follows `natural_log(current price/previous price)`

In our case we want to run this equation for every day/year of price data in our imported DataFrame series (the A series is a single column in a DataFrame).

The pandas shift function (`sifth()`

) can be used to divide each current price by its previous price in the Series.

`prices / prices.shift(1)`

And we can use Numpy’s natural logarithm function to get the log return for each entry in the new Series.

`import numpy as np`

np.log(Series)

Let’s do it

In [24]:

importnumpyasnpdeflog_return(prices):

returnnp.log(prices/prices.shift(1))

Let’s use our new `log_return`

function to calculate the logarithmic return of the DataFrame of Nasdaq, GDP and other indicator that we've already loaded

In [25]:

start=datetime(1999, 1, 1)

end=datetime(2019, 1, 1)nasdaq_data=web.DataReader("NASDAQ100", "fred", start, end)

sap_data=web.DataReader("SP500", "fred", start, end)

gdp_data=wb.download(indicator='NY.GDP.MKTP.CD', country=['US'], start=start, end=end)

export_data=wb.download(indicator='NE.EXP.GNFS.CN', country=['US'], start=start, end=end)

Now we are going to pass each variable to the `log_return`

function

In [26]:

`nasdaq_returns `**=** log_return(nasdaq_data['NASDAQ100'])

In [27]:

`nasdaq_returns`

`sap_returns `**=** log_return(sap_data['SP500'])

In [29]:

`sap_returns`

`gdp_returns `**=** log_return(gdp_data['NY.GDP.MKTP.CD'])

In [31]:

`gdp_returns`

`export_returns `**=** log_return(export_data['NE.EXP.GNFS.CN'])

In [33]:

`export_returns`

# Comparison of return volatility

We are now ready to compare the volatility of each type of data. Remember that variance, in the context of financial data, tells us how volatile an investment is. We’ll use the `var()`

function in Pandas to calculate the variance of returns on stocks and World Bank data, and print the results.

The results can be interpreted in various ways, but in general, the higher the variance, the more volatile the data.

What conclusions can be drawn from these data? Which data set was the most volatile? Did any data sets have similar variances?

**sap series: sap: ****8.860342194008153e-05**** which is equivalent to ****0.00008860342194008153**

In [34]:

print('nasdaq_returns:', nasdaq_returns.var())>> nasdaq_returns: 0.0003178379833057229

In [35]:

print('sap_returns:', sap_returns.var())>> sap_returns: 8.375653262337878e-05

In [36]:

print('gdp_returns:', gdp_returns.var())>> gdp_returns: 0.0003408930734314469

In [37]:

print('export_returns:', export_returns.var())>> export_returns: 0.006197236757109849

The S&P 500, a set of 500 large companies listed on U.S. stock exchanges, has the smallest variation, and therefore is the least volatile. Because the S&P 500 index is a weighted measure of many stocks in a variety of industries, it is considered a safer and more diversified investment.

Stocks are interesting. The NASDAQ 100 is more volatile than the S&P 500, which, when you think about it, makes sense since the S&P 500 is much more diversified and follows the market more closely.

So finally we have GDP and exports.

Exports are very volatile, which could have to do with industries that have moved abroad in the last 20 years, and global competition for the production of goods.

GDP is actually quite similar to the NASDAQ 100 in terms of volatility, which is perhaps an interesting correlation.

**Note**: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slack

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Thanks for reading!

**Other posts written by me in Medium**