【Data Analysis(2)】「Garbage in, garbage out.」

- Explaining the importance of data quality — Free data api versus TEJ API

People who do quantitative investing might be familiar with open data source/API such as Yahoo Finance, Google Finance, etc. which we usually get data from their websites for backtesting. However, free data sources we mentioned above are not database companies specialized in Taiwan. Therefore, comparing to the market in the US, they do worse on maintaining the database of the stock market in Taiwan. Moreover, if there is a flaw in the data for backtesting, it will strongly affect the performance of the models or strategies. Under this condition, we will emphasize the importance of quality of data, but how can you know whether your data is qualified enough? So, in this episode, we will share our methods for checking the quality of data with you.

✨ Highlights of this article ✨

  • 🌟 Adjusted stock price & non-adjusted stock price Intro/Application
  • 🌟 Free data v.s. TEJ API - The difference in the quality of data

🚪Links related to this article🚪

Adjusted v.s non-adjusted stock price

Using 0050 as an example:

Assume that 0050 was priced at 30 dollars 10 years before, and the current price of 0050 is 120 dollars. Therefore, we can say that the return on it is 300%. However, it doesn’t consider cash dividends, so the real return on 0050 is not precise. Using non-adjusted stock prices to calculate the return ignores the distribution of cash dividends, but people who deposit their money into the stock do care about dividend yields. As a result, the method that we calculate the return by non-adjusted stock price is not accurate.

However, the adjusted stock price is the price that has taken cash dividends/ stock dividends/cash capital increase into account. Assume that the adjusted price of 0050 was 10 dollars 10 years before, and the current adjusted price of 0050 is 100 dollars, which means increasing 900% in total. In this case, we can find the return calculate by the adjusted stock price is much more accurate!

The individual stock has to consider much more factors than ETFs, and we will provide a comprehensive introduction in the future. If you want to learn more please follow us and keep focusing on our articles.

Summary

  • Non-Adjusted Stock Price ➡️ Market Price
  • Adjusted Stock Price ➡️ The price which is converted from cash dividends/ stock dividends/Cash Capital Increase.

Using 0050 as an example to see the difference of the return from adjusted & non-adjusted stock prices.

1. Import packages

If you do not have an api_key yet, you can 🆓register an account on TEJ for a free trial🆓, but the trial account has a limit on getting data❗️ The database used in this episode is from 斜槓方案的證券交易資料表 (TWN/EWPRCD). If you want to use data without limit, check it on TEJ E Shop🎁.

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as pdr
import tejapi
tejapi.ApiConfig.api_key = 'your key'

2. Extract adjusted & non-adjusted stock prices from TEJ.

stk_num = '0050'stk_data = tejapi.get('TWN/EWPRCD',coid = stk_num,paginate = True
,opts= {'columns':['coid','mdate','open_d','high_d','low_d','close_d']})
# non-adjusted
stk_adj = tejapi.get('TWN/EWPRCD',coid = stk_num,paginate = True
,opts= {'columns':['coid','mdate','open_adj','high_adj','low_adj','close_adj']})
# adjusted
non-adjusted (left)/ adjusted(right)

3. Data visualization

plt.style.use('seaborn')
plt.figure(figsize=(10,5))
plt.title('0050',fontsize = 20)
l1 = plt.plot(stk_data.mdate,stk_data.close_d,label = 'market_price')
l2 = plt.plot(stk_adj.mdate,stk_adj.close_adj,label = 'adjusted_price')
params = {'legend.fontsize': 14,
'legend.handlelength': 3,
'axes.labelsize': 25}
plt.rcParams.update(params)
plt.legend(loc="upper left")
plt.xlabel('Date', fontsize=20)
plt.ylabel('Price', fontsize=20)
plt.show()

4. View the return

The Return Formula:
The close price of the start date is our buy price(buy_price), and the close price of the most recent trading date is our sell price(sell_price).

Return = (sell_price-buy_price)/buy_price

# Difference of return
buy_price_mp, buy_price_adj = stk_data.close_d.values[0], stk_adj.close_adj.values[0]
sell_price_mp, sell_price_adj = stk_data.close_d.values[-1], stk_adj.close_adj.values[-1]return_mp, return_adj = ((sell_price_mp-buy_price_mp)/buy_price_mp), ((sell_price_adj-buy_price_adj)/buy_price_adj)print("未調整股價報酬率:",return_mp,"調整股價報酬率:", return_adj)

We can tell from the result, the difference between adjusted & non-adjusted stock return is more than 40%!!

Free data API vs TEJ API

Besides TEJ API, free data API also provides adjusted stock prices (we use Yahoo Finance API in this episode), so we would like to make a comparison between both.

1. Import packages

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as pdr
import tejapi
tejapi.ApiConfig.api_key = 'your key'

2. Grab data from both API

stk_num = '0050'

stk_adj = tejapi.get('TWN/EWPRCD',coid = stk_num,paginate = True
,opts= {'columns':['coid','mdate','open_adj','high_adj','low_adj','close_adj']})
# adjusted
yf_data = pdr.DataReader(stk_num+'.tw','yahoo') # Yahoo Finance
TEJ adjusted price(left)/YF adjusted price(right)

3. Data visualization

plt.style.use('seaborn')
plt.figure(figsize=(10,5))
plt.title('0050',fontsize = 20)
l1 = plt.plot(yf_data.index,yf_data['Adj Close'],label = 'yf_adjusted')
l2 = plt.plot(stk_adj.mdate,stk_adj.close_adj,label = 'tej_adjusted')
params = {'legend.fontsize': 14,
'legend.handlelength': 3,
'axes.labelsize': 25}
plt.rcParams.update(params)
plt.legend(loc="upper left")
plt.xlabel('Date', fontsize=20)
plt.ylabel('Price', fontsize=20)
plt.show()

Because there(斜槓方案) is a 5-year limit on our data, the difference in the figure above is not clear. We further try to get data from the TEJ advanced database, now we can find:

❗️❗️the start date of the free data is around 2016, but TEJ already has the data before 2004. ❗️❗️

If you require data for a longer period, please contact us.

TEJ database

According to the public information on Taiwan Securities Exchange’s website, the listing date of 0050 is 2003/6/30, so we can know the quality of data is important.

0050 listing date

4. View the return

The Return Formula:
The close price of the start date is our buy price(buy_price), the close price of the most recent date is our sell price(sell_price).

Return = (sell_price-buy_price)/buy_price

Quality of data

  • Start date of the data
  • Difference on the Adjusted Return

* Checking the data period of the component stocks of 0050

- start_date_tej :The start date of the TEJ data api.
- end_date_tej :The most recent trading date of the TEJ data api.

- start_date_yf :The start date of the free data api.
- end_date_yf :The most recent trading date of the free data api.

According to the public information on Taiwan Securities Exchange’s website, the listing date of 台灣高鐵 & 永豐金 is 2016/10/27, 2002/05/09 respectively which is consistent with “start_date_tej”, so we can say that TEJ Data API has a better quality. Choosing the qualified database won’t let your model to buy the unlisted stock in order to provide an accurate result while backtesting.

2633(left) / 2890(right)

* Checking the difference on the adjusted return of component stocks of 0050

- ret_yf :The return calculated by free data api.
- ret_tej:The return calculated by TEJ data api.

Formulas:

- ret_yf = (the adjusted close of end_date_yf —the adjusted close of start_date_yf )/the adjusted close of start_date_yf

- ret_tej = (the adjusted close of end_date_tej — the adjusted close of start_date_tej)/the adjusted close of start_date_tej

difference = abs【 (ret_yf — ret_tej)/ret_tej 】

Take the top 10 rankings as an example

* Conclusion

What we share with you this time is the difference between adjusted & non-adjusted stock prices, and we will provide a more comprehensive introduction in the future. Second, we compare the data quality between the TEJ API and the free data API. From the aspect of the data period and the adjusted return, free data is indeed unstable(the free one is the most expensive one😂). If the data quality is unstable may generate additional variables and increase the risk of the models/ strategies, so we recommend you use a stable data source while backtesting.

Besides, if you have any question or interested in any topic, please leave the message below ❗️❗️ Then, we will go further into financial data analysis and applications in the next article, please look forward to it ❗️❗️

Finally, if you like this topic, please click 👏 below, giving us more support and encouragement. Additionally, if you have any questions or suggestions, please leave a message or email us, we will try our best to reply to you.👍👍

Links related to this article again!💪

If you have any question or difficulty, do not hesitate to contact us: Contact Information

Source Code🚪:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TEJ 台灣經濟新報

TEJ 台灣經濟新報

TEJ 為台灣本土第一大財金資訊公司,成立於 1990 年,專門提供金融市場基本分析所需資訊,以及信用風險、法遵科技、資產評價、量化分析及 ESG 等解決方案及顧問服務。鑒於財務金融領域日趨多元與複雜,TEJ 結合實務與學術界的精英人才,致力於開發機器學習ML、人工智慧AI及自然語言處理NLP等新技術,持續提供創新服務