Build Stock Price Database With R Website Crawling & MYSQL

Lyn Chen
Data Room
Published in
2 min readAug 29, 2020

Why build my own stock price database?

  1. Limitations of the environment

Compared with American stock price research, there are no R packages to get Taiwanese stock price data. If you want to do research without many open data sources, this article might provide you an idea.

On the other hand, we query the stock price data from the TWSE website (Taiwan Stock Exchange Corporation) which treats website crawling strictly. Collecting historical data on local is necessary because few requests per minute and per day are allowed.

2. The historical stock price for backtesting

Backtesting is the most important thing for quantitative financial analysis.

Backtesting requires lots of historical data to test your model performances and make further predictions. The database is the best choice to store and sort the huge amount of data.

3. Combine with other data to do deep analysis

Most investors analyze stock prices in three perspectives, technical, fundamental, and chip analysis respectively.

Multiple data sources can optimize the model performances. Doing fundamental analysis requires monthly revenue, but chip analysis demands the institutional investors’ trading records. The database can store multiple data sources at the same place.

The structure of my database

The current database contains six tables.

  1. Stocks_Categories: Including stock Id and their industries categories.
  2. History_Prices: Data from Yahoo Finance API. Including open, high, low, close (OHLC), adjusted close prices, and volume.
  3. Monthly_Revenue: Including the monthly revenue of each stock.
  4. History_Prices_TWSE: The OHLC, last offered prices(which didn’t make a deal), and the amounts of deals.
  5. Institutional_Investor_Trading_TWSE: The institutional investors’ trading records of each stock
  6. History_Index_TWSE: Including Taiwanese industrial indexes, and indexes of top tier Taiwanese companies’ performances which share similar concepts with S&P 500.

There are two stock prices from Yahoo Finance API and TWSE site respectively. Although there is huge homogeneity between two data sources, they are complementary to each other. It’ difficult to get the whole historical prices from the TWSE site, due to the restricted limit of requests. On the other hand, yahoo finance lacks the price data of some stocks, but the TWSE site has all stocks’ today prices.

There are rooms for improvements for this database. I kept almost all the columns from the original sources, but some calculated metrics are redundant in the database. Besides, adding more different data about stocks is important to consummate models.

Using R crawls the data & writes into the database

--

--