【TEJ Finance Research Institute】Application of TEJ Investment Database in Quantitative Investment Analysis

TEJ 台灣經濟新報
TEJ Finance Research Institute
12 min readDec 16, 2022

To build a multi-factor model for forecasting and stock selection by TEJ investment database.

Preface

After Markowitz published the Portfolio Theory, Sharpe revised and proposed the Capital Asset Pricing Model (CAPM), and Ross further developed the Arbitrage Pricing Theory (APT), scholars have gradually found that the characteristics of stocks have a certain explanatory power for their expected returns. As a result of this, they became a pioneer iin quantitative investment analysis. With the rapid development of computers and algorithms, the application of machine learning and artificial intelligence to data mining has also achieved good results, making quantitative investment analysis an important part of the financial field. At the same time, the demand for data in the investment market also makes a simultaneous increase. When researchers conduct quantitative investment analysis, it is essential for them to have a support by a large amount of data.

The Taiwan stock market produces a lot of transaction information every day, such as price, volume, credit and loan transactions…etc., and also announces many important information about companies, such as revenue, earnings and dividend policy…etc. It is really difficult to collect and organize such great information on a daily basis. Moreover, the quality of the data is also a problem. Although researchers can use web crawlers to grab data from many websites that provide free data, it is generally that the data has problems of missing and wrong. Cleaning and maintaining these data may require another high cost. Therefore, to solve the above problems and meet the needs of quantitative investment analysts for data analysis, it is necessary to own a database with complete and high-quality data. In view of this, the TEJ investment database was born.

The TEJ investment database collects a large amount of Taiwan stock data, and researchers also regularly clean and review it to maintain the quality of the data. The content of the database covers three different types of databases: market data, financial accounting data, and corporate action event data. The database of market transaction data includes stock price, volume, and chip data. The database of financial accounting data includes company revenue and surplus data. The company activity event data includes information on major decisions made by the company’s management. The content of the overall database is not only rich and has a high coverage of Taiwan stock market information, but also has the characteristics of point in time, which is necessary for quantitative analysis. The content and features will be further described in detail later in this article.

Due to the inefficiency of the market and the irrational behavior of investors and other factors, the stock price often under-reacts or over-reacts to information, thus providing a good investment opportunity. Fama (1993) pointed out that the three factors of the stock can explain the expected rate of return of the stock, making the multi-factor model one of the important models for stock selection in the field of quantitative investment. Therefore, the follow-up of this article will discuss how to use the TEJ investment database to establish a multi-factor model for forecasting and stock selection.

Keywords: Investment database, Point-in-time, Multi-factor model

Highlight

📍Three categories of TEJ investment database
📍Point-in-time data features
📍Introduction to Multifactor Models
📍Investment group establishment stock selection backtesting

Three categories of TEJ investment database

The main structure of the TEJ investment database is composed of three types of databases: market data, financial accounting data, and corporate action events. They contain different types of data, which are described below:

  1. Market transaction database:
    It covers stock price volumes, credit and loan transactions, and buying and selling of institutional investors. In addition, it provides attribute data, which can be used to judge the listing status of the stock and the industry to which it belongs. Moreover, it can be used to confirm whether the stock has been disposed of, suspended from trading, or listed as a full delivery stock on that day. It also includes information about stocks that have been listed on and off the market in the past, daily index constituent stocks, and ETF constituent stocks. Using this data for quantitative analysis can avoid survivorship bias.
  2. Financial Accounting Database:
    It includes monthly revenue data, financial reports reviewed by accountants, and the company’s self-accounted profits and losses that are not reviewed by accountants. Monthly revenue and self-reported financial statements that have not been reviewed by accountants are announced earlier, which can help investors make early adjustments to investment decisions when the company’s operations change. In addition, the financial reports reviewed by accountants and the company’s self-contained profit and loss include three types of data, which are single-quarter, cumulative and moving four-season data, so that analysts can use them according to their needs, eliminating the tedious procedure of data sorting.
  3. Corporate Events Database:
    The content includes personnel changes at the management level, insider shareholding declaration and transfer, business mergers and acquisitions, capital formation (including important information that affects equity capital such as capital increase and decrease, private placement), fixed asset changes, dividend and treasury stock policies, and major company event information etc. Each type of event includes its announcement date and relevant important information, which is very suitable for research on the effect of event announcements, or for further discussion with other information.

Point-in-time data features

  1. Survivor bias:
    If the stock price information of a listed company disappears from the historical database due to bankruptcy, delisting, mergers and acquisitions, or the expiration of a futures contract. We generally intuitively use the current listed company pool to capture historical data. Therefore, it would not be able to represent the current market conditions at that time because we ignore this group of investment targets that existed in the market, resulting in overestimation or underestimation of strategic performances. TEJ provides complete listed information, allowing users to avoid survivorship bias when developing strategies.
  2. Look ahead Bias:
    Look ahead Bias is the use of future data during the experiment, instead of the data that can be collected at that time, which will lead to deviations in the experimental results. For example, if the financial statements have been restructured or revised in the same period last year, they are future data. If this data is used as the stock selection criteria, the strategy will not be able to accurately reflect the real trading situation.
  3. Foresight bias:
    If researchers don’t pay attention to the timing of the announcement of the financial report, the end date of the financial report may be mistakenly used for information adoption. For example, the end date of the annual financial statement is December 31 of the current year, and the financial report information is announced before the end of March of the following year. If the date is misused as December 31 of the current year, it leads to forward-looking bias caused by using the future data to perform premiere statistical analysis. In addition to providing the financial report date, the TEJ database also provides the date of the financial report announcement. Using announcement dates to conduct the reaction of stock prices can prevent misjudgements.
  4. Adjustment of historical investment data:
    When we perform stock price analysis, the results of the rate of return is greatly affected by the time point when the company distributes dividends and decides capital increases or decreases. Also, to avoid unusual fluctuations in prices by the time of ex-dividend dividends, the prices are essential to be compared with the past prices on the same benchmark. As a result of this, we must use TEJ’s adjusted stock price as the data for backtesting.
    The key features of the above four PIT data, before the TEJ API database is provided to users, all the above-mentioned problems have been dealt with, allowing researchers to directly access the cleaned information, greatly saving pre-analysis data processing time.

The above four characteristics of PIT data are all placed in API data in TEJ. We have figured out all the problems before providing data to users. Researchers can use the cleaned data directly to eliminate data processing time before analysis.

Introduction of multi-factor model

Fama (1993) three factors of empirical stock: market, market capitalization and stock price-to-book ratio can explain the expected return of stock. Subsequent research in the academic community has also found that many stock characteristics can be used to predict stock price changes. Financial and accounting-related characteristics perform well and are stable, such as growth factor revenue, earnings growth rate, and quality factor gross profit margin, ROA and ROE. In addition, momentum factors have also been proved to be effective in domestic and foreign documents, such as the stock return rate in the past 6 and 12 months.

The multi-factor model is developed from the arbitrage pricing theory proposed by Ross (1976), which uses multiple linear regression (referred to as Linear regression) to build a model for prediction. However, there may be collinearity problems among the features of the model, and the increase in the number of features may also cause the model to overfit, so models with regular penalty items have also been proposed, such as Ridge regression, Lasso regression and ElasticNet returns.

In this section, the above four regressions will be used to build a multi-factor model. The specific description of relevant model methods is as follows.

(1) Linear regression:
The least square method is used to fit the model to minimize the error between the actual rate of stock return and the rate of return predicted by eigenvalues, and then estimate the regression coefficient of the model. The objective function is shown in formula 2.1.

(2) Ridge regression:
Because there is sometimes collinearity between the characteristics of the stock, it will expand the variation of the estimated regression coefficient and reduce the accuracy of the prediction. Therefore, Ridge regression adds an L2 regular penalty term

to the objective function (such as formula 2.2) which can reduce the impact caused by the high correlation of features.

(3) Lasso regression
The increase in the number of model features can improve the explanatory ability of the model, but it also increases the complexity, which can easily lead to overfitting of the model. Although Ridge regression adds the L2 regular penalty term, it still retains all the features in the model and cannot reduce the complexity. In LASSO regression, the regular penalty item

is changed from L2 to L1 in the objective function (such as formula 2.3), which can make the regression model reduce some unimportant regression coefficient to 0. Therefore, Lasso regression has the function of feature screening.

(4) ElasticNet regression
Mix the characteristics of Ridge regression and Lasso regression, add L1 and L2 regular penalty items to the objective function at the same time, and use a control ratio to configure the weights of L1 and L2, and the objective function is shown in formula 2.4.

The above is a brief description of the four commonly used methods for constructing multi-factor models, and then the steps for establishing investment portfolios will be further explained.

Investment group establishment stock and selection backtest

After selecting the characteristics of growth, quality, and kinetic energy factors as the variables of the model, and understanding the method of multi-factor models, then we will proceed to construct multi-factor models. To Perform forecast, stock selection, and build investment portfolios. The overall process consists of three steps, which are described in detail below.

Data processing: Convert the characteristics of different frequencies into the data of monthly frequency month-end value. For example, the quarterly frequency of financial report data is upgraded to monthly frequency data, or the daily frequency of cumulative stock returns in the past 6 or 12 months is reduced to monthly frequency data. After the frequency of the data is unified, shorten and standardize the upper and lower 1% of the features.

Model building: Use the stock return rate of the next month to regress the eigenvalues of the past n months and use the methods of the four mentioned models in the previous section to estimate their respective regression coefficients to build a model. Generally, n-value literature recommends using 2–5 years of data, and this paper uses 30 months as a compromise.

Model stock selection: Put the latest feature value into the established model and predict the expected rate of return of each stock in the next month at the end of each month. After that, sorting the expected rate of return of individual stocks from large to small. In a final, select the top 20% Equivalent allocation of stocks to build a portfolio then repeat the above 3 steps at the end of each month using the moving window method.

After the investment portfolios of the four multi-factor models are established, a simple equal-weight investment portfolio (Equal) is added for performance comparison. The construction method is to sum up the characteristics with equal weight at the end of each month, sort the top 20% stocks from large to small, and allocate them with equal weight to form a portfolio. The next section will compare and analyze the performance of the investment portfolios established by the four multi-factor model methods, the simple equal-weight investment portfolios, and the broader market.

Performance Analysis
This article takes the top 150 companies listed in Taiwan by market value from January 1, 2015, to March 30, 2022, as a sample. Using the revenue and earnings growth rate of the growth factor, the gross profit rate, ROA and ROE of the quality factors, and the stock return rate of the momentum factor in the past 6 and 12 months as variables of the model to construct a multi-factor model Make stock forecasts. Established five equal-weight configurations of Lasso, Ridge, Linear, Elastic, and Equal to analyze the performance of investment portfolios.

Table 3.1, Figure 3.1, and Figure 3.2 represent the performance of the five investment portfolios from July 2017 to March 2022.The overall performance of the five investment portfolios is better than the weighted index. The cumulative rate of return is 155.9% for the Lasso investment group and 114% for the Equal investment group. Although the standard deviation of the five investment portfolios is about 24% which indicates that the volatility of the investment group is higher than the weighted index. However, from the observation of the Sharpe value of the risk-profit index per unit, the five investment portfolios are all higher than the weighted index, especially the Lasso investment group has the best performance of 1.58. The performance of other Ridge, Linear, and Elastic investment groups is not much different, and the performance of Equal is the worst with 1.17. In terms of risk analysis, the maximum drawdown of investment groups established by the four multi-factor models differs by less than 16.5%. Compared with Equal and weighted indices, the risks borne are relatively low.

Portfolio performance analysis table (data period: 2017/7–2022/3)
▲ Table 3.1 Portfolio performance analysis table (data period: 2017/7–2022/3)
Portfolio Annual Return
▲ Figure 3.1 Portfolio Annual Return
Cumulative Portfolio Return
▲ Figure 3.2 Cumulative Portfolio Return

Conclusion

This article explains the content and characteristics of the TEJ investment database and uses the financial accounting database to establish a multi-factor model for forecasting and stock selection.

The full text summarizes the following two points:

  1. The database contains the date of announcement of the information, with the spirit of point-in-time. Using the consolidated data on the announcement day can avoid the occurrence of forward-looking bias during analysis.
  2. Using the four multi-factor models established by financial characteristics, the performance of the investment portfolio formed by stock selection is better than that of the market and simple equal weight investment portfolio. Among the methods, the Lasso investment group performed best in the multi-factor investment portfolio, which shows that it has the ability to filter characteristics under high-dimensional data.

For more complete information, please subscribe to E-Shop to read the full content.
Application of TEJ Investment Database in Quantitative
Source: TEJ database

If you have questions about this article, or you want to get further access to the TEJ Database, feel free to leave a comment or email us.

More About Us

⭐️ TEJ Official Website
⭐️ TEJ E Journal
⭐️ Instagram
⭐️ Facebook
⭐️ LINE
⭐️ LinkedIn

✉️ tej@tej.com.tw
☎️ 02–87681088

--

--

TEJ 台灣經濟新報
TEJ Finance Research Institute

TEJ 為台灣本土第一大財經資訊公司,成立於 1990 年,提供金融市場基本分析所需資訊,以及信用風險、法遵科技、資產評價、量化分析及 ESG 等解決方案及顧問服務。鑒於財務金融領域日趨多元與複雜,TEJ 結合實務與學術界的精英人才,致力於開發機器學習、人工智慧 AI 及自然語言處理 NLP 等新技術,持續提供創新服務