[Signate] JPX Fundamentals Analysis Challenge

Published in

Game of Data

8 min readJun 19, 2021

Competition Summary

The Competition we participated was the JPX Fundamentals Analysis Challenge on the Signate platform. In case you are not familiar, Signate is a very popular platform for Data Science competitions in Japan which can be helpful for any one wishing to develop their data science skills further on the tried and tested battlefield on real live data. The goal of this competition was given a specific market day, to estimate the high and low price of a stock in 20 business days. The datasets included Financial Statements data and stock prices data for each listed company on the Tokyo Stock Exchange (TSE). In this article, we will discuss our approach for this competition as well as some key tips for anyone who wants to improve their speed and time to develop solutions in competitions. A link to our solution can be found here.

You can additionally find more information about the competition at the two links below. (In Japanese)

日本取引所グループファンダメンタルズ分析チャレンジ | SIGNATE - Data Science Competition

お知らせ 2021/06/11 13:00： Web記事賞の表彰対象選出期間は「2021年1月29日から2021年6月21日正午まで」となっております。詳細はフォーラムをご参照ください。 2021/06/01…

signate.jp

An English overview of the competition can also be found below which interested us more in working with stock data.

https://www.jpx.co.jp/english/corporate/news/news-releases/0010/20210319-01.html

The Importance of Pipelines for Rapid Experimentation and Iteration

You can think of a pipeline as an approach for automating the typical machine learning workflow. What is the typical machine learning workflow? You may be familiar with some typical steps in the machine learning process like data cleaning/preparation, feature engineering, model training, model deployment, and model serving. Dayal Chand Aichara and myself always codify our work into pipelines as having it in this structure always us to collaborate together and iterate faster. You can see an example of the pipeline we used in this competition below which is a typical shell script. Each python line is a step in the pipeline or script we execute which uses a configuration_ID to determine which parameters to use. You should think of a data pipeline like a real pipeline where data inputs/outputs are being passed from one part of the pipeline to the other.

# Competition Pipeline
config_id=$1
echo $config_id python ./competition_code/make_dataset.py --config_id ${config_id}
python ./competition_code/create_features.py --config_id ${config_id}
python ./competition_code/train_model.py --config_id ${config_id}
python ./competition_code/model_serving.py --config_id ${config_id}

You may be wondering what is this config_id in the above shell script. We also often create a config.yml file to store parameters of our pipelines so that we can toggle on and off functions of the pipeline we would like to use, for example- Cross Validation. The below configuration was used in our final result for the competition and as you can see there are many different parameters which are consumed by our pipeline. The below config_id is called baseline_model but a developer can develop any number of unique config_ids to use.

## Insert Params as you see fit in this file
baseline_model:
  test_model: private
  random_seed: 255
  low_memory_mode: False
  use_fin_data: False
  data_date_limit: "2021-01-01"
  train_split_date: "2020-01-01"
  test_split_date: "2020-01-01"
  drop_data: False
  drop_data_train_date: "2016-02-01"
  drop_data_test_date: "2020-01-01"
  cross_validation: False
  lgb_model: True
  use_test_as_validation: False
  train_with_all_data: True
  seed: 227
  lgb_params: {
      "application": "fair",
      "num_iterations": 130,
      "learning_rate": 0.1,
      "early_stopping_round": 10,
      "feature_fraction": 1.0,
      "bagging_fraction": 0.9,
      "subsample_freq": 1,
      "min_data_in_leaf": 1016,
      "metric": "lgb_spearmanr",
      "num_leaves": 1016,
      "reg_alpha": 0.3899,
      "reg_lambda": 0.648,
      "verbose": -1,
      "device_type": 'cpu'
    }

A key benefit to these pipelines for iteration is that we can create multiple configurations when we want to try different experiments. If interested in taking your pipelines to the next level you can also look into open source software like DVC which will make pipeline building a breeze.

https://dvc.org/

If you are interested, we even have a template for this type of work which can be found below in case you are interested (Still a work in progress).

Data

Overview

The competition hosts provided the following datasets:

Stock price data : This data includes information about open price, closing price, trading volume, and other stock related factors.
Stock list data: This data has information about stock industries and sectors.
Stock fin data: Financial Statement data are reports or data about companies that tell about financial performance.
Stock label data: Label data has labels for low_5, high_5, low_10, high_10, low_20, and high_20.

Feature Engineering

We first experimented without any feature engineering for our baseline model. Our Cross Validation score was around 1.50 with baseline model. We then added technical features like volatility, simple moving average, exponential average, percentage change for interval of 10, 14, 20, 28, 30, 42 days. With these features, Cross Validation score reduced to 0.95 from 1.50. We only used stock price data as other data didn’t help us but you can read more about that in the things that didn’t work section of this article for more details. It is was clear at this point that technical features are very important. We added more technical features like Bollinger Bands (Lower and Upper BB), Relative Strength Index (RSI), Moving Average Convergence Divergence (MACD). These indicators tell us about price momentum, trend, and volatility. Our Cross Validation score was 0.525784 with new features. We used PrinceIndices python library to calculate price technical indicators which was created by Dayal Chand Aichara.

Model

For Models we tried Ridge Regression, LightGBM, and TabNet models. We used LightGBM as our final model based on Cross Validation analysis results. We trained 2 models one for prediction of low and high stock values.

Evaluation

The evaluation metric used in this competition was the Spearman’s rank correlation coefficient which was calculated on both the high and low predictions from the previous models.

d: The difference between two ranks of each observation

n: number of observations

Then an integrated or combined score of the rank correlation coefficients are calculated for both the high and low models where:

phigh: Spearman rank correlation coefficient of Stock High Predictions

plow: Spearman rank correlation coefficient of Stock Low Predictions.

We relied on a custom coded 5-fold Time Series Split Cross Validation on stock dates with sklearn to evaluate the combination of models properly based on the above integrated metric to select our submission.

An interesting component of this competition was definitely the real time serving aspect as we were all evaluated on stocks in a live market setting for 4 weeks. This competition required significant engineering skills as many needed to make their models not only train but serve in production type settings. I hope for more competitions like this in the future that are more realistic to the current industry problems.

Things that didn’t work

In this section, we will go over briefly some things we tried to experiment with that did not work so well for us.

Ridge Regression — Before building LightGBM models we first tried with a simple Ridge Regression to get a baseline performance. After that we tried LightGBM and found that the LightGBM model performed significantly better on our cross-validation. At first we wanted to build an ensemble of LightGBM and RidgeRegression models but found that the LightGBM performance was so much better that this ensemble may be detrimental to us in the competition. Ridge Regression however was much faster at inference.

TabNet — There have recently been a bunch of different Neural Network based models trying to compete with Gradient Boosted Trees models in the tabular domain. The research papers often argue that they can beat algorithms like XGBoost or LightGBM in predictive performance. We decided to give TabNet a try as it seemed to be interesting — you can see the paper and implementation below if you are interested.

Paper: https://arxiv.org/abs/1908.07442

Implementation: https://github.com/dreamquark-ai/tabnet

I would say that it was not that TabNet did not work as it seemed to give us quite competitive performance. It was just that the performance was about the same with a bigger cost on inference speed. There was also a degree of risk as we were not so familiar with the framework in comparison to well established libraries like LightGBM. We thought about ensembling this with LightGBM but this would have doubled our inference speed so we opted to not use it. Its possible that it could have boosted our performance to the top spot but who wants a real time stock prediction model that takes forever to predict?

Financial Statement Data- Financial Statement data are reports or data about companies that tell you essentially about its financial health and performance. Many may be wondering how or why we did not use such a core dataset or indicators as part of our solution. The key response to that question is simply that based on our Cross-validation results it did not improve our performance significantly enough to warrant the extra cost of using it. It is possible other solutions used it but in our case we saw no benefit of using it and stuck with the basic stock prices data.

Why Signate?

For those not in Japan, we recommend that if you haven’t tried Signate before, it may be another good alternative to your Kaggle addiction. There are occasionally competitions where the datasets and competition details are in English but usually competitions related to computer vision are typically more accessible to all people regardless of language. In addition, if you are working in Japan as a foreign resident, Signate gives you the opportunity to show others that you are able to work with Japanese datasets which employers may find to be an asset. In fact, one of the key reasons I have been participating in Signate Competitions is to improve my skills in working with Japanese datasets which can have some slight differences to tabular datasets you may be familiar with in English. One negative point about Signate is that the prize money is typically lower than what you would see on Kaggle but I am hoping in the future as demand for AI specialists in Japan increases that the reward pools will also increase. Also a second negative is, you may find that rules are not often written in English which can be challenging if you are not confident with the language.

Conclusion

In this article, we discussed our solution in simple terms for predicting high and low values of stock with LightGBM models. From this competition, we can recommend that new data scientists should always translate their code into pipelines for faster experimentation and always trust cross-validation over any public leaderboard score. It is key to note that agility in competitions with the use of Machine Learning pipelines and time to prototype much like the real world will provide an advantage against others in competitive settings. We hope you find this advice helpful and look forward to others putting it to good use against us in the competition battlefield.