In this article, I will be detailing an approach to a competition on Auquan’s website called the Ethical Investing Challenge — see here for more details
This article also presumes a certain level of knowledge about time series data. If you haven’t worked with time series much, take a look at our introduction to time series analysis series here.
Investors are slowly becoming more and more interested in ethical investing. Part of the reason is the industry is starting to care more, but the other reason is that there is a lot of evidence to show that it can produce better or at least equivalent returns.
One subset of this type of investing is known as ESG investing. In short, this uses company filings about their environmental, social and governance activities to gain more information about a company. I’ve recently written an article looking at this, so if you want to brush up on what ESG is, and why it works — go here
Below we look to see if we can use a combination of fundamental and ESG data to predict whether a companies price will increase or decrease over the following 3 month period.
Step One: Getting Started
The first thing we do for any problem that we are looking at is to first visualise the data and understand what we’re working with. This will give us some idea of what approaches will work and if there is any work that we need to do to the data before we can start building our model.
The data given in this competition comes in the form of a series of CSV files, each containing information about a particular stock.
Importing libraries and other setup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
base_path = "historicalData/train/"
files = os.listdir(base_path)Reading all available stocks
files = [file[:-4] for file in files if ".csv" in file]Remove rows where the target variable is not availabledf = pd.read_csv(base_path + files + ".csv")
df = df[~df.Y.isna()]Setting Index and Printing Table
df = df.set_index("datetime")
You can see at the bottom I’m removing rows from our data without looking at it. This is because we don’t need to make predictions about these rows for the competition.
If we look at the data we see that for a particular stock, we have very few data points. The data is limited enough that it will not be sufficient to train anything complex (sorry neural net fans). Instead, we are going to have to work with models that are robust enough for our small sample.
In this example we will try two approaches:
- Train simple time series models
- Train a common model for all the stocks.
Approach One: ARIMA Model
A good place to start with a problem like this is to look and see how a simple linear model performs. Depending on the type of data you have, we will naturally be able to try different models. Our data in this competition is multivariate and time series, which means that a Vector Autoregression (VAR) model would be a good thing to try.
Before we get on to that though, we are going to try something even simpler. Since the final goal of this challenge is to predict the change in share price, we’re going to start with an ARIMA model on share price.
As a quick recap, ARIMA stands for AutoRegressive Integrated Moving Average and these are a general class of models that are used to forecast time series data. The data has to be stationary (the mean and variance doesn’t change over time) or must be able to be made stationary by differencing. If you haven’t come across them before I recommend you check out this article to get an idea of what we’re about to do.
As I said above, in order for ARIMA models to work we need to make sure the series we are investigating is stationary. To understand why this is, remember that stationarity means that the process creating a series (and therefore its statistical properties) do not change over time. Thus can be represented by a single model. If this wasn’t the case, it wouldn’t make sense for us to describe the series with just a single model, as there is no way this could describe the underlying process.
Plotting Share Price
pl = df['Share Price'].plot()
This will give us some indication of whether our series is stationary or not. The output is:
Here we can clearly see that there is a strong trend in our data, which means that it is not stationary. This is to be expected with stock prices as they generally increase in price over time. In order to proceed, we need to make this series into a stationary one. The first thing to do to try and attempt this is differencing, which we can do by looking at returns instead of stock price.
Caculating Returns (Differencing)
returns = df['Share Price'].pct_change()
This looks better. This data doesn’t appear to have any trend, so it is likely now stationary. In order to test this, we would normally run an (Augmented Dickey-Fuller Test) to statistically test for stationarity. If your data isn’t stationary after this you can try taking the second difference, the third, or using more complicated approaches.
The next thing we need to know for the model is the lag period of any autocorrelation within the data series. So let’s print those now:
We see that the autocorrelation peaks when the lag is 4 and drops off suddenly after a lag of 5. If you looked closely at the data provided above you, might have noticed that this data is reported quarterly. With this information, we can make a pretty strong hypothesis that the data is showing the greatest auto-correlation with the value for that quarter the previous year (e.g. Q4 2015 is best predicted by Q4 2014). To take advantage of this we are going to train our model with a lag of 5. This should ensure we capture all the above information, without including unnecessary noise.
Let’s build the model:
Now we can model this using ARIMA models
from statsmodels.tsa.arima_model import ARIMAmodel = ARIMA(returns.dropna(), order=(5,0,0))
model_fit = model.fit(disp=0)
To start making sense of this output, first, look at the ‘coef’ column in the second row. We can see here that the largest coefficient is for the L4 Share Price, which is what we expected based on our autocorrelation analysis. Unfortunately, if we look at confidence intervals on the right we can see that they all include 0 so this model isn’t statistically significant. This is due to the low number of data points for any single stock.
At this point, we need to decide what to do next. There are a couple of options:
- Try a different set of parameters for our model and compare them using AIC or BIC measures (see here and here respectively)
- To try and include other features using a VAR model. Note we would first have to see that our data was cointegrated. (More detail)
- We could also use a multivariate GARCH model to include any changing volatility
In order to build more complex models, we would need to gather more data. One way to do this would be to identify groups of stocks that behave similarly (i.e. clustering). Then you can assume that the returns of these stocks are derived from the same distribution. At this point, you could think about using cool-sounding ML models and train one model per cluster, using data from all the stocks in one cluster as training data for that cluster.
There is a lot more work you could do to create a better model, but this is one approach to starting a problem like this. Hopefully, this gives you some ideas that you can use to continue on from here and build your own model.
To join the competition and download the dataset go here: