Machine Learning has many advantages. It is the hot topic right now. For a trader or a fund manager, the pertinent question is “How can I apply this new tool to generate more alpha?”. I will explore one such model that answers this question in a series of blogs.
This blog has been divided into the following segments:
- Getting the data and making it usable.
- Creating Hyper-parameters.
- Splitting the data into test and train sets.
- Getting the best-fit parameters to create a new function.
- Making the predictions and checking the performance.
- Finally, some food for thought.
You may add one line to install the packages “pip install numpy pandas …”
You can install the necessary packages using the following code, in the Anaconda Prompt.
- pip install pandas
- pip install pandas-datareader
- pip install numpy
- pip install sklearn
- pip install matplotlib
Before we go any further, let me state that this code is written in Python 2.7. So let’s dive in.
Problem Statement: Let’s start by understanding what we are aiming to do. By the end of this blog, I will show you how to create an algorithm that can predict the closing price of a day from the previous OHLC(Open, High, Low, Close) data. I also want to monitor the prediction error along with the size of the input data.
Let us import all the libraries and packages needed for us to build this machine learning algorithm.
Getting the data and making it usable
To create any algorithm we need data to train the algorithm and then to make predictions on new unseen data. In this blog, we will fetch the data from Yahoo. To accomplish this we will use the data reader function from the panda’s library. This function is extensively used and it enables you to get data from many online data sources.
We are fetching the data of the SPDR ETF linked to S&P 500. This stock can be used as a proxy for the performance of the S&P 500 index. We specify the year starting from which we will be pulling the data. Once the data is in, we will discard any data other than the OHLC, such as volume and adjusted Close, to create our data frame ‘df ’.
Now we need to make our predictions from past data. So, let’s create new columns in the data frame that contain data with one day lag.
Note the capital letters are dropped for lower-case letters in the names of new columns.
Although the concept of hyper-parameters is worthy of a blog in itself, for now I will just say a few words about them. These are the parameters that the machine learning algorithm can’t learn over but needs to be iterated over. We use them to see which predefined functions or parameters yield the best fit function.
In this example, I have used Lasso regression which uses L1 type of regularization. This type of regularization is very useful when you are using feature selection. It is capable of reducing the coefficient values to zero.
The imputer function replaces any NaN values that can affect our predictions with mean values, as specified in the code. The ‘steps’ is a bunch of functions that are incorporated as a part of the Pipeline function. The pipeline is a very efficient tool to carry out multiple operations on the data set.
Here we have also passed the Lasso function parameters along with a list of values that can be iterated over. Although I am not going into details of what exactly these parameters do, they are something worthy of digging deeper into.
Finally, I called the randomized search function for performing the cross-validation.