Looking for correlations in the Stock Market

Trying to find which stocks behave the same in the Stock Market using Python and Scikit-learn.

Published in

Overfitted Microservices

8 min readApr 2, 2019

This story began with me trying to answer a simple question: Given some stocks that behave in the same way over the years, is it possible to find a set of them that follow the same pattern?

Apple, Inc. (AAPL) stock in the New York Stock Exchange (NYSE) during the past 5 years.

Salesforce.com, Inc. (AAPL) stock in the New York Stock Exchange (NYSE) during the past 5 years.

The main idea is to find a way to measure “similarity” between two stocks. In other words, we are trying to find a metric to translate it in how “distant” those stocks are i.e. how similar they are. We are going to address this using Correlation.

To clarify a little bit, imagine we have the following stocks:

Just an awful drawing representing two stocks: Orange and Blue

If we can tell that they look pretty similar (comparing by their weekly close value for instance), it’s because we somehow notice that the lines share the same behavior. And this reminded me of what one normally does when working on Machine Learning models: Test that what you predicted, is similar enough to what it is in reality.

So if we want to compare stock A with stock B, what we’ll try to do is to predict stock A with stock B values. Hands-on!

Code Alert!

In this section, we’ll code a proof of concept of the idea. If you’re only interested in the results, you can just jump into the results section.

Set up

To code, we’ll use some basic Python. If you’re unfamiliar with it, you can follow this guide to set up everything you need.

We’ll also use some datasets from Yahoo! Finance. You can download any stock you want to analyze, but we’ll start analyzing at Apple, Salesforce.com, Spotify, and Riot. You can download the weekly historical close price in CSV format.

Create a Python project and save those CSV files in the resources folder. The CSV files should look like this:

CSV file from Yahoo! Finance

Prepare variables

Create a Python file called test.py and declare some variables based on your stocks/files:

# Init constants
crm_stock_name = 'CRM'
spot_stock_name = 'SPOT'
aapl_stock_name = 'AAPL'
riot_stock_name = 'RIOT'crm_path = '../resources/CRM.csv'
spot_path = '../resources/SPOT.csv'
aapl_path = '../resources/AAPL.csv'
riot_path = '../resources/RIOT.csv'stocks = [
    crm_stock_name,
    spot_stock_name,
    aapl_stock_name,
    riot_stock_name
]paths = {
    crm_stock_name: crm_path,
    spot_stock_name: spot_path,
    aapl_stock_name: aapl_path,
    riot_stock_name: riot_path
}

What we’re going to code next is something that reads our data, draws the stocks and tries to compare them.

I called this something Alice.

Let’s define Alice:

CLOSE = 'close'
class Alice:
    companies = {}    similarity_calculator_strategy = SimilarityCalculatorStrategy()    def add_company(self, company_name, historical):
        self.companies.update({company_name: historical})    def get_company(self, company_name):
        return self.companies.get(company_name)    def compare(self, company_x_name, company_y_name, criteria=CLOSE, strategy='cosine'):
        company_x = self.get_company(company_x_name)
        company_y = self.get_company(company_y_name)        similarity_calculator = self.similarity_calculator_strategy.build_similarity_calculator(strategy)        # Get only the criteria to be used to compare stocks
        company_x_criteria = company_x[criteria]
        company_y_criteria = company_y[criteria]        return similarity_calculator.calculate_similarity(company_x_criteria.to_numpy(), company_y_criteria.to_numpy())

By the time I started experimenting with this, I thought there might be multiple ways of comparing stocks. That’s why you can compare by different criteria, but the default is CLOSE (You could compare by OPEN or VOLUME if you wanted to). The same applies to the strategy,which is simply the approach you choose to compare the stocks. The difference between criteria and approach in this context is that criteria refer to the data to be used to compare, whereas the approach refers more to the algorithms to be used. An example of a strategy could be the Coefficient of Determination, a.k.a. RSquared or R2. I’ll give you a brief high-level introduction to R2 later. In the meantime, let’s use it as a black-box and see what we get.

Read your datasets

Don’t reinvent the wheel: Use Pandas.

import pandas as pdclass ResourceManager:    csv_cols = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume']    def read_from_csv(self, path='', converters=None):
        return pd.read_csv(path, converters=converters).rename(columns={                'Date': 'date',
                'Open': 'open',
                'High': 'high',
                'Low': 'low',
                'Close': 'close',
                'Volume': 'volume'
            }).drop('Adj Close', axis=1)

Similarity Calculators

I chose duck typing for the Similarity Calculators. All of them have a method that compares two Numpy arrays:

def calculate_similarity(self, x, y):

The result will depend on the component they’re using internally. Let’s take a look at the NormalizedR2ScoreSimilarityCalculator:

from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScalerfrom src.internal.SimilarityArrayUtils import fix_shape
class NormalizedR2ScoreSimilarityCalculator:    def calculate_similarity(self, x, y):
        x, y = fix_shape(x, y)        x = MinMaxScaler().fit_transform(x.reshape(-1, 1))
        y = MinMaxScaler().fit_transform(y.reshape(-1, 1))
        return r2_score(x, y)

That was easy! We’re relying the logic on Scikit-learn thanks to its already implemented r2score. It’s normally used when evaluating a model, by testing predicted values with real values, as we stated before. However, in this case, we’ll simply compare x and y, which are arrays containing the stocks’ close price.

Yes, we’re also normalizing our inputs! I had tested first without normalization but I didn’t have good results, and it totally makes sense: when we compare stocks, it matters their behaviour, not their values. I also tried to compare by using cosine similarity, but it doesn’t look good so far.

from sklearn.metrics.pairwise import cosine_similarityfrom src.internal.SimilarityArrayUtils import fix_shape
class CosineSimilarityCalculator:    def calculate_similarity(self, x, y):
        x, y = fix_shape(x, y)
        return cosine_similarity(x.reshape(1, -1), y.reshape(1, -1))

Then just create a new instance of Alice and load your datasets:

def main():
    alice = Alice()
    stock_gateway = StockGateway()    # Load stocks
    load_stocks(alice, stock_gateway)    draw_stocks(alice)def load_stocks(alice, stock_gateway):
    for stock in stocks:
        load_stock(alice, paths[stock], stock, stock_gateway)
def load_stock(alice, stock_path, stock_name, stock_gateway):
    alice.add_company(stock_name,    stock_gateway.fetch_from_csv(path=stock_path))

I won’t go through the rest of the code, but here you can find the whole example. Remember that the file test.py has all the magic you need to use Alice 😙

Results

After running our example (test.py), you should see the following charts:

Simple scatter chart with **normalized** stocks. Note that their close values go from 0 to 1.

As I said before, in order to understand the output of R2, we have to understand how it works.

The Coefficient of Determination (a.k.a. R2)

Honestly, it took me a while to understand the output of R2. Even though it’s not the point of this blog to demonstrate mathematically how to get R2, we’ll go through a simple explanation to at least understand its meaning.

It was easy to understand that the closer to 1, the better. However, it was weird to me that R2 could take a negative value. I mean, if zero is bad, would a negative value be even worse?

Well, yes, that’s the case. I figured out why after reading this and this. It turns out that R2 is mainly used to evaluate regressions which, as we stated before, is the way we’re approaching our problem. The first thing you have to do to understand R2 is to draw your dataset and find the average.

The blue line is our dataset and the orange line is the average. To find the average line, pick all your values on your target variable and find its average value.

In the context of R2, we say that it’s possible to compare our dataset using its average. If our dataset was clustered, our error would be rather small, but it would be high if the dataset was spread. In R2, we call this difference between the dataset and its average total error. It’s basically saying how bad the average is performing at predicting the dataset.

The next step is to try to predict our dataset using our model — which in our case is just another stock. By doing this, we will calculate the error we generate by predicting the blue stock, using the orange stock.

We call this error Regression Error. As in every Machine Learning model, we want this error to be as low as possible.

Now if we take a look at R2 formula:

R2 Formula. SSres is the regression error and SStotal is the total error.

We’re seeing that we first find the ratio between the Regression Error and the Total Error. This is to indicate how much of the total error is part of the regression error. If we subtract this value to 1, we will get how much error we removed using our model. Therefore, values close to zero means that we remove pretty much nothing from the total error, whereas values close to 1 means that we removed pretty much the whole error.

It’s possible though, as we’ve seen in our charts, to get negative values of R2. Mathematically, this means that the Regression Error is higher than the Total Error!

Remember: The total error is the error you caused when “using” the average of your dataset. If you did worse than that, it means that you’re predicting with something that is performing worse than the average!

Your dataset in blue and its average in orange. Your prediction in green.

Generally speaking, it is said that if you get negative values it’s because you have probably chosen the wrong model for your dataset. In our case, I would say that the stocks don’t look similar at all.

However, if R2 is close to -1, it may indicate that the dataset and the prediction are inverses. This may sound weird but in stocks, it could indicate that when one stock is performing good, the other one is doing badly, and vice-versa.

Conclusions

No conclusions? 😂 So, yes, the conclusion is that it seems to be possible to find correlations in the stock market. As a general rule, from my experiment, I would say that when comparing two stocks a and b, if R2 is close to 1, there’s a correlation and they tend to behave similarly. On the other hand, if R2 is close to -1, there’s a correlation too, but they tend to behave in the opposite. Finally, values around zero and/or not so close to -1, it indicates there’s no correlation between them.

Anyway, it was just an experiment and I hope that the community shares their feedback and provide thoughts (:

Edit: Many thanks to Martin Rey for collaborating on this blog!