Stock Portfolio Construction: a proof of concept using Apache Spark

Published in

DataReply

1 min readMay 2, 2018

Recently I came across a conference paper by Joglekar (2014) who uses a two-stage approach to constructing low risk and stable return stock portfolios. The idea is simple:

Step 1: Perform correlation-based clustering on a set of financial instruments.

Step 2: Use a genetic algorithm to build an optimal portfolio.

Why not implement this on a massive scale using Apache Spark? In this post I will explain how (and why) to do so based on ~5 years of daily closing price histories of 2,000 stocks (NASDAQ constituents; the dataset from Chapter 9 of Advanced Analytics with Spark).

What has clustering got to do with this?

A widely used risk management technique is portfolio diversification. This basically means that you want the stocks in your portfolio to be “different”. From a statistical point of view, one of the ways to measure this difference is correlation. Take a moment and think about the following (simplified) scenarios:

Most stocks in a portfolio are (highly) positively correlated.

In such situations stock prices are expected to move in the same direction — so if your forecast is correct…….

Originally published at www.datareply.co.uk.

Stock Portfolio Construction: a proof of concept using Apache Spark

What has clustering got to do with this?

Written by Data Reply