Combining Technical Analysis With K-Means

Charlie Shelbourne

Published in

Geek Culture

10 min readJun 20, 2022

A Technique to Classify Pricing Data into Market States

🐍 Codebase found here

🙋‍♂️ Intro

“The Way of The Turtle” by Curtis Faith, is a book that lead me to re-evaluate the way I think about trading. It introduced me thinking of trading as a system, defined by rules that can be built and tested. As an engineer, this was great news as systems are familiar ground. Any trading I took part in beforehand had been tantalising guess work.

Curtis tells us a good trading strategy gives traders an edge over the market. However, this edge is likely to dissipate over time as markets change and adapt. Therefore, strategies must be revised. The advice is to start simple, and thoroughly backtest your system to ensure it’s a winning system.

This post details some of my experimentation towards building a system of trading, using technical analysis and clustering.

Technical analysis (TA) is an area of financial analysis, that attempts to forecast price movements. These types of transformations have been used in systematic trading strategies. TA is based solely on historic data, and gives indications (known as signals) to potentially good opportunities to enter or exit a market.

Investopedia definition of Technical Analysis

Curtis gives examples of signal generation using TA. Strategies can be as simple as calculating a market’s price moving average. Signals to enter and exit the market are then generated when the price crosses the moving average. Or, signals generation can be more complex, combining multiple technical indicators to generate a single signal.

This post describes a method of selecting multiple technical indicators that capture the variance of a market. Then combining them into a single output that classifies market conditions.

💡 Note, risk management is out of scope of this post but is extremely important!

🪖 Objectives of Post

Generate technical analysis of time-series pricing data.
Use k-means to cluster the the technical analysis features and label the pricing data.
Review results.

🌦 Types of Market Conditions

Trending, refers to either an uptrend or a downtrend. In an uptrend we see the price climbing to higher high and higher lows. Whilst, a downtrend is described by lower highs and lower lows. Investopedia trending definition
Ranging, refers to the price cycling between an upper and lower bound, with no distinctive trend up or down. Investopedia ranging definition

💿 Data

For this post we will be using forex pricing data, because it’s easily attainable. However, the techniques used could be applied to any type of time-series pricing data.

The data used is the EURUSD currency pair and was taken at 1 minute intervals. HistData.com, are the providers of this data and warn against using it to inform any trading decision, due to accuracy inefficiencies.

💶 Prices

The plot below is a snippet of the data. We have four prices for each 1 minute interval.

open — opening price (starting price of interval).
high — highest price during the interval.
low — lowest price during the interval.
close — closing price (final price before next interval).

🎪 Example of Data

In this data snippet there is an obvious hill-type feature. This is an example of a market cycle. Where the price starts in a ranging capacity, moving to an uptrend (accumulation), reaching a peak (distribution), and then reversing to a down trend (mark down), finishing in a ranging state.

Investopedia market cycle definition

🧐 Technical Analysis

🐍 Python Library

TA-Lib is the Python library used to generate the technical analysis. It offers over 70 functions including categories of overlap, momentum, cycle, pattern recognition, and more.

💡 Link to TA-Lib docs here

👍 Choosing Functions

Principle Component Analysis (PCA) was used to select the TA functions used in this post. The functions captured significant amounts of the variance in the data. Only 3 were selected to keep the analysis explainable.

The combination includes 2 momentum indicators: Commodity Channel Index, and Aroon Oscillator. And 1 cycle indicator, the Hilbert Transform — Dominant Cycle Phase. However, there could be other combinations that perform well.

🚚 Momentum Indicators

Typically, momentum indicators give signals for the stability of a trending price. For example, if the price is increasing over time, using a window of previous prices the momentum indicator signals if the new price is stable or likely to reverse. Investopedia momentum indicator definition

📺 Commodity Channel Index (CCI)

CCI uses the rate of rise and fall to determine if a currency, or stock, is overbought or oversold. These states lead to price corrections, where the price movement can reverse.

Its output oscillates between a range of -200 and 200. Where -200 shows oversold states and 200 show overbought states. In an uptrend, the CCI generally show positive values. Whilst, in a downtrend, it generally shows negative values.

💡 CCI further explained here

🔺🔻Aroon Oscilator

Oscillates between -100 and 100. Where -100 signals oversold and 100 signals overbought. Aroon is more sensitive than CCI. Therefore, seems better suited for indicating trends rather than peaks and troughs.

💡 Aroon further explained here

🚴‍♂️ Cycle Indicator

Used to pick out repeating cyclical patterns known as cycles; signalling the start or end of a cycle. Cycle indicators offered on TA-Lib all use the Hibert-Transform. This is a technique, also used in signal processing, to break down signals into component signals that makeup the overall input. Wiki Hilber-Transform definition

🧞‍♂️ Hilbert Transform — Dominant Cycle Phase (HTDP)

Tracks the phase of the dominant cycle in the pricing data. It ranges from 0 to 360 (includes all degrees of a circle), and is said to signal the most likely period. The start of a new cycle is indicated when the phase of the dominant cycle reaches 360 and jumps from 360 to 0.

💡 HTDP further explained here

🧐 Analysis of Technical Indicators

This is a review of the technical analysis occurring between the pink lines as indicated on the plot below. This is where the price shows a clear cycle.

🚚 Momentum

In the period between the pink lines, both CCI and AROON, shift from positive to negative values, signalling a change from uptrend to downtrend. Before this price reversal, we see both signals are close to their upper bound, indicating the price is overbought.

🚴‍♂️ Cycle

Our cycle indicator (HTDP) shows two cycles between the pink lines. Firstly, it signals the larger cycle and then a smaller cycle. Comparing this to the price, we see a potential small cycle within the larger one, on the downtrend.

🧩 Added complexity

This example of a market cycle makes it relatively easy to analyse the three indicators. In reality, price movements and patterns are not often as visually well-defined.

Therefore, reading the signals from these indicators becomes more complex. In addition, if we want to add more indicators to our combination, our analysis could start to become very time consuming.

👩‍💻 K-Means Clustering

Using k-means, we can combine the technical analysis outputs into a single indicator, classifying market states. Therefore, simplifying the analysis to make it easier to interpret.

The beauty of using an unsupervised learning with this classification is that we do not need the labour intensive step of labelling our data.

K-means is a good first step approach to clustering. It is a fairly simple clustering algorithm and is often used when exploring a data sets to learn commonalities in the data.

Due to its simplicity the k-means algorithm requires the user to specify the number of means before running the clustering.

🥇 Optimal Number of Clusters

To best split the data, we aim to label our data points in clusters that are compact and faraway from each other.

Therefore, to decide the optimal number of means, using the silhouette score is a good option. The score combines the distance between clusters and the distance of datapoint within clusters. A small distance within clusters and a large distance between clusters gives a higher score, signalling a better fit.

We can see the silhouette score plot peaks at 3 means. Therefore, 3 clusters are optimal when splitting the data.

💡 Silhouette further explained here

🧐 Cluster Analysis

Using the box plot below, we can analysis how our data points have been clustered based on their features.

Cluster 0:

HTDP (ht) values distributed mainly between 200–360, signals that the majority of data points appear towards the end of a cycle.
Aroon values are mid-range, with a negativ mean, suggests range bound prices tending towards a downtrend.
CCI shows the most negative values compared to other clusters, this suggest data points are downward trending and appear towards the end of a cycle, where price reversals occur.

Cluster 1:

HTDP averages around 0. Therefore, this cluster of data points is generally at the start of a new cycle.
Aroon shows the most negative average of all clusters, suggesting it signals a downtrend, leading to oversold prices that could reverse.
CCI is negative on average, similar to cluster 0, suggesting downtrend data points. However, this cluster is less likely show price reversals.

Cluster 2:

HTDP has the widest distribution among clusters, but is mid-range on average, suggesting this cluster generally consists of mid-cycle data points.
Aroon values are positive, with few negative outliers, suggesting uptrends and overbought prices that could reverse.
CCI scores are positive on average, suggesting uptrends. Includes large positive outliers, warning of price reversals from oversold position prices.

⏳ Clusters in Time-Series View

Plotting the price with the clusters marked out in colour helps to reveal cluster behaviour. Similar to the box plot analysis, clusters 1 and 2 (red and green) stand out as showing up and down trends. Whilst, cluster 0 (blue), indicates the end of a cycle, signalling the price is consolidating.

Clusters appear well defined in trending market states, when price movements are large. Whilst, a ranging market is more difficult to indicate.

Prices occurring after date-time 20220104 113600, toward the end of the plot show a ranging market, bound between 1.1289 and 1.1295. However, no single class signals the ranging state. Instead our classes switch often as the model seems to pick out small cycles in the price.

🧪 Classifications on Test Data

The plot below shows cluster predictions for the last 1% of pricing data. This was left out during our models training step. The plot is highlighted green for cluster 2 (uptrends and peaks) and red for clusters 1 and 0 (downtrends and end of cycle price consolidation).

The test data shows a downtrend followed by an uptrend. We cannot really see a significant ranging state. Whilst the model has classified uptrends and downtrends, if we consider any date-time before 20220132 202000 to be a downtrend, and any price after to be an uptrend, the model does not consistently classify. This is likely a result of using a cycle indicator which is attune to smaller cycles.

Therefore, for longer term strategies this is method would require some tuning.

🎬 Conclusion

👀 Where Could This Classification be Used?

To simplify technical analysis when using multiple indicators.
Has the potential to incorporate labelling into generation of trading signals.
Used as a single metric to check the states of many markets (code does the heavy lifting).
Reviewing trading performance within a market state.

💪 Strengths

Analysis of technical indicators was simplified into a single clear visual of price labelled by market state clusters.
With the use of Python’s Scikit-Learn and TA-Lib, coding this solution is relatively simple. Furthermore, the results were fast, and can be scaled/generalised to other markets.
It was important to include cycle indicators, as it showed distinctive difference between the 3 clusters. Therefore, is likely a strong feature for splitting the data. However, including this indicator makes it difficult to use the classifications to label long term trends.

❤️‍🩹 Weaknesses

Only a visually review the models classifications was possible using the unsupervised learning method. Usually, we would make use of a larger test data set with known labels to produce model performance metrics.
An inability to indicate ranging states with a single class. This could be trace back to the technical indicators selected.
Difficulty constantly labelling long term trends. Only the momentum indicators are time adjustable. Therefore, including the cycle indicator limits how we adjust for time. The momentum indicators tend to work best with default time period of 14 ticks when paired with the cycle indicator. However, we could average the input data, in order to pick up longer term trends, rather than extending the time period of the technical indicators.

🕵️ Further Work

Increasing number of cluster could lead to better signals peaks and troughs.
Creating rules based on the clustering and back testing.
Exploring the use of different combination of technical analysis.
Tuning the time range to pick up longer trends by averaging the input data.

💡 Codebase found here