Day-Trading with AI: When to Hold, When to Fold, and When to Not Play!

Market Clustering with Transdimensional Machine Learning

5 min readNov 16, 2020

TLDR:

The ability to characterize market day-trading with TML enables tailoring strategies to specific market “personalities”. These “personalities” render it seemingly impossible to trade profitably approximately 40 percent of trading days!

Background Post: The New AI Gold Rush — Transdimensional Machine Learning (Pan Provided!)

A Jupyter Notebook is provided including code in the GitHub repository HERE.

An Example:

For a given underlying instrument, 10 minute OHLC candles were developed from tick data covering an approximate 10 year period.
Backtesting was performed on the candle data using a baseline “Buy/Sell” indicator for (9) cases, for each day, of fixed stop loss/ATR based stop loss/no stop loss instances.
Each case was optimized for the baseline indicator hyperparameter, asymmetrically for both LONG and SHORT positions, for both BEST and WORST day P/L, recording max and min day trading P/L, number trades and associated optimal hyperparameter value.
For each day, the results for each (9) cases were averaged to form the “raw-data” vector for each day.
Preliminary data exploration was performed using TML establishing the probable number of underlying clusters to be approximately (9).
The “Fitness Function” was modified to help focus attention on (9) cluster solutions during the Hybrid-NEAT evolutionary process.

Resulting “Typical” Day-Trading Market “Personalities” by relative size:

“BEFORE” Applying TML (12-D to 2-D via tSNE):

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=100, n_iter=1000)
tsne_results_orig = tsne.fit_transform( data_orig )
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))df_subset['tsne-2d-one'] = tsne_results_orig[:,0]
df_subset['tsne-2d-two'] = tsne_results_orig[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=['purple','red','darkcyan','brown','blue', 'dodgerblue','green','lightgreen', 'black'],
    data=df_subset,
    legend="full",
    alpha=0.3 )

Applying TML (12-D to 1000-D via TML):

...
metric = "jaccard"
n_neighbors_max = 100
n_neighbors_min = 2
min_dist_max = 0.99
min_dist_min = 0.0
n_components_max = 1000
n_components_min = 1
min_samples_max = 1000
min_samples_min = 2
min_cluster_size_max = 2
min_cluster_size_min = 2
...
if ( num_clusters_found == 9 ):
    genome.fitness = 10000.0 / abs( clustered_COMB_sum_SE + 1)
elif ( num_clusters_found == 0 ):
    genome.fitness = -99999.0
else:
    genome.fitness = 10000.0 / abs( clustered_COMB_sum_SE + 1) — ( abs( num_clusters_found — 9 ) * 1000.0 )
...
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
New best_fitness_so_far =  -2984.7710672587614 1
New best: metric                    =  jaccard
New best: n_neighbors               =  98
New best: min_dist                  =  0.06658783809866256
New best: n_components              =  1000
New best: min_samples               =  3
New best: min_cluster_size          =  2
New best: cluster_selection_epsilon =  0.6658783809866257OUT: num_clusters_found              =  12
OUT: ratio_clustered                 =  1.0
OUT: clusterer_probabilities_sum     =  0.9558447965277097
OUT: clusterer_probabilities_sum_SE  =  184.0931575208609
OUT: clusterer_outlier_scores_sum    =  0.13366803680011266
OUT: clusterer_outlier_scores_sum_SE =  471.55167569985406
OUT: clustered_COMB_sum_SE           =  655.644833220715
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
…
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
New best_fitness_so_far = 182.75239956493343 104
New best: metric = jaccard
New best: n_neighbors = 100
New best: min_dist = 0.9899882983797373
New best: n_components = 1000
New best: min_samples = 2
New best: min_cluster_size = 2
New best: cluster_selection_epsilon = 9.899882983797372OUT: num_clusters_found = 9
OUT: ratio_clustered = 1.0
OUT: clusterer_probabilities_sum = 0.9978606463926271
OUT: clusterer_probabilities_sum_SE = 1.649803561162849
OUT: clusterer_outlier_scores_sum = 0.03316588079964379
OUT: clusterer_outlier_scores_sum_SE = 52.31724504377773
OUT: clustered_COMB_sum_SE = 53.96704860494058
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

“AFTER” Applying TML (1000-D to 2-D via tSNE):

time_start = time.time()
raw_data = fit_HDBSCAN._raw_data
tsne = TSNE(n_components=2, verbose=1, perplexity=100, n_iter=1000)
tsne_results_1 = tsne.fit_transform( raw_data )
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))df_subset['tsne-2d-one'] = tsne_results_1[:,0]
df_subset['tsne-2d-two'] = tsne_results_1[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=['purple','red','darkcyan','brown','blue', 'dodgerblue','green','lightgreen', 'black'],
    data=df_subset,
    legend="full",
    alpha=0.3 )

...
unique_elements, counts_elements = np.unique(fit_HDBSCAN.labels_, return_counts=True)
print(“Frequency of unique values of the said array:”)
print(np.asarray((unique_elements, counts_elements)))# Frequency of unique values of the said array:
# [[ 0 1 2 3 4 5 6 7 8]
# [638 893 269 19 41 23 486 159 225]]threshold = pd.Series(fit_HDBSCAN.outlier_scores_).quantile(0.9)# threshold = 0.09822259079456185outliers = np.where(fit_HDBSCAN.outlier_scores_ > threshold)[0]
sns.distplot(fit_HDBSCAN.outlier_scores_[np.isfinite(fit_HDBSCAN.outlier_scores_)], rug=True)
...

LONG Trade characteristics, avoiding LONG trades 38.5 percent of trading days associated w/ Cluster ID’s 6, 7, 8 and 9.

SHORT Trade characteristics, avoiding SHORT trades 40.3 percent of trading days associated w/ Cluster ID’s 0, 2, 5 and 8.

Summary:

Avoid trading LONG 38 percent of trading days associated w/ Cluster ID’s 6, 7, 8 and 9.
Avoid trading SHORT 40 percent of trading days associated w/ Cluster ID’s 0, 2, 5 and 8.
The market has multiple “personalities”, rendering a single “one-size-fits-all” strategy inadequate!
The ability to identify individual market “personalities” enables tailoring strategies to specific “persona” in pursuit of profitability.
Risk management dictates knowing “How-to-play”, but more importantly, knowing “When-not-to-play”!
A raw data vector of 12-Dimensions was transformed into a 1000-Dimension vector to achieve cluster separation using TML, then transformed back into a 2-D vector using tSNE for visualization purposes.
Transdimensional Machine Learning (TML) could be defined as the holistic application perspective viewing data, metric selection/creation, manifold mapping, AI/ML/DL tool selection, and fitness function determination, driven only by the specifics of the intended use-case, and more importantly, independent of concern of the dimensionality of the underlying raw data and manifold mapping dimension.

Inspiration:

UMAP, Leland McInnes
HDBSCAN, Leland McInnes, John Healy, Steve Astels
NEAT, Kenneth Stanley
How to Tune Hyperparameters of tSNE, Nikolay Oskolkov

About Andrew (Andy) Carl:

The enthusiastic developer of the “GitHub AI Brain-of-Brains” and “GITHUB2VEC” NLP productivity tools. A passionate multi-discipline Aerospace Mechanical Engineer with extensive experience integrating Artificial Intelligence, Hybrid Reinforcement Machine Learning (Hybrid-NEAT), data science and multi-discipline based simulation in Hybrid Reinforcement Learning based Optimization (Hybrid-NEAT), design and analysis of complex air, space and ground-based systems and engineering tool development.