Adding artificial intelligence to your investing strategy; part 2

Patrick Collins
Alpha Vantage
Published in
6 min readJan 29, 2020

Clean and visualize your data

Photo by Franck V. on Unsplash

There are a few steps in every machine learning / AI project (and also algorithmic trading/investing!), these are:

  1. Get data
  2. Play with the data and discover insights through visualization
  3. Clean and prepare the data
  4. Train a model
  5. Fine-tune it
  6. Run real-time, monitor, and maintain
  7. Repeat with new insights (important!)

For this tutorial, we are going to focus on how python/scikit can help us with steps 1–4.

Let’s say we think that there might be some relationship between daily percent change with TSLA, GOOGL, and SPY. Luckily, with Alpha Vantage the first step to getting this data is easy. You may need pip install a few things!

from alpha_vantage.timeseries import TimeSeries
import pandas as pd
from datetime import datetime
ts = TimeSeries(output_format = 'pandas', key = "XXX")
# If you have your ALPHAVANTAGE_API_KEY you can just use:
# ts = TimeSeries(output_format = 'pandas')
# Get a free key at https://www.alphavantage.co/support/#api-key
spy, spy_meta_data = ts.get_daily_adjusted(symbol = 'SPY', outputsize = 'full')
spy.insert(0, "ticker", 'SPY', True)
spy = spy.reset_index()
tsla, tsla_meta_data = ts.get_daily_adjusted(symbol = 'TSLA', outputsize = 'full')
tsla.insert(0, "ticker", 'TSLA', True)
tsla = tsla.reset_index()
spy = spy[spy['date'] >= tsla.min().date]

This will give you 2 dataframes, one of Tesla and the S&P500 index, which is great! We can start some initial screening of the data… Although you may quickly find out that in this format, there isn’t much to explore.

import matplotlib.pyplot as plt
tsla.hist(bins=50, figsize=(20,15))
plt.show()

This doesn’t tell us much of anything. Let’s add some columns and merge the two dataframes.

spy.insert(10, "9. % change", spy['5. adjusted close'].pct_change(), True)
spy.insert(11, "10. weekday", spy['date'].dt.dayofweek, True)
tsla.insert(10, "9. % change", tsla['5. adjusted close'].pct_change(), True)
tsla.insert(11, "10. weekday", tsla['date'].dt.dayofweek, True)
from functools import reduce
dfs = [tsla, spy]
tickers = reduce(lambda left,right: pd.merge(left,right,how = 'outer'), dfs)
tickers = tickers.sort_values(by = ['date', 'ticker'])
tickers

Click here to learn more about reduce, and here to learn about lambda functions in python.

Now that we have our data in a better format, we can start using some tools to look for trends, here are some common ones:

tickers.info()
tickers["ticker"].value_counts()
tickers.describe()

Of course, this original data isn’t really that helpful, so we have to dig a little deeper. By setting our strings to integers, we can visualize comparisons between tickers and dates. (But only temporarily, we’ll touch on this more later)

test_tickers = tickers.copy()
test_tickers.loc[(test_tickers['ticker'] == 'SPY'), 'ticker'] = 0
test_tickers.loc[(test_tickers['ticker'] == 'TSLA'), 'ticker'] = 1
test_tickers.plot(kind='scatter', x='weekday', y='% change', alpha=0.5, s=test_tickers["6. volume"]/100000, figsize=(10,7), label = "volume", c = 'ticker', cmap=plt.get_cmap('jet'), colorbar = True)

This gives us a little more meaningful comparison of data, and we can use the scatter_matrix object to get another visualization of correlations between variables.

from pandas.plotting import scatter_matrixattributes = ["6. volume", "5. adjusted close",
"1. open", 'weekday', '% change']
scatter_matrix(tickers[attributes], figsize=(12, 8))

Or more appropriately with test_tickers.corr()

The correlations go from -1 to 1, where 1 means they are directly proportional, -1 is inversely proportional, and the closer to 0 they are, the less of any sort of correlation they have.

Sadly, it doesn’t look like we’ve found much for trends here, but it can still be fun to see what you can do. Let’s try to incorporate this data into a machine learning algorithm. We first need to clean the data. You’ll notice above we set weekdays and ticker symbols to numbers. This was OK just to visualize the data, but it may skew the results, offering some sort of linear correlation between the workdays. The better way to do this is to set a bunch of binary variables, for example monday = True, tuesday = False, etc..

However, this can be really annoying, and eventually we’d have a massive array of data! An easier way to accomplish this is using the OneHotEncoder package. We also want to scale features, we do a really simple version of that here.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
tickers = tickers.sort_values(by = ['ticker', 'date'])
tickers_list = tickers[['ticker']]
tickers['date'] = tickers['date'].astype(int)
category_encoder = OneHotEncoder()
tickers_1hot = category_encoder.fit_transform(tickers_list)
num_attribs = list(tickers.drop('ticker', axis = 1))
category_attribs = ['ticker']

This makes the data easier for the ML model to digest. For cleaning data, there are often a lot of steps. Sklearn has a pipeline package to help ease the standardization of pipelining data, and you can add methods to be enacted on all data transformations.

import numpy
from sklearn.pipeline import Pipeline
num_pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), category_attribs),
])
tickers_prepped = full_pipeline.fit_transform(tickers)

The last thing to do is fit it on a ML model. We are going to use a regression model again.

from sklearn.linear_model import LinearRegression
tickers = tickers.dropna()
tickers_labels = tickers['% change'].dropna().copy()
lin_reg = LinearRegression()
lin_reg.fit(tickers_prepped[1:], tickers_labels[1:])

Since we trained our model on the entire dataset, our model may be a bit overfitted. We are going to ignore that at the moment, but in the future, you really want to train you model on a subset and then test it on a different subset. Sklearn has some tools for that next time too!

from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(tickers, test_size=0.2, random_state=2, shuffle = False)

This will give you a test set and and a training set to do exactly that. Anyways, now that we have our model trained, we can start making predictions, including on what we have already tested.

some_data = tickers.iloc[:5]
some_labels = tickers_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

Returns:

Predictions: [ 0.00959096  0.00447642  0.00547957 -0.00651265 -0.03053395]
Labels: [0.009590964598126028, 0.004476416039214115, 0.005479565662968255, -0.006512645101450443, -0.03053395044822982]

We can see that the predictions match pretty close to what the actually were.

These were just some quick tips on how to start playing with data and some of the awesome stuff you can do in python. To get even MORE in depth, check out the ageron/handson-ml github repo, which walks through the Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.

What are some tools you use to clean and visualize data? Share with us in the comments below!

Want to learn more?

Follow Alpha Vantage on Medium and see the tutorials that are coming out soon, with content like blockchain applications, machine learning with python, hackathons, and a ton of other helpful content.

You can reach us also on slack, twitter, or discord.

#investing #machinelearning #AI #stockapi #fintech

--

--