Predicting Gold Price with Machine Learning

Scikit-Learn under AWS Sagemaker

Image for post
Image for post

Machine learning is a buzzword in the technology world right now, and it represents a major step forward in how computers can learn. It has already extended into so many aspects of daily life, for instance, fraud detection in the financial sector, skin cancer diagnosis in healthcare and recommendation engines in retail.

Machine learning in finance is reshaping the financial services industry like never before. Leading banks and financial services companies are deploying AI technology, including machine learning (ML).

In this article, I will try to predict next day gold price using Scikit-Learn under AWS Sagemake Notebook.

Part 1: Install and Import Latest Package

Under AWS Sagemaker Notebook, importing yahoofinancials python module always return ModuleNotFoundError. It took me hours to figure out that the AWS Sagemake Notebook instance doesn’t come with the latest version of pip and that is the root cause (for me at least).

pip install --upgrade pip
pip install yahoofinancials --user

Part 2: Gathering and Preparing Data

yahoofinancials package is a powerful financial data module used for pulling both fundamental and technical data from Yahoo Finance. The package requires Yahoo ticker symbol and data range we need to import the data for.

ticker = "GC=F"
names = "Gold"
end_date= "2020-08-27"
start_date = "2001-01-01"

It is crucial to ensure the data is clean, consistent and accurate. Check for missing value and fill or delete where appropriate. In this case, we will replace NaN values using forward and backward filling.

Standardizing values in a column will ensure that your data will aggregate correctly. Here we convert numeric arguments to a common type.

values = values.fillna(method="ffill",axis=0)
values = values.fillna(method="bfill",axis=0)

Always visual inspect your data

values.Gold.plot(figsize=(10, 7),color='r')
plt.ylabel("Gold Prices")
plt.title("Gold Price Series")
Image for post
Image for post

Part 3: Feature Selection and Training Data Preparation

Selecting the appropriate features for that model is the next important step. The moving averages reflect the dominant trend and bias of gold prices and therefore are a reflection of what all the “insiders” know about gold — that is, the fundamental factors influencing gold price movements.

data = values

# add features
data['Gold/15MA'] = data[names].rolling(window=15).mean()
data['Gold/90MA'] = data[names].rolling(window=90).mean()

# add label
data['Gold-T+1'] = data[names].shift(-1)

data = data.dropna()

Heatmap is one of the useful visualization tools to evaluate the correlation between features & the target column and correlation between features.

corr = data.corr()
plt.figure(figsize = (12,10))
plt.title('Correlation of df Features', y = 1.05, size=15)
Image for post
Image for post

Be sure to select non-overlapping subsets of your data for the training, validation and testing sets in order to ensure proper result. Moreover to minimize unpredictable impact during model training due to the sequential characteristic of the dataset, I used Pandas sample() to shuffle the training dataset.

t = .8
t = int(t*len(data))

# Train dataset with feature and label
train_data = data[:t]
train_data = train_data.sample(frac=1)

label_train = train_data['Gold-T+1']
feature_train = train_data.drop(['Gold-T+1'],axis=1)

# Test dataset with feature and label
test_data = data[t:]
label_test = test_data['Gold-T+1']
feature_test = test_data.drop(['Gold-T+1'],axis=1)

Part 4: Regression modelling using Scikit-Learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.

There are two types of supervised machine learning algorithms: Regression and classification. Regression models is used, which is a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting.

# Create a linear regression model
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

lreg = linear_model.LinearRegression()
linear =, label_train)
label_predict = lreg.predict(feature_test)

print("Linear Regression model")
print("Gold Price (y) = %.2f * G15MA + %.2f * G90MA\
+ %.2f (constant)" % (linear.coef_[0], linear.coef_[1], linear.intercept_))

# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(label_test, label_predict))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(label_test, label_predict))

label_predict = pd.DataFrame(label_predict, index=label_test.index, columns=['price'])
label_predict.plot(figsize=(10, 7))
plt.legend(['predicted_price', 'actual_price'])
plt.ylabel("Gold Price")
Image for post
Image for post

Written by

Tech Blog, Consultant and Strategist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store