HDSC Winter ’22 Premiere Project Presentation: Gold Price Prediction

HamoyeHQ
Hamoye Blog
Published in
6 min readMar 6, 2022

A project by Team Random Forest

How important is the forecast of the rise and fall of gold prices? The challenge/aim of this project is the accurate prediction of the future adjusted closing price of Gold ETF across a ‌ period in the future using this dataset.

Deployed Project: link

GitHub Repository : link

Team Random-Forest divided the project into smaller tasks and created a timeline with set deadlines.

  • Task 1: DATA CLEANING & PRE-PROCESSING

The dataset has 1718 rows & 80 columns in total. Data for attributes, such as Oil Price, Standard and Poor’s (S&P) 500 index, Dow Jones Index US Bond rates (10 years), Euro USD exchange rates, prices of precious metals Silver and Platinum & other metals such as Palladium and Rhodium, prices of US Dollar Index, Eldorado Gold Corporation and Gold Miners ETF were gathered.

After importing the necessary libraries, reading & summarising the dataset, data cleaning & preprocessing were carried out. While cleaning the Gold Price Prediction data, with over 80 features, it was observed using a Heat Map that there were no null values, where there was no colour variation. With task 1 completed, the next step is Exploratory Data Analysis (EDA).

  • Task 2: EXPLORATORY DATA ANALYSIS (EDA)

EDA entailed finding the correlations & relationships between each of the 80 features. The first step was to provide a summary table of the one-to-one correlation for each feature, where all 80 features being of numeric data type displayed a certain level of correlation between them.

A. The next step entailed checking the distribution plot of each feature. The following was noted:

  1. The feature distribution of (SP, Dj, EU, OF, USB, PLT, PLD, RHO, USDI, GDX, USO,) and each of its respective price markets show a similar trend of the noticeable trend which is the concentrating their prices at the position before or after the median value of their range.
Fig 1.1 The features (Open, High, Low, Close, Adj_close) have a similar distribution curve with the the majority of its data points lie at $110 to $150

2. Another key inference is that the “patterns feature” for any kind of market had a similar pattern which resulted in a two-bin distribution, thus indicating alternative movement on pricing.

Fig 1.2 Histogram of EU_Trend

3. The trade volume of each different base market (SP, GDX, USDI, etc) shows the margin titled to the left, showing smaller quantities are bought more frequently as opposed to larger quantities bought more rarely.

Fig 1.3 Histogram of USDI_Volume

B. A check on the 80 features showed that about 10% of them, i.e.,10 features correlated less than 0.5 and greater than -0.5, meaning that the other features can ‌ carry out feature engineering techniques on the dataset while the rest can be ignored or deleted

Fig. 2 (a) Features with correlation > 0.5 (b) Features with correlation < -0.5

Task 3: FEATURE ENGINEERING (FE)

The primary goal of this task was to identify the best set of characteristics that enable us to create a useful and constructive model.

We used feature engineering to achieve two main objectives which include:

  • Creating a suitable input dataset that meets the requirements of the machine learning algorithm.
  • Machine learning models’ performance being improved.

Not all the features improved the model’s accuracy. Some features were redundant when compared to others. The attributes that were irrelevant to the problem were eliminated.

Feature selection solves these difficulties by selecting a subset of features that are most relevant to the problem. During the project, the following feature selection methods were used:

  • Correlation
  • Mutual Information
  • Recursive Feature Elimination (REF)
  • Variance Threshold
  • Univariate Selection
  • ExtraTreesClassifier

Based on the Classifiers used in FE, the common features were used as a benchmark for the modelling stage.

Fig. Selected features using (a.) ExtraTreesClassifier (b.) Univariate Selection
  • Task 4: MODELING & HYPERTUNING

After identifying the best features in the dataset, the next task was to determine the best working machine learning model for the project. This could only be carried out by experimenting & comparing the performances of different models.

This was where hyper-parameter tuning came in. Each model has its own sets of parameters that need to be tuned to get ideal output. For every model, the goal was to minimise the error or to have predictions as close as possible to actual values. This is one of the core objectives of hyper-parameter tuning.

The models employed in determining the best-predicted values were:

  • Random Forest
  • Decision Tree
  • ElasticNet
  • XGBoost

How well a regression model performs can be obtained by how close the predicted value is to the ground truth. It is very important to use the appropriate metric to evaluate the performance. In this case, R-squared, also known as the coefficient of determination, was used as a metric to determine the goodness of fit of the models. From this, the highest accuracy in the Random Forest Algorithm with an R2 score of 0.999922 was obtained.

Random Forest Algorithm: Random forest is a supervised machine learning algorithm that operates by constructing a multitude of decision trees at training time. A random forest is a group of decision trees. However, there are certain differences between the two. A decision tree creates rules, which it uses to make decisions.

A random forest will randomly choose features and make observations, build a forest of decision trees, and then average out the results. It is one of the most used algorithms because of its accuracy, simplicity, and flexibility.

The fact that it can be used for classification and regression tasks, combined with its nonlinear nature, makes it highly adaptable to a range of data and situations. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.

  • Task 5: DEPLOYMENT

The primary aim of the deployment is to make the model available to the public. Anyone across the world can predict future gold prices using the ML model developed by Team Random-Forest just by visiting the link at https://goldpredict.herokuapp.com/.

  1. Django, a Python web framework was used for the deployment, since it allows easy deployment of web apps.
  2. A landing page was created using a form with the selected features displayed as the labels, where the user is prompted to enter the inputs accordingly and submit the form. The values of each feature were combined & passed through the prediction model as a ndarray, hence obtaining a prediction.
  3. This prediction was then displayed to the user in a user-friendly manner.
  4. Aside from the Predict page, the website also includes a Team page introducing all the Team Random Forest members who contributed to this project and an About page, which gives a detailed description of the project.

CONCLUSION:

  1. The gold-stock price highly depends on its Open, High, Low & Close Price. Along with that, other factors that contribute to its prediction are GDX stock and SF stock price.
  2. A basic ML Model was deployed which fairly predicts the Adjusted Closing price of Gold given its OHLC price, along with GDX and SF Price.
  3. As a continuous learning exercise, Team Random Forest would like to train the model with a larger & more recent dataset with more hyper-tuning & feature engineering.

--

--

HamoyeHQ
Hamoye Blog

Our mission is to develop an army of creative problem solvers using an innovative approach to internships.