Unraveling Stock Price Predictions Using Supervised Learning

Arshul Shaik
INST414: Data Science Techniques
6 min readApr 29, 2024

In this post, I delve into the application of supervised learning techniques to extract insights from stock price data. Specifically, I explore the question: Can we predict stock prices accurately based on historical data? This question is paramount to investors, financial analysts, and traders, as the answer can inform decisions regarding buying, selling, or holding stocks.

To address the question of whether stock prices can be accurately predicted based on historical data, a pertinent dataset would typically comprise historical stock market data along with corresponding ground-truth labels. This dataset commonly includes fields such as opening price, closing price, highest price, lowest price, trading volume, and potentially additional technical indicators, offering a comprehensive view of a stock’s performance over time. Ground-truth labels, representing the actual closing prices of stocks observed in the historical data, are generated based on real-world transactions in the stock market. Each day’s closing price reflects the price at which the last trade for that day occurred. This dataset’s relevance lies in its capacity to provide historical context and patterns of stock price movements, enabling researchers and analysts to identify relationships between historical features and closing prices. Understanding these relationships is vital for developing predictive models that utilize historical data to forecast future stock prices accurately.

We obtained the data, known as the Twitter Stocks Dataset, from Kaggle. This dataset provides 9 years of Twitter’s stock prices, ranging from November 2013 to October 2022, in CSV format. By accessing Kaggle, we were able to download the dataset directly from the platform. Once downloaded, the dataset was loaded into a Python environment using the `pandas` library, which facilitated data manipulation and analysis. This dataset is valuable for various analytical purposes, including time series analysis, trend forecasting, pattern identification, and other applications related to understanding Twitter’s stock performance over time.

For this analysis, a regression model is utilized rather than a classification model. The feature being predicted in this context is the closing price of the stock, which is a continuous variable. Regression models are particularly suitable for predicting continuous outcomes like stock prices, as they aim to estimate the relationship between input features and a continuous target variable. In this case, the input features consist of various numerical attributes such as opening price, highest price, lowest price, and trading volume, while the target variable is the continuous closing price of the stock. Therefore, employing a regression model aligns with the nature of the prediction task, which involves forecasting the numerical value of the stock price based on historical data patterns and trends.

The features used for the supervised model are selected from the stock dataset. These features include:

  1. Opening Price (Open): The price at which a security first trades upon the opening of an exchange on a given trading day.
  2. Highest Price (High): The highest price at which a security trades during the trading day.
  3. Lowest Price (Low): The lowest price at which a security trades during the trading day.
  4. Volume: The total number of shares or contracts traded during a specified time period.

These features are selected as they provide valuable information about the trading activity and price movement of the stock. By including these features in the model, it can learn the relationships between them and the target variable, which is the closing price of the stock. This allows the model to make predictions based on historical patterns and trends in the data. Examining the five samples where the model made incorrect predictions, it’s evident that there’s a discrepancy between the actual and predicted close prices. Here’s a discussion of potential reasons why the model might have misclassified these samples:

  • Sample 1: The model predicted a close price lower than the actual close price. This discrepancy could be due to unforeseen market factors influencing the stock price, such as news events, market sentiment, or macroeconomic indicators, which the model did not account for in its training data.
  • Sample 2: Similarly, the model underestimated the actual close price. This discrepancy could stem from sudden changes in market conditions or external factors impacting the stock’s performance, which were not captured by the model’s training data.
  • Sample 3: In this case, the model again underestimated the actual close price. It’s possible that the model failed to capture subtle patterns or trends in the data that led to this misclassification, or it may have overfit to noise in the training data.
  • Sample 4: The model predicted a close price slightly lower than the actual close price. This discrepancy could be attributed to fluctuations in market dynamics or anomalies in trading patterns that were not adequately captured by the model during training.
  • Sample 5: Here, the model overestimated the actual close price. This discrepancy might be due to outliers or irregularities in the data that the model struggled to generalize from during training, leading to an inaccurate prediction.

The analysis of the collected data, specifically the Twitter Stocks Dataset obtained from Kaggle, aimed to address the question of whether stock prices can be accurately predicted based on historical data. By utilizing a RandomForestRegressor model trained on features such as opening price, highest price, lowest price, and trading volume, the analysis sought to forecast Twitter’s stock’s closing prices over a specified period. The evaluation of the model’s performance, measured using metrics such as mean squared error (MSE), provided insights into the accuracy of the predictions. Additionally, the examination of misclassified samples highlighted potential limitations of the model and areas for improvement. Overall, the analysis demonstrated the feasibility of predicting stock prices to a certain extent using historical data and machine learning techniques, while also acknowledging the challenges and uncertainties inherent in stock market forecasting.

During data cleanup, two common issues are missing values and outliers. Missing values may occur due to data collection errors or incomplete records, potentially impacting model accuracy. To address missing values, strategies like imputation, replacing missing values with estimates such as the mean, median, or mode of the respective feature, or removing rows or columns with missing values, can be employed. Outliers are data points significantly deviating from the dataset’s rest, which can skew model predictions. Techniques like truncation, winsorization, or data transformation, such as log transformation, are commonly used to mitigate outliers’ impact on model performance. By addressing missing values and outliers effectively during data cleaning, dataset integrity, and reliability are preserved, leading to more accurate machine learning models.

The analysis has limitations, particularly regarding external factors and data quality. Firstly, external factors like geopolitical events or company-specific news, which significantly influence stock prices, are only partially considered due to their absence from the dataset. This omission may limit the model’s predictive accuracy, particularly in unforeseen circumstances. Secondly, the accuracy and reliability of the model’s predictions depend on the quality of the training data. Errors, inconsistencies, or biases within the dataset can undermine the model’s performance and generalizability. Additionally, the absence of pertinent features or the presence of irrelevant ones may impede the model’s ability to capture all relevant patterns and trends in the data, potentially leading to biased predictions. These limitations underscore the importance of caution and thorough validation when interpreting the analysis results and highlight areas for potential improvement in future iterations.

GitHub Repository: https://github.com/arshuls/INST414.git

--

--