Clustering NASDAQ 100 Stocks

James van Doorn
INST414: Data Science Techniques
3 min readNov 3, 2023

Non-obvious Insight:

The non-obvious insight I aim to extract revolves around grouping stocks with similar price movement patterns, so I am trying to find out which stocks have similar price movement patterns. This insight can offer valuable guidance for forming well-diversified portfolios, identifying sectors with correlated stock performance, and making strategic choices to maximize risk-reward ratios, and assist others in doing so.

Data Source and Feature Selection:

For this assignment, I used a dataset from dataset on stock prices: NASDAQ-100 Stock Price Data (kaggle.com). In my analysis, I’m using the features high price, low price, closing price, adjusted closing price, and volume to determine the similarity between stocks. These features are fundamental for dissecting stock price dynamics and capturing various aspects of stock behavior. To measure this similarity, I’ve opted for the Euclidean distance metric, which is suitable for numeric data and allows me to assess the proximity of stocks based on their price-related attributes.

Selecting the Number of Clusters (k):

To determine the optimal number of clusters (k), I employed the elbow method. I ran K-Means clustering with varying k values and analyzed the within-cluster sum of squares. This allowed me to pinpoint the elbow in the plot, indicating a significant change in within-cluster sum of squares. Based on this analysis, I settled on k=3. I applied K-Means with k=3 to group stocks based on their price behavior, providing a structured approach for pattern identification.

Elbow method plot to determine optimal number of clusters with number of clusters on x axis and the within cluster sum of squares on the y axis which quantifies the variation within clusters (so we want to minimize this)

The Clusters:

These clusters represent distinct groups of stocks that share similar price movement patterns. Here’s a brief interpretation of these clusters:

  • Cluster 1: Comprises high-volatility stocks with sharp price fluctuations. These stocks are akin to speculative or growth stocks, offering the potential for high returns but accompanied by higher risks.
  • Cluster 2: Consists of stable, low-volatility stocks characterized by consistent price trends. These stocks may represent mature, dividend-paying companies, making them a choice for risk-averse investors.
  • Cluster 3: Encompasses stocks with intermediate characteristics, combining elements of stability and growth.

Software Used:

I conducted the analysis using Python and key libraries like Pandas for data manipulation, NumPy for numerical operations, Scikit-Learn for K-Means clustering and the elbow method, and Matplotlib for data visualization. These tools streamlined the process, allowing me to handle, cluster, and visualize NASDAQ 100 stock data efficiently. Python’s flexibility and the capabilities of these libraries helped in simplifying the analysis, making it a practical choice for uncovering insights within the dataset.

Data Cleaning and Common Bugs:

Dealing with real-world financial data involves addressing common issues like missing values, outliers, and data inconsistencies. To mitigate these challenges, I used StandardScaler from sklearn to standardize features, making them directly comparable. Any missing data was imputed or removed, ensuring the dataset’s integrity. Stocks with incomplete data were removed to ensure a robust clustering process.

Limitations and Bias:

Selecting the optimal k value is a critical yet somewhat subjective decision, as different k values may reveal alternative patterns within the data. Additionally, the choice of clustering method, similarity metric, and the selection of features can significantly impact the clustering results, making it essential to make informed choices. Furthermore, it’s crucial to be aware of potential biases introduced during data preprocessing and analysis, which can affect the outcomes and interpretations, including:

  • Selection Bias: If specific stocks or time periods were selectively included or excluded during data preprocessing, it could lead to biased clustering results, as the dataset may not represent the entire NASDAQ 100 accurately.
  • Missing Data Bias: Handling missing data, whether through imputation or removal, can introduce biases if the chosen method does not accurately represent the missing values’ characteristics.
  • Feature Selection Bias: The choice of features used for clustering can introduce bias, as selecting certain attributes over others may affect the grouping of stocks.
  • Metric Bias: The choice of the distance or similarity metric, such as Euclidean distance in this analysis, may not be the best representation of stock relationships and can introduce bias.
  • Algorithmic Bias: Different clustering algorithms have inherent biases, and the choice of K-Means in this analysis may not be the most suitable for all types of data.

GitHub Repository Link: inst414_work/assig4inst414.ipynb at main · jvand0/inst414_work (github.com)

--

--