Exploring KMeans Clustering: Implementation, Advantages, and Challenges
4 min readFeb 27, 2024
In this blog post, we will explore how the KMeans algorithm can be effectively utilized for clustering articles.
I have deep dived into the implementation in the following video
Many of the fundamental steps are discussed in my previous article on keyword extraction, which can be found here
To evaluate the effectiveness of KMeans, we will select articles from three distinct categories: Geography, Sports, and Mathematics. We will randomly sample 1000 articles from each of these categories.
pattern = r'\b(?:player|team|score)\b'
sports_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
sports_df["category"] = "sports"
sports_df.shape
pattern = r'\b(?:theorem|proof|equation|formula)\b'
maths_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
maths_df["category"] = "mathematics"
maths_df.shape
pattern = r'\b(?:river|lake|ocean)\b'
geo_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
geo_df["category"] = "geography"
geo_df.shape