Sample Size for Cluster Analysis

Alexis Idlette-Wilson
Data Viz for Fun
Published in
3 min readNov 14, 2018

What a cluster.

Photo by Phil Desforges on Unsplash

Classical statistics prioritizes sample size as a major consideration when choosing which methodology to employ and statistics to calculate. Contrastingly, sample size often becomes the last thing considered when walking through a cluster analysis project. Data sets may be slowly shifting towards “big” data but academia continues to prove that medium and small data provide valid results as well.

Analysts and statisticians have some basic guides to follow with regard to sample size for most analyses. In his 2013 article “Heuristics for Sample Size Determination in Multivariate Statistical Techniques”, Kamran Siddiqui offers a comprehensive summary of sample size advice for seven different statistical techniques, including cluster analysis.

Cluster analysis, an unsupervised machine learning technique, supports critical business problems like market segmentation and pattern recognition. Surprisingly though, Siddiqui and other data science academics agree that no sample size guidelines exist for cluster analysis. Run with two observations or two thousand, the analyst is doing it the right way!

So if there are no best practices for sample size within a cluster analysis, how does an analyst decide how much data to include? The obvious answer may be to run everything; population data reigns supreme. Of course, that analyst may get a friendly visit from his local database administrator after querying the entire database.

Some opinions do exist in the literature to guide an analyst if and when a sample must be selected.

1. Consider feature-to-observation ratio

In her 2002 analysis of cluster methodology, Sara Dolnicar suggests analyzing a feature-to-observation ratio to validate the choice of sample size. Dolnicar et al. also published a study in 2013 recommending a sample size between 60*k and 70*k, where k equals the number of features. Dolnicar references an even older study suggesting a sample size guideline of 2^k observations, where k equals the number of features.

This particular technique also seems like a potential strategy to manage feature “creep.” If 2^k observations far surpasses the observation count, the analyst may also consider reducing the initial feature set in lieu of increasing sample size.

2. Utilize a stratified sampling technique

Another consideration for sample size is the proportion of observation types in the data. By ensuring that the analysis data set represents a balanced cross-section of the overall population, one stands a better chance of generating well formed clusters. No single attribute or entity should be over-represented if possible.

How does this affect sample size? If some particular class of entity, say “Retail”, makes up 50% of the population data but 95% of your initial chosen data set, one may then choose to reduce the sample size such that the mix becomes 50% retail. Similarly, one may add additional observations from other categories. A 1987 study concluded larger sample sizes yielded better quality clusters when scientists input balanced proportions of observations.

3. Repeat the analysis with different sample sizes

With regard to item #2, an analyst may also observe varying levels of cluster density dependent upon the sample size studied. A 2015 study published in the Journal of the Society of Prevention Research actually took this approach to observe how discretely clusters formed based on the sample size. The researchers found cluster quality varied within each of the three clustering methods utilized based on the sample size.

Discussion about cluster analysis typically bypasses the examination of sample size. Research have proven that sample size can be a powerful tuning parameter despite its lack of attention. These studies suggest optimized sample size improves cluster formation meaning it can propel an analysis from inconclusive to informative.

--

--