Member-only story
All You Need Is Statistics to Analyze Tabular Datasets
To analyze tabular datasets there is no need for deep learning nor large language models. I will demonstrate how (simple) statistics, and techniques such as PCA can show new insights and explainable results.
Tabular datasets are one of the most common forms of data and consist of a mix of variables such as binary, categorical, textual, and continuous values. A well-known tabular dataset is, for example, the Titanic dataset. The major challenge in such datasets is the way of analyzing the variables because analysis of categorical values needs different statistics and/or models than categorical values, and so on. In addition, key is also to determine multicollinearity in the dataset because variables with statistically similar behavior can affect the reliability of models. In this blog post I will demonstrate the steps of pre-processing tabular datasets and how statistical tests, such as Hypergeometric testing, can show the relationship across variables. In addition, I will explain the importance of multiple test corrections, and show how to apply Principal Component Analysis on a tabular dataset.