Member-only story

Statistical Tests Won’t Help You to Compare Distributions

Forget p-values, and come to know “Standardized Wasserstein Distance”: a practically useful measure of the difference between distributions.

Samuele Mazzanti
Towards Data Science
8 min readApr 8, 2022

--

[Image by Author]

One of the most recurring questions in the world of data, from machine learning to business analysis, from finance to medical research is:

“How different are these variables among these groups?”

I bet you have been there before. You have a dataset and two (or more) groups of customers, say customers from US and customers from Germany, and you have recorded some variables, such as age, purchase frequency, and average purchase amount.

[Image by Author]

You would like to know which variable is better able to set apart American customers from German customers. Or, equivalently, which variable is more different between American and German customers.

To do that, you need a way to measure the difference between distributions, e.g. how different is the age of German customers from the age of American customers, and so on.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Samuele Mazzanti
Samuele Mazzanti

Written by Samuele Mazzanti

Applied Scientist @ Yelp | I write about data science in the real world | Opinions are my own

Responses (10)