How Real is Your Frankenstein Dataset?
Common metrics to compare a real dataset against those synthesized from it
“Beware, for I am fearless and therefore powerful.” — Mary W. Shelley
This blog contains part of the talk I gave at PyConDE 2022.
You have synthesized a real-world dataset as per your requirement. It seems to have a similar structure and frees you from the legal responsibilities of working with sensitive information.
Mind you! Synthetic datasets too are prone to disclosures so do not heave a sigh of relief.
For now, let us focus on how to measure the realness of this Frankenstein that you created from a real-world dataset. This dataset needs to be real enough to mingle with the world and not cause villagers to come marching with pitchforks and torches!
There are primarily four ways of comparing the two datasets.
Comparing Univariate Statistics
The comparison of univariate statistics focuses on a single feature between the two datasets. Plotting the distribution followed by the feature can give visual clarity.
The image on the left shows that the synthesis was pretty much successful since the two distributions are comparable. While the image on the right is utter chaos. You may want to ask the model, “What are you doing?”
Pictures say a thousand words but data science works on metrics and numbers attached to them. Therefore, Hellinger distance is used as a probabilistic measure between 0 and 1, where 0 indicates no difference between distributions.
Comparing Bivariate Statistics
For such a comparison, an absolute difference in correlations between all variable pairs in the real and synthetic data is used to measure data utility. Of course, the type of variables governs which coefficient would need to be computed. Some examples are:
- For continuous variable pair: Pearson coefficient
- Between continuous & nominal variables: multiple correlation coefficient
- For continuous & dichotomous variables: point-biserial correlation
- For dichotomous variable pair: the phi coefficient
A heatmap for computed correlations can then be plotted for easier comparison.
In the above plot, we can see the discrepancies that are typical of real-world datasets. While the lighter shades show that the differences were close to 0 and the synthesis was successful. The gray areas are for the instances when correlation could not be computed due to missing values or low variability.
Comparing Multivariate Statistics
Comparison of multivariate statistics is a game of Tag! You’re it, where a feature that becomes ‘it’ is not chasing other features but rather would be treated as a target variable. So, for each run, classification models will be built with one variable in the dataset as an outcome.
The intent behind such a comparison is to examine all possible models and test the performance for both real and synthetic datasets. The performance of such models is measured using Area Under ROC.
While the operations seem computation heavy, Generalized Boosted Models (GBM) can be used to build classification trees and speed up the working.
Distinguishability
The approach of distinguishability assigns a binary indicator to each record. If a particular observation is a real record then it will have a new feature set to value 1. Conversely, if a particular observation is a synthetic record then it will have a new feature set to value 0.
A binary classification model is built using these datasets to discriminate between real and synthetic data. The output is the probability for each prediction which is called a Propensity score.
If the Propensity score tends to 0 then the record is synthetic. And if the Propensity score tends to 1 then the record is real. Of course, the allocation of the values can be vice versa too. That is, a real record can be marked with a value of 0 and a synthetic record can be marked by a 1.