Predicting genuine and forged banknotes
I don’t know where you live, but money transactions are still a reality here in Brazil. This type of transaction is not so safe, since it isn’t so obvious to recognize a forged banknote. Banks work endlessly to bring more security to banknotes. Ultraviolet and holographic features, watermarks, and metal threads, but forgeries get updates with the new technologies.
The core of this possible application would be a machine learning project to make these predictions. To do so, will use a dataset from OpenML called banknote authentication which will help us distinguish between genuine and forged banknotes.
This dataset was built using images from both genuine and forged banknotes. The Wavelet transform extracted features from these images such as variance, skewness, kurtosis, and entropy. But for this project will be using a simplified version of this dataset which includes just variance (V1) and skewness (V2) extracted from the images using Wavelet transform. Both features have no missing values and are continuous numeric. The description of the dataset can be seen below:
To visualize the data we’ll be using the Scatter plot from matplotlib, which will allow us to make better analyses of the data.
Since the dataset doesn’t have a target, this project involves building unsupervised machine learning. K-means clustering is one of the simplest algorithms to make inferences on datasets. This algorithm aggregates data points in K clusters where each data point is nearest to one of the K clusters centroids.
The distribution in the graph seems not too spread and not too centered, and the number of instances seems to be sufficient, which means worth to try the K-means clustering algorithm, despite the distribution shape isn’t spherical. Since the first feature varies between -7.04 to 6.82, while the other from -13.77 to 12.95, could be interesting to apply the same scale of measurements to both features, i.e, standardize the features before creating the model.
To use the K-means, we’ll use two clusters. One to genuine banknotes and the other to forged ones. Below it’s possible to visualize the results after clustering.
The K-means clustering algorithm starts the process of clustering by randomly choosing the initial positions of the centroids to calculate the points nearest the centroids, and then recalculate them. This random selection could be affecting the final results, so it’s essential to check if the algorithm is stable for the given dataset. To check the stability of the algorithm we’ll rerun it several times and check for significant differences in the final results.
After rerunning the K-means 12 times it’s possible to see insignificant differences, which allows us to conclude that the K-means algorithm is stable for this dataset. Of 1,372 data points, 775 were clustered in cluster 1 and 597 in cluster 2.
To calculate the accuracy would be necessary to implement a test, but fortunately, the data from OpenML has the target, so we simply compare the results we found with the target in this dataset. The accuracy found was 87.82% which could be considered a good result. To improve the results, working together with the other 2 features could increase the accuracy!
The code behind this article can be found here. Feel free to give me any feedback. Thank you for reading this article until here.