A Brief Overview of K-Anonymity using Clustering in Python

Palakkohli
Brillio Data Science
6 min readAug 7, 2020

1.Background

Privacy of individuals is a matter of concern in today’s world, where enormous amounts of data is being collected on a daily basis. Most institutions, in an attempt to protect privacy, usually exclude identifiers such as name, unique identification number, etc., and proceed with the release of data for numerous purposes. But this release of data doesn’t necessarily have to be privacy protected. Studies conducted on a city’s voter list (with reference to the research paper by Latanya Sweeney and Pierangela Samarati) showed that 97% of the individuals were uniquely identifiable only with their birth date and complete postal code [1]. Such information combined with sensitive data can result in dangerous leaks of information. For example, hospital records where obvious unique identifiers such as name, UID, etc. are excluded can still be remapped to the correct individual by looking up their birth date and postal codes in the city’s voter list. Here, the medical history of individuals is compromised

2.Problem Statement

In this article, we discuss about k-Anonymity to ensure privacy protection and demonstrate the use of clustering to obtain groups of similar records.

3.Definitions and Methods

Quasi-identifiers: Let T (A1,.., An) be a table. A quasi-identifier of T is a set of attributes {Ai,.., Aj} ⊆ {A1,.., An} whose release must be controlled [1].

Goal of k-Anonymity is to ensure that any set of values of quasi-identifiers are mapped to at least k records, thus protecting privacy of records.

k-Anonymity: A release of data is said to satisfy k-Anonymity iff there exist at least k records for every possible combination of values of quasi-identifiers.

Generalization: This method suggests the usage of a less specific domain (zi) for an attribute. For example, specific birth years like 1956, 1954, etc. (z0) can be made less informative by replacing them with 195X (z1) in the data. This creates a less informative domain where 1956 and 1954 cannot be distinguished from each other. This domain can further be generalized to 19XX (z2) where records such as 1956 and 1947 are indistinguishable as well. (Here, zi refers to the ith level domain generalization of the attribute).

Suppression: This method suggests the removal of an attribute in its entirety. Suppression is used when more than required attributes have to be generalized resulting in loss of information. Suppression is preferred if it holds for lesser loss of information as compared to generalization of multiple attributes.

4. Dataset

Bank Marketing dataset from the UCI Machine Learning Repository is used for the purpose of this article [2]. The original dataset consists of 17 attributes, both numerical and categorical. We choose the attributes age (numeric), marital (categorical: marital status of the person) and housing (categorical: whether the person holds a housing loan) for the purpose of clustering.

5.Solution Methodology

We follow a two-step process towards solving the problem. The first step is the preprocessing of acquired data. Here, we choose/assume our quasi-identifiers and also decide distances between categorical attributes, if any. In the second step, we perform KMeans clustering on the preprocessed data followed by certain adjustments and obtain groups of similar records.

5.1 Data Preprocessing

We start with manually choosing the quasi-identifiers for our dataset. For our example, they are as follows — age, marital and housing. Here, age is a numeric identifier while marital and housing are categorical identifiers.

The categorical identifiers were handled as follows:

  • housing — Since the only unique records included in this attribute were “yes” and “no”, we do a binary mapping as follows: yes >> 1, no >> 0 (Figure a).

Figure a: binary mapping of attribute ‘housing’.

  • marital — Since this attribute contains more than two kinds of records such as “married”, “single”, and “divorced”, binary mapping isn’t possible. Hence we use a one hot encoding scheme for marital-status. Each unique record is converted into an attribute of its own, where its values consist of only 1s and 0s (as shown in Figure b).

Our newly preprocessed dataset is shown in the below figure.

Figure b: one-hot encoding of attribute ‘marital’.

Now, we will use KMeans clustering algorithm which uses Euclidean distances to calculate distance between data-points.

5.2 Clustering

Since we are dealing with a large number of data-points, we will proceed with KMeans clustering algorithm. Hierarchical clustering algorithms are generally more computationally expensive when dealing with large number of data-points. For the purpose of this article, we choose k=2000 and the number of clusters n_clusters=5 (Not to confuse with k, which refers to the minimum size of each cluster according to the k-Anonymity requirement).

Following are the sizes of each cluster along with their labels after clustering:

Here, the key of the obtained dictionary is the label while the value is its count. We can clearly see that the cluster with label ‘2’ does not satisfy our requirement for k-Anonymity since k<2000 for this cluster. So we perform readjustments and move points from satisfied clusters to the above unsatisfied cluster. For this, we use the metric called Silhouette Coefficient.

  • Silhouette Coefficient is defined as (b — a) / max(a, b).

Where a — mean intra cluster distance and b — mean nearest cluster distance [4].

We proceed with the following steps for readjustments:

  • Obtain silhouette coefficient for each sample.
  • Sort the dataset on the basis of these silhouette coefficients in ascending order.
  • Since, smaller values of silhouette coefficients indicate a wrongly assigned cluster, these points can be assigned to another cluster. Iterating over the sorted dataset, we re-cluster the first few points from satisfied clusters till all clusters obey k-Anonymity. Here, we also make sure that the size of satisfied clusters is not reduced below k.

6.Results

The labels and sizes of readjusted clusters are as follows:

We see that the cluster with label ‘2’ now satisfies k-Anonymity, while labels ‘1’ and ‘0’ have a reduced count value.

Following is the housing vs age graph plotted for the above 5 clusters.

Following is the marital_married vs age graph plotted for the above 5 clusters.

Each color in the above graphs represent a cluster. We can see how similar points are assigned the same cluster.

7. Future Developments and Constraints

  • We have created a binary mapping of our categorical data and KMeans clustering is generally not suitable for binary data. On the other hand, hierarchical clustering is computationally expensive. Since this is only a demonstration of ways to achieve k-Anonymity, clustering accuracy and efficiency is out of scope for this article. In presence of more numeric public records, KMeans will help create better clusters.
  • When reassigning clusters to satisfy k-Anonymity after KMeans clustering, silhouette scores can be computed for every reassignment. This will help to find which unsatisfied cluster is the best match for the relocating point. This is time consuming if the dataset size and k are large values.

References

  1. Pierangela Samarati and L. Sweeney. k-anonymity: a model for protecting privacy. Proceedings of the IEEE Symposium on Research in Security and Privacy (S&P). May 1998, Oakland, CA.
  2. [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31, June 2014.
  3. Ji-Won Byun, Ashish Kamra, Elisa Bertino, Ninghui Li. Efficient k -Anonymization Using Clustering Techniques. DASFAA 2007: 188–200.
  4. https://scikitlearn.org/stable/modules/generated/sklearn.metrics.silhouette_score

--

--