Clustering a privacy concerns level of online media users across social media platforms

Published in

Web Mining [IS688, Spring 2021]

7 min readApr 1, 2021

How is the relationship between the number of social media followers and users’ privacy concern level?

Do people around the world use social media the same? Perhaps it’s not something you typically think about but do you think everyone who uses social media, or the internet in general, uses it in the same way? Obviously, there’s a lot of “ways to use social media” whether you’re just scrolling through your news feeds or posting either on Instagram and TikTok or Snapchat and YouTube. However, there is evidence that social media users in different countries use social media platforms differently and for various reasons. If citizens of different countries have different motivations for using social media and different personality traits behind those motivations, is it possible that they also have different concerns on the topic of privacy?

A group of users may become unhappy with the platform or use it differently without the platform realizing they are doing so. More importantly, if there are discrepancies between how users view privacy concerns on a platform, then that platform should work to ensure that there is an equal understanding among all users. At the end of the day, there should be no major differences in how people, regardless of nationality, view privacy concerns because it means there is a good understanding of the issue at hand. Hence, we can discover if there is perhaps a lack of understanding from groups of people. Being informed on social media, especially when it is such a prominent part of our lives, is important. And if there are concerns about privacy online, then it is just as important to address those concerns.

In this study, I use the data set from the previous project to analyzing the privacy concerns of Facebook and Instagram users from the United States and Thailand. This was done by sending out a survey to users between October and November 2020.

This study focus on the privacy concern of social media users in order to answer the question of “How is the relationship between the number of social media followers and users’ privacy concern level?”

Dataset

As I mentioned, this study uses the data set from the previous project which contains 47 questions for the survey on privacy concerns on SurveyMonkey in Thai and English and sent them out to friends, family, acquaintances, and other social media users. The data set has 235 responses of which 145 are Thai and 90 are English. All responses in this survey are anonymous. Among the 90 English respondents, 63 of them identified as American. The other 27 respondents took the English survey but identified as being from a different country than the United States. Since the survey responses were collected in two languages, I translated them into English and then edited data into the same format. After formatted the data, I have found many attributes that are missing from rows.

The goal is to clustering the privacy concerns of online media users across social media platforms, data rows missing the privacy content or user’s opinion are not useful in predicting the user’s privacy concerns, so they could be ignored and removed. Responses with users who did not identify as Thai or American were removed. I have clustered nodes with similar sets of neighbors to determine similarity and will use the K-Means algorithm as a similarity metric.

This study focus on the privacy concern of social media users in order to answer the question of “How is the relationship between the number of social media followers and users’ privacy concern level?”

K-Means Clustering Algorithm

For the K-Means Clustering Algorithm, the approaches for K-Means clustering includes the following step:

Randomly chose k examples as an initial centroid
Create k clusters by assigning each example to the closest centroid
Compute k new centroids by averaging examples in each cluster

To clustering the dataset of the number of social media followers and users’ privacy concern level, I started Python coding for k-means clustering with the following approaches:

Importing the required libraries such as pandas, numpy, and matplot libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline

Reading the dataset using the read_csv Pandas method and storing the data in a data frame df. After populating the data frame df, we use the head() method on the dataset to see its first 206 records, since the dataset contains 2 columns and 206 rows in the CSV file.
Build and convert the dataset into the data frame df.

df = pd.read_csv('IS688.csv')
df.head(206)

Select two columns (the number of the followers and the users’ concern level) of the dataset in a variable called x.
Use the iloc function on df, and the column index (0,1) for the above two columns are used, as shown below.

x = df.iloc[:, [0,1]].values

Assign the k value as 5. I will implement k-means clustering using k=5. I will instantiate the KMeans class and assign it to the variable kmeans5.

kmeans5 = KMeans(n_clusters=5)
y_kmeans5 = kmeans5.fit_predict(x)
print(y_kmeans5)
Kmeans5.cluster_centers_

The figure below shows the output of the k-means clustering model with k=5. Note that we can find the centers of 5 clusters formed from the data

np. random.seed(206)
k=3
centroids = {
    i+1: [np.random.randint(0,15), np.random.randint(0,5000)]
    for i in range(k)
}fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
    plt.xlim(0,15)
    plt.ylim(0,5000)
    plt.show()

Use the Elbow method which designed to help find the optimal number of clusters in a dataset. I used this method to calculate the optimum value of k. To implement the Elbow method, we need to create some Python code (shown below) and plot a graph between the number of clusters and the corresponding error value.

Error =[]
for i in range(1, 15):
    kmeans = KMeans(n_clusters = i).fit(x)
    kmeans.fit(x)
    Error.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1, 15), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()

The output graph of the Elbow method is shown below. Note that the shape of the Elbow is approximately formed at k=3.

As we can see, the optimal value of k is between 2 and 4, as the elbow-like shape is formed at k=3 in the above graph. Let’s implement k-means again using k=3

Create a scatterplot with x and y values which are users’ privacy concern level and the number of social media followers respectively.
Visualize the three clusters that were formed with the optimal k value. We can clearly see three clusters in the image below, with each cluster represented by a different color.

Conclusion

After we proceed with K-Means clustering for the number of social media followers and users’ privacy concern level, we noticed that the social media users who had followers between 0–1,000 accounts have a variety of social media privacy concerns. The crowded cluster of this group (shown in green) is in middle towards to high level of concerns. The second crowded cluster of this dataset is in the groups of people who have followers between 1,000–3,000 accounts as shown in purple which their majority level of concern is at level 12 which is very high. The cluster in red represents social media users who have followers between 4,000–5,000 accounts in which highly concern about their privacy on the social media platforms.

Discussion and limitation

Due to the large number of social media users nowadays, Having more survey respondents or datasets in this study will be helpful because there is a stronger representation of social media privacy habits and trends. This increase in survey participants would give more concrete and measurable results. Also having different age groups involved in this study would give a different perspective on how different ages have a different concern on social media privacy. Moreover, It is clear that privacy is a genuine concern to many social media users and that many don’t understand privacy policies. Information privacy, or lack thereof, is an important topic in online media today. Finally, using k-means has trouble clustering data where clusters are of varying sizes and density especially for my dataset which is low distribution data.

References

[1] https://www.nasdaq.com/articles/cybersecurity-and-privacy-concerns-fuel-raft-of-new-regulations-2020-08-11

[2] https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

[3] https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187