Why is your Facebook Data so valuable? A method to predict humans traits through Facebook Likes.
Why is your Facebook Data so valuable? A method to predict humans traits (gender, political preference, age) through Facebook Likes.
As the election time is approaching we will see how and why our Facebook Data is so valuable to advertisers, politicians. Facebook is the world's largest social network platform with over 2.5 billion active users. It processes data at a scale never seen before, the highly sophisticated Facebook A.I algorithms curate, categorize, and predict associations between data in an almost human way.
How and Why
Why: Due to the influx of such a vast amount of data and processing power, we will explore how we can predict human traits using just a collection of Facebook Likes. To achieve our results, we will try to replicate the popular paper that analyzed data from the Cambridge Analytica Data Scandal. (Paper here)
How: To build a predictive model, we will utilize the Facebook Ad Categories dataset. Using this, we will attempt to create a user-like sparse matrix where each category corresponds with a rating. (1 represents, the user likes the content. 0 represents, the user does not like the content)
Requirements
- Python 3.8
- Scikit-Learn
- Pandas
- Numpy
Dataset
Like any ML/Data Mining project, we will start by analyzing and generating the dataset.
As you can see, the dataset is very broad and because it is crowd-sourced, these entries are of real users. It is evident from the dataset that we only need the “name” column. With this as the starting point, we can generate the next part of our dataset.
def generate_data(number_of_users=55000, cols=cols):"""Generates random data consisting of user's likes and dislikesArguments:number_of_users: intReturns:DataFrame"""assert number_of_users <= len(cols), "Number of users and cols should be less or equal."index = ["User {}".format(i) for i in range(1, number_of_users+1)] # Number of user# generic categoriescols = cols.tolist()data = {col: [] for col in cols}# random liking or disliking ( 1 or 0)def like_or_not(): return random.randint(0,1)for col in cols:#print("Adding for {}".format(col))for i in range(1,number_of_users+1):data[col].append(like_or_not())print("Data generation complete.")return pd.DataFrame(data=data, index=index), index, cols
To generate our data, we extract the “name” column and randomly assign 1 or 0 for each user.
Target Variables
To add our target variables, we will focus on “age”, “gender” and “political”. These are the variables that we want to predict using our collection of Facebook Likes for each user.
def generate_target_variables(number_of_users=55000,target=['age', 'gender', 'political']):
# for each target we will generate random data
data = {"age":[], "gender":[], "political":[]}# for age (age ranged: 18 to 75)
# regression
data['age'] = [random.randint(18, 75) for x in range(1, number_of_users+1)]# for gender (m or f) ( 1 or 0)
# classification
data['gender'] = [random.randint(0,1) for x in range(1, number_of_users+1)]# for political
# classification (1 -> democratic, 0 -> republican)
data['political'] = [random.randint(0,1) for x in range(1, number_of_users+1)]return data# adding target variables
data['age'] = target['age']
data['gender'] = target['gender']
data['political'] = target['political']
Dimensionality Reduction
As our dataset contains ad categories of about 52000+ products, we will perform dimensionality reduction using SVD. (Also mentioned in the paper)
For those of you that don’t remember SVD or PCA, we will provide a brief overview of the process.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
For SVD, we essentially transform the data using the formula:
To perform our dimensionality reduction, we will use the sklearn TruncatedSVD class where n_components=100 (according to the paper.)
def dimen_reduce(values):
reduced = TruncatedSVD(n_components=100)
tr = reduced.fit_transform(values)
return trdef generated_reduced_df(df, index, number_of_users=55000):tr = dimen_reduce(df.values)# 100 componentsdimen_df = pd.DataFrame(data=tr, index=index)target = generate_target_variables(len(dimen_df))dimen_df['age'] = target['age']dimen_df['gender'] = target['gender']dimen_df['political'] = target['political']return dimen_df
Training
As per the paper, we train the age target variable with Linear Regression and other parameters with Logistic Regression (classification).
The paper mentions a lot of parameters but due to computational constraints, we are just predicting three variables (age, gender, political).
def split_dataset(df,test_size=0.2, stype="linear"):features_age, labels_age = df.drop(columns=['age']).values, df['age'].valuesfeatures_gender, labels_gender = df.drop(columns=['gender']).values, df['gender'].valuesfeatures_political, labels_political = df.drop(columns=['political']).values, df['political'].valuesif stype == 'linear':x_train, x_test, y_train, y_test = train_test_split(features_age, labels_age, random_state=42, test_size=test_size)return x_train, x_test, y_train, y_testif stype == 'clas_gender':x_train, x_test, y_train, y_test = train_test_split(features_gender, labels_gender, random_state=42, test_size=test_size)return x_train, x_test, y_train, y_testif stype == 'clas_pol':x_train, x_test, y_train, y_test = train_test_split(features_political, labels_political, random_state=42, test_size=test_size)return x_train, x_test, y_train, y_test
As gender and political target variables are in the form of 1 and 0, we use LogisticRegression to train our model. We use LinearRegression to train our model with age as the target variable.
According to the paper, we apply 10 fold cross-validation with 100 components as per the SVD.
# age
cross_val_score(linear_reg, x_train_linear, y_train_linear, cv=10)
cross_val_score(linear_reg, x_train_linear, y_train_linear, cv=10)# gender and political
cross_val_score(log_reg, x_train_gender, y_train_gender, cv=10)
cross_val_score(log_reg, x_train_pol, y_train_pol, cv=10)
Conclusion
To conclude, we see how Facebook likes can influence or predict important human traits. With enough data (n=55000), we can achieve an acceptable accuracy as measured ROC. With the influx of data, we see how easy is it for organizations to target a specific set of people if they have access to Facebook data.
References
[1] https://www.pnas.org/content/pnas/110/15/5802.full.pdf
Links
[1] Github: https://github.com/aaditkapoor/Facebook-Likes-Model
[2]Colab:https://colab.research.google.com/drive/1k75ODwkhEnXirVDKzyyBImPqNGISqjs6?usp=sharing

