Demystifying Fake Follower Detection: Unveiling Authenticity Online

13 min readAug 20, 2023

Instagram, a bustling realm of digital connection, now faces a pivotal challenge: the veracity of followers.

These elusive fake followers, often overlooked, wield a dual risk — businesses unwittingly inflate their engagement metrics, and the ominous specter of fake news dissemination is amplified.

Yet, solutions like HypeAuditor, HypeTrain and les influenceurs full stack API stand as stalwarts against this tide, revealing the truth behind these numbers.

In our quest for comprehension, we’ll peel back layers, unveiling the intricate interplay of data and algorithms using Machine Learning, to decipher their essence.

In this article, we embark on an exploration that requires two essential companions: a fundamental grasp of Python and a touch of mathematical acumen. As we navigate through the intricacies of building a machine learning classifier, we’ll harness the power of a dataset from Kaggle, unraveling the nuances that distinguish genuine engagement from masked facades.

Our first stride involves immersing ourselves in the dataset. This initial step unfurls the canvas where insights are waiting to be uncovered. By loading and delving into the dataset’s intricacies, we lay the foundation for our quest to discern real from the simulated. Let’s embark on this exploratory voyage, where every data point has a tale to tell. 📊🔍

Initiate the process by importing necessary Python libraries and loading the dataset through the provided code snippet.

Note : In real-world applications, data retrieval from Instagram can be accomplished using either the official API or a custom-built solution ((Please review the terms and conditions of each provider for potential legal considerations).
Loading libraries

# Importing libraries 
from imblearn.over_sampling import SMOTE
import pandas as pd, numpy as np
import plotly.express as px# manipulation and numerical operations
import plotly.graph_objects as go  # DataViz
from plotly.subplots import make_subplots # DataViz
import matplotlib.pyplot as plt # DataViz
import seaborn as sns # DataViz
from sklearn.pipeline import Pipeline # Pipeline for chaining multiple data preprocessing steps
from sklearn.preprocessing import StandardScaler # Data rescaling
from sklearn.model_selection import train_test_split # Data splitting 
from sklearn.metrics import roc_auc_score, roc_curve# Evaluation Metrics
from sklearn.preprocessing import FunctionTransformer
import shap # Machine Learning explainability librart
# Models
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.ensemble import AdaBoostClassifier
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
#Local import
from minspect import inspect
from categorize_features import categorize_features
from create_barplots import create_barplots

Load data set

# Loading data
try:
    fulldata = pd.read_csv('./archive/final-v1.csv')  #Importing data set
except Exception as error:
    print("Error while reading csv file  : ", type(error).__name__, error)
    exit()

Data format output

pint(fulldata.head)

 edge_followed_by  edge_follow  username_length  username_has_number  full_name_has_number  ...  has_channel  is_business_account  has_guides  has_external_url  is_fake
0             0.001        0.257               13                    1                     1  ...            0                    0           0                 0        1
1             0.000        0.958                9                    1                     0  ...            0                    0           0                 0        1
2             0.000        0.253               12                    0                     0  ...            0                    0           0                 0        1
3             0.000        0.977               10                    1                     0  ...            0                    0           0                 0        1
4             0.000        0.321               11                    0                     0  ...            0                    0           0                 0        1

Visualising duplicate rows in the data set

# Visualising duplicated rows in the data set
fulldata_duplicated = fulldata[fulldata.duplicated(keep=False)]
print(fulldata_duplicated)

Sample output

# Sample output
Duplicated Values: 
3

The presence of three duplicate rows in our dataset can potentially lead to issues during the machine learning process. Duplicates can skew model training, as the same data points are counted multiple times, causing bias. This can result in an overestimation of model performance and reduced generalization to new data. Additionally, it might increase processing time during training, impacting efficiency. Addressing duplicates is crucial for maintaining data integrity and ensuring the reliability of our machine learning outcomes.

Eliminating Duplicate Rows

# Eliminating duplicates rows
fulldata.drop_duplicates(inplace=True)

Continuing forward, our attention shifts towards categorizing features into distinct continuous and binary categories within the dataset.

# Creating lists with the categorize_features function
continuous_features, binary_features = categorize_features(fulldata)

# Printing feature categorizations
print('\n')
print('Continuous features:')
print(continuous_features)
print('\n')
print('Binary features:')
print(binary_features)

Sample output

Continuous features:
['edge_followed_by', 'edge_follow', 'username_length', 'full_name_length']


Binary features:
['username_has_number', 'full_name_has_number', 'is_private', 'is_joined_recently', 'has_channel', 'is_business_account', 'has_guides', 'has_external_url', 'is_fake']

Transitioning from data categorization, we confront the issue of dataset imbalance. This scenario arises when the distribution of classes within a dataset is uneven, creating an imbalance between dominant and minority classes. This imbalance can distort the learning process, as the model may overly emphasize the dominant class while neglecting the minority class. This can adversely affect the model’s ability to generalize and accurately classify the minority class. To tackle this challenge and ensure equitable model performance, addressing dataset imbalance becomes crucial.

Testing for dataset imbalance

# Creating another dataframe to label real and fake accounts
legend_df = fulldata.copy()

try:
    legend_df['is_fake'] = legend_df['is_fake'].replace({0: 'Real Accounts', 1: "Fake Accounts"}) 
except Exception as error:
    print("Error while renaming fake column: ", type(error).__name__, "-", error)
    exit()
#End data preparation


# Testing for dataset imbalance
fig = px.pie(legend_df, names='is_fake', title='Target variable distribution', color_discrete_sequence = ['#636EFA','#EF553B'])
fig.update_layout(template = 'ggplot2')
fig.show()

The dataset exhibits an imbalance, with approximately 88% of entries representing fake followers and less than 12% corresponding to real followers. This skewed distribution can potentially impact the model’s ability to accurately predict both classes and warrants further attention to mitigate bias.

To mitigate the imbalance problem in our dataset, we will employ a data augmentation technique known as SMOTE (Synthetic Minority Over-sampling Technique). This method involves generating synthetic instances of the minority class, in this case, fake followers, by creating new data points interpolated between existing ones. By strategically increasing the representation of the minority class, we aim to achieve a more balanced distribution, enabling our model to better learn from both real and fake follower instances. Through SMOTE, we enhance our ability to effectively address the challenge posed by imbalanced data and facilitate the development of a more accurate and robust prediction model.

Addressing dataset imbalance through data augmentation with SMOTE method

# Addressing dataset imbalance through data augmentation with SMOTE method
#Instantiate a SMOTE object
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Splitting dataset into independent variables (X) and target variable (y)
X = fulldata.drop('is_fake', axis = 1)
y = fulldata['is_fake']

#Resampled data
X_resampled, y_resampled = smote.fit_resample(X, y)
# Create a new DataFrame with the augmented data
augmented_df = pd.DataFrame(data=X_resampled, columns=X.columns)
augmented_df['is_fake'] = y_resampled



#Testing (again) for dataset imbalance
# Creating another dataframe to label real and fake accounts
augmented_legend_df = augmented_df.copy()
try:
    augmented_legend_df['is_fake'] = augmented_legend_df['is_fake'].replace({0: 'Real Accounts', 1: "Fake Accounts"}) 
except Exception as error:
    print("Error while renaming fake column: ", type(error).__name__, "-", error)
    exit()
fig = px.pie(augmented_legend_df, names='is_fake', title='Target variable distribution after data augmentation', color_discrete_sequence = ['#636EFA','#EF553B'])
fig.update_layout(template = 'ggplot2')
fig.show()

After applying data augmentation using the SMOTE method, we have successfully balanced our dataset to have an equal distribution of 50% fake followers and 50% real followers. This equilibrium enables our model to learn from both classes more effectively, enhancing its capability to make accurate predictions for both categories.

Transitioning from addressing dataset imbalance, our exploration now takes a visual turn as we venture into the realm of plotting barplots for each binary feature. This analytical endeavor unveils the intricate interplay between these binary attributes and the division between real and fake followers. By visualizing the distribution of these attributes, we embark on a journey to decode the distinctions that set apart genuine accounts from their artificial counterparts. This approach serves as a dynamic tool to unravel underlying patterns, facilitating more informed decision-making and offering fresh perspectives on our data. Through these visual insights, we pave the way for deeper analysis and further exploration.

# plotting barplots for each binary feature
create_barplots(augmented_df,augmented_legend_df,binary_features)

One striking revelation caught our attention: a notable tendency among fake followers to incorporate numbers within their usernames.

The visual storytelling offered by our bar plots illuminated this intriguing phenomenon. It appears that the realm of fake followers harbors a preference for usernames adorned with numeric characters.

This is precisely the kind of hidden insight that data exploration endeavors to uncover — subtle cues that may elude the naked eye but become vividly apparent through the lens of data visualization.

The power of data-driven exploration is its ability to unveil narratives that might otherwise remain buried, inviting us to explore further and decode the mysteries that lie within.

In the realm of spotting fake followers, the decision to employ the Random Forest Harshdeep Singh (RF ) algorithm over Deep Neural Networks (DNNs) is underpinned by various factors, including computational efficiency. RF’s inherent characteristics align well with the specific demands of this task.

When evaluating computational costs, RF often takes the lead due to its ensemble nature and decision tree-based structure. RF constructs multiple decision trees in parallel, harnessing the power of multiple CPU cores or threads. This parallelization substantially reduces training time, a crucial advantage when addressing the fake follower detection challenge. Conversely, DNNs involve intricate mathematical computations and matrix operations, which may demand more computational resources and time.

Furthermore, RF’s hyperparameter tuning process is relatively straightforward, offering a pragmatic advantage when selecting the optimal model configuration. In comparison, DNNs often entail intricate tuning efforts, especially when dealing with complex architectures.

Given the specific nature of spotting fake followers, where the dataset may not be as extensive as other domains, RF’s well-tuned performance and computational efficiency make it an apt choice. This strategic selection aligns with the task’s requirements while ensuring practicality and efficiency, making RF a compelling contender in the pursuit of authenticating online presence.

We begin by splitting our resampled data into two parts: a training set (X_train, y_train) and a validation set (X_val, y_val). This division is crucial for assessing the model’s performance on data it hasn’t seen during training.

# Creating training and validation sets from resampled data
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size = 0.3, random_state = 42)

Next, we prepare to use a Random Forest classifier, a versatile machine learning algorithm that can distinguish between real and fake followers. The “random_state” parameter ensures that the randomness within the algorithm is consistent, making results reproducible. Now, directing our attention to the pivotal rf.fit(X_train, y_train) line. In this stride, the Random Forest model acquires insights from the training data, enabling it to deftly discern between genuine and counterfeit followers.

# Initializing mode
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train, y_train) # Fitting to training data

Moving forward, we put our trained model to use. We predict whether the profiles in the validation set are real or fake followers. This prediction process leverages the learned patterns to make assessments based on the features. To gauge the model’s performance, we compute key metrics for the Receiver Operating Characteristic (ROC) curve, specifically the False Positive Rate (FPR) and True Positive Rate (TPR). These metrics help us understand how well the model is separating real and fake followers. We also obtain the threshold values that correspond to the FPR and TPR points. These thresholds represent the decision boundaries that the model uses to classify profiles as real or fake followers. Lastly, we construct a DataFrame named “roc_df” to store the calculated FPR, TPR, and threshold values. This DataFrame serves as a valuable resource for generating visualizations of the ROC curve, aiding our understanding of the classifier’s performance.

y_pred = rf.predict(X_val) # Predicting on validation set
fpr, tpr, thresholds = roc_curve(y_val, y_pred)
roc_df = pd.DataFrame({'FPR': fpr, 'TPR': tpr, 'Thresholds': thresholds})

In summary, this code section guides us through the process of creating training and validation sets, initializing and training a Random Forest classifier, predicting on the validation set, calculating ROC curve metrics, and generating a DataFrame for insightful analysis.

Here, we calculate the baseline performance using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric. This metric quantifies the model’s ability to distinguish between real and fake followers.

baseline_score = roc_auc_score(y_val, y_pred)
print('\n')
print('AUC-ROC Baseline: ', baseline_score.round(2))
print('\n')

Sample output

AUC-ROC Baseline:  0.96

By comparing our model’s performance against this baseline, we gain a valuable perspective on its effectiveness. The roc_auc_score function computes the AUC-ROC score based on the predicted labels (y_pred) and the actual labels (y_val) from the validation set. This score ranges from 0 to 1, where higher values signify better discrimination. We then print the baseline AUC-ROC score rounded to two decimal places, providing us with a benchmark to gauge our model's progress. This process is integral to assessing the model's performance and its potential real-world impact (For an in-depth understanding of the ROC AUC score, consider delving into Sarang Narkhede article “Understanding AUC - ROC Curve“).

In this code excerpt, we initialize the visualization with a defined style from Seaborn. A line plot is constructed to illustrate the Receiver Operating Characteristic (ROC) curve, featuring the False Positive Rate (FPR) on the horizontal axis and the True Positive Rate (TPR) on the vertical axis. This graph visually conveys the performance of the RandomForest Classifier model, presenting its AUC-ROC score along with the baseline score. A dashed line is superimposed to represent the ROC curve that would result from random guessing. The plot bears a title and is furnished with appropriate labels for its axes. The legend serves to distinguish between the classifier’s curve and the random guessing line. Subsequently, we exhibit the distribution of actual values (y_val) and the predicted values (y_pred), offering insights into the outcomes of the classification.

sns.set_style('darkgrid')
sns.lineplot(x='FPR', y='TPR', data=roc_df, label=f'RandomForest Classifier(AUC-ROC = {baseline_score.round(2)})')
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guessing')
plt.title('AUC-ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
print('\n')
print('y_val value counts')
print(y_val.value_counts())
print('\n')
print('predicted value counts')
print(np.unique(y_pred, return_counts=True))

Sample output

With an AUC score of 0.96, the graph produced by the code portion illustrates a well-separated curve that is positioned significantly above the dashed line representing random guessing. This visual representation signifies that the RandomForest Classifier model has a high capacity to distinguish between true positives and false positives. In other words, it demonstrates a strong ability to correctly identify real followers and fake followers, leading to a superior performance compared to random chance. The larger the area under the curve (AUC), the better the model’s overall predictive accuracy.

In this code portion, we are delving into the realm of feature importance analysis using the SHAP (SHapley Additive exPlanations) library. This powerful tool enables us to comprehend the impact of individual features on the model’s predictions.


# Plotting Feature Importance plot
shap_values = shap.TreeExplainer(rf).shap_values(X_val)
shap.summary_plot(shap_values, X_val, plot_type="bar")

By utilizing a TreeExplainer tailored to the Random Forest model, we generate SHAP values that quantify the contribution of each feature to the model’s output. The subsequent “summary_plot” function call crafts an insightful bar plot, illustrating the magnitude and direction of these feature impacts. This visual representation aids us in identifying which attributes wield the greatest influence in our fake follower detection model, contributing to a deeper understanding of the model’s decision-making process (For a comprehensive elucidation of Shapley Additive Explanations, I recommend delving into Fernando López article on the subject )

Sample output

The graphical representation produced by the SHAP function, specifically the shap.summary_plot, highlights the pronounced influence of features such as edge_followed_by and username_has_number. These attributes play a pivotal role in detecting fake followers.

Transitioning to machine learning modeling entails algorithm selection in line with our objectives. Our journey begins with computing means and standard deviations, a pivotal step for outlier detection and model optimization.

This code segment involves calculating the mean and standard deviation of the features within the augmented dataset. By computing these statistical measures, we gain insights into the central tendency and variability of the data. This analysis allows us to identify potential outliers, assess feature distribution, and understand the data’s overall characteristics, which are crucial steps in preparing and fine-tuning machine learning models.

# Measuring mean values and standard deviations
df_means = augmented_df.mean().round(2)
df_stds = augmented_df.std().round(2)
results = pd.concat([df_means, df_stds], axis = 1)
results.columns = ['Mean', 'Standard Deviation']
print(results)

Sample output

                                    Mean  Standard Deviation
edge_followed_by                    0.00                0.04
edge_follow                         0.30                0.27
username_length                    12.06                3.29
username_has_number                 0.38                0.49
full_name_has_number                0.06                0.24
full_name_length                    8.70                7.68
is_private                          0.24                0.43
is_joined_recently                  0.20                0.40
has_channel                         0.00                0.00
is_business_account                 0.09                0.29
has_guides                          0.00                0.03
has_external_url                    0.11                0.31
is_fake                             0.50                0.50

Upon examining the output, several key insights emerge. Notably, the features “edge_followed_by” and “username_has_number” exhibit mean values close to 0.5 and standard deviations around 0.5. This indicates a potential presence of binary-encoded features, possibly associated with fake followers. In contrast, features like “username_length” and “full_name_length” demonstrate higher means, hinting at variations among followers.

Given the variance in scale and distribution, employing feature scaling techniques, such as StandardScaler, could enhance model performance. This ensures that all features are treated equally during the learning process, preventing any undue influence.

As we proceed with algorithm selection, understanding these nuances is vital. To effectively discern the best algorithm for our task, employing techniques like cross-validation with various models (e.g., RandomForest, GradientBoosting, LogisticRegression) is prudent. This process considers both the spread and magnitude of features, guiding us towards an accurate and robust model.

This code defines a dictionary of machine learning models, including XGBoost and AdaBoost classifiers, with a set random state for reproducibility.

#Defining models
models = {
    "XGBoost": XGBClassifier(random_state = 42),
    "AdaBoost": AdaBoostClassifier(random_state = 42)
}

This code fits multiple machine learning pipelines on training data, predicts on validation data, and calculates AUC-ROC scores for each pipeline. Results are stored in a dictionary and printed for evaluation.


results = {}
for name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)
    auc = roc_auc_score(y_val, y_pred)
    results[name] = {
        "pipeline": pipeline,
        "auc": auc
    }
    print(f"{name}: AUC-ROC score = {auc:.2f}")

Sample output

XGBoost: AUC-ROC score = 0.95
AdaBoost: AUC-ROC score = 0.92

In the realm of model selection, we’ve coded an expedition. Our trusty soldiers, XGBoost and AdaBoost, have undergone rigorous evaluation. With XGBoost achieving an AUC-ROC of 0.95 and AdaBoost at 0.92, we stand at an intriguing juncture. Our baseline Random Forest has set the stage with a score of 0.96, prompting a thoughtful decision.

As we gaze at these numbers, an invitation emerges: extend your horizon. The code is a canvas, waiting for you to introduce new contenders — Support Vector Machines, Logistic Regression, and beyond. The journey is not over, it’s a call to arms. The path to enhanced authenticity detection lies in your hands. Embark on the challenge: update this code, explore more algorithms, and witness the realm of possibilities unfurl. Your quest awaits.

For those interested in accessing the dataset, which has been compiled by Reza Jafari, you can download it from the provided link . I would also like to extend my warm greetings to Luís Fernando Torres, a fellow data scientist whose insights and inspiration greatly contributed to the creation of this article.

If you have any questions or would like to connect further, feel free to reach out to me via LinkedIn or my website.

Demystifying Fake Follower Detection: Unveiling Authenticity Online

Written by Patrice Salnot