Understanding Bullying Factors with Machine Learning and SHAP
Develop a classification model to understand the main factors that lead to suffer from bullying
Unfortunately, bullying is still a problem nowadays. According to researchers, one out of every five (20.2%) students report being bullied [1]. Moreover, people that suffer from bullying have higher risk for depression, anxiety, sleep difficulties, lower academic achievement, and dropping out of school [1].
To better understand the main factors that increase the chance of being bullied, I trained a machine learning model and, using explainability techniques (SHAP), I show how these factors impact on the probability of suffer from bullying.
1. The Dataset and Data Cleaning
The used dataset is related to a survey conducted in Argentina in 2018 , where 56,981 students participated answering different questions related to bullying. You can find this dataset on Kaggle. To make the target variable, I considered that a person was bullied (target = 1) if it was bullied on school property in the past 12 months or bullied not on school property in the past 12 months or cyber bullied in the past 12 months (all of these are variables in the dataset).
There are some non responses for different variable (represented by blank space strings). Thus, first I had to replace this values with numpy.nan and to make easier to conduct the analysis, I just opted to dropping the null rows and move on.
2. Model Pipeline
To avoid data leakage, as always as possible I prefer to use scikit-learn Pipeline class. It allows us to put together data preprocessing and modeling steps, fitting a single object.
To the preprocessing step, I had to deal with the fact that we have only categorical variables in the dataset. Some of then have ordinal sense, thus I started by separating the variables to use One Hot Encoding and the variables to use Ordinal Enconding. Moreover, I created a map for ordinal columns, to keep the logical order of the values, once scikit-learn’s OrdinalEncoder uses alphabetical order to encode the variables. Finally, I created two pipelines to encode the categorical columns with OneHotEncoder and ordinal columns with OrdinalEncoder, putting all these together in a single preprocessing pipeline with scikit-learn’s ColumnTransformer class.
# Setting the columns to ordinal or categorical
# Here, categorical columns with a sense of order (age, number of times that something hapenned) were set to ordinal
categorical_columns = [
'Sex',
'Felt_lonely',
'Other_students_kind_and_helpful',
'Parents_understand_problems',
'Most_of_the_time_or_always_felt_lonely',
'Missed_classes_or_school_without_permission',
'Were_underweight',
'Were_overweight',
'Were_obese'
]
ordinal_columns = [
'Custom_Age',
'Physically_attacked',
'Physical_fighting',
'Close_friends',
'Miss_school_no_permission'
]
# Creating the mapping order to ordinal columns
ordinal_cols_mapping = [
['11 years old or younger', '12 years old', '13 years old', '14 years old', '15 years old', '16 years old', '17 years old', '18 years old or older'],
['0 times', '1 time', '2 or 3 times', '4 or 5 times', '6 or 7 times', '8 or 9 times', '10 or 11 times', '12 or more times'],
['0 times', '1 time', '2 or 3 times', '4 or 5 times', '6 or 7 times', '8 or 9 times', '10 or 11 times', '12 or more times'],
['0', '1', '2', '3 or more'],
['0 days', '1 or 2 days', '3 to 5 days', '6 to 9 days', '10 or more days']
]
# Constructing the preprocessing pipeline
categorical_transformer = Pipeline(
steps=[
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'))
]
)
ordinal_transformer = Pipeline(
steps=[
('encoder', OrdinalEncoder(categories=ordinal_cols_mapping, handle_unknown='use_encoded_value', unknown_value=-1))
]
)
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_columns),
('ord', ordinal_transformer, ordinal_columns)
]
)
After, I trained the model pipeline, where I put together the preprocessing pipeline, a StandardScaler object and a RandomForestClassifier. The ‘model_pipeline_rf.set_output(transform=’pandas’)’ option keeps the features names through the pipeline, allowing us to keep track of them when we analyze the feature importances, for example.
# Defining the model pipeline
model_rf = RandomForestClassifier(
n_estimators=200,
max_depth=5,
class_weight='balanced',
n_jobs=-1,
verbose=True
)
model_pipeline_rf = Pipeline(
steps=[
('preprocessor', preprocessor),
('scaler', StandardScaler()),
('rf', model_rf)
]
)
# With this set_output API we are able to track the feature names which the pipeline outputs
model_pipeline_rf.set_output(transform='pandas')
# Fitting the model
model_pipeline_rf.fit(X_train, y_train)
3. Feature Importances
When we take a look to model’s feature importance, we observe that features like ‘Physically attacked’, ‘Felt lonely’, ‘Sex’, among others, are hot predictors to bullying.
However, we still have a lot of features to analyze here, so I executed a feature selection with Boruta to reduce the number of variables and make the model easier to explain (keeping in mind that we have to do that without significantly decrease our model evaluation metrics). You can learn more about Boruta here and reach out to my GitHub repo (I’m going to leave a link at the end of the article) to check the project code and see how I used Boruta here. After running the feature selection, we can see that now we have less features, making the model more understandable.
4. Model Explanation with SHAP
Ok, now that we have our model and checked the feature importances, I want to understand how the values of the variables in the model impact in the propention to suffer from bullying. To do this, I used SHAP, which is a great tool to bring explainability to different kind of black box machine learning models. You can learn more about SHAP here. You can also check the project GitHub repo as well to see how I used SHAP here.
The SHAP library provides different kinds of plot to explain our machine learning model. The beeswarm plot is one of my favorite because it allows us to see how the values of the variables impact in the probabilty to a given class.
Here is how we read the graph: the more red are the points, the higher the value observed for a given variable, while the more blue, the lower the values for a given variable. In the X-axis, we have the impact on the output probability for a given class, in a way that a negative value means less propention to a given class, while positive values means more propention to a given class. This way, if we have a variable with a lot of red points in the right side of the graph, it means that higher values in this variable increases the propention to the observed class. It is important to say that the plot orders the features by their importance, so the first one is the more important according to SHAP.
In the figure above, I show the SHAP beeswarm plot for our model considering the class 1 (the person was bullied). Here, we can see that the fact of never feel alone (Felt_lonely_never = 1) means less propention to suffer from bullying. The more times the person was physically attacked and the more times it has physically fought means greater propention to bullying. Other interest finding in this plot is that the fact of being male means lower propention to being bullied. The SHAP values also show that the fact of other students rarely are kind or helpful to a person means greater propention to this person to being bullied.
5. Conclusion
In this article I used machine learning and SHAP to understand the main factors that lead a person to being bullied. Technically saying, we saw that using Boruta allowed us to reduce the number of variables, making the machine learning model easier to explain. Moreover, with SHAP I showed how the values of these variables impact on the model propability output, helping us to understand the reasons of being bullied.
Here are the main findings, according to the model:
- Feel alone in any scale (sometimes, always) raises the propention to suffer from bullying;
- The fact of being physically attacked raises the propention to suffer from bullying;
- Physically fighting indicates greater propention to suffer from bullying;
- The female sex has greater propention to being bullied;
- The fact of other students rarely help or rarely are kind raises the propention to suffer from bullying.
I also deployed the model with a Streamlit web app on Render. I’m gonna write another story to show how I did that, so keep in track ;)
That’s all, folks! Thanks for reading!
You can check the whole code on my GitHub repo below!
Cheers!