All-Inclusive Imbalanced Classification for Beginners

8 min readOct 8, 2023

Hello everybody.

We have completed our third project as Team-2 in Istanbul Data Science Academy’s Data Science Bootcamp. In this article, I would like to introduce our project where I experienced all the steps of an imbalanced classification project for beginners.

In this project, which we briefly call ‘User Type Prediction’, citibike® asked us to develop an application that predicts the user type as Subscriber or Customer when some features are entered as input.

Data Storage

Tools

We used DBeaver as our database administration tool, PostgreSQL as a database management system, Pandas and Psycopg 2 for adapting our database to Jupyter Notebook which is our workspace as always until now.

Storage

After splitting our csv file, which we had imported from citibike® system data platform, into train and test datasets, we created our database, schema and tables on DBeaver and copied the values of our tables from the train and test datasets.

CREATE DATABASE istdsa
CREATE SCHEMA project03
CREATE TABLE project03.train(
  tripduration INT,
  starttime TIMESTAMP,
  stoptime TIMESTAMP,
  "start station id" INT,
  "start station name" TEXT,
  "start station latitude" FLOAT,
  "start station longitude" FLOAT,
  "end station id" INT,
  "end station name" TEXT,
  "end station latitude " FLOAT,
  "end station longitude" FLOAT,
  bikeid INT,
  usertype TEXT,
  "birth year" INT,
  gender INT
)
COPY project03.train FROM '/Users/Shared/istdsa/project03/train.csv' DELIMITER ',' CSV header

Here is a preview of our tables.

We then adapted our database to Jupyter Notebook using Psycopg 2 and imported our tables using a combination of Pandas and SQL.

params = {
    "host": "localhost",
    "user": "postgres",
    "port": 5432,
    "password": "336991" 
}
connection = psycopg2.connect(**params, dbname= "istdsa")
train = pd.read_sql("select * from project03.train;", connection)

And here is how our data frame looks like. Look how complex it is at the very beginning. We will be making this useful in the following steps.

EDA & Feature Engineering

Tools

We used Jupyter Notebook as our workspace, Numpy and Pandas to clean and edit our data, matplotlib and seaborn to visualise our data, and pyproj to convert our coordinate columns into a distance column.

Simple Edits

Before moving on to EDA and feature engineering, we made some useful adjustments. The first and simplest of these was to change the column names to something we are more familiar with.

train.rename(
    columns={"tripduration": "trip_duration", "starttime": "start_time",
             "stoptime": "stop_time", "start station id": "start_station_id",
             "start station name": "start_station_name", "start station latitude": "start_station_latitude",
             "start station longitude": "start_station_longitude", "end station id": "end_station_id",
             "end station name": "end_station_name", "end station latitude ": "end_station_latitude",
             "end station longitude": "end_station_longitude", "bikeid": "bike_id",
             "usertype": "user_type", "birth year": "birth_year"},
    inplace=True)

Then we created new useful time columns using the start_time and stop_time columns.

train['start_hour'] = train['start_time'].dt.hour
train['start_day_of_week'] = train['start_time'].dt.dayofweek
train['stop_hour'] = train['stop_time'].dt.hour
train['stop_day_of_week'] = train['stop_time'].dt.dayofweek

It made sense to convert the birth year column into an age column for convenience in the next steps.

train['birth_year'] = 2023 - train['birth_year']
train.rename(columns={"birth_year": "age"}, inplace=True)

There were 3 genders in the gender column. So we included the third gender in females.

train.gender=train.gender.mask(train.gender == 0, 2)
train.gender=train.gender.mask(train.gender == 1, 0)
train.gender=train.gender.mask(train.gender == 2, 1)
# Final encoding
# Male: 0
# Female: 1

Since we cannot use categorical columns directly when modelling, we converted our target column into a numeric column. Here, Customer will be our positive label, so it takes the value 1.

ut_dict = {
    'Subscriber': '0',
    'Customer': '1'
}

train.user_type = train.user_type.map(ut_dict)
train.user_type = train.user_type.astype(int)

We applied the same optimisations for our test table except for the user_type column. Because the test table does not contain user_type, which is our target column, since we will use it for prediction at the end.

EDA & Feature Engineering

At the beginning of EDA, our features were not promising at all. If you look at the pairplot below, you can see that the yellow areas on the x-axis of histograms represent our positive label. And they only account for about 2% of the values in the target column.

The extreme imbalance in the distribution of the data can also be recognised by looking at these histograms.

Let’s look at the distribution of data over time. The increase in density at the beginning and end of working hours on Tuesdays, Wednesdays and Thursdays is familiar from somewhere: MTA Turnstyle dataset from our first project.

Let’s take a look at the density at the stations. Again, as in the MTA Turnstyle dataset, we see that the density increases at certain stations.

Instead of start_time and stop_time columns, we have already created more useful columns. Instead of start_station_name and end_station_name, we can use their numeric copies, start_station_id and end_station_id. We can also see that the bike_id column will not be of any use. So we can say goodbye to the related columns.

data = [train, test]
for dataset in data:
    dataset.drop(columns=['start_time', 'stop_time', 'start_station_name',
                          'end_station_name', 'bike_id'], inplace=True)

We convert the coordinate columns, which are not useful in their initial form, into a new useful column, distance, and then say goodbye to them.

from pyproj import Geod
wgs84_geod = Geod(ellps='WGS84')
def Distance(lat1, lon1, lat2, lon2):
    az12, az21, dist = wgs84_geod.inv(lon1, lat1, lon2, lat2)
    return dist

data = [train, test]
for dataset in data:
    dataset['distance'] = Distance(dataset['start_station_latitude'].tolist(),
                                   dataset['start_station_longitude'].tolist(),
                                   dataset['end_station_latitude'].tolist(),
                                   dataset['end_station_longitude'].tolist())
    dataset['distance'] = dataset['distance'].astype(int)
    dataset.drop(columns=['start_station_latitude', 'start_station_longitude',
                          'end_station_latitude', 'end_station_longitude'],
                          inplace=True)

Let’s also look at the distinction between user types of all ages. We can see that the majority of the customers are 54 years old.

Before moving on to modelling, let’s take a look at the correlations between our features. Still, none of our features seem to have any significant effect on our target.

When we look at the correlation heatmap, we see that some of the features have high correlations between them, which may cause multicollinearity problems. This is because the start and stop hours and days are almost always the same. So we can say goodbye to the columns that concern the stop time.

data = [train, test]
for dataset in data:
    dataset.drop(columns=['stop_day_of_week', 'stop_hour'], inplace=True)

Modelling

Tools

We used Jupyter Notebook as our workspace, Numpy and Pandas to organise our data, seaborn to visualise outcomes, scikit learn in different ways for splitting, scaling, testing, cross-validation, and prediction, LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier and XGBClassifier for training.

Modelling before Sampling

We wanted to see the cross validated accuracy scores by training 5 different models before applying any sampling to the data we have.

Accuracy scores were good, as expected. This is because we had trained our models on data with an extremely imbalanced distribution. But we didn’t let that mislead us. Here is the confusion matrix of the model trained with the Logistic Regression algorithm. This is not an acceptable outcome.

That was not how it was supposed to be. We even considered working with a different data set if sampling didn’t work. Nevertheless, we decided to give it a try.

Modelling after Sampling

We applied random oversampling, smote, adasyn and borderline-smote methods to our dataset, which increased the number of positive labels from around 200 to around 10000, and random undersampling, which decreased the number of negative labels from around 10000 to around 200. We then trained a total of 25 models with logistic regression, knn, decision tree, random forest and xgboost algorithms using all the datasets we oversampled and undersampled, respectively. Together with the ones before sampling, we had 30 models in total. We sorted these 30 models first by their recall scores and then by their precision scores. Here are our top five!

What was more important for us was the correct prediction of Customers, hence the recall score. That’s why we chose rf3 with the highest recall score among the top five. Here is the confusion matrix of the rf3 model that we trained with the random forest classifier using the SMOTE oversampled data. We finally got an acceptable outcome.

Explainable AI

We use sophisticated models like random forest and xgboost, but we have to explain to our clients how these models are trained and what is going on in the background when we make predictions. This is exactly where Explainable AI helps us.

Global Explainability

Global explainability means a general explanation of the effects of features on the target. We used Summary Plot for global explainability. Here we see that the feature that affects our target the most is age.

Local Explainability

Local explainability refers to the explanation of the effects of features on the target if they take certain values. We used Force Plot for local explainability. Here we see how our target is affected if the features take certain values.

Web Interface Development

After confirming that the predictions of our model using our test dataset were reasonable, we saved our model as a pkl file using joblib and converted it into a streamlit application that you can use at https://istdsaproject03.streamlit.app

Conclusion

In this project, where I experienced all the steps of an imbalanced classification project for beginners, I am happy that I got an acceptable outcome from a model that I trained with a dataset with unpromising features. What excited me the most about this project is that I was thinking, ‘What more can I learn?’ and then I realized that what I have learned so far is just the beginning.

Thanks to Everybody

Thank you all for sparing your valuable time to read my article.

Please visit my GitHub repository for additional sources related to our project such as the project notebook, the csv file and the python file for the streamlit application: https://github.com/salimkilinc/istdsa_project03

All-Inclusive Imbalanced Classification for Beginners

Data Storage

Tools

Storage

EDA & Feature Engineering

Tools

Simple Edits

EDA & Feature Engineering

Modelling

Tools

Modelling before Sampling

Modelling after Sampling

Explainable AI

Global Explainability

Local Explainability

Web Interface Development

Conclusion

Thanks to Everybody

Written by Salim Kılınç