Define threshold of logistic regression in Python

Little Dino
2 min readApr 22, 2022

--

Introduction

As we discussed before, logistic regression predicts the probabilities of an object belonging to each class and makes binary classification based on the probabilities.

By default, the probability threshold in LogisticRegression function in SciPy package is 0.5. For example, a student with at least 50% predicted chance of passing the exam will be classified as “pass” (class 1).

However, sometimes we might want to define our own threshold depending on various circumstances. In this article, we’ll look at how to define the threshold.

Example

Say we collect the data of 30 students and save the information in a data frame called df with 2 columns, hoursOfStudy and passing. We want to use logistic regression to predict whether a student will pass the final exam (y) based on hours of study (x).

Since we’ve gone through the procedure of logistic regression before, we’ll quickly implement the algorithm. Basically, we split the data into training and testing set. Training set is used to train the model, while testing set is used to evaluate the model performance. Then, we fit a logistic regression model on training set by using LogisticRegression function.

# Train/Test split
x_train, x_test, y_train, y_test = train_test_split(df.hoursOfStudy, df.passing, test_size=0.4, random_state=321)
# Fit logistic regression model on training set
x_train_array = np.array(x_train).reshape(-1,1)
logistic = LogisticRegression()
model = logistic.fit(hoursOfStudy_array, df.passing)

Typically, we’d use model.predict to get the classification result, but here we’ use a little trick to define the threshold. In specific, we use model.predict_proba function.

This function does NOT return the binary classification result (0 or 1), it instead returns the predicted probability. Thus, we can classify data points with the probability larger than specific value (i.e. 0.4) to class 1.

In this example, we’ll predict students with the probability of passing exam larger than 0.4 as passing.

# Reshape the X for testing data
x_test_array = np.array(x_test).reshape(-1,1)
# Predicted probability
y_predict_prob = model.predict_proba(x_test_array)

In fact, the model.predict_proba function predicts 2 probabilities, the probability of data point in class 0 and in class 1. To define the threshold, we only need 1 probability, so we extract the predicted probability of class 1.

⚡ It’s also fine to use the predicted probability of class 0, just remember how you want to define the threshold (i.e. predicted probability of class 0 ≤ 0.6).

# Extracting predicted probability of class 1
y_predict_prob_class_1 = y_predict_prob[:,1]

Now we can finally define our own threshold by using list comprehension. Namely, the students with the predicted probability of class 1 larger than 0.4 will be assigned to class 1 (passing the exam).

# Define threshold 0.4
y_predict_class = [1 if prob > 0.4 else 0 for prob in y_predict_prob_class_1]

After having the classification result, we can evaluate our model. Here we’ll simply look at the accuracy.

print("Accuracy:", round(accuracy_score(y_test, y_pedict_class), 3))

It turns out the accuracy of this logistic regression model (self-defined threshold as 0.4) is 0.833, which is quite good. Of course more evaluation measures are required, but you get the idea of how to define your own threshold!

Coding

References

  1. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

--

--

Little Dino

Welcome to my little world! I LOVE talking about machine learning, data science, coding, and statistics!