Logistic Regression, Accuracy, and Cross-Validation

3 min readMay 14, 2019

To classify a value and make sure the value stays within a certain range, logistic regression is used. The below is a Sigmoid curve and function:

Image from Towards Data Science Article by Surya Remanan

We’re first going to take a selection of features, and set the target to those that survived to manually break into a training and testing set. There is no y_test because the information if held in Kaggle till the project is submitted.

features = ['Pclass', 'Age', 'Fare']#numeric, imputed
target = 'Survived'
X_train = train[features]
y_train = train[target]#survived
X_test = test[features]

Here is how we’re fitting logistic regression. Setting the threshold at 0.5 assumes that we’re not making trade-offs for getting false positives or false negatives, that there normally is a 50% chance to survive.

from sklearn.linear_model import LogisticRegression
LogisticRegression.fit(X_train, y_train)
threshold = 0.5
log_reg.predict(X_test)

This returns the discrete approximation of zeros and ones. We’re going to manually show how this array is derived:

The below prints the probability for each row/person in the test set on their survival, specifically the probability that they did not survive, and the probability that they did survive; both numbers adding to 1.

log_reg.predict_proba(X_test)[:,1]

The below prints True and False of whether the probability passed the threshold of 0.5 that we set.

log_reg.predict_proba(X_test)[:,1]> threshold

The below prints a 1 if the chance of survival is greater than 0.5.

(log_reg.predict_proba(X_test_imputed)[:,1] > threshold).astype(int)

With the numbers that we get, we can find the accuracy, which is the metric we use for logistic regression. To get accuracy:

accuracy = correct_predictions / total_predictions

Accuracy is the proportion of correct predictions over total predictions. This is how we can find the accuracy with logistic regression:

score = LogisticRegression.score(X_test, y_test)
print(‘Test Accuracy Score’, score)

We don’t have an output for this since Kaggle withholds the y_train, but the score would be something like the above.

Here is an example of how we can take the accuracy score another way:

from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

Let’s say that we don’t have a lot of datapoints, and it doesn’t make sense to split our data into train, validation, and test. Sklearn has a cross_val_score object that allows us to see how well our model generalizes.

Here’s how to cross-validate:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, X_train_imputed, y_train, cv=10)
print('Cross-Validation Accuracy Scores', scores)

We can then see the range of how our scores are doing:

scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

So the range of our accuracy is between 0.62 to 0.75 but generally 0.7 on average.

Logistic Regression, Accuracy, and Cross-Validation

Written by Lily Su