Logistic Regression Using Python

Abin Joy
Abin Joy
Nov 1 · 4 min read

Simple, logistic regression algorithm is used for solving the binary classification problem. Logistic regression can be used for lots of real world application like Email spam or not, Online transaction fraud or not fraud, Tumor Malignant or Benign.

The outcome of logistic regression is dichotomous in nature. Dichotomous means there are only two possible classes.

Linear Regression Equation:

y : dependent variable
X1,X2…Xn are explanatory variables.

Logistic Function:
The logistic function also called sigmoid function. It’s an S-shaped curve that can take any real valued number and map it into a value between 0 and 1, but never exactly at those limits.

e : base of the natural algorithms(Euler’s number) and value is the actual numerical number that you want to transform.

Types of Logistic Regression

  1. Binary Logistic Regression: It has only two possible outcomes, yes or no.
  2. Multinomial Logistic Regression: It has more than two nominal categories.
  3. Ordianl Logistic Regression: It has more than two nominal catorgories. Also, it deals with target variables with ordered categories.

Let’s begin Coding!!!

Loading Data and Feature Selection

dataset = pd.read_csv("dataset/social_network_ads.csv")
X = dataset.iloc[:, [2,3]].values
Y = dataset.iloc[:, 4].values

.iloc[] is a function which helps to slice the whole dataset based on row and column selection.

.iloc[<Row selection>, <Column selection>]

X is independent variable(or feature variable)

Y is dependent variable(or target variable)

Splitting Data

X_train, X_test, y_train, y_test  = train_test_split(X, Y, test_size = 0.20, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Here the dataset splitted into two parts 80% for training and 20% for testing
Why random_state = 0?
The answer is here, the function train_test_split will split the arrays or matrices into random test and train subsets. That means everytime you run it without specify random_state, you will get different result.

So if you use random_state = some_number you will get output of Run1 will be equal to the output of Run2.

Use of StandardScaler()?

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

Model training and prediction

logreg = LogisticRegression(random_state=0)
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

Model Evaluation

Model evaluation is an important step to identify how accurate the model that we trained. A confusion matrix is a matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one.
The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.

cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix', cm)

class_names = [0, 1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

sns.heatmap(pd.DataFrame(cm), annot=True, cmap = "YlGnBu", fmt = 'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()

plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Label')
plt.xlabel('Predicted label')
plt.show()

The confusion matrix has two classes 0 or 1.
From the above given figure, 56 and 17 are accurate predictions while 2 and 5 are inaccurate.

Accuracy = TP + TN / Total
Error Rate = FP + FN/Total
Precision = TP/TP+FP
Recall = TP/ TP+FN

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
report = classification_report(y_test, y_pred)
print("Classification Report: ", report)

classification report will provide precision, recall, f1-score

Precision is sort of like accuracy but it looks only at the data you predicted positive. Recall is also sort of like accuracy but it looks only at the data that is “relevant” in some way.

If the accuracy is close to 1 then the trained model would be perfect one.

F1 score is needed when you want to seek a balance between Precision and Recall and there is an uneven class distribution(large number of Actual Negatives).

Visualizing Performance of Model

X1, X2 = np.meshgrid(np.arange(start = X_test[:, 0].min() - 1, stop = X_test[:, 0].max() + 1, step = 0.01),
(np.arange(start = X_test[:, 1].min() -1, stop = X_test[:, 1].max() + 1, step = 0.01)))
plt.contourf(X1, X2, logreg.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.50, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_test)):
plt.scatter(X_test[y_test == j, 0], X_test[y_test == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

By analysing the evaluation factors such as accuracy, confusion matrix and graph we can easily say that our model is performing really well.

Full code accessible from GitHub!!! :)
https://github.com/abinj/machine-learning-algorithms.git

You can also reach me out in LinkedIn

Abin Joy

Written by

Abin Joy

Secretly loves the story behind the data. :)

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade