
Simple, logistic regression algorithm is used for solving the binary classification problem. Logistic regression can be used for lots of real world application like Email spam or not, Online transaction fraud or not fraud, Tumor Malignant or Benign.
The outcome of logistic regression is dichotomous in nature. Dichotomous means there are only two possible classes.
Linear Regression Equation:

y : dependent variable
X1,X2…Xn are explanatory variables.
Logistic Function:
The logistic function also called sigmoid function. It’s an S-shaped curve that can take any real valued number and map it into a value between 0 and 1, but never exactly at those limits.

e : base of the natural algorithms(Euler’s number) and value is the actual numerical number that you want to transform.

Types of Logistic Regression
- Binary Logistic Regression: It has only two possible outcomes, yes or no.
- Multinomial Logistic Regression: It has more than two nominal categories.
- Ordianl Logistic Regression: It has more than two nominal catorgories. Also, it deals with target variables with ordered categories.
Let’s begin Coding!!!
Loading Data and Feature Selection
dataset = pd.read_csv("dataset/social_network_ads.csv")
X = dataset.iloc[:, [2,3]].values
Y = dataset.iloc[:, 4].values.iloc[] is a function which helps to slice the whole dataset based on row and column selection.
.iloc[<Row selection>, <Column selection>]
X is independent variable(or feature variable)
Y is dependent variable(or target variable)
Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)Here the dataset splitted into two parts 80% for training and 20% for testing
Why random_state = 0?
The answer is here, the function train_test_split will split the arrays or matrices into random test and train subsets. That means everytime you run it without specify random_state, you will get different result.
So if you use random_state = some_number you will get output of Run1 will be equal to the output of Run2.
Use of StandardScaler()?
The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.
Model training and prediction
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)Model Evaluation
Model evaluation is an important step to identify how accurate the model that we trained. A confusion matrix is a matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one.
The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix', cm)
class_names = [0, 1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap = "YlGnBu", fmt = 'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Label')
plt.xlabel('Predicted label')
plt.show()
The confusion matrix has two classes 0 or 1.
From the above given figure, 56 and 17 are accurate predictions while 2 and 5 are inaccurate.
Accuracy = TP + TN / Total
Error Rate = FP + FN/Total
Precision = TP/TP+FP
Recall = TP/ TP+FN
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
report = classification_report(y_test, y_pred)
print("Classification Report: ", report)classification report will provide precision, recall, f1-score
Precision is sort of like accuracy but it looks only at the data you predicted positive. Recall is also sort of like accuracy but it looks only at the data that is “relevant” in some way.
If the accuracy is close to 1 then the trained model would be perfect one.
F1 score is needed when you want to seek a balance between Precision and Recall and there is an uneven class distribution(large number of Actual Negatives).

Visualizing Performance of Model
X1, X2 = np.meshgrid(np.arange(start = X_test[:, 0].min() - 1, stop = X_test[:, 0].max() + 1, step = 0.01),
(np.arange(start = X_test[:, 1].min() -1, stop = X_test[:, 1].max() + 1, step = 0.01)))
plt.contourf(X1, X2, logreg.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.50, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_test)):
plt.scatter(X_test[y_test == j, 0], X_test[y_test == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
By analysing the evaluation factors such as accuracy, confusion matrix and graph we can easily say that our model is performing really well.
Full code accessible from GitHub!!! :)
https://github.com/abinj/machine-learning-algorithms.git
You can also reach me out in LinkedIn
