Build a Logistic Regression model from scratch in Swift | Swift Safari

Published in

SwiftSafari

10 min readJun 8, 2023

In the realm of machine learning, logistic regression holds a special place. Despite its name suggesting a regression algorithm, it’s primarily used for classification tasks, which sets it apart from its sibling, linear regression. Logistic regression is a simple yet powerful algorithm used for binary classification problems, i.e., problems with two possible outcomes.

Suppose you want to predict whether an email is spam or not, or if a student will pass or fail an exam based on their number of study hours. These are classic examples of binary classification problems, and logistic regression can be a great tool to solve them.

Linear regression, as we previously discussed here, is used for predicting a continuous outcome. For example, you could use linear regression to predict a house’s price based on various features like its size, location, age, etc. However, if we try to use linear regression for classification problems, we may run into issues. Linear regression could predict values less than 0 or greater than 1, which doesn’t make sense in a probability context, since probabilities range from 0 to 1.

This is where logistic regression comes in. Logistic regression uses the same basic idea as linear regression but transforms its output using the logistic function to ensure that it falls between 0 and 1. This output can be interpreted as the probability of a particular class or event.

In the following sections, we’ll dig deeper into the logistic function, understand the cost function used in logistic regression, discuss how gradient descent is used to minimize this cost function, and finally, build our own logistic regression model from scratch. Stick around to unravel the simplicity and elegance of logistic regression.

Understanding the Logistic Function

(If you want to get straight to the code, skip to the Implementation section below!)

The heart of logistic regression lies in the logistic function, also known as the sigmoid function. The logistic function is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1.

The mathematical expression for the sigmoid function is:

σ(x) = 1 / (1 + e^-x)

The ‘e’ here is the base of natural logarithm, Euler’s number, approximately equal to 2.71828.

Now, you might be wondering, why do we need this function in logistic regression? The answer lies in the nature of binary classification problems. The output we seek is a probability that the given input point belongs to a certain class. Probabilities range from 0 to 1. Therefore, we need a function that can map any real-valued number to the (0, 1) range.

Let’s look at a scenario where we want to predict whether a student passes (1) or fails (0) an exam based on the number of hours they studied. We use a linear function to represent the relationship between the number of study hours and the output:

y = m*x + b

However, this equation could produce values less than 0 or greater than 1, which wouldn’t make sense in the context of probabilities. Therefore, we pass this output through the sigmoid function:

p = 1 / (1 + e^-(m*x + b))

Now, ‘p’ gives us a value between 0 and 1, representing the probability of the student passing the exam. If ‘p’ is greater than or equal to 0.5, we can classify the student as passing; otherwise, we classify them as failing.

The logistic function not only ensures that the output is within the (0, 1) range but also has a unique property that makes it convenient for calculation. The derivative of the sigmoid function can be expressed in terms of the function itself, which is highly beneficial when computing the gradient of the cost function during optimization.

In the next sections, we’ll dive into the cost function used in logistic regression and how we use gradient descent to optimize it.

Cost Function in Logistic Regression

In logistic regression, we don’t use the same cost function as in linear regression due to a compelling reason. If we were to use the mean squared error cost function used in linear regression, we would end up with a non-convex function because of the sigmoid transform. Non-convex functions have multiple local minimums, and gradient descent may not find the optimal global minimum.

Instead, we use a cost function known as “log loss” or “binary cross entropy”. This function provides a convex shape, ensuring that gradient descent can find the global minimum.

For a single training example, the cost is defined as:

Cost(y, y_predicted) = -[y*log(y_predicted) + (1-y)*log(1 - y_predicted)]

In this formula, ‘y’ is the actual label (0 or 1), and ‘y_predicted’ is the predicted probability that y=1.

This cost function makes intuitive sense. When the actual ‘y’ is 1, the second term disappears because we’re multiplying by (1-y), and the cost becomes -log(y_predicted). As 'y_predicted' approaches 1 (our desired outcome), the cost goes to 0. But if 'y_predicted' approaches 0, the cost goes to infinity.

Similarly, when the actual ‘y’ is 0, the first term disappears, and we’re left with -log(1 - y_predicted). In this case, as 'y_predicted' approaches 0 (our desired outcome), the cost goes to 0, but as 'y_predicted' approaches 1, the cost goes to infinity.

To calculate the overall cost function for ‘m’ training examples, we simply compute the average cost across all training examples:

Cost = -(1/m) * Σ[y^(i) * log(y_predicted^(i)) + (1-y^(i)) * log(1 - y_predicted^(i))]

This cost function is suitable for logistic regression because it penalizes confident and wrong predictions more heavily than their unconfident (i.e., closer to 0.5) counterparts.

In the next section, we’ll look at how we can use gradient descent to minimize this cost function and find the optimal parameters for our logistic regression model.

Gradient Descent in Logistic Regression

Gradient descent is a widely-used optimization algorithm in machine learning that allows us to find the parameters of our function that minimize the cost. The “gradient” in the name refers to the derivative or the slope of the function. By finding the gradient of our cost function, we can move in the direction where the cost decreases most rapidly.

The procedure of gradient descent remains the same as in the linear regression, but the cost function and hence the gradient update equations are different.

For logistic regression, our hypothesis function ‘h’ is the sigmoid function applied to the linear model:

h(x) = σ(m*x + b)

And the cost function ‘J’ is defined as:

J(m, b) = -(1/m) * Σ[y^(i) * log(h(x^(i))) + (1-y^(i)) * log(1 - h(x^(i)))]

The derivatives of the cost function with respect to ‘m’ and ‘b’ are:

∂J/∂m = (1/m) * Σ[(h(x^(i)) - y^(i)) * x^(i)]

∂J/∂b = (1/m) * Σ(h(x^(i)) - y^(i))

These derivatives give the direction of the steepest ascent, i.e., the direction in which the cost function increases most rapidly. To minimize the cost function, we want to go in the opposite direction, so we subtract these derivatives from the current parameters ‘m’ and ‘b’:

m = m - α * ∂J/∂m

b = b - α * ∂J/∂b

In these equations, ‘α’ is the learning rate, a hyperparameter that determines how big a step we take in the direction of the steepest descent. A smaller learning rate could make the learning process slower but more likely to converge, while a larger learning rate could make learning faster but risk overshooting the minimum.

This update is repeated for a number of iterations until we reach a satisfactory solution where the cost function is minimized.

In the next section, we will put all these pieces together to build a logistic regression model from scratch. We’ll take a step-by-step approach and implement each part in code, making sure to explain each part and how it relates to the concepts we’ve discussed so far.

Implementation

Let’s now implement our logistic regression model in code. We’ll use Swift for our implementation. In this section, we’ll implement each part of the algorithm: the sigmoid function, the cost function, and the gradient descent algorithm.

First, let’s define our sigmoid function:

func sigmoid(_ z: Double) -> Double {
    return 1.0 / (1.0 + exp(-z))
}

As discussed, the sigmoid function is used to map any real-valued number into a value between 0 and 1.

Next, let’s define our cost function:

func costFunction(_ m: Double, _ b: Double, _ x: [Double], _ y: [Double]) -> Double {
    var totalCost = 0.0
    let m = Double(x.count)
    
    for i in 0..<x.count {
        let h = sigmoid(m*x[i] + b)
        totalCost += -(y[i]*log(h) + (1 - y[i])*log(1 - h))
    }
    
    return (1/m) * totalCost
}

The cost function computes the average cost over all training examples.

Finally, we’ll implement our gradient descent algorithm:

func gradientDescent(_ x: [Double], _ y: [Double], _ learningRate: Double, _ numberOfIterations: Int) -> (m: Double, b: Double) {
    var m = 0.0
    var b = 0.0
    let N = Double(x.count)

    for _ in 0..<numberOfIterations {
        var sum_m = 0.0
        var sum_b = 0.0

        for i in 0..<x.count {
            let h = sigmoid(m*x[i] + b)
            sum_m += (h - y[i]) * x[i]
            sum_b += (h - y[i])
        }

        m -= learningRate * (1/N) * sum_m
        b -= learningRate * (1/N) * sum_b
    }

    return (m, b)
}

The gradientDescent function repeatedly adjusts the parameters 'm' and 'b' to minimize the cost function. The learning rate and the number of iterations are hyperparameters that you can adjust to achieve better results.

Evaluation

Model evaluation is a critical part of the machine learning process. It helps us to understand the performance of our model. For logistic regression, we can use several metrics, including confusion matrix, accuracy, precision, recall, and F1 score. But let’s start by discussing two important ones: accuracy and confusion matrix.

Accuracy

Accuracy is one of the most straightforward metrics used in machine learning. It gives us a general idea of how our model performs across all classes. It is the ratio of the number of correct predictions to the total number of predictions (or the total instances).

Let’s implement a function for this:

func accuracy(_ y: [Double], _ yPred: [Double]) -> Double {
    var correctCount = 0.0
    
    for i in 0..<y.count {
        if (yPred[i] >= 0.5 && y[i] == 1) || (yPred[i] < 0.5 && y[i] == 0) {
            correctCount += 1
        }
    }
    
    return correctCount / Double(y.count)
}

In the function above, y represents the actual labels, and yPred represents the predicted probabilities. If the predicted probability is greater than or equal to 0.5, we consider that the model predicts class 1; otherwise, it predicts class 0.

Confusion Matrix

While accuracy is a straightforward and easy-to-understand metric, it doesn’t always tell the full story, especially for imbalanced datasets. A confusion matrix provides a more detailed breakdown of a classifier’s performance.

A confusion matrix for a binary classifier consists of four values:

True positives (TP): The number of instances that were correctly classified as positive.
True negatives (TN): The number of instances that were correctly classified as negative.
False positives (FP): The number of instances that were incorrectly classified as positive (i.e., they are actually negative).
False negatives (FN): The number of instances that were incorrectly classified as negative (i.e., they are actually positive).

We can create a confusion matrix function as follows:

func confusionMatrix(_ y: [Double], _ yPred: [Double]) -> (tp: Int, tn: Int, fp: Int, fn: Int) {
    var tp = 0
    var tn = 0
    var fp = 0
    var fn = 0

    for i in 0..<y.count {
        let pred = yPred[i] >= 0.5 ? 1 : 0

        if pred == 1 && y[i] == 1 {
            tp += 1
        } else if pred == 1 && y[i] == 0 {
            fp += 1
        } else if pred == 0 && y[i] == 1 {
            fn += 1
        } else {
            tn += 1
        }
    }

    return (tp, tn, fp, fn)
}

By understanding these evaluation metrics, you can have a more comprehensive view of your model’s performance. Accuracy gives you a quick understanding of the overall performance, while the confusion matrix gives you a deeper insight into how your model is performing for each class.

Example

Let’s use our newly built logistic regression model to predict whether a student will pass or fail an exam based on the number of hours they studied. This simple binary classification problem is perfect for logistic regression.

Assume we have the following data:

let hoursStudied = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
let passExam = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Here, each number in hoursStudied corresponds to the number of hours a student studied for the exam, and each value in passExam indicates whether that student passed (1) or failed (0).

First, we run gradient descent to find our parameters m and b:

let (m, b) = gradientDescent(hoursStudied, passExam, learningRate: 0.01, numberOfIterations: 1000)

Now that we have our parameters, we can use them to make predictions:

var predictions = [Double]()
for x in hoursStudied {
    let predictedProbability = sigmoid(m*x + b)
    predictions.append(predictedProbability)
}

Now, we can evaluate the accuracy of our model:

let modelAccuracy = accuracy(passExam, predictions)
print("Model Accuracy: \(modelAccuracy)")

And create a confusion matrix:

let (tp, tn, fp, fn) = confusionMatrix(passExam, predictions)
print("Confusion Matrix: TP \(tp), TN \(tn), FP \(fp), FN \(fn)")

This is a simple example, but it gives you an idea of how you can apply logistic regression to a real-world problem. Remember that your data will need to be preprocessed (e.g., normalized) before it’s ready to be used in a machine learning model. And when working with real-world data, it’s always a good idea to split your data into a training set and a test set to evaluate the model’s performance on unseen data.

You’ve learned the basic theory behind logistic regression, how it differs from linear regression, the role of the sigmoid function, and the concept of gradient descent for optimization. You’ve also explored how to evaluate your model using accuracy and confusion matrix.

Subscribe for more data science in Swift tutorials and follow along projects! If you can help me reach just 100 subscribers I can start putting more time into these tutorials and its a free way to support me!