Simple Linear Regression from Scratch Using Kotlin

Yassin Hajaj
Jul 1 · 5 min read
Artificial Intelligence Illustration

In this tutorial, we’ll learn how to use Kotlin to train and test a simple linear regression model without any external library. Simple linear regression is the easiest model in machine learning and therefore is a great candidate, to begin with.

This article doesn’t use any external library, the goal is to write everything down from scratch to allow for a better understanding of the mechanics behind the scenes.

This article is partly inspired by this one.

Simple Linear Regression

a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simplerefers to the fact that the outcome variable is related to a single predictor.

In other words, given a variable, the simple linear regression model is able to predict with more or less effectiveness the value of a variable linked to the input variable.

There are multiple examples of how simple linear regression can be used

  • Number of children in household -> Liters of milk consumed
  • Years of experience -> Salary
  • IQ -> Job Performance
  • etc.

Data

Data Set

Link to the dataset

Independent Variable & Dependent Variable

In the schema hereunder, the independent variable is x while the dependent one is y .

Simple Linear Regression

The goal of the exercise is of course to get an approximation of the optimal values of β₀ & β₁ in the simple linear regression formula :

y = β₀ + β₁*x

In this formula, y is the dependent variable, x is the independent variable, β₀ is the constant (varying the position of our line on the y-axis) and β₁ is the coefficient of the independent variable (varying the slope of our line).

Build & Train

Read Files

val xTrain = mutableListOf<Double>()
val yTrain = mutableListOf<Double>()
val trainFileName = "train.csv"

File(trainFileName).forEachLine {
val split = it.split(",")
xTrain.add(split[0].toDouble())
yTrain.add(split[1].toDouble())
}
val xTest = mutableListOf<Double>()
val yTest = mutableListOf<Double>()
val testFileName = "test.csv"

File(testFileName).forEachLine {
val split = it.split(",")
xTest.add(split[0].toDouble())
yTest.add(split[1].toDouble())
}

Model

val model = SimpleLinearRegressionModel(independentVariables = xTrain, dependentVariables = yTrain)

I left out the code for SimpleLinearRegressionModel on purpose because we’ll discover it method by method, field by field. For now, we just need to understand that we’ve filled the two fields independentVariables & dependentVariables .

Mean X & Mean Y

private val meanX: Double = independentVariables.sum().div(independentVariables.count())
private val meanY: Double = dependentVariables.sum().div(dependentVariables.count())

Variance & Covariance

β₁ = covariance / variance

For us to get the value of β₁, we’ll have to calculate both of those.

The variance can be defined as the sum of the squared difference of each independent variable minus their mean.

private val variance: Double = independentVariables.stream().mapToDouble { (it - meanX).pow(2) }.sum()

The way to calculate covariance requires a bit more code but still is quite manageable. It can be described as the sum of products of, for each point of the graph, the value of x — meanX and the value of y — mean Y.

Hope the code is easier to understand…

private fun covariance(): Double {
var covariance = 0.0
for (i in 0 until independentVariables.size) {
val xPart = independentVariables[i] - meanX
val yPart = dependentVariables[i] - meanY
covariance += xPart * yPart
}
return covariance
}

β₀ & β

For a reminder, their respective formulas are the followings

β₁ = covariance / variance

β₀ = meanY — (meanX * β₁)

private val b1 = covariance.div(variance)
private val b0 = meanY - b1 * meanX

Test

We’ll also calculate the to evaluate the precision of our model.

fun test(xTest: List<Double>, yTest: List<Double>) {
var errorSum = 0.0
var sst = 0.0
var ssr = 0.0
for (i in 0 until xTest.count()) {
val x = xTest[i]
val y = yTest[i]
val yPred = predict(x)
errorSum += (yPred - y).pow(2)
sst += (y - meanY).pow(2)
ssr += (y - yPred).pow(2)
}
println("RMSE = " + Math.sqrt(errorSum.div(xTest.size)))
println("R² = " + (1 - (ssr / sst)))
}

fun predict(independantVariable: Double) = b0 + b1 * independantVariable

Now that we have everything set up, our model prints the following results for RMSE &

RMSE = 3.07130626802983
R² = 0.9888226846629965

Which is a great result for our model since the closer is to 1, the better and a RMSE of 3.071 in this case is more than OK.

From Wikipedia

Conclusion

In the next articles, we’ll see how Multiple Linear Regression works, and introduce the concept of Gradient Descent to minimize errors of our model.

Data Driven Investor

from confusion to clarity, not insanity

Yassin Hajaj

Written by

I’m a developer who has a thing for artificial intelligence !

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade