GETTING STARTED | SIMPLE REGRESSION | KNIME ANALYTICS PLATFORM

Simple Linear Regression — KNIME

From data ingestion to model prediction and evaluation in just a few clicks

Zeynep Küçük
Low Code for Data Science

--

Co-author: Zeynep Küçük

As first published on Woman Engineer

Linear regression is a kind of statistical analysis that attempts to model the relationship between a scalar response and one or more explanatory variables. Linear regression looks at various data points and aims at fitting a trend line. Linear regression can be used to create a predictive model on apparently random data, showing trends in the dataset, such as in cancer diagnoses or in stock prices.

Predicting a response using one or more input features, that is given a set of input data points (X) and responses (Y), simple linear regression tries to fit a line that passes through the maximum number of points while minimizing the squared distance of the points to the fitted line values.

In simple terms, linear regression helps to find the relation between two variables and is a type of supervised algorithm.

A linear regression line has an equation of the form:

y=b0+ b1x+e

y = Dependent variable

x = Independent variable

The term b0 is the intercept , b1 is the slope of the regression line, x is the input variable, e is the error term, and y is the predicted value of the response variable.

The slope b1 tells us how a change in the input causes changes in the output.

A No-Code Approach to Simple Regression with KNIME

KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. Let’s make an example using a simple regression tree in KNIME Analytics Platform.

Exploring the Dataset

The Iris flower data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class (Iris-setosa) is linearly separable from the other 2; whereas the latter are NOT linearly separable from each other.

The Iris Dataset we are using looks like this:

Attribute Information

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
— Iris Setosa
— Iris Versicolour
— Iris Virginica

So, once the file is read into KNIME using a File Reader node. Usually, at this point we would need to apply the first pre-processing step to the data , but we don’t need pre-processing in this case because the iris data set is a clean set. With the exception of the feature “class”, our dataset contains only numeric features.

Train — Test Split

Finally, we have our dataset in a form that can be used for training a simple regressor and testing it. Before that, the last thing we need to do is to split the complete dataset into Train and Test set. To do so, we use the Partitioning node. In its configuration, we specify that we want to split the data using “Stratified sampling” with 80% as our train data and the remaining as our test data.

Training and Testing the Model

KNIME provides several options to perform simple regressions using the regression learners and regression predictor nodes. In this example, the Simple Regression Tree Learner node is used to learn a single regression tree, and the Simple Regression Tree Predictor node to apply the learnt model. We feed the Iris train data set coming from the Partitioning node to the learner node using “petal width” as the target column. The output of this node is a learnt Regression Tree model.

Next, we feed the output model and Iris test data set to the Simple Regression Tree Predictor node, which will return the predicted values for the target column.

You can see this part of the workflow in the figure below.

Evaluating the Model

We are using the following metrics to evaluate the regression model:

  • R-Squared
  • Mean Absolute Error
  • Mean Squared Error
  • Root Mean Squared Error

All these metrics measure how much the predicted value deviates from the actual values. We can directly calculate these metrics using the Numeric Scorer node. This node computes statistics between the numeric column’s actual values and the predicted values.

We see that our model has an R-Squared value of 88.3% which means that 88.3% of our Iris dataset falls around the regression line created by our model.

Line Plot

The final step is the visualization of results. To do this, we are using KNIME’s Line Plot node to draw a line plot and visualize the performance of the simple regression tree. Using the Line Plot node, we can plot the numeric columns of the input table as lines.

If we want to inspect the regression tree built by the Simple Regression Tree Learner node, we can easily do that exploring the Regression Tree View of the node.

Finally, we can inspect the predicted values for the test set output by the Simple Regression Tree Predictor in the form of a table.

Linear Regression Use Cases

The following can be some of the best use cases for using Linear Regression:

  • Given some demographic description of a person, it can help predict his or her monthly income.
  • Linear regression can also be used to analyze the marketing effectiveness, pricing and promotions on sales of a product.
  • Given some description of a house (bathrooms, condition, view, floors), it can help predict its cost.
  • Linear regression can be used by a company that wants to know if the funds that they have invested in marketing a particular brand has given them substantial return on investment.
  • Conducting a linear analysis on sales data with monthly sales, the company could forecast sales in future months.
  • Analyzing the impact of rainfall totals on fruit harvest.
  • Analyzing the impact of product price on number of sales.

Thanks for reading!

--

--