Databricks: Setting up A Spark Dataframe for Linear Regression

Christopher Lewis
Analytics Vidhya
Published in
7 min readApr 21, 2021

--

Before we begin this fun journey, a word of caution: the focus of this blog is NOT on cleaning the data and checking whether or not the assumptions of linear regression (briefly listed below) are met. Instead, the focus is on how to format the dataset so we can feed it into a linear regression model using PySpark!

The Assumptions of Linear Regression

Linear regression is an analysis that assesses whether one or more feature variables explain the target variable.

Linear regression has 5 key assumptions:

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

If you’d like to know more about the above assumptions, you can visit this link for more information. Now that we’ve gotten that out of the way… let’s begin!

Databricks

In this blog, we are using Databricks as our platform to build our Linear Regression model through PySpark. If you have not used Databricks before, feel free to visit my blog here where I quickly walk you through the setup!

If you are already familiar with Databricks: log in, create a new cluster and a new notebook. Once you’ve done that, simply attach the cluster to the notebook:

Attaching cluster to notebook

Now we will load in a few libraries we need to access the data. The dataset we will be using in this blog will be the Boston Housing Prices dataset, which we can load from the sklearn.datasets library:

import pyspark
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

Once we’ve run the above cell, we can access the dataset. We will load the data into a Pandas Dataframe, and then convert that dataframe into a Spark dataframe so we can use it with Spark’s MLlib library for Linear regression.

Step 1: Loading the Dataset into a Pandas Dataframe

# loading an instance of the boston dataset
boston = load_boston()
# Gathering the features of the dataset
feats = boston.data
# Putting the array of features into a pandas Dataframe
df = pd.DataFrame(feats)
# Naming the columns of the dataframe their respective feature names
df.columns = boston.feature_names
# Creating a new column in the Dataframe called 'PRICE' that holds
# the target values
df['PRICE'] = boston.target
# Viewing the df to make sure everything looks okay
df.head()
Viewing the first 5 rows of the Pandas Dataframe

Great, the dataframe looks good! Now we must convert this Pandas dataframe into a Spark dataframe.

Step 2: Converting the Pandas Dataframe into a Spark Dataframe

To convert the dataframe into a Spark dataframe, we just need to run one simple line:

# This line will create a spark dataframe from our pandas dataframe
spark_df = spark.createDataFrame(df)
# Viewing first 5 rows of the spark_df
spark_df.show(5)
Viewing the first 5 rows of the Spark Dataframe

If you compare the first 5 rows of the Pandas dataframe to the Spark dataframe, you’ll notice they are the same. So far, everything is going according to plan…

The Difference Between MLlib and Sklearn Linear Regression

In Sci-kit Learn each feature has its own column, and we don’t need to worry about applying any more transformations to the dataframe. However, in MLlib, we cannot have individual columns for each feature. MLlib requires all values of features in a row to be contained in an array, and they all sit in one column called ‘features’. Let’s view an example of what I mean:

Notice the ‘features’ column on the end

The ‘features’ column contains arrays. Each of these arrays contains all the values of the features in a row. For example, the first row contains:

  • CRIM = 0.00632
  • ZN = 18.0
  • INDUS = 2.31
  • etc.

To make our Spark Dataframe compatible with an MLlib linear regression model, we want to condense all the values within that each row… into an array… in one column that we will call ‘features’. So how do we do that?

Enter: VectorAssembler

from pyspark.ml.feature import VectorAssembler

The tool we need for this job is PySpark’s VectorAssembler. What the VectorAssembler does is combines multiple values into a single array (or row-vector); that is, where each row element of the newly generated column is a vector formed by concatenating each row element from the specified input columns. We can implement this on our dataframe by saying the following:

# Creating an instance of the VectorAssembler
assembler = VectorAssembler(inputCols=[‘CRIM’,’ZN’,’INDUS’,’CHAS’,
’NOX’,’RM’,’AGE’,’DIS’,’RAD’,
’TAX’,‘PTRATIO’,’B’,’LSTAT’],
outputCol=’features’)

To use the VectorAssembler, we assign a variable called ‘assembler’ that will reference the location of the VectorAssembler. Next, we must pass in a list of strings that are the column names of the features we want to include in the model. These names will go in a list assigned to the inputCols parameter.

Important note: Do not include any categorical column names whose values have not been StringIndexed or OneHotEncoded

Notice that we did not add our target variable into the list of inputCols. This VectorAssembler is designed only to take in the independent variables of our dataset, and then condense them down into a single column that contains an array of values. The outputCol parameter assigns a name to the condensed column that will be created — I recommend naming it ‘features’ (that is the default name many MLlib models use to identify feature columns). To transform our Spark dataframe:

# transforming our spark dataframe
df_out = assembler.transform(spark_df)
# Viewing the first 5 rows
df_out.show(5)

The only columns we need to train a linear regression model are the ‘features’ and ‘PRICE’ columns. Let’s create a new spark dataframe that contains only those columns:

# While we could say something like this...
clean_df = df_out.select(['features', 'PRICE'])
# I want to rename the 'PRICE' column to 'label' as well
clean_df = df_out.select([‘features’, col(‘PRICE’).alias(‘label’)])
clean_df.show(5)
Nice and neat

Creating Train and Test sets

Our final step before fitting a model to the data: we must create train and test sets! In a nutshell, train and test sets are extremely important to have when modeling off of data. If you happen to have an R-squared score of say… 0.99 (extremely high) with no other data to validate that score, how would you know if your model is overfitting? To put it simply, you wouldn’t. If you’d like to know a little more about the importance of making train and test sets (and validation sets!) feel free to check out this article. Let’s go ahead and create our train and test sets!

# Creating our train and test sets
train, test = clean_df.randomSplit([0.7, 0.3], seed=42)

Spark dataframes have a built-in method called randomSplit() that allows us to create train and test sets directly from a Spark dataframe. Above, we are saying “randomly split the clean_df into 70% and 30% chunks to go into the train and test set, respectively, and set the random seed to 42. The next step for us is to create and fit a linear regression model.

Creating a Linear Regression Model

from pyspark.ml.regression import LinearRegression

The first step is to create an instance of the LinearRegression() model:

# creating an instance of a linear regression model
lr_model = LinearRegression(featuresCol=’features',labelCol=’label’)

By default, the LinearRegression model assumes the name of the featuresCol and labelCol are ‘features’ and ‘label’, respectively. As long as the features are correctly formatted into one column and the label column also exists, it doesn’t matter what their name is (ex. If you kept your label column’s name as ‘PRICE’, then labelCol=‘PRICE’). Next, we fit the model to our train set:

# fitting the model to the train set
fit_model = lr_model.fit(train)

Now we can finally evaluate our model’s performance! We will create a new variable called ‘test_results’ that will hold the evaluation results of our test set. This is essentially a summary of the model’s performance on the test set:

test_results = fit_model.evaluate(test)

To access values, like R-squared, we can say:

test_results.r2
R-squared results of our test set

Please remember that the purpose of this blog was to get our dataset formatted correctly so it may be used with MLlib’s linear regression. We did not inspect or clean any of the data while going through this blog, or check if any of the assumptions of linear regression had been violated. Thanks very much for reading! If you have any questions, feel free to ask!

--

--

Christopher Lewis
Analytics Vidhya

I am an aspiring Data Scientist and Data Analyst skilled in Python, SQL, Tableau, Computer Vision, Deep Learning, and Data Analytics.