Published in


Linear Regression with PySpark

By Hiren Rupchandani and Abhinav Jangir

Photo by Scott Graham on Unsplash

In our previous article, we performed a basic EDA using PySpark. Now let’s try implementing a linear regression model and make some predictions.

You can find the corresponding notebook here.

Before we jump to linear regression, we also need to process our data and extract the relevant features for model building.

Data Post-Processing

  • We will use a VectorAssembler.
  • VectorAssember from Spark ML library is a module that allows converting numerical features into a single vector that is used by the machine learning models.
  • It takes a list of columns (features) and combines it into a single vector column (feature vector).
  • It is then used as an input into the machine learning models in Spark ML.
featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’, ‘Kms_Driven’, ‘Owner’], outputCol=’Independent’)
  • Using this assembler, we can transform the original dataset and take a look at the result:
output = featureassembler.transform(df)
  • This DataFrame can now be used for training models available in Spark ML by passing theIndependent vector column as your input variable and Seller_Type as your target variable.
output.columns:# OUTPUT:['Car_Name',  'Year',  'Selling_Price',  'Present_Price',  'Kms_Driven',  'Fuel_Type',  'Seller_Type',  'Transmission',  'Owner',  'Independent']

Feature Extraction

  • We will simply use the Independent Vector as our input and the Selling_Price feature as the output for our model:
final_data ="Independent", "Selling_Price")
Input and Output Vectors

Train Test Split

  • Finally, a classic train-test split to fit and evaluate our model:
# Train Test Splittrain_data, test_data = final_data.randomSplit(weights=[0.75,0.25], seed=42)

Linear Regression with PySpark

  • And we are finally here, the moment you have been waiting for.

Model Initialization and Training

  • We will use library to initialize a baseline linear regression model:
# Initializing a Linear Regression model
ss = LinearRegression(featuresCol='Independent', labelCol='Selling_Price')
  • Let’s train the model:
# Training the model
ss =
  • Checking for the coefficient values:
ss.coefficients# OUTPUT:
DenseVector([0.379, 0.5261, -0.0, -1.0682])
  • Checking for the intercept:
ss.intercept# OUTPUT:

Model Evaluation on Test Set

  • Let’s see what the predictions look like:
pred = ss.evaluate(test_data)
Test Predictions
  • Let’s see a plot for actual vs predicted values for the test set:
Actual vs Predicted values
  • Let’s check for scores:
  1. MAE:
# Printing MAE
print(‘MAE for train set:’, pred_train.meanAbsoluteError)
print(‘MAE for test set:’, pred.meanAbsoluteError)
MAE for train set: 1.2527194991798931
MAE for test set: 1.3513549412893398

2. MSE:

# Printing MSE
print('MSE for train set:', pred_train.meanSquaredError)
print('MSE for test set:', pred.meanSquaredError)
MSE for train set: 3.8817538005487515
MSE for test set: 4.083863351293766

3. RMSE:

# Printing RMSE
print('RMSE for train set:', pred_train.rootMeanSquaredError)
print('RMSE for test set:', pred.rootMeanSquaredError)
RMSE for train set: 1.9702166887296308
RMSE for test set: 2.020857083342057

4. R2 Score:

# Printing the R2 Score
print('R2-Score for train set:', pred_train.r2)
print('R2-Score for test set:', pred.r2)
R2-Score for train set: 0.8515799987327323
R2-Score for test set: 0.8308412358811239
  • We can see decent scores and fits according to all the performance metrics, indicating that we indeed have a good baseline model!


  • We have finally performed EDA on the car data, and extracted some important insights that can be useful for model building.
  • We used VectorAssembler for preparing our data for the machine learning model.
  • This was proceeded by a linear regression training and evaluation which observed a good fit of the model with the current constraints of data.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!