Linear Regression with PySpark
By Hiren Rupchandani and Abhinav Jangir
In our previous article, we performed a basic EDA using PySpark. Now let’s try implementing a linear regression model and make some predictions.
You can find the corresponding notebook here.
Before we jump to linear regression, we also need to process our data and extract the relevant features for model building.
Data Post-Processing
- We will use a VectorAssembler.
- VectorAssember from Spark ML library is a module that allows converting numerical features into a single vector that is used by the machine learning models.
- It takes a list of columns (features) and combines it into a single vector column (feature vector).
- It is then used as an input into the machine learning models in Spark ML.
featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’, ‘Kms_Driven’, ‘Owner’], outputCol=’Independent’)
- Using this assembler, we can transform the original dataset and take a look at the result:
output = featureassembler.transform(df)
output.show()# OUTPUT:

- This DataFrame can now be used for training models available in Spark ML by passing the
Independent
vector column as your input variable andSeller_Type
as your target variable.
output.columns:# OUTPUT:['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven', 'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner', 'Independent']
Feature Extraction
- We will simply use the
Independent
Vector as our input and theSelling_Price
feature as the output for our model:
final_data = output.select("Independent", "Selling_Price")
final_data.show()

Train Test Split
- Finally, a classic train-test split to fit and evaluate our model:
# Train Test Splittrain_data, test_data = final_data.randomSplit(weights=[0.75,0.25], seed=42)
Linear Regression with PySpark
- And we are finally here, the moment you have been waiting for.
Model Initialization and Training
- We will use pyspark.ml..regression library to initialize a baseline linear regression model:
# Initializing a Linear Regression model
ss = LinearRegression(featuresCol='Independent', labelCol='Selling_Price')
- Let’s train the model:
# Training the model
ss = ss.fit(train_data)
- Checking for the coefficient values:
ss.coefficients# OUTPUT:
DenseVector([0.379, 0.5261, -0.0, -1.0682])
- Checking for the intercept:
ss.intercept# OUTPUT:
-762.0797553301906
Model Evaluation on Test Set
- Let’s see what the predictions look like:
pred = ss.evaluate(test_data)
pred.predictions.show()# OUTPUT:

- Let’s see a plot for actual vs predicted values for the test set:

- Let’s check for scores:
- MAE:
# Printing MAE
print(‘MAE for train set:’, pred_train.meanAbsoluteError)
print(‘MAE for test set:’, pred.meanAbsoluteError)# OUTPUT:
MAE for train set: 1.2527194991798931
MAE for test set: 1.3513549412893398
2. MSE:
# Printing MSE
print('MSE for train set:', pred_train.meanSquaredError)
print('MSE for test set:', pred.meanSquaredError)# OUTPUT:
MSE for train set: 3.8817538005487515
MSE for test set: 4.083863351293766
3. RMSE:
# Printing RMSE
print('RMSE for train set:', pred_train.rootMeanSquaredError)
print('RMSE for test set:', pred.rootMeanSquaredError)# OUTPUT:
RMSE for train set: 1.9702166887296308
RMSE for test set: 2.020857083342057
4. R2 Score:
# Printing the R2 Score
print('R2-Score for train set:', pred_train.r2)
print('R2-Score for test set:', pred.r2)# OUTPUT:
R2-Score for train set: 0.8515799987327323
R2-Score for test set: 0.8308412358811239
- We can see decent scores and fits according to all the performance metrics, indicating that we indeed have a good baseline model!
Conclusion
- We have finally performed EDA on the car data, and extracted some important insights that can be useful for model building.
- We used VectorAssembler for preparing our data for the machine learning model.
- This was proceeded by a linear regression training and evaluation which observed a good fit of the model with the current constraints of data.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).