# Linear Regression with PySpark

By **Hiren Rupchandani**** **and **Abhinav Jangir**

In our **previous article**, we performed a basic **EDA **using **PySpark**. Now let’s try implementing a linear regression model and make some predictions.

You can find the corresponding notebook

here.

Before we jump to linear regression, we also need to process our data and extract the relevant features for model building.

# Data Post-Processing

- We will use a
**VectorAssembler**. - VectorAssember from
**Spark ML library**is a module that allows converting**numerical features into a single vector**that is used by the machine learning models. - It takes a
**list of columns**(features) and**combines it**into a single**vector column**(feature vector). - It is then used as an
**input**into the**machine learning models**in Spark ML.

**featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’, ‘Kms_Driven’, ‘Owner’], outputCol=’Independent’)**

- Using this assembler, we can transform the original dataset and take a look at the result:

output = featureassembler.transform(df)#

output.show()OUTPUT:

- This DataFrame can now be used for training models available in Spark ML by passing the
`Independent`

vector column as your**input variable**and`Seller_Type`

as your**target variable**.

output.columns:#OUTPUT:['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven', 'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner', 'Independent']

# Feature Extraction

- We will simply use the

Vector as our**Independent****input**and the

feature as the**Selling_Price****output**for our model:

**final_data = output.select("Independent", "Selling_Price")**

final_data.show()

# Train Test Split

- Finally, a classic train-test split to fit and evaluate our model:

# Train Test Splittrain_data, test_data = final_data.randomSplit(weights=[0.75,0.25], seed=42)

# Linear Regression with PySpark

- And we are finally here, the moment you have been waiting for.

## Model Initialization and Training

- We will use pyspark.ml..regression library to initialize a baseline linear regression model:

`# Initializing a Linear Regression model`

**ss = LinearRegression(featuresCol='Independent', labelCol='Selling_Price')**

- Let’s train the model:

`# Training the model`

**ss = ss.fit(train_data)**

- Checking for the coefficient values:

ss.coefficients#OUTPUT:

DenseVector([0.379, 0.5261, -0.0, -1.0682])

- Checking for the intercept:

ss.intercept#OUTPUT:

-762.0797553301906

## Model Evaluation on Test Set

- Let’s see what the predictions look like:

pred = ss.evaluate(test_data)#

pred.predictions.show()OUTPUT:

- Let’s see a
**plot**for**actual vs predicted values**for the test set:

- Let’s check for scores:

- MAE:

# Printing MAEprint(‘MAE for train set:’, pred_train.meanAbsoluteError)

print(‘MAE for test set:’, pred.meanAbsoluteError)# OUTPUT:MAE for train set: 1.2527194991798931

MAE for test set: 1.3513549412893398

2. MSE:

# Printing MSE

print('MSE for train set:', pred_train.meanSquaredError)

print('MSE for test set:', pred.meanSquaredError)# OUTPUT:MSE for train set: 3.8817538005487515

MSE for test set: 4.083863351293766

3. RMSE:

# Printing RMSEprint('RMSE for train set:', pred_train.rootMeanSquaredError)

print('RMSE for test set:', pred.rootMeanSquaredError)# OUTPUT:RMSE for train set: 1.9702166887296308

RMSE for test set: 2.020857083342057

4. R2 Score:

# Printing the R2 Scoreprint('R2-Score for train set:', pred_train.r2)

print('R2-Score for test set:', pred.r2)# OUTPUT:R2-Score for train set: 0.8515799987327323

R2-Score for test set: 0.8308412358811239

- We can see decent scores and fits according to all the performance metrics, indicating that we indeed have a good baseline model!

# Conclusion

- We have finally performed
**EDA on the car data**, and**extracted**some important insights that can be**useful for model building**. - We used
**VectorAssembler**for**preparing our data**for the machine learning model. - This was proceeded by a
**linear regression**training and evaluation which observed**a good fit**of the model with the current constraints of data.

# Final Thoughts and Closing Comments

There are **some vital points** many **people fail to understand** while they pursue their **Data Science **or **AI journey**. If you are one of them and looking for a way to **counterbalance** these **cons**, check out the certification programs provided by **INSAID** on their website. If you liked this story, I recommend you to go with the **Global Certificate in Data Science & AI**** **because this one will cover your foundations, machine learning** **algorithms, and deep neural networks (basic to advance).