Logistic Regression with PySpark in 10 steps

Sanjjushri Varshini R
featurepreneur
Published in
2 min readFeb 18, 2022

In the end, what’s any good reader really hoping for? That spark. That spell. That journey. — Victor LaValle

We will see how to solve Logistic Regression using PySpark.

1. Install the dependencies required:

pip install pyspark

2. Import the necessary Packages:

from pyspark.sql import SparkSessionfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml.classification import LogisticRegression

3. Creating a SparkSession with App name:

spark = SparkSession.builder.appName("churn").getOrCreate()

appName is the name of the application, which will be displayed in the spark UI.

getOrCreate() will either create an existing SparkSession or, if none exists, will create one.

4. Read the Data set and print the Schema:

dataset = spark.read.csv("../datasets/customer_churn.csv", inferSchema = True, header = True)dataset.printSchema()

inferSchema → Is to understand automatically that the data is string or integer or double.

printSchema() → gives what kind of fields are in the dataset.

5. Using VectorAssembler:

assembler = VectorAssembler(inputCols = ['AccountWeeks', 'ContractRenewal', 'DataPlan', 'DataUsage', 'CustServCalls', 'DayMins', 'DayCalls', 'MonthlyCharge', 'OverageFee'], outputCol='features')output = assembler.transform(dataset)

VectorAssembler is a transformer that creates a single vector column from a list of columns mentioned.

6. Returning the output column:

output = assembler.transform(dataset)

Returns an array of elements after applying a transformation to each element in the input array

7. Selecting the input and output columns:

finalised_data = output.select('features', 'churn')finalised_data.show()

Select the transformed input column and target column that should be predicted.

8. Splitting the data:

train, test = finalised_data.randomSplit([0.7, 0.3])

Split the data set such that the training set has 70% of data and the testing set has 30% of the total data.

9. Fit the model:

lr = LogisticRegression(labelCol="churn")lrn = lr.fit(train)

Loading the Logistic Regression model and fitting the training data. Fitting is nothing but training.

10. Predict:

lrn_summary = lrn.summarylrn_summary.predictions.show()

Finally, predict the values.

The final output:

lrn_summary.predictions.describe().show()

11. Evaluate:

pred_labels.predictions.show()eval = BinaryClassificationEvaluator(rawPredictionCol = "prediction", labelCol = "churn")auc = eval.evaluate(pred_labels.predictions)print(auc)

Evaluating using AUC (Area Under the Curve).

--

--