Role of StringIndexer and Pipelines in PySpark ML Feature

4 min readNov 4, 2020

What is PySpark ML?

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

What is StringIndexer?

class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, inputCols=None, outputCols=None, handleInvalid=’error’, stringOrderType=’frequencyDesc’) — StringIndexer encodes a string column of labels to a column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels).

By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically

four ordering options are supported:

1 “frequencyDesc”: descending order by label frequency (most frequent label assigned 0)

2 “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0)

3 “alphabetDesc”: descending alphabetical order

4 “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”)

Role of StringIndexer and Pipelines in PySpark ML Feature

What is PySpark ML?

What is StringIndexer?

Let us see an example

Create SparkSession

Written by Nutan