Role of StringIndexer and Pipelines in PySpark ML Feature

Nutan
4 min readNov 4, 2020

What is PySpark ML?

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

What is StringIndexer?

class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, inputCols=None, outputCols=None, handleInvalid=’error’, stringOrderType=’frequencyDesc’) — StringIndexer encodes a string column of labels to a column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels).

By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically

four ordering options are supported:

1 “frequencyDesc”: descending order by label frequency (most frequent label assigned 0)

2 “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0)

3 “alphabetDesc”: descending alphabetical order

4 “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”)

Let us see an example

Create SparkSession

--

--

Nutan

knowledge of Machine Learning, React Native, React, Python, Java, SpringBoot, Django, Flask, Wordpress. Never stop learning because life never stops teaching.