What is PySpark ML?
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
What is StringIndexer?
class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, inputCols=None, outputCols=None, handleInvalid=’error’, stringOrderType=’frequencyDesc’) — StringIndexer encodes a string column of labels to a column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels).
By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically
four ordering options are supported:
1 “frequencyDesc”: descending order by label frequency (most frequent label assigned 0)
2 “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0)
3 “alphabetDesc”: descending alphabetical order
4 “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”)