Role of OneHotEncoder and Pipelines in PySpark ML Feature — Part 2

Nutan
4 min readNov 6, 2020

Part 1 — What is StringIndexer?

We have already discussed regarding StringIndexer (link)

What is OneHotEncoder?

class pyspark.ml.feature.OneHotEncoder(inputCols=None, outputCols=None, handleInvalid=’error’, dropLast=True, inputCol=None, outputCol=None) — One Hot Encoding is a technique for converting categorical attributes into a binary vector.

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0].

The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Let us see an example

Create SparkSession

#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to…

--

--

Nutan

knowledge of Machine Learning, React Native, React, Python, Java, SpringBoot, Django, Flask, Wordpress. Never stop learning because life never stops teaching.