One Transformation we cannot model without in Pyspark — VectorAssembler

Dawar Rohan
4 min readNov 2, 2022

--

Photo by Bekky Bekks on Unsplash

Hello folks!

Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads. You can get your very own slice of pie for experimentation. It gives you the ability to do analysis in a handful of programming languages.

In this post we would see one transformation, that we can’t live without, i.e. VectorAssembler, first let me try to answer the question on why this is needed :

VectorAssembler is done for efficiency & scaling. The Vector assembler will express the features efficiently using techniques such as spark vector, which helps in better data handling & efficient memory management . This helps the modelling algorithms run efficiently even in large data columns.

In other words, VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models. VectorAssembler accepts the following input columns:

  • All numeric.
  • Boolean Type.
  • Vector Type.

You can setup your own databricks edition for experimentation, check out the details , on how to do it in my previous post

Let’s check out the basic ones first

  1. Start importing the libraries
Importing Libraries

2. Let’s create a dummy dataset

Creating the dataset for this experiment

3. In this experiment, id is unique for every user & clicked is the target variable, so we would be transforming the remaining 3 variables.

Applying Transformation

As we see a new column named features has been created, which encapsulates all the feature vectors into a single column.

Final output selecting only the transformed features & target Variable(Clicked)

In the next experiment, we would transform a dataset, which has categorical variables as well, when we deal with categorical variables/ Features, we have to convert them into numbers first & apply one Hot encoding , to do so we would take the help of method such as stringIndexers & OneHotEncoding, lets dig into the definitions for them :

StringIndexer: It encodes a sting column of labels to a column of label indices, in other words we convert the alphabets to numbers, by default this is ordered by label frequencies so the most frequent label gets index 0, moreover this behavior can be controlled by parameter stringOrderType, furthermore, 4 ordering options are supported:

  • frequencyDesc: Descending order by label frequency most frequent label assigned 0)
  • frequencyAsc: Ascending order by label frequency (least frequent label assigned 0)
  • alphabetDesc: Descending alphabetical order
  • alphabetAsc: Ascending alphabetical order

OneHotEncoding:A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

Let’s implement this procedure in an example, also note this would a pre-curser to the actual vector assembler process, when we deal with categorical variables.

  1. Create a dummy dataset
Sample dataset creation for Experiment #2

2. Now, we have 3 string columns here namely, name, qualification & gender, a quick check on the different values of the categorical columns, give us the below insights.

Checking the available values for variable

3. We would be experimenting with qualification in this example.

String Indexer for one column

This process needs to repeated for every categorical column. To keep the example simple, lets implement the onehotencoding to the qualification example itself, here we would use the “qualification_index” as input column & qualification_ohe as output.

4. Performing the One Hot Encoding for the indexed columns

One Hot Encoding

As, we have seen above the 2 step process (stringIndexer & OneHotEncoding) has to be repeated for all the variables, we tried a concept known as pipeline, which would do this process for us for all the columns.

  1. As usual we start by creating a sample dataset
Sample Dataset for Experiment #3

This dataset is a mixture of numerical & categorical columns.

2. Then we define the stages which we would like to be included in the pipeline, in this example we need 3 steps in our pipeline

  • String indexing for categorical columns
  • One hot encoding for string index columns
  • Vector Assembling (encapsulating all the columns into a single Vector Column)

3. We start by defining the various stages, first filtering the categorical columns & applying the 2 step process to them

Handling Categorical Features

4. Then we filter the numerical columns & combine them with the one hot encoded, categorical columns & define vector assembler to them.

5. Lets create a Pipeline to put all these things together.

The whole notebook for this experiment is available on my github HERE.

Stay tuned for more such content! See you in the Next one!!Happy Learning! If you like, what you read follow me on linkedIN

References:

--

--

Dawar Rohan

Data Scientist | Machine Learning Enthusiast| Learning through Sharing |Learning is fun! Connect/Follow with me on LinkedIN, www.linkedin.com/in/rohandawar