Word Embedding — One hot encoding

Basic of one hot encoding using numpy, sklearn, Keras, and Tensorflow

Pema Grg
Zero Equals False
2 min readJan 8, 2019

--

The machine cannot understand words and therefore it needs numerical values so as to make it easier for the machine to process the data. To apply any type of algorithm to the data, we need to convert the categorical data to numbers. To achieve this, one hot ending is one way as it converts categorical variables to binary vectors.

Example:

Suppose we have a sentence as “Can I eat the Pizza”.

looking at this, we can directly say that all the words are different from each other but how will the machine know?

so when we try to apply one hot ending i.e converting the categories into numerical labels.

  1. Firstly, convert the text to lower and then sort the words in ascending form i.e A-Z. Now we’ll have “can, eat, i, pizza, the”.
  2. Give a numerical label as we can see can is at 0th position and eat is at 1 same way, assign the values like can:0, i:2, eat:1, the:4, pizza:3.
  3. Transform to binary vectors.

Got some idea? Are you wondering whats Categorical variable now?

Well, Categorical variables are basically the fixed value number on the basis of some qualitative properties. Such as Sex of an individual as it can be either male or female or trans. Weather is also one example as it can be sunny, cloudy, or rainy.

Binary variables are nothing but values that contains only 0s and 1s.

Convert Using Numpy

Steps to follow:

  1. Convert Text to lower case
  2. Tokenize the text
  3. Get unique words
  4. Sort the word list
  5. Get the integer/position of the words
  6. create a vector of each word by marking its position as 1 and rest as 0
  7. create a matrix of the found vectors.

Convert Using Sklearn

Steps to follow:

  1. Convert the text to Lower Case
  2. Word Tokenize
  3. Get its integer value i.e the position by using LabelEncoder()
  4. Get one hot encoding of the word by referring to the label encoded values using OneHotEncoder()

Convert Using Keras

Steps to follow:

  1. Convert the text to Lower Case
  2. Word Tokenize
  3. Get its integer value i.e the position by using LabelEncoder()
  4. Get one hot encoding of the word by referring to the label encoded values by using to_categorical()

Convert Using TensorFlow

Steps to follow:

  1. Convert the text to Lower Case
  2. Word Tokenize
  3. Get its integer value i.e the position.
  4. Create a placeholder for the input
  5. Get one hot encoding using tf.one_hot()
  6. run the session by feeding in the word ids as input.

You can get the code from my Github link: https://github.com/pemagrg1/one-hot-encoding

Enjoy coding! 😃

--

--

Pema Grg
Zero Equals False

curretly an NLP Engineer @EKbana(Nepal)| previously worked@Awesummly(Bangalore)| internship@Meltwater, Bangalore| Linkedin: https://www.linkedin.com/in/pemagrg/