Representation ALWAY matter : Machine Learning Part-8

5 min readAug 13, 2018

Most important things to learn anything depends on how good data is represented. Let’s say for a math teacher even person have some awesome subject knowledge and material but if he do not have trick to represent it better then it’s not success as he might have expected.

In machine learning this trick to convert raw data to substantial useful feature known as Feature engineering. Here we focus more on Data Representation comapre to code in general software writing.

Data values can be categorise in Categorical & Numerical Data.

Categorial Data: In text formate.
1. discrete set of possible values.
e.g. House_style: {Tudor, ranch, colonial}
e.g. Car_style: {hatchback, SUV, Sedan}
e.g. Car_color: {red, while, black}

2. But Sometime values are mutually exclusive.
car [{Toyota} {Tata}}

Also cars can be in multiple colour e.g. one categorical feature can have one example(m) with multiple values(n). That mean model have m*n new possibilities to learn 🎉

Numerical Data:

Maths understand real number better in compare to text information. And so we intend to convert text values to a real number identifier.
e.g. Map Street_1 to 1. Map Street_2 to 2 and so on.

Why we need to do this ?? because in ML to get feature weight feature values (FV) must be multiplied by the model weights (MW).
Remember Y = mX + b from linear equation !!

Good, now let’s check with some useful terminology.

Vocabulary
Out of Vocabulary
Binary Vector
One-hot encoding
Multi-hot encoding
Sparse Representation.

and understand these in little detail.

Vocabulary: From a large set of examples we choose relevant examples that are going to be use to train ML Model.
Out-of-vocabulary(OOV): rest of unused example set.

Depending on your problem statement you defintely wanted to remove unnecessary learning effort which can affect future result. It is one of the most important part of Model Training and this can take lot of consideration.

bold marked are called **Vocabulary** & other **OOV**

3. Binary Vector. To remove bias factor at initial state it intend to make every feature value linear as possible; that mean do not assign higher weightage to any feature in dedault. Let’s understand using following:

Feature values (FV) must be multiplied by the model weights (MW). Consider we got weight is 5 & let’s see mw * fv values for each of vocabulary streets we are using.

Street_1 : 5 *1 = 5
Street_2 : 5 *2 = 10
Street_3: 5 *3 = 15 and so on.

Now problem here is feature weight is just increasing because of index(value) in a number we assign. To fix this it’s very important to linearise all feature value to same but as an distinguish identifier. Binary vector is solution of this.

You create a binary vector and mark index by 1 to index(instead assign other numbers) as identifier index and rest by 0. This is also called One-hot encoding(4").

Now check same formula calculation mw * fv again:

Street_1 : 5 *1 = 5
Street_2 : 5 *1 = 5
Street_13: 5 *1 = 5 and so on. Wooow we fixed it !

It is possible to have multi-hot encoding(5") for given feature example. Say a feature called “house” fall in at street corner and can be known by both street name. (Try to understand binary vector representation yourself and add in comment section.)

Cool !! So we are understanding concept of Feature Engineering. What we have see so far is basic but important of it, there can be lot other factor which can help to represent data better.

Feature engineering helping to create binary vector for textual data.

Do you see any issue when using Binary vector. If you check it closely again then you can get it easily that all values other then non-zero are redundant and can cause of increase computing memory and process overhead 😬 😬; wait wait so to fix this we have another technic name is “Sparse Representation”

6. Sparse Representation: This technic allow to store only non-zero value(no more Zeros 😉). It help to reduce Memory & Computing power drastically. It has two way to do this:

Using Array:

2. and Linked list:

Linked list could be a powerful data structure to use.

if we do so then our data representation will be look like this

Nice, so we are going good so far. Let also see couple of things which you should notice on you data. This will help to understand your data better and improve Model performance.

Avoid rarely used discrete feature values:
: house_type: victorian 👍
: unique_house_id: 8SK982ZZ1242Z 👎
Prefer clear and obvious meanings.
: house_age: 27 👍
: user_age : 227 👎 (who is this guy 😅 😅 )

3. Don’t mix “magic” values with actual data.
: Rating (0.05, 0.87) 👍
: Rating (-1) 👎.

Specifically in this scenario rating feature can be break in two category say,
1. Rating given or now. (1/-1)
2. and rating (0–1)

And that’s all from this session. There are more technics to filter your data to represent it better, we will discuss in next.

Representation ALWAY matter : Machine Learning Part-8

Numerical Data:

Written by Shubham Patni