MLiterature: Dimensionality Reduction (1D, 2D,…1millionD)

Vikram Iyer
Aug 26, 2017 · 3 min read

Caveat: This post covers only the theory part of Dimensionality Reduction and the math/code(jupyter notebook)/geometric interpretation will be covered in the later posts in this series.

Though the term “Dimensionalty Reduction” at first may seem intimidating for beginner, it is not. As far as I have read Machine Learning literature, many names are often to scare off newbies. Glory awaits for the ones who are not afraid! :)

Let’s say you work as a Data Scientist for an Ad Marketing Agency where your task is to analyze the data about customer purchases and find patterns as to what made a user buy something and how can I make him/her buy more?

Where is my credit card ??

Let’s break down the process of finding patterns about the users buying habits into below steps:

  1. Get data about the user [we need not worry]
  2. Clean and format the data (rdbms, csv, etc)[we need not worry]
  3. Find insights about a particular user [let’s worry about this at least]

Now, one thing you need to definitely worry about is, what kind of data about a user is stored.

  1. time of buying
  2. sites used by the user
  3. device type(mobile, laptop)
  4. product last browsed
  5. ad type last clicked
  6. color of the product
  7. category of product
  8. id of the user
  9. id of the product
  10. time at which data is stored, etc.

Let’s call the above points, “features” of a user. There is a plethora of this kind data stored in the data warehouses.

Too much (FUN) data!

Now, from the above set of features, we see that product id, user id do not make much sense to know about what a user has been purchasing or will purchase, because they will look something like this — AbCDe123Xyz which is just an alpha-numeric string.

So, we decide that we will omit these features (product id, user id, time at which data is stored) while deciding which features (time of buying, sites used by the user, device(mobile, laptop), product last browsed, ad type last clicked, color of the product, category of product) are helpful.

Now let’s say of the remaining 7 features, that we have selected as useful ones, finding results by analyzing 7 features takes 1 day and you are 100% sure about a pattern that a user will buy; on the other side using only 3 features takes may be 5 minutes and you are 80% sure that the user will buy.

Which one would you choose?

And the answer is: 100% and 7 features.

Since there is no constraint given as of now, we want to be 100% sure that user buys what we offer.

But, if your manager says, you’ve got to do this everyday as there will be new data coming in every second, with your choice of answer, you are screwed!

You may either better your analysis time or go with 3 features and 80% chance that user will buy. Lazy people choose the second option. I did!

All of what you did above is called dimensionality reduction where each of your feature is a dimension. You used only 3 features/dimensions of a user and predicted with 80% accuracy a pattern that made a user buy.

So, this pattern is something like,

  1. user buys only on weekends
  2. facebook ads are more attractive to user
  3. mobile is the most preferred way of buying, etc and so on

Now a very major part of how this is done is what I did not explain here and anyways that was not the intent, but in the later posts, I intend to do that as well.

We will look at below points in the upcoming posts in the series:

  1. Explaining Dimensionality Reduction with co-ordinate geometry
  2. Explaining the code, math behind the scenes
  3. One technique which will help us “Principle Component Analysis

Until then,

Happy Machine Learning!

)

Vikram Iyer

Written by

Machine Learning Engineer, Time Series Analysis, Design Patterns, Clean Code, etc

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade