Image for post
Image for post

Matrix factorization works great for building recommender systems. I think it got pretty popular after the Netflix prize competition. All you need to build one is information about which user bought or rated which items and you’re good to go. And I was surprised how amazingly simple to build one with Pyspark ML libraries. So I’ll demonstrate how to code one up quickly using RDDs and DataFrames separately.

I’ll mostly focus on building the model in this tutorial. If you’re interested in learning more about Matrix factorization or Singular Value Decomposition, there are some amazing resources out there. …

In this blog, I’ll share some basic data preparation stuff I find myself doing quite often and I’m sure you do too. I’ll use Pyspark and I’ll cover stuff like removing outliers and making your distributions normal before you feed your data into any model, be it linear regression or nearest neighbour searches.

You don’t always need to remove outliers and skewness from your data. It highly depends on how you’re going to use it. For example, algorithms like decision trees arnt affected by outliers, but algorithms like linear regression or even neural nets expect your data to have somewhat normal distributions. …

Sometimes histograms and scatterplots arnt enough. Here I’ll cover some of the more complicated plots that you might need to use — violin plots, heatmaps and sankey diagrams. I’ll mostly use python and I’ve picked up this data from here, its data on the startup investment scene in India. The dataset has Indian startup funding information between January 2015 and August 2017.

First I’ll read the data and do some cleaning.

# importing some stuff

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store