Matrix factorization works great for building recommender systems. I think it got pretty popular after the Netflix prize competition. All you need to build one is information about which user bought or rated which items and you’re good to go. And I was surprised how amazingly simple to build one with Pyspark ML libraries. So I’ll demonstrate how to code one up quickly using RDDs and DataFrames separately.

I’ll mostly focus on building the model in this tutorial. If you’re interested in learning more about Matrix factorization or Singular Value Decomposition, there are some amazing resources out there. …

In this blog, I’ll share some basic data preparation stuff I find myself doing quite often and I’m sure you do too. I’ll use Pyspark and I’ll cover stuff like removing outliers and making your distributions normal before you feed your data into any model, be it linear regression or nearest neighbour searches.

You don’t always need to remove outliers and skewness from your data. It highly depends on how you’re going to use it. For example, algorithms like decision trees arnt affected by outliers, but algorithms like linear regression or even neural nets expect your data to have somewhat normal distributions. …

Sometimes histograms and scatterplots arnt enough. Here I’ll cover some of the more complicated plots that you might need to use — violin plots, heatmaps and sankey diagrams. I’ll mostly use python and I’ve picked up this data from here, its data on the startup investment scene in India. The dataset has Indian startup funding information between January 2015 and August 2017.

First I’ll read the data and do some cleaning.

# importing some stuffimport pandas as pd

import numpy as np

import math

from datetime import datetime# importing tsuff for plottingimport seaborn as sns

from matplotlib import pyplot as plt

from pylab import…