This Guide will ellaborate full pipelining stages from data production setup to model creation and evaluation. Here for ilustration Movielens dataset is used with Kafka as producer.

Using Pyspark, Machine learning model using Alternating least square method is build and its performance is compared with the deeplearning models build using tensorflow framework in Databricks.

The data set can be found in kaggle here.

Intro to Collaborative Filtering

Collaborative filtering (CF) is a popular recommendation algorithm that bases its predictions and recommendations on the ratings or behavior of other users in the system. In simple words, If a user “Andrew” likes…

It is one of the difficult decisions to quit a well-paying career and follow an interesting passion. We as humans are bound to various things that might make us not switch careers and feel secured in the path we currently are. But to be a dream follower, one needs to come out of his safe zone and explore. So my journey to pursue this interest and passion for data science started in the end of 2016. Until then I never heard of this data science field.

An artist must not train only his eyes but also his soul

The above…

My journey to being an intern in the Startup is what discussed here. Personally joining a startup has transformed my life to become a professional coder. Until this internship at Startup “Sensego, Paris”, I could write down any programming logic’s in python or apply machine learning algorithms on dataset over the internet. It is an easy task. But in the production environment, it is completely different. The dataset is quite complicated, When it comes to the production environment, where the code is written to be a part of the bigger modules, the code needs to be very efficient mostly leveraging…

What are Bo and B1?, these model parameters are sometime referred to as teta0 and teta1. Basically B0 repressents the intercept and later represents the slope of regression line.

We all know that the regression line is given by Y=B0+B1.X

To understand as how Y is expressed as function of X with these model parameters and to understand how the best fit line is selected, In this post step by step derviation of the formula for B0 and B1 is derived.

Consider some problem as shown below, the best regression line is selected with B0= 19.969 and B1=0.00776, so how…

Regular expression or called fancily as Regex is the one of the most important topic one needs to be aware to be a data scientist. The knowledge of Regular expression pays its way while programming.

Data scientist use regular expressions in the field of Natural language processing such as text mining, computer vision to extract the part of sentence or strip away the words desired. Regular expression forms the basis of text mining which is now deemed mostly as NLP.

An example of where regular expression is used is in “Datascience applied to Ad industry”. With each advertisements we see…


An cheering, enthusiastic datascience professional open to work with dynamic team to share and work in harmony.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store