Data Science

The minimal effort maximum outcome way

Happiness is not pleasure. Happiness is the expansion of possibilities — Scott Young

I often wonder what is one thing I am most passionate about, I am aware I like fitness, sports, e-commerce, and advertising space but I cannot come up with one answer. Is it because I know I am not good at it or am I scared to actually make a career out of it? I thought completing grad school and finding a job will help me to find my passion, it has definitely made me happy but I am far from finding my passion. I tried digging…

Real-life use of a machine learning model to combat disasters

Current Problem

In 2019, there were a total of 409 natural disasters worldwide. The irony is that we are right now in the middle of a global pandemic due to Covid19. During a disaster or following the disaster, millions of people communicate either directly or via social media to get some help from the government or disaster relief and recovery services. If the affected person is tweeting it or even sending a message to the helpline service chances are that the message will be lost in the thousands of messages received. …

Data Science

A real-life data science task for a Mail-order sales company


In this project, a mail-order sales company in Germany is interested in identifying segments of the general population to target with their marketing to grow. Demographics information has been provided (by Arvato Finacial Solutions through Udacity) for both the general population at large as well as for prior customers of the mail-order company to build a model of the customer base of the company. The target dataset contains demographics information for targets of a mailout marketing campaign.

The objective is to identify which individuals are most likely to respond to the campaign and become customers of the mail-order company.

Data Description


Extract data from different sources



It is said that Data Scientist spends 80% of their time in preprocessing the data, so lets deep dive into the data preprocessing pipeline also known as ETL pipeline and let's find out which stage takes the most time. In this blog post, we will learn how to extract data from different data sources. Let's take a real-life dataset so it’s easier to follow.

This lesson uses data from the World Bank. The data comes from two sources:

  1. World Bank Indicator Data — This data contains socio-economic indicators for countries around the world. …

Data Science Project Ideas

The buzz for Data Science in 2020 is like Tesla stock; it keeps on increasing every day. The field is hot to such an extent that everyone from mechanical engineers to doctors wants to be a data scientist. However, how would you break into Data Science? Join a DS Bootcamp? Do two or three MOOCs? Compete in Kaggle competitions? The rundown is endless. I am not refuting the advantages of MOOCs or even Kaggle competitions, they are incredible spots to learn Data Science.

However, the issue is everyone is doing it! How frequently have we seen some post about their…

Wage analysis using Random Forest

Wage analysis is a process of comparing the salaries based on the attributes attached to the employee. Of course, there are several factors like the company, location which contributes to the wage. However, we will analyze the Mid-Atlantic wage dataset, which is available here.

For execution reason, I have utilized PySpark and Apache Spark Docker Jupyter Notebook, and you can utilize python and scikit or some other bundles.

We should peruse our information and perceive what it looks like:

How apps like Inshorts work

To understand the key topics of text summarization, I highly recommend you read Text Summarization-Key Concepts.

Features for Extractive Text Summarization

  1. Title Word Features: The sentences in the first report, which comprise of words referenced in the title, have more prominent opportunities to add to the last outline since they fill in as pointers of the topic of the record. Eg: If the title of the document is “Automation in Healthcare Industry” then words like automation, healthcare appearing in the content are given greater significance
  2. Content Word Feature: Keywords are basic in distinguishing the significance of the sentence. The sentence that comprises of primary keywords…

How apps like Inshorts work


Have you at any point condensed an extensive record into a short passage? To what extent did you take? Physically producing an outline can be tedious and repetitive. Programmed content synopsis guarantees to defeat such challenges and enable you to create the key thoughts in a bit of composing effectively. Or have you ever tried the portable application Inshorts? It’s an imaginative news application that changes over news articles into a 60-word rundown. What’s more, that is actually what we will realize in this project — Text Summarization. Text summarization is the technique for generating a concise and precise summary…

Brace yourselves the spam is coming

In this blog post, we are going to develop an SMS spam detector using logistic regression and pySpark. We will predict whether an SMS text is spam or not. This was one of the first use cases of data science and is still widely used to filter emails.

Dataset: Text file can be downloaded from here. This is what our dataset looks like:

Using PySpark and vanilla Python

Some statistical models 𝑓(𝑥) are learned by optimizing a loss function 𝐿(Θ) that depends on a set of parameters Θ. There are several ways of finding the optimal Θ for the loss function, one of which is to iteratively update following the gradient:

To then, compute the update:

Harsh Darji

Data, Product & Automation @Zoro

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store