Artificial Intelligence in Plain English
New AI, ML and Data Science articles every day.

Data Science

In Artificial Intelligence in Plain English. More on Medium.

A recommendation engine is a model that can predict what a user may be interested in. When we apply this to the context of movies, for example, this becomes a movie recommendation engine. We filter items in our database by predicting how the current user might rate them. This helps us in connecting the user to the right content in our dataset.

Why is this relevant? If you have a massive catalogue, then the user may or may not find all the content that is relevant to them. By recommending the right content, you increase consumption. …

When GME goes up, the market goes down. Read on!

Image for post
Image for post
Figure made by the author.

Disclaimer: This is a short article and does not intent to provide financial advice or to suggest anything whatsoever.


Recently there is a lot of noise around GME, reddit and the stock market.

My hypothesis was that there is a significant correlation between GME and S&P 500 time courses of price.

I did a simple correlation analysis and I found that there is a significant (p=0.05) negative correlation (rho= -0.319) between the GME and S&P500 price.

Just for a reminder this is what happened over the past couple of months:

In my previous article, I have highlighted 4 algorithms to start off in Machine Learning: Linear Regression, Logistic Regression, Decision Trees and Random Forest. Now, I am creating a series of the same.

Image for post
Image for post

The equation which defines the simplest form of the regression equation with one dependent and one independent variable: y = mx+c.

Where y = estimated dependent variable, c = constant, m= regression coefficient and x = independent variable.

Let's just understand with an example:

In my last post, I mentioned combining and merging datasets in Pandas library. I will talk about combining and merging datasets in this post. Data sets are not always formatted as we would like. Sometimes we may want to convert long format data to wide format, namely tabular format, or wide format to long format. In this post, I will explain how to change the format of the data set.

Image for post
Image for post
Photo by Kaitlyn Baker on Unsplash

We can rearrange the data sets in the DataFrame structure with hierarchical indexing. In summary, in this post,

  • How to rearrange the data set with hierarchical indexing?
  • What are the…
Image for post
Image for post

Logistic Regression is an algorithm that predicts the probability an observation belongs to one of two classes. If the observation being predicted is an event, the binary dependent variable is encoded as a 1 if the event is likely to occur, or as a 0 if it is not likely.

In this example, I implement a 5-step logistic regression to predict whether a reservation is likely to be cancelled so venues are prepared to find new bookings for empty space.

The dataset used is on a travel business and contains 4238 observations. Its categorical variables include destination country, property type…

Installing R, Performing Data Manipulation, up to Applying Machine Learning with R.

Image for post
Image for post
Image by Daria Nepriakhina from Unsplash

For Data Scientists nowadays, there are many options to choose from in producing statistical visualisation. Let’s say we have Python as the most popular one. With Python we can perform pretty much anything from Machine Learning classification and regression, Deep Learning for computer vision, NLP, up to Audio Analysis. Aside from Python, we can perform Machine Learning algorithms through many other languages, such as JAVA, Scala, Lisp, C++, or C#. However, it is undeniable that R is the second most sought after skill from Data Scientist, at least up to 2021 according to LinkedIn as mentioned in the link below:

Let’s Get Started with the Download

Future events are far from certain in the business world. Most managers who use probabilities are concerned with two conditions:

1. The case when one event or another will occur.

2. The situation where two or more events will both occur.

We are interested in the first case when we ask―What is the probability that today’s demand will exceed our inventory? To illustrate the second situation, we could ask―What is the probability that today’s demand will exceed our inventory and that more than 10% of our sales force will not report for work? Probability is used throughout business to evaluate…

One of the main applications of unsupervised learning is market segmentation. This is when we don’t have labelled data available all the time, but it’s important to segment the market so that people can target individual groups. This is very useful in advertising, inventory management, implementing strategies for distribution, and mass media. Let’s go ahead and apply unsupervised learning to one such use case to see how it can be useful.

Getting ready

We will be dealing with a wholesale vendor and his customers. We will be using the data available at …

The number of clusters as one of the input parameters

When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.

Getting ready

We will perform a DBSCAN analysis using the sklearn.cluster.DBSCAN function. We will use the same…

Measuring the Performance

We have built different clustering algorithms, but haven’t measured their

  1. In supervised learning, the predicted values with the original labels are
    compared to calculate their accuracy.
  2. In contrast, in unsupervised learning, we have no labels, so we need to find a way to measure the performance of our algorithms.

Getting ready

A good way to measure a clustering algorithm is by seeing how well the clusters are separated. Are the clusters well separated? Are the data points in a cluster that is tight enough?

We need a metric that can quantify this behaviour. We will use a metric called the silhouette…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store