Image for post
Image for post
Image by Dimitris Effrosynidis

I created a simple Web Application with Spotify API, Python Dash, and Flask. Spotify users can access the app giving permission to the app to use the data. A lot of cool statistics are displayed!

You can visit the app here

You need a Spotify Account to access. Allow up to 20 seconds to load.

Code is available on GitHub

I am a Data Scientist, with an academic background in Electrical and Computer Engineering. After completing university in 2017, I immediately started a Ph.D. Through the Ph.D. journey, I discovered Data Science. Machine Learning and Data Science Books, Youtube Videos, Online Courses, Podcasts, Kaggle, all combined made me a self-taught Data Science Aspirant. So, after I completed my Military Service, I found a Data Science job (still doing the Ph.D.). …

Image for post
Image for post

What is Outlier Detection?

Outlier Detection is also known as anomaly detection, noise detection, deviation detection, or exception mining. There is no universally accepted definition. An early definition by (Grubbs, 1969) is: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. A more recent definition by (Barnett and Lewis, 1994) is:

An observation which appears to be inconsistent with the remainder of that set of data.


Straight from this excellent article, the most common causes of outliers are:

  • Human errors — Data entry errors
  • Instrument errors — Measurement errors
  • Experimental errors — data extraction or experiment planning/executing…

Image for post
Image for post
source: Clker, via pixabay

Normalization and standardization are similar — they rescale the features. They are used in data analysis to understand the data, and in machine learning to perform better training with certain algorithms.

This article includes:

  • Normalization. Why normalize?
  • Standardization. Why standardize?
  • Differences?
  • When to use and when not
  • Python code for Simple Feature Scaling, Min-Max, Z-score, log1p transformation

Import Libraries, Read Data

Using House Prices Dataset from Kaggle.

Image for post
Image for post

In about 20 minutes from now, you will have a playlist in Spotify that automatically receives songs from your favorite subreddits.

You will have to set up this once and enjoy it forever. It is very easy and requires no coding knowledge.

Ok, but why do I want to create such a playlist?

If you are like me, then you love to discover new songs. Except for the great recommendation systems that Spotify and other services provide, which are generated by machine learning, I have found that human recommendations are more diverse and more interesting.

One of the best human music recommendation places on the internet is Reddit. There are numerous subreddits for every music taste. A list with the most popular can be seen here. These subreddits contain mostly songs as a post in the format: ‘Artist - Song’s…

Image for post
Image for post

In this article, we will:

  • Explore 11 Cross-Validation techniques.
  • Visualize the training and validation samples in each fold. This is the best way to instantly understand how that particular CV technique works.
  • Plot the distribution in each validation fold versus the distribution of the actual test.

We have a dataset. It is splitted into two parts. One is called training and the other testing. They have the same number of columns, except one. Training has also the target.

Our task is to fit a model on the training data and predict the unknown target on the testing data.

We can’t just fit on the whole training data and expect things to go well on testing. We need to validate that our model captures the hidden patterns in the training data, is stable, does not overfit, and generalizes well on unknown data. …

Using Machine Learning and Python, I developed a model to classify tweets about COVID into positive, negative or neutral.

Image for post
Image for post

Twitter is the most successful microblogging service with 150 million daily users. 6.000 tweets are written every second. People tweet about everything that comes in mind and use hashtags to associate the tweet with a topic.

We can build a machine learning classifier to rate tweets based on their Sentiment. A tweet can express a positive, negative, or neutral Sentiment.

I will create a simple model to classify such tweets in real-time and create a graph for the overall Sentiment of COVID. What’s the people’s sentiment about the virus?

1. Dataset

First, I need a dataset of tweets that are already classified into one of the three categories to train my model. Sem-eval provides a relatively big dataset of 65.854 already labeled tweets. As there is no COVID-specific twitter dataset, I will use a general twitter dataset. …

Image for post
Image for post

We will use data from the kaggle competition M5 Forecasting — Accuracy.

The task is to forecast, as precisely as possible, the unit sales (demand) of various products sold in the USA by Walmart.

More precisely, we have to forecast daily sales for the next 28 days. The data covers stores in three US states (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details.

The data is enormous, and for this demonstration, I will use a subset of them, a product from the dataset with a lot of sales.

Our goal is to compare classical time series analysis techniques with machine learning algorithms.

Image for post
Image for post

Time Series Forecasting is the process where we try to do the impossible: predict the future.

If anyone says that has constructed the perfect time series forecasting model, well, we have to be cautious. Sure, some models are better than others and the error can be quite small for some observations, but overall the future is unpredictable. Something might happen in the future that never occurred in the past, so even this “perfect” model will fail.

In this article, we will go step-by-step through the time series forecasting procedure using three relatively simple forecasting methods and predict the unknown future using the Triple Exponential Smoothing model. …

Image for post
Image for post

In Part 1 we looked at:

  • What is a Time Series
  • The Basic Steps in a Forecasting Task
  • Time Series Graphics including time plot, seasonal plots, and seasonal subseries plots
  • Time Series Components and Decomposition.

Everything was accompanied by theory and code.

In Part 2 we will continue our journey with:

  • Stationarity
  • Autocorrelation
  • Lag Scatter Plot
  • Simple Moving Average
  • Exponentially Weighted Moving Average
  • Double and Triple Exponential Smoothing

Let’s remember our dataset with a glimpse of its first rows.

Image for post
Image for post

A. Some Theory First

1. What is a Time Series?

Time series is a sequence of observations recorded at regular time intervals.

Depending on the frequency of observations, a time series may typically be hourly, daily, weekly, monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as well, like, number of clicks and user visits every minute, etc.

Most problems use time-series data. Anything that is observed sequentially over time is a time series.

Examples of time series data include:

  • Daily stock prices
  • Monthly sea temperature
  • Quarterly sales for a company
  • Annual company profits

Time series analysis involves understanding various aspects of the inherent nature of the series so that you are better informed to create meaningful and accurate forecasts. …


Dimitris Effrosynidis

Data Scientist at Mathisys Technologies Hellas | Ph.D. Candidate at the Democritus University of Thrace.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store