Gif by the author.


You can visit the final app on The code is open-source on GitHub.
This is mostly a procedural article. The purpose is to describe the full process of creating an app without bombarding the reader with code snippets.

What’s the app about? The app is called Random Dose of Knowledge and looks like this

Image by Faye Cornish on Unsplash

Most of the content of this article is from my recent paper entitled:
“An Evaluation of Feature Selection Methods for Environmental Data”, available here for anyone interested.

The 2 approaches for Dimensionality Reduction

There are two ways to reduce the number of features, otherwise known as dimensionality reduction.

The first way is called feature extraction and it aims to transform the features and create entirely new ones based on combinations of the raw/given ones.
The most popular approaches are the Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Multidimensional Scaling. However, the new feature space can hardly provide us with useful information about the original features.

Achieve ×66 speedup read time, ×25 write time, and ×0.39 filesize on your daily I/O operations.

Photo by Red Zeppelin on Unsplash

Reading and writing files using Pandas and NumPy is an everyday task for Data Scientists and Engineers.

Let’s compare the most common functions that these libraries provide to write/read tabular data.

We can make our code much faster in these I/O operations, save time, and make our boss and ourselves happy.

We can also save serious amounts of disk space by choosing the appropriate save function.

First, let's create a DataFrame of 10,000,000 rows and 2 columns.

to_csv() / pd.read_csv()

The most common approach to save a Pandas DataFrame.

Keep up with the latest trends and stay at the top of your field.

Photo by Juja Han on Unsplash

Podcasts are on the rise. They can be an alternative way for Data Scientists to learn and keep up with the latest news on the field. Immerse yourself in the industry and stay at the top of your field.

A podcast is a passive form of learning, so you can do other things at the same time. You can listen to podcasts when you take a walk, when exercising, when cleaning the house, or when relaxing.

I will recommend 5 active podcasts that post new episodes every week with durations ranging from 20 minutes to about an hour.

I recommend…


I rate all the movies I watch on IMDb and the website allows you to download a nice .csv with all your ratings. This .csv contains basic information about the movies. In order to perform topic modeling, I need the plots and/or summaries of the movies. I will grab this information from Wikipedia and use it to enrich the IMDb dataset. Then I will perform LDA for topic modeling on the plots+summaries of the movies to find 6 topics.

I will keep the article clean of code. The code is available here.

The purpose of this article is to:

  • Use…

Image by Dimitris Effrosynidis

I created a simple Web Application with Spotify API, Python Dash, and Flask. Spotify users can access the app giving permission to the app to use the data. A lot of cool statistics are displayed!

You need a Spotify Account to access it. Allow up to 20 seconds to load.

I am a Data Scientist, with an academic background in Electrical and Computer Engineering. After completing university in 2017, I immediately started a Ph.D. Through the Ph.D. journey, I discovered Data Science. Machine Learning and Data Science Books, Youtube Videos, Online Courses, Podcasts, Kaggle, all combined made me a self-taught…

What is Outlier Detection?

Outlier Detection is also known as anomaly detection, noise detection, deviation detection, or exception mining. There is no universally accepted definition. An early definition by (Grubbs, 1969) is: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. A more recent definition by (Barnett and Lewis, 1994) is:

An observation which appears to be inconsistent with the remainder of that set of data.


Straight from this excellent article, the most common causes of outliers are:

  • Human errors — Data entry errors
  • Instrument errors — Measurement errors
  • Experimental errors

source: Clker, via pixabay

Normalization and standardization are similar — they rescale the features. They are used in data analysis to understand the data, and in machine learning to perform better training with certain algorithms.

This article includes:

  • Normalization. Why normalize?
  • Standardization. Why standardize?
  • Differences?
  • When to use and when not
  • Python code for Simple Feature Scaling, Min-Max, Z-score, log1p transformation

Import Libraries, Read Data

Using House Prices Dataset from Kaggle.

In about 20 minutes from now, you will have a playlist in Spotify that automatically receives songs from your favorite subreddits.

You will have to set up this once and enjoy it forever. It is very easy and requires no coding knowledge.

Ok, but why do I want to create such a playlist?

If you are like me, then you love to discover new songs. Except for the great recommendation systems that Spotify and other services provide, which are generated by machine learning, I have found that human recommendations are more diverse and more interesting.

One of the best human…

In this article, we will:

  • Explore 11 Cross-Validation techniques.
  • Visualize the training and validation samples in each fold. This is the best way to instantly understand how that particular CV technique works.
  • Plot the distribution in each validation fold versus the distribution of the actual test.

We have a dataset. It is splitted into two parts. One is called training and the other testing. They have the same number of columns, except one. Training has also the target.

Our task is to fit a model on the training data and predict the unknown target on the testing data.

We can’t…

Dimitris Effrosynidis

Data Scientist at Mathisys Technologies Hellas | Ph.D. Candidate at the Democritus University of Thrace.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store