Selecting the right set of features to be used for data modelling has been shown to improve the performance of supervised and unsupervised learning, to reduce computational costs such as training time or required resources, in the case of high-dimensional input data to mitigate the curse of dimensionality. Computing and using feature importance scores is also an important step towards model interpret-ability.
This post shares the overview of supervised and unsupervised methods for performing feature selection I have acquired after researching the topic for a few days. For all depicted methods I also provide references to open-source python implementations I…
This post will show an alternative approach to clustering a dataset, which relies on graph oriented techniques, namely on louvain community detection.
A community is a group of nodes within a network that are more densely connected than they are to the other nodes. The problem of community detection is the equivalent of partitioning a network into densely connected communities in which the nodes belonging to other communities are only sparsely connected. This achieved by optimizing a the modularity metric which evaluates the quality of a partitioning by how much more densely connected the nodes in a community are by…
Frequent patterns are collections of items which appear in a data set at an important frequency (usually greater than a predefined threshold)and can thus reveal association rules and relations between variables. Frequent pattern mining is a research area in data science applied to many domains such as recommender systems (what are the set of items usually ordered together), bioinformatics (what are the genes co-expressed in a given condition), decision making, clustering, website navigation.
This section will introduce the recurrent terminology of frequent pattern mining domain.
Input data is usually stored in a database or as a collection of transactions…
This post addresses the following questions:
High dimensional data consists in input having from a few dozen to many thousands of features (or dimensions). This is a context typically encountered for instance in bioinformatics (all sorts of sequencing data) or in NLP where the size of the vocabulary if very high. High dimensional data is challenging because:
This post will address the following questions:
The figure below is a simplification of the paper Reservoir computing approaches for representation and classification of multivariate time series but it captures well the gist of ESNs. Each component will be detailed in the following sections.
This post will show you how to:
Let’s start by generating an input dataset consisting of 3 blobs:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
from sklearn.datasets.samples_generator import make_blobsn_components = 3
X, truth = make_blobs(n_samples=300, centers=n_components,
cluster_std = [2, 1.5, 1],
random_state=42)plt.scatter(X[:, 0], X[…
There are over 20 different types of data distributions (applied to the continuous or the discrete space) commonly used in data science to model various types of phenomena. They also have many interconnections which allow us to group them in family of distributions. A great blog post proposes the following visualization, where the continuous lines represent an exact relationship (special case, transformation or sum) and dashed line indicates a limit relationship. The same post provides a detailed explanation of these relationships and this paper provides a thorough analysis of the interactions between distributions.
The following section provides information about each…
This article presents a basic python implementation of the expectation maximization algorithm applied to Gaussian distributions. The entire code has been made available as a notebook on Github.
There are datasets consisting of a mixture of distributions (for simplicity, we will consider Gaussians, but the same heuristic can be applied to other types of distributions):
dataset = N(mean1, sd1) + N(mean2, sd2) + … + N(mean_n, sd_n)
where N represents a normal distribution described by a mean and a standard deviation values.
In the example below we generated a dataset containing samples from 3 distributions having random means and the…
In a nutshell, this post addresses the following 2 questions:
Typically we use both linear and log scales (for capturing outliers) but here we will investigate the possibility of creating hybrid axes, with an arbitrary mixture of various types of scales, applied on desired intervals.
We will propose a custom implementation of a violinboxplot offering a wide range of customization parameters which govern for instance the rendering of outliers, custom annotations for modes and counts, split axis between linear/log based on an arbitrary percentile. It handles both arrays of data and dataframes grouped by a list of columns.
This post explains the functioning of the spectral graph clustering algorithm, then it looks at a variant named self tuned graph clustering. This adaptation has the advantage of providing an estimation for the optimal number of clusters and also for the similarity measure between data points. Next, we will provide an implementation for the eigengap heuristic computing of the optimal number of clusters in a dataset based on the largest distance between consecutive eigen values of the input data’s laplacian.
Now let’s start by introducing some basing graph theory notions.
Given a graph with n vertices and m nodes, the…
Computer science engineer, bioinformatician, researcher in data science