Bimodal or trimodal data? Try a Gaussian Mixture Model

Jozsef Meszaros
Analytics Vidhya
Published in
4 min readFeb 6, 2022

If you are using MATLAB to analyze data containing two or more normally distributed populations, this tutorial could help you sort them out

Let’s hope that your data will cluster this cleanly!

Let’s begin by creating a hypothetical dataset containing data from three random distributions.

rng(5) % Initialize the random number generatormu = [5,10,15]; % Choose mean and standard deviation values
sigma = [.5,2,3];
Nsamples = [100,100,100]; % Note this %
data = cell2mat( arrayfun(@(i) ...
random('Normal',mu(i),sigma(i),Nsamples(i),1), [1:3],...
'UniformOutput', false )' );
figure; hist(data)

Creating the figure will get you something that looks like this. It’s hard to tell how many Gaussians you put in there using the automatic binning by MATLAB.

We can use the AIC estimator to check the optimal number of Gaussians in your data. Of course, we know the answer is three for our toy dataset, but this won’t be the case out in the wild.

rng(5)Nguess = 5; % Guess the data has at most 5 Gaussiansoptions = statset(); options.MaxIter = 500;% Fit Gaussian Mixture Models for each of the five models
gmmodels = arrayfun( @(x) fitgmdist( data, x, 'Options', options ),[1:Nguess],...
'UniformOutput', false );
% Extract the AIC values for each model
AIC_values = cellfun( @(x) x.AIC, gmmodels );
% Plot AIC values to see the number that gives you the smallest AIC
figure; ax=axes('NextPlot','add'); plot( ax, [1:5], AIC_values );[a,b] = min(AIC_values);
plot(b,a,'ro')

This will show you a shallow plateau with a minimum at 3, which is great because that’s how many we put into the artificial data.

Now, we’ll want to try and cluster the data based on the Gaussian Mixture Models we created above using arrayfun (if you need to brush up on your functions check out my post on Making MATLAB “Fun”).

Idx = cluster(gmmodels{3},data);% We will create artificially spaced bins
Nbins=100;
bins = linspace(0,max(data),Nbins);
% Create a Kernel Density Function for Each of the 3 Gaussians
% Note the multiplication by the normalization factor
% (sum(Idx==x))/numel(Idx)
% Entries roughly should match up with Nsamples/sum(Nsamples)
%
f = arrayfun( @(x) (sum(Idx==x)/numel(Idx))*ksdensity( data(Idx==x), bins, 'BandWidth', 1 ), [1:3], 'UniformOutput', false );
figure; ax=axes('nextplot','add');arrayfun( @(f_) plot(bins,f_{1}), f );

Now there are two areas here that you may need to adjust when using it with your own custom data.

(1) Binning. You may want to try a more principled approach, such as the Freedman-Diaconis rule (it’s very straightforward).

(2) Bandwidth. The parameter within the KS density function is arbitrarily set to 1. This is something you can play around with, or check out this formula.

Once you display the result, it should look pretty crisp and a huge improvement visually from the histogram we started out with.

Lastly, you’ll want to get that beautiful aesthetic that’s possible through using the legendary patch function (more patch magic in this post).

% I use two anonymous functions to create the x and y input to patch
% This way the Kernel Density Function is generated new for each
% of the three Gaussians
patchup_ksf_y = @(ksf) [ksf,zeros(1,numel(ksf)),ksf(1)]';
patchup_ksf_x = @(ksf) [bins,fliplr(bins),bins(1)]';
% Create a figure without lines or axes
% You can edit here if you wish to preserve any of the plot pieces
figure('color','w'); ax=axes('nextplot','add','xcolor','w','ycolor','w','box','off');
% Create an array that contains three patches
patches = cellfun( @(ksf) patch( patchup_ksf_x(ksf),patchup_ksf_y(ksf),...
'k','EdgeAlpha',0,'facealpha',0.5,'parent',ax ), f )
% A custom palette for the display
palette=lines(3);
% Re-coloring the patches
arrayfun( @(i) set(patches(i),'facecolor',palette(i,:)), [1:3] )

Note that this isn’t the same as fitting a Gaussian to your data. Here’s how you would do that.

gm = arrayfun( @(x) gmdistribution( gmmodels{3}.mu(x), gmmodels{3}.Sigma(x) ), [1:3], 'UniformOutput', false )figure('color','w');
ax = axes('nextplot','add','xlim',[min(bins),max(bins)]);
arrayfun( @(x) plot( bins, (sum(Idx==x)/numel(Idx))*pdf( gm{x}, bins' ) ), [1:3] )

Let’s take a look at the data, the Kernel Density plot, and the Gaussian fits together.

The raw data (left) is clustered using the cluster function and a Gaussian Mixture Model — giving us three colors. These data can then be smoothed over using ksdensity, or alternately, three Gaussian pdfs.

That’s all it takes to produce a nice graphic showing the diversity of your dataset. For another use of the ksdensity function, check out my post on creating a half-violin plot in MATLAB.

--

--