# Bimodal or trimodal data? Try a Gaussian Mixture Model

--

If you are using MATLAB to analyze data containing two or more normally distributed populations, this tutorial could help you sort them out

Let’s begin by creating a hypothetical dataset containing data from three random distributions.

rng(5) % Initialize the random number generatormu = [5,10,15]; % Choose mean and standard deviation values

sigma = [.5,2,3];

Nsamples = [100,100,100]; % Note this %data = cell2mat( arrayfun(@(i) ...

random('Normal',mu(i),sigma(i),Nsamples(i),1), [1:3],...

'UniformOutput', false )' );figure; hist(data)

Creating the figure will get you something that looks like this. It’s hard to tell how many Gaussians you put in there using the automatic binning by MATLAB.

We can use the AIC estimator to check the optimal number of Gaussians in your data. Of course, we know the answer is three for our toy dataset, but this won’t be the case out in the wild.

rng(5)Nguess = 5; % Guess the data has at most 5 Gaussiansoptions = statset(); options.MaxIter = 500;% Fit Gaussian Mixture Models for each of the five models

gmmodels = arrayfun( @(x) fitgmdist( data, x, 'Options', options ),[1:Nguess],...

'UniformOutput', false );% Extract the AIC values for each model

AIC_values = cellfun( @(x) x.AIC, gmmodels );% Plot AIC values to see the number that gives you the smallest AIC

figure; ax=axes('NextPlot','add'); plot( ax, [1:5], AIC_values );[a,b] = min(AIC_values);

plot(b,a,'ro')

This will show you a shallow plateau with a minimum at 3, which is great because that’s how many we put into the artificial data.

Now, we’ll want to try and cluster the data based on the Gaussian Mixture Models we created above using **arrayfun **(if you need to brush up on your **fun**ctions check out my post on Making MATLAB “Fun”).

Idx = cluster(gmmodels{3},data);% We will create artificially spaced bins

Nbins=100;

bins = linspace(0,max(data),Nbins);% Create a Kernel Density Function for Each of the 3 Gaussians

% Note the multiplication by the normalization factor

% (sum(Idx==x))/numel(Idx)

% Entries roughly should match up with Nsamples/sum(Nsamples)

%

f = arrayfun( @(x) (sum(Idx==x)/numel(Idx))*ksdensity( data(Idx==x), bins, 'BandWidth', 1 ), [1:3], 'UniformOutput', false );figure; ax=axes('nextplot','add');arrayfun( @(f_) plot(bins,f_{1}), f );

Now there are two areas here that you may need to adjust when using it with your own custom data.

**(1) Binning.** You may want to try a more principled approach, such as the Freedman-Diaconis rule (it’s very straightforward).

**(2) Bandwidth. **The parameter within the KS density function is arbitrarily set to 1. This is something you can play around with, or check out this formula.

Once you display the result, it should look pretty crisp and a huge improvement visually from the histogram we started out with.

Lastly, you’ll want to get that beautiful aesthetic that’s possible through using the legendary **patch **function (more **patch **magic in this post).

% I use two anonymous functions to create the x and y input to patch

% This way the Kernel Density Function is generated new for each

% of the three Gaussians

patchup_ksf_y = @(ksf) [ksf,zeros(1,numel(ksf)),ksf(1)]';

patchup_ksf_x = @(ksf) [bins,fliplr(bins),bins(1)]';% Create a figure without lines or axes

% You can edit here if you wish to preserve any of the plot pieces

figure('color','w'); ax=axes('nextplot','add','xcolor','w','ycolor','w','box','off');% Create an array that contains three patches

patches = cellfun( @(ksf) patch( patchup_ksf_x(ksf),patchup_ksf_y(ksf),...

'k','EdgeAlpha',0,'facealpha',0.5,'parent',ax ), f )% A custom palette for the display

palette=lines(3);% Re-coloring the patches

arrayfun( @(i) set(patches(i),'facecolor',palette(i,:)), [1:3] )

Note that this isn’t the same as fitting a Gaussian to your data. Here’s how you would do that.

gm = arrayfun( @(x) gmdistribution( gmmodels{3}.mu(x), gmmodels{3}.Sigma(x) ), [1:3], 'UniformOutput', false )figure('color','w');

ax = axes('nextplot','add','xlim',[min(bins),max(bins)]);arrayfun( @(x) plot( bins, (sum(Idx==x)/numel(Idx))*pdf( gm{x}, bins' ) ), [1:3] )

Let’s take a look at the data, the Kernel Density plot, and the Gaussian fits together.

That’s all it takes to produce a nice graphic showing the diversity of your dataset. For another use of the **ksdensity **function, check out my post on creating a half-violin plot in MATLAB.