# Bimodal or trimodal data? Try a Gaussian Mixture Model

Published in

--

If you are using MATLAB to analyze data containing two or more normally distributed populations, this tutorial could help you sort them out

Let’s begin by creating a hypothetical dataset containing data from three random distributions.

`rng(5) % Initialize the random number generatormu = [5,10,15]; % Choose mean and standard deviation valuessigma = [.5,2,3];Nsamples = [100,100,100]; % Note this %data = cell2mat( arrayfun(@(i) ...random('Normal',mu(i),sigma(i),Nsamples(i),1), [1:3],...'UniformOutput', false )' );figure; hist(data)`

Creating the figure will get you something that looks like this. It’s hard to tell how many Gaussians you put in there using the automatic binning by MATLAB.

We can use the AIC estimator to check the optimal number of Gaussians in your data. Of course, we know the answer is three for our toy dataset, but this won’t be the case out in the wild.

`rng(5)Nguess = 5; % Guess the data has at most 5 Gaussiansoptions = statset(); options.MaxIter = 500;% Fit Gaussian Mixture Models for each of the five modelsgmmodels = arrayfun( @(x) fitgmdist( data, x, 'Options', options ),[1:Nguess],...'UniformOutput', false );% Extract the AIC values for each modelAIC_values = cellfun( @(x) x.AIC, gmmodels );% Plot AIC values to see the number that gives you the smallest AICfigure; ax=axes('NextPlot','add'); plot( ax, [1:5], AIC_values );[a,b] = min(AIC_values);plot(b,a,'ro')`

This will show you a shallow plateau with a minimum at 3, which is great because that’s how many we put into the artificial data.

Now, we’ll want to try and cluster the data based on the Gaussian Mixture Models we created above using arrayfun (if you need to brush up on your functions check out my post on Making MATLAB “Fun”).

`Idx = cluster(gmmodels{3},data);% We will create artificially spaced binsNbins=100;bins = linspace(0,max(data),Nbins);% Create a Kernel Density Function for Each of the 3 Gaussians% Note the multiplication by the normalization factor% (sum(Idx==x))/numel(Idx) % Entries roughly should match up with Nsamples/sum(Nsamples)%f = arrayfun( @(x) (sum(Idx==x)/numel(Idx))*ksdensity( data(Idx==x), bins, 'BandWidth', 1 ), [1:3], 'UniformOutput', false );figure; ax=axes('nextplot','add');arrayfun( @(f_) plot(bins,f_{1}), f );`

Now there are two areas here that you may need to adjust when using it with your own custom data.

(1) Binning. You may want to try a more principled approach, such as the Freedman-Diaconis rule (it’s very straightforward).

(2) Bandwidth. The parameter within the KS density function is arbitrarily set to 1. This is something you can play around with, or check out this formula.

Once you display the result, it should look pretty crisp and a huge improvement visually from the histogram we started out with.

Lastly, you’ll want to get that beautiful aesthetic that’s possible through using the legendary patch function (more patch magic in this post).

`% I use two anonymous functions to create the x and y input to patch% This way the Kernel Density Function is generated new for each% of the three Gaussianspatchup_ksf_y = @(ksf) [ksf,zeros(1,numel(ksf)),ksf(1)]';patchup_ksf_x = @(ksf) [bins,fliplr(bins),bins(1)]';% Create a figure without lines or axes% You can edit here if you wish to preserve any of the plot piecesfigure('color','w'); ax=axes('nextplot','add','xcolor','w','ycolor','w','box','off');% Create an array that contains three patchespatches = cellfun( @(ksf) patch( patchup_ksf_x(ksf),patchup_ksf_y(ksf),...'k','EdgeAlpha',0,'facealpha',0.5,'parent',ax ), f )% A custom palette for the displaypalette=lines(3);% Re-coloring the patchesarrayfun( @(i) set(patches(i),'facecolor',palette(i,:)), [1:3] )`

Note that this isn’t the same as fitting a Gaussian to your data. Here’s how you would do that.

`gm = arrayfun( @(x) gmdistribution( gmmodels{3}.mu(x), gmmodels{3}.Sigma(x) ), [1:3], 'UniformOutput', false )figure('color','w');ax = axes('nextplot','add','xlim',[min(bins),max(bins)]);arrayfun( @(x) plot( bins, (sum(Idx==x)/numel(Idx))*pdf( gm{x}, bins' ) ), [1:3] )`

Let’s take a look at the data, the Kernel Density plot, and the Gaussian fits together.

That’s all it takes to produce a nice graphic showing the diversity of your dataset. For another use of the ksdensity function, check out my post on creating a half-violin plot in MATLAB.