A Brief Introduction: Data Modelling through LS and MLE (The Implementation)

8 min readDec 9, 2023

This is the continuation of the first post regarding the theory of least-squares and maximum-likelihood. In this post I will create a situation where we can implement both techniques and compare them to each other. I would strongly recommend checking out the previous post before this one, as it covers important concepts regarding each algorithm.

Generating Data and Choosing a Model

First of we need generate some synthetic data. This will allow us to know the true distribution of the data, so when we determine the optimal parameters, we are able to compare them to these true values.

# Generate synthetic data
rng = np.random.default_rng(seed=42)
X = np.sort(rng.uniform(0,5,30))
A, B, C, D = rng.random(4)
y = A*np.sin(X*B + C) + D + rng.normal(0, 0.1, len(X))
yerr = np.random.rand(len(X))*0.1

Here we are setting the seed of the generator to 42 in order to have reproducible results. I randomly generate the true parameters of our data and store them in the variables A, B, C, and D. Additionally, I picked the data to follow a sinusoidal function, and I added noise to the data. Lastly, I generated some uncertainties for each data point. We can take an initial look at our generated data.

Now we must select a model that we believe represents the data or at least can describe the behavior we are observing. Popular models used to fit data in general are high order polynomials, primarily because they can pretty much fit most data. However, they are not ideal because they often lead to overfitting, loss of generalization, numerical instabilities, or extrapolation issues to name a few. So, it’s usually best to stay with simpler and more general models, which is why we would pick a sinusoidal function either cos or sin would work.

For this specific model the parameter vector that will be used by both techniques would be

Implementing Non-Linear Least Squares

For this section we will use ‘scipy.optimize.curve_fit()’ to perform the non-linear least squares routine. First, we need to define our model as a function in Python.

# Define model function for curve_fit
def f(x, A, omega, phi, C):
    return A*np.sin(x*omega + phi) + C

In order to reduce the variation in the optimal parameters found by curve_fit we need to provide the function with an initial guess as to what these parameters are. So, I randomly generated this guess using a different seed than before.

# Define initial parameters 
guess_rng = np.random.default_rng(seed=1)
p0 = guess_rng.random(4)

Now we want to call curve_fit on the model function, the x-data, y-data, y-error and initial guesses as follows.

popt, pcov = curve_fit(f, X, y, sigma=yerr, p0=p0,\
                        absolute_sigma=True, maxfev=2000)

Here popt is an array that contains the optimal parameters, and pcov is covariance matrix of these parameters. This matrix basically tells us how these parameters vary with respect to one another. And the diagonal entries represent the variance of the parameters. Furthermore, ‘absolute_sigma’ tells curve_fit to consider ‘sigma’ in an absolute sense and ‘maxfev’ specifies the maximum number of calls to the function.

We see that the covariance between the parameters is very low which means there is not strong association between any two parameters. Additionally, we can calculate the standard deviation of each individual parameter by taking the square root of the diagonal entries of the covariance matrix. Remember that the diagonal entries are the variance of the parameters.

We can plot the prediction from curve_fit onto our data and also observe how it compares to the real distribution.

We see that LS does a decent job at fitting the data, however, as we see in the residual plot (bottom panel), the predicted data is overestimating the observed data by 0.1.

The best fit parameters from curve_fit are:

Implementing MLE through MCMC

As covered in the previous post, our objective it to sample the posterior distribution. In order to achieve such objective, we need to define a couple of important functions, those being the log likelihood function, the log prior function and of course the model function itself.

# Define model function
def Model(x, theta):
    A, omega, phi, C, log_f = theta
    return A*np.sin(x*omega + phi) + C

# Define log likelihood function
def log_likelihood(theta, x, y, yerr):
    ''' This definition of the log likelihood function
    incorportate the fractional error from the model'''
    log_f = theta[-1]
    model = Model(x,theta)
    sigma2 = yerr**2 + np.exp(2 * log_f) * model**2
    return -0.5 * np.sum((y - model) ** 2 / sigma2 + np.log(sigma2))

# Define log prior
def log_prior(theta):
    A, omega, phi, C, sigma = theta
    if 0 <= phi < 1.1 and 0 <= omega < 1.1 and 0 <= A < 1.1 and 0 <= C < 1.1:
         return 0.0
    return -np.inf

# Define log posterior to sample from
def log_posterior(theta, x, y, yerr):
    return log_prior(theta) + log_likelihood(theta, x, y, yerr)

There are a couple of important comments that need to be addressed before we move forward.

There are a couple of ways we can define the log likelihood function; however, I find that the above form is the most robust way of defining it. Basically, this likelihood function is not only accounting for the errors in the data, but it is also accounting for errors intrinsic to the model. The term ‘sigma2’ represents the total variance due to uncertainties in the data (yerr**2) but also uncertainty in the model (np.exp(2 * log_f) * model**2).

Additionally, the prior function is defined to be uniformly distributed over specific ranges of each parameter. Basically, we are telling the algorithm to draw parameters only within the specified ranges.

Lastly the log posterior function is defined as the sum of the log posterior with the log likelihood function. If this does not make too much sense, I would suggest revisiting the last equation of my previous post.

With everything set in place all we have to do is setup the ensembler by defining the number of walker and number of parameters. I will be using the ‘emcee’ library to run the MCMC algorithm.

import emcee
# We define the inital parameters same as for LS 
# but we add guess for log f. In this case we guess -1
initial_params = np.append(p0, -1)

# Set up the EnsebleSampler
nwalkers, ndim = 60, len(initial_params)
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=(X,y,yerr))

It’s always recommended to run a burn-in before starting the actual production. What this really means is that we let the walker take a short number of steps in parameter space and we record their last positions. We then use their last positions to run the actual production. This will essentially help our walkers start of in good positions in case our initial guesses are not the best.

# Run initial burn-in
print("Running Burn-in")
p0, _, _, = sampler.run_mcmc(np.random.rand(nwalkers,ndim) * 0.1 + initial_params, 100, progress=True)
sampler.reset()

# Run the MCMC sampling
print("Running production")
nsteps = 2000
pos, _, _ = sampler.run_mcmc(p0, nsteps, progress=True)

We can visualize at the posterior distribution of parameters by using a corner plot as follows.

import corner  
# Sample the flatten chain
flat_samples = sampler.get_chain(discard=100, thin=15, flat=True)
# Define labels for corner plot
labels = ['A',r'$\omega$',r'$\Phi$','C',"log f"]

# Plot corner plot
fig = corner.corner(flat_samples, labels=labels, quantiles=[0.16, 0.5, 0.84],\
                    show_titles=True)
plt.show()

What’s useful about corner plots is that they can show you the marginal distribution of the individual parameters (top panel of each column). And they also show how parameters are related to each other. Additionally, they also provide you with confidence ranges for your parameters.

In our situations, we see that all our parameters have clear peaks, which indicates there are not parameter degeneracies in our model. We notice small bimodalities for the phase shift and for the model uncertainty parameter, but we see a clear winner in between these peaks. Additionally, we see that no parameters are correlated to each other as we saw in the covariance matrix plot from the least-square routine.

Lastly, we can plot the predicted model by looking at the parameters with highest likelihoods and use a sample of walkers to plot the confidence bands as follows.

def sample_walkers(domain,nsamples,flattened_chain):
'''Returns the median and 1 sigma spread of y-pred
by sampling (nsamples) walkers from the flatten_chain'''
    models = []
    draw = np.floor(np.random.uniform(0,len(flattened_chain),size=nsamples)).astype(int)
    thetas = flattened_chain[draw]
    for i in thetas:
        mod = Model(domain,i)
        models.append(mod)
    spread = np.std(models,axis=0)
    med_model = np.median(models,axis=0)
    return med_model, spread

# Function to plot the predicted lines
def plot_predicted_lines(sampler,x_range, med, spr, ax):
''' Plots highest likelihood prediction and 1, 2 and 3 sigma 
confidence bands.'''
    samples = sampler.flatchain
    theta_max  = np.array(samples[np.argmax(sampler.flatlnprobability)])
    best_pred = Model(x_range,theta_max)
    ax.plot(x_range, best_pred, color = 'k', ls = ':',label='HL Parameters')
    ax.fill_between(x_range,med - spr, med + spr,color = 'r',alpha=0.2)
    ax.fill_between(x_range,med - spr*2, med + spr*2,color = 'r',alpha=0.2)
    ax.fill_between(x_range,med - spr*3, med + spr*3,color = 'r',alpha=0.2)

These are two helper functions I use to plot the predictions of MCMC.

# Collect median and 1sigma spread of model
Med, Spr = sample_walkers(X_fine, 500, flat_samples)

# Plot the predicted lines
fig, ax = plt.subplots(figsize=(8, 6))
ax.errorbar(X,y,yerr=yerr,fmt='k.')
ax.plot(X_fine, A*np.sin(X_fine*B + C) + D, color='gray',linewidth=3,label='True',alpha=0.5)
plot_predicted_lines(sampler, X_fine , Med, Spr, ax)
ax.plot(X_fine, f(X_fine,*popt),ls='--',c='g',label='Least-Square')
ax.set_xlabel('X',fontsize=16)
ax.set_ylabel('Y',fontsize=16)
ax.set_title('Posterior Predictive Plots',fontsize=18)
ax.legend()

Comparing the predictive plots between LS and MLE we see that the latter option captures the true distribution of our data better. However, it's important to highlight that MCMC should not be primarily used for finding optimal parameters, we can use optimizers for that. MCMC in its essence is just a sampler and can be used for parameter marginalization and further exploration of parameter space in the case of degeneracies. We could’ve achieved an equally good fit by minimizing the negative log likelihood function however, that would not provide further insights into our parameters.

The true parameters of the MCMC MLE technique are:

Now the moment of truth has come! Let’s see what the true value of these parameters are.

We see that MLE was closer to the true value for 3 out of 4 parameters in our model.

A Brief Introduction: Data Modelling through LS and MLE (The Implementation)

Written by Patrick Sandoval Romero