Error in Astronomy or

how I learned to stop worrying and love the uncertainty

Joost VanderBorgh
nieuwsgierigheid
8 min readJul 25, 2019

--

This is my summarized notes from Andrae, Oser, Wittman, and Ly (their works are in the bibliography below) (: all info and knowledge come from there! I implore you to read through it!

Scientists measure data and make inferences about their observations. To be technical, in examining the data, parameters are estimated from the data and are inferred upon.

You are a scientist and you seek to plot the best fit line on data. Your boss asks you how certain you are of your best fit line:

What error estimate should you assign to the slope and intercept of this fit? How do you approach this problem?

You are not given the error bars. You are not told of the distribution of the errors.

By the end of the article, you will be able to tell your boss how!

Part I: The Nature of Uncertainty

Where does uncertainty come from? Random errors could be made when data is measured and systematic errors could be made by a model (such as overfitting).

Terminology

A parameter is denoted by a θ, whereas the estimate of the parameter is expressed by the symbol, (theta-hat):

As Medium doesn’t allow for LaTex Type Setting :(

For example, the apparent magnitude of stars or galaxies: based on a image, one needs to first estimate the (parameter) flux (=the amount of “stuff” flowing through an area or volume), then can we infer about its magnitude. A parameter is a stepping stone for an inference.

The Poisson distribution can describe the behavior of astronomy probability distributions. For example, in photometry or spectroscopy, where the physical measurement process is counting photons in a certain pixel, the photon counting is assumed to be a Poisson process that follows a respective Poisson distribution.

If you know the mean value of µ photons in a defined pixel, the probability of measuring n photons in this pixel is:

Changing the mean value of µ photons alters the Poisson distribution. One applied task is estimate the mean given the number of photons in the pixel.

Notice the asymmetry :)

If you increase the number of measurements (of photon counts, for example), the expected value of µ will also be increased: this improves the signal to noise ratio and the Poisson distribution approaches a Gaussian distribution.

Parameter Estimation

In optimization (or “fitting”) problems, the goal is to minimize residuals or more precisely, maximize the likelihood function.

For a given set of measured data D & model M with parameters θ, the likelihood function is defined as:

In words, this is: the likelihood is the probability of measuring the data D we have given the model M with parameters θ

MLE (Maximum Likelihood Estimation) is us choosing a parameter θ that gives our measured data the most likely outcome. The set of parameters that do so are called the maximum-likelihood estimate and are denoted as:

Theta-Hat (again!)

In Bayesian statistics, the goal is instead of maximizing the likelihood function as above but to maximize the posterior probability.

Central Limit Theorem

This tells us that given identically independent data, any likelihood function is asymptotically Gaussian near its maximum. Evaluating the likelihood function, and taylor expanding it allows for the log L(θ_max) to disappear as θ_max becomes zero; this leaves a quadratic form θ near the optimum.

Part II: Error Estimation for Model-Based Parameters

Brute Force Grids

In summary, this simple method evaluates the likelihood function at every point in the rectangular grid in the parameter space. The only assumption this method makes is that the error distribution of the measured data is correct.

For the standard data below, brute-force grids estimate the error: panel (a) assumes a Poisson error distribution whereas (b) assumes a Gaussian error distribution. Both likelihood functions L(u) (blue curves) peak at u = 10.07.

Varying χ²

Assuming if the error distribution of the measured data set is Gaussian, one can vary the model parameters around a minimized χ² . Mathematically, one thus examines where χ² = χ²_ min + 1, which would define the error estimate of the model parameters.

The minimum χ² of occurs at µ-hat = 10.09, where here χ² is 12.4; adding unity to the minimum χ² value (=13.4) gives the second red dotted line. From this point, the resulting error estimate is 0.57.

Fisher Matrix

The Fisher Matrix estimates error based on the assumption one has fitted model parameters to logarithmic likelihood (in case of Gaussian error estimation, i.e. minimizing χ²). This method is founded upon the central limit theorem which says that any i.i.d. likelihood function is asymptotically Gaussian near its maximum.

The Fisher Matrix allows one to engage in experimental design. It allows one to understand the covariance between the two estimates of the parameters. The diagnolas of the covariance matrix contains the variance estimates of each

It assumes:

  1. The error distribution of measurements is known (i.e. the likelihood function is defined correctly)
  2. The second order taylor expansion is a good approximation

The Fisher Matrix, F, is as follows: For N model parameters p_1, p_2, … p_N, F is an N x N symmetric matrix.

In this matrix, each element is a sum over observables (which means the things we measure!). Say there are B number of observables (so f_1, f_2, … f_B) so each one is related to the model parameters by some equation f. As a reminder, this is why the Fisher Matrix is a way to estimate error for model-dependent parameters.

Each model parameter is related to the observable by f_b = f_b(p_1, p_2 … p_N); then the elements of each Fisher Matrix are:

For example, hypothetically, if are analyzing families in the mythical land of Hillary-and-Bob-land — and under families, you see the following:

The first child is called Hillary (=thus, in families with an only child, there is only Hillary)
If there is another child, then the second child is called Bob

The resulting Fisher matrix for this case becomes (where h is hillary and b is bob):

And inverting the 2x2 matrix becomes:

This is the covariance matrix. Despite how lacking this example is of bells and whistles, this explains the following:

There are two tests to see if this co variance matrix is valid: One, the determinant of the co variance matrix should non-negative to be valid. If you diagonalise the co variance matrix to determine eigenvalues, then if any eigenvalue is not positive, then the co variance matrix is not valid.

Monte Carlo methods

Monte Carlo methods are the most intuitive of the ways to estimate error because it has a minimum number of assumptions; the only assumption is that the error distribution of the measured data is known correctly (that we are certain about the likelihood function).

Monte Carlo methods draw samples from the likelihood function and check to see if they fit within the probability defined by the corresponding value of the likelihood function.

In choosing a Monte-Carlo simulation, the number of model parameters is considered. If it is small like 1 or 2, then we can use uniform sampling, importance sampling, or rejection sampling. If the number of parameters is large, then one can use a Markov Chain Monte Carlo (MCMC) algorithm.

Part III: Error Estimation for Model-Independent Parameters

Error estimation for model-dependent parameters involves an optimization problem which could be complicated. Model-independent parameters do not usually involve any optimization problem and are popular in astronomy :)

Resampling Data

For this, the only assumption made is that the correct error distribution of the measured data is known. For each data point x_n, a Gaussian with mean x_n and standard deviation σ_n is sampled. In essence, it simulates repeated measurements of the data. The result will slightly differ from the actual data but the distribution (after resampling 100 times, at least) will allow one to infer the error.

This provides only an upper limit for the uncertainty because

Bootstrapping

In the bootstrap method, one can calcuate the distributions of the rrors from the data itself. Then, use these to calculate the errors on the fit.

You start with a set of N independent and identically distributed observations and we estimate some parameter.

Then, you make a new data set by selecting N random observations, with replacement. Some get picked more than once and some do not get picked at all. This new data set is X’.

For this new data set, you calculate a new parameter, θ(X’)

Repeat for at least 100 times.

The width of the distribution of the θ calculated from the resampled data sets is the error on θ.

This simple process has its benefits:

  1. You do not need to generate a large Monte Carlo data set
  2. You do not need to know the true underlying error distribution

To answer the beginning question, you can calculate the residuals (how far the data points are from the line of best fit), as seen below:

Then you generate new bootstrap data sets according to the following:

This generates the following distribution:

The bootstrap from the data set gave us:

Whereas if one were to run a Monte Carlo analysis on this data set, then the following would be generated:

The discrepancy is driven by the lack of a big data set. In 30 data points, the bootstrap is not completely effective.

When not to use a bootstrap

Thus, there are instances when not to use a bootstrap:

  1. Small sample size (N<50)
  2. Distributions with infinite moments
  3. Using bootstraps to estimate extreme values

Part IV: Summary

In this article, we defined key terminology, and introduced the summaries of two ways of error estimation. We discussed bootstrap methods to the Fisher Matrix.

--

--