Describing your statistical data

Specialist Library Support
Specialist Library Support
8 min readSep 4, 2019

This resource is no longer being updated

Introduction

We encounter statistics in all areas of our life and it is important that we understand how statistics extract meaning from data. This post will explore the methods and descriptives you can use to describe your data.

Before you continue with this post you might first find it useful to read ‘Introduction to statistics’ It will give you a basic introduction to statistical terminology, charts and data types.

Describing your data

The main goal of any statistics is to condense data into easy to understand summaries. Being able to effectively describe your data will open up your data set to a wider audience. Also it will allow you to more easily conduct analysis. This post will explore:

^ Back to the top

Normal distribution

A normal distribution is a type of data distribution, sometimes referred to as a bell curve (after its shape) or Gaussian distribution, and is one of the easiest to use mathematically. Many variables are normal distributed.

A data set that has a normal distribution can be described by two numbers:

  • mean: measure of location of the centre of the curve
  • standard deviation: measure of variation; the width of the curve

Here you can see a data set which has a normal distribution.

The variable on the horizontal (x) axis might be ‘height in centimetres’, and the vertical (y) axis ‘number of people’. If the height distribution for a sample of people looked like this chart, then we can say that the sample is normal distributed.

^ Back to the top

Null hypothesis and research hypothesis

A hypothesis is a proposed explanation for something which you want to test. When you are conducting a statistical test you will be addressing a hypothesis which you have proposed.

Null hypothesis

A null hypothesis is a hypothesis that there is no relationship between the parameters or groups you are investigating. It is generally assumed to be true until proven otherwise, and is often a commonly-accepted fact or common knowledge.

By conducting a test, you are looking to reject (or nullify/disprove) the null hypothesis. For example, in a questionnaire on commutes, we could look at distance to work, and having a valid driving licence. The null hypothesis could be that the mean distance travelled to work is the same for those with and without a valid driving licence, i.e. that there is no relationship.

Research hypothesis

The research hypothesis is the idea that we are looking to prove or disprove. For example, continuing with the commuting survey, the research hypothesis could be that the mean distance travelled to work is different for those with and without a valid driving licence.

In practice

In practice, it is difficult to prove a research hypothesis. In this case, we look to reject the null hypothesis. For example, our research hypothesis is that there is a difference in mean distance travelled to work for those who have and do not have a valid driving licence. We cannot prove this, as we do not have data on all commuters. However, we can survey a sample of commuters, and attempt to make inferences from this sample to the underlying population. In this example, if we found a significant difference in mean distance travelled between those with and without a valid driving licence, we could reject the null hypothesis (that there would be no difference).

^ Back to the top

Independent and dependent variables

An independent variable is a variable that is being manipulated in an experiment or test. The outcome of that test or experiment is the dependent variable.

For example, if you throw a ball and measure how far it has travelled, the force with which you throw the ball is the independent variable. The dependent variable is the distance the ball travels as it is directly affected by the force of the throw.

Example chart of plotting independent and dependent variables.

On the left is an example showing independent and dependent variables. If you plot these two variables, the independent variable (force) conventionally uses the horizontal (x) axis, and the dependent variable (distance) is on the vertical (y) axis.

^ Back to the top

Independent and repeated measures

Independent (or unrelated) measures — not to be confused with independent variables — are readings that are taken from unique individuals. Repeated (or related) measures are readings gathered from the same individual(s) but at different times.

For example if you took the same person’s blood pressure once a week for 4 weeks you would be gathering a repeated measure of their blood pressure. If you used different individuals each week you would have recorded independent measures.

Illustration of how Independent and repeated measures differ.

^ Back to the top

One-sided and two-sided tests

Sometimes you are looking to see if the value of a sample (such as a mean or proportion) is higher or lower than that of the population.

A one-sided test is used when the direction of the difference or relationship is known perfectly. Otherwise, two-sided tests are used. For example, for the hypothesis that mean distance travelled to work is greater for those with a valid driving licence, we would use a one-sided test (because we are making an assumption that drivers travel further to work, perhaps because they are able to work further from home).

Example of one-sided and two sided tests plotted on a bell curve.

For the hypothesis that the mean distance travelled is different between these groups, we would use a two-sided test, because we are testing for difference (in either direction).

^ Back to the top

Significance and p-values

Imagine you suspect a coin is biased. You toss it twice and it lands on ‘heads’ both times. You could say that in your experiment, the coin landed on ’heads’ 100% of the time (for two throws). Does this significantly support a hypothesis that the coin is biased? Or could your observations be merely down to chance?

If the null hypothesis (the coin is not biased) were true, there would be a 25% chance we would see this result (two heads out of two tosses) anyway.

The probability of getting the observed result (or more extreme) if the null hypothesis is true is called the p-value, and is a measure of statistical significance. P-values are written as decimals. For the above example, p=0.25, which is not considered significant.

If we tossed the coin five times, the p-value for it landing on heads at least four times is 0.1875. The most usual statistical significance level is p ≤ 0.05, meaning at most a 5% chance of getting our result (or more extreme) if the null hypothesis is true. If we want to be stricter, we can use a lower significance level, for example 0.01 (a 1% chance of wrongly rejecting the null hypothesis). In some software packages, p-values less than 0.001 may appear as zero but should be written up as p<0.001.

^ Back to the top

Parametric and nonparametric tests

A parametric test is a type of test which makes assumptions about the parameters (defining properties) of the data in order to be more powerful. In contrast, a non-parametric test makes no such assumptions.

If a variable has scale (ratio or interval) data and has an approximately normal distribution, you may use a parametric test. In all other cases including ordinal and nominal variables choose a nonparametric test.

Look at the questions below and think about the type of variables involved.

What method of transport do you use to get to work?

These variable are nominal, so use nonparametric tests. Consider the types of transport used when commuting. Unlike timings there are no clear assumptions we could make about method of transport and number of people. Also, because methods of transport do not have a natural order, the data set could not be modelled by a bell curve.

How many minutes does it take you to get to work?

These variables are scale, so if the distribution is approximately normal, use parametric tests, otherwise nonparametric tests.

Thinking about how long it takes people to get to work, we can guess that it might follow a normal distribution: the number of people will tail off at short and long times. It is plausible that many people’s commute times will lie somewhere around an average. Because of people’s different circumstances (mode of transport, distance from work), we might expect a bell curve. We should look at the data in question to determine whether this is the case.

Confidence intervals

Often, we want to know about the extent of our results, not just whether they are significant. For example, a new drug might be found to reduce pain compared to an existing drug, significant for p ≤ 0.05. This tells us that the drug is likely ‘better’, but nothing about its strength; it could have a significant, but very small effect.

Confidence intervals give us an idea of the strength of a difference or relationship. A confidence interval is a range of values in which the population parameter of interest is likely to lie, to a given level of confidence (usually 95%).

A 95% confidence interval tells us that if the study were to be repeated, the 95% confidence interval would contain the population parameter 95 times out of 100. This allows us to interpret both the significance and size of an effect.

For example, the effect of a new drug on those suffering moderate pain might be measured in terms of pain reduction on a ten-point scale. For people suffering a 5/10 level of pain, the drug might have a mean reduction of 2.8 points, with a 95% confidence interval of a 2–3 point reduction. This means that if we repeated the study, the effect would be in the range of a reduction of 2–3 points, 95 times out of 100. 2.8 is merely an estimate, but we can be confident that the ‘real’ effect will be in the range 2–3, 95 times out of 100.

Summary

The ability to apply the correct descriptive statistics and tests to your data will make it more accessible, enabling more people to understand and engage with your data. Before conducting any statistical test ensure you understand your data, this will enable you to select the correct statistical tests.

--

--