Little Crash-Course in Big Data – Analysing Data

A designer’s guide to quantitative data from one amateur to another (Part 2 of 3)

11 min readMar 14, 2017

This article is the second in a series of three about how to collect, analyse and visualise quantitative data targeted to designers who want to get into quantitate data. You can find the other parts here — Collecting Data and Visualising Data.

I’ll start with a confession. It’s been a long time since I got my hands dirty analysing data. Although, I did get nerdy and calculated a Chi-squared test over Christmas to test a hypothesis about the gender attrition in the office, I had to watch a video to refresh my memory. A Chi-squared test is used to test whether there is a significant difference when the variables are on the nominal scale.

Although I know enough to know when it’s appropriate to use a Student’s t-test over a Z-test, I’m pretty far from an expert. The good new is that you (and I) will likely work with an expert when it comes to analysing data.

I will however, to give you a basic understanding of key terms and concepts that will help you communicate and collaborate more effectively with an expert.

For the curious, the time to use a t-test is when the sample size is less than 30 and it’s unknown whether the population is normally distributed or not. The t-test uses the residuals of the sample to estimate the errors in the population.

But let’s start with some more basic stuff.

Analysing the data

In the previous article, we covered the topic of collecting data. Whether you’re doing a survey early in the design process or capturing clicks on a high-fidelity prototype further down the line, you’re likely to have collected data with various levels of measurements. You might have used an ordinal scale for your survey but you also captured gender, family status, type of car they drive etc on the nominal scale.

It’s important to understand that various methods for analysing data is appropriate to use depending on you level of measurement.

AVERAGES

A simple way to analyse your data is calculating averages. When measuring a single (univariate) variable, you can use mode, median and mean.

3 6 6 6 7 9 11 11 13

Mode is the most common value, 6 in the case above. Mode works with all levels of measurements, including nominal. You can for instance say that Volvo was the most common car in the sample.

3 6 6 6 7 9 11 11 13

Median is the value in the middle, 7 in the case above. Median can’t be calculated for nominal scales but can be for the other three. You can for instance say that median value for Concept A was 6 and the median for Concept B was 7 if you’re testing a series of concepts and asking users to evaluate how likely they would be to use it on a 1–10 scale.

Note that the median does not have to be a whole number. If there are two numbers in the middle e.g. 6 and 7, the median would be 6.5.

(3 + 6+ 6+ 6 + 7+ 9+ 11+ 11+ 13 = 72) 72/9=8

Mean is the sum of all values divided by the number of values, 8 in the case above. Strictly speaking, the mean is reserved for the interval and ratio scales. However, as I discussed in the first part, Collecting Data, calculating a mean is often done anyway for ordinal scales since it does provide a useful measurement, albeit less stringent than some would like. Read this to find out more.

Because the ‘mode’ — and often the ‘median’ — represent a real person, as opposed to an average, it could be very relevant to designers.

DISTRIBUTION

To learn more beyond the average, you can also look at the distribution of the values. Data can be distributed (spread out) in different ways.

In the graph below, we can see the distribution of a survey and what people answered on a 1–10 scale. 2 people answered 1, 5 people answered 2 and so on. Nobody liked it enough to answer 10.

Note that samples with the same mean can have greatly varying distributions. Look at the four distributions below. Figure A and B has the same average but A shows that most people fell in the middle and thought the concept was ok/neutral. Not many were very positive or very negative.

If we look at figure B, we can see that a lot of people answered low numbers like 1–3 as well as 7–9. Not a lot of people seem to be neutral. This could seems to be a polarising concept.

In many cases where the data tends to be around a central value with no bias left or right, it gets close to a normal distribution. Things that have a normal distribution are e.g. height or weight of people, blood pressure, sizes of things produced by machines etc. The normal distribution is sometimes informally called the bell curve.

VARIANCE

Variance is a measurement of how spread out the values measured are around the mean average. Low variance means that values are close together like the curve A below. A and B have the same average but A has a lower variance and B has a higher variance.

If you’re curious how to calculate the variance, check this WikiHow post out.

Another term used to talk about how the values measured are spread out in relationship to the mean is standard deviation (SD). The SD is calculated by taking the square root of the variance.

You might see something like the image on the left. It shows a a graph with the SD which is written with the greek σ. You can see, in dark blue, 1σ away from the mean to the right and -1σ away to the left. 68% of all measured values fall within 1 standard deviation from the mean.

You can see that 95% of values falls 2 standard deviations (from -2σ to 2σ) from the mean.

A concrete example: Let’s say that the average height of a woman in Sweden is 167cm and the standard deviation is 6cm. This would mean that 68% of women in Sweden would be between 161cm (167cm-6cm) and 173cm (167cm+6cm). 95% of women in Sweden would be between 155cm (167cm-12cm) and 179cm (167cm+12cm).

SIGNIFICANCE

The level of variance is closely related to how confident you can be in your findings. The more variance, the less confidence. This confidence level is sometimes called the p-value or the alpha value. If the confidence interval is 95% the p-value/alpha is expressed p≤0.05.

Let’s use an example here because this is going to get a bit theoretical otherwise. Let’s say you want to find out if apples, plums or oranges are more expensive at farmer’s markets nationally. So, you choose a couple of farmer’s markets at random, go there and find out the price for the three fruits. You then calculate what the average price for each of them.

What we want to do now is to take this data from the sample (averages of the selected farmer’s markets) and make a claim about the population (national average of all farmer’s markets).

The results are as follows (see graph on the left). It looks like plums are the cheapest and the oranges the most expensive. The visualisation looks clear. BUT it’s not that simple. We need to figure out if the difference between these average prices are statistically significant.

Whether a result is significant is calculated using what is called a p-value. The lower the p-value, the better. There are different ways of calculating p-values depending on what levels of measurement (nominal, ordinal, interval and ratio scales) you’re using and what method of analysis.

The good news is that if you’re using a computer program like SPSS or something similar, it’ll be automatically calculated. What’s more important is that you understand what influences the significance and what it means.

The bar below show the measured average value but it has been augmented with an error bar showing 95% confidence interval (p≤0.05). This means that the researcher, based on the variance, is 95% confident that another sample from the same population will yield an average within the range.

This means that if you randomly selected another sample of farmer’s markets, you’ll probably get an average that is higher or lower than the first one but it’s likely to be within the confidence interval.

The bigger the variance in the sample (like the B curve in the example under VARIANCE in the section above), the bigger the chance that you’ve picked a farmer’s market that will skew your data.

Let’s say that the confidence intervals look like below in figure A. The confidence interval of the plums does not overlap with apples or oranges so we can be confident that plums are indeed cheaper. The result is significant.

Comparing apples and oranges is another story. The intervals overlap considerably. Especially the price of the oranges seem to have varied quite a bit from the different markets.

This means that there is a chance that if we’d drawn other sample farmer’s markets the averages would have been different, even inverse. The difference is not significant.

If the result would have shown error bars that don’t overlap or only slightly overlap like in the case below in figure B, the researcher could have been 95% (in this case) certain that oranges are more expensive.

The error bars only overlap a tiny bit. There is still a 5% chance that in reality apples are equally or more expensive.

Remember, you don’t have to think about this if you’re measuring the entire population. If you’re e.g. reflecting back a user’s walking data to her, you can confidently say if she has walked more this month or last because you’re not just talking a sample, you’re measuring all the data.

CORRELATION

Correlation is a statistical measurement of the relationship between two variables. Possible correlations range from 1 to –1.

A positive correlation is when there is a relationship between two variables so that both of them “move together.” When one variable decreases as the other variable decreases, or one variable increases while the other increases. An example of a strong positive correlation would be .92. A perfect positive correlation would be 1.0.

Temperature and ice creams sales are positively correlated. When the temperature is high, people buy a lot of ice cream. When the temperature is low, people buy a lot less ice cream. The correlation might be .57 meaning that there is correlation and that 57% of the variance in ice cream sales can be explained by the temperature.

An example of a negative correlation is exercise and body weight. An increase in exercise usually goes with a decrease in body weight. Let’s say the correlation is -.32.

A 0 (zero) means that there is no correlation.

Strictly speaking, you can’t draw any conclusions about causality from a correlation. In other words, you can’t say which variable affects the other. You need to do an experiment to show causal relationships.
Correlations can however hint towards causation relationship.

One reason you can’t draw causal conclusions (determine which variable influences the other) is that there are other variables that could be the cause and affect both of the variables that co-vary.

An example, there is a correlation between the amount of serious crime committed and the amount of ice cream sold. I don’t think anyone would claim that the more ice cream is sold, the more crime we get or vice versa, more crime leads to more people hankering for ice cream.

Instead, high temperature in the summertime means that more people want ice cream and that, for some reason, more crimes are committed. It’s sometimes called the third variable problem when a confounding variable leads to thinking that there is a causal relationship between two other variables.

HYPOTHSIS

An hypothesis is sort of a statement about the world. It says something about the relationship between two or more variables. A hypothesis is a statement, not a question.

If we’re designing an app to encourage people to walk more we might use something like “Visualising how much a person walks back to them increases their amount of walking.”

A good hypothesis is testable. Think about what are the dependent and independent variables. We probably want to make our hypotesis more specific to what type of visualisation we think will increase how much people walk. You want to set up a test that proves* or disproves your hypothesis.

Best practice is to start with an hypothesis and not just to look at the data after you have collected it. It’s easy to fall into the trap of always interpreting what you see in favour of what you want to see.

A practical examples would be if you design and launch a new website for a museum. You measure time people spend on certain pages in the old design and in the new one to compare.

If you see that people stay longer on pages, you might say that people are really valuing engaging with the content. Success! Or if you see that people stay a shorter amount of time on the new pages, you might say that people are finding what they need much quicker because of improved UX. Success!

However the result, you interpret them in favour of what you want — to convince yourself and others that your new design is a success.

*More hardcore science would claim that you should focus on trying to disprove the hypothesis, not trying to prove it. This is called falsification and came from the philosopher Karl Popper. Most scientists stand behind this approach to the scientific method. A classic examples is the hypothesis “all swans are white.” Even if you’ve looked at thousands of swans and all of them were white, you can’t prove that all swans are white because there might be a black swan out there somewhere. As soon as you see one single back swan, you can however disprove that the hypothesis is false.

Wrap up

Ok, that’s was a bit about analysing data. Please feel free to write to correct errors, suggest additions and ask questions.

Part 1: Collecting Data

Part 2: Analysing Data (this article)

Part 3: Visualising Data

A bit about me

My name is Stina Jonsson and I work at IDEO in our London office. I apply cognitive science to design challenges to fundamentally question and reframe the way we engage in a digital context.

PS. Sorry for the spelling mistakes.