The Short And Sweet Basics of Data Analysis

Mia
Endless Forms Most Beautiful
7 min readDec 16, 2015

Whatever you do, don’t forget to look for the bigger picture. The main themes will come out in the middle of the numerical grades and quantitative data.

As laid down in Zapier’s Guide to Forms and Surveys, there are four main ways to collect responses and four main data types you may confront when analysing the results: categorical, ordinal, interval and ratio.

  • Categorical data is gathered when you calculate the total number of responses and then divide the number in each category by the total. The sum should be 100%. Categorical data can be made more useful by grouping results by customer segment (new customers, long-time customers, geography…) Think what categories are most meaningful to you.
  • A contingency table is created by first splitting the responses into groups that become rows. The groups are mutually exclusive (no overlap) and exhaustive (sum to 100%). Next, count the number of responses to one question. Finally, divide each count within each cell by the total number of responses to that question (including all groups).
  • Ordinal data is handled very wrong if you convert the responses to numbers and then calculate the average of those numbers. Instead, create a simple relative frequency table to contingency table (seen above). Intentionally avoid averages and instead describe data. Like this:
  • Diverging bar charts are a great way to visualise ordinal data. Common baseline allows the eye to measure the length of each bar. You can compile more than one info in a bar.
  • Interval data is most safely and usefully summarised when you handle it like they are ordinal data. Using averages and standard deviations is possible, but only if the distance between intervals is even.
  • Ratio data is rich enough to support averages (arithmetic mean). Average gives a measure of where the data is centred. Standard deviation gives the average distance from the centre of the data. This requires two steps: 1) calculate the variance statistic and 2) take the square root of the variance statistic. Variance statistic is: SUM([each value — mean] ^ 2 ) / N-1. So, it would be something like “the average number of sessions attended was 5 +-2.

Be thoughtful when creating visualisations of the data. Start with the largest differences first. Round numbers to avoid communicating a false degree of precision. Keep it simple and think if it’s better to use tables or more visual ways like graphs.

Tools for data visualisation

Before you start, check out Stephen Few’s nice article about selecting the right graph for your message (PDF).

Every survey tool will give you some kind of report of the answers. Most often they are in a way you can export to a spreadsheet, download as a PDF and also see in a visual way with charts and graphs. If you, however, want to do your own data visualisations, or just see how it goes, there are a lot of nice tools for you to use. The next tools are in no particular order.

  • Microsoft Excel, Google Sheets, Calc by OpenOffice and Calc by LibreOffice all give you the spreadsheet tools and resources to analyse, interpret and visualise your data in an easy way, without having to code a single line.
  • Raw is a free, easy-to-use and open-source data visualisation tool. Jus drop in a plain text file or a spreadsheet and work your way through few easy steps to create a nice data visualisation.
  • Chartblocks doesn’t require coding but instead allows you drop in your data and work from there with a simple interface. Chartblocks allows you to import data from all kinds of sources. Sharing and customising are easy and the charts are responsive. Chartblocks is free for up to 30 active, public charts.
  • Infogram is easy to use, allowing you to create interactive, responsive and clear infographics. They allow you to import data from all kinds of sources and you can also create maps, share the visualisations privately and export them to your own use. They have a limited free plan you can try out and the pricing starts from $15/month.

If learning data science interests you, the internet is full of great (and free!) resources to get you started. You might want to check out:

  • DataCamp — access to all material is $25/year but the free account is worth checking out.
  • A Crash Course in Data Science — a course by John Hopkins University focuses on the basics of data science to get you familiar with the subject.
  • Statistics I — a beginner-friendly course to statistics by Princeton University. No prior knowledge is needed. The course will also introduce you R programming language.
  • Statistics: Making Sense of Data — is similar to Statistics I, but offered by Toronto University. It assumes no prior knowledge of the subject.
  • Data Visualisation — by the University of Illinois is what it says on the tin: focus on visualising your data. During the course, you learn about the charts, graphs and interactivity. Not only that, but you also learn to showcase relationships, hierarchies, text and databases. No prior knowledge is necessary.

Sampling methods

Saul McLeod introduces the different sampling methods in his article.

  • Random sampling: every member of a population has an equal chance of being selected. Advantages are that your sample should represent the target population but the disadvantage is that this requires time and money.
  • Stratified sampling: divide the target population into important subcategories, selecting members in the proportion that they occur in the population. This is very time-consuming and difficult but highly representative of the target population.
  • Volunteer sampling: individuals who have chosen to be involved in a study. This is also called self-selecting.
  • Opportunity sampling: simply select those people that are available at the time. This is a quick and easy way of choosing participants but may not provide a representative sample.
  • Systematic sampling: choose subjects in a systematic way (like every nth participant). Divide the number of people in the population by the number of people you want in your sample. This gives you the n. Then, take every nth participant. This takes time and effort but is representative.

Scott Smith writes about determining sample size. When determining the size, there are a few questions you should consider. First, how many total people fit in your demographic? Second, how much error do you allow? The margin of error (also known as confidence level) determines how much higher or lower than the population mean you’re willing to let your sample mean fall. Then, how confident do you want to be that the actual mean falls within your margin of error? The common ones are 90%, 95% and 99% confident. You also need to think how much variance you expect in your responses (this is called standard deviation). The safe decision is to use 5 as it is the most forgiving number and ensures that your sample will be large enough. Finally, your confidence level corresponds to the Z-score (PDF). This is a constant value needed for this equation. Z-score for 90% is 1.645, for 95% 1.96 and for 99% 2.576.

Now, plug in your Z-score, standard of deviation and confidence interval into this equation: Necessary sample size = (Z-score)^2 * Standard Deviation * (1-Standard Deviation) / (margin of error)^2.

So, if you choose 95% confidence level, 5 standard deviation and a margin of error is +/- 5%, the equation goes like this:

((1,96)2 * .5(.5.)) / (.05)2

(3.8416 * .25) / .0025

.9604 / .0025

384.16

The necessary sample size is 385 responders.

Or, you know, you can use a sample size calculator like this one.

If you are sample sizing a smaller population, you might want to check out this PDF.

Writing A Survey Report

Samuel Hamilton outlines the basic stuff quite nicely in his article.

The main thing is to summarise your findings. Give a broad overview of the entire report in the beginning of it. This includes the date the survey was distributed, the methods used for calculating the responses and a list of some key findings. Then provide some background information. Why you conducted the survey, composed the report and what you hoped to gain from your research? Detail the problem or question. Elaborate on this problem with additional context (who were the respondents, what kind of questions were asked, and so on).

Detail the methods and results. In your method section, include the survey and an explanation/analysis of why you asked the types of questions you did. Describe what you did with the information you generated and explain how and why you tallied and grouped the responses (either visually or spreadsheet style) the way you did.

Lastly, analyse results and recommend solutions. Analyse the implications of your results section, specifically examples of tallies or groups that seem out of ordinary or go against your expectations. Offer 5–10 specific, actionable, clear and brief recommendations based on your results.

Keep in mind that there is a difference between the word “significant” (a finding that may have decision-making utility) and the term “statistical significance” (when you are very sure that the statistic is reliable but that’s all). Be critical about the results and don’t trick people. Instead, be transparent and clear. What kinds of people answered the questions? Who did not answer? How does that affect the results? Remember that sometimes people lie, exaggerate or underrate things in their answers. Also, what kind of average are you talking about? And what is the margin of error? If you make people hunt down this information, it may lead to wrong conclusions or they might think you dishonest.

--

--

Mia
Endless Forms Most Beautiful

At first I was worried but then I remembered, dude I am Iron Man.