How do you build your data analysis after you’ve formulated the question?

Well, we’ve covered that the MOST IMPORTANT PART of any data analysis is the QUESTION.

And we’ve noted this cannot be emphasised enough. So…

Again with the question?

Why is this the most important? Just to repeat…

Too often people start with a pre-conceived idea that they want you to prove with a set of data (and the data may or may not be relevant to the pre-conceived idea that it is meant to prove…) or a well-meaning individual or team will just throw some data at an analyst and say, “what can we learn from this?” Fun — if the data sample is small and clearly focused. Which is rare. So rarely fun. Often, when I’m put in a situation like this, I want to revert to Hitchhiker’s Guide to the Galaxy mode and send whoever asked me to do the analysis this:

The joke being, for those that haven’t read the book, heard the radio show or seen the movie (where have you been?) that after a couple hundred thousand years, an advanced computer (the Earth) produced the answer 42, but no one could remember the initial question (other than the question had something to do with “life, the universe…everything.”) Useful, yes?


Define your question. And document. That’s the lesson here.

And to recap, questions and data analysis have six categories.

The question and corresponding analysis, as noted in an earlier post, can fall into one of six categories. They are, briefly:

  1. Descriptive. Basically, what is in this data we collected? In dataset Z, we have X. We also have Y. And A, and B, and C… An example is the U.S. Census. Another example might be that there are a lot of women with silky hair, toned bodies, symmetrical facial features, and good hygiene working as actresses in LA. This data looks like this. Nothing more — no explanations as to why, no inferences as to what more could be done, no correlations, and definitely no causations or predictions.
  2. Exploratory. In an exploratory data analysis, an analyst explores a data set for relationships in which you expressed an interest. We found these relationships and trends. In dataset Z, X and Y exist and appear together a lot. Skinny actresses with silky hair, toned bodies, symmetrical facial features, and good hygiene in LA appear to benefit from high employment in LA. We don’t know why, we don’t know how, and we don’t know if what we found can be applied to other data. In an exploratory analysis, we made some discoveries but we can’t confirm them.
  3. Inferential. Okay, so we found these relationships you asked about and we can imply or infer that these two variables we identified are involved in a pattern that seems to check out in other, similar data sets. In all datasets similar to Z, X and Y exist and appear together a lot. So in this analysis we have a definite pattern, but no explanations as to why that pattern exists or how it came to be and, once again, no predictions or causation can be confirmed. At best, we have a pattern with some correlating variables. You know those actresses we were looking at before? Well it seems they work more both in LA and in NYC and in Chicago and in Miami and in several other cities if they are blonde rather than brunette or raven-haired.
  4. Predictive. (Note how far down the list this one is? That is important. Predictions are not a part of any earlier data analysis.) Now we not only have a pattern that shows up in a lot of similar datasets, but we can show that one identified variable indeed predicts the behaviour or existence of another identified variable. X->Y. We still don’t know why, but we know it does. After repeating the inferential data analysis again and again across many cities where there is a significant actress population that fulfils our specifications, we can confidently predict that blonde actresses work more frequently as actresses than their brunette or raven-haired counterparts.
  5. Causal. Here we have “average” appear for the first time. With causal analysis and a corresponding data set, we can predict that on average, X impacts Y. For example, our experiments show that if an actress changes her hair colour to blonde (without changing anything else), she is more likely to find work as an actress more often in one of the cities studied than she did prior to the hair colour change. On average. (Individual results for different actresses may vary.)
  6. Mechanistic. Here we have to jettison the actress example as this type of question/analysis must be guaranteed (we would have to be able to guarantee that if an actress, any actress, changed her hair to blonde, she’d get more acting jobs. We can’t.) So think engineering — if we change the shape of a plane’s wing we definitely get more or less wind drag. Change X and you change Y in the exact same manner each time.

So you have your question, you have your data. Is it relevant? Here is how you can tell:

Tidy the data. Then check the data.

Tidy the data.

Jeffrey Leek of John Hopkins University has a checklist. (Of course, he does.)

Any and all data sets provided to an analyst should have the following:

  • the raw data set,
  • a tidied data set,
  • a code book explaining each and every aspect of the tidied data set, and
  • easier-than-IKEA instructions describing exactly how the raw data was produced and how you tidied it and why you did what you did.

Leek notes that raw data is relative — you may not have access to the actual raw data, so do your best to provide exactly what you were given and why and describe whatever you did to it.

In general, with tidy data:

  • Every variable gets one column. Quick tangent: A variable is something like gender ( male or female) or age range (0 to 100+) or hair colour (brown, black, blonde, blue…) or the unique number assigned each individual subject (your social security number, for example.) By the way, most variables fall into a few categories, namely:
  • continuous: any number
  • ordinal: an integer, fixed values that are ordered (e.g. poor, fair, good)
  • categorical: unordered, like male or female, brunette or blonde.
  • missing: you don’t have them — and you should make that explicit so nobody thinks you do.
  • censored: you don’t have them but you know why. Explain you don’t have them and explain why.
  • One table = one kind of variable. Don’t put gender and age in the same table.
  • If you have different variables that are linked, include a column with what links them (e.g. the unique number assigned to the individual who is female, age 34, and brown-haired should be included in the gender table, the age table, and the hair table, and her unique number should have its own column so any analyst can link all these variables to that individual at any time.)
  • In the column above each variable, give the variable its full name. No guessing games, please. (e.g. is HC for hair colour or human colour? Just tell me, don’t make me guess. It will not end well.)
  • Do one Excel file per table, if you are working in Excel. It makes life easier for everyone to avoid multiple sheets.
  • Remember your units! Kilometers or miles or Lis or what-have-you. Not everyone counts in the same currency!
  • Make your code book: explain the variables, include the units, describe the design of the experiment, explain your choices. This code book is your data diary re-made as your public memoirs if scientists were going to clone you and you wanted the clone to use those memoirs to replicate your life down to the second you had to pee today.
  • Make your instructions explicit. Better yet, make them a script. I am not a programmer, but I’m learning R and Python so I can do this. Think having a piece of furniture from IKEA that comes with a button rather than an instruction booklet. You just push that button and, voila, the furniture puts itself together before your very eyes. You see how it’s done and you can reuse the button if needed. Screw the pre-information age stick figure hieroglyphics that make you secretly hate IKEA and furniture both as you spend precious minutes trying to figure out which bolt is A2 and which is A1.

Check the data.

Think of every possible way any tidying exercise could go wrong. And then think of all the ways you didn’t think of. See what I’m getting at? Check your data. Some ideas of how things could go wrong from the wonderful Leek:

  • Be clear whether missing data is censored or just missing. Different statistical models treat missing data differently than censored data. And, as mentioned above, explain why the data is censored.
  • Be clear on how you are identifying the missing value. Use NA, for example.
  • Don’t code categorical (remember: male, female / brown, blonde, blue) or ordinal (e.g. poor, fair, good) as numbers. Too much room for misinterpretation. Keep them text.
  • Spell out — in writing — any observations you have. If, for example, in Excel you highlight all the ages over 50 because these people don’t understand Snapchat and that is important information, software may not pick up the highlighted cells and thus this important information will be lost. Colours will not be exported. Text will. Put your observations in text.
  • Check for coding errors. In a column listing age, is there a -12? That’s an error. Did you list some women as “women” and others as “ladies”? Fix that.
  • Check labels. If an individual is coded as both age 24 and age 65, this is probably an error.
  • If the data exists in several files, make sure that the information that should be identical in those files is in fact identical. If a specific individual is coded as both male and female, this may be an error. In fact, don’t duplicate data, period.
  • Make sure all units are included. Use your units! People do not count in the same currency. Leek cites the measurement mess-up behind the $125 million dollar Mars Rover. One team used metric, one team used English, no one included their units, and there were expensive problems.
  • Make lots of charts. Look for outliers. Check everything again.


Tidy data ≠ clean data.

Tidy data is NOT clean data. Think about your car — you have to throw out all your fast food wrappers and receipts before you can vacuum.

Leek cites this well-known video, which notes that about “on average” 80% of any data analysis is dedicated to actually cleaning the data and preparing it for actual use. You have to tidy the data to clean it.

But then you have to clean it.

Cleaning the data should make certain the data is correct and ready for use in answering your question. Underestimating how and how much time it takes to clean data and ensure the data’s accuracy can lead to pointless results. Just ask Dilbert.

Okay, so.

In conclusion…

  1. Start with a question.
  2. Categorise your question and your relevant analysis — don’t expect wine from raisins, and don’t expect predictions from an exploratory analysis.
  3. Make sure the data can answer the question (in theory.)
  4. Tidy the data.
  5. Check the data.
  6. Remember, Tidy data ≠ clean data.