What is the story of your data?

How to transform your raw data into decisions

Joel Shuman
The Startup
13 min readApr 29, 2020

--

This question is the core of all data analytic work. By striving to answer this question, we will transform raw data into analysis and ultimately, decisions. In order for analysis to catalyze decisions, it must convey not only the particular levels of each measurement, but also the framework by which these measurements can be understood. This creates a twofold challenge: dig for insights and then connect those insights to human processes.

So many stories to choose from
Photo by Robyn Budlender on Unsplash

What is the story of your data? is ultimately meaningless without context. Data analysis becomes a story only within a particular context. To develop this story, we start with the context of the organization and the person reading the analysis. What is their prior knowledge? What decisions are they trying to make? What comparisons would be meaningful to them?

Even the simplest data sets are open to hundreds of “correct” ways to find insights. Should we look at a daily or a monthly view? How do we categorize highly diverse inputs? That’s not to say these interpretations are contradictory; this isn’t about telling misleading stories with data. Instead, because there are so many different places to focus, we have to be very intentional about the information that goes into any report.

In this post, we are going to walk through developing a story by following these steps:

  1. Broaden out: Take a look at the questions our data set is able to answer. Without going deep on anything, take a look at what is in the realm of the possible.
  2. Develop questions: Because there are so many ways to go, we’ve got to limit ourselves. Figure out some interesting questions that we want to ask and write them down.
  3. Develop hypotheses: Make educated guesses about what we think we are going to find. This step is crucial; it provides the beginnings of a framework to make the analysis actionable. By forming a hypothesis, we are setting ourselves up to create sufficient information to make a decision: proving or disproving our theory.
  4. Prove or disprove your hypotheses: By finding information that convinces us that we have proven or disproven a hypothesis, we force ourselves to create decision-making-information (more on this below). As we support our hypotheses with evidence, we ready ourselves to synthesize a framework to understand the proof.
  5. Validate: Because there are often so many ways to arrive at the same analysis, take the time to take two or three of them on each hypothesis. It’s great if they support each other, even better if they do not as that can be the best way to find the inevitable bugs.
  6. Synthesize: Make the implicit framework you have been working on explicit. At this stage we will reorder the information and prune anything extraneous. The logic should flow at this point. The main message is clear, proven, and validated.

There is so much good data out in the world, but we can only make it interesting by telling a story. Using the question/hypothesis framework, we can develop any scenario into an amazing story. Let’s dive in!

Hallowed halls of a library
Photo by j zamora on Unsplash

Scenario: A billionaire philanthropist has just endowed the national public library foundation with an enormous gift. The library wants to use this money to better serve its patrons and encourage more people to borrow books and use other services.

There are so many ways we could tackle this! In this case, we are going to walk through the process above using ANES survey data. The ANES is a yearly study of political and electoral preferences, but it also has many useful social science variables and questions. The 2019 Pilot Study of 3,000 people conducted in December 2019 in particular has a Yes/No question regarding the public library:

“Please indicate whether you have done each of the following in the past year.[Yes/No] borrowed a book from the public library.”

Let’s use this question and the hundreds of other pieces of information in the survey to answer some questions for the public library.

[If you want to do your own digging into any of this, this is all public. Here is the script, the codebook, and the data.]

Step 1: Broaden Out

There are so many possible questions we can answer with this survey data, and far more that we cannot answer. Let’s take a quick survey of the possible with this survey. The ANES contains questions that cover:

  • Demographic information: like sex, location, education history and income.
  • Evaluations of leaders and countries: 100 point scales showing views on different topics.
  • Media: Some social media information but not much information on traditional media sources.
  • Partisanship and vote history: political behavior and views. This is actually the focus of the ANES, but since we don’t have a political lens for this question, we are going to tell a far different story with this data.
[Deep Breath]
Photo by Pawel Nolbert on Unsplash

There are so many more than this, I have barely begun to scratch the surface. By doing this exercise, questions should begin to form in your mind about topics that interested you. Our strategy is going to be discover how the book borrowing question interacts with all of these other variables, but we obviously cannot spend time to think critically about each and every possible interaction.

An aside: I often feel overwhelmed at this moment. The “right” analysis would look in every nook and cranny; it leaves no stone unturned and goes down every rabbit hole. But, there is never enough time on any project to do that. I try to remind myself that looking deeply at the important things is so much more useful than looking at everything shallowly. [Deep Breath] It’s ok to leave some stones gathering moss for the next round.

How can we prioritize? What should we do to end up with decision-making-information? First let’s talk about what decision-making-information is.

Decision-Making-Information

Folks, it’s not more complicated than it sounds. It’s the minimal information that you would need to make a decision.

Two roads diverged in a yellow wood…
Photo by Vladislav Babienko on Unsplash

A couple tips to remember when presenting decision-making-information:

  • Define it near where it is presented — to prevent confusion and ensure everyone has the same understanding.
  • Present it with relevant comparisons — relevant to the stakeholders that is. The CEO might get a market-wide comparison while a junior analyst might review how this week compared to last week.
  • Support it with validation — that is present it with the information needed to prove that the decision-making-information is correct. This could also mean anticipating and answering the first questions that would come to mind seeing the analysis.
  • Format it to show the main point — whatever story you are trying to tell should be obvious from the way you present it.
  • Keep it minimal — we don’t have to achieve the platonic ideal of minimal, but the less we present, the more clear the decision-making-information is.

Step 2: Develop Questions

Now that we know what questions we can ask, let’s suggest several. At this point we can stay fairly broad. The analysis is going to become specific as we formulate hypotheses, developing questions is really about choosing topic areas that interest us. If we only answer the questions we ask, we have narrowed the task tremendously from Step 1.

Steps 2 and 3 (Develop Hypotheses) are also the perfect places to seek input from a wide array of people. At this point in the analysis, the more viewpoints, the better the final product.

Looking through the codebook that I linked above and the topics in Step 1, I spent a couple minutes thinking up questions that the Library would be interested in answering.

  1. How does social media usage effect book borrowing?
  2. Do people in rural areas have trouble accessing the library?
  3. Who uses the library more, younger or older people?
  4. Do people who use the library have similar interests? Are there types of interest programs we could consider?

A few others came to my mind that we cannot answer:

  1. What services do people prefer at the library?
  2. What complaints do people have about the library experience?

I only point these out to say that we probably want the answers to these questions, but we are not going to get them through this exercise. We have to keep our analyses grounded in what we can achieve.

Step 3: Develop Hypotheses

Now that we have a few questions to answer, we can move on to the most undervalued part of this process: developing hypotheses. Remember our goal is to tell a story using decision-making-information and up to this point in time we don’t have a story, only different threads that we can tug on.

Dig into those questions

By creating a hypothesis, we set up a story to unfold as our thinking comes into conflict with the data. When we eventually prove or disprove our hypothesis, we will have been forced to make a decision and to perform the analysis that led to it. Without this step, we lose the structure and the analysis can unravel as we get distracted by every loose end.

What makes a good hypothesis? Most importantly the hypothesis being proven or disproven should lead to different actions on the part of the person reading the analysis. We want to look at things that have real world impact. It also needs to be able to be proven or disproven with the data at hand.

Some hypotheses for our stakeholders at the Library:

  1. Library users use social media more than average
  2. Library use is lowest in rural areas, highest in urban areas
  3. Library goers are older than the general population
  4. Library goers are interested in outdoor activities

All of these could lead to useful ways to spend the billions we’ve been given. If library use in rural areas is lowest, then we would want to spend money in rural areas to increase use. If library goers are interested in outdoor activities then we should create programs that focus on outdoor activities.

While simple statements, the task of proving or disproving each is a high bar. Next, in Step 4 we apply the traditional uses of statistical analysis to our hypothesis to create decision-making-information.

Step 4: Prove or Disprove your Hypotheses

Let’s take this hypothesis as an example:

“Library use is lowest in rural areas, highest in urban areas”

and create decision-making-information to support or disprove it.

First and most obvious question, let’s look at the survey question result. Overall, 58% of people said they have borrowed a book in the last year.

Book borrowing by rurality
Borrowing a book from the public library in the last year

Right off the bat our hypothesis is weakened, suburb and city dwellers use the library at ~61% rates, rural is next at 55% and small town dwellers use it the least at 52%. Both rural and small town dwellers use the library less than people living in dense areas. But, the difference between rural and small town dwellers is also potentially significant. So now the question becomes why is it small town people are losing out on library access and not rural people?

To understand this, let’s compare small town people to people overall. Let’s also have a comparison to rural people especially. I’ll also take a quick look at other data that might be of interest that we saw during the broaden out phase. We can formulate a hypothesis to the small town question and try to prove or disprove it.

Cross-tabs for important demographic groups
Cross-tabs for several demographics

A few trends I notice right off the bat:

  1. Small towns are also generally older, having more people over 65 than any other column.
  2. Book borrowers tend to be less often retired or disabled people than other groups. Retired or disabled people also tend to live more in small towns. These groups make up 39% of small towns compared with 28% of book borrowers.
  3. Rural areas actually have a similar number retirees to small towns, but fewer disabled people than small towns.
  4. Small towns are better educated than rural areas, and mainly lack post-grads but have an average number of 4 year degrees.

Based off this data, I hypothesize small towns have the least number of book borrowers because they have the highest proportion of disabled people. Both small towns and rural areas lag behind denser areas, but the unique make-up of small towns present the steepest challenges to library access.

Book borrowing by employment status
Book borrowing by employment status

Indeed we see when looking at book borrowing by employment status, retired and disabled people are the least likely to have borrowed a book.

At this point I am tempted to talk about reasons behind this, like access to the library and types of services offered. But we already know we don’t have that information in this survey. Our goal now should be to try to communicate the above as a story to stakeholders who could decide to pursue a project to ameliorate this issue.

At this point we are nearing decision-making-information. People in less dense areas use the library less than other groups. The demographics of these areas reinforce this trend. The relatively larger proportion of disabled people in small towns could influence the lower book borrowing we see in that area. This makes them targets for our billions to encourage the use of this public good.

Step 5: Validate

Our next step will be to seek to validate this by proving it a second way. If this argument is worth taking into the field, then the data should point to it from multiple directions. This will also give us an opportunity to see if there is a more logical way to lay out our argument.

Book borrowing by rurality as a child
Book borrowing by rurality as a child

Let’s look at the same question, but this time broken down by their recollections about where they grew up. The survey asks people:

“Growing up, did you mostly live in a rural area, small town, suburb, or a city?”

We can compare that to their current density.

My hope was to see the same trend as previously. Instead what we see is that people from rural areas are the ones least likely to borrow a book. What’s with this reversal?

My first thought is that people from rural areas moved to small towns and the composition of rural areas has changed.

People in suburbs and urban areas mostly stay, rural and small towns mostly leave
For each person, the density of where they lived as a youth compared to now

Indeed the composition of rural America has changed in that time, 39% who were living rurally in their youth do now, where as in small towns, only 31% do. So small towns have changed more in composition than rural areas, but not enough of a transfer or reversal to prove that rural people moving to small towns is associated with reduced book borrowing.

Because we see that the youth rurality survey question contradicts the current rurality question, our hypothesis about what is causing access issues in small towns is disproven. However, we have still accomplished our goal. It is definitely true that rural and small town dwellers are less likely to have borrowed a book, confirmed by two different sources. We can move on to synthesizing how we arrived there.

Step 6: Synthesize

Our final synthesis will incorporate all the elements above into the framework by which we are able to understand the results. This final step is key to the story telling. A great story will make it much easier for the readers to follow the analysis to its logical conclusion.

The framework I’d follow for this would be the following:

  1. Small towns and rural areas are less likely to have borrowed a book.
  2. Small towns and rural areas differ from urban and suburban areas on several important measures most notably employment status: more are retired or disabled; and age: more are older.
  3. Retired and disabled people are much less likely to use the library.

We start out with the main point we were trying to prove, that location does affect how often people use the library. We prove that location has associated factors that might impact ability to use the library. Finally, we limit our story to only what we know. At this point we don’t know how to get more access to rural areas, just that it is an issue that shows up in the data.

We have not overstepped our hypothesis or gone down every rabbit-hole we could. Instead we set a goal and managed to prove it one way and nod at its directionality another. Someone reading this information and the above visuals could make a decision about whether or not rural and small town access is an issue.

Conclusion

Reports have to tell a story in such a way that the person who reads the report has the same information absorption as a person who read it summarized into paragraphs. Anything less, and the report will not be useful; it will require a technical resource to continually generate a human readable interpretation of it — a story. This process itself, beyond the obvious inefficiency, is prone to error and bias; it also creates a single point of failure with only one or a small group of people able to bring data into strategy discussions. Having the creation and understanding of reporting siloed to a few people automatically generates office-politics; a much better approach is to make reporting everyone’s responsibility.

These are the key things to keep in mind when creating reports:

  1. Show comparisons. Any number with no comparison is more confusing than useful. Showing comparisons over time, amongst different groups, or to industry standards is a must.
  2. Have a specific audience in mind. The larger the audience, the more general the report has to be — limiting its effectiveness.
  3. Have the report contain a framework. Think about the key information being displayed and show that first. Think up a couple scenarios for what that chart could show and show the charts that would possibly explain those.
  4. Make reading the report and understanding it part of a recurring meeting. The report itself is not action and the creator of the report is not necessarily the person in the best position to act on it.

By following the framework we just walked through, you’ll be able to take any project and perform analysis that drives real world action. Always tell a story with your data so that your analysis lands with impact.

--

--

Joel Shuman
The Startup

Data scientist and pythonista, former fundraising analyst for Bernie 2020. I help non-profits improve their digital fundraising. For more go to shoveldata.me