How to analyze the NAs in multiple response surveys

By Huey Fern Tay, With Greg Page

Huey Fern Tay
CodeX
5 min readJan 6, 2022

--

Photo by Luke Chesser on Unsplash

Most likely, you have encountered questionnaires which asked you to ‘select-all-that-apply’ in response to some particular prompt. These multiple response questions have their drawbacks, but they are an efficient way of condensing several possible answers into one question on a single page.

An important part of analysing survey questions involves calculating the non-response rates. When non-response rates spike for a particular question, that might suggest that respondents did not wish to answer it (maybe a question was too personal), that an unclear question needs rewording, or perhaps that a survey is simply too long.

When survey response data is stored in a CSV, each segment of a multiple response question typically receives its own column — and from the standpoint of organizing the data in a format conducive to quick analysis, that makes sense.

Suppose we were interested in the responses to Question 7 (“what programming languages do you use on a regular basis?” (Select all that apply)).

Above: Author’s own image

The complete breakdown of answers to all parts of this question is depicted in tabular form, in the screenshot below:

Above: Author’s own image

However, determining the true extent of the missing values for multiple response questions is not as simple as calculating the average number of null values. Such a process, demonstrated in the code below, would show a misleading high number for the multiple response questions.

Above: Author’s own image

Chart 1, shown below, delivers a graphical representation of the per-question missing value averages.

Chart 1

This chart does not distinguish between single response and multiple response questions; instead, it lumps all questions together, registers a respondent’s answer as NaN whenever any part of the question goes unanswered, and then spits out the mean value of those NaNs per question.

Chart 1 would mislead us into thinking that the non-response rate to Question 7 was greater than 80%, when it was approximately 10.62%. (based on the percentage of respondents who didn’t answer any option at all)

Above: Author’s own image

A better approach for capturing the response rate patterns is to separate the single response questions from the multiple response questions.

Chart 2

Above: Author’s own image

As you can see from Chart 2, most people did not ignore Question 7. In fact, 15531 people said Python was the programming language they used the most often, a response that represented 77% of all respondents. But because not as many people went on to select other options like Bash, C++, and Javascript, that increased the average non-response rate for this question.

What else does Chart 2 tell us about missing values in the questionnaire? Let’s examine it together with Chart 3, which shows the extent to which people answered single response questions such as: “What is your age”, “For how many years have you been programming / writing code?”

Chart 3:

Above: Author’s own image

We see that response counts drop to 50% at around Question 25. The downward trend continues until the uptick towards the very end (Question 38 and 39).

What can we infer from these results?

By giving people the option of selecting from a range of alternatives including ‘none’, the survey designers reduced the probability of social desirability bias, which happens when people answer questions in a way that makes them look good according to social norms.

Questions that ask respondents to ‘select-all-that-apply’ also shorten the time it takes to complete the survey, because all options are presented in one question, on a single page. In contrast, even though respondents may be compelled to think a little deeper for each single-response question, they may quit the survey prematurely because they will need more time to complete the survey. Imagine what would happen if we broke down all the multiple response questions into something like this?

Do you use Python?

Yes

No

Do you use SQL?

Yes

No

Do you use R?

Yes
No

Repeat this format 9 more times for just one question. The survey would be far too tedious to complete.

That said, the survey was probably a little long for most people. All multiple response questions contained a ‘none’ option, which means respondents could have checked that box if the question was not relevant to them. Since people did not select that option in many instances, we can assume they skipped those questions.

At the same time, useful insights can be gleaned from the folks who completed all 39 questions. These people are enthusiastic individuals who genuinely want to help the survey designers understand developments within the data science community. The survey designers could consider conducting extensive qualitative interviews with some of these respondents to obtain a deeper insight into industry developments.

Data source: Kaggle

--

--

Huey Fern Tay
CodeX

MS Applied Business Analytics, Boston University