How (Not) To Construct A Proper Questionnaire

Things to keep in mind when you create your own survey or questionnaire for your team or organization

Published in

The Liberators

16 min readAug 1, 2022

Have you ever set up a survey to assess how people feel about something? Perhaps tried to assess maturity, experience, or growth? If so, this post is interesting for you.

While tools like SurveyMonkey and Google Forms make it easy to set up a questionnaire, setting up a proper survey is harder than you think. More often than not, the results are unreliable and don’t actually measure what you want to measure. The problem is that proper questionnaire development requires techniques, concepts, and software that most people have never heard about. This is problematic when the results are used to make decisions that impact real people and real teams.

In this post, I explain the process of questionnaire development through the Scrum Team Survey that I’ve been developing with support from Barry Overeem and Daniel Russo, Ph.D. This is an extensive questionnaire to diagnose your Scrum team, together with your team and its stakeholders. You can use our questionnaire for free, and we have advanced features that you can enable with a subscription.

Note: There are other approaches to survey construction, like Item-Response Theory (IRT). In IRT, the questions themselves are not important, but rather their statistical ability to distinguish between groups you want to distinguish. IRT is very powerful. At the same time, it has been criticized by many academics for its lack of face validity. It is also far more complex than the approach outlined in this post.

A bit of background: my experience with questionnaire development

“But wait, I know you as a Scrum Master and developer, not a statistician — why should I trust what you write?”. Good question, and great that you’re skeptical. So let me paint the background of my experience a bit.

The construction of questionnaires is a professional field in its own right. I ventured into this field while majoring in organizational psychology at the University of Utrecht. Contrary to my expectations, I greatly enjoyed statistics and joined every advanced course I could — structural equation modeling, multilevel modeling, questionnaire construction, item-response theory, and so on.

Since then, I’ve always found excuses to develop questionnaires. It was an integral part of my master thesis and (uncompleted) PhD-trajectory, where I developed a large questionnaire for the Dutch Armed Forces. I didn’t do this alone, but with Frank van Boxmeer, Robert de Bruin, and our professor Martin Euwema. This questionnaire has been used by the Dutch Armed forces for many years. I’m proud that it contributed to the prevention of burnout and PTSD in military personnel on many occasions and even brought relief to entire platoons in mission areas such as Afghanistan. The development and application of questionnaires also constituted a large part of my first professional job at the Institute for Work and Stress and TNO (Dutch research organization for applied research). After that, my love for questionnaires and statistics seeped into my work with Scrum teams, initially with TeamMetrics, and now with Barry Overeem on the Scrum Team Survey. Recently I started publishing the data from the Scrum Team Survey, as well as its construction, with Daniel Russo, Ph.D.

Screenshot of (part of) the questionnaire we created for the Scrum Team Survey

Three design concepts of questionnaires

What makes a properly designed questionnaire? Although there are entire academic books on this topic, I think it boils down to three core concepts that set the stage for the rest of this post.

1. Questionnaires measure one or more latent factors

In organizational research, we are often interested in how people feel about themselves or their work. How satisfied is someone with their job? How much stress do they experience? How much autonomy do they feel? Since we can’t look inside people’s brains to measure how they feel, we have to ask them instead. One way to do this is by asking a series of questions that are intended to capture aspects of their experience of something (like their job, stress, etc). These questions all tap into a common factor. We call these latent factors. We can’t measure them directly, but only indirectly through how people respond to a series of questions.

Two latent factors that we measure with the Scrum Team Survey: Team Morale and Stakeholder Happiness. Both factors are measured with 3 questions that tap into different aspects of that factor. The way we measure them here is based on existing scientific literature.

2. Questionnaires must contain precise and concise questions

All students of questionnaire development quickly discover just how hard it is to write good questions. Here are some of the rules you quickly learn:

A good question asks one thing. A question like “How satisfied are you with your work and your colleagues?” might seem okay. But what is the participant responding to here? Their work or their colleagues? Or both? Or take a question like “How do you rate our skills and our ability to apply them”. It is impossible to be certain what part of the question someone based their response on, so you can’t assume that the same response always means the same thing. This greatly complicates the interpretation.
A good question isn’t loaded or biased. A question like “How great do you think our service is? (1–5)” may seem innocuous. But its phrasing pushes people in a positive direction. Leading questions already contain adjectives that frame and direct the answer. This messes up your analyses because you don’t know what a score of 1 means. Does it mean that the “service was great” (but not better than great) or does it mean that the service is “horrible”. Plus, you may inadvertently encourage people to be overly positive by biasing them in that direction.
A good question doesn’t use jargon or complicated language. The problem with jargon is that some participants may not be familiar with a term. A complicated question — like one with a double negative (“do you not agree that it isn’t helpful to…”)— may be hard to comprehend. The problem with questions that violate these principles is that they stop measuring the intended latent factor, but instead measure the familiarity with jargon or the text comprehension skills of participants.

These are some examples of the considerations that come into play when writing proper questions for questionnaires. The pattern that connects these rules is that they change the meaning of the answers in unexpected ways. This complicates the analyses and the interpretation and may lead you on a goose chase.

It is impossible to write perfect questions. You’re always balancing precision and pragmatism. For example, the questionnaire in the Scrum Team Survey still has some questions with Scrum-related jargon like “Sprint Review” or “Product Owner”. For example: “The Product Owner of this team uses the Sprint Review to collect feedback from stakeholders”. We tried to write jargon-free alternatives, but they always end up like long and awkward sentences where you’re basically explaining a Sprint Review. We alleviated some of this by adding informative prompts to jargon in the questionnaire.

Fortunately, badly-written questions can often be identified with the statistical techniques we explore a bit further on.

3. Questionnaires use multiple questions per latent factor

The latent factors that we’re interested in are inherently subjective. This is simple to understand if I ask you: “How satisfied are you with your work?” (scale from 1 to 5). What do you take into consideration when you answer this question? Perhaps you consider your salary, the atmosphere at work or the relationships with your colleagues. But a colleague may consider entirely different aspects, like how they like their desk, the customers or their manager. There may also be different interpretations at work for “work” and “satisfied”. The point here is that while you and your colleague answer the same question, we don’t know if the answer actually means the same thing. This is problematic if we want to calculate an average score based on what your and your colleagues answer.

This is why properly designed questionnaires use two or more questions to measure the same latent factor. Each question taps into a slightly different aspect of the same latent factor. So in the example of job satisfaction, we could ask questions like “How satisfied are you with your job?” and “How happy are you with your work?”. This collection of questions to measure a single latent factor is called a scale. The same items in a single scale can appear quite similar, but they’re never precisely the same.

There is another reason for wanting multiple questions on a scale. When you have at least two questions that are slightly-similar-and-slightly-different, you can analyze the statistical variation of the answer patterns. Without diving too much into raw statistics, the key point is that this variation allows us to assess the quality and reliability of a questionnaire with the statistical techniques outlined in this post. None of that would be possible with only a single question per factor.

How to test the quality of your questionnaire

The design concepts I just addressed are useful to understand how proper questionnaires work. But how do you know if they’re actually measuring the right things in the right way?

1. Are you asking the right questions?

So how do you find the right questions to measure a latent factor? The first step is to research existing academic literature on a factor. For example, there are existing, validated scales for factors such as “group cohesion”, “work motivation”, “burnout” and many, many others. Some scales are available for free. Others require a license. This often isn’t clear at all. An entire industry has sprung up around the development and licensing of well-established psychometric scales and questionnaires.

If existing scales are not available, or are too expensive, you can develop your own. In that case, existing academic literature can provide guidance on how to define and measure something like “psychological safety”. You can then create a set of potential questions and test it with participants. But then what?

There are several statistical techniques that you then use to identify which questions actually measure the same factor and which seem to measure something else — even though it may be related. One such technique is Factor Analysis, both in its exploratory form (EFA) or its confirmatory form (CFA). To do this, you need to load all response data into specialized statistical software — like IBM’s SPSS or SAS — to see which questions cluster together. This is the case when the responses for a subset of questions are closer together than they are with other subsets of questions. This indicates the presence of underlying latent factors.

Here is an example of a set of questions that we initially developed for a latent factor called “Stakeholder Collaboration”, or the degree to which a Scrum Team actively collaborates with users and customers (on a 7-point Likert).

The initial set of questions to measure the degree to which teams interact with stakeholders. We used a Likert scale (1–7).

When we processed the actual responses from 300 participants in SPSS with an Exploratory Factor Analysis (EFA), we actually found three separate latent factors. We also found that most of the questions were of questionable quality (as indicated by a factor loading below .60). This means that there are too many differences in how participants answer the questions, and we’re not getting consistently reliable results. For example, a question like “The Product Backlog of this team is easily accessible to stakeholders” is apparently too vague. “Easily” is very subjective. And what does “accessible” mean?

The picture below shows the “pattern matrix”. The numbers in columns 1, 2, and 3 represent the “factor loadings” of the questions. This is the strength with which a question loads on its latent factor. Stronger is better, and a value above .60 is generally preferable. I’ve suppressed items with a factor loading below .30

So why is this finding important? If we wouldn’t have run the factor analysis, we would’ve assumed that these 9 questions measured a single latent factor. We would’ve then calculated a factor score — like an average — and presented that: “This team scores a 6.5 on ‘Stakeholder Collaboration’”. But this score has no real meaning when it is based on three separate latent factors. The fact that they are separate means that a team could score high on factor 1 but low on factors 2 and 3. Or vice versa. The high score on one factor would vanish when you average the scores, which means that you lose a lot of resolution.

So from the initial set of 9 questions, we kept 3 and removed the rest. We also created new questions and tested the scale again with new participants. Over a dozen iterations, we ended up with a scale of 5 items for “Stakeholder Collaboration” that clustered on a single factor:

2. Is the factorial structure of your questionnaire of good quality?

The questionnaire that we developed for the Scrum Team Survey consists of 19 factors at the time of writing. This includes factors as “team morale”, “value focus” and “release automation”. We reached that point by iterating over potential questions and scales until we arrived at a stable solution. This is called a factorial structure, or a measurement model.

A solution is stable when the questions load primarily on their intended latent factors, and not too much on other factors. A perfect solution doesn’t exist, as all questions load on all factors to some extent. The world is much messier than our models. A good test of the entire questionnaire is to load all completed questionnaires into statistical software to see if you indeed get 19 factors out of a Confirmatory Factor Analysis (CFA). The pattern matrix for the Scrum Team Survey Questionnaire is shown below. I removed all values below .30 from the matrix for clarity, but you’d normally see values in every cell. The “stair” pattern, combined with the fact that each “step” generally has questions from the same scale, is a good indicator that the 19 factors are present in the data and are also distinct enough.

Results from a Confirmatory Factor Analysis. I suppressed factor loadings below .30 for clarity. I also show only the question names rather than the whole question to save space. But you can clearly see that questions for the same scale (e.g. TM1, TM2, TM3) cluster together rather than with other questions.

3. Are your questions valid?

A nice factorial structure is great. But it only tells you something about the structure of the data. It doesn’t tell you if the questions are actually valid or meaningful. There are different types of validity to consider.

Content validity is the degree to which the questions you for a latent factor actually touch all relevant aspects of that factor. For example, if I’m interested in job satisfaction, it is not sufficient to only ask “How satisfied are you with your salary?”. Job satisfaction is probably a broader concept that includes more aspects of the job, and you’d want questions for each of those aspects. There are no statistical techniques to assess content validity. Instead, this is often a matter of qualitative and observational research. For example, you could interview several people to assess what job satisfaction consists of for them (e.g. do I like my colleagues? do I like my salary?), and use that to create questions. You can also rely on prior research by scientists. For example, if you’d want to measure burnout, it is well-established that it consists of three components: cynicism, a sense of depersonalization, and emotional exhaustion. By assessing all these components in your questions, you increase content validity. You can find in our academic paper (under peer review) on the Scrum Team Survey how we attempted to achieve content validity through observational case studies and reliance on existing theories.

Discriminant validity is the degree to which the factors in a questionnaire are meaningfully different. A great example of this was our attempt to measure conflict in teams. From the scientific literature, we learned that there are two types of conflict in teams; task conflict and relational conflict. So it is possible for a team to score high on one, but not necessarily equally high on the other. Many scientists expect that relational conflict is more damaging to teams than task conflict, which is why the distinction matters to us too.

So we added two scales based on existing literature. This is the 3-question scale for “Task conflict”:

1. There is often disagreement in this team about how to do the work.
2. In this team, there are often conflicting ideas about how to do the work.
3. There is often conflict about the work that I do in this team.

And this is the 3-question scale for “Relational conflict”:

1. The members of this team often experience moments of friction with each other.
2. Different personalities in this team often clash or disagree with each other.
3. There are often moments of tension between members of this team.

While both scales are ultimately about team conflict, we expected them to distinguish between two types of conflict. But a so-called heterotrait-monotrait analysis (HTMT) concluded that “Relational Conflict and Task Conflict are statistically indistinguishable”. This means that when teams score high on one, they score so identical on the other that it doesn’t make sense to distinguish between both. So we used these insights to drop the scale for “task conflict” altogether.

Criterion or predictive validity is the degree to which the results from a latent factor actually predict something meaningful. One example is the latent factor “Stakeholder Satisfaction” in the Scrum Team Survey Questionnaire. We measure this by asking team members to rate how satisfied they think their stakeholders are. This is obviously an indirect measure of what we’re really interested in. And it’s not hard to see how team members might be biased to inflate the results. So we added a short questionnaire to the Scrum Team Survey that teams can send to their actual stakeholders and that measures their actual satisfaction. We are now able to calculate the degree to which “stakeholder satisfaction as reported by the team” correlates with “actual stakeholder satisfaction”. Although we’re still building a sufficiently large dataset, we can already see that both scales are clearly correlated (between r=0.46 and r=0.72). If one is high, the other tends to be high too. And when one increases by 1 point, the other increases by a value between .46 and .72 points. This shows that our indirect measure of stakeholder satisfaction is not perfect, but it gives us an approximation of the real value. We aim to perform similar tests for other areas in the Scrum Team Survey.

Finally, construct validity is the degree to which your scale of questions actually measures what you intended to measure. A good example of this is “psychological safety”. You may decide that a good question for this would be “I never get criticized by my team” and “People in my team compliment me”. While these questions may do well on the other types of validity, and work statistically, they don’t actually measure psychological safety as it is defined and understood in scientific studies. There, psychological safety is generally understood to mean “the degree to which people in a team feel safe to take interpersonal risks”, which specifically includes the ability to criticize each other constructively. In order to develop measures with good construct validity, you need to carefully look at how constructs are defined in studies and by experts who are very familiar with them. Otherwise, you may end up with meaningless results.

4. Are your questions free of response bias?

The final point regarding quality that I want to address is response bias. The reason why questionnaires are so prevalent is that they are often the only way to get a sense of how people feel about something. In order to know how happy someone is with their job, we have to ask them. When we know if people experience enough autonomy, we have to ask them. This data is inherently self-reported. When a participant reads a statement like “Most of the time, I have the freedom to decide how to do my work”, they effectively reflect on their recent experiences and pick an answer.

The problem with all self-reporting is that it is potentially biased. Some people have overly optimistic views of themselves or their team. Some people may have an agenda and enter the questionnaire in a way that supports it. Other people may seek to protect their job and avoid any kind of criticism out of fear of retaliation. Even such subtle factors as how the survey is presented, what language is used, and how it is introduced will probably affect how people answer the questions.

“So there is always bias. Fortunately, the techniques that I outlined in this post are helpful here too.”

So there is always bias. Fortunately, the techniques that I outlined in this post are helpful here. Because biases are also latent factors that can be measured. And this means that questions that are prone to bias will jump out from the various analysis that I shared in this post. When one question in a scale is more prone to bias than the other two questions on that scale, we will be able to identify and remove it with a Confirmatory Factor Analysis or a Reliability Analysis. We can even “model” and control for the measurement error that is caused by bias with advanced statistical techniques like Structural Equation Modelling to some extent.

You can also control for some bias through the use of a so-called Common Method Factor (CMF or CLF). This is a work-in-progress screenshot of both the measurement model and our theorized causal model for the Scrum Team Survey Questionnaire. The CMF is visible below as a box, with arrows leading to all individual questions.

And there are other techniques too. For example, we use control questions in the Scrum Team Survey that are unrelated to what we’re interested in. In our case, we used three questions from the Social Desirability Scale (SDS-5). This scale measures how inclined a participant is to answer in a socially desirable way, and is one way to measure bias. We then use the score that each participant gets for this factor to statistically compensate all their other answers for it. So if someone scores very high on social desirability, we know that their other answers are probably biased upwards also. We statistically “correct” their answers for that bias. Its certainly not perfect, and it certainly doesn’t prevent all forms of bias, but it is one way to reduce the bias that is inherent to questionnaires. This strategy is often called common method bias, common method factor, or common latent factor.

Closing Words

After reading through this post, I hope you have developed a better understanding of how to design a proper questionnaire. This is an important realization. Many organizations dabble with questionnaire development without having the skills, the knowledge, and the tools to do it right. This isn’t helpful, as you end up measuring the wrong things in the wrong way, and making the wrong decisions based on that information.

If you can, I would recommend using off-the-shelf questionnaires that are academically validated and supported. You can also hire a professional questionnaire developer to develop a questionnaire with you, write proper questions and analyze the results correctly.

This post may also be able to answer some of the worries that people have with self-reported questionnaires. Although surveys are biased, statisticians have developed a number of sophisticated techniques to reduce bias. And since we often don’t have any other way to measure how people feel about things, it's great to know that questionnaires remain helpful here.