Greenhouse’s Scorecard Rating System

Mona Khalil
In the weeds
Published in
4 min readNov 4, 2019

If you’ve ever seen or filled out a scorecard in Greenhouse Recruiting, you likely noticed our unique measurement scale for assessing candidate attributes and performance. We’re regularly asked about our rationale behind a combination of colors and emojis instead of a traditional 5-point scale (i.e., 1 to 5, Strongly Disagree to Strongly Agree), and met with a variety of feedback on our choice.

Our choice of measurement scale here was far from random — in fact, our choice was heavily informed by research in the field of psychometrics. Psychometrics is the study of quantitative measurement practices in the social sciences. A psychometrician generally researches best practices in evaluating the quality of metrics (i.e., survey items), measurement scales (i.e., Strongly Disagree to Strongly Disagree), and other related factors that contribute to the accuracy of the behavior or process you’re trying to capture [1].

Throughout this article, we’ll introduce our justification for excluding numbers, text/worded scales, and our ultimate choice to use color & symbolic scales in our product. We’ll also share some interesting facts about how people tend to respond to surveys with only the smallest tweaks in measurement scales.

Showing numbers on a scale can significantly impact how people respond to your questions.

Have you noticed the wide variety of numerical scales used in survey questions to represent the same set of response choices? The following 5-point Likert scale is a common set of choices available as a range of answers to a survey question:

And so is this variation of the same scale:

Several studies on the use of Likert scales have demonstrated that people’s responses will vary based on the numbers shown (or not shown) to survey participants. [2] Survey participants tend to evaluate & assign different weights to numbers that may introduce bias in any survey scale. Traditional academic research tends to collect data from larger sample sizes (i.e., 200–300 respondents), helps balance out the variation in response rates from people’s interpretation of a scale. That’s generally much higher than you’ll see in the hiring process — as a hiring manager, you’ll likely receive candidate ratings from 5 to 10 people, and even fewer responses on each of the individual attributes to be evaluated. Any step we can take to reduce error and bias improves the quality of your candidate evaluations.

There are strong cultural and regional differences in how people choose to respond to worded measurement scales.

Did you know that residents of the United States tend to respond overwhelmingly positive to most survey questions they receive than many other parts of the world? [3] These differences in survey responses don’t reflect any cultural differences in optimism or agreement — in fact, people in the United States tend to be less positive and trusting of institutions as a whole. [4] In the U.S. respondents likely respond to a question as Strongly Agree unless they have a clear reason to disagree with the statement being made. In other parts of the world, such as in mainland China [3], respondents will respond more neutrally unless they have a clear justification for strongly agreeing with the statement.

Excluding a worded scale ensures that your hiring managers can be confident that your candidate is being evaluated consistently across interviewers, and is not subject to bias associated with different individuals’ interpretations of Strongly Agree/Strongly Disagree.

People tend to respond more consistently and powerfully to color heuristics than they do to words.

Humans process colors & symbols in a different region of the brain than words or numbers. [5] We also more readily associate colors with a negative or positive valuation based on opacity, etc. [6] By using shades of red, yellow, and green as heuristics to evaluate your candidates, we hope to provide you with more consistent results across individuals and teams as you make your hiring decisions.

Conclusion

Ultimately, our scorecard rating setup was created in the interests of providing customers with the best possible platform for reducing bias and individual differences. We’re always open to feedback and evidence-based approaches to better reaching that goal.

References

[1] Psychometric Society: What is psychometrics? https://www.psychometricsociety.org/content/what-psychometrics

[2] Weijters, B., Cabooter, E., & Schillewaert, N. (2010). The effect of rating scale format on response styles: The number of response categories and response category labels. https://doi.org/10.1016/j.ijresmar.2010.02.004

[3] Lee, J. W., Jones, P. S., Mineyama, Y., & Zhang, X. E. (2002). Cultural differences in responses to a Likert scale. Research in nursing & health, 25(4), 295–306. https://doi.org/10.1002/nur.10041

[4] Twenge, J. M., Campbell, W. K., & Carter, N. T. (2014). Declines in trust in others and confidence in institutions among American adults and late adolescents, 1972–2012. Psychological Science, 25(10), 1914–1923. https://doi.org/10.1177/0956797614545133

[5] Peterson, Bradley S., et al. “An fMRI study of Stroop word-color interference: evidence for cingulate subregions subserving multiple distributed attentional systems.” Biological psychiatry 45.10 (1999): 1237–1258. https://doi.org/10.1016/S0006-3223(99)00056-6

[6] Piotrowski, C., & Armstrong, T. (2012). Color Red: Implications for applied psychology and marketing research. Psychology and education-An Interdisciplinary Journal, 49, 55–57.

--

--

Mona Khalil
In the weeds

Data Scientist @ Greenhouse. Co-host @ Bad Methods Podcast. Passionate about ethics in data science. Twitter @mona_kay_