So I guess we can instantly calibrate performance data now??? 🤨

My search for truth begins with the hypothesis that the current method for collecting performance data is flawed.

Jessica Zwaan
Incompass Labs


Something I presumed wasn’t really ‘improvable’ — performance calibration. It’s arduous, labor intensive, and still struggles to remove the biases it aims to mitigate. That said, as far as I could work out, there just wasn’t really a better way. Heck, only a few months ago I wrote a three-part blog series which basically centered around the idea you needed to run performance calibration in order to get to the “truth” (or as close as possible). I accepted it as a necessary evil.

That said, I’m not really the kind of person who likes to accept the status quo. I love prodding into things and working out how we can do things better. I specifically want to be someone to do this when it comes to hard problems like meritocracy and truth in the workplace. I want People teams to have good quality data, built on a foundation we can trust. Calibration was the path to get there, but I still knew it was deeply flawed.

A week or so ago I wrote a blog post about HR analytics:

Are managers capable of calibrating qualitative data and turning it into performance assessments (and, more pointedly, quantitative outcomes)?

On this answer HR leaders have constructed an entire Rube Goldberg machine of HR analytics. People teams can analyze pay, age, geographic information, and job details, but what do we have to weigh this against if we don’t have reliable performance data? The result is that CEOs do not trust performance data, the fundamental building block of HR analytics.
- Yeah, we’re nowhere near fixing HR analytics, September 2023

Within the blog I lamented the state of HR data if it’s built on the foundation of performance calibration. Not just for business success, but also because it’s just the right thing to do by your team. We should be rewarding and recognizing people with data we can deeply trust. Anything less than that feels like bad business, and bad human-ing, frankly.

It’s a hard truth. Capitalism being capitalism, if you’re working in a company you’re being assessed one way or another. Your pay changes, promotions, and growth opportunities are dictated by the systems we’re building in People Ops; many of which are deeply flawed, biased, and baked into the fabric of everyone’s working lives. I personally want those decisions to be as close to meritocracy as possible. The idea of a biased manager or a cultural preference for extroverts being the force that compounds inequalities isn’t tolerable for me. It shouldn’t be for you either, probably, but who am I to say.

Look, there is a better way to do this. There has to be. How can we get deep, unbiased insights into things like customer LTV, but not understand who in our team is genuinely doing the work that deserves recognition? Why are so many people feeling failed by performance assessment? It’s abysmal out there: 70% of people seem to agree it’s waste of time! Most CEOs don’t trust what’s coming out of our efforts!

Well, whatever, I’m gonna pat myself on the back a bit here because I think we’ve finally cracked it… or something very, very close to it. We’ve mathed the math, modeled the models, piloted some impressive pilots, and you don’t need to just trust me — Dan, our Chief Statistician, and one of the team who helped build this, is probably the most scrupulous human on earth when it comes to stats. So, here comes a big claim: we can instantly and fairly calibrate performance data.

How on Earth?

It’s not magic, believe it or not (or at least Deniz, our CTO, gets cross when I say that…) We’ve built a machine-learning algorithm which incentivizes honest feedback, ensuring you make decisions based on merit, not on manipulation.

So how does it work? In order to calibrate performance data, our algorithm has four key features which makes it highly reliable:

Statistical calibration

Our calibration algorithm attempts to fairly and objectively understand the macroscopic view of the scores as they apply to the entire population. It looks for the relationships between the assessors own scores, their ratings for others, and all other observable data about themselves and their peers (more on this in Metadata, below). We do this along each behavior being assessed, estimating across the entire employee base to obtain view of the network structure holistically, and increasing our confidence in the validity of everyone’s scores relative to everyone else.

The way we work together as peers really tells us something we’re missing out on.

🔥 This, compared against current tools on the market, which use simple averages to give insights on qualitative performance data, puts us lightyears ahead of the quality of insights out there.

Relative Peer Grading

In traditional peer assessment, employees provide absolute grades or scores based on a rubric, and the absolute value of those scores are taken at face value, with each employee being evaluated almost independently of others. In contrast, in Relative Peer Grading (RPG), employees rate each others’ behaviours relative to one another, not allowing for ties.

Rationale: The underlying idea behind RPG is that while it may be difficult for employees to synthesize a large rubric and assign consistent absolute scores on individuals, they may find it easier to compare them. For example, it might be challenging for employees to decide whether a colleague is “exceeding expectations” or “greatly exceeding expectations” for a Software Engineer IC based on a large rubric of behaviors and competencies, but they might more easily determine that one colleague is better than another at, for an example, Execution.

Turns out, people are really not made for doing this:

Synthesizing huge quantities of qualitative data like progression frameworks, day to day work, interactions, values… and then giving a summary it’s not what we’re made for

But we’re exceptionally good at doing this:

When looking at two of our peers, we’re actually quite effective at identifying who is doing better and with a pretty decent estimation of how they compare

But wait — does this mean we’re just looking at two people’s comparisons and adding them up?! No. Our algorithm collects a rich volume of data points by requiring each employee to compare a number of their peers along a suggested five behaviors (but in theory you can do as many as you want, and whatever you want — impact, collaboration, adaptability, so long as you can define it succinctly). If each of your team selects eight peers to score (and our average on the pilots with our current customers is ten) then within a company of 1000 headcount, Peerful provides forty-five comparisons per reviewer across five behaviors. This is a lot of data. 225 data points from each reviewer makes a quarter million comparative data points for Peerful to assess. That’s strength in numbers, something People data can rarely boast.

Incentive Compatible Scoring

Incentive Compatible Scoring is a concept primarily discussed in the context of peer assessment, especially in settings like Massive Open Online Courses (MOOCs) or other large-scale educational settings. The idea is to design a grading system such that employees are incentivized to score each-other honestly and accurately, rather than in a biased or strategic manner that might be in their own self-interest.

The challenge with peer assessment in traditional 360 reviews is that if employees believe that they can benefit from grading others harshly (to make their own work look better in comparison) or leniently (hoping for reciprocation), they might be tempted to do so. This can distort the final outcomes and make the evaluation system unreliable.

Incentive Compatible Scoring tries to address this by creating mechanisms where the best strategy is to score as accurately and honestly as possible. Here are some ways this is achieved within Peerful:

  1. Reciprocity Blindness: Peers are made unaware of who is assessing their work. This can deter folks from trying to form alliances or reciprocation strategies (”I pat your back, you pat mine”), but also ensures additional rankings are able to be considered — such as a person who worked with you on a cross-functional project and thinks you were incredible (rather than simply selecting who you would like to review you, as is standard in most 360 reviews).
  2. Score Weighting: A user who provides consistently unreliable reviews is inferred to have a lower score (and we will blunt the effect of their poor ratings). Time and again, we’ve found strong performers are more likely to give accurate reviews, if that is found true in your team the algorithm will also weight those scores accordingly.

Population profile

Peerful goes beyond just performance data, and aims to give you insights based on a comprehensive overview of a population; including both demographic information (like age, gender, location, remote/hybrid) and metadata (manager relationships, start-date, last promotion). In short, anything you collect in your HRIS we will port over with an integration and can use to offer additional insights.

We can identify where someone’s bias is hiding the otherwise strong (or weak) performance of their team.

We can review the scores given to and by those in specific groups or profiles, and assess them for how closely their scoring of behaviors aligns with the consensus or average score given by others. If a specific group’s grading deviates significantly from the average, their scores can be identified and normalized. Say you want to see if women managed by men are rated more harshly, or if leadership are rated consistently too leniently by each-other, we can identify that trend for you to address.

⚠️ It is important to note that we do not hardcode the control of these population profiles, but rather give you the data that allows you to identify but not control, or identify and normalize.

A whole new era of HR insights

I have to say I’m excited. Of course I am — I’m working on something I’m deeply passionate about, but also I’m genuinely convinced that this solves some pretty chronic problems in People Analytics:

  • Performance Data isn’t reliable and we’re building everything on it,
  • We’re losing insights on some of our best people, future leaders, and giving up on the chance of building true meritocracies, and
  • Our teams are being failed, we’re not able to demonstrate true ROI in the work we do in People Operations, but we’re also not able to build businesses that have the greatest chance of success.

Gone are the days of wading through murky waters of biases and skewed results. We want it to be a future where HR analytics gets some real legs. It’s time for business intelligence for People Ops.

Want to hear more? Sign up for our oversubscribed Beta waitlist here.

Me and my cat, looking professional

👉 Buy my book on Amazon! 👈
I talk plenty more about this way of working, and how to use product management methodologies day-to-day, I’ve been told it’s a good read, but I’m never quite sure.

Check out my LinkedIn
Check out the things I have done/do do
Follow me on twitter: @JessicaMayZwaan



Jessica Zwaan
Incompass Labs

G’day. 🐨 I am a person and I like to think I am good enough to do it professionally. So that’s what I do.