How Peerful became Peerful 🪄

The history of our game-changing data-driven solution to employee and workforce assessment, built for CEO insight.

Jessica Zwaan
Incompass Labs

--

When one thinks of a game-changing tool for CEOs to get insight into employee and workforce performance, the first thought probably isn’t grading MBA papers at Wharton. Yet, that’s precisely where Peerful’s story began. This fascinating journey from a mere grading tool to an advanced algorithmic system is a testament to innovation and evolution in the face of rapidly changing business needs.

Our founding advisors, Peter Fader and Asuka Nakahara sit down to talk WHOOPPEE (Wharton Online Ordinal Peer Performance Evaluation Engine.)

Peerful started as an algorithmic, incentive compatable approach to peer grading at Wharton

The Beginnings: Wharton MBA Papers

It all began at the Wharton School of the University of Pennsylvania, one of the world’s premier business schools. The challenge was grading MBA papers efficiently and fairly. But how could this process be streamlined and improved? Enter WHOOPPEE: Wharton Online Ordinal Peer Performance Evaluation Engine. This tool was initially created to grade Wharton MBA papers, leveraging a unique algorithm to ensure fairness and accuracy in grading.

Peerful’s Transformation

The transition from a grading tool to what Peerful is today began as a series of conversations that Pete held with a quant-minded, performance-culture-centric, public company CEO who saw the possibilities. With the CEO’s persistent nudges, it became evident that the underlying mechanics of WHOOPPEE could be utilized far beyond academia. The idea was simple: if we can grade MBA papers using an algorithm, why not use similar technology to evaluate employee performance and ultimately, performance calibration?

The WHOOPPEE code was rewritten for this new use case. The ability to algorithmically calibrate performance data quickly meant that businesses could now gain meaningful and otherwise invisible insights into their teams’ performance. This was a game-changer. Suddenly, CEOs had access to data-driven evaluations, giving them the power to make informed decisions about workforce development, management strategies, and more. Leveraging the wisdom-of-the-crowd, Peerful could deliver more accurate, less biased performance metrics for individual and team growth and performance evaluation, allowing measurement over time and between populations.

Key takeaways:

Functionality of WHOOPPEE:

  • The system addresses issues in peer evaluation like accuracy and fairness by leveraging multiple evaluations for each paper.
  • A unique feature is its incentive-compatible grading, where the weight of an evaluator’s grade depends on the quality of their own paper. This means students who write better papers have a greater influence on grading, ensuring fairness and motivation.

Incentive-Compatible Grading:

  • The grading method ensures fairness by aligning the interests of the grader and the writer. The performance of a student as a grader is connected to their performance as a writer, creating a balanced and effective evaluation system.

Crowdsourced Evaluation vs. Singular Expert Evaluation:

  • The WHOOPPEE (though the full form isn’t given, this appears to be the name of the grading algorithm being discussed) approach relies on multiple students evaluating a paper rather than just a single professor.
  • Simulations revealed that taking a weighted average of five students’ evaluations produced a more accurate assessment of a paper than relying solely on one expert’s view.
  • This method is particularly useful when the single evaluator might have biases, showing that sometimes a broader perspective can yield more accurate results.

Ranking vs. Grading:

  • A crucial distinction in the WHOOPPEE approach is that students are asked to rank, not grade. They need to position papers relative to each other.
  • The variability within and across paper batches helps ensure that the ranking isn’t biased by having all strong or all weak papers in a batch.

Quality of Grading vs. Quality of Writing:

  • Pete mentions an observed correlation between the quality of a student’s paper and the weightage of their grades in the system. While the system allows for this correlation, it isn’t imposed. It simply emerged from the data, suggesting that students who produce quality papers often offer valuable evaluations as well.
  • This realization strengthens the system’s credibility and reinforces its foundational ideas.

Connection to Super Forecasting:

  • Asuka draws a parallel between the collaborative ranking system and Phil Tetlock’s research on super forecasting. Tetlock found that forecasts drawn from the averaged predictions of several informed individuals tend to be more accurate. This supports the idea that collective insight can be more reliable than individual judgments.

Full interview transcript

Asuka Nakahara:

My name is Asuka Nakahara and I teach the Real Estate Development Course at Wharton. I’m very happy to be here with Pete Fader, a teaching colleague and senior marketing professor who specializes in using behavioral data to understand and forecast purchasing and shopping activities across many industries. He believes marketing is a data-driven field, and interestingly has applied his expertise in data work to a fascinating and I think remarkable in-class application. Pete and a team of collaborators have been working on WHOOPPEE (Wharton Online Ordinal Peer Performance Evaluation Engine) for several years now, and launched it formally during the spring 2015 term.

Over time, the algorithm user interface had been tweaked and improved. I use WHOOPPEE in my real estate development course and I’m applying it to assignments this year. The purpose of this chat with Pete is to answer some basic questions about WHOOPPEE based on my experience and student feedback and questions. Pete, welcome and thanks for doing this, and thanks for developing WHOOPPEE. First question, what is WHOOPPEE and how did it develop? Obviously there are issues with peer evaluation and peer grading, things like accuracy and fairness and so forth. Please tell us the story.

Peter Fader:

Sure thing. So what is WHOOPPEE? A strange name for a software platform, the Wharton Online Ordinal Peer Performance Evaluation Engine. So there’s strength in numbers, and the way to overcome some of these issues is to have multiple evaluators, a multiple motivated evaluators go through a paper and then find an incentive compatible way to weight their evaluations and come up with an overall grade for that paper. That’s what WHOOPPEE ‘s all about. It’s collaborative grading, which by itself isn’t a new idea, but it’s done really well in a way that I’m actually really proud of the way that it’s come about and how successful it’s been in the various different courses I’ve been teaching.

Asuka :

What’s the origin? I mean, how did you even come up with the idea to do this?

Pete:

It all came about from lunch with students. I was sitting around with students right after one of my assignments and I had just posted the three or four best papers, and the students were saying in kind of a casual way, they said, it’s kind of depressing that you just show us the best papers because it makes us feel real bad, so why don’t you show us some average papers instead? Or why don’t we each get a chance to look at some random papers? And then we were kind of joking around saying, yeah, I could show you the random papers and then you could grade them, and then we could wait your grades with my grades. And it was just, “ha very funny.” And then we kept on going with lunch, but this idea stayed with me. And actually months later, believe it or not, I was on a long flight back from South Africa and I kind of put together an algorithm that would do just that where I would have students evaluate random papers and we different weights on the papers based on how well they did on their own paper.

You write a good paper, you carry more weight and including the faculty and the TAs in there as well. And if we can come up with this overall way to get people to evaluate other papers and do so in a way that the weight on the paper reflects how well they did on their paper, not only is that really fair to the students being evaluated, but it’s very incentive compatible for the graders because of this connection between the performance of them as a grader and the performance of them as a student writing a paper, we want it to all be aligned and it seems to have worked pretty well.

Asuka:

Okay. So the way it works is each student writes a paper, then that student will rank five randomly selected paper. Why 5 and not 6, 4?

Pete:

Turns out to be the magic number. When we put these algorithms together, we ran a bunch of simulations and we said, suppose we know the true underlying score of each paper, but then we’re going to have different kinds of people evaluate that paper in an error filled way. Well, you could have the professor do it. The professor’s never going to be perfect. They might come close to the truth, but they’ll never get quite there. They’ll always be above or below, they like the paper too much or too little. But suppose we can get a certain number of even more error filled people, the students to take a shot at it, and there’s going to be much more variability there. But if we can find a way to take all of those scores and bring them together to take a weighted average of them, turns out that taking a weighted average of five more error filled scores is better than taking the one less error filled score from the professor. So sometimes students don’t like the idea that their peers are evaluating the papers, but the fact that enough of them are doing it and the faculty are doing it as well leads to better outcomes than if it’s the faculty alone.

Asuka:

So I’ve heard it described this algorithm in running iterations, sort of like a Google search engine works, but could you describe why it runs so often and what happens each time it runs?

Pete:

Yeah, there actually are some connections with the way the whole page rank system works. I have to say that was coincidental. Maybe it’s just great minds thinking alike, I’m not sure. But the basic idea is this, as I said earlier, under the presumption that if you write a good paper, you should carry more weight in your evaluation of others. So not only do you want to write the best paper possible, but if you’re a little bit too casual in the way you evaluate the other papers, if you say, eh, I’m just going to do it randomly, then it suggests that, well, maybe you’re not that good after all, and the grade on your own paper is going to go down. So part of it’s to create this correlation, this connection between your performance as a writer and as a grader. And a second piece that’s really important is that we’re not asking you to grade the papers.

It’s actually wrong for me to use that verb at all. We’re asking you to rank the papers, so we’re going to give you a set of five, and instead of grading them, even though we give you a very detailed rubric, that’s very important, all we’re asking you to do is a forced rank. You got to give a one or two or three or four and a five. Now you might have a really strong batch of papers, and it’s really hard to give someone a five when it’s a really outstanding paper, or you might have a really weak batch of papers. It’s really hard to give someone a one when they all stink. But that’s the beauty of it is by having lots of variability within and across the batches, and by having your paper evaluated in different batches, it all kinds of bubbles up. And we really find that the ultimate scores that we come up with much more than the raw ranks themselves after we run this algorithm and take into account all these weights and everything, tends to provide a very good reflection of the true value of the paper.

Asuka:

And so the reason you have to iterate this thing so many times is because the first iteration probably just says, how many people got a lot of ones? That’s right. And then it starts figuring out, well, the ones should get rated higher on their ranks, so it runs it again, and then it runs it again. Is that sort of the way it works? It’s

Pete:

Exactly right. So what happens is the students will see the raw ranks that they get, and sometimes if that’s all they know, they’ll say, Hey, how come my God, a bunch of ones and twos here, but my paper was only in the middle of the pack? Well, we need to take into account, like I said before, the strength of the batches that those papers are included in. If you’re getting ones in a week batch doesn’t mean that much, as well as taking into account the weights on the graders. So someone gave you a one, but their paper was the worst in the class, that’s not going to carry nearly as much weight as someone else. So taking all of these factors into account means we need to, well, here’s the way it works. We want to come up with the metric score that best reflects how well each paper did in the different batches that it was in, recognizing the differential weights on them.

And so the algorithm, the math behind it is actually kind of nasty, but conceptually, I hope it’s pretty clear. And it turns out that even though we have to go through these iterations, it’s done automatically. And in the end, it definitely converges. So it’s not like we just stop and say, oh, that looks pretty good. It will converge. Sometimes it’ll take a couple of hours to run to try different combinations of weights and scores and so on. But in the end, we’ll look at the numbers and I’ll spend a great deal of time, especially at the extremes, not only the papers that did best and did worst, but the ones that appear to have the biggest change from the raw ranks to the ultimate WHOOPPEE score, just to make sure it’s all kosher. And it tends to be, it’s very rare that we’ll see cases where the algorithm doesn’t work.

Asuka:

So Pete, have you ever graded papers and compared your grades to the grades generated or the greatest generated from the rankings that WHOOPPEE has come up with or?

Pete:

So one thing that I’ll do every time, like I said, is myself and the TAs, we are going through batches of papers ourselves. So each student is just going to do one batch of five papers, rank those papers, that’s it. But myself and the tass, we’re going to take everybody’s paper. We want to make sure that everyone gets a gold standard evaluation. So I’ll be doing, depending on how many students are in the class, six or eight or 10 separate batches. And then what I’ll do is after the fact, and I’ll always do this as I’ll see, how did I evaluate those papers? The immediate thing I want to know is how did it compare to the student’s own evaluations? But then when we get the final scores after the algorithm is run, how well did my evaluations of those papers, how well did the ranks that I give match up with the relative ranks that those papers actually got post hoc? And sometimes it’s actually perfect. Sometimes the ranks that I give align perfectly with the true scores of those papers. Sometimes it’s off by a little bit. Sometimes I see things in papers that other students didn’t or vice versa. But the alignment is usually quite good.

Asuka:

Have you used WHOOPPEE in a classroom setting now?

Pete:

So I used it in this big nasty quantitative course, and I’ve used it, I think about six to eight different times. And that’s across, even though it is a quantitative course, there’s different kinds of assignments in there. So in the first assignment, students have to go out there and find their own dataset and do their own analysis. And so when they’re evaluating others’ papers, they’re looking at five wildly different papers, looking at completely different topics, maybe coming at it with different analytical approaches. And by the way, that by itself, one of the big reasons why I go back to that lunchtime conversation when we come up with this idea, just to expose them to a bunch of other papers and do so in kind of an incentive compatible way. There’s a lot of learning that takes place right there. Like, huh, I never really thought about applying these models in this area.

I never really thought about doing that kind of analysis with this kind of data. So the first assignment is kind of a roll your own and getting the students to evaluate five wildly different ones. The second assignment, they’re all working off of a common data set. So on one hand it’s easier because the papers are much more comparable to each other, but that also makes it a little bit more challenging. Sometimes the papers might tend to clump a little bit closer together. So you have to think a little bit more carefully about the analysis and how they differ from each other, and even the quality of the paper, the quality of the writing and the exhibits and so on. So even though I’ve used it in a single course, it’s been used for different kinds of assignments. And then a number of colleagues, including yourself, have used it in a wide variety of very, very different courses as well.

Asuka:

So Pete, I’ll ask, you’ve talked a lot about how it works and some of the detail. What are some of the student dissatisfies that you hear as you give it in your class and as your colleagues report in on how the experience has gone?

Pete:

Sure. Thing number one, there’s the visceral reaction that, oh, the professor’s trying to get off the hook here. Oh, the professor doesn’t want to grade our papers. They’re just going to make us do it. Well, that’s just wrong. And of course, that reaction is wrong because first of all, the faculty and the TAs are indeed evaluating a whole bunch of papers. Every student’s paper is going to get touched by one of us. That’s not a requirement of the system, but that’s just part of my own values in the courses that I teach, and I know it’s true with you as well. And they end up getting better evaluations by having five plus one people doing it and taking the weighted average and all that. So concern number one is that I just want the professor to do it. They’re better off having this broader crowdsourced version of it.

Concern number two would be on the outputs that sometimes it is a bit of a mystery that they just see these ranks and sometimes the raw ranks that they get don’t align well with the final score that comes out of the algorithm. And that’s something we’re working on is trying to figure out, we don’t want to overwhelm students by saying, here is the rank that the student gave you. Here is the weight that they got due to the score on their paper. Here is the strength of the batch that they looked at. Now for some students, they want all of it, give me more metrics. I really want to get the full story. I’m kind of curious about it. Other students don’t want to see any of that. Just kind of tell me my score. You know what? Don’t even show me those raw ranks.

So it’s figuring out the just right balance of information. Number three is student feedback. So we strongly encourage the students as they’re evaluating the papers to provide feedback you got to give to get. And so let’s go in there and say a few things. Don’t be nasty about it, be constructive. But at the same time, be kind of honest about it. I wish that there was more quality and quantity of feedback. We’re trying to work on ways that maybe after all the papers are graded, you can go out there and kind of give some of your graders a thumbs up or maybe a thumbs down on the quality and quality of feedback that they gave you. That’s still a work in progress, and I think that that’s an important piece as well.

Asuka:

Pete, you touched on this now. So what are some other tweaks, modifications, improvements that you see in the future?

Pete:

So in addition to some of the fine tuning that I just mentioned, something else that sometimes happens is sometimes, even though we make it incentive compatible for the students to evaluate the papers as carefully as effectively as possible, sometimes for reasons that I don’t know, they do a terrible job and maybe they’re demotivated, maybe they’re just marching to their own drum and they’re just kind of out of line or who knows what. So right now we have a way of if some students evaluations of their batch is way out of line with the way the other students evaluated those same papers, we’ll just throw out their ranks completely. So it would be great to automate that. It would be great to come up with some standards about what do we mean by misaligned and have that kind of woven into the algorithm and maybe to then send a signal to the faculty saying, here’s one you want to look a little bit more carefully at, because there’s one less touch on that paper. So we want to make sure that the numbers that we provide are really robust, and that one, we provide the data back to the student, though they’ll really not only believe their overall score, but even the process that led to it.

Asuka:

That’s a great point, Pete. So right now, the way you do that is mechanically, you just set a five or 10% fixed number that you eliminate those rankings because they’re just so out of whack.

Pete:

That’s right. And that’s arbitrary, and it’s something I look at very, very carefully after the fact that are we throwing out too many and therefore losing perfectly good information that should be taken into account in the algorithm? I think a big part of this depends on the interface for the WHOOPPEE system itself. And initially we asked students to write in 1, 2, 3, 4, 5, and they would sometimes get confused about is one best or is one worst natural source of confusion. But the new interface that we have, the students drag and drop the paper. So if you’re at the top of the list, that paper is the best. So we think we’re going to get rid of a lot of the randomness, the misalignment by just having a nicer, a prettier, a more obvious interface. So the students will not only not make mistakes, but they’ll find it even more motivating. They’ll see that we’ve really put the effort in to build something carefully and that they’ll be equally careful as they use it.

Asuka:

Anything we haven’t talked about that you can think of that you want to make sure we cover

Pete:

Well beyond a grading? And our students may or may not be interested in this, but there’s been so much interest in the idea of this collaborative system and the idea of the differential weights and so on. We’ve been hearing from a lot of people who want to use it for other purposes as well. So for instance, I had a conversation with the folks in the admissions office then instead of asking the admissions counselors to look at this application and say, does this person get in or not? Why not give them batches of applications and rank them and have each council look at a bunch of different batches? And so it’s something that they’re exploring right now, whether it’s for admissions, whether it’s for maybe employee evaluations, let’s look at all the employees and rank them, but have multiple people do that, and then once again, run an algorithm on the backend to come up with these scores in order to give people their bonuses and so on.

So I think there’s a lot of other applications for it, and I hope that some of them come to life because I’d like when our students come into class to be kind of informed about what the system is and that it’s not this one-off goofy thing that’s being done in this class and this class only, but it’s just kind of part of life that having collaborative ranked evaluations is just a way to get better fairer answers. And I also hope that they’ll see that not only is it best for them in terms of the grade that they get, but that they look forward to reading some of those papers and providing feedback on them and having this conversation, even if it’s anonymous about just what makes for good work in this course. Again, some take to it, others are a little bit resistant, but net what I care most about is fairness and accuracy, and I really believe in the numbers that come out of the system.

Asuka:

It’s interesting, you’re probably familiar with Phil Tetlock’s research on super forecasting where he makes the point that the best forecasts are forecasts where you basically average a lot of people who have some knowledge their forecasts. So I think this is right along those lines.

Pete:

It really is, and that’s fascinating work. And it’s weird. Once again, it was just very parallel origins that we’re trying to figure out who those super graders are and a good sign of someone who’s going to be a super grader or someone who’s a super good paper writer. Lemme just add one thing on that I mentioned how we allow for a correlation between the score on your paper and the weight that your grades will get. We don’t impose that. We allow for it. And so in the dataset, if there was a disconnect between the paper writing and the paper evaluating, then that correlation would be zero. And we’ve seen cases like that. It could even be negative perhaps. So we let the data tell us how much weight we should be putting on people based on the quality of the grades. It is not imposed there. It just turns out that in pretty much every example we’ve looked at, there’s actually a very strong correlation between the quality of your paper and the way that your grades should get. So it kind of reinforces this idea that we had going in and I think really helps the outcomes of the process.

Asuka:

Pete, thanks for taking the time to talk about WHOOPPEE, how it works, what students can gain from the experience, and look forward to the next revisions.

Pete:

Thanks, Asuka. It’s always great talking to you and appreciate your support of this whole new way of grading papers.

What is this about Peerful?

Good question! I am working with a pretty incredible team of academics on developing a powerful algorithmic approach to performance assessment and calibration. One that is able to calibrate performance instantly (no more sitting in rooms arguing over performance frameworks) and which show you trends across company, leadership, and individual performance, improving your ROI in People. I’m hoping we can build something that genuinely democratizes meritocracy and gives your leaders time back from calibration, time you can instead spend on surfacing insights and progression (or improvement) plans for your people.

The world of performance measurement is exciting and ever-evolving. Stay tuned for more insights and keep innovating! 🚀

Ok that’s all from me, folks. 👋

👉 Buy my book on Amazon! 👈
I talk plenty more about this way of working, and how to use product management methodologies day-to-day, I’ve been told it’s a good read, but I’m never quite sure.

Check out my LinkedIn
Check out the things I have done/do do
Follow me on twitter: @JessicaMayZwaan

--

--

Jessica Zwaan
Incompass Labs

G’day. 🐨 I am a person and I like to think I am good enough to do it professionally. So that’s what I do.