Photo by Riho Kroll on Unsplash

How to approach McKinsey and BCG Aptitude testing (Imbellus, Pymetrics)

Part 4 of 7 Steps to a Management Consulting Offer

Sofie Yang 👩‍💻

--

T here’s been a new development in the management consulting recruiting funnel in 2019: Aptitude Testing. They are basically a series of games or questions that score people on psychometric and cognitive qualities (e.g. risk-taking, mental processing speed, spatial reasoning, and even empathy).

The concept of aptitude testing isn’t really new. Unilever explored it years ago; but it only recently got picked up by McKinsey and BCG, which haven’t changed their recruiting practices for a long time. With it comes a whole set of new questions, concerns, and uncertainty for applicants.

Introduction

Source: McKinsey & Company
Source: Boston Consulting Group

McKinsey’s implementation of Imbellus and BCG’s of Pymetrics are used to screen candidates before the interview stage in conjunction with resumes, cover letters, and transcripts. To preface, I am not an expert in psychometrics, algorithmic justice, or recruiting in any sense — I have just enough undergraduate-level statistics and machine learning background to sniff around and get curious. This article doesn’t tell you how to succeed on these tests (because I doubt anyone knows). Instead, for general awareness and our entertainment, I’ll offer my experience and my treatise on the concept itself.

As a disclaimer, the details of this post might become outdated and only apply to the West Coast USA processes for undergraduate internships and full time from a target/semi-target school (UC Berkeley). This post draws on my personal experience with McKinsey, Bain, EY-Parthenon-SSG, PwC (offers), and Deloitte S&O (final round) and my peers’ experiences with McKinsey, Bain, BCG, and other firms. I also don’t work for McKinsey anymore after my summer internship there, which gives me more license to share my thoughts.

I took the McKinsey Imbellus test (the beta version) for an hour and a half last summer with my intern class, but did not get any results back. I also took the BCG Pymetrics test as part of my application last year. They were both fun to play — almost too fun! I will share some of my results in this article.

These tests have a variety of stated goals. In no particular order:

  1. increase speed and accuracy of hiring by measuring cognitive performance, and
  2. level the playing field for non-traditional applicants.

In reality, the idea probably came from the test developers reaching out to a Senior Partner saying, “You don’t want the other firms to seem more disruptive and innovative, so better hop on board now!”

From the candidate’s perspective, I would treat it as a task to spend a little time strategizing for. Neither Imbellus nor Pymetrics have practice tests, so just focus on following your plan and don’t panic!

There is far more info on Pymetrics than Imbellus, so I will start with a brief overview of Imbellus and then do a more researched explanation of Pymetrics.

Imbellus — McKinsey

https://www.imbellus.com/#/science

Time: 1 hour total (According to CaseCoach because my test was the beta version). Self-allocate time for two tasks.

Tools: Computer, note-taking tool

Players: Single player

Tactical notes

In case there was a non-disclosure agreement I forgot about … all the information I’m about to give is available online somewhere. CaseCoach described the four games and wrote an excellent overview. I only played two of them and will expand here.

General:

  1. Read the rules carefully and draw diagrams/take notes (almost like in a case interview)
  2. Make local optimizations — do not revisit the same information multiple times, nail down some choices early and follow through. I wasted a lot of time trying to find the global maximum.

I’ll mark some “local optimizations” you can make in italics. The game names are made up.

Game 1: Playing Poseidon

Source: Imbellus

This is an ecosystem creation game.

Goal: assemble a stable ecosystem.

Step 1. Choose a location in the area. First, click into some organisms and check for the main types of ecosystems (e.g. warm, cold). Perhaps one type has more animals to choose from, therefore increasing your chances of finding a stable set.

Step 2: There will be a ton of organisms. Prior knowledge says ecosystems are pyramid-shaped. Maybe pick your top predator first and then go down the pyramid (draw this diagram and include calorie counts).

Step 3: The hard part is that you might need to increase the size of your food chain to satisfy calorie counts and the complexity essentially doubles at that node. Requires some long addition — round your numbers and keep count of unused calories. If it doesn’t entirely balance out after your allocated time is over, but is close, that’s okay. Submit and move on.

Game 2: Plants vs. Zombies with Consequences

Source: Imbellus

This is an adversarial organism protection game.

Goal: Survive X number of turns. In my gameplay, I felt that it was fairly easy to survive the turns.

Step 1: You have a map with your native species near the center. You can place natural barriers (some slow down, some block) on the map, but only beside the same barrier type. Invader species appear on the map near the corners. Try to use your first moves to set up blockades around the corners forcing new invaders to take a longer route. For one round, I was able to indefinitely block the invaders by sealing the corners with boulders.

Step 2: You can plan up to a few steps ahead using the interface. This turned out to be a waste of time because you could probably just keep that in your head. To show the system you’re thinking ahead though — put a few placeholders in and swap them out as the adversary makes more plays. However, do plan ahead for barriers you should build in order to maximize your materials and adjacencies (e.g. build diagonally vs. perpendicularly to increase perimeter)

Step 3: Don’t panic and waste materials. If the predator gets close, take a second to evaluate how you can make the material count and where to place it first.

Based on these descriptions, you can go in with some notion of your time allocation. If you’re slower at math, perhaps leave more time for Playing Poseidon. If you get frazzled with strategy games, leave more time for Plants vs. Zombies with Consequences. I used 3/4 of my time on Playing Poseidon because I was really interested in whether I could make a balanced ecosystem and I knew I could get through Plants vs. Zombies quickly.

Breaking Down the Concept of Imbellus

To recap, the stated goals of aptitude tests are: 1. increase speed and accuracy of hiring, and 2. level the playing field for non-traditional applicants by measuring “potential”. It has to be within this context that we evaluate the test’s efficacy because I can confidently say that there are no perfect measurements of human ability.

Increase speed and accuracy of hiring

Imbellus is a single-player assessment based on a knowable environment and small set of stimuli— which omits the many benefits and challenges of constantly working with a team and client in consulting. Arguably, the most successful consultants (i.e. the ones that rise through the ranks the fastest) stand out with their client and team interactions, not their analytical skills.

Now, accuracy can be defined multiple ways. Two reasonable ones are:

  1. Higher Imbellus scores are associated with a higher chance of being hired (if Imbellus testing is an independent event)
  2. Higher Imbellus scores are associated with better performance reviews and faster promotions (longitudinal study required)

From Imbellus’ website,

“In high-stakes, summative assessment, success depends on how well the assessment measures key constructs and predicts outcomes (e.g. hiring process success, job performance, or college GPA) and the degree to which the variables behind that prediction can be explained and corroborated with both theoretical and data-driven models.”

This quote taken from Imbellus’ website implies that they define accuracy the first way, which is far easier — they would need years for longitudinal performance data to prove the second way.

The correlation enables recruiters to make predictions as to whether someone will be worth interviewing given their score. Hypothetically, if someone with a below-average Imbellus score only has a 1% chance of getting an offer, why spend the additional time interviewing them? However, offer rates at McKinsey are so low that I’m not sure the association can be statistically significant.

The choice of accuracy metric is unfortunately the test’s downfall for achieving goal number 2: level the playing field for non-traditional applicants by measuring “potential.”

Level the playing field

After reviewing the materials on Imbellus, I came across a few densely-worded explanations of the underlying science, which boils down to machine learning (on small datasets) and interviews with individuals. At least they are honest here.

“The tension between traditional psychometrics and modern data science lies in the difference between optimizing assessment scoring for predictive versus explanatory power.” As an example. deploying something like a neural network that may outperform traditional models for prediction might require tens of thousands of test sessions. In practice, assessment providers often settle on less complex models, smaller data, or both in an effort to provide scores that mean something in the short term while building data needed to make greater claims with more sophisticated models later.

If the training data of this game is based on McKinsey hires (who are already disproportionately privileged), then the attributes of privileged hires are labeled “success,” while overlooked individuals are labeled “failure.” From my own experience, McKinsey’s new hires at an undergraduate level come from a wealthy background disproportionately. As such, the game calibration is not immune to systemic bias if it gives high scores to people who think like the current wealthy McKinsey hires in the “success” group. I would love to see anything showing a statistically significant difference in the cognitive traits of successful McKinsey hires and the general pool of students they draw from at elite target schools (Harvard, Stanford, Princeton, etc.), which would give some credibility to the Imbellus model.

My question is: if a non-traditional applicant (e.g. a soccer coach) scores absurdly high, say two standard deviations above average, would McKinsey pull the person in for an interview or discount it as a fluke?

The bottom line is that I believe cognitive/aptitude assessments should never be used as a standalone elimination round because it could introduce more bias — except more veiled. Unlike a human, software cannot be held accountable.

Pymetrics — BCG

Source: SmartRecruiters

Time: 30 mins (12 mini-games each taking a few minutes to complete)

Tools: Computer (make sure all your keys work including the space bar. If you modified your computer inputs…maybe play on another computer)

Players: Single player

Tactical notes

General:

  1. You can prepare strategies for each of the 12 games beforehand.
  2. Performing extremely “well” on a certain task by the metric of the game (be it money or speed) will not necessarily lead to favorable traits for consulting.

I couldn’t remember all 12 games, but I was able to jog my memory on a few of them by watching this Youtube video.

I’ll mark some tips in italics, but they’re not guaranteed to be helpful. The game names are made up.

Games

Photo by Hybrid on Unsplash

Balloon Popping

You click to pump a balloon. The bigger it is, the more money it’s worth. At any size, you can choose to collect the winnings or keep pumping. However, if the balloon pops, you lose its worth.

Different colors correlate to an approximate maximum size — emphasis on approximate. So, even if you popped a blue balloon at $3, it doesn’t mean you know exactly when the next one will pop. There’s also more colors than you can feasibly keep track of.

Looks like I’m a crazy risk-taker! But no! I’m most definitely a risk averse and careful person, evidenced in the fact that I chose to be a consultant.

I think that this exercise is mapped to risk-taking so you have to 1. only push the limits and pop some balloons early on and 2. rely on intuition about relative sizes when judging when to stop.

Photo by Ibrahim Rifath on Unsplash

Easy/hard tradeoff

Given a probability of getting the reward, you choose between easy and hard tasks, which both have a dollar value attached. An easy task takes 3 seconds, but a hard one takes more. To play more rounds, you want to balance out the easy and hard tasks as well as your processing time.

Pretty average results on these, but consulting firms are probably looking for “don’t exert effort if low reward”

To try to deduce the better approach: let’s first ignore the relative dollar amounts assigned to each task.

Which should you choose when the probability of success is low? Say p ≤50%. Easy takes less time and the added value of “hard” is discounted by the probability. Therefore, I’d choose Easy.

Which should you choose when the probability of success is high? Say p > 50%. Hard takes more time, but also rewards more. It is also rare to get a high probability, so maximizing gains from a high probability would be smart. Therefore, I’d choose Hard.

Now, what should change given the relative dollar values attached is your probability cut off between choosing easy and hard. I haven’t been able to figure out how to calculate that (someone please jump in here), but it seems like if Hard is less than 5/3x the value of Easy, only do it in conjunction with a very high probability. That ratio comes from the time to complete Easy vs.Hard. If I followed this principle and chose Hard more sparingly, I’d probably find myself more on the left end of the spectrums pictured above, which seems to match consulting better.

Other games

  1. Recall strings of digits (successively flashed) of increasing lengths
  2. Follow the arrow direction and ignore distractions
  3. Assess someone’s emotion given a situational story and a smiley face 🙂
  4. Decide between money now and more money later (basically, what are your short term and long term discount rates?)

Breaking Down the Concept of Pymetrics

Before I jump in here, I want to highlight a scary clause in Pymetrics’ Terms of Service. Aside from learning I cannot use any images or text from their website, I also learned that they have a Class Action Waiver. It prevents job-seekers (or even companies) who have been wronged by Pymetrics’ assessments from suing as a group. For disadvantaged groups such as those with disabilities and minorities, sometimes the only way for their case to be lucrative enough for a lawyer to take on is through class actions. Pymetrics has been clever to block this and will continue to make it infinitely harder for regular people to overthrow its perhaps undue authority in hiring.

Analyzing Pymetrics’ Claims and Science

List of my sources:

The starting point: This is amazing reporting by Sarah Todd and the Quartz team, presenting both sides of the story.

General overview of algorithm bias in many fields targeted at a management audience.

“Positive” results from Unilever’s implementation of Pymetrics.

Authoritative overview analyzing the science of hiring algorithms. Worth the long read.

Really awesome Cornell paper: Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices

If you have time, give them a read!

1 — “Pymetrics is based on science”

Pymetrics’ science page does not talk about neuroscience backing. It says it has a foundation in “Data Science,” which is a fancy way to say correlations…

I already explained why this “science” makes me incredibly angry in the Imbellus section on “Level the playing field.”

“The tension between traditional psychometrics and modern data science lies in the difference between optimizing assessment scoring for predictive versus explanatory power.” — Imbellus

If Pymetrics cannot say that a trait e.g. risk-taking CAUSES future success as a consulting Senior Partner, then the science is simply perpetuating the traits already in the firm for no reason.

Quartz drew on Suresh Venkatasubramanian’s expertise as a University of Utah Computer Science professor focusing on algorithmic fairness.

“I struggle with AI in hiring because I think there might be a case to be made for being more thoughtful about how we do hiring,” says Venkatasubramanian, noting that the persistence of old-boy networks means there is ‘definitely an argument to be made that we need to reform the process.’”

“But if the solution to reform is merely to automate it, it’s not clear that’s addressing the root problem,” he adds. “Strong claims need strong evidence. I haven’t seen that yet.” — Quartz

2 — “Pymetrics results are stable”

“Pymetrics categories measure natural tendencies that are quite stable and tend not to change over time. Because of this, your profile will be saved for one year, after which you will have the opportunity to replay the games if you choose.”

I will tell you for one thing that I was late for class when taking my Pymetrics test and I was rushed for time. Maybe that made me more of a risk taker? A lot of these arrow/number memory tests are so brief (lasting no more than 2 minutes), that it in no way represents cognitive focus during a work day. Having just exercised, consumed alcohol the night before, or even had coffee could alter your state. These are not natural tendencies of the mind and personality. These are physical responses.

“Research suggests that personality is unstable, as is the very construct of a self, and that our behavior is highly dependent on the specifics of the situations that we find ourselves” — Quartz

If you’re reading this and you run recruiting, I would love it if you could send me a link for a new Pymetrics test! I want to see whether my results changed from last year. 👏 👏

3 — “Pymetrics is unbiased”

The same set of assessments is used to measure the unique attributes of your top performers. This data is used to build custom algorithms that represent success in a given role at your company. Every algorithm is rigorously tested for bias. — Pymetrics

Cool, but your top performers are rated subjectively. If your company is biased towards men when it comes to promotions, then their attributes are deemed successful.

Pymetrics has versions for color-blindness, dyslexia, and ADHD, but inherent human variation goes beyond these conditions. As part of my research into assistive tech, I saw how motor difficulties in older people slows down clicking time. Furthermore, I can see how flashing signals would bother veterans with PTSD. People who did not grow up with video games or computers wouldn’t be familiar with the layout.

The 4/5ths test

Pymetrics claims that it is unbiased because it passes the 4/5ths test and built a tool for calculating it called Audit-ai.

What is the 4/5ths test?

The way this works is that Pymetrics will generate a machine learning algorithm for predicting success at a company based on scores. Then, the results (predictions for pass/fail) made by the algorithm are analyzed to sniff out bias.

“According to the Uniform Guidelines on Employee Selection Procedures (UGESP; EEOC et al., 1978), all assessment tools should comply to fair standard of treatment for all protected groups.

Within the hiring space, the EEOC often uses a statistical significance of p < .05 to determine bias, and a bias ratio below the 4/5ths rule to demonstrate practical significance.” — Audit-ai by Pymetrics

“The proportional pass rates of the highest-passing demographic group are compared to the lowest-passing group for each demographic category (gender and ethnicity). This proportion is known as the bias ratio.”

The 4/5ths rule effectively states that the lowest-passing group has to be within 4/5ths of the pass rate of the highest-passing group. — Audit-ai by Pymetrics

Fictional example: Suppose that a “pass” is denoted as having an A grade. In this test, women were the highest-passing gender— 50% of them passed. Meanwhile, men were the lowest-passing — 41% of them passed. This ratio greater than 4/5 means that the algorithm is deemed unbiased.

As you can see, this is an extremely arbitrary and rudimentary metric — and not quite statistically significant. It meets the bare minimum outlined by the EEOC.

The way Pymetrics gets around this is by trial and error apparently. Pymetrics CEO Frida Polli says, “if there’s a difference, Pymetrics labels the algorithm as biased and finds an alternative.”

The other providers fare no better. PredictiveHire makes the astounding claim that “AI bias is testable, hence fixable.” Knockri claims its “A.I. is unbiased because of its full spectrum database that ensures there’s no benchmark of what the ‘ideal candidate’ looks like.” Sorry, just because you have a lot of data, doesn’t mean your data and algorithms are unbiased.

Cornell paper: Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices Table 1

In closing, these aptitude tests have abstract claims but real impact on individuals.

Reddit

Another commenter on Reddit said “I personally know someone that landed a brand manager job [at P&G]. The person had her roommate take the personality test/games for her.”

All this being said, there have been positive trends at companies. “According to Polli, in the first year after implementing the service, Unilever hired almost 20% more people of color in the roles for which it used Pymetrics.” This is definitely a great sign, but is probably due to higher awareness by human interviewers and HR.

Maybe there is a silver lining. As of right now, algorithms are not very good at fixing bias. Instead, why don’t we use algorithms to detect and trace bias first?

“Armed with a deeper understanding of the forces that may have shaped prior hiring decisions, new technologies, coupled with affirmative techniques to break entrenched patterns, could make employers more effective allies in promoting equity at scale.” — Upturn.org

--

--