How good is Invisible’s AI, really?

Stephen Welch
Invisible AI
Published in
13 min readMar 1, 2022


By Stephen Welch | Staff Computer Vision Engineer @ Invisible AI

State of the art open heavy model source (left) vs in-house lightweight model (right) — who will win?! Video Source:

What am I doing here?

Gotta love starting a new ML engineering job. There’s nothing quite like that initial optimism for a new set of problems before the messiness, trade offs, and complexity of really shipping ML models to real customers begins.

Coming into Invisible AI to lead Deep Learning six weeks ago — I was really looking forward to getting under the hood of a pretty slick stack. Invisible AI deploys cutting edge Deep Learning models into complex manufacturing environments — models run on camera, on premise — and in some applications are even trained on device. Deployments can easily reach well into the hundred of cameras at a single location.

A critical piece of the stack is a 2D Multi-Person Pose Estimation model. Many Invisible AI applications depend on this model to deliver critical information to customers about their manufacturing operations. A small improvement in pose estimation could be a big deal.

Naturally one of the first questions that popped into my head when learning about the stack was how good is this pose model, really?

While we had lots of anecdotal evidence that the pose model worked well and solved customer problems, we had no single objective performance measure to drive decision making and guide our ML strategy. Further, like any ML model, our pose model does make mistakes, and we had plenty of examples from our Apps team of things that our pose model could do better.

With lots of input from the team, I set out to do two things:

  1. Define and implement a meaningful and useful benchmark scoring methodology for pose models at Invisible AI. One number to rule them all.
  2. Make that number go up.

2D Pose Model Metrics

I’ve found that for 99.9% of the new problems I bump into, someone smarter than me has already thought about and written something down, so I did the responsible thing and started with a quick literature review, and was happy to quickly find a couple recent review papers. [Chen et al 2020] [Zheng et al 2021]

Chen et al 2020 give a nice tabular summary of different ways to measure how good or bad your pose model is.

Table 1. Summary of pose metrics from Chen et al 2020how do these line up with what Invisible AI cares about?

Object Keypoint Similarity (OKS)

We started by looking at a pretty standard metric used by the COCO team for model evaluation, Object Keypoint Similarity:

  • Like most equations in ML, this one looks a lot more complicated than it actually is!
  • dᵢ is the distance between the ith predicted keypoint and corresponding labeled keypoint.
  • vᵢ are the visibility flags for the ground truth labels. If a person’s left shoulder isn’t visible in an image, then this keypoint would not be labeled, and v₆=0 (keypoints follow a standard order in the COCO format, and left shoulder is 6th).
  • δ is an “indicator function” — yielding a 1 if its contents are true, and 0 if false. I find that indicator functions always make equations look more complicated than they actually are! If a given keypoint is present in the labels, then δ(vᵢ>0)=0, otherwise δ(vᵢ>0)=1. Notice that the indicator function shows up in the numerator and denominator.
  • s is the scale of the object — here equal to the square root of the object’s area.
  • kᵢ are a set of constants tuned to make OKS “perceptually meaningful and easy to interpret”. The basic idea here is that for keypoints like eyes, being a few pixels off may be very noticeable, while for hips this may still look like a perfectly acceptable label. The COCO authors capture this idea in a clever way — they take the same image and have it labeled by different people — and measure the distribution (standard deviation) of keypoint locations. As you may expect, these distributions can vary significantly between keypoints, for example hips turn out to have significantly more variance than eyes. ks are determined by setting each equal to twice the observed standard deviation, so each keypoint error term is effectively normalized by a term equal to two times the empirically observed standard deviation among human generated labels. So we’re effectively measuring error in terms of standard deviations occurring in human generated labels instead of just pixels.

Percent Correct Keypoints (PCK)

Next we looked at a simpler metric — Percent Correct Keypoints (PCK) [Yang and Ramanan 2013] also know as Percent Detected Joints (PDJ) [Toshev and Szegedy 2014]. The idea here is to imagine a little disc around each labeled keypoint. If our predicted keypoint falls within this disk — it counts as a detection, otherwise it’s a miss. In mathematical terms, borrowing the indicator function δ notation from coco, this:

  • dᵢ is the distance between the ith predicted keypoint and corresponding labeled keypoint, just as in the OKS equation
  • s is the scale of the object — in some implementations equal to the diagonal of the bounding box around the person, the maximum of the bounding box height and width, or various measurements of the head or torso size.
  • k is a constant fractional multiplier, often something like 0.05 → meaning that if a labeled and predicted keypoint are within 5% of the scale of the overall object, the point is counted as a valid detection.
  • vᵢ are the visibility flags for the ground truth labels. If a person’s left shoulder isn’t visible in an image, then this keypoint would not be labeled, and v₆=0 (keypoints follow a standard order in the COCO format, and left shoulder is 6th).
  • δ is an “indicator function” — yielding a 1 if its contents are true, and 0 if false. If a given keypoint is present in the labels, then δ(vᵢ>0)=0, otherwise δ(vᵢ>0)=1. Notice that the indicator function shows up in the numerator and denominator.

The mechanics of PDJ are nice and intuitive — if a person has 10 labeled keypoint and our model detects 8 of them within a distance of ks of each keypoint, then PDJ=0.8.

Percentage of Correct Parts (PCP)

Finally, we also looked at Percentage of Correct Parts (PCP) [Eichner et al 2012]. A part (also known as a limb in the literature), is a connection of two keypoints. For example a left wrist and left elbow are combined to create a lower left arm limb/part. “An estimated body part is considered correct if its segment endpoints lie within fraction of the length of the ground-truth segment from their annotated location.” [Eichner et al 2012] A threshold value of 0.5 is often used.

As we’ll see shortly, limbs are a critical abstraction at Invisible AI, which makes PCP appealing — however, using limb length as the the thresholding quantity in PCP has some potential drawbacks, as discussed in [Toshev and Szegedy 2014]. “PCP was the initially preferred metric for evaluation, however it has the drawback of penalizing shorter limbs, such as lower arms, which are usually harder to detect.”

Visualization of some commonly used pose performance metrics on some industrial example video clips. PDJW=weighted PDJ, with more weight on elbows and wrists. Source:

Do these metrics work for Invisible AI?

After getting my head around the nuance of these metrics — I shopped them around the team a bit — do either of these metrics capture what our application engineers and customers care about? In general, I really hate re-inventing the wheel, but here it really didn’t seem like these off-the-shelf metric was going to cut it, here’s why:

  • We care disproportionately about lower arms. Much of the work Invisible AI tracks involves people doing things with their hands. If our pose model can’t keep track of people elbows and hands, we’re in trouble. Torsos of course matter as well, but not as much. Whatever metric we land on needs to be sensitive to arm issues.
  • Limbs > Keypoints. As discussed above, a limb (also referred to as a part in the literature) is just a connection of keypoints. I thought this distinction was trivial at first → aren’t limbs just a linear combinations of keypoints? Yes and no. The extra complexity here comes from a few sources. For one, our downstream applications often use limbs, not keypoints → getting a left wrist keypoint detection but missing a left elbow doesn’t do us much good → it’s not 50% correct. Secondly, the pose model architecture, borrowing ideas from the OpenPose and PifPaf [Kreiss et al 2019], not only learns keypoints, but also a notion of limbs. Finally, one of the most common errors we see in practice is when our keypoint detections are great, but are limbs get confused between people. All of this to say that while limbs are sort of a minor extension to keypoints, they’re what really matters in Invisible AI’s stack today, and our metric will likely be more useful if limbs are its atoms — not keypoints.
  • Recall and Precision matter equally. In many of the ML applications I’ve worked on (e.g. autonomous driving, defect detection in manufacturing) recall is significantly more important than precision. Sure, it’s ok to scare us with a false positive now and then, but for goodness sake — don’t let that defective part leave the factory. Don’t miss detecting that pedestrian. Chatting with the team, I’ve learned that false positive and false negatives are roughly equally problematic. For this reason, a thresholded Average Precision or Average Recall number alone probably wasn’t going to cut it.
  • Local and Global. A big part of the rationale behind the COCO OKS score is to serve the same role that IoU serves in object detection:

The core idea behind evaluating keypoint detection is to mimic the evaluation metrics used for object detection, namely average precision (AP) and average recall (AR) and their variants. At the heart of these metrics is a similarity measure between ground truth objects and predicted objects. In the case of object detection, the IoU serves as this similarity measure (for both boxes and segments). Thesholding the IoU defines matches between the ground truth and predicted objects and allows computing precision-recall curves. To adopt AP/AR for keypoints detection, we only need to define an analogous similarity measure. We do so by defining an object keypoint similarity (OKS) which plays the same role as the IoU.

In the COCO approach, OKS is computed for each label prediction keypoint pairing, and then thresholded to determine if we have a match or not. From there pose estimation can be treated exactly like bounding box detection. Each label and prediction can be classified as a True positive, false positive, or false negative, and precision and recall curves can be computed. While there’s a lot to like here, but one thing we don’t love is that limb localization issues can be lost. If a label prediction pair has a high enough OKS, the whole person gets counted as correct, even if localization is just barely good enough. We potentially obfuscate localization issues in our global metric. COCO does address this by averaging performance across a range of OKS thresholds, we agree that this is an effective approach, but found it a bit difficult to reason about and visualize. Ideally, we would like our performance metric to be sensitive to local localization errors at the person level and global errors — e.g. did we detect that person at all?

  • We like the interpretability of PDK. Don’t get me wrong — I think that the COCO team’s implementation of OKS is brilliant — and not having to threshold localization errors is terrific. However, there’s something that really resonated with the team about the simplicity of drawing a little disk around each ground truth keypoint and checking if a predicted keypoint is within the disk.

The Invisible AI Weighted Limb f1 Score

Figure 1. Invisible AI Weighted Limb f1 Score. Source:

So, how do we turn all this information into math? After experimentation and discussion we’ve converged on what we’re calling a weighted limb f1 score. Here’s how it works.

  1. Start with Percent Correct Keypoints (PCK). We like the intuitiveness of a little “disk of correctness” around each ground truth keypoint, as showing in figure 1b.
  2. Next, we need to match the labeled and predicted people. Matching two sets of things in computer vision always turns out to be more complicated than I expect. Constructing loss functions for bounding box object detection models is a great example — it get really gross really fast. Our models don’t just output a bunch of keypoints, they also, critically, merge sets of keypoints into these things called “people” 😂. To match people, we compare each prediction to each label (this could become costly, happily validation happens offline and monitoring crowds are not part of the scope of our products) and compute how well each set of keypoint predicted keypoints match each set of labels. We then rank these matches (using the same limb f1 score we use for overall performance, as described below), and pair off predictions and labels by walking down the list. Figure 2c shows an example set of predicted keypoints paired to each labeled person.
  3. As discussed, we know limbs are important, so we next map keypoints to limbs, if both keypoints are correct, then the limb is correct, as shown in Figure 1d. This gets us around the issue with PCP discussed by [Toshev and Szegedy 2014] above. From here, predicted limbs fall into four categories:

Limb True Positives. Both labeled keypoints are detected and localized correctly.

Limb False Negatives. One or both labeled keypoints are either not detected or not localized correctly.

Limb False Positive. A limb is predicted where no labeled limb exists.

Limb Localization Error. This is a sort of strange edge case — and there may be a better way to think about it that I just haven’t bumped into yet. This happens when we have the same labeled and predicted limb, but the limb is not localized correctly (e.g. the lower right leg in figure 1d). Since localization error, especially of workers arms when manipulating tools work pieces, matters a lot for Invisible AI → we lump these localization errors in with False Negatives. If our limb detection is this far off, it might as well be wrong.

4. Now we’re almost there! Each limb is now a True Positive (TP), False Negative (FN), or False Positive (FP). All that’s left to do is compute overall image level performance. As mentioned above, we want a “global and local” metric — so instead of taking an intermediate step where we compute the performance for each person, we simply concatenate all limbs together into a single image-level vector of TPs, FNs, and FPs. From there we computer precision, recall, and finally f1 score. This approach does have the side-effect of weighting people with more visible limbs higher than those with less, but we think this is a reasonable bias to adopt.

Finally, as discussed above, we know that some limbs matter more than others to the team and to our customers. To address this we adopt a simple weighting procedure where some limbs effectively “count more” than others in Recall and Precision calculations.

The Benchmark

Alright, that was a lot of work! Now what? Now that we had a metric, I discussed with the team setup a consistent internal test set to measure performance on from model to model. To validate that we had an effective metrics and a reasonable test set, we visualize model performance on videos, examples shown below.

Will SOTA Crush us?

The Invisible AI pose model is an interesting combination of ideas from academic papers, internal ideas, and technical solutions to various problems that pop up. Due to rapid inference time requirements and relatively low power deployment hardware, the model is also quite lightweight. The overall approach was put together circa 2018–2019, and uses ideas from OpenPose and PifPaf, plus some other clever ideas on the architecture and training sides.

This seemed like a strong approach, but I remained a bit skeptical. Do we really have a good sense for how we’re doing relative to State-of-the-Art (SOTA) models? The field of DL changes really quickly — Performance on COCO has gone up something like 7 points since 2018.

Ah, PifPaf. What all pose models need to perform their best and keep those pesky bugs out.

I pulled in a top performing heavyweight model from mmpose, pose_hrnet_w48_udp, with an impressive 0.77 COCO val2017 AP score. So how did we do?

Table 2. Invisible AI vs SOTA. *On Titan Pascal GPU

As shown in Table 2, happily there was no bloodbath to be had, with the Invisible AI model holding its own well, especially factoring in inference time. Below are animations showing performance of the Invisible AI model relative to pose_hrnet_w48_udp on specific examples.

pose_hrnet_w48_udp on left, Invisible model on right. Time series on bottom of screen shows f1 over time as we look at various clips. Source=
pose_hrnet_w48_udp on left, Invisible AI model on right. Time series on bottom of screen shows f1 over time as we look at various clips. Source=

What’s Next?

Vacation 😎 ! Just kidding. While happily out-of-the-box SOTA doesn’t appear to be beating us on our own benchmark, there’s a ton of new ideas that we’re looking at to take our benchmark performance to the next level — delivering more accurate outputs for our apps team and more ultimately reliable results to our customers.

We’re Hiring!

“If you’re the smartest person in the room, you’re in the wrong room.”

When I started the job search that lead me to join Invisible AI, I decided to be a bit more analytical this time. I collected data on 100+ companies that could be interesting to work with, and sorted on a number of metrics. A metric that I really prioritized this time around was finding a team that was way smarter than me. I stalked founders and engineers on Linked In, and did my best to estimate a “team” score for each company. Invisible AI really impressed me — strong technical founders — great early hires. After 6 weeks on the team I’m happy to report that the team has really lived up to this initial impression — really smart and collaborative folks all rowing in the same direction — it’s really been a pleasure to get to know everyone and see how things work.

If you’re interested in joining the team, have any questions or comments, or would just like to chat — please reach out to me directly at [my_first_name]