How Centaur Labs harnesses diverse opinions to make diagnostics more accessible.
Have you ever witnessed a group of people estimate the number of jelly beans in a jar? Chances are, each person individually is off the mark, but if you average the opinions of everyone in the group, they will get pretty close.
This is an illustration of a concept known as “wisdom of the crowd.” Collectively, the aggregate opinion of many individuals tends to be as accurate or better than a singular opinion. This idea, while seemingly simple, has powerful applications.
Centaur Labs harnesses the power of the crowd and applies it to a crucial area of medicine: diagnostics. By aggregating the collective opinion of medical students, doctors, nurses, and others from around the world, Centaur Labs is capable of labeling medical images with a very high degree of accuracy.
Our users label medical images through our mobile app, DiagnosUs, and can practice analyzing a wide range of cases — from classifying skin lesions to spotting pneumonia in an X-ray. Fittingly, the app is valuable to medical students and professionals looking to practice their diagnostic skills. The tasks are gamified and allow users to compete against each other for cash prizes. While labeling images with known answers, users also contribute labels for data with unknown ground truths. These aggregate opinions can be combined with computer vision algorithms to form an overall estimate — this hybrid of human and algorithmic intelligence is known as a centaur intelligence, a topic that will be explored further in a future post.
Our first set of pilot customers is largely comprised of AI-based companies aiming to label large medical image datasets for training and testing their models. This is not unlike how Google’s ReCAPTCHA program uses people to help teach autonomous vehicles to identify signs, trees, and cars. In the future, we envision our system also helping make image-based diagnoses more accessible to everyone though labeling user-submitted images.
How Accurate Is The Crowd?
In this post, we will take a look at how many users, on average, must review a case in order to have high assurance of accuracy using a dataset of skin images, each labeled as with or without psoriasis by a large number of DiagnosUs users. This project was undertaken in collaboration with LEO Pharma’s Innovation lab. The “correct answers” of the dataset are determined by a panel of 2–12 professional dermatologists.
If we take the simple majority vote of all DiagnosUs users across these images and assume that the professional dermatologists are correct, the crowd achieves an accuracy of 97.1% and an AUC of over .99! It also turns out that in most of the problems that our crowd gets “incorrect,” there were fewer dermatologists (usually 2–3) determining the correct answer — suggesting that the crowd may actually be correct on some of these cases.
An Interesting Mathematical Question
How many opinions do we need on a particular image in this dataset to get a reasonable assurance of accuracy?
One approach to answering this question would be to repeatedly take a random sample of n users for each image, perform a simple majority vote, average the results, and then see how the accuracy of this method changes for different values of n. However, randomly sampling n users of the N who rated a particular image many times is relatively computationally expensive. And we would ideally want to sample all of the ways we can choose n of the N users. How do we do this efficiently?
This is where hypergeometric distributions come in. A hypergeometric distribution describes the probability of having k successes (random draws with a specific characteristic) in n draws from a finite population of n objects where k of them have that characteristic total. For example, say we have 100 votes on an image, 60 of which are positive (presence of psoriasis), and we want to know the probability that we will get 3 positive votes if we draw 5 of them at random (without replacement). The answer to this question would be (60 choose 3) * (40 choose 2) / (100 choose 5). This is essentially equivalent to choosing 3 of the 60 positive votes. 2 of the remaining 40 non-positive votes, over the total number of ways to choose 5 out of 100 votes in general. The distribution of these probabilities is the probability mass function (PMF), and the cumulative sum of values of the PMF is called the CDF. For example, we can deterministically express the probability of drawing 3 or less positive votes out of 5 as a CDF.
So, back to our original question. In order to estimate the accuracy of a subset of n users on a particular image, we can use a CDF. SciPy has a built-in function called hypergeom.cdf that computes this relatively quickly. For the case of choosing a subset of n users, we can determine the probability that a majority (majority = (n-1)/2 for an odd value of n) of that subset voted for the correct answer by looking at the total number of users who voted at all, and within that group, who voted correctly. To get the probability that at least the majority voted for the correct answer, we can compute 1-CDF (out of a subset of n users, less than half voted correctly). If we compute this for various odd values of n (1, 3, 5, etc..) we get the following diagram:
It would appear that the accuracy begins to level off at about n = 5 or n = 7, approaching the maximum accuracy of 97.1%. This indicates that we need surprisingly few of our users to obtain a reasonable assurance of accuracy on this dataset. We could also even further improve the accuracy by weighting opinions of individuals according to their skill level.
At Centaur Labs, we are excited to see how the wisdom of the crowds will ultimately improve the ways we make decisions in crucial areas like medicine. We believe that the future of AI will involve intelligent collaboration between humans and computers, as both bring complementary strengths to the table. Ultimately, the diversity of our crowd is our greatest asset.