A Machine Learning Approach to Diagnosing Colon Cancer

Anirudh Kamath
CounteractIO
Published in
6 min readMay 27, 2017

A little bit of background:

A few weeks ago, a close family friend was diagnosed with stage 3 colon/rectal (colorectal) cancer. This amazing woman was someone whom I didn’t address by name, but by the term “amma,” the literal word for “mother” in most Indian languages. Eighteen years later, I still consider her nothing less than a second mother.

The diagnosis came as a shock to all of us. Extremely rarely had her extended family witnessed cancer of any kind, let alone colon cancer. After thinking she just had hemorrhoids for months, she went to the doctor much later expecting to get treatment for just that.

One painful colonoscopy later, the doctors found a tumor in her body that was starting to extend to nearby lymph nodes. They weren’t sure whether it was stage 2 or stage 3 yet, though. Hell, they didn’t even know if she even really had cancer to begin with. They had to confirm it with a biopsy and a CT scan. In the end, they had a surgery done to remove the tumor and diagnose her with stage 3 colon cancer.

It’s curable, they said. Hopefully it is. Except now that since a lymph node was impacted, she has to go through six months of chemotherapy.

Is colon cancer even that big? If it was, there’s probably been things done about it, right?

The largest cancers affecting both sexes

Colon/Rectal cancer is the fourth largest cancer in the United States, according to the National Cancer Institute, just behind breast, prostate, and lung, respectively. It affects over 130,000 new people every year.

Oh but it gets worse. There are 50,000+ people who don’t survive. 50,000. Second only to lung cancer. That’s more than those who die from breast cancer and melanoma (skin cancer), combined.

Wow, okay, that’s a lot of people. Surely there’s been a ton of innovation in the field to improve diagnosis without causing so much pain to the patient right?

The best tests I was able to find were flexible sigmoidoscopies, colonoscopies, and virtual colonoscopies. The first two, while accurate, cause a whole lot of pain. The virtual colonoscopy is equally painless as it is expensive.

The least discomforting while still inexpensive tests I could find were the fecal immunochemical test (FIT) and the Guaiac-based fecal occult blood test (gFOBT). Wow, that’s a mouthful.

Both tests are dependent on a sample of stool, and it counts the amount of blood in the stool in order to come up with a diagnosis. They are the most noninvasive (doctors don’t have to stick a rod up your butt) tests, and they can even be done at home.

BEFORE YOU PROCEED, IT’S CRUCIAL THAT YOU UNDERSTAND THAT MY ENTIRE EXPOSURE TO THE FIELD HAS BEEN THROUGH 10TH GRADE BIOLOGY. THIS ENTIRE PROJECT IS JUST WHAT I DID FROM MY OWN RESEARCH AND LEARNING. I’M MORE THAN WELCOME TO HEAR YOUR FEEDBACK

So what’s the problem?

Above is what’s known as a confusion matrix. It’s a nice visual that determines the effectiveness of a test, predominantly diagnosis tests. Out of a certain number of tests done, how many positive cases did the test label positive (True Positive)? How many negative cases did the test label negative (True Negative)? How many positive cases did the test label negative (False Negative)? How many negative cases did the test label positive (False Positive)?

That’s where the sensitivity and specificity metrics come in. Sensitivity tells you how many people with the condition were labeled as such. Specificity tells you how many people without the condition were labeled as such. The thing is, both of these metrics can easily be skewed.

If the test returned negative for every single test, even for people who actually have the condition, it would have a really low accuracy, but a 100% specificity. Makes sense, right? If everyone is labelled as positive, then the people with the disease will definitely be labelled as positive, making a false positive outcome literally impossible.

That’s the problem with modern tests. The FIT and gFOBT tests return negative a lot more than they should, leading to many people getting results that inaccurately say they don’t have cancer. At that point, they’re ignoring the cancer, and just letting it grow.

Look, you’ve said a lot, but what exactly what was the point of writing all of this? Are you doing anything about it?

A working prototype of CounteractIO, a noninvasive diagnosis tool for colorectal cancer.

Enter CounteractIO.

What if there was a noninvasive way of diagnosing your current condition based on every day things you can answer yourself?

CounteractIO uses various factors to come up with a probability-based diagnosis for your condition. This means that instead of outright telling you that you have stage 2 cancer, it will tell you you have a 20% chance of having an adenoma, a 10% chance of not having anything at all, and a 70% of having cancer. Then it will tell you the percent chance of having each stage of cancer.

While CounteractIO still has a long way to go as far as clinical validation and FDA clearance, I wanted to share the progress I’ve achieved in the last few weeks.

The inputs I used were the amount of blood counted from FIT, history of colonic lesions, history of polyps, age, and BMI. All things that are extremely simple to provide.

I had two separate tests — one was meant for simply detecting cancer/no cancer, and the other distinguished cancer, adenoma, and nothing at all. The former tended to perform much better than the latter.

The following results are after a cross-validation test run 10 times.

And now, prepare yourself for graphs.

The ROC Curve for cancer detection

By far the best performing classifier was when inputs included the FIT results as well as history of lesions and polyps. The accuracy was right at 90% (peak 98%), with specificity at 97.4%! The sensitivity was lacking, however, only performing 72%.

The next problem was distinguishing between cancer/adenoma/nothing, which I labelled multi-output detection since there were more than just two outputs. Ironically, the accuracy was lacking, with 85% accuracy, but a 92% sensitivity!

The next part was rather challenging. The previous two included the history of lesions/polyps as inputs, but what if the patient had never been tested for polyps/lesions? This is the problem CounteractIO plans to address next. These are the results have as of now:

This is again simply distinguishing cancer/no cancer, except this time the test doesn’t consider the history of lesions/polyps. To my surprise, the performance was not too bad at all. The accuracy was pretty decent, at 86%. The specificity remained solid, at 95%, while the sensitivity trailed behind at 73%.

Oh boy, this one was a doozy. The accuracy was much worse than the others at only 77%, with peak performance hardly at 88%. The specificity was laughable at best, at 45%. However, the sensitivity, was not too bad, at 82%. Needless to say, this needs the most improvement.

What are your next steps?

As I said before, I definitely need to improve the multi-output detection without history of lesions/polyps. Beyond that, however, I want to spread this technology as much as I can, but I can’t do that without more legitimacy.

I’m looking for funding so I can employ senior data engineers who can help me further improve my analysis.

I’m looking for press so I can spread awareness about the cause and what I’m doing.

I’m looking for clinical validation so I know what I’m doing can actually make an impact.

If you read this and laughed at my naivete, or you genuinely took it seriously and want to help, I’d appreciate any kind of feedback. If you’d like to reach out to me personally, my Facebook is here, and my LinkedIn is here.

--

--