Domain Expertise Crucial for Successful Medical AI

Lessons Learned from Past Attempts to Predict Epidemics

ICTC-CTIC

Follow

Published in

ICTC-CTIC

8 min readJun 22, 2020

--

By Rosina Hamoni, Olivia Lin, and Mairead Matthews

What do you do when you feel sick or experience the first symptoms of a cold or flu? Do you visit your pharmacy or doctor’s office, or do you go on your laptop or mobile phone and Google your symptoms first? If you chose the latter option, you’re not alone. Google has become the “go-to” search engine for almost everything in our daily lives, and caring for our health is no exception. In 2008, Google realized that their platform contained valuable information regarding the number of flu-related searches and created Google Flu Trend (GFT), an algorithm that used search data to track influenza-like illness (ILI) and predict their spread.

As the standard authority on ILIs, the Centers for Disease Control and Prevention (CDC) estimates flu prevalence on a weekly basis by tracking the number of ILI-related doctor visits. At the outset of the program, GFT was estimating flu prevalence up to two weeks before the CDC — not tracking but predicting its spread. Unfortunately, the model soon lost its prediction power: trained with primarily seasonal flu data, it completely missed the nonseasonal A-H1N1 pandemic in 2009. This was because many of its “flu predictors” were focused on winter, whereas A-H1N1 was most prevalent over the summer and fall months.[1] GFT updated its algorithm to account for this issue in 2009 but nonetheless missed the 2011 and 2012 flu seasons. Similarly, several additional changes were made to the model in 2013 and 2014 but with no success: the new model continued to overestimate the prevalence of the flu. In 2015, following six years of declining accuracy, Google made each revised version of the model available to researchers online before suspending the project indefinitely.

Why the Quality of GFT’s Predictions Deteriorated

At a high level, the 2009 to 2015 period still marked the early days of AI/ML using big data for disease prediction, and GFT was venturing into unknown territory. The underlying model was also much too simple: GFT was a pure linear regression model, which means it was designed to “model the relationship between two sets of variables by fitting a linear equation to the observed data.”[2] Of the two sets, one would be considered the explanatory or independent variable(s) and the other the dependent variable(s). For example, a linear regression model can relate the height of a plant to its age (age would be the explanatory variable). In the context of GFT, the dependent variables are the number of ILI physician visits while the independent variables are the related search queries. Had the model been more complex, with additional features borrowed from a latent variable model, for example, it may have produced a more accurate outcome. Unlike linear regression models, latent variable models can account for missing data, such as a third, unobserved variable.

Secondly, GTF’s algorithm was based on faulty data. The algorithm worked by identifying the top 45 related search queries out of 50 million possible search queries, then used those queries to predict the flu. So, if people searched, “What do I do when my child has a fever?” often during flu season, queries involving “child” and “fever” would likely be used as predictors for the flu. The problem was that the 44th and 45th most related queries were much less related than the 1st and 2nd. The algorithm drew from a wide selection of search queries unrelated to influenza-like illnesses, and so it included phrases like “high school basketball” as possible predictors. “High school basketball” is correlated with ILI physician visits because high school basketball season and general flu season both occur between fall and winter, but they have no intuitive causative relationship between them (correlation is different from causation). By including terms that were unrelated to the flu, like “high school basketball,” GFT’s model overpredicted the flu compared to the CDC’s actual reported flu levels.

Linear regression models can become sensitive to certain predictors that are unique to a given dataset. This is known as “overfitting”: it is especially disruptive if the number of predictors is high and the number of observations is low. When overfitting occurs, the model becomes unable to make accurate predictions using data beyond the original dataset. In short, the high number of predictors (45) compromises the required level of relatedness, which in turn leads to a loss in generalization. GFT’s model was trained using a specific dataset, and when it was applied to a different dataset, it performed poorly.

By December 2019, AI/ML Product Development and Sophistication has Evolved Significantly

BlueDot, a Toronto-based digital health firm, developed AI-based “outbreak risk software” for global infectious disease early warning and spread prediction. The platform does daily searches in more than 65 languages on 100,000+ articles and global news reports, human, animal, and insect population data, and information from local healthcare workers to detect and forecast disease. Made possible through advancements in big data and AI, the platform uses natural language processing and ML algorithms to parse data far too complicated for any human(s). Armed with this information, part of BlueDot’s core service offering is post-analysis alerts sent to private sector and government clients to help identify the risk of new diseases.

BlueDot CEO Dr. Kamran Khan decided to start the company after working as a frontline, practising physician and epidemiologist during the SARs epidemic in 2003. Since its launch, BlueDot has done particularly well: the engine successfully predicted that the Ebola outbreak would end in West Africa in 2014 and that the Zika virus would spread to Florida in 2016.[3]

Most recently, nine days before the World Health Organization released its first official statement on COVID-19, BlueDot had already alerted clients of several “unusual pneumonia” cases occurring near Wuhan, China.[4] This made BlueDot one of the first organizations in the world to identify the emerging COVID-19 risk and predict where else in Asia the novel virus would spread. Due to this success, the Canadian federal government is now leveraging BlueDot’s disease analytics platform to track and monitor the virus nationally and to inform future decisions regarding COVID-19.[5]

Lessons Learned: Domain Experts are a Core Component of AI/ML Product Teams

So, what do these two attempts to predict influenza-like diseases tell us about the use of AI/ML in the medical space? For one, disease prediction is incredibly complex and without the right domain expertise to inform design, strong knowledge of ML and programming skills are not enough to guarantee accurate results. Airtight models may not be needed when applying ML to industries such as marketing or retail, but in fields with robust regulatory environments and where the health and safety of individuals is at risk, accuracy is crucial.

This caveat wasn’t as well-understood in the early days of AI/ML for disease prediction as it is today. GFT was designed not by a team of multidisciplinary professionals — including medical professionals, software developers, and machine learning engineers — but by a team of developers and AI/ML researchers alone. The underlying model was based on assumptions about the significance and accuracy of search engine data that, in practice, turned out not to be true (the initial team used an automated method of selecting ILI-related search queries, that in their words required no prior knowledge about influenza.)[6] A medical professional with practical understanding of a patient’s subjective ability to self-diagnose or accurately identify symptoms might not have trusted search engine data so willingly. Likewise, an influenza specialist may have better understood the relevance of certain search queries for predicting the flu and, in turn, prevented the model from overfitting to seasonally used but nonetheless unrelated terms.

Today we know that domain experts are important because they provide product development teams with a deep understanding of the application area and inform technological approaches to design and implementation. Their expertise is used to manage the product’s research direction and ensure market relevance while technical expertise ensures that the technical foundation of ML applications is plausible. In addition to this, domain experts provide contextual meaning to the data used in prediction models and help determine the practical significance of results.

When looking at the composition of AI product development teams in financial services and healthcare, three components tend to stand out. Ideal teams typically consist of graduate-level trained financial and medical experts, experienced developers, and AI/ML systems experts. However, the mere presence of these team members is not enough: each must have some underlying knowledge of the other areas of expertise so that communication is effective and leads to successful teamwork. In any given phase of product development, each expert controls their own focus area while drawing on the supporting knowledge of the other two. In other words, an overlapping understanding of one another’s knowledge is also crucial.

This type of team composition is increasingly evident in regulated sectors with substantial risk. BlueDot is no exception and is, in fact, a near-perfect replica of this AI product development team recipe. The company’s Chief Technology Officer, Mike Chmura, has worked on medical software for most of his career and on infectious disease software for a significant portion of it, demonstrating core technical knowledge in programming and machine learning with overlapping knowledge of the domain.[7] Similarly, the company has a comprehensive team of medical professionals on staff to guide their work, and several of those medical advisors have backgrounds in epidemiology and data science alongside medical qualifications.

Going forward, as AI/ML systems are deployed in new, riskier, and more impactful contexts, it will be increasingly important to build comprehensive development teams that have a deep understanding of the application area and can inform technological approaches for design and implementation.

[1] WHO, https://www.who.int/csr/disease/swineflu/updates/en/

[2] Sergei V. Chekanov, “Linear Regression and Curve Fitting,” Scientific Data Analysis using Python Scripting and Java, Springer Science & Business Media, 2010.

[3] Jerry Bowles, “How Canadian AI start-up BlueDot spotted Coronavirus before anyone else had a clue”, Diginomica, March 10, 2020, https://diginomica.com/how-canadian-ai-start-bluedot-spotted-coronavirus-anyone-else-had-clue

[4] Ibid.

[5] “Canada’s plan to mobilize science to fight COVID-19”, Prime Minister of Canada, March 23, 2020, https://pm.gc.ca/en/news/news-releases/2020/03/23/canadas-plan-mobilize-science-fight-covid-19

[6] Jeremy Ginsberg et al., “Detecting influenza epidemics using search engine query data”, February 2009, https://static.googleusercontent.com/media/research.google.com/en//archive/papers/detecting-influenza-epidemics.pdf

[7] BlueDot, https://bluedot.global/about/

Domain Expertise Crucial for Successful Medical AI

Lessons Learned from Past Attempts to Predict Epidemics

Written by ICTC-CTIC