AI4TB: Ground-truthing from machine-learning innovations
In the third “Sprint” of the AI4TB pilot, we learned a lot in efforts to evaluate the use of computer-assisted detection (CAD) to identify the presence of tuberculosis (TB) and silicosis in radiological images. Three specific key learnings emerged:
1- Not all AI providers could streamline assessments
We had assumed that the two CAD companies would be able to provide CAD scores that would facilitate triage into “Normal or Pure TB “versus “Other”, to facilitate streamlining of the assessment by human readers. In fact, only one of the vendors’ systems did this; the second only distinguished normal from abnormal.
2 — Operational procedures for triage may take time to set up successfully
We assumed that all normal and pure TB would be triaged to a 2-person panel and all others to a 4-person panel. In fact, this did not happen perhaps due to the above problem compounded by communication difficulties between the team conducting the evaluation and the personnel actually doing the triage into piles to be read.
3 — We need to iterate with different image / film quality
Thirdly, and most importantly, when the CAD results were compared to Medical Bureau of Occupational Disease (MBOD) decisions as the reference standard, they were substantially and significantly worse than what was observed the trial of 330 in a previous sprint. This seems to be attributable to the poor quality of the films, as in this trial the CAD was on films that had been first scanned and then digitalized.
Additionally, three new avenues were pursued to improve the algorithm we have been developing for prioritizing individuals for assessment:
i) adding in the data from the two more recent field studies (in Alice and Bizana) to the data from Stilfontein already employed in the algorithm development;
ii) instead of dichotomizing yes/no for assessment, we have devised a “likely not, maybe and likely yes” approach, that may work better as an output of the algorithm; and
iii) we tried again to add job risk information to improve accuracy — but this has not been helpful as yet.
To build on these learnings in our next sprint, we plan several new experiments:
1 — Study the quality of the new, better controlled, scanning process;
2 — See if after more training with additional chest x-rays, the AI company that could not distinguish between TB and silicosis would be able to do so; and concurrently
3 — Study how efficiency could be improved by triage from a 2-person versus 4-person panel as was previously planned.
Both these last questions will be addressed in a study of 500 x-rays from field studies that have already been conducted in Stilfontein, Bizana and Alice. Results from the CAD have already been obtained. What remains are the adjudications from the MBOD committee.
Meanwhile we are synthesizing what we have learned from the work to date regarding the case for and against the use of AI in this context to identify concerns that could prompt not going forward with such innovation as well as approaches for addressing them; we plan to soon submit this for peer-review publication to inform world knowledge.