Exploring the Horizon - Episode 2

Erik R. Ranschaert
10 min readJan 6, 2019

--

The Quest for Medical Data

Those interested in the 35 min. interview with Prof. Andreas Maier are invited to listen to the podcast,Episode 2 of “Exploring the Horizon”.

As you probably know there is an ongoing quest for medical data and images to develop algorithms able to perform “simple” tasks such as image segmentation, detection of abnormalities and classification findings, which is mostly the case in radiological applications. These tools are are mainly based upon Deep Learning algorithms, for which the availability of curated and annotated medical data is an absolute prerequisite. You can read more about the challenges involved in collecting and curating data in my new book about the opportunities, applications and risks of Artificial Intelligence in Medical Imaging. Nevertheless Artificial Intelligence (AI) is establishing itself as a new technique aiding the radiologists in analysing medical images and finding the right diagnosis. It could even be seen as the new Formula E for radiologists, in which the algorithms are the AI co-pilot guiding the the radiologists driving their cars. However many cars are still in a stationary position , waiting to be filled with fuel. The algorithm or racing machine might be designed already, but before bringing the AI-tool into the every day workflow lots of data are needed to train it and to get it in pole position. “To classify features in a CT scan, for example, you could need anywhere from 100 images to millions of images,” Keith Dreyer said. “That means you need a lot of storage and a lot of processing power at your disposal.

To classify features in a CT scan you could need anywhere from 100 images to millions of images. That means you need a lot of storage and a lot of processing power at your disposal.

Most AI applications in healthcare are based on supervised learning, so the data for fuelling these Radiology Formula E cars first need to be prepared (curated and annotated).

Picture from Techonomy

In this ongoing quest for “fuel” several methods are being deployed for obtaining it. In academic institutions the researchers developing DL algorithms can usually fall back on their own data, at least to start with. However as soon as the research projects make a transition to a more commercial version or so-called spin-off startup company, contracts have to be made with external partners for obtaining more data, needed to develop new algorithms or to improve and upgrade the existing products. Hospitals not only make contracts with startups, but also with big vendors such as Siemens, GE, Philips and other, which are also eager to develop new AI-tools for making their portfolio more appealing. When the demand for fuel rises, the cost price also increases, so as a result high sums are often paid by these vendors for obtaining such data. Since many startups and academic institutions are not able to invest similar amounts of money as the big vendors, alternative energy-sources are gradually popping up.

An example of such a more disruptive-type of initiative for obtaining data is Medical Data Donors. Platforms such as this are trying to obtain medical data directly from patients. The concept however goes further than only collecting data: they not only want to collect data, but also curate and annotate them, i.e. prepare them for use in research. The idea for this Medical Data Donor - concept actually comes from Professor Andreas Maier, who is a computer scientist with a background in medical imaging. He currently is the head of the Medical Reconstruction Group at the Pattern Recognition lab of the Friedrich-Alexander-Universität in Erlangen, Germany. In this group research is being performed in the field of automated processing of medical images, such as image reconstruction, image analysis and interpretation, image fusion, motion compensation etc.

Collecting data

From his experience with the development of speech recognition tools, Andreas Maier is well aware of the enormous amount of data needed to develop highly accurate speech recognition (SR) software. Thanks to the recent development of Deep Learning (DL) methods and the enormous increase in processing power it was possible to use much more data and thus to increase the accuracy of SR software. By using up to 1 million hours of speech it was possible to increase the accuracy of SR from 96% to 99,7%. Similarly, for deep learning to be successful in radiology, it will be necessary to collect and annotate potentially thousands of cases representing a range of pathologies.

Prof. Andreas Maier

However as a consequence of using such an enormous amount of data, a lot of working hours need to be calculated since the annotation of data is a manual process requiring a significant financial investment. In other words, to develop DL algorithms, not only must data be purchased, but that data must be curated and annotated. “Deep learning requires deep work,” as stated by Carolina Lugo-Fagundo in a JACR opinion paper. The algorithms developed using deep learning are only as good as the training sets used to train the system; the networks need to be trained in a supervised way. In addition those willing to develop algorithms need access to powerful processing power. The budgets of academic centers willing to do research in AI are usually less royal than those available for the big players in the league, such as DeepMind (Google) and Amazon.

Deep learning requires deep work

The Medical Data Donors platform was founded in January 2018 as a non-profit organization, with the idea of asking patients to donate their data for research purposes. Maier is convinced that non-profit organizations are more trustworthy for the mainstream public, because they focus on research rather than directly implementing profit stategies. The goal of this platform is to set the stage for developing DL-based tools in medical diagnosis, which includes collecting data, annotating them, and making them available for non-commercial research groups. The project is currently based upon a collaboration between the academy, the University of Erlangen-Nürnberg and the German Cancer Research Center DKFZ, and a private partner called Telepaxx, a German company specialized in secured cloud-based archiving of medical data. The fully-encrypted data will remain stored on their servers. For collecting the data an analogue solution with envelopes is preferred however. Maier believes that too many patients undergoing the examinations are insufficiently acquainted with computer systems, so inviting them to donate and upload their data through a web-based portal might be too cumbersome he thinks. Patients can send their CD/DVD by using specially designed “envelopes” provided by the platform. Each envelope has its unique transaction number and includes an informed consent form that the patient needs to sign. The patient is advised to keep his/her transaction number for accessing or deleting the data (withdraw consent), which is possible through a web interface provided by Telepaxx. Creating a web-based platform to upload the data is the next step, which will also be supported by the private partner. Currently patients are not reimbursed for donating their data.

Although technically it’s possible to obtain data from patients in any country, at this moment the legal issues have only been clarified for the German citizens. Extensive legal analysis demonstrated that not only does compliance with the GDPR (for protecting the patient’s -the data subject- rights) has to be dealt with; ownership and copyright relating to the hospitals and (radiological) practices making the medical images must also be provided for. Two agreements are always necessary, with the patient and with the radiology department. Therefore the success of the platform strongly depends on the collaboration and partnership with radiology practices. Maier is also actively looking for support from the radiological community. Material such as posters for the waiting room, donation boxes and envelopes is available for promoting the project. Radiologists are also invited to participate as annotators, which is another essential part of the project. Their engagement is encouraged by providing them access to algorithms that will be developed with the data.

Annotation and gamification

The modality by which the images are made (CT, MRI or other) is not an issue for Maier; anonymization and annotation of the images is, however, key. Anonymization is needed for making the images accessible to third parties, such as the volunteers willing to annotate the data, the so-called “annotators”. Voluntary annotators are definitely needed, and they are invited to join the platform through the website. The plan is to make the data available for annotation to everybody eager to use them for research purposes, mainly the universities and startups not able to invest large amounts of money in obtaining those data. “Data is the currency of the new century”, says Maier. Small and mid-size companies and universities should be able to get access to large amounts of data in an affordable way in order to be able to contribute to state-of-the-art products and making them available for a wider public. To make it a bit more attractive the annotation procedure is presented in a gamification model. A series of games is being developed, especially for organ segmentation and detection of anatomical landmarks.

Data is the currency of the new century

Users are actually trained by participating in different levels of making annotations. Scores are provided and leader boards will be made available on which the best annotators will be ranked. It will be possible to develop groups able to compete against each other, e.g. university A against university B. Social media will be integrated to share the results. By going through several levels the users will be progressively confronted with more detailed anatomical landmarks, so whereas in the lower levels lay persons might be able to participate, for the higher level medical students or radiologists will be needed.

Competitors and future

Maier mainly thinks about developing algorithms for automated organ segmentation, landmark detection and identification of disease. For the future Maier considers the option of asking patients to give their ICD-code(s), with the intention of getting access to validated diagnostic information. For automating the diagnostic process, access is needed to all codes of the same case, which can be a difficult task. A similar “competing” but larger initiative is the NIH DeepLesion dataset, which is currently publicly available. While most publicly available medical image datasets have less than a thousand lesions, this dataset has over 32.000 annotated lesions identified on CT images, coming from 4.400 unique patients. Previously the NIH released chest X-ray images from more than 30.000 people, including many with advanced lung disease. Although these are significant numbers, according to Maier the quality of data is still an issue. Data need to be of sufficient quality for developing high-quality and accurate algorithms, since the ground truth also determines the degree of accuracy of an algorithm; the veracity of the ground truth is critical because the deep learning algorithms can only be as truthful as this reference.

The veracity of the ground truth is critical because the deep learning algorithms can only be as truthful as this reference.

Andreas Maier hopes to be able to improve this, e.g. by providing labels to the cases, indicating the certainty or uncertainty of the diagnosis. Cases and images with a high certainty label could also be used as prototype cases for educational purposes, e.g. of anatomical variations or rare diseases. Another well-known very large project is the German National Cohort in which not only images but also genomic data are collected. Genomics are excluded for the Medical Data Donor project since access to genomic data more easily allows identification of a person, making it impossible to anonymize data for making them accessible for third parties, e.g. as would be needed to allow crowd sourcing.

Support

Supporting the project is possible by spreading the news and visiting the website. Membership is not a prerequisite to participate. Both individuals and institutions are able to register as members. A fan shop provides graphics that can be printed on T-shirts. For each purchase 20% is donated to the project, so “you can do good and simultaneously look good”, as Maier says. According to Prof. Maier this project might be a game changer, which can go far beyond the possibilities of working with only 1 or a limited number of hospitals. For making the project a success the support of all stakeholders, including patients, hospitals, radiologists and many volunteers will be needed, so the awareness of the value of such projects will have to be increased.

You can do good and simultaneously look good

Conclusion

To develop algorithms for radiology an enormous amount of data is and will be needed. The quantity and quality of the training set however are critically important in the development of state-of-the- art deep learning. It becomes clear that curating and annotating data is a topic that should get more attention because it’s crucial for the development of algorithms, now and in the future. In this context it is very likely that data collection and curation will become a new business as such.

The question however is whether the commercial branch will surpass the non-profit initiatives. In other words: will there be sufficient funds available for preparing the data, will it be possible to find a sufficient number of annotators — and if yes where — and what will be the best way to train and motivate them? Will it be necessary to adapt the training of radiology residents, and will the annotated data be available for educational purposes? What role should radiological scientific societies play in this chain, should they develop standard proceses and best practice guidelines for data segmentations and share these datasets so that more usable high-quality datasets become available? Will algorithms improve by themselves through application in clinical practice, much the same way voice recognition programs improve from their real-word deployment? Is the reimbursement of annotators an issue? Should patients be paid for submitting their data? Many questions still remain to be answered...

MD.ai-based image data collection “The Cancer Genome Atlas for Liver Hepatocellular Carcinoma”

In one of my next podcasts I will interview George Shih, who is one of the main drivers behind the dataset curation and annotation platform MD.ai.
From this interview I hope to learn more about what is the role and value of crowd-sourcing type of platforms such as Crowds Cure Cancer and Medical Data Donors vs. MD.ai and more commercial initiatives such as Embleema, in which patients are encouraged to submit their data for payment, in a blockchain model. I hope George will help me in answering several of the remaining questions. More will follow soon, stay tuned and subscribe to my newsletter!

--

--

Erik R. Ranschaert

Erik is a visionary radiologist, speaker and expert in the healthcare and imaging informatics arena. You can find him on www.erikranschaert.com