Viewing the world through a straw, Part 2
How lessons from computer vision applications in geo will impact bio image analysis
The previous post explored why automated image analysis (computer vision) has struggled to make the transition from everyday photographs to satellite imagery, and in this post we’ll explore if those lessons apply to the rapidly growing field of automated medical microscopy analysis. As it turns out, many of the challenges with satellite imagery (i.e., relative size, number and variability of objects, size of images, lack of well labelled data) are true for biological imaging, specifically microscopy data. However, there are clear differences as well (i.e., market dynamics, higher thresholds for explainability, regulatory issues associated with diagnoses) that will directly impact the growth of automated medical image analysis. Regardless of these critical differences, biological imaging often looks far more similar to satellite imagery than natural scenes. Look for yourself (Figure 1) — one of these things does not look like the other:
What is biological imaging?
Biological imaging visualizes organisms or component structures to understand biological processes and is used for basic research or medicine. Medical imaging is a subset of biological imaging that focuses on clinical uses of images to diagnose and monitor clinical conditions. The ability to image our internal body has been heralded as one of the most influential achievements in medicine in the past thousand years. Often, if we can see it then we can understand it, which has been a massive boon for advancing biological and medical understanding. Biological imaging includes a set of technologies that roughly fall into two categories: organ- or body-scale imaging (magnetic resonance imaging (MRI), computed tomography (CT), X-ray imaging, ultrasound, etc.) and micro-scale imaging (histopathology slides, fluorescence microscopy, diagnostic microscopy, etc.). Together, these methods are used in a range of medical disciplines — radiology, pathology, histology, dermatology, and ophthalmology, among others. For the purposes of our blog post we will focus mainly on imagery derived from microscopy to highlight similarities with satellite imagery. The comparison to satellite imagery will be instructive and allow us to leverage IQT CosmiQ Works’ expertise on the opportunities and challenges of applying computer vision to overhead imagery (Figure 1).
How is biological imagery like satellite imagery?
For both microscopy and satellite imagery, simplistically, you are looking at the world through a straw. The straw (or rather lens) is either in orbit circling our planet or hovering a relatively short distance above a very small object sandwiched between two pieces of glass on a slide. The satellite or microscope allows us to see things that appear very small, either due to distance or actual size. Broadly, microscopy images are like satellite imagery in terms of the image size, number of objects, dataset size, and availability (or lack thereof) of high-quality image labels.
Scale and number of objects in microscopy and satellite images
Objects of interest are usually very small in both satellite images (e.g., buildings, cars, etc.) and microscopy images (e.g., cells and sub-cellular components). This contrasts sharply with everyday images you can find on the internet. We counted cells per image in the popular HeLa_S3 cellular imagery dataset (Figure 2) and found that they are approximately the same average object density as buildings in the SpaceNet overhead imagery dataset (see the previous blog post), and over 14x the object density of the COCO natural imagery dataset. The implication here is that similar approaches to handle object density may apply to both satellite imagery and microscopy.
Lack of well-labeled data
As with all AI applications, data remain the critical component for developing algorithms. For both types of imagery, there is a lack of high quality, open-source, easily accessible labeled data.
First, there is a shortage of publicly available images. Healthcare organizations have saved large amounts of medical imaging from past cases. Unfortunately, these images are not easily shared, made available, or accessible for use in research or developing algorithms, partially due to the challenge in curating datasets for machine learning. Satellite imagery datasets like SpaceNet provide an excellent template for well-curated, well-labeled datasets. The open source license, free download, and clear structure of SpaceNet datasets, along with the data science challenges that run in parallel to dataset release, have helped enable more researchers to dig into computer vision applications to satellite imagery. Microscopy dataset producers could implement many of the principles learned there. While similar work is being done in biology by Medical ImageNet, a well-curated repository of radiology images, and others — see the resources section for other examples of biological imaging repositories for artificial intelligence applications — there remain challenges.
Just as was true five years ago for geospatial imagery, many microscopy datasets today are stored in esoteric and at times proprietary formats, with unclear cataloging of contents and inconsistent labeling techniques. The datasets are inconsistent in quality and quantity, with variable labels and at times small amounts of training examples. Licensing often prohibits or tightly restricts commercial use. For many datasets, access must be approved on a case-by-case basis, and a PI and project clearly articulated. By contrast, common everyday imagery datasets can be freely downloaded from a website with the click of a button, with clear tutorials for acquisition. As a result, everyday ML researchers avoid delving into medical datasets due to the difficulty of acquisition. Making anonymized medical imagery datasets freely available for research, government, and commercial use, with standardized data formats and access protocols, would accelerate innovation. This may increase the number of students, small start-ups, and large commercial entities developing novel solutions to challenging medical problems. IQT CosmiQ Works observed such an increase in interest in geospatial computer vision applications after the release of its SpaceNet datasets.
Identifying and implementing means of sharing imagery for use in research and ML development would accelerate development of rapid, accurate means of analyzing microscopy data, and therefore should be a priority.
Second, there is a lack of well labeled medical imaging due to the cost and time associated with expert labeling of medical imagery for AI. Highly trained individuals — biologically or medically trained experts — are needed to label biological imagery. Generally, medical imaging in key areas — radiology, pathology, histology, dermatology, ophthalmology — relies on expert, qualitative assessments of the imagery. The availability of trained individuals for labeling images will largely determine the usefulness of imagery repositories for developing algorithms to provide quantitative assessments. However, expert labeling is expensive, time-consuming, and often unavailable to dataset creators. In an attempt to overcome these barriers, many dataset creators turn to “crowd-sourcing,” or soliciting work on a project from a large number of people. Many crowd-sourced approaches happen online and asynchronously using a pool of non-experts. Crowd-sourcing approaches have been used for developing geospatial maps especially in a time of crisis like a natural disaster. These approaches can be effective, though they still require substantial validation by experts, and attempts at crowdsourcing satellite imagery labels without careful after-the-fact validation have resulted in low-quality datasets. This likely will be even more true in the medical realm. This indicates that once again, applying crowd-sourcing approaches on medical imaging will be constrained by the availability of medically trained validators, though it may reduce the time commitment required of these experts to generate high-quality data.
Hiring trained individuals to label is an alternative approach, though it is extremely expensive. SpaceNet uses this approach for its satellite imagery datasets at a cost of roughly $10,000–25,000 per city of building and road labels. This has resulted in a very complete, high-quality set of labels, but comes at substantially greater cost than crowd-sourcing solutions. DARPA used a similar approach for expert ultrasound data labeling to develop artificial intelligence models. Ultrasound represents a classic example where expert understanding is fundamentally important to generating high-quality image labels, which could easily translate to other medical image types. In short, expert labeling or validation may be essential to generate valuable microscopy datasets, though the inherent costs are daunting.
Technology approaches are being developed for automating data labelling in other areas. Whether or not these approaches could be used successfully on biological imaging would need to be tested. Regardless of the labeling method, quality labeled data are essential for computer vision algorithm development. Options and technologies for rapidly labeling imagery would significantly advance the field and should be a focus area for development.
Lack of easily accessible data
The accessibility of labeled datasets is another essential issue for algorithm development. At present, the academic and “open source” (publicly available code and methods, with no proprietary restrictions) research communities have outpaced many government and commercial solutions in domain-specific applications of computer vision. This was true in the early stages of geospatial computer vision research (and arguably remains true today in some areas) and providing open datasets like SpaceNet enabled those communities to rapidly improve analysis techniques. By analogy, providing easy-to-access, well-labeled open source microscopy datasets may enhance critically important academic medical research. Furthermore, microscopy analysis companies have begun to leverage the open source computer vision community through Kaggle competitions, which rely on high-quality open source datasets. However, see the section on privacy issues associated with medical data below. In short, sharing healthcare data can be problematic given privacy regulations.
The dearth of ML experts with medical imaging or biology backgrounds
Though the number of data scientists, ML engineers, and AI experts is rapidly growing, there remains a paucity of domain experts who can also provide machine learning expertise in some fields. This has presented a major barrier for geospatial research, where one needs to understand with geographic coordinate reference systems, satellite collection details, and other domain topics to effectively pursue computer vision research. The same is likely even more true in medical microscopy research, where an understanding of disease phenotypes and cellular physiology is critical to scoping and implementing a computer vision solution. Being “bilingual” — able to speak biotechnology AND machine learning — is unique and valuable. Daphne Koller, CEO of the ML/biology unicorn company Insitro, describes her team’s bilingual abilities as their secret sauce. This secret sauce has paid off because in early 2019 Insitro raised a $100 million Series A round giving them plenty of resources. The availability of talent in both areas — ML and biology — will dictate how quickly artificial intelligence applications will be developed and applied in biology generally and in biological imaging specifically.
How is biological imaging NOT like geospatial imagery?
There are a few ways biological imaging is not like geospatial imagery including data sharing restrictions due to privacy regulations, higher thresholds for explainability in medical imaging, and the size of the commercial markets driving development in the respective areas.
Sharing healthcare data is restricted by privacy regulations, and this applies to medical imaging as well. Most ML research on medical imaging has been based on data from single institutions because privacy regulations limit the ability to share medical data. Quantity and diversity of data from single institutions will limit the generality of the models being developed in this way. Finding creative approaches to share these medical data among institutions will be helpful for moving the field forward. Federated learning methods — where models and weights are shared among institutions, but data are not shared — may be a way of overcoming data stovepipes because of privacy regulations. Additional regulatory and technology approaches should be explored and tested.
The “black box” concern — i.e., that it is impossible to understand how some complex AI models make their decisions — is perhaps more poignant for the application of AI in medicine than it is in geospatial imagery. For example, a computer vision algorithm could identify cancer from a pathology slide but not tell the pathologist why or what features lead to that result. As this lack of understanding reduces confidence in results, particularly when models generate incorrect predictions, it will likely hamper the uptake of AI by clinicians and raise concerns among regulators. In contrast, the application of AI to generate movie or book recommendations can have high tolerance for poor results or understanding the reasons for the recommendations. Understandably, in medicine, however, if a model generates a solution that cannot be understood by a human expert then fewer people will adopt the recommendation or modeling approach especially for highly regulated fields like medicine. Explainable artificial intelligence (XAI) methods are being developed to highlight sections or features of an image that were critical for a model’s results. For example, XAI approaches would enable a computer vision algorithm to identify cancer from a pathology slide AND demonstrate the features of the slide that were critical for the result. Advances in XAI will be particularly useful for medical applications.
The global medical imaging market is over $30 billion currently and will continue to grow an estimated 5.1% over the next five years. Computer vision can enhance the diagnosis of a wide variety of diseases and will impact the broader healthcare industry that in the U.S. was estimated to be over $3 trillion dollars in 2018. This is markedly larger than the global satellite imagery and remote sensing market, which IQT CosmiQ Works’ internal market estimates place at about $3 billion with about 8.4% annual growth — based on IQT CosmiQ Works’ market analyses. Given the respective market sizes, one would naively expect that there would be more resources available for advances in the application of ML to medical imaging than will be available for geospatial imagery. The size of the markets makes it more likely that big players in the AI space, such as Google, Microsoft, Facebook, and Amazon, will continue to explore substantial investments in healthcare.
Indeed, the relatively limited investments some of these groups have make in geospatial have had large impacts on the community, highlighting how powerful such efforts can be.
What can AI on satellite imagery teach us about advancing AI in biological imaging?
Overall, biological imaging is similar to satellite imagery and will help us consider challenges and opportunities associated with applying computer vision to biological imagery. Medical microscopy analysis faces many of the same barriers as geospatial analytics: lack of well-labeled datasets for model development, divergence from everyday photographs to limit the usefulness of transfer learning, and substantial domain expertise requirements. The additional regulatory hurdles that AI for medical applications must overcome will further slow development; on the flip-side, the commercial value of medical AI methods will encourage entrepreneurs, medical device, and pharmaceutical companies to invest in developing those technologies.
2. Medical ImageNet — https://aimi.stanford.edu/research/medical-imagenet
3. NIH Clinical Center publicly released 32,000 CT images — DeepLesion — https://www.nih.gov/news-events/news-releases/nih-clinical-center-releases-dataset-32000-ct-images; https://nihcc.app.box.com/v/DeepLesion
4. NIH Clinical Center publicly released chest X-ray dataset: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community; https://nihcc.app.box.com/v/ChestXray-NIHCC
5. The Cancer Imaging Archive — https://www.cancerimagingarchive.net