Exploration of the SIL Model and the Shuffle Instance Strategy

11 min readDec 10, 2023

In two previous blogs, we presented the main work and experimental results of SIS respectively. Here, we describe in detail how the model and shuffling strategy work.
I am currently a postdoctoral at ETH Zürich, Switzerland. I am co-supervised by Prof. Radu Timofte. I mainly focus on medical image research, especially on pathological image analysis.
If you are interested in this work, feel free to contact us to collaborate on research together.

Exploration of model and shuffling strategy

Here, we employed pyramid-group datasets as testing samples and specially designed a series of ablation studies. The primary objectives are to optimize the model design with SI strategy, explore the optimal shuffling settings, and identify the most effective approach for model learning.

Introduction of shuffling

Inspired by the pathological instance pyramid, we introduced the SI strategy for better establishing the model understandings of instances and instance relationships, ultimately leading to performance enhancement.

The original model (ViT) is equipped with a single classification head for categorization. In the SILM (RS), we extended the ViT architecture by incorporating an additional regression (REG) head, resulting in the REG-ViT model. In REG-ViT, the REG head is responsible for regressing the soft labels of the input images, while the classification head continues to perform category classification. By incorporating the REG head, the model gains the ability to leverage patch-level supervision and obtain additional information. Specifically, the REG head establishes connections with the patch tokens within the ViT structure, enabling regression on the soft labels associated with the input images. Simultaneously, the classification of categories can still be accomplished through a multi-layer perceptron (MLP) structure connected to the classification token, following the standard ViT operation. By introducing supervisions on the patches through the REG head, our experiments demonstrate a slight but significant average increase of 1.26% in accuracy and 1.14% in F1-score across four datasets. This improvement validates the efficacy of enhancing instance token representations within the model.

To further explore the potential benefits of the SI strategy, we introduced a shuffle step, which aims to create novel distributions and instance relationships within the input image. This led to the development of the SF-ViT model, where the REG head regresses the soft labels of the shuffled images and the classification head performs category classification on the un-shuffled images. While SF-ViT showed improved performance compared to the original ViT, its performance fluctuated in comparison to REG-ViT. This fluctuation may be attributed to the lack of unified supervision during both the shuffle and un-shuffle steps.

To overcome this limitation, we propose the SILM. In SILM, we leverage the REG head for patch-level supervision on both shuffled and un-shuffled images, while the classification head remains active for category prediction on un-shuffled images. This comprehensive approach yields optimal results, achieving an average accuracy of 96.30% and an average F1-score of 96.70%. By involving the REG head in both steps, SILM enables the instance tokens, in addition to the classification token, to guide the model in simultaneously learning instances and instance relationships. Moreover, our findings highlight the ability of the model to effectively learn from these artificially created images, thereby emphasizing the importance of better information modeling for solving more challenging classification tasks.

Instance-level explorations

We assume that the promotion of SILM performance stems from the procedure of shuffle instance including patch splitting, shuffling and regrouping. Since instances are contained in patches, patch size is crucial for the retention of instance information and the construction of the relationship. In general, larger patch size contains more biological information but introduces less instance relationship, whereas smaller patch size better increases instance relationships but may sacrifice some internal details. Therefore, different patch size represent different patterns, apparently affects the effectiveness of the SI strategy and the model final performance. To investigate this, we tested SILMs with different patch size on the pyramid-group datasets. The model performance varied significantly for different patch sizes, and optimal results occurred at different patch sizes corresponding to the specific dataset categories. When patch size was set to 64, 48, 48 and 64, optimal performance and most precise model attention areas were observed on WBC, ROSE, pRCC and CAM16, respectively. Notably, the model achieved its best classification performance when the patch size was close to the minimal instance size in each dataset, meaning that each patch could contain a single instance perfectly. For instance, a single cell can be precisely contained in a patch when the patch size is set to 32 for ROSE, or the patch size is approximately equal to the size of a tissue of cell composition for pRCC. To illustrate the effect of different patch sizes on model learning, we also utilized Grad-CAM to show attention regions on the shuffled images. As shown in below, when the patch size is set from 16 to 128, the model can reasonably identify the target area. But when the patch size setting is relatively larger, some critical local regions may be ignored and omitted. Conversely, using smaller patch size allows the model to better focus on all cell regions, despite the possibility of errors. The opportune patch size assists SILM to better comprehend the information inherent in pathological images, thus resulting in more accurate attention regions and better classification performance.

As variations in patch size affect model performance and attention map, we were also interested in examining whether the same class of pathological data with different scales exhibit differences when patch size is held constant. For this purpose, we chose the Colonoscopy dataset as our experimental sample as it has 3 distinct scales, covering cell, cell cluster, and local-tissue levels. The best performance occurred with patch size of 48, 32 and 48, respectively, for these 3 types of samples. These results demonstrate high consistency in previous findings that the optimal patch size is similar for datasets within the same level, thereby validating our hypothesis regarding the relationship between patch size and instance. Additionally, there is an obvious drop in accuracy and F1-score from low image level to high. We selected two groups of patch size at 32 and 48 and used Grad-CAM to investigate the reasons. At the same patch size, cropped patch in 3 scales contain different amount of information. Some pathological structures can be reserved in a patch from Colonoscopy (Low), while only some coloration features can be retained from Colonoscopy (High). Discrepancy of information contained in a patch will affect the model function, no matter from different patch size in the same dataset or different levels of image sharing the same pathological category and patch size.

Puzzle-level explorations

Since instance information is determined based on shuffled patches, the relationship of instance is contingent on the shuffle strategy. Building the shuffle puzzle using different methods will engender distinct instance relationships, yielding variable quantities and degrees of information for the model. Puzzle-level explorations are valuable in identifying the most effective data augmentation strategy.

As instances were sent into model in the form of puzzle, the most straightforward way for constructing a puzzle is to partition a single image into patches and shuffle them. While this in-graph shuffle strategy cannot introduce instances from other images, it does facilitate a reorganization of their original information. From CAM results, SILM is able to accurately locate the instance regions within the shuffled image. We evaluated this shuffle method and found that it averagely improved the accuracy of the model by 0.56% and F1-score by 0.82%, when compared to a baseline without shuffling. These findings illustrate that when maintaining label consistency and increasing in-graph relationships, the model can better learn the crucial information pertaining to each instance. In-class shuffle, which entails shuffling patches from different images within the same category, further amplifies both in-graph and in-class relationships. The model needs to thoroughly observe patches from different images and evidently achieves accurate recognition of instances, which results in higher benchmark scores across all 4 datasets. It suggests that introduction of this novel in-class relationship helps the model to better recognize the similarity and difference among different images within the same category. In addition to in-graph and in-class relationship, the cross-class relationship can be increased by artificially creating multi-class shuffled image with annotation information provided by soft labels. Highest results were achieved by this instance shuffling strategy in all 4 datasets, proving it the best way to help model recognize instance and distinguish pathological images in different categories.

However, more relationships do not necessarily equate to better performance. Based on in-place shuffle strategy, certain patches from a single image are retained, while others are exchanged. Two variations of shuffle strategies, named group-shuffle and split-shuffle, were tested. The former involves exchanging a group of only two images, while the latter operates by shuffling patches from all images in a batch. However, when the cross-class relationship is too complex, i.e., the puzzle contains too many instances, the model may become biased towards individual instances or regions, compromising its ability to model the overarching interdependencies among samples. Hence, the ability of SILM to model instance relationships may be impeded or even lost. This could help explain why the split-shuffle strategy recorded lower results compared to the group-shuffle approach. Additionally, the fixed position ratio (FPR) within the shuffle process was also crucial in puzzle construction. A higher FPR value indicates that more patches are held constant during shuffling, thereby resulting in less information confusion and a simpler training task. We controlled the FPR for SILM within a range of 0.5–0.9 and found varying values across 4 datasets. Interestingly, results showed that the best performance occurred at different FPRs for each dataset, thus providing empirical support for our hypothesis regarding the effectiveness of FPR variations.

Generator explorations

We have meticulously explored the optimal approach to convey instance information to the model, and figured out the most effective way to increase instance relationship. Thus, we lastly explored the generator settings to facilitate the model better learning from the shuffled images.

Based on former ablation studies on patch size and FPR, the patch size schedular and FPR schedular were further designed to promote the model performance. They allow SILM to perceive input images shuffling at all kinds of patch size and FPR and fully learn from them. Guided by the course learning theory, patch size and FPR was set initially high and gradually reduced in the training procedure, with new information and relationships increasing. Loss-drive strategy was adopted to control the learning process and adjust the schedules. It presented better performance than dynamic adjustment strategies including reverse, linear, random or loop. According to the results, this dynamic adaptive strategy outperforms all best ones in fixed patch size and FPR strategies on accuracy and F1-score benchmarks. It showed the contribution of different feature scales to the better information modeling. The SILM is able to fully exploit the information of the image from this process, ultimately achieving optimal performance.

Additionally, we tried to modify the patch size list in model training and investigated its impact. Partial lists were selected from complete patch size list (16/32/48/64/96/128/196) and divided into even group (16/48/96/128) and odd group (16/32/64/128). No matter trained on even group or odd group, they showed a decline in performance compared with full patch size group, as summarized in below. It reveals that distinct patch size contains diverse modalities of instance information, which aligns with the findings in instance-level ablation studies. The model learns effectively from all selected patch sizes during dynamic adaptation, resulting in optimal performance on the complete list. All these generator studies illustrate that SILM can work better in relationship modeling with processing more comprehensive feature scales.

Summary

In this study, we propose the shuffle instance learning model (SILM), which provides a new way for the extension of large model in pathological image analysis. The trend of enhancing model performance by using more data and larger networks has become increasingly apparent in the deep learning field since the introduction of the Transformer. ViT retains the scalability of Transformer to large models, making it possible to utilize larger models in the visual domain. However, the success achieved by large models in the natural image domain is difficult to replicate directly in the pathological image domain. The unique characteristics of pathological images make it challenging to collect and annotate them in large quantities, resulting in the scarcity of well-annotated, large-scale public datasets. Furthermore, the preparation and imaging processes of pathological slides greatly influence the final images, limiting the generalizability of models. Therefore, this study aims to lower the threshold for significant improvements in the pathological image domain, while preserving the large-scale expansion flexibility of ViT and the ability to directly transfer knowledge from the natural image domain, enabling better organization and learning from the limited datasets.

To achieve this, we conducted a detailed analysis of the diagnostic process of pathologists, integrating the unique features of pathological images with the neural network modeling process and proposing the concept of instance pyramid. Instance pyramid uses the feature collections of different pathological images at specific spatial scales as instances, thereby summarizing the common characteristics of pathological images at different scales. Based on these common features, we propose a shuffle instance strategy that increases instance relationships to change the original data information distribution, allowing for better organization and learning of the original information. In order to implement this strategy and further study the influencing factors, we designed two puzzle generators corresponding to two different shuffle methods. The puzzle generator module is a plug-and-play module that retains the basic structure of ViT. This design maximizes the strong transferability and scalability of ViT, providing the possibility for transferring knowledge from the natural image domain and expanding to other pathological image-related tasks.

To thoroughly validate our hypotheses and explore how to increase the instance relationships to enhance the modeling performance, we conducted various experiments on 11 datasets with different scales and tasks. In comparison with other models and data augmentation methods, our model achieved the best and most stable performance, and Grad-CAM results also demonstrated the most accurate attention areas. Experiments related to instances explored the concept in depth. Testing on the dataset with obscured instances using a trained model revealed the actual attention areas of the model in the CAM results, indicating that the model understanding of categories is built upon the instance modeling process. A series of experiments on patch size explored the instance scale in different datasets, and the CAM of the model on shuffled images showed that the model established a perception of the instances themselves for different categories, rather than using interference features such as background information for modeling. The study of instance relationships was conducted using different shuffling methods. Corresponding to in-graph, in-class, and cross-class instance relationships, we used in-graph, in-class, and cross-class shuffling methods to gradually increase the instance relationships in the input images. The experimental results showed that, on four datasets with different scales, the capability of the model to represent categories was enhanced as the types of instance relationships increased, and it reached its best performance when all three instance relationships were added. In addition to the discussion of instances and instance relationships, we conducted more experiments to explore how to more effectively increase the instance relationships and how the SI strategy works. The main difference between random shuffle and in-place shuffle is that in-place shuffle preserves the spatial relationships between instances, which is a more intuitive shuffling method, and the results showed that in-place shuffle performs better than random shuffle in most cases. The design of the network structure and the parameters of the puzzle generator also influence the process of increasing instance relationships, with related experiments discussed in the Results section.

There are more works that can be done around the concepts and models we have proposed. The proposal of instance pyramid and instance relationships provide a new way to study the pattern of pathological images. We designed two types of puzzle generators, utilizing random shuffle and in-place shuffle. But more reasonable and efficient shuffling methods may exist, which could optimize and improve the puzzle generator. The puzzle generator is a plug-and-play module that retains the basic structure of the backbone, demonstrating immense potential for transfer learning and application in other pathological image analysis tasks, such as image segmentation and object detection. Moreover, our module targets common issues in pathological images while preserving the scalability of ViT, paving the way for the development of universal diagnostic models in the field of computational pathology.

Exploration of the SIL Model and the Shuffle Instance Strategy

Exploration of model and shuffling strategy

Summary

Written by Bahjat