RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Zilun Zhang
10 min readJan 3, 2024

--

Introduction

In this blog post, we introduce RS5M, GeoRSCLIP, and GeoRSSD.

RS5M is an image-text paired dataset in the field of remote sensing (RS), which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset.

GeoRSCLIP is a fine-tuned CLIP-like model. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by 3%~20% in Zero-shot Classification (ZSC) tasks, 3%~6% in Remote Sensing Cross-Modal Text–Image Retrieval (RSCTIR) and 4%~5% in Semantic Localization (SeLo) tasks.

GeoRSSD is a Stable Diffusion (2.1) model tuned by data from RS5M with Dreambooth (we provided versions using 1% and 20% data). It achieves
significantly improved FID scores compared with vanilla SD in generating RS imagery qualitatively and quantitatively.

Overview of RS5M Construction

We constructed the RS5M through two sources (see Figure 2).

  1. We gather 11 publicly available image-text paired datasets (PUB11) and filter them using RS-related keywords. We then utilize the URLs and other tools to deduplicate images. Next, we use a pre-trained VLM and an RS image detector to remove non-RS images. (Filter Large-Scale Image-Text Paird Datasets)
  2. We utilize BLIP2 to generate captions for 3 large-scale RS datasets (RS3) that only have class-level labels. We conduct a series of quality assurance methods including a self-supervised one to acquire descriptive and suitable captions for RS images. Finally, we merge the results from both sources. (Caption Remote Sensing Image Datasets)

Filter Large-Scale Image-Text Paird Datasets

We have chosen 11 public large-scale English image-text paired datasets to build the PUB11 subset, including LAION2B-en, LAION400M, LAIONCOCO, COYO700M, CC3M, CC12M, YFCC15M, WIT, Redcaps, SBU, and Visual Genome. We collected 3 million image-text pairs with the steps below.

  1. We establish a set of keywords closely related to RS, which consists of two groups: RS-related nouns and RS-related applications & company names (See Appendix B2). To identify image-text pairs with text containing the keyword patterns, we utilize regular expressions. This was applied to the label files for the datasets mentioned above.
  2. We download all relevant images from the internet
  3. We utilize Fastdup for invalid image checking and deduplication. We first filter out corrupted images and apply deduplication based on URLs. Then, Fastdup is used to cluster duplicate images. We keep one image and discard the rest for each cluster of duplicate images.
  4. We clean the dataset using VLM and the RS image Detector. First, we develop a set of handcrafted RS-related text prompt templates t_j with length n (refer to Appendix B3 for details). For each image x_i, we select a CNN-based CLIP-ConvNext-XXL model to compute the cosine similarity s_i between the average text feature f_t of the prompt templates and the image feature f_{image}(x_i).
  5. Then, we construct a classification dataset comprising two classes: RS images and non-RS images. Details on this classification dataset can be found in Appendix B4.
  6. We fine-tune a classifier, which is integrated with the ViTAE pre-trained model, to serve as an RS image detector. We denote the probability of an image x_i is an RS image to be c_i = P(c_{RS}| x_i ).
  7. We filter the images in RS5M based on the joint score (s_i, c_i). We keep images with s_i > m and c_i > n, where m and n represent some thresholds. In practice, we set m and n to specific values to only keep image-text pairs that have the top 90% s_i score and top 80% c_i score among all image-text pairs. The PUB11 subset we constructed included both the satellite view and aerial view images. We have 3,007,809 image-text pairs in total.

Caption Remote Sensing Image Datasets

We employ the tuned BLIP2 model (tuning details can be found in Appendix B10) with the OPT 6.7B checkpoint in half-precision from Huggingface for caption generation. We choose nucleus sampling as it generates more diverse captions (refer to Appendix B7). The selected datasets include BigEarthNet [32], FMoW [31], and MillionAID [29]. We use only the training set for FMoW (727,144 images) and BigEarthNet (344,385 images), as some downstream tasks evaluate the test set. For the MillionAID dataset, we select the test set (990,848 images). We have 2,062,377 images in total for the RS3 subset.

  1. We generate 20 candidate captions per image using tuned BLIP2.
  2. We rank the top 10 results using CLIP ViT-H/14.
  3. We re-rank these top 10 results using CLIP Resnet50x64 to obtain the top 5 captions.
  4. We enhanced the dataset by integrating meta information (geo-meta information, class labels, UTM, UTC, etc.) into readable sentences as a part of the image caption. More can be found in Appendix B9. This structured meta-caption, combined with the model-generated caption, offers a more comprehensive view.
  5. Rotation-invariant features are crucial in the field of remote sensing, as targets on the ground captured by satellites or drones typically maintain their shape, size, and color, such as rivers, forests, and cultivated lands. However, changes in the shooting angle may result in rotations for targets. We aim to generate captions that accurately describe the image content, regardless of the shooting angle. To achieve this, we design a rotation-invariant criterion for obtaining high-quality captions. Our criterion is that for an image x, suppose we have k captions for the image from the previous steps, denoted by t^j, where j in {1 … k}, and we augment the image by rotating it at 12 different angles with an increment of 30 degrees (denote as {x_n}, n in {1, …, 12}). Our goal is to find a j so that t^j minimizes the variance of cosine similarity between image features for images in different angles and text features. In other words, regardless of the image rotation angle, the matching score between the caption and the rotated images should be only negligibly influenced.

Dataset Description

Figure 3 left shows the frequency statistics of keywords (can be found in Appendix VII) appearing in the image captions. The phrase ”aerial view” is predominant in the captions, resulting in a significant number of aerial view remote sensing images in the RS5M dataset. The middle Figure presents a word cloud of words extracted from the RS5M captions. All special characters and numbers have been removed, as well as the majority of prepositions. Frequently occurring words in the captions include ”satellite”, ”field”, ”building”, ”road”, and ”farm”. The right figure shows the distribution of caption length in log scale. The distribution is long-tailed, and the average caption length is 49 words (maximum 3,248).

We then use CLIP’s visual encoder (ConvNext-XXL Visual Encoder from OpenCLIP’s implementation) to extract image features from PUB11 and RS3, visualizing the results using PCA. We sampled 1,000 images equally from PUB11 and RS3.

Figure 4 left demonstrates the discriminative domain differences between PUB11 and RS3, possibly due to the massive amount of aerial images in PUB11 and satellite images in RS3. Figure 4 middle displays the PCA visualization for 2,200 samples from the 11 datasets in PUB11. Interestingly, no significant domain differences are observed among the RS images from them, as the data points are intermingled.

Figure 4 right reveals a clear separation between BigEarthNet and the other two datasets (500 examples for each), which may be attributed to the lower resolution (120 ˆ 120) of all BigEarthNet images compared to the higher resolutions of the other two datasets.

Geographical Analysis and Potential Negative Social Impact

In our dataset, there are two potential concerns. First is the data overrepresentation and underrepresentation problems in some parts of the world. We analyzed the geolocation information of images in our dataset (based on 1,079,370 images with geo-information from Fmow, BEN, and YFCC). Our analysis reveals a long-tailed distribution for the ”number of images per UTM zone” statistics.

In Figure 6, image density (number of images per UTM zone) is sparse in Middle Africa and Southern Africa. This might be attributed to the presence of the Sahara Desert and the South African Plateau, which are less inhabited regions. Southern Indonesia and Australia exhibit low image density. However, an exception is observed in Southern Australia, characterized by its flat terrain and heightened human activity. Northern South America displays a reduced distribution of images. The former is peculiar, as one would expect higher human activity in this region. Northern regions of Canada and Russia have a low image density, which is understandable given their proximity to the Arctic Circle. High image density is observed in North America, Europe, and most parts of Asia and South America. The low image density areas are overlapped with many underdevelopment areas and inhabitable areas, and this could bring bias into the model trained with RS5M. Second, the RS3 subset may contain wrong captions or misleading information, which could lead to mistakes that might have real-world consequences.

Upon analysis, we discovered that the captions from the PUB11 dataset contain a significant amount of location information. As a result, we executed a NER (Named Entity Recognition) extraction on the PUB11 subset with entities labeled as ”GPE” (geopolitical entities). The complete file was uploaded to the huggingface repo. Out of the dataset, 880,354 images from PUB11 have captions with location information.

Experiment

To verify the effectiveness of the RS5M dataset, we conducted some experiments on training models using RS5M. We selected CLIP ViT-B32, CLIP ViT-B16, CLIP ViT-L16, and CLIP ViT-H14 models as the base model. We fine-tuned the models with the RS5M dataset, and employed 4 different Parameter-Efficient Fine-Tuning (PEFT) methods on the CLIP ViT-B32: Pfeiffer adapter, LoRA adapter, Prefix-tuning adapter, and UniPELT adapter (a vanilla adapter, a low-rank-approximated adapter, a prompt-based adapter, and a composite adapter).

We evaluated the domain generalizability of models tuned by the RS5M dataset from 3 vision-language tasks: zero-shot classification (ZSC), remote sensing cross-modal text–image retrieval (RSCTIR), and semantic localization (SeLo). The results are shown below.

Table I demonstrates that the CLIP-based methods, especially when fine-tuned (CLIP-FT), exhibit superior performance in ZSC across the AID, RESISC45, and EuroSAT datasets. The highest top-1 accuracy is achieved by the CLIP-FT (ViT-H-14) model, indicating the effectiveness of fine-tuning larger models on specialized datasets like RS5M. Models enhanced with DVLM (like CLIP-Pfeiffer, CLIP-Prefix-tuning, CLIP-LoRA, and CLIP-UniPELT) also show notable improvements over the baseline. In SeLo, the highest scores for Rsu, Rda, and Rmi are again observed in the CLIP-FT (ViT-H-14) variant, suggesting its robustness in localizing semantic elements. Interestingly, the lower Ras score in CLIP-Pfeiffer highlights its performance in semantic localization. Across the SeLo metrics, the DVLM-enhanced CLIP models generally outperform the baseline and purely supervised models, demonstrating the value of fine-tuning with task-specific datasets. The RS5M dataset’s role as a tuning dataset for DVLM implementations illustrates its potential to enhance model performance for both ZSC and SeLo tasks. The results suggest that larger, fine-tuned models can more effectively leverage the rich information present in specialized datasets like RS5M.

In Table II, we demonstrate the retrieval performance across both datasets (RSICD and RSITMD) and both tasks (image-to-text and text-to-image retrieval). This again indicates the effectiveness of GeoRSCLIP models. Moreover, the results show that models tuned with the RS5M dataset generally exhibit better performance compared to most of the baseline and other supervised methods. This suggests that the RS5M dataset can provide valuable domain-specific information that enhances model performance in RSCTIR tasks.

For more ablation studies and experiments, please refer to the original paper.

GeoRSSD

Given the impracticality of training the Stable Diffusion model from scratch with only 5M data, we present the Stable Diffusion models tuned by 1% data of RS5M, which we refer to as GeoRSSD. Specifically, we use Dreambooth from a modified Diffuser repository. The image resolution is
set to 512, with a batch size of 50 for 50,000 steps. The text encoder was trained as well.

We generate 40,000 samples using different queries to calculate the fid of vanilla Stable Diffusion and Tuned Stable Diffusion (RS-SD, tuned with RS5M). The vanilla Stable Diffusion model yields the FID score of 36.86 for the RS domain generation task. However, the RS-SD model achieves significantly improved FID scores of 28.32. In overall, RS-SD outperforms vanilla SD in generating RS images qualitatively and quantitatively. The RS-SD model is capable of generating more realistic RS images that better match the corresponding captions, regardless of whether the images are in satellite or aerial view.

As demonstrated by Figure 24, for prompts containing ”satellite”, the vanilla SD tends to generate unrealistic or meteorological images, but RS-SD can generate RS images that are more realistic and in accord with RS images for common RS downstream tasks. Besides, the understandings of ”snow-covered land”, ”building with some snow” and ”surrounding fields” of RS-SD are significantly better than the SD.

--

--

Zilun Zhang
0 Followers

PhD student @ ZJU. MEng@UofT. Bachelor in Math and CS@UofT.