A Preliminary Study of Visual Prompt-Based Open-Vocabulary RSI Segmentation and Change Detection in VLMs

abstract

Large visual-language models (VLMs) have demonstrated robust capabilities in zero-shot classification and information retrieval in remote sensing (RS) images. Nevertheless, their capability is presently constrained to pixel-dense tasks like semantic segmentation and pixel-wise change detection, which have to be done in a particular context in various applications. To address this limitation, this study investigates the utilization of visual prompts combined with textual inputs to tackle open-vocabulary semantic segmentation and change detection tasks on RS images. Specifically, the red boundary of each object obtained from the Segment Anything Model is employed as the visual prompt and to guide the VLM’s focus towards the target of interest while preserving contextual information from the surroundings. Our experimental results demonstrate the potential of the proposed approach, showcasing impressive zero-shot performance and robust reasoning abilities in the realm of RS image open-vocabulary segmentation and change detection tasks.

Introduction

Remote sensing (RS) images encompassing diverse modalities and resolutions contain a wealth of information on land cover and land use, including high-level information about the relationships between different objects. However, the existing techniques for extracting this information are typically tailored to specific tasks and trained on a fixed set of object classes, thus limiting the flexibility in accessing information from RS data. Consequently, the remote sensing community has become increasingly intrigued by the zero-shot capability offered by visual-language models (VLMs [1].). Recently, Chen et al. introduced a conditional Unet model, which utilizes a pre-trained CLIP model and OpenStreetMap (OSM) data to predict segmentation masks based on text descriptions. Although this text-conditioned model allows for the segmentation of interest objects based on textual prompts, its application is restricted to a few types of object annotations from OpenStreetMap and fails to perform detailed inquiries from the images. Seeking interactive dialogue ability, Lobry et al.[2] developed the first visual question answering (VQA) dataset, involving queries about selected areas from OSM integrated with Sentinel-2 images. Moreover, Yuan et al. [3] employed a VQA system for change detection on multitemporal aerial images by generating question-answer pairs using an automatic method. To further advance the capabilities of RS vision-language models, Zhan et al. [4] curated the RS images-text dataset (RSVG) to tackle the visual grounding task on RS data, which comprises a substantial number of language expressions paired with RS images and object categories. However, it was discovered that vision-language models trained on general VQA and captioning tasks exhibit limited visual reasoning abilities and lack the appropriate visual prompt for dense-pixel downstream tasks. To address these limitations, our proposal involves leveraging segmentation results from the Segment Anything Model as a visual prompt, combined with text prompts, to extract useful information, such as object category and change status, from large VLMs in a zero-shot manner. Notably, we demonstrate that using a red boundary as the visual prompt guides the VLM to analyze or talk about the interest object more effectively than only cropping, as it preserves contextual information. We showcase the potential of zero-shot performance in RS image open-vocabulary segmentation and change detection. Moreover, to overcome challenges in identifying small objects of different scales with context and information queries.

Fig. 1 Overview of the visual prompt guided open vocabulary remote sensing image semantic segmentation and change detection. (a) We use the Segment Anything Model to get the boundary of each object in the image as our visual prompt. (b) The object-centric image crop with a red boundary around the object is inputted to the VLM with a text prompt for extracting useful information. (c) The text prompt for semantic information query, customized classification and object-based change detection.

Methodology

Segment Anything Model

The Segment Anything Model (SAM) is a novel image segmentation tool that was trained using the largest available segmentation dataset. This model can handle both interactive and automatic segmentation tasks. A noteworthy feature of SAM is its promptable interface, affording versatility in its application, as it can be tailored to various segmentation tasks through the appropriate engineering of prompts, such as clicks, boxes, or text. The model’s training is based on an extensive and high-quality dataset comprising over 1 billion masks, meticulously assembled as part of the project. This extensive training empowers the model to demonstrate robust generalization capabilities, enabling it to effectively handle new object types and image scenarios beyond its original training data. Consequently, this capacity to generalize alleviates the need for practitioners to amass their own segmentation data and perform fine-tuning for specific use cases. The SAM architecture is composed of three core components: an image encoder, a flexible prompt encoder, and a fast mask decoder. The image encoder leverages a Vision Transformer (ViT) backbone, capable of processing high-resolution 1024 × 1024 images and generating an image embedding of spatial size 64 × 64. On the other hand, the prompt encoder accepts sparse prompts, encompassing points, boxes, and text, as well as dense prompts, like masks. It skillfully translates these prompts into c-dimensional tokens. Lastly, the lightweight mask decoder seamlessly integrates the image and prompt embeddings, efficiently predicting segmentation masks in real-time, thus enabling SAM to adapt gracefully to diverse prompts while incurring minimal computational overhead.

Vison-Language Models

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in diverse tasks, including chatbot interactions, mathematical calculations, and code generation. Nonetheless, a significant limitation of these models lies in their inability to effectively infer from real-world visual inputs. Generative vision language models, on the other hand, capitalize on pre-trained large language models’ potential to comprehend both images and text, thereby showcasing their prowess in visual reasoning tasks. These generative models excel in tasks like visual question answering, image captioning, optical character recognition, and object detection. For instance, FLAMINGO and BLIP [2,5] are notable examples of generative VLMs, excelling in captioning and visual question-answering tasks. However, they fall short when confronted with pixel-level computer vision tasks. Training a large multimodal language model (VLM) end-to-end to bridge the gap between LLMs and images can be a computationally expensive endeavor. To overcome this challenge, the prevalent approach for VLMs involves introducing a learnable interface, utilizing frozen pre-trained LLM models between the pre-trained visual encoder and LLM. By employing a set of learned query tokens, these models can extract information through a query-based mechanism, with the vision encoder parameters optimized via backpropagation through a frozen LLM (such as Flamingo, BLIP, LLaVA). In the scope of this study, we investigate the PaLM-E and LLaVA as the Vision-language Model backbone. These models incorporate visual inputs into the same latent embedding as language tokens and process them using self-attention layers within a Transformer-based LLM, similar to how text inputs are treated. The inputs to PaLM-E comprise both textual and (multiple) visual observations. Consequently, we can perform image segmentation on a single visual input or leverage multiple visual inputs for change detection tasks in the context of remote sensing images. An illustrative example of such usage is a question posed as follows: “What events occurred between \(<img1>\) and \(<img2>\)?”, where \(<img_i>\) represents the image acquired at time \(i\).

Visual Prompt

In the context of prompting VLMs, a prevalent technique involves incorporating a set of learnable tokens into either the text input or the visual inputs. This approach effectively guides a frozen VLM to perform zero-shot inference on downstream tasks. Commonly employed visual prompts include pixel space augmentation techniques, such as image padding, patch modification, or image inpainting. Another notable strategy known as “Colorful Prompt Tuning” entails highlighting color regions within an image and using a captioning model to predict which object in the image a given expression refers to, based on its color. For instance, Shtedritski et al. [5] utilized a simple red circle drawn around an object in the image to serve as a visual prompt. In line with this concept, our method involves annotating a red enclosed boundary around the designated object, demonstrating that our approach surpasses the effectiveness and flexibility of a mere circular or rectangular bounding box. This red enclosed boundary acts as a powerful and versatile visual prompt, facilitating enhanced interactions with the VLM and enabling more accurate and context-aware zero-shot inferences.

Experimental Results

Evaluation Setting

In our investigation of open-vocabulary semantic segmentation and change detection, we utilize data sourced from OSM for the purpose of constructing our evaluation dataset. OSM is an openly accessible repository housing geolocalized information that is contributed by volunteers. Leveraging this data, we are able to automatically extract the relevant information, which subsequently aids in generating text descriptions and semantic segmentation maps. The dataset construction process involves several steps. Firstly, we identify the objects of interest and proceed to filter recognizable objects from the land cover and land use layer available in OSM. Examples of such objects include “silos,” “apartments,” “roads,” and others. Subsequently, the vector annotations of these selected objects are rasterized into tiles. By querying all objects that fall within each tile, we obtain a comprehensive semantic segmentation map encompassing the entirety of relevant objects. Furthermore, we undertake a careful examination of changes in various objects by considering the bi-temporal images and the change labels provided by OSM. For semantic segmentation, we consider the semantic segment anything model as a comparison [6] in the future, which is based on the Segment Anything Model and gets the rich vocabulary semantic information from CLIPSeg, BLIP and CLIP. For change detection, we consider the CDVQA as a potential comparison, which only provides the text description of changes.

Results

We first evaluate the results of semantic segmentation considering various object sizes. In Fig. 2, we present RGB images alongside segmentations obtained from SAM and semantic information derived from the VLM. Notably, to ensure contextual relevance and informative details, we eliminated excessively small and large objects based on image size. This step is essential as object information cannot be reliably inferred without adequate context or sufficient details. Regarding semantic information, square-shaped objects exhibit superior results compared to slim and elongated objects. The segmentation of slim objects includes more unrelated context, which confuses the VLM’s understanding. Furthermore, the VLM struggles with small objects that lack detailed information within the given image resolution. Additionally, the VLM may even yield unrelated information when encountering objects with shadows. For change detection (Fig. 3), our approach involves obtaining segmentations of bi-temporal images using SAM. Subsequently, we remove all overlapped objects, indicating no changes have occurred. The remaining objects, accompanied by visual prompts, are then fed into the VLM to identify their object types. This serves as context for text prompts to query whether changes happened or not. It is worth noting that the changing status for different objects varies; for example, a parking lot may remain unchanged with or without cars, but changes can still occur in other aspects. In the realm of change detection, the selection of an appropriate text prompt is critical. Generally, we observed that comparing segmentations from bi-temporal images effectively detects the most changed objects. Any remaining false alarms can be mitigated by querying the VLM, particularly in complex scenes. This process assists in refining the accuracy of change detection results.

Fig. 2 Examples of semantic segmentation results obtained by SAM and Google Bard, organized in three scenes. Col. 1: RGB images, Col. 2: segmentations from SAM, Col. 3: semantic information from Google Bard.

Fig. 3 Examples of change detection results obtained by SAM and Google Bard. Col. 1: pre-event images, Col. 2: post-event images, Col. 3: objects with change information.

Conclusion

In this study, we have shown that the visual prompt from SAM can extract useful semantic information and change statu information from VLMs such as Google Bard and LLaVA in a zero-shot manner, achieving the zero-shot open-vocabulary remote sensing image segmentation and change detection performance. Our comprehensive analysis highlights that the visual prompt and reasoning capabilities of these large multimodal language models can effectively extract pertinent information from complex geospatial scenes. However, it is evident that models trained solely on natural images exhibit certain limitations when applied to remote sensing images. To further enhance the performance of our approach, it is imperative to conduct additional fine-tuning of the SAM and VLMs on remote sensing images. Preliminary findings indicate the need for this fine-tuning process. In our future research endeavors, we envision incorporating diverse sensor data, such as synthetic-aperture radar (SAR), multispectral, and hyperspectral images, in order to enhance the inference performance of our approach and better accommodate the unique characteristics of remote sensing data.

References

[1] C. Wen, Y. Hu, X. Li, Z. Yuan, and X. X. Zhu, “Vision-language models in remote sensing: Current progress and future trends,” arXiv preprint arXiv:2305.05726, 2023.

[2] S. Lobry, D. Marcos, J. Murray, and D. Tuia, “Rsvqa: Visual question answering for remote sensing data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020.

[3] Z. Yuan, L. Mou, Z. Xiong, and X. X. Zhu, “Change detection meets visual question answering,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.

[4] Y. Zhan, Z. Xiong, and Y. Yuan, “Rsvg: Exploring data and models for visual grounding on remote sensing data,” arXiv preprint arXiv:2210.12634, 2022.

[5] A. Shtedritski, C. Rupprecht, and A. Vedaldi, “What does clip know about a red circle? visual prompt engineering for vlms,” arXiv preprint arXiv:2304.06712, 2023.

[6] J. Chen, Z. Yang, and L. Zhang, “Semantic segment anything,” https://github.com/fudan-zvg/Semantic-Segment-Anything, 2023