LISAt: Language- Instructed Segmentation Assistant for Satellite Imagery

Jerome Quenum *, Wen-Han Hsieh *, Tsung-Han Wu, Ritwik Gupta,
Trevor Darrell, David M. Chan

UC Berkeley

arXiv GRES Dataset HF GRES Dataset GH PreGRES Dataset HF PreGRES Dataset GH LISAt's Code/Model HF LISAt's Code/Model GH LISAt_PRE's Code/Model HF

Why LISAt?

Segmentation models can recognize a pre-defined set of objects in images. However, segmentation models that can “reason” over complex user queries that implicitly refer to multiple objects of interest are still in their infancy. Recent advances in “reasoning segmentation”—generating segmentation masks from complex, implicit query text—show that vision-language models can reason across an open domain of objects and produce reasonable segmentation outputs. However, our experiments show that such models struggle when operating on complicated remote-sensing images.

Our Solution

In this work, we introduce LISAT, a vision-language model (VLM) designed to describe complex remote-sensing images, answer questions about those images, and also identify and segment objects within the scenes. We trained LISAT on a new curated geospatial reasoning-segmentation dataset, GRES, comprising 27,615 annotations across 9,205 images, and a multi-modal geospatial pre-training dataset, PreGRES, containing >1M QA pairs. LISAT outperforms existing geospatial foundation models such as RS-GPT4V by over 10.04% (BLEU-4) on remote-sensing visual description tasks and outperforms state-of-the-art open-domain models on remote-sensing reasoning segmentation tasks by 143.36% (gIoU). Our model, datasets, and code are available at .

How was GRES constructed?

To generate synthetic data, we use the pipeline depicted below. We start with a seed detection dataset (xView). We then filter detections for those that are both visually interesting and highly distinguishable (A). For those detection, we then generate a natural language description (B), and a pixel-wise segmentation mask (C). Finally, the natural language description is used to generate a localization query (D). Together, the segmentation mask and the query form a ground-truth pair for the LISAT reasoning segmentation fine-tuning.

For more information, please visit our GitHub repository.

What are Some Quantitative Results of LISAt_PRE and LISAT?

The choice of Vision Encoder and LLM matters for the MLLM (LISAt_PRE): The table below shows that models using LLama 2 as a base LLM are notably worse than Vicuna. It also shows that RemoteCLIP (which we use in LISAT) significantly outperforms both Geo-CLIP and Sat-CLIP on all domains, while slightly outperforming the base CLIP models.

Effectiveness of the choice on Vision Encoder/ LMMs combinaison

LISAt_PRE is competitive compare with existing methods: A comparison of captioning performance on the datasets such as the UCM- Captions dataset are reported for BLEU-4 and CIDEr metrics is shown in the table below.

LISAt outperforms existing methods: A comparative performance of LISAT against LISA-7B- v1 and LISA-13B-Llama2-v1 on GRES across different object sizes are reported in the table below where LISAT-7B consistently outperforms the baseline models, particularly in the Small object category.

LISAt is scalable: While adding additional data is helpful, even with 7K training images (the full GRES dataset), we observe the beginning of a plateau in performance, particularly on cIOU scores. This suggests that more data alone may not be helpful, and instead, we may need additional data variance outside the xView classes.

What are Some Qualitative Results of LISAt?

BibTeX

@article{quenum2025lisat,
  title={LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery},
  author={Quenum, Jerome and Hsieh, Wen-Han and Wu, Tsung-Han and Gupta, Ritwik and Darrell, Trevor and Chan, David M},
  journal={arXiv preprint arXiv:2505.02829},
  year={2025},
  url={https://arxiv.org/pdf/2505.02829}
}