Segmentation models can recognize a pre-defined set of objects in images. However, segmentation models that can “reason” over complex user queries that implicitly refer to multiple objects of interest are still in their infancy. Recent advances in “reasoning segmentation”—generating segmentation masks from complex, implicit query text—show that vision-language models can reason across an open domain of objects and produce reasonable segmentation outputs. However, our experiments show that such models struggle when operating on complicated remote-sensing images.
In this work, we introduce LISAT, a vision-language model (VLM) designed to describe complex remote-sensing images, answer questions about those images, and also identify and segment objects within the scenes. We trained LISAT on a new curated geospatial reasoning-segmentation dataset, GRES, comprising 27,615 annotations across 9,205 images, and a multi-modal geospatial pre-training dataset, PreGRES, containing >1M QA pairs. LISAT outperforms existing geospatial foundation models such as RS-GPT4V by over 10.04% (BLEU-4) on remote-sensing visual description tasks and outperforms state-of-the-art open-domain models on remote-sensing reasoning segmentation tasks by 143.36% (gIoU). Our model, datasets, and code are available at
To generate synthetic data, we use the pipeline depicted below. We start with a seed detection dataset (xView). We then filter detections for those that are both visually interesting and highly distinguishable (A). For those detection, we then generate a natural language description (B), and a pixel-wise segmentation mask (C). Finally, the natural language description is used to generate a localization query (D). Together, the segmentation mask and the query form a ground-truth pair for the LISAT reasoning segmentation fine-tuning.
For more information, please visit our GitHub repository.
@article{TBD,
title={LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery},
author={Quenum, Jerome and Hsieh, Wen-Han and Wu, Tsung-Han and Gupta, Ritwik and Darrell, Trevor and Chan, David M},
journal={TBD},
year={2025},
url={TBD}
}