RankSAM: Lightweight Adapters and Prompt Generation in Zero-Shot Semantic Segmentation

Yue Zhuo, Zhaocheng Xu, Di Zhou, Pengpeng Xu, Yan Tian

Zhejiang University

Figure 2: An illustration of the RankSAM.

Given an RGB image, the coarse-to-fine rank modulation in SAM encoder distincts the lower and higher-rank subspaces and activates the corresponding components by using a hierarchical gating mechanism. After that, automatic prompt generation is proposed to efficiently generate essential prompts for precise mask output, where the prompt candidate generation identifies essential prompt locations and redundant prompt deletion enhances efficiency in prompt generation.

Abstract

Zero-shot segmentation based plays a crucial role in neurocomputing such as embodied intelligence systems and autonomous driving technologies. However, current approaches struggle to preserve the intrinsic ability of SAM to generalize as input quality declines. In addition, prompt generation still faces an embarrassment in the balance between effectiveness and efficiency. Motivated by low-rank adaptation (LoRA) , we design RankSAM which integrates slim, adaptable modules into the middle layers of the frozen SAM framework. These modules dynamically fine-tune the operational rank of their weight matrices in response to input data, leveraging a trainable gating mechanism to selectively activate specific (rank-1) matrix components as needed. In addition, a learnable prompt predictor is designed to learn and generate prompt confidence maps and point prompts, and any remaining prompts that would produce the same mask are filtered out to enhance efficiency in prompt generation. The experimental results in multiple datasets indicate that our approach improves the mean intersection over union (mIoU) by a margin of 2.5-2.8% compared to the prevailing approaches.

Example