A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data
Driven by the rapid development of Earth observation sensors, semantic segmentation using multimodal fusion of remote sensing data has drawn substantial research attention in recent years. However, existing multimodal fusion methods based on convolutional neural networks cannot capture long-range dependencies across multiscale feature maps of remote sensing data in different modalities. To circumvent this problem, this work proposes a crossmodal multiscale fusion network (CMFNet) by exploiting the transformer architecture. In contrast to the conventional early, late, or hybrid fusion networks, the proposed CMFNet fuses information of different modalities at multiple scales using the cross-attention mechanism. More specifically, the CMFNet utilizes a novel cross-modal attention architecture to fuse multiscale convolutional feature maps of optical remote sensing images and digital surface model data through a crossmodal multiscale transformer (CMTrans) and a multiscale context augmented transformer (MCATrans). The CMTrans can effectively model long-range dependencies across multiscale feature maps derived from multimodal data, while the MCATrans can learn discriminative integrated representations for semantic segmentation. Extensive experiments on two large-scale fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam, confirm the excellent performance of the proposed CMFNet as compared to other multimodal fusion methods.
Weakly Supervised Local-Global Anchor Guidance Network for Landslide Extraction With Image-Level Annotations
Weakly supervised learning using image-level annotations has become a popular choice for reducing labeling efforts of remote sensing object extraction. Existing methods exploit inter-pixel relations within an individual image patch for object localizations. When facing large-scale remote sensing images, it is still challenging to obtain global semantic contexts across image patches for feature representation, resulting in inaccurate object localizations. To remedy these issues, we propose a local-global anchor guidance network (LGAGNet) for weakly supervised landslide extraction. Specifically, a structure-aware object locating (SOL) module is developed to capture the spatial structure of landslide objects and extract local category anchors containing informative feature embeddings. Furthermore, we leverage a global anchor aggregation (GAA) module to excavate semantic patterns across image patches based on a memory bank, which is then used as additional context cues to enhance the feature presentation through a cross-attention mechanism. Finally, a hybrid loss function is designed to guide the network training, considering category-aware semantic contrasts and local activation consistency. Experimental results on high-resolution aerial and satellite image datasets verify the effectiveness of the proposed approach on landslide extraction.