2

Unsupervised Domain Adaptation Augmented by Mutually Boosted Attention for Semantic Segmentation of VHR Remote Sensing Images

Xianping Ma, Xiaokang Zhang, Zhiguo Wang, Man-On Pun

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Driven by the rapid development of Earth observation sensors, semantic segmentation using multimodal fusion of remote sensing data has drawn substantial research attention in recent years. However, existing multimodal fusion methods based on convolutional neural networks cannot capture long-range dependencies across multiscale feature maps of remote sensing data in different modalities. To circumvent this problem, this work proposes a crossmodal multiscale fusion network (CMFNet) by exploiting the transformer architecture. In contrast to the conventional early, late, or hybrid fusion networks, the proposed CMFNet fuses information of different modalities at multiple scales using the cross-attention mechanism. More specifically, the CMFNet utilizes a novel cross-modal attention architecture to fuse multiscale convolutional feature maps of optical remote sensing images and digital surface model data through a crossmodal multiscale transformer (CMTrans) and a multiscale context augmented transformer (MCATrans). The CMTrans can effectively model long-range dependencies across multiscale feature maps derived from multimodal data, while the MCATrans can learn discriminative integrated representations for semantic segmentation. Extensive experiments on two large-scale fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam, confirm the excellent performance of the proposed CMFNet as compared to other multimodal fusion methods.

Xianping Ma, Xiaokang Zhang, Man-On Pun

Weakly Supervised Local-Global Anchor Guidance Network for Landslide Extraction With Image-Level Annotations

Weakly supervised learning using image-level annotations has become a popular choice for reducing labeling efforts of remote sensing object extraction. Existing methods exploit inter-pixel relations within an individual image patch for object localizations. When facing large-scale remote sensing images, it is still challenging to obtain global semantic contexts across image patches for feature representation, resulting in inaccurate object localizations. To remedy these issues, we propose a local-global anchor guidance network (LGAGNet) for weakly supervised landslide extraction. Specifically, a structure-aware object locating (SOL) module is developed to capture the spatial structure of landslide objects and extract local category anchors containing informative feature embeddings. Furthermore, we leverage a global anchor aggregation (GAA) module to excavate semantic patterns across image patches based on a memory bank, which is then used as additional context cues to enhance the feature presentation through a cross-attention mechanism. Finally, a hybrid loss function is designed to guide the network training, considering category-aware semantic contrasts and local activation consistency. Experimental results on high-resolution aerial and satellite image datasets verify the effectiveness of the proposed approach on landslide extraction.

Xiaokang Zhang, W Yu, X Ma, X Kang

Multilevel Deformable Attention-Aggregated Networks for Change Detection in Bitemporal Remote Sensing Imagery

We propose multilevel deformable attention-aggregated networks (MLDANets) to effectively learn long-range dependencies across multiple levels of bitemporal convolutional features for multiscale context aggregation. Specifically, a multilevel change-aware deformable attention (MCDA) module consisting of linear projections with learnable parameters is built based on multihead self-attention (SA) with a deformable sampling strategy. It is applied in the skip connections of an encoder–decoder network taking a bitemporal deep feature hypersequence (BDFH) as input. MCDA can progressively address a set of informative sampling locations in multilevel feature maps for each query element in the BDFH. Simultaneously, MCDA learns to characterize beneficial information from different spatial and feature subspaces of BDFH using multiple attention heads for change perception. As a result, contextual dependencies across multiple levels of bitemporal feature maps can be adaptively aggregated via attention weights to generate multilevel discriminative change-aware representations.

Xiaokang Zhang, Weikang Yu, Man-On Pun

Multilevel Deformable Attention-Aggregated Networks for Change Detection in Bitemporal Remote Sensing Imagery

Federated Deep Learning With Prototype Matching for Object Extraction From Very-High-Resolution Remote Sensing Images

Deep convolutional neural networks (DCNNs) have become the leading tools for object extraction from very-high-resolution (VHR) remote …

Xiaokang Zhang, X Zhang, B Zhang, W Yu, X Kang

Federated Deep Learning With Prototype Matching for Object Extraction From Very-High-Resolution Remote Sensing Images