Driven by the rapid development of Earth observation sensors, semantic segmentation using multimodal fusion of remote sensing data has drawn substantial research attention in recent years. However, existing multimodal fusion methods based on convolutional neural networks cannot capture long-range dependencies across multiscale feature maps of remote sensing data in different modalities. To circumvent this problem, this work proposes a crossmodal multiscale fusion network (CMFNet) by exploiting the transformer architecture. In contrast to the conventional early, late, or hybrid fusion networks, the proposed CMFNet fuses information of different modalities at multiple scales using the cross-attention mechanism. More specifically, the CMFNet utilizes a novel cross-modal attention architecture to fuse multiscale convolutional feature maps of optical remote sensing images and digital surface model data through a crossmodal multiscale transformer (CMTrans) and a multiscale context augmented transformer (MCATrans). The CMTrans can effectively model long-range dependencies across multiscale feature maps derived from multimodal data, while the MCATrans can learn discriminative integrated representations for semantic segmentation. Extensive experiments on two large-scale fine-resolution remote sensing datasets, namely ISPRS Vaihingen and Potsdam, confirm the excellent performance of the proposed CMFNet as compared to other multimodal fusion methods.