Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline

1University of Alberta, 2ByteDance, 3Johns Hopkins University
CVPR 2023

Real-world illustrations of Multispectral Video Semantic Segmentation in daytime and nighttime.

Abstract

Robust and reliable semantic segmentation in complex scenes is crucial for many real-life applications such as autonomous safe driving and nighttime rescue. In most approaches, it is typical to make use of RGB images as input. They however work well only in preferred weather conditions; when facing adverse conditions such as rainy, overexposure, or low-light, they often fail to deliver satisfactory results. This has led to the recent investigation into multispectral semantic segmentation, where RGB and thermal infrared (RGBT) images are both utilized as input. This gives rise to significantly more robust segmentation of image objects in complex scenes and under adverse conditions. Nevertheless, the present focus in single RGBT image input restricts existing methods from well addressing dynamic real-world scenes.

Motivated by the above observations, in this paper, we set out to address a relatively new task of semantic segmentation of multispectral video input, which we refer to as Multispectral Video Semantic Segmentation, or MVSS in short. An in-house MVSeg dataset is thus curated, consisting of 738 calibrated RGB and thermal videos, accompanied by 3,545 fine-grained pixel-level semantic annotaions of 26 categories. Our dataset contains a wide range of challenging urban scenes in both daytime and nighttime. Moreover, we propose an effective MVSS baseline, dubbed MVNet, which is to our knowledge the first model to jointly learn semantic representations from multispectral and temporal contexts. Comprehensive experiments are conducted using various semantic segmentation models on the MVSeg dataset. Empirically, the engagement of multispectral video input is shown to lead to significant improvement in semantic segmentation; the effectiveness of our MVNet baseline has also been verified.

Video Demo

Visual Results

In the following video demo, we provide intuitive visual results for different scenarios, day and night, in the MVSeg dataset. We also highlighted the details with red boxes. Obviously, the results from our MVNet model are more complete compared to RGB-based DeeplabV3+. This attributes to the superiority of our method in engaging the advantages of complementary multispectral and temporal contexts.


Method Quickview

The following animation presents an overview of the proposed MVNet. Starting from the input multispectral video, its pipeline consists of four parts: (a) feature extraction to obtain the multispectral video features; (b) an MVFuse module to furnish the query features with the rich semantic cues of memory frames; (c) an MVRegulator loss to regularize the multispectral video embedding space; and (d) a cascaded decoder to generate the final segmentation mask.


Our BibTeX

@InProceedings{ji2023mvss,
      title     = {Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline},
      author    = {Ji, Wei and Li, Jingjing and Bian, Cheng and Zhou, Zongwei and Zhao, Jiaying and Yuille, Alan L. and Cheng, Li},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2023},
      pages     = {1094-1104}
}

Announcement

Our dataset and code are available now to research community. If you have questions, please email us at wji3@ualberta.ca.

Our MVSeg dataset (Multispectral Video Segmentation) is based on diverse RGBT video sources, including OSU, INO, KAIST, and RGBT234. They are annotated and adjusted to better fit the MVSS task. All data and annotations provided are strictly intended for non-commercial research purpose only. If you are interested in our MVSeg dataset, we sincerely appreciate your citation of our work and strongly encourage you to cite the four source datasets mentioned above.