Unleashing Multispectral Video’s Potential in Semantic Segmentation: A Semi-supervised Viewpoint and New UAV-View Benchmark

Wei Ji1,2, Jingjing Li1, Wenbo Li3, Yilin Shen3, Li Cheng1, Hongxia Jin3
1University of Alberta, 2Yale University, 3Samsung AI Center-Mountain View
NeurIPS 2024

Abstract

Thanks to the rapid progress in RGB & thermal imaging, also known as multispectral imaging, the task of multispectral video semantic segmentation, or MVSS in short, has recently drawn significant attentions. Noticeably, it offers new opportunities in improving segmentation performance under unfavorable visual conditions such as poor light or overexposure. Unfortunately, there are currently very few datasets available, including for example MVSeg dataset that focuses purely toward eye-level view; and it features the sparse annotation nature due to the intensive demands of labeling process. To confront these challenges, this paper presents two major contributions to advance MVSS: the introduction of MVUAV, a new MVSS benchmark dataset, and the development of a dedicated semi-supervised MVSS baseline - SemiMV. Our MVUAV dataset is captured via Unmanned Aerial Vehicles (UAV), which offers a unique oblique bird’s-eye view complementary to the existing MVSS datasets; it also encompasses a broad range of day/night lighting conditions and over 30 semantic categories. In the meantime, to better leverage the sparse annotations and extra unlabeled RGB-Thermal videos, a semi-supervised learning baseline, SemiMV, is proposed to enforce consistency regularization through a dedicated Cross-collaborative Consistency Learning (C3L) module and a denoised temporal aggregation strategy. Comprehensive empirical evaluations on both MVSeg and MVUAV benchmark datasets have showcased the efficacy of our SemiMV baseline.

New MVUAV Dataset

We introduce MVUAV, a new MVSS dataset containing a wide range of RGB-T videos captured by Unmanned Aerial Vehicles (UAVs) from an oblique bird’s-eye viewpoint. This viewpoint offers a complementary perspective to the eye-level viewpoint adopted by existing MVSeg dataset.


MVUAV Examples

The MVUAV dataset captures diverse real-world scenarios such as roads, streets, bridges, parks, seas, beaches, courts and schools; it also spans different lighting conditions from daytime to low-light and even pitch-dark scenarios.

Semi-supervised MVSS Task

Illustrations of information used in the semi-supervised MVSS (Semi-MVSS) task and related semantic segmentation tasks.


Method Overview

Illustrations of the proposed semi-supervised MVSS framework, namely SemiMV.


Visual Examples

We visualize some multispectral video sequences from both the MVSeg and our MVUAV datasets, alongside the segmentation results obtained using the SupOnly baseline and our SemiMV method. Obviously, our SemiMV produces more accurate segmentation predictions by effectively engaging both labeled and unlabeled multispectral videos.

More details about our dataset and method can refer to our paper.