A Unified Latency-Flexible Framework with a Causal Mamba Core for Multichannel Speech Enhancement

Abstract

We propose a unified and latency-flexible framework for multichannel speech enhancement in CHiME-9 Task 2 (ECHI), built upon a decoupled Shell-Core architecture. A latency-aware DenseNet-style Shell performs local spectral-spatial modeling across channels, while a causal Core composed of TF-Mamba blocks with unidirectional Time-Mamba captures long-range temporal dependencies. Using identical causal Mamba operators, the framework enables plug-and-play scaling via adjustable shell lookahead, supporting both low-latency streaming (16 ms updates, RTF 0.005) and high-performance offline configurations within a single unified model family. On the HA development set, the causal model improves upon the official baseline while operating within a 16 ms latency budget, and the offline configuration further improves objective metrics under a 1.136 s latency setting.

Publication
ICASSP 2026 Workshop: 9th CHiME Speech Separation and Recognition Challenge (CHiME-9)

Related