A Multimodal Preprocessing Pipeline for Robust Audio-Visual Speech Separation and Recognition

Audiovisual Speech Separation Feature Normalization MFCC PCA Lightweight Deep Learning AVSpeech Dataset

Authors

Vol. 7 No. 2 (2026): June
Research Articles

Downloads

Audiovisual speech separation aims to improve speech intelligibility in challenging real-world environments, such as noisy meetings and multi-speaker acoustic scenes. However, many existing approaches rely on computationally intensive architectures or unstable multimodal representations, limiting robustness and practical deployment. This study proposes a multimodal audiovisual speech separation framework based on a structured preprocessing pipeline and a lightweight hybrid deep learning architecture. The proposed method enforces geometric, photometric, temporal, and statistical consistency across audio and visual streams. The visual pathway employs grayscale conversion, histogram equalization, face detection, spatial normalization, and eigenface-based PCA feature extraction to obtain stable articulatory representations, while the audio pathway incorporates pre-emphasis filtering, normalization, resampling, and MFCC-based feature extraction with vector-level equalization. The resulting representations are processed using a hybrid Conv1D–LSTM–GRU architecture for efficient temporal modeling. Experimental evaluation on the AVSpeech dataset achieved an average SDR of 19.33 dB, SIR of 15.72 dB, and PESQ of 4.22. The proposed HAVS-Net architecture contains approximately 227K trainable parameters while maintaining robust performance and efficient computational behavior under diverse real-world conditions.