A Multimodal Preprocessing Pipeline for Robust Audio-Visual Speech Separation and Recognition
Downloads
Audiovisual speech separation aims to improve speech intelligibility in challenging real-world environments, such as noisy meetings and multi-speaker acoustic scenes. However, many existing approaches rely on computationally intensive architectures or unstable multimodal representations, limiting robustness and practical deployment. This study proposes a multimodal audiovisual speech separation framework based on a structured preprocessing pipeline and a lightweight hybrid deep learning architecture. The proposed method enforces geometric, photometric, temporal, and statistical consistency across audio and visual streams. The visual pathway employs grayscale conversion, histogram equalization, face detection, spatial normalization, and eigenface-based PCA feature extraction to obtain stable articulatory representations, while the audio pathway incorporates pre-emphasis filtering, normalization, resampling, and MFCC-based feature extraction with vector-level equalization. The resulting representations are processed using a hybrid Conv1D–LSTM–GRU architecture for efficient temporal modeling. Experimental evaluation on the AVSpeech dataset achieved an average SDR of 19.33 dB, SIR of 15.72 dB, and PESQ of 4.22. The proposed HAVS-Net architecture contains approximately 227K trainable parameters while maintaining robust performance and efficient computational behavior under diverse real-world conditions.
Downloads
[1] Michelsanti, D., Tan, Z. H., Zhang, S. X., Xu, Y., Yu, M., Yu, D., & Jensen, J. (2021). An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 1368–1396. doi:10.1109/TASLP.2021.3066303.
[2] Du, J., Jin, Z., Yang, P., Liu, J., Li, Z., Liu, X., & Li, M. (2025). Audio-Visual Speech Enhancement in Complex Scenarios with Separation and Dereverberation Joint Modeling. arXiv Preprint, arXiv:2510.26825. doi:10.48550/arXiv.2510.26825.
[3] Rahimi, A. (2025). Restoring Degraded Multi-Speaker Speech through Separation and Enhancement. 4th Cogmhear Audio-Visual Speech Enhancement Challenge (AVSEC), ISCA, 1–5. doi:10.21437/avsec.2025-1.
[4] Wang, X., Guo, B., Huo, X., Zhang, Y., & Tao, J. (2024). Speech Enhancement Techniques Based on Microphone Arrays and Deep Learning. 2024 IEEE 8th International Conference on Vision, Image and Signal Processing, ICVISP 2024, 1–4. doi:10.1109/ICVISP64524.2024.10959537.
[5] Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 46–50. doi:10.1109/ICASSP40776.2020.9054266.
[6] Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., Zhang, B., & Xie, L. (2020). DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October, 2472–2476. doi:10.21437/Interspeech.2020-2537.
[7] Gao, R., & Grauman, K. (2021). VisualVoice: Audio-visual speech separation with cross-modal consistency. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 15490–15500. doi:10.1109/CVPR46437.2021.01524.
[8] Lopez-Olvera, J. A., Perez-Meana, H. M., Garcia-Rios, E., & Escamilla-Hernandez, E. (2026). Leveraging MFCC and Mel-Spectrogram Representations for Deep Learning-Based Speech Recognition. Engineering Proceedings, 123(1), 22. doi:10.3390/engproc2026123022.
[9] Sach, M., Franzen, J., Defraene, B., Fluyt, K., Strake, M., Tirry, W., & Fingscheidt, T. (2023). EffCRN: An Efficient Convolutional Recurrent Network for High-Performance Speech Enhancement. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 2508–2512. doi:10.21437/Interspeech.2023-799.
[10] Naser, O. A., Mumtazah, S., Samsudin, K., Hanafi, M., Shafie, S. M. B., & Zamri, N. Z. (2025). Comparative Analysis of MTCNN and Haar Cascades for Face Detection in Images with Variation in Yaw Poses and Facial Occlusions. Journal of Communications Software and Systems, 21(1), 109–119. doi:10.24138/jcomss-2024-0084.
[11] Vilaça, L., Yu, Y., & Viana, P. (2025). A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning. ACM Computing Surveys, 57(12), 1–299 46. doi:10.1145/3696445.
[12] Radfar, M., Barnwal, R., Swaminathan, R. V., Chang, F. J., Strimel, G. P., Susanj, N., & Mouchtaris, A. (2022). ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 4431–4435. doi:10.21437/Interspeech.2022-10844.
[13] Kalkhorani, V. A., Kumar, A., Tan, K., Xu, B., & Wang, D. L. (2023). Time-domain Transformer-based Audiovisual Speaker Separation. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 3472–3476. doi:10.21437/Interspeech.2023-2098.
[14] Baevski, A., Babu, A., Hsu, W. N., & Auli, M. (2023). Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. Proceedings of Machine Learning Research, 202, 1416–1429.
[15] Hu, Y., Li, R., Chen, C., Zou, H., Zhu, Q., & Chng, E. S. (2023). Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition. IJCAI International Joint Conference on Artificial Intelligence, 2023-August, 5076–5084. doi:10.24963/ijcai.2023/564.
[16] Li, C., & Qian, Y. (2020). Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation. Interspeech, 1426-1430.
[17] Jin, Z., Yang, Y., Shi, M., Kang, W., Yang, X., Yao, Z., Kuang, F., Guo, L., Meng, L., Lin, L., Xu, Y., Zhang, S. X., & Povey, D. (2024). LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 702–706. doi:10.21437/Interspeech.2024-90.
[18] Yemini, Y., Ellinson, Y., Ben-Ari, R., Gannot, S., & Fetaya, E. (2026). SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling. arXiv Preprint, arXiv:2602.01394. doi:10.48550/arXiv.2602.01394.
[19] Lee, S., Jung, C., Jang, Y., Kim, J., & Chung, J. S. (2024). Seeing Through the Conversation: Audio-Visual Speech Separation Based on Diffusion Model. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 12632–12636. doi:10.1109/ICASSP48485.2024.10447679.
[20] Gogate, M., Dashtipour, K. K., & Hussain, A. (2024). A Lightweight Real-time Audio-Visual Speech Enhancement Framework. In 3rd COG-MHEAR Workshop on Audio-Visual Speech Enhancement (AVSEC), ISCA, 19–23. doi:10.21437/avsec.2024-5.
[21] López-Espejo, I., Joglekar, A., Peinado, A. M., & Jensen, J. (2024). On Speech Pre-emphasis as a Simple and Inexpensive Method to Boost Speech Enhancement. IberSPEECH 2024, ISCA, 96–100. doi:10.21437/iberspeech.2024-20.
[22] Yang, W., Li, P., Yang, W., Liu, Y., He, Y., Petrosian, O., & Davydenko, A. (2023). Research on Robust Audio-Visual Speech Recognition Algorithms. Mathematics, 11(7), 1733. doi:10.3390/math11071733.
[23] Richter, J., Frintrop, S., & Gerkmann, T. (2023). Audio-Visual Speech Enhancement with Score-Based Generative Models. Speech Communication - 15th ITG Conference, 275–279. doi:10.30420/456164054.
[24] Chen, C. W., Hou, J. C., Tsao, Y., Chen, J. C., & Chien, S. Y. (2024). DAVSE: A diffusion-based generative approach for audio-visual speech enhancement. 3rd COG-MHEAR Workshop on Audio-Visual Speech Enhancement (AVSEC), 1 September 2024, Kos, Greece.
[25] Wahab, F., Saleem, N., Hussain, A., Rizwan, M., & Hossen, M. B. (2024). Multi-Model Dual-Transformer Network for Audio-Visual Speech Enhancement. In 3rd COG-MHEAR Workshop on Audio-Visual Speech Enhancement (AVSEC), ISCA, 1–5. doi:10.21437/avsec.2024-1.
[26] Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H. (2022). FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 7857–7861. doi:10.1109/ICASSP43922.2022.9747888.
[27] Zhu, Q. S., Zhou, L., Zhang, J., Liu, S. J., Hu, Y. C., & Dai, L. R. (2023). Robust Data2VEC: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023-June, 1–5. doi:10.1109/ICASSP49357.2023.10095373.
[28] Tiwari, S., Mentel, G., Si Mohammed, K., Rehman, M. Z., & Lewandowska, A. (2024). Unveiling the role of natural resources, energy transition and environmental policy stringency for sustainable environmental development: Evidence from BRIC +1. Resources Policy, 96, 105204. doi:10.1016/j.resourpol.2024.105204.
[29] Lian, J., Baevski, A., Hsu, W. N., & Auli, M. (2023, December). Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1-8. doi:10.1109/ASRU57964.2023.10389642.
[30] Pala, A. K., Mallik, S., Tripathy, M., Sahoo, R. R., Swain, R., & Dash, D. K. (2026). Deep Learning Based Face Recognition System with Modified MTCNN and FaceNet. Computing, Communication and Intelligence, 77-82.
[31] Zhang, N., Luo, J., & Gao, W. (2020). Research on face detection technology based on MTCNN. Proceedings - 2020 International Conference on Computer Network, Electronic and Automation, ICCNEA 2020, 154–158. doi:10.1109/ICCNEA50255.2020.00040.
[32] Xu, X., Tu, W., Yang, Y., Li, J., Zhang, Y., & Chen, H. (2026). Contribution-aware Dynamic Multi-modal Balance for Audio-Visual Speech Separation. IEEE Transactions on Multimedia, 1–13. doi:10.1109/tmm.2026.3654399.
[33] Anwar, M., Shi, B., Goswami, V., Hsu, W. N., Pino, J., & Wang, C. (2023). MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4064–4068. doi:10.21437/Interspeech.2023-2279.
[34] Sheng, C., Kuang, G., Bai, L., Hou, C., Guo, Y., Xu, X., Pietikainen, M., & Liu, L. (2024). Deep Learning for Visual Speech Analysis: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(9), 6001–6022. doi:10.1109/TPAMI.2024.3376710.
[35] Zhang, S., Shankar, S., Nguyen, T., Fanelli, A., & Fiterau, M. (2025). Audio-Visual Speech Separation via Bottleneck Iterative Network. arXiv Preprint, arXiv:2507.07270. doi:10.48550/arXiv.2507.07270.
[36] Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020-December, 12449–12460.
[37] Wang, C., & Liu, F. (2025). Ghost Module-Enhanced MTCNN: A Lightweight Cascade Framework for High-Accuracy Face Detection in Edge-Deployable Scenarios. IEEE Access, 13, 107694–107709. doi:10.1109/ACCESS.2025.3581428.
[38] Zhang, X., Ren, X., Zheng, X., Chen, L., Zhang, C., Guo, L., & Yu, B. (2021). Low-delay speech enhancement using perceptually motivated target and loss. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2, 826–830. doi:10.21437/Interspeech.2021-1410.
[39] Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics, 37(4), 3201357. doi:10.1145/3197517.3201357.
[40] Wu, Y., Li, C., & Qian, Y. (2023). Light-Weight Visualvoice: Neural Network Quantization on Audio Visual Speech Separation. ICASSPW 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, 1–5. doi:10.1109/ICASSPW59220.2023.10193263.
[41] Roux, J. Le, Wisdom, S., Erdogan, H., & Hershey, J. R. (2019). SDR - Half-baked or Well Done? ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2019-May, 626–630. doi:10.1109/ICASSP.2019.8683855.
[42] Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 3451–3460. doi:10.1109/TASLP.2021.3122291.
[43] Rahimi, A., Afouras, T., & Zisserman, A. (2022). Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June, 10483–10492. doi:10.1109/CVPR52688.2022.01024.
[44] Zhang, Z., Li, X., Li, Y., Dong, Y., Wang, D., & Xiong, S. (2021). Neural noise embedding for end-to-end speech enhancement with conditional layer normalization. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June, 7113–7117. doi:10.1109/ICASSP39728.2021.9413931.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.





















