Lip-Reading with Visual Form Classification using Residual Networks and Bidirectional Gated Recurrent Units
Abstract
Doi: 10.28991/HIJ-2023-04-02-010
Full Text: PDF
Keywords
References
Lu, L., Yu, J., Chen, Y., Liu, H., Zhu, Y., Kong, L., & Li, M. (2019). Lip Reading-Based User Authentication through Acoustic Sensing on Smartphones. IEEE/ACM Transactions on Networking, 27(1), 447–460. doi:10.1109/TNET.2019.2891733.
Lee, S., & Yook, D. (2002). Audio-to-Visual Conversion Using Hidden Markov Models. PRICAI 2002: Trends in Artificial Intelligence, 563–570, Springer. doi:10.1007/3-540-45683-x_60.
Bagherzadeh, S. Z., & Toosizadeh, S. (2022). Eye tracking algorithm based on multi model Kalman filter. HighTech and Innovation Journal, 3(1), 15-27. doi:10.28991/HIJ-2022-03-01-02.
Fenghour, S., Chen, D., Guo, K., & Xiao, P. (2020). Lip Reading Sentences Using Deep Learning with only Visual Cues. IEEE Access, 8, 215516–215530. doi:10.1109/ACCESS.2020.3040906.
El-Bialy, R., Chen, D., Fenghour, S., Hussein, W., Xiao, P., Karam, O. H., & Li, B. (2023). Developing phoneme-based lip-reading sentences system for silent speech recognition. CAAI Transactions on Intelligence Technology, 8(1), 129–138. doi:10.1049/cit2.12131.
Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2022). Deep Audio-Visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 8717–8727. doi:10.1109/TPAMI.2018.2889052.
Fenghour, S., Chen, D., Guo, K., Li, B., & Xiao, P. (2021). Deep Learning-Based Automated Lip-Reading: A Survey. IEEE Access, 9, 121184–121205. doi:10.1109/ACCESS.2021.3107946.
Luo, M., Yang, S., Chen, X., Liu, Z., & Shan, S. (2020). Synchronous Bidirectional Learning for Multilingual Lip Reading (ArXiv Preprint). doi:10.48550/arXiv.2005.03846.
Ali, N. H., Abdulmunim, M. E., & Ali, A. E. (2021). Constructed model for micro-content recognition in lip reading based deep learning. Bulletin of Electrical Engineering and Informatics, 10(5), 2557–2565. doi:10.11591/eei.v10i5.2927.
Thammarak, K., Sirisathitkul, Y., Kongkla, P., & Intakosum, S. (2022). Automated Data Digitization System for Vehicle Registration Certificates Using Google Cloud Vision API. Civil Engineering Journal, 8(7), 1447-1458. doi:10.28991/CEJ-2022-08-07-09.
Kurniawan, A., & Suyanto, S. (2020). Syllable-Based Indonesian Lip Reading Model. 2020 8th International Conference on Information and Communication Technology (ICoICT). doi:10.1109/icoict49345.2020.9166217.
Nurhidayat, I., Pimpunchat, B., Noeiaghdam, S., & Fernández-Gámiz, U. (2022). Comparisons of SVM kernels for insurance data clustering. Emerging Science Journal, 6(4), 866-880. doi:10.28991/ESJ-2022-06-04-014.
Sarhan, A. M., Elshennawy, N. M., & Ibrahim, D. M. (2021). HLR-net: a hybrid lip-reading model based on deep convolutional neural networks. Computers, Materials & Continua, 68(2), 1531-1549. doi:10.32604/cmc.2021.016509.
Chung, J. S., & Zisserman, A. (2017). Lip Reading in the Wild. Lecture Notes in Computer Science, 87–103, Springer. doi:10.1007/978-3-319-54184-6_6.
Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.367.
Anina, I., Ziheng Zhou, Guoying Zhao, & Pietikainen, M. (2015). OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia. doi:10.1109/fg.2015.7163155.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421–2424. doi:10.1121/1.2229005.
Jeon, S., & Kim, M. S. (2022). End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC. Sensors, 22(9), 3597. doi:10.3390/s22093597.
Weng, X., & Kitani, K. (2019). Learning spatio-temporal features with two-stream deep 3d CNNs for lipreading. arXiv preprint arXiv:1905.02540. doi:10.48550/arXiv.1905.02540.
Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020). Lipreading Using Temporal Convolutional Networks. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). do:10.1109/icassp40776.2020.9053841.
Ma, P., Wang, Y., Shen, J., Petridis, S., & Pantic, M. (2021). Lip-reading with Densely Connected Temporal Convolutional Networks. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). doi:10.1109/wacv48630.2021.00290.
Jeon, S., Elsharkawy, A., & Kim, M. S. (2022). Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 22(1), 72. doi:10.3390/s22010072.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2015.510.
Haque, I., Alim, M., Alam, M., Nawshin, S., Noori, S. R. H., & Habib, M. T. (2022). Analysis of recognition performance of plant leaf diseases based on machine vision techniques. Journal of Human, Earth, and Future, 3(1), 129-137. doi:10.28991/HEF-2022-03-01-09.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.90.
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. doi:10.3115/v1/w14-4012.
Oghbaie, M., Sabaghi, A., Hashemifard, K., & Akbari, M. (2021). Advances and Challenges in Deep Lip Reading. arXiv preprint arXiv:2110.07879. doi:10.48550/arXiv.2110.07879.
Bear, H. L., Harvey, R. W., Theobald, B.-J., & Lan, Y. (2014). Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading? Lecture Notes in Computer Science, 230–239, Springer. doi:10.1007/978-3-319-14364-4_22.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. doi:10.48550/arXiv.1412.6980.
Lu, Y., Xiao, Q., & Jiang, H. (2021). A Chinese Lip-Reading System Based on Convolutional Block Attention Module. Mathematical Problems in Engineering, 2021, 1–12. doi:10.1155/2021/6250879.
Le Cornu, T., & Milner, B. (2017). Generating Intelligible Audio Speech from Visual Speech. IEEE/ACM Transactions on Audio Speech and Language Processing, 25(9), 1751–1761. doi:10.1109/TASLP.2017.2716178.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). The Grid Audio-Visual Speech Corpus. Zenodo, Open Science. doi:10.5281/zenodo.3625687.
DOI: 10.28991/HIJ-2023-04-02-010
Refbacks
- There are currently no refbacks.
Copyright (c) 2023 . Anni, . Suharjito