Hybrid CNN-LSTM-Based Multimodal Framework for Dolphin Activity Recognition Using Visual and Acoustic Cues
Downloads
Dolphin behavior research is crucial to the advancement of marine ecology, wildlife management, and conservation. Conventional methods to observe dolphin behavior (e.g., visual tagging and tracking) can be invasive, time-consuming, and limited by environmental constraints (visibility and weather). This study develops a new hybrid deep learning framework that utilizes both visual and acoustic data to thoroughly, automatically, and accurately identify dolphin behaviors in natural underwater environments. The framework utilizes Convolutional Neural Networks (CNNs) to learn spatial features from images of dorsal fins, combined with Long Short-Term Memory (LSTM) networks, which were trained using Mel-Frequency Cepstral Coefficients (MFCCs) in a training dataset to learn temporal changes in dolphin vocalizations. This study used two datasets (both publicly available): the Risso’s Dolphin Dataset for image data, and the Dolphins Underwater Sounds Dataset for acoustic data. The multimodal framework matched the behavioral labels between the two modalities to provide robust training. The model achieved an overall classification accuracy of 94.2%, significantly outperforming traditional machine learning classifiers such as SVM, Random Forest, and k-NN. A detailed evaluation using a confusion matrix and per-class performance metrics revealed high precision and recall across various behavioral classes, particularly excelling in detecting silence and whistles, while presenting minor classification challenges between burst pulses and clicks due to spectral similarities. This research demonstrates that integrating spatial and temporal modalities enhances the system’s ability to recognize complex behaviors, representing a scalable, non-invasive, and efficient solution for real-time monitoring of marine mammals. The proposed hybrid framework offers valuable contributions toward the development of intelligent, ethical, and automated marine observation systems.
Downloads
[1] Trotter, C., Atkinson, G., Sharpe, M., Richardson, K., McGough, A. S., Wright, N., ... & Berggren, P. (2020). NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv Preprint, arXiv:2005.13359. doi:10.48550/arXiv.2005.13359
[2] Duc, P. N. H. (2020). Development of artificial intelligence methods for marine mammal detection and classification of underwater sounds in a weak supervision (but) Big Data-Expert context. Doctoral dissertation, Sorbonne Université, Paris, France.
[3] Chen, J., Hu, M., Coker, D. J., Berumen, M. L., Costelloe, B., Beery, S., Rohrbach, A., & Elhoseiny, M. (2023). MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023-June, 13052–13061. doi:10.1109/CVPR52729.2023.01254.
[4] Syed, M. A. Bin, & Ahmed, I. (2023). A CNN-LSTM Architecture for Marine Vessel Track Association Using Automatic Identification System (AIS) Data. Sensors, 23(14), 6400. doi:10.3390/s23146400.
[5] Yao, Q., Wang, Y., Yang, Y., & Shi, Y. (2023). Seal call recognition based on general regression neural network using Mel-frequency cepstrum coefficient features. Eurasip Journal on Advances in Signal Processing, 2023(1), 48. doi:10.1186/s13634-023-01014-1.
[6] Feng, R., Xu, J., Jin, K., Xu, L., Liu, Y., Chen, D., & Chen, L. (2023). An Automatic Deep Learning Bowhead Whale Whistle Recognizing Method Based on Adaptive SWT: Applying to the Beaufort Sea. Remote Sensing, 15(22), 5346. doi:10.3390/rs15225346.
[7] Licciardi, A., & Carbone, D. (2024). WhaleNet: A Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database. IEEE Access, 3482117. doi:10.1109/ACCESS.2024.3482117.
[8] Hamard, Q., Pham, M. T., Cazau, D., & Heerah, K. (2024). A deep learning model for detecting and classifying multiple marine mammal species from passive acoustic data. Ecological Informatics, 84, 102906. doi:10.1016/j.ecoinf.2024.102906.
[9] Lin, J., Gui, D., Xie, Q., Zhou, X., & Shan, Y. (2024). Automated Detection and Recognition of Wild Dolphin Behaviors Using Deep Learning. Communications in Computer and Information Science, 2058 CCIS, 212–219. doi:10.1007/978-981-97-1277-9_16.
[10] Tseng, S. P., Hsu, S. E., Wang, J. F., & Jen, I. F. (2024). An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification. Journal of Marine Science and Engineering, 12(4), 540. doi:10.3390/jmse12040540.
[11] Rattanarat, J., Jaroensutasinee, K., Jaroensutasinee, M., & Sparrow, E. B. (2025). Driving Mangrove Recovery: Community Engagement and Socio-Economic Shifts in Aquaculture Areas. Emerging Science Journal, 9(5), 2439–2453. doi:10.28991/ESJ-2025-09-05-09.
[12] Scaradozzi, D., De Marco, R., Li Veli, D., Lucchetti, A., Screpanti, L., & Di Nardo, F. (2024). Convolutional Neural Networks for Enhancing Detection of Dolphin Whistles in a Dense Acoustic Environment. IEEE Access, 12, 3454815. doi:10.1109/ACCESS.2024.3454815.
[13] Nihal, R. A., Yen, B., Shi, R., & Nakadai, K. (2025). Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring. arXiv e-prints, arXiv-2502. doi:10.48550/arXiv.2502.20838.
[14] Maglietta, R., Fanizza, C., Cherubini, C., Bellomo, S., Carlucci, R., & Dimauro, G. (2023). Risso’s dolphin dataset. IEEE Dataport, March 20, 2023, doi:10.21227/rb8d-cd89.
[15] Zhivomirov, H., Nedelchev, I., & Dimitrov, G. (2020). Dolphins Underwater Sounds Database. IEEE Dataport, March 10, 2020, doi:10.21227/n00y-kq67.
[16] Liu, Y., Tee, M., Lu, L., Zhou, F., & Lu, B. (2025). High-Precision Urban Air Quality Prediction Using a LSTM-Transformer Hybrid Architecture. International Journal of Advanced Computer Science and Applications, 16(4), 299–305. doi:10.14569/IJACSA.2025.0160431.
[17] Li, D., Liao, J., Jiang, H., Jiang, K., Chen, M., Zhou, B., Pu, H., & Li, J. (2024). A classification method of marine mammal calls based on two-channel fusion network. Applied Intelligence, 54(4), 3017–3039. doi:10.1007/s10489-023-05138-7.
[18] Di Nardo, F., De Marco, R., Li Veli, D., Screpanti, L., Castagna, B., Lucchetti, A., & Scaradozzi, D. (2025). Multiclass CNN Approach for Automatic Classification of Dolphin Vocalizations. Sensors, 25(8), 2499. doi:10.3390/s25082499.
[19] Abdelaziz, A., Elhoseny, M., & Santos, V. (2025). Advancing Network Security: Integrating Salp Swarm Optimization with LSTM for Intrusion Detection. HighTech and Innovation Journal, 6(4), 1185–1219. doi:10.28991/HIJ-2025-06-04-05.
[20] Raza, A., Zongxin, S., Qiao, G., Javed, M., Bilal, M., Zuberi, H. H., & Mohsin, M. (2025). Automated classification of humpback whale calls in four regions using convolutional neural networks and multi scale deep feature aggregation (MSDFA). Measurement: Journal of the International Measurement Confederation, 255, 118038. doi:10.1016/j.measurement.2025.118038.
[21] Cheng, W., Chen, H., Jiang, J., Li, S., Wang, J., & Zhou, Y. (2025). Recognition and classification techniques of marine mammal calls based on LSTM and expanded causal convolution. Frontiers in Marine Science, 12. doi:10.3389/fmars.2025.1603090.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.





















