Hybrid CNN-LSTM-Based Multimodal Framework for Dolphin Activity Recognition Using Visual and Acoustic Cues

Dolphin Activity Recognition Multimodal Deep Learning CNN-LSTM Hybrid Model Underwater Acoustic Analysis Marine Mammal Monitoring Good Health and Well-Being Process Innovation

Authors

Vol. 7 No. 2 (2026): June
Research Articles

Downloads

Dolphin behavior research is crucial to the advancement of marine ecology, wildlife management, and conservation. Conventional methods to observe dolphin behavior (e.g., visual tagging and tracking) can be invasive, time-consuming, and limited by environmental constraints (visibility and weather). This study develops a new hybrid deep learning framework that utilizes both visual and acoustic data to thoroughly, automatically, and accurately identify dolphin behaviors in natural underwater environments. The framework utilizes Convolutional Neural Networks (CNNs) to learn spatial features from images of dorsal fins, combined with Long Short-Term Memory (LSTM) networks, which were trained using Mel-Frequency Cepstral Coefficients (MFCCs) in a training dataset to learn temporal changes in dolphin vocalizations. This study used two datasets (both publicly available): the Risso’s Dolphin Dataset for image data, and the Dolphins Underwater Sounds Dataset for acoustic data. The multimodal framework matched the behavioral labels between the two modalities to provide robust training. The model achieved an overall classification accuracy of 94.2%, significantly outperforming traditional machine learning classifiers such as SVM, Random Forest, and k-NN. A detailed evaluation using a confusion matrix and per-class performance metrics revealed high precision and recall across various behavioral classes, particularly excelling in detecting silence and whistles, while presenting minor classification challenges between burst pulses and clicks due to spectral similarities. This research demonstrates that integrating spatial and temporal modalities enhances the system’s ability to recognize complex behaviors, representing a scalable, non-invasive, and efficient solution for real-time monitoring of marine mammals. The proposed hybrid framework offers valuable contributions toward the development of intelligent, ethical, and automated marine observation systems.