Biodiv Sci ›› 2023, Vol. 31 ›› Issue (7): 23087.  DOI: 10.17520/biods.2023087

• Technology and Methodology • Previous Articles     Next Articles

A deep feature fusion-based method for bird sound recognition and its interpretability analysis

Jianmin Cai1, Peiyu He1, Zhipeng Yang2, Luying Li2, Qijun Zhao3, Fan Pan1,*()   

  1. 1. School of Electronic Information, Sichuan University, Chengdu 610065
    2. College of Electronic Engineering, Chengdu University of Information Technology, Chengdu 610225
    3. School of Computer Science (School of Software), Sichuan University, Chengdu 610065
  • Received:2023-03-08 Accepted:2023-06-14 Online:2023-07-20 Published:2023-06-29
  • Contact: *E-mail:


Background: Bird sound recognition is a crucial tool for ecological monitoring. However, current research still faces the challenges of achieving low recognition rates in complex datasets and a lack of robustness. Moreover, there is a noticeable absence of interpretability analysis for deep learning model in the existing research.

Methods: Firstly, we utilized a deep feature extraction network to extract features from the logarithmic Mel-spectrogram of bird sound and the deep features of the supplementary feature set. These two types of deep features were then fused and fed into a light gradient boosting machine (lightGBM) classifier for classification. Class activation maps were applied to perform interpretability analysis on deep learning models to understand how the models recognize bird sound.

Results: The experimental results demonstrated that the proposed method in this paper achieved state-of-the-art results on the Beijing Bird Dataset, with an average accuracy of 98.70% and an average F1 score of 98.84%. Compared to traditional methods, the deep fusion features show a significant improvement in accuracy for bird sound recognition, with an increase of at least 5.62%. Additionally, the introduction of the lightGBM classifier contributed to a 3.02% improvement in classification accuracy. Furthermore, the proposed method exhibited outstanding performance on the CLO-43SD and BirdCLEF2022 competition datasets, achieving average accuracies of 98.32% and 91.12%, respectively. The result of the class activation maps revealed that the disparities in attentional regions within the neural network for each specific bird sound type.

Conclusion: The method proposed in this paper effectively improves the accuracy of bird sound recognition and demonstrates excellent performance on three datasets, offering strong technical support for ecological monitoring based on bird sound recognition. This analysis serves as a theoretical foundation for subsequent endeavors in feature selection and model optimization.

Key words: bird sound recognition, feature fusion, interpretability analysis, deep learning, lightGBM