生物多样性 ›› 2023, Vol. 31 ›› Issue (7): 23087.  DOI: 10.17520/biods.2023087

• 技术与方法 • 上一篇    下一篇

基于深度特征融合的鸟鸣识别方法及其可解释性分析

蔡建民1, 何培宇1, 杨智鹏2, 李露莹2, 赵启军3, 潘帆1,*()   

  1. 1.四川大学电子信息学院, 成都 610065
    2.成都信息工程大学电子工程学院, 成都 610225
    3.四川大学计算机学院, 成都 610065
  • 收稿日期:2023-03-08 接受日期:2023-06-14 出版日期:2023-07-20 发布日期:2023-06-29
  • 通讯作者: *E-mail: panfan@scu.edu.cn
  • 作者简介:*E-mail: panfan@scu.edu.cn
  • 基金资助:
    国家自然科学基金(62066042);中央高校基本科研业务费专项资金(2022SCU12008);四川省重点研发项目(2022YFG0045)

A deep feature fusion-based method for bird sound recognition and its interpretability analysis

Jianmin Cai1, Peiyu He1, Zhipeng Yang2, Luying Li2, Qijun Zhao3, Fan Pan1,*()   

  1. 1. School of Electronic Information, Sichuan University, Chengdu 610065
    2. College of Electronic Engineering, Chengdu University of Information Technology, Chengdu 610225
    3. School of Computer Science (School of Software), Sichuan University, Chengdu 610065
  • Received:2023-03-08 Accepted:2023-06-14 Online:2023-07-20 Published:2023-06-29
  • Contact: *E-mail: panfan@scu.edu.cn

摘要:

鸟鸣识别是生态监测的重要手段, 为进一步提升鸟鸣识别的准确性和鲁棒性, 本文提出了1种新的基于深度特征融合的鸟鸣识别方法。该方法首先利用深度特征提取网络对鸟鸣的对数梅尔谱图和补充特征集的深度特征进行提取, 再将两种深度特征进行融合, 最后使用轻量级梯度提升机(light gradient boosting machine, lightGBM)分类器进行分类。本文充分利用深度神经网络的特征提取能力以及lightGBM的分类性能, 将特征提取和特征分类过程进行分离, 从而实现了高准确率的鸟鸣识别。实验结果显示, 本文提出的方法在北京百鸟数据集中取得了目前已知的最佳结果, 模型的平均准确率达到了98.70%, 平均F1分数达到了98.84%。相比传统方法, 深度融合特征在鸟鸣识别任务上准确率提升了5.62%以上。同时, 引入的lightGBM分类器使分类准确率提升了3.02%。此外, 在CLO-43SD和BirdCLEF2022比赛的数据集中, 本文方法也展现出卓越的性能, 分别取得了98.32%和91.12%的平均准确率。本文还引入了类激活图对不同类型鸟鸣的识别结果进行可解释性分析, 揭示了神经网络对不同类型鸟鸣的注意力区域差异, 为后续的特征选择和模型优化提供了理论依据。研究结果表明, 本文方法有效提高了鸟鸣识别的准确率, 在3个数据集的测试中均展现出较好的性能, 能够为基于鸟鸣识别的生态监测提供有力的技术支撑。

关键词: 鸟鸣识别, 特征融合, 可解释性分析, 深度学习, lightGBM

Abstract

Background: Bird sound recognition is a crucial tool for ecological monitoring. However, current research still faces the challenges of achieving low recognition rates in complex datasets and a lack of robustness. Moreover, there is a noticeable absence of interpretability analysis for deep learning model in the existing research.

Methods: Firstly, we utilized a deep feature extraction network to extract features from the logarithmic Mel-spectrogram of bird sound and the deep features of the supplementary feature set. These two types of deep features were then fused and fed into a light gradient boosting machine (lightGBM) classifier for classification. Class activation maps were applied to perform interpretability analysis on deep learning models to understand how the models recognize bird sound.

Results: The experimental results demonstrated that the proposed method in this paper achieved state-of-the-art results on the Beijing Bird Dataset, with an average accuracy of 98.70% and an average F1 score of 98.84%. Compared to traditional methods, the deep fusion features show a significant improvement in accuracy for bird sound recognition, with an increase of at least 5.62%. Additionally, the introduction of the lightGBM classifier contributed to a 3.02% improvement in classification accuracy. Furthermore, the proposed method exhibited outstanding performance on the CLO-43SD and BirdCLEF2022 competition datasets, achieving average accuracies of 98.32% and 91.12%, respectively. The result of the class activation maps revealed that the disparities in attentional regions within the neural network for each specific bird sound type.

Conclusion: The method proposed in this paper effectively improves the accuracy of bird sound recognition and demonstrates excellent performance on three datasets, offering strong technical support for ecological monitoring based on bird sound recognition. This analysis serves as a theoretical foundation for subsequent endeavors in feature selection and model optimization.

Key words: bird sound recognition, feature fusion, interpretability analysis, deep learning, lightGBM