A deep feature fusion-based method for bird sound recognition and its interpretability analysis

doi:10.17520/biods.2023087

Abstract

Abstract:

Background: Bird sound recognition is a crucial tool for ecological monitoring. However, current research still faces the challenges of achieving low recognition rates in complex datasets and a lack of robustness. Moreover, there is a noticeable absence of interpretability analysis for deep learning model in the existing research.

Methods: Firstly, we utilized a deep feature extraction network to extract features from the logarithmic Mel-spectrogram of bird sound and the deep features of the supplementary feature set. These two types of deep features were then fused and fed into a light gradient boosting machine (lightGBM) classifier for classification. Class activation maps were applied to perform interpretability analysis on deep learning models to understand how the models recognize bird sound.

Results: The experimental results demonstrated that the proposed method in this paper achieved state-of-the-art results on the Beijing Bird Dataset, with an average accuracy of 98.70% and an average F₁ score of 98.84%. Compared to traditional methods, the deep fusion features show a significant improvement in accuracy for bird sound recognition, with an increase of at least 5.62%. Additionally, the introduction of the lightGBM classifier contributed to a 3.02% improvement in classification accuracy. Furthermore, the proposed method exhibited outstanding performance on the CLO-43SD and BirdCLEF2022 competition datasets, achieving average accuracies of 98.32% and 91.12%, respectively. The result of the class activation maps revealed that the disparities in attentional regions within the neural network for each specific bird sound type.

Conclusion: The method proposed in this paper effectively improves the accuracy of bird sound recognition and demonstrates excellent performance on three datasets, offering strong technical support for ecological monitoring based on bird sound recognition. This analysis serves as a theoretical foundation for subsequent endeavors in feature selection and model optimization.

Key words: bird sound recognition, feature fusion, interpretability analysis, deep learning, lightGBM

Jianmin Cai, Peiyu He, Zhipeng Yang, Luying Li, Qijun Zhao, Fan Pan. A deep feature fusion-based method for bird sound recognition and its interpretability analysis[J]. Biodiv Sci, 2023, 31(7): 23087.

Add to citation manager EndNote|Ris|BibTeX

URL: https://www.biodiversity-science.net/EN/10.17520/biods.2023087

https://www.biodiversity-science.net/EN/Y2023/V31/I7/23087

Figures/Tables 8

References 18

[1]	Adavanne S, Drossos K, Cakir E, Virtanen T (2017) Stacked convolutional and recurrent neural networks for bird audio detection. In:Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Greek Island, Greece.
[2]	Bold N, Zhang C, Akashi T (2019) Cross-domain deep feature combination for bird species classification with audio-visual data. IEICE Transactions on Information and Systems, 102, 2033-2042.
[3]	Brandes TS (2008) Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International, 18, S163-S173.
[4]	Dan S (2022) Computational bioacoustics with deep learning: A review and roadmap. PeerJ, 10, e13152.
[5]	Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, Devillers LY, Epps J, Laukka P, Narayanan SS, Truong KP (2015) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7, 190-202. DOI URL
[6]	Eyben F, Wöllmer M, Schuller B (2010) Opensmile:The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia (eds del Bimbo A, Chang SF), pp. 1459-1462. Association for Computing Machinery, New York.
[7]	Gupta G, Kshirsagar M, Zhong M, Gholami S, Ferres JL (2021) Comparing recurrent convolutional neural networks for large scale bird species classification. Scientific Reports, 11, 17085. DOI PMID
[8]	Ji XS, Jiang K, Xie J (2022) Deep feature fusion of multi-dimensional neural network for bird call recognition. Journal of Signal Processing, 38, 844-853. (in Chinese with English abstract)
	[吉训生, 江昆, 谢捷 (2022) 基于多维神经网络深度特征融合的鸟鸣识别算法. 信号处理, 38, 844-853.]
[9]	Ke GL, Meng Q, Finley T, Wang TF, Chen W, Ma WD, Ye QW, Liu TY (2017) LightGBM:A highly efficient gradient boosting decision tree. In: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems (eds von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R), pp. 3149-3157. Curran Associates Inc., New York.
[10]	Li HC, Yang DW, Wen ZF, Wang YN, Chen AB (2022) Inception-CSA deep learning model-based classification of bird sounds. Journal of Huazhong Agricultural University, 42(3), 97-104. (in Chinese with English abstract)
	[李怀城, 杨道武, 温治芳, 王亚楠, 陈爱斌 (2022) 基于Inception- CSA深度学习模型的鸟鸣分类. 华中农业大学学报, 42(3), 97-104.]
[11]	Salamon J, Bello JP, Farnsworth A, Robbins M, Keen S, Klinck H, Kelling S (2016) Towards the automatic classification of avian flight calls for bioacoustic monitoring. PLoS ONE, 11, e0166866.
[12]	Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-CAM:Visual explanations from deep networks via gradient-based localization. In:2017 IEEE International Conference on Computer Vision (ICCV) (ed.ed. O'Conner L), pp. 618-626. IEEE Computer Society Customer Service Center, California.
[13]	Sprengel E, Jaggi M, Kilcher Y, Hofmann T (2016) Audio based bird species identification using deep learning techniques. In: Conference and Labs of the Evaluation Forum (CLEF) 2016, pp. 547-559. Évora, Portugal.
[14]	Stowell D (2022) Computational bioacoustics with deep learning: A review and roadmap. PeerJ, e13152.
[15]	Tan Mingxing, Le Quoc (2021) EfficientNetV2: Smaller models and faster training. arXiv e-prints, doi: 10.48550/ arXiv.2104.00298. DOI
[16]	Xie J, Zhu MY (2019) Handcrafted features and late fusion with deep learning for bird sound classification. Ecological Informatics, 52, 74-81. DOI URL
[17]	Yan N, Chen AB, Zhou GX, Zhang ZQ, Liu XY, Wang JW, Liu ZH, Chen WJ (2021) Birdsong classification based on multi-feature fusion. Multimedia Tools and Applications, 80, 36529-36547. DOI
[18]	Zhang FY, Zhang LY, Chen HX, Xie JJ (2021) Bird species identification using spectrogram based on multi-channel fusion of DCNNs. Entropy, 23, 1507. DOI URL

模型 Model	参数名称 Parameter name	参数值 Parameter value
深度特征提取网络 Deep feature extraction network	优化器 Optimizer	Adam
	学习率 Learning rate	0.01
	时期数 Epochs	200
	批大小 Batch size	16
	损失函数 Loss function	分类交叉熵 Categorical_cross-entropy
分类器 Classifier	学习率 Learning rate	0.01
	加速方法 Boosting method	GBDT
	最大深度 Max depth	4

模型 Model	参数名称 Parameter name	参数值 Parameter value
深度特征提取网络 Deep feature extraction network	优化器 Optimizer	Adam
	学习率 Learning rate	0.01
	时期数 Epochs	200
	批大小 Batch size	16
	损失函数 Loss function	分类交叉熵 Categorical_cross-entropy
分类器 Classifier	学习率 Learning rate	0.01
	加速方法 Boosting method	GBDT
	最大深度 Max depth	4

模型 Model	平均准确率 Average accuracy (%)	平均F₁分数 Average F₁-score (%)
logMel + mobileNetV3	90.20	90.11
logMel + EfficientNetV2	93.08	93.20
logMelMAPS + EfficientNetV2	95.69	95.73
EGeMAPS + EfficientNetV2	77.41	77.45
LogEGeMAPS + EfficientNetV2	91.56	91.42
深度logEGeMAPS + lightGBM Deep logEGeMAPS + lightGBM	97.13	97.12
深度logMelMAPS + lightGBM Deep logMelMAPS + lightGBM	98.71	98.69
深度融合特征(mobileNetV3) + lightGBM Deep fusion features (mobileNetV3) + lightGBM	97.77	97.76
深度融合特征 + SVM Deep fusion features + SVM	98.83	98.82
深度融合特征 + Random Forest Deep fusion features + Random Forest	98.82	98.81
深度融合特征 + XGBoost Deep fusion features + XGBoost	98.64	98.63
深度融合特征 + lightGBM Deep fusion features + lightGBM	98.70	98.82

模型 Model	平均准确率 Average accuracy (%)	平均F₁分数 Average F₁-score (%)
logMel + mobileNetV3	90.20	90.11
logMel + EfficientNetV2	93.08	93.20
logMelMAPS + EfficientNetV2	95.69	95.73
EGeMAPS + EfficientNetV2	77.41	77.45
LogEGeMAPS + EfficientNetV2	91.56	91.42
深度logEGeMAPS + lightGBM Deep logEGeMAPS + lightGBM	97.13	97.12
深度logMelMAPS + lightGBM Deep logMelMAPS + lightGBM	98.71	98.69
深度融合特征(mobileNetV3) + lightGBM Deep fusion features (mobileNetV3) + lightGBM	97.77	97.76
深度融合特征 + SVM Deep fusion features + SVM	98.83	98.82
深度融合特征 + Random Forest Deep fusion features + Random Forest	98.82	98.81
深度融合特征 + XGBoost Deep fusion features + XGBoost	98.64	98.63
深度融合特征 + lightGBM Deep fusion features + lightGBM	98.70	98.82

模型 Model	平均准确率 Average accuracy (%)	平均F₁分数 Average F₁-score (%)	参考文献 Reference
GWO-KELM	91.16	88.54	李大鹏, 2022^①① 李大鹏 (2022) 自然场景下鸟鸣声识别算法研究, 硕士学位论文, 南京信息工程大学, 南京.)
LogMel + CRNN	92.89	89.64	Adavanne et al, 2017
LogMel + CNN	91.12	88.47	Bold et al, 2019
logMel + DSRN + DilatedSAM + BiLSTM	96.58	96.51	李大鹏, 2022^①① 李大鹏 (2022) 自然场景下鸟鸣声识别算法研究, 硕士学位论文, 南京信息工程大学, 南京.)
深度融合特征 + lightGBM Deep fusion features + lightGBM	98.70	98.82	本文 This study