Rare bird recognition method in Beijing based on TC-YOLO model

doi:10.17520/biods.2024056

Abstract

Abstract:

Aim: Bird recognition is an important means to protect birds, and traditional bird recognition mainly relies on manual labor, which has high costs, high professional technical requirements, and certain limitations. With the development of artificial intelligence technology, the use of deep learning technology to automatically identify birds has become an important means of bird survey and protection. However, the actual bird images are characterized by a complex background and the similar appearance of birds of similar families, resulting in poor model recognition accuracy.

Methods: To address the above problems, this paper proposed a bird recognition method based on TC-YOLO model. First, in order to solve the problem of missed detection caused by complex background in bird recognition, the method in this paper combined the CARAFE (content-aware reassembly of features) mechanism to adaptively generate the upsampling kernels corresponding to different feature points, to aggregate the contextual semantic information within a larger receptive field, effectively to focus on the distribution of bird regions in the global feature map, and to improve the ability of the upsampling in recognizing the bird features so as to enable the model to accurately recognize bird targets. Second, in order to solve the problem of false detection caused by similar appearances in bird recognition, our paper introduced TSCODE (task-specific context decoupling) to decouple the localization and classification tasks by acquiring the information of multi-level feature maps to regress to the target boundary and utilizing the features containing the underlying texture and the higher-level semantics for the classification of species, which in turn improves the model’s bird recognition accuracy.

Results: This paper carried out comparative experiments to verify the performance of the model. The experimental results showed that the mean average precision of the TC-YOLO model on the self-built dataset Beijing-28, which contained 28 species of national first-class protected birds in Beijing, and the publicly available dataset of birds CUB200-2011 reached 78.7% and 75.3%, respectively, which were both better than the comparison methods, proving that the TC-YOLO model possessed a superior performance in bird recognition. In addition, in order to verify the generalization of the TC-YOLO model on other kinds of datasets, experiments were carried out on the public dataset MS COCO, and the results showed that the performance of the TC-YOLO model was better than the comparison model, which indicated that the TC-YOLO model possesses strong generalization.

Conclusion: The TC-YOLO model proposed in this paper can effectively recognize bird images in the presence of complex backgrounds or similar appearances with low leakage and misdetection rates and strong generalization, which can provide important technical support for bird conservation and thus bring more practical application value for biodiversity conservation in Beijing.

Key words: rare birds, image recognition, YOLOv5s, upsampling, decoupling head

Baican Li, Junguo Zhang, Changchun Zhang, Lifeng Wang, Jiliang Xu, Li Liu. Rare bird recognition method in Beijing based on TC-YOLO model[J]. Biodiv Sci, 2024, 32(5): 24056.

Add to citation manager EndNote|Ris|BibTeX

URL: https://www.biodiversity-science.net/EN/10.17520/biods.2024056

https://www.biodiversity-science.net/EN/Y2024/V32/I5/24056

Figures/Tables 15

Fig. 1 Partial examples of the national first-class protected bird dataset in Beijing

Fig. 2 Example of generated images based on Cycle-GAN method

Fig. 3 TC-YOLO model structure. Conv and Conv2d, Convolution; C3, Feature extraction; SPPF, Space pyramid pool structure; Concat and c, Feature fusion; CARAFE, CARAFE upsampling; C2, The feature map of layer 2 feature extraction; Pl, The feature map of the l-th pyramid level; Head, TSCODE decoupling header; NMS, Non-maximum suppression.

Fig. 4 CARAFE model structure. $\mathcal{X}$, Input feature map; C, The number of channels of the input feature map; H, The height of the input feature map; W, The width of the input feature map; Cm, Number of channels after channel compression; σ, Upsample ratio; σ 2 × k u p 2, Number of output channels after content encoding; k u p 2, Predicted upsampling kernel size; l, Source location; l°, Target location; $N\left(\mathcal{X}_{l}, \boldsymbol{k}_{u p}\right)$, A square area centered on the source position; Wl°, Reassembly upsampling kernel; $\mathcal{X}^{\prime}$, Output feature map; σH, The height of the output feature map; σW, The width of the output feature map.

Fig. 5 Detail-preserving encoding structure. Pl, The feature map of the l-th pyramid level; Gloc l, The feature map for localization; H, The height of the feature map; W, The width of the feature map; floc(·), The feature projection functions for localization; $\mathcal{R}(\cdot)$, The final layer in localization.

Fig. 6 Semantic context encoding structure. Pl, The feature map of the l-th pyramid level; Gcls l, The feature map for localization; H, The height of the feature map; W, The width of the feature map; C, The number of channels of the feature map; fcls(·), The feature projection functions for localization; C(·), The final layer in localization.

Table 1

算法模型 Algorithm model	平均精度均值 mAP@0.5 (%)	平均精度均值 mAP@0.5:0.9 (%)	参数量 Parameters (×10⁶M)	帧率 Frames per second (帧/s)	参考文献 References
TC-YOLO	90.6	78.7	28.89	56	本文 This study
Faster R-CNN	67.9	45.2	28.56	16	Ren et al, 2016
SSD	85.3	68.3	26.29	47	Liu et al, 2016
YOLOv3-tiny	80.9	54.8	8.73	86	Adarsh et al, 2020
YOLOv4-tiny	75.1	48.6	6.06	70	Wang et al, 2021
YOLOv5s	90.2	76.3	7.09	85	Jocher, 2022
YOLOv6n	76.6	67.6	4.64	67	Li et al, 2022
YOLOv7-tiny	81.4	66.1	6.09	64	Wang CY et al, 2023
YOLOv8n	84.9	74.5	3.01	67	Jocher, 2023

Fig. 7 Comparison of experimental results on the CUB200- 2011 dataset among different models

Table 2 Comparison of experimental results on the Beijing-28 dataset among different improved YOLOv5s methods

方法 Methods	模型构成 Model composition	平均精度均值 mAP@0.5 (%)	平均精度均值 mAP@0.5:0.95 (%)	参考文献 References
TC-YOLO	YOLOv5s + CARAFE + TSCODE	90.6	78.7	本文 This study
方法1 Method 1	YOLOv5s	90.2	76.3	Jocher, 2022
方法2 Method 2	YOLOv5s + CBAM	89.9	76.2	Xue et al, 2022
方法3 Method 3	YOLOv5s + SA	89.8	75.2	Hao et al, 2023
方法4 Method 4	YOLOv5s + CA	89.3	74.4	Zhang et al, 2023
方法5 Method 5	YOLOv5s + WioUv3	89.9	76.1	Zhao et al, 2023

Fig. 8 Comparison of experimental results on the CUB200- 2011 dataset among different improved YOLOv5s methods

Fig. 9 Comparison of recognition performance between before and after improvement in YOLOv5s (missed detection situations)

Fig. 10 Comparison of recognition performance between before and after improvement in YOLOv5s (false detection situations)

Table 3 Ablation experimental results of TC-YOLO model

数据集 Datasets	算法模型 Algorithm model	精确率 Precision (%)	召回率 Recall (%)	平均精度均值 mAP@0.5 (%)	平均精度均值 mAP@0.5:0.95 (%)
Beijing-28	YOLOv5s	86.8	87.4	90.2	76.3
	C-YOLO	87.5	88.3	90.3	77.3
	T-YOLO	87.3	88.1	90.2	77.4
	TC-YOLO	89.4	89.5	90.6	78.7
CUB200-2011	YOLOv5s	81.7	82.2	85.4	72.8
	C-YOLO	81.9	82.5	85.6	73.5
	T-YOLO	84.7	82.0	85.4	74.7
	TC-YOLO	85.1	82.5	85.5	75.3

Fig. 11 Comparison of ablation study results (low-quality images)

Table 4 Comparison of experimental results on the MS COCO dataset among different models (* represents experimental results in the paper)

算法模型 Algorithm model	平均精度均值 mAP@0.5 (%)	平均精度均值 mAP@0.5:0.95 (%)	参考文献 References
TC-YOLO	61.0	42.1	本文 This study
Faster R-CNN*	59.2	39.8	Ren et al, 2016
SSD*	43.1	25.1	Liu et al, 2016
YOLOv4-tiny*	42.1	24.9	Wang et al, 2021
YOLOv5s*	56.8	37.4	Jocher, 2022
YOLOv6n*	52.7	37.0	Li et al, 2022
YOLOv7-tiny*	52.8	35.2	Wang CY et al, 2023
YOLOv8n*	52.6	37.3	Jocher, 2023

References 47

[1]	Adarsh P, Rathi P, Kumar M (2020) YOLO v3-Tiny: Object detection and recognition using one stage improved model. In: 2020 International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 687-694. IEEE, Coimbatore.
[2]	Alqaysi H, Fedorov I, Qureshi FZ, O’Nils M (2021) A temporal boosted YOLO-based model for birds detection around wind farms. Journal of Imaging, 7, 227.
[3]	Cai JM, He PY, Yang ZP, Li LY, Zhao QJ, Pan F (2023) A deep feature fusion-based method for bird sound recognition and its interpretability analysis. Biodiversity Science, 31, 23087. (in Chinese with English abstract) DOI
	[蔡建民, 何培宇, 杨智鹏, 李露莹, 赵启军, 潘帆 (2023) 基于深度特征融合的鸟鸣识别方法及其可解释性分析. 生物多样性, 31, 23087.] DOI
[4]	Cheng G, Yuan X, Yao XW, Yan KB, Zeng QH, Xie XX, Han JW (2023) Towards large-scale small object detection: Survey and benchmarks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 13467-13488.
[5]	Feng CJ, Zhong YJ, Gao Y, Scott MR, Huang WL (2021) Tood: Task-aligned one-stage object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3490-3499. IEEE, Montreal.
[6]	Ge Z, Liu ST, Wang F, Li ZM, Sun J (2021) Yolox: Exceeding yolo series in 2021. arXiv, doi: 10.48550/arXiv.2107.08430.
[7]	Gou JP, Xiong XS, Yu BS, Du L, Zhan YB, Tao DC (2023) Multi-target knowledge distillation via student self- reflection. International Journal of Computer Vision, 131, 1857-1874.
[8]	Hao WL, Zhang L, Han M, Zhang K, Li FZ, Yang GQ, Liu ZY (2023) YOLOv5-SA-FC: A novel pig detection and counting method based on shuffle attention and focal complete intersection over union. Animals, 13, 3201.
[9]	Hong YY, Lu XL, Zhao HP (2021) Bird diversity and interannual dynamics in different habitats of agricultural landscape in Huanghuai Plain. Acta Ecologica Sinica, 41, 2045-2055. (in Chinese with English abstract)
	[洪咏怡, 卢训令, 赵海鹏 (2021) 黄淮平原农业景观不同生境鸟类多样性特征及年际动态. 生态学报, 41, 2045-2055.]
[10]	Huang RR, Wang Y, Yang HZ (2022) Cross-layer attention network for fine-grained visual categorization. arXiv, doi: 10.48550/arXiv.2210.08784.
[11]	Jocher G (2022) YOLOv5 Release v6.0. https://github.com/ultralytics/yolov5/releases/tag/v6.0. (accessed on 2022-11-22)
[12]	Jocher G (2023) YOLOv8 Release v8.1.0. https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0. (accessed on 2023-01-10)
[13]	Lei JL, Gao SH, Rasool MA, Fan R, Jia YF, Lei GC (2023) Optimized small waterbird detection method using surveillance videos based on YOLOv7. Animals, 13, 1929.
[14]	Li CY, Li LL, Jiang HL, Weng KH, Geng YF, Li L, Ke ZD, Li QY, Cheng M, Nie WQ, Li YD, Zhang B, Liang YF, Zhou LY, Xu XM, Chu XX, Wei XX, Wei XL (2022) YOLOv6: A single-stage object detection framework for industrial applications. arXiv, doi: 10.48550/arXiv.2209.02976.
[15]	Lin TY, Dollár P, Girshick R, He KM, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117-2125. IEEE, Honolulu.
[16]	Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco:Common objects in context. In: 2014 European Conference on Computer Vision (ECCV), pp. 740-755. Springer International Publishing, Zurich.
[17]	Liu S, Qi L, Qin HF, Shi JP, Jia JY (2018) Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759-8768. IEEE, Salt Lake City, UT.
[18]	Liu SL, Li YL, Qu JY, Wu RB (2022) Airport UAV and birds detection based on deformable DETR. Journal of Physics: Conference Series, 2253, 012024.
[19]	Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD:Single shot MultiBox detector. In: 2016 European Conference on Computer Vision (ECCV), pp. 21-37. Springer International Publishing, Amsterdam.
[20]	Mokany K, Ware C, Harwood TD, Schmidt RK, Ferrier S (2022) Habitat-based biodiversity assessment for ecosystem accounting in the Murray-Darling Basin. Conservation Biology, 36, e13915.
[21]	Ren SQ, He KM, Girshick R, Sun J (2016) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149.
[22]	Ronneberger O, Fischer P, Brox T (2015) U-Net:Convolutional networks for biomedical image segmentation. In: 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234-241. Springer International Publishing, Munich.
[23]	Roth K, Vinyals O, Akata Z (2022) Non-isotropy regularization for proxy-based deep metric learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7420-7430. IEEE, New Orleans.
[24]	Song GL, Liu Y, Wang XG (2020) Revisiting the sibling head in object detector. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11563-11572. IEEE, Seattle.
[25]	Sun HB, He XT, Peng YX (2022) Sim-trans: Structure information modeling transformer for fine-grained visual categorization. In: 30th ACM International Conference on Multimedia (ACM MM), pp. 5853-5861. ACM, Lisboa.
[26]	Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset, Technical Report CNS-TR-2011-001. California Institute of Technology, California, USA.
[27]	Wang CY, Bochkovskiy A, Liao HYM (2021) Scaled-yolov4: Scaling cross stage partial network. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13029-13038. IEEE, Nashville.
[28]	Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464-7475. IEEE, Vancouver.
[29]	Wang JQ, Chen K, Xu R, Liu ZW, Loy CC, Lin DH (2019) Carafe: Content-aware reassembly of features. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3007-3016. IEEE, Seoul.
[30]	Wang JX, Su YH, Yao JH, Liu M, Du YR, Wu X, Huang L, Zhao MH (2023) Apple rapid recognition and processing method based on an improved version of YOLOv5. Ecological Informatics, 77, 102196.
[31]	Wang K, Yang F, Chen ZB, Chen YX, Zhang Y (2023) A fine-grained bird classification method based on attention and decoupled knowledge distillation. Animals, 13, 264.
[32]	Wu KY, Ruan WD, Zhou DF, Chen QC, Zhang CY, Pan XY, Yu S, Liu Y, Xiao RB (2023) Syllable clustering analysis-based passive acoustic monitoring technology and its application in bird monitoring. Biodiversity Science, 31, 22370. (in Chinese with English abstract) DOI
	[吴科毅, 阮文达, 周棣锋, 陈庆春, 张承云, 潘新园, 余上, 刘阳, 肖荣波 (2023) 基于音节聚类分析的被动声学监测技术及其在鸟类监测中的应用. 生物多样性, 31, 22370.] DOI
[33]	Wu Y, Chen YP, Yuan L, Liu ZC, Wang LJ, Li HZ, Fu Y (2020) Rethinking classification and localization for object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10186-10195. IEEE, Seattle.
[34]	Xiang WB, Song ZY, Zhang GX, Wu XC (2022) Birds detection in natural scenes based on improved faster RCNN. Applied Sciences, 12, 6094.
[35]	Xiao ZS, Xiao WH, Wang TM, Li S, Lian XM, Song DZ, Deng XQ, Zhou QH (2022) Wildlife monitoring and research using camera-trapping technology across China: The current status and future issues. Biodiversity Science, 30, 22451. (in Chinese with English abstract) DOI
	[肖治术, 肖文宏, 王天明, 李晟, 连新明, 宋大昭, 邓雪琴, 周岐海 (2022) 中国野生动物红外相机监测与研究: 现状及未来. 生物多样性, 30, 22451.] DOI
[36]	Xie JJ, Zhong YJ, Zhang JG, Liu S, Ding CQ, Triantafyllopoulos A (2023a) A review of automatic recognition technology for bird vocalizations in the deep learning era. Ecological Informatics, 73, 101927.
[37]	Xie JJ, Zhong YJ, Zhang JG, Zhang CC, Schuller BW (2023b) A weakly supervised spatial group attention network for fine-grained visual recognition. Applied Intelligence, 53, 23301-23315.
[38]	Xie ZF, Li DZ, Sun HX, Zhang AM (2023) Deep learning techniques for bird chirp recognition task. Biodiversity Science, 31, 22308. (in Chinese with English abstract) DOI
	[谢卓钒, 李鼎昭, 孙海信, 张安民 (2023) 面向鸟鸣声识别任务的深度学习技术. 生物多样性, 31, 22308.] DOI
[39]	Xue ZY, Lin HF, Wang F (2022) A small target forest fire detection model based on YOLOv5 improvement. Forests, 13, 1332.
[40]	Yang CHY, Huang ZH, Wang NY (2022) QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13668-13677. IEEE, New Orleans.
[41]	Zhang H, Shao FM, He XH, Zhang ZH, Cai YG, Bi SH (2023) Research on object detection and recognition method for UAV aerial images based on improved YOLOv5. Drones, 7, 402.
[42]	Zhao Q, Wei HL, Zhai XY (2023) Improving tire specification character recognition in the YOLOv5 network. Applied Sciences, 13, 7310.
[43]	Zhao YF, Li J, Chen XW, Tian YH (2021) Part-guided relational transformers for fine-grained visual recognition. IEEE Transactions on Image Processing, 30, 9470-9481.
[44]	Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to- image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2223-2232. IEEE, Venice.
[45]	Zhu SM (2024) The Number of Terrestrial Wild Animals in the City Has Increased to 612 Species. (in Chinese)
	[朱松梅 (2024) 全市陆生野生动物种类增至612种.] https://www.beijing.gov.cn/ywdt/yaowen/202404/t20240414_3617537.html. (accessed on 2024-04-14)
[46]	Zhuang JY, Qin Z, Yu H, Chen XC (2023) Task-Specific context decoupling for object detection. arXiv, doi: 10.48550/arXiv.2303.01047.
[47]	Zou C, Liang YQ (2021) Bird detection of transmission line based on YOLO V3 algorithm. Computer Applications and Software, 38(10), 164-167, 241. (in Chinese with English abstract)
	[邹聪, 梁永全 (2021) 基于YOLO V3算法的输电线路鸟类检测. 计算机应用与软件, 38(10), 164-167, 241.]