生物多样性

• • 上一篇    下一篇

基于GenBank数据库的真核生物遗传数据 时空格局: 现状与展望

彭欣, 刘传, 黄晓磊*   

  1. 福建农林大学植物保护学院农林生物安全国家重点实验室, 福州 350002
  • 收稿日期:2025-05-19 修回日期:2025-08-24 接受日期:2025-09-12
  • 通讯作者: 黄晓磊

Spatiotemporal patterns of eukaryotic genetic data based on the GenBank database: Current status and future directions

Xin Peng, Chuan Liu, Xiaolei Hunag*   

  1. State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
  • Received:2025-05-19 Revised:2025-08-24 Accepted:2025-09-12
  • Contact: Xiaolei Huang

摘要: 遗传数据在生物多样性研究和保护实践中发挥着越来越重要的作用, 然而其应用时常面临数据质量缺陷、地理或类群分布不均等方面的制约, 尽管陆生脊椎动物的遗传数据格局已有较深入研究, 但全球植物、真菌和其他动物类群的遗传数据空间分布模式仍缺乏系统的实证研究。本文采用多尺度分析方法, 系统评估了动物、植物和真菌三大真核生物界的遗传数据现状、元数据完整性以及遗传数据的时空动态趋势。结果表明, 动物界拥有约2.7亿条序列和1.6万个基因组数据, 超过植物界(约1.4亿条, 0.7万个)和真菌界(约0.2亿条, 1.7万个)。遗传数据的地理元数据缺失现象普遍存在, 其中真菌ITS序列的经纬度缺失最为严重(缺失率92.07%), 其次是植物rbcL (83.19%)和动物COI (26.40%)。时空分布格局显示, 全球尺度上遗传数据呈现明显的“北半球中心化”特征, 北美、西欧和东亚地区占据主导地位, 而南半球普遍数据匮乏; 同时观察到动物COI和植物rbcL数据呈下降趋势, 而真菌ITS数据快速增长。中国区域则表现出独特的“南动物、东植物、北真菌”分布格局, 而西北地区数据积累明显不足; 时间维度上, 中国植物和真菌数据持续增长, 而动物数据保持稳定。这些发现揭示了遗传数据质量缺陷和分布失衡已成为制约生物多样性研究的重要瓶颈。为此, 我们建议建立严格的元数据存档标准, 重点加强南半球和中国西北部等数据薄弱区域的科研投入, 并通过构建国际科研合作网络促进全球数据资源的均衡配置, 从而提升遗传数据在生物多样性研究和保护实践中的应用价值。

关键词: 遗传数据, DNA条形码, 基因组, 时空分布格局

Abstract

Aims: Genetic data are playing an increasingly vital role in biodiversity research and conservation practices. However, their application is often constrained by data quality deficiencies and uneven geographical or taxonomic distributions. While the genetic data patterns of terrestrial vertebrates have been extensively studied, the spatial distribution patterns of genetic data for global plants, fungi, and other animal groups still lack systematic empirical research. This study aims to assess the current coverage of genetic data across three major eukaryotic kingdoms (Animalia, Plantae, and Fungi), focusing on representative molecular markers to analyze metadata completeness and spatiotemporal distribution patterns, thereby identifying key bottlenecks in biodiversity research applications. 

Methods: This study employed a multi-scale analytical approach to evaluate genetic data across the three eukaryotic kingdoms (Animalia, Plantae, and Fungi). First, comprehensive statistical analyses were conducted on sequence and genome datasets. We then specifically assessed metadata completeness for three standard DNA barcodes: cytochrome c oxidase subunit I (COI; Animalia), ribulose-bisphosphate carboxylase (rbcL; Plantae), and internal transcribed spacer (ITS; Fungi), covering approximately 6 million sequences. Finally, we systematically analyzed the spatial distribution patterns and interannual variation trends of these genetic data using geographic grids of different resolutions (4° × 4° for global scale and 2° × 2° for China) in combination with Natural Earth national boundary datasets. 

Results: The results demonstrate that the kingdom Animalia possesses approximately 270 million sequences and 16,000 genomic datasets, surpassing both Plantae (approximately 140 million sequences, 7,000 genomes) and Fungi (approximately 20 million sequences, 17,000 genomes). Geographic metadata deficiencies were prevalent across all three standard barcode markers - COI (Animalia), rbcL (Plantae), and ITS (Fungi) - with ITS sequences exhibiting the highest rate of missing geographic coordinate data (92.07%), followed by rbcL (83.19%) and COI (26.40%). The spatiotemporal distribution pattern demonstrates a distinct “Northern Hemisphere Centralization” at the global scale, with North America, Western Europe, and East Asia being the dominant regions, while the Southern Hemisphere generally lacks data; At the same time, a declining trend was observed in Animalia COI and Plantae rbcL data, while Fungi ITS data exhibited rapid growth. In China, a unique distribution emerged, characterized by “Southern Animalia, Eastern Plantae, and Northern Fungi”, with significant data shortages in the northwest region. Over time, data for Plantae and Fungi in China continue to grow, while data for Animalia remain stable. 

Conclusion: These findings highlight that deficiencies in genetic data quality and imbalances in spatial distribution have become important bottlenecks restricting biodiversity research. To address these issues, we recommend the establishment of stringent metadata archiving standards, increased scientific research investment in underrepresented areas such as the Southern Hemisphere and Northwest China, and the promotion of equitable global data resource allocation through the construction of an international scientific research cooperation network. These measures aim to enhance the application value of genetic data in biodiversity research and conservation practices.

Key words: genetic data, DNA barcode, genome, spatiotemporal distribution pattern