Biodiv Sci

Previous Articles     Next Articles

Spatiotemporal patterns of eukaryotic genetic data based on the GenBank database: Current status and future directions

Xin Peng, Chuan Liu, Xiaolei Hunag*   

  1. State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
  • Received:2025-05-19 Revised:2025-08-24 Accepted:2025-09-12
  • Contact: Xiaolei Huang

Abstract:

Aims: Genetic data are playing an increasingly vital role in biodiversity research and conservation practices. However, their application is often constrained by data quality deficiencies and uneven geographical or taxonomic distributions. While the genetic data patterns of terrestrial vertebrates have been extensively studied, the spatial distribution patterns of genetic data for global plants, fungi, and other animal groups still lack systematic empirical research. This study aims to assess the current coverage of genetic data across three major eukaryotic kingdoms (Animalia, Plantae, and Fungi), focusing on representative molecular markers to analyze metadata completeness and spatiotemporal distribution patterns, thereby identifying key bottlenecks in biodiversity research applications. 

Methods: This study employed a multi-scale analytical approach to evaluate genetic data across the three eukaryotic kingdoms (Animalia, Plantae, and Fungi). First, comprehensive statistical analyses were conducted on sequence and genome datasets. We then specifically assessed metadata completeness for three standard DNA barcodes: cytochrome c oxidase subunit I (COI; Animalia), ribulose-bisphosphate carboxylase (rbcL; Plantae), and internal transcribed spacer (ITS; Fungi), covering approximately 6 million sequences. Finally, we systematically analyzed the spatial distribution patterns and interannual variation trends of these genetic data using geographic grids of different resolutions (4° × 4° for global scale and 2° × 2° for China) in combination with Natural Earth national boundary datasets. 

Results: The results demonstrate that the kingdom Animalia possesses approximately 270 million sequences and 16,000 genomic datasets, surpassing both Plantae (approximately 140 million sequences, 7,000 genomes) and Fungi (approximately 20 million sequences, 17,000 genomes). Geographic metadata deficiencies were prevalent across all three standard barcode markers - COI (Animalia), rbcL (Plantae), and ITS (Fungi) - with ITS sequences exhibiting the highest rate of missing geographic coordinate data (92.07%), followed by rbcL (83.19%) and COI (26.40%). The spatiotemporal distribution pattern demonstrates a distinct “Northern Hemisphere Centralization” at the global scale, with North America, Western Europe, and East Asia being the dominant regions, while the Southern Hemisphere generally lacks data; At the same time, a declining trend was observed in Animalia COI and Plantae rbcL data, while Fungi ITS data exhibited rapid growth. In China, a unique distribution emerged, characterized by “Southern Animalia, Eastern Plantae, and Northern Fungi”, with significant data shortages in the northwest region. Over time, data for Plantae and Fungi in China continue to grow, while data for Animalia remain stable. 

Conclusion: These findings highlight that deficiencies in genetic data quality and imbalances in spatial distribution have become important bottlenecks restricting biodiversity research. To address these issues, we recommend the establishment of stringent metadata archiving standards, increased scientific research investment in underrepresented areas such as the Southern Hemisphere and Northwest China, and the promotion of equitable global data resource allocation through the construction of an international scientific research cooperation network. These measures aim to enhance the application value of genetic data in biodiversity research and conservation practices.

Key words: genetic data, DNA barcode, genome, spatiotemporal distribution pattern