生物多样性 ›› 2026, Vol. 34 ›› Issue (1): 25263.  DOI: 10.17520/biods.2025263

• 生态学数据分析方法专题 • 上一篇    下一篇

一种基于不等长DNA序列的单倍型丰富度估算方法

姜媛1, 黄贝希1, 贾学圆1, 梁思1, 谢雨彤1, 范平1*, 宋刚2   

  1. 1. 陕西中医药大学,陕西咸阳 712046;2. 中国科学院动物研究所,北京 100101
  • 收稿日期:2025-07-06 修回日期:2025-09-28 接受日期:2026-01-21 出版日期:2026-01-20 发布日期:2026-01-22
  • 通讯作者: 范平

An approach for estimating haplotype richness from sequences with unequal lengths

Yuan Jiang1, Beixi Huang1, Xueyuan Jia1, Si Liang1, Yutong Xie1, Ping Fan1*, Gang Song2   

  1. 1 Shaanxi University of Chinese Medicine, Xianyang, Shaanxi 712046, China 

    2 Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China

  • Received:2025-07-06 Revised:2025-09-28 Accepted:2026-01-21 Online:2026-01-20 Published:2026-01-22
  • Contact: Ping Fan

摘要: 传统遗传多样性计算方法要求序列长度一致, 而从公共数据库中获取的序列长度不一, 增加了单倍型识别和遗传多样性评估的难度。现有方法虽能估算不等长DNA序列的单倍型多样性和核苷酸多样性, 但单倍型丰富度的有效计算方法仍属空白。为此, 本研究基于配对序列间的核苷酸差异(Kij)开发了一种针对不等长DNA序列的单倍型丰富度估算方法。我们通过3项分析验证方法性能: (1)对于同一套等长序列数据集, 将其计算结果与DnaSP的输出结果进行比较; (2)基于鸟类、哺乳动物和两栖动物的等长数据集生成随机长度的模拟序列, 验证算法处理不等长序列的性能, 并利用药用植物数据集评估该方法的泛化能力; (3)应用该方法计算了鸟类、哺乳动物和两栖动物单倍型丰富度的纬度梯度格局。结果表明: (1)对等长序列, 新方法与DnaSP的输出结果无显著差异(鸟类: W = 22,018, P = 0.845; 哺乳动物: W = 23,096, P = 0.990; 两栖动物: W = 3,518.5, P = 0.977), 且在存在碱基缺失时展现出更优的单倍型识别能力(平均较DnaSP多识别1.333 ± 0.188个单倍型); (2)随机长度模拟实验证实该方法对不等长的序列数据具备良好的估算性能(以相对误差衡量的整体计算精确度为0.130 ± 0.106; 稳定性为0.007 ± 0.007); (3)纬度格局分析显示: 鸟类和哺乳动物在南半球呈现显著递减趋势, 而在北半球变化较为平缓; 两栖动物则呈现自南向北的持续递减模式。本研究有助于开发更精确的量化方法, 或可为遗传多样性研究及保护工作提供新的分析工具。

关键词: 遗传多样性, 不等长序列, 单倍型丰富度, 方法

Abstract

Aims: Traditional methods for calculating genetic diversity necessitate uniform sequence lengths within species. However, the sequences available in public databases often exhibit variability in length, thereby complicating the processes of haplotype identification and genetic diversity assessment. Although methods exist for estimating haplotype diversity and nucleotide diversity from sequences with unequal lengths, there is currently no effective methodology for calculating haplotype richness. 

Methods: In response to this issue, this research introduces a method to estimate haplotype richness for DNA sequences of varying lengths, utilizing the nucleotide differences between paired sequences (Kij). Three analyses were conducted to validate the method’s performance: (1) For sequences that were of equal length, the results obtained from our method were compared with those from DnaSP. (2) The algorithm’s performance with sequences of different lengths was tested by generating simulated sequences of random lengths from equal-length datasets of birds, mammals, and amphibians, and its generalization capability was evaluated using a medicinal plant dataset. (3) The method was employed to assess the latitudinal gradient patterns of haplotype richness in birds, mammals, and amphibians. 

Results: For sequences with equal length, the new method’s results were not significantly different from those of DnaSP (birds: W = 22,018, P = 0.845; mammals: W = 23,096, P = 0.990; amphibians: W = 3,518.5, P = 0.977) but it surpassed it in haplotype identification when base deletions occurred, identifying an average of 1.333 ± 0.188 more haplotypes. (2) Random-length simulation trials confirmed the effectiveness (The mean relative error indicated an overall accuracy of 0.130 ± 0.106, while the variance of relative error showed a stability of 0.007 ± 0.007) and applicability of this method in estimating haplotype richness for sequences of varying lengths. (3) An examination of latitudinal haplotype richness patterns found that birds and mammals exhibited a significant decreasing trend in the Southern Hemisphere but a relatively stable decreasing trend in the Northern Hemisphere, whereas amphibians showed a continuous decline from south to north. 

Conclusions: This study advances the development of more precise quantitative methodologies and introduces novel analytical tools for the conservation of genetic diversity.

Key words: genetic diversity, unequal length sequences, haplotype richness, method