生物多样性 ›› 2015, Vol. 23 ›› Issue (4): 550-555.doi: 10.17520/biods.2015120

• • 上一篇    

批量下载GenBank基因序列数据的新工具——NCBIminer

徐晓婷1, 王志恒1, , A;*, DimitarDimitrov2()   

  1. 1 (北京大学城市与环境学院生态学系, 北京大学地表过程分析与模拟教育部重点实验室, 北京 100871) 2 (Natural History Museum, University of Oslo, Oslo, Norway) 3 (Center for Macroecology, Evolution and Climate, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark) 4 (Imperial College London, Grand Challenges in Ecosystems and the Environment Initiative, Silwood Park Campus, Berkshire, UK);
  • 收稿日期:2015-05-07 接受日期:2015-07-09 出版日期:2015-07-20
  • 通讯作者: 王志恒 E-mail:zhiheng.wang@pku.edu.cn
  • 基金项目:
    国家自然科学基金(31470564, 31400467, 31321061)和中国博士后科学基金(2014M550555)

Using NCBIminer to search and download nucleotide sequences from GenBank

Xiaoting Xu1, Zhiheng Wang1, *(), Dimitar Dimitrov2, Carsten Rahbek3, 4   

  1. 1 Department of Ecology and Key Laboratory for Earth Surface Processes of the Ministry of Education, College of Urban and Environmental Sciences, Peking University, Beijing 100871
    2 Natural History Museum, University of Oslo, Oslo, Norway
    3 Center for Macroecology, Evolution and Climate, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
    4 Imperial College London, Grand Challenges in Ecosystems and the Environment Initiative, Silwood Park Campus, Berkshire, UK
  • Received:2015-05-07 Accepted:2015-07-09 Online:2015-07-20
  • Contact: Wang Zhiheng E-mail:zhiheng.wang@pku.edu.cn

核苷酸序列是生物体遗传信息的载体, 是现代生物学和生态学的基础数据。随着测序技术的进步, 大量核苷酸序列被提取并存储在公共数据平台中, 其中GenBank(http://www.ncbi.nlm.nih.gov/genbank/)是目前最大的核苷酸序列数据平台之一。截至2015年2月, 该平台收录核苷酸序列总数已超过1.8亿条、覆盖全球超过30万个物种。但如何从如此海量的数据中准确、快速查找并下载所需数据已成为限制基因数据广泛使用的障碍之一。为此, 我们开发了一款可高效、准确下载GenBank数据的生物信息学软件NCBIminer。NCBIminer可根据用户提供的核苷酸序列名称、数据类型、一或多条初始化参考序列, 查找并下载用户指定的多个物种或类群的特定基因序列数据。该软件下载地址为https://github.com/greengirl/NCBIminer/releases/, 可在Windows、Linux和MAC操作系统下免费使用; 同时, 其操作简单, 用户无需生物信息学背景。为方便该软件的使用, 本文将介绍该软件的工作流程与算法、安装及使用过程中的参数设置等。

关键词: GenBank, 生物信息学, 基因序列, 系统进化, DNA, 核苷酸序列

GenBank is the leading public genetic resources database and currently contains over 1012 base pairs from about 300,000 formally described species. It offers valuable resources for studies on the evolution of species, genes, and genomes. However, difficulties in GenBank data mining hinder the potential wide application of this tool for big data collection. To address this issue, we introduce new bioinformatics software —NCBIminer. NCBIminer is a freely available, cross-platform, and user-friendly software for mining nucleotide sequences from GenBank. The main purpose of NCBIminer is to download sequences for user required genes and taxonomic groups based on gene names, types, and one or several reference sequences. The program algorithms have been described elsewhere and here, we focus on introducing the details in the usage of the program including how to install, run, and set parameters.

Key words: GenBank, bioinformatics, gene, phylogenetic evolution, DNA, nucleotide sequences

附录1

GenBank中的序列数据格式。左侧方框中是GenBank定义的基因类型(feature type), 右侧方框中为该序列的相关注释信息。"

附录2

NCBIminer的工作流程。a为NCBIminer工作的主要流程, b详细解释了优化参考序列集建立和多查询归并算法的步骤。根据Xu et al. (2015)修改。"

1 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool.Journal of Molecular Biology, 215, 403-410.
2 Chen ZD (陈之端), Li DZ (李德铢) (2013) On Barcode of Life and Tree of Life.Plant Diversity and Resources(植物分类与资源学报), 35, 675-681. (in Chinese with English abstract)
3 Driskell AC, Ané C, Burleigh JG, McMahon MM, O’Meara BC, Sanderson MJ (2004) Prospects for building the Tree of Life from large sequence databases.Science, 306, 1172-1174.
4 Holt B, Lessard JP, Borregaard MK, Fritz SA, Araujo MB, Dimitrov D, Fabre PH, Graham CH, Graves GR, Jonsson KA, Nogues-Bravo D, Wang ZH, Whittaker RJ, Fjeldsa J, Rahbek C (2013) An update of Wallace’s zoogeographic regions of the world.Science, 339, 74-78.
5 Jones M, Koutsovoulos G, Blaxter M (2011) iPhy: an integrated phylogenetic workbench for supermatrix analyses.BMC Bioinformatics, 12, 30.
6 Li DC (2013) Similarity analysis of DNA sequences based on CLZ complexity.Journal of Computational and Theoretical Nanoscience, 10, 481-487.
7 Li DZ, Gao LM, Li HT, Wang H, Ge XJ, Liu JQ, Chen ZD, Zhou SL, Chen SL, Yang JB, Fu CX, Zeng CX, Yan HF, Zhu YJ, Sun YS, Chen SY, Zhao L, Wang K, Yang T, Duan GW, Grp CPB (2011) Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proceedings of the National Academy of Sciences, USA, 108, 19641-19646.
8 Lu LM (鲁丽敏), Sun M (孙苗), Zhang JB (张景博), Li HL (李洪雷), Lin L (林立), Yang T (杨拓), Chen M (陈闽), Chen ZD (陈之端) (2014) Tree of Life and its applications.Biodiversity Science(生物多样性), 22, 3-20. (in Chinese with English abstract)
9 Pearse WD, Purvis A (2013) phyloGenerator: an automated phylogeny generation tool for ecologists.Methods in Ecology and Evolution, 4, 692-698.
10 Pei NC (裴男才) (2015) Applications of DNA barcoding in evolutionary ecology.Biodiversity Science(生物多样性), 23, 291-292. (in Chinese)
11 Qiu Q, Zhang GJ, Ma T, Qian WB, Wang JY, Ye ZQ, Cao CC, Hu QJ, Kim J, Larkin DM, Auvil L, Capitanu B, Ma J, Lewin HA, Qian XJ, Lang YS, Zhou R, Wang LZ, Wang K, Xia JQ, Liao SG, Pan SK, Lu X, Hou HL, Wang Y, Zang XT, Yin Y, Ma H, Zhang J, Wang ZF, Zhang YM, Zhang DW, Yonezawa T, Hasegawa M, Zhong Y, Liu WB, Zhang Y, Huang ZY, Zhang SX, Long RJ, Yang HM, Wang J, Lenstra JA, Cooper DN, Wu Y, Wang J, Shi P, Wang J, Liu JQ (2012) The yak genome and adaptation to life at high altitude.Nature Genetics, 44, 946-949.
12 Ren BQ (任保青), Chen ZD (陈之端) (2010) DNA barcoding plant life.Chinese Bulletin of Botany(植物学报), 45, 1-12. (in Chinese with English abstract)
13 Sanderson M, Boss D, Chen D, Cranston K, Wehe A (2008) The PhyLoTA browser: processing GenBank for molecular phylogenetics research.Systematic Biology, 57, 335-346.
14 Xu X, Wang Z, Rahbek C, Lessard J-P, Fang J (2013)
15 Evolutionary history influences the effects of water-energy dynamics on oak diversity in Asia.Journal of Biogeography, 40, 2146-2155.
16 Xu XT, Dimitrov D, Rahbek C, Wang ZH (2015) NCBIminer: sequences harvest from Genbank.Ecography, 38, 426-430.
17 Yang ZY, Ran JH, Wang XQ (2012) Three genome-based phylogeny of Cupressaceae s.l.: further evidence for the evolution of gymnosperms and southern hemisphere biogeography.Molecular Phylogenetics and Evolution, 64, 452-470.
18 Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O’Meara BC, Moles AT, Reich PB, Royer DL, Soltis DE, Stevens PF, Westoby M, Wright IJ, Aarssen L, Bertin RI, Calaminus A, Govaerts R, Hemmings F, Leishman MR, Oleksyn J, Soltis PS, Swenson NG, Warman L, Beaulieu JM (2013) Three keys to the radiation of angiosperms into freezing environments.Nature, 506, 89-92.
[1] 李格 孟小庆 李宗芸 朱明库. (2020) 甘薯盐胁迫响应基因IbMYB3的表达特征及生物信息学分析. 植物学报, 55(1): 0-0.
[2] 陈威,杨颖增,陈锋,周文冠,舒凯. (2019) 表观遗传修饰介导的植物胁迫记忆. 植物学报, 54(6): 779-785.
[3] 邵昕宁, 宋大昭, 黄巧雯, 李晟, 姚蒙. (2019) 基于粪便DNA及宏条形码技术的食肉动物快速调查及食性分析. 生物多样性, 27(5): 543-556.
[4] 李晗溪, 黄雪娜, 李世国, 战爱斌. (2019) 基于环境DNA-宏条形码技术的水生生态系统入侵生物的早期监测与预警. 生物多样性, 27(5): 491-504.
[5] 李诣远, DavidC.Molik, MichaelE.Pfrender. (2019) 基于Nextflow构建的宏条形码自动化分析流程EPPS. 生物多样性, 27(5): 567-575.
[6] 刘山林. (2019) DNA条形码参考数据集构建和序列分析相关的新兴技术. 生物多样性, 27(5): 526-533.
[7] 张亚红, 贾会霞, 王志彬, 孙佩, 曹德美, 胡建军. (2019) 滇杨种群遗传多样性与遗传结构. 生物多样性, 27(4): 355-365.
[8] 胡建霖,刘志芳,慈秀芹,李捷. (2019) DNA条形码在热带龙脑香科树种鉴定中的应用. 植物学报, 54(3): 350-359.
[9] 程广前,贾克利,李娜,邓传良,李书粉,高武军. (2019) 石刁柏核质体DNA的生物信息学分析及染色体定位. 植物学报, 54(3): 328-334.
[10] 农全东,张明永,张美,简曙光,陆宏芳,夏快飞,文和明. (2019) 火龙果茎基因组DNA提取方法改良. 植物学报, 54(3): 371-377.
[11] 舒江平, 罗俊杰, 韦宏金, 严岳鸿. (2018) 基于模式产地的分子证据澄清南平鳞毛蕨的分类学地位. 植物学报, 53(6): 793-800.
[12] 刘魏, 童永鳌, 白洁. (2018) 水稻雄配子体发育过程中tRNA片段的生物信息学分析. 植物学报, 53(5): 625-633.
[13] 侯勤曦, 慈秀芹, 刘志芳, 徐武美, 李捷. (2018) 基于DNA条形码评估西双版纳国家级自然保护区对樟科植物进化历史的保护. 生物多样性, 26(3): 217-228.
[14] 黄勋和, 余哲琪, 翁茁先, 何丹林, 易振华, 李威娜, 陈洁波, 张细权, 杜炳旺, 钟福生. (2018) 广东省地方鸡线粒体遗传多样性与母系起源. 生物多样性, 26(3): 238-247.
[15] 刘青青, 董志军. (2018) 基于线粒体COI基因分析钩手水母的群体遗传结构. 生物多样性, 26(11): 1204-1211.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed