生物多样性 ›› 2019, Vol. 27 ›› Issue (5): 567-575.doi: 10.17520/biods.2018211

• 方法 • 上一篇    下一篇

基于Nextflow构建的宏条形码自动化分析流程EPPS

李诣远*(), DavidC.Molik, MichaelE.Pfrender   

  1. (Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46554, USA)
  • 收稿日期:2018-08-01 接受日期:2019-03-05 出版日期:2019-05-20

EPPS, a metabarcoding bioinformatics pipeline using Nextflow

Li Yiyuan()*, C. Molik David, E. Pfrender Michael   

  1. Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46554, USA)Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46554, USA
  • Received:2018-08-01 Accepted:2019-03-05 Online:2019-05-20

基于宏条形码技术的物种快速检测有助于生物多样性的评估、预测和保护。本文介绍了常用宏条形码分析的步骤和参数设定方法。我们利用Nextflow搭建了一款宏条形码分析流程EPPS, 可以自动化地运行从原始数据的质量控制到环境多样性的比较。Nextflow软件还拥有流程监控的功能, 可视化输出每个进程所消耗的时间与内存。本文还使用测试数据和已发表数据证明该平台能够有效地分析宏条形码数据并可靠地分析环境生物多样性的相似性。

关键词: 环境DNA, USEARCH, Trimmomatic, 主成分分析

Metabarcoding helps to quickly assess biodiversity. In this study, we discuss popular metabarcoding analytical tools and parameter settings. We also develop a metabarcoding bioinformatics pipeline, EPPS, to process data from quality control of raw reads to biodiversity comparisons between samples using a pipeline building program, Nextflow. The EPPS pipeline can summarize the time and memory cost of each process in the pipeline. We also apply the pipeline on a test dataset and a public dataset from a previous study. The result suggests that this pipeline can reliably analyze metabarcoding data and facilitate pipeline sharing of metabarcoding studies.

Key words: environmental DNA, USEARCH, Trimmomatic, principal component analysis

图1

EPPS的主要分析步骤。OTU聚类分析还包括去除重复序列、OTU聚类和嵌合体的检测。"

图2

EPPS流程每一个进程的时间消耗。横坐标代表时间, 单位是秒。最左列的名称分别对应了宏条形码分析的流程。filter: 测序质量控制; demultiplex: PCR引物的删除, 如果有多个引物则将各个引物分开; merge: 合并正向和反向序列; otu_clustering/map: OTU聚类分析; plot: 主成分分析。由于测试数据有4个样品, 因此每个进程的右侧括号里有1-4的序号。浅色进度条代表进程所消耗的系统时间。深色进度条代表的是每个进程的CPU时间。每个进度条包含有两个数字, 第1个数字代表每个进度的系统时间, 第2个数字代表虚拟内存的峰值。"

图3

基于测试数据的主成分分析结果。图中每一个点代表一个测试数据的样品。点与点之间距离越近代表样品之间的物种组成相似度越高。例如, test3和test4的相似度大于test3和test1的相似度。"

图4

公共数据的分析结果。样品的名称编号1-8分别代表从最上游到下游的8个采样点。编号的后缀a, b, c分别代表同一个采样地点的3次独立的重复取样。基于图中的结果, 最上游的样品Location1有独特的鱼类多样性组成。Location 3-6有类似的鱼类多样性组成。最下游的样品Location 7-8有类似的鱼类多样性组成。"

[1] Bazinet AL, Cummings MP ( 2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92.
doi: 10.1186/1471-2105-13-92
[2] Berger SA, Krompass D, Stamatakis A ( 2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology, 60, 291-302.
doi: 10.1093/sysbio/syr010
[3] Bik HM, Interactive Pitch Inc . ( 2014) Phinch: An interactive, exploratory data visualization framework for-Omic datasets. bioRxiv, 009944.
[4] Bista I, Carvalho GR, Tang M, Walsh K, Zhou X, Hajibabaei M, Shokralla S, Seymour M, Bradley D, Liu S, Christmas M ( 2018) Performance of amplicon and shotgun sequencing for accurate biomass estimation in invertebrate community samples. Molecular Ecology Resources, 18, 1020-1034.
doi: 10.1111/men.2018.18.issue-5
[5] Bohmann K, Evans A, Gilbert MT, Carvalho GR, Creer S, Knapp M, Douglas WY, De Bruyn M ( 2014) Environmental DNA for wildlife biology and biodiversity monitoring. Trends in Ecology & Evolution, 29, 358-367.
[6] Bolger AM, Lohse M, Usadel B ( 2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114-2120.
doi: 10.1093/bioinformatics/btu170
[7] Boyer F, Mercier C, Bonin A, Le Bras Y, Taberlet P, Coissac E ( 2016) obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16, 176-182.
doi: 10.1111/1755-0998.12428
[8] Brady A, Salzberg SL ( 2009) Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6, 673.
doi: 10.1038/nmeth.1358
[9] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP ( 2016) DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13, 581.
doi: 10.1038/nmeth.3869
[10] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL ( 2009) BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421.
doi: 10.1186/1471-2105-10-421
[11] Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R ( 2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences, USA, 108, 4516-4522.
doi: 10.1073/pnas.1000080107
[12] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA ( 2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335.
doi: 10.1038/nmeth.f.303
[13] Cardoso P, Borges PA, Veech JA ( 2009) Testing the performance of beta diversity measures based on incidence data: The robustness to undersampling. Diversity and Distributions, 15, 1081-1090.
doi: 10.1111/ddi.2009.15.issue-6
[14] Collen B, Whitton F, Dyer EE, Baillie JE, Cumberlidge N, Darwall WR, Pollock C, Richman NI, Soulsby AM, Böhm M ( 2014) Global patterns of freshwater species diversity, threat and endemism. Global Ecology and Biogeography, 23, 40-51.
doi: 10.1111/geb.12096
[15] Crampton-Platt A, Timmermans MJ, Gimmel ML, Kutty SN, Cockerill TD, Vun Khen C, Vogler AP ( 2015) Soup to tree: The phylogeny of beetles inferred by mitochondrial metagenomics of a Bornean rainforest sample. Molecular Biology and Evolution, 32, 2302-2316.
doi: 10.1093/molbev/msv111
[16] Crampton-Platt A, Douglas WY, Zhou X, Vogler AP ( 2016) Mitochondrial metagenomics: Letting the genes out of the bottle. GigaScience, 5, 15.
doi: 10.1186/s13742-016-0120-y
[17] Deiner K, Bik HM, Mächler E, Seymour M, Lacoursière- Roussel A, Altermatt F, Creer S, Bista I, Lodge DM, de Vere N, Pfrender ME ( 2017 a) Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Molecular Ecology, 26, 5872-5895.
doi: 10.1111/mec.2017.26.issue-21
[18] Deiner K, Renshaw MA, Li Y, Olds BP, Lodge DM, Pfrender ME ( 2017 b) Long-range PCR allows sequencing of mitochondrial genomes from environmental DNA. Methods in Ecology and Evolution, 8, 1888-1898.
doi: 10.1111/2041-210X.12836
[19] Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C ( 2017) Nextflow enables reproducible computational workflows. Nature Biotechnology, 35, 316.
doi: 10.1038/nbt.3820
[20] Dowle EJ, Pochon X, Banks JC, Shearer K, Wood SA ( 2016) Targeted gene enrichment and high-throughput sequencing for environmental biomonitoring: A case study using freshwater macroinvertebrates. Molecular Ecology Resources, 16, 1240-1254.
doi: 10.1111/1755-0998.12488
[21] Edgar RC ( 2016) SINTAX: A simple non-Bayesian taxonomy classifier for 16S and ITS sequences. bioRxiv, 074161.
[22] Edgar RC ( 2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.
doi: 10.1093/bioinformatics/btq461
[23] Edgar RC ( 2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nature Methods, 10, 996.
doi: 10.1038/nmeth.2604
[24] Evans NT, Li Y, Renshaw MA, Olds BP, Deiner K, Turner CR, Jerde CL, Lodge DM, Lamberti GA, Pfrender ME ( 2017) Fish community assessment with eDNA metabarcoding: Effects of sampling design and bioinformatic filtering. Canadian Journal of Fisheries and Aquatic Sciences, 74, 1362-1374.
doi: 10.1139/cjfas-2016-0306
[25] Evans NT, Olds BP, Renshaw MA, Turner CR, Li Y, Jerde CL, Mahon AR, Pfrender ME, Lamberti GA, Lodge DM ( 2016) Quantification of mesocosm fish and amphibian species diversity via environmental DNA metabarcoding. Molecular Ecology Resources, 16, 29-41.
doi: 10.1111/1755-0998.12433
[26] Gerlach W, Stoye J ( 2011) Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Research, 39, e91.
doi: 10.1093/nar/gkr225
[27] Huson DH, Auch AF, Qi J, Schuster SC ( 2007) MEGAN analysis of metagenomic data. Genome Research, 17, 377-386.
doi: 10.1101/gr.5969107
[28] Ji Y, Ashton L, Pedley SM, Edwards DP, Tang Y, Nakamura A, Kitching R, Dolman PM, Woodcock P, Edwards FA, Larsen TH ( 2013) Reliable, verifiable and efficient monitoring of biodiversity via metabarcoding. Ecology Letters, 16, 1245-1257.
doi: 10.1111/ele.12162
[29] Li Y, Evans NT, Renshaw MA, Jerde CL, Olds BP, Shogren AJ, Deiner K, Lodge DM, Lamberti GA, Pfrender ME ( 2018) Estimating fish alpha- and beta-diversity along a small stream with environmental DNA metabarcoding. Metabarcoding and Metagenomics, 2, e24262.
doi: 10.3897/mbmg.2.24262
[30] Liu B, Gibbons T, Ghodsi M, Pop M ( 2010) MetaPhyler: Taxonomic profiling for metagenomic sequences. In: Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 95-100.
[31] Liu S, Wang X, Xie L, Tan M, Li Z, Su X, Zhang H, Misof B, Kjer KM, Tang M, Niehuis O ( 2016) Mitochondrial capture enriches mito-DNA 100 fold, enabling PCR-free mitogenomics biodiversity analysis. Molecular Ecology Resources, 16, 470-479.
doi: 10.1111/men.2016.16.issue-2
[32] Liu S, Li Y, Lu J, Su X, Tang M, Zhang R, Zhou L, Zhou C, Yang Q, Ji Y, Yu DW ( 2013) SOAPBarcode: Revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons. Methods in Ecology and Evolution, 4, 1142-1150.
doi: 10.1111/mee3.2013.4.issue-12
[33] Lodge DM, Turner CR, Jerde CL, Barnes MA, Chadderton L, Egan SP, Feder JL, Mahon AR, Pfrender ME ( 2012) Conservation in a cup of water: Estimating biodiversity and population abundance from environmental DNA. Molecular Ecology, 21, 2555-2558.
doi: 10.1111/j.1365-294X.2012.05600.x
[34] Martin M ( 2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17, 10-12.
[35] Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD ( 2012) PANDAseq: Paired-end assembler for Illumina sequences. BMC Bioinformatics, 13, 31.
doi: 10.1186/1471-2105-13-31
[36] Matsen FA, Kodner RB, Armbrust EV ( 2010) pplacer: Linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11, 538.
doi: 10.1186/1471-2105-11-538
[37] Millennium Ecosystem Assessment ( 2005) Ecosystem and Human Well-being: Biodiversity Synthesis. World Resources Institute, Washington, DC.
[38] Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R ( 2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Systematic Biology, 57, 750-757.
doi: 10.1080/10635150802422316
[39] Newbold T, Hudson LN, Hill SL, Contu S, Lysenko I, Senior RA, Börger L, Bennett DJ, Choimes A, Collen B, Day J ( 2015) Global effects of land use on local terrestrial biodiversity. Nature, 520, 45.
doi: 10.1038/nature14324
[40] Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’hara RB, Simpson GL, Solymos P, Stevens MH, Wagner H ( 2013) Package ‘vegan’. Community Ecology Package, version. 2. ( accessed on 2018-08-01)
[41] Olds BP, Jerde CL, Renshaw MA, Li Y, Evans NT, Turner CR, Deiner K, Mahon AR, Brueseke MA, Shirey PD, Pfrender ME ( 2016) Estimating species richness using environmental DNA. Ecology and Evolution, 6, 4214-4226.
doi: 10.1002/ece3.2186
[42] Patil KR, Roune L, McHardy AC ( 2012) The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS ONE, 7, e38581.
doi: 10.1371/journal.pone.0038581
[43] Pfrender M, Hawkins C, Bagley M, Courtney G, Creutzburg B, Epler J, Fend S, Ferrington L Jr, Hartzell P, Jackson S, Larsen D ( 2010) Assessing macroinvertebrate biodiversity in freshwater ecosystems: Advances and challenges in DNA-based approaches. The Quarterly Review of Biology, 85, 319-340.
doi: 10.1086/655118
[44] Pimm SL, Jenkins CN, Abell R, Brooks TM, Gittleman JL, Joppa LN, Raven PH, Roberts CM, Sexton JO ( 2014) The biodiversity of species and their rates of extinction, distribution, and protection. Science, 344, 1246752.
doi: 10.1126/science.1246752
[45] Piro VC, Matschkowski M, Renard BY ( 2017) MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5, 101.
doi: 10.1186/s40168-017-0318-y
[46] Price MN, Dehal PS, Arkin AP ( 2009) FastTree: Computing large minimum evolution trees with profiles instead of a distance matrix. Molecular Biology and Evolution, 26, 1641-1650.
doi: 10.1093/molbev/msp077
[47] R Core Team ( 2016) R: A Language and Environment for Statistical Computing. https://www.R-project.org/. ( accessed on 2018-08-01)
[48] Rognes T, Flouri T, Nichols B, Quince C, Mahé F ( 2016) VSEARCH: A versatile open source tool for metagenomics. PeerJ, 4, e2584.
doi: 10.7717/peerj.2584
[49] Rosen GL, Reichenberger ER, Rosenfeld AM ( 2010) NBC: The Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27, 127-129.
[50] Sato Y, Miya M, Fukunaga T, Sado T, Iwasaki W ( 2018) MitoFish and MiFish pipeline: A mitochondrial genome database of fish with an analysis pipeline for environmental DNA metabarcoding. Molecular Biology and Evolution, 35, 1553-1555.
doi: 10.1093/molbev/msy074
[51] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW ( 2009) Introducing Mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537-7541.
doi: 10.1128/AEM.01541-09
[52] Simon TP, Evans NT ( 2017) Environmental quality assessment using stream fishes. In: Methods in Stream Ecology, 3rd edn. (eds Hauer FR, Lamberti G), pp. 319-334. Elsevier, London.
[53] Slowikowski K ( 2018) ggrepel: Automatically Position Non- Overlapping Text Labels with ‘ggplot2’. https://CRAN.R- project.org/package=ggrepel. ( accessed on 2018-08-01)
[54] Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH ( 2012) Environmental DNA. Molecular Ecology, 21, 1789-1793.
doi: 10.1111/j.1365-294X.2012.05542.x
[55] Tang M, Hardman CJ, Ji Y, Meng G, Liu S, Tan M, Yang S, Moss ED, Wang J, Yang C, Bruce C ( 2015) High-throughput monitoring of wild bee diversity and abundance via mitogenomics. Methods in Ecology and Evolution, 6, 1034-1043.
doi: 10.1111/2041-210X.12416
[56] Thomsen PF, Kielgast JO, Iversen LL, Wiuf C, Rasmussen M, Gilbert MT, Orlando L, Willerslev E ( 2012) Monitoring endangered freshwater biodiversity using environmental DNA. Molecular Ecology, 21, 2565-2573.
doi: 10.1111/j.1365-294X.2011.05418.x
[57] Thomsen PF, Willerslev E ( 2015) Environmental DNA—An emerging tool in conservation for monitoring past and present biodiversity. Biological Conservation, 183, 4-18.
doi: 10.1016/j.biocon.2014.11.019
[58] Uritskiy GV, DiRuggiero J, Taylor J ( 2018) MetaWRAP—A flexible pipeline for genome-resolved metagenomic data analysis. bioRxiv, 277442.
[59] Visconti A, Martin TC, Falchi M ( 2018) YAMP: A containerised workflow enabling reproducibility in metagenomics research. GigaScience, 7, giy072.
[60] Wang Q, Garrity GM, Tiedje JM, Cole JR ( 2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267.
doi: 10.1128/AEM.00062-07
[61] Wickham H ( 2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York. 2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York.
[62] Wilcox TM, Zarn KE, Piggott MP, Young MK, McKelvey KS, Schwartz MK ( 2018) Capture enrichment of aquatic environmental DNA: A first proof of concept. Molecular Ecology Resources, 18, 1392-1401.
doi: 10.1111/men.2018.18.issue-6
[63] Worm B, Barbier EB, Beaumont N, Duffy JE, Folke C, Halpern BS, Jackson JB, Lotze HK, Micheli F, Palumbi SR, Sala E ( 2006) Impacts of biodiversity loss on ocean ecosystem services. Science, 314, 787-790.
doi: 10.1126/science.1132294
[64] Zhou HW, Li DF, Tam NF, Jiang XT, Zhang H, Sheng HF, Qin J, Liu X, Zou F ( 2011) BIPES, a cost-effective high-throughput method for assessing microbial diversity. The ISME Journal, 5, 741.
doi: 10.1038/ismej.2010.160
[65] Zhou X, Li Y, Liu S, Yang Q, Su X, Zhou L, Tang M, Fu R, Li J, Huang Q ( 2013) Ultra-deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod samples without PCR amplification. GigaScience, 2, 4.
doi: 10.1186/2047-217X-2-4
[1] 徐承香, 李子忠, 黎道洪. 贵州织金洞洞穴动物群落多样性与光照强度及土壤重金属含量的关系[J]. 生物多样性, 2013, 21(1): 62-70.
[2] 王玉, 高光彩, 付必谦, 吴专. 北京野鸭湖湿地地表甲虫群落组成与空间分布格局[J]. 生物多样性, 2009, 17(1): 30-42.
[3] 周志强, 魏晓雪, 刘彤. 新疆奇台荒漠植物群落的数量分类及土壤环境解释[J]. 生物多样性, 2007, 15(3): 264-270.
[4] 金伟栋, 洪德林. 太湖流域粳稻地方品种遗传多样性研究[J]. 生物多样性, 2006, 14(6): 479-487.
[5] 王正寰, 王小明. 资源选择函数拟合藏狐洞穴生境利用特征的有效性分析[J]. 生物多样性, 2006, 14(5): 382-391.
[6] 吴陆生, 吴孝兵, 江红星, 王朝林. 野生扬子鳄生境特征分析[J]. 生物多样性, 2005, 13(2): 156-161.
[7] 龚志莲, 郭辉军, 盛才余, 周开元. 西双版纳社区旱稻品种多样性与就地保护初探[J]. 生物多样性, 2004, 12(4): 427-434.
[8] 吴海荣, 强胜. 南京市秋季外来杂草定量调查研究[J]. 生物多样性, 2003, 11(5): 432-438.
[9] 张文辉, 王延平, 刘国彬. 独叶草构件生长及其与环境的关系[J]. 生物多样性, 2003, 11(2): 132-140.
[10] 李欣海, 马志军, 李典谟, 丁长青, 翟天庆, 路宝忠. 应用资源选择函数研究朱鹮的巢址选择[J]. 生物多样性, 2001, 09(4): 352-358.
[11] 刘志斌, 郑哲民, 王青川. 东亚飞蝗与亚洲飞蝗的主成分及判别式分析*[J]. 生物多样性, 1997, 05(1): 67-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed