Biodiversity Science ›› 2019, Vol. 27 ›› Issue (5): 567-575.doi: 10.17520/biods.2018211

• Methodology • Previous Article     Next Article

EPPS, a metabarcoding bioinformatics pipeline using Nextflow

Li Yiyuan()*, C. Molik David, E. Pfrender Michael   

  1. Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46554, USA)Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46554, USA
  • Received:2018-08-01 Accepted:2019-03-05 Online:2019-05-20

Metabarcoding helps to quickly assess biodiversity. In this study, we discuss popular metabarcoding analytical tools and parameter settings. We also develop a metabarcoding bioinformatics pipeline, EPPS, to process data from quality control of raw reads to biodiversity comparisons between samples using a pipeline building program, Nextflow. The EPPS pipeline can summarize the time and memory cost of each process in the pipeline. We also apply the pipeline on a test dataset and a public dataset from a previous study. The result suggests that this pipeline can reliably analyze metabarcoding data and facilitate pipeline sharing of metabarcoding studies.

Key words: environmental DNA, USEARCH, Trimmomatic, principal component analysis

Fig. 1

The workflow of EPPS. OTU clustering step includes, dereplication, OTU clustering and chimera detection."

Fig. 2

The timeline chart of EPPS. The x-axis is the amount of time for each process in seconds. Each row indicates the name of different stages of the analysis. filter, Data filtering; demultiplex, Primer removal and demultiplex if there are multiple primers; merge, Merging of forward and reverse reads; otu_clustering and map, OTU clustering and mapping of reads; plot, Plotting PCA plot. As there are four samples in the testing data set, there are four processes (1 to 4) for filter, demultiplex, merge, and map steps. Each bar indicates the time for each process. The dark area in each bar represents the real execution time. Each bar displays two numbers: the task duration time and the virtual memory size peak."

Fig. 3

PCA plot based on testing data. Each dot in the figure represents a test sample. The distance between dots indicates the dissimilarity between samples. For example, the similarity between test3 and test4 is higher than test1 and test2."

Fig. 4

EPPS result based on public data set. Samples are named from 1 to 8 from upstream to downstream. The suffix “a”, “b” and “c” indicate three replicates of the same sampling location. Based on the PCA, the most upstream sample (Location 1) has unique fish composition. Location 3-6 have similar fish composition. The downstream samples (Location 7-8) share similar fish composition."

[1] Bazinet AL, Cummings MP ( 2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92.
doi: 10.1186/1471-2105-13-92
[2] Berger SA, Krompass D, Stamatakis A ( 2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology, 60, 291-302.
doi: 10.1093/sysbio/syr010
[3] Bik HM, Interactive Pitch Inc . ( 2014) Phinch: An interactive, exploratory data visualization framework for-Omic datasets. bioRxiv, 009944.
[4] Bista I, Carvalho GR, Tang M, Walsh K, Zhou X, Hajibabaei M, Shokralla S, Seymour M, Bradley D, Liu S, Christmas M ( 2018) Performance of amplicon and shotgun sequencing for accurate biomass estimation in invertebrate community samples. Molecular Ecology Resources, 18, 1020-1034.
doi: 10.1111/men.2018.18.issue-5
[5] Bohmann K, Evans A, Gilbert MT, Carvalho GR, Creer S, Knapp M, Douglas WY, De Bruyn M ( 2014) Environmental DNA for wildlife biology and biodiversity monitoring. Trends in Ecology & Evolution, 29, 358-367.
[6] Bolger AM, Lohse M, Usadel B ( 2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114-2120.
doi: 10.1093/bioinformatics/btu170
[7] Boyer F, Mercier C, Bonin A, Le Bras Y, Taberlet P, Coissac E ( 2016) obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16, 176-182.
doi: 10.1111/1755-0998.12428
[8] Brady A, Salzberg SL ( 2009) Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6, 673.
doi: 10.1038/nmeth.1358
[9] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP ( 2016) DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13, 581.
doi: 10.1038/nmeth.3869
[10] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL ( 2009) BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421.
doi: 10.1186/1471-2105-10-421
[11] Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R ( 2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences, USA, 108, 4516-4522.
doi: 10.1073/pnas.1000080107
[12] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA ( 2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335.
doi: 10.1038/nmeth.f.303
[13] Cardoso P, Borges PA, Veech JA ( 2009) Testing the performance of beta diversity measures based on incidence data: The robustness to undersampling. Diversity and Distributions, 15, 1081-1090.
doi: 10.1111/ddi.2009.15.issue-6
[14] Collen B, Whitton F, Dyer EE, Baillie JE, Cumberlidge N, Darwall WR, Pollock C, Richman NI, Soulsby AM, Böhm M ( 2014) Global patterns of freshwater species diversity, threat and endemism. Global Ecology and Biogeography, 23, 40-51.
doi: 10.1111/geb.12096
[15] Crampton-Platt A, Timmermans MJ, Gimmel ML, Kutty SN, Cockerill TD, Vun Khen C, Vogler AP ( 2015) Soup to tree: The phylogeny of beetles inferred by mitochondrial metagenomics of a Bornean rainforest sample. Molecular Biology and Evolution, 32, 2302-2316.
doi: 10.1093/molbev/msv111
[16] Crampton-Platt A, Douglas WY, Zhou X, Vogler AP ( 2016) Mitochondrial metagenomics: Letting the genes out of the bottle. GigaScience, 5, 15.
doi: 10.1186/s13742-016-0120-y
[17] Deiner K, Bik HM, Mächler E, Seymour M, Lacoursière- Roussel A, Altermatt F, Creer S, Bista I, Lodge DM, de Vere N, Pfrender ME ( 2017 a) Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Molecular Ecology, 26, 5872-5895.
doi: 10.1111/mec.2017.26.issue-21
[18] Deiner K, Renshaw MA, Li Y, Olds BP, Lodge DM, Pfrender ME ( 2017 b) Long-range PCR allows sequencing of mitochondrial genomes from environmental DNA. Methods in Ecology and Evolution, 8, 1888-1898.
doi: 10.1111/2041-210X.12836
[19] Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C ( 2017) Nextflow enables reproducible computational workflows. Nature Biotechnology, 35, 316.
doi: 10.1038/nbt.3820
[20] Dowle EJ, Pochon X, Banks JC, Shearer K, Wood SA ( 2016) Targeted gene enrichment and high-throughput sequencing for environmental biomonitoring: A case study using freshwater macroinvertebrates. Molecular Ecology Resources, 16, 1240-1254.
doi: 10.1111/1755-0998.12488
[21] Edgar RC ( 2016) SINTAX: A simple non-Bayesian taxonomy classifier for 16S and ITS sequences. bioRxiv, 074161.
[22] Edgar RC ( 2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.
doi: 10.1093/bioinformatics/btq461
[23] Edgar RC ( 2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nature Methods, 10, 996.
doi: 10.1038/nmeth.2604
[24] Evans NT, Li Y, Renshaw MA, Olds BP, Deiner K, Turner CR, Jerde CL, Lodge DM, Lamberti GA, Pfrender ME ( 2017) Fish community assessment with eDNA metabarcoding: Effects of sampling design and bioinformatic filtering. Canadian Journal of Fisheries and Aquatic Sciences, 74, 1362-1374.
doi: 10.1139/cjfas-2016-0306
[25] Evans NT, Olds BP, Renshaw MA, Turner CR, Li Y, Jerde CL, Mahon AR, Pfrender ME, Lamberti GA, Lodge DM ( 2016) Quantification of mesocosm fish and amphibian species diversity via environmental DNA metabarcoding. Molecular Ecology Resources, 16, 29-41.
doi: 10.1111/1755-0998.12433
[26] Gerlach W, Stoye J ( 2011) Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Research, 39, e91.
doi: 10.1093/nar/gkr225
[27] Huson DH, Auch AF, Qi J, Schuster SC ( 2007) MEGAN analysis of metagenomic data. Genome Research, 17, 377-386.
doi: 10.1101/gr.5969107
[28] Ji Y, Ashton L, Pedley SM, Edwards DP, Tang Y, Nakamura A, Kitching R, Dolman PM, Woodcock P, Edwards FA, Larsen TH ( 2013) Reliable, verifiable and efficient monitoring of biodiversity via metabarcoding. Ecology Letters, 16, 1245-1257.
doi: 10.1111/ele.12162
[29] Li Y, Evans NT, Renshaw MA, Jerde CL, Olds BP, Shogren AJ, Deiner K, Lodge DM, Lamberti GA, Pfrender ME ( 2018) Estimating fish alpha- and beta-diversity along a small stream with environmental DNA metabarcoding. Metabarcoding and Metagenomics, 2, e24262.
doi: 10.3897/mbmg.2.24262
[30] Liu B, Gibbons T, Ghodsi M, Pop M ( 2010) MetaPhyler: Taxonomic profiling for metagenomic sequences. In: Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 95-100.
[31] Liu S, Wang X, Xie L, Tan M, Li Z, Su X, Zhang H, Misof B, Kjer KM, Tang M, Niehuis O ( 2016) Mitochondrial capture enriches mito-DNA 100 fold, enabling PCR-free mitogenomics biodiversity analysis. Molecular Ecology Resources, 16, 470-479.
doi: 10.1111/men.2016.16.issue-2
[32] Liu S, Li Y, Lu J, Su X, Tang M, Zhang R, Zhou L, Zhou C, Yang Q, Ji Y, Yu DW ( 2013) SOAPBarcode: Revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons. Methods in Ecology and Evolution, 4, 1142-1150.
doi: 10.1111/mee3.2013.4.issue-12
[33] Lodge DM, Turner CR, Jerde CL, Barnes MA, Chadderton L, Egan SP, Feder JL, Mahon AR, Pfrender ME ( 2012) Conservation in a cup of water: Estimating biodiversity and population abundance from environmental DNA. Molecular Ecology, 21, 2555-2558.
doi: 10.1111/j.1365-294X.2012.05600.x
[34] Martin M ( 2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17, 10-12.
[35] Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD ( 2012) PANDAseq: Paired-end assembler for Illumina sequences. BMC Bioinformatics, 13, 31.
doi: 10.1186/1471-2105-13-31
[36] Matsen FA, Kodner RB, Armbrust EV ( 2010) pplacer: Linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11, 538.
doi: 10.1186/1471-2105-11-538
[37] Millennium Ecosystem Assessment ( 2005) Ecosystem and Human Well-being: Biodiversity Synthesis. World Resources Institute, Washington, DC.
[38] Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R ( 2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Systematic Biology, 57, 750-757.
doi: 10.1080/10635150802422316
[39] Newbold T, Hudson LN, Hill SL, Contu S, Lysenko I, Senior RA, Börger L, Bennett DJ, Choimes A, Collen B, Day J ( 2015) Global effects of land use on local terrestrial biodiversity. Nature, 520, 45.
doi: 10.1038/nature14324
[40] Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’hara RB, Simpson GL, Solymos P, Stevens MH, Wagner H ( 2013) Package ‘vegan’. Community Ecology Package, version. 2. ( accessed on 2018-08-01)
[41] Olds BP, Jerde CL, Renshaw MA, Li Y, Evans NT, Turner CR, Deiner K, Mahon AR, Brueseke MA, Shirey PD, Pfrender ME ( 2016) Estimating species richness using environmental DNA. Ecology and Evolution, 6, 4214-4226.
doi: 10.1002/ece3.2186
[42] Patil KR, Roune L, McHardy AC ( 2012) The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS ONE, 7, e38581.
doi: 10.1371/journal.pone.0038581
[43] Pfrender M, Hawkins C, Bagley M, Courtney G, Creutzburg B, Epler J, Fend S, Ferrington L Jr, Hartzell P, Jackson S, Larsen D ( 2010) Assessing macroinvertebrate biodiversity in freshwater ecosystems: Advances and challenges in DNA-based approaches. The Quarterly Review of Biology, 85, 319-340.
doi: 10.1086/655118
[44] Pimm SL, Jenkins CN, Abell R, Brooks TM, Gittleman JL, Joppa LN, Raven PH, Roberts CM, Sexton JO ( 2014) The biodiversity of species and their rates of extinction, distribution, and protection. Science, 344, 1246752.
doi: 10.1126/science.1246752
[45] Piro VC, Matschkowski M, Renard BY ( 2017) MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5, 101.
doi: 10.1186/s40168-017-0318-y
[46] Price MN, Dehal PS, Arkin AP ( 2009) FastTree: Computing large minimum evolution trees with profiles instead of a distance matrix. Molecular Biology and Evolution, 26, 1641-1650.
doi: 10.1093/molbev/msp077
[47] R Core Team ( 2016) R: A Language and Environment for Statistical Computing. ( accessed on 2018-08-01)
[48] Rognes T, Flouri T, Nichols B, Quince C, Mahé F ( 2016) VSEARCH: A versatile open source tool for metagenomics. PeerJ, 4, e2584.
doi: 10.7717/peerj.2584
[49] Rosen GL, Reichenberger ER, Rosenfeld AM ( 2010) NBC: The Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27, 127-129.
[50] Sato Y, Miya M, Fukunaga T, Sado T, Iwasaki W ( 2018) MitoFish and MiFish pipeline: A mitochondrial genome database of fish with an analysis pipeline for environmental DNA metabarcoding. Molecular Biology and Evolution, 35, 1553-1555.
doi: 10.1093/molbev/msy074
[51] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW ( 2009) Introducing Mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537-7541.
doi: 10.1128/AEM.01541-09
[52] Simon TP, Evans NT ( 2017) Environmental quality assessment using stream fishes. In: Methods in Stream Ecology, 3rd edn. (eds Hauer FR, Lamberti G), pp. 319-334. Elsevier, London.
[53] Slowikowski K ( 2018) ggrepel: Automatically Position Non- Overlapping Text Labels with ‘ggplot2’. https://CRAN.R- ( accessed on 2018-08-01)
[54] Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH ( 2012) Environmental DNA. Molecular Ecology, 21, 1789-1793.
doi: 10.1111/j.1365-294X.2012.05542.x
[55] Tang M, Hardman CJ, Ji Y, Meng G, Liu S, Tan M, Yang S, Moss ED, Wang J, Yang C, Bruce C ( 2015) High-throughput monitoring of wild bee diversity and abundance via mitogenomics. Methods in Ecology and Evolution, 6, 1034-1043.
doi: 10.1111/2041-210X.12416
[56] Thomsen PF, Kielgast JO, Iversen LL, Wiuf C, Rasmussen M, Gilbert MT, Orlando L, Willerslev E ( 2012) Monitoring endangered freshwater biodiversity using environmental DNA. Molecular Ecology, 21, 2565-2573.
doi: 10.1111/j.1365-294X.2011.05418.x
[57] Thomsen PF, Willerslev E ( 2015) Environmental DNA—An emerging tool in conservation for monitoring past and present biodiversity. Biological Conservation, 183, 4-18.
doi: 10.1016/j.biocon.2014.11.019
[58] Uritskiy GV, DiRuggiero J, Taylor J ( 2018) MetaWRAP—A flexible pipeline for genome-resolved metagenomic data analysis. bioRxiv, 277442.
[59] Visconti A, Martin TC, Falchi M ( 2018) YAMP: A containerised workflow enabling reproducibility in metagenomics research. GigaScience, 7, giy072.
[60] Wang Q, Garrity GM, Tiedje JM, Cole JR ( 2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267.
doi: 10.1128/AEM.00062-07
[61] Wickham H ( 2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York. 2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York.
[62] Wilcox TM, Zarn KE, Piggott MP, Young MK, McKelvey KS, Schwartz MK ( 2018) Capture enrichment of aquatic environmental DNA: A first proof of concept. Molecular Ecology Resources, 18, 1392-1401.
doi: 10.1111/men.2018.18.issue-6
[63] Worm B, Barbier EB, Beaumont N, Duffy JE, Folke C, Halpern BS, Jackson JB, Lotze HK, Micheli F, Palumbi SR, Sala E ( 2006) Impacts of biodiversity loss on ocean ecosystem services. Science, 314, 787-790.
doi: 10.1126/science.1132294
[64] Zhou HW, Li DF, Tam NF, Jiang XT, Zhang H, Sheng HF, Qin J, Liu X, Zou F ( 2011) BIPES, a cost-effective high-throughput method for assessing microbial diversity. The ISME Journal, 5, 741.
doi: 10.1038/ismej.2010.160
[65] Zhou X, Li Y, Liu S, Yang Q, Su X, Zhou L, Tang M, Fu R, Li J, Huang Q ( 2013) Ultra-deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod samples without PCR amplification. GigaScience, 2, 4.
doi: 10.1186/2047-217X-2-4
[1] DONG Zheng-Wu, ZHAO Ying, LEI Jia-Qiang, XI Yin-Qiao. (2018) Distribution pattern and influencing factors of soil salinity at Tamarix cones in the Taklimakan Desert . Chin J Plan Ecolo, 42(8): 873-884.
[2] Yafeng Zhou, Yanbin Xu, Yanling Wang, Qiong Li, Jianbin Hu. (2017) Establishment of a Comprehensive Evaluation System for Chilling Tolerance in Melon Seedlings Based on Principal Component Analysis and Cluster Analysis . Chin Bull Bot, 52(4): 520-529.
[3] CHEN Tian-Yi, LIU Zeng-Hui, and LOU An-Ru. (2013) Phenotypic variation in populations of Solanum rostratum in different distribution areas in China . Chin J Plan Ecolo, 37(4): 344-353.
[4] CI Dun-Wei, DAI Liang-Xiang, SONG Wen-Wu, and ZHANG Zhi-Meng. (2013) Genotypic differences in salt tolerance from germination to seedling stage in peanut . Chin J Plan Ecolo, 37(11): 1018-1027.
[5] LI He, ZHANG Wei-Kang, and WANG Guo-Hong. (2012) Relationship between climatic factors and geographical distribution of spruce forests in China . Chin J Plan Ecolo, 36(5): 372-381.
[6] Pengfei Wang, Qianqian Wang, Xianen Li, Minjian Qin. (2012) Metabolites Research of Corydalis yanhusuo Tubers with Gas Chromatography-mass Spectrometry . Chin Bull Bot, 47(2): 149-154.
[7] , Yu Wang, Guangcai Gao, Biqian Fu, Zhuan Wu, . (2009) Composition and spatial distribution pattern of ground-dwelling beetle communities in Yeyahu Wetland, Beijing . Biodiv Sci, 17(1): 30-42.
[9] Weidong Jin, Delin Hong. (2006) Genetic diversity in japonica rice landraces (Oryza sativa) from the Taihu Lake region . Biodiv Sci, 14(6): 479-487.
[11] LI Jun-Xiang, DA Liang-Jun, WANG Yu-Jie, SONG Yong-Chang. (2005) VEGETATION CLASSIFICATION OF EAST CHINA USING MULTI-TEMPORAL NOAA-AVHRR DATA . Chin J Plan Ecolo, 29(3): 436-443.
[12] Lusheng Wu, Xiaobing Wu, Hongxing Jiang, Chaolin Wang. (2005) Habitat characteristics of wild Chinese alligator (Alligator sinensis) . Biodiv Sci, 13(2): 156-161.
[13] GONG Zhi-Lian, GUO Hui-Jun, SHENG Cai-Yu, ZHOU Kai-Yuan. (2004) Upland rice variety diversity and in situ conservation in the communities of Xishuangbanna . Biodiv Sci, 12(4): 427-434.
[14] WU Hai-Rong, QIANG Sheng. (2003) Quantitative survey on exotic weeds in autumn in Nanjing . Biodiv Sci, 11(5): 432-438.
[15] ZHANG Wen-Hui, WANG Yan-Ping, LIU Guo-Bin. (2003) The relationship between modular growth of Kingdonia uniflora and the environment . Biodiv Sci, 11(2): 132-140.
Full text