Applying cluster analysis and Google Maps in the study of large-scale species occurrence data

doi:10.17520/biods.2011131

Abstract

Abstract:

The primary species occurrence data include the data on animal and plant specimens in museums and herbaria, as well as species observations. TaiBIF (Taiwan Biodiversity Information Facility) data portal has integrated 26 datasets so far, resulting in more than 1.5 million species occurrence data; 85% of them are geo-referenced. This study utilizes more than 8,800 Cyprinidae occurrence data from 11 datasets and uses three different types of clustering algorithms—grid-based, partition-based, and density-based—to produce different spatial visualization results. It aims to resolve the problems of efficacy and poor visualization when large scales of species occurrence data are presented in Google Maps. The study also explores the comparative differences between the results obtained from the three clustering algorithms and the expert opinion range maps of Cyprinidae. It hopes to identify a quick and efficient way to present species distribution data, in turn help researchers to extract knowledge from large amount of data so that the knowledge can be tapped as important reference for ecological conservation efforts.

Key words: species occurrence data, cluster analysis, biodiversity informatics, visualization

Kunchi Lai, Youhua Cheng, Yuehchih Chen, Yousheng Li, Kwangtsao Shao. Applying cluster analysis and Google Maps in the study of large-scale species occurrence data[J]. Biodiv Sci, 2012, 20(1): 76-85.

Add to citation manager EndNote|Ris|BibTeX

URL: https://www.biodiversity-science.net/EN/10.17520/biods.2011131

https://www.biodiversity-science.net/EN/Y2012/V20/I1/76

Figures/Tables 9

References 21

[1]	Chapman AD (2005) Uses of primary species-occurrence data, version 1.0, Global Biodiversity Information Facility.
[2]	Chen IS (陈义雄), Fang LS (方力行) (1999) The Freshwater and Estuarine Fishes of Taiwan (台湾淡水及河口鱼类志). Museum of Marine Biology and Aquarium, Taiwan. (in Chinese)
[3]	Encyclopedia of Life (2011) Retrieved from http://www.eol.org, Accessed 2011-07-01.
[4]	Ester M, Kriegel HP, Sander J, Xu X (1998) Clustering for mining in large spatial databases. Special Issue on Data Mining, KI-Journal, ScienTec Publishing, 1,1-7.
[5]	Flemons P, Guralnick R, Krieger J, Ranipeta A, Neufeld D (2007) A web-based GIS tool for exploring the world's biodiversity: The Global Biodiversity Information Facility Mapping and Analysis Portal Application (GBIF-MAPA). Ecological Informatics, 2(1),49-60.
[6]	GBIF Data Portal (2011) Retrieved from http://data.gbif.org, Accessed 2011-07-01.
[7]	Han J, Kamber M (2006) Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, Amsterdam.
[8]	Hijmans RJ, Spooner DM (2001) Geographic distribution of wild potato species. American Journal of Botany, 88,2101-2112. URL PMID
[9]	Hill AW, Guralnick RP, Flemons P, Beaman R, Wieczorek J, Ranipeta A, Chavan V, Remsen D (2009) Location, location, location: utilizing pipelines and services to more effectively georeference the world’s biodiversity data. BMC Bioinformatics, 10,S3. URL PMID
[10]	Hill AW, Otegui J, Ariño AH, Guralnick RP (2010) GBIF Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network, version 1.0. Copenhagen: Global Biodiversity Information Facility, 25.
[11]	Jaffe A, Naaman M, Tassa T, Davis M (2006) Generating summaries and visualization for large collections of geo-referenced photographs. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp.89-98.
[12]	Liu X, Hui Y, Sun W, Liang H (2007) Towards Service Composition Based on Mashup. In: 2007 IEEE Congress on Services,332-339.
[13]	Miller HJ, Han J (2009) Geographic Data Mining and Knowledge Discovery, 2nd edn. CRC Press.
[14]	Mucke E (2009) Computing Prescriptions: Quickhull: Computing Convex Hulls Quickly. Computing in Science Engineering, 11(5),54-57.
[15]	Peterson AT, Knapp S, Guralnick R, Soberón J, Holder MT (2010) The big questions for biodiversity informatics. Systematics and Biodiversity, 8(2),159-168.
[16]	Proj4 (2011) Retrieved from http://trac.osgeo.org/proj/, Accessed 2011-07-01.
[17]	Shao KT, Peng CI, Yen E, Lai KC, Wang MC, Lin J, Yang A, Chen SY (2007) Integration of biodiversity databases in Taiwan and linkage to global databases. Data Science Journal, 6,S2-S10.
[18]	Shao KT (邵广昭), Lai KC (赖昆祺), Lin YC (林永昌), Ko CJ (柯智仁), Lee H (李瀚), Hung LY (洪铃雅), Chen YC (陈岳智), Chen LS (陈丽西) (2010) Experience and strategy of biodiversity data integration in Taiwan, Biodiversity Science (生物多样性), 18,444-453. (in Chinese with English abstract)
[19]	Tang M, Zhou Y, Cui P, Wang W, Li J, Zhang H, Hou Y, Yan B (2009) Discovery of migration habitats and routes of wild bird species by clustering and association analysis. Advan- ced Data Mining and Applications, 5678,288-301.
[20]	Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1),1-37.
[21]	Zang N, Rosson MB, Nasser V (2008) Mashups: who? what? why? Proceedings of CHI '08 Extended Abstracts on Human Factors in Computing Systems, 3171-3176.

	网格式分析法 Grid-based methods	切割式分析法 Partitioning methods	密度式分析法 Density-based methods
参数设定 Parameter settings	容易, 只需决定每层网格边长 Easy; only need to determine length of grid for each level	容易, 只需决定分的群数(K) Easy; only need to determine K	困难, 需要反复调整MinPts与Eps两参数 Hard; need to adjust MinPtsandEpsrepetitively
计算效能 Computing efficiency	快。时间复杂度为O(n) Fast; time complexity is O(n)	慢。时间复杂度为O(kmn) Slowest; time complexity isO(kmn)	中。时间复杂度为O(nlgn) Slow; time complexity isO(nlgn)
噪声或偏离值处理 Processing of noise or outliers	无 No	无 No	有 Yes
与Google Map的可视化呈现(此处是指Google Map上程序撰写的难易度) Programming for Google map visualizations	较为困难 Hard	较为容易 Easy	较为容易 Easy
依据地图分辨率而改变聚类分析结果(具有阶层变化) Display of cluster analysis results on a map using different scales	有, 以本研究为例提供3种网格尺度 Easy; the study has three different spatial scales	无 No	无 No
呈现物种分布的概括 Presentation of species distribution	不容易看出 Difficult to spot distribution	接近 Easy to spot and more precise	较接近 Easy to spot and most precise

	网格式分析法 Grid-based methods	切割式分析法 Partitioning methods	密度式分析法 Density-based methods
参数设定 Parameter settings	容易, 只需决定每层网格边长 Easy; only need to determine length of grid for each level	容易, 只需决定分的群数(K) Easy; only need to determine K	困难, 需要反复调整MinPts与Eps两参数 Hard; need to adjust MinPtsandEpsrepetitively
计算效能 Computing efficiency	快。时间复杂度为O(n) Fast; time complexity is O(n)	慢。时间复杂度为O(kmn) Slowest; time complexity isO(kmn)	中。时间复杂度为O(nlgn) Slow; time complexity isO(nlgn)
噪声或偏离值处理 Processing of noise or outliers	无 No	无 No	有 Yes
与Google Map的可视化呈现(此处是指Google Map上程序撰写的难易度) Programming for Google map visualizations	较为困难 Hard	较为容易 Easy	较为容易 Easy
依据地图分辨率而改变聚类分析结果(具有阶层变化) Display of cluster analysis results on a map using different scales	有, 以本研究为例提供3种网格尺度 Easy; the study has three different spatial scales	无 No	无 No
呈现物种分布的概括 Presentation of species distribution	不容易看出 Difficult to spot distribution	接近 Easy to spot and more precise	较接近 Easy to spot and most precise