A demonstration of unsupervised machine learning in species delimitation

Derkarabetian, S., Castillo, S., Koo, P. K., Ovchinnikov, S., Hedin, M. (October 2019) A demonstration of unsupervised machine learning in species delimitation. Mol Phylogenet Evol, 139. p. 106562. ISSN 1055-7903

URL: https://www.ncbi.nlm.nih.gov/pubmed/31323334
DOI: 10.1016/j.ympev.2019.106562

Abstract

One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.

Item Type: Paper
Subjects: bioinformatics
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification
bioinformatics > genomics and proteomics > genetics & nucleic acid processing
bioinformatics > genomics and proteomics
bioinformatics > computational biology > algorithms
bioinformatics > computational biology
evolution
bioinformatics > computational biology > algorithms > machine learning
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > single nucleotide polymorphism
CSHL Authors:
Communities: CSHL labs > Koo Lab
Depositing User: Matthew Dunn
Date: October 2019
Date Deposited: 16 Sep 2019 16:54
Last Modified: 02 Feb 2024 14:44
PMCID: PMC6880864
Related URLs:
URI: https://repository.cshl.edu/id/eprint/38392

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving