Combining natural language processing and metabarcoding to reveal pathogen-environment associations.

Molik, David C, Tomlinson, DeAndre, Davitt, Shane, Morgan, Eric L, Sisk, Matthew, Roche, Benjamin, Meyers, Natalie, Pfrender, Michael E (April 2021) Combining natural language processing and metabarcoding to reveal pathogen-environment associations. PLoS Neglected Tropical Diseases, 15 (4). e0008755. ISSN 1935-2735

[thumbnail of Combining natural language processing and metabarcoding to reveal pathogen-environment associations.pdf] PDF
Combining natural language processing and metabarcoding to reveal pathogen-environment associations.pdf

Download (2MB)
URL: https://www.ncbi.nlm.nih.gov/pubmed/33826634
DOI: 10.1371/journal.pntd.0008755

Abstract

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.

Item Type: Paper
Subjects: diseases & disorders > Bacterial Infections
bioinformatics
diseases & disorders
bioinformatics > computational biology > algorithms
organism description > bacteria
bioinformatics > computational biology
bioinformatics > computational biology > algorithms > machine learning
CSHL Authors:
Communities: CSHL labs > Martienssen lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: April 2021
Date Deposited: 11 May 2021 18:43
Last Modified: 25 Jan 2024 15:27
PMCID: PMC8055023
URI: https://repository.cshl.edu/id/eprint/40084

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving