Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data

Mo, Ziyi, Siepel, Adam (September 2023) Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. bioRxiv. (Submitted)

[thumbnail of 2023_Mo_Domain-adaptive_neural_networks_improve_preprint.pdf] PDF
2023_Mo_Domain-adaptive_neural_networks_improve_preprint.pdf - Submitted Version
Available under License Creative Commons Attribution Non-commercial.

Download (1MB)
URL: https://www.ncbi.nlm.nih.gov/pubmed/36909514
DOI: 10.1101/2023.03.01.529396

Abstract

Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods--SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the d omain- ada ptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.

Item Type: Paper
Subjects: bioinformatics > computational biology > algorithms > machine learning
organs, tissues, organelles, cell types and functions > tissues types and functions > neural networks
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > population genetics
CSHL Authors:
Communities: CSHL labs > Siepel lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 6 September 2023
Date Deposited: 29 Sep 2023 17:53
Last Modified: 29 Sep 2023 17:53
PMCID: PMC10002701
Related URLs:
URI: https://repository.cshl.edu/id/eprint/41071

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving