Ziyi, Mo (January 2024) Scalable and robust deep-learning methods power evolutionary-genetic studies of biobank-scale population genomic data. PhD thesis, Cold Spring Harbor Laboratory.
PDF
Mo_Ziyi_SBS-thesis_final_Jan2024.pdf Download (18MB) |
|
PDF
Mo_Ziyi_SBS-thesis_submission_form_May2024.pdf Restricted to Repository staff only Download (140kB) |
Abstract
The advent of next-generation sequencing has brought forth an era where datasets containing genomic sequences for thousands of individuals are common. The key to leveraging rich datasets to generate impactful biomedical insights are high-quality computational tools for biological data analysis. The field of population genetics has a long tradition of using mathematical models to investigate how evolutionary forces shape the dynamics of genetic variants and their biological implications. More recently, artificial intelligence (AI) and machine learning (ML) methods have demonstrated state-of-the-art performance for a wide range of applications involving big data and are increasingly dominant in all areas of quantitative research. My thesis work addresses the unique promises and challenges of analyzing genomic data with AI/ML methods by pioneering rigorous, scalable and innovative deep learning models for population-genetic inference tasks, which ultimately open up broad opportunities for this emerging field of research. A fundamental pursuit in evolutionary genetics is to identify beneficial mutations and measure the strength of their selective advantage, based on patterns of genetic variation. Studies of positive selection have led to new insights into the biological relevance of particular genomic elements, such as the discovery of mutations involved in immunity or adaptation to extreme environments. Despite many advances, major limitations remain in the sensitivity and accuracy of computational methods for identifying and characterizing selection. These limitations stem, in part, from the difficulty of estimating selective effects directly from DNA sequences. We developed a novel deep-learning method called Selection Inference using the ARG (SIA), which makes use of a rich set of features extracted from a reconstructed ancestral recombination graph (ARG) to make accurate inferences about selection from large-scale genomic data. The ARG can be thought of as a collection of local genealogies and therefore augments the raw sequences by encoding their complete evolutionary history. By exploiting both the richness of information in the ARG and the flexibility and scalability of deep-learning models, SIA offers notable improvements over a wide range of previous methods and therefore emerges as the state of the art for selection inference. A defining feature of the new generation of AI/ML methods for applications in population genetics, including SIA, is that they generally rely on simulated data for supervised training. This simulate-and-train paradigm has the advantage of virtually unlimited and perfectly labeled training data, but the disadvantage that its performance depends strongly on simulation modeling assumptions. These methods can fail catastrophically when the simulations are mis-specified, such as when a demographic model fails to include a bottleneck event or migration between populations. To go beyond the current ad-hoc methods for handling this essential problem, we devised a domain-adaptive framework for deep-learning models trained on simulated population genetic data. This approach used domain adaptation – a specific form of transfer learning – to train models on one data distribution (simulated genomic data) that can perform well when applied to datasets drawn from a different distribution (real genomic data). This framework is the first to effectively address the critical problem of simulation mis-specification, which has hitherto been the major concern about current applications of AI/ML approaches in population genetics. Our novel methodological frameworks mark a pivotal step to capitalize on hardware and software advancements for AI/ML, but only the beginning of AI/ML approaches to evolutionary modeling. Recently, large language models (LLMs) of protein and DNA have shown promising performance in a variety of problems in molecular biology such as protein structure or variant effect prediction. Similarly, large generative pre-trained evolutionary models based on genealogical embeddings of the ARG in the future have the potential to revolutionize population genetic research. Such models can be trained in a self-supervised manner with an incredibly wide range of simulations to learn the grammar and logic of how evolutionary processes manifest in different topologies of the ARG, much like the way LLMs “understand” natural languages. Generative models of evolution can be subsequently fine-tuned to perform diverse tasks such as inference of demography, population structure or admixture events. From this line of research that my thesis helped to pioneer, many more powerful AI/ML methods will emerge in the coming years to revolutionize population genetic research and other areas of genomics.
Item Type: | Thesis (PhD) |
---|---|
Subjects: | bioinformatics bioinformatics > genomics and proteomics > genetics & nucleic acid processing bioinformatics > genomics and proteomics bioinformatics > computational biology > algorithms bioinformatics > computational biology bioinformatics > computational biology > algorithms > machine learning bioinformatics > genomics and proteomics > genetics & nucleic acid processing > population genetics |
CSHL Authors: | |
Communities: | CSHL labs > Siepel lab School of Biological Sciences > Theses |
Depositing User: | Kathleen McGuire |
Date: | January 2024 |
Date Deposited: | 29 Aug 2024 18:34 |
Last Modified: | 29 Aug 2024 18:34 |
URI: | https://repository.cshl.edu/id/eprint/41642 |
Actions (login required)
Administrator's edit/view item |