Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Tang, Ziqi, Koo, Peter K (March 2024) Evaluating the representational power of pre-trained DNA language models for regulatory genomics. bioRxiv. (Public Dataset) (Submitted)

[thumbnail of 2024.02.29.582810v1.full.pdf] PDF
2024.02.29.582810v1.full.pdf - Submitted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (7MB)
DOI: 10.1101/2024.02.29.582810

Abstract

The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Item Type: Paper
Subjects: bioinformatics
bioinformatics > quantitative biology
bioinformatics > quantitative biology > quantitative genetics
CSHL Authors:
Communities: CSHL labs > Koo Lab
School of Biological Sciences > Publications
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 4 March 2024
Date Deposited: 07 Mar 2024 15:12
Last Modified: 07 Mar 2024 15:12
Related URLs:
Dataset ID:
  • https://github.com/amberT15/LLM_eval
  • https://doi.org/10.5281/zenodo.8279716
URI: https://repository.cshl.edu/id/eprint/41454

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving