Tang, Ziqi, Somia, Nirali, Yu, Yiyang, Koo, Peter K (July 2025) Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Genome Biology, 26 (1). ISSN 1474-760X (Public Dataset)
![]() |
PDF
10.1186.s13059-025-03674-8.pdf - Published Version Available under License Creative Commons Attribution Non-commercial No Derivatives. Download (5MB) |
Abstract
Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Results: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study. Discussion: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
Item Type: | Paper |
---|---|
Subjects: | bioinformatics bioinformatics > genomics and proteomics bioinformatics > computational biology > algorithms bioinformatics > computational biology |
CSHL Authors: | |
Communities: | CSHL labs > Koo Lab |
SWORD Depositor: | CSHL Elements |
Depositing User: | CSHL Elements |
Date: | 14 July 2025 |
Date Deposited: | 15 Jul 2025 12:45 |
Last Modified: | 15 Jul 2025 12:45 |
Related URLs: | |
Dataset ID: |
|
URI: | https://repository.cshl.edu/id/eprint/41903 |
Actions (login required)
![]() |
Administrator's edit/view item |