Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Tang, Ziqi, Somia, Nirali, Yu, Yiyang, Koo, Peter K (July 2025) Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Genome Biology, 26 (1). ISSN 1474-760X (Public Dataset)

[thumbnail of 10.1186.s13059-025-03674-8.pdf]

PDF
10.1186.s13059-025-03674-8.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB)

DOI: 10.1186/s13059-025-03674-8

Abstract

Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Results: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study. Discussion: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Item Type:	Paper
Subjects:	bioinformatics bioinformatics > genomics and proteomics bioinformatics > computational biology > algorithms bioinformatics > computational biology
CSHL Authors:	Tang, Ziqi Koo, Peter K Somia, Nirali
Communities:	CSHL labs > Koo Lab School of Biological Sciences > Publications
SWORD Depositor:	CSHL Elements
Depositing User:	CSHL Elements
Date:	14 July 2025
Date Deposited:	15 Jul 2025 12:45
Last Modified:	30 Sep 2025 17:22
PMCID:	PMC12261763
Related URLs:	Publisher
Dataset ID:	https://doi.org/10.5281/zenodo.8279715 ENCODE: ENCSR463IRX ENCODE: ENCSR460LZI ENCODE: ENCSR022GQD ENCODE: ENCSR382BVV ENCODE: ENCSR244FWB ENCODE: ENCSR405QCT ENCODE: ENCSR203UFY ENCODE: ENCSR336MKI https://doi.org/10.5281/zenodo.7011631 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126550 http://ascot.cs.jhu.edu/ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178230
URI:	https://repository.cshl.edu/id/eprint/41903

Actions (login required)

Administrator's edit/view item