Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Tang, Ziqi, Somia, Nirali, Yu, Yiyang, Koo, Peter K (July 2025) Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Genome Biology, 26 (1). ISSN 1474-760X (Public Dataset)

[thumbnail of 10.1186.s13059-025-03674-8.pdf] PDF
10.1186.s13059-025-03674-8.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB)

Abstract

Background: The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Results: Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study. Discussion: This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Item Type: Paper
Subjects: bioinformatics
bioinformatics > genomics and proteomics
bioinformatics > computational biology > algorithms
bioinformatics > computational biology
CSHL Authors:
Communities: CSHL labs > Koo Lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 14 July 2025
Date Deposited: 15 Jul 2025 12:45
Last Modified: 15 Jul 2025 12:45
Related URLs:
Dataset ID:
  • https://doi.org/10.5281/zenodo.8279715
  • ENCODE: ENCSR463IRX
  • ENCODE: ENCSR460LZI
  • ENCODE: ENCSR022GQD
  • ENCODE: ENCSR382BVV
  • ENCODE: ENCSR244FWB
  • ENCODE: ENCSR405QCT
  • ENCODE: ENCSR203UFY
  • ENCODE: ENCSR336MKI
  • https://doi.org/10.5281/zenodo.7011631
  • https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126550
  • http://ascot.cs.jhu.edu/
  • https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178230
URI: https://repository.cshl.edu/id/eprint/41903

Actions (login required)

Administrator's edit/view item Administrator's edit/view item