A community effort to optimize sequence-based deep learning models of gene regulation

Rafi, Abdul Muntakim, Nogina, Daria, Penzar, Dmitry, Lee, Dohoon, Lee, Danyeong, Kim, Nayeon, Kim, Sangyeup, Kim, Dohyeon, Shin, Yeojin, Kwak, Il-Youp, Meshcheryakov, Georgy, Lando, Andrey, Zinkevich, Arsenii, Kim, Byeong-Chan, Lee, Juhyun, Kang, Taein, Vaishnav, Eeshit Dhaval, Yadollahpour, Payman, Random Promoter DREAM Challenge Consortium, Kim, Sun, Albrecht, Jake, Regev, Aviv, Gong, Wuming, Kulakovskiy, Ivan V, Meyer, Pablo, de Boer, Carl G (October 2024) A community effort to optimize sequence-based deep learning models of gene regulation. Nature Biotechnology. ISSN 1087-0156 (Public Dataset)

[thumbnail of 10.1038.s41587-024-02414-w.pdf]

Preview

PDF
10.1038.s41587-024-02414-w.pdf - Published Version
Available under License Creative Commons Attribution.
Download (4MB) | Preview

URL: https://www.ncbi.nlm.nih.gov/pubmed/39394483

DOI: 10.1038/s41587-024-02414-w

Abstract

A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.

Item Type:	Paper
Subjects:	bioinformatics bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification bioinformatics > genomics and proteomics > genetics & nucleic acid processing bioinformatics > genomics and proteomics bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > genes, structure and function > gene regulation bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > genes, structure and function > gene regulation bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > genes, structure and function
CSHL Authors:	Koo, Peter K
Communities:	CSHL labs > Koo Lab
SWORD Depositor:	CSHL Elements
Depositing User:	CSHL Elements
Date:	11 October 2024
Date Deposited:	01 Nov 2024 12:24
Last Modified:	01 Nov 2024 12:24
Related URLs:	Publisher
Dataset ID:	GEO: GSE254493 https://doi.org/10.5281/zenodo.10633252 GEO: GSE183939 https://doi.org/10.5281/zenodo.8219231
URI:	https://repository.cshl.edu/id/eprint/41723

Actions (login required)

Administrator's edit/view item