Evolutionary-scale protein language models uncover beneficial variants in a Sorghum bicolor diversity panel

Johansen, Natasha H, Sendowski, Janek Sven-Ole, Nikolaidou, Eleni, Chatzivasileiou, Savvas, Wang, Shuai, Song, Baoxing, Olson, Andrew, Bataillon, Thomas, Ramstein, Guillaume P (April 2026) Evolutionary-scale protein language models uncover beneficial variants in a Sorghum bicolor diversity panel. bioRxiv. ISSN 2692-8205 (Submitted)

[thumbnail of 10.64898.2026.04.10.717708.pdf] PDF
10.64898.2026.04.10.717708.pdf - Submitted Version
Available under License Creative Commons Attribution Non-commercial.

Download (1MB)

Abstract

Quantitative genetic approaches such as genome-wide association studies and genomic prediction are widely used to identify favourable genetic variation, but they have limited resolution due to linkage disequilibrium. Comparative genomics approaches, especially Protein Language Models (PLMs), have emerged as powerful alternatives, by detecting phylogenetic residue conservation (PRC) across evolutionary time scales. However, the extent to which these tools can guide the detection of impactful variants for field agronomic traits is still unclear. In this study, we used the pre-trained PLM ESM2 to predict PRC scores of nonsynonymous mutations segregating within a diverse panel of 387 accessions in sorghum (SAP). The distribution of fitness effects (DFE) of the same set of nonsynonymous mutations was inferred using unfolded site frequency spectra to assess whether the DFE distribution covaried with PRC scores. Furthermore, we estimated the load of putatively nonneutral mutations of SAP accessions and evaluated associations between this mutation load and phenotypic performance across multiple agronomic traits. Our results show that ESM2 can detect mutations associated with fitness-enhancing effects in SAP, as indicated by enrichments in positive selection signatures among the variants with positive PRC scores. Significant associations were also detected between phenotypic performance and mutation load for several agronomic traits, indicating that PLMs can identify functionally important genetic variation. However, these signals were not consistent across all traits in the SAP population. Altogether, our findings suggest that large language models may support breeding efforts, as PLM predictions covaried with fitness effects and captured agronomic performance for some traits in plant populations.

Item Type: Paper
Subjects: organism description > plant
organism description > plant > sorghum
CSHL Authors:
Communities: CSHL labs > Ware lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 13 April 2026
Date Deposited: 01 Jul 2026 14:23
Last Modified: 01 Jul 2026 14:23
Related URLs:
URI: https://repository.cshl.edu/id/eprint/42257

Actions (login required)

Administrator's edit/view item Administrator's edit/view item