Defining the extent of gene function using ROC curvature

Fischer, Stephan, Gillis, Jesse (October 2022) Defining the extent of gene function using ROC curvature. Bioinformatics. btac692. ISSN 1367-4803

[thumbnail of 2022-Gillis-Defining-the-extent-of-gene-function-using-ROC-curvature.pdf] PDF
2022-Gillis-Defining-the-extent-of-gene-function-using-ROC-curvature.pdf
Available under License Creative Commons Attribution.

Download (1MB)
URL: https://www.ncbi.nlm.nih.gov/pubmed/36271855
DOI: 10.1093/bioinformatics/btac692

Abstract

MOTIVATION: Interactions between proteins help us understand how genes are functionally related and how they contribute to phenotypes. Experiments provide imperfect "ground truth" information about a small subset of potential interactions in a specific biological context, which can then be extended to the whole genome across different contexts, such as conditions, tissues, or species, through machine learning methods. However, evaluating the performance of these methods remains a critical challenge. Here, we propose to evaluate the generalizability of gene characterizations through the shape of performance curves. RESULTS: We identify Functional Equivalence Classes (FECs), subsets of annotated and unannotated genes that jointly drive performance, by assessing the presence of straight lines in ROC curves built from gene-centric prediction tasks, such as function or interaction predictions. FECs are widespread across data types and methods, they can be used to evaluate the extent and context-specificity of functional annotations in a data-driven manner. For example, FECs suggest that B cell markers can be decomposed into shared primary markers (10 to 50 genes), and tissue-specific secondary markers (100 to 500 genes). In addition, FECs suggest the existence of functional modules that span a wide range of the genome, with marker sets spanning at most 5% of the genome and data-driven extensions of Gene Ontology sets spanning up to 40% of the genome. Simple to assess visually and statistically, the identification of FECs in performance curves paves the way for novel functional characterization and increased robustness in the definition of functional gene sets. AVAILABILITY: Code for analyses and figures is available at https://github.com/yexilein/pyroc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Item Type: Paper
Subjects: bioinformatics
bioinformatics > genomics and proteomics
bioinformatics > genomics and proteomics > Mapping and Rendering
bioinformatics > computational biology > algorithms
bioinformatics > computational biology
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > genes, structure and function
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > genes, structure and function > genes: types
bioinformatics > computational biology > algorithms > machine learning
bioinformatics > genomics and proteomics > Mapping and Rendering > ontology
CSHL Authors:
Communities: CSHL labs > Gillis Lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 22 October 2022
Date Deposited: 13 Dec 2022 16:27
Last Modified: 11 Jan 2024 20:03
PMCID: PMC9750128
URI: https://repository.cshl.edu/id/eprint/40771

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving