SVCollector: Optimized sample selection for cost-efficient long-read population sequencing

Ranallo-Benavidez, Rhyker, Lemmon, Zachary, Soyk, Sebastian, Aganezov, Sergey, Salerno, William, McCoy, Rajiv, Lippman, Zachary, Schatz, Michael, Sedlazeck, Fritz (August 2020) SVCollector: Optimized sample selection for cost-efficient long-read population sequencing. BioRxiv. (Unpublished)

[thumbnail of 2020. Ranallo_Benavidez.long_read_population_sequencing.pdf]

PDF
2020. Ranallo_Benavidez.long_read_population_sequencing.pdf
Download (3MB)

DOI: 10.1101/2020.08.06.240390

Abstract

An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g. microarrays, exome capture, short-read WGS), from which a few individuals are selected for resequencing using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically been focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. To address this goal, SVCollector ( https://github.com/fritzsedlazeck/SVCollector ) identifies the optimal subset of individuals for resequencing. SVCollector analyzes a population-level VCF file from a low resolution genotyping study. It then computes a ranked list of samples that maximizes the total number of variants present from a subset of a given size. To solve this optimization problem, SVCollector implements a fast greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3K Rice Genomes Project and show the rankings it computes are more representative than widely used naive strategies. Notably, we show that when selecting an optimal subset of 100 samples in these two cohorts, SV-Collector identifies individuals from every subpopulation while naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts of different sizes selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples.

Item Type:	Paper
Subjects:	organism description > plant > Oryza organism description > animal > mammal > primates > hominids > human Investigative techniques and equipment > assays > long-read sequencing
CSHL Authors:	Lemmon, Zachary Lippman, Zachary B. Schatz, Michael C.
Communities:	CSHL labs > Lippman lab CSHL labs > Schatz lab
SWORD Depositor:	CSHL Elements
Depositing User:	CSHL Elements
Date:	6 August 2020
Date Deposited:	24 May 2021 14:57
Last Modified:	24 May 2021 14:57
URI:	https://repository.cshl.edu/id/eprint/40136

Actions (login required)

Administrator's edit/view item