Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score

Lee, H., Schatz, M. C. (August 2012) Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics, 28 (16). pp. 2097-2105. ISSN 1367-4803

[thumbnail of Paper]
Preview
PDF (Paper)
Schatz Bioinformatics 2012.pdf - Published Version

Download (711kB) | Preview
URL: http://www.ncbi.nlm.nih.gov/pubmed/22668792
DOI: 10.1093/bioinformatics/bts330

Abstract

Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and do not directly measure the problematic repeats across the genome. Here, we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position and thus measures the overall composition of the genome itself. Results: We have developed the Genome Mappability Analyzer to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the 'dark matter' of the genome, including of known clinically relevant variations in these regions.

Item Type: Paper
Uncontrolled Keywords: alignment ultrafast sequence
Subjects: bioinformatics
bioinformatics > genomics and proteomics > genetics & nucleic acid processing
bioinformatics > genomics and proteomics
bioinformatics > genomics and proteomics > Mapping and Rendering
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > genomes > genome rendering
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > genomes
CSHL Authors:
Communities: CSHL labs > Schatz lab
CSHL Cancer Center Program > Cancer Genetics
Depositing User: Matt Covey
Date: 15 August 2012
Date Deposited: 30 Jan 2013 17:14
Last Modified: 14 Oct 2015 19:01
PMCID: PMC3413383
Related URLs:
URI: https://repository.cshl.edu/id/eprint/26987

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving