Using quality scores and longer reads improves accuracy of Solexa read mapping

Smith, A. D., Xuan, Z. Y., Zhang, M. Q. (February 2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. Bmc Bioinformatics, 9 (128). ISSN 1471-2105

[thumbnail of Paper]
Preview
PDF (Paper)
Zhang BMC Bioinformatics 2008.pdf - Published Version

Download (450kB) | Preview
URL: http://www.ncbi.nlm.nih.gov/pubmed/18307793
DOI: 10.1186/1471-2105-9-128

Abstract

Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample ( e. g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina IG sequencer can produce tens of millions of reads, ranging in length from similar to 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores. Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/. Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.

Item Type: Paper
Uncontrolled Keywords: GENOME SEARCH
Subjects: bioinformatics
bioinformatics > genomics and proteomics
bioinformatics > genomics and proteomics > Mapping and Rendering
CSHL Authors:
Communities: CSHL labs > Zhang lab
Depositing User: Matt Covey
Date: February 2008
Date Deposited: 22 Feb 2013 19:35
Last Modified: 22 Feb 2013 19:35
PMCID: PMC2335322
Related URLs:
URI: https://repository.cshl.edu/id/eprint/27638

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving