Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Kundeti, V. K., Rajasekaran, S., Dinh, H., Vaughn, M. W., Thapar, V. (November 2010) Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. BMC Bioinformatics, 11. p. 560.

Preview

PDF (Paper)
Efficient_parallel_and_out_of_core_algorithms_for_constructing_large_bi-directed_de_Bruijn_graphs.pdf
Download (187kB)

URL: https://www.ncbi.nlm.nih.gov/pubmed/21078174

DOI: 10.1186/1471-2105-11-560

Abstract

Background: Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Î˜(nÎ£) messages (Î£ being the size of the alphabet).Results: In this paper we present a Î˜(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Î£. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Î˜(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.Conclusions: The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET. Â© 2010 Kundeti et al; licensee BioMed Central Ltd.

Item Type:	Paper
Subjects:	bioinformatics > genomics and proteomics > annotation > sequence annotation bioinformatics > genomics and proteomics > analysis and processing > Sequence Data Processing bioinformatics > genomics and proteomics > Mapping and Rendering > Sequence Rendering bioinformatics > computational biology
CSHL Authors:	Thapar, Vishal Vaughn, Matthew W.
Communities:	CSHL labs > Martienssen lab
Depositing User:	CSHL Librarian
Date:	15 November 2010
Date Deposited:	19 Oct 2011 17:46
Last Modified:	08 Mar 2018 17:05
PMCID:	PMC2996408
Related URLs:	Publisher
URI:	https://repository.cshl.edu/id/eprint/15460

Actions (login required)

Administrator's edit/view item