Comparison of the two major classes of assembly algorithms: overlap^layout^consensus and de-bruijn-graph Zhenyu Li*, Yanxiang Chen*, Desheng Mu*, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang and Wei Fan Advance Access publication date 19 December 2011
Downloaded from http://bfg.oxfordjournals.org/ at The University of Miami Libraries on February 7, 2013
Since the completion of the cucumber and panda genome projects using Illumina sequencing in 2009, the global scientific community has had to pay much more attention to this new cost-effective approach to generate the draft sequence of large genomes. To allow new users to more easily understand the assembly algorithms and the optimum software packages for their projects, we make a detailed comparison of the two major classes of assembly algorithms: overlap ^ layout^ consensus and de-bruijn-graph, from how they match the Lander^Waterman model, to the required sequencing depth and reads length. We also discuss the computational efficiency of each class of algorithm, the influence of repeats and heterozygosity and points of note in the subsequent scaffold linkage and gap closure steps. We hope this review can help further promote the application of second-generation de novo sequencing, as well as aid the future development of assembly algorithms. Keywords: OLC; DBG; de novo assembly; second-generation
One of the most important tasks in genome biology is to obtain a complete genome sequence, which is finished by a combination of sequencing technology and assembly software [1–3]. The high cost of Sanger sequencing technology has long been a limiting factor for genome projects, as we can see from the limited number of large genomes published before 2010. Fortunately, the second-generation sequencing technologies Roche/454 (www.454.com), Illumina/solexa (www.illumina.com) and AB/Solid (www.appliedbiosystems.com), which arrived in the market in 2005 and rapidly developing since then, have dramatically lowered the cost per sequenced nucleotide and increased throughput by orders of
magnitude. However, although the secondgeneration technologies are comparatively very cheap, their application was mainly restricted to resequencing projects [4, 5] where a good reference sequence existed, due to the much shorter read length (30–400 bp) in comparison with Sanger sequencing (500–1000 bp). In light of this, a major question that confronted us was, can we de novo sequence and assemble a large genome (>100 Mbp) using short reads? If so, sequencing cost no longer becomes a limiting factor for most denovo large genome projects, and sequence assembly becomes the major challenge. The evolution of assembly algorithms has accompanied the development of sequencing technologies. Currently, there are two widely used classes of
Corresponding authors. Bicheng Yang. E-mail: firstname.lastname@example.org; Wei Fan. Tel/Fax: +86 0755 22358672; E-mail: email@example.com *These authors contributed equally to this work. Zhenyu Li,Yanxiang Chen, Desheng Mu, JianyingYuan,Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu and Wei Fan are from the DNA sequence assembly team at the Science and Technology department of Beijing Genomics Institute at Shenzhen (BGI-SZ). Bicheng Yang is from the Scientific Cooperation Department, BGI-SZ. BGI-SZ is a leading genomics research center in China. ß The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: firstname.lastname@example.org
Li et al.
genome projects. To allow new users to more easily understand the assembly algorithms and choose the correct software for their projects, in this perspective, we make detailed comparisons of the two major classes of assembly algorithms: OLC and DBG. We hope this article can help promote the...