The January 2011 Nature Methods highlights next-generation sequencing (NGS) de-novo assembly and establishes expectations for the products of short read assemblers in both the editorial by Ewan Birney (whose name is synonymous with computational biology) and an article coming from the Eichler Lab (a lab largely focused on studying genomic structural evolution).
The group’s focus was the recent SOAP de-novo assemblies of the Han Chinese individual and a Yoruban individual. They investigated novel genomic content and identified viral contamination, repeat under representation due to collapse of nearly identical sequences, highlighted the difficulty in correct representation of segmental duplications, and investigated gene content and accuracy. As you might suspect, those genes falling in segmental duplications and are part of highly similar paralogs/gene families are subject to misrepresentation or loss. This last focus (gene content), is a primary focus of the vast majority of Cofactor’s de-novo sequencing projects. The editorial and paper are reminders that genomic assembly was not “solved” before next-generation sequencing came along and is certainly not currently without limitations. Accurate assembly requires significant design and development in both the molecular biology and computational approaches in order to land on the best representation of a genome possible given inherent and basic limitations in the reads forming the basis of the assemblies.
The authors remind us when working with any genomic assembly it is important to have a healthy level of skepticism (especially in the characterization of low complexity or repeat regions as well as conclusions about gene loss).
With this said, the products of de-novo assemblies are extremely informative in providing insight into the characteristics of representative genomes being sequenced at an awesome rate, due to the huge leaps in sequencing and algorithmic developments over the last 5-10 years. The authors conclude that it is the communities responsibility to require higher quality assemblies, which only come through hybrid sequencing approaches employing multi-platform/multi-library sequencing strategies coupled with intelligent assembly algorithms and approaches to make the best use of these data.
Truly revolutionary times.
Jarret Glasscock
[email protected]
Birney, E. Assemblies: the good, the bad, the ugly. Nature Methods 8, 59-60 (2011)
Alkan, C., Sajjadian, S., Eichler, E.E. Limitations of Next-generation Genome Sequence Assembly. Nature Methods 8, 61-65 (2011)
*Cofactor Genomics is a firm providing unparalleled experimental design, sequencing on a variety of platforms, and complete analysis solutions tailored to project goals. The scientists at Cofactor have been involved in hundreds of de-novo assemblies over the last 12 years (initially as part of a genome center using sanger data and over the last 7 years on the alpha, beta , and full commercial versions of the majority of next-gen platforms in use today). Cofactor has used the hundreds of de-novo assembly projects that they have been involved in as a test-bed to derive optimal library and mixed-platform approaches to produce the best de-novo assembly products.