In Analysis, Assembly

Clients ask us all the time to identify differential expression of transcript isoforms, and for good reason: we all know isoforms from the same gene can have significantly different structures from one another and thus significantly different function.

However, despite the success of Cuffdiff and other similar statistical approaches, our answer continues to be that effectively separating expression of isoforms is far from a solved problem.

Our results show that this lack of confidence and ambiguity in isoform expression is due to the unrealistic goal of trying to categorize every read into an isoform. Instead of denying the inherent limitations of the current technology, we would rather accept them and be able to identify 90% of isoforms with high certainty than 100% of the isoforms with doubt.

The limitations I’m referring to are tied to the amount of information contained in the sequence reads themselves…they’re simply not long enough to overlap enough of an isoform to map to it uniquely.

We can see this with a simple experiment. Using a database of 78,631 human transcript isoforms, we generated three simulated datasets, one of 100 nucleotide reads, one of 400, and one of 500*. We aligned those reads back to the database and asked, how many of the isoforms can we identify using each dataset?

That’s almost 90% of the isoforms that we can characterize confidently simply by using longer reads. Put another way,  that’s over 5000 more isoforms that we’re identifying with direct sequence evidence instead of inferring based on a model.

This is not to say that there won’t be improvements on the software side. On the contrary, we believe the hardware improvements have helped to inspire a golden age in algorithm development. But we’d always rather rely on more straightforward techniques that are easier to understand, and easier to tell when you’ve got a reliable result.

At the end of the day, we care a lot less about the particular technology or technique than about getting the most confident results to drive research forward.






Update Sept 11, 2013

Based on feedback on Twitter (Thanks Keith and @nextgenseek!), I’ve replotted the results using a zero baseline and showing the percent and number of isoforms missed.






* We are using these today in current projects as overlapping paired 250 bp reads to form single long reads, and paired 300 bp reads will be available later this month. If you’re interested in learning more, feel free to contact us.

Dave Messina
Dr. David Messina serves as Cofactor's Chief Operations Officer. He has spent the last 19 years in computational biology and genetics. He worked on the Human Genome Project at Washington University in Saint Louis, trained in molecular biology and human genetics at the University of Chicago, and and earned his PhD in computational biology in Stockholm, Sweden.
Recent Posts

Leave a Comment

42 + = 43

Start typing and press Enter to search