It’s all about the isoforms

Clients ask us all the time to identify differential expression of transcript isoforms, and for good reason: we all know isoforms from the same gene can have significantly different structures from one another and thus significantly different function.

However, despite the success of Cuffdiff and other similar statistical approaches, our answer continues to be that effectively separating expression of isoforms is far from a solved problem.

Our results show that this lack of confidence and ambiguity in isoform expression is due to the unrealistic goal of trying to categorize every read into an isoform. Instead of denying the inherent limitations of the current technology, we would rather accept them and be able to identify 90% of isoforms with high certainty than 100% of the isoforms with doubt.

The limitations I’m referring to are tied to the amount of information contained in the sequence reads themselves…they’re simply not long enough to overlap enough of an isoform to map to it uniquely.

We can see this with a simple experiment. Using a database of 78,631 human transcript isoforms, we generated three simulated datasets, one of 100 nucleotide reads, one of 400, and one of 500*. We aligned those reads back to the database and asked, how many of the isoforms can we identify using each dataset?

That’s almost 90% of the isoforms that we can characterize confidently simply by using longer reads. Put another way,  that’s over 5000 more isoforms that we’re identifying with direct sequence evidence instead of inferring based on a model.

This is not to say that there won’t be improvements on the software side. On the contrary, we believe the hardware improvements have helped to inspire a golden age in algorithm development. But we’d always rather rely on more straightforward techniques that are easier to understand, and easier to tell when you’ve got a reliable result.

At the end of the day, we care a lot less about the particular technology or technique than about getting the most confident results to drive research forward.

 

 

 

 

 

Update Sept 11, 2013

Based on feedback on Twitter (Thanks Keith and @nextgenseek!), I’ve replotted the results using a zero baseline and showing the percent and number of isoforms missed.

 

 

 

 

 

*We are using these today in current projects as overlapping paired 250 bp reads to form single long reads, and paired 300 bp reads will be available later this month. If you’re interested in learning more, feel free to contact us.

Dave Messina
Dave Messina
Dr. David Messina has spent the last 17 years working in computational biology and genetics. He currently drives the computational R&D and production efforts as Cofactor's Director of Analysis. Outside of the lab, Dave enjoys driving fast in old BMWs.

Leave a Comment

73 − 65 =

Contact Us for Info and Pricing

We're here to help! If you have any questions relating to our services or Cofactor, don't hesitate to get in touch.

x
Get the latest from Cofactor
Delivered to your inbox!

  • Get the most out of your RNA-seq
  • Know what the experts know
  • Get on with discovery

Enter your email and stay on top of things,