4. Align to the gene set and genome
Depending on a researchers goals, it may be advantageous to align to both the gene set and genome. Of course, a first step is to make sure a gene set and genome exist for your organism (or at least for a closely related organism). If not, you will want to have a de novo assembly of one or both generated for your organism prior to any RNA-seq experiments (pssst…. Cofactor can do this).
If you think critically about a gene set, there are certain attributes that exist.
First, gene set calls and annotation may have come solely from gene predictions. In this case, even though gene predictors are very good at this point, and often “tuned” for the organism of interest, there will still be areas (potentially unique to the phenotype of your organism) that will not be annotated. Thus, when aligning to a predicted gene set, one may end up with more false negatives than with more highly curated gene set. The best type of gene set would be one that combines both predictions and evidence from actual RNA sequencing data.
A second attribute of a gene set is that the set is fixed or closed, much like a microarray. In other words, you cannot discover differentially up or down-regulated genes that are not part of the gene set. For many clients we work with, this is fine because they are looking for differentially expressed gene candidates that have some information about their function and optionally the pathway they are in. They do not want to spend time chasing candidates where lots of additional molecular or bioinformatics need to be completed to understand what that transcript is and does. Also, if a gene set like human, mouse, rat, etc is used, in most cases all known isoforms will be included in the gene set. With these, we are not only able to deliver differentially expressed genes, but also differentially expressed isoforms of the same gene. Pretty cool!! Many times, I think in analogies (much to the chagrin of the folks at Cofactor) and in this case I think of alignments to a gene set like fishing with thousands of hooks, each with a different bait. The references are the hooks and the raw reads are the fish (I know, I know, simplistic…. but still applicable).
Figure 1. Gene set or transcript reference based on Jon’s analogy above. The wavy blue line at the top represent the surface of the ocean and the curved lines represent hooks……. just kidding, jeeeez!
Now, lets consider alignments to the genome. When aligning to a de novo assembly of the genome, as opposed to working with a gene set, we are fishing with a net and not hooks. We have references that correspond to, in the best case, chromosomes, and in the worst case, a bunch of unplaced contigs. Anyway, we are not driving the reads towards any specific “set” of references that were defined a priori, we are actually allowing the reads to “fall” across the genome. This situation is considered “hypothesis neutral”, in the sense that we will have reads that align to genic, intronic, and intergenic regions (due to stochastic transcription and missing, incorrect, or incomplete annotation, as well as poly(A) transcripts that are partially-spliced or un-spliced). A very interesting occurrence for alignments to unannotated areas of the genome, are areas that are differentially expressed between samples (the replicates need to share this signal, at a higher-than-noise-threshold, so pay attention to your coverage noise cutoff), and statistically significant. Many of our clients are VERY interested in these areas because they may have discovered a completely novel gene, that has a large amount of regulatory control in their experiment, and provides them great competitive advantage in their field (can anyone spell G-R-E-E-N B-I-O-T-E-C-H). As you will remember from above, gene sets are cool because your differential candidates will include some information about their purpose and pathway, however in the case of unannotated areas that are differentially expressed , get out the molecular and proteomics toolboxes because there is a lot of work to be done now. We can help our clients with much of this work but it still prolongs the timeline for discovery → answers.
Taking the above into account, Cofactor decided last year to start performing alignments to the gene set AND genome for all of our RNA-seq clients. We were usually performing them both anyway, just at separate times. So, thanks to Dave Messina and his team at Cofactor, our pipeline does both alignments automatically and spits all the information into our ActiveSite Expression Viewer for candidate discovery (thanks Dave!).
TOMORROW IS GOING TO BE AWESOME!!! (yes…. I just yelled that)
I am going to touch on some of the finer points of replication, and possibly refute some myths……maybe ruffle some feathers….how can this not be fun?