Downstream Validation of RNA-seq Data
“a short sequence read that contains one or more errors (wrong base calls) can coincidentally match—in its mutated form—to another existing sequence in the genome. This has the potential to create a specific kind of false-positive background in RNA-Seq that is of interest because it will preferentially affect gene families. It would thus cause greatest mischief if one family member were very highly expressed (therefore generating occasional mutant reads) and the other gene to which it might map were not, in reality, expressed at all.” – Mortazavi et al, 2008
In 2008, the early days of RNA-seq, Mortazavi et al made the above point about the significant danger sequencing errors could pose to a researcher’s results. If an error returned a base that just happened to change the read’s sequence into another, real sequence, that second sequence would get another hit, increasing its apparent expression. This, along with any number of other potentially problematic situations surrounding the quality of data returned by RNA-seq, led to the writing of many papers comparing the new technology with other, more established methods. Microarray and qPCR were highlighted as important pieces of the validation process for RNA-seq. Protein expression analysis was another. The message was that, while RNA-seq was a powerful new tool, it was indeed new. And like any new technology it needed external validation and benchmarking to ensure quality results.
Fortunately, the intervening scientific eternity of the past eight years has brought us to a point where downstream validation of RNA-seq data is nowhere close to the necessity it once was. In fact, despite repeated prompting – in a conversation putatively titled “Downstream Validation of RNA-seq Data” – Cofactor COO Dave Messina was quite comfortable in explaining how RNA-seq data is built to stand on its own. This provides a huge benefit to researchers who can now focus on the meaning of their data and how it fits in to their overall hypothesis, rather than spending time and money running orthogonal experiments.
Rather than thinking about validation as an event downstream of sequencing, it might be better to view it as an extension, almost a synonym, of quality control. Phred scores are the first piece of this process. By simply calculating the confidence that a base call is accurate, we can reduce the chances that a false-positive will slip through the analysis. Other QC steps help to limit the need for independent validation. Saturation curves provide insight into how many unique transcripts are identified with high enough coverage. Seeing the level of coverage doesn’t really change the way the data is used, but it does indicate how confident one can be in the data; that is, confident that the RNA-seq results accurately represent the actual RNA molecules in the original sample. Correlation plots to show the consistency between technical or biological replicates is yet another QC component of validation. Statistical analysis between samples from the same group allows for confident, accurate comparison between groups.
Cofactor Genomics has spent years refining its processes so that the final results from an RNA-seq experiment can stand on their own and immediately be plugged in to the larger research project. By putting correctly prepared samples into the sequencer, and then carrying out robust quality control at all stages, the need for additional experiments designed solely to confirm information from RNA-seq is greatly reduce if not eliminated altogether.
There are times when additional orthogonal validation can be helpful. A researcher may wish to hone in on a hit from RNA-seq through other means. In this context, rather being seen as a burden, experimental validation adds new perspective to RNA-seq data. RNA-seq is powerful in part because it does not require a working hypothesis or a set of targets to start with. Instead, this technology is unbiased and lead to the discovery of unexpected genes or molecules of interest. When this happens, tools such as qPCR or mass spectrometry may be employed to dig deeper into individual targets. For example, consider a gene that is found by RNA-seq to be over expressed in a tumor sample. That valuable information could lead to further protein-centric experiments that reveal how the molecule encoded by the over expressed gene interacts with its binding partner. In these instances, RNA-seq is the first step in a multi-faceted discovery process that spans from genetics to molecular mechanism.
Of course, if you have questions about any of this give us a call! We’ll schedule a time for you to speak with one of our project scientists about validating RNA-seq data, how we approach this process, and whether you should consider orthogonal approaches to your experimental design.