What to do with RNA-seq data
“A crucial prerequisite for a successful RNA-seq study is that the data generated have the potential to answer the biological questions of interest. This is achieved by first defining a good experimental design, that is, by choosing the library type, sequencing depth and number of replicates appropriate for the biological system under study, and second by planning an adequate execution of the sequencing experiment itself, ensuring that data acquisition does not become contaminated with unnecessary biases.”
“Raw read counts alone are not sufficient to compare expression levels among samples, as these values are affected by factors such as transcript length, total number of reads, and sequencing biases.” –Conesa et al; 2016
Once the sequencing step of RNA-seq is complete, researchers are left with tens of millions of individual reads. There is a growing catalogue of bioinformatics tools to extract useful data out of this information, but it can still be an intimidating process. In this section, we turn to some of the key terms and issues researchers need to be aware of when processing data from RNA-seq experiments.
Coverage & Noise
As with any scientific endeavor, figuring out what to do with the data starts with setting up the experiment to obtain the best data in the first place. The first question a researcher must ask when setting up an RNA-seq experiment is how deep the coverage needs to be. Skim sequencing provides a high level overview of the sample, sort of an initial pass that will tell the researcher whether there is enough going on to warrant a deeper look. RNA-skim uses an entirely different set of tools and algorithms to rapidly evaluate transcript abundances, so contact a Cofactor project scientist for more if you are interested in this approach.
In most cases, much deeper coverage is desired to compare biologically complex conditions such as healthy tissue vs a tumor. Saturation occurs when sequencing data has been acquired for every transcript within the input sample. For an example of a relatively early paper looking at the effect of depth on transcript detection, see this paper published by Tarazona et al published in 2011.
In many biological assays, noise tends to be a significant concern. For example, in a microarray experiment, nonspecific binding can lead to a baseline signal that might be interpreted as a false positive. On the other end of the spectrum, when a sample is saturated no further information can be extracted. RNA-seq, in contrast, provides a much greater dynamic range than other assays, and the issue of noise looks very different. Unlike the way researchers often think about noise – as Jon Armstrong, Cofactor’s CSO puts it, like an “accordion” that increases or decreases with the amount of sequencing – noise from RNA-seq tends to be more stable. Thus, deeper reads bring more transcripts above the baseline noise level where they can be assessed.
First, mechanical noise comes from the sequencer itself, but this can largely be calibrated out of the final analysis. Sequencing errors are another type of “noise” that come from the machine as sequencing depth increases. However, sequencing errors increase relatively linearly and can also be handled during data processing.
Biological noise is the same issue that will crop up in any experiment with biological replicates. Despite our best efforts, two identically treated samples will still display slight differences. To minimize this variation, Cofactor recommends a minimum of three biological replicates whenever possible.
Finally, molecular noise is an issue of sample preparation and handling. If the input sample undergoes too many cycles of PCR amplification before sequencing, there is a risk that duplicates will make it to the flow cell for sequencing and create an artificially high read count. In short, read density can increase duplication levels, which then need to be cleaned up in later QC with software such as Picard Tools. This form of noise is worth being aware of, but it is important to note that continually refining RNA-seq protocols and sample prep allow this particular category to be minimized.
After sequencing is completed, the next step is quality control of the resulting data. QC ties directly into the section above, as this is the step where those four types of noise are dealt with. Biological replicates are compared with each other to obtain a reasonable cutoff for defining a signal as real or not. Validated batch controls should be (and at Cofactor are) used with all experiments to provide a benchmark in case results from the experimental samples appear problematic. Putting a questionable sample up against a batch control can provide information as to whether there might be a problem with the input RNA such as degradation. Armstrong notes that the greatest variability tends to come from molecular noise, so researchers should focus on careful sample handling to avoid creating artificial differences.
It is important to note that QC often has a qualitative element more than quantitative. The right approach is not necessarily quantitative, where a specific threshold number is assigned to a sample and anything outside that threshold is considered an outlier. Instead, it may be better to compare the scores of multiple samples to look for consistency (or lack thereof).
Other aspects of QC include checking read alignment to ensure that a specific read has been mapped to the most likely location in the transcriptome. This is particularly important when multiple regions contain highly similar or identical sequences. The short (~30 base) reads generated by RNA-seq increase the likelihood that any given sequence could map to more than one location. There are numerous algorithms for mapping reads. In general, though, the idea is to find the location on the reference transcriptome that gives the read the lowest mismatch score. This takes into account the quality score of each base to help determine whether an observed mismatch is a sequencing error or a real variant. For more on mapping reads, see this early paper by Li et al from 2008.
At Cofactor, QC is built in to the entire process and based on extensive validation. As a result, researchers can focus more on their data and its meaning and worry less about the validity of the information coming back from the sequencer.
Once the data has been filtered for quality, the more interesting, experimentally relevant process of looking at relative transcript levels begins. The first step here is normalizing the data to eliminate biases such as read number. Most RNA-seq experiments are designed to be qualitative, comparing gene expression between two or more conditions or over time. There may be instances when quantification of transcript levels is desired, but the majority of RNA-seq setups will look at how levels in one sample compare to another.
In some assays, such as Western blotting for comparing protein levels, expression of the control condition is set at 1, and all experimental samples normalized to that control. Normalization in RNA-seq is a bit more complex because of inherent bias in the sequencing itself. In short, longer transcripts will result in more reads. Reads from RNA-seq are all roughly the same length, in the neighborhood of 30 bases. If sequencing is carried out so that two transcripts, one 1,000 bases long and the other 2,000, are covered at 100%, then the latter transcript will have twice as many reads as the former. Simple normalization, then, would cause the 2,000 base transcript to appear to be more abundant. Normalization also allows datasets with different sequencing depth to be compared without bias.
There are three primary forms of normalization applied to RNA-seq data that help with this, and other differences:
- RPKM, or reads per kilobase million, was developed in a 2008 paper by Mortazavi et al and is calculated as (raw # reads/exon length) x (1,000,000/# mapped reads). This normalization method normalizes for the total length of the transcript using the total number of reads.
- FPKM, or fragments per kilobase million accounts for paired-end reads instead of individual reads. This is a more useful method with current technology, as most RNA-seq experiments now use paired-end sequencing. If both reads of one sequence are high quality, FPKM counts it as one. If only one read is good but the other ends up discarded, you still have a read that counts towards that sequence. In RPKM, those two instances would result in a different read count, potentially changing the apparent transcript level.
- TPM, or transcripts per million, is a newer method where the first step is to divide the number of reads by the gene length. This is the opposite for R/FPKM, but it shows what proportion of total reads map to a given gene. This standardizes comparisons between samples, arguably making for a more traditional type of normalization.
For an excellent explanation of RPKM vs FPKM vs TMP, see this post from the RNA-seq Blog.
There are many bioinformatics tools that can be applied to RNA-seq data. Numerous software kits are available to handle this process, and Cofactor will take care of the analysis as part of their services. However, there are also several well-known statistical approaches for basic comparison of data once the reads have been mapped and normalized. As noted above, these – like any good experiment – require at least three biological replicates per condition. With that kind of input, the coefficient of variance can be calculated as an important first step to ensuring that the numbers within a group are consistent enough for comparison to those of another. Of course, simple p-values are also important for quantifying the significance of any observed variation. Relationships between samples are often investigated by Pearson, and sometimes Spearman, correlation.
Final thoughts on analyzing the data
This post barely scratches the surface of how to deal with data from RNA-seq. Part of the reason for this is space, as there are dozens of resources and published manuscripts describing specific software and other tools. However, another reason for this is that many RNA-seq services build the analysis in to the project. At Cofactor, part of the process is speaking with researchers to determine their specific needs for an experiment, and then handling much of the bioinformatics work as part of the process. If you have questions about any aspect of this, or wish to have a detailed discussion of more complex issues such as mapping reads and data processing, get in touch with Cofactor to schedule a call with a project scientist.