What does a clinical workflow for RNA-seq look like?
Using RNA-seq for diagnostic purposes starts with proper collection of a biological sample. Whether liquid or solid tissue, the ability to extract useful clinical information from the genetic profile of that sample requires careful handling. From there, the sequencing itself is relatively straight forward, followed by the computationally intensive QC and data analysis.
Overall, the process for RNA-seq can be broadly broken down into three steps, as described by Han et al: 1) Experimental biology, 2) Computational biology, and 3) Systems biology:
1) Sample acquisition and RNA extraction, preparation of the cDNA library, and sequencing.
2) Quality control, filtering, and alignment/assembly of reads. Simply eliminating low-quality reads or read fragments can go a long way in cleaning up raw sequencing results. By looking at Phred scores, analysis software can remove ambiguous sequences based on customizable cutoffs.
3) Statistical analysis, functional correlation, and clinical application.
We will take each of these steps (which are demarcated for convenience and are not formal categories) in turn to work through a general clinical workflow for RNA-seq.
1. Experimental biology (wet lab)
Regardless of how a sample is acquired or what the plan is for its use, the most important part of preparing for RNA-seq is to isolate the tissue and stabilize RNA as quickly as possible. Clinical samples are often fixed in formalin and embedded in paraffin for further analysis by traditional pathology. RNA from these formalin-fixed paraffin-embedded (FFPE) tissues tends to be more difficult to work with. Formalin modifies nucleic acids such that downstream sequencing is not as effective or accurate. However, work in recent years to reverse these modifications has advanced, and sequencing technology is capable of handling the lower concentrations of RNA from these highly processed samples. Indeed, two papers (from 2014 and 2015) demonstrated that gene expression profiles of FFPE samples looked very similar to those from matched fresh frozen controls. Hedegaard et al noted:
RNA-Seq data showed high correlation of expression profiles in FF/FFPE pairs (Pearson Correlations of 0.90 +/- 0.05), irrespective of storage time (up to 244 months) and tissue type.
With reagents available to reverse some of the modifications, and tools like transcriptional exome capture available as a possible alternative to the more traditional poly(A) selection, clinicians now have better options for extracting useful information from non-ideal samples.
Once the tissue is preserved, library preparation and sequencing is handled by the testing laboratory. These labs, including Cofactor, have built pipelines that manage the entire process (including QC and alignment), which allows clinicians to focus on patient care.
2. QC and Assembly
Quality control of sequencing reads helps to deal with some of the potential problems from damaged RNA inputs, such as that from FFPE. Classic analysis tools such as Phred scores are used to highlight poor-quality base calls and reads. Removing duplicated reads is another important part of QC. This step ensures that over-sampling of a particular transcript does not result in false-positives or inaccurate expression ratios in later analyses.
Assembly – aligning the millions of ~30 base reads to a reference genome – is of course the foundational step in RNA-seq. Without knowing where the tens of millions of reads map to, no downstream comparisons can be made. Because each read is so short, though, it’s not always a given that an individual read can be easily assigned a location on the genome. Thus, a significant challenge is aligning reads that have (or could have) multiple potential homes. Tandem repeats are one instance. Related but different genes can also contain stretches of identical sequence, making the mapping difficult. As Li et al said in their 2008 paper on mapping short reads,
“Most genomes contain at least some sequence that is repetitive or close to repetitive on the length scale of the reads. As a consequence, some reads will map equally well to multiple positions.”
3. Analysis
Once the alignment and mapping has been completed, normalization is necessary to account for differences in transcript length and other possible biases. De-duplication is a process by which redundant reads can be binned as either artificial noise caused by sequencing bias, or biological duplication. The latter is where real changes in gene expression are found. At this point, after normalization and de-duplication, counting the number of reads per gene provides for quantification of relative gene expression. RNA-seq has a large dynamic range and returns low noise in the data. As a result, relative gene expression is among the most compelling applications for the technology because it allows for highly accurate clinical diagnosis.
Of course, actual analysis of RNA-seq data is extremely complex and experiment-specific. As noted above, quantifying gene expression is dependent on everything from the input sample to the genome annotation used for quantification. Offering a full list of caveats and details is beyond the scope of this article. If you have questions, though, don’t hesitate to get in touch with Cofactor to discuss your options.
In clinical contexts, diagnosis based on gene expression profiles can take a few different forms. In some instances, changes in expression of a gene or a small set of genes relative to known healthy baseline levels will indicate a problem. In some tumors, the number of genes displaying shifts from healthy controls is in the hundreds or even thousands. Often, though, a few key genes are sufficient to nail down the particular cancer subtype being assessed.
In other situations, a disease state is marked by a change in expression of two or more genes relative to each other. In these cases, the problem can be diagnosed when the ratio of transcript A to transcript B in a patient sample varies significantly from the same ratio in healthy tissues.
The level of diagnostic detail that can be extracted from RNA-seq data continues to increase. RNA-seq provides an excellent companion to traditional pathology, as the former can indicate the molecular subtype. By extension, in a growing number of instances, this information reveals drug sensitivity or resistance, thereby helping to suggest the appropriate clinical intervention.
That said, there are still concerns related to RNA-seq for diagnostic purposes. Some are discussed in a 2016 review of work carried out by the FDA-led Sequencing Quality Control Consortium. Accuracy and reproducibility are critically important to ensure that all test results provide appropriate diagnostic data. Recent studies found that cross-platform reproducibility was high for calculating relative gene expression, but less so when looking at absolute expression values of a particular gene. Additionally, RNA-seq and microarray provide similar results for “clinical endpoint prediction.” However, RNA-seq still has significant advantages, including the much greater dynamic range mentioned above, which is particularly useful for quantifying genes expressed at low levels. Interestingly, the FDA study found that a predictive model developed for RNA-seq could not necessarily be applied directly to microarray data without some variability in predictive performance. Microarray models, however, could be applied directly to RNA-seq. Thus, the FDA research group highlighted what they considered the”continued usefulness of legacy microarray data and established microarray biomarkers and predictive models in the forthcoming RNA-seq era.”
Though beyond the scope of this article, it is important to point out that the FDA is in the midst of a massive effort to determine how laboratory derived tests (LDTs), which include most forms of genetic testing tools used for diagnostic purposes, will be regulated. Accuracy is a key component to CLIA regulation, ensuring a technology or process is standardized and reproducible. The FDA is focused on whether a test provides accurate clinical information. That is, “is this diagnosis correct, and does it lead clinicians to take the appropriate next steps?”
Cofactor Genomics is a CAP-CLIA certified lab. Feel free to get in touch with questions about this designation, the new Pinnacle RNA-based cancer biomarker panel, or how Cofactor might be able to help you in your clinical RNA-seq efforts.