The Dashboard page is the first thing you’ll see when you open your Activesite URL. It contains basic QC information and is intended to give you quick sense of the quality of the sequencing data produced.
Near the top of Dashboard is a row of colored boxes, each one reflecting a different QC metric.
Counts – This is the average number of reads per sample in millions.
Quality – This is the phred quality score for all samples in the experiment averaged over all positions in the reads. Click on the Quality box to see plots of phred quality scores at each position in the read for each sample in the experiment.
Saturation – This metric goes with the Information Saturation plot that is a little further down the page. The number in this box represents the number of samples in your experiment that have reached saturation. More on this in the Information Saturation section.
ERCC – Sometimes as part of the experiment, artificial RNAs are spiked into the RNA-seq libraries in known amounts and then aligned to an ERCC reference. Those alignments are then used to calculate a Pearson correlation coefficient for each sample. The number you see in the box is the average Pearson correlation coefficient for all samples.
Ribosomal – This is the average ribosomal content for all samples.
Information Saturation – For a given sample, transcripts at 10x is the number of transcripts that have at least 10 reads aligning to them. As more and more reads are sampled, the number of transcripts at 10x goes up, but it doesn’t go up forever. Beyond a certain depth of sequencing, you aren’t really seeing any more transcripts, and the curve flattens out. This plot, and the Complete, Intermediate and Incomplete metrics are meant to give you a sense of whether the amount of sequencing done adequately captures the diversity of transcripts present in the sample.
Correlation Matrix – These are matrices of Pearson correlation coefficients comparing samples from within the same group. Notice that each group has its own tab.
5′ to 3′ Mean Coverage of Top 1000 Expressed Transcripts – This plot is useful for evaluating coverage along the length of a transcript. Generally, the more even the coverage, the better. Irregularities in the Mean Coverage plot could reflect biases in the preparation of the RNA-seq library and these biases could affect downstream analysis.
Genome alignment metrics – These fields show where within the reference genome the reads from each sample are aligning: exonic, intronic or intergenic. Note the options to export the data to Excel and PDF formats.
Replicates – These are scatter plots of expression values (typically RPKM) of one sample versus another. In the Replicates tab, samples from the same experimental group are plotted against each other. In the Comparative Expression tab, samples from one experimental group are compared to samples from another group.
The Reads page contains information about the number of reads and bases sequenced for each sample, as well as the type of read and the read length.
The Alignments page presents alignment rates to the reference genome or transcriptome.
The Candidates page typically has three tabs: Transcriptome, Transcriptome Pairwise and Genome.
The Transcriptome tab contains comparative expression data for your sample groups, one row per transcript.
You can sort the data in a column in ascending or descending order by clicking on the column header. Also each column has a search field you can use to filter your data.
Here’s a quick run down of each column.
Gene – This is the gene name of the transcript. Click on the name to link out to the transcript record at NCBI.
Transcript ID – The transcript identifier. Clicking on the link takes you to this transcript in the UCSC genome browser.
Pathway – (For human and mouse) Pathway annotations are curated by expert biologists and provided by the Reactome database. The current version (49) annotates 7357 human genes in 1551 pathways.
Min P-value – In this view the displayed P-value is the minimum of all the pairwise P-values between groups. The P-value is a Welch’s t-test. That is, a test of the null hypothesis that the means of two normally distributed populations are equal, assuming unequal variance. Smaller values are more significant. P-values are then corrected for multiple testing using the two-step adaptive method of Benjamini and Hochberg (2000), which estimates the true number of null hypotheses to control the False Discovery Rate.
You can use the search box underneath the Min P-value column header to filter by p-value. For example, if you only want to see the rows with Min P-values less the 0.05, type “<0.05” in Min P-value’s search field. Note that when you filter using a P-value <0.05, the number of records returned is reduced from over 70,000 to 8,899.
Fold Change – This is the maximum pairwise fold change (ratio of expression) among the sample groups. Again you can use Fold Change’s search field to filter your results. Type “>10” in the Fold Change search field to see rows with Fold Change greater than 10.
Expression Max – This is the highest expression value among all sample groups. This is useful as a noise filter to remove low level stochastic genomic transcription products from your results page.
Differentially Expressed? – The “Differentially Expressed?” value for a gene/transcript can have 2 values:
1 – Differentially Expressed
0 – Not Differentially Expressed
A gene is considered differentially expressed if 2 criteria are met. First the sum of the highest and lowest expression values must be greater than or equal to a cutoff of 1 rpkm. Second, the fold change must be greater than or equal to the value selected in the “Min Fold Change” column (2,5,10 or you can leave it blank).
For the gene MIR3652:
Highest expression = 11.29
Lowest expression = 4.67
11.29 + 4.67 = 15.96 > 1
The first criteria is met.
Fold Change = 2.42
The second criteria is met when Min Fold Change is set to 2, but not when it is set to 5 or 10.
Thus the Differentially Expressed? flag is 1 in the row with a Min Fold Change of 2, but it is 0 in the other two rows.
Note that the differential expression column does not take the FDR corrected p value into account. A gene could have a 1 in the “Differentially Expressed?” column, but that doesn’t necessarily mean it has a significant p-value. Rather the “Differentially Expressed?” column can act as an additional noise filter for your data set.
Expression String – You can use expression strings to search for genes/transcripts in your data that are highly expressed in some sample groups but expressed at a low level (or not at all) in others. The expression string consists of 1’s and 0’s and its length will be equal to the number of sample groups in your data set. 0 indicates that the gene has low expression. 1 indicates high expression. In this example, there are 4 sample groups, Control-1, Control-2, E2-1 and E2-2. An expression string of 0011 will filter for genes where expression is low/off for Control-1 and Control-2 and high/on for E2-1 and E2-2. This can be useful for sorting data by hypothesis. The threshold for a gene to be considered on is determined by the value of the Min Fold Change column.
Mean – This is the mean expression in rpkm for all samples within a group.
cv – This is the coefficient of variation. It is the standard deviation divided by the mean for that sample group.
The Transcriptome Pairwise tab is similar to the Transcriptome tab, but with a couple of extra columns, Group A and Group B. These columns allow you to see the results of pairwise tests between one sample group and another. On the Transcriptome tab, you only see the minimum p-value among all the pairwise comparisons made between different sample groups. In the Transcriptome Pairwise tab, you can see all p-values (by leaving Group A and Group B blank), or you can filter the data so that you only see the comparisons between two sample groups that you select. So if you’re interested in comparing two specific sample groups in an experiment that has several groups, you’ll get a lot of use out of the Transcriptome Pairwise tab.