Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Shian Su, Maria Doyle, Harriet Dashnow, Charity Law
Data files are available from: https://figshare.com/s/1d788fd384d33e913a2a. You should download the files listed below and place them into a folder called data
in your working directory.
Data files:
Data files were originally obtained from:
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60450/suppl/GSE60450_Lactation-GenewiseCounts.txt.gz
http://bioinf.wehi.edu.au/MSigDB/v7.1/Mm.c2.all.v7.1.entrez.rds
http://bioinf.wehi.edu.au/MSigDB/v7.1/Mm.h.all.v7.1.entrez.rds
This material has been inspired by the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf (Lun, Chen, and Smyth 2016)
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/RNAseq_DE_analysis_with_R.html
Packages used:
To install the packages you can:
Install the latest release of R. This version of the tutorial uses R 4.0.
Get the latest version of Bioconductor and packages by starting R and entering the commands:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install(c("limma", "edgeR", "Glimma", "org.Mm.eg.db", "gplots", "RColorBrewer", "NMF", "BiasedUrn"))
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today we will be starting from the count data and getting stuck into analysis.
First, let’s load all the packages we will need to analyse the data.
library(edgeR)
library(limma)
library(Glimma)
library(org.Mm.eg.db)
library(gplots)
library(RColorBrewer)
library(NMF)
The data for this tutorial comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (Fu et al. 2015). Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450.
This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates. We will first use the counts file as a starting point for our analysis. This data has already been aligned to the mouse genome. The command line tool featureCounts (Liao, Smyth, and Shi 2014) was used to count reads mapped to mouse genes from Refseq annotation (see the paper for details).
Set up an RStudio project specifying the directory where you have saved the /data
directory. Download and read in the data.
# Read the data into R
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
# Read the sample information into R
sampleinfo <- read.delim("data/SampleInfo.txt", stringsAsFactors = TRUE)
Let’s take a look at the data. You can use the head
command to see the first 6 lines. The dim
command will tell you how many rows and columns the data frame has.
head(seqdata)
EntrezGeneID Length MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1
1 497097 3634 438
2 100503874 3259 1
3 100038431 1634 0
4 19888 9747 1
5 20671 3130 106
6 27395 4203 309
MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1
1 300 65
2 0 1
3 0 0
4 1 0
5 182 82
6 234 337
MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1
1 237 354
2 1 0
3 0 0
4 0 0
5 105 43
6 300 290
MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1
1 287 0
2 4 0
3 0 0
4 0 10
5 82 16
6 270 560
MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1
1 0 0
2 0 0
3 0 0
4 3 10
5 25 18
6 464 489
MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1
1 0 0
2 0 0
3 0 0
4 2 0
5 8 3
6 328 307
MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
1 0
2 0
3 0
4 0
5 10
6 342
dim(seqdata)
[1] 27179 14
The seqdata
object contains information about genes (one gene per row), the first column has the Entrez gene id, the second has the gene length and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample. There are two replicates for each cell type and time point (detailed sample info can be found in file “GSE60450_series_matrix.txt” from the GEO website). The sampleinfo file contains basic information about the samples that we will need for the analysis today.
sampleinfo
FileName SampleName CellType Status
1 MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DG luminal virgin
2 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DH basal virgin
3 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DI basal pregnant
4 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DJ basal pregnant
5 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DK basal lactate
6 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.DL basal lactate
7 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LA basal virgin
8 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LB luminal virgin
9 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LC luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LD luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LE luminal lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1 MCL1.LF luminal lactate
We will be manipulating and reformatting the counts matrix into a suitable format for downstream analysis. The first two columns in the seqdata
dataframe contain annotation information. We need to make a new matrix containing only the counts, but we can store the gene identifiers (the EntrezGeneID
column) as rownames. We will add more annotation information about each gene later on in the workshop.
Let’s create a new data object, countdata
, that contains only the counts for the 12 samples.
# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]
# Look at the output
head(countdata)
MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1
1 438 300
2 1 0
3 0 0
4 1 1
5 106 182
6 309 234
MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1
1 65 237
2 1 1
3 0 0
4 0 0
5 82 105
6 337 300
MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1
1 354 287
2 0 4
3 0 0
4 0 0
5 43 82
6 290 270
MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1
1 0 0
2 0 0
3 0 0
4 10 3
5 16 25
6 560 464
MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1
1 0 0
2 0 0
3 0 0
4 10 2
5 18 8
6 489 328
MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
1 0 0
2 0 0
3 0 0
4 0 0
5 3 10
6 307 342
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
Take a look at the output
head(countdata)
MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1
497097 438 300
100503874 1 0
100038431 0 0
19888 1 1
20671 106 182
27395 309 234
MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1
497097 65 237
100503874 1 1
100038431 0 0
19888 0 0
20671 82 105
27395 337 300
MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1
497097 354 287
100503874 0 4
100038431 0 0
19888 0 0
20671 43 82
27395 290 270
MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1
497097 0 0
100503874 0 0
100038431 0 0
19888 10 3
20671 16 25
27395 560 464
MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1
497097 0 0
100503874 0 0
100038431 0 0
19888 10 2
20671 18 8
27395 489 328
MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
497097 0 0
100503874 0 0
100038431 0 0
19888 0 0
20671 3 10
27395 307 342
Now take a look at the column names
colnames(countdata)
[1] "MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1" "MCL1.DH_BC2CTUACXX_CAGATC_L002_R1"
[3] "MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1" "MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1"
[5] "MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1" "MCL1.DL_BC2CTUACXX_ATCACG_L002_R1"
[7] "MCL1.LA_BC2CTUACXX_GATCAG_L001_R1" "MCL1.LB_BC2CTUACXX_TGACCA_L001_R1"
[9] "MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1" "MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1"
[11] "MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1" "MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1"
These are the sample names which are pretty long so we’ll shorten these to contain only the relevant information about each sample. We will use the substr
command to extract the first 7 characters and use these as the colnames.
# using substr, you extract the characters starting at position 1 and stopping at position 7 of the colnames
colnames(countdata) <- substr(colnames(countdata),start=1,stop=7)
Take a look at the output.
head(countdata)
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097 438 300 65 237 354 287 0 0
100503874 1 0 1 1 0 4 0 0
100038431 0 0 0 0 0 0 0 0
19888 1 1 0 0 0 0 10 3
20671 106 182 82 105 43 82 16 25
27395 309 234 337 300 290 270 560 464
MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097 0 0 0 0
100503874 0 0 0 0
100038431 0 0 0 0
19888 10 2 0 0
20671 18 8 3 10
27395 489 328 307 342
Note that the column names are now the same as SampleName
in the sampleinfo
file. This is good because it means our sample information in sampleinfo
is in the same order as the columns in countdata
.
table(colnames(countdata)==sampleinfo$SampleName)
TRUE
12
Next we’ll create a DGEList
object. This is an object used by edgeR to store count data. It has a number of slots for storing various parameters about the data.
y <- DGEList(countdata)
# have a look at y
y
An object of class "DGEList"
$counts
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097 438 300 65 237 354 287 0 0
100503874 1 0 1 1 0 4 0 0
100038431 0 0 0 0 0 0 0 0
19888 1 1 0 0 0 0 10 3
20671 106 182 82 105 43 82 16 25
MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097 0 0 0 0
100503874 0 0 0 0
100038431 0 0 0 0
19888 10 2 0 0
20671 18 8 3 10
27174 more rows ...
$samples
group lib.size norm.factors
MCL1.DG 1 23227641 1
MCL1.DH 1 21777891 1
MCL1.DI 1 24100765 1
MCL1.DJ 1 22665371 1
MCL1.DK 1 21529331 1
7 more rows ...
# See what slots are stored in y
names(y)
[1] "counts" "samples"
# Library size information is stored in the samples slot
y$samples
group lib.size norm.factors
MCL1.DG 1 23227641 1
MCL1.DH 1 21777891 1
MCL1.DI 1 24100765 1
MCL1.DJ 1 22665371 1
MCL1.DK 1 21529331 1
MCL1.DL 1 20015386 1
MCL1.LA 1 20392113 1
MCL1.LB 1 21708152 1
MCL1.LC 1 22241607 1
MCL1.LD 1 21988240 1
MCL1.LE 1 24723827 1
MCL1.LF 1 24657293 1
We can also store the groups for the samples in the DGEList object.
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
# Take a look
group
[1] "luminal.virgin" "basal.virgin" "basal.pregnant" "basal.pregnant"
[5] "basal.lactate" "basal.lactate" "basal.virgin" "luminal.virgin"
[9] "luminal.pregnant" "luminal.pregnant" "luminal.lactate" "luminal.lactate"
# Convert to factor
group <- factor(group)
# Take another look.
group
[1] luminal.virgin basal.virgin basal.pregnant basal.pregnant
[5] basal.lactate basal.lactate basal.virgin luminal.virgin
[9] luminal.pregnant luminal.pregnant luminal.lactate luminal.lactate
6 Levels: basal.lactate basal.pregnant basal.virgin ... luminal.virgin
# Add the group information into the DGEList
y$samples$group <- group
y$samples
group lib.size norm.factors
MCL1.DG luminal.virgin 23227641 1
MCL1.DH basal.virgin 21777891 1
MCL1.DI basal.pregnant 24100765 1
MCL1.DJ basal.pregnant 22665371 1
MCL1.DK basal.lactate 21529331 1
MCL1.DL basal.lactate 20015386 1
MCL1.LA basal.virgin 20392113 1
MCL1.LB luminal.virgin 21708152 1
MCL1.LC luminal.pregnant 22241607 1
MCL1.LD luminal.pregnant 21988240 1
MCL1.LE luminal.lactate 24723827 1
MCL1.LF luminal.lactate 24657293 1
The only annotation we can see is the Entrez Gene ID, which is not very informative. We would like to add some annotation information. There are a number of ways to do this. We will demonstrate how to do this using the org.Mm.eg.db package
First we need to decide what information we want. In order to see what we can extract we can run the columns
function on the annotation database.
columns(org.Mm.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MGI" "ONTOLOGY"
[16] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE"
[21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
We definitely want gene symbols and perhaps the full gene name. Let’s build up our annotation information in a separate data frame using the select
function.
ann <- select(org.Mm.eg.db,keys=rownames(y$counts),columns=c("ENTREZID","SYMBOL","GENENAME"))
'select()' returned 1:1 mapping between keys and columns
# Have a look at the annotation
head(ann)
ENTREZID SYMBOL GENENAME
1 497097 Xkr4 X-linked Kx blood group related 4
2 100503874 Gm19938 predicted gene, 19938
3 100038431 Gm10568 predicted gene 10568
4 19888 Rp1 retinitis pigmentosa 1 (human)
5 20671 Sox17 SRY (sex determining region Y)-box 17
6 27395 Mrpl15 mitochondrial ribosomal protein L15
Let’s double check that the ENTREZID
column matches exactly to our y$counts
rownames.
table(ann$ENTREZID==rownames(y$counts))
TRUE
27179
We can slot in the annotation information into the genes
slot of y
. (Please note that if the select
function returns a 1:many mapping then you can’t just append the annotation to the y object. An alternative way to get annotation will be discussed during the analysis of the second dataset.)
y$genes <- ann
Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.
There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in this case we have a sample size of 2 in each group, we favour filtering on a minimum counts per million threshold present in at least 2 samples. Two represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least two samples.
We’ll use the cpm
function from the edgeR library (M. D. Robinson, McCarthy, and Smyth 2010) to generate the CPM values and then filter. Note that by converting to CPMs we are normalising for the different sequencing depths for each sample.
# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL
497097 18.85684388 13.77543859 2.69700983 10.45648006 16.442685 14.3389690
100503874 0.04305215 0.00000000 0.04149246 0.04412017 0.000000 0.1998463
100038431 0.00000000 0.00000000 0.00000000 0.00000000 0.000000 0.0000000
19888 0.04305215 0.04591813 0.00000000 0.00000000 0.000000 0.0000000
20671 4.56352843 8.35709941 3.40238163 4.63261775 1.997275 4.0968483
27395 13.30311589 10.74484210 13.98295863 13.23605071 13.469996 13.4896224
MCL1.LA MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
100503874 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
100038431 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
19888 0.4903857 0.1381969 0.4496078 0.09095771 0.0000000 0.0000000
20671 0.7846171 1.1516411 0.8092940 0.36383085 0.1213404 0.4055595
27395 27.4615975 21.3744588 21.9858214 14.91706476 12.4171715 13.8701357
# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
100503874 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
100038431 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
19888 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
20671 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
27395 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097 FALSE FALSE FALSE FALSE
100503874 FALSE FALSE FALSE FALSE
100038431 FALSE FALSE FALSE FALSE
19888 FALSE FALSE FALSE FALSE
20671 TRUE FALSE FALSE FALSE
27395 TRUE TRUE TRUE TRUE
# Summary of how many TRUEs there are in each row
# There are 11433 genes that have TRUEs in all 12 samples.
table(rowSums(thresh))
0 1 2 3 4 5 6 7 8 9 10 11 12
10857 518 544 307 346 307 652 323 547 343 579 423 11433
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
summary(keep)
Mode FALSE TRUE
logical 11375 15804
A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample. A requirement for expression in two or more libraries is used as each group contains two replicates. This ensures that a gene will be retained if it is only expressed in one group. Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5. You should filter with CPMs rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.
# Let's have a look and see whether our threshold of 0.5 does indeed correspond to a count of about 10-15
# We will look at the first sample
plot(myCPM[,1],countdata[,1])
# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)
Challenge
- Plot the counts-per-million versus counts for the second sample.
- Add a vertical line at 0.5 and a horizontal line at 10.
- Add the lines again, colouring them blue
HINT: use the
col
parameter.Solution
Now that we’ve checked our filtering method we will filter the DGEList object.
y <- y[keep, keep.lib.sizes=FALSE]
Note: Alternatively, the edgeR function filterByExpr
can be used to perform this method of filtering automatically, as demonstrated in the limma, Glimma and edgeR workflow article (Charity W. Law et al. 2018).
Now that we have got rid of the lowly expressed genes and have our counts stored in a DGEList
object, we can look at a few different plots to check that the data is good quality, and that the samples are as we would expect.
First, we can check how many reads we have for each sample in the y
.
y$samples$lib.size
[1] 23218026 21768136 24091588 22656713 21522033 20008326 20384562 21698793
[9] 22235847 21982745 24719697 24652963
We can also plot the library sizes as a barplot to see whether there are any major discrepancies between the samples more easily.
# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(y$samples$lib.size,names=colnames(y),las=2)
# Add a title to the plot
title("Barplot of library sizes")
# we can also adjust the labelling if we want
barplot(y$samples$lib.size/1e06, names=colnames(y), las=2, ann=FALSE, cex.names=0.75)
mtext(side = 1, text = "Samples", line = 4)
mtext(side = 2, text = "Library size (millions)", line = 3)
title("Barplot of library sizes")
Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. Next we’ll use box plots to check the distribution of the read counts on the log2 scale. We can use the cpm
function to get log2 counts per million, which are corrected for the different library sizes. The cpm
function also adds a small offset to avoid taking log of zero.
# Get log2 counts per million
logcounts <- cpm(y,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
title("Boxplots of logCPMs (unnormalised)")
From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further. Another kind of QC plot that is helpful in checking for dodgy samples is a relative log expression (RLE) plot, which can be generated with plotRLE
from the EDASeq package.
Discussion
Do any samples appear to be different compared to the others?
By far, one of the most important plots we make when we analyse RNA-Seq data are MDSplots. An MDSplot is a visualisation of a principle components analysis, which determines the greatest sources of variation in the data. A principle components analysis is an example of an unsupervised analysis, where we don’t need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the plotMDS
function to create the MDS plot.
plotMDS(y)
It is a bit difficult to see exactly what is going on with the default plot, although we do see samples grouping together in pairs. To make this plot more informative, we can colour the samples according to the grouping information. We can also change the labels, or instead of labels we can have points.
# We specify the option to let us plot two plots side-by-sde
par(mfrow=c(1,2))
# Let's set up colour schemes for CellType
# How many cell types and in what order are they stored?
levels(sampleinfo$CellType)
[1] "basal" "luminal"
## Let's choose purple for basal and orange for luminal
col.cell <- c("purple","orange")[sampleinfo$CellType]
data.frame(sampleinfo$CellType,col.cell)
sampleinfo.CellType col.cell
1 luminal orange
2 basal purple
3 basal purple
4 basal purple
5 basal purple
6 basal purple
7 basal purple
8 luminal orange
9 luminal orange
10 luminal orange
11 luminal orange
12 luminal orange
# Redo the MDS with cell type colouring
plotMDS(y,col=col.cell)
# Let's add a legend to the plot so we know which colours correspond to which cell type
legend("topleft",fill=c("purple","orange"),legend=levels(sampleinfo$CellType))
# Add a title
title("Cell type")
# Similarly for status
levels(sampleinfo$Status)
[1] "lactate" "pregnant" "virgin"
col.status <- c("blue","red","black")[sampleinfo$Status]
col.status
[1] "black" "black" "red" "red" "blue" "blue" "black" "black" "red"
[10] "red" "blue" "blue"
plotMDS(y,col=col.status)
legend("topleft",fill=c("blue","red","black"),legend=levels(sampleinfo$Status),cex=0.8)
title("Status")
Discussion
Look at the MDS plot coloured by cell type. Is there something strange going on with the samples? Identify the two samples that don’t appear to be in the right place.
# There is a sample info corrected file in your data directory
# Old sampleinfo
sampleinfo
FileName SampleName CellType Status
1 MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DG luminal virgin
2 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DH basal virgin
3 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DI basal pregnant
4 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DJ basal pregnant
5 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DK basal lactate
6 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.DL basal lactate
7 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LA basal virgin
8 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LB luminal virgin
9 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LC luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LD luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LE luminal lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1 MCL1.LF luminal lactate
# I'm going to write over the sampleinfo object with the corrected sample info
sampleinfo <- read.delim("data/SampleInfo_Corrected.txt", stringsAsFactors = TRUE)
sampleinfo
FileName SampleName CellType Status
1 MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DG basal virgin
2 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DH basal virgin
3 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DI basal pregnant
4 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DJ basal pregnant
5 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DK basal lactate
6 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.DL basal lactate
7 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LA luminal virgin
8 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LB luminal virgin
9 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LC luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LD luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LE luminal lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1 MCL1.LF luminal lactate
# We need to correct the info for the groups
group <- factor(paste(sampleinfo$CellType,sampleinfo$Status,sep="."))
y$samples$group <- group
# Redo the MDSplot with corrected information
par(mfrow=c(1,2))
col.cell <- c("purple","orange")[sampleinfo$CellType]
col.status <- c("blue","red","black")[sampleinfo$Status]
plotMDS(y,col=col.cell)
legend("topleft",fill=c("purple","orange"),legend=levels(sampleinfo$CellType))
title("Cell type")
plotMDS(y,col=col.status)
legend("topleft",fill=c("blue","red","black"),legend=levels(sampleinfo$Status),cex=0.8)
title("Status")
Discussion
What is the greatest source of variation in the data (i.e. what does dimension 1 represent)? What is the second greatest source of variation in the data?
Challenge
- Redo the plots choosing your own colours.
- Change the plotting character to a symbol instead of the column names HINT: use
pch
argument. Trypch=16
and see what happens.- Change the plotting characters such that basal samples have the value
1
and luminal samples have the value4
. Colour by status (lactate, pregnant, virgin)
Solution
[1] "#1B9E77" "#D95F02" "#7570B3"
The distance between each pair of samples in the MDS plot is calculated as the leading fold change, defined as the root-mean-square of the largest 500 log2-fold changes between that pair of samples. Replicate samples from the same group cluster together in the plot, while samples from different groups form separate clusters. This indicates that the differences between groups are larger than those within groups, i.e., differential expression is greater than the variance and can be detected. In the MDS plot, the distance between basal samples on the left and luminal cells on the right is about 6 units, corresponding to a leading fold change of about 64-fold (2^6 = 64) between basal and luminal. The expression differences between virgin, pregnant and lactating are greater for luminal cells than for basal.
Notes
plotMDS(y)
. The additional code is purely for aesthetics, to improve the visualization of the groups.plotMDS
plots the first two dimensions as a default, however you can plot higher dimensions using the dim
argument.
# Dimension 3 appears to separate pregnant samples from the rest. Dim4?
plotMDS(y,dim=c(3,4),col=col.status,pch=char.celltype,cex=2)
legend("topright",legend=levels(sampleinfo$Status),col=cols,pch=16)
legend("bottomright",legend=levels(sampleinfo$CellType),pch=c(1,4))
Another alternative is to generate an interactive MDS plot using the Glimma package. This allows the user to interactively explore the different dimensions.
labels <- paste(sampleinfo$SampleName, sampleinfo$CellType, sampleinfo$Status)
glMDSPlot(y, labels=labels, groups=group, folder="mds")
Glimma was created to make interactive versions of some of the popular plots from the limma package (Su et al. 2017). It can be used to obtain MDS plots, and also mean-difference (MD) and volcano plots which will be covered later. The output of glMDSPlot
is an html page (/mds/MDS-Plot.html) that shows the MDS plot on the left, and the amount of variation explained by each dimension in a barplot on the right. The user can hover over points to find out sample information, and switch between successive dimensions in the MDS plot by clicking on the bars in the barplot. The default MDS plots shows dimensions 1 and 2.
An alternative to plotMDS
for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the heatmap.2
function from the gplots package. In this example heatmap.2
calculates a matrix of euclidean distances from the logCPM (logcounts
object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using plotMDS
.)
The RColorBrewer package has nicer colour schemes, accessed using the brewer.pal
function. “RdYlBu” is a common choice, and “Spectral” is also nice.
Note:The png
function will create a png file to save the plots created straight after, and will close this file when dev.off()
is called. To see your plots interactively, simply omit those two lines.
Let’s select data for the 500 most variable genes and plot the heatmap
# We estimate the variance for each row in the logcounts matrix
var_genes <- apply(logcounts, 1, var)
head(var_genes)
497097 20671 27395 18777 21399 58175
13.6624115 2.7493077 0.1581944 0.1306781 0.3929526 4.8232522
# Get the gene names for the top 500 most variable genes
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
head(select_var)
[1] "22373" "12797" "11475" "11468" "14663" "24117"
# Subset logcounts matrix
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
[1] 500 12
head(highly_variable_lcpm)
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
22373 0.4660671 0.9113837 7.435080 7.807900 9.288318 9.318483 6.6277452
12797 10.0429713 9.6977966 11.046567 11.358857 11.605894 11.491773 0.8529774
11475 12.3849005 12.3247093 13.989573 14.180048 14.285489 14.032486 3.4823396
11468 7.1537287 6.8917703 9.325436 9.661942 9.491765 9.424803 -1.8086076
14663 1.9717614 1.9471846 9.091895 8.756261 9.539747 9.504098 6.1710357
24117 7.8378853 7.8995788 8.634622 8.582447 6.704706 6.777335 -1.3824015
MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
22373 6.6884102 12.1273320 13.1502579 15.6481000 15.569686
12797 1.6034598 2.3323371 2.4601069 0.8055694 1.288003
11475 4.3708241 5.2116574 5.0788442 3.6997655 3.965775
11468 -0.6387584 0.5244769 0.6694047 -0.4412496 -1.014878
14663 6.2328260 13.7571928 14.2506761 16.0020840 15.885390
24117 -0.5387838 -0.2280797 0.1243601 -2.9468278 -2.945610
## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[sampleinfo$CellType]
# Plot the heatmap
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=col.cell,scale="row")
# Save the heatmap
png(file="High_var_genes.heatmap.png")
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=col.cell,scale="row")
dev.off()
quartz_off_screen
2
Challenge
- Change the colour scheme to “PiYG” and redo the heatmap. Try
?RColorBrewer
and see what other colour schemes are available.- Change the sample names to
group
using thelabCol
argument- Redo the heatmap using the top 500 LEAST variable genes.
Solution
[1] "320714" "93684" "245638" "93747" "26433" "236792"
If we wanted to add more annotations, for example colours for both cell type and status, we can’t easily do that with heatmap.2
, but we could use the aheatmap
function from the NMF package. We can specify which columns we want to use from sampleinfo in annCol=
and coloured bars will be added for those.
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
aheatmap(highly_variable_lcpm,col=rev(morecols(50)),main="Top 500 most variable genes across samples",annCol=sampleinfo[, 3:4],labCol=group, scale="row")
TMM normalization is performed to eliminate composition biases between libraries (Mark D. Robinson and Oshlack 2010). This generates a set of normalization factors, where the product of these factors and the library sizes defines the effective library size. The calcNormFactors
function calculates the normalization factors between libraries. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.
# Apply normalisation to DGEList object
y <- calcNormFactors(y)
This will update the normalisation factors in the DGEList
object (their default values are 1). Take a look at the normalisation factors for these samples.
y$samples
group lib.size norm.factors
MCL1.DG basal.virgin 23218026 1.2368993
MCL1.DH basal.virgin 21768136 1.2139485
MCL1.DI basal.pregnant 24091588 1.1255640
MCL1.DJ basal.pregnant 22656713 1.0698261
MCL1.DK basal.lactate 21522033 1.0359212
MCL1.DL basal.lactate 20008326 1.0872153
MCL1.LA luminal.virgin 20384562 1.3684449
MCL1.LB luminal.virgin 21698793 1.3653200
MCL1.LC luminal.pregnant 22235847 1.0047431
MCL1.LD luminal.pregnant 21982745 0.9232822
MCL1.LE luminal.lactate 24719697 0.5291015
MCL1.LF luminal.lactate 24652963 0.5354877
The normalization factors multiply to unity across all libraries. A normalization factor below one indicates that the library size will be scaled down, as there is more suppression (i.e., composition bias) in that library relative to the other libraries. This is also equivalent to scaling the counts upwards in that sample. Conversely, a factor above one scales up the library size and is equivalent to downscaling the counts.
The last two samples have much smaller normalisation factors, and MCL1.LA and MCL1.LB have the largest. If we plot mean difference plots using the plotMD
function for these samples, we should be able to see the composition bias problem. We will use the logcounts
, which have been normalised for library size, but not for composition bias.
par(mfrow=c(1,2))
plotMD(logcounts,column = 7)
abline(h=0,col="grey")
plotMD(logcounts,column = 11)
abline(h=0,col="grey")
The mean-difference plots show average expression (mean: x-axis) against log-fold-changes (difference: y-axis). Because our DGEList
object contains the normalisation factors, if we redo these plots using y
, we should see the composition bias problem has been solved.
par(mfrow=c(1,2))
plotMD(y,column = 7)
abline(h=0,col="grey")
plotMD(y,column = 11)
abline(h=0,col="grey")
Challenge
Plot the biased and unbiased MD plots side by side for the same sample to see the before and after TMM normalisation effect.
Solution
We need to save a few data objects to use for Day 2 so we don’t have to rerun everything
# save(group,y,logcounts,sampleinfo,file="day1objects.Rdata")
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. The limma package (Ritchie et al. 2015) (since version 3.16.0) offers the voom
function, which transforms the read counts into logCPMs while taking into account the mean-variance relationship in the data (Charity W. Law et al. 2014). After vooming, users can apply a linear model to the voom transformed data to test for differentially expressed genes, using standard limma commands.
Load the objects into the workspace that we created yesterday
# load("day1objects.Rdata")
# objects()
First we need to create a design matrix for the groups (see the excellent limma user guide for more information on design matrices). There are many different ways to set up your design matrix, and it is dictated by what comparisons you would like to test. We will follow the set-up from pg 43 of the limma vignette (“Interaction models: 2X2 factorial designs”).
In this analysis let’s assume that we will be testing differences in status in the different cell types separately. For example, we want to know which genes are differentially expressed between pregnant and lactating in basal cells only. We have previously coded the group
variable, which is a concatenation of cell type and status. Coding the cell type and status in this way allows us to be flexible in specifying which comparisons we are interested in.
# Look at group variable again
group
[1] basal.virgin basal.virgin basal.pregnant basal.pregnant
[5] basal.lactate basal.lactate luminal.virgin luminal.virgin
[9] luminal.pregnant luminal.pregnant luminal.lactate luminal.lactate
6 Levels: basal.lactate basal.pregnant basal.virgin ... luminal.virgin
# Specify a design matrix without an intercept term
design <- model.matrix(~ 0 + group)
design
groupbasal.lactate groupbasal.pregnant groupbasal.virgin
1 0 0 1
2 0 0 1
3 0 1 0
4 0 1 0
5 1 0 0
6 1 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
groupluminal.lactate groupluminal.pregnant groupluminal.virgin
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 1
8 0 0 1
9 0 1 0
10 0 1 0
11 1 0 0
12 1 0 0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
## Make the column names of the design matrix a bit nicer
colnames(design) <- levels(group)
design
basal.lactate basal.pregnant basal.virgin luminal.lactate luminal.pregnant
1 0 0 1 0 0
2 0 0 1 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 1
10 0 0 0 0 1
11 0 0 0 1 0
12 0 0 0 1 0
luminal.virgin
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
Each column of the design matrix tells us which samples correspond to each group. The samples which come from basal cells from a lactating mouse correspond to columns 5 and 6 in the counts matrix, i.e. the samples which have 1s.
Once we have our design matrix ready to go, we can perform our voom transformation. Voom will automatically adjust the library sizes using the norm.factors
already calculated. The voom transformation uses the experiment design matrix, and produces an EList
object. We can add plot=TRUE
to generate a plot of the mean-variance trend. This plot can also tell us if there are any genes that look really variable in our data, and if we’ve filtered the low counts adequately.
par(mfrow=c(1,1))
v <- voom(y,design,plot = TRUE)
The voom normalised log2 counts can be found in v$E. Take a look at what is in the voom object.
v
An object of class "EList"
$genes
ENTREZID SYMBOL GENENAME
1 497097 Xkr4 X-linked Kx blood group related 4
5 20671 Sox17 SRY (sex determining region Y)-box 17
6 27395 Mrpl15 mitochondrial ribosomal protein L15
7 18777 Lypla1 lysophospholipase 1
9 21399 Tcea1 transcription elongation factor A (SII) 1
15799 more rows ...
$targets
group lib.size norm.factors
MCL1.DG basal.virgin 28718360 1.236899
MCL1.DH basal.virgin 26425396 1.213949
MCL1.DI basal.pregnant 27116624 1.125564
MCL1.DJ basal.pregnant 24238743 1.069826
MCL1.DK basal.lactate 22295130 1.035921
7 more rows ...
$E
MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
497097 3.932532 3.507368 1.272317 3.292541 3.9909851 3.724252 -5.8019424
20671 1.890808 2.787899 1.605217 2.121856 0.9642868 1.923156 -0.7575483
27395 3.429894 3.149591 3.637638 3.631978 3.7037376 3.636318 4.3286281
18777 4.505933 4.285975 5.128398 5.270351 5.3801014 5.185279 4.8889286
21399 5.804007 5.822559 5.988345 5.764344 5.7006304 5.615502 5.5801412
MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097 -5.8887821 -5.4816421 -5.343143 -4.709206 -4.7226146
20671 -0.2163568 -0.2721888 -1.255680 -1.901851 -0.3302972
27395 3.9707527 4.4535229 4.016606 4.555237 4.6973455
18777 4.8635986 4.9031416 4.993363 5.379583 5.4610208
21399 5.4087074 5.5797290 5.512504 5.280898 5.2446116
15799 more rows ...
$weights
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 42.85032 40.59638 19.78119 18.11020 38.62541 37.98661 0.7907909
[2,] 21.34610 19.99539 15.74020 14.40027 10.70068 10.49511 4.5401043
[3,] 35.05544 33.04583 39.72941 36.82081 35.31518 34.70612 50.4586968
[4,] 56.26175 53.86441 69.16366 66.65108 66.03797 65.46069 64.6692219
[5,] 76.92857 75.98600 76.79528 75.45282 71.62644 71.16558 73.5980084
[,8] [,9] [,10] [,11] [,12]
[1,] 0.7907909 0.7907909 0.7907909 0.7907909 0.7907909
[2,] 4.7399033 3.4277768 3.2224692 2.1314506 2.1428929
[3,] 52.1833006 45.7806579 43.1302387 38.6112610 38.8559918
[4,] 66.0984998 60.0161723 57.3174922 54.0693981 54.3410792
[5,] 74.5613544 70.1642210 68.1238307 50.8570652 51.1210163
15799 more rows ...
$design
basal.lactate basal.pregnant basal.virgin luminal.lactate luminal.pregnant
1 0 0 1 0 0
2 0 0 1 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 1 0 0 0 0
luminal.virgin
1 0
2 0
3 0
4 0
5 0
7 more rows ...
# What is contained in this object?
names(v)
[1] "genes" "targets" "E" "weights" "design"
Challenge
- What is in the
targets
slot ofv
and what does it correspond to iny
?- What are the dimensions of the
weights
slot inv
?
We can repeat the box plots for the normalised data to compare to before normalisation. The expression values in v$E
are already log2 values so we don’t need to log-transform.
par(mfrow=c(1,2))
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2,main="Unnormalised logCPM")
## Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
boxplot(v$E, xlab="", ylab="Log2 counts per million",las=2,main="Voom transformed logCPM")
## Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(v$E),col="blue")
Compare these box plots to the box plots we generated before performing the normalisation. Can you see any differences?
Now that we have the voom transformed data we can use limma to test for differential expression. First we fit a linear model for each gene using the lmFit
function in limma. lmFit
needs the voom object and the design matrix that we have already specified, which is stored within the voom object.
# Fit the linear model
fit <- lmFit(v)
names(fit)
[1] "coefficients" "stdev.unscaled" "sigma" "df.residual"
[5] "cov.coefficients" "pivot" "rank" "genes"
[9] "Amean" "method" "design"
lmFit
estimates group means according to the design matrix, as well as gene-wise variances. There are a number of items stored in the fit
object, most of which are specific to the statistical testing, and we won’t be discussing these in detail today.
Since we are interested in differences between groups, we need to specify which comparisons we want to test. The comparison of interest can be specified using the makeContrasts
function. Here, we are interested in knowing which genes are differentially expressed between the pregnant and lactating group in the basal cells. This is done by defining the null hypothesis as basal.pregnant - basal.lactate = 0 for each gene. Note that the group names must exactly match the column names of the design matrix.
cont.matrix <- makeContrasts(B.PregVsLac=basal.pregnant - basal.lactate,levels=design)
Take a look at the contrast matrix. The contrast matrix tells limma which columns of the design matrix we are interested in testing our comparison. Note that here we have specified only one comparison to test, but we can specify as many as we want in one go.
cont.matrix
Contrasts
Levels B.PregVsLac
basal.lactate -1
basal.pregnant 1
basal.virgin 0
luminal.lactate 0
luminal.pregnant 0
luminal.virgin 0
Now we can apply the contrasts matrix to the fit
object to get the statistics and estimated parameters of our comparison that we are interested in. Here we call the contrasts.fit
function in limma.
fit.cont <- contrasts.fit(fit, cont.matrix)
The final step is to call the eBayes
function, which performs empirical Bayes shrinkage on the variances, and estimates moderated t-statistics and the associated p-values.
fit.cont <- eBayes(fit.cont)
Check the dimensions of the fit object
dim(fit.cont)
[1] 15804 1
We can use the limma decideTests
function to generate a quick summary of DE genes for the contrasts.
summa.fit <- decideTests(fit.cont)
summary(summa.fit)
B.PregVsLac
Down 2635
NotSig 10464
Up 2705
Challenge
- Add another contrast to the contrasts matrix:
L.PregVsLac = luminal.pregnant - luminal.lactate
and re-run the code above. You should have two comparisons infit.cont
now.- Check out the
vennDiagram
function (HINT: type?vennDiagram
). Can you show the overlap of differentially expressed genes between the two comparisons? How many genes are commonly differentially expressed?
Solution
B.PregVsLac L.PregVsLac
Down 2635 3686
NotSig 10464 8451
Up 2705 3667
Let’s do a few plots to make sure everything looks good and that we haven’t made a mistake in the analysis. Genome-wide plots that are useful for checking are MAplots (or MDplots) and volcano plots. There are functions in limma for plotting these with fit.cont
as input.
# We want to highlight the significant genes. We can get this from decideTests.
par(mfrow=c(1,2))
plotMD(fit.cont,coef=1,status=summa.fit[,"B.PregVsLac"], values = c(-1, 1), hl.col=c("blue","red"))
# For the volcano plot we have to specify how many of the top genes to highlight.
# We can also specify that we want to plot the gene symbol for the highlighted genes.
# let's highlight the top 100 most DE genes
volcanoplot(fit.cont,coef=1,highlight=100,names=fit.cont$genes$SYMBOL, main="B.PregVsLac")
Challenge
Look at the MD plot and volcano plot for the second comparison,
L.PregVsLac
. Change the number of highlighted genes to 200 in the volcano plot.
Before following up on the DE genes with further lab work, it is recommended to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression using stripchart
. We can use the normalised log expression values in the voom object (v$E
).
par(mfrow=c(1,3))
# Let's look at the first gene in the topTable, Wif1, which has a rowname 24117
stripchart(v$E["24117",]~group)
# This plot is ugly, let's make it better
stripchart(v$E["24117",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,col=1:6,method="jitter")
# Let's use nicer colours
nice.col <- brewer.pal(6,name="Dark2")
stripchart(v$E["24117",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.col,method="jitter",ylab="Normalised log2 expression",main="Wif1")
Notice anything interesting about the expression of this gene?
Challenge
Take the top gene from the L.PregVsLactate comparison and make a stripchart of grouped expression as above. (Don’t forget to change the title of the plot.)
Solution
ENTREZID SYMBOL
12992 12992 Csn1s2b
13358 13358 Slc25a1
20531 20531 Slc34a2
11941 11941 Atp2b2
230810 230810 Slc30a2
100705 100705 Acacb
68801 68801 Elovl5
13645 13645 Egf
12683 12683 Cidea
26366 26366 Ceacam10
GENENAME
12992 casein alpha s2-like B
13358 solute carrier family 25 (mitochondrial carrier, citrate transporter), member 1
20531 solute carrier family 34 (sodium phosphate), member 2
11941 ATPase, Ca++ transporting, plasma membrane 2
230810 solute carrier family 30 (zinc transporter), member 2
100705 acetyl-Coenzyme A carboxylase beta
68801 ELOVL family member 5, elongation of long chain fatty acids (yeast)
13645 epidermal growth factor
12683 cell death-inducing DNA fragmentation factor, alpha subunit-like effector A
26366 carcinoembryonic antigen-related cell adhesion molecule 10
logFC AveExpr t P.Value adj.P.Val B
12992 -8.603611 3.5629500 -46.84744 4.172002e-15 6.593433e-11 23.84700
13358 -4.124175 5.7796989 -34.76897 1.526326e-13 1.206103e-09 21.38712
20531 -4.177812 4.2786290 -31.24354 5.524752e-13 2.910439e-09 20.23287
11941 -7.386986 1.2821431 -30.01476 8.946195e-13 3.534642e-09 18.54201
230810 -3.203118 2.6958115 -29.31342 1.188209e-12 3.654867e-09 19.42820
100705 -4.314320 4.4409137 -28.93689 1.387573e-12 3.654867e-09 19.30790
68801 -2.863304 6.4552045 -28.18823 1.900049e-12 3.828415e-09 19.08355
13645 -5.362664 0.7359047 -28.14183 1.937947e-12 3.828415e-09 17.22920
12683 -4.840655 3.3749575 -27.04271 3.123304e-12 5.484522e-09 18.58932
26366 -3.295621 1.8210142 -25.45612 6.437086e-12 1.005946e-08 17.82256
An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot
function in the Glimma package.
group2 <- group
levels(group2) <- c("basal.lactate","basal.preg","basal.virgin","lum.lactate", "lum.preg", "lum.virgin")
glXYPlot(x=fit.cont$coefficients[,1], y=fit.cont$lods[,1],
xlab="logFC", ylab="B", main="B.PregVsLac",
counts=v$E, groups=group2, status=summa.fit[,1],
anno=fit.cont$genes, side.main="ENTREZID", folder="volcano")
This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.
When there is a lot of differential expression, sometimes we may want to cut-off on a fold change threshold as well as a p-value threshold so that we follow up on the most biologically significant genes. However, it is not recommended to simply rank by p-value and then discard genes with small logFC’s, as this has been shown to increase the false discovery rate. In other words, you are not controlling the false discovery rate at 5% any more. There is a function called treat
in the limma package that performs this style of analysis correctly (McCarthy and Smyth 2009). treat
will simply take our fit.cont
object, as well as a user-specified log fold change cut-off, and recalculate the moderated t-statistics and p-values with the new information about logFC.
# Let's decide that we are only interested in genes that have a absolute logFC of 1.
# This corresponds to a fold change of 2, or 0.5 (i.e. double or half).
# We can perform a treat analysis which ranks our genes according to p-value AND logFC.
# This is easy to do after our analysis, we just give the treat function the fit.cont object and specify our cut-off.
fit.treat <- treat(fit.cont,lfc=1)
res.treat <- decideTests(fit.treat)
summary(res.treat)
B.PregVsLac L.PregVsLac
Down 53 631
NotSig 15677 14775
Up 74 398
topTable(fit.treat,coef=1,sort.by="p")
ENTREZID SYMBOL GENENAME logFC
211577 211577 Mrgprf MAS-related GPR, member F -5.146268
78896 78896 Ecrg4 ECRG4 augurin precursor 2.807548
16012 16012 Igfbp6 insulin-like growth factor binding protein 6 -2.896115
21953 21953 Tnni2 troponin I, skeletal, fast 2 -5.827889
14620 14620 Gjb3 gap junction protein, beta 3 3.600094
12992 12992 Csn1s2b casein alpha s2-like B -6.070143
270150 270150 Ccdc153 coiled-coil domain containing 153 -3.211148
226101 226101 Myof myoferlin -2.329744
381290 381290 Atp2b4 ATPase, Ca++ transporting, plasma membrane 4 -2.143885
231991 231991 Creb5 cAMP responsive element binding protein 5 -2.598105
AveExpr t P.Value adj.P.Val
211577 -0.9368335 -13.18562 7.329299e-09 6.669358e-05
78896 3.0365195 11.94135 2.234154e-08 6.669358e-05
16012 1.9784488 -11.92570 2.267536e-08 6.669358e-05
21953 0.3020716 -11.93183 2.295775e-08 6.669358e-05
14620 3.5252805 11.89242 2.342636e-08 6.669358e-05
12992 3.5629500 -11.83199 2.532027e-08 6.669358e-05
270150 -1.3408388 -10.67393 7.828123e-08 1.767367e-04
226101 6.2235246 -10.42714 1.012983e-07 2.001148e-04
381290 3.9440659 -10.17758 1.322118e-07 2.050763e-04
231991 4.2759295 -10.17157 1.330809e-07 2.050763e-04
# Notice that much fewer genes are highlighted in the MAplot
par(mfrow=c(1,2))
plotMD(fit.treat,coef=1,status=res.treat[,"B.PregVsLac"], values=c(-1,1), hl.col=c("blue","red"))
abline(h=0,col="grey")
plotMD(fit.treat,coef=2,status=res.treat[,"L.PregVsLac"], values=c(-1,1), hl.col=c("blue","red"))
abline(h=0,col="grey")
Challenge
Change the cut-off so that we are interested in genes that change at least 50% on the fold change scale.
HINT: what is the corresponding logFC value of 50% fold change? Assume basal.pregnant is 50% higher than basal.lactate
Solution
B.PregVsLac L.PregVsLac
Down 234 1328
NotSig 15250 13397
Up 320 1079
ENTREZID SYMBOL GENENAME logFC
78896 78896 Ecrg4 ECRG4 augurin precursor 2.807548
16012 16012 Igfbp6 insulin-like growth factor binding protein 6 -2.896115
211577 211577 Mrgprf MAS-related GPR, member F -5.146268
381290 381290 Atp2b4 ATPase, Ca++ transporting, plasma membrane 4 -2.143885
14620 14620 Gjb3 gap junction protein, beta 3 3.600094
226101 226101 Myof myoferlin -2.329744
24117 24117 Wif1 Wnt inhibitory factor 1 1.819943
231830 231830 Micall2 MICAL-like 2 2.253400
21953 21953 Tnni2 troponin I, skeletal, fast 2 -5.827889
231991 231991 Creb5 cAMP responsive element binding protein 5 -2.598105
AveExpr t P.Value adj.P.Val
78896 3.0365195 14.68324 2.122741e-09 1.107588e-05
16012 1.9784488 -14.53610 2.387463e-09 1.107588e-05
211577 -0.9368335 -14.50548 2.595874e-09 1.107588e-05
381290 3.9440659 -13.87032 4.068604e-09 1.107588e-05
14620 3.5252805 13.79074 4.435571e-09 1.107588e-05
226101 6.2235246 -13.68164 4.766155e-09 1.107588e-05
24117 2.9755445 13.64479 4.905791e-09 1.107588e-05
231830 4.7605970 13.34681 6.327022e-09 1.249903e-05
21953 0.3020716 -12.95757 9.721628e-09 1.399330e-05
231991 4.2759295 -12.81319 1.010560e-08 1.399330e-05
An interactive version of the mean-difference plots is possible via the glMDPlot
function in the Glimma package.
glMDPlot(fit.treat, coef=1, counts=v$E, groups=group2,
status=res.treat, side.main="ENTREZID", main="B.PregVsLac",
folder="md")
As with the volcano plot example above, this function creates an html page (./md/MD-Plot.html) that allows the user to search for their favourite gene.
Sometimes there is quite a long list of differentially expressed genes to interpret after a differential expression analysis, and it is usually infeasible to go through the list one gene at a time trying to understand it’s biological function. A common downstream procedure is gene set testing, which aims to understand which pathways/gene networks the differentially expressed genes are implicated in.
There are a number of different ways to go about testing for enrichment of biological pathways, and the test you choose usually depends on the question you’re asking. There are two kinds of tests: self-contained and competitive gene set tests. Self-contained tests, which include the ROAST
procedure, ask the question “Are the genes in the set/pathway differentially expressed as a whole?” Competitive gene set tests, like goana
and camera
ask the question whether the differentially expressed genes tend to be over-represented in the gene set, compared to all the other genes in the experiment. These different questions use different statistical methodology.
First, we will perform a gene ontology (GO) enrichment analysis using the goana
function in limma. There are approximately 20,000 GO terms, and they are split into three categories: BP (biological process), MF (molecular function) and CC (cellular component). goana
uses annotation from the appropriate Bioconductor package and can be used for any of the five species specified (Hs, Mm, Rn, Dm or Pt). goana
has an advantage over other methods, such as DAVID, in that there is the option to take into account the gene length bias inherent in RNA-Seq data.
Suppose we want to identify GO terms that are over-represented in the basal lactating group compared to the basal pregnancy group. This can be achieved by applying the goana function to the differential expression results of that comparison. goana
takes the fit.cont
object, the coefficient of interest and the species. The top set of most enriched GO terms can be viewed with the topGO function.
go <- goana(fit.cont, coef="B.PregVsLac",species = "Mm")
topGO(go, n=10)
Term Ont N Up Down P.Up
GO:0022613 ribonucleoprotein complex biogenesis BP 398 196 25 1.913118e-50
GO:0042254 ribosome biogenesis BP 290 155 12 8.927292e-46
GO:1990904 ribonucleoprotein complex CC 659 261 48 1.594455e-44
GO:0022626 cytosolic ribosome CC 102 75 2 4.361513e-36
GO:0006364 rRNA processing BP 204 112 4 9.206303e-35
GO:0016072 rRNA metabolic process BP 211 113 6 1.021692e-33
GO:0003723 RNA binding MF 1012 320 109 1.269701e-31
GO:0034470 ncRNA processing BP 351 145 18 3.695632e-27
GO:0005840 ribosome CC 214 104 6 1.479315e-26
GO:0022625 cytosolic large ribosomal subunit CC 52 44 0 2.739381e-26
P.Down
GO:0022613 1.0000000
GO:0042254 1.0000000
GO:1990904 1.0000000
GO:0022626 0.9999999
GO:0006364 1.0000000
GO:0016072 1.0000000
GO:0003723 1.0000000
GO:0034470 1.0000000
GO:0005840 1.0000000
GO:0022625 1.0000000
The row names of the output are the universal identifiers of the GO terms, with one term per row. The Term column gives the names of the GO terms. These terms cover three domains - biological process (BP), cellular component (CC) and molecular function (MF), as shown in the Ont column. The N column represents the total number of genes that are annotated with each GO term. The Up and Down columns represent the number of differentially expressed genes that overlap with the genes in the GO term. The P.Up and P.Down columns contain the p-values for over-representation of the GO term across the set of up- and down-regulated genes, respectively. The output table is sorted by the minimum of P.Up and P.Down by default.
An additional refinement is to supply goana
with the gene lengths using the covariate
argument. In the original data matrix that we loaded into R, there is a column called “Length.”
colnames(seqdata)
[1] "EntrezGeneID" "Length"
[3] "MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1" "MCL1.DH_BC2CTUACXX_CAGATC_L002_R1"
[5] "MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1" "MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1"
[7] "MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1" "MCL1.DL_BC2CTUACXX_ATCACG_L002_R1"
[9] "MCL1.LA_BC2CTUACXX_GATCAG_L001_R1" "MCL1.LB_BC2CTUACXX_TGACCA_L001_R1"
[11] "MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1" "MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1"
[13] "MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1" "MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1"
In order to get the gene lengths for every gene in fit.cont
, we can use the match
command. Note that the gene length supplied needs to be in the correct order.
m <- match(rownames(fit.cont),seqdata$EntrezGeneID)
gene_length <- seqdata$Length[m]
head(gene_length)
[1] 3634 3130 4203 2433 2847 2241
# Rerun goana with gene length information
go_length <- goana(fit.cont,coef="B.PregVsLac",species="Mm",
covariate=gene_length)
topGO(go_length, n=10)
Term Ont N Up Down P.Up
GO:0022613 ribonucleoprotein complex biogenesis BP 398 196 25 2.329797e-52
GO:0042254 ribosome biogenesis BP 290 155 12 2.649314e-48
GO:1990904 ribonucleoprotein complex CC 659 261 48 1.248381e-47
GO:0022626 cytosolic ribosome CC 102 75 2 7.020502e-40
GO:0006364 rRNA processing BP 204 112 4 1.430811e-36
GO:0016072 rRNA metabolic process BP 211 113 6 2.150090e-35
GO:0003723 RNA binding MF 1012 320 109 1.117780e-31
GO:0005840 ribosome CC 214 104 6 9.362522e-31
GO:0003735 structural constituent of ribosome MF 147 81 2 8.302045e-30
GO:0044391 ribosomal subunit CC 182 92 2 1.510361e-29
P.Down
GO:0022613 1.0000000
GO:0042254 1.0000000
GO:1990904 1.0000000
GO:0022626 0.9999987
GO:0006364 1.0000000
GO:0016072 1.0000000
GO:0003723 1.0000000
GO:0005840 1.0000000
GO:0003735 1.0000000
GO:0044391 1.0000000
Challenge
Perform GO analysis for the second comparison, “L.PregVsLac,” taking into account gene length information
Solution
Term Ont
GO:0031090 organelle membrane CC
GO:0005739 mitochondrion CC
GO:0005783 endoplasmic reticulum CC
GO:0005515 protein binding MF
GO:0005789 endoplasmic reticulum membrane CC
GO:0022626 cytosolic ribosome CC
GO:0042175 nuclear outer membrane-endoplasmic reticulum membrane network CC
GO:0050789 regulation of biological process BP
GO:0044281 small molecule metabolic process BP
GO:0005634 nucleus CC
N Up Down P.Up P.Down
GO:0031090 1946 390 709 9.998865e-01 1.233259e-43
GO:0005739 1676 291 597 1.000000e+00 1.405622e-35
GO:0005783 1379 263 519 9.999829e-01 1.637816e-34
GO:0005515 7422 2042 1672 1.518817e-29 9.998305e-01
GO:0005789 493 68 228 1.000000e+00 2.006631e-29
GO:0022626 102 75 7 2.095015e-29 9.999886e-01
GO:0042175 515 75 234 9.999998e-01 9.823731e-29
GO:0050789 8345 2257 1865 1.217543e-28 9.999991e-01
GO:0044281 1376 261 494 9.999722e-01 3.169760e-28
GO:0005634 6012 1693 1336 1.013031e-27 9.998554e-01
Notes
Other databases of gene sets that are available come from the Broad Institute’s Molecular Signatures Database (MSigDB). CAMERA is good option for testing a very large number of gene sets such as the MSigDB sets, as it is very fast. CAMERA is known as a competitive gene set test, however it has the advantage that it can take into account inter-gene correlation within each gene set (Wu and Smyth 2012). It also works seamlessly with a voom
object, taking into account the mean-variance relationship in RNA-Seq data.
Here we will be using the C2 gene sets for mouse, available as .rdata files from the WEHI bioinformatics page http://bioinf.wehi.edu.au/MSigDB/index.html. The C2 gene sets contain >5000 curated gene sets collected from a variety of places: BioCarta, KEGG, Pathway Interaction Database, Reactome as well as some published studies. It doesn’t include GO terms.
# Load in the mouse c2 gene sets as Mm.c2
Mm.c2 <- readRDS("data/Mm.c2.all.v7.1.entrez.rds")
# Have a look at the first few gene sets
names(Mm.c2)[1:5]
[1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"
[2] "KEGG_CITRATE_CYCLE_TCA_CYCLE"
[3] "KEGG_PENTOSE_PHOSPHATE_PATHWAY"
[4] "KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS"
[5] "KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM"
# Number of gene sets in C2
length(Mm.c2)
[1] 5529
The gene identifiers are Entrez Gene ID, which is the same as the rownames of our voom object. We need to map the Entrez gene ids between the list of gene sets and our voom object. We can do this using the ids2indices
function.
c2.ind <- ids2indices(Mm.c2, rownames(v))
CAMERA takes as input the voom object v
, the indexed list of gene sets c2.ind
, the design matrix, the contrast being tested, as well as some other arguments. By default, CAMERA can estimate the correlation for each gene set separately. However, in practise, it works well to set a small inter-gene correlation of about 0.05 using the inter.gene.cor
argument.
gst.camera <- camera(v,index=c2.ind,design=design,contrast = cont.matrix[,1],inter.gene.cor=0.05)
CAMERA outputs a dataframe of the resulting statistics, with each row denoting a different gene set. The output is ordered by p-value so that the most significant should be at the top. Let’s look at the top 5 gene sets:
gst.camera[1:5,]
NGenes Direction
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 103 Up
KEGG_RIBOSOME 98 Up
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 137 Up
REACTOME_RRNA_PROCESSING 218 Up
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 112 Up
PValue
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 8.508347e-10
KEGG_RIBOSOME 1.067029e-09
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 2.226264e-08
REACTOME_RRNA_PROCESSING 2.345260e-08
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 3.997525e-08
FDR
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 2.946068e-06
KEGG_RIBOSOME 2.946068e-06
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 3.237631e-05
REACTOME_RRNA_PROCESSING 3.237631e-05
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 4.414867e-05
The total number of significant gene sets at 5% FDR is
table(gst.camera$FDR < 0.05)
FALSE TRUE
5472 50
You can write out the camera results to a csv file to open in excel.
write.csv(gst.camera,file="gst_BPregVsLac.csv")
Challenge
- Run
camera
on the second contrast in the contrast matrix.- Run
camera
on a different set of MSigDB gene sets, the hallmark datasets,mouse_H_v5.rdata
. You will need to load in the hallmark gene sets. Call the objectMm.H
.
Solution
FALSE TRUE
5472 50
NGenes
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 103
KEGG_RIBOSOME 98
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 137
REACTOME_RRNA_PROCESSING 218
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 112
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 130
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 132
REACTOME_SELENOAMINO_ACID_METABOLISM 128
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 67
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 60
Direction
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION Up
KEGG_RIBOSOME Up
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION Up
REACTOME_RRNA_PROCESSING Up
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY Up
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE Up
REACTOME_NONSENSE_MEDIATED_DECAY_NMD Up
REACTOME_SELENOAMINO_ACID_METABOLISM Up
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S Up
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL Up
PValue
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 8.508347e-10
KEGG_RIBOSOME 1.067029e-09
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 2.226264e-08
REACTOME_RRNA_PROCESSING 2.345260e-08
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 3.997525e-08
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 5.187911e-08
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 1.617747e-07
REACTOME_SELENOAMINO_ACID_METABOLISM 2.358693e-07
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 2.365774e-07
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 3.214483e-07
FDR
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 2.946068e-06
KEGG_RIBOSOME 2.946068e-06
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 3.237631e-05
REACTOME_RRNA_PROCESSING 3.237631e-05
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 4.414867e-05
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 4.774608e-05
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 1.276172e-04
REACTOME_SELENOAMINO_ACID_METABOLISM 1.451534e-04
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 1.451534e-04
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 1.775037e-04
ROAST is an example of a self-contained gene set test (Wu et al. 2010). It asks the question, “Do the genes in my set tend to be differentially expressed between my conditions of interest?” ROAST doesn’t care about what the other genes in the experiment are doing, which is different to camera
and goana
. ROAST is a good option for when you’re interested in a specific set, or a few sets. It is not really used to test thousands of sets at one time.
From the Hallmark gene sets, two MYC pathways were most significant.
H.camera[1:10,]
NGenes
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 103
KEGG_RIBOSOME 98
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 137
REACTOME_RRNA_PROCESSING 218
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 112
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 130
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 132
REACTOME_SELENOAMINO_ACID_METABOLISM 128
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 67
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 60
Direction
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION Up
KEGG_RIBOSOME Up
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION Up
REACTOME_RRNA_PROCESSING Up
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY Up
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE Up
REACTOME_NONSENSE_MEDIATED_DECAY_NMD Up
REACTOME_SELENOAMINO_ACID_METABOLISM Up
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S Up
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL Up
PValue
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 8.508347e-10
KEGG_RIBOSOME 1.067029e-09
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 2.226264e-08
REACTOME_RRNA_PROCESSING 2.345260e-08
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 3.997525e-08
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 5.187911e-08
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 1.617747e-07
REACTOME_SELENOAMINO_ACID_METABOLISM 2.358693e-07
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 2.365774e-07
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 3.214483e-07
FDR
REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 2.946068e-06
KEGG_RIBOSOME 2.946068e-06
REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 3.237631e-05
REACTOME_RRNA_PROCESSING 3.237631e-05
REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 4.414867e-05
REACTOME_SRP_DEPENDENT_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE 4.774608e-05
REACTOME_NONSENSE_MEDIATED_DECAY_NMD 1.276172e-04
REACTOME_SELENOAMINO_ACID_METABOLISM 1.451534e-04
REACTOME_ACTIVATION_OF_THE_MRNA_UPON_BINDING_OF_THE_CAP_BINDING_COMPLEX_AND_EIFS_AND_SUBSEQUENT_BINDING_TO_43S 1.451534e-04
REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 1.775037e-04
Let’s see if there are any MYC signalling pathways in MsigDB C2 collection. We can do this with the grep
command on the names of the gene sets.
grep("MYC_",names(c2.ind))
[1] 462 510 587 1327 1328 1518 1519 1753 1754 1755 1756 1896 1897 1902 1903
[16] 2129 2130 2200 2201 2294 2302 2312 2319 2362 2378 2390 2400 2410 2423 2462
[31] 2500 2968 3053 3451 3566 3567 3818 3819 3820 3821 4029 4030 4209
# Let's save these so that we can subset c2.ind to test all gene sets with MYC in the name
myc <- grep("MYC_",names(c2.ind))
# What are these pathways called?
names(c2.ind)[myc]
[1] "PID_MYC_ACTIV_PATHWAY"
[2] "PID_MYC_PATHWAY"
[3] "PID_MYC_REPRESS_PATHWAY"
[4] "ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP"
[5] "ODONNELL_TARGETS_OF_MYC_AND_TFRC_DN"
[6] "CAIRO_PML_TARGETS_BOUND_BY_MYC_UP"
[7] "CAIRO_PML_TARGETS_BOUND_BY_MYC_DN"
[8] "SCHLOSSER_MYC_AND_SERUM_RESPONSE_SYNERGY"
[9] "SCHLOSSER_MYC_TARGETS_REPRESSED_BY_SERUM"
[10] "SCHLOSSER_MYC_TARGETS_AND_SERUM_RESPONSE_DN"
[11] "SCHLOSSER_MYC_TARGETS_AND_SERUM_RESPONSE_UP"
[12] "CEBALLOS_TARGETS_OF_TP53_AND_MYC_UP"
[13] "CEBALLOS_TARGETS_OF_TP53_AND_MYC_DN"
[14] "KIM_MYC_AMPLIFICATION_TARGETS_UP"
[15] "KIM_MYC_AMPLIFICATION_TARGETS_DN"
[16] "BENPORATH_MYC_TARGETS_WITH_EBOX"
[17] "BENPORATH_MYC_MAX_TARGETS"
[18] "MORI_EMU_MYC_LYMPHOMA_BY_ONSET_TIME_UP"
[19] "MORI_EMU_MYC_LYMPHOMA_BY_ONSET_TIME_DN"
[20] "SCHUHMACHER_MYC_TARGETS_DN"
[21] "LEE_LIVER_CANCER_MYC_TGFA_DN"
[22] "LEE_LIVER_CANCER_MYC_UP"
[23] "LEE_LIVER_CANCER_MYC_TGFA_UP"
[24] "LEE_LIVER_CANCER_MYC_E2F1_DN"
[25] "LEE_LIVER_CANCER_MYC_E2F1_UP"
[26] "SCHUHMACHER_MYC_TARGETS_UP"
[27] "LEE_LIVER_CANCER_MYC_DN"
[28] "COLLER_MYC_TARGETS_UP"
[29] "COLLER_MYC_TARGETS_DN"
[30] "YU_MYC_TARGETS_UP"
[31] "YU_MYC_TARGETS_DN"
[32] "BILD_MYC_ONCOGENIC_SIGNATURE"
[33] "SANSOM_APC_MYC_TARGETS"
[34] "SOUCEK_MYC_TARGETS"
[35] "ELLWOOD_MYC_TARGETS_UP"
[36] "ELLWOOD_MYC_TARGETS_DN"
[37] "DANG_REGULATED_BY_MYC_UP"
[38] "DANG_REGULATED_BY_MYC_DN"
[39] "DANG_MYC_TARGETS_UP"
[40] "DANG_MYC_TARGETS_DN"
[41] "ACOSTA_PROLIFERATION_INDEPENDENT_MYC_TARGETS_UP"
[42] "ACOSTA_PROLIFERATION_INDEPENDENT_MYC_TARGETS_DN"
[43] "ALFANO_MYC_TARGETS"
Let’s use ROAST to see if these MYC related gene sets tend to be differentially expressed. Note that the syntax for camera
and roast
is almost identical.
myc.rst <- roast(v,index=c2.ind[myc],design=design,contrast=cont.matrix[,1],nrot=999)
myc.rst[1:15,]
NGenes PropDown PropUp
COLLER_MYC_TARGETS_UP 26 0.07692308 0.7692308
DANG_MYC_TARGETS_UP 159 0.11949686 0.7106918
PID_MYC_ACTIV_PATHWAY 101 0.15841584 0.6237624
SCHLOSSER_MYC_TARGETS_AND_SERUM_RESPONSE_UP 64 0.14062500 0.5781250
SCHLOSSER_MYC_AND_SERUM_RESPONSE_SYNERGY 46 0.19565217 0.4782609
BILD_MYC_ONCOGENIC_SIGNATURE 286 0.27622378 0.4650350
CAIRO_PML_TARGETS_BOUND_BY_MYC_DN 24 0.45833333 0.2083333
BENPORATH_MYC_TARGETS_WITH_EBOX 291 0.23024055 0.4295533
BENPORATH_MYC_MAX_TARGETS 1012 0.25395257 0.3745059
ELLWOOD_MYC_TARGETS_UP 19 0.10526316 0.3157895
CAIRO_PML_TARGETS_BOUND_BY_MYC_UP 28 0.07142857 0.6785714
DANG_REGULATED_BY_MYC_UP 89 0.12359551 0.6741573
KIM_MYC_AMPLIFICATION_TARGETS_UP 227 0.16740088 0.4977974
ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP 137 0.43065693 0.1970803
ACOSTA_PROLIFERATION_INDEPENDENT_MYC_TARGETS_UP 101 0.18811881 0.4059406
Direction PValue FDR
COLLER_MYC_TARGETS_UP Up 0.001 0.002150000
DANG_MYC_TARGETS_UP Up 0.001 0.002150000
PID_MYC_ACTIV_PATHWAY Up 0.001 0.002150000
SCHLOSSER_MYC_TARGETS_AND_SERUM_RESPONSE_UP Up 0.001 0.002150000
SCHLOSSER_MYC_AND_SERUM_RESPONSE_SYNERGY Up 0.001 0.002150000
BILD_MYC_ONCOGENIC_SIGNATURE Up 0.001 0.002150000
CAIRO_PML_TARGETS_BOUND_BY_MYC_DN Down 0.001 0.002150000
BENPORATH_MYC_TARGETS_WITH_EBOX Up 0.001 0.002150000
BENPORATH_MYC_MAX_TARGETS Up 0.001 0.002150000
ELLWOOD_MYC_TARGETS_UP Up 0.001 0.002150000
CAIRO_PML_TARGETS_BOUND_BY_MYC_UP Up 0.002 0.003794118
DANG_REGULATED_BY_MYC_UP Up 0.002 0.003794118
KIM_MYC_AMPLIFICATION_TARGETS_UP Up 0.002 0.003794118
ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP Down 0.002 0.003794118
ACOSTA_PROLIFERATION_INDEPENDENT_MYC_TARGETS_UP Up 0.002 0.003794118
PValue.Mixed FDR.Mixed
COLLER_MYC_TARGETS_UP 0.001 0.001
DANG_MYC_TARGETS_UP 0.001 0.001
PID_MYC_ACTIV_PATHWAY 0.001 0.001
SCHLOSSER_MYC_TARGETS_AND_SERUM_RESPONSE_UP 0.001 0.001
SCHLOSSER_MYC_AND_SERUM_RESPONSE_SYNERGY 0.001 0.001
BILD_MYC_ONCOGENIC_SIGNATURE 0.001 0.001
CAIRO_PML_TARGETS_BOUND_BY_MYC_DN 0.001 0.001
BENPORATH_MYC_TARGETS_WITH_EBOX 0.001 0.001
BENPORATH_MYC_MAX_TARGETS 0.001 0.001
ELLWOOD_MYC_TARGETS_UP 0.001 0.001
CAIRO_PML_TARGETS_BOUND_BY_MYC_UP 0.001 0.001
DANG_REGULATED_BY_MYC_UP 0.002 0.002
KIM_MYC_AMPLIFICATION_TARGETS_UP 0.001 0.001
ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP 0.001 0.001
ACOSTA_PROLIFERATION_INDEPENDENT_MYC_TARGETS_UP 0.002 0.002
Each row corresponds to a single gene set. The NGenes column gives the number of genes in each set. The PropDown and PropUp columns contain the proportions of genes in the set that are down- and up-regulated, respectively, with absolute fold changes greater than 2. The net direction of change is determined from the significance of changes in each direction, and is shown in the Direction column. The PValue provides evidence for whether the majority of genes in the set are DE in the specified direction, whereas the PValue.Mixed tests for differential expression in any direction. FDRs are computed from the corresponding p-values across all sets.
Challenge
- Test whether the MYC signalling pathways tend to be differentially expressed between luminal pregnant vs lactating (the second contrast).
- Look for gene sets containing “WNT” in the name and see whether they tend to be differentially expressed in basal pregnant vs lactating.
Solution
NGenes PropDown PropUp Direction PValue
ODONNELL_TARGETS_OF_MYC_AND_TFRC_DN 54 0.6481481 0.1481481 Down 0.001
PID_MYC_PATHWAY 35 0.2000000 0.5428571 Up 0.001
ELLWOOD_MYC_TARGETS_DN 61 0.5409836 0.2622951 Down 0.001
BENPORATH_MYC_TARGETS_WITH_EBOX 291 0.3024055 0.4329897 Up 0.001
BILD_MYC_ONCOGENIC_SIGNATURE 286 0.2902098 0.4195804 Up 0.001
PID_MYC_REPRESS_PATHWAY 93 0.2150538 0.3978495 Up 0.001
ELLWOOD_MYC_TARGETS_UP 19 0.2105263 0.3684211 Up 0.001
DANG_MYC_TARGETS_DN 43 0.2325581 0.5813953 Up 0.003
ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP 137 0.3576642 0.2262774 Down 0.003
ALFANO_MYC_TARGETS 365 0.2794521 0.4027397 Up 0.005
LEE_LIVER_CANCER_MYC_TGFA_DN 63 0.3650794 0.3492063 Down 0.005
YU_MYC_TARGETS_UP 47 0.7021277 0.1063830 Down 0.009
DANG_REGULATED_BY_MYC_DN 343 0.2565598 0.4169096 Up 0.010
SCHUHMACHER_MYC_TARGETS_UP 99 0.4646465 0.3131313 Down 0.011
PID_MYC_ACTIV_PATHWAY 101 0.2970297 0.4158416 Up 0.014
FDR PValue.Mixed FDR.Mixed
ODONNELL_TARGETS_OF_MYC_AND_TFRC_DN 0.003071429 0.001 0.001
PID_MYC_PATHWAY 0.003071429 0.001 0.001
ELLWOOD_MYC_TARGETS_DN 0.003071429 0.001 0.001
BENPORATH_MYC_TARGETS_WITH_EBOX 0.003071429 0.001 0.001
BILD_MYC_ONCOGENIC_SIGNATURE 0.003071429 0.001 0.001
PID_MYC_REPRESS_PATHWAY 0.003071429 0.001 0.001
ELLWOOD_MYC_TARGETS_UP 0.003071429 0.001 0.001
DANG_MYC_TARGETS_DN 0.011944444 0.001 0.001
ODONNELL_TARGETS_OF_MYC_AND_TFRC_UP 0.011944444 0.001 0.001
ALFANO_MYC_TARGETS 0.017590909 0.001 0.001
LEE_LIVER_CANCER_MYC_TGFA_DN 0.017590909 0.001 0.001
YU_MYC_TARGETS_UP 0.030458333 0.003 0.003
DANG_REGULATED_BY_MYC_DN 0.031423077 0.001 0.001
SCHUHMACHER_MYC_TARGETS_UP 0.032250000 0.001 0.001
PID_MYC_ACTIV_PATHWAY 0.038700000 0.001 0.001
NGenes
PID_WNT_CANONICAL_PATHWAY 29
PID_WNT_NONCANONICAL_PATHWAY 47
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 16
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 6
SANSOM_WNT_PATHWAY_REQUIRE_MYC 91
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 11
BIOCARTA_WNT_PATHWAY 41
KEGG_WNT_SIGNALING_PATHWAY 151
WILLERT_WNT_SIGNALING 38
PID_WNT_SIGNALING_PATHWAY 29
REACTOME_SIGNALING_BY_WNT_IN_CANCER 39
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 9
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 18
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 227
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 18
PropDown
PID_WNT_CANONICAL_PATHWAY 0.4482759
PID_WNT_NONCANONICAL_PATHWAY 0.4255319
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.2500000
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.1666667
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.2087912
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.4545455
BIOCARTA_WNT_PATHWAY 0.2926829
KEGG_WNT_SIGNALING_PATHWAY 0.3642384
WILLERT_WNT_SIGNALING 0.2631579
PID_WNT_SIGNALING_PATHWAY 0.2758621
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.4102564
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.2222222
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.3888889
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.3083700
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.3888889
PropUp
PID_WNT_CANONICAL_PATHWAY 0.13793103
PID_WNT_NONCANONICAL_PATHWAY 0.14893617
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.62500000
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.50000000
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.38461538
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.09090909
BIOCARTA_WNT_PATHWAY 0.36585366
KEGG_WNT_SIGNALING_PATHWAY 0.24503311
WILLERT_WNT_SIGNALING 0.47368421
PID_WNT_SIGNALING_PATHWAY 0.31034483
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.25641026
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.33333333
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.22222222
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.33039648
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.33333333
Direction
PID_WNT_CANONICAL_PATHWAY Down
PID_WNT_NONCANONICAL_PATHWAY Down
REACTOME_REPRESSION_OF_WNT_TARGET_GENES Up
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING Up
SANSOM_WNT_PATHWAY_REQUIRE_MYC Up
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION Down
BIOCARTA_WNT_PATHWAY Up
KEGG_WNT_SIGNALING_PATHWAY Down
WILLERT_WNT_SIGNALING Up
PID_WNT_SIGNALING_PATHWAY Up
REACTOME_SIGNALING_BY_WNT_IN_CANCER Down
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL Up
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING Down
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT Up
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS Up
PValue
PID_WNT_CANONICAL_PATHWAY 0.001
PID_WNT_NONCANONICAL_PATHWAY 0.001
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.002
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.002
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.002
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.004
BIOCARTA_WNT_PATHWAY 0.015
KEGG_WNT_SIGNALING_PATHWAY 0.026
WILLERT_WNT_SIGNALING 0.035
PID_WNT_SIGNALING_PATHWAY 0.054
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.065
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.115
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.224
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.258
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.307
FDR
PID_WNT_CANONICAL_PATHWAY 0.00550000
PID_WNT_NONCANONICAL_PATHWAY 0.00550000
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.00660000
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.00660000
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.00660000
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.01283333
BIOCARTA_WNT_PATHWAY 0.04557143
KEGG_WNT_SIGNALING_PATHWAY 0.07012500
WILLERT_WNT_SIGNALING 0.08433333
PID_WNT_SIGNALING_PATHWAY 0.11770000
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.12900000
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.20991667
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.37823077
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.40464286
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.44953333
PValue.Mixed
PID_WNT_CANONICAL_PATHWAY 0.001
PID_WNT_NONCANONICAL_PATHWAY 0.001
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.001
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.001
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.001
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.005
BIOCARTA_WNT_PATHWAY 0.001
KEGG_WNT_SIGNALING_PATHWAY 0.001
WILLERT_WNT_SIGNALING 0.001
PID_WNT_SIGNALING_PATHWAY 0.001
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.001
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.004
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.001
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.001
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.001
FDR.Mixed
PID_WNT_CANONICAL_PATHWAY 0.001
PID_WNT_NONCANONICAL_PATHWAY 0.001
REACTOME_REPRESSION_OF_WNT_TARGET_GENES 0.001
REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_WNT_SIGNALING 0.001
SANSOM_WNT_PATHWAY_REQUIRE_MYC 0.001
REACTOME_RNF_MUTANTS_SHOW_ENHANCED_WNT_SIGNALING_AND_PROLIFERATION 0.005
BIOCARTA_WNT_PATHWAY 0.001
KEGG_WNT_SIGNALING_PATHWAY 0.001
WILLERT_WNT_SIGNALING 0.001
PID_WNT_SIGNALING_PATHWAY 0.001
REACTOME_SIGNALING_BY_WNT_IN_CANCER 0.001
REACTOME_WNT_MEDIATED_ACTIVATION_OF_DVL 0.004
REACTOME_WNT_LIGAND_BIOGENESIS_AND_TRAFFICKING 0.001
REACTOME_TCF_DEPENDENT_SIGNALING_IN_RESPONSE_TO_WNT 0.001
REACTOME_NEGATIVE_REGULATION_OF_TCF_DEPENDENT_SIGNALING_BY_WNT_LIGAND_ANTAGONISTS 0.001
Notes
A barcode plot can be produced with the barcodeplot
function to visualize the results for any particular set. To display a barcodeplot, we need to decide what statistics to use, usually we choose either logFC or t-statistics for the comparison of interest. We also need to make sure we give the statistics in the correct order, such that subsetting the statistics vector will give the genes for the gene set we’re interested in. The best way to do this is to work with the fit object directly. The coefficients
slot contains the logFCs and the t
slot contains the t-statistics.
Let’s have a look at one of the top MYC related pathway in the ROAST test that you have already done, “DANG_REGULATED_BY_MYC_UP.”
# Have a look at the logFCs and t-statistics in fit.cont
names(fit.cont)
[1] "coefficients" "stdev.unscaled" "sigma" "df.residual"
[5] "cov.coefficients" "pivot" "rank" "genes"
[9] "Amean" "method" "design" "contrasts"
[13] "df.prior" "s2.prior" "var.prior" "proportion"
[17] "s2.post" "t" "df.total" "p.value"
[21] "lods" "F" "F.p.value"
head(fit.cont$coefficients)
Contrasts
B.PregVsLac L.PregVsLac
497097 -1.62084708 -0.6964826
20671 0.41298109 0.3652171
27395 -0.03540567 -0.3849389
18777 -0.08505647 -0.4731888
21399 0.21912852 0.2839044
58175 1.49852621 1.4736067
head(fit.cont$t)
Contrasts
B.PregVsLac L.PregVsLac
497097 -4.0541969 -0.3076765
20671 1.0030050 0.4055901
27395 -0.1936036 -2.2395329
18777 -0.7975228 -4.0746018
21399 2.1866392 2.5284968
58175 1.5788218 3.1387631
par(mfrow=c(1,1))
# barcode plot with logFCs
barcodeplot(fit.cont$coeff[,1], index=c2.ind[["DANG_REGULATED_BY_MYC_UP"]], main="LogFC: DANG_REGULATED_BY_MYC_UP")
# barcode plot using t-statistics
barcodeplot(fit.cont$t[,1], index=c2.ind[["DANG_REGULATED_BY_MYC_UP"]], main="T-statistic: DANG_REGULATED_BY_MYC_UP")
Here, genes are represented by bars and are ranked from left to right by decreasing log-fold change or t-statistic. This forms the barcode-like pattern. The line above the barcode shows the relative local enrichment of the vertical bars in each part of the plot. The barcodeplot shows that the genes in this gene set tend to be up-regulated between pregnant and lactating in basal cells.
Challenge
- Produce a barcodeplot for luminal pregnant vs lactating for this gene set. Does the pattern of enrichment look as strong?
- Choose one of the Wnt signalling pathways and produce a barcode plot for both comparisons.
- You can put two gene sets on one plot, for example a set that is up-regulated and one that is down-regulated, by adding a gene set to the
index2
argument. Produce a barcodeplot with two sets of your choosing.
Solution
The session information describes the versions of R and of the packages that were used in the analysis. This is useful for record-keeping purposes, and ensures that an analysis can be reproduced even when the software is updated over time.
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] NMF_0.23.0 cluster_2.1.0 rngtools_1.5
[4] pkgmaker_0.32.2 registry_0.5-1 RColorBrewer_1.1-2
[7] gplots_3.1.1 org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0
[10] IRanges_2.24.0 S4Vectors_0.28.0 Biobase_2.50.0
[13] BiocGenerics_0.36.0 Glimma_2.0.0 edgeR_3.32.0
[16] limma_3.46.0 knitr_1.30
loaded via a namespace (and not attached):
[1] bitops_1.0-6 matrixStats_0.57.0
[3] bit64_4.0.5 doParallel_1.0.16
[5] httr_1.4.2 GenomeInfoDb_1.26.1
[7] tools_4.0.3 R6_2.5.0
[9] KernSmooth_2.23-18 DBI_1.1.0
[11] colorspace_2.0-0 withr_2.3.0
[13] tidyselect_1.1.0 DESeq2_1.30.0
[15] bit_4.0.4 compiler_4.0.3
[17] BiasedUrn_1.07 DelayedArray_0.16.0
[19] caTools_1.18.0 scales_1.1.1
[21] genefilter_1.72.0 stringr_1.4.0
[23] digest_0.6.27 rmarkdown_2.5
[25] XVector_0.30.0 pkgconfig_2.0.3
[27] htmltools_0.5.0 MatrixGenerics_1.2.0
[29] htmlwidgets_1.5.2 rlang_0.4.9
[31] RSQLite_2.2.1 generics_0.1.0
[33] jsonlite_1.7.1 BiocParallel_1.24.1
[35] gtools_3.8.2 dplyr_1.0.2
[37] RCurl_1.98-1.2 magrittr_2.0.1
[39] GO.db_3.12.1 GenomeInfoDbData_1.2.4
[41] Matrix_1.2-18 Rcpp_1.0.5
[43] munsell_0.5.0 lifecycle_0.2.0
[45] stringi_1.5.3 yaml_2.2.1
[47] SummarizedExperiment_1.20.0 zlibbioc_1.36.0
[49] plyr_1.8.6 grid_4.0.3
[51] blob_1.2.1 crayon_1.3.4
[53] lattice_0.20-41 splines_4.0.3
[55] annotate_1.68.0 locfit_1.5-9.4
[57] pillar_1.4.7 GenomicRanges_1.42.0
[59] geneplotter_1.68.0 reshape2_1.4.4
[61] codetools_0.2-18 XML_3.99-0.5
[63] glue_1.4.2 evaluate_0.14
[65] vctrs_0.3.5 foreach_1.5.1
[67] gtable_0.3.0 purrr_0.3.4
[69] assertthat_0.2.1 ggplot2_3.3.2
[71] xfun_0.19 gridBase_0.4-7
[73] xtable_1.8-4 survival_3.2-7
[75] tibble_3.0.4 iterators_1.0.13
[77] memoise_1.1.0 ellipsis_0.3.1