Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law

Resources and data files

This material has been created using the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf (Lun, Chen, and Smyth 2016)
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html

Data files downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60450/suppl/GSE60450_Lactation-GenewiseCounts.txt.gz
http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5.rdata
http://bioinf.wehi.edu.au/software/MSigDB/mouse_H_v5.rdata

Data files are available from: https://figshare.com/s/1d788fd384d33e913a2a You should download these files and place them in your /data directory.

Data files:

Packages used:

Overview

Introduction and data import

Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.

There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today we will be starting from the count data and getting stuck into analysis.

First, let’s load all the packages we will need to analyse the data.

library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
library(RColorBrewer)

Mouse mammary gland dataset

The data for this tutorial comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (Fu et al. 2015). Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450.

This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates. We will first use the counts file as a starting point for our analysis. This data has already been aligned to the mouse genome. The command line tool featureCounts (Liao, Smyth, and Shi 2014) was used to count reads mapped to mouse genes from Refseq annotation (see the paper for details).

Reading in the data

Set up an RStudio project specifying the directory where you have saved the /data directory. Download and read in the data.

# Read the data into R
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
# Read the sample information into R
sampleinfo <- read.delim("data/SampleInfo.txt")

Let’s take a look at the data. You can use the head command to see the first 6 lines. The dim command will tell you how many rows and columns the data frame has.

head(seqdata)
  EntrezGeneID Length MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1
1       497097   3634                               438
2    100503874   3259                                 1
3    100038431   1634                                 0
4        19888   9747                                 1
5        20671   3130                               106
6        27395   4203                               309
  MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1
1                               300                                65
2                                 0                                 1
3                                 0                                 0
4                                 1                                 0
5                               182                                82
6                               234                               337
  MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1
1                               237                               354
2                                 1                                 0
3                                 0                                 0
4                                 0                                 0
5                               105                                43
6                               300                               290
  MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1
1                               287                                 0
2                                 4                                 0
3                                 0                                 0
4                                 0                                10
5                                82                                16
6                               270                               560
  MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1
1                                 0                                 0
2                                 0                                 0
3                                 0                                 0
4                                 3                                10
5                                25                                18
6                               464                               489
  MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1
1                                 0                                 0
2                                 0                                 0
3                                 0                                 0
4                                 2                                 0
5                                 8                                 3
6                               328                               307
  MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
1                                 0
2                                 0
3                                 0
4                                 0
5                                10
6                               342
dim(seqdata)
[1] 27179    14

The seqdata object contains information about genes (one gene per row), the first column has the Entrez gene id, the second has the gene length and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample. There are two replicates for each cell type and time point (detailed sample info can be found in file “GSE60450_series_matrix.txt” from the GEO website). The sampleinfo file contains basic information about the samples that we will need for the analysis today.

sampleinfo
                            FileName SampleName CellType   Status
1  MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1    MCL1.DG  luminal   virgin
2  MCL1.DH_BC2CTUACXX_CAGATC_L002_R1    MCL1.DH    basal   virgin
3  MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1    MCL1.DI    basal pregnant
4  MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1    MCL1.DJ    basal pregnant
5  MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1    MCL1.DK    basal  lactate
6  MCL1.DL_BC2CTUACXX_ATCACG_L002_R1    MCL1.DL    basal  lactate
7  MCL1.LA_BC2CTUACXX_GATCAG_L001_R1    MCL1.LA    basal   virgin
8  MCL1.LB_BC2CTUACXX_TGACCA_L001_R1    MCL1.LB  luminal   virgin
9  MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1    MCL1.LC  luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1    MCL1.LD  luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1    MCL1.LE  luminal  lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1    MCL1.LF  luminal  lactate

We will be manipulating and reformatting the counts matrix into a suitable format for downstream analysis. The first two columns in the seqdata dataframe contain annotation information. We need to make a new matrix containing only the counts, but we can store the gene identifiers (the EntrezGeneID column) as rownames. We will add more annotation information about each gene later on in the workshop.

Format the data

Let’s create a new data object, countdata, that contains only the counts for the 12 samples.

# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]
# Look at the output
head(countdata)
  MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1
1                               438                               300
2                                 1                                 0
3                                 0                                 0
4                                 1                                 1
5                               106                               182
6                               309                               234
  MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1
1                                65                               237
2                                 1                                 1
3                                 0                                 0
4                                 0                                 0
5                                82                               105
6                               337                               300
  MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1
1                               354                               287
2                                 0                                 4
3                                 0                                 0
4                                 0                                 0
5                                43                                82
6                               290                               270
  MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1
1                                 0                                 0
2                                 0                                 0
3                                 0                                 0
4                                10                                 3
5                                16                                25
6                               560                               464
  MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1
1                                 0                                 0
2                                 0                                 0
3                                 0                                 0
4                                10                                 2
5                                18                                 8
6                               489                               328
  MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
1                                 0                                 0
2                                 0                                 0
3                                 0                                 0
4                                 0                                 0
5                                 3                                10
6                               307                               342
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]

Take a look at the output

head(countdata)
          MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1
497097                                  438
100503874                                 1
100038431                                 0
19888                                     1
20671                                   106
27395                                   309
          MCL1.DH_BC2CTUACXX_CAGATC_L002_R1
497097                                  300
100503874                                 0
100038431                                 0
19888                                     1
20671                                   182
27395                                   234
          MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1
497097                                   65
100503874                                 1
100038431                                 0
19888                                     0
20671                                    82
27395                                   337
          MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1
497097                                  237
100503874                                 1
100038431                                 0
19888                                     0
20671                                   105
27395                                   300
          MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1
497097                                  354
100503874                                 0
100038431                                 0
19888                                     0
20671                                    43
27395                                   290
          MCL1.DL_BC2CTUACXX_ATCACG_L002_R1
497097                                  287
100503874                                 4
100038431                                 0
19888                                     0
20671                                    82
27395                                   270
          MCL1.LA_BC2CTUACXX_GATCAG_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                    10
20671                                    16
27395                                   560
          MCL1.LB_BC2CTUACXX_TGACCA_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                     3
20671                                    25
27395                                   464
          MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                    10
20671                                    18
27395                                   489
          MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                     2
20671                                     8
27395                                   328
          MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                     0
20671                                     3
27395                                   307
          MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1
497097                                    0
100503874                                 0
100038431                                 0
19888                                     0
20671                                    10
27395                                   342

Now take a look at the column names

colnames(countdata)
 [1] "MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1"
 [2] "MCL1.DH_BC2CTUACXX_CAGATC_L002_R1"
 [3] "MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1"
 [4] "MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1"
 [5] "MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1"
 [6] "MCL1.DL_BC2CTUACXX_ATCACG_L002_R1"
 [7] "MCL1.LA_BC2CTUACXX_GATCAG_L001_R1"
 [8] "MCL1.LB_BC2CTUACXX_TGACCA_L001_R1"
 [9] "MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1"
[10] "MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1"
[11] "MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1"
[12] "MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1"

These are the sample names which are pretty long so we’ll shorten these to contain only the relevant information about each sample. We will use the substr command to extract the first 7 characters and use these as the colnames.

# using substr, you extract the characters starting at position 1 and stopping at position 7 of the colnames
colnames(countdata) <- substr(colnames(countdata),start=1,stop=7)

Take a look at the output

head(countdata)
          MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097        438     300      65     237     354     287       0       0
100503874       1       0       1       1       0       4       0       0
100038431       0       0       0       0       0       0       0       0
19888           1       1       0       0       0       0      10       3
20671         106     182      82     105      43      82      16      25
27395         309     234     337     300     290     270     560     464
          MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097          0       0       0       0
100503874       0       0       0       0
100038431       0       0       0       0
19888          10       2       0       0
20671          18       8       3      10
27395         489     328     307     342

Note that the column names are now the same as SampleName in the sampleinfo file. This is good because it means our sample information in sampleinfo is in the same order as the columns in countdata.

table(colnames(countdata)==sampleinfo$SampleName)

TRUE 
  12 

Filtering to remove lowly expressed genes

Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in this case we have a sample size of 2 in each group, we favour filtering on a minimum counts per million threshold present in at least 2 samples. Two represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least two samples.

We’ll use the cpm function from the edgeR library (M D Robinson, McCarthy, and Smyth 2010) to generate the CPM values and then filter. Note that by converting to CPMs we are normalising for the different sequencing depths for each sample.

# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)
              MCL1.DG     MCL1.DH     MCL1.DI     MCL1.DJ   MCL1.DK
497097    18.85684388 13.77543859  2.69700983 10.45648006 16.442685
100503874  0.04305215  0.00000000  0.04149246  0.04412017  0.000000
100038431  0.00000000  0.00000000  0.00000000  0.00000000  0.000000
19888      0.04305215  0.04591813  0.00000000  0.00000000  0.000000
20671      4.56352843  8.35709941  3.40238163  4.63261775  1.997275
27395     13.30311589 10.74484210 13.98295863 13.23605071 13.469996
             MCL1.DL    MCL1.LA    MCL1.LB    MCL1.LC     MCL1.LD
497097    14.3389690  0.0000000  0.0000000  0.0000000  0.00000000
100503874  0.1998463  0.0000000  0.0000000  0.0000000  0.00000000
100038431  0.0000000  0.0000000  0.0000000  0.0000000  0.00000000
19888      0.0000000  0.4903857  0.1381969  0.4496078  0.09095771
20671      4.0968483  0.7846171  1.1516411  0.8092940  0.36383085
27395     13.4896224 27.4615975 21.3744588 21.9858214 14.91706476
             MCL1.LE    MCL1.LF
497097     0.0000000  0.0000000
100503874  0.0000000  0.0000000
100038431  0.0000000  0.0000000
19888      0.0000000  0.0000000
20671      0.1213404  0.4055595
27395     12.4171715 13.8701357
# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)
          MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097       TRUE    TRUE    TRUE    TRUE    TRUE    TRUE   FALSE   FALSE
100503874   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
100038431   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
19888       FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
20671        TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE
27395        TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE
          MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097      FALSE   FALSE   FALSE   FALSE
100503874   FALSE   FALSE   FALSE   FALSE
100038431   FALSE   FALSE   FALSE   FALSE
19888       FALSE   FALSE   FALSE   FALSE
20671        TRUE   FALSE   FALSE   FALSE
27395        TRUE    TRUE    TRUE    TRUE
# Summary of how many TRUEs there are in each row
# There are 11433 genes that have TRUEs in all 12 samples.
table(rowSums(thresh))

    0     1     2     3     4     5     6     7     8     9    10    11 
10857   518   544   307   346   307   652   323   547   343   579   423 
   12 
11433 
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
summary(keep)
   Mode   FALSE    TRUE 
logical   11375   15804 
dim(counts.keep)
[1] 15804    12

A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample. A requirement for expression in two or more libraries is used as each group contains two replicates. This ensures that a gene will be retained if it is only expressed in one group. Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5. You should filter with CPMs rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

# Let's have a look and see whether our threshold of 0.5 does indeed correspond to a count of about 10-15
# We will look at the first sample
plot(myCPM[,1],countdata[,1])

# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)

Challenge

  1. Plot the counts-per-million versus counts for the second sample.
  2. Add a vertical line at 0.5 and a horizontal line at 10.
  3. Add the lines again, colouring them blue

HINT: use the col parameter.

Solution Note: When in doubt, a threshold of 1 CPM in at least minimum group sample size is a good rule of thumb.

Convert counts to DGEList object

Next we’ll create a DGEList object. This is an object used by edgeR to store count data. It has a number of slots for storing various parameters about the data.

y <- DGEList(counts.keep)
# have a look at y
y
An object of class "DGEList"
$counts
       MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA MCL1.LB
497097     438     300      65     237     354     287       0       0
20671      106     182      82     105      43      82      16      25
27395      309     234     337     300     290     270     560     464
18777      652     515     948     935     928     791     826     862
21399     1604    1495    1721    1317    1159    1066    1334    1258
       MCL1.LC MCL1.LD MCL1.LE MCL1.LF
497097       0       0       0       0
20671       18       8       3      10
27395      489     328     307     342
18777      668     646     544     581
21399     1068     926     508     500
15799 more rows ...

$samples
        group lib.size norm.factors
MCL1.DG     1 23218026            1
MCL1.DH     1 21768136            1
MCL1.DI     1 24091588            1
MCL1.DJ     1 22656713            1
MCL1.DK     1 21522033            1
7 more rows ...
# See what slots are stored in y
names(y)
[1] "counts"  "samples"
# Library size information is stored in the samples slot
y$samples
        group lib.size norm.factors
MCL1.DG     1 23218026            1
MCL1.DH     1 21768136            1
MCL1.DI     1 24091588            1
MCL1.DJ     1 22656713            1
MCL1.DK     1 21522033            1
MCL1.DL     1 20008326            1
MCL1.LA     1 20384562            1
MCL1.LB     1 21698793            1
MCL1.LC     1 22235847            1
MCL1.LD     1 21982745            1
MCL1.LE     1 24719697            1
MCL1.LF     1 24652963            1

Quality control

Now that we have got rid of the lowly expressed genes and have our counts stored in a DGEList object, we can look at a few different plots to check that the data is good quality, and that the samples are as we would expect.

Library sizes and distribution plots

First, we can check how many reads we have for each sample in the y.

y$samples$lib.size
 [1] 23218026 21768136 24091588 22656713 21522033 20008326 20384562
 [8] 21698793 22235847 21982745 24719697 24652963

We can also plot the library sizes as a barplot to see whether there are any major discrepancies between the samples more easily.

# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(y$samples$lib.size,names=colnames(y),las=2)
# Add a title to the plot
title("Barplot of library sizes")

Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. Next we’ll use box plots to check the distribution of the read counts on the log2 scale. We can use the cpm function to get log2 counts per million, which are corrected for the different library sizes. The cpm function also adds a small offset to avoid taking log of zero.

# Get log2 counts per million
logcounts <- cpm(y,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
title("Boxplots of logCPMs (unnormalised)")