RNA-seq analysis in R

Introduction to RStudio

We’ll be using RStudio: a free, open source R integrated development environment. It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Basic layout

When you first open RStudio, you will be greeted by three panels:

The interactive R console (entire left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)

Once you open files, such as R scripts, an editor panel will also open in the top left.

Work flow within RStudio

There are two main ways one can work within RStudio.

Test and play within the interactive R console then copy code into a .R file to run later.
- This works well when doing small tests and initially starting off.
- It quickly becomes laborious
Start writing in an .R file and use RStudio’s command / short cut to push current line, selected lines or modified lines to the interactive R console.
- This is a great way to start; all your code is saved for later
- You will be able to run the file you create from within RStudio or using R’s source() function.

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the Run button just above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit Ctrl-Enter in Windows or Linux or Command-Enter on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run. If you have modified a line of code within a block of code you have just run, there is no need to reselect the section and Run, you can use the next button along, Re-run the previous region. This will run the previous code block including the modifications you have made.

Your RStudio environment

Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you just typed in R in your commandline environment.

R Packages

It is possible to add functions to R by writing a package, or by obtaining a package written by someone else. As of this writing, there are over 7,000 packages available on CRAN (the comprehensive R archive network). R and RStudio have functionality for managing packages:

You can see what packages are installed by typing installed.packages()
You can install packages by typing install.packages("packagename"), where packagename is the package name, in quotes.
You can update installed packages by typing update.packages()
You can remove a package with remove.packages("packagename")
You can make a package available for use with library(packagename)

For this workshop we will also be using packages from These can all be obtained from Bioconductor.

Open RStudio and run the following commands to install packages from Bioconductor. These are installed slightly differently. For example, to install the package limma:

source("http://bioconductor.org/biocLite.R")
biocLite("limma")

RStudio project management

The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.

Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo April 15, 2013

Most people tend to organize their projects like this:

There are many reasons why we should ALWAYS avoid this:

It is really hard to tell which version of your data is the original and which is the modified;
It gets really messy because it mixes files with various extensions together;
It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;

A good project layout will ultimately make your life easier:

It will help ensure the integrity of your data;
It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
It allows you to easily upload your code with your manuscript submission;
It makes it easier to pick the project back up after a break.

A possible solution

Fortunately, there are tools and packages which can help you manage your work effectively.

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.

Challenge: Creating a self-contained project

We’re going to create a new project in RStudio:

Click the “File” menu button, then “New Project”.

Click “New Directory”.

Click “Empty Project”.

Type in the name of the directory to store your project, e.g. “my_project”, or better yet use the name of this workshop.

Click the “Create Project” button.

Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.

Best practices for project organisation

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning

In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. I find it useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

There are lots of different was to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Separate function definition and application

The most effective way I find to work in R, is to play around in the interactive session, then copy commands across to a script file when I’m sure they work and do what I want. You can also save all the commands you’ve entered using the history command, but I don’t find it useful because when I’m typing its 90% trial and error.

When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these into separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.

Tip: avoiding duplication

You may find yourself using data or analysis scripts across several projects. Typically you want to avoid duplication to save space and avoid having to make updates to code in multiple places.

In this case I find it useful to make “symbolic links”, which are essentially shortcuts to files somewhere else on a filesystem. On Linux and OS X you can use the ln -s command, and on windows you can either create a shortcut or use the mklink command from the windows terminal.

Save the data in the data directory

Now we have a good directory structure we will now place/save the data file in the data/ directory.

Download the RNAseq data for this workshop.

Mouse mammary data (counts): https://figshare.com/s/1d788fd384d33e913a2a
Drosophila data (counts): https://figshare.com/s/e08e71c42f118dbe8be6

Challenge

Create a /data directory. In the bottom right panel select the “Files” tab, then “New Folder”, then type “data” and click “OK”.

Download the RNAseq data using the links above (if you find the internet is slow, you can just download Day 1 for now).

Click “Download all” (this will download a zip file).

Unzip the file (usually double clicking on it will do the trick).

Move all the files inside into the data/ folder within your project.

Reading data into R

Our data is tab delimited, which is what read.table expect by default so we don’t need to specify delim. However, we do need header = TRUE.

# Read the data into R
small_counts <- read.table("data/small_counts.txt", header = TRUE)

print(small_counts)

        Sample_1 Sample_2 Sample_3 Sample_4
Xkr4         438      300       65      237
Sox17        106      182       82      105
Mrpl15       309      234      337      300
Lypla1       652      515      948      935
Tcea1       1604     1495     1721     1317
Rgs20          4        2       14        4
Atp6v1h      769      752     1062      987
Rb1cc1      1494     1412     1157      967
Pcmtd1      1344     1242     1374     1593
Rrs1        1691     1808     2127     1653

dim(small_counts)

[1] 10  4

Working with Data Frames

Subsetting

Let’s say we want the all the counts for Sample_1. There are three main ways we could do this:

$ notation with the column name

small_counts$Sample_1

 [1]  438  106  309  652 1604    4  769 1494 1344 1691

[row, column] notation with numeric indices

small_counts[, 1]

 [1]  438  106  309  652 1604    4  769 1494 1344 1691

[row, column] notation using the column name (in a vector)

small_counts[, c("Sample_1")]

 [1]  438  106  309  652 1604    4  769 1494 1344 1691

This is handy when you want to grab multiple columns

small_counts[, c("Sample_1", "Sample_3")]

        Sample_1 Sample_3
Xkr4         438       65
Sox17        106       82
Mrpl15       309      337
Lypla1       652      948
Tcea1       1604     1721
Rgs20          4       14
Atp6v1h      769     1062
Rb1cc1      1494     1157
Pcmtd1      1344     1374
Rrs1        1691     2127

You can even mix numerical and named indices

small_counts[1:3, c("Sample_1", "Sample_3")]

       Sample_1 Sample_3
Xkr4        438       65
Sox17       106       82
Mrpl15      309      337

Vectorisation

R arithmetic operators are vectorised by default.

small_counts$Sample_1 * 2

 [1]  876  212  618 1304 3208    8 1538 2988 2688 3382

So are many R functions.

log(small_counts)

        Sample_1  Sample_2 Sample_3 Sample_4
Xkr4    6.082219 5.7037825 4.174387 5.468060
Sox17   4.663439 5.2040067 4.406719 4.653960
Mrpl15  5.733341 5.4553211 5.820083 5.703782
Lypla1  6.480045 6.2441669 6.854355 6.840547
Tcea1   7.380256 7.3098815 7.450661 7.183112
Rgs20   1.386294 0.6931472 2.639057 1.386294
Atp6v1h 6.645091 6.6227363 6.967909 6.894670
Rb1cc1  7.309212 7.2527624 7.053586 6.874198
Pcmtd1  7.203406 7.1244783 7.225481 7.373374
Rrs1    7.433075 7.4999765 7.662468 7.410347

Let’s say we want to calculate the sum of counts for each sample. We can do this for each sample by subsetting and then using the sum function.

sum(small_counts$Sample_1)

[1] 8411

sum(small_counts$Sample_2)

[1] 7942

If you had a lot of samples, you can imagine this would get tedious. So instead, we can use an apply statement to apply the sum function to every column at once. In R, this is usually quicker than using a loop. Note the MARGIN argument tells apply which direction you want to go in. MARGIN = 1 will apply sum to each row, while MARGIN = 2 will apply sum to each column.

# Sum the counts for each sample
sample_sums = apply(small_counts, MARGIN = 2, sum)
print(sample_sums)

Sample_1 Sample_2 Sample_3 Sample_4 
    8411     7942     8887     8098

Data types

There are 5 main types: doubles, integers, complex, logical and character.

typeof(3.14)

[1] "double"

typeof(1L)

[1] "integer"

typeof(1+1i)

[1] "complex"

typeof(TRUE)

[1] "logical"

typeof('banana')

[1] "character"

Note the L suffix to insist that a number is an integer. No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.

Let’s read in some data that contains things of more than one type. This is actually a sample of a differential expression results table that you will generate later in the workshop.

ResultsTable_small <- read.table("data/ResultsTable_small.txt", header=TRUE)
print(ResultsTable_small)

   ENTREZID        SYMBOL     logFC     AveExpr         t      P.Value
1     24117          Wif1  1.819943  2.97554452  20.10780 1.063770e-10
2    381290        Atp2b4 -2.143885  3.94406593 -19.07495 1.982934e-10
3     78896 1500015O10Rik  2.807548  3.03651950  18.54773 2.758828e-10
4    226101          Myof -2.329744  6.22352456 -18.26861 3.297667e-10
5     16012        Igfbp6 -2.896115  1.97844876 -18.21525 3.413066e-10
6    231830       Micall2  2.253400  4.76059697  18.02627 3.858161e-10
7     16669         Krt19 -2.312721  8.74189184 -17.07937 7.264548e-10
8     55987         Cpxm2 -1.515469  2.83451194 -16.64333 9.829870e-10
9    231991         Creb5 -2.598105  4.27592952 -16.53634 1.059885e-09
10    14620          Gjb3  3.600094  3.52528051  16.46627 1.113755e-09
11   211577        Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
12    20317      Serpinf1  1.710380  3.38883490  15.77280 1.838727e-09
13    74747         Ddit4  2.180370  6.86479110  15.70145 1.938279e-09
14   270150       Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
15    11636           Ak1  2.766745  4.30347462  15.27694 2.664640e-09
16    20482          Skil  1.887561  8.49892507  14.65488 4.311334e-09
17   194126        Mtmr11 -1.708973  2.50804119 -14.48746 4.922928e-09
18    21953         Tnni2 -5.827889  0.30207159 -14.40327 5.265278e-09
19    12992       Csn1s2b -6.070143  3.56295004 -14.16565 6.377604e-09
20    17131         Smad7  1.972771  6.71751902  14.14348 6.493642e-09
21    20564         Slit3 -1.331437  3.44179493 -13.88522 8.026279e-09
22    16878           Lif  3.738933  6.68203417  13.73344 9.105708e-09
23   170761         Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
24    73847       Fam110a  1.474671  6.84086068  13.62251 9.993185e-09
25    12977          Csf1  2.835624  7.47759094  13.41902 1.187300e-08
26    12654         Chil1  2.342914  5.57645724  13.21976 1.408760e-08
27   545370         Hmcn1 -1.567424  3.10302591 -13.19053 1.444832e-08
28    60599     Trp53inp1 -1.258671  6.11839605 -13.16241 1.480464e-08
29   217166         Nr1d1  2.278879  6.26087761  13.12885 1.524242e-08
30    67111          Naaa -2.538596  3.29074575 -13.04083 1.645823e-08
31   232016       Ccdc129 -2.597044  5.00471484 -13.02266 1.672195e-08
32    76123         Gpsm2 -2.171896  4.99093472 -12.76344 2.102397e-08
33   329739       Fam102b -1.520291  4.18813047 -12.75357 2.120968e-08
34    79362       Bhlhe41  1.788601  6.18368494  12.70504 2.214908e-08
35    50786        Hs6st2  1.751340  0.53953600  12.43097 2.836919e-08
36    18019        Nfatc2 -2.011808  5.79499693 -12.27561 3.271067e-08
37    67971         Tppp3 -2.653398  4.90816305 -12.22845 3.416616e-08
38   235542       Ppp2r3a -1.109156  6.50105941 -12.22041 3.442139e-08
39    74134        Cyp2s1 -2.071469  1.40704575 -12.20154 3.502805e-08
40    20677          Sox4  1.522405  7.46932835  12.13548 3.724389e-08
      adj.P.Val
1  1.016240e-06
2  1.016240e-06
3  1.016240e-06
4  1.016240e-06
5  1.016240e-06
6  1.016240e-06
7  1.640127e-06
8  1.718703e-06
9  1.718703e-06
10 1.718703e-06
11 1.718703e-06
12 2.356351e-06
13 2.356351e-06
14 2.539851e-06
15 2.807465e-06
16 4.258521e-06
17 4.576586e-06
18 4.622914e-06
19 5.131276e-06
20 5.131276e-06
21 6.040348e-06
22 6.541210e-06
23 6.580512e-06
24 6.580512e-06
25 7.505634e-06
26 8.306595e-06
27 8.306595e-06
28 8.306595e-06
29 8.306595e-06
30 8.524957e-06
31 8.524957e-06
32 1.015751e-05
33 1.015751e-05
34 1.029541e-05
35 1.280991e-05
36 1.419445e-05
37 1.419445e-05
38 1.419445e-05
39 1.419445e-05
40 1.462109e-05

We can use str to reveal the structure of our ResultsTable_small.

str(ResultsTable_small)

'data.frame':   40 obs. of  7 variables:
 $ ENTREZID : int  24117 381290 78896 226101 16012 231830 16669 55987 231991 14620 ...
 $ SYMBOL   : Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...
 $ logFC    : num  1.82 -2.14 2.81 -2.33 -2.9 ...
 $ AveExpr  : num  2.98 3.94 3.04 6.22 1.98 ...
 $ t        : num  20.1 -19.1 18.5 -18.3 -18.2 ...
 $ P.Value  : num  1.06e-10 1.98e-10 2.76e-10 3.30e-10 3.41e-10 ...
 $ adj.P.Val: num  1.02e-06 1.02e-06 1.02e-06 1.02e-06 1.02e-06 ...

We can see that it’s a data frame, and str also tells us the type of each column. Everything in a single column must be of the same type, while each column can be a different type. This is because data frames are actually lists of vectors, and vectors must contain data of only one type.

When you try to mix types in a vector, strange things can happen.

my_vector = c(1, "hello", TRUE)
print(my_vector)

[1] "1"     "hello" "TRUE"

Notice that everything has been converted to a character. You can check this with typeof.

typeof(my_vector)

[1] "character"

This is called type coercion. When you try to create a vector containing a mixture of types R will automatically (and silently) convert everything to the same type. Generally it will chose a type that means you don’t lose any information.

The coercion rules go: logical -> integer -> numeric -> complex -> character.

Note, the same thing will happen if you try to read in data from a file that has a mixture of types within a single column.

Factors

You may have noticed that ResultsTable_small contained a column annotated as Factor.

str(ResultsTable_small)

'data.frame':   40 obs. of  7 variables:
 $ ENTREZID : int  24117 381290 78896 226101 16012 231830 16669 55987 231991 14620 ...
 $ SYMBOL   : Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...
 $ logFC    : num  1.82 -2.14 2.81 -2.33 -2.9 ...
 $ AveExpr  : num  2.98 3.94 3.04 6.22 1.98 ...
 $ t        : num  20.1 -19.1 18.5 -18.3 -18.2 ...
 $ P.Value  : num  1.06e-10 1.98e-10 2.76e-10 3.30e-10 3.41e-10 ...
 $ adj.P.Val: num  1.02e-06 1.02e-06 1.02e-06 1.02e-06 1.02e-06 ...

Factors usually look like character data, but are typically used to represent categorical information.

str(ResultsTable_small$SYMBOL)

 Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...

Instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood:

typeof(ResultsTable_small$SYMBOL)

[1] "integer"

In modelling functions, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labelled in alphabetical order. You can change this by specifying the levels:

mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)

 Factor w/ 2 levels "control","case": 2 1 1 2

In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!

If you prefer not to work with factors, you can simply tell the read.table function to keep all strings as characters.

ResultsTable_small <- read.table("data/ResultsTable_small.txt", stringsAsFactors = FALSE, header=TRUE)

Sorting

If you just want to sort a vector from smallest to largest, you can use the sort function:

sort(ResultsTable_small$logFC)

 [1] -6.070143 -5.827889 -5.146268 -3.313648 -3.211148 -2.896115 -2.653398
 [8] -2.598105 -2.597044 -2.538596 -2.329744 -2.312721 -2.171896 -2.143885
[15] -2.071469 -2.011808 -1.708973 -1.567424 -1.520291 -1.515469 -1.331437
[22] -1.258671 -1.109156  1.474671  1.522405  1.710380  1.751340  1.788601
[29]  1.819943  1.887561  1.972771  2.180370  2.253400  2.278879  2.342914
[36]  2.766745  2.807548  2.835624  3.600094  3.738933

To reverse the sort order so it goes from largest to smallest:

sort(ResultsTable_small$logFC, decreasing = TRUE)

 [1]  3.738933  3.600094  2.835624  2.807548  2.766745  2.342914  2.278879
 [8]  2.253400  2.180370  1.972771  1.887561  1.819943  1.788601  1.751340
[15]  1.710380  1.522405  1.474671 -1.109156 -1.258671 -1.331437 -1.515469
[22] -1.520291 -1.567424 -1.708973 -2.011808 -2.071469 -2.143885 -2.171896
[29] -2.312721 -2.329744 -2.538596 -2.597044 -2.598105 -2.653398 -2.896115
[36] -3.211148 -3.313648 -5.146268 -5.827889 -6.070143

And this works with characters:

sort(ResultsTable_small$SYMBOL)

 [1] "1500015O10Rik" "Ak1"           "Atp2b4"        "Bhlhe41"      
 [5] "Ccdc129"       "Ccdc153"       "Chil1"         "Cpxm2"        
 [9] "Creb5"         "Csf1"          "Csn1s2b"       "Cyp2s1"       
[13] "Ddit4"         "Fam102b"       "Fam110a"       "Gjb3"         
[17] "Gpsm2"         "Hmcn1"         "Hs6st2"        "Igfbp6"       
[21] "Krt19"         "Lif"           "Micall2"       "Mrgprf"       
[25] "Mtmr11"        "Myof"          "Naaa"          "Nfatc2"       
[29] "Nr1d1"         "Pdzd3"         "Ppp2r3a"       "Serpinf1"     
[33] "Skil"          "Slit3"         "Smad7"         "Sox4"         
[37] "Tnni2"         "Tppp3"         "Trp53inp1"     "Wif1"

If you want to sort the entire data frame by the values in a single column, you can use the order function. This gives you the indexes for if the data was sorted. For example:

order(ResultsTable_small$logFC)

 [1] 19 18 11 23 14  5 37  9 31 30  4  7 32  2 39 36 17 27 33  8 21 28 38
[24] 24 40 12 35 34  1 16 20 13  6 29 26 15  3 25 10 22

You then use these indexes to sort the data:

ResultsTable_small$logFC[order(ResultsTable_small$logFC)]

 [1] -6.070143 -5.827889 -5.146268 -3.313648 -3.211148 -2.896115 -2.653398
 [8] -2.598105 -2.597044 -2.538596 -2.329744 -2.312721 -2.171896 -2.143885
[15] -2.071469 -2.011808 -1.708973 -1.567424 -1.520291 -1.515469 -1.331437
[22] -1.258671 -1.109156  1.474671  1.522405  1.710380  1.751340  1.788601
[29]  1.819943  1.887561  1.972771  2.180370  2.253400  2.278879  2.342914
[36]  2.766745  2.807548  2.835624  3.600094  3.738933

You can then use these indices to sort the rows:

ResultsTable_small[order(ResultsTable_small$logFC), ]

   ENTREZID        SYMBOL     logFC     AveExpr         t      P.Value
19    12992       Csn1s2b -6.070143  3.56295004 -14.16565 6.377604e-09
18    21953         Tnni2 -5.827889  0.30207159 -14.40327 5.265278e-09
11   211577        Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
23   170761         Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
14   270150       Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
5     16012        Igfbp6 -2.896115  1.97844876 -18.21525 3.413066e-10
37    67971         Tppp3 -2.653398  4.90816305 -12.22845 3.416616e-08
9    231991         Creb5 -2.598105  4.27592952 -16.53634 1.059885e-09
31   232016       Ccdc129 -2.597044  5.00471484 -13.02266 1.672195e-08
30    67111          Naaa -2.538596  3.29074575 -13.04083 1.645823e-08
4    226101          Myof -2.329744  6.22352456 -18.26861 3.297667e-10
7     16669         Krt19 -2.312721  8.74189184 -17.07937 7.264548e-10
32    76123         Gpsm2 -2.171896  4.99093472 -12.76344 2.102397e-08
2    381290        Atp2b4 -2.143885  3.94406593 -19.07495 1.982934e-10
39    74134        Cyp2s1 -2.071469  1.40704575 -12.20154 3.502805e-08
36    18019        Nfatc2 -2.011808  5.79499693 -12.27561 3.271067e-08
17   194126        Mtmr11 -1.708973  2.50804119 -14.48746 4.922928e-09
27   545370         Hmcn1 -1.567424  3.10302591 -13.19053 1.444832e-08
33   329739       Fam102b -1.520291  4.18813047 -12.75357 2.120968e-08
8     55987         Cpxm2 -1.515469  2.83451194 -16.64333 9.829870e-10
21    20564         Slit3 -1.331437  3.44179493 -13.88522 8.026279e-09
28    60599     Trp53inp1 -1.258671  6.11839605 -13.16241 1.480464e-08
38   235542       Ppp2r3a -1.109156  6.50105941 -12.22041 3.442139e-08
24    73847       Fam110a  1.474671  6.84086068  13.62251 9.993185e-09
40    20677          Sox4  1.522405  7.46932835  12.13548 3.724389e-08
12    20317      Serpinf1  1.710380  3.38883490  15.77280 1.838727e-09
35    50786        Hs6st2  1.751340  0.53953600  12.43097 2.836919e-08
34    79362       Bhlhe41  1.788601  6.18368494  12.70504 2.214908e-08
1     24117          Wif1  1.819943  2.97554452  20.10780 1.063770e-10
16    20482          Skil  1.887561  8.49892507  14.65488 4.311334e-09
20    17131         Smad7  1.972771  6.71751902  14.14348 6.493642e-09
13    74747         Ddit4  2.180370  6.86479110  15.70145 1.938279e-09
6    231830       Micall2  2.253400  4.76059697  18.02627 3.858161e-10
29   217166         Nr1d1  2.278879  6.26087761  13.12885 1.524242e-08
26    12654         Chil1  2.342914  5.57645724  13.21976 1.408760e-08
15    11636           Ak1  2.766745  4.30347462  15.27694 2.664640e-09
3     78896 1500015O10Rik  2.807548  3.03651950  18.54773 2.758828e-10
25    12977          Csf1  2.835624  7.47759094  13.41902 1.187300e-08
10    14620          Gjb3  3.600094  3.52528051  16.46627 1.113755e-09
22    16878           Lif  3.738933  6.68203417  13.73344 9.105708e-09
      adj.P.Val
19 5.131276e-06
18 4.622914e-06
11 1.718703e-06
23 6.580512e-06
14 2.539851e-06
5  1.016240e-06
37 1.419445e-05
9  1.718703e-06
31 8.524957e-06
30 8.524957e-06
4  1.016240e-06
7  1.640127e-06
32 1.015751e-05
2  1.016240e-06
39 1.419445e-05
36 1.419445e-05
17 4.576586e-06
27 8.306595e-06
33 1.015751e-05
8  1.718703e-06
21 6.040348e-06
28 8.306595e-06
38 1.419445e-05
24 6.580512e-06
40 1.462109e-05
12 2.356351e-06
35 1.280991e-05
34 1.029541e-05
1  1.016240e-06
16 4.258521e-06
20 5.131276e-06
13 2.356351e-06
6  1.016240e-06
29 8.306595e-06
26 8.306595e-06
15 2.807465e-06
3  1.016240e-06
25 7.505634e-06
10 1.718703e-06
22 6.541210e-06

Subsetting using logical statements

Let’s say we only want to look at the rows where the log fold change is greater than 3. We can ask if each logFC is greater than 3.

ResultsTable_small$logFC > 3

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This gives us a vector of TRUE/FALSE for each value stating if it is greater than 3. We can then use this to subset the data.

ResultsTable_small$logFC[ResultsTable_small$logFC > 3]

[1] 3.600094 3.738933

And to subset the rows of the data frame.

ResultsTable_small[ResultsTable_small$logFC > 3, ]

   ENTREZID SYMBOL    logFC  AveExpr        t      P.Value    adj.P.Val
10    14620   Gjb3 3.600094 3.525281 16.46627 1.113755e-09 1.718703e-06
22    16878    Lif 3.738933 6.682034 13.73344 9.105708e-09 6.541210e-06

In fact, we probably care about things that are -3 as well, so let’s include those too:

ResultsTable_small[ResultsTable_small$logFC > 3 | ResultsTable_small$logFC < -3, ]

   ENTREZID  SYMBOL     logFC     AveExpr         t      P.Value
10    14620    Gjb3  3.600094  3.52528051  16.46627 1.113755e-09
11   211577  Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
14   270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
18    21953   Tnni2 -5.827889  0.30207159 -14.40327 5.265278e-09
19    12992 Csn1s2b -6.070143  3.56295004 -14.16565 6.377604e-09
22    16878     Lif  3.738933  6.68203417  13.73344 9.105708e-09
23   170761   Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
      adj.P.Val
10 1.718703e-06
11 1.718703e-06
14 2.539851e-06
18 4.622914e-06
19 5.131276e-06
22 6.541210e-06
23 6.580512e-06

Or we can simplify this using the abs function to find the absolute value of each logFC before testing them.

ResultsTable_small[abs(ResultsTable_small$logFC) > 3, ]

   ENTREZID  SYMBOL     logFC     AveExpr         t      P.Value
10    14620    Gjb3  3.600094  3.52528051  16.46627 1.113755e-09
11   211577  Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
14   270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
18    21953   Tnni2 -5.827889  0.30207159 -14.40327 5.265278e-09
19    12992 Csn1s2b -6.070143  3.56295004 -14.16565 6.377604e-09
22    16878     Lif  3.738933  6.68203417  13.73344 9.105708e-09
23   170761   Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
      adj.P.Val
10 1.718703e-06
11 1.718703e-06
14 2.539851e-06
18 4.622914e-06
19 5.131276e-06
22 6.541210e-06
23 6.580512e-06

Tip: Logical Operators in R

Operator Description

< less than

<= less than or equal to

> greater than

>= greater than or equal to

== exactly equal to

!= not equal to

!x Not x

x | y x OR y

x & y x AND y

isTRUE(x) test if X is TRUE

Source: Quick-R

Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
x \| y	x OR y
x & y	x AND y
isTRUE(x)	test if X is TRUE

We might also want to reduce our results table down to a set of genes that we are interested in. We can use similar logical subsetting logic to do this.

my_genes <- c("Smad7", "Wif1", "Fam102b", "Tppp3")

For each gene in ResultsTable_small$SYMBOL, does it appear in my_genes?

ResultsTable_small$SYMBOL %in% my_genes

 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[34] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

ResultsTable_small[ResultsTable_small$SYMBOL %in% my_genes, ]

   ENTREZID  SYMBOL     logFC  AveExpr         t      P.Value    adj.P.Val
1     24117    Wif1  1.819943 2.975545  20.10780 1.063770e-10 1.016240e-06
20    17131   Smad7  1.972771 6.717519  14.14348 6.493642e-09 5.131276e-06
33   329739 Fam102b -1.520291 4.188130 -12.75357 2.120968e-08 1.015751e-05
37    67971   Tppp3 -2.653398 4.908163 -12.22845 3.416616e-08 1.419445e-05

You can use match to get the same subset, but this time make sure they are in the same order as in my_genes. This can be useful if you are trying to merge together data from two sources.

match(my_genes, ResultsTable_small$SYMBOL)

[1] 20  1 33 37

This returns the index for where each of the genes in my_list appears in ResultsTable_small$SYMBOL. We can then use this to subset the columns from ResultsTable_small.

ResultsTable_small[match(my_genes, ResultsTable_small$SYMBOL), ]

   ENTREZID  SYMBOL     logFC  AveExpr         t      P.Value    adj.P.Val
20    17131   Smad7  1.972771 6.717519  14.14348 6.493642e-09 5.131276e-06
1     24117    Wif1  1.819943 2.975545  20.10780 1.063770e-10 1.016240e-06
33   329739 Fam102b -1.520291 4.188130 -12.75357 2.120968e-08 1.015751e-05
37    67971   Tppp3 -2.653398 4.908163 -12.22845 3.416616e-08 1.419445e-05

You can see they are in the same order as my_genes!

RNA-seq analysis in R

R for RNAseq

Introduction to RStudio

Work flow within RStudio

Tip: Running segments of your code

Your RStudio environment

R Packages

RStudio project management

A possible solution

Challenge: Creating a self-contained project

Best practices for project organisation

Treat data as read only

Data Cleaning

Treat generated output as disposable

Separate function definition and application

Tip: avoiding duplication

Save the data in the data directory

Challenge

Reading data into R

Working with Data Frames

Subsetting

Vectorisation

Data types

Factors

Sorting

Subsetting using logical statements

Tip: Logical Operators in R