We’ll be using RStudio: a free, open source R integrated development environment. It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.
Basic layout
When you first open RStudio, you will be greeted by three panels:
Once you open files, such as R scripts, an editor panel will also open in the top left.
There are two main ways one can work within RStudio.
source()
function.Tip: Running segments of your code
RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the
Run
button just above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit Ctrl-Enter in Windows or Linux or Command-Enter on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and thenRun
. If you have modified a line of code within a block of code you have just run, there is no need to reselect the section andRun
, you can use the next button along,Re-run the previous region
. This will run the previous code block including the modifications you have made.
Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you just typed in R
in your commandline environment.
It is possible to add functions to R by writing a package, or by obtaining a package written by someone else. As of this writing, there are over 7,000 packages available on CRAN (the comprehensive R archive network). R and RStudio have functionality for managing packages:
installed.packages()
install.packages("packagename")
, where packagename
is the package name, in quotes.update.packages()
remove.packages("packagename")
library(packagename)
For this workshop we will also be using packages from These can all be obtained from Bioconductor.
Open RStudio and run the following commands to install packages from Bioconductor. These are installed slightly differently. For example, to install the package limma
:
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.
Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo April 15, 2013
Most people tend to organize their projects like this:
There are many reasons why we should ALWAYS avoid this:
A good project layout will ultimately make your life easier:
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.
Challenge: Creating a self-contained project
We’re going to create a new project in RStudio:
- Click the “File” menu button, then “New Project”.
- Click “New Directory”.
- Click “Empty Project”.
- Type in the name of the directory to store your project, e.g. “my_project”, or better yet use the name of this workshop.
- Click the “Create Project” button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.
In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. I find it useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.
There are lots of different was to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.
The most effective way I find to work in R, is to play around in the interactive session, then copy commands across to a script file when I’m sure they work and do what I want. You can also save all the commands you’ve entered using the history
command, but I don’t find it useful because when I’m typing its 90% trial and error.
When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these into separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.
Tip: avoiding duplication
You may find yourself using data or analysis scripts across several projects. Typically you want to avoid duplication to save space and avoid having to make updates to code in multiple places.
In this case I find it useful to make “symbolic links”, which are essentially shortcuts to files somewhere else on a filesystem. On Linux and OS X you can use the
ln -s
command, and on windows you can either create a shortcut or use themklink
command from the windows terminal.
Now we have a good directory structure we will now place/save the data file in the data/
directory.
Download the RNAseq data for this workshop.
Challenge
- Create a
/data
directory. In the bottom right panel select the “Files” tab, then “New Folder”, then type “data” and click “OK”.- Download the RNAseq data using the links above (if you find the internet is slow, you can just download Day 1 for now).
- Click “Download all” (this will download a zip file).
- Unzip the file (usually double clicking on it will do the trick).
- Move all the files inside into the
data/
folder within your project.
Our data is tab delimited, which is what read.table expect by default so we don’t need to specify delim
. However, we do need header = TRUE
.
# Read the data into R
small_counts <- read.table("data/small_counts.txt", header = TRUE)
print(small_counts)
Sample_1 Sample_2 Sample_3 Sample_4
Xkr4 438 300 65 237
Sox17 106 182 82 105
Mrpl15 309 234 337 300
Lypla1 652 515 948 935
Tcea1 1604 1495 1721 1317
Rgs20 4 2 14 4
Atp6v1h 769 752 1062 987
Rb1cc1 1494 1412 1157 967
Pcmtd1 1344 1242 1374 1593
Rrs1 1691 1808 2127 1653
dim(small_counts)
[1] 10 4
Let’s say we want the all the counts for Sample_1. There are three main ways we could do this:
$
notation with the column name
small_counts$Sample_1
[1] 438 106 309 652 1604 4 769 1494 1344 1691
[row, column]
notation with numeric indices
small_counts[, 1]
[1] 438 106 309 652 1604 4 769 1494 1344 1691
[row, column]
notation using the column name (in a vector)
small_counts[, c("Sample_1")]
[1] 438 106 309 652 1604 4 769 1494 1344 1691
This is handy when you want to grab multiple columns
small_counts[, c("Sample_1", "Sample_3")]
Sample_1 Sample_3
Xkr4 438 65
Sox17 106 82
Mrpl15 309 337
Lypla1 652 948
Tcea1 1604 1721
Rgs20 4 14
Atp6v1h 769 1062
Rb1cc1 1494 1157
Pcmtd1 1344 1374
Rrs1 1691 2127
You can even mix numerical and named indices
small_counts[1:3, c("Sample_1", "Sample_3")]
Sample_1 Sample_3
Xkr4 438 65
Sox17 106 82
Mrpl15 309 337
R arithmetic operators are vectorised by default.
small_counts$Sample_1 * 2
[1] 876 212 618 1304 3208 8 1538 2988 2688 3382
So are many R functions.
log(small_counts)
Sample_1 Sample_2 Sample_3 Sample_4
Xkr4 6.082219 5.7037825 4.174387 5.468060
Sox17 4.663439 5.2040067 4.406719 4.653960
Mrpl15 5.733341 5.4553211 5.820083 5.703782
Lypla1 6.480045 6.2441669 6.854355 6.840547
Tcea1 7.380256 7.3098815 7.450661 7.183112
Rgs20 1.386294 0.6931472 2.639057 1.386294
Atp6v1h 6.645091 6.6227363 6.967909 6.894670
Rb1cc1 7.309212 7.2527624 7.053586 6.874198
Pcmtd1 7.203406 7.1244783 7.225481 7.373374
Rrs1 7.433075 7.4999765 7.662468 7.410347
Let’s say we want to calculate the sum of counts for each sample. We can do this for each sample by subsetting and then using the sum
function.
sum(small_counts$Sample_1)
[1] 8411
sum(small_counts$Sample_2)
[1] 7942
If you had a lot of samples, you can imagine this would get tedious. So instead, we can use an apply
statement to apply the sum
function to every column at once. In R, this is usually quicker than using a loop. Note the MARGIN
argument tells apply
which direction you want to go in. MARGIN = 1 will apply sum
to each row, while MARGIN = 2 will apply sum
to each column.
# Sum the counts for each sample
sample_sums = apply(small_counts, MARGIN = 2, sum)
print(sample_sums)
Sample_1 Sample_2 Sample_3 Sample_4
8411 7942 8887 8098
There are 5 main types: doubles, integers, complex, logical and character.
typeof(3.14)
[1] "double"
typeof(1L)
[1] "integer"
typeof(1+1i)
[1] "complex"
typeof(TRUE)
[1] "logical"
typeof('banana')
[1] "character"
Note the L
suffix to insist that a number is an integer. No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
Let’s read in some data that contains things of more than one type. This is actually a sample of a differential expression results table that you will generate later in the workshop.
ResultsTable_small <- read.table("data/ResultsTable_small.txt", header=TRUE)
print(ResultsTable_small)
ENTREZID SYMBOL logFC AveExpr t P.Value
1 24117 Wif1 1.819943 2.97554452 20.10780 1.063770e-10
2 381290 Atp2b4 -2.143885 3.94406593 -19.07495 1.982934e-10
3 78896 1500015O10Rik 2.807548 3.03651950 18.54773 2.758828e-10
4 226101 Myof -2.329744 6.22352456 -18.26861 3.297667e-10
5 16012 Igfbp6 -2.896115 1.97844876 -18.21525 3.413066e-10
6 231830 Micall2 2.253400 4.76059697 18.02627 3.858161e-10
7 16669 Krt19 -2.312721 8.74189184 -17.07937 7.264548e-10
8 55987 Cpxm2 -1.515469 2.83451194 -16.64333 9.829870e-10
9 231991 Creb5 -2.598105 4.27592952 -16.53634 1.059885e-09
10 14620 Gjb3 3.600094 3.52528051 16.46627 1.113755e-09
11 211577 Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
12 20317 Serpinf1 1.710380 3.38883490 15.77280 1.838727e-09
13 74747 Ddit4 2.180370 6.86479110 15.70145 1.938279e-09
14 270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
15 11636 Ak1 2.766745 4.30347462 15.27694 2.664640e-09
16 20482 Skil 1.887561 8.49892507 14.65488 4.311334e-09
17 194126 Mtmr11 -1.708973 2.50804119 -14.48746 4.922928e-09
18 21953 Tnni2 -5.827889 0.30207159 -14.40327 5.265278e-09
19 12992 Csn1s2b -6.070143 3.56295004 -14.16565 6.377604e-09
20 17131 Smad7 1.972771 6.71751902 14.14348 6.493642e-09
21 20564 Slit3 -1.331437 3.44179493 -13.88522 8.026279e-09
22 16878 Lif 3.738933 6.68203417 13.73344 9.105708e-09
23 170761 Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
24 73847 Fam110a 1.474671 6.84086068 13.62251 9.993185e-09
25 12977 Csf1 2.835624 7.47759094 13.41902 1.187300e-08
26 12654 Chil1 2.342914 5.57645724 13.21976 1.408760e-08
27 545370 Hmcn1 -1.567424 3.10302591 -13.19053 1.444832e-08
28 60599 Trp53inp1 -1.258671 6.11839605 -13.16241 1.480464e-08
29 217166 Nr1d1 2.278879 6.26087761 13.12885 1.524242e-08
30 67111 Naaa -2.538596 3.29074575 -13.04083 1.645823e-08
31 232016 Ccdc129 -2.597044 5.00471484 -13.02266 1.672195e-08
32 76123 Gpsm2 -2.171896 4.99093472 -12.76344 2.102397e-08
33 329739 Fam102b -1.520291 4.18813047 -12.75357 2.120968e-08
34 79362 Bhlhe41 1.788601 6.18368494 12.70504 2.214908e-08
35 50786 Hs6st2 1.751340 0.53953600 12.43097 2.836919e-08
36 18019 Nfatc2 -2.011808 5.79499693 -12.27561 3.271067e-08
37 67971 Tppp3 -2.653398 4.90816305 -12.22845 3.416616e-08
38 235542 Ppp2r3a -1.109156 6.50105941 -12.22041 3.442139e-08
39 74134 Cyp2s1 -2.071469 1.40704575 -12.20154 3.502805e-08
40 20677 Sox4 1.522405 7.46932835 12.13548 3.724389e-08
adj.P.Val
1 1.016240e-06
2 1.016240e-06
3 1.016240e-06
4 1.016240e-06
5 1.016240e-06
6 1.016240e-06
7 1.640127e-06
8 1.718703e-06
9 1.718703e-06
10 1.718703e-06
11 1.718703e-06
12 2.356351e-06
13 2.356351e-06
14 2.539851e-06
15 2.807465e-06
16 4.258521e-06
17 4.576586e-06
18 4.622914e-06
19 5.131276e-06
20 5.131276e-06
21 6.040348e-06
22 6.541210e-06
23 6.580512e-06
24 6.580512e-06
25 7.505634e-06
26 8.306595e-06
27 8.306595e-06
28 8.306595e-06
29 8.306595e-06
30 8.524957e-06
31 8.524957e-06
32 1.015751e-05
33 1.015751e-05
34 1.029541e-05
35 1.280991e-05
36 1.419445e-05
37 1.419445e-05
38 1.419445e-05
39 1.419445e-05
40 1.462109e-05
We can use str
to reveal the structure of our ResultsTable_small
.
str(ResultsTable_small)
'data.frame': 40 obs. of 7 variables:
$ ENTREZID : int 24117 381290 78896 226101 16012 231830 16669 55987 231991 14620 ...
$ SYMBOL : Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...
$ logFC : num 1.82 -2.14 2.81 -2.33 -2.9 ...
$ AveExpr : num 2.98 3.94 3.04 6.22 1.98 ...
$ t : num 20.1 -19.1 18.5 -18.3 -18.2 ...
$ P.Value : num 1.06e-10 1.98e-10 2.76e-10 3.30e-10 3.41e-10 ...
$ adj.P.Val: num 1.02e-06 1.02e-06 1.02e-06 1.02e-06 1.02e-06 ...
We can see that it’s a data frame, and str
also tells us the type of each column. Everything in a single column must be of the same type, while each column can be a different type. This is because data frames are actually lists of vectors, and vectors must contain data of only one type.
When you try to mix types in a vector, strange things can happen.
my_vector = c(1, "hello", TRUE)
print(my_vector)
[1] "1" "hello" "TRUE"
Notice that everything has been converted to a character. You can check this with typeof
.
typeof(my_vector)
[1] "character"
This is called type coercion. When you try to create a vector containing a mixture of types R will automatically (and silently) convert everything to the same type. Generally it will chose a type that means you don’t lose any information.
The coercion rules go: logical
-> integer
-> numeric
-> complex
-> character
.
Note, the same thing will happen if you try to read in data from a file that has a mixture of types within a single column.
You may have noticed that ResultsTable_small
contained a column annotated as Factor
.
str(ResultsTable_small)
'data.frame': 40 obs. of 7 variables:
$ ENTREZID : int 24117 381290 78896 226101 16012 231830 16669 55987 231991 14620 ...
$ SYMBOL : Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...
$ logFC : num 1.82 -2.14 2.81 -2.33 -2.9 ...
$ AveExpr : num 2.98 3.94 3.04 6.22 1.98 ...
$ t : num 20.1 -19.1 18.5 -18.3 -18.2 ...
$ P.Value : num 1.06e-10 1.98e-10 2.76e-10 3.30e-10 3.41e-10 ...
$ adj.P.Val: num 1.02e-06 1.02e-06 1.02e-06 1.02e-06 1.02e-06 ...
Factors usually look like character data, but are typically used to represent categorical information.
str(ResultsTable_small$SYMBOL)
Factor w/ 40 levels "1500015O10Rik",..: 40 3 1 26 20 23 21 8 9 16 ...
Instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood:
typeof(ResultsTable_small$SYMBOL)
[1] "integer"
In modelling functions, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labelled in alphabetical order. You can change this by specifying the levels:
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
Factor w/ 2 levels "control","case": 2 1 1 2
In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!
If you prefer not to work with factors, you can simply tell the read.table
function to keep all strings as characters.
ResultsTable_small <- read.table("data/ResultsTable_small.txt", stringsAsFactors = FALSE, header=TRUE)
If you just want to sort a vector from smallest to largest, you can use the sort
function:
sort(ResultsTable_small$logFC)
[1] -6.070143 -5.827889 -5.146268 -3.313648 -3.211148 -2.896115 -2.653398
[8] -2.598105 -2.597044 -2.538596 -2.329744 -2.312721 -2.171896 -2.143885
[15] -2.071469 -2.011808 -1.708973 -1.567424 -1.520291 -1.515469 -1.331437
[22] -1.258671 -1.109156 1.474671 1.522405 1.710380 1.751340 1.788601
[29] 1.819943 1.887561 1.972771 2.180370 2.253400 2.278879 2.342914
[36] 2.766745 2.807548 2.835624 3.600094 3.738933
To reverse the sort order so it goes from largest to smallest:
sort(ResultsTable_small$logFC, decreasing = TRUE)
[1] 3.738933 3.600094 2.835624 2.807548 2.766745 2.342914 2.278879
[8] 2.253400 2.180370 1.972771 1.887561 1.819943 1.788601 1.751340
[15] 1.710380 1.522405 1.474671 -1.109156 -1.258671 -1.331437 -1.515469
[22] -1.520291 -1.567424 -1.708973 -2.011808 -2.071469 -2.143885 -2.171896
[29] -2.312721 -2.329744 -2.538596 -2.597044 -2.598105 -2.653398 -2.896115
[36] -3.211148 -3.313648 -5.146268 -5.827889 -6.070143
And this works with characters:
sort(ResultsTable_small$SYMBOL)
[1] "1500015O10Rik" "Ak1" "Atp2b4" "Bhlhe41"
[5] "Ccdc129" "Ccdc153" "Chil1" "Cpxm2"
[9] "Creb5" "Csf1" "Csn1s2b" "Cyp2s1"
[13] "Ddit4" "Fam102b" "Fam110a" "Gjb3"
[17] "Gpsm2" "Hmcn1" "Hs6st2" "Igfbp6"
[21] "Krt19" "Lif" "Micall2" "Mrgprf"
[25] "Mtmr11" "Myof" "Naaa" "Nfatc2"
[29] "Nr1d1" "Pdzd3" "Ppp2r3a" "Serpinf1"
[33] "Skil" "Slit3" "Smad7" "Sox4"
[37] "Tnni2" "Tppp3" "Trp53inp1" "Wif1"
If you want to sort the entire data frame by the values in a single column, you can use the order
function. This gives you the indexes for if the data was sorted. For example:
order(ResultsTable_small$logFC)
[1] 19 18 11 23 14 5 37 9 31 30 4 7 32 2 39 36 17 27 33 8 21 28 38
[24] 24 40 12 35 34 1 16 20 13 6 29 26 15 3 25 10 22
You then use these indexes to sort the data:
ResultsTable_small$logFC[order(ResultsTable_small$logFC)]
[1] -6.070143 -5.827889 -5.146268 -3.313648 -3.211148 -2.896115 -2.653398
[8] -2.598105 -2.597044 -2.538596 -2.329744 -2.312721 -2.171896 -2.143885
[15] -2.071469 -2.011808 -1.708973 -1.567424 -1.520291 -1.515469 -1.331437
[22] -1.258671 -1.109156 1.474671 1.522405 1.710380 1.751340 1.788601
[29] 1.819943 1.887561 1.972771 2.180370 2.253400 2.278879 2.342914
[36] 2.766745 2.807548 2.835624 3.600094 3.738933
You can then use these indices to sort the rows:
ResultsTable_small[order(ResultsTable_small$logFC), ]
ENTREZID SYMBOL logFC AveExpr t P.Value
19 12992 Csn1s2b -6.070143 3.56295004 -14.16565 6.377604e-09
18 21953 Tnni2 -5.827889 0.30207159 -14.40327 5.265278e-09
11 211577 Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
23 170761 Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
14 270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
5 16012 Igfbp6 -2.896115 1.97844876 -18.21525 3.413066e-10
37 67971 Tppp3 -2.653398 4.90816305 -12.22845 3.416616e-08
9 231991 Creb5 -2.598105 4.27592952 -16.53634 1.059885e-09
31 232016 Ccdc129 -2.597044 5.00471484 -13.02266 1.672195e-08
30 67111 Naaa -2.538596 3.29074575 -13.04083 1.645823e-08
4 226101 Myof -2.329744 6.22352456 -18.26861 3.297667e-10
7 16669 Krt19 -2.312721 8.74189184 -17.07937 7.264548e-10
32 76123 Gpsm2 -2.171896 4.99093472 -12.76344 2.102397e-08
2 381290 Atp2b4 -2.143885 3.94406593 -19.07495 1.982934e-10
39 74134 Cyp2s1 -2.071469 1.40704575 -12.20154 3.502805e-08
36 18019 Nfatc2 -2.011808 5.79499693 -12.27561 3.271067e-08
17 194126 Mtmr11 -1.708973 2.50804119 -14.48746 4.922928e-09
27 545370 Hmcn1 -1.567424 3.10302591 -13.19053 1.444832e-08
33 329739 Fam102b -1.520291 4.18813047 -12.75357 2.120968e-08
8 55987 Cpxm2 -1.515469 2.83451194 -16.64333 9.829870e-10
21 20564 Slit3 -1.331437 3.44179493 -13.88522 8.026279e-09
28 60599 Trp53inp1 -1.258671 6.11839605 -13.16241 1.480464e-08
38 235542 Ppp2r3a -1.109156 6.50105941 -12.22041 3.442139e-08
24 73847 Fam110a 1.474671 6.84086068 13.62251 9.993185e-09
40 20677 Sox4 1.522405 7.46932835 12.13548 3.724389e-08
12 20317 Serpinf1 1.710380 3.38883490 15.77280 1.838727e-09
35 50786 Hs6st2 1.751340 0.53953600 12.43097 2.836919e-08
34 79362 Bhlhe41 1.788601 6.18368494 12.70504 2.214908e-08
1 24117 Wif1 1.819943 2.97554452 20.10780 1.063770e-10
16 20482 Skil 1.887561 8.49892507 14.65488 4.311334e-09
20 17131 Smad7 1.972771 6.71751902 14.14348 6.493642e-09
13 74747 Ddit4 2.180370 6.86479110 15.70145 1.938279e-09
6 231830 Micall2 2.253400 4.76059697 18.02627 3.858161e-10
29 217166 Nr1d1 2.278879 6.26087761 13.12885 1.524242e-08
26 12654 Chil1 2.342914 5.57645724 13.21976 1.408760e-08
15 11636 Ak1 2.766745 4.30347462 15.27694 2.664640e-09
3 78896 1500015O10Rik 2.807548 3.03651950 18.54773 2.758828e-10
25 12977 Csf1 2.835624 7.47759094 13.41902 1.187300e-08
10 14620 Gjb3 3.600094 3.52528051 16.46627 1.113755e-09
22 16878 Lif 3.738933 6.68203417 13.73344 9.105708e-09
adj.P.Val
19 5.131276e-06
18 4.622914e-06
11 1.718703e-06
23 6.580512e-06
14 2.539851e-06
5 1.016240e-06
37 1.419445e-05
9 1.718703e-06
31 8.524957e-06
30 8.524957e-06
4 1.016240e-06
7 1.640127e-06
32 1.015751e-05
2 1.016240e-06
39 1.419445e-05
36 1.419445e-05
17 4.576586e-06
27 8.306595e-06
33 1.015751e-05
8 1.718703e-06
21 6.040348e-06
28 8.306595e-06
38 1.419445e-05
24 6.580512e-06
40 1.462109e-05
12 2.356351e-06
35 1.280991e-05
34 1.029541e-05
1 1.016240e-06
16 4.258521e-06
20 5.131276e-06
13 2.356351e-06
6 1.016240e-06
29 8.306595e-06
26 8.306595e-06
15 2.807465e-06
3 1.016240e-06
25 7.505634e-06
10 1.718703e-06
22 6.541210e-06
Let’s say we only want to look at the rows where the log fold change is greater than 3. We can ask if each logFC
is greater than 3.
ResultsTable_small$logFC > 3
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This gives us a vector of TRUE/FALSE for each value stating if it is greater than 3. We can then use this to subset the data.
ResultsTable_small$logFC[ResultsTable_small$logFC > 3]
[1] 3.600094 3.738933
And to subset the rows of the data frame.
ResultsTable_small[ResultsTable_small$logFC > 3, ]
ENTREZID SYMBOL logFC AveExpr t P.Value adj.P.Val
10 14620 Gjb3 3.600094 3.525281 16.46627 1.113755e-09 1.718703e-06
22 16878 Lif 3.738933 6.682034 13.73344 9.105708e-09 6.541210e-06
In fact, we probably care about things that are -3 as well, so let’s include those too:
ResultsTable_small[ResultsTable_small$logFC > 3 | ResultsTable_small$logFC < -3, ]
ENTREZID SYMBOL logFC AveExpr t P.Value
10 14620 Gjb3 3.600094 3.52528051 16.46627 1.113755e-09
11 211577 Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
14 270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
18 21953 Tnni2 -5.827889 0.30207159 -14.40327 5.265278e-09
19 12992 Csn1s2b -6.070143 3.56295004 -14.16565 6.377604e-09
22 16878 Lif 3.738933 6.68203417 13.73344 9.105708e-09
23 170761 Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
adj.P.Val
10 1.718703e-06
11 1.718703e-06
14 2.539851e-06
18 4.622914e-06
19 5.131276e-06
22 6.541210e-06
23 6.580512e-06
Or we can simplify this using the abs
function to find the absolute value of each logFC
before testing them.
ResultsTable_small[abs(ResultsTable_small$logFC) > 3, ]
ENTREZID SYMBOL logFC AveExpr t P.Value
10 14620 Gjb3 3.600094 3.52528051 16.46627 1.113755e-09
11 211577 Mrgprf -5.146268 -0.93683349 -16.36573 1.196263e-09
14 270150 Ccdc153 -3.211148 -1.34083882 -15.50126 2.249931e-09
18 21953 Tnni2 -5.827889 0.30207159 -14.40327 5.265278e-09
19 12992 Csn1s2b -6.070143 3.56295004 -14.16565 6.377604e-09
22 16878 Lif 3.738933 6.68203417 13.73344 9.105708e-09
23 170761 Pdzd3 -3.313648 -0.06019306 -13.62372 9.982985e-09
adj.P.Val
10 1.718703e-06
11 1.718703e-06
14 2.539851e-06
18 4.622914e-06
19 5.131276e-06
22 6.541210e-06
23 6.580512e-06
Tip: Logical Operators in R
Operator Description < less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to != not equal to !x Not x x | y x OR y x & y x AND y isTRUE(x) test if X is TRUE Source: Quick-R
We might also want to reduce our results table down to a set of genes that we are interested in. We can use similar logical subsetting logic to do this.
my_genes <- c("Smad7", "Wif1", "Fam102b", "Tppp3")
For each gene in ResultsTable_small$SYMBOL, does it appear in my_genes?
ResultsTable_small$SYMBOL %in% my_genes
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[34] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
ResultsTable_small[ResultsTable_small$SYMBOL %in% my_genes, ]
ENTREZID SYMBOL logFC AveExpr t P.Value adj.P.Val
1 24117 Wif1 1.819943 2.975545 20.10780 1.063770e-10 1.016240e-06
20 17131 Smad7 1.972771 6.717519 14.14348 6.493642e-09 5.131276e-06
33 329739 Fam102b -1.520291 4.188130 -12.75357 2.120968e-08 1.015751e-05
37 67971 Tppp3 -2.653398 4.908163 -12.22845 3.416616e-08 1.419445e-05
You can use match
to get the same subset, but this time make sure they are in the same order as in my_genes
. This can be useful if you are trying to merge together data from two sources.
match(my_genes, ResultsTable_small$SYMBOL)
[1] 20 1 33 37
This returns the index for where each of the genes in my_list appears
in ResultsTable_small$SYMBOL
. We can then use this to subset the columns from ResultsTable_small
.
ResultsTable_small[match(my_genes, ResultsTable_small$SYMBOL), ]
ENTREZID SYMBOL logFC AveExpr t P.Value adj.P.Val
20 17131 Smad7 1.972771 6.717519 14.14348 6.493642e-09 5.131276e-06
1 24117 Wif1 1.819943 2.975545 20.10780 1.063770e-10 1.016240e-06
33 329739 Fam102b -1.520291 4.188130 -12.75357 2.120968e-08 1.015751e-05
37 67971 Tppp3 -2.653398 4.908163 -12.22845 3.416616e-08 1.419445e-05
You can see they are in the same order as my_genes
!