R and tidyverseNov 9, 2021
To interactively work with the code below, open lecture12.ipynb in VSCode. Make sure to select the kernel for R so that you can execute R code. You should have already set this up following the software installation instructions here.
R is the second programming language after Python that we will learn in this course. We will use R over the next 5 lectures.
R is particularly well suited for reading, manipulating, and visualizing data in tabular and biological sequence formats.
Many statistical tests are also available out of the box in R.
While “base” R is used widely, I almost exclusively use R for its two excellent package collections:
Today we will learn a few basic functions from tidyverse for working with tabular data.
Unlike pandas which is a single package with lot of functionality, tidyverse is a collection of packages that are focused on specific tasks.
You can load all the above packages in one go:
library(tidyverse)
Various options for reading and writing data are in package readr.
data <- read_tsv("data/example_dataset_1.tsv")
data
The tabular data structure is called a tibble in tidyverse, and is a souped-up version of the data.frame R data structure with additional nice features.
The ` <- ` assignment operator is equivalent to the ` = ` assignment operator and can be used interchangeably. However, using the ` <- ` operator is more conventional.
ggplot(data, aes(x = kozak_region, y = mean_ratio)) +
geom_point()
Anatomy of a ggplot2 plot
ggplot function with a tibble argument as the first argument.aes specifies the variables to plot.geom specifies the type of plot.+ adds additional layers to the plot.Key differences with Python
options(repr.plot.width = 5, repr.plot.height = 3)
Plotting a point graph with color
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence)) +
geom_point()
Plotting a line graph
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line()
Plotting point and line graphs
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line() +
geom_point()
options(repr.plot.width = 6, repr.plot.height = 3)
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
group = insert_sequence)) +
geom_line() +
geom_point() +
facet_grid(~ insert_sequence)
ggplot2 exercises(20 min)
See https://ggplot2.tidyverse.org/reference/labs.html
ggplot(data, aes(x = kozak_region,
y = mean_ratio,
color = insert_sequence,
group = insert_sequence)) +
geom_line() +
geom_point()
classic themeSee https://ggplot2.tidyverse.org/reference/ggtheme.html
See https://ggplot2.tidyverse.org/reference/scale_continuous.html
Uses functions from the dplyr package.
data <- read_tsv("data/example_dataset_1.tsv")
data
select(data, strain, mean_ratio, insert_sequence, kozak_region)
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(strain, mean_ratio, insert_sequence, kozak_region)
Above is the same as the following:
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(., strain, mean_ratio, insert_sequence, kozak_region)
The %>% operator lets you chain different data analysis tasks together and makes the analysis logic easier to understand.
Side note: You can create keyboard shortcuts for ` <- ` and ` %>% ` in VSCode as explained here.
I use Alt + - for ` <- ` and Alt + Shift + m for ` %>% ` following RStudio convention.
You can get a view of the transformed data using print() as the last step in a chain of commands
data <- read_tsv("data/example_dataset_1.tsv") %>%
select(strain, mean_ratio, insert_sequence, kozak_region) %>%
print()
data <- read_tsv("data/example_dataset_1.tsv")
data %>%
filter(kozak_region == "A")
data %>%
filter(kozak_region == "A", insert_sequence == "10×AGA")
data %>%
filter(kozak_region == "A") %>%
filter(insert_sequence == "10×AGA")
data %>%
arrange(mean_ratio)
mutatedata <- read_tsv("data/example_dataset_2.tsv") %>%
print()
data <- data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
print()
Use mutate to modify existing columns
data %>%
mutate(mean_ratio = round(mean_ratio, 2))
join functionsVariants: inner_join, left_join, right_join, full_join
See https://dplyr.tidyverse.org/reference/mutate-joins.html
annotations <- read_tsv("data/example_dataset_3.tsv")
annotations
data %>%
inner_join(annotations, by = "strain")
data %>%
left_join(annotations, by = "strain")
data %>%
right_join(annotations, by = "strain")
dplyr and ggplot2 functionsBut remember to use ` %>% ` in dplyr vs ` + in ggplot2`!
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
ggplot(aes(x = kozak_region, y = mean_ratio,
color = insert_sequence, group = insert_sequence)) +
geom_line() +
geom_point()
stringr functions to manipulate string columnsAll functions are named nicely and begin with str_. I find them easier to use than the equivalent Python regular expression functions.
See https://stringr.tidyverse.org/reference/index.html
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
mutate(codon = str_extract(insert_sequence, "[A-Z]{3}$"))
(20 min)
Google for log2 R to find the appropriate function
data <- read_tsv("data/example_dataset_2.tsv")
strain column into a new column and sort numerically by strain numberExtract the strain number using a stringr function.
Google for character to integer R to find appropriate function to use in mutate.
Then sort.
annotations <- read_tsv("data/example_dataset_3.tsv")
annotations
kozak_region but sorted by strain numberThis requires a bit more reading and discussion, but it is a good example of how to learn new tidyverse functions on your own!
Use fct_reorder function from the forcats package to sort kozak_region by strain number you created above in a mutate step and then feed it into ggplot.
data %>%
mutate(mean_ratio = mean_yfp / mean_rfp) %>%
left_join(annotations, by = "strain") %>%
ggplot(aes(x = kozak_region, y = mean_ratio,
color = insert_sequence, group = insert_sequence)) +
geom_line() +
geom_point()