tfcb_2021

Lecture 12: Working with tabular data using `R` and `tidyverse`

Nov 9, 2021

To interactively work with the code below, open lecture12.ipynb in VSCode. Make sure to select the kernel for R so that you can execute R code. You should have already set this up following the software installation instructions here.

R is the second programming language after Python that we will learn in this course. We will use R over the next 5 lectures.

R is particularly well suited for reading, manipulating, and visualizing data in tabular and biological sequence formats. Many statistical tests are also available out of the box in R.

While “base” R is used widely, I almost exclusively use R for its two excellent package collections:

Tidyverse - suited for tabular data
Bioconductor - suited for biology-aware analyses

Today we will learn a few basic functions from tidyverse for working with tabular data.

Unlike pandas which is a single package with lot of functionality, tidyverse is a collection of packages that are focused on specific tasks.

ggplot2 - for plotting data
dplyr - for filtering, aggregating, and transforming data
readr - for reading and writing data
tidyr - for cleaning and transforming data
stringr - for manipulating strings
purrr - for manipulating lists of R objects
forcats - for manipulating categorical data

You can load all the above packages in one go:

library(tidyverse)

Reading data

Various options for reading and writing data are in package readr.

data <- read_tsv("data/example_dataset_1.tsv")

data

The tabular data structure is called a tibble in tidyverse, and is a souped-up version of the data.frame R data structure with additional nice features.

The ` <- ` assignment operator is equivalent to the ` = ` assignment operator and can be used interchangeably. However, using the ` <- ` operator is more conventional.

Plotting data

ggplot(data, aes(x = kozak_region, y = mean_ratio)) +
  geom_point()

Anatomy of a ggplot2 plot

Begins with ggplot function with a tibble argument as the first argument.
aes specifies the variables to plot.
geom specifies the type of plot.
+ adds additional layers to the plot.

Key differences with Python

No need to specify variables within quotes.
Indentation convention is different.

Change size of plot globally

options(repr.plot.width = 5, repr.plot.height = 3)

How do we show multiple experimental variables?

Plotting a point graph with color

ggplot(data, aes(x = kozak_region,
                 y = mean_ratio,
                 color = insert_sequence)) +
  geom_point()

Plotting a line graph

ggplot(data, aes(x = kozak_region,
                 y = mean_ratio,
                 color = insert_sequence,
                 group = insert_sequence)) +
  geom_line()

Plotting point and line graphs

ggplot(data, aes(x = kozak_region,
                 y = mean_ratio,
                 color = insert_sequence,
                 group = insert_sequence)) +
  geom_line() +
  geom_point()

‘Faceting’ – Plotting in multiple panels

options(repr.plot.width = 6, repr.plot.height = 3)

ggplot(data, aes(x = kozak_region,
                 y = mean_ratio,
                 group = insert_sequence)) +
  geom_line() +
  geom_point() +
  facet_grid(~ insert_sequence)

In-class `ggplot2` exercises

(20 min)

1. Make X, Y, legend labels into nice strings

See https://ggplot2.tidyverse.org/reference/labs.html

ggplot(data, aes(x = kozak_region,
                 y = mean_ratio,
                 color = insert_sequence,
                 group = insert_sequence)) +
  geom_line() +
  geom_point()

2. Add title to the above plot

3. Change the plot look to a `classic` theme

See https://ggplot2.tidyverse.org/reference/ggtheme.html

4. Change Y axis to log scale

See https://ggplot2.tidyverse.org/reference/scale_continuous.html

5. Change Y scale to go linearly from 0 to 5

Transforming data

Uses functions from the dplyr package.

data <- read_tsv("data/example_dataset_1.tsv")

data

Select specific columns

select(data, strain, mean_ratio, insert_sequence, kozak_region)

Combine operations using the ` %>% ` operator

data <- read_tsv("data/example_dataset_1.tsv") %>% 
  select(strain, mean_ratio, insert_sequence, kozak_region)

Above is the same as the following:

data <- read_tsv("data/example_dataset_1.tsv") %>% 
  select(., strain, mean_ratio, insert_sequence, kozak_region)

The %>% operator lets you chain different data analysis tasks together and makes the analysis logic easier to understand.

Side note: You can create keyboard shortcuts for ` <- ` and ` %>% ` in VSCode as explained here.

I use Alt + - for ` <- ` and Alt + Shift + m for ` %>% ` following RStudio convention.

You can get a view of the transformed data using print() as the last step in a chain of commands

data <- read_tsv("data/example_dataset_1.tsv") %>% 
  select(strain, mean_ratio, insert_sequence, kozak_region) %>% 
  print()

Filter rows

data <- read_tsv("data/example_dataset_1.tsv")

data %>% 
  filter(kozak_region == "A")

data %>%
  filter(kozak_region == "A", insert_sequence == "10×AGA")

data %>%
  filter(kozak_region == "A") %>% 
  filter(insert_sequence == "10×AGA")

Arrange (sort) rows in a specific order

data %>%
  arrange(mean_ratio)

Create new columns using `mutate`

data <- read_tsv("data/example_dataset_2.tsv") %>% 
  print()

data <- data %>%
  mutate(mean_ratio = mean_yfp / mean_rfp) %>%
  print()

Use mutate to modify existing columns

data %>%
  mutate(mean_ratio = round(mean_ratio, 2))

Combine tables using `join` functions

Variants: inner_join, left_join, right_join, full_join

See https://dplyr.tidyverse.org/reference/mutate-joins.html

annotations <- read_tsv("data/example_dataset_3.tsv")

annotations

data %>% 
  inner_join(annotations, by = "strain")

data %>% 
  left_join(annotations, by = "strain")

data %>% 
  right_join(annotations, by = "strain")

You can combine `dplyr` and `ggplot2` functions

But remember to use ` %>% ` in dplyr vs ` + in ggplot2`!

data %>% 
  mutate(mean_ratio = mean_yfp / mean_rfp) %>%
  left_join(annotations, by = "strain") %>%
  ggplot(aes(x = kozak_region, y = mean_ratio, 
             color = insert_sequence, group = insert_sequence)) +
  geom_line() +
  geom_point()

Use `stringr` functions to manipulate string columns

All functions are named nicely and begin with str_. I find them easier to use than the equivalent Python regular expression functions.

See https://stringr.tidyverse.org/reference/index.html

data %>% 
  mutate(mean_ratio = mean_yfp / mean_rfp) %>%
  left_join(annotations, by = "strain") %>% 
  mutate(codon = str_extract(insert_sequence, "[A-Z]{3}$"))

In-class data transformation exercises

(20 min)

1. Create log2-transformed YFP/RFP ratio as a new column

Google for log2 R to find the appropriate function

data <- read_tsv("data/example_dataset_2.tsv")

2. Extract strain number from the `strain` column into a new column and sort numerically by strain number

Extract the strain number using a stringr function.

Google for character to integer R to find appropriate function to use in mutate.

Then sort.

annotations <- read_tsv("data/example_dataset_3.tsv")

annotations

3. Plot with the X axis as `kozak_region` but sorted by strain number

This requires a bit more reading and discussion, but it is a good example of how to learn new tidyverse functions on your own!

Use fct_reorder function from the forcats package to sort kozak_region by strain number you created above in a mutate step and then feed it into ggplot.

data %>% 
  mutate(mean_ratio = mean_yfp / mean_rfp) %>%
  left_join(annotations, by = "strain") %>%
  ggplot(aes(x = kozak_region, y = mean_ratio, 
             color = insert_sequence, group = insert_sequence)) +
  geom_line() +
  geom_point()