Chapter 2 RNA-seq to DESeq2

This chapter summarized our RNA-seq data preprocessing, read counts, and DESeq2 differential analysis. All the human, mouse, and canine models were using the same workflow.

Code blow are showing the workflow on the R workspace.

2.1 RNA-seq preprocessing

Low quality reads were filtered (based on the filter flag embedded in the sequence ID) prior to alignment to genome canine (canFam3), mouse (mm10) and human (hg38) genome by Tophat/2.1.0 and Bowtie/2.2.3. Gene features were collected from Ensembl database and converted to Bioconductor TxDb annotation packages; reads were counted toward gene features using the summarizeOverlaps function from Bioconductor’s GenomicAlignments package (mode=IntersectionStrict).

2.2 Reads counts

Code below presents an example of counting reads that completely overlap with exons of gene features (exons) using the summarizeOverlaps function - the count matrix, metadata, and gene features are stored into an object called SummarizedExperiments.

2.2.1 Define gene features

Example:

suppressPackageStartupMessages(library(DESeq2))
library(TxDb.Cfamiliaris.UCSC.canFam3.ensGene) # custom TxDb package
suppressPackageStartupMessages(library(GenomicAlignments))
suppressPackageStartupMessages(library(org.Cf.eg.db)) # Biocondutor annotation package for Canis familiaris
suppressPackageStartupMessages(library(BiocParallel))
mparam <- MulticoreParam(workers = 4L, type = "SOCK")
register(mparam, default = TRUE)
# (1) define gene features: extract exons and keep ones from standard chromosome
ens_exons <- exonsBy(TxDb.Cfamiliaris.UCSC.canFam3.ensGene, by="gene")
ens_exons <- keepStandardChromosomes(ens_exons, pruning.mode="fine")
# make sure that the sequence level is UCSC, not NCBI 
seqlevelsStyle(ens_exons)

2.2.2 Read counts

Example:

# (2) bamFiles -> summarizeOverlaps
# define bamFiles (an example)
pkgDir <- "~/tapscott/RNA-Seq/canFam3.DuxFamily"
dataDir <- file.path(pkgDir, "data")
ngsDir <- "/shared/ngs/illumina/atyler/170208_SN367_0857_AHF37NBCXY"
bamDir <- file.path(ngsDir, "tophat", "bam")
bamFiles <- list.files(bamDir, pattern="\\.bam$", full.names=TRUE)
bamFiles <- bamFiles[c(1, 5, 6, 7, 8, 9, 10, 11, 12, 2, 3, 4)]

# summarizeOverlaps -> SummerizeExpriment instance
canine.ens.SE <- 
  GenomicAlignments::summarizeOverlaps(features=ens_exons,
                                       reads=bamFiles,
                                       mode="IntersectionStrict",
                                       inter.feature=TRUE,
                                       singleEnd=TRUE,
                                       ignore.strand=TRUE,
                                       BPPARAM=mparam)

# clean up metadata and column names
colnames(canine.ens.SE) <- sub(".bam", "", colnames(canine.ens.SE))
canine.ens.SE$nick_name  <- 
pheno_type <- sapply(strsplit(basename(bamFiles), "_"), "[[", 2)
canine.ens.SE$pheno_type <-  
  factor(pheno_type, 
         levels=c("luciferase", "hDUX4CA", "cDUXCintac", "cDUXCinter"))
canine.ens.SE$nick_name <- 
  factor(c(rep("LinC", 3), rep("HinC", 3),
           rep("CinC", 3), rep("CALTinC", 3)), 
         levels=c("LinC", "HinC", "CinC", "CALTinC"))

2.3 DESeq2

We used stardard DESeq2 analysis to find genes affected by the transcription factors. Here we give an example using the canine.ens.SE, comparing DUXC (CinC) with Luciferase (LinC) samples:

# example: DUXC vs. luciferase
sub <- canine.ens.SE[, canine.ens.SE$nick_name %in% c("CinC", "LinC")]
dds <- DESeqDataSet(sub, design = ~ nickname)
dds <- DESeq(dds)

Below is the list of the DESeqDataSet instances, product of DESeq2, available in the data folder.

├── CALTinC.ens.dds.rda: DUXC-ALT transcriptome in canine skeletal muscle
├── CinC.ens.dds.rda: DUXC transcriptome in canine skeletal muscle
├── HinC.ens.dds.rda: DUX4 transcriptome in canine skeletal muscle
├── C2C12.ens.ddsl2.rda: DUX4 and Dux transcriptome in mouse myoblast
├── HinH.ens.dds.rda: DUX4 transcriptome in human myoblast