Chapter 2 Preprocessing and gene counts

The preprocessing pipeline for RNA-seq samples is as the following, Wang 2019(Wang and al. 2019):

  1. filter unqualified reads
  2. trim Illumina universal adapter by Trimmomatic-0.32
  3. align read to genome build hg38 using Tophat-2.1/Bowtie2-2.2.6
  4. profile gene expression by counting aligned reads to features annotated by Gencode v24 (R/Bioconductor)

For step 4, we convert Gencode v24 annotation to a Bioconductor TxDb package. To perform feature counts, we use Bioconductor function GenomicAlignments::summarizeOverlap(), a counterpart to a python package htseq-count. The count mode is set to IntersectionStrict in which restricts read count to those that fall completely within the range of exons. Ambiguous reads straddling multiple features are not counted. See below for the shell (step 1-3) and R scripts (step 4). The processed dataset (gene exrepssion matrix) is formated as a SummarizedExperiment and DESeqDataSet instances and saved in the data directory. The scripts and codes are available in the inst/scripts directories on the master branch. Below are code chunks showing the parameter setting for pre-processing and gene counts.

2.1 Shell script for pre-processing

The shell script for filering, trimming, fastqc, and alignment (step 1 to 3).

#!/bin/bash
#./do_tophat.sh
#SBATCH -n6 -t 1-0 -p campus --mail-type=ALL --mail-user=cwon2@fhcrc.org -A tapscott_s

sampleDir=$1
sampleName=$2
pkgDir=$3
tophatOutDir=$pkgDir/tophat
trimmedDir=$pkgDir/trimmed
fastqcDir=$pkgDir/fastqc
filteredDir=$pkgDir/filtered

echo $sampleName

bowtieVersion=/home/solexa/apps/bowtie/bowtie2-2.2.6
tophatVersion=/home/solexa/apps/tophat/tophat-2.1.0.Linux_x86_64
gtfFile=/shared/biodata/ngs/Reference/iGenomes/Homo_sapiens/UCSC/hg38/Annotation/Archives/archive-2015-08-14-08-18-15/Genes/genes.gtf
genomeBuild=/shared/biodata/ngs/Reference/iGenomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
maxIntronLength=500000
innerDist=80
segmentLength=25
libraryType=fr-unstranded

#umask 0002
export PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/moab/bin:$bowtieVersion:$tophatVersion:/home/solexa/apps/samtools/samtools-0.1.19:/home/solexa/apps/FastQC:/home/solexa/apps/trim_galore:/app/cutadapt/1.1/bin

#remove /app/bin from the PATH, as it will default to running bowtie 0.12.8
export PATH=${PATH/\/app\/bin:}

#define some directories
mkdir -p $filteredDir
filterPerSample=$filteredDir/$sampleName
mkdir $filterPerSample
mkdir -p $fastqcDir
mkdir -p $trimmedDir
mkdir -p $tophatOutDir
bamDir=$tophatOutDir/bam
mkdir -p $bamDir
tmpDir=$tophatOutDir/$sampleName
mkdir -p $tmpDir

# (1) by pass filter
cd $sampleDir
for i in *fastq.gz
    do
        i2=${i//.gz/}
        zgrep -A 3 '^@.*[^:]*:N:[^:]*:' $i | zgrep -v '^\-\-$' > $filterPerSample/#$i2
done
gzip -r $filterPerSample

# (2) zcat all the fastq.gz 
cd $filterPerSample
fqFile=$trimmedDir/$sampleName.filterd.fastq
zcat $(ls *.fastq.gz) > $fqFile

# (3) trim adapter and tail (TruSeq3-SE.fa is saved at $trimmedDir)
cd $trimmedDir
trimFq=$trimmedDir/$sampleName.trim.fastq
java -jar /home/solexa/apps/trimmomatic/Trimmomatic-0.32/trimmomatic-0.32.jar SE -threads 6 $fqFile $trimFq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 TRAILING:3 MINLEN:36 
rm $fqFile

# (5) fastqc again
fastqc -t 6 $trimFq -o $fastqcDir --casava
touch $fastqcDir/$sampleName\_fastqcDone.txt

# (6) tophat alignment
tophat --mate-inner-dist $innerDist --num-threads 4 -G $gtfFile --library-type $libraryType -I $maxIntronLength --segment-length $segmentLength --no-coverage-search -o $tmpDir $genomeBuild $trimFq

mv $tmpDir/accepted_hits.bam $bamDir/$sampleName.bam

# sort bam
cd $bamDir
samtools sort -@ 4 $sampleName.bam $sampleName.bam.sorted
mv $sampleName.bam.sorted.bam $sampleName.bam
samtools index $sampleName.bam
touch $sampleName.tophatDone.txt
exit 0

2.2 Annotation TxDb package

The TxDb class is a container for storing transcript annotation in Bioconductor. It is handy to bulid a TxDb package for the smoothness for profiling gene expression and the downstream analysis. There are several ways to make the package. Here we show a couple ways of retriving the GTF file and convert it to TxDb package.

  1. Download GTF file from Gencode website directly. Supppose gtf_fie is the downloaded GENCODE v24 annotation GTF file, we can (1) use GenomicFeatures::makeTxDbFromGFF() to convert the GTF file to a TxDb instance and then (2) GenomicFeatures::makeTxDbPackage() to build a TxDb package. Code chunk is an example:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file=gtf_file,
                        format="gtf",
                        dataSource="ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human",
                        organism="Homo sapiens")
genome(txdb) <- "hg38"
pkgName <- "hg38.HomoSapiens.Gencode.v24"
makeTxDbPackage(txdb, version="1.0.0", author="Chao-Jen Wong",
                pkgname=pkgName,
                maintainer="Chao-Jen Wong <cwon2@fredhutch.org>")
  1. Code chuck below is an example of how to use the AnnotationHub (Morgan 2018) package to download GTF files (Gencode or Ensembl) as a GRanges instance, convert to a TxDb instance and build then a TxDb package.
library(AnnotationHub)
ah <- AnnotationHub()
#' GENCODE v23
query(ah, c("GENCODE", "v23")) 
# query(ah, c("GRCh38", "ensembl", "92")) # or search for ensembl v92
gr <- ah[["AH49555"]] # download GTF GENCODE v23 as an GRanges instance
meta_data <- data.frame(name=c("organism", "source", "genome"), 
                        value=c("Homo sapiens", "Genocode v23", "hg38"))
txdb <- makeTxDbFromGRanges(gr=gr, metadata=meta_data)
pkgName <- "hg38.HomoSapiens.Gencode.v23"
makeTxDbPackage(txdb, version="1.0.0", author="Chao-Jen Wong",
                pkgname=pkgName,
                maintainer="Chao-Jen Wong <cwon2@fredhutch.org>")

2.3 R Script for gene counts

Once the TxDb package is build, we use GenomicAlignments::summarizeOverlaps() to count reads that overlap with features annotated by a TxDb package. Suppose file_bam is our bam files and hg38.HomoSapiens.Gencode.v24 is out customized TxDb package, we can build a summarizedExperiment instance containing gene counts as well as the feature annotation:

library(hg38.HomoSapiens.Gencode.v24)
library(org.Hs.eg.db)
library(BiocParallel)
library(GenomicAlignments)
mparam <- MulticoreParam(workers = 2, type = "SOCK")
features <- GenomicFeatures::exonsBy(hg38.HomoSapiens.Gencode.v24, by="gene")
# only keep standard chromosomes
features <- keepStandardChromosomes(features,
                                    species = "Homo_sapiens",
                                    pruning.mode="fine")
se <- GenomicAlignments::summarizeOverlaps(features=features,
                                           reads=file_bam,
                                           mode="IntersectionStrict",
                                           inter.feature=TRUE,
                                           singleEnd=TRUE,
                                           ignore.strand=TRUE,
                                           BPPARAM=mparam)

References

Morgan, Martin. 2018. AnnotationHub: Client to Access Annotationhub Resources.

Wang, Leo H., and et al. 2019. “MRI-Informed Muscle Biopsies Correlate Mri with Pathology and Dux4 Target Gene Expression in Fshd.” Human Molecular Genetics. doi:10.1093/hmg/dd7364.