2 Preprocessing Ribosome footprints sequencing data

Figure 2.1: Flowchart of Ribo-seq data preprocessing

The script provided performs the preprocessing steps of our ribo-seq data. The following sections outline the preprocessing steps included in the shell script.

Software:

FastQC/0.11.9-Java-11
cutadapt/2.7-GCCcore-8.3.0-Python-3.7.4
Bowtie2/2.4.1-GCCcore-8.3.0
SAMtools/1.10-GCCcore-8.3.0
STAR/2.7.6a-foss-2019b

2.1 Adapter trimming

Trimming forward 3’ adapter TGGAATTCTCGGGTGCCAAGG and removing 4 bases from the front and end after trimming:

# for forward read, trim 3' adapter TGGAATTCTCGGGTGCCAAGG 
cutadapt --cores=4 -O 13 -a N{4}TGGAATTCTCGGGTGCCAAGG \
  -u 4 -m 25 \
  --untrimmed-output ${sample_name}_R1_untrimmed.fastq.gz \ 
  -o ${sample_name}_R1_trimmed.fastq.gz \
  $fq1 > ${sample_name}_cutadapt_log.txt
  
fq=$trim_dir/${sample_name}_R1_trimmed.fastq.gz

The -O 13 is the minimum overlap for an adapter match, where the 13 is computed as 9 plus 4 (where 9 is the minimum overlap and 4 is the length of the unknown section). If you do not specify it, the adapter sequence would match the end of every read (because N matches anything), and ten bases would then be removed from every read.

This step yielded reads with 25 - 29 nt length.

2.2 Remove rRNA and non-rRNA contamination using Bowtie2

We first used RNA central database to identify rRNA sequences and built an rRNA reference genome rRNA_ref. Second, we used GFF of Gencode v35 to extract small RNA (tRNA, snoRNA, sRNA, microRNA, etc) sequences and customized a reference genome rsoRNA_ref. We then used bowtie2 to align the trimmed reads against the rRNA reference genome. Next, we took the un-mapped and aligned against the small RNA reference sequence. The final un-mapped were the reads of interest.

# rRNA filter
bowtie2 --seed=123 --threads=32 -x $rRNA_ref -U $fq -S ${sample_name}_R1.rRNA.sam \
  --un-gz=${sample_name}_R1_rFiltered.fq.gz \
  > ${sample_name}_R1_rRNA.bowtie2.log 2> ${sample_name}_R1_rRNA.bowtie2.log2
# tRNA and snoRNA filter
bowtie2 --seed=123 --threads=32 -x $tsoRNA_ref \
  -U ${sample_name}_R1_rFiltered.fq.gz \
  -S ${sample_name}_R1.tsoRNA.sam \
  --un-gz=${sample_name}_R1_rtsFiltered.fq.gz \
  > ${sample_name}_R1_tsoRNA.bowtie2.log 2>   ${sample_name}_R1_tsoRNA.bowtie2.log2

2.3 Alignment

We took the final un-mapped reads from previous step and aligned against GRCh38.p13 by STAR:

STAR --runThreadN 4 --outSAMattributes NH HI NM MD AS nM jM jI XS \
  --genomeDir $star_index \
  --readFilesIn $filtered_dir/${sample_name}_R1_rtsFiltered.fq.gz \
  --outFileNamePrefix $bam_dir/${sample_name}_ \
  --sjdbOverhang 25 \
  --outFilterScoreMinOverLread 0 \ 
  --outFilterMatchNminOverLread 0 \
  --outFilterMatchNmin 20 \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate

samtools index $bam_dir/${sample_name}_Aligned.sortedByCoord.out.bam

See next chapter for Ribo-seq specific quality control.

1 About

3 Ribosome footprints quality control