1 About
The objective of this book is to advocate for computational transparency and reproducibility of our FSHD bilateral cohort study, titled “Validation study of the association of MRI and FSHD gene signature reveals markers of whole muscle and systemic disease progression.” This book includes detailed explanations of our decision-making processes grounded in bioinformatics analysis, machine learning, and statistical inferences along with reproducible codes (in R) that generate results and figures on the fly.
In this FSHD bilateral cohort study, we recruited 34 FSHD patients and obtained their left and right tibialis anteria (TA) muscle biopsies. On the biopsies, we performed MRI characterization, histopathological scoring, RNA extraction for RNA sequencing, and DNA extraction for bisulfite sequencing of the 4qA permissive allele in the region of exon 2-3. Through the metrics yield by these processes, we studied the following topics:
- Transcript-based assessment of muscle cell type content of the FSHD biopsied muscle
- FSHD molecular signatures, including DUX4, extracurricular matrix, inflammatory, complement activation, and immunoglobulins
- ML classification based on the FSHD molecular signatures
- Identification of six baskets containing genes that exhibit specific signatures associated with FSHD
- Association of MRI characteristics with signatures of DUX4 expression
- Bilateral comparison analysis
- Verification of immune cell infiltrate signatures and immune cell type proportions
1.1 Software
- R (4.2.2) and packages from the Bioconductor (3.17) and Tidyverse projects. Most used packages include
GenomicAlignments
,GenomicFeatures
,DESeq2
,dplyr
,ggplot2
,caret
, and genomic-related BioC packages for annotation. -
fastqc
,Subread
,samTools
, andcutadapt
for RNA-seq data preprocessing, andPLIER
for immune cell type proportions.
1.2 Repos structure
The section outlines the structure of our repository, which is available at here.
\data:
|- longitudinal_dds.rda: a DESeqDataSet instance include the
longitudinal RNA-seq gene counts, TPM, and published metadata
including ML-based classification labels, MRI, histopathology
scores, and clinical data
|- bilateral_dds.rda: a DESeqDataSet instance; bilateral RNA-seq gene
counts, TPM, and metadata including ML-based classification labels
and published MRI and clinical data
|- comprehensive_df.rda: a data.frame instance obtaining the biletarl
cohort's clinical, MRI, pathology, FSHD molecular signature scores,
and DNA methylation data
|- all_baskets.rda: basekts of genes representing FSHD disease
signatures (DUX4-M, DUX4-M6, DUX4-M12, ECM, Inflamm, Complement,
and IG baskets)
|- bilat_MLpredict.rda: categorization of the bilateral biopsy
samples by machine learning models (random forest and
KNN) trained by the longitudinal cohort and the signature
genes in the baskets
cohort
\docs: *.html, the pages rendered by *.Rmd files in the \gitbook
directory
\gitbook: *.Rmd, source code of this book
\scripts: *.R; un-orgnaized R code of our initial data
exploration and bioinformatics analysis
\extdata: *.xlsx; extra files and supplemental tables
1.3 Annotation
The gene annotation for our RNA-seq analysis is based on Gencode version 35. To facilitate gene counts using Bioconductor packages, we customized a Bioconductor TxDB
package named hg38.HomoSapiens.Gencode.v35
. See Appendix A on how to build Bioconductor TxDB
or EsbDB
annotation packages.
1.4 RNA-seq preprocessing and gene counts
The pre-preprocessing of RNA-seq data, we filtered out unqualified raw reads, trimmed the Illumina universal adapters using Trimmomatic, and aligned the remaining reads to GRCh38.p13 with Rsubread.
The code chunk below provides an example of how we used GenomicAlignments
, BiocParallel
, and a customized TxDB
package (hg38.HomoSapiens.Gencode.v35
) to count reads and generate a RangedSummarizedExperiment
object. We used GenomicAlignments::summarizeOveralsp()
to count the reads. With Intersectionstrict
mode that we only count reads that are completely contained within the range of exons, and ignore any ambiguous reads that straddle different gene features.
-
sort_files
parameter below indicates the location of the sorted, indexed BAM files used in our analysis
# sort_files gives the location of our bam files
sort_files <- list.files(scratch_dir, full.name=TRUE, pattern="\\.bam$")
library(GenomicFeatures)
library(GenomicAlignments)
library(Rsamtools)
library(BiocParallel)
library(hg38.HomoSapiens.Gencode.v35)
bp_param <- BiocParallel::MulticoreParam(workers=12L)
register(bp_param)
features <- GenomicFeatures::exonsBy(hg38.HomoSapiens.Gencode.v35, by="gene")
features <- keepStandardChromosomes(features,
species = "Homo_sapiens",
pruning.mode="fine")
rse <-
GenomicAlignments::summarizeOverlaps(features=features,
reads =
Rsamtools::BamFileList(sort_file),
mode = "IntersectionStrict",
ignore.strand=TRUE,
singleEnd=TRUE,
inter.feature=TRUE,
BPPARAM=bp_param)