9 Immune cell type proportion by PLIER

The PLIER package1 provides an alternative to examine the immune cell infiltration, without involving in GO or differential analysis as we have investigated in @ref{immune-cell-infiltrates}. PLIER employes eigenvalue-like decomposition to estimate the immune cell contribution to the FSHD transcriptome. The top loading variables from the resulting eigenvalue-like latent variables (LV) are potential indicators of immune infiltrate signatures. We utilized the PLIER algorithm for analyzing both longitudinal and bilateral samples. The expression levels of the most influential loading variables are depicted in this chapter.

# define parameters and load datasets: bilat_dds and longitudinal_dds
pkg_dir <- "/Users/cwon2/CompBio/Wellstone_BiLateral_Biopsy"
fig_dir <- file.path(pkg_dir, "figures", "immune-infiltration")
source(file.path(pkg_dir, "scripts", "load_variables_and_datasets.R"))
# returns bilat_dds and longitudinal_dds

We ultitized the IRIS dataset2 from PLIER as the baseline of immnune cell types and markers.

library(PLIER)
data(bloodCellMarkersIRISDMAP)
# Extract the IRIS dataset from `bloodCellMarkersIRISDMAP`
blood_iris <- as.data.frame(bloodCellMarkersIRISDMAP) %>%
  rownames_to_column(var="gene_name") %>%
  dplyr::select(gene_name, starts_with("IRIS")) %>%
  dplyr::filter(rowSums(across(where(is.numeric))) > 0) 

bilat_coldata <- colData(bilat_dds) %>% as.data.frame()
longi_coldata <- colData(longitudinal_dds) %>% as.data.frame()

9.1 PLIER Longitudinal samples

In this section, we applied PLIER to estimate the immune cell type proportion for the longitudiname samples.

9.1.1 Steps

  1. run PLIER, the priorMat only includes the IRIS dataset (blood_iris)
  2. Select top LVs with AUC > 0.8 and FDR < 0.05
  3. Verify the top LVs of elevated proportion in FSHD and filter out the false positives
  4. Identify top loading varibles associated with the selected top LVs
# use RPKM as suggested by the PLIER authers; TPM yields similar results
longitudinal_rpkm <- assays(longitudinal_dds)[["RPKM"]]
longitudinal_tpm <- assays(longitudinal_dds)[["TPM"]]
rownames(longitudinal_rpkm) <- 
  rownames(longitudinal_tpm) <- rowData(longitudinal_dds)$gene_name

priorMat <- 
  bloodCellMarkersIRISDMAP[, names(blood_iris)[-1]] # remove col "gene_name" 
plier_longi <- PLIER(longitudinal_rpkm, # or longitudinal_rpkm
                     priorMat=priorMat,
                     scale = TRUE)
#> [1] 24.90828
#> [1] "L2 is set to 24.908276840206"
#> [1] "L1 is set to 12.454138420103"
mat_u <- plotU(plier_longi, auc.cutoff = 0.70, fdr.cutoff = 0.05, top = 3)
Significant LVs with AUC > 0.8 and FDR < 0.05.

Figure 9.1: Significant LVs with AUC > 0.8 and FDR < 0.05.

sig_sets <- names(mat_u$iirow)[mat_u$iirow]

Displayed below is the visualization of eight latent variables (LVs) that potentially capture the immune cell infiltrate signatures in the more affected muscles. It is important to note that there may be some false positives among them.

# proportion boxplots of all the sig LVs 
sig_proportion_longi <- plier_longi$B[mat_u$iicol, ] %>% as.data.frame() %>%
  rownames_to_column(var="LV_index") %>%
  tidyr::gather(key=sample_id, value=proportion, -LV_index) %>%
  left_join(dplyr::select(longi_coldata, sample_id, cluster), by="sample_id")
  
outliers <- sig_proportion_longi %>% dplyr::filter(proportion > 3)

sig_proportion_longi %>%
    dplyr::filter(proportion < 3) %>%
ggplot(aes(x=cluster, y=proportion)) +
  geom_boxplot(width=0.3, outlier.shape = NA) +
  geom_jitter(width=0.3, size=0.3, color="grey50") +
  facet_wrap(~LV_index, scale="free_y", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        axis.title.x = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())
The cell type proportion of all significant LVs

Figure 9.2: The cell type proportion of all significant LVs

9.1.2 Five best LVs showing immune cell infiltrat signatures

Based on the information extracted from the U, top Z, and B matrices, it can be concluded that LV30, LV4, LV5, LV12, and LV46 are the most reliable latent variables indicating heightened immune cell infiltration. Consequently, we have generated plots for our publication showcasing the following latent variables: LV30 (CD4 Tcell-Th1), LV4 and LV12 (Neutrophil resting), LV40 (IgG and IgA B memory cell), and LV5 (dendritic cell).

NOTE: I hide the code chuck for they are tidious, but the reader cand find them in the original Rmd files.

9.1.2.1 LV30,IRIS_CD4Tcell-Th1-restimulated48hour

9.1.2.2 LV4,IRIS_Neutrophil-Resting

9.1.2.3 LV12,IRIS_Neutrophil-Resting

9.1.2.4 LV5,IRIS_DendriticCell-LPSstimulated

9.1.2.5 LV40,IRIS_Bcell-Memory_IgG_IgA

9.1.3 Top loading variables Z of selected LVs

LV_index<- c(4, 5, 12, 30, 40)
names(LV_index) <- rownames(plier_longi$B[LV_index, ])
priorMat <- bloodCellMarkersIRISDMAP[, names(blood_iris)[-1]]

top = 15
sub_iris <- blood_iris %>% 
  dplyr::select(gene_name, `IRIS_CD4Tcell-Th1-restimulated48hour`,
                `IRIS_CD4Tcell-Th2-restimulated48hour`, 
                `IRIS_DendriticCell-Control`,
                `IRIS_DendriticCell-LPSstimulated`,
                `IRIS_CD4Tcell-Th2-restimulated12hour`,
                `IRIS_Bcell-Memory_IgG_IgA`,
                `IRIS_Neutrophil-Resting`)

longi_topZ <- map_dfr(LV_index, function(index) {
  plier_longi$Z[, index, drop=FALSE] %>% as.data.frame() %>%
    arrange(-V1) %>% top_n(top) %>%
    dplyr::select(-V1) %>%
    #dplyr::rename_with(~paste0("LV", as.character(index))) %>%
    rownames_to_column(var="gene_name") %>%
    left_join(sub_iris, by="gene_name") %>% # turn NA to 0
    replace(is.na(.), 0) %>%
    dplyr::mutate(in_markerSet = 
                    if_else(rowSums(across(where(is.numeric))) > 0,
                                    TRUE, FALSE), .after=gene_name)
    # which pathway?
}, .id="LV_index") %>% # append gene_id
  left_join(anno_ens88, by="gene_name") %>%
  left_join(dplyr::select(anno_gencode35, gencode35_id, gene_name), 
            by="gene_name")
longi_topZ %>% group_by(LV_index) %>% 
  summarize(n=sum(in_markerSet)) %>%
  knitr::kable(format = "html", 
               caption="The five most reliable LVs and the number of top loading variables associated with the LVs.")
Table 9.1: The five most reliable LVs and the number of top loading variables associated with the LVs.
LV_index n
12,IRIS_Neutrophil-Resting 7
30,IRIS_CD4Tcell-Th1-restimulated48hour 7
4,IRIS_Neutrophil-Resting 11
40,IRIS_Bcell-Memory_IgG_IgA 13
5,IRIS_DendriticCell-LPSstimulated 4

The code chunk below tidied up the top loading variables and attached annotation.

longi_topZ <- longi_topZ %>%
  dplyr::mutate(cell_type = str_replace(LV_index, "[0-9]*,IRIS_", ""))

longi_markers <- map_dfr(names(LV_index), function(lv) {
  longi_topZ %>% dplyr::filter(LV_index == lv, in_markerSet) %>%
    dplyr::select(LV_index, cell_type, ens_id)
}) %>% dplyr::arrange(cell_type) %>%
  left_join(dplyr::select(anno_ens88, ens88_id, ens_id, gene_name), 
            by="ens_id") %>%
  dplyr::rename(gene_id = ens88_id)

9.1.3.1 Heatmap of top loading variables

Below is the heatmap illustrating the z-scores of the log TPM (transcripts per million) values for the top loading variables associated with the five latent variables (LVs).
Heatmap of the top loading variables associated with the five LVs for longitudinal samples.

Figure 9.3: Heatmap of the top loading variables associated with the five LVs for longitudinal samples.

9.2 PLIER to the bilateral samples

We applied PLIER to the bilateral samples in the same matter. The cut off values for LVs are : AUC = 0.7 and FDR = 0.05

bilat_rpkm <- assays(bilat_dds)[["RPKM"]]
bilat_tpm <- assays(bilat_dds)[["TPM"]]

rownames(bilat_rpkm) <- rownames(bilat_tpm) <- rowData(bilat_dds)$gene_name
plier_bilat <- PLIER(bilat_rpkm, # or longitudinal_rpkm
                     priorMat=priorMat,
                     scale = TRUE)
#> [1] 29.88863
#> [1] "L2 is set to 29.8886299733202"
#> [1] "L1 is set to 14.9443149866601"
save(plier_bilat, file=file.path(pkg_dir, "data", "plier_bliat.rda"))
mat_u <- plotU(plier_bilat, auc.cutoff = 0.7, fdr.cutoff = 0.05, top = 3)
bilat_col_data <- colData(bilat_dds) %>% as.data.frame()

# define sig LV_index: 
LV_index <- c(2, 14, 18, 20, 26, 22)
names(LV_index) <- rownames(plier_bilat$B[LV_index, ])

#plier_bilat$B[LV_index, ] %>% as.data.frame() %>%
#  rownames_to_column(var="LV_index") %>%
#  tidyr::gather(key=sample_id, value=proportion, -LV_index) %>%
#  left_join(dplyr::select(bilat_col_data, sample_id, class), by="sample_id") %>%
#  dplyr::filter(proportion >= 3) %>%
#  knitr::kable(caption="Samples with extreme proportions. Remov ed from the #figure below.")

# boxplot for proportion of potential LVs
plier_bilat$B[mat_u$iicol, ] %>% as.data.frame() %>%
  rownames_to_column(var="LV_index") %>%
  tidyr::gather(key=sample_id, value=proportion, -LV_index) %>%
  left_join(dplyr::select(bilat_col_data, sample_id, class), by="sample_id") %>%
  dplyr::filter(proportion < 3) %>%
  ggplot(aes(x=class, y=proportion)) +
    geom_boxplot(width=0.3, outlier.shape = NA) +
    geom_jitter(width=0.3, size=0.3) +
    facet_wrap(~LV_index, scale="free_y", nrow=3) +
    #labs(title="Remove outliers with proportion > 3 ") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          axis.title.x = element_blank())

9.2.1 Three base LVs demonstrating elevated immune cell proportion

The significant LVs that show elevated immune cell type contribution in FSHD are LV2 (Bcell-Memory_IgG_IgA), LV22 (Neutrophil-resting) and LV20 (DendriticCell-LPSstimulated). LV18 and LV26 do have elevated proportion in moderatedly-affected muscles.

bilat_proportion <- plier_bilat$B[LV_index, ] %>% as.data.frame() %>%
  rownames_to_column(var="LV_index") %>%
  tidyr::gather(key=sample_id, value=proportion, -LV_index) %>%
  left_join(dplyr::select(bilat_col_data, sample_id, class), by="sample_id") 

9.2.1.1 LV2, IRIS_Bcell-Memory_IgG_IgA

9.2.1.2 LV22,IRIS_Neutrophil-Resting

9.2.1.3 LV20,IRIS_DendriticCell-LPSstimulated

9.2.1.4 Top loading variables (Z) of selected LVs

bilat_LV <- c(2, 20, 22)
names(bilat_LV) <- rownames(plier_bilat$B[bilat_LV, ])
priorMat <- bloodCellMarkersIRISDMAP[, names(blood_iris)[-1]]

top = 15
sub_iris <- blood_iris %>% 
  dplyr::select(gene_name, `IRIS_CD4Tcell-Th1-restimulated48hour`,
                `IRIS_CD4Tcell-Th2-restimulated48hour`, 
                `IRIS_Bcell-Memory_IgG_IgA`, `IRIS_Bcell-naive`,
                `IRIS_Bcell-naive`, `IRIS_Bcell-Memory_IgM`,
                `IRIS_Neutrophil-Resting`,
                `IRIS_DendriticCell-Control`,
                `IRIS_DendriticCell-LPSstimulated`)

bilat_topZ <- map_dfr(bilat_LV, function(index) {
  plier_bilat$Z[, index, drop=FALSE] %>% as.data.frame() %>%
    arrange(-V1) %>% top_n(top) %>%
    dplyr::select(-V1) %>%
    #dplyr::rename_with(~paste0("LV", as.character(index))) %>%
    rownames_to_column(var="gene_name") %>%
    left_join(sub_iris, by="gene_name") %>% # turn NA to 0
    replace(is.na(.), 0) %>%
    dplyr::mutate(in_markerSet = 
                    if_else(rowSums(across(where(is.numeric))) > 0,
                                    TRUE, FALSE), .after=gene_name)
}, .id="LV_index") %>% # append gene_id
  left_join(anno_ens88, by="gene_name") %>%
  left_join(dplyr::select(anno_gencode35, gencode35_id, gene_name), 
            by="gene_name")
#top_z <- bilat_topZ
bilat_topZ %>% group_by(LV_index) %>% 
  summarize(n=sum(in_markerSet)) %>%
  knitr::kable(caption="The three best LVs and the number of associated makers.")
Table 9.2: The three best LVs and the number of associated makers.
LV_index n
2,IRIS_Bcell-Memory_IgG_IgA 11
20,IRIS_DendriticCell-LPSstimulated 5
22,IRIS_Neutrophil-Resting 7
#bilat_LV <- c(2, 20, 22)
#names(bilat_LV) <- rownames(plier_bilat$B[bilat_LV, ])
bilat_topZ <- bilat_topZ %>%
    dplyr::mutate(cell_type = str_replace(LV_index, "[0-9]*,IRIS_", ""))

bilat_markers <-  map_dfr(names(bilat_LV), function(lv) {
  bilat_topZ %>% dplyr::filter(LV_index == lv, in_markerSet) %>%
    dplyr::select(LV_index, cell_type, ens_id)
}) %>% dplyr::arrange(cell_type) %>%
   left_join(dplyr::select(anno_gencode35, gencode35_id, ens_id, gene_name), 
            by="ens_id") %>%
    dplyr::rename(gene_id = gencode35_id)
plier_topZ_markers <- list(bilat_markers=bilat_markers,
                           longitudinal_markers = longi_markers)
save(plier_topZ_markers, file=file.path(pkg_dir, "data",
                                        "plier_topZ_markers.rda"))

9.2.1.5 Heatmap

Figure below is a heatmap depicted the row-wise z-score of the \(\log_{10} TPM\) of the top loading variables associated with the three LVs.

Heatmap of 18 top Z (loading variables) in the bilateral samples.

Figure 9.4: Heatmap of 18 top Z (loading variables) in the bilateral samples.