gimap tutorial

gimap performs analysis of dual-targeting CRISPR screening data, with the goal of aiding the identification of genetic interactions (e.g. cooperativity, synthetic lethality) in models of disease and other biological contexts. gimap analyzes functional genomic data generated by the pgPEN (paired guide RNAs for genetic interaction mapping) approach, quantifying growth effects of single and paired gene knockouts upon application of a CRISPR library. A multitude of CRISPR screen types can be used for this analysis, with helpful descriptions found in this review (https://www.nature.com/articles/s43586-021-00093-4). Use of pgPEN and GI-mapping in a paired gRNA format can be found here (https://pubmed.ncbi.nlm.nih.gov/34469736/).

library(gimap)
#> Warning: replacing previous import 'dplyr::group_rows' by
#> 'kableExtra::group_rows' when loading 'gimap'
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

First let’s create a folder we will save files to.

output_dir <- "output_data"
dir.create(output_dir, showWarnings = FALSE)

Data requirements

Let’s examine this example pgPEN counts table. It’s divided into columns containing:

  • id: an ID corresponding to the names of paired guides
  • seq_1: gRNA sequence 1, targeting “paralog A”
  • seq_2: gRNA sequence 2, targeting “paralog B”
  • Day00_RepA: Gene Counts from Day 00 for Replicate A
  • Day05_RepA: Gene Counts from Day 05 for Replicate A
  • Day22_RepA: Gene Counts from Day 22 for Replicate A
  • Day22_RepB: Gene Counts from Day 22 for Replicate B
example_data <- get_example_data("count")
#> Rows: 33170 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): id, seq_1, seq_2
#> dbl (5): Day00_RepA, Day05_RepA, Day22_RepA, Day22_RepB, Day22_RepC
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

The metadata you have may vary slightly from this but you’ll want to make sure you have the essential variables and information regarding how you collected your data.

colnames(example_data)
#> [1] "id"         "seq_1"      "seq_2"      "Day00_RepA" "Day05_RepA"
#> [6] "Day22_RepA" "Day22_RepB" "Day22_RepC"

This count data has …

We also want the construct metadata

example_pg_metadata <- get_example_data("meta")
#> Rows: 33170 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): pgRNA_ID, target1_sgRNA_seq, target2_sgRNA_seq, paralog_pair, targe...
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(example_pg_metadata)
#> [1] "pgRNA_ID"           "target1_sgRNA_seq"  "target2_sgRNA_seq" 
#> [4] "paralog_pair"       "target_type"        "target1"           
#> [7] "target2"            "target1_ensembl_id" "target2_ensembl_id"

And the meta data has …

Setting up data

We’re going to set up three datasets. Two are required; the first is the counts that the genetic interaction analysis will be used for.

The first data set contains the read counts from each sample type. Required for analysis is a Day 0 (or plasmid) sample, and at least one further timepoint sample. The T0 sample, or plasmid sample, will represent the entire library before any type of selection has occurred during the length of the screen. This is the baseline for guide RNA representation. The length of time cells should remain in culture throughout the screen is heavily dependent on the type of selection occurring, helpful advice can be found in (https://www.nature.com/articles/s43586-021-00093-4). QC analysis will follow to correlate replicates if inputted. Comparison of early and late timepoints is possible in this function, but not required if early timepoints were not taken.

example_counts <- example_data %>%
  select(c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC")) %>%
  as.matrix()

The next datasets are metadata that describe the dimensions of the count data. - These both need to be data frames. - The sizes of these metadata must correspond to the dimensions of the counts data.

pg_id are just the unique IDs listed in the same order/sorted the same way as the count data and can be used for mapping between the count data and the metadata.

example_pg_id <- example_data %>%
  dplyr::select("id")

One of these (example_pg_metadata) is required because it is necessary to know the IDs and be able to map them to pgRNA constructs.

example_pg_metadata <- example_data %>%
  select(c("id", "seq_1", "seq_2"))

Sample metadata is the information that describes the samples and is sorted the same order as the columns in the count data.

You need to have two columns in the metadata you provide (they must be named exactly this).

  • col_names - Must match the colnames of the counts data being submitted
  • timepoints - A numeric variable that describes the timepoints for these data.
  • replicates - A factor that describes which columns are replicates of each other. It doesn’t matter what you name these except that this column is a factor that will be used for calculations.
example_sample_metadata <- data.frame(
  col_names = c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC"),
  day = as.numeric(c("0", "5", "22", "22", "22")),
  rep = as.factor(c("RepA", "RepA", "RepA", "RepB", "RepC"))
)

We’ll need to provide example_counts, pg_ids and pg_metadata to setup_data(). We can provide sample_metadata, but it is not required at the moment.

Now let’s setup our data using setup_data(). Optionally we can provide the metadata in this function as well so that it is stored with the data.

gimap_dataset <- setup_data(counts = example_counts,
                            pg_ids = example_pg_id,
                            pg_metadata = example_pg_metadata,
                            sample_metadata = example_sample_metadata)

You’ll notice that this set up gives us a list of formatted data. This contains the original counts we gave setup_data() function but also normalized counts, and the total counts per sample.

  • raw_counts: The original counts data that illustrates the number of cells alive in the sample. This data has samples as the columns and the paired guide constructs as rows.
  • counts_per_sample: Add up all the counts for each sample over all of the paired guide designs.
  • Transformed data: This section contains the various types of normalized and adjusted data made from the raw counts data.
  • count_norm - For each sample, the data is normalized -log10(( counts +1) / total counts for the sample over all the pg designs ))
  • cpm - For each sample this is calculated by taking the counts / total counts for the sample over all the pg designs)*1e6
  • log2cpm: log-2 transformed counts per million this is calculated by log2(cpms + 1)
  • pg_metadata: paired guide metadata - information that describes the paired-guided RNA designs. This may include the sequences used in the CRISPR design as well as what genes are targeted.
  • sample_metadata: Metadata that describes the samples. This likely includes the time point information, replicates, sample IDs, and any other additional information that is needed regarding the experimental setup.
str(gimap_dataset)
#> List of 9
#>  $ raw_counts       : num [1:33170, 1:5] 951 1511 191 588 289 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>  $ counts_per_sample: Named num [1:5] 38471205 32028290 29144511 36628141 36448291
#>   ..- attr(*, "names")= chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>  $ transformed_data :List of 3
#>   ..$ count_norm: num [1:33170, 1:5] 4.61 4.41 5.3 4.82 5.12 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>   ..$ cpm       : num [1:33170, 1:5] 24.72 39.28 4.96 15.28 7.51 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>   ..$ log2_cpm  : num [1:33170, 1:5] 4.68 5.33 2.58 4.03 3.09 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>  $ metadata         :List of 2
#>   ..$ pg_ids         : tibble [33,170 × 1] (S3: tbl_df/tbl/data.frame)
#>   .. ..$ id: chr [1:33170] "AADAC_AADACL2_pg1" "AADAC_AADACL2_pg10" "AADAC_AADACL2_pg11" "AADAC_AADACL2_pg12" ...
#>   ..$ sample_metadata:'data.frame':  5 obs. of  3 variables:
#>   .. ..$ col_names: chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>   .. ..$ day      : num [1:5] 0 5 22 22 22
#>   .. ..$ rep      : Factor w/ 3 levels "RepA","RepB",..: 1 1 1 2 3
#>  $ filtered_data    :List of 5
#>   ..$ filter_step_run       : logi FALSE
#>   ..$ metadata_pg_ids       : NULL
#>   ..$ transformed_log2_cpm  : NULL
#>   ..$ removed_pg_ids        : NULL
#>   ..$ all_reps_zerocount_ids: NULL
#>  $ annotation       : NULL
#>  $ normalized_log_fc: NULL
#>  $ crispr_score     : NULL
#>  $ results          : NULL
#>  - attr(*, "class")= chr [1:2] "list" "gimap_dataset"

Quality Checks

The first step is running some quality checks on our data. The run_qc() function will create a report we can look at to assess this.

The report includes several visualizations of raw/unfiltered data:

  • distribution of normalized counts for each sample. The goal of determining this distribution is to identify pgRNA counts that are low at the start of the screen - before any type of phenotypic or growth selection is occurring, either in the T0 or plasmid sample. These low abundance pgRNAs should be removed from the analysis. [the goal section here doesn’t fit my expectations/understanding of this plot.]
  • histogram of log2cpm values for each individual sample: this helps users identify samples that do not have a normal distribution of reads and inform the upcoming filtering steps.
  • sample correlation heatmap: generates a heatmap using cpm values for each replicate/sample. This heatmap gives an overview on how similar samples are. Replicates should correlate well, and cluster together, while each timepoint sample should be different from the T0. This analysis will also allow users to identify potential sample swaps, if correlation scores between replicates is poor.
  • A histogram that shows the variance within replicates for each pgRNA construct. For each pgRNA construct, the variance among the 3 replicates is found and a distribution is constructed by looking at these variances together.
  • A histogram of the log2 CPM values of pgRNA constructs at the plasmid time point. This graph relates to a plasmid filter that can be applied to the data, but the plot displays all of the data prior to a filter being applied.

This report also includes several visualizations after filters are applied:

There is a filter that flags pgRNA constructs where any of the time points have a count of zero. - We include a bar plot that shows the number of pgRNA constructs which have counts of zero in either 0, 1, 2, or 3 replicates. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

run_qc(gimap_dataset,
       output_file = "example_qc_report.Rmd",
       overwrite = TRUE,
       quiet = TRUE)

You can take a look at an example QC report here.

Filtering the data

After considering the QC report and which filters are appropriate/desired for your data, you can apply filters to the data using the gimap_filter function.

gimap_dataset <- gimap_dataset %>%
  gimap_filter()

Filtered forms of the data can be seen in the $filtered_data entry

str(gimap_dataset$filtered_data)
#> List of 5
#>  $ filter_step_run       : logi TRUE
#>  $ metadata_pg_ids       : tibble [31,817 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ id: chr [1:31817] "AADAC_AADACL2_pg1" "AADAC_AADACL2_pg10" "AADAC_AADACL2_pg11" "AADAC_AADACL2_pg12" ...
#>  $ transformed_log2_cpm  : num [1:31817, 1:5] 4.68 5.33 2.58 4.03 3.09 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#>  $ removed_pg_ids        : tibble [1,353 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ id             : chr [1:1353] "ABCC1_ABCC3_pg12" "ABHD4_ABHD5_pg3" "ABL1_ABL2_pg12" "ACACA_ACACB_pg13" ...
#>   ..$ relevantFilters: chr [1:1353] "filter_zero_count" "filter_zero_count" "filter_zero_count" "filter_zero_count" ...
#>  $ all_reps_zerocount_ids: tibble [159 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ id: chr [1:159] "ACSL3_ACSL4_pg15" "ACSL3_ACSL4_pg4" "ACTL6A_ACTL6B_pg5" "ACTL6A_ACTL6B_pg6" ...

Let’s take a look at how many rows of data we have left:

nrow(gimap_dataset$filtered_data$transformed_log2_cpm)
#> [1] 31817

As you can see from the output above, there are fewer pgRNA constructs in the filtered dataset following completion of filtering.

The filtering step also stores two tables of information that you may want to use or report.

  • $filtered_data$removed_pg_ids is a table that has the pgRNA construct IDs that are removed following completion of filtering in the id column and the relevant filter(s) that led to removal as a comma separated list in the relevantFilters column
  • $filtered_data$all_reps_zerocount_ids is a table that lists the IDs of pgRNA constructs which had a count of 0 for all final timepoint replicates. These pgRNA constructs are NOT necessarily filtered out
gimap_dataset <- gimap_dataset %>%
  gimap_annotate() %>%
  gimap_normalize(
    timepoints = "day",
    replicates = "rep") %>%
  calc_crispr() %>%
  calc_gi()
#> Annotating Data
#> Rows: 1884 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): gene, gene_symbol, entrez_id
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Normalizing Log Fold Change
#> 
#> Calculating CRISPR score
#> 
#> Calculating Genetic Interaction scores

Take a look at the results.

Here’s what’s included in the GI Scores table:

head(gimap_dataset$gi_scores)
#> # A tibble: 6 × 5
#>   pgRNA_target_double rep       double_target_gi_score single_target_gi_score_1
#>   <chr>               <chr>                      <dbl>                    <dbl>
#> 1 AADAC_AADACL2       late_RepA                 0.277                  -0.00458
#> 2 AADAC_AADACL2       late_RepA                 0.277                  -0.00458
#> 3 AADAC_AADACL2       late_RepA                 0.277                  -0.165  
#> 4 AADAC_AADACL2       late_RepA                 0.277                  -0.165  
#> 5 AADAC_AADACL2       late_RepB                 0.0934                 -0.349  
#> 6 AADAC_AADACL2       late_RepB                 0.0934                 -0.349  
#> # ℹ 1 more variable: single_target_gi_score_2 <dbl>

Take a look at a preview of the crispr scores:

head(gimap_dataset$crispr_score)
#> # A tibble: 6 × 12
#>   target_type rep       double_crispr_score single_crispr_score_1
#>   <chr>       <chr>                   <dbl>                 <dbl>
#> 1 gene_gene   late_RepA               0.747                 0.465
#> 2 gene_gene   late_RepA               0.747                 0.465
#> 3 gene_gene   late_RepA               0.747                 0.305
#> 4 gene_gene   late_RepA               0.747                 0.305
#> 5 gene_gene   late_RepB               0.660                 0.218
#> 6 gene_gene   late_RepB               0.660                 0.218
#> # ℹ 8 more variables: single_crispr_score_2 <dbl>, pgRNA_target_double <chr>,
#> #   mean_single_target_crispr_1 <dbl>, mean_single_target_crispr_2 <dbl>,
#> #   pgRNA1_seq <chr>, pgRNA2_seq <chr>, mean_double_control_crispr_1 <dbl>,
#> #   mean_double_control_crispr_2 <dbl>

Take a look at the overall stats:

gimap_dataset$results$overall
#> # A tibble: 6 × 6
#> # Groups:   rep [3]
#>   rep       term                 estimate std.error statistic   p.value
#>   <chr>     <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
#> 1 late_RepA (Intercept)           0.00258    0.0134     0.192 8.48e-  1
#> 2 late_RepA mean_expected_crispr  0.626      0.0189    33.1   1.94e-164
#> 3 late_RepB (Intercept)           0.0101     0.0136     0.741 4.59e-  1
#> 4 late_RepB mean_expected_crispr  0.631      0.0188    33.5   6.17e-167
#> 5 late_RepC (Intercept)          -0.0293     0.0133    -2.20  2.82e-  2
#> 6 late_RepC mean_expected_crispr  0.633      0.0188    33.7   5.15e-168

Lastly, let’s take a look at the by target results. We can arrange by results using dplyr::arrange()

gimap_dataset$results$by_target %>% 
  dplyr::arrange(fdr_vals_wil_late_RepA)
#> # A tibble: 1,030 × 13
#>    targets     p_val_ttest_late_RepA p_val_ttest_late_RepB p_val_ttest_late_RepC
#>    <chr>                       <dbl>                 <dbl>                 <dbl>
#>  1 CNOT8_CNOT7           0.00000498           0.000735                0.0000910 
#>  2 SNAP25_SNA…           0.0000108            0.00971                 0.0271    
#>  3 UGGT2_UGGT1           0.0000324            0.00129                 0.0000828 
#>  4 WNT9B_WNT9A           0.000000517          0.0000000346            0.0000444 
#>  5 CDH1_CDH3             0.000133             0.00516                 0.000375  
#>  6 BOD1L2_BOD1           0.00000955           0.000774                0.000861  
#>  7 GREB1L_GRE…           0.0000156            0.00274                 0.00943   
#>  8 BCAT2_BCAT1           0.00000181           0.0142                  0.0428    
#>  9 CDKN2B_CDK…           0.0000122            0.00000231              0.00000472
#> 10 DSCAML1_DS…           0.0000799            0.00000599              0.0000421 
#> # ℹ 1,020 more rows
#> # ℹ 9 more variables: p_val_wil_late_RepA <dbl>, p_val_wil_late_RepB <dbl>,
#> #   p_val_wil_late_RepC <dbl>, fdr_vals_ttest_late_RepA <dbl>,
#> #   fdr_vals_ttest_late_RepB <dbl>, fdr_vals_ttest_late_RepC <dbl>,
#> #   fdr_vals_wil_late_RepA <dbl>, fdr_vals_wil_late_RepB <dbl>,
#> #   fdr_vals_wil_late_RepC <dbl>

Saving results to files

We can save the genetic interactions scores like this:

readr::write_tsv(gimap_dataset$gi_scores, file.path(output_dir, "gi_scores.tsv"))

Similarly let’s save the CRISPR scores.

readr::write_tsv(gimap_dataset$crispr_score, file.path(output_dir, "crispr_scores.tsv"))

We can extract and save the stats results to the TSV

readr::write_tsv(gimap_dataset$results$overall, file.path(output_dir, "overall_stats.tsv"))
readr::write_tsv(gimap_dataset$results$by_target, file.path(output_dir, "by_target_stats.tsv"))

We can save all these data as an RDS.

saveRDS(gimap_dataset, "gimap_dataset_final.RDS")

Session Info

This is just for provenance purposes.

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4 gimap_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.1        sass_0.4.9         utf8_1.2.4         generics_0.1.3    
#>  [5] xml2_1.3.6         stringi_1.8.4      hms_1.1.3          digest_0.6.36     
#>  [9] magrittr_2.0.3     RColorBrewer_1.1-3 evaluate_0.24.0    grid_4.4.0        
#> [13] timechange_0.3.0   fastmap_1.2.0      jsonlite_1.8.8     backports_1.5.0   
#> [17] purrr_1.0.2        fansi_1.0.6        viridisLite_0.4.2  scales_1.3.0      
#> [21] textshaping_0.4.0  jquerylib_0.1.4    cli_3.6.3          crayon_1.5.3      
#> [25] rlang_1.1.4        bit64_4.0.5        munsell_0.5.1      withr_3.0.0       
#> [29] cachem_1.1.0       yaml_2.3.10        parallel_4.4.0     tools_4.4.0       
#> [33] tzdb_0.4.0         memoise_2.0.1      colorspace_2.1-1   ggplot2_3.5.1     
#> [37] kableExtra_1.4.0   broom_1.0.6        curl_5.2.1         vctrs_0.6.5       
#> [41] R6_2.5.1           lifecycle_1.0.4    lubridate_1.9.3    snakecase_0.11.1  
#> [45] stringr_1.5.1      bit_4.0.5          fs_1.6.4           htmlwidgets_1.6.4 
#> [49] vroom_1.6.5        ragg_1.3.2         janitor_2.2.0      pkgconfig_2.0.3   
#> [53] desc_1.4.3         pkgdown_2.0.9      pillar_1.9.0       bslib_0.7.0.9000  
#> [57] gtable_0.3.5       glue_1.7.0         systemfonts_1.1.0  xfun_0.46         
#> [61] tibble_3.2.1       tidyselect_1.2.1   rstudioapi_0.16.0  knitr_1.48        
#> [65] htmltools_0.5.8.1  rmarkdown_2.27     svglite_2.1.3      readr_2.1.5       
#> [69] pheatmap_1.0.12    compiler_4.4.0