gimap performs analysis of dual-targeting CRISPR screening data, with the goal of aiding the identification of genetic interactions (e.g. cooperativity, synthetic lethality) in models of disease and other biological contexts. gimap analyzes functional genomic data generated by the pgPEN (paired guide RNAs for genetic interaction mapping) approach, quantifying growth effects of single and paired gene knockouts upon application of a CRISPR library. A multitude of CRISPR screen types can be used for this analysis, with helpful descriptions found in this review (https://www.nature.com/articles/s43586-021-00093-4). Use of pgPEN and GI-mapping in a paired gRNA format can be found here (https://pubmed.ncbi.nlm.nih.gov/34469736/).
library(gimap)
#> Warning: replacing previous import 'dplyr::group_rows' by
#> 'kableExtra::group_rows' when loading 'gimap'
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
First let’s create a folder we will save files to.
output_dir <- "output_timepoints"
dir.create(output_dir, showWarnings = FALSE)
In this example we are going to examine a dataset with timepoints.
Let’s examine this example pgPEN counts table. It’s divided into columns containing:
id
: an ID corresponding to the names of paired
guidesseq_1
: gRNA sequence 1, targeting “paralog A”seq_2
: gRNA sequence 2, targeting “paralog B”Day00_RepA
: Gene Counts from Day 00 for Replicate
ADay05_RepA
: Gene Counts from Day 05 for Replicate
ADay22_RepA
: Gene Counts from Day 22 for Replicate
ADay22_RepB
: Gene Counts from Day 22 for Replicate
BFor the purposes of this tutorial, we can grab example data from the package.
example_data <- get_example_data("count")
#> Rows: 33170 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): id, seq_1, seq_2
#> dbl (5): Day00_RepA, Day05_RepA, Day22_RepA, Day22_RepB, Day22_RepC
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The metadata you have may vary slightly from this but you’ll want to make sure you have the essential variables and information regarding how you collected your data.
colnames(example_data)
#> [1] "id" "seq_1" "seq_2" "Day00_RepA" "Day05_RepA"
#> [6] "Day22_RepA" "Day22_RepB" "Day22_RepC"
We’re going to set up three datasets that we will provide to the
set_up()
function to create a gimap
dataset
object.
counts
- the counts generated from pgPENpg_ids
- the IDs that correspond to the rows of the
counts and specify the constructsample_metadata
- metadata that describes the columns
of the counts including their timepointsFor this example we are using a timepoint dataset.
Required for a timepoint analysis is a Day 0 (or plasmid) sample, and at least one further timepoint sample. The T0 sample, or plasmid sample, will represent the entire library before any type of selection has occurred during the length of the screen. This is the baseline for guide RNA representation. The length of time cells should remain in culture throughout the screen is heavily dependent on the type of selection occurring, helpful advice can be found in (https://www.nature.com/articles/s43586-021-00093-4). QC analysis will follow to correlate replicates if inputted. Comparison of early and late timepoints is possible in this function, but not required if early timepoints were not taken.
counts <- example_data %>%
select(c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC")) %>%
as.matrix()
The next datasets are metadata that describe the dimensions of the count data.
pg_id
are just the unique IDs listed in the same
order/sorted the same way as the count data and can be used for mapping
between the count data and the metadata. This is required and very
important because it is necessary to know the IDs and be able to map
them to pgRNA constructs and counts data.
Sample metadata is the information that describes the samples and is sorted the same order as the columns in the count data.
You need to have two columns in the metadata you provide. You’ll need
to specify the names of these columns in the
gimap_annotate()
function.
col_names
- Must match the colnames of the counts data
being submittedtimepoints
- A numeric variable that describes the
timepoints for these data.
sample_metadata <- data.frame(
col_names = c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC"),
day = as.numeric(c("0", "5", "22", "22", "22")),
rep = as.factor(c("RepA", "RepA", "RepA", "RepB", "RepC"))
)
We’ll need to provide counts
, pg_ids
and
sample_metadata
to setup_data()
.
Now let’s setup our data using setup_data()
. Optionally
we can provide the metadata in this function as well so that it is
stored with the data.
gimap_dataset <- setup_data(
counts = counts,
pg_ids = pg_ids,
sample_metadata = sample_metadata
)
You’ll notice that this set up gives us a list of formatted data.
This contains the original counts we gave setup_data()
function but also normalized counts, and the total counts per
sample.
raw_counts
: The original counts data that illustrates
the number of cells alive in the sample. This data has samples as the
columns and the paired guide constructs as rows.counts_per_sample
: Add up all the counts for each
sample over all of the paired guide designs.count_norm
- For each sample, the data is normalized
-log10(( counts +1) / total counts for the sample over all the pg designs ))
cpm
- For each sample this is calculated by taking the
counts / total counts for the sample over all the pg designs)*1e6
log2cpm
: log-2 transformed counts per million this is
calculated by log2(cpms + 1)
sample_metadata
: Metadata that describes the samples.
This likely includes the time point information, replicates, sample IDs,
and any other additional information that is needed regarding the
experimental setup.
str(gimap_dataset)
#> List of 10
#> $ raw_counts : num [1:33170, 1:5] 951 1511 191 588 289 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> $ counts_per_sample: Named num [1:5] 38471205 32028290 29144511 36628141 36448291
#> ..- attr(*, "names")= chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> $ transformed_data :List of 3
#> ..$ count_norm: num [1:33170, 1:5] 4.61 4.41 5.3 4.82 5.12 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> ..$ cpm : num [1:33170, 1:5] 24.72 39.28 4.96 15.28 7.51 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> ..$ log2_cpm : num [1:33170, 1:5] 4.68 5.33 2.58 4.03 3.09 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> $ metadata :List of 2
#> ..$ pg_ids : tibble [33,170 × 1] (S3: tbl_df/tbl/data.frame)
#> .. ..$ id: chr [1:33170] "AADAC_AADACL2_pg1" "AADAC_AADACL2_pg10" "AADAC_AADACL2_pg11" "AADAC_AADACL2_pg12" ...
#> ..$ sample_metadata:'data.frame': 5 obs. of 3 variables:
#> .. ..$ col_names: chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> .. ..$ day : num [1:5] 0 5 22 22 22
#> .. ..$ rep : Factor w/ 3 levels "RepA","RepB",..: 1 1 1 2 3
#> $ filtered_data :List of 5
#> ..$ filter_step_run : logi FALSE
#> ..$ metadata_pg_ids : NULL
#> ..$ transformed_log2_cpm : NULL
#> ..$ removed_pg_ids : NULL
#> ..$ all_reps_zerocount_ids: NULL
#> $ comparisons : NULL
#> $ annotation : NULL
#> $ normalized_log_fc: NULL
#> $ crispr_score : NULL
#> $ results : NULL
#> - attr(*, "class")= chr [1:2] "list" "gimap_dataset"
The first step is running some quality checks on our data. The
run_qc()
function will create a report we can look at to
assess this.
The report includes several visualizations of raw/unfiltered data:
log2cpm
values for each individual sample:
this helps users identify samples that do not have a normal distribution
of reads and inform the upcoming filtering steps.This report also includes several visualizations after filters are applied:
There is a filter that flags pgRNA constructs where any of the time points have a count of zero. - We include a bar plot that shows the number of pgRNA constructs which have counts of zero in either 0, 1, 2, or 3 replicates. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.
There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.
There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.
run_qc(gimap_dataset,
output_file = "example_qc_report.Rmd",
overwrite = TRUE,
quiet = TRUE)
You can take a look at an example QC report here.
After considering the QC report and which filters are
appropriate/desired for your data, you can apply filters to the data
using the gimap_filter
function.
gimap_dataset <- gimap_dataset %>%
gimap_filter()
Filtered forms of the data can be seen in the
$filtered_data entry
str(gimap_dataset$filtered_data)
#> List of 5
#> $ filter_step_run : logi TRUE
#> $ metadata_pg_ids : tibble [31,817 × 1] (S3: tbl_df/tbl/data.frame)
#> ..$ id: chr [1:31817] "AADAC_AADACL2_pg1" "AADAC_AADACL2_pg10" "AADAC_AADACL2_pg11" "AADAC_AADACL2_pg12" ...
#> $ transformed_log2_cpm : num [1:31817, 1:5] 4.68 5.33 2.58 4.03 3.09 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:5] "Day00_RepA" "Day05_RepA" "Day22_RepA" "Day22_RepB" ...
#> $ removed_pg_ids : tibble [1,353 × 2] (S3: tbl_df/tbl/data.frame)
#> ..$ id : chr [1:1353] "ABCC1_ABCC3_pg12" "ABHD4_ABHD5_pg3" "ABL1_ABL2_pg12" "ACACA_ACACB_pg13" ...
#> ..$ relevantFilters: chr [1:1353] "filter_zero_count" "filter_zero_count" "filter_zero_count" "filter_zero_count" ...
#> $ all_reps_zerocount_ids: tibble [159 × 1] (S3: tbl_df/tbl/data.frame)
#> ..$ id: chr [1:159] "ACSL3_ACSL4_pg15" "ACSL3_ACSL4_pg4" "ACTL6A_ACTL6B_pg5" "ACTL6A_ACTL6B_pg6" ...
Let’s take a look at how many rows of data we have left:
nrow(gimap_dataset$filtered_data$transformed_log2_cpm)
#> [1] 31817
As you can see from the output above, there are fewer pgRNA constructs in the filtered dataset following completion of filtering.
The filtering step also stores two tables of information that you may want to use or report.
$filtered_data$removed_pg_ids
is a table that has the
pgRNA construct IDs that are removed following completion of filtering
in the id
column and the relevant filter(s) that led to
removal as a comma separated list in the relevantFilters
column$filtered_data$all_reps_zerocount_ids
is a table that
lists the IDs of pgRNA constructs which had a count of 0 for all final
timepoint replicates. These pgRNA constructs are NOT necessarily
filtered outNow that you’ve performed QC and filtering, the rest of the pipeline can be run
cell_line
your data uses so that the correct
corresponding DepMap data is used for annotation. This function is
gimap_annotate()
gimap_normalize()
function. timepoints
needs to be specified pointing to the
correct column names from the sample_data
passed to the
setup_data()
function earlier.calc_crispr()
function.calc_gi()
function.
gimap_dataset <- gimap_dataset %>%
gimap_annotate(cell_line = "HELA") %>%
gimap_normalize(
timepoints = "day"
) %>%
calc_crispr() %>%
calc_gi()
#> Annotating Data
#> Rows: 1884 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): gene, gene_symbol, entrez_id
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Normalizing Log Fold Change
#>
#> Calculating CRISPR score
#>
#> Calculating Genetic Interaction scores
Take a look at the results.
Here’s what’s included in the GI Scores table:
head(gimap_dataset$gi_scores)
#> # A tibble: 6 × 8
#> pgRNA_target_double rep double_target_gi_score single_target_gi_sco…¹
#> <chr> <chr> <dbl> <dbl>
#> 1 AADAC_AADACL2 Day05_RepA_… -9.44 -0.387
#> 2 AADAC_AADACL2 Day05_RepA_… -9.44 -0.387
#> 3 AADAC_AADACL2 Day05_RepA_… -9.44 -1.54
#> 4 AADAC_AADACL2 Day05_RepA_… -9.44 -1.54
#> 5 AADAC_AADACL2 Day22_RepA_… 0.496 0.228
#> 6 AADAC_AADACL2 Day22_RepA_… 0.496 0.228
#> # ℹ abbreviated name: ¹single_target_gi_score_1
#> # ℹ 4 more variables: single_target_gi_score_2 <dbl>,
#> # expected_crispr_single_1 <dbl>, expected_crispr_single_2 <dbl>,
#> # expected_crispr_double <dbl>
Take a look at a preview of the crispr scores:
head(gimap_dataset$crispr_score)
#> # A tibble: 6 × 13
#> pg_ids rep double_crispr_score single_crispr_score_1 single_crispr_score_2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 AADAC_A… Day0… -8.22 1.36 4.70
#> 2 AADAC_A… Day0… -8.22 1.36 1.46
#> 3 AADAC_A… Day0… -8.22 -0.971 4.70
#> 4 AADAC_A… Day0… -8.22 -0.971 1.46
#> 5 AADAC_A… Day2… 0.890 0.396 0.167
#> 6 AADAC_A… Day2… 0.890 0.396 0.641
#> # ℹ 8 more variables: pgRNA_target_double <chr>, gene_symbol_1 <chr>,
#> # gene_symbol_2 <chr>, mean_single_target_crispr_1 <dbl>,
#> # mean_single_target_crispr_2 <dbl>, pgRNA1_seq <chr>, pgRNA2_seq <chr>,
#> # mean_double_control_crispr <dbl>
Take a look at the overall stats:
gimap_dataset$results$overall
#> # A tibble: 8 × 6
#> # Groups: rep [4]
#> rep term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Day05_RepA_early (Intercept) 0.0170 0.00689 2.47 1.35e- 2
#> 2 Day05_RepA_early mean_expected_single_… 0.367 0.00226 162. 0
#> 3 Day22_RepA_late (Intercept) 0.0128 0.000879 14.6 5.33e- 48
#> 4 Day22_RepA_late mean_expected_single_… 0.681 0.00227 300. 0
#> 5 Day22_RepB_late (Intercept) 0.0206 0.000829 24.9 6.78e-137
#> 6 Day22_RepB_late mean_expected_single_… 0.661 0.00224 295. 0
#> 7 Day22_RepC_late (Intercept) 0.0205 0.000774 26.5 6.88e-155
#> 8 Day22_RepC_late mean_expected_single_… 0.695 0.00227 306. 0
Lastly, let’s take a look at the by target results. We can arrange by
results using dplyr::arrange()
gimap_dataset$results$by_target %>%
dplyr::arrange(fdr_vals_wil_Day22_RepA_late)
#> # A tibble: 1,030 × 17
#> targets p_val_ttest_Day05_Re…¹ p_val_ttest_Day22_Re…² p_val_ttest_Day22_Re…³
#> <chr> <dbl> <dbl> <dbl>
#> 1 CNOT8_C… 0.981 0.000000742 0.00000615
#> 2 AP2A1_A… 0.00184 0.00000515 0.00000396
#> 3 CABP7_C… 0.0214 0.000240 0.00382
#> 4 PFN2_PF… 0.0749 0.0000000770 0.357
#> 5 TRAPPC6… 0.0854 0.00000206 0.000435
#> 6 FES_FER 0.296 0.000000613 0.0688
#> 7 TIAL1_T… 0.202 0.0000220 0.000756
#> 8 CD209_C… 0.588 0.0000374 0.232
#> 9 UCK2_UC… 0.903 0.0000257 0.00795
#> 10 SEC23B_… 0.0759 0.000102 0.0000396
#> # ℹ 1,020 more rows
#> # ℹ abbreviated names: ¹p_val_ttest_Day05_RepA_early,
#> # ²p_val_ttest_Day22_RepA_late, ³p_val_ttest_Day22_RepB_late
#> # ℹ 13 more variables: p_val_ttest_Day22_RepC_late <dbl>,
#> # p_val_wil_Day05_RepA_early <dbl>, p_val_wil_Day22_RepA_late <dbl>,
#> # p_val_wil_Day22_RepB_late <dbl>, p_val_wil_Day22_RepC_late <dbl>,
#> # fdr_vals_ttest_Day05_RepA_early <dbl>, …
We can save the genetic interactions scores like this:
Similarly let’s save the CRISPR scores.
We can extract and save the stats results to the TSV
readr::write_tsv(gimap_dataset$results$overall, file.path(output_dir, "overall_stats.tsv"))
readr::write_tsv(gimap_dataset$results$by_target, file.path(output_dir, "by_target_stats.tsv"))
We can save all these data as an RDS.
saveRDS(gimap_dataset, "gimap_dataset_final_timepoint.RDS")
This is just for provenance purposes.
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS 15.0.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 gimap_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyr_1.3.1 sass_0.4.9 utf8_1.2.4 generics_0.1.3
#> [5] xml2_1.3.6 stringi_1.8.4 hms_1.1.3 digest_0.6.37
#> [9] magrittr_2.0.3 RColorBrewer_1.1-3 evaluate_1.0.1 grid_4.4.0
#> [13] timechange_0.3.0 fastmap_1.2.0 jsonlite_1.8.9 backports_1.5.0
#> [17] purrr_1.0.2 fansi_1.0.6 viridisLite_0.4.2 scales_1.3.0
#> [21] textshaping_0.4.0 jquerylib_0.1.4 cli_3.6.3 crayon_1.5.3
#> [25] rlang_1.1.4 bit64_4.5.2 munsell_0.5.1 withr_3.0.1
#> [29] cachem_1.1.0 yaml_2.3.10 parallel_4.4.0 tools_4.4.0
#> [33] tzdb_0.4.0 colorspace_2.1-1 ggplot2_3.5.1 kableExtra_1.4.0
#> [37] broom_1.0.7 curl_5.2.3 vctrs_0.6.5 R6_2.5.1
#> [41] lifecycle_1.0.4 lubridate_1.9.3 snakecase_0.11.1 stringr_1.5.1
#> [45] bit_4.5.0 fs_1.6.4 htmlwidgets_1.6.4 vroom_1.6.5
#> [49] ragg_1.3.3 janitor_2.2.0 pkgconfig_2.0.3 desc_1.4.3
#> [53] pkgdown_2.1.1 pillar_1.9.0 bslib_0.8.0.9000 gtable_0.3.6
#> [57] glue_1.8.0 systemfonts_1.1.0 xfun_0.48 tibble_3.2.1
#> [61] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.48 htmltools_0.5.8.1
#> [65] rmarkdown_2.28 svglite_2.1.3 readr_2.1.5 pheatmap_1.0.12
#> [69] compiler_4.4.0