This calculates the log fold change for a gimap dataset based on the annotation and metadata provided. gimap takes in a counts matrix that represents the number of cells that have each type of pgRNA this data needs some normalization before CRISPR scores and Genetic Interaction scores can be calculated.
There are four steps of normalization. 1. `Calculate log2CPM` - First we account for different read depths across samples and transforms data to log2 counts per million reads. `log2((counts / total counts for sample)) * 1 million) + 1)` 2. `Calculate log2 fold change` - This is done by subtracting the log2CPM for the pre-treatment from each sample. control is what is highlighted. The pretreatment is the day 0 of CRISPR treatment, before CRISPR pgRNAs have taken effect. `log2FC = log2CPM for each sample - pretreament log2CPM`
3. `Normalize by negative and positive controls` - Calculate a negative control median for each sample and a positive control median for each sample and divide each log2FC by this value. `log2FC adjusted = log2FC / (median negative control for a sample - median positive control for a sample)`
gimap_normalize(
.data = NULL,
gimap_dataset,
normalize_by_unexpressed = TRUE,
timepoints = NULL,
treatments = NULL,
control_name = NULL,
num_ids_wo_annot = 20,
rm_ids_wo_annot = TRUE,
missing_ids_file = "missing_ids_file.csv",
overwrite = TRUE
)
Data can be piped in with a tidyverse pipe from function to function. But the data must still be a gimap_dataset
A special dataset structure that is setup using the `setup_data()` function.
TRUE/FALSE crispr data should be normalized so that the median of unexpressed controls is 0. For this to happen set this to TRUE but you need to have added TPM data in the gimap_annotate step using cell_line_annotation or custom_tpm.
Specifies the column name of the metadata set up in `$metadata$sample_metadata` that has a factor that represents the timepoints. Timepoints will be made into three categories: plasmid for the earliest time point, early for all middle timepoints and late for the latest timepoints. The late timepoints will be the focus for the calculations. The column used for timepoints must be numeric or at least ordinal.
Specifies the column name of the metadata set up in `$metadata$sample_metadata` that has a factor that represents column that specifies the treatment applied to each. The replicates will be kept collapsed to an average.
A name that specifies the data either in the treatments column that should be used as the control. This could be the Day 0 of treatment or an untreated sample. For timepoints testing it will be assumed that the mininmum timepoint is the control.
default is 20; the number of pgRNA IDs to display to console if they don't have corresponding annotation data; ff there are more IDs without annotation data than this number, the output will be sent to a file rather than the console.
default is TRUE; whether or not to filter out pgRNA IDs from the input dataset that don't have corresponding annotation data available
If there are missing IDs and a file is saved, where do you want this file to be saved? Provide a file path.
Should existing normalized_log_fc data in the gimap_dataset be overwritten?
if (FALSE) { # \dontrun{
gimap_dataset <- get_example_data("gimap")
# Highly recommended but not required
run_qc(gimap_dataset)
gimap_dataset <- gimap_dataset %>%
gimap_filter() %>%
gimap_annotate(cell_line = "HELA") %>%
gimap_normalize(
timepoints = "day"
)
# To see results
gimap_dataset$normalized_log_fc
} # }