A function to run filtering — gimap

This function applies filters to the gimap data. By default it runs both the zero count (across all samples) and the low plasmid cpm filters, but users can select a subset of these filters or even adjust the behavior of each filter

gimap_filter(
  .data = NULL,
  gimap_dataset,
  filter_type = "both",
  cutoff = NULL,
  filter_zerocount_target_col = NULL,
  filter_plasmid_target_col = NULL,
  filter_replicates_target_col = NULL,
  min_n_filters = 1
)

Arguments

.data: Data can be piped in with tidyverse pipes from function to function. But the data must still be a gimap_dataset
gimap_dataset: A special dataset structure that is setup using the `setup_data()` function.
filter_type: Can be one of the following: `zero_count_only`, `low_plasmid_cpm_only` or `both`. Potentially in the future also `rep_variation`, `zero_in_last_time_point` or a vector that includes multiple of these filters.
cutoff: default is NULL, relates to the low_plasmid_cpm filter; the cutoff for low log2 CPM values for the plasmid time period; if not specified, The lower outlier (defined by taking the difference of the lower quartile and 1.5 * interquartile range) is used
filter_zerocount_target_col: default is NULL; Which sample column(s) should be used to check for counts of 0? If NULL and not specified, downstream analysis will select all sample columns
filter_plasmid_target_col: default is NULL, and if NULL, will select the first column only; this parameter specifically should be used to specify the plasmid column(s) that will be selected
filter_replicates_target_col: default is NULL, Which sample columns are the final time point replicates; If NULL, the last 3 sample columns are used. This is only used by this function to save a list of which pgRNA IDs have a zero count for all of these samples.
min_n_filters: default is 1; this parameter defines at least how many/the minimum number of independent filters have to flag a pgRNA construct before the construct is filtered when using a combination of filters You should decide on the appropriate filter based on the results of your QC report.

Value

a filtered version of the gimap_dataset returned in the $filtered_data section filter_step_run is a boolean reporting if the filter step was run or not (since it's optional) metadata_pg_ids is a subset the pgRNA IDs such that these are the ones that remain in the dataset following completion of filtering transformed_log2_cpm is a subset the log2_cpm data such that these are the ones that remain in the dataset following completion of filtering removed_pg_ids is a record of which pgRNAs are filtered out once filtering is complete all_reps_zerocount_ids is not actually filtered data necessarily. Instead it's just a record of which pgRNAs have a zero count in all final timepoint replicates

Examples

# \donttest{

gimap_dataset <- get_example_data("gimap", data_dir = tempdir()) %>%
  gimap_filter()

# To see filtered data
# gimap_dataset$filtered_data

# If you want to only use a single filter or some subset,
# specify which using the filter_type parameter
gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter(filter_type = "zero_count_only")
# or
gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter(filter_type = "low_plasmid_cpm_only")

# If you want to use multiple filters and more than one to flag a pgRNA
# construct before it's filtered out, use the `min_n_filters` argument
gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter(
    filter_type = "both",
    min_n_filters = 2
  )

# You can also specify which columns the filters will be applied to
gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter(
    filter_type = "zero_count_only",
    filter_zerocount_target_col = c(1, 2)
  )
# }