This function is provided for convenience. It runs all the functions required for computing the modeling_results. This can be useful for finding marker genes on a new spatially-resolved transcriptomics dataset and thus using it for run_app(). The results from this function can also be used for performing spatial registration through layer_stat_cor() and related functions of sc/snRNA-seq datasets.

registration_wrapper(
  sce,
  var_registration,
  var_sample_id,
  covars = NULL,
  gene_ensembl = NULL,
  gene_name = NULL,
  suffix = "",
  min_ncells = 10,
  pseudobulk_rds_file = NULL
)

Arguments

sce

A SingleCellExperiment-class object or one that inherits its properties.

var_registration

A character(1) specifying the colData(sce) variable of interest against which will be used for computing the relevant statistics.

var_sample_id

A character(1) specifying the colData(sce) variable with the sample ID.

covars

A character() with names of sample-level covariates.

gene_ensembl

A character(1) specifying the rowData(sce_pseudo) column with the ENSEMBL gene IDs. This will be used by layer_stat_cor().

gene_name

A character(1) specifying the rowData(sce_pseudo) column with the gene names (symbols).

suffix

A character(1) specifying the suffix to use for the F-statistics column. This is particularly useful if you will run this function more than once and want to be able to merge the results.

min_ncells

An integer(1) greater than 0 specifying the minimum number of cells (for scRNA-seq) or spots (for spatial) that are combined when pseudo-bulking. Pseudo-bulked samples with less than min_ncells on sce_pseudo$ncells will be dropped.

pseudobulk_rds_file

A character(1) specifying the path for saving an RDS file with the pseudo-bulked object. It's useful to specify this since pseudo-bulking can take hours to run on large datasets.

Value

A list() of data.frame() with the statistical results. This is similar to fetch_data("modeling_results").

Details

We chose a default of min_ncells = 10 based on OSCA from section 4.3 at http://bioconductor.org/books/3.15/OSCA.multisample/multi-sample-comparisons.html. They cite https://doi.org/10.1038/s41467-020-19894-4 as the paper where they came up with the definition of "very low" being 10. You might want to use registration_pseudobulk() and manually explore sce_pseudo$ncells to choose the best cutoff.

See also

Examples

## Ensure reproducibility of example data
set.seed(20220907)

## Generate example data
sce <- scuttle::mockSCE()

## Add some sample IDs
sce$sample_id <- sample(LETTERS[1:5], ncol(sce), replace = TRUE)

## Add a sample-level covariate: age
ages <- rnorm(5, mean = 20, sd = 4)
names(ages) <- LETTERS[1:5]
sce$age <- ages[sce$sample_id]

## Add gene-level information
rowData(sce)$ensembl <- paste0("ENSG", seq_len(nrow(sce)))
rowData(sce)$gene_name <- paste0("gene", seq_len(nrow(sce)))

## Compute all modeling results
example_modeling_results <- registration_wrapper(
    sce,
    "Cell_Cycle", "sample_id", c("age"), "ensembl", "gene_name", "wrapper"
)
#> 2024-07-26 23:49:13.890602 make pseudobulk object
#> 2024-07-26 23:49:14.048728 dropping 9 pseudo-bulked samples that are below 'min_ncells'.
#> 2024-07-26 23:49:14.070448 drop lowly expressed genes
#> 2024-07-26 23:49:14.125018 normalize expression
#> 2024-07-26 23:49:14.211229 create model matrix
#> 2024-07-26 23:49:14.221755 run duplicateCorrelation()
#> 2024-07-26 23:49:16.590766 The estimated correlation is: -0.0783081238514532
#> 2024-07-26 23:49:16.592969 computing enrichment statistics
#> 2024-07-26 23:49:16.720585 extract and reformat enrichment results
#> 2024-07-26 23:49:16.746398 running the baseline pairwise model
#> 2024-07-26 23:49:16.76428 computing pairwise statistics
#> 2024-07-26 23:49:16.837624 computing F-statistics