Pseudo-bulk the gene expression, filter lowly-expressed genes, and normalize. This is the first step for spatial registration and for statistical modeling.
registration_pseudobulk(
sce,
var_registration,
var_sample_id,
covars = NULL,
min_ncells = 10,
pseudobulk_rds_file = NULL,
filter_expr = TRUE,
mito_gene = NULL
)
A SingleCellExperiment-class object or one that inherits its properties.
A character(1)
specifying the colData(sce)
variable of interest against which will be used for computing the relevant
statistics. This should be a categorical variable, with all categories
syntaticly valid (could be used as an R variable, no special characters or
leading numbers), ex. 'L1.2', 'celltype2' not 'L1/2' or '2'.
A character(1)
specifying the colData(sce)
variable
with the sample ID.
A character()
with names of sample-level covariates.
An integer(1)
greater than 0 specifying the minimum
number of cells (for scRNA-seq) or spots (for spatial) that are combined
when pseudo-bulking. Pseudo-bulked samples with less than min_ncells
on
sce_pseudo$ncells
will be dropped.
A character(1)
specifying the path for saving
an RDS file with the pseudo-bulked object. It's useful to specify this since
pseudo-bulking can take hours to run on large datasets.
A logical(1)
specifying whether to filter pseudobulked
counts with edgeR::filterByExpr
. Defaults to TRUE
, filtering is recommended for
spatail registratrion workflow.
An optional logical()
vector indicating which genes are
mitochondrial, used to calculate pseudo bulked mitochondrial expression rate
expr_chrM
and pseudo_expr_chrM
. The length has to match the nrow(sce)
.
A pseudo-bulked SingleCellExperiment-class object. The logcounts()
assay are log2-CPM
values calculated with edgeR::cpm(log = TRUE)
. See
https://github.com/LieberInstitute/spatialLIBD/issues/106 and
https://support.bioconductor.org/p/9161754 for more details about the
math behind scuttle::logNormFactors()
, edgeR::cpm()
, and their
differences.
Other spatial registration and statistical modeling functions:
registration_block_cor()
,
registration_model()
,
registration_stats_anova()
,
registration_stats_enrichment()
,
registration_stats_pairwise()
,
registration_wrapper()
## Ensure reproducibility of example data
set.seed(20220907)
## Generate example data
sce <- scuttle::mockSCE()
## Add some sample IDs
sce$sample_id <- sample(LETTERS[1:5], ncol(sce), replace = TRUE)
## Add a sample-level covariate: age
ages <- rnorm(5, mean = 20, sd = 4)
names(ages) <- LETTERS[1:5]
sce$age <- ages[sce$sample_id]
## Add gene-level information
rowData(sce)$ensembl <- paste0("ENSG", seq_len(nrow(sce)))
rowData(sce)$gene_name <- paste0("gene", seq_len(nrow(sce)))
## Pseudo-bulk by Cell Cycle
sce_pseudo <- registration_pseudobulk(
sce,
var_registration = "Cell_Cycle",
var_sample_id = "sample_id",
covars = c("age"),
min_ncells = NULL
)
#> 2025-04-30 17:47:04.97592 make pseudobulk object
#> 2025-04-30 17:47:05.09133 drop lowly expressed genes
#> 2025-04-30 17:47:05.140169 normalize expression
colData(sce_pseudo)
#> DataFrame with 20 rows and 9 columns
#> Mutation_Status Cell_Cycle Treatment sample_id age
#> <character> <character> <character> <character> <numeric>
#> A_G0 NA G0 NA A 19.1872
#> B_G0 NA G0 NA B 25.3496
#> C_G0 NA G0 NA C 24.1802
#> D_G0 NA G0 NA D 15.5211
#> E_G0 NA G0 NA E 20.9701
#> ... ... ... ... ... ...
#> A_S NA S NA A 19.1872
#> B_S NA S NA B 25.3496
#> C_S NA S NA C 24.1802
#> D_S NA S NA D 15.5211
#> E_S NA S NA E 20.9701
#> registration_variable registration_sample_id ncells pseudo_sum_umi
#> <character> <character> <integer> <numeric>
#> A_G0 G0 A 8 2946915
#> B_G0 G0 B 13 4922867
#> C_G0 G0 C 9 3398888
#> D_G0 G0 D 7 2630651
#> E_G0 G0 E 10 3761710
#> ... ... ... ... ...
#> A_S S A 12 4516334
#> B_S S B 8 2960685
#> C_S S C 7 2595774
#> D_S S D 14 5233560
#> E_S S E 11 4151818