Spatial registration: pseudobulk

Pseudo-bulk the gene expression, filter lowly-expressed genes, and normalize. This is the first step for spatial registration and for statistical modeling.

registration_pseudobulk(
  sce,
  var_registration,
  var_sample_id,
  covars = NULL,
  min_ncells = 10,
  pseudobulk_rds_file = NULL,
  filter_expr = TRUE,
  mito_gene = NULL
)

Arguments

sce: A SingleCellExperiment-class object or one that inherits its properties.
var_registration: A character(1) specifying the colData(sce) variable of interest against which will be used for computing the relevant statistics. This should be a categorical variable, with all categories syntaticly valid (could be used as an R variable, no special characters or leading numbers), ex. 'L1.2', 'celltype2' not 'L1/2' or '2'.
var_sample_id: A character(1) specifying the colData(sce) variable with the sample ID.
covars: A character() with names of sample-level covariates.
min_ncells: An integer(1) greater than 0 specifying the minimum number of cells (for scRNA-seq) or spots (for spatial) that are combined when pseudo-bulking. Pseudo-bulked samples with less than min_ncells on sce_pseudo$ncells will be dropped.
pseudobulk_rds_file: A character(1) specifying the path for saving an RDS file with the pseudo-bulked object. It's useful to specify this since pseudo-bulking can take hours to run on large datasets.
filter_expr: A logical(1) specifying whether to filter pseudobulked counts with edgeR::filterByExpr. Defaults to TRUE, filtering is recommended for spatail registratrion workflow.
mito_gene: An optional logical() vector indicating which genes are mitochondrial, used to calculate pseudo bulked mitochondrial expression rate expr_chrM and pseudo_expr_chrM. The length has to match the nrow(sce).

Value

A pseudo-bulked SingleCellExperiment-class object. The logcounts() assay are log2-CPM values calculated with edgeR::cpm(log = TRUE). See https://github.com/LieberInstitute/spatialLIBD/issues/106 and https://support.bioconductor.org/p/9161754 for more details about the math behind scuttle::logNormFactors(), edgeR::cpm(), and their differences.

Examples

## Ensure reproducibility of example data
set.seed(20220907)

## Generate example data
sce <- scuttle::mockSCE()

## Add some sample IDs
sce$sample_id <- sample(LETTERS[1:5], ncol(sce), replace = TRUE)

## Add a sample-level covariate: age
ages <- rnorm(5, mean = 20, sd = 4)
names(ages) <- LETTERS[1:5]
sce$age <- ages[sce$sample_id]

## Add gene-level information
rowData(sce)$gene_id <- paste0("ENSG", seq_len(nrow(sce)))
rowData(sce)$gene_name <- paste0("gene", seq_len(nrow(sce)))

## Pseudo-bulk by Cell Cycle
sce_pseudo <- registration_pseudobulk(
    sce,
    var_registration = "Cell_Cycle",
    var_sample_id = "sample_id",
    covars = c("age"),
    min_ncells = NULL
)
#> 2026-01-09 17:23:15.800132 make pseudobulk object
#> 2026-01-09 17:23:15.915281 drop lowly expressed genes
#> 2026-01-09 17:23:15.989577 normalize expression
colData(sce_pseudo)
#> DataFrame with 20 rows and 9 columns
#>      Mutation_Status  Cell_Cycle   Treatment   sample_id       age
#>          <character> <character> <character> <character> <numeric>
#> A_G0              NA          G0          NA           A   19.1872
#> B_G0              NA          G0          NA           B   25.3496
#> C_G0              NA          G0          NA           C   24.1802
#> D_G0              NA          G0          NA           D   15.5211
#> E_G0              NA          G0          NA           E   20.9701
#> ...              ...         ...         ...         ...       ...
#> A_S               NA           S          NA           A   19.1872
#> B_S               NA           S          NA           B   25.3496
#> C_S               NA           S          NA           C   24.1802
#> D_S               NA           S          NA           D   15.5211
#> E_S               NA           S          NA           E   20.9701
#>      registration_variable registration_sample_id    ncells pseudo_sum_umi
#>                <character>            <character> <integer>      <numeric>
#> A_G0                    G0                      A         8        2946915
#> B_G0                    G0                      B        13        4922867
#> C_G0                    G0                      C         9        3398888
#> D_G0                    G0                      D         7        2630651
#> E_G0                    G0                      E        10        3761710
#> ...                    ...                    ...       ...            ...
#> A_S                      S                      A        12        4516334
#> B_S                      S                      B         8        2960685
#> C_S                      S                      C         7        2595774
#> D_S                      S                      D        14        5233560
#> E_S                      S                      E        11        4151818
rowData(sce_pseudo)
#> DataFrame with 2000 rows and 3 columns
#>               gene_id   gene_name        gene_search
#>           <character> <character>        <character>
#> Gene_0001       ENSG1       gene1       gene1; ENSG1
#> Gene_0002       ENSG2       gene2       gene2; ENSG2
#> Gene_0003       ENSG3       gene3       gene3; ENSG3
#> Gene_0004       ENSG4       gene4       gene4; ENSG4
#> Gene_0005       ENSG5       gene5       gene5; ENSG5
#> ...               ...         ...                ...
#> Gene_1996    ENSG1996    gene1996 gene1996; ENSG1996
#> Gene_1997    ENSG1997    gene1997 gene1997; ENSG1997
#> Gene_1998    ENSG1998    gene1998 gene1998; ENSG1998
#> Gene_1999    ENSG1999    gene1999 gene1999; ENSG1999
#> Gene_2000    ENSG2000    gene2000 gene2000; ENSG2000

Arguments

Value

See also

Examples