This function finds expression cutoffs based on estimating the second breakpoint using segmented applied to two curves. The first curve is the number of features passing the cutoff assuming an initial large drop, an intermediate section with moderate drop, and then a stable tail. The second curve is the mean number of expressed samples (non-zero expression) for all genes at each given cutoff. This curve increases rapidly then has a section with a moderate increase, and finally a long tail. Using segmented we find that second breakpoint. The suggested expression cutoff returned is the average of the two suggested cutoffs.
expression_cutoff(expr, max_cut = 1, seed = NULL, n.boot = 2000, k = 2)
A matrix with the expression values with features in the rows and
samples in the columns in RPKM format (typically). expr
can be any matrix
as expression_cutoff()
will work with the row means of the matrix. You'll
most likely want to use a normalized expression matrix, although this
function will work with any matrix.
Maximum expression cutoff to consider. Increasing this value
does then to increase the suggested cutoffs. If set to NULL
, this
will be chosen automatically from 1 to 5.
Set this argument for increased reproducibility of the results. This is passed to seg.control. We highly recommend specifiying this value.
This argument controls the stability of the suggested cutoffs. It is passed to seg.control.
The number of breakpoints to consider. If you increase this value the estimated breakpoints are less stable.
A two element vector with the suggested cutoffs based on the number of features expressed and the mean number of expressed samples across all features. The function also makes plots that visualize both curves and show the identified breakpoints as vertical grey lines.
## Simulate expression data. Based on the limma::lmFit() man page
sd <- 0.3 * sqrt(4 / rchisq(100, df = 4))
y <- matrix(rnorm(100 * 6, sd = sd), 100, 6)
rownames(y) <- paste("Gene", seq_len(100))
y[seq_len(2), seq_len(3) + 3] <- y[seq_len(2), seq_len(3) + 3] + 2
## Make the expression data look like RPKM values
y2 <- apply(y, 2, function(z) {
res <- z + abs(min(z))
res[sample(seq_len(100), 10)] <- 0
res
})
summary(y2)
#> V1 V2 V3 V4
#> Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.812 1st Qu.:0.5794 1st Qu.:0.6757 1st Qu.:0.7014
#> Median :1.061 Median :0.8119 Median :0.9338 Median :1.0135
#> Mean :1.002 Mean :0.8070 Mean :0.8705 Mean :0.9828
#> 3rd Qu.:1.264 3rd Qu.:1.0926 3rd Qu.:1.1058 3rd Qu.:1.2464
#> Max. :2.472 Max. :2.1606 Max. :1.9088 Max. :3.3155
#> V5 V6
#> Min. :0.000 Min. :0.000
#> 1st Qu.:1.107 1st Qu.:1.435
#> Median :1.470 Median :1.699
#> Mean :1.364 Mean :1.558
#> 3rd Qu.:1.710 3rd Qu.:1.894
#> Max. :3.563 Max. :3.829
expression_cutoff(y2)
#> 2024-03-27 13:28:49.133827 the suggested expression cutoff is 0.78
#> percent_features_cut samples_nonzero_cut
#> 0.79 0.77
## Or same the plots to a PDF file
pdf("test_expression_cutoff.pdf", width = 12)
expression_cutoff(y2)
#> 2024-03-27 13:28:49.81514 the suggested expression cutoff is 0.78
#> percent_features_cut samples_nonzero_cut
#> 0.79 0.77
dev.off()
#> agg_png
#> 2
## View the pdf with the following code
utils::browseURL("test_expression_cutoff.pdf")