This function takes a vector with cluster labels, recasts it as a factor()
,
and sorts the factor()
levels by frequency such that the most frequent
cluster is the first level and so on.
sort_clusters(clusters, map_subset = NULL)
A factor()
version of clusters
where the levels are ordered by
frequency.
## Build an initial set of cluster labels
clus <- letters[unlist(lapply(4:1, function(x) rep(x, x)))]
## In this case, it's a character vector
class(clus)
#> [1] "character"
## We see that we have 10 elements in this vector, which is
## an unnamed character vector
clus
#> [1] "d" "d" "d" "d" "c" "c" "c" "b" "b" "a"
## letter 'd' is the most frequent
table(clus)
#> clus
#> a b c d
#> 1 2 3 4
## Sort them and obtain a factor. Notice that it's a named
## factor, and the names correspond to the original values
## in the character vector.
sort_clusters(clus)
#> [1] d d d d c c c b b a
#> Levels: d c b a
## Since 'd' was the most frequent, it gets assigned to the first level
## in the factor variable.
table(sort_clusters(clus))
#>
#> d c b a
#> 4 3 2 1
## If we skip the first 3 values of clus (which are all 'd'), we can
## change the most frequent cluster. And thus the ordering of the
## factor levels.
sort_clusters(clus, map_subset = seq_len(length(clus)) > 3)
#> [1] d d d d c c c b b a
#> Levels: c b a d
## Let's try with a factor variable
clus_factor <- factor(clus)
## sort_clusters() returns an identical result in this case
stopifnot(identical(sort_clusters(clus), sort_clusters(clus_factor)))
## What happens if you have a logical variable with NAs?
set.seed(20240712)
log_var <- sample(c(TRUE, FALSE, NA),
1000,
replace = TRUE,
prob = c(0.3, 0.15, 0.55)
)
## Here, the NAs are the most frequent group.
table(log_var, useNA = "ifany")
#> log_var
#> FALSE TRUE <NA>
#> 135 304 561
## The NAs are not used for sorting. Since we have more 'TRUE' than 'FALSE'
## then, 'TRUE' becomes the first level.
table(sort_clusters(log_var), useNA = "ifany")
#>
#> TRUE FALSE <NA>
#> 304 135 561