This function takes a vector with cluster labels, recasts it as a factor(), and sorts the factor() levels by frequency such that the most frequent cluster is the first level and so on.

sort_clusters(clusters, map_subset = NULL)

Arguments

clusters

A vector with cluster labels.

map_subset

A logical vector of length equal to clusters specifying which elements of clusters to use to determine the ranking of the clusters.

Value

A factor() version of clusters where the levels are ordered by frequency.

Examples


## Build an initial set of cluster labels
clus <- letters[unlist(lapply(4:1, function(x) rep(x, x)))]

## In this case, it's a character vector
class(clus)
#> [1] "character"

## We see that we have 10 elements in this vector, which is
## an unnamed character vector
clus
#>  [1] "d" "d" "d" "d" "c" "c" "c" "b" "b" "a"

## letter 'd' is the most frequent
table(clus)
#> clus
#> a b c d 
#> 1 2 3 4 

## Sort them and obtain a factor. Notice that it's a named
## factor, and the names correspond to the original values
## in the character vector.
sort_clusters(clus)
#>  [1] d d d d c c c b b a
#> Levels: d c b a

## Since 'd' was the most frequent, it gets assigned to the first level
## in the factor variable.
table(sort_clusters(clus))
#> 
#> d c b a 
#> 4 3 2 1 

## If we skip the first 3 values of clus (which are all 'd'), we can
## change the most frequent cluster. And thus the ordering of the
## factor levels.
sort_clusters(clus, map_subset = seq_len(length(clus)) > 3)
#>  [1] d d d d c c c b b a
#> Levels: c b a d

## Let's try with a factor variable
clus_factor <- factor(clus)
## sort_clusters() returns an identical result in this case
stopifnot(identical(sort_clusters(clus), sort_clusters(clus_factor)))

## What happens if you have a logical variable with NAs?
set.seed(20240712)
log_var <- sample(c(TRUE, FALSE, NA),
    1000,
    replace = TRUE,
    prob = c(0.3, 0.15, 0.55)
)
## Here, the NAs are the most frequent group.
table(log_var, useNA = "ifany")
#> log_var
#> FALSE  TRUE  <NA> 
#>   135   304   561 

## The NAs are not used for sorting. Since we have more 'TRUE' than 'FALSE'
## then, 'TRUE' becomes the first level.
table(sort_clusters(log_var), useNA = "ifany")
#> 
#>  TRUE FALSE  <NA> 
#>   304   135   561