This function adds cluster assignment to each row of a data.frame with iterative variable selection during clustering.
Usage
add_cluster_assignment(
df,
v_cluster = NULL,
k = NULL,
maxit = 100,
elimination = c("backward", "bidirectional"),
max_vars_rm_or_add_each_it = Inf,
return_df_cluster_instead = FALSE,
new_var_name = "Cluster",
weights = NULL
)Arguments
- df
data.frame
- v_cluster
(character) vector of variable names to use in clustering (if
NULL, use all)- k
(numeric) number of clusters (if
NULL, determined optimally; seeget_cluster())- maxit
(numeric) maximum number of iterations to select significant outcomes of clusters
- elimination
(character) whether to use backward or bidirectional elimination (see Details; default
backward)- max_vars_rm_or_add_each_it
(positive int) max number of variables to add and remove (each) at each iteration (see Details; default
Inf)- return_df_cluster_instead
(logical) whether to return
dfwith only final clustering variables and cluster assignment (defaultFALSE)- new_var_name
(character) new variable name
- weights
(numeric vector) variable weights for calculating Gower distances (default all 1)
Details
An initial set of clusters is determined using get_cluster() (Gower distance matrix with
clustering around medoids). Then, significant differences between clusters are determined using
get_sig_differences_between_groups() (ANOVA for numeric variables, Chi-square for categorical
variables), and clustering is performed again using the set of significant variables (and the same k).
This variable-selection method is repeated until all remaining variables are significant,
or until maxit is reached. If maxit is reached before convergence, a warning is
thrown.
Variable selection can be achieved either through backward elimination, where the full set of candidate variables is first considered, then only significant predictors are kept in the next iteration, and so on until all variables are significant. This guarantees that each iteration will have a reduced set of variables.
Alternatively, variable selection can be achieved through bidirectional elimination, where
at each step the full set of initial variables is tested for differences between clusters,
whether or not they were included in the previous iteration. All significant predictors are
included in the next iteration. This guarantees that the final set of variables will match
the output of get_sig_differences_between_groups() with test_vars=v_cluster and group=new_var_name
(provided convergence has been reached).
If max_vars_rm_or_add_each_it is less than the number of variables to add or remove at a given
iteration, then this sets the number of changes made to the variable set for that iteration,
separately for variables removed and variables added (if elimination is bidirectional). For example,
if max_vars_rm_or_add_each_it is set to 1, then at each iteration, only one variable can be removed,
and only one variable can be added. If several candidates are available, then which variable(s)
get added or removed are selected randomly. This can be useful in determining cluster variable
importances using calc_cluster_importances().
When maxit is set to 0, a warning is thrown indicating that no variable selection
is performed, and this is equivalent to using get_cluster() directly.
Examples
if (FALSE) {
df <- df |> add_cluster_assignment()
df |>
add_cluster_assignment(return_df_cluster_instead=TRUE) |>
plot_density_by_groups(group="Cluster")
}