Skip to contents

This function adds cluster assignment to each row of a data.frame with iterative variable selection during clustering.

Usage

add_cluster_assignment(
  df,
  v_cluster = NULL,
  k = NULL,
  maxit = 100,
  elimination = c("backward", "bidirectional"),
  max_vars_rm_or_add_each_it = Inf,
  return_df_cluster_instead = FALSE,
  new_var_name = "Cluster",
  weights = NULL
)

Arguments

df

data.frame

v_cluster

(character) vector of variable names to use in clustering (if NULL, use all)

k

(numeric) number of clusters (if NULL, determined optimally; see get_cluster())

maxit

(numeric) maximum number of iterations to select significant outcomes of clusters

elimination

(character) whether to use backward or bidirectional elimination (see Details; default backward)

max_vars_rm_or_add_each_it

(positive int) max number of variables to add and remove (each) at each iteration (see Details; default Inf)

return_df_cluster_instead

(logical) whether to return df with only final clustering variables and cluster assignment (default FALSE)

new_var_name

(character) new variable name

weights

(numeric vector) variable weights for calculating Gower distances (default all 1)

Value

df with new cluster variable added

Details

An initial set of clusters is determined using get_cluster() (Gower distance matrix with clustering around medoids). Then, significant differences between clusters are determined using get_sig_differences_between_groups() (ANOVA for numeric variables, Chi-square for categorical variables), and clustering is performed again using the set of significant variables (and the same k). This variable-selection method is repeated until all remaining variables are significant, or until maxit is reached. If maxit is reached before convergence, a warning is thrown.

Variable selection can be achieved either through backward elimination, where the full set of candidate variables is first considered, then only significant predictors are kept in the next iteration, and so on until all variables are significant. This guarantees that each iteration will have a reduced set of variables.

Alternatively, variable selection can be achieved through bidirectional elimination, where at each step the full set of initial variables is tested for differences between clusters, whether or not they were included in the previous iteration. All significant predictors are included in the next iteration. This guarantees that the final set of variables will match the output of get_sig_differences_between_groups() with test_vars=v_cluster and group=new_var_name (provided convergence has been reached).

If max_vars_rm_or_add_each_it is less than the number of variables to add or remove at a given iteration, then this sets the number of changes made to the variable set for that iteration, separately for variables removed and variables added (if elimination is bidirectional). For example, if max_vars_rm_or_add_each_it is set to 1, then at each iteration, only one variable can be removed, and only one variable can be added. If several candidates are available, then which variable(s) get added or removed are selected randomly. This can be useful in determining cluster variable importances using calc_cluster_importances().

When maxit is set to 0, a warning is thrown indicating that no variable selection is performed, and this is equivalent to using get_cluster() directly.

Examples

if (FALSE) {
df <- df |> add_cluster_assignment()

df |>
   add_cluster_assignment(return_df_cluster_instead=TRUE) |>
   plot_density_by_groups(group="Cluster")
}