Skip to contents

This function runs a full AutoML pipeline, starting from a data.frame split into training and test sets to adding predictions from an AutoML model to the test set and possibly reporting variable importances.

Usage

run_automl_pipeline(
  df,
  target,
  config = get_automl_config(),
  ...,
  prop_train = 0.7,
  scale_numeric_features = FALSE,
  add_confidence_level = FALSE,
  plot_and_print_variable_importances = TRUE,
  nthreads = "half",
  shutdown_h2o = FALSE
)

Arguments

df

data.frame

target

(character) name of target variable

config

Configuration object (from calling get_automl_config())

...

additional arguments passed to get_automl_model()

prop_train

proportion of rows to be allocated to training

scale_numeric_features

(logical) whether to scale numeric features (see scale_numeric_features_in_train_and_test())

add_confidence_level

(logical) whether to add confidence levels alongside predictions in the test set (default FALSE)

plot_and_print_variable_importances

(logical) whether to plot and print variable importances if they are available (default TRUE)

nthreads

(character) how many threads (cores) to dedicate to the H2O cluster ("half", "minus1", or "all"; see init_h2o())

shutdown_h2o

(logical) whether to shut down cluster at the end of the pipeline (default FALSE)

Value

list with elements:

tt

train-test split constructed from df, with predictions and possibly confidence levels added to the test set

automl

H2O AutoML model

importances

Variable importances (if requested)

Details

The df data.frame is first prepared by removing non-ASCII characters, casting all character variables to factors (if any), and removing blank factor levels from factors. Then it is split into training and test sets, and if requested, numeric features are scaled based on sample statistics in the training set. Next, a local H2O cluster is launched, the train and test sets are uploaded to the cluster, and an AutoML model is requested using parameters specified in the configuration object. Predictions are added to the test set (with or without confidence levels), and variable importances may be plotted and reported. Finally, if requested, the H2O cluster is shut down.

If a configuration object is not provided, the default configuration is the configuration generated by get_automl_config().

Examples

if (FALSE) {
pip <- df |> run_automl_pipeline(target="y1")
}