This function runs a full AutoML pipeline, starting from a data.frame
split into training and test sets to adding predictions from an AutoML model
to the test set and possibly reporting variable importances.
Usage
run_automl_pipeline(
df,
target,
config = get_automl_config(),
...,
prop_train = 0.7,
scale_numeric_features = FALSE,
add_confidence_level = FALSE,
plot_and_print_variable_importances = TRUE,
nthreads = "half",
shutdown_h2o = FALSE
)Arguments
- df
data.frame
- target
(character) name of target variable
- config
Configuration object (from calling
get_automl_config())- ...
additional arguments passed to
get_automl_model()- prop_train
proportion of rows to be allocated to training
- scale_numeric_features
(logical) whether to scale numeric features (see
scale_numeric_features_in_train_and_test())- add_confidence_level
(logical) whether to add confidence levels alongside predictions in the test set (default
FALSE)- plot_and_print_variable_importances
(logical) whether to plot and print variable importances if they are available (default
TRUE)- nthreads
(character) how many threads (cores) to dedicate to the H2O cluster ("half", "minus1", or "all"; see
init_h2o())- shutdown_h2o
(logical) whether to shut down cluster at the end of the pipeline (default
FALSE)
Value
list with elements:
- tt
train-test split constructed from
df, with predictions and possibly confidence levels added to the test set- automl
H2O AutoML model
- importances
Variable importances (if requested)
Details
The df data.frame is first prepared by removing non-ASCII characters,
casting all character variables to factors (if any), and removing blank factor
levels from factors. Then it is split into training and test sets, and if requested,
numeric features are scaled based on sample statistics in the training set. Next,
a local H2O cluster is launched, the train and test sets are uploaded to the cluster,
and an AutoML model is requested using parameters specified in the configuration
object. Predictions are added to the test set (with or without confidence levels),
and variable importances may be plotted and reported. Finally, if requested,
the H2O cluster is shut down.
If a configuration object is not provided, the default configuration
is the configuration generated by get_automl_config().