Skip to contents

This function removes the target values from the test set (replacing with NA) and backs up the values in a new variable in the global environment. This is particularly useful to avoid target leakage (i.e. accidentally using the target value during testing).

Usage

remove_target_from_test_and_add_ref_to_env(tt, target, unique_id, ref_name)

Arguments

tt

train-test list (see ttsplit())

target

(character) name of target variable

unique_id

(character) name of unique ID variable

ref_name

(character) name of new variable that stores mapping between ID and target

Value

tt with target replaced with NA in $test, and ref_name added to global environment

Note

This function implements the method described in Preventing Target Leakage (D22 QuantCafé, 2021).

Examples

tt <- df |>
   dplyr::mutate(id=1:nrow(df)) |>
   ttsplit() |>
   remove_target_from_test_and_add_ref_to_env("y1", "id", "target_values")
#> [1] "Split df into list with: train, test (proportion train = 0.7)"
#> [1] "Target column y1 replaced with NA in test set"
#> [1] "Added reference table 'target_values' to global environment"