Find Maximum Effect Modifiers (RCT, Single Binary A) - Two Approaches

This function estimates effect modification in an RCT (with known randomization probability \(\)) under two different parameters, determined by rct_type:

If rct_type = "ate", we estimate a subject-level \(ATE\), meaning we look at \(Q(1,W_i) - Q(0,W_i)\). Then we do a TMLE-style update (with known \(\)) to get the influence function for the ATE. The script then performs a data-adaptive partition to find subpopulations with the largest ATE difference.
If rct_type = "incps", we do a two-stage incremental-propensity-shift approach, going from \(\) to \(+\). We produce subject-level differences \(Q,+(i) - Q,(i)\), and partition on this "shift effect."

Note: If you want more valid subpopulation inference, do sample splitting or cross-validation externally. The p-values from a single pass can be too optimistic.

find_max_effect_mods_rct(
  at,
  av,
  delta,
  a_name,
  w_names,
  outcome,
  outcome_type,
  mu_learner,
  alpha = NULL,
  top_n = 3,
  seed,
  min_obs,
  fold,
  max_depth = 2,
  pval_thresh = 0.05,
  rct_type = c("ate", "incps")
)

Arguments

at: A training fold data.frame, with columns w_names, a_name, outcome.
av: A validation fold data.frame (or the same set, if single pass).
delta: A numeric scalar for the incremental coverage shift \( + \).
a_name: Name of the binary exposure (e.g. "A").
w_names: Character vector of baseline covariate names.
outcome: Name of the outcome variable.
outcome_type: "continuous","binary","count" (for sl3 tasks).
mu_learner: A list of sl3 learners for the outcome regression.
alpha: Known randomization prob; if NULL, we estimate it from at.
top_n: Number of top rules to return from the partition search.
seed: Random seed for reproducibility.
min_obs: Min # of obs in a valid split branch.
fold: Label for fold index (for cross-validation).
max_depth: Maximum depth of the partition search tree.
pval_thresh: p-value threshold for accepting a split.
rct_type: Either "ate" or "incps".

Value

A list with:

K_fold_EM_results: Data frame with 2 rows per discovered region: one for \(V\), one for \(V^c\).
av_q_estimates: Either the subject-level ATE difference (\(Q(1)-Q(0)\)) or inc. shift difference in validation.
av_hn_estimates: Corresponding influence function (or difference of shift’s IF) for each subject in validation.
q_region_v: Vector of av_q_estimates in the discovered region \(V\) (for the first discovered partition).
q_region_vc: Vector of av_q_estimates outside that region (complement).
g_region_v: Vector of av_hn_estimates in \(V\).
g_region_vc: Vector of av_hn_estimates in \(V^c\).
data_region_v: The rows in \(V\).
data_region_vc: The rows in \(V^c\).
data: The full validation set (with appended columns) for post-hoc usage.