Effect Modification Identification and Estimation using Data-Adaptive Stochastic Interventions

library(EffectXshift)
library(devtools)
#> Loading required package: usethis
library(kableExtra)
library(sl3)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:kableExtra':
#> 
#>     group_rows
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

seed <- 4291531
set.seed(seed)

Motivation

The motivation behind the package EffectXshift is to answer the question, “what subpopulations of individuals have the largest difference in impact in the event of an exposure intervention.” Researchers are often interested in understanding vulnerable subpopulations, or individuals with certain characteristics like sex or age, that make them more susceptible to the impact of a drug or pollutant.

Background

Methods like R-learners, S-learners, T-learners, and X-learners, have been pivotal in estimating Conditional Average Treatment Effects (CATE) across diverse settings. While S-learners apply a single model to the entire data, incorporating treatment as a feature, T-learners use separate models for treated and control groups to capture distinct patterns. X-learners, designed for settings with imbalanced treatment groups, first estimate treatment effects for both groups separately and then use the information from one group to refine the estimates for the other.

Despite their effectiveness, these methods often act as black boxes, offering limited insights into which specific subpopulations are most affected by a treatment or exposure. Additionally, many of these methods are limited to only estimating the differential effects of a binary treatment/exposure. Our work with EffectXshift addresses this gap by providing an interpretable framework that not only estimates the impact of interventions but also identifies subpopulations with the maximal effect difference. This approach enhances our understanding of treatment effects, facilitating targeted interventions and personalized treatment strategies.

Relationship to Other Methods

EffectXshift sits between stochastic-intervention estimators, heterogeneous-treatment-effect methods, and exposure-mixture methods. Stochastic-intervention estimators define and estimate mean outcomes under modified exposure policies (Dı́az and van der Laan 2012; Haneuse and Rotnitzky 2013), but typically do not search for the exposure and covariate region with maximal effect modification. Honest causal trees/forests and meta-learners focus on heterogeneous effects (Athey and Imbens 2016; Wager and Athey 2018; Künzel et al. 2019), but most commonly target binary or simpler treatments rather than additive stochastic shifts of continuous mixture components. Exposure-mixture methods such as quantile g-computation estimate overall mixture effects or component contributions (Keil et al. 2020), while EffectXshift searches for interpretable exposure-region pairs where the stochastic-shift effect is most different.

This data-adaptive search is useful but should be treated carefully. Users should review fold-level selection stability and positivity diagnostics before interpreting pooled estimates. Consistent exposure/rule selection across folds supports a more stable finding; large clever covariates indicate weak support for the requested shift.

Notes for Trialists

For a randomized trial with one binary treatment, set rct = TRUE and pass one 0/1 treatment column. In rct_type = "ate" mode, EffectXshift asks which baseline covariate-defined subgroup has the largest treatment-effect difference against its complement. It estimates (Q(1, W) - Q(0, W)), searches for the region V that maximizes the differential treatment effect, and reports held-out estimates for V, V^c, and the V - V^c contrast.

Trialists should pass the known allocation probability with alpha whenever it is known, especially under unequal randomization. If alpha = NULL, the probability is estimated within each training fold from the observed treatment proportion. Use target = "risk" only for a prognostic high-risk subgroup analysis; it reports held-out outcome risk and is not a treatment-effect modification estimand. Current scope is one binary treatment and a marginal randomization probability; cluster-randomized, adaptive, crossover, or strongly stratified designs may require design-specific handling.

For reporting, include treatment coding, the allocation probability, eligible baseline covariates, n_folds, min_obs, max_depth, pval_thresh, fold-level selected rules, the V, V^c, and V - V^c estimates with confidence intervals, selected-region arm counts and observed outcome summaries from Trial Region Diagnostics, and whether the analysis was prespecified or exploratory.

For fixed-time event endpoints, define the time point () and endpoint scale before subgroup discovery. The current randomized-trial workflow expects a fully observed scalar endpoint, for example event by (), survival at (), or a continuous endpoint at a visit. Informative censoring before () requires a censoring-adjusted estimator; do not code censored subjects as event-free unless that composite endpoint is the intended estimand. A time-to-event extension should estimate a censoring survival model, construct an IPCW/AIPW/TMLE pseudo-outcome for the fixed-time risk or survival contrast, and then search for effect-modifying baseline regions using that adjusted pseudo-outcome (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use 2019; Moore and van der Laan 2009; Brooks et al. 2013).

Target Parameter

EffectXshift uses data-adaptive machine learning methods to identify the exposure-covariate combination that has the maximum difference in the impact of intervention.

Given a set of exposures $\boldsymbol{A}$, a covariate space $W$, and a particular outcome $Y$, our goal is to identify the exposure $A_i \in \boldsymbol{A}$ and region $V \subseteq W$ that demonstrate the maximal difference in the stochastic-shift effect, compared across $V$ and its complementary region $V^c$. Our target parameter looks like:

\[\begin{align*} (\hat A_i, \hat V) &= \underset{A_i \in \boldsymbol{A}, \, V \subseteq W}{\arg\max} \left\{ \Psi_i(V) - \Psi_i(V^c) \right\}, \\ \Psi_i(V) &= E\left[ E\{Y \mid A_i + \delta_i, A_{\setminus i}, W\} - E\{Y \mid A_i, A_{\setminus i}, W\} \mid W \in V \right]. \end{align*}\]

$$ Here, $\delta_i$ denotes the additive intervention on exposure $A_i$, $\Psi_i(V)$ is the stochastic-shift effect in region $V$, and $V^c$ represents the complementary set of $V$ within $W$.

To elucidate the EffectXshift algorithm, we start by fitting an outcome model utilizing the Super Learner ensemble methodology. This model is tasked with predicting the outcome variable, (Y), based on a set of exposures, (), and covariates, (). The Super Learner algorithm, a machine learning ensemble, optimally combines predictions from a pre-specified library of candidate models to minimize prediction error, thereby offering a flexible and robust approach to outcome modeling.

Subsequently, for each exposure (A_i) within the vector of exposures (), we apply a shift, denoted by (_i), and predict the expected outcome under this perturbation, yielding (E[Y | A_i + _i, ]). This process is repeated for the baseline scenario without the shift to compute (E[Y | , ]). The difference between these two expected outcomes, (E[Y | A_i + _i, ] - E[Y | , ]), constitutes our individual treatment effect (ITE) vectors, capturing the expected change in outcome attributable to the intervention (_i) on exposure (A_i).

To identify subpopulations with maximally differential responses to the intervention, we employ a custom decision tree algorithm that partitions the covariate space () in a manner that maximizes the average difference in population intervention effects between the resulting regions. This tree-based partitioning is performed recursively, with splits chosen to greedily maximize the difference in mean ITE between subgroups defined by the covariates within (). This approach aims to approximate the oracle target parameter: the region in the covariate space where the intervention exhibits the maximal differential effect.

The final output of the EffectXshift algorithm is a set of rules defining covariate-based subpopulations alongside estimates of the population intervention effect within these subpopulations. This not only provides insights into the heterogeneity of treatment effects across different segments of the population but also guides the identification of individuals who stand to benefit most from targeted interventions, thus embodying a significant stride towards personalized treatment strategies.

Doubly Robust Estimation

The above g-computation approach to estimation the population intervention effect which is then regressed onto covariates using our custom decision tree to find the maximal subpopulation intervention effect is the data-adaptive discovery portion of the method. What comes out is an exposure-covariate region combination for each fold. We then estimate stochastic interventions, shifting the exposure in subregions of the covariate space found and applying targeted learning to debias our initial estimates. In order to get proper inference for the subpopulation intervention effect in the covariate regions we use targeted learning. Briefly, we fit a propensity score estimator to estimate the likelihood of exposure levels given covariates under the observed exposure level and intervened level, the ratio of these likelihoods consitutes the clever covariate for the influence curve of a stochastic intervention. We use this clever covariate to update our initial estimates of the expected outcome under shift which initially is done using Super Learner. This targeting step debiases our estimates and we are able to get confidence intervals.

Sample Splitting

Because we don’t know which exposure-covariate regions to estimate our effect modification parameter a priori, we need to split our data and in the training fold find the exposure-covariate region, using the above g-computation framework, and in an estimation sample, efficiently estimate the subpopulation intervention effect parameter for the identified exposure in the covariate regions so that we can get valid confidence intervals for these data-adaptively identified interactions. As discussed above, is done using targeted learning. This sample splitting is through a cross-validation framework, ensuring the identification of our oracle parameter, the maximizing exposure-covariate region, is independent of the estimation, given we don’t know the maximum effect modifier beforehand.

Targeted Learning

We use targeted minimum loss based estimation (TMLE) to debias our initial outcome estimates given a shift in exposure such that the resulting estimator is asymptotically unbiased and has the smallest variance. This procedure involves constructing a least favorable submodel which uses a ratio of conditional exposure densities under shift and now shift as a covariate to debias the initial estimates. More details of targeted learning for stochastic shifts are here Dı́az and van der Laan (2012)

Data Adaptive Delta

For each exposure, the user inputs a respective delta, for example for exposures $M1, M2, M3$ the analyst puts in the deltas vector

deltas <- c("M1" = 1, "M2" = 2.3, "M3" = 1.4)

Which assigns a delta shift amount to each exposure. However, because the user doesn’t know the underlying experimentation in the data, a delta that is too big may result in positivity violations which leads to bias and high variance of the estimator. That is, if the user asks for a shift in exposure density that is very unlikely given the data. To solve this, the analyst can also choose to set adaptive_delta = TRUE which then makes the shift amount a data adaptive parameter as well. The delta will be reduced until the max of the ratio of densities for each exposure is less than or equal to hn_trunc_thresh.

Inputs and Outputs

EffectXshift takes in covariates, exposures, an outcome, a list of deltas to shift each exposure by, the estimator (tmle or one step) number of folds for the CV procedure, parallelization parameters and if the delta should be data-adaptive. Also the user defines top_n which is the number of ranked effect modification relationships, where 1 is the max.

The package then outputs K-fold specific results, the result found in each fold, and the oracle result which is the parameter that is pooled across all the folds. For example, we will get the fold specific modification results for the rank 1 exposure-covariate region found to have the highest modification. The pooled parameter is the oracle rank 1, meaning we do a pooled TMLE for all the rank 1 findings across the folds.

If the same exposure-covariate reguion are used across the folds, the pooled estimate is interpretable and has greater power compared to k-fold. If not, the analyst must investigate the k-fold specific results and report consistency of findings.

NHANES Data

Here we will try and replicate the analysis by Gibson et. al.:

https://ehjournal.biomedcentral.com/articles/10.1186/s12940-019-0515-1

and Mitro et. al, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4858394/

Who used the NHANES 2001-2002 data to investigate 18 POP exposures on telomere length.

In this NHANES example, we will investigate if any of the included baseline covariates modify the impact of intervening on any of the persistant organic pollutants.

Let’s first load the data and investigate the variables:

data("NHANES_eurocim")

exposures <- c(
  "LBX074LA", # PCB74 Lipid Adj (ng/g)
  "LBX099LA", # PCB99 Lipid Adj (ng/g)
  "LBX118LA", # PCB118 Lipid Adj (ng/g)
  "LBX138LA", # PCB138 Lipid Adj (ng/g)
  "LBX153LA", # PCB153 Lipid Adj (ng/g)
  "LBX170LA", # PCB170 Lipid Adj (ng/g)
  "LBX180LA", # PCB180 Lipid Adj (ng/g)
  "LBX187LA", # PCB187 Lipid Adj (ng/g)
  "LBX194LA", # PCB194 Lipid Adj (ng/g)
  "LBXD03LA", # 1,2,3,6,7,8-hxcdd Lipid Adj (pg/g)
  "LBXD05LA", # 1,2,3,4,6,7,8-hpcdd Lipid Adj (pg/g)
  "LBXD07LA", # 1,2,3,4,6,7,8,9-ocdd Lipid Adj (pg/g)
  "LBXF03LA", # 2,3,4,7,8-pncdf Lipid Adj (pg/g)
  "LBXF04LA", # 1,2,3,4,7,8-hxcdf Lipid Adj (pg/g)
  "LBXF05LA", # 1,2,3,6,7,8-hxcdf Lipid Adj (pg/g)
  "LBXF08LA", # 1,2,3,4,6,7,8-hxcdf Lipid Adj (pg/g)
  "LBXHXCLA", # 3,3',4,4',5,5'-hxcb Lipid Adj (pg/g)
  "LBXPCBLA"
) # 3,3',4,4',5-pcnb Lipid Adj (pg/g)

NHANES_eurocim <- NHANES_eurocim[complete.cases(NHANES_eurocim[, exposures]), ]

outcome <- "TELOMEAN"

covariates <- c(
  "LBXWBCSI", # White blood cell count (SI)
  "LBXLYPCT", # Lymphocyte percent (%)
  "LBXMOPCT", # Monocyte percent (%)
  "LBXEOPCT", # Eosinophils percent (%)
  "LBXBAPCT", # Basophils percent (%)
  "LBXNEPCT", # Segmented neutrophils percent (%)
  "male", # Sex
  "age_cent", # Age at Screening, centered
  "race_cat", # race
  "bmi_cat3", # Body Mass Index (kg/m**2)
  "ln_lbxcot", # Cotinine (ng/mL), log-transformed
  "edu_cat"
) # Education Level - Adults 20+

To improve consistency in the region we find to be the maximizing region, it is best to remove an exposure of highly correlated sets, we do that here:

# Calculate the correlation matrix for the exposures
cor_matrix <- cor(NHANES_eurocim[, exposures], use = "complete.obs")

# Set a threshold for high correlation
threshold <- 0.75

# Find pairs of highly correlated exposures
highly_correlated_pairs <- which(abs(cor_matrix) > threshold & lower.tri(cor_matrix), arr.ind = TRUE)

# Initiate a vector to keep track of exposures to remove
exposures_to_remove <- c()

# Loop through the highly correlated pairs and decide which exposure to remove
for (pair in seq_len(nrow(highly_correlated_pairs))) {
  row <- highly_correlated_pairs[pair, "row"]
  col <- highly_correlated_pairs[pair, "col"]

  if (!(colnames(cor_matrix)[row] %in% exposures_to_remove)) {
    exposures_to_remove <- c(exposures_to_remove, colnames(cor_matrix)[row])
  }
}

# Keep only uncorrelated exposures
exposures_to_keep <- setdiff(exposures, exposures_to_remove)

Run EffectXshift


ptm <- proc.time()

w <- NHANES_eurocim[, covariates]
a <- NHANES_eurocim[, exposures_to_keep]
y <- NHANES_eurocim$TELOMEAN

deltas <- a %>%
  summarise(across(everything(), sd, na.rm = TRUE)) %>%
  as.list()

# Make the deltas negative
deltas <- lapply(deltas, floor)
# Assign the names to the list
names(deltas) <- names(a)


NHANES_results <- EffectXshift(
  w = w,
  a = a,
  y = y,
  deltas = deltas,
  n_folds = 5,
  outcome_type = "continuous",
  parallel = TRUE,
  parallel_type = "multi_session",
  num_cores = 6,
  seed = seed,
  adaptive_delta = FALSE,
  top_n = 1,
  min_obs = 300,
  density_classification = TRUE,
  max_depth = 2
)

proc.time() - ptm

Investigating Results

One note is that for this example we will get more variability than normal because we only run 5 fold CV due to computational time.

Let’s first look at the results for each fold:


k_fold_results <- NHANES_results$`Effect Modification K-Fold Results`

k_fold_results %>%
  kableExtra::kbl(caption = "K-Fold Effect Modification Results") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Psi is the comparison of the expected outcome under shift intervention compared to the mean outcome in each strata of the covariate. Variance estimates are from the influence function for stochastic interventions.

Above, the first row in each fold corresponds to the region v and the second row vc for the pooled results. The pooled results, leveraging the full data is below:

pooled_results <- rbind(NHANES_results$`Effect Modification Region V Pooled Results`, NHANES_results$`Effect Modification Region V^c Pooled Results`)

pooled_results %>%
  kableExtra::kbl(caption = "Pooled TMLE Results") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Of course, pooled results are only interpretable if the same exposure-covariate region was found in the definition of v and vc. When results are consistent, this indicates a shift in the same exposure in the same region of the covariate space and we are able to leverage more data in the estimation folds to get more precise estimates.

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60. https://doi.org/10.1073/pnas.1510489113.

Brooks, Jordan C, Mark J van der Laan, Daniel E Singer, and Alan S Go. 2013. “Targeted Minimum Loss-Based Estimation of Causal Effects in Right-Censored Survival Data with Time-Dependent Covariates: Warfarin, Stroke, and Death in Atrial Fibrillation.” Journal of Causal Inference 1 (2): 235–54. https://doi.org/10.1515/jci-2013-0001.

Dı́az, Iván, and Mark J van der Laan. 2012. “Population Intervention Causal Effects Based on Stochastic Interventions.” Biometrics 68 (2): 541–49.

Haneuse, Sebastian, and Andrea Rotnitzky. 2013. “Estimation of the Effect of Interventions That Modify the Received Treatment.” Statistics in Medicine 32 (30): 5260–77.

International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. 2019. “ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials.” https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf.

Keil, Alexander P, Jessie P Buckley, Katie M O’Brien, Kelly K Ferguson, Shanshan Zhao, and Alexandra J White. 2020. “A Quantile-Based g-Computation Approach to Addressing the Effects of Exposure Mixtures.” Environmental Health Perspectives 128 (4): 047004. https://doi.org/10.1289/EHP5838.

Künzel, Sören R, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65. https://doi.org/10.1073/pnas.1804597116.

Moore, Kelly L, and Mark J van der Laan. 2009. “Increasing Power in Randomized Trials with Right Censored Outcomes Through Covariate Adjustment.” Journal of Biopharmaceutical Statistics 19 (6): 1099–1131. https://doi.org/10.1080/10543400903243017.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42. https://doi.org/10.1080/01621459.2017.1319839.

David McCoy

2026-06-24