5 `R` packages for estimation of the causal (in)direct effects

We’ll now turn to working through a few examples of estimating the natural and interventional direct and indirect effects. We will be using the crumble R package, which provides a unified framework for estimating many of the common mediation estimands, and supports high-dimensional exposures, mediators, and mediator-outcome confounders. However, many software implementations exist in the R ecosystem for performing mediation analyses; notably, crumble was heavily inspired by medoutcon, HDmediation, and lcm.

As our running example, we’ll use a simple data set from an observational study of the relationship between BMI and kids’ behavior, freely distributed with the mma R package on CRAN. First, let’s load the packages we’ll be using and set a seed for reproducibility; then, load this data set and take a quick look.

library(crumble)
library(mlr3extralearners)
library(fastDummies)
library(torch)

set.seed(4235243)
torch_manual_seed(657)

# load and examine data
data(weight_behavior, package = "mma")

# drop missing values
weight_behavior <- na.omit(weight_behavior)

# dummy code (aka one-hot-encode) factors
weight_behavior <- dummy_cols(
  weight_behavior, c("sex", "sports", "snack"),
  remove_selected_columns = TRUE,
  remove_first_dummy = TRUE
)

head(weight_behavior)

       bmi  age  race numpeople car gotosch tvhours cmpthours cellhours
1 18.20665 12.2 OTHER         5   3       2       4         0         0
2 22.78401 12.8 OTHER         4   3       2       4         2         0
3 25.56754 12.1 OTHER         2   3       2       0         2         0
4 15.07408 12.3 OTHER         4   1       2       2         1         3
5 22.98338 11.8 OTHER         4   1       1       4         3         2
6 19.15658 12.1 OTHER         3   3       2       0         0         1
  exercises sweat overweigh sex_F sex_M sports_2 snack_2
1         2     1         0     1     0        1       0
2         8     2         0     0     1        0       0
3         9     1         1     0     1        1       0
4        12     1         0     0     1        0       0
5         1     1         0     0     1        0       0
6         1     3         0     1     0        0       0

The documentation for the data set describes it as a “database obtained from the Louisiana State University Health Sciences Center, New Orleans, by Dr. Richard Scribner. He explored the relationship between BMI and kids’ behavior through a survey at children, teachers and parents in Grenada in 2014. This data set includes 691 observations and 15 variables.” Note that the data set contained several observations with missing values, which we removed above to simplify the demonstration of our analytic methods. In practice, we recommend instead using appropriate corrections (e.g., imputation, inverse weighting) to fully take advantage of the observed data.

Following the motivation of the original study, we focus on the causal effects of participating in a sports team (sports_2) on the BMI of children (bmi), taking into consideration several mediators (snack_2, exercises); c("age", "sex_F", "tvhours") are taken to be potential baseline confounders.

5.1 `crumble`: Flexible and general mediation analysis

crumble implements a one-step estimator of the natural, interventional, and organic (in)direct effect, as well as path-specific effects using recanting twins.
The estimator is capable of accommodating flexible modeling strategies (e.g., ensemble machine learning) for the initial estimation of nuisance parameters.
To this end, crumble integrates with the mlr3superlearner R package.
The estimator uses cross-fitting of nuisance parameters.

The data on a single observational unit can be represented $O = (W, A, M, Y)$, with the data pooled across all participants denoted $O_1, \ldots, O_n$, for a of $n$ i.i.d. observations of $O$. Recall the DAG from an earlier chapter, which represents the data-generating process:

Code

\dimendef\prevdepth=0
\pgfdeclarelayer{background}
\pgfsetlayers{background,main}
\usetikzlibrary{arrows,positioning}
\tikzset{
>=stealth',
punkt/.style={
rectangle,
rounded corners,
draw=black, very thick,
text width=6.5em,
minimum height=2em,
text centered},
pil/.style={
->,
thick,
shorten <=2pt,
shorten >=2pt,}
}
\newcommand{\Vertex}[2]
{\node[minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$};
}
\newcommand{\VertexR}[2]
{\node[rectangle, draw, minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$};
}
\newcommand{\ArrowR}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) to[bend right=30] (#2);
\end{pgfonlayer}
}
\newcommand{\ArrowL}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) to[bend left=45] (#2);
\end{pgfonlayer}
}
\newcommand{\EdgeL}[3]
{ \begin{pgfonlayer}{background}
\draw[dashed,#3] (#1) to[bend right=-45] (#2);
\end{pgfonlayer}
}
\newcommand{\Arrow}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) -- +(#2);
\end{pgfonlayer}
}
\begin{tikzpicture}
  \Vertex{-4, 0}{W}
  \Vertex{0, 0}{M}
  \Vertex{-2, 0}{A}
  \Vertex{2, 0}{Y}
  \Arrow{W}{A}{black}
  \Arrow{A}{M}{black}
  \Arrow{M}{Y}{black}
  \ArrowL{W}{Y}{black}
  \ArrowL{A}{Y}{black}
  \ArrowL{W}{M}{black}
\end{tikzpicture}

Directed acyclic graph under *no intermediate confounders* of the mediator-outcome relation affected by treatment

In crumble(), $A$ is denoted trt, $Y$ is denoted outcome, $M$ is denoted mediators, and $W$ is denoted covar.

5.2 Estimating nuisance parameters

Recall that the one-step estimator can be thought of as a combination of a weighted estimator and a regression estimator.
We’d like to rely on flexible, data adaptive strategies for nuisance parameter estimation.
Doing so minimizes opportunities for model mis-specification to compromise our analytic conclusions.

5.2.1 Super learning

For the regression function parts of the EIF (e.g., $\E[Y\mid A,M,W]$), crumble uses the Super Learner algorithm for ensemble machine learning (van der Laan, Polley, and Hubbard 2007), which is implemented in the mlr3superlearner R package.
Below, we demonstrate the construction of an ensemble learner based on a limited library of algorithms, including an intercept model, a main terms GLM, LASSO ($\ell_1$-penalized) regression, and random forest (ranger).

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

ensemble <- list(
  "mean",
  "glm",
  "cv_glmnet",
  list("ranger", num.trees = 200)
)

Of course, there are many alternatives for learning algorithms to be included in such a modeling library. Feel free to explore!

5.2.2 Riesz representers

In a subset of the semi-parametric and double-machine learning literature, the weights in the EIF are sometimes referred to as Riesz representers. To understand why, let’s imagine that we want to estimate the total effect (i.e., the ATE) of a binary treatment $A$ on some outcome $Y$ accounting for confounders $W$. Under standard identification assumptions, this effect is identified from the observed data as

\[ \psi = \E[\E[Y\mid A= 1,W] - \E[Y\mid A= 0,W]]\text{.} \]

Let’s denote $Q(a,W) = \E[Y\mid A=a,W]$ and $m(O;Q)$ as a continuous linear mapping of $Q(a,W)$ and the observed data. In the case of the total effect, $m(O;Q) = Q(1,W) - Q(0,W)$. We can then re-write the total effect as

\[ \E[m(O;Q)]\text{.} \]

Now, consider that the total effect can equivalently be identified as

\[ \E\bigg[\bigg\{\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}\bigg\}Y \bigg]\text{.} \]

Let’s denote $\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}$ as $\alpha_0(W)$. Then,

\[ \E[m(O;Q)] = \E[\alpha_0(W) \cdot Y]\text{.} \]

According to the Riesz representation theorem, if $m(O;Q)$ is a continuous linear functional of $Q$ (which it is), then there exists a unique function $\alpha_0(W)$–the so-called Riesz representer–such that

\[ \E[m(O;Q)] = \E[\alpha_0(W) \cdot Q(A,W)]\text{.} \]

Using the tower rule, we have

\[ \begin{align*} \E[m(O;Q)] &= \E[\alpha_0(W) \cdot Y]\\ &= \E\big[\E[\alpha_0(W) \cdot Y\mid A,W]\big] \\ &= \E\big[\alpha_0(W) \cdot \E[Y\mid A,W]\big] \\ & = \E[\alpha_0(W) \cdot Q(A,W)]\text{.} \end{align*} \]

Thus, $\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}$ is the Riesz representer for estimating the total effect of $A$ on $Y$!

While $\alpha_0(W)$ for the total effect has a nice close-form solution that can be easily estimated with off-the-shelf software, this isn’t always the case. With that in mind, Chernozhukov et al. (2021) showed that the Riesz representer is the minimizer of the the so-called Riesz loss:

Chernozhukov, Victor, Whitney K Newey, Victor Quintas-Martinez, and Vasilis Syrgkanis. 2021. “Automatic Debiased Machine Learning via Riesz Regression.” arXiv Preprint arXiv:2104.14737.

\[ \text{arg min}_\alpha \E[\alpha_0(W)^2 - 2\cdot m(O;\alpha)]\text{.} \]

This means that we can directly estimate $\alpha_0(W)$ by minimizing the above loss function! For the case of mediation analysis, the methods we proposed in Liu et al. (2024) and implemented in crumble estimate Riesz representers of the type:

Liu, Richard, Nicholas T Williams, Kara E Rudolph, and Iván Dı́az. 2024. “General Targeted Machine Learning for Modern Causal Mediation Analysis.” arXiv Preprint arXiv:2408.14620.

\[ \frac{\I(A_i=1)}{\hat{\P}(A_i=1 \mid W_i)} \frac{\hat{\P}(M_i \mid A_i=0,W)_i}{\hat{\P}(M_i \mid A_i=1,W_i)} - \frac{\I(A=0)}{\hat{\P}(A_i=0 \mid W_i)}\text{.} \]

crumble estimates this value (and others) for mediational effects directly by minimizing the Riesz loss with deep learning using torch. Much of this is abstracted away from the analyst, but we still need to specify some type of deep-learning architecture to use crumble. Here’s an example of a simple multilayer perceptron (MLP) with two hidden layers, each with 10 units, and a dropout rate of 0.1 which we create using the sequential_module() function from crumble.

mlp <- sequential_module(layers = 2, hidden = 10, dropout = 0.1)

5.2.3 On cross-fitting

When using data-adaptive regression estimators, it is recommended to use cross-fitted estimators
Cross-fitting is similar to cross-validation:
- Randomly split the sample into K (e.g., K=10) subsets of equal size
- For each of the 9/10ths of the sample, fit the regression models
- Use the out-of-sample fit to predict in the remaining 1/10th of the sample
Cross-fitting further reduces the bias of the estimators
Cross-fitting aids in guaranteeing the correctness of the standard errors and confidence intervals
Cross-fitting is implemented by default in the R packages that you will see next

5.3 Efficient estimation of the natural (in)direct effects

To start, we will consider estimation of the natural direct and indirect effects, which, we recall, are defined as follows

\[ \E[Y_{1,M_1} - Y_{0,M_0}] = \underbrace{\E[Y_{\color{red}{1},\color{blue}{M_1}} - Y_{\color{red}{1},\color{blue}{M_0}}]}_{\text{natural indirect effect}} + \underbrace{\E[Y_{\color{blue}{1},\color{red}{M_0}} - Y_{\color{blue}{0},\color{red}{M_0}}]}_{\text{natural direct effect}}. \]

Let’s use the crumble() function to estimate the natural direct and indirect effect.

We’ll use the ensemble of the intercept model, a main terms GLM, LASSO ($\ell_1$-penalized) regression, and random forest we defined above (ensemble) for the regression parameters
We’ll use the multilayer perceptron we defined above (mlp) to estimate the Riesz representers.
To indicate we want to estimate the natural (in)direct effects, we set effect = "N".
crumble was designed to also work with continuous exposures. Because of this, we need to specify the treatment regimes to create the contrasts for the (in)direct effects.
- d0 and d1 are functions that return the treatment assignment under the two regimes. In our case, we want to estimate the effect of participating in a sports team, so we set d0 to always return 1 (i.e., no participation) and d1 to always return 0 (i.e., participation).
- Want to learn more about specifying causal effects using these d functions? Check out the workshop, Beyond the ATE happening after this one!

# compute one-step estimates of the natural effects
crumble(
  data = weight_behavior,
  trt = "sports_2",
  outcome = "bmi",
  covar = c("age", "sex_F", "tvhours"),
  mediators = c("snack_2", "exercises"),
  d0 = function(data, trt) rep(1, nrow(data)),
  d1 = function(data, trt) rep(0, nrow(data)),
  effect = "N",
  learners = ensemble,
  nn_module = mlp,
  control = crumble_control(crossfit_folds = 1L, epochs = 10L, learning_rate = 0.01)
)
# ✔ Fitting outcome regressions... 1/1 folds [5.4s]
# ✔ Computing alpha n density ratios... 1/1 folds [7s]
#
# ══ Results `crumble()` ════════════════════════════════════════════════════════════════
#
# ── Direct Effect
#       Estimate: -1.08
#     Std. error: 0.26
# 95% Conf. int.: -1.6, -0.57
#
# ── Indirect Effect
#       Estimate: 0.06
#     Std. error: 0.05
# 95% Conf. int.: -0.05, 0.16
#
# ── Average Treatment Effect
#       Estimate: -1.03
#     Std. error: 0.3
# 95% Conf. int.: -1.61, -0.44

5.4 Efficient estimation of the interventional (in)direct effects

Since our knowledge of the system under study is incomplete, we might worry that one (or more) of the measured variables are not mediators, but, in fact, intermediate confounders affected by treatment. While the natural (in)direct effects are not identified in this setting, their interventional (in)direct counterparts are, as we saw in an earlier section. Recall that both types of effects are defined by static interventions on the treatment. The interventional effects are distinguished by their use of a stochastic intervention on the mediator to aid in their identification.

Code

\dimendef\prevdepth=0
\pgfdeclarelayer{background}
\pgfsetlayers{background,main}
\usetikzlibrary{arrows,positioning}
\tikzset{
>=stealth',
punkt/.style={
rectangle,
rounded corners,
draw=black, very thick,
text width=6.5em,
minimum height=2em,
text centered},
pil/.style={
->,
thick,
shorten <=2pt,
shorten >=2pt,}
}
\newcommand{\Vertex}[2]
{\node[minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$};
}
\newcommand{\VertexR}[2]
{\node[rectangle, draw, minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$};
}
\newcommand{\ArrowR}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) to[bend right=30] (#2);
\end{pgfonlayer}
}
\newcommand{\ArrowL}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) to[bend left=45] (#2);
\end{pgfonlayer}
}
\newcommand{\EdgeL}[3]
{ \begin{pgfonlayer}{background}
\draw[dashed,#3] (#1) to[bend right=-45] (#2);
\end{pgfonlayer}
}
\newcommand{\Arrow}[3]
{ \begin{pgfonlayer}{background}
\draw[->,#3] (#1) -- +(#2);
\end{pgfonlayer}
}
\begin{tikzpicture}
  \Vertex{0, -1}{Z}
  \Vertex{-4, 0}{W}
  \Vertex{0, 0}{M}
  \Vertex{-2, 0}{A}
  \Vertex{2, 0}{Y}
  \ArrowR{W}{Z}{black}
  \Arrow{Z}{M}{black}
  \Arrow{W}{A}{black}
  \Arrow{A}{M}{black}
  \Arrow{M}{Y}{black}
  \Arrow{A}{Z}{black}
  \Arrow{Z}{Y}{black}
  \ArrowL{W}{Y}{black}
  \ArrowL{A}{Y}{black}
  \ArrowL{W}{M}{black}
\end{tikzpicture}

Directed acyclic graph under intermediate confounders of the mediator-outcome relation affected by treatment

Recall that the interventional (in)direct effects are defined via the decomposition:

\[ \E[Y_{1,G_1} - Y_{0,G_0}] = \underbrace{\E[Y_{\color{red}{1},\color{blue}{G_1}} - Y_{\color{red}{1},\color{blue}{G_0}}]}_{\text{interventional indirect effect}} + \underbrace{\E[Y_{\color{blue}{1},\color{red}{G_0}} - Y_{\color{blue}{0},\color{red}{G_0}}]}_{\text{interventional direct effect}} \]

In our data example, we’ll consider the eating of snacks as a potential intermediate confounder, since one might reasonably hypothesize that participation on a sports team might subsequently affect snacking, which then could affect the mediator of the amount of exercises.
We can easily estimate the interventional direct and indirect effects by setting effect = "RI" and specifying the intermediate confounders in the moc argument.

# compute one-step estimates of the interventional effects
crumble(
  data = weight_behavior,
  trt = "sports_2",
  outcome = "bmi",
  covar = c("age", "sex_F", "tvhours"),
  mediators = "exercises",
  moc = "snack_2",
  d0 = function(data, trt) rep(1, nrow(data)),
  d1 = function(data, trt) rep(0, nrow(data)),
  effect = "RI",
  learners = ensemble,
  nn_module = mlp,
  control = crumble_control(crossfit_folds = 1L, epochs = 10L, learning_rate = 0.01)
)
# ✔ Permuting Z-prime variables... 1/1 tasks [1.3s]
# ✔ Fitting outcome regressions... 1/1 folds [7.1s]
# ✔ Computing alpha r density ratios... 1/1 folds [13.3s]
#
# ══ Results `crumble()` ════════════════════════════════════════════════════════════════
#
# ── Randomized Direct Effect
#       Estimate: -1.12
#     Std. error: 0.43
# 95% Conf. int.: -1.97, -0.27
#
# ── Randomized Indirect Effect
#       Estimate: 0
#     Std. error: 0.03
# 95% Conf. int.: -0.05, 0.05

# `R` packages for estimation of the causal (in)direct effects ```{r} #| label: load-renv #| echo: false #| message: false renv::autoload() library(here) ``` We'll now turn to working through a few examples of estimating the natural and interventional direct and indirect effects. We will be using the [`crumble` `R` package](https://cran.r-project.org/web/packages/crumble/index.html), which provides a unified framework for estimating many of the common mediation estimands, and supports high-dimensional exposures, mediators, and mediator-outcome confounders. However, many software implementations exist in the `R` ecosystem for performing mediation analyses; notably, `crumble` was heavily inspired by [`medoutcon`](https://github.com/nhejazi/medoutcon), [`HDmediation`](https://github.com/nt-williams/HDmediation), and [`lcm`](https://github.com/nt-williams/lcm). As our running example, we'll use a simple data set from an observational study of the relationship between BMI and kids' behavior, freely distributed with the [`mma` `R` package on CRAN](https://CRAN.R-project.org/package=mma). First, let's load the packages we'll be using and set a seed for reproducibility; then, load this data set and take a quick look. ```{r} #| label: setup #| code-fold: false library(crumble) library(mlr3extralearners) library(fastDummies) library(torch) set.seed(4235243) torch_manual_seed(657) # load and examine data data(weight_behavior, package = "mma") # drop missing values weight_behavior <- na.omit(weight_behavior) # dummy code (aka one-hot-encode) factors weight_behavior <- dummy_cols( weight_behavior, c("sex", "sports", "snack"), remove_selected_columns = TRUE, remove_first_dummy = TRUE ) head(weight_behavior) ``` The documentation for the data set describes it as a "database obtained from the Louisiana State University Health Sciences Center, New Orleans, by Dr. Richard Scribner. He explored the relationship between BMI and kids' behavior through a survey at children, teachers and parents in Grenada in 2014. This data set includes 691 observations and 15 variables." Note that the data set contained several observations with missing values, which we removed above to simplify the demonstration of our analytic methods. In practice, we recommend instead using appropriate corrections (e.g., imputation, inverse weighting) to fully take advantage of the observed data. Following the motivation of the original study, we focus on the causal effects of participating in a sports team (`sports_2`) on the BMI of children (`bmi`), taking into consideration several mediators (`snack_2`, `exercises`); `c("age", "sex_F", "tvhours")` are taken to be potential baseline confounders. ## `crumble`: Flexible and general mediation analysis - `crumble` implements a one-step estimator of the natural, interventional, and organic (in)direct effect, as well as path-specific effects using recanting twins. - The estimator is capable of accommodating flexible modeling strategies (e.g., ensemble machine learning) for the initial estimation of nuisance parameters. - To this end, `crumble` integrates with the [`mlr3superlearner` R package](https://cran.r-project.org/package=mlr3superlearner). - The estimator uses cross-fitting of nuisance parameters. The data on a single observational unit can be represented $O = (W, A, M, Y)$, with the data pooled across all participants denoted $O_1, \ldots, O_n$, for a of $n$ i.i.d. observations of $O$. Recall the DAG [from an earlier chapter](#estimands), which represents the data-generating process: ```{tikz} #| fig-cap: Directed acyclic graph under *no intermediate confounders* of the mediator-outcome relation affected by treatment \dimendef\prevdepth=0 \pgfdeclarelayer{background} \pgfsetlayers{background,main} \usetikzlibrary{arrows,positioning} \tikzset{ >=stealth', punkt/.style={ rectangle, rounded corners, draw=black, very thick, text width=6.5em, minimum height=2em, text centered}, pil/.style={ ->, thick, shorten <=2pt, shorten >=2pt,} } \newcommand{\Vertex}[2] {\node[minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$}; } \newcommand{\VertexR}[2] {\node[rectangle, draw, minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$}; } \newcommand{\ArrowR}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) to[bend right=30] (#2); \end{pgfonlayer} } \newcommand{\ArrowL}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) to[bend left=45] (#2); \end{pgfonlayer} } \newcommand{\EdgeL}[3] { \begin{pgfonlayer}{background} \draw[dashed,#3] (#1) to[bend right=-45] (#2); \end{pgfonlayer} } \newcommand{\Arrow}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) -- +(#2); \end{pgfonlayer} } \begin{tikzpicture} \Vertex{-4, 0}{W} \Vertex{0, 0}{M} \Vertex{-2, 0}{A} \Vertex{2, 0}{Y} \Arrow{W}{A}{black} \Arrow{A}{M}{black} \Arrow{M}{Y}{black} \ArrowL{W}{Y}{black} \ArrowL{A}{Y}{black} \ArrowL{W}{M}{black} \end{tikzpicture} ``` - In `crumble()`, $A$ is denoted `trt`, $Y$ is denoted `outcome`, $M$ is denoted `mediators`, and $W$ is denoted `covar`. ## Estimating nuisance parameters - Recall that the one-step estimator can be thought of as a combination of a weighted estimator and a regression estimator. - We'd like to rely on flexible, data adaptive strategies for nuisance parameter estimation. - Doing so minimizes opportunities for model mis-specification to compromise our analytic conclusions. ### Super learning - For the regression function parts of the EIF (e.g., $\E[Y\mid A,M,W]$), `crumble` uses the Super Learner algorithm for ensemble machine learning [@vdl2007super], which is implemented in the [`mlr3superlearner` R package](https://cran.r-project.org/package=mlr3superlearner). - Below, we demonstrate the construction of an ensemble learner based on a limited library of algorithms, including an intercept model, a main terms GLM, LASSO ($\ell_1$-penalized) regression, and random forest (`ranger`). ```{r} #| code-fold: false #| eval: false ensemble <- list( "mean", "glm", "cv_glmnet", list("ranger", num.trees = 200) ) ``` - Of course, there are many alternatives for learning algorithms to be included in such a modeling library. Feel free to explore! ### Riesz representers In a subset of the semi-parametric and double-machine learning literature, the weights in the EIF are sometimes referred to as *Riesz representers*. To understand why, let's imagine that we want to estimate the total effect (i.e., the ATE) of a binary treatment $A$ on some outcome $Y$ accounting for confounders $W$. Under standard identification assumptions, this effect is identified from the observed data as $$ \psi = \E[\E[Y\mid A= 1,W] - \E[Y\mid A= 0,W]]\text{.} $$ Let's denote $Q(a,W) = \E[Y\mid A=a,W]$ and $m(O;Q)$ as a continuous linear mapping of $Q(a,W)$ and the observed data. In the case of the total effect, $m(O;Q) = Q(1,W) - Q(0,W)$. We can then re-write the total effect as $$ \E[m(O;Q)]\text{.} $$ Now, consider that the total effect can equivalently be identified as $$ \E\bigg[\bigg\{\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}\bigg\}Y \bigg]\text{.} $$ Let's denote $\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}$ as $\alpha_0(W)$. Then, $$ \E[m(O;Q)] = \E[\alpha_0(W) \cdot Y]\text{.} $$ According to the Riesz representation theorem, if $m(O;Q)$ is a continuous linear functional of $Q$ (which it is), then there exists a unique function $\alpha_0(W)$--the so-called Riesz representer--such that $$ \E[m(O;Q)] = \E[\alpha_0(W) \cdot Q(A,W)]\text{.} $$ Using the tower rule, we have $$ \begin{align*} \E[m(O;Q)] &= \E[\alpha_0(W) \cdot Y]\\ &= \E\big[\E[\alpha_0(W) \cdot Y\mid A,W]\big] \\ &= \E\big[\alpha_0(W) \cdot \E[Y\mid A,W]\big] \\ & = \E[\alpha_0(W) \cdot Q(A,W)]\text{.} \end{align*} $$ Thus, $\frac{\I(A=1)}{\P(A=1 \mid W)} - \frac{\I(A=0)}{1 - \P(A=1 \mid W)}$ is the Riesz representer for estimating the total effect of $A$ on $Y$! While $\alpha_0(W)$ for the total effect has a nice close-form solution that can be easily estimated with off-the-shelf software, this isn't always the case. With that in mind, @chernozhukov2021automatic showed that the Riesz representer is the minimizer of the the so-called Riesz loss: $$ \text{arg min}_\alpha \E[\alpha_0(W)^2 - 2\cdot m(O;\alpha)]\text{.} $$ This means that we can directly estimate $\alpha_0(W)$ by minimizing the above loss function! For the case of mediation analysis, the methods we proposed in @liu2024general and implemented in `crumble` estimate Riesz representers of the type: $$ \frac{\I(A_i=1)}{\hat{\P}(A_i=1 \mid W_i)} \frac{\hat{\P}(M_i \mid A_i=0,W)_i}{\hat{\P}(M_i \mid A_i=1,W_i)} - \frac{\I(A=0)}{\hat{\P}(A_i=0 \mid W_i)}\text{.} $$ `crumble` estimates this value (and others) for mediational effects directly by minimizing the Riesz loss with deep learning using `torch`. **Much of this is abstracted away from the analyst**, but we still need to specify some type of deep-learning architecture to use `crumble`. Here's an example of a simple multilayer perceptron (MLP) with two hidden layers, each with 10 units, and a dropout rate of 0.1 which we create using the `sequential_module()` function from `crumble`. ```{r} #| code-fold: false #| eval: false mlp <- sequential_module(layers = 2, hidden = 10, dropout = 0.1) ``` ### On cross-fitting - When using data-adaptive regression estimators, it is recommended to use cross-fitted estimators - Cross-fitting is similar to cross-validation: - Randomly split the sample into K (e.g., K=10) subsets of equal size - For each of the 9/10ths of the sample, fit the regression models - Use the out-of-sample fit to predict in the remaining 1/10th of the sample - Cross-fitting further reduces the bias of the estimators - Cross-fitting aids in guaranteeing the correctness of the standard errors and confidence intervals - Cross-fitting is implemented by default in the R packages that you will see next ## Efficient estimation of the natural (in)direct effects To start, we will consider estimation of the *natural* direct and indirect effects, which, we recall, are defined as follows $$ \E[Y_{1,M_1} - Y_{0,M_0}] = \underbrace{\E[Y_{\color{red}{1},\color{blue}{M_1}} - Y_{\color{red}{1},\color{blue}{M_0}}]}_{\text{natural indirect effect}} + \underbrace{\E[Y_{\color{blue}{1},\color{red}{M_0}} - Y_{\color{blue}{0},\color{red}{M_0}}]}_{\text{natural direct effect}}. $$ Let's use the `crumble()` function to estimate the natural direct and indirect effect. - We'll use the ensemble of the intercept model, a main terms GLM, LASSO ($\ell_1$-penalized) regression, and random forest we defined above (`ensemble`) for the regression parameters - We'll use the multilayer perceptron we defined above (`mlp`) to estimate the Riesz representers. - To indicate we want to estimate the natural (in)direct effects, we set `effect = "N"`. - `crumble` was designed to also work with continuous exposures. Because of this, we need to specify the treatment regimes to create the contrasts for the (in)direct effects. - `d0` and `d1` are functions that return the treatment assignment under the two regimes. In our case, we want to estimate the effect of participating in a sports team, so we set `d0` to always return 1 (i.e., no participation) and `d1` to always return 0 (i.e., participation). - Want to learn more about specifying causal effects using these `d` functions? Check out the workshop, *Beyond the ATE* happening after this one! ```{r} #| label: natural-os #| code-fold: false #| eval: false # compute one-step estimates of the natural effects crumble( data = weight_behavior, trt = "sports_2", outcome = "bmi", covar = c("age", "sex_F", "tvhours"), mediators = c("snack_2", "exercises"), d0 = function(data, trt) rep(1, nrow(data)), d1 = function(data, trt) rep(0, nrow(data)), effect = "N", learners = ensemble, nn_module = mlp, control = crumble_control(crossfit_folds = 1L, epochs = 10L, learning_rate = 0.01) ) # ✔ Fitting outcome regressions... 1/1 folds [5.4s] # ✔ Computing alpha n density ratios... 1/1 folds [7s] # # ══ Results `crumble()` ════════════════════════════════════════════════════════════════ # # ── Direct Effect # Estimate: -1.08 # Std. error: 0.26 # 95% Conf. int.: -1.6, -0.57 # # ── Indirect Effect # Estimate: 0.06 # Std. error: 0.05 # 95% Conf. int.: -0.05, 0.16 # # ── Average Treatment Effect # Estimate: -1.03 # Std. error: 0.3 # 95% Conf. int.: -1.61, -0.44 ``` ## Efficient estimation of the interventional (in)direct effects Since our knowledge of the system under study is incomplete, we might worry that one (or more) of the measured variables are not mediators, but, in fact, intermediate confounders affected by treatment. While the natural (in)direct effects are not identified in this setting, their interventional (in)direct counterparts are, as we saw in an earlier section. Recall that both types of effects are defined by static interventions on the treatment. The interventional effects are distinguished by their use of a stochastic intervention on the mediator to aid in their identification. ```{tikz} #| fig-cap: Directed acyclic graph under intermediate confounders of the mediator-outcome relation affected by treatment \dimendef\prevdepth=0 \pgfdeclarelayer{background} \pgfsetlayers{background,main} \usetikzlibrary{arrows,positioning} \tikzset{ >=stealth', punkt/.style={ rectangle, rounded corners, draw=black, very thick, text width=6.5em, minimum height=2em, text centered}, pil/.style={ ->, thick, shorten <=2pt, shorten >=2pt,} } \newcommand{\Vertex}[2] {\node[minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$}; } \newcommand{\VertexR}[2] {\node[rectangle, draw, minimum width=0.6cm,inner sep=0.05cm] (#2) at (#1) {$#2$}; } \newcommand{\ArrowR}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) to[bend right=30] (#2); \end{pgfonlayer} } \newcommand{\ArrowL}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) to[bend left=45] (#2); \end{pgfonlayer} } \newcommand{\EdgeL}[3] { \begin{pgfonlayer}{background} \draw[dashed,#3] (#1) to[bend right=-45] (#2); \end{pgfonlayer} } \newcommand{\Arrow}[3] { \begin{pgfonlayer}{background} \draw[->,#3] (#1) -- +(#2); \end{pgfonlayer} } \begin{tikzpicture} \Vertex{0, -1}{Z} \Vertex{-4, 0}{W} \Vertex{0, 0}{M} \Vertex{-2, 0}{A} \Vertex{2, 0}{Y} \ArrowR{W}{Z}{black} \Arrow{Z}{M}{black} \Arrow{W}{A}{black} \Arrow{A}{M}{black} \Arrow{M}{Y}{black} \Arrow{A}{Z}{black} \Arrow{Z}{Y}{black} \ArrowL{W}{Y}{black} \ArrowL{A}{Y}{black} \ArrowL{W}{M}{black} \end{tikzpicture} ``` Recall that the interventional (in)direct effects are defined via the decomposition: $$ \E[Y_{1,G_1} - Y_{0,G_0}] = \underbrace{\E[Y_{\color{red}{1},\color{blue}{G_1}} - Y_{\color{red}{1},\color{blue}{G_0}}]}_{\text{interventional indirect effect}} + \underbrace{\E[Y_{\color{blue}{1},\color{red}{G_0}} - Y_{\color{blue}{0},\color{red}{G_0}}]}_{\text{interventional direct effect}} $$ - In our data example, we'll consider the eating of snacks as a potential intermediate confounder, since one might reasonably hypothesize that participation on a sports team might subsequently affect snacking, which then could affect the mediator of the amount of exercises.\ - We can easily estimate the interventional direct and indirect effects by setting `effect = "RI"` and specifying the intermediate confounders in the `moc` argument. ```{r} #| code-fold: false #| eval: false # compute one-step estimates of the interventional effects crumble( data = weight_behavior, trt = "sports_2", outcome = "bmi", covar = c("age", "sex_F", "tvhours"), mediators = "exercises", moc = "snack_2", d0 = function(data, trt) rep(1, nrow(data)), d1 = function(data, trt) rep(0, nrow(data)), effect = "RI", learners = ensemble, nn_module = mlp, control = crumble_control(crossfit_folds = 1L, epochs = 10L, learning_rate = 0.01) ) # ✔ Permuting Z-prime variables... 1/1 tasks [1.3s] # ✔ Fitting outcome regressions... 1/1 folds [7.1s] # ✔ Computing alpha r density ratios... 1/1 folds [13.3s] # # ══ Results `crumble()` ════════════════════════════════════════════════════════════════ # # ── Randomized Direct Effect # Estimate: -1.12 # Std. error: 0.43 # 95% Conf. int.: -1.97, -0.27 # # ── Randomized Indirect Effect # Estimate: 0 # Std. error: 0.03 # 95% Conf. int.: -0.05, 0.05 ```

5.1 crumble: Flexible and general mediation analysis

5.2 Estimating nuisance parameters

5.2.1 Super learning

5.2.2 Riesz representers

5.2.3 On cross-fitting

5.3 Efficient estimation of the natural (in)direct effects

5.4 Efficient estimation of the interventional (in)direct effects

5.1 `crumble`: Flexible and general mediation analysis