Introduction

This tutorial demonstrates g-computation (also known as the g-formula) for estimating causal effects with time-varying confounders. We replicate the analysis from Naimi, Cole, and Kennedy (2017).

From our knowledge of the data-generating process, we know the true average causal effect to be 50.

Load Required Libraries

library(dplyr)
library(knitr)
library(kableExtra)

Study Setting

The empirical setting is to treat HIV with a therapy regimen (A) in two time periods (t = 0, t = 1). We measure the time-varying confounder, HIV viral load (Z), at times t = 0 and t = 1. Our outcome is the CD4 count (cells/mm³) observed at t = 2.

Objective: Estimate the causal effect of always being treated vs. never being treated on HIV outcomes.

Variables:

  • A0, A1: Treatment at times 0 and 1 (0 = no treatment, 1 = treatment)
  • Z0 = 1 for all subjects (high viral load at baseline); 1 indicates worse baseline condition.
  • Z1: Time-varying confounder (HIV viral load at time 1; 0 = low, 1 = high)
  • Y: Outcome (CD4 count at time 2); higher values indicate better immune function.

Causal structure: Z0 → A0 → Z1 → A1 → Y

Target estimand: E[Y^{1,1}] - E[Y^{0,0}] (always treated vs. never treated)

Data Setup

# Aggregated data from Table 1 of Naimi et al. (2017)
data <- data.frame(
  a0 = c(0, 0, 0, 0, 1, 1, 1, 1),        # Treatment at time 0
  z1 = c(0, 0, 1, 1, 0, 0, 1, 1),        # Confounder at time 1
  a1 = c(0, 1, 0, 1, 0, 1, 0, 1),        # Treatment at time 1
  y  = c(87.288, 112.107, 119.654, 144.842, 
         105.282, 130.184, 137.720, 162.832), # Mean CD4 count
  n  = c(209271, 93779, 60657, 136293, 
         134781, 60789, 93903, 210527)     # Number of subjects
)

kable(data, 
      caption = "Aggregated Study Data",
      col.names = c("A0", "Z1", "A1", "Y (mean)", "N (count)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Aggregated Study Data
A0 Z1 A1 Y (mean) N (count)
0 0 0 87.288 209271
0 0 1 112.107 93779
0 1 0 119.654 60657
0 1 1 144.842 136293
1 0 0 105.282 134781
1 0 1 130.184 60789
1 1 0 137.720 93903
1 1 1 162.832 210527
cat("Total sample size:", sum(data$n), "subjects\n")
## Total sample size: 1e+06 subjects
cat("True causal effect (from data generation process): 50\n")
## True causal effect (from data generation process): 50

Method 1: Nonparametric G-Computation

The nonparametric approach directly estimates the g-formula using empirical probabilities:

\[E[Y^{a_0, a_1}] = \sum_{z_1} E[Y | A_1 = a_1, Z_1 = z_1, A_0 = a_0] \cdot P(Z_1 = z_1 | A_0 = a_0)\]

Step 1: Estimate P(Z1 | A0)

# Calculate empirical probabilities P(Z1 | A0)
z1_probs <- data %>%
  group_by(a0, z1) %>%
  summarise(count = sum(n), .groups = 'drop') %>%
  group_by(a0) %>%
  mutate(
    total = sum(count),
    prob = count / total
  ) %>%
  select(a0, z1, prob)

kable(z1_probs, 
      caption = "Empirical Probabilities P(Z1 | A0)",
      col.names = c("A0", "Z1", "P(Z1 \\mid A0)"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Empirical Probabilities P(Z1 | A0)
A0 Z1 P(Z1 A0)
0 0 0.6061
0 1 0.3939
1 0 0.3911
1 1 0.6089

Step 2: Calculate Potential Outcomes

# Function to calculate potential outcome under intervention (a0, a1)
calculate_potential_outcome <- function(a0_val, a1_val) {
  
  # Get relevant probabilities P(Z1 | A0 = a0_val)
  probs <- z1_probs %>% filter(a0 == a0_val)
  
  # Get outcome values E[Y | A0 = a0_val, Z1, A1 = a1_val]
  outcomes <- data %>% 
    filter(a0 == a0_val, a1 == a1_val) %>%
    select(z1, y)
  
  # Combine and calculate weighted expectation
  result <- probs %>%
    left_join(outcomes, by = "z1") %>%
    summarise(
      potential_outcome = sum(prob * y),
      .groups = 'drop'
    ) %>%
    pull(potential_outcome)
  
  return(result)
}

# Calculate potential outcomes
always_treated <- calculate_potential_outcome(a0_val = 1, a1_val = 1)
never_treated <- calculate_potential_outcome(a0_val = 0, a1_val = 0)

# Results
nonparam_results <- data.frame(
  Intervention = c("Always treated (1,1)", "Never treated (0,0)"),
  Potential_Outcome = c(always_treated, never_treated)
)

nonparam_ate <- always_treated - never_treated

kable(nonparam_results, 
      caption = "Nonparametric G-Computation Results",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Nonparametric G-Computation Results
Intervention Potential_Outcome
Always treated (1,1) 150.0621
Never treated (0,0) 100.0370
cat("Nonparametric ATE:", round(nonparam_ate, 4), "\n")
## Nonparametric ATE: 50.0251

Method 2: Parametric G-Computation

The parametric approach uses regression models to estimate each component of the g-formula.

Step 1: Fit Regression Models

# Fit models for each component (using sample weights!)
z1_model <- glm(z1 ~ a0, 
                family = binomial(), 
                data = data, 
                weights = n)

a1_model <- glm(a1 ~ z1, 
                family = binomial(), 
                data = data, 
                weights = n)

y_model <- lm(y ~ a0 + z1 + a1 + a0*a1 + z1*a1, 
              data = data, 
              weights = n)

# Display model fit
cat("Model summaries:\n\n")
## Model summaries:
cat("Z1 Model (P(Z1 | A0)):\n")
## Z1 Model (P(Z1 | A0)):
print(summary(z1_model)$coefficients)
##               Estimate  Std. Error   z value Pr(>|z|)
## (Intercept) -0.4309479 0.002894342 -148.8932        0
## a0           0.8734707 0.004095766  213.2619        0
cat("\nA1 Model (P(A1 | Z1)):\n") 
## 
## A1 Model (P(A1 | Z1)):
print(summary(a1_model)$coefficients)
##               Estimate  Std. Error   z value Pr(>|z|)
## (Intercept) -0.8001587 0.003062011 -261.3180        0
## z1           1.6083822 0.004327695  371.6487        0
cat("\nY Model (E[Y | A0, Z1, A1]):\n")
## 
## Y Model (E[Y | A0, Z1, A1]):
print(summary(y_model)$coefficients)
##                 Estimate Std. Error      t value     Pr(>|t|)
## (Intercept) 8.727925e+01 0.02611352 3.342301e+03 8.951769e-08
## a0          1.801633e+01 0.03697234 4.872921e+02 4.211322e-06
## z1          3.240493e+01 0.03983439 8.134912e+02 1.511100e-06
## a1          2.485141e+01 0.04407823 5.638024e+02 3.145894e-06
## a0:a1       4.970271e-04 0.05220134 9.521347e-03 9.932675e-01
## z1:a1       2.901257e-01 0.05628599 5.154492e+00 3.563828e-02
cat("\nY Model R²:", round(summary(y_model)$r.squared, 4), "\n")
## 
## Y Model R²: 1

Step 2: G-Computation via Simulation

# Simulation-based g-computation
set.seed(123)
n_sim <- 100000

# Function to simulate potential outcomes under intervention
simulate_intervention <- function(a0_intervention, a1_intervention, n_sim) {
  
  # Generate Z1 under the intervention A0 = a0_intervention
  z1_sim <- rbinom(n = n_sim, 
                   size = 1, 
                   prob = predict(z1_model, 
                                 newdata = data.frame(a0 = a0_intervention), 
                                 type = "response"))
  
  # Predict Y under the full intervention (A0, A1) and simulated Z1
  y_sim <- predict(y_model, 
                  newdata = data.frame(
                    a0 = a0_intervention,
                    z1 = z1_sim,
                    a1 = a1_intervention
                  ))
  
  return(mean(y_sim))
}

# Simulate potential outcomes
always_treated_param <- simulate_intervention(a0_intervention = 1, 
                                            a1_intervention = 1, 
                                            n_sim = n_sim)

never_treated_param <- simulate_intervention(a0_intervention = 0, 
                                           a1_intervention = 0, 
                                           n_sim = n_sim)

param_ate <- always_treated_param - never_treated_param

# Results
param_results <- data.frame(
  Intervention = c("Always treated (1,1)", "Never treated (0,0)"),
  Potential_Outcome = c(always_treated_param, never_treated_param)
)

kable(param_results, 
      caption = "Parametric G-Computation Results",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Parametric G-Computation Results
Intervention Potential_Outcome
Always treated (1,1) 150.0866
Never treated (0,0) 100.0555
cat("Parametric ATE:", round(param_ate, 4), "\n")
## Parametric ATE: 50.031

Step 3: Additional Intervention Scenarios

# Explore other intervention scenarios
other_interventions <- data.frame(
  Scenario = c("Treated at time 0 only", "Treated at time 1 only", "Natural course"),
  A0 = c(1, 0, "Natural"),
  A1 = c(0, 1, "Natural"),
  Potential_Outcome = c(
    simulate_intervention(1, 0, n_sim),
    simulate_intervention(0, 1, n_sim),
    mean(predict(y_model, newdata = data.frame(
      a0 = sample(c(0,1), n_sim, replace = TRUE),
      z1 = rbinom(n_sim, 1, 0.5),  # Natural Z1 distribution
      a1 = rbinom(n_sim, 1, 0.5)   # Natural A1 distribution
    )))
  )
)

kable(other_interventions, 
      caption = "Additional Intervention Scenarios",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Additional Intervention Scenarios
Scenario A0 A1 Potential_Outcome
Treated at time 0 only 1 0 125.0354
Treated at time 1 only 0 1 125.0488
Natural course Natural Natural 124.8908

Comparison and Validation

# Compare all methods
comparison <- data.frame(
  Method = c("Nonparametric G-computation", 
             "Parametric G-computation", 
             "True Effect"),
  ATE = c(round(nonparam_ate, 4),
          round(param_ate, 4),
          50.0),
  Difference_from_Truth = c(round(nonparam_ate - 50, 4),
                           round(param_ate - 50, 4),
                           0)
)

kable(comparison, 
      caption = "Method Comparison",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Method Comparison
Method ATE Difference_from_Truth
Nonparametric G-computation 50.0251 0.0251
Parametric G-computation 50.0310 0.0310
True Effect 50.0000 0.0000
# Model validation: Check if parametric models recover empirical probabilities
cat("\nModel Validation:\n")
## 
## Model Validation:
cat("Empirical P(Z1=1|A0=0):", round(z1_probs$prob[z1_probs$a0==0 & z1_probs$z1==1], 4), "\n")
## Empirical P(Z1=1|A0=0): 0.3939
cat("Model P(Z1=1|A0=0):    ", round(predict(z1_model, data.frame(a0=0), type="response"), 4), "\n")
## Model P(Z1=1|A0=0):     0.3939
cat("Empirical P(Z1=1|A0=1):", round(z1_probs$prob[z1_probs$a0==1 & z1_probs$z1==1], 4), "\n")
## Empirical P(Z1=1|A0=1): 0.6089
cat("Model P(Z1=1|A0=1):    ", round(predict(z1_model, data.frame(a0=1), type="response"), 4), "\n")
## Model P(Z1=1|A0=1):     0.6089

Key Insights

  • Nonparametric vs Parametric: The nonparametric approach uses empirical probabilities (unbiased), while the parametric approach uses regression models (flexible but assumption-dependent).
  • G-computation algorithm:
    1. Fit models for P(Z1|A0), P(A1|Z1), E[Y|A0,Z1,A1]
    2. Simulate under interventions
    3. Average over confounders
    4. Compare potential outcomes
  • Time-varying confounding: Standard methods fail when confounders are affected by treatment. G-computation handles this by modeling the temporal sequence.

Conclusion

This tutorial demonstrated g-computation for estimating causal effects with time-varying confounders. Both nonparametric and parametric approaches successfully recovered the true causal effect when properly implemented with sample weights and correct model specification.