Introduction

This tutorial demonstrates g-methods in a simplified static (cross-sectional) setting, based on concepts from this paper by Naimi, A. I., Cole, S. R., and Kennedy, E. H (2017). By removing the time-varying complexity, we can focus on the core principles of the g-formula for causal inference.

In this tutorial, we will use the g-formula to estimate the average causal effect of treatment compared to no treatment. We’ll demonstrate both non-parametric and parametric approaches to g-computation.

Our Settings

We have a simple cross-sectional study examining the effect of HIV treatment (\(A\)) on CD4 count (\(Y\)). We observe a single baseline confounder, HIV viral load (\(Z\)), measured before treatment assignment. Unlike the longitudinal setting, all variables are measured at a single time point.

Our variables are:

  • \(Z\): HIV viral load at baseline (0 = low, 1 = high); 1 indicates worse baseline condition
  • \(A\): Treatment status (0 = no treatment, 1 = treatment)
  • \(Y\): CD4 count outcome; higher values indicate better immune function

We want to estimate the average causal effect of treatment (\(a = 1\)) compared to no treatment (\(a = 0\)). In this setup, the true average treatment effect (ATE) is 50.

G-Formula in Static Setting

The g-formula allows us to estimate the average potential outcome under a hypothetical treatment intervention. In the static setting, the g-formula simplifies to:

\[ E(Y^a) = \sum_z E(Y | A = a, Z = z) \cdot P(Z = z) \]

This is essentially a standardization or adjustment formula where we: 1. Compute the expected outcome for each level of the confounder \(Z\) 2. Weight by the marginal distribution of \(Z\) in the population

Our estimand of interest is: \[\hat{\tau} = E[Y^{a=1}] - E[Y^{a=0}]\]

Data Setup

# Simplified data for static setting
# Z: Baseline HIV viral load (0 = low, 1 = high)
# A: Treatment status (0 = no treatment, 1 = treatment)  
# Y: CD4 count outcome
# N: Number of subjects in each stratum

z <- c(0, 0, 1, 1)
a <- c(0, 1, 0, 1)
y <- c(100, 150, 80, 130)
n <- c(300, 200, 150, 350)

data <- data.frame(z, a, y, n)

# Display the data
kable(data, 
      caption = "Table 1: Cross-sectional Data",
      col.names = c("Z (Viral Load)", "A (Treatment)", "Y (CD4 Count)", "N (Sample Size)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 1: Cross-sectional Data
Z (Viral Load) A (Treatment) Y (CD4 Count) N (Sample Size)
0 0 100 300
0 1 150 200
1 0 80 150
1 1 130 350
# Calculate marginal distribution of Z
total_n <- sum(n)
z_marginal <- data %>%
  group_by(z) %>%
  summarise(total = sum(n), prob = sum(n) / total_n, .groups = 'drop')

kable(z_marginal,
      caption = "Table 2: Marginal Distribution of Z",
      col.names = c("Z (Viral Load)", "Count", "Probability"),
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 2: Marginal Distribution of Z
Z (Viral Load) Count Probability
0 500 0.5
1 500 0.5

Non-parametric G-Formula

Treated Intervention (A = 1)

# E[Y^{a=1}] - always treated
# Get expected outcomes for treated group at each level of Z
treated_outcomes <- data %>%
  filter(a == 1) %>%
  left_join(z_marginal %>% select(z, prob), by = "z") %>%
  mutate(weighted_y = y * prob)

expected_treated <- sum(treated_outcomes$weighted_y)

kable(treated_outcomes %>% select(z, y, prob, weighted_y),
      caption = "Table 3: Treated Intervention Calculation",
      col.names = c("Z", "$E[Y\\mid A=1, Z]$", "$P(Z)$", "Weighted Contribution"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 3: Treated Intervention Calculation
Z \(E[Y\mid A=1, Z]\) \(P(Z)\) Weighted Contribution
0 150 0.5 75
1 130 0.5 65
cat("Expected outcome under treatment: E[Y^{a=1}] =", round(expected_treated, 4), "\n")
## Expected outcome under treatment: E[Y^{a=1}] = 140

Untreated Intervention (A = 0)

# E[Y^{a=0}] - never treated
# Get expected outcomes for untreated group at each level of Z
untreated_outcomes <- data %>%
  filter(a == 0) %>%
  left_join(z_marginal %>% select(z, prob), by = "z") %>%
  mutate(weighted_y = y * prob)

expected_untreated <- sum(untreated_outcomes$weighted_y)

kable(untreated_outcomes %>% select(z, y, prob, weighted_y),
      caption = "Table 4: Untreated Intervention Calculation", 
      col.names = c("Z", "$E[Y\\mid A=0, Z]$", "$P(Z)$", "Weighted Contribution"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 4: Untreated Intervention Calculation
Z \(E[Y\mid A=0, Z]\) \(P(Z)\) Weighted Contribution
0 100 0.5 50
1 80 0.5 40
cat("Expected outcome without treatment: E[Y^{a=0}] =", round(expected_untreated, 4), "\n")
## Expected outcome without treatment: E[Y^{a=0}] = 90

Results Summary

# Calculate the average treatment effect
ate_nonparam <- expected_treated - expected_untreated

# Create results summary
results_nonparam <- data.frame(
  Intervention = c("Treated (A=1)", "Untreated (A=0)", "Average Treatment Effect"),
  Expected_Outcome = c(round(expected_treated, 4), round(expected_untreated, 4), round(ate_nonparam, 4))
)

kable(results_nonparam,
      caption = "Table 5: Non-parametric G-Formula Results",
      col.names = c("Intervention", "Expected Outcome")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 5: Non-parametric G-Formula Results
Intervention Expected Outcome
Treated (A=1) 140
Untreated (A=0) 90
Average Treatment Effect 50
cat("Non-parametric Average Treatment Effect:", round(ate_nonparam, 4), "\n")
## Non-parametric Average Treatment Effect: 50

This approach represents the non-parametric maximum likelihood estimation of the g-formula.

Parametric G-Formula

Now we’ll demonstrate the parametric approach using regression models. This is more practical when dealing with multiple confounders or continuous variables.

Step 1: Fit Models

# Fit models for each component
# Model for confounder Z (baseline characteristic)
# Since Z is a baseline variable, we model its marginal distribution
z_model_data <- data %>%
  slice(rep(row_number(), n)) # Expand data based on sample sizes

# Model for treatment A given Z
a_model <- glm(a ~ z, family = binomial(), data = data, weights = n)

# Model for outcome Y given A and Z  
y_model <- lm(y ~ a + z, data = data, weights = n)

# Display model summaries
cat("Treatment Model (A ~ Z):\n")
## Treatment Model (A ~ Z):
summary(a_model)
## 
## Call:
## glm(formula = a ~ z, family = binomial(), data = data, weights = n)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.40547    0.09129  -4.442 8.93e-06 ***
## z            1.25276    0.13363   9.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1376.3  on 3  degrees of freedom
## Residual deviance: 1283.9  on 2  degrees of freedom
## AIC: 1287.9
## 
## Number of Fisher Scoring iterations: 4
cat("\nOutcome Model (Y ~ A + Z):\n") 
## 
## Outcome Model (Y ~ A + Z):
summary(y_model)
## 
## Call:
## lm(formula = y ~ a + z, data = data, weights = n)
## 
## Weighted Residuals:
##          1          2          3          4 
##  3.329e-14 -4.077e-14 -4.708e-14  3.082e-14 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.000e+02  4.012e-15  2.493e+16   <2e-16 ***
## a            5.000e+01  5.137e-15  9.734e+15   <2e-16 ***
## z           -2.000e+01  5.111e-15 -3.913e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.705e-14 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.79e+31 on 2 and 1 DF,  p-value: < 2.2e-16

Step 2: Generate Simulated Data

# Step 2: Generate simulated values for parametric g-formula
set.seed(12345)
sim_n <- 100000

# Simulate Z from its marginal distribution
z_sim <- sample(z_marginal$z, size = sim_n, replace = TRUE, prob = z_marginal$prob)

# Simulate A given Z (though we'll override this for interventions)
a_sim <- rbinom(n = sim_n, 
                size = 1,
                prob = predict(a_model, 
                              newdata = data.frame(z = z_sim),
                              type = "response"))

# Create simulated dataset
data_sim <- data.frame(z = z_sim, a = a_sim)

cat("Simulation completed with", sim_n, "observations\n")
## Simulation completed with 1e+05 observations
cat("Z distribution:", table(z_sim), "\n")
## Z distribution: 50084 49916
cat("A distribution:", table(a_sim), "\n")
## A distribution: 45020 54980

Step 3: Intervention and Prediction

# Intervene to set treatment status
# Always treated (A = 1)
y_treated <- predict(y_model,
                    newdata = data_sim %>% mutate(a = 1),
                    type = "response")

# Never treated (A = 0)  
y_untreated <- predict(y_model,
                      newdata = data_sim %>% mutate(a = 0),
                      type = "response")

# Calculate average potential outcomes
expected_treated_param <- mean(y_treated)
expected_untreated_param <- mean(y_untreated)
ate_param <- expected_treated_param - expected_untreated_param

# Results summary
results_param <- data.frame(
  Intervention = c("Treated (A=1)", "Untreated (A=0)", "Average Treatment Effect"),
  Expected_Outcome = c(round(expected_treated_param, 4), 
                      round(expected_untreated_param, 4), 
                      round(ate_param, 4))
)

kable(results_param,
      caption = "Table 6: Parametric G-Formula Results",
      col.names = c("Intervention", "Expected Outcome")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6: Parametric G-Formula Results
Intervention Expected Outcome
Treated (A=1) 140.0168
Untreated (A=0) 90.0168
Average Treatment Effect 50.0000
cat("Parametric Average Treatment Effect:", round(ate_param, 4), "\n")
## Parametric Average Treatment Effect: 50

Comparison with Simple Methods

Let’s compare our g-formula results with simpler approaches that don’t properly adjust for confounding.

# Crude (unadjusted) comparison
crude_treated <- data %>% filter(a == 1) %>% summarise(mean = weighted.mean(y, n)) %>% pull()
crude_untreated <- data %>% filter(a == 0) %>% summarise(mean = weighted.mean(y, n)) %>% pull()
crude_effect <- crude_treated - crude_untreated

# Simple stratified analysis (manual adjustment)
stratified_effect <- data %>%
  group_by(z) %>%
  summarise(
    treated_mean = y[a == 1],
    untreated_mean = y[a == 0], 
    diff = treated_mean - untreated_mean,
    n_total = sum(n),
    .groups = 'drop'
  ) %>%
  summarise(weighted_diff = weighted.mean(diff, n_total)) %>%
  pull()

# Summary comparison
comparison <- data.frame(
  Method = c("Crude (Unadjusted)", "Stratified Analysis", "Non-parametric G-Formula", "Parametric G-Formula"),
  ATE_Estimate = c(round(crude_effect, 4), round(stratified_effect, 4), 
                   round(ate_nonparam, 4), round(ate_param, 4)),
  Notes = c("Ignores confounding", "Manual standardization", "Direct standardization", "Model-based")
)

kable(comparison,
      caption = "Table 7: Comparison of Different Methods",
      col.names = c("Method", "ATE Estimate", "Notes")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 7: Comparison of Different Methods
Method ATE Estimate Notes
Crude (Unadjusted) 43.9394 Ignores confounding
Stratified Analysis 50.0000 Manual standardization
Non-parametric G-Formula 50.0000 Direct standardization
Parametric G-Formula 50.0000 Model-based

Discussion

In this simplified static setting, we’ve demonstrated the core concepts of the g-formula:

  1. Non-parametric approach: Direct standardization using the empirical distribution of confounders
  2. Parametric approach: Model-based estimation allowing for more complex relationships
  3. Clarification: Non-parametric G-Formula vs. Stratified Standardization: In this static setting with a binary confounder, the non-parametric g-formula is mathematically equivalent to what is often called stratified standardization in epidemiology.

Both involve computing \(E[Y \mid A = a, Z = z]\) within levels of \(Z\) and then weighting by the marginal distribution \(P(Z = z)\). The terminology differs by discipline—standardization emphasizes adjustment for confounding, while g-computation emphasizes potential outcomes under hypothetical interventions.

In essence, they are two interpretations of the same procedure.

Key Insights

The g-formula provides several advantages:

  • Handles confounding: Unlike crude comparisons, it adjusts for measured confounders
  • Interpretable: Estimates population-level causal effects under hypothetical interventions
  • Flexible: Can accommodate complex treatment regimes and multiple confounders
  • Generalizable: The parametric version can extrapolate beyond observed data combinations

When Methods Agree vs. Disagree

In our example, the non-parametric and parametric g-formula give similar results, suggesting our linear model is well-specified. When they disagree substantially, it often indicates:

  • Model misspecification in the parametric approach
  • Insufficient data for reliable non-parametric estimation
  • Complex non-linear relationships requiring more flexible models

Extensions

This static framework naturally extends to:

  • Multiple confounders: Simply expand the standardization
  • Continuous confounders: Use integration instead of summation
  • Machine learning: Replace simple regression with flexible algorithms
  • Longitudinal settings: Add time-varying confounders and treatments

References

Naimi, A. I., Cole, S. R., & Kennedy, E. H. (2017). An introduction to g methods. International Journal of Epidemiology, 46(2), 756-762.