This tutorial demonstrates g-methods in a simplified static (cross-sectional) setting, based on concepts from this paper by Naimi, A. I., Cole, S. R., and Kennedy, E. H (2017). By removing the time-varying complexity, we can focus on the core principles of the g-formula for causal inference.
In this tutorial, we will use the g-formula to estimate the average causal effect of treatment compared to no treatment. We’ll demonstrate both non-parametric and parametric approaches to g-computation.
We have a simple cross-sectional study examining the effect of HIV treatment (\(A\)) on CD4 count (\(Y\)). We observe a single baseline confounder, HIV viral load (\(Z\)), measured before treatment assignment. Unlike the longitudinal setting, all variables are measured at a single time point.
Our variables are:
We want to estimate the average causal effect of treatment (\(a = 1\)) compared to no treatment (\(a = 0\)). In this setup, the true average treatment effect (ATE) is 50.
The g-formula allows us to estimate the average potential outcome under a hypothetical treatment intervention. In the static setting, the g-formula simplifies to:
\[ E(Y^a) = \sum_z E(Y | A = a, Z = z) \cdot P(Z = z) \]
This is essentially a standardization or adjustment formula where we: 1. Compute the expected outcome for each level of the confounder \(Z\) 2. Weight by the marginal distribution of \(Z\) in the population
Our estimand of interest is: \[\hat{\tau} = E[Y^{a=1}] - E[Y^{a=0}]\]
# Simplified data for static setting
# Z: Baseline HIV viral load (0 = low, 1 = high)
# A: Treatment status (0 = no treatment, 1 = treatment)
# Y: CD4 count outcome
# N: Number of subjects in each stratum
z <- c(0, 0, 1, 1)
a <- c(0, 1, 0, 1)
y <- c(100, 150, 80, 130)
n <- c(300, 200, 150, 350)
data <- data.frame(z, a, y, n)
# Display the data
kable(data,
caption = "Table 1: Cross-sectional Data",
col.names = c("Z (Viral Load)", "A (Treatment)", "Y (CD4 Count)", "N (Sample Size)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Z (Viral Load) | A (Treatment) | Y (CD4 Count) | N (Sample Size) |
---|---|---|---|
0 | 0 | 100 | 300 |
0 | 1 | 150 | 200 |
1 | 0 | 80 | 150 |
1 | 1 | 130 | 350 |
# Calculate marginal distribution of Z
total_n <- sum(n)
z_marginal <- data %>%
group_by(z) %>%
summarise(total = sum(n), prob = sum(n) / total_n, .groups = 'drop')
kable(z_marginal,
caption = "Table 2: Marginal Distribution of Z",
col.names = c("Z (Viral Load)", "Count", "Probability"),
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Z (Viral Load) | Count | Probability |
---|---|---|
0 | 500 | 0.5 |
1 | 500 | 0.5 |
# E[Y^{a=1}] - always treated
# Get expected outcomes for treated group at each level of Z
treated_outcomes <- data %>%
filter(a == 1) %>%
left_join(z_marginal %>% select(z, prob), by = "z") %>%
mutate(weighted_y = y * prob)
expected_treated <- sum(treated_outcomes$weighted_y)
kable(treated_outcomes %>% select(z, y, prob, weighted_y),
caption = "Table 3: Treated Intervention Calculation",
col.names = c("Z", "$E[Y\\mid A=1, Z]$", "$P(Z)$", "Weighted Contribution"),
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Z | \(E[Y\mid A=1, Z]\) | \(P(Z)\) | Weighted Contribution |
---|---|---|---|
0 | 150 | 0.5 | 75 |
1 | 130 | 0.5 | 65 |
## Expected outcome under treatment: E[Y^{a=1}] = 140
# E[Y^{a=0}] - never treated
# Get expected outcomes for untreated group at each level of Z
untreated_outcomes <- data %>%
filter(a == 0) %>%
left_join(z_marginal %>% select(z, prob), by = "z") %>%
mutate(weighted_y = y * prob)
expected_untreated <- sum(untreated_outcomes$weighted_y)
kable(untreated_outcomes %>% select(z, y, prob, weighted_y),
caption = "Table 4: Untreated Intervention Calculation",
col.names = c("Z", "$E[Y\\mid A=0, Z]$", "$P(Z)$", "Weighted Contribution"),
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Z | \(E[Y\mid A=0, Z]\) | \(P(Z)\) | Weighted Contribution |
---|---|---|---|
0 | 100 | 0.5 | 50 |
1 | 80 | 0.5 | 40 |
## Expected outcome without treatment: E[Y^{a=0}] = 90
# Calculate the average treatment effect
ate_nonparam <- expected_treated - expected_untreated
# Create results summary
results_nonparam <- data.frame(
Intervention = c("Treated (A=1)", "Untreated (A=0)", "Average Treatment Effect"),
Expected_Outcome = c(round(expected_treated, 4), round(expected_untreated, 4), round(ate_nonparam, 4))
)
kable(results_nonparam,
caption = "Table 5: Non-parametric G-Formula Results",
col.names = c("Intervention", "Expected Outcome")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Intervention | Expected Outcome |
---|---|
Treated (A=1) | 140 |
Untreated (A=0) | 90 |
Average Treatment Effect | 50 |
## Non-parametric Average Treatment Effect: 50
This approach represents the non-parametric maximum likelihood estimation of the g-formula.
Now we’ll demonstrate the parametric approach using regression models. This is more practical when dealing with multiple confounders or continuous variables.
# Fit models for each component
# Model for confounder Z (baseline characteristic)
# Since Z is a baseline variable, we model its marginal distribution
z_model_data <- data %>%
slice(rep(row_number(), n)) # Expand data based on sample sizes
# Model for treatment A given Z
a_model <- glm(a ~ z, family = binomial(), data = data, weights = n)
# Model for outcome Y given A and Z
y_model <- lm(y ~ a + z, data = data, weights = n)
# Display model summaries
cat("Treatment Model (A ~ Z):\n")
## Treatment Model (A ~ Z):
##
## Call:
## glm(formula = a ~ z, family = binomial(), data = data, weights = n)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.40547 0.09129 -4.442 8.93e-06 ***
## z 1.25276 0.13363 9.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1376.3 on 3 degrees of freedom
## Residual deviance: 1283.9 on 2 degrees of freedom
## AIC: 1287.9
##
## Number of Fisher Scoring iterations: 4
##
## Outcome Model (Y ~ A + Z):
##
## Call:
## lm(formula = y ~ a + z, data = data, weights = n)
##
## Weighted Residuals:
## 1 2 3 4
## 3.329e-14 -4.077e-14 -4.708e-14 3.082e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+02 4.012e-15 2.493e+16 <2e-16 ***
## a 5.000e+01 5.137e-15 9.734e+15 <2e-16 ***
## z -2.000e+01 5.111e-15 -3.913e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.705e-14 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.79e+31 on 2 and 1 DF, p-value: < 2.2e-16
# Step 2: Generate simulated values for parametric g-formula
set.seed(12345)
sim_n <- 100000
# Simulate Z from its marginal distribution
z_sim <- sample(z_marginal$z, size = sim_n, replace = TRUE, prob = z_marginal$prob)
# Simulate A given Z (though we'll override this for interventions)
a_sim <- rbinom(n = sim_n,
size = 1,
prob = predict(a_model,
newdata = data.frame(z = z_sim),
type = "response"))
# Create simulated dataset
data_sim <- data.frame(z = z_sim, a = a_sim)
cat("Simulation completed with", sim_n, "observations\n")
## Simulation completed with 1e+05 observations
## Z distribution: 50084 49916
## A distribution: 45020 54980
# Intervene to set treatment status
# Always treated (A = 1)
y_treated <- predict(y_model,
newdata = data_sim %>% mutate(a = 1),
type = "response")
# Never treated (A = 0)
y_untreated <- predict(y_model,
newdata = data_sim %>% mutate(a = 0),
type = "response")
# Calculate average potential outcomes
expected_treated_param <- mean(y_treated)
expected_untreated_param <- mean(y_untreated)
ate_param <- expected_treated_param - expected_untreated_param
# Results summary
results_param <- data.frame(
Intervention = c("Treated (A=1)", "Untreated (A=0)", "Average Treatment Effect"),
Expected_Outcome = c(round(expected_treated_param, 4),
round(expected_untreated_param, 4),
round(ate_param, 4))
)
kable(results_param,
caption = "Table 6: Parametric G-Formula Results",
col.names = c("Intervention", "Expected Outcome")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Intervention | Expected Outcome |
---|---|
Treated (A=1) | 140.0168 |
Untreated (A=0) | 90.0168 |
Average Treatment Effect | 50.0000 |
## Parametric Average Treatment Effect: 50
Let’s compare our g-formula results with simpler approaches that don’t properly adjust for confounding.
# Crude (unadjusted) comparison
crude_treated <- data %>% filter(a == 1) %>% summarise(mean = weighted.mean(y, n)) %>% pull()
crude_untreated <- data %>% filter(a == 0) %>% summarise(mean = weighted.mean(y, n)) %>% pull()
crude_effect <- crude_treated - crude_untreated
# Simple stratified analysis (manual adjustment)
stratified_effect <- data %>%
group_by(z) %>%
summarise(
treated_mean = y[a == 1],
untreated_mean = y[a == 0],
diff = treated_mean - untreated_mean,
n_total = sum(n),
.groups = 'drop'
) %>%
summarise(weighted_diff = weighted.mean(diff, n_total)) %>%
pull()
# Summary comparison
comparison <- data.frame(
Method = c("Crude (Unadjusted)", "Stratified Analysis", "Non-parametric G-Formula", "Parametric G-Formula"),
ATE_Estimate = c(round(crude_effect, 4), round(stratified_effect, 4),
round(ate_nonparam, 4), round(ate_param, 4)),
Notes = c("Ignores confounding", "Manual standardization", "Direct standardization", "Model-based")
)
kable(comparison,
caption = "Table 7: Comparison of Different Methods",
col.names = c("Method", "ATE Estimate", "Notes")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Method | ATE Estimate | Notes |
---|---|---|
Crude (Unadjusted) | 43.9394 | Ignores confounding |
Stratified Analysis | 50.0000 | Manual standardization |
Non-parametric G-Formula | 50.0000 | Direct standardization |
Parametric G-Formula | 50.0000 | Model-based |
In this simplified static setting, we’ve demonstrated the core concepts of the g-formula:
Both involve computing \(E[Y \mid A = a, Z = z]\) within levels of \(Z\) and then weighting by the marginal distribution \(P(Z = z)\). The terminology differs by discipline—standardization emphasizes adjustment for confounding, while g-computation emphasizes potential outcomes under hypothetical interventions.
In essence, they are two interpretations of the same procedure.
The g-formula provides several advantages:
In our example, the non-parametric and parametric g-formula give similar results, suggesting our linear model is well-specified. When they disagree substantially, it often indicates:
This static framework naturally extends to:
Naimi, A. I., Cole, S. R., & Kennedy, E. H. (2017). An introduction to g methods. International Journal of Epidemiology, 46(2), 756-762.