from pathlib import Path
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from lightgbm import LGBMRegressor
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="X does not have valid feature names")
pd.set_option("display.max_columns", 140)
pd.set_option("display.max_rows", 120)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")
sns.set_theme(style="whitegrid", context="notebook")
RANDOM_STATE = 42Notebook 05: G-Computation for Long-Term Effects
Notebook 04 estimated the long-term effect of high-watch exposure using a marginal structural model. That approach modeled treatment assignment, created inverse probability weights, and then estimated the effect in a weighted pseudo-population.
This notebook estimates the same causal question using a different strategy: g-computation. Instead of modeling the probability of treatment, g-computation models the future outcome directly:
future_7day_interactions ~ treatment + pre-treatment history + calendar context
Once the outcome model is fit, we simulate two counterfactual predictions for every user-day:
- predicted future engagement if the user-day were treated;
- predicted future engagement if the same user-day were untreated.
The average difference between those two predictions estimates:
E[Y(1) - Y(0)]
This is a strong next step because it checks whether an outcome-modeling approach tells a similar story to the weighting approach from Notebook 04.
Why G-Computation Helps This Project
A marginal structural model and g-computation make different modeling bets.
- The MSM relies heavily on the treatment model and the resulting inverse probability weights.
- G-computation relies heavily on the outcome model and its ability to predict future engagement under both treatment states.
Neither approach magically solves unobserved confounding. But if both approaches, using the same estimand and legal pre-treatment covariates, produce a similar answer, the project becomes more credible. If they disagree, that disagreement is informative and should be discussed as model sensitivity.
Setup
The first cell imports the libraries used for outcome modeling, cross-fitting, diagnostics, bootstrap uncertainty, and plots. We use two outcome-model families: a regularized linear model for transparency and LightGBM for flexible nonlinear prediction.
The notebook environment is ready for outcome-model-based causal estimation. The important modeling choice is that the same pre-treatment feature set will be used for all models, so differences in estimates reflect model family rather than different covariate rules.
Locate the Weighted Panel and MSM Result
This notebook uses the weighted panel from Notebook 03 because it already contains the final estimand population and standardized columns. It also loads the primary MSM result from Notebook 04 so the two causal strategies can be compared directly.
WEIGHTED_PANEL_RELATIVE_PATH = Path("data/processed/kuairec_long_term_weighted_panel.parquet")
MSM_PRIMARY_RELATIVE_PATH = Path("data/processed/kuairec_long_term_msm_primary_statement.csv")
candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
(path for path in candidate_roots if (path / WEIGHTED_PANEL_RELATIVE_PATH).exists()),
None,
)
if PROJECT_ROOT is None:
raise FileNotFoundError(
f"Could not find {WEIGHTED_PANEL_RELATIVE_PATH}. Run Notebook 03 first or run this notebook inside the project."
)
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PANEL_PATH = PROJECT_ROOT / WEIGHTED_PANEL_RELATIVE_PATH
MSM_PRIMARY_PATH = PROJECT_ROOT / MSM_PRIMARY_RELATIVE_PATH
print(f"Project root: {PROJECT_ROOT}")
print(f"Weighted panel: {PANEL_PATH}")
print(f"MSM primary result: {MSM_PRIMARY_PATH}")
print(f"Processed output folder: {PROCESSED_DIR}")Project root: /home/apex/Documents/ranking_sys
Weighted panel: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weighted_panel.parquet
MSM primary result: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_msm_primary_statement.csv
Processed output folder: /home/apex/Documents/ranking_sys/data/processed
The notebook found both required inputs. The weighted panel supplies the analysis rows and covariates, while the MSM result provides the benchmark effect estimate from the previous notebook.
Load the Analysis Panel
The primary treatment is treatment, and the primary outcome is outcome, which means future 7-day interaction volume. The panel also includes repeated user-days, so model evaluation and bootstrap uncertainty should respect user_id groups.
panel = pd.read_parquet(PANEL_PATH)
panel["event_date"] = pd.to_datetime(panel["event_date"])
panel = panel.sort_values(["user_id", "event_date"]).reset_index(drop=True)
PRIMARY_TREATMENT = "treatment"
PRIMARY_OUTCOME = "outcome"
required_columns = ["user_id", "event_date", "day_of_week", PRIMARY_TREATMENT, PRIMARY_OUTCOME]
missing_required = [column for column in required_columns if column not in panel.columns]
if missing_required:
raise ValueError(f"Missing required columns: {missing_required}")
print(f"Panel shape: {panel.shape}")
print(f"Users: {panel['user_id'].nunique():,}")
print(f"Date range: {panel['event_date'].min().date()} to {panel['event_date'].max().date()}")
display(panel[required_columns + ["analysis_weight", "propensity_logistic"]].head())Panel shape: (4698, 40)
Users: 91
Date range: 2020-07-08 to 2020-08-29
| user_id | event_date | day_of_week | treatment | outcome | analysis_weight | propensity_logistic | |
|---|---|---|---|---|---|---|---|
| 0 | 14 | 2020-07-08 | Wednesday | 1 | 279.0000 | 0.9131 | 0.5471 |
| 1 | 14 | 2020-07-09 | Thursday | 0 | 311.0000 | 1.3647 | 0.6333 |
| 2 | 14 | 2020-07-10 | Friday | 1 | 352.0000 | 1.2067 | 0.4140 |
| 3 | 14 | 2020-07-11 | Saturday | 1 | 437.0000 | 0.8836 | 0.5654 |
| 4 | 14 | 2020-07-12 | Sunday | 0 | 437.0000 | 1.2466 | 0.5986 |
The loaded table is the same final estimand population used by the MSM notebook. Because the treatment and outcome columns are standardized, this notebook can focus on the g-computation strategy rather than redefining the causal question.
Define the Outcome-Model Features
G-computation needs an outcome model that uses the treatment and pre-treatment covariates. The treatment is included because we want to estimate how predicted outcomes change when the treatment is toggled. The history and calendar features are included to adjust for observed confounding.
No future outcome columns are included as features.
base_numeric_covariates = [
"lag_1_active_day",
"lag_1_interactions",
"lag_1_total_play_duration_sec",
"lag_1_avg_watch_ratio",
"lag_1_high_watch_share",
"prior_3day_active_day",
"prior_3day_interactions",
"prior_3day_total_play_duration_sec",
"prior_3day_avg_watch_ratio",
"prior_3day_high_watch_share",
"calendar_day_index",
"panel_day_index",
]
numeric_features = [PRIMARY_TREATMENT] + base_numeric_covariates
categorical_features = ["day_of_week"]
outcome_features = numeric_features + categorical_features
feature_table = pd.DataFrame(
[
{"role": "treatment toggle", "columns": PRIMARY_TREATMENT, "reason": "Set to 1 and 0 to generate counterfactual predictions."},
{"role": "pre-treatment numeric history", "columns": ", ".join(base_numeric_covariates), "reason": "Adjust for recent activity and watch behavior."},
{"role": "calendar context", "columns": ", ".join(categorical_features), "reason": "Account for day-of-week patterns in engagement."},
]
)
display(feature_table)| role | columns | reason | |
|---|---|---|---|
| 0 | treatment toggle | treatment | Set to 1 and 0 to generate counterfactual pred... |
| 1 | pre-treatment numeric history | lag_1_active_day, lag_1_interactions, lag_1_to... | Adjust for recent activity and watch behavior. |
| 2 | calendar context | day_of_week | Account for day-of-week patterns in engagement. |
The feature set respects the causal timing: it uses treatment and pre-treatment context, but not future information. This lets the outcome model simulate counterfactual treatment states for the same user-day history.
Outcome and Treatment Summary
Before fitting outcome models, we summarize the treatment split and outcome distribution. This gives a baseline sense of how hard the prediction problem is and how much variation exists in future engagement.
outcome_summary = pd.DataFrame(
[
{"metric": "rows", "value": len(panel)},
{"metric": "users", "value": panel["user_id"].nunique()},
{"metric": "treatment_rate", "value": panel[PRIMARY_TREATMENT].mean()},
{"metric": "outcome_mean", "value": panel[PRIMARY_OUTCOME].mean()},
{"metric": "outcome_std", "value": panel[PRIMARY_OUTCOME].std()},
{"metric": "outcome_min", "value": panel[PRIMARY_OUTCOME].min()},
{"metric": "outcome_median", "value": panel[PRIMARY_OUTCOME].median()},
{"metric": "outcome_max", "value": panel[PRIMARY_OUTCOME].max()},
]
)
display(outcome_summary)
display(
panel.groupby(PRIMARY_TREATMENT)[PRIMARY_OUTCOME]
.agg(["count", "mean", "std", "min", "median", "max"])
.rename_axis("treatment")
)| metric | value | |
|---|---|---|
| 0 | rows | 4,698.0000 |
| 1 | users | 91.0000 |
| 2 | treatment_rate | 0.4996 |
| 3 | outcome_mean | 382.0826 |
| 4 | outcome_std | 156.3942 |
| 5 | outcome_min | 0.0000 |
| 6 | outcome_median | 391.0000 |
| 7 | outcome_max | 888.0000 |
| count | mean | std | min | median | max | |
|---|---|---|---|---|---|---|
| treatment | ||||||
| 0 | 2351 | 378.5479 | 157.6583 | 18.0000 | 380.0000 | 888.0000 |
| 1 | 2347 | 385.6233 | 155.0705 | 0.0000 | 402.0000 | 851.0000 |
The outcome has meaningful spread, which makes outcome modeling feasible. The treated and control raw means remain descriptive only; g-computation will compare counterfactual predictions for the same covariate distribution.
Define Outcome Model Pipelines
The linear model is transparent and stable. LightGBM is more flexible and can capture nonlinearities and interactions, including treatment-effect heterogeneity. Both models use the same features.
def make_linear_outcome_pipeline(numeric_cols, categorical_cols):
preprocessor = ColumnTransformer(
transformers=[
("numeric", StandardScaler(), numeric_cols),
("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
],
remainder="drop",
verbose_feature_names_out=False,
)
return Pipeline(
steps=[
("preprocessor", preprocessor),
("model", Ridge(alpha=1.0)),
]
)
def make_lgbm_outcome_pipeline(numeric_cols, categorical_cols):
preprocessor = ColumnTransformer(
transformers=[
("numeric", "passthrough", numeric_cols),
("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
],
remainder="drop",
verbose_feature_names_out=False,
)
return Pipeline(
steps=[
("preprocessor", preprocessor),
(
"model",
LGBMRegressor(
n_estimators=140,
learning_rate=0.04,
num_leaves=15,
min_child_samples=40,
subsample=0.9,
colsample_bytree=0.9,
random_state=RANDOM_STATE,
n_jobs=1,
verbose=-1,
),
),
]
)
outcome_models = {
"linear_ridge": make_linear_outcome_pipeline(numeric_features, categorical_features),
"lightgbm": make_lgbm_outcome_pipeline(numeric_features, categorical_features),
}
print("Defined outcome models:")
for model_name in outcome_models:
print(f"- {model_name}")Defined outcome models:
- linear_ridge
- lightgbm
The two model families give a useful contrast: linear ridge provides an interpretable baseline, while LightGBM tests whether flexible nonlinear outcome modeling changes the estimated causal effect.
Cross-Fit Outcome Models and Counterfactual Predictions
This cell uses user-grouped cross-fitting. Each user’s rows appear entirely in either the training fold or the held-out fold, which avoids evaluating the model on another day from the same user it trained on.
For each held-out row, the model predicts:
- observed outcome under the actual treatment value;
- counterfactual outcome if
treatment = 1; - counterfactual outcome if
treatment = 0.
The difference between the two counterfactual predictions is the row-level g-computation effect contribution.
def make_counterfactual_features(X, treatment_value):
X_cf = X.copy()
X_cf[PRIMARY_TREATMENT] = treatment_value
return X_cf
def evaluate_predictions(y_true, y_pred):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return rmse, mae, r2
def cross_fitted_gcomp(model, data, feature_cols, outcome_col, groups, model_name, n_splits=5):
group_kfold = GroupKFold(n_splits=n_splits)
X = data[feature_cols].copy()
y = data[outcome_col].astype(float).copy()
observed_pred = np.zeros(len(data), dtype=float)
y1_pred = np.zeros(len(data), dtype=float)
y0_pred = np.zeros(len(data), dtype=float)
fold_rows = []
for fold, (train_idx, test_idx) in enumerate(group_kfold.split(X, y, groups=groups), start=1):
fold_model = clone(model)
X_train = X.iloc[train_idx]
y_train = y.iloc[train_idx]
X_test = X.iloc[test_idx]
y_test = y.iloc[test_idx]
fold_model.fit(X_train, y_train)
pred_observed = fold_model.predict(X_test)
pred_y1 = fold_model.predict(make_counterfactual_features(X_test, 1))
pred_y0 = fold_model.predict(make_counterfactual_features(X_test, 0))
observed_pred[test_idx] = pred_observed
y1_pred[test_idx] = pred_y1
y0_pred[test_idx] = pred_y0
rmse, mae, r2 = evaluate_predictions(y_test, pred_observed)
fold_rows.append(
{
"model": model_name,
"fold": fold,
"rows": len(test_idx),
"users": data.iloc[test_idx]["user_id"].nunique(),
"rmse": rmse,
"mae": mae,
"r2": r2,
"mean_observed_outcome": y_test.mean(),
"mean_predicted_outcome": pred_observed.mean(),
"gcomp_ate_fold": (pred_y1 - pred_y0).mean(),
}
)
scored = data[["user_id", "event_date", PRIMARY_TREATMENT, outcome_col]].copy()
scored["model"] = model_name
scored["pred_observed"] = observed_pred
scored["pred_y1"] = y1_pred
scored["pred_y0"] = y0_pred
scored["cate_hat"] = y1_pred - y0_pred
scored["residual"] = scored[outcome_col] - scored["pred_observed"]
return scored, pd.DataFrame(fold_rows)
feature_frame = panel[outcome_features].copy()
groups = panel["user_id"]
scored_frames = []
metric_frames = []
for model_name, model in outcome_models.items():
scored, metrics = cross_fitted_gcomp(
model,
panel,
outcome_features,
PRIMARY_OUTCOME,
groups,
model_name=model_name,
n_splits=5,
)
scored_frames.append(scored)
metric_frames.append(metrics)
gcomp_scored = pd.concat(scored_frames, ignore_index=True)
gcomp_fold_metrics = pd.concat(metric_frames, ignore_index=True)
display(gcomp_fold_metrics)| model | fold | rows | users | rmse | mae | r2 | mean_observed_outcome | mean_predicted_outcome | gcomp_ate_fold | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | linear_ridge | 1 | 962 | 19 | 106.5616 | 82.3527 | 0.5154 | 381.5561 | 384.7669 | 2.3053 |
| 1 | linear_ridge | 2 | 934 | 18 | 103.2866 | 82.4002 | 0.5253 | 381.8961 | 382.3991 | 3.8276 |
| 2 | linear_ridge | 3 | 934 | 18 | 126.3958 | 94.4799 | 0.4328 | 381.4700 | 381.4646 | -1.8905 |
| 3 | linear_ridge | 4 | 934 | 18 | 107.8745 | 82.8412 | 0.5395 | 385.5064 | 379.2154 | 5.3646 |
| 4 | linear_ridge | 5 | 934 | 18 | 104.3184 | 82.4244 | 0.5257 | 380.0000 | 382.3423 | 2.0375 |
| 5 | lightgbm | 1 | 962 | 19 | 72.2918 | 51.5314 | 0.7770 | 381.5561 | 386.7467 | 0.0473 |
| 6 | lightgbm | 2 | 934 | 18 | 63.2139 | 48.2563 | 0.8222 | 381.8961 | 383.3526 | 0.2892 |
| 7 | lightgbm | 3 | 934 | 18 | 83.8233 | 58.7569 | 0.7505 | 381.4700 | 380.5458 | -0.3942 |
| 8 | lightgbm | 4 | 934 | 18 | 77.5274 | 54.6464 | 0.7621 | 385.5064 | 380.0921 | 0.1809 |
| 9 | lightgbm | 5 | 934 | 18 | 60.4394 | 45.9208 | 0.8408 | 380.0000 | 380.7179 | 0.0558 |
The cross-fitted table shows out-of-user prediction quality and fold-level g-computation estimates. This is the core diagnostic for whether the outcome models are credible enough to support counterfactual simulation.
Summarize Outcome Model Quality
This cell aggregates the fold metrics by model family. Lower RMSE and MAE indicate better prediction accuracy, while higher R-squared indicates more explained variation in future engagement.
outcome_model_metrics = (
gcomp_fold_metrics.groupby("model")
.agg(
rows=("rows", "sum"),
users=("users", "sum"),
rmse_mean=("rmse", "mean"),
rmse_std=("rmse", "std"),
mae_mean=("mae", "mean"),
r2_mean=("r2", "mean"),
observed_mean=("mean_observed_outcome", "mean"),
predicted_mean=("mean_predicted_outcome", "mean"),
fold_ate_mean=("gcomp_ate_fold", "mean"),
fold_ate_std=("gcomp_ate_fold", "std"),
)
.reset_index()
)
display(outcome_model_metrics)| model | rows | users | rmse_mean | rmse_std | mae_mean | r2_mean | observed_mean | predicted_mean | fold_ate_mean | fold_ate_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | lightgbm | 4698 | 91 | 71.4592 | 9.7444 | 51.8224 | 0.7905 | 382.0857 | 382.2910 | 0.0358 | 0.2601 |
| 1 | linear_ridge | 4698 | 91 | 109.6874 | 9.5135 | 84.8997 | 0.5077 | 382.0857 | 382.0377 | 2.3289 | 2.7088 |
The model-quality table helps decide how much trust to place in each g-computation estimate. A flexible model may fit better, but the causal estimate should still be checked against the transparent linear model and the MSM result.
Predicted Versus Observed Outcomes
A scatter plot of predicted versus observed outcomes shows whether the outcome models capture the main signal or collapse toward the mean. The plot samples points for readability.
plot_scored = gcomp_scored.sample(min(2_000, len(gcomp_scored)), random_state=RANDOM_STATE)
fig, axes = plt.subplots(1, 2, figsize=(13, 5), sharex=True, sharey=True)
for ax, model_name in zip(axes, ["linear_ridge", "lightgbm"]):
model_plot = plot_scored.query("model == @model_name")
sns.scatterplot(
data=model_plot,
x=PRIMARY_OUTCOME,
y="pred_observed",
alpha=0.35,
s=18,
ax=ax,
color="#2A6F97",
)
min_value = min(model_plot[PRIMARY_OUTCOME].min(), model_plot["pred_observed"].min())
max_value = max(model_plot[PRIMARY_OUTCOME].max(), model_plot["pred_observed"].max())
ax.plot([min_value, max_value], [min_value, max_value], color="black", linestyle="--", linewidth=1)
ax.set_title(f"{model_name}: Predicted vs Observed")
ax.set_xlabel("Observed future interactions")
ax.set_ylabel("Predicted future interactions")
plt.tight_layout()
plt.show()
The scatter plots show how well each model tracks observed future engagement. Large vertical spread is expected because future behavior is noisy, but the model should still capture meaningful variation rather than predicting one constant value for everyone.
Calibration by Prediction Bin
Calibration checks whether rows with higher predicted outcomes actually have higher observed outcomes. This is a useful outcome-model diagnostic because g-computation relies on the level of predicted counterfactual outcomes.
def prediction_calibration_table(scored, n_bins=10):
frames = []
for model_name, model_data in scored.groupby("model"):
temp = model_data.copy()
temp["prediction_bin"] = pd.qcut(temp["pred_observed"], q=n_bins, duplicates="drop")
frame = (
temp.groupby("prediction_bin", observed=True)
.agg(
rows=(PRIMARY_OUTCOME, "size"),
mean_predicted=("pred_observed", "mean"),
mean_observed=(PRIMARY_OUTCOME, "mean"),
mean_residual=("residual", "mean"),
)
.reset_index()
)
frame.insert(0, "model", model_name)
frames.append(frame)
return pd.concat(frames, ignore_index=True)
calibration = prediction_calibration_table(gcomp_scored)
display(calibration)| model | prediction_bin | rows | mean_predicted | mean_observed | mean_residual | |
|---|---|---|---|---|---|---|
| 0 | lightgbm | (77.664, 156.969] | 470 | 121.2450 | 118.3745 | -2.8705 |
| 1 | lightgbm | (156.969, 243.43] | 470 | 196.3886 | 195.4213 | -0.9673 |
| 2 | lightgbm | (243.43, 331.146] | 470 | 293.2714 | 294.3170 | 1.0457 |
| 3 | lightgbm | (331.146, 381.112] | 469 | 350.9291 | 347.0362 | -3.8928 |
| 4 | lightgbm | (381.112, 414.707] | 470 | 400.0981 | 401.1340 | 1.0359 |
| 5 | lightgbm | (414.707, 438.505] | 470 | 426.5840 | 427.0745 | 0.4904 |
| 6 | lightgbm | (438.505, 464.164] | 469 | 450.2288 | 448.0874 | -2.1413 |
| 7 | lightgbm | (464.164, 502.289] | 470 | 480.6242 | 480.8106 | 0.1864 |
| 8 | lightgbm | (502.289, 548.924] | 470 | 526.6655 | 536.8596 | 10.1940 |
| 9 | lightgbm | (548.924, 679.474] | 470 | 577.2190 | 571.7766 | -5.4424 |
| 10 | linear_ridge | (94.196, 213.272] | 470 | 173.3898 | 122.2617 | -51.1280 |
| 11 | linear_ridge | (213.272, 264.816] | 470 | 239.5195 | 207.8128 | -31.7068 |
| 12 | linear_ridge | (264.816, 317.465] | 470 | 287.8596 | 282.7468 | -5.1128 |
| 13 | linear_ridge | (317.465, 380.705] | 469 | 351.9200 | 394.5181 | 42.5981 |
| 14 | linear_ridge | (380.705, 411.384] | 470 | 397.3584 | 495.7298 | 98.3714 |
| 15 | linear_ridge | (411.384, 435.723] | 470 | 423.7958 | 507.6404 | 83.8446 |
| 16 | linear_ridge | (435.723, 462.498] | 469 | 449.1164 | 486.4222 | 37.3058 |
| 17 | linear_ridge | (462.498, 484.971] | 470 | 473.9972 | 453.0660 | -20.9312 |
| 18 | linear_ridge | (484.971, 505.903] | 470 | 495.0923 | 440.2553 | -54.8370 |
| 19 | linear_ridge | (505.903, 647.435] | 470 | 528.5690 | 430.6213 | -97.9478 |
The calibration table checks whether predicted outcome levels are systematically too high or too low across the prediction range. Good calibration makes the counterfactual averages easier to trust.
Plot Outcome Calibration
The dashed diagonal represents perfect calibration. Points above the line mean observed future engagement is higher than predicted; points below the line mean the model overpredicts.
fig, ax = plt.subplots(figsize=(6, 6))
sns.scatterplot(
data=calibration,
x="mean_predicted",
y="mean_observed",
hue="model",
size="rows",
sizes=(40, 160),
ax=ax,
)
min_axis = min(calibration["mean_predicted"].min(), calibration["mean_observed"].min())
max_axis = max(calibration["mean_predicted"].max(), calibration["mean_observed"].max())
ax.plot([min_axis, max_axis], [min_axis, max_axis], color="black", linestyle="--", linewidth=1)
ax.set_title("Outcome Model Calibration")
ax.set_xlabel("Mean predicted future interactions")
ax.set_ylabel("Mean observed future interactions")
plt.tight_layout()
plt.show()
The calibration plot turns model quality into a visual check. If the flexible model calibrates better than the linear model, its counterfactual estimates may be more credible, though they still need sensitivity checks.
Estimate G-Computation Effects
The g-computation estimate is the average predicted outcome under treatment minus the average predicted outcome under control. This cell summarizes both the average treatment effect and the distribution of row-level predicted effects.
gcomp_results = (
gcomp_scored.groupby("model")
.agg(
n_rows=(PRIMARY_OUTCOME, "size"),
observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
mean_pred_observed=("pred_observed", "mean"),
mean_y1_hat=("pred_y1", "mean"),
mean_y0_hat=("pred_y0", "mean"),
ate=("cate_hat", "mean"),
cate_std=("cate_hat", "std"),
cate_p05=("cate_hat", lambda values: values.quantile(0.05)),
cate_median=("cate_hat", "median"),
cate_p95=("cate_hat", lambda values: values.quantile(0.95)),
share_positive_cate=("cate_hat", lambda values: (values > 0).mean()),
)
.reset_index()
)
gcomp_results["relative_lift_vs_y0"] = gcomp_results["ate"] / gcomp_results["mean_y0_hat"]
display(gcomp_results)| model | n_rows | observed_outcome_mean | mean_pred_observed | mean_y1_hat | mean_y0_hat | ate | cate_std | cate_p05 | cate_median | cate_p95 | share_positive_cate | relative_lift_vs_y0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | lightgbm | 4698 | 382.0826 | 382.3176 | 382.3398 | 382.3040 | 0.0359 | 0.7680 | -0.9658 | 0.0000 | 1.0999 | 0.1249 | 0.0001 |
| 1 | linear_ridge | 4698 | 382.0826 | 382.0539 | 383.2312 | 380.9024 | 2.3288 | 2.4159 | -1.8905 | 2.3053 | 5.3646 | 0.8012 | 0.0061 |
This is the main g-computation result table. The ate column estimates the average causal effect implied by each outcome model, while the CATE distribution columns show whether the model predicts similar or heterogeneous effects across user-days.
Plot Counterfactual Means
This plot compares the model-implied average outcome under treatment and under control. It is often easier to communicate g-computation this way: the model predicts two potential outcome means, and their gap is the estimated effect.
cf_means_plot = gcomp_results.melt(
id_vars="model",
value_vars=["mean_y0_hat", "mean_y1_hat"],
var_name="counterfactual_state",
value_name="predicted_mean_outcome",
)
cf_means_plot["counterfactual_state"] = cf_means_plot["counterfactual_state"].map(
{"mean_y0_hat": "Predicted if untreated", "mean_y1_hat": "Predicted if treated"}
)
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(
data=cf_means_plot,
x="model",
y="predicted_mean_outcome",
hue="counterfactual_state",
ax=ax,
)
ax.set_title("G-Computation Counterfactual Mean Outcomes")
ax.set_xlabel("Outcome model")
ax.set_ylabel("Predicted future 7-day interactions")
plt.tight_layout()
plt.show()
The counterfactual mean plot shows the estimated potential outcome gap directly. If the treated and untreated bars are close, the model is saying high-watch exposure has little average effect on future interaction volume.
Plot Row-Level Predicted Effects
The row-level cate_hat values are not experimental individual treatment effects, but they show how the outcome model expects treatment effects to vary across observed histories. LightGBM can produce heterogeneous effects; the additive linear model may produce a nearly constant effect.
fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
data=gcomp_scored,
x="cate_hat",
hue="model",
bins=50,
stat="density",
common_norm=False,
element="step",
fill=False,
ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Distribution of G-Computation Row-Level Predicted Effects")
ax.set_xlabel("Predicted effect on future 7-day interactions")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()
The CATE-style distribution helps distinguish a global average effect from heterogeneity. If LightGBM shows a wide distribution, the next project step could inspect which user histories are associated with larger positive or negative predicted effects.
Segment-Level G-Computation Effects
This cell summarizes predicted effects across simple history segments. Segment summaries are not the primary estimand, but they help explain whether the average effect is hiding important variation by prior engagement level.
segment_panel = panel[["user_id", "event_date", "prior_3day_interactions", "prior_3day_high_watch_share"]].copy()
segment_panel["prior_interactions_bucket"] = pd.qcut(
segment_panel["prior_3day_interactions"],
q=4,
duplicates="drop",
)
segment_panel["prior_high_watch_bucket"] = pd.qcut(
segment_panel["prior_3day_high_watch_share"],
q=4,
duplicates="drop",
)
segmented_scored = gcomp_scored.merge(
segment_panel[["user_id", "event_date", "prior_interactions_bucket", "prior_high_watch_bucket"]],
on=["user_id", "event_date"],
how="left",
)
segment_effects = []
for segment_col in ["prior_interactions_bucket", "prior_high_watch_bucket"]:
summary = (
segmented_scored.groupby(["model", segment_col], observed=True)
.agg(
rows=("cate_hat", "size"),
ate=("cate_hat", "mean"),
cate_std=("cate_hat", "std"),
observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
)
.reset_index()
.rename(columns={segment_col: "segment"})
)
summary.insert(1, "segment_type", segment_col)
segment_effects.append(summary)
segment_gcomp_effects = pd.concat(segment_effects, ignore_index=True)
display(segment_gcomp_effects)| model | segment_type | segment | rows | ate | cate_std | observed_outcome_mean | |
|---|---|---|---|---|---|---|---|
| 0 | lightgbm | prior_interactions_bucket | (-0.001, 123.0] | 1183 | -0.0428 | 0.4879 | 276.2113 |
| 1 | lightgbm | prior_interactions_bucket | (123.0, 166.0] | 1182 | 0.0062 | 0.5888 | 386.9205 |
| 2 | lightgbm | prior_interactions_bucket | (166.0, 217.0] | 1164 | 0.0244 | 0.7948 | 420.7345 |
| 3 | lightgbm | prior_interactions_bucket | (217.0, 581.0] | 1169 | 0.1569 | 1.0627 | 445.8435 |
| 4 | linear_ridge | prior_interactions_bucket | (-0.001, 123.0] | 1183 | 2.1703 | 2.4964 | 276.2113 |
| 5 | linear_ridge | prior_interactions_bucket | (123.0, 166.0] | 1182 | 2.4662 | 2.3430 | 386.9205 |
| 6 | linear_ridge | prior_interactions_bucket | (166.0, 217.0] | 1164 | 2.3710 | 2.3921 | 420.7345 |
| 7 | linear_ridge | prior_interactions_bucket | (217.0, 581.0] | 1169 | 2.3081 | 2.4225 | 445.8435 |
| 8 | lightgbm | prior_high_watch_bucket | (-0.001, 1.127] | 1175 | 0.0551 | 0.6319 | 368.0732 |
| 9 | lightgbm | prior_high_watch_bucket | (1.127, 1.414] | 1174 | 0.1259 | 0.9780 | 386.3526 |
| 10 | lightgbm | prior_high_watch_bucket | (1.414, 1.708] | 1174 | -0.0251 | 0.8764 | 384.5980 |
| 11 | lightgbm | prior_high_watch_bucket | (1.708, 2.735] | 1175 | -0.0124 | 0.4720 | 389.3123 |
| 12 | linear_ridge | prior_high_watch_bucket | (-0.001, 1.127] | 1175 | 2.3254 | 2.4400 | 368.0732 |
| 13 | linear_ridge | prior_high_watch_bucket | (1.127, 1.414] | 1174 | 2.3538 | 2.3454 | 386.3526 |
| 14 | linear_ridge | prior_high_watch_bucket | (1.414, 1.708] | 1174 | 2.3703 | 2.4434 | 384.5980 |
| 15 | linear_ridge | prior_high_watch_bucket | (1.708, 2.735] | 1175 | 2.2658 | 2.4350 | 389.3123 |
The segment table shows where the outcome model predicts stronger or weaker effects. These are exploratory heterogeneity diagnostics, useful for interpretation but not a replacement for the primary average treatment effect.
User-Cluster Bootstrap for G-Computation
The analysis has repeated user-days, so bootstrap uncertainty should resample users rather than individual rows. This cell refits each outcome model on user-cluster bootstrap samples and recomputes the g-computation estimate.
The bootstrap is intentionally moderate in size to keep the notebook runnable while still giving a useful uncertainty check.
def fit_predict_gcomp_effect(model, data, feature_cols, outcome_col):
X = data[feature_cols].copy()
y = data[outcome_col].astype(float)
fitted_model = clone(model).fit(X, y)
pred_y1 = fitted_model.predict(make_counterfactual_features(X, 1))
pred_y0 = fitted_model.predict(make_counterfactual_features(X, 0))
return float(np.mean(pred_y1 - pred_y0))
N_BOOTSTRAP = 150
rng = np.random.default_rng(RANDOM_STATE)
user_ids = panel["user_id"].unique()
user_groups = {user_id: group.copy() for user_id, group in panel.groupby("user_id", sort=False)}
bootstrap_rows = []
for model_name, model in outcome_models.items():
for bootstrap_id in range(N_BOOTSTRAP):
sampled_users = rng.choice(user_ids, size=len(user_ids), replace=True)
bootstrap_sample = pd.concat(
[user_groups[user_id] for user_id in sampled_users],
ignore_index=True,
)
effect = fit_predict_gcomp_effect(
model,
bootstrap_sample,
outcome_features,
PRIMARY_OUTCOME,
)
bootstrap_rows.append(
{
"model": model_name,
"bootstrap_id": bootstrap_id,
"ate": effect,
}
)
gcomp_bootstrap = pd.DataFrame(bootstrap_rows)
gcomp_bootstrap_summary = (
gcomp_bootstrap.groupby("model")
.agg(
bootstrap_reps=("ate", "size"),
bootstrap_mean=("ate", "mean"),
bootstrap_std=("ate", "std"),
ci_95_lower=("ate", lambda values: values.quantile(0.025)),
ci_95_upper=("ate", lambda values: values.quantile(0.975)),
)
.reset_index()
)
display(gcomp_bootstrap_summary)| model | bootstrap_reps | bootstrap_mean | bootstrap_std | ci_95_lower | ci_95_upper | |
|---|---|---|---|---|---|---|
| 0 | lightgbm | 150 | 0.1756 | 0.6123 | -0.6384 | 1.8656 |
| 1 | linear_ridge | 150 | 1.3476 | 4.3325 | -6.3238 | 9.3565 |
The bootstrap summary adds uncertainty to the g-computation estimates while respecting user clustering. Wide intervals mean the outcome-model estimate should be treated cautiously even if the point estimate looks meaningful.
Plot G-Computation Bootstrap Distributions
The bootstrap distributions show how much each model’s estimated effect changes when users are resampled. This gives a visual uncertainty check alongside the numeric confidence intervals.
fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
data=gcomp_bootstrap,
x="ate",
hue="model",
bins=35,
stat="density",
common_norm=False,
element="step",
fill=False,
ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("User-Cluster Bootstrap Distribution for G-Computation ATE")
ax.set_xlabel("Estimated effect on future 7-day interactions")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()
The bootstrap plot shows whether the g-computation effect is consistently positive, consistently negative, or uncertain around zero. This will be important when comparing against the MSM result.
Compare G-Computation with the MSM Estimate
This cell loads the primary MSM effect from Notebook 04 and places it next to the g-computation estimates. This is the main cross-method comparison for the project.
msm_primary = pd.read_csv(MSM_PRIMARY_PATH)
msm_lookup = dict(zip(msm_primary["quantity"], msm_primary["value"]))
msm_comparison_row = {
"method": "MSM: weighted calendar-adjusted",
"model": "propensity_weighted_msm",
"estimate": msm_lookup["estimated_incremental_future_7day_interactions"],
"ci_95_lower": msm_lookup["bootstrap_ci_lower"],
"ci_95_upper": msm_lookup["bootstrap_ci_upper"],
"source": "Notebook 04 user-cluster bootstrap",
}
gcomp_comparison_rows = []
for _, row in gcomp_results.iterrows():
boot_row = gcomp_bootstrap_summary.query("model == @row.model").iloc[0]
gcomp_comparison_rows.append(
{
"method": "G-computation",
"model": row["model"],
"estimate": row["ate"],
"ci_95_lower": boot_row["ci_95_lower"],
"ci_95_upper": boot_row["ci_95_upper"],
"source": "Notebook 05 user-cluster bootstrap",
}
)
estimator_comparison = pd.DataFrame([msm_comparison_row, *gcomp_comparison_rows])
estimator_comparison["relative_lift_vs_msm_control_mean"] = (
estimator_comparison["estimate"] / msm_lookup["weighted_control_mean_future_7day_interactions"]
)
display(estimator_comparison)| method | model | estimate | ci_95_lower | ci_95_upper | source | relative_lift_vs_msm_control_mean | |
|---|---|---|---|---|---|---|---|
| 0 | MSM: weighted calendar-adjusted | propensity_weighted_msm | -2.6877 | -12.2706 | 5.7291 | Notebook 04 user-cluster bootstrap | -0.0072 |
| 1 | G-computation | lightgbm | 0.0359 | -0.6384 | 1.8656 | Notebook 05 user-cluster bootstrap | 0.0001 |
| 2 | G-computation | linear_ridge | 2.3288 | -6.3238 | 9.3565 | Notebook 05 user-cluster bootstrap | 0.0062 |
The comparison table is the key conclusion check. If MSM and g-computation estimates are close and uncertain around zero, that supports an honest conclusion: the observed data does not show a clear long-term interaction-volume effect from high-watch exposure.
Plot Cross-Estimator Comparison
This plot compares the MSM estimate and the g-computation estimates on the same outcome scale. The intervals come from user-cluster bootstrap procedures in Notebooks 04 and 05.
plot_comparison = estimator_comparison.copy()
plot_comparison["label"] = plot_comparison["method"] + "\n" + plot_comparison["model"]
plot_comparison["lower_error"] = plot_comparison["estimate"] - plot_comparison["ci_95_lower"]
plot_comparison["upper_error"] = plot_comparison["ci_95_upper"] - plot_comparison["estimate"]
fig, ax = plt.subplots(figsize=(9, 5.5))
ax.errorbar(
x=np.arange(len(plot_comparison)),
y=plot_comparison["estimate"],
yerr=[plot_comparison["lower_error"], plot_comparison["upper_error"]],
fmt="o",
color="#2A6F97",
ecolor="black",
capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(plot_comparison)))
ax.set_xticklabels(plot_comparison["label"], rotation=20, ha="right")
ax.set_title("MSM Versus G-Computation Estimates")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Estimator")
plt.tight_layout()
plt.show()
The plot makes estimator agreement or disagreement easy to see. Cross-method consistency is more persuasive than any single estimate, especially in an observational sequential recommender log.
Secondary Outcome G-Computation
The primary outcome is future 7-day interaction volume. This cell repeats g-computation with the LightGBM outcome model for secondary outcomes, so we can see whether the story changes for active days or watch hours.
secondary_outcomes = [
"future_7day_active_days",
"future_7day_avg_daily_interactions",
"future_7day_play_hours",
"outcome_log1p",
]
secondary_rows = []
secondary_model = outcome_models["lightgbm"]
for outcome_col in secondary_outcomes:
scored_secondary, metrics_secondary = cross_fitted_gcomp(
secondary_model,
panel,
outcome_features,
outcome_col,
groups,
model_name=f"lightgbm_{outcome_col}",
n_splits=5,
)
secondary_rows.append(
{
"outcome": outcome_col,
"model": "lightgbm",
"ate": scored_secondary["cate_hat"].mean(),
"mean_y1_hat": scored_secondary["pred_y1"].mean(),
"mean_y0_hat": scored_secondary["pred_y0"].mean(),
"relative_lift_vs_y0": scored_secondary["cate_hat"].mean() / scored_secondary["pred_y0"].mean(),
"rmse_mean": metrics_secondary["rmse"].mean(),
"mae_mean": metrics_secondary["mae"].mean(),
"r2_mean": metrics_secondary["r2"].mean(),
}
)
secondary_gcomp_results = pd.DataFrame(secondary_rows)
display(secondary_gcomp_results)| outcome | model | ate | mean_y1_hat | mean_y0_hat | relative_lift_vs_y0 | rmse_mean | mae_mean | r2_mean | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | future_7day_active_days | lightgbm | -0.0176 | 6.8424 | 6.8601 | -0.0026 | 0.5294 | 0.2543 | -0.0250 |
| 1 | future_7day_avg_daily_interactions | lightgbm | 0.0051 | 54.6200 | 54.6149 | 0.0001 | 10.2085 | 7.4032 | 0.7905 |
| 2 | future_7day_play_hours | lightgbm | 0.0396 | 0.9374 | 0.8978 | 0.0441 | 0.2349 | 0.1654 | 0.6964 |
| 3 | outcome_log1p | lightgbm | 0.0003 | 5.8347 | 5.8344 | 0.0001 | 0.2169 | 0.1426 | 0.8291 |
The secondary outcome table checks whether the g-computation conclusion is specific to interaction volume or appears in other long-term behavior measures. Differences across outcomes should be reported as part of the product interpretation.
Save G-Computation Results
This cell saves the main g-computation estimates, model diagnostics, counterfactual predictions, bootstrap results, estimator comparison, segment summaries, and secondary outcome estimates.
gcomp_results_path = PROCESSED_DIR / "kuairec_long_term_gcomp_results.csv"
gcomp_metrics_path = PROCESSED_DIR / "kuairec_long_term_gcomp_model_metrics.csv"
gcomp_counterfactuals_path = PROCESSED_DIR / "kuairec_long_term_gcomp_counterfactuals.parquet"
gcomp_bootstrap_path = PROCESSED_DIR / "kuairec_long_term_gcomp_bootstrap.csv"
gcomp_bootstrap_summary_path = PROCESSED_DIR / "kuairec_long_term_gcomp_bootstrap_summary.csv"
estimator_comparison_path = PROCESSED_DIR / "kuairec_long_term_estimator_comparison.csv"
segment_gcomp_path = PROCESSED_DIR / "kuairec_long_term_gcomp_segment_effects.csv"
secondary_gcomp_path = PROCESSED_DIR / "kuairec_long_term_gcomp_secondary_outcomes.csv"
# Keep one row per user-day/model with the counterfactual predictions needed for later reporting.
gcomp_counterfactuals = gcomp_scored[[
"user_id",
"event_date",
"model",
PRIMARY_TREATMENT,
PRIMARY_OUTCOME,
"pred_observed",
"pred_y1",
"pred_y0",
"cate_hat",
"residual",
]].copy()
gcomp_results.to_csv(gcomp_results_path, index=False)
outcome_model_metrics.to_csv(gcomp_metrics_path, index=False)
gcomp_counterfactuals.to_parquet(gcomp_counterfactuals_path, index=False)
gcomp_bootstrap.to_csv(gcomp_bootstrap_path, index=False)
gcomp_bootstrap_summary.to_csv(gcomp_bootstrap_summary_path, index=False)
estimator_comparison.to_csv(estimator_comparison_path, index=False)
segment_gcomp_effects.to_csv(segment_gcomp_path, index=False)
secondary_gcomp_results.to_csv(secondary_gcomp_path, index=False)
print("Saved Notebook 05 artifacts:")
print(f"- {gcomp_results_path}")
print(f"- {gcomp_metrics_path}")
print(f"- {gcomp_counterfactuals_path}")
print(f"- {gcomp_bootstrap_path}")
print(f"- {gcomp_bootstrap_summary_path}")
print(f"- {estimator_comparison_path}")
print(f"- {segment_gcomp_path}")
print(f"- {secondary_gcomp_path}")Saved Notebook 05 artifacts:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_results.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_model_metrics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_counterfactuals.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_bootstrap.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_bootstrap_summary.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimator_comparison.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_segment_effects.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_secondary_outcomes.csv
The g-computation artifacts are saved for the final project report. The most important files are the estimator comparison table and the counterfactual predictions, because they let the next notebook summarize cross-method agreement and heterogeneity.
Takeaways and Next Step
This notebook estimated the primary long-term effect using g-computation. It modeled future engagement directly, generated counterfactual predictions under treatment and control, summarized average and row-level predicted effects, and compared those estimates with the MSM result from Notebook 04.
The key value of this notebook is cross-method triangulation. If the MSM and g-computation estimates both suggest a small or uncertain effect on future 7-day interactions, then the final project should say that clearly rather than forcing a positive story.
The next notebook should be the final sensitivity and report notebook. It should combine the MSM and g-computation outputs, state the final conclusion, document limitations, and create portfolio-ready tables/figures.