Notebook 05: G-Computation for Long-Term Effects

Notebook 04 estimated the long-term effect of high-watch exposure using a marginal structural model. That approach modeled treatment assignment, created inverse probability weights, and then estimated the effect in a weighted pseudo-population.

This notebook estimates the same causal question using a different strategy: g-computation. Instead of modeling the probability of treatment, g-computation models the future outcome directly:

future_7day_interactions ~ treatment + pre-treatment history + calendar context

Once the outcome model is fit, we simulate two counterfactual predictions for every user-day:

predicted future engagement if the user-day were treated;
predicted future engagement if the same user-day were untreated.

The average difference between those two predictions estimates:

E[Y(1) - Y(0)]

This is a strong next step because it checks whether an outcome-modeling approach tells a similar story to the weighting approach from Notebook 04.

Why G-Computation Helps This Project

A marginal structural model and g-computation make different modeling bets.

The MSM relies heavily on the treatment model and the resulting inverse probability weights.
G-computation relies heavily on the outcome model and its ability to predict future engagement under both treatment states.

Neither approach magically solves unobserved confounding. But if both approaches, using the same estimand and legal pre-treatment covariates, produce a similar answer, the project becomes more credible. If they disagree, that disagreement is informative and should be discussed as model sensitivity.

Setup

The first cell imports the libraries used for outcome modeling, cross-fitting, diagnostics, bootstrap uncertainty, and plots. We use two outcome-model families: a regularized linear model for transparency and LightGBM for flexible nonlinear prediction.

from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

from lightgbm import LGBMRegressor
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="X does not have valid feature names")

pd.set_option("display.max_columns", 140)
pd.set_option("display.max_rows", 120)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")
RANDOM_STATE = 42

The notebook environment is ready for outcome-model-based causal estimation. The important modeling choice is that the same pre-treatment feature set will be used for all models, so differences in estimates reflect model family rather than different covariate rules.

Locate the Weighted Panel and MSM Result

This notebook uses the weighted panel from Notebook 03 because it already contains the final estimand population and standardized columns. It also loads the primary MSM result from Notebook 04 so the two causal strategies can be compared directly.

WEIGHTED_PANEL_RELATIVE_PATH = Path("data/processed/kuairec_long_term_weighted_panel.parquet")
MSM_PRIMARY_RELATIVE_PATH = Path("data/processed/kuairec_long_term_msm_primary_statement.csv")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / WEIGHTED_PANEL_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {WEIGHTED_PANEL_RELATIVE_PATH}. Run Notebook 03 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PANEL_PATH = PROJECT_ROOT / WEIGHTED_PANEL_RELATIVE_PATH
MSM_PRIMARY_PATH = PROJECT_ROOT / MSM_PRIMARY_RELATIVE_PATH

print(f"Project root: {PROJECT_ROOT}")
print(f"Weighted panel: {PANEL_PATH}")
print(f"MSM primary result: {MSM_PRIMARY_PATH}")
print(f"Processed output folder: {PROCESSED_DIR}")

Project root: /home/apex/Documents/ranking_sys
Weighted panel: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weighted_panel.parquet
MSM primary result: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_msm_primary_statement.csv
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The notebook found both required inputs. The weighted panel supplies the analysis rows and covariates, while the MSM result provides the benchmark effect estimate from the previous notebook.

Load the Analysis Panel

The primary treatment is treatment, and the primary outcome is outcome, which means future 7-day interaction volume. The panel also includes repeated user-days, so model evaluation and bootstrap uncertainty should respect user_id groups.

panel = pd.read_parquet(PANEL_PATH)
panel["event_date"] = pd.to_datetime(panel["event_date"])
panel = panel.sort_values(["user_id", "event_date"]).reset_index(drop=True)

PRIMARY_TREATMENT = "treatment"
PRIMARY_OUTCOME = "outcome"

required_columns = ["user_id", "event_date", "day_of_week", PRIMARY_TREATMENT, PRIMARY_OUTCOME]
missing_required = [column for column in required_columns if column not in panel.columns]
if missing_required:
    raise ValueError(f"Missing required columns: {missing_required}")

print(f"Panel shape: {panel.shape}")
print(f"Users: {panel['user_id'].nunique():,}")
print(f"Date range: {panel['event_date'].min().date()} to {panel['event_date'].max().date()}")
display(panel[required_columns + ["analysis_weight", "propensity_logistic"]].head())

Panel shape: (4698, 40)
Users: 91
Date range: 2020-07-08 to 2020-08-29

	user_id	event_date	day_of_week	treatment	outcome	analysis_weight	propensity_logistic
0	14	2020-07-08	Wednesday	1	279.0000	0.9131	0.5471
1	14	2020-07-09	Thursday	0	311.0000	1.3647	0.6333
2	14	2020-07-10	Friday	1	352.0000	1.2067	0.4140
3	14	2020-07-11	Saturday	1	437.0000	0.8836	0.5654
4	14	2020-07-12	Sunday	0	437.0000	1.2466	0.5986

The loaded table is the same final estimand population used by the MSM notebook. Because the treatment and outcome columns are standardized, this notebook can focus on the g-computation strategy rather than redefining the causal question.

Define the Outcome-Model Features

G-computation needs an outcome model that uses the treatment and pre-treatment covariates. The treatment is included because we want to estimate how predicted outcomes change when the treatment is toggled. The history and calendar features are included to adjust for observed confounding.

No future outcome columns are included as features.

base_numeric_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
    "panel_day_index",
]

numeric_features = [PRIMARY_TREATMENT] + base_numeric_covariates
categorical_features = ["day_of_week"]
outcome_features = numeric_features + categorical_features

feature_table = pd.DataFrame(
    [
        {"role": "treatment toggle", "columns": PRIMARY_TREATMENT, "reason": "Set to 1 and 0 to generate counterfactual predictions."},
        {"role": "pre-treatment numeric history", "columns": ", ".join(base_numeric_covariates), "reason": "Adjust for recent activity and watch behavior."},
        {"role": "calendar context", "columns": ", ".join(categorical_features), "reason": "Account for day-of-week patterns in engagement."},
    ]
)

display(feature_table)

	role	columns	reason
0	treatment toggle	treatment	Set to 1 and 0 to generate counterfactual pred...
1	pre-treatment numeric history	lag_1_active_day, lag_1_interactions, lag_1_to...	Adjust for recent activity and watch behavior.
2	calendar context	day_of_week	Account for day-of-week patterns in engagement.

The feature set respects the causal timing: it uses treatment and pre-treatment context, but not future information. This lets the outcome model simulate counterfactual treatment states for the same user-day history.

Outcome and Treatment Summary

Before fitting outcome models, we summarize the treatment split and outcome distribution. This gives a baseline sense of how hard the prediction problem is and how much variation exists in future engagement.

outcome_summary = pd.DataFrame(
    [
        {"metric": "rows", "value": len(panel)},
        {"metric": "users", "value": panel["user_id"].nunique()},
        {"metric": "treatment_rate", "value": panel[PRIMARY_TREATMENT].mean()},
        {"metric": "outcome_mean", "value": panel[PRIMARY_OUTCOME].mean()},
        {"metric": "outcome_std", "value": panel[PRIMARY_OUTCOME].std()},
        {"metric": "outcome_min", "value": panel[PRIMARY_OUTCOME].min()},
        {"metric": "outcome_median", "value": panel[PRIMARY_OUTCOME].median()},
        {"metric": "outcome_max", "value": panel[PRIMARY_OUTCOME].max()},
    ]
)

display(outcome_summary)

display(
    panel.groupby(PRIMARY_TREATMENT)[PRIMARY_OUTCOME]
    .agg(["count", "mean", "std", "min", "median", "max"])
    .rename_axis("treatment")
)

	metric	value
0	rows	4,698.0000
1	users	91.0000
2	treatment_rate	0.4996
3	outcome_mean	382.0826
4	outcome_std	156.3942
5	outcome_min	0.0000
6	outcome_median	391.0000
7	outcome_max	888.0000

	count	mean	std	min	median	max
treatment
0	2351	378.5479	157.6583	18.0000	380.0000	888.0000
1	2347	385.6233	155.0705	0.0000	402.0000	851.0000

The outcome has meaningful spread, which makes outcome modeling feasible. The treated and control raw means remain descriptive only; g-computation will compare counterfactual predictions for the same covariate distribution.

Define Outcome Model Pipelines

The linear model is transparent and stable. LightGBM is more flexible and can capture nonlinearities and interactions, including treatment-effect heterogeneity. Both models use the same features.

def make_linear_outcome_pipeline(numeric_cols, categorical_cols):
    preprocessor = ColumnTransformer(
        transformers=[
            ("numeric", StandardScaler(), numeric_cols),
            ("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    )
    return Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("model", Ridge(alpha=1.0)),
        ]
    )


def make_lgbm_outcome_pipeline(numeric_cols, categorical_cols):
    preprocessor = ColumnTransformer(
        transformers=[
            ("numeric", "passthrough", numeric_cols),
            ("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    )
    return Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            (
                "model",
                LGBMRegressor(
                    n_estimators=140,
                    learning_rate=0.04,
                    num_leaves=15,
                    min_child_samples=40,
                    subsample=0.9,
                    colsample_bytree=0.9,
                    random_state=RANDOM_STATE,
                    n_jobs=1,
                    verbose=-1,
                ),
            ),
        ]
    )

outcome_models = {
    "linear_ridge": make_linear_outcome_pipeline(numeric_features, categorical_features),
    "lightgbm": make_lgbm_outcome_pipeline(numeric_features, categorical_features),
}

print("Defined outcome models:")
for model_name in outcome_models:
    print(f"- {model_name}")

Defined outcome models:
- linear_ridge
- lightgbm

The two model families give a useful contrast: linear ridge provides an interpretable baseline, while LightGBM tests whether flexible nonlinear outcome modeling changes the estimated causal effect.

Cross-Fit Outcome Models and Counterfactual Predictions

This cell uses user-grouped cross-fitting. Each user’s rows appear entirely in either the training fold or the held-out fold, which avoids evaluating the model on another day from the same user it trained on.

For each held-out row, the model predicts:

observed outcome under the actual treatment value;
counterfactual outcome if treatment = 1;
counterfactual outcome if treatment = 0.

The difference between the two counterfactual predictions is the row-level g-computation effect contribution.

def make_counterfactual_features(X, treatment_value):
    X_cf = X.copy()
    X_cf[PRIMARY_TREATMENT] = treatment_value
    return X_cf


def evaluate_predictions(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return rmse, mae, r2


def cross_fitted_gcomp(model, data, feature_cols, outcome_col, groups, model_name, n_splits=5):
    group_kfold = GroupKFold(n_splits=n_splits)
    X = data[feature_cols].copy()
    y = data[outcome_col].astype(float).copy()

    observed_pred = np.zeros(len(data), dtype=float)
    y1_pred = np.zeros(len(data), dtype=float)
    y0_pred = np.zeros(len(data), dtype=float)
    fold_rows = []

    for fold, (train_idx, test_idx) in enumerate(group_kfold.split(X, y, groups=groups), start=1):
        fold_model = clone(model)
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        X_test = X.iloc[test_idx]
        y_test = y.iloc[test_idx]

        fold_model.fit(X_train, y_train)
        pred_observed = fold_model.predict(X_test)
        pred_y1 = fold_model.predict(make_counterfactual_features(X_test, 1))
        pred_y0 = fold_model.predict(make_counterfactual_features(X_test, 0))

        observed_pred[test_idx] = pred_observed
        y1_pred[test_idx] = pred_y1
        y0_pred[test_idx] = pred_y0

        rmse, mae, r2 = evaluate_predictions(y_test, pred_observed)
        fold_rows.append(
            {
                "model": model_name,
                "fold": fold,
                "rows": len(test_idx),
                "users": data.iloc[test_idx]["user_id"].nunique(),
                "rmse": rmse,
                "mae": mae,
                "r2": r2,
                "mean_observed_outcome": y_test.mean(),
                "mean_predicted_outcome": pred_observed.mean(),
                "gcomp_ate_fold": (pred_y1 - pred_y0).mean(),
            }
        )

    scored = data[["user_id", "event_date", PRIMARY_TREATMENT, outcome_col]].copy()
    scored["model"] = model_name
    scored["pred_observed"] = observed_pred
    scored["pred_y1"] = y1_pred
    scored["pred_y0"] = y0_pred
    scored["cate_hat"] = y1_pred - y0_pred
    scored["residual"] = scored[outcome_col] - scored["pred_observed"]

    return scored, pd.DataFrame(fold_rows)

feature_frame = panel[outcome_features].copy()
groups = panel["user_id"]

scored_frames = []
metric_frames = []
for model_name, model in outcome_models.items():
    scored, metrics = cross_fitted_gcomp(
        model,
        panel,
        outcome_features,
        PRIMARY_OUTCOME,
        groups,
        model_name=model_name,
        n_splits=5,
    )
    scored_frames.append(scored)
    metric_frames.append(metrics)

gcomp_scored = pd.concat(scored_frames, ignore_index=True)
gcomp_fold_metrics = pd.concat(metric_frames, ignore_index=True)

display(gcomp_fold_metrics)

	model	fold	rows	users	rmse	mae	r2	mean_observed_outcome	mean_predicted_outcome	gcomp_ate_fold
0	linear_ridge	1	962	19	106.5616	82.3527	0.5154	381.5561	384.7669	2.3053
1	linear_ridge	2	934	18	103.2866	82.4002	0.5253	381.8961	382.3991	3.8276
2	linear_ridge	3	934	18	126.3958	94.4799	0.4328	381.4700	381.4646	-1.8905
3	linear_ridge	4	934	18	107.8745	82.8412	0.5395	385.5064	379.2154	5.3646
4	linear_ridge	5	934	18	104.3184	82.4244	0.5257	380.0000	382.3423	2.0375
5	lightgbm	1	962	19	72.2918	51.5314	0.7770	381.5561	386.7467	0.0473
6	lightgbm	2	934	18	63.2139	48.2563	0.8222	381.8961	383.3526	0.2892
7	lightgbm	3	934	18	83.8233	58.7569	0.7505	381.4700	380.5458	-0.3942
8	lightgbm	4	934	18	77.5274	54.6464	0.7621	385.5064	380.0921	0.1809
9	lightgbm	5	934	18	60.4394	45.9208	0.8408	380.0000	380.7179	0.0558

The cross-fitted table shows out-of-user prediction quality and fold-level g-computation estimates. This is the core diagnostic for whether the outcome models are credible enough to support counterfactual simulation.

Summarize Outcome Model Quality

This cell aggregates the fold metrics by model family. Lower RMSE and MAE indicate better prediction accuracy, while higher R-squared indicates more explained variation in future engagement.

outcome_model_metrics = (
    gcomp_fold_metrics.groupby("model")
    .agg(
        rows=("rows", "sum"),
        users=("users", "sum"),
        rmse_mean=("rmse", "mean"),
        rmse_std=("rmse", "std"),
        mae_mean=("mae", "mean"),
        r2_mean=("r2", "mean"),
        observed_mean=("mean_observed_outcome", "mean"),
        predicted_mean=("mean_predicted_outcome", "mean"),
        fold_ate_mean=("gcomp_ate_fold", "mean"),
        fold_ate_std=("gcomp_ate_fold", "std"),
    )
    .reset_index()
)

display(outcome_model_metrics)

	model	rows	users	rmse_mean	rmse_std	mae_mean	r2_mean	observed_mean	predicted_mean	fold_ate_mean	fold_ate_std
0	lightgbm	4698	91	71.4592	9.7444	51.8224	0.7905	382.0857	382.2910	0.0358	0.2601
1	linear_ridge	4698	91	109.6874	9.5135	84.8997	0.5077	382.0857	382.0377	2.3289	2.7088

The model-quality table helps decide how much trust to place in each g-computation estimate. A flexible model may fit better, but the causal estimate should still be checked against the transparent linear model and the MSM result.

Predicted Versus Observed Outcomes

A scatter plot of predicted versus observed outcomes shows whether the outcome models capture the main signal or collapse toward the mean. The plot samples points for readability.

plot_scored = gcomp_scored.sample(min(2_000, len(gcomp_scored)), random_state=RANDOM_STATE)

fig, axes = plt.subplots(1, 2, figsize=(13, 5), sharex=True, sharey=True)
for ax, model_name in zip(axes, ["linear_ridge", "lightgbm"]):
    model_plot = plot_scored.query("model == @model_name")
    sns.scatterplot(
        data=model_plot,
        x=PRIMARY_OUTCOME,
        y="pred_observed",
        alpha=0.35,
        s=18,
        ax=ax,
        color="#2A6F97",
    )
    min_value = min(model_plot[PRIMARY_OUTCOME].min(), model_plot["pred_observed"].min())
    max_value = max(model_plot[PRIMARY_OUTCOME].max(), model_plot["pred_observed"].max())
    ax.plot([min_value, max_value], [min_value, max_value], color="black", linestyle="--", linewidth=1)
    ax.set_title(f"{model_name}: Predicted vs Observed")
    ax.set_xlabel("Observed future interactions")
    ax.set_ylabel("Predicted future interactions")

plt.tight_layout()
plt.show()

The scatter plots show how well each model tracks observed future engagement. Large vertical spread is expected because future behavior is noisy, but the model should still capture meaningful variation rather than predicting one constant value for everyone.

Calibration by Prediction Bin

Calibration checks whether rows with higher predicted outcomes actually have higher observed outcomes. This is a useful outcome-model diagnostic because g-computation relies on the level of predicted counterfactual outcomes.

def prediction_calibration_table(scored, n_bins=10):
    frames = []
    for model_name, model_data in scored.groupby("model"):
        temp = model_data.copy()
        temp["prediction_bin"] = pd.qcut(temp["pred_observed"], q=n_bins, duplicates="drop")
        frame = (
            temp.groupby("prediction_bin", observed=True)
            .agg(
                rows=(PRIMARY_OUTCOME, "size"),
                mean_predicted=("pred_observed", "mean"),
                mean_observed=(PRIMARY_OUTCOME, "mean"),
                mean_residual=("residual", "mean"),
            )
            .reset_index()
        )
        frame.insert(0, "model", model_name)
        frames.append(frame)
    return pd.concat(frames, ignore_index=True)

calibration = prediction_calibration_table(gcomp_scored)
display(calibration)

	model	prediction_bin	rows	mean_predicted	mean_observed	mean_residual
0	lightgbm	(77.664, 156.969]	470	121.2450	118.3745	-2.8705
1	lightgbm	(156.969, 243.43]	470	196.3886	195.4213	-0.9673
2	lightgbm	(243.43, 331.146]	470	293.2714	294.3170	1.0457
3	lightgbm	(331.146, 381.112]	469	350.9291	347.0362	-3.8928
4	lightgbm	(381.112, 414.707]	470	400.0981	401.1340	1.0359
5	lightgbm	(414.707, 438.505]	470	426.5840	427.0745	0.4904
6	lightgbm	(438.505, 464.164]	469	450.2288	448.0874	-2.1413
7	lightgbm	(464.164, 502.289]	470	480.6242	480.8106	0.1864
8	lightgbm	(502.289, 548.924]	470	526.6655	536.8596	10.1940
9	lightgbm	(548.924, 679.474]	470	577.2190	571.7766	-5.4424
10	linear_ridge	(94.196, 213.272]	470	173.3898	122.2617	-51.1280
11	linear_ridge	(213.272, 264.816]	470	239.5195	207.8128	-31.7068
12	linear_ridge	(264.816, 317.465]	470	287.8596	282.7468	-5.1128
13	linear_ridge	(317.465, 380.705]	469	351.9200	394.5181	42.5981
14	linear_ridge	(380.705, 411.384]	470	397.3584	495.7298	98.3714
15	linear_ridge	(411.384, 435.723]	470	423.7958	507.6404	83.8446
16	linear_ridge	(435.723, 462.498]	469	449.1164	486.4222	37.3058
17	linear_ridge	(462.498, 484.971]	470	473.9972	453.0660	-20.9312
18	linear_ridge	(484.971, 505.903]	470	495.0923	440.2553	-54.8370
19	linear_ridge	(505.903, 647.435]	470	528.5690	430.6213	-97.9478

The calibration table checks whether predicted outcome levels are systematically too high or too low across the prediction range. Good calibration makes the counterfactual averages easier to trust.

Plot Outcome Calibration

The dashed diagonal represents perfect calibration. Points above the line mean observed future engagement is higher than predicted; points below the line mean the model overpredicts.

fig, ax = plt.subplots(figsize=(6, 6))
sns.scatterplot(
    data=calibration,
    x="mean_predicted",
    y="mean_observed",
    hue="model",
    size="rows",
    sizes=(40, 160),
    ax=ax,
)
min_axis = min(calibration["mean_predicted"].min(), calibration["mean_observed"].min())
max_axis = max(calibration["mean_predicted"].max(), calibration["mean_observed"].max())
ax.plot([min_axis, max_axis], [min_axis, max_axis], color="black", linestyle="--", linewidth=1)
ax.set_title("Outcome Model Calibration")
ax.set_xlabel("Mean predicted future interactions")
ax.set_ylabel("Mean observed future interactions")
plt.tight_layout()
plt.show()

The calibration plot turns model quality into a visual check. If the flexible model calibrates better than the linear model, its counterfactual estimates may be more credible, though they still need sensitivity checks.

Estimate G-Computation Effects

The g-computation estimate is the average predicted outcome under treatment minus the average predicted outcome under control. This cell summarizes both the average treatment effect and the distribution of row-level predicted effects.

gcomp_results = (
    gcomp_scored.groupby("model")
    .agg(
        n_rows=(PRIMARY_OUTCOME, "size"),
        observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
        mean_pred_observed=("pred_observed", "mean"),
        mean_y1_hat=("pred_y1", "mean"),
        mean_y0_hat=("pred_y0", "mean"),
        ate=("cate_hat", "mean"),
        cate_std=("cate_hat", "std"),
        cate_p05=("cate_hat", lambda values: values.quantile(0.05)),
        cate_median=("cate_hat", "median"),
        cate_p95=("cate_hat", lambda values: values.quantile(0.95)),
        share_positive_cate=("cate_hat", lambda values: (values > 0).mean()),
    )
    .reset_index()
)

gcomp_results["relative_lift_vs_y0"] = gcomp_results["ate"] / gcomp_results["mean_y0_hat"]

display(gcomp_results)

	model	n_rows	observed_outcome_mean	mean_pred_observed	mean_y1_hat	mean_y0_hat	ate	cate_std	cate_p05	cate_median	cate_p95	share_positive_cate	relative_lift_vs_y0
0	lightgbm	4698	382.0826	382.3176	382.3398	382.3040	0.0359	0.7680	-0.9658	0.0000	1.0999	0.1249	0.0001
1	linear_ridge	4698	382.0826	382.0539	383.2312	380.9024	2.3288	2.4159	-1.8905	2.3053	5.3646	0.8012	0.0061

This is the main g-computation result table. The ate column estimates the average causal effect implied by each outcome model, while the CATE distribution columns show whether the model predicts similar or heterogeneous effects across user-days.

Plot Counterfactual Means

This plot compares the model-implied average outcome under treatment and under control. It is often easier to communicate g-computation this way: the model predicts two potential outcome means, and their gap is the estimated effect.

cf_means_plot = gcomp_results.melt(
    id_vars="model",
    value_vars=["mean_y0_hat", "mean_y1_hat"],
    var_name="counterfactual_state",
    value_name="predicted_mean_outcome",
)
cf_means_plot["counterfactual_state"] = cf_means_plot["counterfactual_state"].map(
    {"mean_y0_hat": "Predicted if untreated", "mean_y1_hat": "Predicted if treated"}
)

fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(
    data=cf_means_plot,
    x="model",
    y="predicted_mean_outcome",
    hue="counterfactual_state",
    ax=ax,
)
ax.set_title("G-Computation Counterfactual Mean Outcomes")
ax.set_xlabel("Outcome model")
ax.set_ylabel("Predicted future 7-day interactions")
plt.tight_layout()
plt.show()

The counterfactual mean plot shows the estimated potential outcome gap directly. If the treated and untreated bars are close, the model is saying high-watch exposure has little average effect on future interaction volume.

Plot Row-Level Predicted Effects

The row-level cate_hat values are not experimental individual treatment effects, but they show how the outcome model expects treatment effects to vary across observed histories. LightGBM can produce heterogeneous effects; the additive linear model may produce a nearly constant effect.

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
    data=gcomp_scored,
    x="cate_hat",
    hue="model",
    bins=50,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Distribution of G-Computation Row-Level Predicted Effects")
ax.set_xlabel("Predicted effect on future 7-day interactions")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

The CATE-style distribution helps distinguish a global average effect from heterogeneity. If LightGBM shows a wide distribution, the next project step could inspect which user histories are associated with larger positive or negative predicted effects.

Segment-Level G-Computation Effects

This cell summarizes predicted effects across simple history segments. Segment summaries are not the primary estimand, but they help explain whether the average effect is hiding important variation by prior engagement level.

segment_panel = panel[["user_id", "event_date", "prior_3day_interactions", "prior_3day_high_watch_share"]].copy()
segment_panel["prior_interactions_bucket"] = pd.qcut(
    segment_panel["prior_3day_interactions"],
    q=4,
    duplicates="drop",
)
segment_panel["prior_high_watch_bucket"] = pd.qcut(
    segment_panel["prior_3day_high_watch_share"],
    q=4,
    duplicates="drop",
)

segmented_scored = gcomp_scored.merge(
    segment_panel[["user_id", "event_date", "prior_interactions_bucket", "prior_high_watch_bucket"]],
    on=["user_id", "event_date"],
    how="left",
)

segment_effects = []
for segment_col in ["prior_interactions_bucket", "prior_high_watch_bucket"]:
    summary = (
        segmented_scored.groupby(["model", segment_col], observed=True)
        .agg(
            rows=("cate_hat", "size"),
            ate=("cate_hat", "mean"),
            cate_std=("cate_hat", "std"),
            observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
        )
        .reset_index()
        .rename(columns={segment_col: "segment"})
    )
    summary.insert(1, "segment_type", segment_col)
    segment_effects.append(summary)

segment_gcomp_effects = pd.concat(segment_effects, ignore_index=True)
display(segment_gcomp_effects)

	model	segment_type	segment	rows	ate	cate_std	observed_outcome_mean
0	lightgbm	prior_interactions_bucket	(-0.001, 123.0]	1183	-0.0428	0.4879	276.2113
1	lightgbm	prior_interactions_bucket	(123.0, 166.0]	1182	0.0062	0.5888	386.9205
2	lightgbm	prior_interactions_bucket	(166.0, 217.0]	1164	0.0244	0.7948	420.7345
3	lightgbm	prior_interactions_bucket	(217.0, 581.0]	1169	0.1569	1.0627	445.8435
4	linear_ridge	prior_interactions_bucket	(-0.001, 123.0]	1183	2.1703	2.4964	276.2113
5	linear_ridge	prior_interactions_bucket	(123.0, 166.0]	1182	2.4662	2.3430	386.9205
6	linear_ridge	prior_interactions_bucket	(166.0, 217.0]	1164	2.3710	2.3921	420.7345
7	linear_ridge	prior_interactions_bucket	(217.0, 581.0]	1169	2.3081	2.4225	445.8435
8	lightgbm	prior_high_watch_bucket	(-0.001, 1.127]	1175	0.0551	0.6319	368.0732
9	lightgbm	prior_high_watch_bucket	(1.127, 1.414]	1174	0.1259	0.9780	386.3526
10	lightgbm	prior_high_watch_bucket	(1.414, 1.708]	1174	-0.0251	0.8764	384.5980
11	lightgbm	prior_high_watch_bucket	(1.708, 2.735]	1175	-0.0124	0.4720	389.3123
12	linear_ridge	prior_high_watch_bucket	(-0.001, 1.127]	1175	2.3254	2.4400	368.0732
13	linear_ridge	prior_high_watch_bucket	(1.127, 1.414]	1174	2.3538	2.3454	386.3526
14	linear_ridge	prior_high_watch_bucket	(1.414, 1.708]	1174	2.3703	2.4434	384.5980
15	linear_ridge	prior_high_watch_bucket	(1.708, 2.735]	1175	2.2658	2.4350	389.3123

The segment table shows where the outcome model predicts stronger or weaker effects. These are exploratory heterogeneity diagnostics, useful for interpretation but not a replacement for the primary average treatment effect.

User-Cluster Bootstrap for G-Computation

The analysis has repeated user-days, so bootstrap uncertainty should resample users rather than individual rows. This cell refits each outcome model on user-cluster bootstrap samples and recomputes the g-computation estimate.

The bootstrap is intentionally moderate in size to keep the notebook runnable while still giving a useful uncertainty check.

def fit_predict_gcomp_effect(model, data, feature_cols, outcome_col):
    X = data[feature_cols].copy()
    y = data[outcome_col].astype(float)
    fitted_model = clone(model).fit(X, y)
    pred_y1 = fitted_model.predict(make_counterfactual_features(X, 1))
    pred_y0 = fitted_model.predict(make_counterfactual_features(X, 0))
    return float(np.mean(pred_y1 - pred_y0))

N_BOOTSTRAP = 150
rng = np.random.default_rng(RANDOM_STATE)
user_ids = panel["user_id"].unique()
user_groups = {user_id: group.copy() for user_id, group in panel.groupby("user_id", sort=False)}

bootstrap_rows = []
for model_name, model in outcome_models.items():
    for bootstrap_id in range(N_BOOTSTRAP):
        sampled_users = rng.choice(user_ids, size=len(user_ids), replace=True)
        bootstrap_sample = pd.concat(
            [user_groups[user_id] for user_id in sampled_users],
            ignore_index=True,
        )
        effect = fit_predict_gcomp_effect(
            model,
            bootstrap_sample,
            outcome_features,
            PRIMARY_OUTCOME,
        )
        bootstrap_rows.append(
            {
                "model": model_name,
                "bootstrap_id": bootstrap_id,
                "ate": effect,
            }
        )

gcomp_bootstrap = pd.DataFrame(bootstrap_rows)

gcomp_bootstrap_summary = (
    gcomp_bootstrap.groupby("model")
    .agg(
        bootstrap_reps=("ate", "size"),
        bootstrap_mean=("ate", "mean"),
        bootstrap_std=("ate", "std"),
        ci_95_lower=("ate", lambda values: values.quantile(0.025)),
        ci_95_upper=("ate", lambda values: values.quantile(0.975)),
    )
    .reset_index()
)

display(gcomp_bootstrap_summary)

	model	bootstrap_reps	bootstrap_mean	bootstrap_std	ci_95_lower	ci_95_upper
0	lightgbm	150	0.1756	0.6123	-0.6384	1.8656
1	linear_ridge	150	1.3476	4.3325	-6.3238	9.3565

The bootstrap summary adds uncertainty to the g-computation estimates while respecting user clustering. Wide intervals mean the outcome-model estimate should be treated cautiously even if the point estimate looks meaningful.

Plot G-Computation Bootstrap Distributions

The bootstrap distributions show how much each model’s estimated effect changes when users are resampled. This gives a visual uncertainty check alongside the numeric confidence intervals.

fig, ax = plt.subplots(figsize=(9, 5))
sns.histplot(
    data=gcomp_bootstrap,
    x="ate",
    hue="model",
    bins=35,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("User-Cluster Bootstrap Distribution for G-Computation ATE")
ax.set_xlabel("Estimated effect on future 7-day interactions")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

The bootstrap plot shows whether the g-computation effect is consistently positive, consistently negative, or uncertain around zero. This will be important when comparing against the MSM result.

Compare G-Computation with the MSM Estimate

This cell loads the primary MSM effect from Notebook 04 and places it next to the g-computation estimates. This is the main cross-method comparison for the project.

msm_primary = pd.read_csv(MSM_PRIMARY_PATH)
msm_lookup = dict(zip(msm_primary["quantity"], msm_primary["value"]))

msm_comparison_row = {
    "method": "MSM: weighted calendar-adjusted",
    "model": "propensity_weighted_msm",
    "estimate": msm_lookup["estimated_incremental_future_7day_interactions"],
    "ci_95_lower": msm_lookup["bootstrap_ci_lower"],
    "ci_95_upper": msm_lookup["bootstrap_ci_upper"],
    "source": "Notebook 04 user-cluster bootstrap",
}

gcomp_comparison_rows = []
for _, row in gcomp_results.iterrows():
    boot_row = gcomp_bootstrap_summary.query("model == @row.model").iloc[0]
    gcomp_comparison_rows.append(
        {
            "method": "G-computation",
            "model": row["model"],
            "estimate": row["ate"],
            "ci_95_lower": boot_row["ci_95_lower"],
            "ci_95_upper": boot_row["ci_95_upper"],
            "source": "Notebook 05 user-cluster bootstrap",
        }
    )

estimator_comparison = pd.DataFrame([msm_comparison_row, *gcomp_comparison_rows])
estimator_comparison["relative_lift_vs_msm_control_mean"] = (
    estimator_comparison["estimate"] / msm_lookup["weighted_control_mean_future_7day_interactions"]
)

display(estimator_comparison)

	method	model	estimate	ci_95_lower	ci_95_upper	source	relative_lift_vs_msm_control_mean
0	MSM: weighted calendar-adjusted	propensity_weighted_msm	-2.6877	-12.2706	5.7291	Notebook 04 user-cluster bootstrap	-0.0072
1	G-computation	lightgbm	0.0359	-0.6384	1.8656	Notebook 05 user-cluster bootstrap	0.0001
2	G-computation	linear_ridge	2.3288	-6.3238	9.3565	Notebook 05 user-cluster bootstrap	0.0062

The comparison table is the key conclusion check. If MSM and g-computation estimates are close and uncertain around zero, that supports an honest conclusion: the observed data does not show a clear long-term interaction-volume effect from high-watch exposure.

Plot Cross-Estimator Comparison

This plot compares the MSM estimate and the g-computation estimates on the same outcome scale. The intervals come from user-cluster bootstrap procedures in Notebooks 04 and 05.

plot_comparison = estimator_comparison.copy()
plot_comparison["label"] = plot_comparison["method"] + "\n" + plot_comparison["model"]
plot_comparison["lower_error"] = plot_comparison["estimate"] - plot_comparison["ci_95_lower"]
plot_comparison["upper_error"] = plot_comparison["ci_95_upper"] - plot_comparison["estimate"]

fig, ax = plt.subplots(figsize=(9, 5.5))
ax.errorbar(
    x=np.arange(len(plot_comparison)),
    y=plot_comparison["estimate"],
    yerr=[plot_comparison["lower_error"], plot_comparison["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(plot_comparison)))
ax.set_xticklabels(plot_comparison["label"], rotation=20, ha="right")
ax.set_title("MSM Versus G-Computation Estimates")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Estimator")
plt.tight_layout()
plt.show()

The plot makes estimator agreement or disagreement easy to see. Cross-method consistency is more persuasive than any single estimate, especially in an observational sequential recommender log.

Secondary Outcome G-Computation

The primary outcome is future 7-day interaction volume. This cell repeats g-computation with the LightGBM outcome model for secondary outcomes, so we can see whether the story changes for active days or watch hours.

secondary_outcomes = [
    "future_7day_active_days",
    "future_7day_avg_daily_interactions",
    "future_7day_play_hours",
    "outcome_log1p",
]

secondary_rows = []
secondary_model = outcome_models["lightgbm"]
for outcome_col in secondary_outcomes:
    scored_secondary, metrics_secondary = cross_fitted_gcomp(
        secondary_model,
        panel,
        outcome_features,
        outcome_col,
        groups,
        model_name=f"lightgbm_{outcome_col}",
        n_splits=5,
    )
    secondary_rows.append(
        {
            "outcome": outcome_col,
            "model": "lightgbm",
            "ate": scored_secondary["cate_hat"].mean(),
            "mean_y1_hat": scored_secondary["pred_y1"].mean(),
            "mean_y0_hat": scored_secondary["pred_y0"].mean(),
            "relative_lift_vs_y0": scored_secondary["cate_hat"].mean() / scored_secondary["pred_y0"].mean(),
            "rmse_mean": metrics_secondary["rmse"].mean(),
            "mae_mean": metrics_secondary["mae"].mean(),
            "r2_mean": metrics_secondary["r2"].mean(),
        }
    )

secondary_gcomp_results = pd.DataFrame(secondary_rows)
display(secondary_gcomp_results)

	outcome	model	ate	mean_y1_hat	mean_y0_hat	relative_lift_vs_y0	rmse_mean	mae_mean	r2_mean
0	future_7day_active_days	lightgbm	-0.0176	6.8424	6.8601	-0.0026	0.5294	0.2543	-0.0250
1	future_7day_avg_daily_interactions	lightgbm	0.0051	54.6200	54.6149	0.0001	10.2085	7.4032	0.7905
2	future_7day_play_hours	lightgbm	0.0396	0.9374	0.8978	0.0441	0.2349	0.1654	0.6964
3	outcome_log1p	lightgbm	0.0003	5.8347	5.8344	0.0001	0.2169	0.1426	0.8291

The secondary outcome table checks whether the g-computation conclusion is specific to interaction volume or appears in other long-term behavior measures. Differences across outcomes should be reported as part of the product interpretation.

Save G-Computation Results

This cell saves the main g-computation estimates, model diagnostics, counterfactual predictions, bootstrap results, estimator comparison, segment summaries, and secondary outcome estimates.

gcomp_results_path = PROCESSED_DIR / "kuairec_long_term_gcomp_results.csv"
gcomp_metrics_path = PROCESSED_DIR / "kuairec_long_term_gcomp_model_metrics.csv"
gcomp_counterfactuals_path = PROCESSED_DIR / "kuairec_long_term_gcomp_counterfactuals.parquet"
gcomp_bootstrap_path = PROCESSED_DIR / "kuairec_long_term_gcomp_bootstrap.csv"
gcomp_bootstrap_summary_path = PROCESSED_DIR / "kuairec_long_term_gcomp_bootstrap_summary.csv"
estimator_comparison_path = PROCESSED_DIR / "kuairec_long_term_estimator_comparison.csv"
segment_gcomp_path = PROCESSED_DIR / "kuairec_long_term_gcomp_segment_effects.csv"
secondary_gcomp_path = PROCESSED_DIR / "kuairec_long_term_gcomp_secondary_outcomes.csv"

# Keep one row per user-day/model with the counterfactual predictions needed for later reporting.
gcomp_counterfactuals = gcomp_scored[[
    "user_id",
    "event_date",
    "model",
    PRIMARY_TREATMENT,
    PRIMARY_OUTCOME,
    "pred_observed",
    "pred_y1",
    "pred_y0",
    "cate_hat",
    "residual",
]].copy()

gcomp_results.to_csv(gcomp_results_path, index=False)
outcome_model_metrics.to_csv(gcomp_metrics_path, index=False)
gcomp_counterfactuals.to_parquet(gcomp_counterfactuals_path, index=False)
gcomp_bootstrap.to_csv(gcomp_bootstrap_path, index=False)
gcomp_bootstrap_summary.to_csv(gcomp_bootstrap_summary_path, index=False)
estimator_comparison.to_csv(estimator_comparison_path, index=False)
segment_gcomp_effects.to_csv(segment_gcomp_path, index=False)
secondary_gcomp_results.to_csv(secondary_gcomp_path, index=False)

print("Saved Notebook 05 artifacts:")
print(f"- {gcomp_results_path}")
print(f"- {gcomp_metrics_path}")
print(f"- {gcomp_counterfactuals_path}")
print(f"- {gcomp_bootstrap_path}")
print(f"- {gcomp_bootstrap_summary_path}")
print(f"- {estimator_comparison_path}")
print(f"- {segment_gcomp_path}")
print(f"- {secondary_gcomp_path}")

Saved Notebook 05 artifacts:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_results.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_model_metrics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_counterfactuals.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_bootstrap.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_bootstrap_summary.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimator_comparison.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_segment_effects.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_secondary_outcomes.csv

The g-computation artifacts are saved for the final project report. The most important files are the estimator comparison table and the counterfactual predictions, because they let the next notebook summarize cross-method agreement and heterogeneity.

Takeaways and Next Step

This notebook estimated the primary long-term effect using g-computation. It modeled future engagement directly, generated counterfactual predictions under treatment and control, summarized average and row-level predicted effects, and compared those estimates with the MSM result from Notebook 04.

The key value of this notebook is cross-method triangulation. If the MSM and g-computation estimates both suggest a small or uncertain effect on future 7-day interactions, then the final project should say that clearly rather than forcing a positive story.

The next notebook should be the final sensitivity and report notebook. It should combine the MSM and g-computation outputs, state the final conclusion, document limitations, and create portfolio-ready tables/figures.