Notebook 06: Doubly Robust Heterogeneous Effects

The project now has two complementary average-effect estimators:

This notebook adds a third strategy: doubly robust estimation with AIPW scores. AIPW stands for augmented inverse probability weighting. It combines the propensity model from Notebook 03 with the outcome models from Notebook 05.

The main question remains the same:

What is the effect of a high-watch-exposure day on future 7-day interaction volume?

The new question in this notebook is:

Even if the average effect is small, are there user-history segments where high-watch exposure appears more helpful or more harmful?

That second question matters in recommender systems. A global average can hide useful personalization signals. A treatment that has no average effect may still help some user states and hurt others.

Why Doubly Robust Estimation Helps

AIPW is useful because it combines two nuisance models:

  • a propensity model: P(treatment = 1 | pre-treatment history);
  • an outcome model: E[outcome | treatment, pre-treatment history].

The AIPW score for each row is:

m1(X) - m0(X)
+ A * (Y - m1(X)) / e(X)
- (1 - A) * (Y - m0(X)) / (1 - e(X))

Where:

  • A is treatment;
  • Y is the observed future outcome;
  • e(X) is the propensity score;
  • m1(X) is the predicted outcome under treatment;
  • m0(X) is the predicted outcome under control.

AIPW is called doubly robust because, under standard assumptions, it can remain consistent if either the propensity model or the outcome model is correctly specified. In practice, it is also a strong diagnostic: if MSM, g-computation, and AIPW all tell a similar story, the conclusion is more credible.

Setup

This cell imports the tools used for AIPW scoring, clustered bootstrap uncertainty, segment summaries, and effect-smoothing models. The effect-smoothing model is not treated as a definitive individual treatment effect model; it is used to make noisy AIPW pseudo-effects easier to summarize.

from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

from lightgbm import LGBMRegressor
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="X does not have valid feature names")

pd.set_option("display.max_columns", 150)
pd.set_option("display.max_rows", 150)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")
RANDOM_STATE = 42

The notebook environment is ready for doubly robust estimation. The key inputs will be loaded from earlier notebooks, so this notebook focuses on combining nuisance predictions and summarizing heterogeneity.

Locate Project Artifacts

This notebook needs three saved artifacts:

  • the weighted panel with propensity scores from Notebook 03;
  • the counterfactual outcome predictions from Notebook 05;
  • the estimator comparison table from Notebook 05.

Loading these files keeps the project modular and avoids refitting nuisance models unnecessarily.

WEIGHTED_PANEL_RELATIVE_PATH = Path("data/processed/kuairec_long_term_weighted_panel.parquet")
GCOMP_COUNTERFACTUALS_RELATIVE_PATH = Path("data/processed/kuairec_long_term_gcomp_counterfactuals.parquet")
ESTIMATOR_COMPARISON_RELATIVE_PATH = Path("data/processed/kuairec_long_term_estimator_comparison.csv")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / WEIGHTED_PANEL_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {WEIGHTED_PANEL_RELATIVE_PATH}. Run Notebooks 03-05 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PANEL_PATH = PROJECT_ROOT / WEIGHTED_PANEL_RELATIVE_PATH
COUNTERFACTUALS_PATH = PROJECT_ROOT / GCOMP_COUNTERFACTUALS_RELATIVE_PATH
ESTIMATOR_COMPARISON_PATH = PROJECT_ROOT / ESTIMATOR_COMPARISON_RELATIVE_PATH

print(f"Project root: {PROJECT_ROOT}")
print(f"Weighted panel: {PANEL_PATH}")
print(f"G-computation counterfactuals: {COUNTERFACTUALS_PATH}")
print(f"Estimator comparison: {ESTIMATOR_COMPARISON_PATH}")
print(f"Processed output folder: {PROCESSED_DIR}")
Project root: /home/apex/Documents/ranking_sys
Weighted panel: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weighted_panel.parquet
G-computation counterfactuals: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_gcomp_counterfactuals.parquet
Estimator comparison: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimator_comparison.csv
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The required artifacts are present. This confirms the notebook can reuse the propensity and outcome nuisance predictions that were already created in a cross-fitted way.

Load Propensity Scores and Outcome Nuisance Predictions

The weighted panel contains the treatment, outcome, propensity scores, and pre-treatment history. The counterfactual table contains pred_y1 and pred_y0 from the linear and LightGBM g-computation models.

panel = pd.read_parquet(PANEL_PATH)
panel["event_date"] = pd.to_datetime(panel["event_date"])
panel = panel.sort_values(["user_id", "event_date"]).reset_index(drop=True)

counterfactuals = pd.read_parquet(COUNTERFACTUALS_PATH)
counterfactuals["event_date"] = pd.to_datetime(counterfactuals["event_date"])

estimator_comparison = pd.read_csv(ESTIMATOR_COMPARISON_PATH)

PRIMARY_TREATMENT = "treatment"
PRIMARY_OUTCOME = "outcome"

print(f"Weighted panel shape: {panel.shape}")
print(f"Counterfactual table shape: {counterfactuals.shape}")
print(f"Outcome nuisance models: {sorted(counterfactuals['model'].unique())}")

display(panel[["user_id", "event_date", PRIMARY_TREATMENT, PRIMARY_OUTCOME, "propensity_logistic", "propensity_lightgbm"]].head())
display(counterfactuals.head())
Weighted panel shape: (4698, 40)
Counterfactual table shape: (9396, 10)
Outcome nuisance models: ['lightgbm', 'linear_ridge']
user_id event_date treatment outcome propensity_logistic propensity_lightgbm
0 14 2020-07-08 1 279.0000 0.5471 0.5266
1 14 2020-07-09 0 311.0000 0.6333 0.5373
2 14 2020-07-10 1 352.0000 0.4140 0.5234
3 14 2020-07-11 1 437.0000 0.5654 0.6503
4 14 2020-07-12 0 437.0000 0.5986 0.7418
user_id event_date model treatment outcome pred_observed pred_y1 pred_y0 cate_hat residual
0 14 2020-07-08 linear_ridge 1 279.0000 519.6987 519.6987 515.8710 3.8276 -240.6987
1 14 2020-07-09 linear_ridge 0 311.0000 497.2763 501.1040 497.2763 3.8276 -186.2763
2 14 2020-07-10 linear_ridge 1 352.0000 520.0026 520.0026 516.1750 3.8276 -168.0026
3 14 2020-07-11 linear_ridge 1 437.0000 495.6651 495.6651 491.8375 3.8276 -58.6651
4 14 2020-07-12 linear_ridge 0 437.0000 497.5665 501.3941 497.5665 3.8276 -60.5665

The loaded tables line up at the user-day level. The counterfactual table has one row per user-day per outcome model, which lets us compute AIPW estimates for multiple propensity/outcome model combinations.

Merge Nuisance Predictions into an AIPW Base Table

The next cell joins propensity scores and pre-treatment covariates onto the counterfactual predictions. This produces the row-level table needed for AIPW scoring.

pre_treatment_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
    "panel_day_index",
    "day_of_week",
]

panel_for_merge = panel[
    [
        "user_id",
        "event_date",
        "propensity_logistic",
        "propensity_lightgbm",
        "analysis_weight",
        *pre_treatment_covariates,
    ]
].copy()

aipw_base = counterfactuals.merge(
    panel_for_merge,
    on=["user_id", "event_date"],
    how="left",
    validate="many_to_one",
)

required_aipw_columns = [
    PRIMARY_TREATMENT,
    PRIMARY_OUTCOME,
    "pred_y1",
    "pred_y0",
    "propensity_logistic",
    "propensity_lightgbm",
]
missing_rates = aipw_base[required_aipw_columns].isna().mean().rename("missing_rate").reset_index().rename(columns={"index": "column"})

display(missing_rates)
display(aipw_base.head())
column missing_rate
0 treatment 0.0000
1 outcome 0.0000
2 pred_y1 0.0000
3 pred_y0 0.0000
4 propensity_logistic 0.0000
5 propensity_lightgbm 0.0000
user_id event_date model treatment outcome pred_observed pred_y1 pred_y0 cate_hat residual propensity_logistic propensity_lightgbm analysis_weight lag_1_active_day lag_1_interactions lag_1_total_play_duration_sec lag_1_avg_watch_ratio lag_1_high_watch_share prior_3day_active_day prior_3day_interactions prior_3day_total_play_duration_sec prior_3day_avg_watch_ratio prior_3day_high_watch_share calendar_day_index panel_day_index day_of_week
0 14 2020-07-08 linear_ridge 1 279.0000 519.6987 519.6987 515.8710 3.8276 -240.6987 0.5471 0.5266 0.9131 1.0000 78.0000 655.4890 0.8415 0.4615 3.0000 127.0000 1,144.8080 2.9900 1.5602 3 3 Wednesday
1 14 2020-07-09 linear_ridge 0 311.0000 497.2763 501.1040 497.2763 3.8276 -186.2763 0.6333 0.5373 1.3647 1.0000 22.0000 201.9010 0.9828 0.5000 3.0000 123.0000 1,105.7340 2.8883 1.4833 4 4 Thursday
2 14 2020-07-10 linear_ridge 1 352.0000 520.0026 520.0026 516.1750 3.8276 -168.0026 0.4140 0.5234 1.2067 1.0000 55.0000 485.0390 0.8619 0.3636 3.0000 155.0000 1,342.4290 2.6862 1.3252 5 5 Friday
3 14 2020-07-11 linear_ridge 1 437.0000 495.6651 495.6651 491.8375 3.8276 -58.6651 0.5654 0.6503 0.8836 1.0000 52.0000 606.2440 1.1380 0.5769 3.0000 129.0000 1,293.1840 2.9827 1.4406 6 6 Saturday
4 14 2020-07-12 linear_ridge 0 437.0000 497.5665 501.3941 497.5665 3.8276 -60.5665 0.5986 0.7418 1.2466 1.0000 32.0000 284.7470 0.9337 0.5000 3.0000 139.0000 1,376.0300 2.9336 1.4406 7 7 Sunday

The merged table has the ingredients for doubly robust scores: observed treatment, observed outcome, propensity scores, and predicted potential outcomes. Missingness should be zero for the required columns before any AIPW estimate is trusted.

Define the AIPW Score

This helper computes the AIPW score for one propensity model and one outcome model. Propensities are clipped away from 0 and 1 to prevent a small number of near-impossible treatment assignments from dominating the residual correction.

def add_aipw_score(data, propensity_col, score_col, eps=0.01):
    result = data.copy()
    treatment = result[PRIMARY_TREATMENT].astype(float)
    outcome = result[PRIMARY_OUTCOME].astype(float)
    e_hat = result[propensity_col].astype(float).clip(eps, 1 - eps)
    m1_hat = result["pred_y1"].astype(float)
    m0_hat = result["pred_y0"].astype(float)

    result[f"{score_col}_e_clipped"] = e_hat
    result[score_col] = (
        (m1_hat - m0_hat)
        + treatment * (outcome - m1_hat) / e_hat
        - (1 - treatment) * (outcome - m0_hat) / (1 - e_hat)
    )
    return result

print("Defined AIPW score helper.")
Defined AIPW score helper.

The helper implements the doubly robust score formula. The score combines the model-implied treatment contrast with residual corrections from the observed treated and control rows.

Compute AIPW Scores for Nuisance Model Combinations

This cell computes AIPW scores for four combinations:

  • logistic propensity + linear outcome;
  • logistic propensity + LightGBM outcome;
  • LightGBM propensity + linear outcome;
  • LightGBM propensity + LightGBM outcome.

The primary doubly robust specification is logistic propensity plus LightGBM outcome, because Notebook 03 showed logistic propensities were stable and Notebook 05 showed LightGBM had better outcome prediction.

propensity_specs = {
    "logistic_propensity": "propensity_logistic",
    "lightgbm_propensity": "propensity_lightgbm",
}

score_frames = []
for outcome_model in sorted(aipw_base["model"].unique()):
    model_data = aipw_base.query("model == @outcome_model").copy()
    for propensity_name, propensity_col in propensity_specs.items():
        score_name = f"aipw_{propensity_name}_{outcome_model}_outcome"
        scored = add_aipw_score(model_data, propensity_col, score_name, eps=0.01)
        scored["propensity_model"] = propensity_name
        scored["outcome_model"] = outcome_model
        scored["aipw_score"] = scored[score_name]
        scored["aipw_score_name"] = score_name
        score_frames.append(scored)

aipw_scores = pd.concat(score_frames, ignore_index=True)
PRIMARY_AIPW_SCORE_NAME = "aipw_logistic_propensity_lightgbm_outcome"

print(f"AIPW score rows: {len(aipw_scores):,}")
print(f"Primary AIPW score: {PRIMARY_AIPW_SCORE_NAME}")
display(aipw_scores[["user_id", "event_date", "propensity_model", "outcome_model", "aipw_score"]].head())
AIPW score rows: 18,792
Primary AIPW score: aipw_logistic_propensity_lightgbm_outcome
user_id event_date propensity_model outcome_model aipw_score
0 14 2020-07-08 logistic_propensity lightgbm -182.3750
1 14 2020-07-09 logistic_propensity lightgbm 245.6311
2 14 2020-07-10 logistic_propensity lightgbm -144.5768
3 14 2020-07-11 logistic_propensity lightgbm 7.5475
4 14 2020-07-12 logistic_propensity lightgbm 25.5266

The AIPW table now contains one score per user-day for each nuisance-model combination. The average of each score column is the corresponding doubly robust estimate.

Summarize Doubly Robust Average Effects

This table averages the AIPW scores by nuisance model combination. It also reports the model-implied counterfactual means and the distribution of row-level scores.

aipw_results = (
    aipw_scores.groupby(["propensity_model", "outcome_model", "aipw_score_name"])
    .agg(
        n_rows=("aipw_score", "size"),
        n_users=("user_id", "nunique"),
        mean_y1_hat=("pred_y1", "mean"),
        mean_y0_hat=("pred_y0", "mean"),
        gcomp_component=("cate_hat", "mean"),
        aipw_ate=("aipw_score", "mean"),
        aipw_score_std=("aipw_score", "std"),
        aipw_p05=("aipw_score", lambda values: values.quantile(0.05)),
        aipw_median=("aipw_score", "median"),
        aipw_p95=("aipw_score", lambda values: values.quantile(0.95)),
        share_positive_score=("aipw_score", lambda values: (values > 0).mean()),
    )
    .reset_index()
)

aipw_results["relative_lift_vs_y0"] = aipw_results["aipw_ate"] / aipw_results["mean_y0_hat"]
aipw_results["is_primary_aipw"] = aipw_results["aipw_score_name"].eq(PRIMARY_AIPW_SCORE_NAME)

display(aipw_results)
propensity_model outcome_model aipw_score_name n_rows n_users mean_y1_hat mean_y0_hat gcomp_component aipw_ate aipw_score_std aipw_p05 aipw_median aipw_p95 share_positive_score relative_lift_vs_y0 is_primary_aipw
0 lightgbm_propensity lightgbm aipw_lightgbm_propensity_lightgbm_outcome 4698 91 382.3398 382.3040 0.0359 1.9929 210.8940 -230.0220 2.0653 218.2429 0.5109 0.0052 False
1 lightgbm_propensity linear_ridge aipw_lightgbm_propensity_linear_ridge_outcome 4698 91 383.2312 380.9024 2.3288 0.0236 325.4178 -353.3088 6.5009 329.7477 0.5153 0.0001 False
2 logistic_propensity lightgbm aipw_logistic_propensity_lightgbm_outcome 4698 91 382.3398 382.3040 0.0359 1.0940 325.5438 -227.4167 1.9645 214.8106 0.5109 0.0029 True
3 logistic_propensity linear_ridge aipw_logistic_propensity_linear_ridge_outcome 4698 91 383.2312 380.9024 2.3288 -4.2621 462.7484 -351.1020 6.3363 322.5791 0.5151 -0.0112 False

The AIPW estimates add a third view of the average effect. Agreement across nuisance combinations would suggest the result is not driven by one particular propensity or outcome model.

User-Cluster Bootstrap for AIPW Estimates

Each user contributes multiple days, so uncertainty should be clustered by user. This bootstrap resamples users and recomputes the average AIPW score for each nuisance combination. The nuisance predictions are kept fixed, so this is a practical uncertainty check for the score average rather than a full nuisance-refitting bootstrap.

def cluster_bootstrap_score_mean(data, score_col="aipw_score", group_col="user_id", n_bootstrap=400, random_state=RANDOM_STATE):
    rng = np.random.default_rng(random_state)
    user_ids = data[group_col].unique()
    user_groups = {user_id: group for user_id, group in data.groupby(group_col, sort=False)}
    rows = []
    for bootstrap_id in range(n_bootstrap):
        sampled_users = rng.choice(user_ids, size=len(user_ids), replace=True)
        sample = pd.concat([user_groups[user_id] for user_id in sampled_users], ignore_index=True)
        rows.append({"bootstrap_id": bootstrap_id, "estimate": sample[score_col].mean()})
    return pd.DataFrame(rows)

N_AIPW_BOOTSTRAP = 400
bootstrap_frames = []
for (propensity_model, outcome_model, score_name), group in aipw_scores.groupby(
    ["propensity_model", "outcome_model", "aipw_score_name"]
):
    boot = cluster_bootstrap_score_mean(group, n_bootstrap=N_AIPW_BOOTSTRAP)
    boot["propensity_model"] = propensity_model
    boot["outcome_model"] = outcome_model
    boot["aipw_score_name"] = score_name
    bootstrap_frames.append(boot)

aipw_bootstrap = pd.concat(bootstrap_frames, ignore_index=True)

aipw_bootstrap_summary = (
    aipw_bootstrap.groupby(["propensity_model", "outcome_model", "aipw_score_name"])
    .agg(
        bootstrap_reps=("estimate", "size"),
        bootstrap_mean=("estimate", "mean"),
        bootstrap_std=("estimate", "std"),
        ci_95_lower=("estimate", lambda values: values.quantile(0.025)),
        ci_95_upper=("estimate", lambda values: values.quantile(0.975)),
    )
    .reset_index()
)

aipw_results = aipw_results.merge(
    aipw_bootstrap_summary,
    on=["propensity_model", "outcome_model", "aipw_score_name"],
    how="left",
)

display(aipw_results)
propensity_model outcome_model aipw_score_name n_rows n_users mean_y1_hat mean_y0_hat gcomp_component aipw_ate aipw_score_std aipw_p05 aipw_median aipw_p95 share_positive_score relative_lift_vs_y0 is_primary_aipw bootstrap_reps bootstrap_mean bootstrap_std ci_95_lower ci_95_upper
0 lightgbm_propensity lightgbm aipw_lightgbm_propensity_lightgbm_outcome 4698 91 382.3398 382.3040 0.0359 1.9929 210.8940 -230.0220 2.0653 218.2429 0.5109 0.0052 False 400 1.8301 2.8635 -3.4280 7.5135
1 lightgbm_propensity linear_ridge aipw_lightgbm_propensity_linear_ridge_outcome 4698 91 383.2312 380.9024 2.3288 0.0236 325.4178 -353.3088 6.5009 329.7477 0.5153 0.0001 False 400 -0.3441 4.8857 -9.9116 8.8830
2 logistic_propensity lightgbm aipw_logistic_propensity_lightgbm_outcome 4698 91 382.3398 382.3040 0.0359 1.0940 325.5438 -227.4167 1.9645 214.8106 0.5109 0.0029 True 400 1.1991 5.0298 -7.3428 12.0178
3 logistic_propensity linear_ridge aipw_logistic_propensity_linear_ridge_outcome 4698 91 383.2312 380.9024 2.3288 -4.2621 462.7484 -351.1020 6.3363 322.5791 0.5151 -0.0112 False 400 -4.6326 7.2876 -21.6843 8.2386

The bootstrap intervals add uncertainty to the doubly robust estimates. If the intervals cross zero, the doubly robust estimate should be reported as uncertain even if the point estimate has a direction.

Plot AIPW Estimates Across Nuisance Combinations

This plot shows the doubly robust estimates and bootstrap intervals. It helps diagnose whether one nuisance combination is an outlier.

aipw_plot = aipw_results.copy()
aipw_plot["label"] = aipw_plot["propensity_model"] + "\n" + aipw_plot["outcome_model"]
aipw_plot["lower_error"] = aipw_plot["aipw_ate"] - aipw_plot["ci_95_lower"]
aipw_plot["upper_error"] = aipw_plot["ci_95_upper"] - aipw_plot["aipw_ate"]

fig, ax = plt.subplots(figsize=(9, 5))
ax.errorbar(
    x=np.arange(len(aipw_plot)),
    y=aipw_plot["aipw_ate"],
    yerr=[aipw_plot["lower_error"], aipw_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(aipw_plot)))
ax.set_xticklabels(aipw_plot["label"], rotation=20, ha="right")
ax.set_title("AIPW Estimates Across Nuisance Model Combinations")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Propensity model / outcome model")
plt.tight_layout()
plt.show()

The plot makes cross-model stability easy to assess. If all AIPW intervals overlap heavily around zero, that reinforces the earlier MSM and g-computation conclusion that the average effect is small and uncertain.

Compare MSM, G-Computation, and AIPW

This cell appends the primary AIPW result to the existing estimator comparison table. This creates the main cross-estimator summary for the final report notebook.

primary_aipw = aipw_results.query("is_primary_aipw").iloc[0]
msm_control_mean = estimator_comparison.loc[
    estimator_comparison["method"].str.startswith("MSM"),
    "estimate",
].index

# Use the MSM comparison table's relative-lift denominator if available; otherwise use the AIPW m0 mean.
comparison_denominator = primary_aipw["mean_y0_hat"]
if "relative_lift_vs_msm_control_mean" in estimator_comparison.columns:
    msm_row = estimator_comparison.query("method.str.startswith('MSM')", engine="python").iloc[0]
    if pd.notna(msm_row["relative_lift_vs_msm_control_mean"]) and msm_row["relative_lift_vs_msm_control_mean"] != 0:
        comparison_denominator = msm_row["estimate"] / msm_row["relative_lift_vs_msm_control_mean"]

primary_aipw_row = pd.DataFrame(
    [
        {
            "method": "Doubly robust AIPW",
            "model": "logistic propensity + LightGBM outcome",
            "estimate": primary_aipw["aipw_ate"],
            "ci_95_lower": primary_aipw["ci_95_lower"],
            "ci_95_upper": primary_aipw["ci_95_upper"],
            "source": "Notebook 06 user-cluster bootstrap over AIPW scores",
            "relative_lift_vs_msm_control_mean": primary_aipw["aipw_ate"] / comparison_denominator,
        }
    ]
)

estimator_comparison_with_aipw = pd.concat(
    [estimator_comparison, primary_aipw_row],
    ignore_index=True,
)

display(estimator_comparison_with_aipw)
method model estimate ci_95_lower ci_95_upper source relative_lift_vs_msm_control_mean
0 MSM: weighted calendar-adjusted propensity_weighted_msm -2.6877 -12.2706 5.7291 Notebook 04 user-cluster bootstrap -0.0072
1 G-computation lightgbm 0.0359 -0.6384 1.8656 Notebook 05 user-cluster bootstrap 0.0001
2 G-computation linear_ridge 2.3288 -6.3238 9.3565 Notebook 05 user-cluster bootstrap 0.0062
3 Doubly robust AIPW logistic propensity + LightGBM outcome 1.0940 -7.3428 12.0178 Notebook 06 user-cluster bootstrap over AIPW s... 0.0029

The combined table now includes all three causal strategies. This is the most important result table for the final report: it shows whether weighting, outcome modeling, and doubly robust estimation tell a consistent story.

Plot Cross-Estimator Comparison Including AIPW

This plot places MSM, g-computation, and AIPW estimates on the same scale. The intervals come from the user-cluster bootstrap outputs in the relevant notebooks.

comparison_plot = estimator_comparison_with_aipw.copy()
comparison_plot["label"] = comparison_plot["method"] + "\n" + comparison_plot["model"]
comparison_plot["lower_error"] = comparison_plot["estimate"] - comparison_plot["ci_95_lower"]
comparison_plot["upper_error"] = comparison_plot["ci_95_upper"] - comparison_plot["estimate"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(comparison_plot)),
    y=comparison_plot["estimate"],
    yerr=[comparison_plot["lower_error"], comparison_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(comparison_plot)))
ax.set_xticklabels(comparison_plot["label"], rotation=20, ha="right")
ax.set_title("Long-Term Effect Estimates Across Causal Strategies")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Estimator")
plt.tight_layout()
plt.show()

The cross-estimator plot is a clean final-modeling diagnostic. Agreement near zero is not a failed project; it is evidence that the data does not support a clear average long-term lift from this treatment definition.

Create User-History Segments for Heterogeneity

The average effect may hide differences across user states. This cell creates readable segments based on recent interaction volume, recent high-watch share, recent watch ratio, and day of week.

segment_data = panel[
    [
        "user_id",
        "event_date",
        "day_of_week",
        "prior_3day_interactions",
        "prior_3day_high_watch_share",
        "prior_3day_avg_watch_ratio",
        "lag_1_interactions",
    ]
].copy()

segment_data["prior_interactions_bucket"] = pd.qcut(
    segment_data["prior_3day_interactions"],
    q=4,
    duplicates="drop",
).astype(str)
segment_data["prior_high_watch_bucket"] = pd.qcut(
    segment_data["prior_3day_high_watch_share"],
    q=4,
    duplicates="drop",
).astype(str)
segment_data["prior_watch_ratio_bucket"] = pd.qcut(
    segment_data["prior_3day_avg_watch_ratio"],
    q=4,
    duplicates="drop",
).astype(str)
segment_data["lag_interactions_bucket"] = pd.qcut(
    segment_data["lag_1_interactions"],
    q=4,
    duplicates="drop",
).astype(str)

segment_columns = [
    "prior_interactions_bucket",
    "prior_high_watch_bucket",
    "prior_watch_ratio_bucket",
    "lag_interactions_bucket",
    "day_of_week",
]

primary_scores = aipw_scores.query("aipw_score_name == @PRIMARY_AIPW_SCORE_NAME").copy()
primary_scores = primary_scores.merge(
    segment_data[["user_id", "event_date", *[col for col in segment_columns if col != "day_of_week"]]],
    on=["user_id", "event_date"],
    how="left",
    validate="one_to_one",
)

print("Created segment columns:")
for segment_col in segment_columns:
    print(f"- {segment_col}: {primary_scores[segment_col].nunique()} levels")
Created segment columns:
- prior_interactions_bucket: 4 levels
- prior_high_watch_bucket: 4 levels
- prior_watch_ratio_bucket: 4 levels
- lag_interactions_bucket: 4 levels
- day_of_week: 7 levels

The segment columns translate continuous user histories into product-readable groups. This lets us ask whether the doubly robust score is systematically higher or lower for different recent engagement states.

Estimate Segment-Level Doubly Robust Effects

Segment-level effects are computed by averaging the primary AIPW score within each segment. This is a descriptive heterogeneity view, not a new randomized experiment. Minimum row and user counts are included so tiny segments do not receive too much attention.

def summarize_segment_effects(data, segment_col, score_col="aipw_score", min_rows=100, min_users=10):
    summary = (
        data.groupby(segment_col, observed=True)
        .agg(
            rows=(score_col, "size"),
            users=("user_id", "nunique"),
            treatment_rate=(PRIMARY_TREATMENT, "mean"),
            observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
            aipw_ate=(score_col, "mean"),
            aipw_score_std=(score_col, "std"),
            gcomp_component=("cate_hat", "mean"),
            mean_propensity=("propensity_logistic", "mean"),
        )
        .reset_index()
        .rename(columns={segment_col: "segment"})
    )
    summary.insert(0, "segment_type", segment_col)
    summary["passes_size_filter"] = (summary["rows"] >= min_rows) & (summary["users"] >= min_users)
    return summary

segment_effect_tables = [summarize_segment_effects(primary_scores, segment_col) for segment_col in segment_columns]
segment_effects = pd.concat(segment_effect_tables, ignore_index=True)
segment_effects = segment_effects.sort_values(["segment_type", "aipw_ate"], ascending=[True, False])

display(segment_effects)
segment_type segment rows users treatment_rate observed_outcome_mean aipw_ate aipw_score_std gcomp_component mean_propensity passes_size_filter
21 day_of_week Tuesday 623 91 0.4912 390.4575 20.0321 750.2361 0.0096 0.4917 True
22 day_of_week Wednesday 711 91 0.4684 383.1899 1.8366 168.8665 0.0151 0.4673 True
16 day_of_week Friday 702 91 0.5199 370.7877 0.6318 205.5002 0.0605 0.5201 True
20 day_of_week Thursday 706 91 0.5057 375.7663 -0.8262 221.5878 0.0502 0.5056 True
18 day_of_week Saturday 710 91 0.4831 364.3127 -3.2785 165.9187 0.0524 0.4826 True
19 day_of_week Sunday 621 91 0.5523 400.9678 -3.4574 201.6585 0.0243 0.5517 True
17 day_of_week Monday 625 91 0.4800 393.7184 -6.4503 170.7878 0.0344 0.4809 True
15 lag_interactions_bucket (75.0, 375.0] 1146 91 0.4852 448.0070 16.2174 575.5850 0.0829 0.4860 True
13 lag_interactions_bucket (36.0, 53.0] 1158 91 0.5052 384.2660 -1.3738 193.1010 -0.0205 0.4985 True
12 lag_interactions_bucket (-0.001, 36.0] 1202 91 0.4917 290.3095 -4.3047 166.3770 -0.0194 0.5013 True
14 lag_interactions_bucket (53.0, 75.0] 1192 91 0.5159 409.1242 -5.6043 187.1179 0.1011 0.5115 True
4 prior_high_watch_bucket (-0.001, 1.127] 1175 73 0.1677 368.0732 12.5337 549.1163 0.0551 0.1685 True
5 prior_high_watch_bucket (1.127, 1.414] 1174 84 0.3424 386.3526 2.4564 154.8324 0.1259 0.3521 True
6 prior_high_watch_bucket (1.414, 1.708] 1174 85 0.6235 384.5980 -1.5016 147.9427 -0.0251 0.6104 True
7 prior_high_watch_bucket (1.708, 2.735] 1175 65 0.8647 389.3123 -9.1134 276.4264 -0.0124 0.8668 True
3 prior_interactions_bucket (217.0, 581.0] 1169 91 0.4688 445.8435 25.6170 548.5773 0.1569 0.4787 True
0 prior_interactions_bucket (-0.001, 123.0] 1183 91 0.4920 276.2113 -0.7281 163.7490 -0.0428 0.5010 True
1 prior_interactions_bucket (123.0, 166.0] 1182 91 0.5178 386.9205 -8.0145 233.7360 0.0062 0.5061 True
2 prior_interactions_bucket (166.0, 217.0] 1164 91 0.5198 420.7345 -12.4330 205.3765 0.0244 0.5120 True
8 prior_watch_ratio_bucket (-0.001, 2.26] 1175 75 0.2119 371.7200 12.7319 549.7271 0.1169 0.2241 True
9 prior_watch_ratio_bucket (2.26, 2.59] 1174 89 0.4421 376.9199 4.2636 157.1119 0.1366 0.4355 True
10 prior_watch_ratio_bucket (2.59, 3.004] 1174 83 0.6550 391.3833 0.6755 157.5448 -0.0435 0.6321 True
11 prior_watch_ratio_bucket (3.004, 11.372] 1175 73 0.6894 388.3106 -13.2926 268.3096 -0.0665 0.7063 True

The segment table shows where the AIPW score is more positive or more negative. These patterns are useful for hypothesis generation, but they should be interpreted with uncertainty and support diagnostics.

Bootstrap Segment-Level Effects

This bootstrap resamples users and recomputes segment-level AIPW means. It gives uncertainty intervals for segment effects while respecting repeated user-days.

def bootstrap_segment_effects(data, segment_col, score_col="aipw_score", n_bootstrap=250, random_state=RANDOM_STATE):
    rng = np.random.default_rng(random_state)
    user_ids = data["user_id"].unique()
    user_groups = {user_id: group for user_id, group in data.groupby("user_id", sort=False)}
    rows = []
    for bootstrap_id in range(n_bootstrap):
        sampled_users = rng.choice(user_ids, size=len(user_ids), replace=True)
        sample = pd.concat([user_groups[user_id] for user_id in sampled_users], ignore_index=True)
        boot_summary = sample.groupby(segment_col, observed=True)[score_col].mean().reset_index()
        boot_summary["bootstrap_id"] = bootstrap_id
        rows.append(boot_summary.rename(columns={segment_col: "segment", score_col: "aipw_ate"}))
    result = pd.concat(rows, ignore_index=True)
    result.insert(0, "segment_type", segment_col)
    return result

segment_bootstrap_frames = []
for segment_col in segment_columns:
    segment_bootstrap_frames.append(bootstrap_segment_effects(primary_scores, segment_col))

segment_bootstrap = pd.concat(segment_bootstrap_frames, ignore_index=True)
segment_bootstrap_summary = (
    segment_bootstrap.groupby(["segment_type", "segment"])
    .agg(
        bootstrap_reps=("aipw_ate", "size"),
        bootstrap_mean=("aipw_ate", "mean"),
        bootstrap_std=("aipw_ate", "std"),
        ci_95_lower=("aipw_ate", lambda values: values.quantile(0.025)),
        ci_95_upper=("aipw_ate", lambda values: values.quantile(0.975)),
    )
    .reset_index()
)

segment_effects_with_ci = segment_effects.merge(
    segment_bootstrap_summary,
    on=["segment_type", "segment"],
    how="left",
)

display(segment_effects_with_ci)
segment_type segment rows users treatment_rate observed_outcome_mean aipw_ate aipw_score_std gcomp_component mean_propensity passes_size_filter bootstrap_reps bootstrap_mean bootstrap_std ci_95_lower ci_95_upper
0 day_of_week Tuesday 623 91 0.4912 390.4575 20.0321 750.2361 0.0096 0.4917 True 250 20.4314 32.7418 -24.3380 90.8593
1 day_of_week Wednesday 711 91 0.4684 383.1899 1.8366 168.8665 0.0151 0.4673 True 250 2.1471 7.8816 -12.4635 18.6481
2 day_of_week Friday 702 91 0.5199 370.7877 0.6318 205.5002 0.0605 0.5201 True 250 0.1500 7.1439 -13.5674 13.8208
3 day_of_week Thursday 706 91 0.5057 375.7663 -0.8262 221.5878 0.0502 0.5056 True 250 -1.4476 8.9724 -19.8088 15.7126
4 day_of_week Saturday 710 91 0.4831 364.3127 -3.2785 165.9187 0.0524 0.4826 True 250 -3.6206 5.5464 -13.7823 7.3680
5 day_of_week Sunday 621 91 0.5523 400.9678 -3.4574 201.6585 0.0243 0.5517 True 250 -4.0013 7.8748 -20.5009 9.5415
6 day_of_week Monday 625 91 0.4800 393.7184 -6.4503 170.7878 0.0344 0.4809 True 250 -6.1623 5.8209 -18.4649 4.0985
7 lag_interactions_bucket (75.0, 375.0] 1146 91 0.4852 448.0070 16.2174 575.5850 0.0829 0.4860 True 250 16.0696 18.4124 -9.9926 57.4826
8 lag_interactions_bucket (36.0, 53.0] 1158 91 0.5052 384.2660 -1.3738 193.1010 -0.0205 0.4985 True 250 -1.6926 5.6757 -12.9051 9.4193
9 lag_interactions_bucket (-0.001, 36.0] 1202 91 0.4917 290.3095 -4.3047 166.3770 -0.0194 0.5013 True 250 -4.3555 3.9523 -11.8921 3.5914
10 lag_interactions_bucket (53.0, 75.0] 1192 91 0.5159 409.1242 -5.6043 187.1179 0.1011 0.5115 True 250 -5.6611 5.3270 -15.8618 5.4620
11 prior_high_watch_bucket (-0.001, 1.127] 1175 73 0.1677 368.0732 12.5337 549.1163 0.0551 0.1685 True 250 11.9216 17.3031 -11.3159 52.9259
12 prior_high_watch_bucket (1.127, 1.414] 1174 84 0.3424 386.3526 2.4564 154.8324 0.1259 0.3521 True 250 2.5918 4.6737 -6.3520 12.5274
13 prior_high_watch_bucket (1.414, 1.708] 1174 85 0.6235 384.5980 -1.5016 147.9427 -0.0251 0.6104 True 250 -1.5291 4.6744 -10.6043 7.8901
14 prior_high_watch_bucket (1.708, 2.735] 1175 65 0.8647 389.3123 -9.1134 276.4264 -0.0124 0.8668 True 250 -9.4140 9.0052 -27.9388 7.2418
15 prior_interactions_bucket (217.0, 581.0] 1169 91 0.4688 445.8435 25.6170 548.5773 0.1569 0.4787 True 250 25.7353 17.6855 1.8847 66.3297
16 prior_interactions_bucket (-0.001, 123.0] 1183 91 0.4920 276.2113 -0.7281 163.7490 -0.0428 0.5010 True 250 -1.0646 4.8685 -9.6046 8.7503
17 prior_interactions_bucket (123.0, 166.0] 1182 91 0.5178 386.9205 -8.0145 233.7360 0.0062 0.5061 True 250 -8.5057 6.1431 -20.4152 2.5367
18 prior_interactions_bucket (166.0, 217.0] 1164 91 0.5198 420.7345 -12.4330 205.3765 0.0244 0.5120 True 250 -12.2267 6.6019 -26.7004 0.3019
19 prior_watch_ratio_bucket (-0.001, 2.26] 1175 75 0.2119 371.7200 12.7319 549.7271 0.1169 0.2241 True 250 12.4139 17.3878 -9.3683 52.2110
20 prior_watch_ratio_bucket (2.26, 2.59] 1174 89 0.4421 376.9199 4.2636 157.1119 0.1366 0.4355 True 250 3.9070 5.0899 -5.4912 13.3243
21 prior_watch_ratio_bucket (2.59, 3.004] 1174 83 0.6550 391.3833 0.6755 157.5448 -0.0435 0.6321 True 250 0.8840 5.2983 -9.0288 11.3917
22 prior_watch_ratio_bucket (3.004, 11.372] 1175 73 0.6894 388.3106 -13.2926 268.3096 -0.0665 0.7063 True 250 -13.5921 7.0527 -27.1569 -0.8128

The segment bootstrap intervals show how noisy the heterogeneous effect patterns are. A segment with a large point estimate but a wide interval should be treated as a hypothesis, not an action rule.

Plot Segment Effects by Recent Interaction Volume

Recent interaction volume is a natural recommender-system state variable. This plot shows whether users with lower or higher recent activity appear to have different doubly robust effects.

interaction_segment_plot = segment_effects_with_ci.query("segment_type == 'prior_interactions_bucket'").copy()
interaction_segment_plot = interaction_segment_plot.sort_values("segment")
interaction_segment_plot["lower_error"] = interaction_segment_plot["aipw_ate"] - interaction_segment_plot["ci_95_lower"]
interaction_segment_plot["upper_error"] = interaction_segment_plot["ci_95_upper"] - interaction_segment_plot["aipw_ate"]

fig, ax = plt.subplots(figsize=(9, 5))
ax.errorbar(
    x=np.arange(len(interaction_segment_plot)),
    y=interaction_segment_plot["aipw_ate"],
    yerr=[interaction_segment_plot["lower_error"], interaction_segment_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(interaction_segment_plot)))
ax.set_xticklabels(interaction_segment_plot["segment"], rotation=20, ha="right")
ax.set_title("AIPW Effects by Prior 3-Day Interaction Bucket")
ax.set_xlabel("Prior 3-day interactions")
ax.set_ylabel("Effect on future 7-day interactions")
plt.tight_layout()
plt.show()

This plot checks whether the average near-zero effect is hiding differences by recent activity level. If intervals remain wide or cross zero across buckets, the heterogeneity story should stay cautious.

Plot Segment Effects by Recent Watch Quality

This plot repeats the segment view for recent high-watch share. It asks whether users with recently satisfying watch behavior respond differently to another high-watch-exposure day.

watch_segment_plot = segment_effects_with_ci.query("segment_type == 'prior_high_watch_bucket'").copy()
watch_segment_plot = watch_segment_plot.sort_values("segment")
watch_segment_plot["lower_error"] = watch_segment_plot["aipw_ate"] - watch_segment_plot["ci_95_lower"]
watch_segment_plot["upper_error"] = watch_segment_plot["ci_95_upper"] - watch_segment_plot["aipw_ate"]

fig, ax = plt.subplots(figsize=(9, 5))
ax.errorbar(
    x=np.arange(len(watch_segment_plot)),
    y=watch_segment_plot["aipw_ate"],
    yerr=[watch_segment_plot["lower_error"], watch_segment_plot["upper_error"]],
    fmt="o",
    color="#5C946E",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(watch_segment_plot)))
ax.set_xticklabels(watch_segment_plot["segment"], rotation=20, ha="right")
ax.set_title("AIPW Effects by Prior High-Watch Share Bucket")
ax.set_xlabel("Prior 3-day high-watch share")
ax.set_ylabel("Effect on future 7-day interactions")
plt.tight_layout()
plt.show()

The watch-quality segment plot connects directly to the treatment definition. It helps answer whether high-watch exposure is more promising for users who were already having high-watch experiences recently.

Smooth Noisy AIPW Scores with a LightGBM Effect Model

AIPW scores are noisy row-level pseudo-outcomes. To summarize heterogeneity more smoothly, this cell fits a LightGBM regressor that predicts the primary AIPW score from pre-treatment covariates only.

This is not a definitive individual treatment effect model. It is a diagnostic tool for identifying which histories appear to moderate the doubly robust score.

effect_numeric_features = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
    "panel_day_index",
]
effect_categorical_features = ["day_of_week"]
effect_features = effect_numeric_features + effect_categorical_features

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", "passthrough", effect_numeric_features),
        ("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), effect_categorical_features),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
)

effect_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        (
            "model",
            LGBMRegressor(
                n_estimators=120,
                learning_rate=0.04,
                num_leaves=10,
                min_child_samples=60,
                subsample=0.9,
                colsample_bytree=0.9,
                random_state=RANDOM_STATE,
                n_jobs=1,
                verbose=-1,
            ),
        ),
    ]
)

X_effect = primary_scores[effect_features].copy()
y_effect = primary_scores["aipw_score"].astype(float)
groups = primary_scores["user_id"]

group_kfold = GroupKFold(n_splits=5)
effect_pred = np.zeros(len(primary_scores), dtype=float)
effect_metric_rows = []

for fold, (train_idx, test_idx) in enumerate(group_kfold.split(X_effect, y_effect, groups=groups), start=1):
    fold_model = clone(effect_model)
    fold_model.fit(X_effect.iloc[train_idx], y_effect.iloc[train_idx])
    pred = fold_model.predict(X_effect.iloc[test_idx])
    effect_pred[test_idx] = pred
    effect_metric_rows.append(
        {
            "fold": fold,
            "rows": len(test_idx),
            "users": primary_scores.iloc[test_idx]["user_id"].nunique(),
            "rmse": np.sqrt(mean_squared_error(y_effect.iloc[test_idx], pred)),
            "mae": mean_absolute_error(y_effect.iloc[test_idx], pred),
            "r2": r2_score(y_effect.iloc[test_idx], pred),
        }
    )

primary_scores["smoothed_aipw_effect"] = effect_pred
effect_model_metrics = pd.DataFrame(effect_metric_rows)

display(effect_model_metrics)
fold rows users rmse mae r2
0 1 962 19 186.9065 106.7323 -0.0361
1 2 934 18 236.1637 112.2666 0.0117
2 3 934 18 610.4023 132.5208 -0.0014
3 4 934 18 173.2584 108.6869 -0.1550
4 5 934 18 208.2852 99.8468 -0.0115

The effect-smoothing metrics show how predictable the noisy AIPW scores are from pre-treatment history. Modest predictive performance is normal here because AIPW scores include residual noise, but the model can still reveal useful moderator patterns.

Inspect the Distribution of Smoothed Effects

The smoothed effect distribution is easier to interpret than the raw AIPW score distribution. It shows where the effect model expects more positive or negative long-term impacts based on pre-treatment history.

fig, axes = plt.subplots(1, 2, figsize=(13, 4.8))

sns.histplot(primary_scores["aipw_score"], bins=60, color="#8ECAE6", ax=axes[0])
axes[0].axvline(0, color="black", linewidth=1)
axes[0].set_title("Raw AIPW Score Distribution")
axes[0].set_xlabel("AIPW score")

sns.histplot(primary_scores["smoothed_aipw_effect"], bins=40, color="#5C946E", ax=axes[1])
axes[1].axvline(0, color="black", linewidth=1)
axes[1].set_title("Smoothed Effect Prediction Distribution")
axes[1].set_xlabel("Predicted AIPW effect")

plt.tight_layout()
plt.show()

The raw AIPW scores are much noisier than the smoothed effect predictions. The smoothed distribution is better for describing heterogeneity, while the average AIPW score remains the main doubly robust estimate.

Feature Importance for Effect Moderation

The full effect model is fit on all rows so we can inspect feature importances. These importances describe which pre-treatment variables help predict variation in the AIPW pseudo-effect. They should not be read as causal effects of the covariates themselves.

full_effect_model = clone(effect_model).fit(X_effect, y_effect)
feature_names = full_effect_model.named_steps["preprocessor"].get_feature_names_out()
importances = full_effect_model.named_steps["model"].feature_importances_

effect_importance = (
    pd.DataFrame({"feature": feature_names, "importance": importances})
    .sort_values("importance", ascending=False)
    .reset_index(drop=True)
)

display(effect_importance.head(15))

fig, ax = plt.subplots(figsize=(9, 5))
plot_importance = effect_importance.head(12).sort_values("importance")
sns.barplot(data=plot_importance, x="importance", y="feature", ax=ax, color="#2A6F97")
ax.set_title("Top Moderators in the AIPW Effect-Smoothing Model")
ax.set_xlabel("LightGBM feature importance")
ax.set_ylabel("Pre-treatment feature")
plt.tight_layout()
plt.show()
feature importance
0 prior_3day_high_watch_share 189
1 lag_1_total_play_duration_sec 150
2 prior_3day_interactions 141
3 lag_1_interactions 132
4 lag_1_avg_watch_ratio 130
5 lag_1_high_watch_share 103
6 prior_3day_total_play_duration_sec 96
7 prior_3day_avg_watch_ratio 78
8 calendar_day_index 43
9 day_of_week_Friday 6
10 day_of_week_Thursday 5
11 day_of_week_Monday 3
12 day_of_week_Wednesday 3
13 panel_day_index 1
14 lag_1_active_day 0

The importance table and plot identify which histories are most useful for predicting heterogeneous doubly robust scores. These variables are natural candidates for final-report discussion and future online experiment stratification.

Smoothed Effects by Predicted Effect Bucket

This cell groups rows by the smoothed effect prediction. It compares predicted effect buckets with observed AIPW scores and treatment rates. This checks whether the effect smoother is separating rows into meaningfully different pseudo-effect groups.

primary_scores["smoothed_effect_bucket"] = pd.qcut(
    primary_scores["smoothed_aipw_effect"],
    q=5,
    duplicates="drop",
).astype(str)

smoothed_bucket_summary = (
    primary_scores.groupby("smoothed_effect_bucket", observed=True)
    .agg(
        rows=("aipw_score", "size"),
        users=("user_id", "nunique"),
        treatment_rate=(PRIMARY_TREATMENT, "mean"),
        observed_outcome_mean=(PRIMARY_OUTCOME, "mean"),
        mean_raw_aipw_score=("aipw_score", "mean"),
        mean_smoothed_effect=("smoothed_aipw_effect", "mean"),
        mean_prior_interactions=("prior_3day_interactions", "mean"),
        mean_prior_high_watch_share=("prior_3day_high_watch_share", "mean"),
    )
    .reset_index()
)

display(smoothed_bucket_summary)
smoothed_effect_bucket rows users treatment_rate observed_outcome_mean mean_raw_aipw_score mean_smoothed_effect mean_prior_interactions mean_prior_high_watch_share
0 (-11.184, -2.468] 939 91 0.4696 374.1214 6.9171 -5.7183 161.5474 1.3829
1 (-2.468, 2.224] 953 91 0.4711 352.1794 -7.3628 -0.1343 151.4974 1.3917
2 (-447.053, -11.184] 940 91 0.5681 393.0787 0.0059 -45.8042 170.6777 1.5720
3 (11.278, 814.396] 940 91 0.4926 394.0340 9.0918 41.2670 201.3787 1.3727
4 (2.224, 11.278] 926 90 0.4968 397.6361 -3.1214 5.9372 180.8920 1.4077

The bucket summary turns the effect model into a readable heterogeneity diagnostic. If higher predicted-effect buckets also have higher raw AIPW scores, the smoother is capturing a real pattern rather than arbitrary noise.

Save Doubly Robust and Heterogeneity Artifacts

This cell saves the AIPW estimates, bootstrap intervals, row-level scores, segment effects, smoothed effect predictions, and updated cross-estimator comparison table.

aipw_results_path = PROCESSED_DIR / "kuairec_long_term_aipw_results.csv"
aipw_scores_path = PROCESSED_DIR / "kuairec_long_term_aipw_scores.parquet"
aipw_bootstrap_path = PROCESSED_DIR / "kuairec_long_term_aipw_bootstrap.csv"
comparison_with_aipw_path = PROCESSED_DIR / "kuairec_long_term_estimator_comparison_with_aipw.csv"
segment_effects_path = PROCESSED_DIR / "kuairec_long_term_aipw_segment_effects.csv"
segment_bootstrap_path = PROCESSED_DIR / "kuairec_long_term_aipw_segment_bootstrap.csv"
effect_predictions_path = PROCESSED_DIR / "kuairec_long_term_effect_model_predictions.parquet"
effect_importance_path = PROCESSED_DIR / "kuairec_long_term_effect_moderator_importance.csv"
effect_metrics_path = PROCESSED_DIR / "kuairec_long_term_effect_model_metrics.csv"
smoothed_bucket_path = PROCESSED_DIR / "kuairec_long_term_smoothed_effect_buckets.csv"

save_score_cols = [
    "user_id",
    "event_date",
    PRIMARY_TREATMENT,
    PRIMARY_OUTCOME,
    "propensity_model",
    "outcome_model",
    "aipw_score_name",
    "aipw_score",
    "pred_y1",
    "pred_y0",
    "cate_hat",
    "propensity_logistic",
    "propensity_lightgbm",
]

aipw_results.to_csv(aipw_results_path, index=False)
aipw_scores[save_score_cols].to_parquet(aipw_scores_path, index=False)
aipw_bootstrap.to_csv(aipw_bootstrap_path, index=False)
estimator_comparison_with_aipw.to_csv(comparison_with_aipw_path, index=False)
segment_effects_with_ci.to_csv(segment_effects_path, index=False)
segment_bootstrap.to_csv(segment_bootstrap_path, index=False)
primary_scores[[
    "user_id",
    "event_date",
    PRIMARY_TREATMENT,
    PRIMARY_OUTCOME,
    "aipw_score",
    "smoothed_aipw_effect",
    *segment_columns,
]].to_parquet(effect_predictions_path, index=False)
effect_importance.to_csv(effect_importance_path, index=False)
effect_model_metrics.to_csv(effect_metrics_path, index=False)
smoothed_bucket_summary.to_csv(smoothed_bucket_path, index=False)

print("Saved Notebook 06 artifacts:")
print(f"- {aipw_results_path}")
print(f"- {aipw_scores_path}")
print(f"- {aipw_bootstrap_path}")
print(f"- {comparison_with_aipw_path}")
print(f"- {segment_effects_path}")
print(f"- {segment_bootstrap_path}")
print(f"- {effect_predictions_path}")
print(f"- {effect_importance_path}")
print(f"- {effect_metrics_path}")
print(f"- {smoothed_bucket_path}")
Saved Notebook 06 artifacts:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_aipw_results.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_aipw_scores.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_aipw_bootstrap.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimator_comparison_with_aipw.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_aipw_segment_effects.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_aipw_segment_bootstrap.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_effect_model_predictions.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_effect_moderator_importance.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_effect_model_metrics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_smoothed_effect_buckets.csv

The doubly robust artifacts are saved for the final report notebook. The most important file is the updated estimator comparison table, because it now contains MSM, g-computation, and AIPW estimates side by side.

Takeaways and Next Step

This notebook added a doubly robust estimator and a heterogeneity layer to the long-term causal effects project.

The AIPW estimates should be interpreted alongside the MSM and g-computation estimates. If all three strategies remain small and uncertain, the final project conclusion should be honest: the current KuaiRec analysis does not show a clear average long-term interaction-volume gain from high-watch-exposure days.

The heterogeneity results are still useful. They show how the effect may vary by recent engagement state, and they provide product hypotheses for future online testing. The final notebook should now combine the estimator comparison, sensitivity checks, segment findings, limitations, and resume-ready summary into a polished report.