Notebook 07: Sensitivity, Final Report, and Portfolio Artifacts

This is the closing notebook for ** Long-Term Causal Effects in Recommendation Systems**.

The project asked whether short-term recommendation exposure, defined as a high-watch-exposure user-day, appears to change longer-term user engagement in KuaiRec. The previous notebooks built the analysis step by step:

This final notebook turns those modeling outputs into a concise project conclusion and saves portfolio-ready artifacts: figures, tables, final summary text, limitations, and resume bullets.

Final Project Question

The final causal question is:

Among active KuaiRec user-days with sufficient prior history and 7-day follow-up, what is the effect of a high-watch-exposure day on future 7-day interaction volume?

A high-watch-exposure day is not a randomized intervention. It is a constructed treatment from observed behavior. That means every result in this notebook depends on the sequential ignorability assumption: after conditioning on observed user history and calendar context, treated and untreated user-days are comparable enough for causal adjustment.

The final conclusion should therefore be framed as offline observational evidence, not as a substitute for an online experiment.

Setup

The first cell imports libraries for loading final artifacts, creating report tables, saving figures, and writing markdown summaries. The final notebook should be deterministic and lightweight because it does not refit the main causal models.

from pathlib import Path
import textwrap
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display

warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option("display.max_columns", 160)
pd.set_option("display.max_rows", 160)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")

The notebook environment is ready for reporting. No causal model is being refit here; the goal is to assemble and communicate the evidence produced by the earlier notebooks.

Locate Project Inputs and Report Folders

This cell finds the project root, checks for the final estimator comparison table, and creates the final writeup folders for figures and tables.

FINAL_COMPARISON_RELATIVE_PATH = Path("data/processed/kuairec_long_term_estimator_comparison_with_aipw.csv")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / FINAL_COMPARISON_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {FINAL_COMPARISON_RELATIVE_PATH}. Run Notebooks 04-06 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
WRITEUP_DIR = PROJECT_ROOT / "notebooks" / "long_term_causal_effects" / "writeup"
FIGURES_DIR = WRITEUP_DIR / "figures"
TABLES_DIR = WRITEUP_DIR / "tables"

FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Processed data folder: {PROCESSED_DIR}")
print(f"Writeup folder: {WRITEUP_DIR}")
print(f"Figures folder: {FIGURES_DIR}")
print(f"Tables folder: {TABLES_DIR}")
Project root: /home/apex/Documents/ranking_sys
Processed data folder: /home/apex/Documents/ranking_sys/data/processed
Writeup folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup
Figures folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures
Tables folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/tables

The writeup folder structure is now ready. The rest of the notebook will save final tables and figures into this project-specific report location.

Load Final Project Artifacts

The final report combines artifacts from all modeling notebooks. This cell loads the estimator comparison, weight diagnostics, balance diagnostics, sensitivity outputs, secondary outcome outputs, heterogeneity tables, and effect moderator importances.

def read_processed_csv(filename):
    path = PROCESSED_DIR / filename
    if not path.exists():
        raise FileNotFoundError(f"Missing required artifact: {path}")
    return pd.read_csv(path)

estimator_comparison = read_processed_csv("kuairec_long_term_estimator_comparison_with_aipw.csv")
weight_readiness = read_processed_csv("kuairec_long_term_weight_readiness_checks.csv")
balance_diagnostics = read_processed_csv("kuairec_long_term_balance_diagnostics.csv")
weight_diagnostics = read_processed_csv("kuairec_long_term_weight_diagnostics.csv")
msm_weight_sensitivity = read_processed_csv("kuairec_long_term_msm_weight_sensitivity.csv")
msm_secondary = read_processed_csv("kuairec_long_term_msm_secondary_outcomes.csv")
gcomp_secondary = read_processed_csv("kuairec_long_term_gcomp_secondary_outcomes.csv")
aipw_segment_effects = read_processed_csv("kuairec_long_term_aipw_segment_effects.csv")
effect_importance = read_processed_csv("kuairec_long_term_effect_moderator_importance.csv")
smoothed_effect_buckets = read_processed_csv("kuairec_long_term_smoothed_effect_buckets.csv")
propensity_metrics = read_processed_csv("kuairec_long_term_propensity_model_metrics.csv")
gcomp_metrics = read_processed_csv("kuairec_long_term_gcomp_model_metrics.csv")
aipw_results = read_processed_csv("kuairec_long_term_aipw_results.csv")

artifact_shapes = pd.DataFrame(
    [
        {"artifact": "estimator_comparison", "rows": len(estimator_comparison), "columns": estimator_comparison.shape[1]},
        {"artifact": "weight_readiness", "rows": len(weight_readiness), "columns": weight_readiness.shape[1]},
        {"artifact": "balance_diagnostics", "rows": len(balance_diagnostics), "columns": balance_diagnostics.shape[1]},
        {"artifact": "msm_weight_sensitivity", "rows": len(msm_weight_sensitivity), "columns": msm_weight_sensitivity.shape[1]},
        {"artifact": "msm_secondary", "rows": len(msm_secondary), "columns": msm_secondary.shape[1]},
        {"artifact": "gcomp_secondary", "rows": len(gcomp_secondary), "columns": gcomp_secondary.shape[1]},
        {"artifact": "aipw_segment_effects", "rows": len(aipw_segment_effects), "columns": aipw_segment_effects.shape[1]},
        {"artifact": "effect_importance", "rows": len(effect_importance), "columns": effect_importance.shape[1]},
    ]
)

display(artifact_shapes)
artifact rows columns
0 estimator_comparison 4 7
1 weight_readiness 5 3
2 balance_diagnostics 12 10
3 msm_weight_sensitivity 7 16
4 msm_secondary 4 14
5 gcomp_secondary 4 9
6 aipw_segment_effects 23 16
7 effect_importance 19 2

All final artifacts loaded successfully. The report can now synthesize the project instead of recomputing models.

Create the Final Estimator Comparison Table

The main project result is a cross-estimator comparison. This table includes MSM, g-computation, and doubly robust AIPW estimates on the same outcome scale: future 7-day interactions.

def clean_method_label(row):
    if row["method"].startswith("MSM"):
        return "MSM, weighted"
    if row["method"] == "G-computation" and row["model"] == "lightgbm":
        return "G-computation, LightGBM"
    if row["method"] == "G-computation" and row["model"] == "linear_ridge":
        return "G-computation, linear"
    if row["method"].startswith("Doubly robust"):
        return "AIPW, doubly robust"
    return f"{row['method']}, {row['model']}"

final_estimator_table = estimator_comparison.copy()
final_estimator_table["estimator"] = final_estimator_table.apply(clean_method_label, axis=1)
final_estimator_table["relative_lift_pct"] = 100 * final_estimator_table["relative_lift_vs_msm_control_mean"]
final_estimator_table = final_estimator_table[
    [
        "estimator",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "relative_lift_pct",
        "source",
    ]
].copy()

final_estimator_table["conclusion"] = np.where(
    (final_estimator_table["ci_95_lower"] <= 0) & (final_estimator_table["ci_95_upper"] >= 0),
    "interval crosses zero",
    "interval excludes zero",
)

final_estimator_table.to_csv(TABLES_DIR / "final_estimator_comparison.csv", index=False)
display(final_estimator_table)
estimator estimate ci_95_lower ci_95_upper relative_lift_pct source conclusion
0 MSM, weighted -2.6877 -12.2706 5.7291 -0.7164 Notebook 04 user-cluster bootstrap interval crosses zero
1 G-computation, LightGBM 0.0359 -0.6384 1.8656 0.0096 Notebook 05 user-cluster bootstrap interval crosses zero
2 G-computation, linear 2.3288 -6.3238 9.3565 0.6208 Notebook 05 user-cluster bootstrap interval crosses zero
3 AIPW, doubly robust 1.0940 -7.3428 12.0178 0.2916 Notebook 06 user-cluster bootstrap over AIPW s... interval crosses zero

The final comparison table shows the central project conclusion: every main estimator has an interval that crosses zero. The average effect is small and uncertain across weighting, outcome modeling, and doubly robust approaches.

Figure 1: Final Estimator Comparison

This figure is the most important visual artifact for the project. It shows the estimated effect of high-watch exposure on future 7-day interactions across all main estimators.

plot_df = final_estimator_table.copy()
plot_df["lower_error"] = plot_df["estimate"] - plot_df["ci_95_lower"]
plot_df["upper_error"] = plot_df["ci_95_upper"] - plot_df["estimate"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(plot_df)),
    y=plot_df["estimate"],
    yerr=[plot_df["lower_error"], plot_df["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(plot_df)))
ax.set_xticklabels(plot_df["estimator"], rotation=20, ha="right")
ax.set_title("Long-Term Effect Estimates Across Causal Strategies")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Estimator")
plt.tight_layout()
figure_path = FIGURES_DIR / "01_estimator_comparison.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/01_estimator_comparison.png

The figure makes the cross-method conclusion visually clear. The estimates cluster near zero and the uncertainty intervals cross zero, so the final report should not claim a reliable average long-term lift.

Weight and Overlap Readiness Summary

The treatment weighting notebook showed that observed balance improved after weighting, but overlap was not perfect. This table summarizes the readiness checks that matter for interpreting the MSM result.

weight_readiness_report = weight_readiness.copy()
weight_readiness_report["status"] = np.where(weight_readiness_report["passes"], "passes", "caution")
weight_readiness_report.to_csv(TABLES_DIR / "weight_readiness_checks.csv", index=False)

display(weight_readiness_report)

display(weight_diagnostics)
check value passes status
0 no propensity below 0.05 or above 0.95 for mos... 0.0664 False caution
1 analysis weight max below clipping cap 10.0000 True passes
2 effective sample size at least 70% of rows 0.5110 False caution
3 mean absolute SMD improves after weighting 0.3311 True passes
4 maximum absolute SMD improves after weighting 1.1915 True passes
weight mean std min p50 p90 p95 p99 max effective_sample_size ess_share_of_rows
0 ipw_logistic 2.0767 3.4301 1.0028 1.3945 3.2625 4.7814 11.4275 119.9903 1,260.3721 0.2683
1 sw_logistic 1.0384 1.7153 0.5018 0.6973 1.6307 2.3921 5.7092 59.9441 1,260.0838 0.2682
2 ipw_lightgbm 2.0541 2.2425 1.0216 1.3874 3.3732 5.1568 11.7539 45.0967 2,143.6149 0.4563
3 sw_lightgbm 1.0270 1.1213 0.5104 0.6935 1.6872 2.5799 5.8723 22.5676 2,143.4028 0.4562

The readiness table gives the correct caution language for the final report: weights greatly improved observed balance, but overlap and effective sample size were imperfect. That limits how strongly the MSM result should be interpreted.

Figure 2: Covariate Balance Before and After Weighting

This figure shows whether the stabilized weights reduced observed treated-control imbalance on pre-treatment histories.

balance_plot = balance_diagnostics[
    ["covariate", "abs_smd_unweighted", "abs_smd_weighted"]
].melt(
    id_vars="covariate",
    var_name="balance_type",
    value_name="absolute_smd",
)
balance_plot["balance_type"] = balance_plot["balance_type"].map(
    {
        "abs_smd_unweighted": "Before weighting",
        "abs_smd_weighted": "After weighting",
    }
)

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=balance_plot, x="absolute_smd", y="covariate", hue="balance_type", ax=ax)
ax.axvline(0.1, color="darkred", linestyle="--", linewidth=1)
ax.set_title("Observed Covariate Balance Before and After Weighting")
ax.set_xlabel("Absolute standardized mean difference")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
figure_path = FIGURES_DIR / "02_weight_balance.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/02_weight_balance.png

The balance figure shows that the weighting step did what it was supposed to do on observed covariates. This supports the MSM analysis, while the overlap diagnostics still justify caution.

Weight Sensitivity Summary

This table shows how the MSM estimate changed across unweighted, unclipped, clipped, logistic-weighted, and LightGBM-weighted specifications.

weight_sensitivity_report = msm_weight_sensitivity[
    [
        "model",
        "weight_column",
        "treatment_effect",
        "ci_95_lower",
        "ci_95_upper",
        "relative_lift_vs_control",
        "effective_sample_size",
        "ess_share_of_rows",
    ]
].copy()
weight_sensitivity_report["relative_lift_pct"] = 100 * weight_sensitivity_report["relative_lift_vs_control"]
weight_sensitivity_report.to_csv(TABLES_DIR / "msm_weight_sensitivity.csv", index=False)

display(weight_sensitivity_report)
model weight_column treatment_effect ci_95_lower ci_95_upper relative_lift_vs_control effective_sample_size ess_share_of_rows relative_lift_pct
0 unweighted none -2.6193 -11.5474 6.3088 -0.0069 4,698.0000 1.0000 -0.6919
1 logistic_stabilized_unclipped sw_logistic 1.5616 -14.7943 17.9175 0.0042 1,260.0838 0.2682 0.4157
2 logistic_stabilized_clip_10 analysis_weight -2.6877 -11.8189 6.4435 -0.0072 2,400.8979 0.5110 -0.7164
3 logistic_stabilized_clip_5 sw_logistic_clip_5 -2.9769 -11.1777 5.2239 -0.0079 2,932.2499 0.6241 -0.7936
4 logistic_stabilized_clip_2 sw_logistic_clip_2 -2.2273 -9.8650 5.4105 -0.0059 3,797.3068 0.8083 -0.5918
5 lightgbm_stabilized_unclipped sw_lightgbm 0.9744 -8.5678 10.5166 0.0026 2,143.4028 0.4562 0.2564
6 lightgbm_stabilized_clip_10 sw_lightgbm_clip_10 -0.4889 -9.4432 8.4655 -0.0013 2,351.8214 0.5006 -0.1283

The sensitivity table shows that the estimated effect remains small across weight choices, although precision changes with clipping and effective sample size. This supports a cautious near-zero average-effect conclusion.

Figure 3: MSM Weight Sensitivity

This plot visualizes the MSM treatment effect across weight specifications. It is a useful final-report figure because it shows the result is not driven by one arbitrary clipping choice.

weight_plot = weight_sensitivity_report.copy()
weight_plot["lower_error"] = weight_plot["treatment_effect"] - weight_plot["ci_95_lower"]
weight_plot["upper_error"] = weight_plot["ci_95_upper"] - weight_plot["treatment_effect"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(weight_plot)),
    y=weight_plot["treatment_effect"],
    yerr=[weight_plot["lower_error"], weight_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=3,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(weight_plot)))
ax.set_xticklabels(weight_plot["model"], rotation=30, ha="right")
ax.set_title("MSM Sensitivity to Weight Specification")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Weight specification")
plt.tight_layout()
figure_path = FIGURES_DIR / "03_weight_sensitivity.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/03_weight_sensitivity.png

The sensitivity plot reinforces the main conclusion: the point estimates move somewhat, but they remain small relative to the outcome scale and uncertainty generally overlaps zero.

Secondary Outcome Summary

The primary outcome is future 7-day interaction volume. The project also checked active days, average daily interactions, watch hours, and log interaction volume. This table combines MSM and g-computation secondary outcome estimates.

msm_secondary_report = msm_secondary[
    ["outcome", "treatment_effect", "ci_95_lower", "ci_95_upper", "relative_lift_vs_control"]
].copy()
msm_secondary_report["estimator"] = "MSM"
msm_secondary_report = msm_secondary_report.rename(columns={"treatment_effect": "estimate", "relative_lift_vs_control": "relative_lift"})

gcomp_secondary_report = gcomp_secondary[
    ["outcome", "ate", "relative_lift_vs_y0"]
].copy()
gcomp_secondary_report["estimator"] = "G-computation, LightGBM"
gcomp_secondary_report = gcomp_secondary_report.rename(columns={"ate": "estimate", "relative_lift_vs_y0": "relative_lift"})
gcomp_secondary_report["ci_95_lower"] = np.nan
gcomp_secondary_report["ci_95_upper"] = np.nan

secondary_report = pd.concat(
    [msm_secondary_report, gcomp_secondary_report],
    ignore_index=True,
    sort=False,
)
secondary_report["relative_lift_pct"] = 100 * secondary_report["relative_lift"]
secondary_report.to_csv(TABLES_DIR / "secondary_outcome_summary.csv", index=False)

display(secondary_report)
outcome estimate ci_95_lower ci_95_upper relative_lift estimator relative_lift_pct
0 future_7day_active_days -0.0299 -0.0677 0.0079 -0.0044 MSM -0.4351
1 future_7day_avg_daily_interactions -0.3840 -1.6884 0.9205 -0.0072 MSM -0.7164
2 future_7day_play_hours 0.0598 0.0210 0.0986 0.0689 MSM 6.8882
3 outcome_log1p -0.0120 -0.0414 0.0175 -0.0021 MSM -0.2056
4 future_7day_active_days -0.0176 NaN NaN -0.0026 G-computation, LightGBM -0.2569
5 future_7day_avg_daily_interactions 0.0051 NaN NaN 0.0001 G-computation, LightGBM 0.0094
6 future_7day_play_hours 0.0396 NaN NaN 0.0441 G-computation, LightGBM 4.4138
7 outcome_log1p 0.0003 NaN NaN 0.0001 G-computation, LightGBM 0.0056

The secondary outcomes add nuance. Interaction volume does not show a clear average effect, while watch-hours estimates are more positive. The final report should distinguish these outcome families rather than collapse them into one engagement story.

Figure 4: Secondary Outcome Effects

This plot focuses on MSM secondary outcomes because they include uncertainty intervals. It shows whether the treatment appears different for retention-style, interaction-volume, and watch-time metrics.

secondary_msm_plot = msm_secondary_report.copy()
secondary_msm_plot["lower_error"] = secondary_msm_plot["estimate"] - secondary_msm_plot["ci_95_lower"]
secondary_msm_plot["upper_error"] = secondary_msm_plot["ci_95_upper"] - secondary_msm_plot["estimate"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(secondary_msm_plot)),
    y=secondary_msm_plot["estimate"],
    yerr=[secondary_msm_plot["lower_error"], secondary_msm_plot["upper_error"]],
    fmt="o",
    color="#5C946E",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(secondary_msm_plot)))
ax.set_xticklabels(secondary_msm_plot["outcome"], rotation=25, ha="right")
ax.set_title("MSM Effects on Secondary Long-Term Outcomes")
ax.set_ylabel("Effect in outcome units")
ax.set_xlabel("Outcome")
plt.tight_layout()
figure_path = FIGURES_DIR / "04_secondary_outcomes.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/04_secondary_outcomes.png

The secondary-outcome plot shows why the final conclusion should be precise. There is no clear interaction-volume gain, but watch-time-oriented outcomes may move differently and deserve separate validation.

Heterogeneous Effects Summary

Notebook 06 explored whether the near-zero average effect hides segment-level variation. This cell creates a final table of the most positive and most negative segment estimates that pass the size filter.

eligible_segments = aipw_segment_effects.query("passes_size_filter == True").copy()
eligible_segments["interval_crosses_zero"] = (eligible_segments["ci_95_lower"] <= 0) & (eligible_segments["ci_95_upper"] >= 0)

most_positive_segments = eligible_segments.sort_values("aipw_ate", ascending=False).head(8)
most_negative_segments = eligible_segments.sort_values("aipw_ate", ascending=True).head(8)
final_segment_report = pd.concat(
    [
        most_positive_segments.assign(direction="most_positive"),
        most_negative_segments.assign(direction="most_negative"),
    ],
    ignore_index=True,
)

final_segment_report.to_csv(TABLES_DIR / "heterogeneous_segment_effects.csv", index=False)
display(final_segment_report)
segment_type segment rows users treatment_rate observed_outcome_mean aipw_ate aipw_score_std gcomp_component mean_propensity passes_size_filter bootstrap_reps bootstrap_mean bootstrap_std ci_95_lower ci_95_upper interval_crosses_zero direction
0 prior_interactions_bucket (217.0, 581.0] 1169 91 0.4688 445.8435 25.6170 548.5773 0.1569 0.4787 True 250 25.7353 17.6855 1.8847 66.3297 False most_positive
1 day_of_week Tuesday 623 91 0.4912 390.4575 20.0321 750.2361 0.0096 0.4917 True 250 20.4314 32.7418 -24.3380 90.8593 True most_positive
2 lag_interactions_bucket (75.0, 375.0] 1146 91 0.4852 448.0070 16.2174 575.5850 0.0829 0.4860 True 250 16.0696 18.4124 -9.9926 57.4826 True most_positive
3 prior_watch_ratio_bucket (-0.001, 2.26] 1175 75 0.2119 371.7200 12.7319 549.7271 0.1169 0.2241 True 250 12.4139 17.3878 -9.3683 52.2110 True most_positive
4 prior_high_watch_bucket (-0.001, 1.127] 1175 73 0.1677 368.0732 12.5337 549.1163 0.0551 0.1685 True 250 11.9216 17.3031 -11.3159 52.9259 True most_positive
5 prior_watch_ratio_bucket (2.26, 2.59] 1174 89 0.4421 376.9199 4.2636 157.1119 0.1366 0.4355 True 250 3.9070 5.0899 -5.4912 13.3243 True most_positive
6 prior_high_watch_bucket (1.127, 1.414] 1174 84 0.3424 386.3526 2.4564 154.8324 0.1259 0.3521 True 250 2.5918 4.6737 -6.3520 12.5274 True most_positive
7 day_of_week Wednesday 711 91 0.4684 383.1899 1.8366 168.8665 0.0151 0.4673 True 250 2.1471 7.8816 -12.4635 18.6481 True most_positive
8 prior_watch_ratio_bucket (3.004, 11.372] 1175 73 0.6894 388.3106 -13.2926 268.3096 -0.0665 0.7063 True 250 -13.5921 7.0527 -27.1569 -0.8128 False most_negative
9 prior_interactions_bucket (166.0, 217.0] 1164 91 0.5198 420.7345 -12.4330 205.3765 0.0244 0.5120 True 250 -12.2267 6.6019 -26.7004 0.3019 True most_negative
10 prior_high_watch_bucket (1.708, 2.735] 1175 65 0.8647 389.3123 -9.1134 276.4264 -0.0124 0.8668 True 250 -9.4140 9.0052 -27.9388 7.2418 True most_negative
11 prior_interactions_bucket (123.0, 166.0] 1182 91 0.5178 386.9205 -8.0145 233.7360 0.0062 0.5061 True 250 -8.5057 6.1431 -20.4152 2.5367 True most_negative
12 day_of_week Monday 625 91 0.4800 393.7184 -6.4503 170.7878 0.0344 0.4809 True 250 -6.1623 5.8209 -18.4649 4.0985 True most_negative
13 lag_interactions_bucket (53.0, 75.0] 1192 91 0.5159 409.1242 -5.6043 187.1179 0.1011 0.5115 True 250 -5.6611 5.3270 -15.8618 5.4620 True most_negative
14 lag_interactions_bucket (-0.001, 36.0] 1202 91 0.4917 290.3095 -4.3047 166.3770 -0.0194 0.5013 True 250 -4.3555 3.9523 -11.8921 3.5914 True most_negative
15 day_of_week Sunday 621 91 0.5523 400.9678 -3.4574 201.6585 0.0243 0.5517 True 250 -4.0013 7.8748 -20.5009 9.5415 True most_negative

The segment table is best read as hypothesis generation. Some segments have larger point estimates, but many intervals are wide, so these are candidates for future experiment design rather than deployment rules.

Figure 5: Prior-Interaction Segment Effects

This figure summarizes AIPW effects across prior 3-day interaction buckets. It asks whether recent engagement level moderates the effect of high-watch exposure.

prior_interaction_plot = eligible_segments.query("segment_type == 'prior_interactions_bucket'").copy()
prior_interaction_plot = prior_interaction_plot.sort_values("segment")
prior_interaction_plot["lower_error"] = prior_interaction_plot["aipw_ate"] - prior_interaction_plot["ci_95_lower"]
prior_interaction_plot["upper_error"] = prior_interaction_plot["ci_95_upper"] - prior_interaction_plot["aipw_ate"]

fig, ax = plt.subplots(figsize=(9, 5))
ax.errorbar(
    x=np.arange(len(prior_interaction_plot)),
    y=prior_interaction_plot["aipw_ate"],
    yerr=[prior_interaction_plot["lower_error"], prior_interaction_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(prior_interaction_plot)))
ax.set_xticklabels(prior_interaction_plot["segment"], rotation=20, ha="right")
ax.set_title("AIPW Effects by Prior Interaction Bucket")
ax.set_xlabel("Prior 3-day interaction bucket")
ax.set_ylabel("Effect on future 7-day interactions")
plt.tight_layout()
figure_path = FIGURES_DIR / "05_prior_interaction_segments.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/05_prior_interaction_segments.png

This plot is a useful product diagnostic. It suggests where the effect may differ by recent activity, but wide intervals mean segment findings should be positioned as experimentation hypotheses.

Figure 6: Effect Moderator Importance

The effect-smoothing model from Notebook 06 predicts noisy AIPW scores from pre-treatment histories. Feature importance helps identify which histories are most useful for summarizing heterogeneity.

importance_plot = effect_importance.head(12).sort_values("importance")

fig, ax = plt.subplots(figsize=(9, 5))
sns.barplot(data=importance_plot, x="importance", y="feature", ax=ax, color="#2A6F97")
ax.set_title("Top Moderators in the Doubly Robust Effect Model")
ax.set_xlabel("LightGBM feature importance")
ax.set_ylabel("Pre-treatment feature")
plt.tight_layout()
figure_path = FIGURES_DIR / "06_effect_moderator_importance.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

effect_importance.to_csv(TABLES_DIR / "effect_moderator_importance.csv", index=False)

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/06_effect_moderator_importance.png

The moderator importance plot points to recent watch quality and recent activity as the main heterogeneity drivers. These variables are natural stratification candidates for any follow-up online test.

Final Limitations Table

A strong causal portfolio project should be explicit about what the analysis cannot prove. This cell writes the final limitations table for the project report.

limitations = pd.DataFrame(
    [
        {
            "limitation": "Observational logging",
            "why_it_matters": "High-watch exposure was not randomized, so causal validity depends on observed adjustment rather than experimental assignment.",
            "mitigation_in_project": "Used sequential history covariates, propensity weights, MSM, g-computation, and AIPW triangulation.",
        },
        {
            "limitation": "Sequential ignorability assumption",
            "why_it_matters": "Unobserved user intent or recommender state may affect both treatment and future engagement.",
            "mitigation_in_project": "Adjusted for lagged activity, prior 3-day engagement, watch behavior, and calendar context; documented remaining risk.",
        },
        {
            "limitation": "Imperfect overlap",
            "why_it_matters": "Some histories were much more likely to receive treatment, making weighted estimates less precise.",
            "mitigation_in_project": "Diagnosed positivity, clipped weights, reported effective sample size, and ran weight sensitivity checks.",
        },
        {
            "limitation": "Constructed treatment",
            "why_it_matters": "High-watch exposure is derived from observed consumption, not a product intervention directly assigned by the platform.",
            "mitigation_in_project": "Defined the estimand explicitly and framed conclusions as offline evidence for future experiment design.",
        },
        {
            "limitation": "Dataset slice",
            "why_it_matters": "The analysis uses a KuaiRec sample/panel derived for notebook runtime, so estimates may differ on the full platform log.",
            "mitigation_in_project": "Saved a reproducible panel and designed the workflow so the sample size can be expanded later.",
        },
        {
            "limitation": "Outcome choice",
            "why_it_matters": "Future interactions, active days, and watch hours capture different long-term product objectives.",
            "mitigation_in_project": "Reported primary and secondary outcomes separately instead of forcing one engagement narrative.",
        },
    ]
)

limitations.to_csv(TABLES_DIR / "limitations.csv", index=False)
display(limitations)
limitation why_it_matters mitigation_in_project
0 Observational logging High-watch exposure was not randomized, so cau... Used sequential history covariates, propensity...
1 Sequential ignorability assumption Unobserved user intent or recommender state ma... Adjusted for lagged activity, prior 3-day enga...
2 Imperfect overlap Some histories were much more likely to receiv... Diagnosed positivity, clipped weights, reporte...
3 Constructed treatment High-watch exposure is derived from observed c... Defined the estimand explicitly and framed con...
4 Dataset slice The analysis uses a KuaiRec sample/panel deriv... Saved a reproducible panel and designed the wo...
5 Outcome choice Future interactions, active days, and watch ho... Reported primary and secondary outcomes separa...

The limitations table keeps the final conclusion honest. The project provides strong offline causal analysis, but the right production next step would still be an online experiment or a logging-policy design with better randomization.

Final Recommendation Table

This table translates the analysis into a product-facing conclusion. It states what the evidence supports, what it does not support, and what a next experiment should test.

primary_aipw_estimate = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["estimate"].iloc[0]
primary_aipw_low = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["ci_95_lower"].iloc[0]
primary_aipw_high = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["ci_95_upper"].iloc[0]

final_recommendation = pd.DataFrame(
    [
        {
            "decision_area": "Average long-term interaction effect",
            "recommendation": "Do not claim a clear positive average effect from high-watch-exposure days on future 7-day interactions.",
            "evidence": f"MSM, g-computation, and AIPW estimates are small; primary AIPW is {primary_aipw_estimate:.2f} with CI [{primary_aipw_low:.2f}, {primary_aipw_high:.2f}].",
        },
        {
            "decision_area": "Metric strategy",
            "recommendation": "Keep validating watch-time-like signals against longer-term outcomes rather than optimizing them alone.",
            "evidence": "Primary interaction-volume effects are uncertain, while secondary watch-hours results differ from interaction-volume results.",
        },
        {
            "decision_area": "Experiment design",
            "recommendation": "Use recent engagement and watch-quality histories as stratification variables in a future online test.",
            "evidence": "Heterogeneity diagnostics identify recent interactions and high-watch share as important effect moderators.",
        },
        {
            "decision_area": "Deployment posture",
            "recommendation": "Treat this as offline evidence for experiment prioritization, not as deployment proof.",
            "evidence": "The analysis is observational and overlap is imperfect, even after weighting improved observed balance.",
        },
    ]
)

final_recommendation.to_csv(TABLES_DIR / "final_recommendation.csv", index=False)
display(final_recommendation)
decision_area recommendation evidence
0 Average long-term interaction effect Do not claim a clear positive average effect f... MSM, g-computation, and AIPW estimates are sma...
1 Metric strategy Keep validating watch-time-like signals agains... Primary interaction-volume effects are uncerta...
2 Experiment design Use recent engagement and watch-quality histor... Heterogeneity diagnostics identify recent inte...
3 Deployment posture Treat this as offline evidence for experiment ... The analysis is observational and overlap is i...

The recommendation table is the executive version of the project. It avoids overselling the result and converts the causal evidence into concrete product and experimentation guidance.

Write the Final Project Summary

This cell writes a markdown summary that can be used as the final narrative for a portfolio page or README section.

msm_row = final_estimator_table.query("estimator == 'MSM, weighted'").iloc[0]
gcomp_lgbm_row = final_estimator_table.query("estimator == 'G-computation, LightGBM'").iloc[0]
aipw_row = final_estimator_table.query("estimator == 'AIPW, doubly robust'").iloc[0]

summary_text = f"""# Final Summary: Long-Term Causal Effects in Recommendation Systems

## Question

This project estimates whether a high-watch-exposure day in KuaiRec changes a user's future 7-day engagement. The target outcome is future 7-day interaction volume.

## Why This Matters

Short-term recommendation metrics such as watch ratio or completion can look attractive while failing to improve longer-term engagement. This project treats the problem as a sequential causal inference task, where prior user behavior affects today's exposure and future behavior.

## Methods

The project uses three complementary causal strategies:

- Marginal structural model with stabilized inverse probability weights.
- G-computation with linear and LightGBM outcome models.
- Doubly robust AIPW estimation with segment-level heterogeneity diagnostics.

## Main Result

The average effect is small and uncertain across estimators:

- MSM estimate: {msm_row['estimate']:.2f} future interactions, 95% CI [{msm_row['ci_95_lower']:.2f}, {msm_row['ci_95_upper']:.2f}].
- LightGBM g-computation estimate: {gcomp_lgbm_row['estimate']:.2f}, 95% CI [{gcomp_lgbm_row['ci_95_lower']:.2f}, {gcomp_lgbm_row['ci_95_upper']:.2f}].
- Doubly robust AIPW estimate: {aipw_row['estimate']:.2f}, 95% CI [{aipw_row['ci_95_lower']:.2f}, {aipw_row['ci_95_upper']:.2f}].

The evidence does not support claiming a clear positive average effect of high-watch-exposure days on future 7-day interaction volume.

## Sensitivity and Heterogeneity

Weighting improved observed covariate balance, but overlap was imperfect and effective sample size dropped after weighting. Secondary outcomes suggest the treatment may relate differently to watch-hours metrics than to interaction volume. Heterogeneity diagnostics point to recent engagement and recent watch-quality history as useful stratification variables for future experiments.

## Product Takeaway

High-watch exposure should not be treated as automatically beneficial for longer-term interaction volume. It may still be a useful short-term satisfaction signal, but it should be validated against long-term metrics and tested online with stratification by recent user history.

## Limitations

This is an observational analysis. The estimates rely on sequential ignorability, observed history adjustment, and constructed treatment definitions. The results are best used to prioritize and design online experiments, not to replace them.
"""

summary_path = WRITEUP_DIR / "final_project_summary.md"
summary_path.write_text(summary_text)

display(Markdown(summary_text))
print(f"Saved {summary_path}")

Final Summary: Long-Term Causal Effects in Recommendation Systems

Question

This project estimates whether a high-watch-exposure day in KuaiRec changes a user’s future 7-day engagement. The target outcome is future 7-day interaction volume.

Why This Matters

Short-term recommendation metrics such as watch ratio or completion can look attractive while failing to improve longer-term engagement. This project treats the problem as a sequential causal inference task, where prior user behavior affects today’s exposure and future behavior.

Methods

The project uses three complementary causal strategies:

  • Marginal structural model with stabilized inverse probability weights.
  • G-computation with linear and LightGBM outcome models.
  • Doubly robust AIPW estimation with segment-level heterogeneity diagnostics.

Main Result

The average effect is small and uncertain across estimators:

  • MSM estimate: -2.69 future interactions, 95% CI [-12.27, 5.73].
  • LightGBM g-computation estimate: 0.04, 95% CI [-0.64, 1.87].
  • Doubly robust AIPW estimate: 1.09, 95% CI [-7.34, 12.02].

The evidence does not support claiming a clear positive average effect of high-watch-exposure days on future 7-day interaction volume.

Sensitivity and Heterogeneity

Weighting improved observed covariate balance, but overlap was imperfect and effective sample size dropped after weighting. Secondary outcomes suggest the treatment may relate differently to watch-hours metrics than to interaction volume. Heterogeneity diagnostics point to recent engagement and recent watch-quality history as useful stratification variables for future experiments.

Product Takeaway

High-watch exposure should not be treated as automatically beneficial for longer-term interaction volume. It may still be a useful short-term satisfaction signal, but it should be validated against long-term metrics and tested online with stratification by recent user history.

Limitations

This is an observational analysis. The estimates rely on sequential ignorability, observed history adjustment, and constructed treatment definitions. The results are best used to prioritize and design online experiments, not to replace them.

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/final_project_summary.md

The final summary is now saved as markdown and displayed in the notebook. It states the core finding without burying the uncertainty or limitations.

Write Resume Bullets

The resume bullets focus on the technical and product value of the project. They are written to be useful for data science and causal inference roles in recommendation-system teams.

resume_bullets = """# Resume Bullets

- Built a sequential causal inference project on KuaiRec to estimate whether high-watch-exposure recommendation days affect future 7-day engagement.
- Defined a user-day estimand, engineered lagged user-history confounders, and diagnosed time-varying confounding and positivity in recommender-system logs.
- Estimated long-term effects using marginal structural models, g-computation, and doubly robust AIPW, with user-cluster bootstrap uncertainty.
- Found small, uncertain average effects on future interaction volume across estimators, while identifying recent engagement and watch-quality histories as heterogeneity drivers for future experiments.
- Produced portfolio-ready causal analysis artifacts, including estimator comparison figures, balance diagnostics, sensitivity tables, limitations, and final product recommendations.
"""

resume_path = WRITEUP_DIR / "resume_bullets.md"
resume_path.write_text(resume_bullets)

display(Markdown(resume_bullets))
print(f"Saved {resume_path}")

Resume Bullets

  • Built a sequential causal inference project on KuaiRec to estimate whether high-watch-exposure recommendation days affect future 7-day engagement.
  • Defined a user-day estimand, engineered lagged user-history confounders, and diagnosed time-varying confounding and positivity in recommender-system logs.
  • Estimated long-term effects using marginal structural models, g-computation, and doubly robust AIPW, with user-cluster bootstrap uncertainty.
  • Found small, uncertain average effects on future interaction volume across estimators, while identifying recent engagement and watch-quality histories as heterogeneity drivers for future experiments.
  • Produced portfolio-ready causal analysis artifacts, including estimator comparison figures, balance diagnostics, sensitivity tables, limitations, and final product recommendations.
Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/resume_bullets.md

The resume bullets emphasize the causal workflow and the product interpretation. They are intentionally honest about the result: the value is in rigorous long-term validation, not in manufacturing a positive effect.

Build the Artifact Index

The final cell lists every saved report artifact. This makes the writeup folder easy to navigate.

artifact_paths = sorted(WRITEUP_DIR.rglob("*"))
artifact_index = pd.DataFrame(
    [
        {
            "artifact": str(path.relative_to(WRITEUP_DIR)),
            "type": "directory" if path.is_dir() else path.suffix.replace(".", ""),
            "size_bytes": path.stat().st_size if path.is_file() else np.nan,
        }
        for path in artifact_paths
    ]
)

artifact_index = artifact_index.query("type != 'directory'").reset_index(drop=True)
artifact_index.to_csv(TABLES_DIR / "artifact_index.csv", index=False)

display(artifact_index)
print(f"Saved {TABLES_DIR / 'artifact_index.csv'}")
artifact type size_bytes
0 figures/01_estimator_comparison.png png 84,935.0000
1 figures/02_weight_balance.png png 131,917.0000
2 figures/03_weight_sensitivity.png png 127,032.0000
3 figures/04_secondary_outcomes.png png 87,650.0000
4 figures/05_prior_interaction_segments.png png 68,157.0000
5 figures/06_effect_moderator_importance.png png 107,326.0000
6 final_project_summary.md md 2,219.0000
7 resume_bullets.md md 853.0000
8 tables/effect_moderator_importance.csv csv 495.0000
9 tables/final_estimator_comparison.csv csv 709.0000
10 tables/final_recommendation.csv csv 968.0000
11 tables/heterogeneous_segment_effects.csv csv 4,351.0000
12 tables/limitations.csv csv 1,469.0000
13 tables/msm_weight_sensitivity.csv csv 1,318.0000
14 tables/secondary_outcome_summary.csv csv 1,025.0000
15 tables/weight_readiness_checks.csv csv 395.0000
Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/tables/artifact_index.csv

The artifact index closes the project loop. A reviewer can now open the writeup folder and see the final figures, tables, summary, limitations, and resume bullets without digging through every notebook.

Final Takeaway

The completed project provides a careful causal analysis of long-term recommendation effects. The strongest final statement is:

In this KuaiRec observational panel, high-watch-exposure days do not show a clear positive average effect on future 7-day interaction volume across MSM, g-computation, and doubly robust AIPW estimators.

That is a useful product conclusion. It says that short-term watch-quality signals should be validated against longer-term outcomes before being treated as optimization targets. It also gives a concrete next step: design an online experiment stratified by recent engagement and watch-quality history.