Notebook 07: Sensitivity, Final Report, and Portfolio Artifacts

This is the closing notebook for ** Long-Term Causal Effects in Recommendation Systems**.

The project asked whether short-term recommendation exposure, defined as a high-watch-exposure user-day, appears to change longer-term user engagement in KuaiRec. The previous notebooks built the analysis step by step:

Notebook 01 created the sequential KuaiRec user-day panel.
Notebook 02 defined the estimand.
Notebook 03 modeled time-varying confounding and created stabilized weights.
Notebook 04 estimated a marginal structural model.
Notebook 05 estimated g-computation effects.
Notebook 06 estimated doubly robust AIPW effects and explored heterogeneity.

This final notebook turns those modeling outputs into a concise project conclusion and saves portfolio-ready artifacts: figures, tables, final summary text, limitations, and resume bullets.

Final Project Question

The final causal question is:

Among active KuaiRec user-days with sufficient prior history and 7-day follow-up, what is the effect of a high-watch-exposure day on future 7-day interaction volume?

A high-watch-exposure day is not a randomized intervention. It is a constructed treatment from observed behavior. That means every result in this notebook depends on the sequential ignorability assumption: after conditioning on observed user history and calendar context, treated and untreated user-days are comparable enough for causal adjustment.

The final conclusion should therefore be framed as offline observational evidence, not as a substitute for an online experiment.

Setup

The first cell imports libraries for loading final artifacts, creating report tables, saving figures, and writing markdown summaries. The final notebook should be deterministic and lightweight because it does not refit the main causal models.

from pathlib import Path
import textwrap
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display

warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option("display.max_columns", 160)
pd.set_option("display.max_rows", 160)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")

The notebook environment is ready for reporting. No causal model is being refit here; the goal is to assemble and communicate the evidence produced by the earlier notebooks.

Locate Project Inputs and Report Folders

This cell finds the project root, checks for the final estimator comparison table, and creates the final writeup folders for figures and tables.

FINAL_COMPARISON_RELATIVE_PATH = Path("data/processed/kuairec_long_term_estimator_comparison_with_aipw.csv")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / FINAL_COMPARISON_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {FINAL_COMPARISON_RELATIVE_PATH}. Run Notebooks 04-06 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
WRITEUP_DIR = PROJECT_ROOT / "notebooks" / "long_term_causal_effects" / "writeup"
FIGURES_DIR = WRITEUP_DIR / "figures"
TABLES_DIR = WRITEUP_DIR / "tables"

FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Processed data folder: {PROCESSED_DIR}")
print(f"Writeup folder: {WRITEUP_DIR}")
print(f"Figures folder: {FIGURES_DIR}")
print(f"Tables folder: {TABLES_DIR}")

Project root: /home/apex/Documents/ranking_sys
Processed data folder: /home/apex/Documents/ranking_sys/data/processed
Writeup folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup
Figures folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures
Tables folder: /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/tables

The writeup folder structure is now ready. The rest of the notebook will save final tables and figures into this project-specific report location.

Load Final Project Artifacts

The final report combines artifacts from all modeling notebooks. This cell loads the estimator comparison, weight diagnostics, balance diagnostics, sensitivity outputs, secondary outcome outputs, heterogeneity tables, and effect moderator importances.

def read_processed_csv(filename):
    path = PROCESSED_DIR / filename
    if not path.exists():
        raise FileNotFoundError(f"Missing required artifact: {path}")
    return pd.read_csv(path)

estimator_comparison = read_processed_csv("kuairec_long_term_estimator_comparison_with_aipw.csv")
weight_readiness = read_processed_csv("kuairec_long_term_weight_readiness_checks.csv")
balance_diagnostics = read_processed_csv("kuairec_long_term_balance_diagnostics.csv")
weight_diagnostics = read_processed_csv("kuairec_long_term_weight_diagnostics.csv")
msm_weight_sensitivity = read_processed_csv("kuairec_long_term_msm_weight_sensitivity.csv")
msm_secondary = read_processed_csv("kuairec_long_term_msm_secondary_outcomes.csv")
gcomp_secondary = read_processed_csv("kuairec_long_term_gcomp_secondary_outcomes.csv")
aipw_segment_effects = read_processed_csv("kuairec_long_term_aipw_segment_effects.csv")
effect_importance = read_processed_csv("kuairec_long_term_effect_moderator_importance.csv")
smoothed_effect_buckets = read_processed_csv("kuairec_long_term_smoothed_effect_buckets.csv")
propensity_metrics = read_processed_csv("kuairec_long_term_propensity_model_metrics.csv")
gcomp_metrics = read_processed_csv("kuairec_long_term_gcomp_model_metrics.csv")
aipw_results = read_processed_csv("kuairec_long_term_aipw_results.csv")

artifact_shapes = pd.DataFrame(
    [
        {"artifact": "estimator_comparison", "rows": len(estimator_comparison), "columns": estimator_comparison.shape[1]},
        {"artifact": "weight_readiness", "rows": len(weight_readiness), "columns": weight_readiness.shape[1]},
        {"artifact": "balance_diagnostics", "rows": len(balance_diagnostics), "columns": balance_diagnostics.shape[1]},
        {"artifact": "msm_weight_sensitivity", "rows": len(msm_weight_sensitivity), "columns": msm_weight_sensitivity.shape[1]},
        {"artifact": "msm_secondary", "rows": len(msm_secondary), "columns": msm_secondary.shape[1]},
        {"artifact": "gcomp_secondary", "rows": len(gcomp_secondary), "columns": gcomp_secondary.shape[1]},
        {"artifact": "aipw_segment_effects", "rows": len(aipw_segment_effects), "columns": aipw_segment_effects.shape[1]},
        {"artifact": "effect_importance", "rows": len(effect_importance), "columns": effect_importance.shape[1]},
    ]
)

display(artifact_shapes)

	artifact	rows	columns
0	estimator_comparison	4	7
1	weight_readiness	5	3
2	balance_diagnostics	12	10
3	msm_weight_sensitivity	7	16
4	msm_secondary	4	14
5	gcomp_secondary	4	9
6	aipw_segment_effects	23	16
7	effect_importance	19	2

All final artifacts loaded successfully. The report can now synthesize the project instead of recomputing models.

Create the Final Estimator Comparison Table

The main project result is a cross-estimator comparison. This table includes MSM, g-computation, and doubly robust AIPW estimates on the same outcome scale: future 7-day interactions.

def clean_method_label(row):
    if row["method"].startswith("MSM"):
        return "MSM, weighted"
    if row["method"] == "G-computation" and row["model"] == "lightgbm":
        return "G-computation, LightGBM"
    if row["method"] == "G-computation" and row["model"] == "linear_ridge":
        return "G-computation, linear"
    if row["method"].startswith("Doubly robust"):
        return "AIPW, doubly robust"
    return f"{row['method']}, {row['model']}"

final_estimator_table = estimator_comparison.copy()
final_estimator_table["estimator"] = final_estimator_table.apply(clean_method_label, axis=1)
final_estimator_table["relative_lift_pct"] = 100 * final_estimator_table["relative_lift_vs_msm_control_mean"]
final_estimator_table = final_estimator_table[
    [
        "estimator",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "relative_lift_pct",
        "source",
    ]
].copy()

final_estimator_table["conclusion"] = np.where(
    (final_estimator_table["ci_95_lower"] <= 0) & (final_estimator_table["ci_95_upper"] >= 0),
    "interval crosses zero",
    "interval excludes zero",
)

final_estimator_table.to_csv(TABLES_DIR / "final_estimator_comparison.csv", index=False)
display(final_estimator_table)

	estimator	estimate	ci_95_lower	ci_95_upper	relative_lift_pct	source	conclusion
0	MSM, weighted	-2.6877	-12.2706	5.7291	-0.7164	Notebook 04 user-cluster bootstrap	interval crosses zero
1	G-computation, LightGBM	0.0359	-0.6384	1.8656	0.0096	Notebook 05 user-cluster bootstrap	interval crosses zero
2	G-computation, linear	2.3288	-6.3238	9.3565	0.6208	Notebook 05 user-cluster bootstrap	interval crosses zero
3	AIPW, doubly robust	1.0940	-7.3428	12.0178	0.2916	Notebook 06 user-cluster bootstrap over AIPW s...	interval crosses zero

The final comparison table shows the central project conclusion: every main estimator has an interval that crosses zero. The average effect is small and uncertain across weighting, outcome modeling, and doubly robust approaches.

Figure 1: Final Estimator Comparison

This figure is the most important visual artifact for the project. It shows the estimated effect of high-watch exposure on future 7-day interactions across all main estimators.

plot_df = final_estimator_table.copy()
plot_df["lower_error"] = plot_df["estimate"] - plot_df["ci_95_lower"]
plot_df["upper_error"] = plot_df["ci_95_upper"] - plot_df["estimate"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(plot_df)),
    y=plot_df["estimate"],
    yerr=[plot_df["lower_error"], plot_df["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(plot_df)))
ax.set_xticklabels(plot_df["estimator"], rotation=20, ha="right")
ax.set_title("Long-Term Effect Estimates Across Causal Strategies")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Estimator")
plt.tight_layout()
figure_path = FIGURES_DIR / "01_estimator_comparison.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/01_estimator_comparison.png

The figure makes the cross-method conclusion visually clear. The estimates cluster near zero and the uncertainty intervals cross zero, so the final report should not claim a reliable average long-term lift.

Weight and Overlap Readiness Summary

The treatment weighting notebook showed that observed balance improved after weighting, but overlap was not perfect. This table summarizes the readiness checks that matter for interpreting the MSM result.

weight_readiness_report = weight_readiness.copy()
weight_readiness_report["status"] = np.where(weight_readiness_report["passes"], "passes", "caution")
weight_readiness_report.to_csv(TABLES_DIR / "weight_readiness_checks.csv", index=False)

display(weight_readiness_report)

display(weight_diagnostics)

	check	value	passes	status
0	no propensity below 0.05 or above 0.95 for mos...	0.0664	False	caution
1	analysis weight max below clipping cap	10.0000	True	passes
2	effective sample size at least 70% of rows	0.5110	False	caution
3	mean absolute SMD improves after weighting	0.3311	True	passes
4	maximum absolute SMD improves after weighting	1.1915	True	passes

	weight	mean	std	min	p50	p90	p95	p99	max	effective_sample_size	ess_share_of_rows
0	ipw_logistic	2.0767	3.4301	1.0028	1.3945	3.2625	4.7814	11.4275	119.9903	1,260.3721	0.2683
1	sw_logistic	1.0384	1.7153	0.5018	0.6973	1.6307	2.3921	5.7092	59.9441	1,260.0838	0.2682
2	ipw_lightgbm	2.0541	2.2425	1.0216	1.3874	3.3732	5.1568	11.7539	45.0967	2,143.6149	0.4563
3	sw_lightgbm	1.0270	1.1213	0.5104	0.6935	1.6872	2.5799	5.8723	22.5676	2,143.4028	0.4562

The readiness table gives the correct caution language for the final report: weights greatly improved observed balance, but overlap and effective sample size were imperfect. That limits how strongly the MSM result should be interpreted.

Figure 2: Covariate Balance Before and After Weighting

This figure shows whether the stabilized weights reduced observed treated-control imbalance on pre-treatment histories.

balance_plot = balance_diagnostics[
    ["covariate", "abs_smd_unweighted", "abs_smd_weighted"]
].melt(
    id_vars="covariate",
    var_name="balance_type",
    value_name="absolute_smd",
)
balance_plot["balance_type"] = balance_plot["balance_type"].map(
    {
        "abs_smd_unweighted": "Before weighting",
        "abs_smd_weighted": "After weighting",
    }
)

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=balance_plot, x="absolute_smd", y="covariate", hue="balance_type", ax=ax)
ax.axvline(0.1, color="darkred", linestyle="--", linewidth=1)
ax.set_title("Observed Covariate Balance Before and After Weighting")
ax.set_xlabel("Absolute standardized mean difference")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
figure_path = FIGURES_DIR / "02_weight_balance.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/02_weight_balance.png

The balance figure shows that the weighting step did what it was supposed to do on observed covariates. This supports the MSM analysis, while the overlap diagnostics still justify caution.

Weight Sensitivity Summary

This table shows how the MSM estimate changed across unweighted, unclipped, clipped, logistic-weighted, and LightGBM-weighted specifications.

weight_sensitivity_report = msm_weight_sensitivity[
    [
        "model",
        "weight_column",
        "treatment_effect",
        "ci_95_lower",
        "ci_95_upper",
        "relative_lift_vs_control",
        "effective_sample_size",
        "ess_share_of_rows",
    ]
].copy()
weight_sensitivity_report["relative_lift_pct"] = 100 * weight_sensitivity_report["relative_lift_vs_control"]
weight_sensitivity_report.to_csv(TABLES_DIR / "msm_weight_sensitivity.csv", index=False)

display(weight_sensitivity_report)

	model	weight_column	treatment_effect	ci_95_lower	ci_95_upper	relative_lift_vs_control	effective_sample_size	ess_share_of_rows	relative_lift_pct
0	unweighted	none	-2.6193	-11.5474	6.3088	-0.0069	4,698.0000	1.0000	-0.6919
1	logistic_stabilized_unclipped	sw_logistic	1.5616	-14.7943	17.9175	0.0042	1,260.0838	0.2682	0.4157
2	logistic_stabilized_clip_10	analysis_weight	-2.6877	-11.8189	6.4435	-0.0072	2,400.8979	0.5110	-0.7164
3	logistic_stabilized_clip_5	sw_logistic_clip_5	-2.9769	-11.1777	5.2239	-0.0079	2,932.2499	0.6241	-0.7936
4	logistic_stabilized_clip_2	sw_logistic_clip_2	-2.2273	-9.8650	5.4105	-0.0059	3,797.3068	0.8083	-0.5918
5	lightgbm_stabilized_unclipped	sw_lightgbm	0.9744	-8.5678	10.5166	0.0026	2,143.4028	0.4562	0.2564
6	lightgbm_stabilized_clip_10	sw_lightgbm_clip_10	-0.4889	-9.4432	8.4655	-0.0013	2,351.8214	0.5006	-0.1283

The sensitivity table shows that the estimated effect remains small across weight choices, although precision changes with clipping and effective sample size. This supports a cautious near-zero average-effect conclusion.

Figure 3: MSM Weight Sensitivity

This plot visualizes the MSM treatment effect across weight specifications. It is a useful final-report figure because it shows the result is not driven by one arbitrary clipping choice.

weight_plot = weight_sensitivity_report.copy()
weight_plot["lower_error"] = weight_plot["treatment_effect"] - weight_plot["ci_95_lower"]
weight_plot["upper_error"] = weight_plot["ci_95_upper"] - weight_plot["treatment_effect"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(weight_plot)),
    y=weight_plot["treatment_effect"],
    yerr=[weight_plot["lower_error"], weight_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=3,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(weight_plot)))
ax.set_xticklabels(weight_plot["model"], rotation=30, ha="right")
ax.set_title("MSM Sensitivity to Weight Specification")
ax.set_ylabel("Effect on future 7-day interactions")
ax.set_xlabel("Weight specification")
plt.tight_layout()
figure_path = FIGURES_DIR / "03_weight_sensitivity.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/03_weight_sensitivity.png

The sensitivity plot reinforces the main conclusion: the point estimates move somewhat, but they remain small relative to the outcome scale and uncertainty generally overlaps zero.

Secondary Outcome Summary

The primary outcome is future 7-day interaction volume. The project also checked active days, average daily interactions, watch hours, and log interaction volume. This table combines MSM and g-computation secondary outcome estimates.

msm_secondary_report = msm_secondary[
    ["outcome", "treatment_effect", "ci_95_lower", "ci_95_upper", "relative_lift_vs_control"]
].copy()
msm_secondary_report["estimator"] = "MSM"
msm_secondary_report = msm_secondary_report.rename(columns={"treatment_effect": "estimate", "relative_lift_vs_control": "relative_lift"})

gcomp_secondary_report = gcomp_secondary[
    ["outcome", "ate", "relative_lift_vs_y0"]
].copy()
gcomp_secondary_report["estimator"] = "G-computation, LightGBM"
gcomp_secondary_report = gcomp_secondary_report.rename(columns={"ate": "estimate", "relative_lift_vs_y0": "relative_lift"})
gcomp_secondary_report["ci_95_lower"] = np.nan
gcomp_secondary_report["ci_95_upper"] = np.nan

secondary_report = pd.concat(
    [msm_secondary_report, gcomp_secondary_report],
    ignore_index=True,
    sort=False,
)
secondary_report["relative_lift_pct"] = 100 * secondary_report["relative_lift"]
secondary_report.to_csv(TABLES_DIR / "secondary_outcome_summary.csv", index=False)

display(secondary_report)

	outcome	estimate	ci_95_lower	ci_95_upper	relative_lift	estimator	relative_lift_pct
0	future_7day_active_days	-0.0299	-0.0677	0.0079	-0.0044	MSM	-0.4351
1	future_7day_avg_daily_interactions	-0.3840	-1.6884	0.9205	-0.0072	MSM	-0.7164
2	future_7day_play_hours	0.0598	0.0210	0.0986	0.0689	MSM	6.8882
3	outcome_log1p	-0.0120	-0.0414	0.0175	-0.0021	MSM	-0.2056
4	future_7day_active_days	-0.0176	NaN	NaN	-0.0026	G-computation, LightGBM	-0.2569
5	future_7day_avg_daily_interactions	0.0051	NaN	NaN	0.0001	G-computation, LightGBM	0.0094
6	future_7day_play_hours	0.0396	NaN	NaN	0.0441	G-computation, LightGBM	4.4138
7	outcome_log1p	0.0003	NaN	NaN	0.0001	G-computation, LightGBM	0.0056

The secondary outcomes add nuance. Interaction volume does not show a clear average effect, while watch-hours estimates are more positive. The final report should distinguish these outcome families rather than collapse them into one engagement story.

Figure 4: Secondary Outcome Effects

This plot focuses on MSM secondary outcomes because they include uncertainty intervals. It shows whether the treatment appears different for retention-style, interaction-volume, and watch-time metrics.

secondary_msm_plot = msm_secondary_report.copy()
secondary_msm_plot["lower_error"] = secondary_msm_plot["estimate"] - secondary_msm_plot["ci_95_lower"]
secondary_msm_plot["upper_error"] = secondary_msm_plot["ci_95_upper"] - secondary_msm_plot["estimate"]

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.errorbar(
    x=np.arange(len(secondary_msm_plot)),
    y=secondary_msm_plot["estimate"],
    yerr=[secondary_msm_plot["lower_error"], secondary_msm_plot["upper_error"]],
    fmt="o",
    color="#5C946E",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(secondary_msm_plot)))
ax.set_xticklabels(secondary_msm_plot["outcome"], rotation=25, ha="right")
ax.set_title("MSM Effects on Secondary Long-Term Outcomes")
ax.set_ylabel("Effect in outcome units")
ax.set_xlabel("Outcome")
plt.tight_layout()
figure_path = FIGURES_DIR / "04_secondary_outcomes.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/04_secondary_outcomes.png

The secondary-outcome plot shows why the final conclusion should be precise. There is no clear interaction-volume gain, but watch-time-oriented outcomes may move differently and deserve separate validation.

Heterogeneous Effects Summary

Notebook 06 explored whether the near-zero average effect hides segment-level variation. This cell creates a final table of the most positive and most negative segment estimates that pass the size filter.

eligible_segments = aipw_segment_effects.query("passes_size_filter == True").copy()
eligible_segments["interval_crosses_zero"] = (eligible_segments["ci_95_lower"] <= 0) & (eligible_segments["ci_95_upper"] >= 0)

most_positive_segments = eligible_segments.sort_values("aipw_ate", ascending=False).head(8)
most_negative_segments = eligible_segments.sort_values("aipw_ate", ascending=True).head(8)
final_segment_report = pd.concat(
    [
        most_positive_segments.assign(direction="most_positive"),
        most_negative_segments.assign(direction="most_negative"),
    ],
    ignore_index=True,
)

final_segment_report.to_csv(TABLES_DIR / "heterogeneous_segment_effects.csv", index=False)
display(final_segment_report)

	segment_type	segment	rows	users	treatment_rate	observed_outcome_mean	aipw_ate	aipw_score_std	gcomp_component	mean_propensity	passes_size_filter	bootstrap_reps	bootstrap_mean	bootstrap_std	ci_95_lower	ci_95_upper	interval_crosses_zero	direction
0	prior_interactions_bucket	(217.0, 581.0]	1169	91	0.4688	445.8435	25.6170	548.5773	0.1569	0.4787	True	250	25.7353	17.6855	1.8847	66.3297	False	most_positive
1	day_of_week	Tuesday	623	91	0.4912	390.4575	20.0321	750.2361	0.0096	0.4917	True	250	20.4314	32.7418	-24.3380	90.8593	True	most_positive
2	lag_interactions_bucket	(75.0, 375.0]	1146	91	0.4852	448.0070	16.2174	575.5850	0.0829	0.4860	True	250	16.0696	18.4124	-9.9926	57.4826	True	most_positive
3	prior_watch_ratio_bucket	(-0.001, 2.26]	1175	75	0.2119	371.7200	12.7319	549.7271	0.1169	0.2241	True	250	12.4139	17.3878	-9.3683	52.2110	True	most_positive
4	prior_high_watch_bucket	(-0.001, 1.127]	1175	73	0.1677	368.0732	12.5337	549.1163	0.0551	0.1685	True	250	11.9216	17.3031	-11.3159	52.9259	True	most_positive
5	prior_watch_ratio_bucket	(2.26, 2.59]	1174	89	0.4421	376.9199	4.2636	157.1119	0.1366	0.4355	True	250	3.9070	5.0899	-5.4912	13.3243	True	most_positive
6	prior_high_watch_bucket	(1.127, 1.414]	1174	84	0.3424	386.3526	2.4564	154.8324	0.1259	0.3521	True	250	2.5918	4.6737	-6.3520	12.5274	True	most_positive
7	day_of_week	Wednesday	711	91	0.4684	383.1899	1.8366	168.8665	0.0151	0.4673	True	250	2.1471	7.8816	-12.4635	18.6481	True	most_positive
8	prior_watch_ratio_bucket	(3.004, 11.372]	1175	73	0.6894	388.3106	-13.2926	268.3096	-0.0665	0.7063	True	250	-13.5921	7.0527	-27.1569	-0.8128	False	most_negative
9	prior_interactions_bucket	(166.0, 217.0]	1164	91	0.5198	420.7345	-12.4330	205.3765	0.0244	0.5120	True	250	-12.2267	6.6019	-26.7004	0.3019	True	most_negative
10	prior_high_watch_bucket	(1.708, 2.735]	1175	65	0.8647	389.3123	-9.1134	276.4264	-0.0124	0.8668	True	250	-9.4140	9.0052	-27.9388	7.2418	True	most_negative
11	prior_interactions_bucket	(123.0, 166.0]	1182	91	0.5178	386.9205	-8.0145	233.7360	0.0062	0.5061	True	250	-8.5057	6.1431	-20.4152	2.5367	True	most_negative
12	day_of_week	Monday	625	91	0.4800	393.7184	-6.4503	170.7878	0.0344	0.4809	True	250	-6.1623	5.8209	-18.4649	4.0985	True	most_negative
13	lag_interactions_bucket	(53.0, 75.0]	1192	91	0.5159	409.1242	-5.6043	187.1179	0.1011	0.5115	True	250	-5.6611	5.3270	-15.8618	5.4620	True	most_negative
14	lag_interactions_bucket	(-0.001, 36.0]	1202	91	0.4917	290.3095	-4.3047	166.3770	-0.0194	0.5013	True	250	-4.3555	3.9523	-11.8921	3.5914	True	most_negative
15	day_of_week	Sunday	621	91	0.5523	400.9678	-3.4574	201.6585	0.0243	0.5517	True	250	-4.0013	7.8748	-20.5009	9.5415	True	most_negative

The segment table is best read as hypothesis generation. Some segments have larger point estimates, but many intervals are wide, so these are candidates for future experiment design rather than deployment rules.

Figure 5: Prior-Interaction Segment Effects

This figure summarizes AIPW effects across prior 3-day interaction buckets. It asks whether recent engagement level moderates the effect of high-watch exposure.

prior_interaction_plot = eligible_segments.query("segment_type == 'prior_interactions_bucket'").copy()
prior_interaction_plot = prior_interaction_plot.sort_values("segment")
prior_interaction_plot["lower_error"] = prior_interaction_plot["aipw_ate"] - prior_interaction_plot["ci_95_lower"]
prior_interaction_plot["upper_error"] = prior_interaction_plot["ci_95_upper"] - prior_interaction_plot["aipw_ate"]

fig, ax = plt.subplots(figsize=(9, 5))
ax.errorbar(
    x=np.arange(len(prior_interaction_plot)),
    y=prior_interaction_plot["aipw_ate"],
    yerr=[prior_interaction_plot["lower_error"], prior_interaction_plot["upper_error"]],
    fmt="o",
    color="#2A6F97",
    ecolor="black",
    capsize=4,
)
ax.axhline(0, color="black", linewidth=1)
ax.set_xticks(np.arange(len(prior_interaction_plot)))
ax.set_xticklabels(prior_interaction_plot["segment"], rotation=20, ha="right")
ax.set_title("AIPW Effects by Prior Interaction Bucket")
ax.set_xlabel("Prior 3-day interaction bucket")
ax.set_ylabel("Effect on future 7-day interactions")
plt.tight_layout()
figure_path = FIGURES_DIR / "05_prior_interaction_segments.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/05_prior_interaction_segments.png

This plot is a useful product diagnostic. It suggests where the effect may differ by recent activity, but wide intervals mean segment findings should be positioned as experimentation hypotheses.

Figure 6: Effect Moderator Importance

The effect-smoothing model from Notebook 06 predicts noisy AIPW scores from pre-treatment histories. Feature importance helps identify which histories are most useful for summarizing heterogeneity.

importance_plot = effect_importance.head(12).sort_values("importance")

fig, ax = plt.subplots(figsize=(9, 5))
sns.barplot(data=importance_plot, x="importance", y="feature", ax=ax, color="#2A6F97")
ax.set_title("Top Moderators in the Doubly Robust Effect Model")
ax.set_xlabel("LightGBM feature importance")
ax.set_ylabel("Pre-treatment feature")
plt.tight_layout()
figure_path = FIGURES_DIR / "06_effect_moderator_importance.png"
plt.savefig(figure_path, dpi=160, bbox_inches="tight")
plt.show()

print(f"Saved {figure_path}")

effect_importance.to_csv(TABLES_DIR / "effect_moderator_importance.csv", index=False)

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/figures/06_effect_moderator_importance.png

The moderator importance plot points to recent watch quality and recent activity as the main heterogeneity drivers. These variables are natural stratification candidates for any follow-up online test.

Final Limitations Table

A strong causal portfolio project should be explicit about what the analysis cannot prove. This cell writes the final limitations table for the project report.

limitations = pd.DataFrame(
    [
        {
            "limitation": "Observational logging",
            "why_it_matters": "High-watch exposure was not randomized, so causal validity depends on observed adjustment rather than experimental assignment.",
            "mitigation_in_project": "Used sequential history covariates, propensity weights, MSM, g-computation, and AIPW triangulation.",
        },
        {
            "limitation": "Sequential ignorability assumption",
            "why_it_matters": "Unobserved user intent or recommender state may affect both treatment and future engagement.",
            "mitigation_in_project": "Adjusted for lagged activity, prior 3-day engagement, watch behavior, and calendar context; documented remaining risk.",
        },
        {
            "limitation": "Imperfect overlap",
            "why_it_matters": "Some histories were much more likely to receive treatment, making weighted estimates less precise.",
            "mitigation_in_project": "Diagnosed positivity, clipped weights, reported effective sample size, and ran weight sensitivity checks.",
        },
        {
            "limitation": "Constructed treatment",
            "why_it_matters": "High-watch exposure is derived from observed consumption, not a product intervention directly assigned by the platform.",
            "mitigation_in_project": "Defined the estimand explicitly and framed conclusions as offline evidence for future experiment design.",
        },
        {
            "limitation": "Dataset slice",
            "why_it_matters": "The analysis uses a KuaiRec sample/panel derived for notebook runtime, so estimates may differ on the full platform log.",
            "mitigation_in_project": "Saved a reproducible panel and designed the workflow so the sample size can be expanded later.",
        },
        {
            "limitation": "Outcome choice",
            "why_it_matters": "Future interactions, active days, and watch hours capture different long-term product objectives.",
            "mitigation_in_project": "Reported primary and secondary outcomes separately instead of forcing one engagement narrative.",
        },
    ]
)

limitations.to_csv(TABLES_DIR / "limitations.csv", index=False)
display(limitations)

	limitation	why_it_matters	mitigation_in_project
0	Observational logging	High-watch exposure was not randomized, so cau...	Used sequential history covariates, propensity...
1	Sequential ignorability assumption	Unobserved user intent or recommender state ma...	Adjusted for lagged activity, prior 3-day enga...
2	Imperfect overlap	Some histories were much more likely to receiv...	Diagnosed positivity, clipped weights, reporte...
3	Constructed treatment	High-watch exposure is derived from observed c...	Defined the estimand explicitly and framed con...
4	Dataset slice	The analysis uses a KuaiRec sample/panel deriv...	Saved a reproducible panel and designed the wo...
5	Outcome choice	Future interactions, active days, and watch ho...	Reported primary and secondary outcomes separa...

The limitations table keeps the final conclusion honest. The project provides strong offline causal analysis, but the right production next step would still be an online experiment or a logging-policy design with better randomization.

Final Recommendation Table

This table translates the analysis into a product-facing conclusion. It states what the evidence supports, what it does not support, and what a next experiment should test.

primary_aipw_estimate = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["estimate"].iloc[0]
primary_aipw_low = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["ci_95_lower"].iloc[0]
primary_aipw_high = final_estimator_table.query("estimator == 'AIPW, doubly robust'")["ci_95_upper"].iloc[0]

final_recommendation = pd.DataFrame(
    [
        {
            "decision_area": "Average long-term interaction effect",
            "recommendation": "Do not claim a clear positive average effect from high-watch-exposure days on future 7-day interactions.",
            "evidence": f"MSM, g-computation, and AIPW estimates are small; primary AIPW is {primary_aipw_estimate:.2f} with CI [{primary_aipw_low:.2f}, {primary_aipw_high:.2f}].",
        },
        {
            "decision_area": "Metric strategy",
            "recommendation": "Keep validating watch-time-like signals against longer-term outcomes rather than optimizing them alone.",
            "evidence": "Primary interaction-volume effects are uncertain, while secondary watch-hours results differ from interaction-volume results.",
        },
        {
            "decision_area": "Experiment design",
            "recommendation": "Use recent engagement and watch-quality histories as stratification variables in a future online test.",
            "evidence": "Heterogeneity diagnostics identify recent interactions and high-watch share as important effect moderators.",
        },
        {
            "decision_area": "Deployment posture",
            "recommendation": "Treat this as offline evidence for experiment prioritization, not as deployment proof.",
            "evidence": "The analysis is observational and overlap is imperfect, even after weighting improved observed balance.",
        },
    ]
)

final_recommendation.to_csv(TABLES_DIR / "final_recommendation.csv", index=False)
display(final_recommendation)

	decision_area	recommendation	evidence
0	Average long-term interaction effect	Do not claim a clear positive average effect f...	MSM, g-computation, and AIPW estimates are sma...
1	Metric strategy	Keep validating watch-time-like signals agains...	Primary interaction-volume effects are uncerta...
2	Experiment design	Use recent engagement and watch-quality histor...	Heterogeneity diagnostics identify recent inte...
3	Deployment posture	Treat this as offline evidence for experiment ...	The analysis is observational and overlap is i...

The recommendation table is the executive version of the project. It avoids overselling the result and converts the causal evidence into concrete product and experimentation guidance.

Write the Final Project Summary

This cell writes a markdown summary that can be used as the final narrative for a portfolio page or README section.

msm_row = final_estimator_table.query("estimator == 'MSM, weighted'").iloc[0]
gcomp_lgbm_row = final_estimator_table.query("estimator == 'G-computation, LightGBM'").iloc[0]
aipw_row = final_estimator_table.query("estimator == 'AIPW, doubly robust'").iloc[0]

summary_text = f"""# Final Summary: Long-Term Causal Effects in Recommendation Systems

## Question

This project estimates whether a high-watch-exposure day in KuaiRec changes a user's future 7-day engagement. The target outcome is future 7-day interaction volume.

## Why This Matters

Short-term recommendation metrics such as watch ratio or completion can look attractive while failing to improve longer-term engagement. This project treats the problem as a sequential causal inference task, where prior user behavior affects today's exposure and future behavior.

## Methods

The project uses three complementary causal strategies:

- Marginal structural model with stabilized inverse probability weights.
- G-computation with linear and LightGBM outcome models.
- Doubly robust AIPW estimation with segment-level heterogeneity diagnostics.

## Main Result

The average effect is small and uncertain across estimators:

- MSM estimate: {msm_row['estimate']:.2f} future interactions, 95% CI [{msm_row['ci_95_lower']:.2f}, {msm_row['ci_95_upper']:.2f}].
- LightGBM g-computation estimate: {gcomp_lgbm_row['estimate']:.2f}, 95% CI [{gcomp_lgbm_row['ci_95_lower']:.2f}, {gcomp_lgbm_row['ci_95_upper']:.2f}].
- Doubly robust AIPW estimate: {aipw_row['estimate']:.2f}, 95% CI [{aipw_row['ci_95_lower']:.2f}, {aipw_row['ci_95_upper']:.2f}].

The evidence does not support claiming a clear positive average effect of high-watch-exposure days on future 7-day interaction volume.

## Sensitivity and Heterogeneity

Weighting improved observed covariate balance, but overlap was imperfect and effective sample size dropped after weighting. Secondary outcomes suggest the treatment may relate differently to watch-hours metrics than to interaction volume. Heterogeneity diagnostics point to recent engagement and recent watch-quality history as useful stratification variables for future experiments.

## Product Takeaway

High-watch exposure should not be treated as automatically beneficial for longer-term interaction volume. It may still be a useful short-term satisfaction signal, but it should be validated against long-term metrics and tested online with stratification by recent user history.

## Limitations

This is an observational analysis. The estimates rely on sequential ignorability, observed history adjustment, and constructed treatment definitions. The results are best used to prioritize and design online experiments, not to replace them.
"""

summary_path = WRITEUP_DIR / "final_project_summary.md"
summary_path.write_text(summary_text)

display(Markdown(summary_text))
print(f"Saved {summary_path}")

Final Summary: Long-Term Causal Effects in Recommendation Systems

Question

This project estimates whether a high-watch-exposure day in KuaiRec changes a user’s future 7-day engagement. The target outcome is future 7-day interaction volume.

Why This Matters

Short-term recommendation metrics such as watch ratio or completion can look attractive while failing to improve longer-term engagement. This project treats the problem as a sequential causal inference task, where prior user behavior affects today’s exposure and future behavior.

Methods

The project uses three complementary causal strategies:

Marginal structural model with stabilized inverse probability weights.
G-computation with linear and LightGBM outcome models.
Doubly robust AIPW estimation with segment-level heterogeneity diagnostics.

Main Result

The average effect is small and uncertain across estimators:

MSM estimate: -2.69 future interactions, 95% CI [-12.27, 5.73].
LightGBM g-computation estimate: 0.04, 95% CI [-0.64, 1.87].
Doubly robust AIPW estimate: 1.09, 95% CI [-7.34, 12.02].

The evidence does not support claiming a clear positive average effect of high-watch-exposure days on future 7-day interaction volume.

Sensitivity and Heterogeneity

Weighting improved observed covariate balance, but overlap was imperfect and effective sample size dropped after weighting. Secondary outcomes suggest the treatment may relate differently to watch-hours metrics than to interaction volume. Heterogeneity diagnostics point to recent engagement and recent watch-quality history as useful stratification variables for future experiments.

Product Takeaway

High-watch exposure should not be treated as automatically beneficial for longer-term interaction volume. It may still be a useful short-term satisfaction signal, but it should be validated against long-term metrics and tested online with stratification by recent user history.

Limitations

This is an observational analysis. The estimates rely on sequential ignorability, observed history adjustment, and constructed treatment definitions. The results are best used to prioritize and design online experiments, not to replace them.

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/final_project_summary.md

The final summary is now saved as markdown and displayed in the notebook. It states the core finding without burying the uncertainty or limitations.

Write Resume Bullets

The resume bullets focus on the technical and product value of the project. They are written to be useful for data science and causal inference roles in recommendation-system teams.

resume_bullets = """# Resume Bullets

- Built a sequential causal inference project on KuaiRec to estimate whether high-watch-exposure recommendation days affect future 7-day engagement.
- Defined a user-day estimand, engineered lagged user-history confounders, and diagnosed time-varying confounding and positivity in recommender-system logs.
- Estimated long-term effects using marginal structural models, g-computation, and doubly robust AIPW, with user-cluster bootstrap uncertainty.
- Found small, uncertain average effects on future interaction volume across estimators, while identifying recent engagement and watch-quality histories as heterogeneity drivers for future experiments.
- Produced portfolio-ready causal analysis artifacts, including estimator comparison figures, balance diagnostics, sensitivity tables, limitations, and final product recommendations.
"""

resume_path = WRITEUP_DIR / "resume_bullets.md"
resume_path.write_text(resume_bullets)

display(Markdown(resume_bullets))
print(f"Saved {resume_path}")

Resume Bullets

Built a sequential causal inference project on KuaiRec to estimate whether high-watch-exposure recommendation days affect future 7-day engagement.
Defined a user-day estimand, engineered lagged user-history confounders, and diagnosed time-varying confounding and positivity in recommender-system logs.
Estimated long-term effects using marginal structural models, g-computation, and doubly robust AIPW, with user-cluster bootstrap uncertainty.
Found small, uncertain average effects on future interaction volume across estimators, while identifying recent engagement and watch-quality histories as heterogeneity drivers for future experiments.
Produced portfolio-ready causal analysis artifacts, including estimator comparison figures, balance diagnostics, sensitivity tables, limitations, and final product recommendations.

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/resume_bullets.md

The resume bullets emphasize the causal workflow and the product interpretation. They are intentionally honest about the result: the value is in rigorous long-term validation, not in manufacturing a positive effect.

Build the Artifact Index

The final cell lists every saved report artifact. This makes the writeup folder easy to navigate.

artifact_paths = sorted(WRITEUP_DIR.rglob("*"))
artifact_index = pd.DataFrame(
    [
        {
            "artifact": str(path.relative_to(WRITEUP_DIR)),
            "type": "directory" if path.is_dir() else path.suffix.replace(".", ""),
            "size_bytes": path.stat().st_size if path.is_file() else np.nan,
        }
        for path in artifact_paths
    ]
)

artifact_index = artifact_index.query("type != 'directory'").reset_index(drop=True)
artifact_index.to_csv(TABLES_DIR / "artifact_index.csv", index=False)

display(artifact_index)
print(f"Saved {TABLES_DIR / 'artifact_index.csv'}")

	artifact	type	size_bytes
0	figures/01_estimator_comparison.png	png	84,935.0000
1	figures/02_weight_balance.png	png	131,917.0000
2	figures/03_weight_sensitivity.png	png	127,032.0000
3	figures/04_secondary_outcomes.png	png	87,650.0000
4	figures/05_prior_interaction_segments.png	png	68,157.0000
5	figures/06_effect_moderator_importance.png	png	107,326.0000
6	final_project_summary.md	md	2,219.0000
7	resume_bullets.md	md	853.0000
8	tables/effect_moderator_importance.csv	csv	495.0000
9	tables/final_estimator_comparison.csv	csv	709.0000
10	tables/final_recommendation.csv	csv	968.0000
11	tables/heterogeneous_segment_effects.csv	csv	4,351.0000
12	tables/limitations.csv	csv	1,469.0000
13	tables/msm_weight_sensitivity.csv	csv	1,318.0000
14	tables/secondary_outcome_summary.csv	csv	1,025.0000
15	tables/weight_readiness_checks.csv	csv	395.0000

Saved /home/apex/Documents/ranking_sys/notebooks/projects/project_3_long_term_causal_effects/writeup/tables/artifact_index.csv

The artifact index closes the project loop. A reviewer can now open the writeup folder and see the final figures, tables, summary, limitations, and resume bullets without digging through every notebook.

Final Takeaway

The completed project provides a careful causal analysis of long-term recommendation effects. The strongest final statement is:

In this KuaiRec observational panel, high-watch-exposure days do not show a clear positive average effect on future 7-day interaction volume across MSM, g-computation, and doubly robust AIPW estimators.

That is a useful product conclusion. It says that short-term watch-quality signals should be validated against longer-term outcomes before being treated as optimization targets. It also gives a concrete next step: design an online experiment stratified by recent engagement and watch-quality history.