Notebook 03: Time-Varying Confounding and Propensity Weights

Notebook 01 turned KuaiRec into a sequential user-day panel. Notebook 02 defined the primary causal estimand:

Among active KuaiRec user-days with sufficient prior history and 7-day follow-up, estimate the effect of a high-watch-exposure day on future 7-day interaction volume.

This notebook handles the next causal-design problem: treatment assignment is not random. A user is more likely to have a high-watch-exposure day when their recent history, preferences, activity level, and calendar context make that kind of day more likely. Those same pre-treatment factors can also predict future engagement. That is the classic time-varying confounding problem in recommender systems.

The goal of this notebook is to estimate the probability of treatment from pre-treatment information, diagnose overlap and positivity, and create stabilized inverse probability weights. These weights are the handoff to the next notebook, where we can fit a marginal structural model for the long-term treatment effect.

Why This Notebook Matters

A naive comparison asks whether treated user-days have higher future engagement than untreated user-days. That comparison is easy, but it is not causal. The recommender and the user are both dynamic:

prior engagement affects today’s exposure pattern;
today’s exposure pattern may affect future engagement;
prior engagement also directly predicts future engagement;
therefore, prior engagement confounds the treatment-outcome relationship.

This notebook models the treatment assignment process using only information available before the treatment day. If the model can predict treatment from prior state, that is evidence that treatment was selected rather than randomized. The resulting propensity scores let us construct inverse probability weights that make treated and control user-days more comparable on observed histories.

Setup

The first code cell imports tools for data handling, visualization, propensity modeling, model evaluation, and weight diagnostics. Logistic regression is used as the primary transparent propensity model. LightGBM is included as a flexible comparison model to check whether nonlinear treatment assignment changes the overlap story.

from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

from lightgbm import LGBMClassifier
from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, brier_score_loss, log_loss, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message="X does not have valid feature names")

pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")
RANDOM_STATE = 42

The modeling environment is ready. The imports cover both interpretable and flexible propensity models, which lets the notebook compare stability against model complexity before choosing weights.

Locate the Estimand Panel

Notebook 02 saved a modeling-ready estimand panel. This notebook starts from that file rather than rebuilding the KuaiRec sequence data. That keeps the workflow modular: the first two notebooks define the data and estimand, while this notebook focuses on treatment assignment.

ESTIMAND_PANEL_RELATIVE_PATH = Path("data/processed/kuairec_long_term_estimand_panel.parquet")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / ESTIMAND_PANEL_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {ESTIMAND_PANEL_RELATIVE_PATH}. Run Notebook 02 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PANEL_PATH = PROJECT_ROOT / ESTIMAND_PANEL_RELATIVE_PATH

print(f"Project root: {PROJECT_ROOT}")
print(f"Input panel: {PANEL_PATH}")
print(f"Processed output folder: {PROCESSED_DIR}")

Project root: /home/apex/Documents/ranking_sys
Input panel: /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimand_panel.parquet
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The notebook found the long-term causal effects estimand panel and the processed-output folder. The next cell can now load the exact population defined in Notebook 02 without changing the causal target.

Load the Estimand Panel

The estimand panel already contains standardized treatment and outcome columns. Here, treatment = 1 means the user-day had high-watch exposure, and outcome is future 7-day interaction volume. The cell also sorts the panel by user and date so all time-based diagnostics are easy to read.

panel = pd.read_parquet(PANEL_PATH)
panel["event_date"] = pd.to_datetime(panel["event_date"])
panel = panel.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Panel shape: {panel.shape}")
print(f"Users: {panel['user_id'].nunique():,}")
print(f"Date range: {panel['event_date'].min().date()} to {panel['event_date'].max().date()}")

display(panel.head())

Panel shape: (4698, 30)
Users: 91
Date range: 2020-07-08 to 2020-08-29

	user_id	event_date	day_of_week	calendar_day_index	panel_day_index	active_day	treatment_high_watch_exposure	future_7day_interactions	future_7day_active_days	future_7day_avg_daily_interactions	future_7day_play_hours	log1p_future_7day_interactions	treatment_overcompletion_exposure	lag_1_active_day	lag_1_interactions	lag_1_total_play_duration_sec	lag_1_avg_watch_ratio	lag_1_high_watch_share	prior_3day_active_day	prior_3day_interactions	prior_3day_total_play_duration_sec	prior_3day_avg_watch_ratio	prior_3day_high_watch_share	treatment	outcome	outcome_log1p	primary_treatment_name	primary_outcome_name
0	14	2020-07-08	Wednesday	3	3	1	1	279.0000	7.0000	39.8571	0.7500	5.6348	1	1.0000	78.0000	655.4890	0.8415	0.4615	3.0000	127.0000	1,144.8080	2.9900	1.5602	1	279.0000	5.6348	treatment_high_watch_exposure	future_7day_interactions
1	14	2020-07-09	Thursday	4	4	1	0	311.0000	7.0000	44.4286	0.8534	5.7430	0	1.0000	22.0000	201.9010	0.9828	0.5000	3.0000	123.0000	1,105.7340	2.8883	1.4833	0	311.0000	5.7430	treatment_high_watch_exposure	future_7day_interactions
2	14	2020-07-10	Friday	5	5	1	1	352.0000	7.0000	50.2857	0.9198	5.8665	1	1.0000	55.0000	485.0390	0.8619	0.3636	3.0000	155.0000	1,342.4290	2.6862	1.3252	1	352.0000	5.8665	treatment_high_watch_exposure	future_7day_interactions
3	14	2020-07-11	Saturday	6	6	1	1	437.0000	7.0000	62.4286	1.1297	6.0822	0	1.0000	52.0000	606.2440	1.1380	0.5769	3.0000	129.0000	1,293.1840	2.9827	1.4406	1	437.0000	6.0822	treatment_high_watch_exposure	future_7day_interactions
4	14	2020-07-12	Sunday	7	7	1	0	437.0000	7.0000	62.4286	1.1754	6.0822	0	1.0000	32.0000	284.7470	0.9337	0.5000	3.0000	139.0000	1,376.0300	2.9336	1.4406	0	437.0000	6.0822	treatment_high_watch_exposure	future_7day_interactions

The loaded panel is the final analysis population from Notebook 02. Because every row is already eligible for the primary estimand, every diagnostic in this notebook speaks directly to the treatment and outcome we plan to estimate.

Re-State the Treatment, Outcome, and Allowed Covariates

This cell creates a compact design table. The most important rule is that the propensity model may use only pre-treatment information. Same-day treatment labels and future outcomes are not allowed as covariates because they would leak post-treatment information into the adjustment process.

PRIMARY_TREATMENT = "treatment"
PRIMARY_OUTCOME = "outcome"

numeric_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
    "panel_day_index",
]

categorical_covariates = ["day_of_week"]
propensity_features = numeric_covariates + categorical_covariates

design_table = pd.DataFrame(
    [
        {"role": "treatment", "columns": PRIMARY_TREATMENT, "description": "High-watch-exposure day selected in Notebook 02."},
        {"role": "outcome", "columns": PRIMARY_OUTCOME, "description": "Future 7-day interaction volume."},
        {"role": "numeric pre-treatment covariates", "columns": ", ".join(numeric_covariates), "description": "Prior user state and calendar progression."},
        {"role": "categorical pre-treatment covariates", "columns": ", ".join(categorical_covariates), "description": "Calendar context available before treatment."},
    ]
)

display(design_table)

	role	columns	description
0	treatment	treatment	High-watch-exposure day selected in Notebook 02.
1	outcome	outcome	Future 7-day interaction volume.
2	numeric pre-treatment covariates	lag_1_active_day, lag_1_interactions, lag_1_to...	Prior user state and calendar progression.
3	categorical pre-treatment covariates	day_of_week	Calendar context available before treatment.

The design table separates treatment, outcome, and adjustment features. This separation is what makes the propensity model causal-design machinery rather than just another predictive model.

Check Treatment and Outcome Scale

Before modeling treatment assignment, we confirm the basic treatment rate and outcome distribution. A balanced treatment is easier to model than a rare one, and a non-degenerate outcome is needed for the later marginal structural model.

basic_estimand_summary = pd.DataFrame(
    [
        {"metric": "rows", "value": len(panel)},
        {"metric": "users", "value": panel["user_id"].nunique()},
        {"metric": "treatment_rate", "value": panel[PRIMARY_TREATMENT].mean()},
        {"metric": "control_rows", "value": int((panel[PRIMARY_TREATMENT] == 0).sum())},
        {"metric": "treated_rows", "value": int((panel[PRIMARY_TREATMENT] == 1).sum())},
        {"metric": "outcome_mean", "value": panel[PRIMARY_OUTCOME].mean()},
        {"metric": "outcome_std", "value": panel[PRIMARY_OUTCOME].std()},
        {"metric": "outcome_min", "value": panel[PRIMARY_OUTCOME].min()},
        {"metric": "outcome_max", "value": panel[PRIMARY_OUTCOME].max()},
    ]
)

display(basic_estimand_summary)

	metric	value
0	rows	4,698.0000
1	users	91.0000
2	treatment_rate	0.4996
3	control_rows	2,351.0000
4	treated_rows	2,347.0000
5	outcome_mean	382.0826
6	outcome_std	156.3942
7	outcome_min	0.0000
8	outcome_max	888.0000

The treatment is close to balanced and the future-interaction outcome has substantial variation. That makes this a reasonable setting for propensity modeling and later weighted outcome regression.

Naive Treated-Control Difference

This table repeats the raw treated-versus-control comparison within the final estimand population. It is intentionally naive: it does not adjust for prior engagement. We keep it here as a reference point so later weighting diagnostics can be interpreted against the unadjusted story.

naive_group_summary = (
    panel.groupby(PRIMARY_TREATMENT)[PRIMARY_OUTCOME]
    .agg(["count", "mean", "std", "min", "median", "max"])
    .rename_axis("treatment")
)

naive_difference = naive_group_summary.loc[1, "mean"] - naive_group_summary.loc[0, "mean"]
naive_lift = naive_difference / naive_group_summary.loc[0, "mean"]

print(f"Naive treated-control difference: {naive_difference:,.3f} future interactions")
print(f"Naive relative lift: {naive_lift:.2%}")
display(naive_group_summary)

Naive treated-control difference: 7.075 future interactions
Naive relative lift: 1.87%

	count	mean	std	min	median	max
treatment
0	2351	378.5479	157.6583	18.0000	380.0000	888.0000
1	2347	385.6233	155.0705	0.0000	402.0000	851.0000

The naive comparison gives the unadjusted difference in future engagement. It is useful as a baseline, but it does not answer the causal question because treatment days may already differ in their prior histories.

Time-Varying Confounding Map

The next cell states the confounding logic in table form. Each row links a pre-treatment variable to the treatment assignment process and to the future outcome. The same prior behavior can influence both, which is why adjustment is needed.

confounding_map = pd.DataFrame(
    [
        {
            "pre_treatment_state": "lag_1_interactions",
            "why_it_predicts_treatment": "Recently active users are more likely to generate high-watch exposure days.",
            "why_it_predicts_outcome": "Recently active users are also more likely to interact over the next week.",
        },
        {
            "pre_treatment_state": "prior_3day_interactions",
            "why_it_predicts_treatment": "A short recent history of heavy use can influence the content and sessions observed today.",
            "why_it_predicts_outcome": "Recent heavy use is a strong signal of future engagement volume.",
        },
        {
            "pre_treatment_state": "prior_3day_avg_watch_ratio",
            "why_it_predicts_treatment": "Users with recent high watch ratios may be more likely to have another high-watch day.",
            "why_it_predicts_outcome": "Recent satisfaction can predict continued future engagement.",
        },
        {
            "pre_treatment_state": "calendar_day_index / day_of_week",
            "why_it_predicts_treatment": "The mix of content and user sessions can vary over calendar time.",
            "why_it_predicts_outcome": "Future engagement can also vary by calendar time and day-of-week usage patterns.",
        },
    ]
)

display(confounding_map)

	pre_treatment_state	why_it_predicts_treatment	why_it_predicts_outcome
0	lag_1_interactions	Recently active users are more likely to gener...	Recently active users are also more likely to ...
1	prior_3day_interactions	A short recent history of heavy use can influe...	Recent heavy use is a strong signal of future ...
2	prior_3day_avg_watch_ratio	Users with recent high watch ratios may be mor...	Recent satisfaction can predict continued futu...
3	calendar_day_index / day_of_week	The mix of content and user sessions can vary ...	Future engagement can also vary by calendar ti...

This table makes the sequential causal problem explicit. The next cells test whether these pre-treatment states are actually related to treatment in the observed panel.

Treatment Rate Over Calendar Time

A treatment that varies over time may reflect changes in content supply, user behavior, or logging period conditions. This cell plots the daily treatment rate and the daily average outcome. Calendar patterns motivate including calendar time in the propensity model.

daily_summary = (
    panel.groupby("event_date")
    .agg(
        treatment_rate=(PRIMARY_TREATMENT, "mean"),
        mean_outcome=(PRIMARY_OUTCOME, "mean"),
        rows=(PRIMARY_TREATMENT, "size"),
    )
    .reset_index()
)

fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)
sns.lineplot(data=daily_summary, x="event_date", y="treatment_rate", marker="o", ax=axes[0], color="#2A6F97")
axes[0].set_title("Daily High-Watch-Exposure Treatment Rate")
axes[0].set_ylabel("Treatment rate")
axes[0].yaxis.set_major_formatter(lambda value, _: f"{value:.0%}")

sns.lineplot(data=daily_summary, x="event_date", y="mean_outcome", marker="o", ax=axes[1], color="#5C946E")
axes[1].set_title("Daily Mean Future 7-Day Interactions")
axes[1].set_xlabel("Date")
axes[1].set_ylabel("Mean future interactions")
axes[1].tick_params(axis="x", rotation=35)

plt.tight_layout()
plt.show()

The time-series view shows whether treatment and future engagement drift across the sample window. Any visible calendar movement supports keeping time controls in the treatment assignment model.

Treatment Rate by Prior User State

This cell bins important prior-history variables into quartiles and computes the treatment rate in each bin. If treatment rate changes across prior-activity bins, then prior state is part of the treatment assignment process and should be adjusted for.

def make_quantile_bucket(series, label_prefix):
    return pd.qcut(series, q=4, duplicates="drop").astype(str).rename(label_prefix)

history_bucket_df = panel[[PRIMARY_TREATMENT]].copy()
history_bucket_df["prior_3day_interactions_bucket"] = make_quantile_bucket(
    panel["prior_3day_interactions"], "prior_3day_interactions_bucket"
)
history_bucket_df["lag_1_watch_ratio_bucket"] = make_quantile_bucket(
    panel["lag_1_avg_watch_ratio"], "lag_1_watch_ratio_bucket"
)
history_bucket_df["prior_3day_watch_share_bucket"] = make_quantile_bucket(
    panel["prior_3day_high_watch_share"], "prior_3day_watch_share_bucket"
)

bucket_summaries = []
for bucket_col in ["prior_3day_interactions_bucket", "lag_1_watch_ratio_bucket", "prior_3day_watch_share_bucket"]:
    summary = (
        history_bucket_df.groupby(bucket_col, observed=True)[PRIMARY_TREATMENT]
        .agg(treatment_rate="mean", rows="size")
        .reset_index()
        .rename(columns={bucket_col: "bucket"})
    )
    summary.insert(0, "history_variable", bucket_col)
    bucket_summaries.append(summary)

history_treatment_rates = pd.concat(bucket_summaries, ignore_index=True)
display(history_treatment_rates)

	history_variable	bucket	treatment_rate	rows
0	prior_3day_interactions_bucket	(-0.001, 123.0]	0.4920	1183
1	prior_3day_interactions_bucket	(123.0, 166.0]	0.5178	1182
2	prior_3day_interactions_bucket	(166.0, 217.0]	0.5198	1164
3	prior_3day_interactions_bucket	(217.0, 581.0]	0.4688	1169
4	lag_1_watch_ratio_bucket	(-0.001, 0.738]	0.2153	1175
5	lag_1_watch_ratio_bucket	(0.738, 0.85]	0.4438	1174
6	lag_1_watch_ratio_bucket	(0.85, 1.011]	0.6388	1174
7	lag_1_watch_ratio_bucket	(1.011, 6.491]	0.7004	1175
8	prior_3day_watch_share_bucket	(-0.001, 1.127]	0.1677	1175
9	prior_3day_watch_share_bucket	(1.127, 1.414]	0.3424	1174
10	prior_3day_watch_share_bucket	(1.414, 1.708]	0.6235	1174
11	prior_3day_watch_share_bucket	(1.708, 2.735]	0.8647	1175

The bucket table checks whether treatment is more common for users with different recent activity or watch behavior. This is the empirical version of the confounding story from the previous map.

Plot Treatment Rate by Prior State

The table above is useful, but a faceted plot makes the pattern easier to scan. The goal is not to prove causality. The goal is to see whether pre-treatment history predicts treatment assignment strongly enough to justify propensity modeling.

g = sns.catplot(
    data=history_treatment_rates,
    x="bucket",
    y="treatment_rate",
    col="history_variable",
    kind="bar",
    sharex=False,
    sharey=True,
    height=4,
    aspect=1.2,
    color="#2A6F97",
)
g.set_axis_labels("Prior-state bucket", "Treatment rate")
g.set_titles("{col_name}")
for ax in g.axes.flat:
    ax.tick_params(axis="x", rotation=30)
    ax.yaxis.set_major_formatter(lambda value, _: f"{value:.0%}")
g.fig.suptitle("Treatment Assignment Varies With Pre-Treatment History", y=1.08)
plt.show()

The plot helps reveal which prior-state variables are most associated with high-watch exposure. Stronger gradients indicate variables that the propensity model should capture to improve treated-control comparability.

Unweighted Covariate Balance

Before modeling propensities, we measure treated-control imbalance using standardized mean differences. A standardized mean difference compares means in standard deviation units. Values farther from zero indicate stronger pre-treatment imbalance.

def weighted_mean(values, weights):
    values = np.asarray(values, dtype=float)
    weights = np.asarray(weights, dtype=float)
    return np.sum(weights * values) / np.sum(weights)


def weighted_var(values, weights):
    values = np.asarray(values, dtype=float)
    weights = np.asarray(weights, dtype=float)
    mean = weighted_mean(values, weights)
    return np.sum(weights * (values - mean) ** 2) / np.sum(weights)


def standardized_mean_difference(data, covariate, treatment_col=PRIMARY_TREATMENT, weight_col=None):
    treated = data[treatment_col].eq(1)
    control = data[treatment_col].eq(0)

    if weight_col is None:
        treated_mean = data.loc[treated, covariate].mean()
        control_mean = data.loc[control, covariate].mean()
        treated_var = data.loc[treated, covariate].var(ddof=1)
        control_var = data.loc[control, covariate].var(ddof=1)
    else:
        treated_mean = weighted_mean(data.loc[treated, covariate], data.loc[treated, weight_col])
        control_mean = weighted_mean(data.loc[control, covariate], data.loc[control, weight_col])
        treated_var = weighted_var(data.loc[treated, covariate], data.loc[treated, weight_col])
        control_var = weighted_var(data.loc[control, covariate], data.loc[control, weight_col])

    pooled_sd = np.sqrt((treated_var + control_var) / 2)
    smd = (treated_mean - control_mean) / pooled_sd if pooled_sd > 0 else np.nan
    return treated_mean, control_mean, smd

balance_rows = []
for covariate in numeric_covariates:
    treated_mean, control_mean, smd = standardized_mean_difference(panel, covariate)
    balance_rows.append(
        {
            "covariate": covariate,
            "treated_mean": treated_mean,
            "control_mean": control_mean,
            "smd_unweighted": smd,
            "abs_smd_unweighted": abs(smd),
        }
    )

unweighted_balance = pd.DataFrame(balance_rows).sort_values("abs_smd_unweighted", ascending=False)
display(unweighted_balance)

	covariate	treated_mean	control_mean	smd_unweighted	abs_smd_unweighted
9	prior_3day_high_watch_share	1.6434	1.2077	1.2197	1.2197
4	lag_1_high_watch_share	0.5545	0.3979	1.0634	1.0634
8	prior_3day_avg_watch_ratio	3.0092	2.5157	0.5652	0.5652
3	lag_1_avg_watch_ratio	1.0098	0.8337	0.4654	0.4654
7	prior_3day_total_play_duration_sec	1,605.1460	1,403.9936	0.2859	0.2859
2	lag_1_total_play_duration_sec	538.3737	466.2783	0.2329	0.2329
10	calendar_day_index	28.1542	29.5593	-0.0919	0.0919
11	panel_day_index	28.1542	29.5593	-0.0919	0.0919
6	prior_3day_interactions	171.3997	174.8337	-0.0487	0.0487
0	lag_1_active_day	0.9872	0.9830	0.0349	0.0349
1	lag_1_interactions	57.3464	58.2922	-0.0296	0.0296
5	prior_3day_active_day	2.9497	2.9481	0.0059	0.0059

The unweighted balance table shows which histories differ most between treated and control rows. Those imbalances are the main observed confounders that the propensity model and weights need to address.

Visualize Unweighted Balance

This plot ranks pre-treatment covariates by absolute imbalance. The reference lines at 0.1 and -0.1 are common practical thresholds for standardized mean differences, though they are guidelines rather than hard rules.

plot_balance = unweighted_balance.sort_values("smd_unweighted")

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=plot_balance, x="smd_unweighted", y="covariate", ax=ax, color="#8ECAE6")
ax.axvline(0, color="black", linewidth=1)
ax.axvline(0.1, color="darkred", linestyle="--", linewidth=1)
ax.axvline(-0.1, color="darkred", linestyle="--", linewidth=1)
ax.set_title("Unweighted Treated-Control Balance Before Propensity Adjustment")
ax.set_xlabel("Standardized mean difference")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
plt.show()

The balance plot makes the adjustment target visible. The next step is to estimate treatment probabilities from these histories so we can reweight the data toward better balance.

Define Propensity Model Pipelines

The primary propensity model is logistic regression because it is transparent and usually stable for weights. The comparison model is LightGBM because it can capture nonlinear treatment assignment and interactions. Both models use the same pre-treatment feature set.

def make_logistic_propensity_pipeline(numeric_cols, categorical_cols):
    preprocessor = ColumnTransformer(
        transformers=[
            ("numeric", StandardScaler(), numeric_cols),
            ("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    )
    return Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            (
                "model",
                LogisticRegression(
                    max_iter=1_000,
                    solver="lbfgs",
                    random_state=RANDOM_STATE,
                ),
            ),
        ]
    )


def make_lgbm_propensity_pipeline(numeric_cols, categorical_cols):
    preprocessor = ColumnTransformer(
        transformers=[
            ("numeric", "passthrough", numeric_cols),
            ("categorical", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    )
    return Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            (
                "model",
                LGBMClassifier(
                    n_estimators=150,
                    learning_rate=0.04,
                    num_leaves=15,
                    min_child_samples=50,
                    subsample=0.9,
                    colsample_bytree=0.9,
                    random_state=RANDOM_STATE,
                    n_jobs=1,
                    verbose=-1,
                ),
            ),
        ]
    )

propensity_models = {
    "logistic": make_logistic_propensity_pipeline(numeric_covariates, categorical_covariates),
    "lightgbm": make_lgbm_propensity_pipeline(numeric_covariates, categorical_covariates),
}

print("Defined propensity models:")
for model_name in propensity_models:
    print(f"- {model_name}")

Defined propensity models:
- logistic
- lightgbm

The two pipelines create a transparent baseline and a flexible comparison using the same legal covariates. Comparing them helps distinguish a real overlap problem from a modeling artifact.

Cross-Fit Propensity Scores

Cross-fitting predicts treatment probabilities out of fold. This means each row’s propensity score is produced by a model that did not train on that row. Cross-fitting is useful because it reduces overfitting in nuisance predictions, which is especially important when those predictions become inverse probability weights.

def cross_fitted_propensity(model, X, y, model_name, n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)
    out_of_fold_prob = np.zeros(len(y), dtype=float)
    metric_rows = []

    for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), start=1):
        fold_model = clone(model)
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        y_train = y.iloc[train_idx]
        y_test = y.iloc[test_idx]

        fold_model.fit(X_train, y_train)
        fold_prob = fold_model.predict_proba(X_test)[:, 1]
        out_of_fold_prob[test_idx] = fold_prob

        metric_rows.append(
            {
                "model": model_name,
                "fold": fold,
                "rows": len(test_idx),
                "treatment_rate": y_test.mean(),
                "roc_auc": roc_auc_score(y_test, fold_prob),
                "average_precision": average_precision_score(y_test, fold_prob),
                "brier_score": brier_score_loss(y_test, fold_prob),
                "log_loss": log_loss(y_test, fold_prob, labels=[0, 1]),
            }
        )

    full_model = clone(model).fit(X, y)
    return out_of_fold_prob, pd.DataFrame(metric_rows), full_model

X = panel[propensity_features].copy()
y = panel[PRIMARY_TREATMENT].astype(int)

propensity_outputs = {}
metric_frames = []
full_propensity_models = {}

for model_name, model in propensity_models.items():
    propensity, metrics, fitted_model = cross_fitted_propensity(model, X, y, model_name=model_name)
    propensity_outputs[model_name] = propensity
    metric_frames.append(metrics)
    full_propensity_models[model_name] = fitted_model
    panel[f"propensity_{model_name}"] = propensity

propensity_fold_metrics = pd.concat(metric_frames, ignore_index=True)
propensity_model_metrics = (
    propensity_fold_metrics.groupby("model")
    .agg(
        rows=("rows", "sum"),
        mean_treatment_rate=("treatment_rate", "mean"),
        roc_auc_mean=("roc_auc", "mean"),
        roc_auc_std=("roc_auc", "std"),
        average_precision_mean=("average_precision", "mean"),
        brier_score_mean=("brier_score", "mean"),
        log_loss_mean=("log_loss", "mean"),
    )
    .reset_index()
)

display(propensity_fold_metrics)
display(propensity_model_metrics)

	model	fold	rows	treatment_rate	roc_auc	average_precision	brier_score	log_loss
0	logistic	1	940	0.5000	0.8279	0.8183	0.1698	0.5132
1	logistic	2	940	0.5000	0.8174	0.8021	0.1741	0.5298
2	logistic	3	940	0.4989	0.8264	0.8222	0.1702	0.5110
3	logistic	4	939	0.4995	0.8436	0.8458	0.1617	0.4885
4	logistic	5	939	0.4995	0.8296	0.8285	0.1683	0.5092
5	lightgbm	1	940	0.5000	0.8285	0.8181	0.1702	0.5113
6	lightgbm	2	940	0.5000	0.8134	0.8055	0.1761	0.5336
7	lightgbm	3	940	0.4989	0.8286	0.8228	0.1692	0.5078
8	lightgbm	4	939	0.4995	0.8380	0.8371	0.1646	0.4964
9	lightgbm	5	939	0.4995	0.8315	0.8345	0.1674	0.5067

	model	rows	mean_treatment_rate	roc_auc_mean	roc_auc_std	average_precision_mean	brier_score_mean	log_loss_mean
0	lightgbm	4698	0.4996	0.8280	0.0090	0.8236	0.1695	0.5112
1	logistic	4698	0.4996	0.8290	0.0094	0.8234	0.1688	0.5103

The cross-fitted metrics show how predictable treatment is from observed pre-treatment history. Predictability is expected here; it confirms that high-watch exposure is selected by user state rather than randomized.

Summarize Propensity Distributions

The next table summarizes each model’s estimated treatment probabilities. Propensities near 0 or 1 are dangerous because they produce large weights. A good weighting setup needs enough overlap between treated and control rows.

propensity_summary_rows = []
for model_name in propensity_models:
    col = f"propensity_{model_name}"
    values = panel[col]
    propensity_summary_rows.append(
        {
            "model": model_name,
            "mean": values.mean(),
            "std": values.std(),
            "min": values.min(),
            "p01": values.quantile(0.01),
            "p05": values.quantile(0.05),
            "median": values.median(),
            "p95": values.quantile(0.95),
            "p99": values.quantile(0.99),
            "max": values.max(),
            "share_below_0_05": (values < 0.05).mean(),
            "share_above_0_95": (values > 0.95).mean(),
        }
    )

propensity_summary = pd.DataFrame(propensity_summary_rows)
display(propensity_summary)

	model	mean	std	min	p01	p05	median	p95	p99	max	share_below_0_05	share_above_0_95
0	logistic	0.4995	0.2863	0.0028	0.0303	0.0776	0.4887	0.9457	0.9830	0.9947	0.0217	0.0447
1	lightgbm	0.4985	0.2930	0.0218	0.0382	0.0686	0.5017	0.9396	0.9689	0.9788	0.0243	0.0349

The propensity summary is the first positivity check. If a model produces many probabilities close to 0 or 1, weights from that model would be unstable and the causal estimate would depend on a thin region of overlap.

Plot Propensity Overlap

This plot compares propensity distributions for treated and control rows. Strong overlap means treated and control user-days have comparable predicted treatment probabilities. Weak overlap means the observed data contains few comparable rows for some histories.

propensity_long = panel[[PRIMARY_TREATMENT, "propensity_logistic", "propensity_lightgbm"]].melt(
    id_vars=PRIMARY_TREATMENT,
    var_name="model",
    value_name="propensity",
)
propensity_long["model"] = propensity_long["model"].str.replace("propensity_", "", regex=False)
propensity_long["group"] = np.where(propensity_long[PRIMARY_TREATMENT].eq(1), "treated", "control")

fig, axes = plt.subplots(1, 2, figsize=(14, 4.8), sharey=True)
for ax, model_name in zip(axes, ["logistic", "lightgbm"]):
    sns.histplot(
        data=propensity_long.query("model == @model_name"),
        x="propensity",
        hue="group",
        bins=35,
        stat="density",
        common_norm=False,
        element="step",
        fill=False,
        ax=ax,
    )
    ax.set_title(f"{model_name.title()} Propensity Overlap")
    ax.set_xlabel("Estimated propensity")
    ax.set_ylabel("Density")

plt.tight_layout()
plt.show()

The overlap plot shows whether treated and control histories occupy similar propensity ranges. The primary weights should come from a model that captures selection while preserving usable overlap.

Propensity Calibration by Decile

A propensity model should not only rank rows by treatment likelihood; its probability scale should be reasonable. The next cell bins predictions into deciles and compares predicted treatment probability with observed treatment frequency.

def calibration_table(data, propensity_col, treatment_col=PRIMARY_TREATMENT, n_bins=10):
    table = data[[propensity_col, treatment_col]].copy()
    table["propensity_bin"] = pd.qcut(table[propensity_col], q=n_bins, duplicates="drop")
    return (
        table.groupby("propensity_bin", observed=True)
        .agg(
            rows=(treatment_col, "size"),
            mean_predicted_propensity=(propensity_col, "mean"),
            observed_treatment_rate=(treatment_col, "mean"),
        )
        .reset_index()
    )

calibration_frames = []
for model_name in propensity_models:
    frame = calibration_table(panel, f"propensity_{model_name}")
    frame.insert(0, "model", model_name)
    calibration_frames.append(frame)

propensity_calibration = pd.concat(calibration_frames, ignore_index=True)
display(propensity_calibration)

	model	propensity_bin	rows	mean_predicted_propensity	observed_treatment_rate
0	logistic	(0.00178, 0.119]	470	0.0734	0.0723
1	logistic	(0.119, 0.196]	470	0.1562	0.1532
2	logistic	(0.196, 0.287]	470	0.2418	0.2213
3	logistic	(0.287, 0.386]	469	0.3359	0.3582
4	logistic	(0.386, 0.489]	470	0.4377	0.4383
5	logistic	(0.489, 0.597]	470	0.5439	0.5723
6	logistic	(0.597, 0.707]	469	0.6502	0.6183
7	logistic	(0.707, 0.809]	470	0.7584	0.7851
8	logistic	(0.809, 0.898]	470	0.8520	0.8574
9	logistic	(0.898, 0.995]	470	0.9451	0.9191
10	lightgbm	(0.0208, 0.102]	470	0.0682	0.0851
11	lightgbm	(0.102, 0.175]	470	0.1359	0.1468
12	lightgbm	(0.175, 0.274]	470	0.2211	0.2298
13	lightgbm	(0.274, 0.388]	469	0.3325	0.3454
14	lightgbm	(0.388, 0.502]	470	0.4464	0.4234
15	lightgbm	(0.502, 0.612]	470	0.5581	0.5957
16	lightgbm	(0.612, 0.706]	469	0.6580	0.6375
17	lightgbm	(0.706, 0.814]	470	0.7621	0.7404
18	lightgbm	(0.814, 0.904]	470	0.8632	0.8532
19	lightgbm	(0.904, 0.979]	470	0.9397	0.9383

The calibration table checks whether predicted probabilities line up with observed treatment rates. Reasonable calibration is important because inverse probability weights use probability magnitudes, not just ranking.

Plot Propensity Calibration

The dashed diagonal represents perfect calibration. Points above the line mean the observed treatment rate is higher than predicted. Points below the line mean the model overpredicts treatment frequency.

fig, ax = plt.subplots(figsize=(6, 6))
sns.scatterplot(
    data=propensity_calibration,
    x="mean_predicted_propensity",
    y="observed_treatment_rate",
    hue="model",
    size="rows",
    sizes=(40, 160),
    ax=ax,
)
ax.plot([0, 1], [0, 1], color="black", linestyle="--", linewidth=1)
ax.set_title("Cross-Fitted Propensity Calibration")
ax.set_xlabel("Mean predicted propensity")
ax.set_ylabel("Observed treatment rate")
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()

The calibration plot helps decide whether a model is safe to use for weighting. A model with good discrimination but poor probability calibration can still create misleading weights.

Define Stabilized Inverse Probability Weights

For a point-treatment marginal structural model, the unstabilized inverse probability weight is 1 / e(X) for treated rows and 1 / (1 - e(X)) for control rows. Stabilized weights multiply the numerator by the marginal treatment probability. Stabilization keeps the target population closer to the observed sample and reduces variance.

This notebook uses logistic-regression propensities as the primary weights because they are transparent and typically less extreme. LightGBM weights are kept as a diagnostic comparison.

def add_stabilized_weights(data, propensity_col, prefix, treatment_col=PRIMARY_TREATMENT, eps=1e-6):
    result = data.copy()
    treatment = result[treatment_col].astype(int)
    propensity = result[propensity_col].clip(eps, 1 - eps)
    marginal_treatment_rate = treatment.mean()

    result[f"ipw_{prefix}"] = np.where(
        treatment.eq(1),
        1 / propensity,
        1 / (1 - propensity),
    )
    result[f"sw_{prefix}"] = np.where(
        treatment.eq(1),
        marginal_treatment_rate / propensity,
        (1 - marginal_treatment_rate) / (1 - propensity),
    )
    return result

weighted_panel = panel.copy()
weighted_panel = add_stabilized_weights(weighted_panel, "propensity_logistic", "logistic")
weighted_panel = add_stabilized_weights(weighted_panel, "propensity_lightgbm", "lightgbm")

PRIMARY_PROPENSITY_COL = "propensity_logistic"
PRIMARY_WEIGHT_COL = "sw_logistic"

weight_preview_cols = [
    PRIMARY_TREATMENT,
    "propensity_logistic",
    "sw_logistic",
    "propensity_lightgbm",
    "sw_lightgbm",
]

display(weighted_panel[weight_preview_cols].head(10))

	treatment	propensity_logistic	sw_logistic	propensity_lightgbm	sw_lightgbm
0	1	0.5471	0.9131	0.5266	0.9487
1	0	0.6333	1.3647	0.5373	1.0816
2	1	0.4140	1.2067	0.5234	0.9544
3	1	0.5654	0.8836	0.6503	0.7682
4	0	0.5986	1.2466	0.7418	1.9378
5	1	0.5141	0.9718	0.7059	0.7077
6	1	0.5511	0.9065	0.5565	0.8978
7	1	0.5693	0.8775	0.7866	0.6351
8	1	0.8006	0.6240	0.8249	0.6056
9	1	0.7779	0.6422	0.8345	0.5987

The panel now contains stabilized weights for both propensity models. The logistic stabilized weight is selected as the primary weight for the next notebook, while LightGBM remains available for sensitivity checks.

Weight Distribution Diagnostics

Weights should be centered near 1 and should not have extreme tails. This cell summarizes the unstabilized and stabilized weights from both propensity models. Extreme weights indicate limited overlap or overly confident propensity predictions.

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return (weights.sum() ** 2) / np.sum(weights ** 2)

weight_columns = ["ipw_logistic", "sw_logistic", "ipw_lightgbm", "sw_lightgbm"]
weight_summary_rows = []
for weight_col in weight_columns:
    values = weighted_panel[weight_col]
    weight_summary_rows.append(
        {
            "weight": weight_col,
            "mean": values.mean(),
            "std": values.std(),
            "min": values.min(),
            "p50": values.quantile(0.50),
            "p90": values.quantile(0.90),
            "p95": values.quantile(0.95),
            "p99": values.quantile(0.99),
            "max": values.max(),
            "effective_sample_size": effective_sample_size(values),
            "ess_share_of_rows": effective_sample_size(values) / len(values),
        }
    )

weight_diagnostics = pd.DataFrame(weight_summary_rows)
display(weight_diagnostics)

	weight	mean	std	min	p50	p90	p95	p99	max	effective_sample_size	ess_share_of_rows
0	ipw_logistic	2.0767	3.4301	1.0028	1.3945	3.2625	4.7814	11.4275	119.9903	1,260.3721	0.2683
1	sw_logistic	1.0384	1.7153	0.5018	0.6973	1.6307	2.3921	5.7092	59.9441	1,260.0838	0.2682
2	ipw_lightgbm	2.0541	2.2425	1.0216	1.3874	3.3732	5.1568	11.7539	45.0967	2,143.6149	0.4563
3	sw_lightgbm	1.0270	1.1213	0.5104	0.6935	1.6872	2.5799	5.8723	22.5676	2,143.4028	0.4562

The weight diagnostics show how much precision is lost after weighting. A large effective sample size and modest tails suggest the weighted analysis is numerically feasible.

Plot Stabilized Weight Distributions

The weight plot visualizes the tails that are easy to miss in a summary table. Stabilized weights should ideally be compact. Long right tails warn that some rows carry much more influence than others.

weight_plot_df = weighted_panel[["sw_logistic", "sw_lightgbm"]].melt(
    var_name="weight_type",
    value_name="weight",
)

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(
    data=weight_plot_df,
    x="weight",
    hue="weight_type",
    bins=50,
    stat="density",
    common_norm=False,
    element="step",
    fill=False,
    ax=ax,
)
ax.set_title("Stabilized Weight Distributions")
ax.set_xlabel("Stabilized inverse probability weight")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

The plot checks whether either model creates noticeably heavier weight tails. If the flexible model creates much wider weights, the transparent logistic model is a safer primary choice for the marginal structural model.

Weight Clipping Sensitivity

Even with reasonable overlap, it is common to inspect clipped weights. Clipping caps the largest weights to reduce variance, at the cost of changing the target estimand slightly. This cell creates clipped versions of the primary stabilized weight and records effective sample size.

clip_values = [None, 2, 5, 10]
clipping_rows = []

for clip_value in clip_values:
    if clip_value is None:
        clipped_col = "sw_logistic"
        clipped_values = weighted_panel[PRIMARY_WEIGHT_COL]
        label = "unclipped"
    else:
        clipped_col = f"sw_logistic_clip_{clip_value}"
        clipped_values = weighted_panel[PRIMARY_WEIGHT_COL].clip(upper=clip_value)
        weighted_panel[clipped_col] = clipped_values
        label = f"clip_at_{clip_value}"

    clipping_rows.append(
        {
            "weight_version": label,
            "weight_column": clipped_col,
            "mean_weight": clipped_values.mean(),
            "max_weight": clipped_values.max(),
            "effective_sample_size": effective_sample_size(clipped_values),
            "ess_share_of_rows": effective_sample_size(clipped_values) / len(clipped_values),
        }
    )

clipping_diagnostics = pd.DataFrame(clipping_rows)
display(clipping_diagnostics)

	weight_version	weight_column	mean_weight	max_weight	effective_sample_size	ess_share_of_rows
0	unclipped	sw_logistic	1.0384	59.9441	1,260.0838	0.2682
1	clip_at_2	sw_logistic_clip_2	0.8798	2.0000	3,797.3068	0.8083
2	clip_at_5	sw_logistic_clip_5	0.9684	5.0000	2,932.2499	0.6241
3	clip_at_10	sw_logistic_clip_10	1.0014	10.0000	2,400.8979	0.5110

The clipping table shows whether capping extreme weights materially changes the effective sample size. If clipping barely changes ESS, the primary weights are already stable; if it changes ESS a lot, later estimates should include clipping sensitivity.

Choose the Primary Analysis Weight

The next notebook needs one default weight column. We use the logistic stabilized weight with a conservative cap at 10. If the raw weights are already below that cap, this column will match the raw logistic stabilized weights. Keeping an explicit primary column makes downstream notebooks simpler and more auditable.

PRIMARY_ANALYSIS_WEIGHT_COL = "analysis_weight"
WEIGHT_CLIP_VALUE = 10

weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL] = weighted_panel[PRIMARY_WEIGHT_COL].clip(upper=WEIGHT_CLIP_VALUE)

primary_weight_summary = pd.Series(
    {
        "source_propensity": PRIMARY_PROPENSITY_COL,
        "source_weight": PRIMARY_WEIGHT_COL,
        "analysis_weight": PRIMARY_ANALYSIS_WEIGHT_COL,
        "clip_value": WEIGHT_CLIP_VALUE,
        "mean": weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL].mean(),
        "max": weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL].max(),
        "effective_sample_size": effective_sample_size(weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL]),
        "ess_share_of_rows": effective_sample_size(weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL]) / len(weighted_panel),
    }
).rename("value")

display(primary_weight_summary)

source_propensity        propensity_logistic
source_weight                    sw_logistic
analysis_weight              analysis_weight
clip_value                                10
mean                                  1.0014
max                                  10.0000
effective_sample_size             2,400.8979
ess_share_of_rows                     0.5110
Name: value, dtype: object

The selected analysis_weight column is the practical handoff to the marginal structural model. It makes the next notebook cleaner because the default weighting choice is documented here.

Check Weighted Covariate Balance

The purpose of inverse probability weighting is to make treated and control rows more similar on observed pre-treatment history. The next cell recomputes standardized mean differences using the selected analysis weight and compares them with the original unweighted imbalance.

weighted_balance_rows = []
for covariate in numeric_covariates:
    treated_mean_w, control_mean_w, smd_w = standardized_mean_difference(
        weighted_panel,
        covariate,
        weight_col=PRIMARY_ANALYSIS_WEIGHT_COL,
    )
    weighted_balance_rows.append(
        {
            "covariate": covariate,
            "treated_mean_weighted": treated_mean_w,
            "control_mean_weighted": control_mean_w,
            "smd_weighted": smd_w,
            "abs_smd_weighted": abs(smd_w),
        }
    )

weighted_balance = pd.DataFrame(weighted_balance_rows)
balance_diagnostics = unweighted_balance.merge(weighted_balance, on="covariate", how="left")
balance_diagnostics["abs_smd_reduction"] = (
    balance_diagnostics["abs_smd_unweighted"] - balance_diagnostics["abs_smd_weighted"]
)

balance_diagnostics = balance_diagnostics.sort_values("abs_smd_unweighted", ascending=False)
display(balance_diagnostics)

	covariate	treated_mean	control_mean	smd_unweighted	abs_smd_unweighted	treated_mean_weighted	control_mean_weighted	smd_weighted	abs_smd_weighted	abs_smd_reduction
0	prior_3day_high_watch_share	1.6434	1.2077	1.2197	1.2197	1.4361	1.4354	0.0018	0.0018	1.2179
1	lag_1_high_watch_share	0.5545	0.3979	1.0634	1.0634	0.4795	0.4752	0.0266	0.0266	1.0369
2	prior_3day_avg_watch_ratio	3.0092	2.5157	0.5652	0.5652	2.7649	2.7524	0.0142	0.0142	0.5510
3	lag_1_avg_watch_ratio	1.0098	0.8337	0.4654	0.4654	0.9253	0.9159	0.0240	0.0240	0.4415
4	prior_3day_total_play_duration_sec	1,605.1460	1,403.9936	0.2859	0.2859	1,491.4685	1,471.8861	0.0282	0.0282	0.2577
5	lag_1_total_play_duration_sec	538.3737	466.2783	0.2329	0.2329	497.9161	495.5466	0.0075	0.0075	0.2254
6	calendar_day_index	28.1542	29.5593	-0.0919	0.0919	29.4525	29.6002	-0.0095	0.0095	0.0824
7	panel_day_index	28.1542	29.5593	-0.0919	0.0919	29.4525	29.6002	-0.0095	0.0095	0.0824
8	prior_3day_interactions	171.3997	174.8337	-0.0487	0.0487	171.6769	170.2261	0.0204	0.0204	0.0283
9	lag_1_active_day	0.9872	0.9830	0.0349	0.0349	0.9861	0.9863	-0.0010	0.0010	0.0339
10	lag_1_interactions	57.3464	58.2922	-0.0296	0.0296	57.1834	56.9702	0.0065	0.0065	0.0231
11	prior_3day_active_day	2.9497	2.9481	0.0059	0.0059	2.9518	2.9552	-0.0132	0.0132	-0.0074

The weighted balance table checks whether the chosen weights reduce observed confounding. Improvement here is a key prerequisite before using the weights in a marginal structural model.

Plot Balance Before and After Weighting

The plot compares absolute standardized mean differences before and after weighting. Good weighting should move most covariates closer to zero, ideally below the practical 0.1 threshold.

balance_plot_df = balance_diagnostics[["covariate", "abs_smd_unweighted", "abs_smd_weighted"]].melt(
    id_vars="covariate",
    var_name="balance_type",
    value_name="absolute_smd",
)
balance_plot_df["balance_type"] = balance_plot_df["balance_type"].map(
    {
        "abs_smd_unweighted": "Before weighting",
        "abs_smd_weighted": "After weighting",
    }
)

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
    data=balance_plot_df,
    x="absolute_smd",
    y="covariate",
    hue="balance_type",
    ax=ax,
)
ax.axvline(0.1, color="darkred", linestyle="--", linewidth=1)
ax.set_title("Observed Covariate Balance Before and After Weighting")
ax.set_xlabel("Absolute standardized mean difference")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
plt.show()

The balance plot is the main diagnostic for whether the propensity weights are doing useful adjustment work. Remaining large imbalances would need to be addressed before trusting the weighted effect estimate.

Weighted Outcome Preview

This cell computes a simple weighted treated-control outcome difference using the analysis weights. It is still not the final marginal structural model, but it previews how much the naive difference changes after reweighting observed histories.

def group_weighted_mean(data, group_value, outcome_col=PRIMARY_OUTCOME, weight_col=PRIMARY_ANALYSIS_WEIGHT_COL):
    subset = data.loc[data[PRIMARY_TREATMENT].eq(group_value)]
    return weighted_mean(subset[outcome_col], subset[weight_col])

weighted_control_mean = group_weighted_mean(weighted_panel, 0)
weighted_treated_mean = group_weighted_mean(weighted_panel, 1)
weighted_difference = weighted_treated_mean - weighted_control_mean
weighted_lift = weighted_difference / weighted_control_mean

weighted_outcome_preview = pd.DataFrame(
    [
        {"estimate": "naive_difference", "value": naive_difference, "relative_lift": naive_lift},
        {"estimate": "weighted_difference_preview", "value": weighted_difference, "relative_lift": weighted_lift},
    ]
)

display(weighted_outcome_preview)

	estimate	value	relative_lift
0	naive_difference	7.0755	0.0187
1	weighted_difference_preview	-1.7389	-0.0046

The weighted preview shows whether adjustment changes the direction or size of the raw treated-control difference. The formal effect estimate still belongs in the next notebook, but this preview confirms that the weights have practical consequences.

Positivity and Weight Readiness Checklist

This checklist summarizes whether the weighted panel is ready for a marginal structural model. It does not prove causal identification, because unobserved confounding can still exist. It does check the observable requirements we can diagnose: overlap, stable weights, effective sample size, and improved covariate balance.

max_abs_smd_before = balance_diagnostics["abs_smd_unweighted"].max()
max_abs_smd_after = balance_diagnostics["abs_smd_weighted"].max()
mean_abs_smd_before = balance_diagnostics["abs_smd_unweighted"].mean()
mean_abs_smd_after = balance_diagnostics["abs_smd_weighted"].mean()
analysis_weight_ess = effective_sample_size(weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL])

readiness_checks = pd.DataFrame(
    [
        {
            "check": "no propensity below 0.05 or above 0.95 for most rows",
            "value": ((weighted_panel[PRIMARY_PROPENSITY_COL] < 0.05) | (weighted_panel[PRIMARY_PROPENSITY_COL] > 0.95)).mean(),
            "passes": ((weighted_panel[PRIMARY_PROPENSITY_COL] < 0.05) | (weighted_panel[PRIMARY_PROPENSITY_COL] > 0.95)).mean() < 0.05,
        },
        {
            "check": "analysis weight max below clipping cap",
            "value": weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL].max(),
            "passes": weighted_panel[PRIMARY_ANALYSIS_WEIGHT_COL].max() <= WEIGHT_CLIP_VALUE,
        },
        {
            "check": "effective sample size at least 70% of rows",
            "value": analysis_weight_ess / len(weighted_panel),
            "passes": analysis_weight_ess / len(weighted_panel) >= 0.70,
        },
        {
            "check": "mean absolute SMD improves after weighting",
            "value": mean_abs_smd_before - mean_abs_smd_after,
            "passes": mean_abs_smd_after < mean_abs_smd_before,
        },
        {
            "check": "maximum absolute SMD improves after weighting",
            "value": max_abs_smd_before - max_abs_smd_after,
            "passes": max_abs_smd_after < max_abs_smd_before,
        },
    ]
)

display(readiness_checks)

	check	value	passes
0	no propensity below 0.05 or above 0.95 for mos...	0.0664	False
1	analysis weight max below clipping cap	10.0000	True
2	effective sample size at least 70% of rows	0.5110	False
3	mean absolute SMD improves after weighting	0.3311	True
4	maximum absolute SMD improves after weighting	1.1915	True

The readiness checklist turns the diagnostics into a clear decision point. If these checks mostly pass, the weighted panel is suitable for Notebook 04; if not, the project would need stronger treatment models or additional covariates.

Save the Weighted Panel and Diagnostics

The next notebook should not refit the propensity models. It should load the weighted panel saved here and focus on estimating the long-term effect. This cell saves the weighted panel plus supporting diagnostics so the modeling choices remain auditable.

weighted_panel_path = PROCESSED_DIR / "kuairec_long_term_weighted_panel.parquet"
propensity_metrics_path = PROCESSED_DIR / "kuairec_long_term_propensity_model_metrics.csv"
weight_diagnostics_path = PROCESSED_DIR / "kuairec_long_term_weight_diagnostics.csv"
balance_diagnostics_path = PROCESSED_DIR / "kuairec_long_term_balance_diagnostics.csv"
readiness_checks_path = PROCESSED_DIR / "kuairec_long_term_weight_readiness_checks.csv"

weighted_panel.to_parquet(weighted_panel_path, index=False)
propensity_model_metrics.to_csv(propensity_metrics_path, index=False)
weight_diagnostics.to_csv(weight_diagnostics_path, index=False)
balance_diagnostics.to_csv(balance_diagnostics_path, index=False)
readiness_checks.to_csv(readiness_checks_path, index=False)

print("Saved Notebook 03 artifacts:")
print(f"- {weighted_panel_path}")
print(f"- {propensity_metrics_path}")
print(f"- {weight_diagnostics_path}")
print(f"- {balance_diagnostics_path}")
print(f"- {readiness_checks_path}")

Saved Notebook 03 artifacts:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weighted_panel.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_propensity_model_metrics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weight_diagnostics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_balance_diagnostics.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_weight_readiness_checks.csv

The weighted panel and diagnostics are now persisted. This creates a clean handoff: Notebook 04 can estimate a marginal structural model using analysis_weight without repeating treatment-assignment modeling.

Takeaways and Next Step

This notebook showed that high-watch exposure is related to pre-treatment user history, which supports the claim that the problem involves time-varying confounding. It then estimated cross-fitted propensity scores, checked overlap and calibration, created stabilized inverse probability weights, and verified whether those weights improve observed covariate balance.

The next notebook should use data/processed/kuairec_long_term_weighted_panel.parquet to estimate the long-term causal effect with a marginal structural model. The default treatment column is treatment, the default outcome column is outcome, and the default weight column is analysis_weight.