Robustness and Sensitivity for Discovery Quality Mediation

The previous notebook estimated the main discovery-quality mediation result: high discovery exposure was associated with higher future interaction volume, while the satisfaction-depth mediated pathway was small and slightly negative for the primary interaction-count outcome.

This notebook asks whether that story is fragile.

A good causal portfolio notebook should not stop after one preferred specification. It should ask what happens when reasonable choices change:

Does the result survive different high-discovery thresholds?
Does it depend on one particular satisfaction mediator definition?
Does it look similar for future play time and active days?
Does it change under weighted versus unweighted models?
Does it depend on including a treatment-mediator interaction?
Is the result mostly a calendar-time or pre-period activity artifact?

The goal is not to prove the observational design is perfect. The goal is to show what is stable, what is sensitive, and what a careful reader should trust only with caveats.

1. Load Libraries and Paths

This cell imports the libraries used for robustness modeling and plotting. It also defines all inputs from the previous notebooks and all outputs created here.

from pathlib import Path
import os
import warnings

os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")
os.environ.setdefault("NUMEXPR_NUM_THREADS", "1")

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

warnings.filterwarnings("ignore", category=FutureWarning)

sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (11, 6)
plt.rcParams["axes.titlesize"] = 13
plt.rcParams["axes.labelsize"] = 11
pd.set_option("display.max_columns", 180)
pd.set_option("display.max_colwidth", 140)

PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "data").exists() and PROJECT_ROOT.parent != PROJECT_ROOT:
    PROJECT_ROOT = PROJECT_ROOT.parent

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
NOTEBOOK_DIR = PROJECT_ROOT / "notebooks" / "discovery_quality_mediation"
WRITEUP_DIR = NOTEBOOK_DIR / "writeup"
FIGURE_DIR = WRITEUP_DIR / "figures"
TABLE_DIR = WRITEUP_DIR / "tables"

FIGURE_DIR.mkdir(parents=True, exist_ok=True)
TABLE_DIR.mkdir(parents=True, exist_ok=True)

ANALYSIS_PANEL_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_mediation_analysis_panel.parquet"
EFFECT_SUMMARY_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_effect_summary.csv"
ASSUMPTION_CHECKS_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_assumption_checks.csv"

THRESHOLD_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_robustness_thresholds.csv"
MEDIATOR_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_robustness_mediators.csv"
MODEL_SENSITIVITY_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_model_sensitivity.csv"
CONTINUOUS_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_continuous_score_sensitivity.csv"
PLACEBO_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_placebo_checks.csv"
ROBUSTNESS_SUMMARY_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_robustness_summary.csv"
LIMITATIONS_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_limitations.csv"

The thread limits keep the notebook predictable on a laptop or shared environment. This notebook runs many small linear models, so it does not need aggressive parallel math libraries.

2. Load the Analysis Panel and Main Results

This cell loads the mediation analysis panel and the effect summary from notebook 04. The effect summary is the reference point for every stress test in this notebook.

analysis_panel = pd.read_parquet(ANALYSIS_PANEL_INPUT)
effect_summary = pd.read_csv(EFFECT_SUMMARY_INPUT)
assumption_checks = pd.read_csv(ASSUMPTION_CHECKS_INPUT)

load_summary = pd.DataFrame(
    {
        "artifact": ["analysis_panel", "effect_summary", "assumption_checks"],
        "rows": [len(analysis_panel), len(effect_summary), len(assumption_checks)],
        "columns": [analysis_panel.shape[1], effect_summary.shape[1], assumption_checks.shape[1]],
    }
)

main_primary = effect_summary.query(
    "outcome == 'Y_future_interactions' and estimand in ['gcomp_total_effect', 'natural_direct_effect', 'natural_indirect_effect']"
).copy()

display(load_summary)
display(main_primary[["outcome_label", "estimand", "estimate", "ci_95_lower", "ci_95_upper", "relative_effect"]].round(4))

	artifact	rows	columns
0	analysis_panel	8199	103
1	effect_summary	21	12
2	assumption_checks	6	5

	outcome_label	estimand	estimate	ci_95_lower	ci_95_upper	relative_effect
2	Future 7-day interactions	gcomp_total_effect	37.2781	32.2444	42.8135	0.1152
3	Future 7-day interactions	natural_direct_effect	38.5858	33.3634	43.7154	0.1192
4	Future 7-day interactions	natural_indirect_effect	-1.3077	-2.3047	-0.4290	-0.0036

The main result is the anchor: a positive total effect on future interactions, a direct effect of similar size, and a small negative indirect pathway through the satisfaction-depth metric. The rest of this notebook checks how stable that pattern is.

3. Define Outcomes, Mediators, and Controls

This cell defines the variable sets used across robustness checks. The rich controls mirror earlier notebooks. The simple controls are intentionally smaller and help answer whether the result is only produced by a large adjustment set.

PRIMARY_OUTCOME = "Y_future_interactions"
OUTCOME_COLUMNS = ["Y_future_interactions", "Y_future_active_days", "Y_future_play_hours"]
OUTCOME_LABELS = {
    "Y_future_interactions": "Future 7-day interactions",
    "Y_future_active_days": "Future 7-day active days",
    "Y_future_play_hours": "Future 7-day play hours",
}
BASE_TREATMENT = "A_high_discovery"
BASE_MEDIATOR = "M_satisfaction_depth"
DISCOVERY_SCORE = "discovery_breadth_score"

base_numeric_covariates = [
    "calendar_day_index",
    "lag_1_active_day",
    "prior_3day_active_day",
    "lag_1_interactions",
    "prior_3day_interactions",
    "lag_1_total_play_duration_sec",
    "prior_3day_total_play_duration_sec",
    "lag_1_valid_play_share",
    "prior_3day_valid_play_share",
    "lag_1_high_satisfaction_share",
    "prior_3day_high_satisfaction_share",
    "lag_1_discovery_candidate_share",
    "prior_3day_discovery_candidate_share",
    "recent_activity_score",
    "register_days",
    "follow_user_num",
    "fans_user_num",
    "friend_user_num",
    "is_lowactive_period",
    "is_live_streamer",
    "is_video_author",
]
profile_onehot_covariates = [col for col in analysis_panel.columns if col.startswith("onehot_feat")]
rich_numeric_covariates = [col for col in base_numeric_covariates + profile_onehot_covariates if col in analysis_panel.columns]
rich_categorical_covariates = [
    col
    for col in [
        "user_active_degree",
        "follow_user_num_range",
        "fans_user_num_range",
        "friend_user_num_range",
        "register_days_range",
    ]
    if col in analysis_panel.columns
]

simple_numeric_covariates = [
    col
    for col in [
        "calendar_day_index",
        "lag_1_interactions",
        "prior_3day_interactions",
        "prior_3day_high_satisfaction_share",
        "prior_3day_discovery_candidate_share",
        "register_days",
    ]
    if col in analysis_panel.columns
]
simple_categorical_covariates = ["user_active_degree"] if "user_active_degree" in analysis_panel.columns else []

mediator_candidates = {
    "composite_satisfaction_depth": "M_satisfaction_depth",
    "high_watch_ratio_share": "high_satisfaction_share",
    "valid_play_share": "valid_play_share",
    "average_satisfaction_score": "avg_satisfaction_score",
    "completion_or_rewatch_share": "complete_or_rewatch_share",
}

control_summary = pd.DataFrame(
    {
        "control_set": ["rich", "simple"],
        "numeric_covariates": [len(rich_numeric_covariates), len(simple_numeric_covariates)],
        "categorical_covariates": [len(rich_categorical_covariates), len(simple_categorical_covariates)],
    }
)
mediator_summary = pd.DataFrame(
    {"mediator_label": list(mediator_candidates.keys()), "column": list(mediator_candidates.values())}
)

display(control_summary)
display(mediator_summary)

	control_set	numeric_covariates	categorical_covariates
0	rich	39	5
1	simple	6	1

	mediator_label	column
0	composite_satisfaction_depth	M_satisfaction_depth
1	high_watch_ratio_share	high_satisfaction_share
2	valid_play_share	valid_play_share
3	average_satisfaction_score	avg_satisfaction_score
4	completion_or_rewatch_share	complete_or_rewatch_share

The mediator candidates are all same-day quality signals, but they do not measure exactly the same thing. If the indirect pathway changes sign or size across these mediators, that tells us the mediation result is sensitive to how satisfaction is measured.

4. Build Reusable Estimation Helpers

This cell defines the robustness estimator. It follows the same g-computation logic as notebook 04 but adds options for treatment definitions, mediator definitions, control sets, weighting, and treatment-mediator interactions.

def build_covariate_matrix(frame, numeric_cols, categorical_cols):
    numeric = frame[numeric_cols].copy()
    for col in numeric.columns:
        numeric[col] = pd.to_numeric(numeric[col], errors="coerce")
        numeric[col] = numeric[col].fillna(numeric[col].median())

    if categorical_cols:
        categorical = frame[categorical_cols].copy()
        for col in categorical.columns:
            categorical[col] = categorical[col].astype("string").fillna("missing")
        encoded = pd.get_dummies(categorical, drop_first=True, dtype=float)
        design = pd.concat([numeric, encoded], axis=1)
    else:
        design = numeric

    design = design.loc[:, design.nunique(dropna=False) > 1]
    return design.astype(float)


def make_treatment_design(covariates, treatment_values, treatment_col):
    design = covariates.copy()
    design.insert(0, treatment_col, np.asarray(treatment_values, dtype=float))
    return design


def make_outcome_design(covariates, treatment_values, mediator_values, treatment_col, mediator_col, include_interaction=True):
    treatment_array = np.asarray(treatment_values, dtype=float)
    mediator_array = np.asarray(mediator_values, dtype=float)
    design = covariates.copy()
    if include_interaction:
        design.insert(0, "A_x_M", treatment_array * mediator_array)
    design.insert(0, mediator_col, mediator_array)
    design.insert(0, treatment_col, treatment_array)
    return design


def estimate_stabilized_weights(frame, treatment_col, covariates):
    treatment = frame[treatment_col].astype(int).to_numpy()
    treatment_rate = treatment.mean()
    if treatment_rate <= 0 or treatment_rate >= 1:
        return np.ones(len(frame)), pd.DataFrame([{"diagnostic": "degenerate_treatment", "value": 1.0}])

    propensity_model = make_pipeline(
        StandardScaler(),
        LogisticRegression(max_iter=1000, solver="lbfgs"),
    )
    propensity_model.fit(covariates, treatment)
    propensity = propensity_model.predict_proba(covariates)[:, 1].clip(0.02, 0.98)
    weights = np.where(treatment == 1, treatment_rate / propensity, (1 - treatment_rate) / (1 - propensity))
    cap = np.quantile(weights, 0.99)
    weights = np.clip(weights, None, cap)
    ess = weights.sum() ** 2 / np.square(weights).sum()

    diagnostics = pd.DataFrame(
        [
            {"diagnostic": "treatment_rate", "value": treatment_rate},
            {"diagnostic": "propensity_p05", "value": np.quantile(propensity, 0.05)},
            {"diagnostic": "propensity_p95", "value": np.quantile(propensity, 0.95)},
            {"diagnostic": "weight_max", "value": weights.max()},
            {"diagnostic": "effective_sample_size", "value": ess},
        ]
    )
    return weights, diagnostics


def average(values):
    return float(np.mean(np.asarray(values, dtype=float)))


def estimate_binary_mediation_spec(
    frame,
    treatment_col,
    mediator_col,
    outcome_col,
    numeric_cols,
    categorical_cols,
    spec_name,
    use_weights=True,
    include_interaction=True,
):
    working = frame.copy().reset_index(drop=True)
    covariates = build_covariate_matrix(working, numeric_cols, categorical_cols)
    A = working[treatment_col].astype(float).to_numpy()
    M = working[mediator_col].astype(float).to_numpy()
    Y = working[outcome_col].astype(float).to_numpy()

    if use_weights:
        weights, weight_diagnostics = estimate_stabilized_weights(working, treatment_col, covariates)
    else:
        weights = None
        weight_diagnostics = pd.DataFrame(
            [
                {"diagnostic": "treatment_rate", "value": A.mean()},
                {"diagnostic": "effective_sample_size", "value": len(working)},
            ]
        )

    mediator_X = make_treatment_design(covariates, A, treatment_col)
    mediator_model = LinearRegression().fit(mediator_X, M, sample_weight=weights)
    M0 = mediator_model.predict(make_treatment_design(covariates, np.zeros(len(working)), treatment_col)).clip(0, 1)
    M1 = mediator_model.predict(make_treatment_design(covariates, np.ones(len(working)), treatment_col)).clip(0, 1)

    outcome_X = make_outcome_design(covariates, A, M, treatment_col, mediator_col, include_interaction=include_interaction)
    outcome_model = LinearRegression().fit(outcome_X, Y, sample_weight=weights)

    y_0_m0 = outcome_model.predict(make_outcome_design(covariates, np.zeros(len(working)), M0, treatment_col, mediator_col, include_interaction=include_interaction))
    y_1_m0 = outcome_model.predict(make_outcome_design(covariates, np.ones(len(working)), M0, treatment_col, mediator_col, include_interaction=include_interaction))
    y_1_m1 = outcome_model.predict(make_outcome_design(covariates, np.ones(len(working)), M1, treatment_col, mediator_col, include_interaction=include_interaction))

    raw_high = working.loc[working[treatment_col].eq(1), outcome_col].mean()
    raw_low = working.loc[working[treatment_col].eq(0), outcome_col].mean()
    baseline_mean = average(y_0_m0)
    total_effect = average(y_1_m1 - y_0_m0)
    direct_effect = average(y_1_m0 - y_0_m0)
    indirect_effect = average(y_1_m1 - y_1_m0)
    mediator_shift = average(M1 - M0)

    result = {
        "spec_name": spec_name,
        "treatment_col": treatment_col,
        "mediator_col": mediator_col,
        "outcome": outcome_col,
        "outcome_label": OUTCOME_LABELS.get(outcome_col, outcome_col),
        "treatment_rate": A.mean(),
        "raw_high_minus_lower": raw_high - raw_low,
        "gcomp_total_effect": total_effect,
        "natural_direct_effect": direct_effect,
        "natural_indirect_effect": indirect_effect,
        "mediator_shift": mediator_shift,
        "proportion_mediated": indirect_effect / total_effect if not np.isclose(total_effect, 0) else np.nan,
        "reference_mean": baseline_mean,
        "relative_total_effect": total_effect / baseline_mean if not np.isclose(baseline_mean, 0) else np.nan,
        "uses_weights": use_weights,
        "include_interaction": include_interaction,
        "control_columns": covariates.shape[1],
        "mediator_model_r2": mediator_model.score(mediator_X, M, sample_weight=weights),
        "outcome_model_r2": outcome_model.score(outcome_X, Y, sample_weight=weights),
    }
    for _, row in weight_diagnostics.iterrows():
        result[f"weight_{row['diagnostic']}"] = row["value"]
    return result

print("Robustness estimator ready.")

Robustness estimator ready.

The estimator returns one row per specification. That row has the raw contrast, adjusted total effect, direct effect, indirect effect, mediator shift, treatment rate, model fit, and weight diagnostics.

5. Reproduce the Baseline Specification

Before stress-testing, this cell reruns the baseline specification from notebook 04 using the helper above. The goal is to confirm that the robustness machinery lands close to the saved main result.

baseline_spec = pd.DataFrame(
    [
        estimate_binary_mediation_spec(
            analysis_panel,
            treatment_col=BASE_TREATMENT,
            mediator_col=BASE_MEDIATOR,
            outcome_col=PRIMARY_OUTCOME,
            numeric_cols=rich_numeric_covariates,
            categorical_cols=rich_categorical_covariates,
            spec_name="baseline_median_high_discovery",
            use_weights=True,
            include_interaction=True,
        )
    ]
)

baseline_compare = baseline_spec[
    [
        "spec_name",
        "treatment_rate",
        "raw_high_minus_lower",
        "gcomp_total_effect",
        "natural_direct_effect",
        "natural_indirect_effect",
        "mediator_shift",
        "proportion_mediated",
        "relative_total_effect",
    ]
]

display(baseline_compare.round(4))

	spec_name	treatment_rate	raw_high_minus_lower	gcomp_total_effect	natural_direct_effect	natural_indirect_effect	mediator_shift	proportion_mediated	relative_total_effect
0	baseline_median_high_discovery	0.5001	176.4257	37.2781	38.5858	-1.3077	0.0231	-0.0351	0.1152

The reproduced baseline should be very close to the saved notebook 04 estimates. Small differences can appear because this notebook re-estimates propensity weights internally, but the qualitative story should match.

6. Robustness to Alternative Discovery Thresholds

The baseline treatment uses the top half of discovery-breadth days. This cell tests stricter thresholds: top 40%, top 30%, and top 25%. A stricter treatment asks whether the effect is still present when high discovery means a more extreme discovery day.

threshold_frame = analysis_panel.copy()
threshold_specs = [
    ("top_50_percent_baseline", 0.50),
    ("top_40_percent_discovery", 0.60),
    ("top_30_percent_discovery", 0.70),
    ("top_25_percent_discovery", 0.75),
]
threshold_rows = []

for spec_name, quantile_cutoff in threshold_specs:
    threshold_value = threshold_frame[DISCOVERY_SCORE].quantile(quantile_cutoff)
    treatment_col = f"A_{spec_name}"
    threshold_frame[treatment_col] = (threshold_frame[DISCOVERY_SCORE] >= threshold_value).astype("int8")
    result = estimate_binary_mediation_spec(
        threshold_frame,
        treatment_col=treatment_col,
        mediator_col=BASE_MEDIATOR,
        outcome_col=PRIMARY_OUTCOME,
        numeric_cols=rich_numeric_covariates,
        categorical_cols=rich_categorical_covariates,
        spec_name=spec_name,
        use_weights=True,
        include_interaction=True,
    )
    result["threshold_quantile"] = quantile_cutoff
    result["threshold_value"] = threshold_value
    threshold_rows.append(result)

threshold_results = pd.DataFrame(threshold_rows)

display(
    threshold_results[
        [
            "spec_name",
            "threshold_value",
            "treatment_rate",
            "gcomp_total_effect",
            "natural_direct_effect",
            "natural_indirect_effect",
            "mediator_shift",
            "relative_total_effect",
            "weight_effective_sample_size",
        ]
    ].round(4)
)

	spec_name	threshold_value	treatment_rate	gcomp_total_effect	natural_direct_effect	natural_indirect_effect	mediator_shift	relative_total_effect	weight_effective_sample_size
0	top_50_percent_baseline	0.3300	0.5001	37.2781	38.5858	-1.3077	0.0231	0.1152	5701.3677
1	top_40_percent_discovery	0.3551	0.4000	40.0555	41.7088	-1.6532	0.0232	0.1227	6334.5853
2	top_30_percent_discovery	0.3814	0.3000	38.8190	40.4679	-1.6489	0.0230	0.1173	7016.2744
3	top_25_percent_discovery	0.3959	0.2500	39.7750	41.3340	-1.5590	0.0239	0.1196	7263.6365

This table checks whether the total effect disappears when the treatment becomes more selective. If effects are similar or larger under stricter definitions, the high-discovery story is not merely an artifact of the median split.

7. Plot Threshold Sensitivity

This cell visualizes total, direct, and indirect effects across discovery thresholds. The plot makes it easier to see whether the effect decomposition changes smoothly or flips under stricter cutoffs.

threshold_plot = threshold_results.melt(
    id_vars=["spec_name", "threshold_quantile", "treatment_rate"],
    value_vars=["gcomp_total_effect", "natural_direct_effect", "natural_indirect_effect"],
    var_name="effect_type",
    value_name="estimate",
)
threshold_plot["effect_label"] = threshold_plot["effect_type"].map(
    {
        "gcomp_total_effect": "Total",
        "natural_direct_effect": "Direct",
        "natural_indirect_effect": "Indirect",
    }
)
threshold_plot["threshold_label"] = threshold_plot["spec_name"].str.replace("_", " ").str.title()

fig, ax = plt.subplots(figsize=(12, 6))
sns.lineplot(
    data=threshold_plot,
    x="treatment_rate",
    y="estimate",
    hue="effect_label",
    marker="o",
    linewidth=2,
    ax=ax,
)
ax.axhline(0, color="black", linewidth=1)
ax.invert_xaxis()
ax.set_title("Effect Sensitivity to High-Discovery Threshold")
ax.set_xlabel("Treatment rate, stricter thresholds to the right")
ax.set_ylabel("Effect on future 7-day interactions")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "17_threshold_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The threshold plot is a stability check. A jagged pattern would suggest that the result depends heavily on an arbitrary threshold. A smooth pattern is more reassuring.

8. Robustness to Alternative Mediator Definitions

The baseline mediator is a composite satisfaction-depth score. This cell reruns the mediation decomposition with simpler mediator candidates. The purpose is to test whether the indirect pathway is a property of satisfaction broadly or only of one engineered score.

mediator_rows = []
for mediator_label, mediator_col in mediator_candidates.items():
    result = estimate_binary_mediation_spec(
        analysis_panel,
        treatment_col=BASE_TREATMENT,
        mediator_col=mediator_col,
        outcome_col=PRIMARY_OUTCOME,
        numeric_cols=rich_numeric_covariates,
        categorical_cols=rich_categorical_covariates,
        spec_name=mediator_label,
        use_weights=True,
        include_interaction=True,
    )
    result["mediator_label"] = mediator_label
    mediator_rows.append(result)

mediator_results = pd.DataFrame(mediator_rows)

display(
    mediator_results[
        [
            "mediator_label",
            "mediator_col",
            "gcomp_total_effect",
            "natural_direct_effect",
            "natural_indirect_effect",
            "mediator_shift",
            "proportion_mediated",
            "mediator_model_r2",
        ]
    ].round(4)
)

	mediator_label	mediator_col	gcomp_total_effect	natural_direct_effect	natural_indirect_effect	mediator_shift	proportion_mediated	mediator_model_r2
0	composite_satisfaction_depth	M_satisfaction_depth	37.2781	38.5858	-1.3077	0.0231	-0.0351	0.4653
1	high_watch_ratio_share	high_satisfaction_share	37.2755	38.9900	-1.7145	0.0424	-0.0460	0.4301
2	valid_play_share	valid_play_share	37.3502	37.2221	0.1281	0.0065	0.0034	0.5762
3	average_satisfaction_score	avg_satisfaction_score	37.2560	38.5948	-1.3388	0.0174	-0.0359	0.4418
4	completion_or_rewatch_share	complete_or_rewatch_share	37.2665	38.7399	-1.4734	0.0289	-0.0395	0.4126

This is one of the most important checks. If all satisfaction mediators show small mediated pathways, the notebook 04 conclusion is more stable. If one mediator dominates, the final report should be precise about which satisfaction definition matters.

9. Plot Mediator Sensitivity

This cell plots the direct and indirect estimates across mediator definitions. It keeps the total effect visible as a reference but focuses attention on the mediated pathway.

mediator_plot = mediator_results.melt(
    id_vars=["mediator_label"],
    value_vars=["gcomp_total_effect", "natural_direct_effect", "natural_indirect_effect"],
    var_name="effect_type",
    value_name="estimate",
)
mediator_plot["effect_label"] = mediator_plot["effect_type"].map(
    {
        "gcomp_total_effect": "Total",
        "natural_direct_effect": "Direct",
        "natural_indirect_effect": "Indirect",
    }
)
mediator_plot["mediator_label_clean"] = mediator_plot["mediator_label"].str.replace("_", " ").str.title()

fig, ax = plt.subplots(figsize=(13, 6))
sns.barplot(
    data=mediator_plot,
    x="estimate",
    y="mediator_label_clean",
    hue="effect_label",
    ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Effect Decomposition Across Mediator Definitions")
ax.set_xlabel("Effect on future 7-day interactions")
ax.set_ylabel("Mediator definition")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "18_mediator_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The mediator plot makes the main sensitivity question visible. A consistent direct-effect bar with small indirect bars supports the idea that future interaction gains are not mainly routed through these satisfaction proxies.

10. Robustness Across Outcomes

This cell repeats the baseline decomposition for future interactions, future active days, and future play hours. Notebook 04 already estimated these with bootstrap intervals; here we keep the point-estimate grid alongside the other robustness checks.

outcome_rows = []
for outcome_col in OUTCOME_COLUMNS:
    result = estimate_binary_mediation_spec(
        analysis_panel,
        treatment_col=BASE_TREATMENT,
        mediator_col=BASE_MEDIATOR,
        outcome_col=outcome_col,
        numeric_cols=rich_numeric_covariates,
        categorical_cols=rich_categorical_covariates,
        spec_name=f"baseline_{outcome_col}",
        use_weights=True,
        include_interaction=True,
    )
    outcome_rows.append(result)

outcome_results = pd.DataFrame(outcome_rows)

display(
    outcome_results[
        [
            "outcome_label",
            "gcomp_total_effect",
            "natural_direct_effect",
            "natural_indirect_effect",
            "mediator_shift",
            "relative_total_effect",
            "proportion_mediated",
        ]
    ].round(4)
)

	outcome_label	gcomp_total_effect	natural_direct_effect	natural_indirect_effect	mediator_shift	relative_total_effect	proportion_mediated
0	Future 7-day interactions	37.2781	38.5858	-1.3077	0.0231	0.1152	-0.0351
1	Future 7-day active days	0.0074	0.0021	0.0053	0.0231	0.0011	0.7175
2	Future 7-day play hours	0.0784	0.0739	0.0044	0.0231	0.1000	0.0562

Different outcomes answer different product questions. Future interactions measure volume, active days measure return frequency, and play hours measure attention. A result can be meaningful for one and weak for another.

11. Sensitivity to Model Choices

This cell changes the model specification while holding the baseline treatment, mediator, and primary outcome fixed. It compares weighted versus unweighted regression, interaction versus no interaction, and rich versus simple controls.

model_specs = [
    {
        "spec_name": "weighted_rich_with_interaction",
        "use_weights": True,
        "include_interaction": True,
        "numeric_cols": rich_numeric_covariates,
        "categorical_cols": rich_categorical_covariates,
    },
    {
        "spec_name": "unweighted_rich_with_interaction",
        "use_weights": False,
        "include_interaction": True,
        "numeric_cols": rich_numeric_covariates,
        "categorical_cols": rich_categorical_covariates,
    },
    {
        "spec_name": "weighted_rich_no_interaction",
        "use_weights": True,
        "include_interaction": False,
        "numeric_cols": rich_numeric_covariates,
        "categorical_cols": rich_categorical_covariates,
    },
    {
        "spec_name": "weighted_simple_with_interaction",
        "use_weights": True,
        "include_interaction": True,
        "numeric_cols": simple_numeric_covariates,
        "categorical_cols": simple_categorical_covariates,
    },
    {
        "spec_name": "unweighted_simple_no_interaction",
        "use_weights": False,
        "include_interaction": False,
        "numeric_cols": simple_numeric_covariates,
        "categorical_cols": simple_categorical_covariates,
    },
]

model_rows = []
for spec in model_specs:
    result = estimate_binary_mediation_spec(
        analysis_panel,
        treatment_col=BASE_TREATMENT,
        mediator_col=BASE_MEDIATOR,
        outcome_col=PRIMARY_OUTCOME,
        numeric_cols=spec["numeric_cols"],
        categorical_cols=spec["categorical_cols"],
        spec_name=spec["spec_name"],
        use_weights=spec["use_weights"],
        include_interaction=spec["include_interaction"],
    )
    result["control_set"] = "rich" if len(spec["numeric_cols"]) == len(rich_numeric_covariates) else "simple"
    model_rows.append(result)

model_sensitivity = pd.DataFrame(model_rows)

display(
    model_sensitivity[
        [
            "spec_name",
            "uses_weights",
            "include_interaction",
            "control_set",
            "gcomp_total_effect",
            "natural_direct_effect",
            "natural_indirect_effect",
            "mediator_shift",
            "relative_total_effect",
        ]
    ].round(4)
)

	spec_name	uses_weights	include_interaction	control_set	gcomp_total_effect	natural_direct_effect	natural_indirect_effect	mediator_shift	relative_total_effect
0	weighted_rich_with_interaction	True	True	rich	37.2781	38.5858	-1.3077	0.0231	0.1152
1	unweighted_rich_with_interaction	False	True	rich	36.4288	37.9141	-1.4853	0.0229	0.1130
2	weighted_rich_no_interaction	True	False	rich	37.3139	38.4311	-1.1172	0.0231	0.1153
3	weighted_simple_with_interaction	True	True	simple	37.6289	39.1871	-1.5581	0.0347	0.1169
4	unweighted_simple_no_interaction	False	False	simple	34.8085	35.7445	-0.9360	0.0320	0.1077

This table tells us whether the result is mostly a modeling artifact. A stable total effect across these rows is reassuring. A highly variable indirect effect would tell us to be cautious about strong claims about the mediator pathway.

12. Plot Model-Specification Sensitivity

This cell visualizes the model-sensitivity table. It keeps direct and indirect pieces side by side for each model choice.

model_plot = model_sensitivity.melt(
    id_vars=["spec_name"],
    value_vars=["gcomp_total_effect", "natural_direct_effect", "natural_indirect_effect"],
    var_name="effect_type",
    value_name="estimate",
)
model_plot["effect_label"] = model_plot["effect_type"].map(
    {
        "gcomp_total_effect": "Total",
        "natural_direct_effect": "Direct",
        "natural_indirect_effect": "Indirect",
    }
)
model_plot["spec_label"] = model_plot["spec_name"].str.replace("_", " ").str.title()

fig, ax = plt.subplots(figsize=(13, 7))
sns.barplot(
    data=model_plot,
    x="estimate",
    y="spec_label",
    hue="effect_label",
    ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Effect Decomposition Across Model Choices")
ax.set_xlabel("Effect on future 7-day interactions")
ax.set_ylabel("Model choice")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "19_model_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The model-sensitivity plot is a compact way to see which part of the story is stable. In a strong result, the total effect should not depend on one narrow modeling choice.

13. Continuous Discovery Score Check

The binary treatment is easy to explain, but it loses information. This cell treats discovery_breadth_score as continuous and estimates adjusted associations per 0.10 increase in discovery breadth. This is not a natural-effect decomposition; it is a separate dose-style sensitivity check.

def estimate_continuous_association(frame, score_col, target_col, numeric_cols, categorical_cols, scale=0.10):
    covariates = build_covariate_matrix(frame, numeric_cols, categorical_cols)
    score = frame[score_col].astype(float).to_numpy()
    target = frame[target_col].astype(float).to_numpy()
    design = covariates.copy()
    design.insert(0, score_col, score)
    model = LinearRegression().fit(design, target)
    coef = float(model.coef_[0])
    return {
        "target": target_col,
        "target_label": OUTCOME_LABELS.get(target_col, target_col),
        "score_col": score_col,
        "effect_per_0_10_score_increase": coef * scale,
        "raw_score_target_corr": frame[score_col].corr(frame[target_col], method="spearman"),
        "model_r2": model.score(design, target),
    }

continuous_rows = []
for target_col in OUTCOME_COLUMNS + [BASE_MEDIATOR]:
    continuous_rows.append(
        estimate_continuous_association(
            analysis_panel,
            DISCOVERY_SCORE,
            target_col,
            rich_numeric_covariates,
            rich_categorical_covariates,
        )
    )
continuous_sensitivity = pd.DataFrame(continuous_rows)

display(continuous_sensitivity.round(4))

	target	target_label	score_col	effect_per_0_10_score_increase	raw_score_target_corr	model_r2
0	Y_future_interactions	Future 7-day interactions	discovery_breadth_score	20.5518	0.5674	0.7220
1	Y_future_active_days	Future 7-day active days	discovery_breadth_score	0.0966	0.3980	0.5179
2	Y_future_play_hours	Future 7-day play hours	discovery_breadth_score	0.0447	0.5392	0.6999
3	M_satisfaction_depth	M_satisfaction_depth	discovery_breadth_score	0.0129	0.0930	0.4612

The continuous check asks whether more discovery breadth is associated with more future value without forcing a high-versus-low split. It should broadly agree with the binary result if the threshold is not doing all the work.

14. Placebo-Style Checks for Pre-Period Activity and Calendar Time

A dangerous pattern would be high discovery exposure strongly predicting variables that happened before the treatment day. This cell checks raw and adjusted relationships between high discovery and pre-period activity, prior satisfaction, prior discovery, and calendar day.

placebo_targets = [
    "lag_1_interactions",
    "prior_3day_interactions",
    "prior_3day_high_satisfaction_share",
    "prior_3day_discovery_candidate_share",
    "calendar_day_index",
]

placebo_adjustment_numeric = [
    col
    for col in ["register_days", "follow_user_num", "fans_user_num", "friend_user_num", "is_lowactive_period", "is_live_streamer", "is_video_author"]
    if col in analysis_panel.columns
] + profile_onehot_covariates
placebo_adjustment_categorical = rich_categorical_covariates

placebo_rows = []
for target_col in placebo_targets:
    raw_high = analysis_panel.loc[analysis_panel[BASE_TREATMENT].eq(1), target_col].mean()
    raw_low = analysis_panel.loc[analysis_panel[BASE_TREATMENT].eq(0), target_col].mean()

    numeric_cols = [col for col in placebo_adjustment_numeric if col != target_col]
    categorical_cols = [col for col in placebo_adjustment_categorical if col != target_col]
    covariates = build_covariate_matrix(analysis_panel, numeric_cols, categorical_cols)
    design = covariates.copy()
    design.insert(0, BASE_TREATMENT, analysis_panel[BASE_TREATMENT].astype(float).to_numpy())
    model = LinearRegression().fit(design, analysis_panel[target_col].astype(float).to_numpy())
    adjusted_difference = float(model.coef_[0])

    placebo_rows.append(
        {
            "placebo_target": target_col,
            "raw_high_minus_lower": raw_high - raw_low,
            "adjusted_difference": adjusted_difference,
            "target_mean": analysis_panel[target_col].mean(),
            "relative_raw_difference": (raw_high - raw_low) / analysis_panel[target_col].mean() if not np.isclose(analysis_panel[target_col].mean(), 0) else np.nan,
        }
    )

placebo_checks = pd.DataFrame(placebo_rows)

display(placebo_checks.round(4))

	placebo_target	raw_high_minus_lower	adjusted_difference	target_mean	relative_raw_difference
0	lag_1_interactions	18.3176	18.3670	51.0316	0.3589
1	prior_3day_interactions	49.0163	49.1383	152.5819	0.3212
2	prior_3day_high_satisfaction_share	-0.0321	-0.0363	1.3416	-0.0239
3	prior_3day_discovery_candidate_share	0.3502	0.3493	1.0706	0.3271
4	calendar_day_index	-15.6133	-15.6544	30.7969	-0.5070

The placebo-style checks are expected to show some pre-period differences because discovery exposure is not randomized. The point is to make that imbalance visible and connect it back to the need for adjustment and sensitivity analysis.

15. Plot Placebo-Style Checks

This cell plots raw and adjusted pre-period differences. Large pre-period differences are not fatal by themselves, but they are a warning against overclaiming from observational logs.

placebo_plot = placebo_checks.melt(
    id_vars=["placebo_target"],
    value_vars=["raw_high_minus_lower", "adjusted_difference"],
    var_name="contrast_type",
    value_name="difference",
)
placebo_plot["contrast_type"] = placebo_plot["contrast_type"].map(
    {
        "raw_high_minus_lower": "Raw high - lower",
        "adjusted_difference": "Adjusted difference",
    }
)
placebo_plot["target_label"] = placebo_plot["placebo_target"].str.replace("_", " ").str.title()

fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(data=placebo_plot, x="difference", y="target_label", hue="contrast_type", ax=ax)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Placebo-Style Checks on Pre-Period Variables")
ax.set_xlabel("High-discovery minus lower-discovery difference")
ax.set_ylabel("Pre-period or calendar variable")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "20_placebo_preperiod_checks.png", dpi=160, bbox_inches="tight")
plt.show()

This plot helps communicate why the earlier balance and propensity diagnostics matter. The treatment is observational, so a final report should emphasize adjusted estimates and limitations rather than raw gaps.

16. Summarize Robustness Patterns

This cell creates a compact summary across robustness families. It focuses on whether the total effect stays positive and whether the indirect effect remains small relative to the direct effect.

robustness_families = []

for family_name, frame in [
    ("threshold_sensitivity", threshold_results),
    ("mediator_sensitivity", mediator_results),
    ("model_sensitivity", model_sensitivity),
    ("outcome_sensitivity", outcome_results.query("outcome == @PRIMARY_OUTCOME")),
]:
    robustness_families.append(
        {
            "robustness_family": family_name,
            "specifications": len(frame),
            "total_effect_min": frame["gcomp_total_effect"].min(),
            "total_effect_max": frame["gcomp_total_effect"].max(),
            "total_effect_share_positive": (frame["gcomp_total_effect"] > 0).mean(),
            "indirect_effect_min": frame["natural_indirect_effect"].min(),
            "indirect_effect_max": frame["natural_indirect_effect"].max(),
            "indirect_effect_share_positive": (frame["natural_indirect_effect"] > 0).mean(),
            "median_abs_indirect_to_total_ratio": np.nanmedian(
                np.abs(frame["natural_indirect_effect"] / frame["gcomp_total_effect"].replace(0, np.nan))
            ),
        }
    )

robustness_summary = pd.DataFrame(robustness_families)

display(robustness_summary.round(4))

	robustness_family	specifications	total_effect_min	total_effect_max	total_effect_share_positive	indirect_effect_min	indirect_effect_max	indirect_effect_share_positive	median_abs_indirect_to_total_ratio
0	threshold_sensitivity	4	37.2781	40.0555	1.0	-1.6532	-1.3077	0.0	0.0402
1	mediator_sensitivity	5	37.2560	37.3502	1.0	-1.7145	0.1281	0.2	0.0359
2	model_sensitivity	5	34.8085	37.6289	1.0	-1.5581	-0.9360	0.0	0.0351
3	outcome_sensitivity	1	37.2781	37.2781	1.0	-1.3077	-1.3077	0.0	0.0351

This summary turns many model runs into a few takeaways. The strongest claim is the one that remains stable across robustness families. The weaker claim is the one that moves around or depends on a specific metric definition.

17. Limitations Table

This cell writes a limitations table for the final report. These are not flaws to hide. They are the boundaries that make the analysis honest and more believable.

limitations = pd.DataFrame(
    [
        {
            "limitation": "observational_recommendation_logs",
            "why_it_matters": "High-discovery exposure was not randomized, so user preferences and recommender selection can confound estimates.",
            "mitigation_in_this_work": "Used pre-treatment history/profile controls, propensity diagnostics, and weighted sensitivity checks.",
            "remaining_risk": "Unobserved ranking-system state and user intent can still bias estimates.",
        },
        {
            "limitation": "mediator_outcome_confounding",
            "why_it_matters": "Satisfaction depth and future engagement may share unobserved causes.",
            "mitigation_in_this_work": "Checked residual mediator-outcome relationships and compared multiple mediator definitions.",
            "remaining_risk": "Natural direct and indirect effects rely on strong mediator assumptions.",
        },
        {
            "limitation": "satisfaction_proxy_measurement",
            "why_it_matters": "Watch ratio and valid play are proxies, not direct surveys of satisfaction.",
            "mitigation_in_this_work": "Compared composite satisfaction, high watch ratio, valid play, average score, and completion-style mediators.",
            "remaining_risk": "The chosen proxies may miss frustration, novelty value, or long-run satisfaction.",
        },
        {
            "limitation": "active_day_sample_selection",
            "why_it_matters": "The analysis conditions on active user-days, so it does not study days with no observed interactions.",
            "mitigation_in_this_work": "Defined the estimand clearly for active days and reported active-day alternate outcomes.",
            "remaining_risk": "Results may not generalize to dormant users or cold-start retention decisions.",
        },
        {
            "limitation": "calendar_and_history_imbalance",
            "why_it_matters": "High discovery days differ from lower discovery days in time and recent behavior.",
            "mitigation_in_this_work": "Included calendar/history controls and showed placebo-style pre-period checks.",
            "remaining_risk": "Residual time-varying confounding can remain after observed adjustment.",
        },
    ]
)

display(limitations)

	limitation	why_it_matters	mitigation_in_this_work	remaining_risk
0	observational_recommendation_logs	High-discovery exposure was not randomized, so user preferences and recommender selection can confound estimates.	Used pre-treatment history/profile controls, propensity diagnostics, and weighted sensitivity checks.	Unobserved ranking-system state and user intent can still bias estimates.
1	mediator_outcome_confounding	Satisfaction depth and future engagement may share unobserved causes.	Checked residual mediator-outcome relationships and compared multiple mediator definitions.	Natural direct and indirect effects rely on strong mediator assumptions.
2	satisfaction_proxy_measurement	Watch ratio and valid play are proxies, not direct surveys of satisfaction.	Compared composite satisfaction, high watch ratio, valid play, average score, and completion-style mediators.	The chosen proxies may miss frustration, novelty value, or long-run satisfaction.
3	active_day_sample_selection	The analysis conditions on active user-days, so it does not study days with no observed interactions.	Defined the estimand clearly for active days and reported active-day alternate outcomes.	Results may not generalize to dormant users or cold-start retention decisions.
4	calendar_and_history_imbalance	High discovery days differ from lower discovery days in time and recent behavior.	Included calendar/history controls and showed placebo-style pre-period checks.	Residual time-varying confounding can remain after observed adjustment.

The limitations table gives the final report a professional tone. It shows that the analysis is careful about where the evidence is strong and where it remains assumption-dependent.

18. Save Robustness Outputs

This cell saves all robustness tables to data/processed and mirrors the main tables into the notebook writeup folder.

threshold_results.to_csv(THRESHOLD_OUTPUT, index=False)
mediator_results.to_csv(MEDIATOR_OUTPUT, index=False)
model_sensitivity.to_csv(MODEL_SENSITIVITY_OUTPUT, index=False)
continuous_sensitivity.to_csv(CONTINUOUS_OUTPUT, index=False)
placebo_checks.to_csv(PLACEBO_OUTPUT, index=False)
robustness_summary.to_csv(ROBUSTNESS_SUMMARY_OUTPUT, index=False)
limitations.to_csv(LIMITATIONS_OUTPUT, index=False)

threshold_results.to_csv(TABLE_DIR / "robustness_thresholds.csv", index=False)
mediator_results.to_csv(TABLE_DIR / "robustness_mediators.csv", index=False)
model_sensitivity.to_csv(TABLE_DIR / "model_sensitivity.csv", index=False)
continuous_sensitivity.to_csv(TABLE_DIR / "continuous_score_sensitivity.csv", index=False)
placebo_checks.to_csv(TABLE_DIR / "placebo_checks.csv", index=False)
robustness_summary.to_csv(TABLE_DIR / "robustness_summary.csv", index=False)
limitations.to_csv(TABLE_DIR / "limitations.csv", index=False)

saved_outputs = pd.DataFrame(
    {
        "artifact": [
            "threshold_robustness",
            "mediator_robustness",
            "model_sensitivity",
            "continuous_score_sensitivity",
            "placebo_checks",
            "robustness_summary",
            "limitations",
        ],
        "path": [
            str(THRESHOLD_OUTPUT),
            str(MEDIATOR_OUTPUT),
            str(MODEL_SENSITIVITY_OUTPUT),
            str(CONTINUOUS_OUTPUT),
            str(PLACEBO_OUTPUT),
            str(ROBUSTNESS_SUMMARY_OUTPUT),
            str(LIMITATIONS_OUTPUT),
        ],
    }
)

display(saved_outputs)

	artifact	path
0	threshold_robustness	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_robustness_thresholds.csv
1	mediator_robustness	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_robustness_mediators.csv
2	model_sensitivity	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_model_sensitivity.csv
3	continuous_score_sensitivity	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_continuous_score_sensitivity.csv
4	placebo_checks	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_placebo_checks.csv
5	robustness_summary	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_robustness_summary.csv
6	limitations	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_limitations.csv

The saved outputs make this robustness work reusable. A final report can pull directly from these tables without recomputing the sensitivity grid.

19. Notebook Takeaways

This notebook stress-tested the mediation result from several angles:

The total future-interaction effect is checked across stricter discovery thresholds.
The mediated pathway is checked across several satisfaction proxy definitions.
The result is compared across future interactions, active days, and play hours.
Model choices are varied through weighting, interactions, and control-set size.
Placebo-style checks make pre-period imbalance visible.
The limitations table records what the observational design cannot fully solve.

The next notebook should be the final report and figures notebook for this discovery-quality mediation project. It should turn the evidence into a concise portfolio-ready narrative with the key tables, plots, and caveats.