Metric Construction and Validation for Discovery Quality

The previous notebook created a user-day panel from KuaiRec and established a causal question: recommendation exposure can drive immediate activity, but product teams usually care about whether that exposure creates meaningful discovery, satisfaction, and durable return behavior. This notebook turns that setup into explicit measurement objects.

The measurement problem is subtle. A click-like metric is easy to increase, but it can reward low-value behavior: curiosity clicks, accidental plays, or short sessions that do not create satisfaction. A discovery-quality metric should do more than count immediate response. It should combine three ideas:

Discovery exposure: did the user receive less-obvious or broader content, such as long-tail videos or categories they had not recently consumed?
Satisfaction depth: did the interaction look meaningful, using watch ratio, valid play, completion, and abandonment proxies?
Future value validation: do higher metric values line up with future activity, while not merely duplicating past activity volume?

A key causal discipline in this notebook is separation of roles. Exposure-like variables are candidates for treatments or policy levers. Satisfaction-like variables are plausible mediators. Future outcomes are used only for validation; they are never used to build the metric itself. That separation keeps later mediation analysis understandable.

1. Load Libraries and Paths

This cell imports the data, plotting, and utility libraries used throughout the notebook. It also defines the raw input paths from the setup notebook and the output paths for the metric panel, validation tables, and figures created here.

from pathlib import Path
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

warnings.filterwarnings("ignore", category=FutureWarning)

# Keep the visual style consistent with the earlier notebooks while staying readable in VS Code/Jupyter.
sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (11, 6)
plt.rcParams["axes.titlesize"] = 13
plt.rcParams["axes.labelsize"] = 11
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_colwidth", 120)

# Detect the repository root whether the notebook is run from the repo root or from this notebook folder.
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "data").exists() and PROJECT_ROOT.parent != PROJECT_ROOT:
    PROJECT_ROOT = PROJECT_ROOT.parent

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
NOTEBOOK_DIR = PROJECT_ROOT / "notebooks" / "discovery_quality_mediation"
WRITEUP_DIR = NOTEBOOK_DIR / "writeup"
FIGURE_DIR = WRITEUP_DIR / "figures"
TABLE_DIR = WRITEUP_DIR / "tables"

FIGURE_DIR.mkdir(parents=True, exist_ok=True)
TABLE_DIR.mkdir(parents=True, exist_ok=True)

PANEL_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_mediation_panel.parquet"
INTERACTIONS_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_interactions_sample.parquet"
ITEM_FEATURES_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_item_features.parquet"
READINESS_INPUT = PROCESSED_DIR / "kuairec_discovery_quality_readiness.csv"

METRIC_PANEL_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_metric_panel.parquet"
METRIC_REGISTRY_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_metric_registry.csv"
METRIC_VALIDATION_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_metric_validation.csv"
METRIC_CORRELATION_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_metric_correlations.csv"
METRIC_DECILE_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_metric_deciles.csv"
METRIC_SELECTION_OUTPUT = PROCESSED_DIR / "kuairec_discovery_quality_selected_metrics.csv"

The directories are intentionally the same pattern used by the rest of the repository: durable data artifacts go to data/processed, while notebook-specific tables and figures go to the local writeup folder. That keeps the notebook readable and makes downstream notebooks load a stable artifact instead of recreating everything.

2. Load the Setup Artifacts

This cell loads the active user-day mediation panel and the supporting interaction/item artifacts from the setup notebook. The panel is the main unit of analysis: one row is one active user on one calendar day, with same-day exposure and satisfaction signals plus future engagement outcomes.

metric_base = pd.read_parquet(PANEL_INPUT)
interactions = pd.read_parquet(INTERACTIONS_INPUT)
item_features = pd.read_parquet(ITEM_FEATURES_INPUT)
readiness = pd.read_csv(READINESS_INPUT)

load_summary = pd.DataFrame(
    {
        "artifact": ["mediation_panel", "interaction_sample", "item_features", "readiness_checks"],
        "rows": [len(metric_base), len(interactions), len(item_features), len(readiness)],
        "columns": [metric_base.shape[1], interactions.shape[1], item_features.shape[1], readiness.shape[1]],
    }
)

display(load_summary)
display(readiness)

	artifact	rows	columns
0	mediation_panel	8199	81
1	interaction_sample	440788	37
2	item_features	3327	27
3	readiness_checks	6	3

	check	value	notes
0	active_user_days	8199.000000	Rows available for active-day mediation setup.
1	sampled_users	133.000000	Users represented in the mediation panel.
2	treatment_rate	0.501525	Should be neither near 0 nor near 1.
3	mediator_satisfaction_std	0.176618	Mediator must vary across user-days.
4	future_7day_interactions_std	180.363332	Outcome must vary across user-days.
5	max_key_variable_missing_rate	0.000000	Key variables should be complete or nearly complete.

The readiness table is the first sanity check for this notebook. The treatment rate is close to balanced, the mediator and future outcomes vary, and key missingness is zero. That means metric validation can focus on substantive behavior rather than basic data repair.

3. State the Measurement Contract

Before building formulas, this cell creates a small data dictionary for the key columns used here. This prevents the notebook from becoming a bag of engineered features. Each metric has to map back to a causal role: exposure, mediator, outcome, or history control.

measurement_contract = pd.DataFrame(
    [
        {
            "role": "exposure",
            "column": "discovery_candidate_share",
            "meaning": "Share of same-day interactions that are platform long-tail or first category exposures for the user.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "exposure",
            "column": "long_tail_share",
            "meaning": "Share of same-day interactions with lower platform-level exposure items.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "exposure",
            "column": "new_category_share",
            "meaning": "Share of same-day interactions from categories not previously seen for that user in the sampled history.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "mediator",
            "column": "high_satisfaction_share",
            "meaning": "Share of interactions with watch ratio at least 0.8.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "mediator",
            "column": "valid_play_share",
            "meaning": "Share of interactions with enough watch time to look like a real play.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "mediator",
            "column": "short_abandon_share",
            "meaning": "Share of interactions with very low watch ratio, used as a quality guardrail.",
            "used_for_metric_building": True,
            "uses_future_information": False,
        },
        {
            "role": "outcome",
            "column": "outcome_future_7day_interactions",
            "meaning": "Seven-day future interaction count after the current day.",
            "used_for_metric_building": False,
            "uses_future_information": True,
        },
        {
            "role": "outcome",
            "column": "outcome_future_7day_active_days",
            "meaning": "Seven-day future active-day count after the current day.",
            "used_for_metric_building": False,
            "uses_future_information": True,
        },
        {
            "role": "history_control",
            "column": "prior_3day_interactions",
            "meaning": "Recent activity before the current day, used to detect whether a metric merely tracks user activity level.",
            "used_for_metric_building": False,
            "uses_future_information": False,
        },
    ]
)

display(measurement_contract)

	role	column	meaning	used_for_metric_building	uses_future_information
0	exposure	discovery_candidate_share	Share of same-day interactions that are platform long-tail or first category exposures for the user.	True	False
1	exposure	long_tail_share	Share of same-day interactions with lower platform-level exposure items.	True	False
2	exposure	new_category_share	Share of same-day interactions from categories not previously seen for that user in the sampled history.	True	False
3	mediator	high_satisfaction_share	Share of interactions with watch ratio at least 0.8.	True	False
4	mediator	valid_play_share	Share of interactions with enough watch time to look like a real play.	True	False
5	mediator	short_abandon_share	Share of interactions with very low watch ratio, used as a quality guardrail.	True	False
6	outcome	outcome_future_7day_interactions	Seven-day future interaction count after the current day.	False	True
7	outcome	outcome_future_7day_active_days	Seven-day future active-day count after the current day.	False	True
8	history_control	prior_3day_interactions	Recent activity before the current day, used to detect whether a metric merely tracks user activity level.	False	False

The contract makes one important rule explicit: future outcomes validate metrics but do not define them. That is what makes the later evidence more credible. If a metric were built from future behavior, its validation would be circular.

4. Create Normalization Helpers

Metrics are easier to compare when they live on similar scales. This cell defines small helpers for clipping rates, min-max scaling, Spearman correlation, and decile assignment. The helpers also handle constant columns gracefully so the notebook fails less often when the sample changes.

def clip_rate(series):
    """Keep a rate-style series within [0, 1] while preserving missing values."""
    return series.astype(float).clip(lower=0, upper=1)


def minmax_score(series):
    """Scale a numeric series to [0, 1]; return 0.5 if the column has no variation."""
    values = series.astype(float)
    min_value = values.min(skipna=True)
    max_value = values.max(skipna=True)
    if pd.isna(min_value) or pd.isna(max_value) or np.isclose(max_value, min_value):
        return pd.Series(0.5, index=series.index)
    return (values - min_value) / (max_value - min_value)


def safe_spearman(frame, left, right):
    """Spearman correlation with guardrails for constant or missing columns."""
    pair = frame[[left, right]].dropna()
    if len(pair) < 3 or pair[left].nunique() < 2 or pair[right].nunique() < 2:
        return np.nan
    return pair[left].corr(pair[right], method="spearman")


def add_decile(frame, metric):
    """Assign stable deciles even when many rows share the same metric value."""
    ranked = frame[metric].rank(method="first")
    return pd.qcut(ranked, q=10, labels=False, duplicates="drop") + 1

helper_summary = pd.DataFrame(
    {
        "helper": ["clip_rate", "minmax_score", "safe_spearman", "add_decile"],
        "purpose": [
            "Protect rate-like inputs from drifting outside [0, 1].",
            "Place count or score variables on a comparable [0, 1] scale.",
            "Compute rank correlation without breaking on degenerate columns.",
            "Create ordered metric groups for top-versus-bottom validation.",
        ],
    }
)

display(helper_summary)

	helper	purpose
0	clip_rate	Protect rate-like inputs from drifting outside [0, 1].
1	minmax_score	Place count or score variables on a comparable [0, 1] scale.
2	safe_spearman	Compute rank correlation without breaking on degenerate columns.
3	add_decile	Create ordered metric groups for top-versus-bottom validation.

These helpers are deliberately simple. The goal is not to hide modeling complexity; it is to make each metric definition readable and reproducible. Later notebooks can replace a hand-built score with learned weights, but this notebook starts with transparent measurement.

5. Build Metric Components

This cell prepares reusable components from the user-day panel. Some columns are already rates; others need transformation. For example, raw interaction counts are converted to a scaled log-volume score so high-volume days matter without letting extreme activity dominate every metric.

metric_panel = metric_base.copy()
metric_panel = metric_panel.sort_values(["user_id", "event_date"]).reset_index(drop=True)

# Category breadth is another discovery signal: a day with many categories per interaction is broader.
metric_panel["category_breadth_rate"] = (
    metric_panel["unique_categories"] / metric_panel["interactions"].replace(0, np.nan)
).fillna(0)
metric_panel["category_breadth_rate"] = clip_rate(metric_panel["category_breadth_rate"])

# Scaled volume keeps activity information available without treating raw clicks as the target metric.
metric_panel["engagement_volume_score"] = minmax_score(np.log1p(metric_panel["interactions"]))
metric_panel["recent_activity_score"] = minmax_score(np.log1p(metric_panel["prior_3day_interactions"]))

rate_inputs = [
    "discovery_candidate_share",
    "long_tail_share",
    "new_category_share",
    "valid_play_share",
    "high_satisfaction_share",
    "complete_or_rewatch_share",
    "short_abandon_share",
    "avg_satisfaction_score",
]
for col in rate_inputs:
    metric_panel[col] = clip_rate(metric_panel[col])

component_summary = metric_panel[
    [
        "category_breadth_rate",
        "engagement_volume_score",
        "recent_activity_score",
        "discovery_candidate_share",
        "long_tail_share",
        "new_category_share",
        "valid_play_share",
        "high_satisfaction_share",
        "complete_or_rewatch_share",
        "short_abandon_share",
        "avg_satisfaction_score",
    ]
].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]).T

display(component_summary)

	count	mean	std	min	10%	25%	50%	75%	90%	max
category_breadth_rate	8199.0	0.426961	0.163345	0.105802	0.257732	0.313433	0.396226	0.500000	0.625000	1.0
engagement_volume_score	8199.0	0.608386	0.157664	0.000000	0.416686	0.535855	0.640961	0.715275	0.771506	1.0
recent_activity_score	8199.0	0.774028	0.143744	0.000000	0.629895	0.733243	0.808920	0.857085	0.892857	1.0
discovery_candidate_share	8199.0	0.364368	0.172628	0.000000	0.090909	0.271010	0.390244	0.480570	0.555556	1.0
long_tail_share	8199.0	0.351875	0.164697	0.000000	0.090909	0.260572	0.377551	0.466667	0.539474	1.0
new_category_share	8199.0	0.020453	0.085513	0.000000	0.000000	0.000000	0.000000	0.000000	0.029733	1.0
valid_play_share	8199.0	0.939417	0.095783	0.000000	0.830717	0.920000	0.975610	1.000000	1.000000	1.0
high_satisfaction_share	8199.0	0.468649	0.176618	0.000000	0.257489	0.349699	0.461538	0.575758	0.692308	1.0
complete_or_rewatch_share	8199.0	0.323119	0.171901	0.000000	0.128968	0.208333	0.302326	0.412944	0.533537	1.0
short_abandon_share	8199.0	0.101160	0.100637	0.000000	0.000000	0.037736	0.076923	0.133333	0.214286	1.0
avg_satisfaction_score	8199.0	0.524125	0.101305	0.000000	0.409583	0.460297	0.516842	0.578268	0.647112	1.0

The components cover three different measurement families: breadth of discovery, depth of satisfaction, and volume of immediate activity. Keeping these separate lets us later ask whether a composite metric is truly adding information or just renaming clicks.

6. Construct Candidate Discovery-Quality Metrics

This cell creates the candidate metrics. The formulas are transparent on purpose:

discovery_breadth_score is exposure-like and can support treatment definitions.
satisfaction_depth_score is mediator-like and should not be treated as pre-exposure.
quality_adjusted_discovery_score combines discovery and satisfaction, making it useful as a product metric but not as a clean treatment in mediation.
shallow_click_pressure_score is a guardrail: high immediate volume paired with weak satisfaction.

metric_panel["discovery_breadth_score"] = clip_rate(
    0.45 * metric_panel["discovery_candidate_share"]
    + 0.30 * metric_panel["long_tail_share"]
    + 0.15 * metric_panel["new_category_share"]
    + 0.10 * metric_panel["category_breadth_rate"]
)

metric_panel["satisfaction_depth_score"] = clip_rate(
    0.30 * metric_panel["high_satisfaction_share"]
    + 0.25 * metric_panel["valid_play_share"]
    + 0.20 * metric_panel["avg_satisfaction_score"]
    + 0.15 * metric_panel["complete_or_rewatch_share"]
    + 0.10 * (1 - metric_panel["short_abandon_share"])
)

metric_panel["quality_adjusted_discovery_score"] = np.sqrt(
    metric_panel["discovery_breadth_score"] * metric_panel["satisfaction_depth_score"]
)

metric_panel["balanced_discovery_quality_score"] = clip_rate(
    0.50 * metric_panel["discovery_breadth_score"]
    + 0.50 * metric_panel["satisfaction_depth_score"]
)

metric_panel["volume_weighted_quality_score"] = minmax_score(
    np.log1p(metric_panel["interactions"]) * metric_panel["quality_adjusted_discovery_score"]
)

metric_panel["shallow_click_pressure_score"] = clip_rate(
    metric_panel["engagement_volume_score"] * (1 - metric_panel["satisfaction_depth_score"])
)

# These binary flags are convenient for summaries. Only the discovery-breadth flag is exposure-like.
metric_panel["high_discovery_breadth_day"] = (
    metric_panel["discovery_breadth_score"] >= metric_panel["discovery_breadth_score"].median()
).astype("int8")
metric_panel["high_quality_adjusted_discovery_day"] = (
    metric_panel["quality_adjusted_discovery_score"] >= metric_panel["quality_adjusted_discovery_score"].median()
).astype("int8")

candidate_metrics = [
    "discovery_breadth_score",
    "satisfaction_depth_score",
    "quality_adjusted_discovery_score",
    "balanced_discovery_quality_score",
    "volume_weighted_quality_score",
    "shallow_click_pressure_score",
]

metric_registry = pd.DataFrame(
    [
        {
            "metric": "discovery_breadth_score",
            "role": "exposure_like",
            "formula_summary": "Weighted mix of discovery-candidate, long-tail, new-category, and category-breadth rates.",
            "causal_use": "Candidate exposure/treatment measure for later mediation.",
            "caution": "Still observational; high discovery days may reflect user preference and recommender selection.",
        },
        {
            "metric": "satisfaction_depth_score",
            "role": "mediator_like",
            "formula_summary": "Weighted mix of high watch ratio, valid plays, satisfaction score, completion, and low abandonment.",
            "causal_use": "Candidate mediator between discovery exposure and future value.",
            "caution": "Post-exposure signal; do not use as a pre-treatment control.",
        },
        {
            "metric": "quality_adjusted_discovery_score",
            "role": "composite_product_metric",
            "formula_summary": "Geometric mean of discovery breadth and satisfaction depth.",
            "causal_use": "Useful for monitoring and policy ranking; not a clean treatment in mediation.",
            "caution": "Combines exposure and mediator information.",
        },
        {
            "metric": "balanced_discovery_quality_score",
            "role": "composite_product_metric",
            "formula_summary": "Arithmetic average of discovery breadth and satisfaction depth.",
            "causal_use": "Transparent composite metric for comparison with the geometric version.",
            "caution": "Can look good when only one component is high.",
        },
        {
            "metric": "volume_weighted_quality_score",
            "role": "business_metric",
            "formula_summary": "Quality-adjusted discovery scaled by same-day log interaction volume.",
            "causal_use": "Useful for product prioritization where scale matters.",
            "caution": "More click-sensitive than the pure quality metrics.",
        },
        {
            "metric": "shallow_click_pressure_score",
            "role": "guardrail",
            "formula_summary": "High immediate volume combined with low satisfaction depth.",
            "causal_use": "Negative signal or guardrail for click-heavy recommendation behavior.",
            "caution": "A high value is bad; correlations should be read with the sign reversed.",
        },
    ]
)

metric_distribution = metric_panel[candidate_metrics].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]).T

display(metric_registry)
display(metric_distribution)

	metric	role	formula_summary	causal_use	caution
0	discovery_breadth_score	exposure_like	Weighted mix of discovery-candidate, long-tail, new-category, and category-breadth rates.	Candidate exposure/treatment measure for later mediation.	Still observational; high discovery days may reflect user preference and recommender selection.
1	satisfaction_depth_score	mediator_like	Weighted mix of high watch ratio, valid plays, satisfaction score, completion, and low abandonment.	Candidate mediator between discovery exposure and future value.	Post-exposure signal; do not use as a pre-treatment control.
2	quality_adjusted_discovery_score	composite_product_metric	Geometric mean of discovery breadth and satisfaction depth.	Useful for monitoring and policy ranking; not a clean treatment in mediation.	Combines exposure and mediator information.
3	balanced_discovery_quality_score	composite_product_metric	Arithmetic average of discovery breadth and satisfaction depth.	Transparent composite metric for comparison with the geometric version.	Can look good when only one component is high.
4	volume_weighted_quality_score	business_metric	Quality-adjusted discovery scaled by same-day log interaction volume.	Useful for product prioritization where scale matters.	More click-sensitive than the pure quality metrics.
5	shallow_click_pressure_score	guardrail	High immediate volume combined with low satisfaction depth.	Negative signal or guardrail for click-heavy recommendation behavior.	A high value is bad; correlations should be read with the sign reversed.

	count	mean	std	min	10%	25%	50%	75%	90%	max
discovery_breadth_score	8199.0	0.315292	0.122804	0.022222	0.122665	0.247618	0.330000	0.395932	0.452737	0.925000
satisfaction_depth_score	8199.0	0.618626	0.112600	0.000000	0.482869	0.547525	0.617436	0.685637	0.753525	1.000000
quality_adjusted_discovery_score	8199.0	0.429600	0.106185	0.000000	0.275625	0.374607	0.443178	0.499683	0.548292	0.921954
balanced_discovery_quality_score	8199.0	0.466959	0.085621	0.050000	0.359290	0.413900	0.470779	0.523202	0.569368	0.925000
volume_weighted_quality_score	8199.0	0.487750	0.167229	0.000000	0.227026	0.400079	0.518041	0.605890	0.674227	1.000000
shallow_click_pressure_score	8199.0	0.233236	0.089682	0.000000	0.116157	0.178974	0.236637	0.291770	0.341866	0.614382

The registry is the most important table in the notebook. It says which metrics are appropriate for causal exposure analysis and which ones are better treated as mediators or monitoring metrics. That distinction prevents post-treatment leakage in later notebooks.

7. Visualize Metric Distributions

A metric that is almost always the same value is not useful for causal analysis or product monitoring. This cell plots each candidate metric distribution so we can see spread, skew, and whether any score collapses to a narrow band.

metric_long = metric_panel[candidate_metrics].melt(var_name="metric", value_name="value")

metric_name_order = [
    "discovery_breadth_score",
    "satisfaction_depth_score",
    "quality_adjusted_discovery_score",
    "balanced_discovery_quality_score",
    "volume_weighted_quality_score",
    "shallow_click_pressure_score",
]

fig, axes = plt.subplots(3, 2, figsize=(14, 11), sharex=False)
axes = axes.flatten()
for ax, metric in zip(axes, metric_name_order):
    sns.histplot(
        data=metric_long.query("metric == @metric"),
        x="value",
        bins=35,
        color="steelblue",
        edgecolor="white",
        ax=ax,
    )
    ax.axvline(metric_panel[metric].median(), color="black", linestyle="--", linewidth=1)
    ax.set_title(metric.replace("_", " ").title())
    ax.set_xlabel("Score")
    ax.set_ylabel("User-days")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "05_metric_distributions.png", dpi=160, bbox_inches="tight")
plt.show()

The distributions show whether the metrics provide enough variation for grouping and modeling. The median reference line is also a practical reminder that later binary contrasts should be based on a score with real spread, not on a degenerate indicator.

8. Check Same-Day Metric Relationships

This cell calculates correlations among candidate metrics and the raw same-day components. The purpose is to understand what each metric is mostly measuring. A good composite should relate to discovery and satisfaction, while the guardrail should move in the opposite direction from satisfaction.

same_day_columns = candidate_metrics + [
    "interactions",
    "engagement_volume_score",
    "discovery_candidate_share",
    "long_tail_share",
    "new_category_share",
    "category_breadth_rate",
    "valid_play_share",
    "high_satisfaction_share",
    "complete_or_rewatch_share",
    "short_abandon_share",
]

same_day_correlation = metric_panel[same_day_columns].corr(method="spearman")

fig, ax = plt.subplots(figsize=(13, 10))
sns.heatmap(
    same_day_correlation,
    cmap="coolwarm",
    center=0,
    linewidths=0.4,
    cbar_kws={"label": "Spearman correlation"},
    ax=ax,
)
ax.set_title("Same-Day Metric and Component Relationships")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "06_same_day_metric_correlation_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

display(same_day_correlation.round(3))

	discovery_breadth_score	satisfaction_depth_score	quality_adjusted_discovery_score	balanced_discovery_quality_score	volume_weighted_quality_score	shallow_click_pressure_score	interactions	engagement_volume_score	discovery_candidate_share	long_tail_share	new_category_share	category_breadth_rate	valid_play_share	high_satisfaction_share	complete_or_rewatch_share	short_abandon_share
discovery_breadth_score	1.000	0.093	0.904	0.745	0.792	0.208	0.410	0.410	0.993	0.949	0.263	-0.344	-0.069	0.108	0.101	-0.013
satisfaction_depth_score	0.093	1.000	0.439	0.688	0.289	-0.713	-0.052	-0.052	0.090	0.083	0.033	0.019	0.535	0.955	0.864	-0.617
quality_adjusted_discovery_score	0.904	0.439	1.000	0.934	0.838	-0.073	0.373	0.373	0.898	0.859	0.251	-0.313	0.142	0.431	0.392	-0.241
balanced_discovery_quality_score	0.745	0.688	0.934	1.000	0.742	-0.310	0.255	0.255	0.737	0.700	0.208	-0.220	0.292	0.669	0.611	-0.405
volume_weighted_quality_score	0.792	0.289	0.838	0.742	1.000	0.268	0.774	0.774	0.833	0.852	0.183	-0.702	0.028	0.282	0.258	-0.152
shallow_click_pressure_score	0.208	-0.713	-0.073	-0.310	0.268	1.000	0.655	0.655	0.254	0.300	0.027	-0.569	-0.498	-0.682	-0.611	0.488
interactions	0.410	-0.052	0.373	0.255	0.774	0.655	1.000	1.000	0.482	0.550	0.063	-0.910	-0.160	-0.049	-0.041	0.060
engagement_volume_score	0.410	-0.052	0.373	0.255	0.774	0.655	1.000	1.000	0.482	0.550	0.063	-0.910	-0.160	-0.049	-0.041	0.060
discovery_candidate_share	0.993	0.090	0.898	0.737	0.833	0.254	0.482	0.482	1.000	0.956	0.269	-0.430	-0.079	0.104	0.098	-0.012
long_tail_share	0.949	0.083	0.859	0.700	0.852	0.300	0.550	0.550	0.956	1.000	0.105	-0.497	-0.085	0.097	0.090	-0.006
new_category_share	0.263	0.033	0.251	0.208	0.183	0.027	0.063	0.063	0.269	0.105	1.000	-0.032	-0.032	0.036	0.040	-0.020
category_breadth_rate	-0.344	0.019	-0.313	-0.220	-0.702	-0.569	-0.910	-0.910	-0.430	-0.497	-0.032	1.000	0.133	0.017	0.010	-0.025
valid_play_share	-0.069	0.535	0.142	0.292	0.028	-0.498	-0.160	-0.160	-0.079	-0.085	-0.032	0.133	1.000	0.391	0.199	-0.726
high_satisfaction_share	0.108	0.955	0.431	0.669	0.282	-0.682	-0.049	-0.049	0.104	0.097	0.036	0.017	0.391	1.000	0.857	-0.479
complete_or_rewatch_share	0.101	0.864	0.392	0.611	0.258	-0.611	-0.041	-0.041	0.098	0.090	0.040	0.010	0.199	0.857	1.000	-0.327
short_abandon_share	-0.013	-0.617	-0.241	-0.405	-0.152	0.488	0.060	0.060	-0.012	-0.006	-0.020	-0.025	-0.726	-0.479	-0.327	1.000

This view is a measurement audit. If two metrics are nearly identical, one may be redundant. If the shallow-click guardrail is strongly related to volume and negatively related to satisfaction, it is behaving as intended.

9. Validate Against Future Outcomes

Future outcomes are not part of the metric formulas, so they can be used as validation targets. This cell measures how each candidate metric relates to future seven-day interactions, future active days, and future play hours using Spearman correlations.

future_outcomes = [
    "outcome_future_7day_interactions",
    "outcome_future_7day_active_days",
    "outcome_future_7day_play_hours",
]

future_validation_rows = []
for metric in candidate_metrics:
    for outcome in future_outcomes:
        future_validation_rows.append(
            {
                "metric": metric,
                "validation_target": outcome,
                "spearman_corr": safe_spearman(metric_panel, metric, outcome),
            }
        )

future_validation = pd.DataFrame(future_validation_rows)
future_validation_wide = future_validation.pivot(
    index="metric", columns="validation_target", values="spearman_corr"
).reset_index()

display(future_validation_wide.round(3))

validation_target	metric	outcome_future_7day_active_days	outcome_future_7day_interactions	outcome_future_7day_play_hours
0	balanced_discovery_quality_score	0.282	0.403	0.479
1	discovery_breadth_score	0.398	0.567	0.539
2	quality_adjusted_discovery_score	0.386	0.527	0.564
3	satisfaction_depth_score	-0.009	0.009	0.152
4	shallow_click_pressure_score	0.341	0.392	0.262
5	volume_weighted_quality_score	0.450	0.665	0.672

These correlations are descriptive, not causal estimates. They answer a narrower question: do the candidate metrics point in the same direction as future value? A metric can pass this validation and still require causal adjustment later.

10. Compare Metric Deciles to Future Behavior

Correlation is useful, but product teams often reason in ranked groups. This cell assigns each user-day to metric deciles and compares future outcomes across the ranked distribution. The top-minus-bottom contrast is an intuitive validation check: higher metric days should generally be followed by better future outcomes, except for the shallow-click guardrail where lower is preferable.

decile_frames = []
metric_panel_with_deciles = metric_panel.copy()

for metric in candidate_metrics:
    decile_col = f"{metric}_decile"
    metric_panel_with_deciles[decile_col] = add_decile(metric_panel_with_deciles, metric)
    decile_summary = (
        metric_panel_with_deciles.groupby(decile_col, observed=True)
        .agg(
            user_days=("user_id", "size"),
            metric_mean=(metric, "mean"),
            future_interactions_mean=("outcome_future_7day_interactions", "mean"),
            future_active_days_mean=("outcome_future_7day_active_days", "mean"),
            future_play_hours_mean=("outcome_future_7day_play_hours", "mean"),
            same_day_satisfaction_mean=("satisfaction_depth_score", "mean"),
            same_day_discovery_mean=("discovery_breadth_score", "mean"),
        )
        .reset_index()
        .rename(columns={decile_col: "decile"})
    )
    decile_summary["metric"] = metric
    decile_frames.append(decile_summary)

metric_deciles = pd.concat(decile_frames, ignore_index=True)

decile_lift_rows = []
for metric in candidate_metrics:
    current = metric_deciles.query("metric == @metric").sort_values("decile")
    bottom = current.iloc[0]
    top = current.iloc[-1]
    decile_lift_rows.append(
        {
            "metric": metric,
            "bottom_decile_future_interactions": bottom["future_interactions_mean"],
            "top_decile_future_interactions": top["future_interactions_mean"],
            "top_minus_bottom_future_interactions": top["future_interactions_mean"] - bottom["future_interactions_mean"],
            "top_minus_bottom_future_active_days": top["future_active_days_mean"] - bottom["future_active_days_mean"],
            "top_minus_bottom_future_play_hours": top["future_play_hours_mean"] - bottom["future_play_hours_mean"],
            "top_minus_bottom_same_day_satisfaction": top["same_day_satisfaction_mean"] - bottom["same_day_satisfaction_mean"],
        }
    )

decile_lift = pd.DataFrame(decile_lift_rows).sort_values(
    "top_minus_bottom_future_interactions", ascending=False
)

display(decile_lift.round(3))
display(metric_deciles.head(12).round(3))

	metric	bottom_decile_future_interactions	top_decile_future_interactions	top_minus_bottom_future_interactions	top_minus_bottom_future_active_days	top_minus_bottom_future_play_hours	top_minus_bottom_same_day_satisfaction
4	volume_weighted_quality_score	52.346	484.152	431.806	3.621	1.082	0.082
0	discovery_breadth_score	58.174	426.412	368.238	2.918	0.860	0.022
2	quality_adjusted_discovery_score	66.829	416.454	349.624	2.648	0.902	0.150
5	shallow_click_pressure_score	137.513	421.743	284.229	2.672	0.504	-0.278
3	balanced_discovery_quality_score	137.749	402.717	264.968	1.763	0.749	0.270
1	satisfaction_depth_score	290.466	267.062	-23.404	-0.051	0.148	0.403

	decile	user_days	metric_mean	future_interactions_mean	future_active_days_mean	future_play_hours_mean	same_day_satisfaction_mean	same_day_discovery_mean	metric
0	1	820	0.080	58.174	3.830	0.146	0.628	0.080	discovery_breadth_score
1	2	820	0.175	175.549	6.187	0.437	0.614	0.175	discovery_breadth_score
2	3	820	0.246	294.405	6.705	0.715	0.599	0.246	discovery_breadth_score
3	4	820	0.286	354.070	6.788	0.852	0.601	0.286	discovery_breadth_score
4	5	820	0.317	380.283	6.870	0.924	0.610	0.317	discovery_breadth_score
5	6	819	0.343	403.112	6.829	0.973	0.614	0.343	discovery_breadth_score
6	7	820	0.368	427.234	6.901	1.028	0.624	0.368	discovery_breadth_score
7	8	820	0.396	442.509	6.904	1.051	0.620	0.396	discovery_breadth_score
8	9	820	0.429	445.273	6.902	1.057	0.627	0.429	discovery_breadth_score
9	10	820	0.514	426.412	6.749	1.006	0.650	0.514	discovery_breadth_score
10	1	820	0.423	290.466	5.955	0.625	0.423	0.279	satisfaction_depth_score
11	2	820	0.509	331.652	6.530	0.720	0.509	0.305	satisfaction_depth_score

The decile table translates metrics into rank-order behavior. This is especially useful for storytelling: it shows what happens when we move from low-score user-days to high-score user-days without requiring the reader to parse model coefficients yet.

11. Plot Future Outcomes by Metric Decile

This cell turns the decile table into a compact visual comparison. A clean upward pattern suggests that the metric ranks user-days in a way that is aligned with future engagement. A flat or reversed pattern suggests the metric may be noisy, redundant, or potentially harmful as an optimization target.

plot_metrics = [
    "discovery_breadth_score",
    "satisfaction_depth_score",
    "quality_adjusted_discovery_score",
    "shallow_click_pressure_score",
]
plot_deciles = metric_deciles.query("metric in @plot_metrics").copy()
plot_deciles["metric_label"] = plot_deciles["metric"].str.replace("_", " ").str.title()

fig, axes = plt.subplots(2, 2, figsize=(15, 10), sharex=True)
axes = axes.flatten()
for ax, metric in zip(axes, plot_metrics):
    current = plot_deciles.query("metric == @metric")
    sns.lineplot(
        data=current,
        x="decile",
        y="future_interactions_mean",
        marker="o",
        color="steelblue",
        ax=ax,
    )
    ax.set_title(metric.replace("_", " ").title())
    ax.set_xlabel("Metric decile")
    ax.set_ylabel("Mean future 7-day interactions")
    ax.set_xticks(range(1, 11))

plt.tight_layout()
fig.savefig(FIGURE_DIR / "07_metric_decile_future_outcomes.png", dpi=160, bbox_inches="tight")
plt.show()

The decile curves make the validation question visible. For the guardrail metric, a high score means shallow click pressure, so a weaker or negative relationship with future outcomes can be a good sign rather than a failure.

12. Measure Stability Across Adjacent User Days

A useful product metric should not be pure noise. This cell computes lag-one stability within each user: how much today’s metric resembles yesterday’s metric. High stability is not automatically better, but a completely unstable metric may be hard to interpret or optimize.

stability_panel = metric_panel.sort_values(["user_id", "event_date"]).copy()
stability_rows = []

for metric in candidate_metrics:
    lag_col = f"lag_1_{metric}"
    stability_panel[lag_col] = stability_panel.groupby("user_id")[metric].shift(1)
    pair = stability_panel[["user_id", metric, lag_col]].dropna()
    lag_corr = np.nan
    if len(pair) >= 3 and pair[metric].nunique() > 1 and pair[lag_col].nunique() > 1:
        lag_corr = pair[metric].corr(pair[lag_col], method="spearman")

    user_means = stability_panel.groupby("user_id")[metric].mean()
    user_stds = stability_panel.groupby("user_id")[metric].std()
    stability_rows.append(
        {
            "metric": metric,
            "lag_1_spearman": lag_corr,
            "mean_within_user_std": user_stds.mean(),
            "between_user_std_of_means": user_means.std(),
            "overall_std": stability_panel[metric].std(),
        }
    )

metric_stability = pd.DataFrame(stability_rows).sort_values("lag_1_spearman", ascending=False)

display(metric_stability.round(3))

	metric	lag_1_spearman	mean_within_user_std	between_user_std_of_means	overall_std
4	volume_weighted_quality_score	0.683	0.165	0.029	0.167
5	shallow_click_pressure_score	0.645	0.078	0.041	0.090
1	satisfaction_depth_score	0.618	0.086	0.068	0.113
2	quality_adjusted_discovery_score	0.589	0.103	0.025	0.106
3	balanced_discovery_quality_score	0.573	0.078	0.035	0.086
0	discovery_breadth_score	0.551	0.123	0.010	0.123

Stability helps separate durable user preference signals from one-off daily noise. In later modeling, highly stable metrics may need stronger user-history adjustment, while very unstable metrics may need smoothing or larger samples.

13. Check Dependence on Past Activity

A metric can look predictive simply because active users stay active. This cell checks how strongly each metric correlates with recent user history. A metric that predicts future outcomes while only moderately tracking past activity is more interesting than one that is just a disguised activity count.

history_columns = [
    "prior_3day_interactions",
    "recent_activity_score",
    "prior_3day_high_satisfaction_share",
    "prior_3day_discovery_candidate_share",
]

history_rows = []
for metric in candidate_metrics:
    for history_col in history_columns:
        history_rows.append(
            {
                "metric": metric,
                "history_variable": history_col,
                "spearman_corr": safe_spearman(metric_panel, metric, history_col),
            }
        )

history_dependence = pd.DataFrame(history_rows)
history_dependence_wide = history_dependence.pivot(
    index="metric", columns="history_variable", values="spearman_corr"
).reset_index()

display(history_dependence_wide.round(3))

history_variable	metric	prior_3day_discovery_candidate_share	prior_3day_high_satisfaction_share	prior_3day_interactions	recent_activity_score
0	balanced_discovery_quality_score	0.348	0.358	0.248	0.248
1	discovery_breadth_score	0.498	-0.027	0.401	0.401
2	quality_adjusted_discovery_score	0.463	0.203	0.357	0.357
3	satisfaction_depth_score	0.005	0.584	-0.046	-0.046
4	shallow_click_pressure_score	0.369	-0.410	0.444	0.444
5	volume_weighted_quality_score	0.598	0.177	0.565	0.565

This table is not trying to remove confounding yet. It is a diagnostic. If a candidate metric is very close to recent activity, future causal notebooks should be especially careful with user fixed effects, lag controls, or doubly robust adjustment.

14. Summarize Metric Validation Evidence

This cell combines distribution quality, future-outcome alignment, decile lift, stability, and history dependence into one validation table. The table is not a final model ranking. It is a structured way to decide which metrics deserve to move forward.

distribution_checks = (
    metric_panel[candidate_metrics]
    .agg(["mean", "std", "min", "max"])
    .T.reset_index()
    .rename(columns={"index": "metric", "std": "metric_std"})
)
distribution_checks["missing_rate"] = metric_panel[candidate_metrics].isna().mean().values

future_score = future_validation_wide.rename(
    columns={
        "outcome_future_7day_interactions": "future_interactions_corr",
        "outcome_future_7day_active_days": "future_active_days_corr",
        "outcome_future_7day_play_hours": "future_play_hours_corr",
    }
)

history_score = history_dependence_wide.rename(
    columns={
        "prior_3day_interactions": "prior_interactions_corr",
        "recent_activity_score": "recent_activity_corr",
        "prior_3day_high_satisfaction_share": "prior_satisfaction_corr",
        "prior_3day_discovery_candidate_share": "prior_discovery_corr",
    }
)

validation_summary = (
    distribution_checks.merge(future_score, on="metric", how="left")
    .merge(history_score, on="metric", how="left")
    .merge(decile_lift, on="metric", how="left")
    .merge(metric_stability, on="metric", how="left")
)

validation_summary["future_alignment_score"] = validation_summary[
    ["future_interactions_corr", "future_active_days_corr", "future_play_hours_corr"]
].mean(axis=1)
validation_summary["history_dependence_score"] = validation_summary[
    ["prior_interactions_corr", "recent_activity_corr"]
].abs().mean(axis=1)
validation_summary["screening_score"] = (
    validation_summary["future_alignment_score"]
    - 0.35 * validation_summary["history_dependence_score"]
)

validation_summary = validation_summary.sort_values("screening_score", ascending=False)

display(
    validation_summary[
        [
            "metric",
            "metric_std",
            "future_alignment_score",
            "history_dependence_score",
            "screening_score",
            "top_minus_bottom_future_interactions",
            "lag_1_spearman",
            "missing_rate",
        ]
    ].round(3)
)

	metric	metric_std	future_alignment_score	history_dependence_score	screening_score	top_minus_bottom_future_interactions	lag_1_spearman
4	volume_weighted_quality_score	0.167	0.596	0.565	0.398	431.806	0.683
2	quality_adjusted_discovery_score	0.106	0.492	0.357	0.367	349.624	0.589
0	discovery_breadth_score	0.123	0.502	0.401	0.361	368.238	0.551
3	balanced_discovery_quality_score	0.086	0.388	0.248	0.301	264.968	0.573
5	shallow_click_pressure_score	0.090	0.332	0.444	0.176	284.229	0.645
1	satisfaction_depth_score	0.113	0.051	0.046	0.035	-23.404	0.618

The screening score is deliberately lightweight. It rewards future alignment and penalizes dependence on recent activity, but it is not a replacement for causal estimation. Its job is to narrow attention to the metrics that are worth modeling next.

15. Visualize the Validation Summary

This cell creates a figure that places future alignment and history dependence side by side. The best candidates are usually those with meaningful future alignment and tolerable dependence on prior activity. The guardrail metric should be read differently because high values represent worse quality pressure.

summary_plot = validation_summary.copy()
summary_plot["metric_label"] = summary_plot["metric"].str.replace("_", " ").str.title()
summary_long = summary_plot.melt(
    id_vars=["metric", "metric_label"],
    value_vars=["future_alignment_score", "history_dependence_score", "screening_score"],
    var_name="validation_dimension",
    value_name="score",
)
summary_long["validation_dimension"] = summary_long["validation_dimension"].map(
    {
        "future_alignment_score": "Future alignment",
        "history_dependence_score": "History dependence",
        "screening_score": "Screening score",
    }
)

fig, ax = plt.subplots(figsize=(13, 6))
sns.barplot(
    data=summary_long,
    x="score",
    y="metric_label",
    hue="validation_dimension",
    ax=ax,
)
ax.axvline(0, color="black", linewidth=1)
ax.set_title("Metric Validation Summary")
ax.set_xlabel("Score")
ax.set_ylabel("Metric")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "08_metric_validation_summary.png", dpi=160, bbox_inches="tight")
plt.show()

The visual summary is useful for portfolio communication because it shows the tradeoff in one place. A candidate metric is stronger when it has enough future signal without becoming just another activity-volume measure.

16. Select Metrics for Later Causal Notebooks

This cell chooses the metrics that will move forward. The choices preserve causal roles: an exposure-like metric for treatment, a satisfaction metric for mediation, a composite metric for product-level monitoring, and a guardrail for shallow engagement pressure.

selected_metrics = pd.DataFrame(
    [
        {
            "selected_for": "exposure_analysis",
            "metric": "discovery_breadth_score",
            "reason": "Purest discovery exposure score; does not use future outcomes or satisfaction depth as a defining component.",
        },
        {
            "selected_for": "mediator_analysis",
            "metric": "satisfaction_depth_score",
            "reason": "Aggregates several same-day quality signals and is appropriate as a mediator candidate.",
        },
        {
            "selected_for": "product_metric_monitoring",
            "metric": "quality_adjusted_discovery_score",
            "reason": "Requires both discovery breadth and satisfaction depth to be high, making it a useful composite quality metric.",
        },
        {
            "selected_for": "guardrail_monitoring",
            "metric": "shallow_click_pressure_score",
            "reason": "Flags high-volume days with low satisfaction, useful as a warning against click-only optimization.",
        },
    ]
)

selected_validation = selected_metrics.merge(
    validation_summary[
        [
            "metric",
            "future_alignment_score",
            "history_dependence_score",
            "screening_score",
            "top_minus_bottom_future_interactions",
        ]
    ],
    on="metric",
    how="left",
)

display(selected_validation.round(3))

	selected_for	metric	reason	future_alignment_score	history_dependence_score	screening_score	top_minus_bottom_future_interactions
0	exposure_analysis	discovery_breadth_score	Purest discovery exposure score; does not use future outcomes or satisfaction depth as a defining component.	0.502	0.401	0.361	368.238
1	mediator_analysis	satisfaction_depth_score	Aggregates several same-day quality signals and is appropriate as a mediator candidate.	0.051	0.046	0.035	-23.404
2	product_metric_monitoring	quality_adjusted_discovery_score	Requires both discovery breadth and satisfaction depth to be high, making it a useful composite quality metric.	0.492	0.357	0.367	349.624
3	guardrail_monitoring	shallow_click_pressure_score	Flags high-volume days with low satisfaction, useful as a warning against click-only optimization.	0.332	0.444	0.176	284.229

The selected set gives the next notebooks a clean structure: exposure first, mediator second, product composite third, guardrail alongside them. That keeps the causal story easy to follow instead of mixing every engineered score into every role.

17. Save Metric Artifacts

This cell writes the metric panel and validation outputs to disk. The next notebooks can now load one metric panel rather than rebuilding the score definitions. The registry and selection files make the metric choices auditable.

metric_panel.to_parquet(METRIC_PANEL_OUTPUT, index=False)
metric_registry.to_csv(METRIC_REGISTRY_OUTPUT, index=False)
validation_summary.to_csv(METRIC_VALIDATION_OUTPUT, index=False)
same_day_correlation.to_csv(METRIC_CORRELATION_OUTPUT)
metric_deciles.to_csv(METRIC_DECILE_OUTPUT, index=False)
selected_validation.to_csv(METRIC_SELECTION_OUTPUT, index=False)

# Mirror the most important tables inside the notebook writeup directory for easier export.
metric_registry.to_csv(TABLE_DIR / "metric_registry.csv", index=False)
validation_summary.to_csv(TABLE_DIR / "metric_validation_summary.csv", index=False)
metric_deciles.to_csv(TABLE_DIR / "metric_deciles.csv", index=False)
selected_validation.to_csv(TABLE_DIR / "selected_metrics.csv", index=False)

saved_outputs = pd.DataFrame(
    {
        "artifact": [
            "metric_panel",
            "metric_registry",
            "metric_validation_summary",
            "metric_correlation_matrix",
            "metric_decile_table",
            "selected_metrics",
        ],
        "path": [
            str(METRIC_PANEL_OUTPUT),
            str(METRIC_REGISTRY_OUTPUT),
            str(METRIC_VALIDATION_OUTPUT),
            str(METRIC_CORRELATION_OUTPUT),
            str(METRIC_DECILE_OUTPUT),
            str(METRIC_SELECTION_OUTPUT),
        ],
    }
)

display(saved_outputs)

	artifact	path
0	metric_panel	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_metric_panel.parquet
1	metric_registry	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_metric_registry.csv
2	metric_validation_summary	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_metric_validation.csv
3	metric_correlation_matrix	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_metric_correlations.csv
4	metric_decile_table	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_metric_deciles.csv
5	selected_metrics	/home/apex/Documents/ranking_sys/data/processed/kuairec_discovery_quality_selected_metrics.csv

The saved artifacts are the handoff point. The next notebook can focus on mediation estimands and assumptions because metric construction is now explicit, validated, and reproducible.

18. Notebook Takeaways

This notebook created a measurement layer for discovery quality. The main takeaways are:

Discovery-quality measurement should separate exposure, mediator, composite, and guardrail roles.
discovery_breadth_score is the cleanest exposure-like metric for later causal contrasts.
satisfaction_depth_score is the main mediator candidate because it aggregates watch-quality signals without using future outcomes.
quality_adjusted_discovery_score is useful as a product-facing metric, but it combines exposure and mediator information.
Future outcomes validate the direction of the metrics, while history-dependence checks remind us why causal adjustment is still needed.

The natural next notebook is 03_mediation_estimands_and_assumptions.ipynb, where these metrics can be mapped to direct, indirect, and total effect estimands.