Notebook 02: Defining the Long-Term Causal Estimand

Notebook 01 showed that KuaiRec can be organized as a sequential user-day panel. This notebook now turns that panel into a specific causal question. That step matters because long-term causal analysis can easily become vague: there are many possible exposure definitions, many future outcomes, and many tempting descriptive comparisons. A credible project should define the causal target before estimating it.

The problem handled here is: which daily recommendation exposure should be treated as the intervention, and which future user behavior should be treated as the long-term outcome?

In recommender systems, the short-term metric is often immediate engagement: click, watch, completion, like, or session depth. The long-term business question is broader: does today’s recommendation experience make the user more likely to return, consume more, or keep engaging over the next several days? This notebook creates that bridge. It compares candidate daily treatments and candidate future outcomes, checks their variation, checks whether they are usable in the sampled panel, and then selects one primary estimand for later causal modeling.

The selected estimand should be understood as a portfolio-ready causal statement, not just a metric name. By the end of this notebook, we want to be able to say something like:

Among active KuaiRec user-days with enough prior history and future follow-up, what is the causal effect of receiving a high-watch-exposure day on the user’s next 7 days of engagement?

This notebook still does not claim that the effect has been identified or estimated. Instead, it defines the treatment, outcome, population, and modeling table that later notebooks will use for inverse probability weighting, marginal structural models, and g-computation.

What This Notebook Decides

The notebook makes four design decisions.

Candidate treatments: compare several daily exposure patterns, such as high-intensity days and high-watch-exposure days.
Candidate outcomes: compare future activity metrics over 1-day, 3-day, and 7-day horizons.
Analysis population: restrict to rows where treatment is meaningful and future outcomes can be observed.
Primary estimand: choose one treatment-outcome pair as the main causal target for the rest of the long-term causal effects workflow.

This is the notebook where we deliberately avoid the common mistake of jumping straight into a model. A causal model is only useful after the target question is clean.

Important Modeling Language

The rest of the notebook uses the following causal vocabulary:

Term	Meaning in this project
Unit	A `user_id` observed on a specific `event_date`.
Time step	One calendar day.
Treatment	A daily exposure pattern derived from that day’s recommendation consumption.
Control condition	An active user-day that does not meet the treatment rule.
Outcome	Future user behavior after the treatment day, such as future interactions or active days.
Baseline/history state	Pre-treatment behavior such as yesterday’s activity and prior 3-day engagement.
Estimand	The causal effect we want to estimate for a defined population, treatment, comparator, and outcome.

The distinction between treatment and outcome is especially important. Today’s engagement pattern can be treated as exposure, but tomorrow’s engagement must remain future outcome. Mixing those up would leak post-treatment information into the adjustment set.

Setup

The first code cell imports the libraries used for metric summaries, visual diagnostics, and saving the final modeling table. The notebook uses only standard data science libraries so that the output remains easy to rerun.

from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")

The notebook environment is ready for estimand design: the libraries support tabular summaries, plots, and saved artifacts. The next step is to load the processed panel from Notebook 01 so this notebook can focus on causal definitions rather than raw data parsing.

Locate the Processed Panel

Notebook 01 saved a dense user-day panel to data/processed. This notebook loads that file rather than opening the raw nested KuaiRec archive again. Keeping this boundary clean makes the project sequence easier to follow: Notebook 01 handles raw data understanding; Notebook 02 handles causal target definition.

PROCESSED_PANEL_RELATIVE_PATH = Path("data/processed/kuairec_user_day_panel_sample.parquet")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / PROCESSED_PANEL_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {PROCESSED_PANEL_RELATIVE_PATH}. Run Notebook 01 first or run this notebook inside the project."
    )

PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PANEL_PATH = PROJECT_ROOT / PROCESSED_PANEL_RELATIVE_PATH

print(f"Project root: {PROJECT_ROOT}")
print(f"Input panel: {PANEL_PATH}")
print(f"Processed output folder: {PROCESSED_DIR}")

Project root: /home/apex/Documents/ranking_sys
Input panel: /home/apex/Documents/ranking_sys/data/processed/kuairec_user_day_panel_sample.parquet
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The printed paths confirm that Notebook 01’s processed panel is available. This creates a clean workflow boundary: the current notebook can start from a user-day panel and spend its attention on treatment, outcome, and population choices.

Load the User-Day Panel

The next cell loads the panel created in Notebook 01. Each row is a user-day. The table already contains daily engagement summaries, lagged history features, candidate treatment indicators, and future outcome columns.

user_day = pd.read_parquet(PANEL_PATH)
user_day["event_date"] = pd.to_datetime(user_day["event_date"])
user_day = user_day.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Panel shape: {user_day.shape}")
print(f"Users: {user_day['user_id'].nunique():,}")
print(f"Dates: {user_day['event_date'].nunique():,}")
display(user_day.head())

Panel shape: (5733, 58)
Users: 91
Dates: 63

	user_id	event_date	interactions	unique_videos	total_play_duration_ms	avg_play_duration_ms	avg_video_duration_ms	avg_watch_ratio	high_watch_count	complete_or_rewatch_count	total_play_duration_sec	avg_play_duration_sec	avg_video_duration_sec	high_watch_share	complete_or_rewatch_share	active_day	lag_1_active_day	prior_3day_active_day	lag_1_interactions	prior_3day_interactions	lag_1_total_play_duration_sec	prior_3day_total_play_duration_sec	lag_1_avg_watch_ratio	prior_3day_avg_watch_ratio	lag_1_high_watch_share	prior_3day_high_watch_share	lead_1_active_day	lead_2_active_day	lead_3_active_day	lead_4_active_day	lead_5_active_day	lead_6_active_day	lead_7_active_day	lead_1_interactions	lead_2_interactions	lead_3_interactions	lead_4_interactions	lead_5_interactions	lead_6_interactions	lead_7_interactions	lead_1_total_play_duration_sec	lead_2_total_play_duration_sec	lead_3_total_play_duration_sec	lead_4_total_play_duration_sec	lead_5_total_play_duration_sec	lead_6_total_play_duration_sec	lead_7_total_play_duration_sec	next_day_active	next_day_interactions	next_day_play_duration_sec	future_3day_active_days	future_3day_interactions	future_3day_play_duration_sec	future_7day_active_days	future_7day_interactions	future_7day_play_duration_sec	treatment_high_intensity	treatment_high_watch_exposure
0	14	2020-07-05	26.0000	26.0000	240,975.0000	9,268.2692	10,187.6538	1.0845	15.0000	12.0000	240.9750	9.2683	10.1877	0.5769	0.4615	1	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	23.0000	78.0000	22.0000	55.0000	52.0000	32.0000	42.0000	248.3440	655.4890	201.9010	485.0390	606.2440	284.7470	337.9180	1.0000	23.0000	248.3440	3.0000	123.0000	1,105.7340	7.0000	304.0000	2,819.6820	0	1
1	14	2020-07-06	23.0000	23.0000	248,344.0000	10,797.5652	14,615.2174	1.0640	12.0000	10.0000	248.3440	10.7976	14.6152	0.5217	0.4348	1	1.0000	1.0000	26.0000	26.0000	240.9750	240.9750	1.0845	1.0845	0.5769	0.5769	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	78.0000	22.0000	55.0000	52.0000	32.0000	42.0000	46.0000	655.4890	201.9010	485.0390	606.2440	284.7470	337.9180	502.1450	1.0000	78.0000	655.4890	3.0000	155.0000	1,342.4290	7.0000	327.0000	3,073.4830	0	1
2	14	2020-07-07	78.0000	78.0000	655,489.0000	8,403.7051	13,529.6410	0.8415	36.0000	27.0000	655.4890	8.4037	13.5296	0.4615	0.3462	1	1.0000	2.0000	23.0000	49.0000	248.3440	489.3190	1.0640	2.1485	0.5217	1.0987	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	22.0000	55.0000	52.0000	32.0000	42.0000	46.0000	42.0000	201.9010	485.0390	606.2440	284.7470	337.9180	502.1450	337.4890	1.0000	22.0000	201.9010	3.0000	129.0000	1,293.1840	7.0000	291.0000	2,755.4830	1	0
3	14	2020-07-08	22.0000	22.0000	201,901.0000	9,177.3182	12,657.1818	0.9828	11.0000	9.0000	201.9010	9.1773	12.6572	0.5000	0.4091	1	1.0000	3.0000	78.0000	127.0000	655.4890	1,144.8080	0.8415	2.9900	0.4615	1.5602	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	55.0000	52.0000	32.0000	42.0000	46.0000	42.0000	10.0000	485.0390	606.2440	284.7470	337.9180	502.1450	337.4890	146.5620	1.0000	55.0000	485.0390	3.0000	139.0000	1,376.0300	7.0000	279.0000	2,700.1440	0	1
4	14	2020-07-09	55.0000	55.0000	485,039.0000	8,818.8909	12,841.6727	0.8619	20.0000	15.0000	485.0390	8.8189	12.8417	0.3636	0.2727	1	1.0000	3.0000	22.0000	123.0000	201.9010	1,105.7340	0.9828	2.8883	0.5000	1.4833	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	52.0000	32.0000	42.0000	46.0000	42.0000	10.0000	87.0000	606.2440	284.7470	337.9180	502.1450	337.4890	146.5620	857.2160	1.0000	52.0000	606.2440	3.0000	126.0000	1,228.9090	7.0000	311.0000	3,072.3210	0	0

The panel shape and preview confirm that each row is a user-day with current behavior, history variables, treatments, and future outcomes. With this structure loaded, the notebook can now define which columns should be used for the primary causal question.

Field Guide for the Processed Panel

The processed panel has many columns because it combines current-day behavior, prior behavior, candidate treatments, and future outcomes. The next table groups the most important columns into modeling roles. This makes the rest of the notebook easier to read because every metric is tied to a causal purpose.

panel_field_guide = pd.DataFrame(
    [
        {"role": "identifier", "columns": "user_id, event_date", "description": "Define the user-day unit and time step."},
        {"role": "current-day behavior", "columns": "active_day, interactions, total_play_duration_sec, avg_watch_ratio", "description": "Describe what happened on the current day before creating treatment labels."},
        {"role": "daily exposure quality", "columns": "high_watch_share, complete_or_rewatch_share", "description": "Measure the share of consumed videos with high watch ratio or over-completion."},
        {"role": "history / confounders", "columns": "lag_1_*, prior_3day_*", "description": "Pre-treatment user state. These variables may affect both today's treatment and future outcomes."},
        {"role": "future outcomes", "columns": "next_day_*, future_3day_*, future_7day_*", "description": "Behavior after the current day. These are candidate long-term outcomes."},
        {"role": "existing treatment candidates", "columns": "treatment_high_intensity, treatment_high_watch_exposure", "description": "First-pass treatment labels created in Notebook 01."},
    ]
)

display(panel_field_guide)

	role	columns	description
0	identifier	user_id, event_date	Define the user-day unit and time step.
1	current-day behavior	active_day, interactions, total_play_duration_...	Describe what happened on the current day befo...
2	daily exposure quality	high_watch_share, complete_or_rewatch_share	Measure the share of consumed videos with high...
3	history / confounders	lag_1_, prior_3day_	Pre-treatment user state. These variables may ...
4	future outcomes	next_day_, future_3day_, future_7day_*	Behavior after the current day. These are cand...
5	existing treatment candidates	treatment_high_intensity, treatment_high_watch...	First-pass treatment labels created in Noteboo...

The field guide separates identifiers, current-day behavior, histories, treatments, and outcomes. This separation protects the causal design from leakage: future outcomes should not become adjustment variables, and treatment-day summaries should not be confused with baseline history.

Basic Panel Sanity Check

Before choosing an estimand, we check the panel’s size, date range, and activity rate. This gives context for what kinds of outcomes are feasible. If nearly everyone is active every day, for example, a binary next-day retention outcome may have too little variation to be the primary outcome.

panel_summary = pd.DataFrame(
    [
        {"metric": "rows", "value": len(user_day)},
        {"metric": "users", "value": user_day["user_id"].nunique()},
        {"metric": "dates", "value": user_day["event_date"].nunique()},
        {"metric": "first_date", "value": user_day["event_date"].min()},
        {"metric": "last_date", "value": user_day["event_date"].max()},
        {"metric": "active_day_rate", "value": user_day["active_day"].mean()},
        {"metric": "mean_daily_interactions", "value": user_day["interactions"].mean()},
        {"metric": "mean_daily_watch_minutes", "value": user_day["total_play_duration_sec"].mean() / 60},
    ]
)

display(panel_summary)

	metric	value
0	rows	5733
1	users	91
2	dates	63
3	first_date	2020-07-05 00:00:00
4	last_date	2020-09-05 00:00:00
5	active_day_rate	0.9658
6	mean_daily_interactions	50.4197
7	mean_daily_watch_minutes	7.3176

The sanity check gives the scale and activity level of the analysis panel. The high active-day rate hints that binary retention may have a ceiling effect, so the notebook next creates follow-up flags and compares richer outcome definitions.

Add Calendar and Follow-Up Flags

A future 7-day outcome is only valid if the row has 7 days of future data in the panel. Similarly, a history-adjusted analysis is cleaner if each row has at least a few prior days of observed history. The next cell creates day-index, history, and follow-up flags.

These flags will later define the analysis population. This is one of the most important safeguards in the notebook: we do not want to compare rows with full follow-up against rows at the edge of the panel where future activity is unknown.

user_day["panel_day_index"] = user_day.groupby("user_id").cumcount()
user_day["max_panel_day_index"] = user_day.groupby("user_id")["panel_day_index"].transform("max")
user_day["days_until_panel_end"] = user_day["max_panel_day_index"] - user_day["panel_day_index"]
user_day["day_of_week"] = user_day["event_date"].dt.day_name()
user_day["calendar_day_index"] = (user_day["event_date"] - user_day["event_date"].min()).dt.days

user_day["has_1day_followup"] = user_day["days_until_panel_end"] >= 1
user_day["has_3day_followup"] = user_day["days_until_panel_end"] >= 3
user_day["has_7day_followup"] = user_day["days_until_panel_end"] >= 7
user_day["has_3day_history"] = user_day["panel_day_index"] >= 3

followup_summary = user_day[
    ["has_1day_followup", "has_3day_followup", "has_7day_followup", "has_3day_history"]
].mean().rename("share_of_rows").reset_index().rename(columns={"index": "flag"})

display(followup_summary)

	flag	share_of_rows
0	has_1day_followup	0.9841
1	has_3day_followup	0.9524
2	has_7day_followup	0.8889
3	has_3day_history	0.9524

The follow-up and history flags tell us which rows can support a fair future-outcome comparison. Rows at the beginning lack enough history, and rows near the end lack enough future observation, so these flags will become part of the final analysis-population definition.

Create Additional Candidate Treatment Definitions

Notebook 01 already created two treatment candidates. This cell adds two more transparent candidates so we can compare options before choosing the primary treatment.

treatment_high_intensity: active day with unusually many interactions.
treatment_high_watch_exposure: active day with a high share of videos watched at least 80 percent.
treatment_overcompletion_exposure: active day with a high share of complete or over-complete watches.
treatment_high_watch_time: active day with unusually high total watch time.

The goal is not to keep all of them. The goal is to choose the one that best matches the causal question and has enough variation.

active_days = user_day["active_day"].eq(1)

overcompletion_threshold = user_day.loc[active_days, "complete_or_rewatch_share"].median()
watch_time_threshold = user_day.loc[active_days, "total_play_duration_sec"].quantile(0.75)

user_day["treatment_overcompletion_exposure"] = (
    active_days & (user_day["complete_or_rewatch_share"] >= overcompletion_threshold)
).astype(int)

user_day["treatment_high_watch_time"] = (
    active_days & (user_day["total_play_duration_sec"] >= watch_time_threshold)
).astype(int)

candidate_treatments = [
    "treatment_high_intensity",
    "treatment_high_watch_exposure",
    "treatment_overcompletion_exposure",
    "treatment_high_watch_time",
]

treatment_rules = pd.DataFrame(
    [
        {
            "treatment": "treatment_high_intensity",
            "plain_english_rule": "Active day with interaction count in the upper quartile of active days.",
            "threshold_or_definition": "Created in Notebook 01",
        },
        {
            "treatment": "treatment_high_watch_exposure",
            "plain_english_rule": "Active day with high-watch share at or above the active-day median.",
            "threshold_or_definition": "Created in Notebook 01",
        },
        {
            "treatment": "treatment_overcompletion_exposure",
            "plain_english_rule": "Active day with complete-or-rewatch share at or above the active-day median.",
            "threshold_or_definition": overcompletion_threshold,
        },
        {
            "treatment": "treatment_high_watch_time",
            "plain_english_rule": "Active day with total watch time in the upper quartile of active days.",
            "threshold_or_definition": watch_time_threshold,
        },
    ]
)

display(treatment_rules)

	treatment	plain_english_rule	threshold_or_definition
0	treatment_high_intensity	Active day with interaction count in the upper...	Created in Notebook 01
1	treatment_high_watch_exposure	Active day with high-watch share at or above t...	Created in Notebook 01
2	treatment_overcompletion_exposure	Active day with complete-or-rewatch share at o...	0.3148
3	treatment_high_watch_time	Active day with total watch time in the upper ...	608.0240

The treatment-rule table expands the choice set from Notebook 01. By defining several plausible exposure patterns before looking at the final estimand, the notebook makes the treatment choice explicit rather than accidental.

Summarize Candidate Treatment Variation

A usable treatment needs variation. If almost every active user-day is treated or almost none are treated, causal estimation becomes unstable or uninformative. This cell summarizes treatment prevalence across all user-days and across active user-days only.

treatment_summary_rows = []
for treatment in candidate_treatments:
    treatment_summary_rows.append(
        {
            "treatment": treatment,
            "share_all_user_days": user_day[treatment].mean(),
            "share_active_user_days": user_day.loc[active_days, treatment].mean(),
            "treated_user_days": int(user_day[treatment].sum()),
            "treated_users": int(user_day.loc[user_day[treatment].eq(1), "user_id"].nunique()),
            "untreated_active_user_days": int((active_days & user_day[treatment].eq(0)).sum()),
            "users_with_both_treated_and_control_days": int(
                user_day.groupby("user_id")[treatment].nunique().eq(2).sum()
            ),
        }
    )

treatment_summary = pd.DataFrame(treatment_summary_rows)
display(treatment_summary)

	treatment	share_all_user_days	share_active_user_days	treated_user_days	treated_users	untreated_active_user_days	users_with_both_treated_and_control_days
0	treatment_high_intensity	0.2501	0.2590	1434	91	4103	91
1	treatment_high_watch_exposure	0.4842	0.5014	2776	91	2761	91
2	treatment_overcompletion_exposure	0.4837	0.5008	2773	91	2764	91
3	treatment_high_watch_time	0.2416	0.2501	1385	91	4152	91

The treatment summary shows whether each candidate has enough treated and control examples, including within-user variation. This is a practical positivity check: a treatment with little variation would be difficult to model credibly.

Visualize Treatment Rates Over Time

The treatment rate should not be concentrated in only one date or one short window. The next plot shows daily prevalence for each candidate treatment. Stable but non-constant variation is useful because it leaves room for causal estimators to compare treated and untreated days across similar histories.

daily_treatment_rates = (
    user_day.groupby("event_date")[candidate_treatments]
    .mean()
    .reset_index()
    .melt(id_vars="event_date", var_name="treatment", value_name="daily_rate")
)

fig, ax = plt.subplots(figsize=(13, 5))
sns.lineplot(
    data=daily_treatment_rates,
    x="event_date",
    y="daily_rate",
    hue="treatment",
    linewidth=1.6,
    ax=ax,
)
ax.set_title("Daily Prevalence of Candidate Treatments")
ax.set_xlabel("Date")
ax.set_ylabel("Treatment rate")
ax.yaxis.set_major_formatter(lambda value, _: f"{value:.0%}")
ax.tick_params(axis="x", rotation=35)
plt.tight_layout()
plt.show()

The daily treatment-rate plot reveals whether treatment definitions are stable over the sample window or tied to calendar trends. This motivates keeping calendar time in the later modeling table as a possible adjustment variable.

Add Outcome Variants for Interpretability

The raw future outcomes are useful, but a few transformed versions make interpretation easier. This cell creates average daily future interactions and future watch hours. It also creates log-transformed outcomes for skewed engagement volume metrics.

The raw count outcome remains the main candidate because it is directly interpretable: how many interactions occur in the next 7 days?

user_day["future_3day_avg_daily_interactions"] = user_day["future_3day_interactions"] / 3
user_day["future_7day_avg_daily_interactions"] = user_day["future_7day_interactions"] / 7
user_day["future_3day_play_hours"] = user_day["future_3day_play_duration_sec"] / 3_600
user_day["future_7day_play_hours"] = user_day["future_7day_play_duration_sec"] / 3_600
user_day["log1p_future_7day_interactions"] = np.log1p(user_day["future_7day_interactions"])
user_day["log1p_future_7day_play_duration_sec"] = np.log1p(user_day["future_7day_play_duration_sec"])

candidate_outcomes = [
    "next_day_active",
    "future_3day_active_days",
    "future_7day_active_days",
    "future_3day_interactions",
    "future_7day_interactions",
    "future_7day_avg_daily_interactions",
    "future_7day_play_hours",
    "log1p_future_7day_interactions",
]

print("Created candidate outcome variants:")
for outcome in candidate_outcomes:
    print(f"- {outcome}")

Created candidate outcome variants:
- next_day_active
- future_3day_active_days
- future_7day_active_days
- future_3day_interactions
- future_7day_interactions
- future_7day_avg_daily_interactions
- future_7day_play_hours
- log1p_future_7day_interactions

The created outcomes turn raw future behavior into several interpretable targets: retention-like activity counts, interaction volume, watch hours, and log-transformed engagement. The next cell compares these candidates to decide which one is best suited as the primary outcome.

Summarize Candidate Outcomes

A primary long-term outcome should have enough variation, low missingness after applying follow-up rules, and a clear interpretation. The next table compares candidate outcomes on missingness, mean, standard deviation, and the share of rows at the maximum value.

The maximum-share column is especially important for active-day outcomes. If most rows are at the maximum, the outcome has a ceiling effect and may be weak as a primary metric.

outcome_summary_rows = []
for outcome in candidate_outcomes:
    values = user_day[outcome]
    nonmissing = values.dropna()
    outcome_summary_rows.append(
        {
            "outcome": outcome,
            "missing_rate": values.isna().mean(),
            "mean": nonmissing.mean(),
            "std": nonmissing.std(),
            "min": nonmissing.min(),
            "p25": nonmissing.quantile(0.25),
            "median": nonmissing.median(),
            "p75": nonmissing.quantile(0.75),
            "max": nonmissing.max(),
            "share_at_zero": (nonmissing == 0).mean(),
            "share_at_max": (nonmissing == nonmissing.max()).mean(),
            "unique_values": nonmissing.nunique(),
        }
    )

outcome_summary = pd.DataFrame(outcome_summary_rows)
display(outcome_summary)

	outcome	missing_rate	mean	std	p25	median	p75	max	share_at_zero	share_at_max	unique_values
0	next_day_active	0.0159	0.9661	0.1809	1.0000	1.0000	1.0000	1.0000	0.0339	0.9661	2
1	future_3day_active_days	0.0476	2.9117	0.3841	3.0000	3.0000	3.0000	3.0000	0.0073	0.9379	4
2	future_7day_active_days	0.1111	6.8055	0.7406	7.0000	7.0000	7.0000	7.0000	0.0051	0.8966	8
3	future_3day_interactions	0.0476	156.2579	80.1244	99.0000	153.0000	209.0000	581.0000	0.0073	0.0002	398
4	future_7day_interactions	0.1111	377.9823	155.8725	275.0000	383.0000	489.0000	956.0000	0.0051	0.0002	691
5	future_7day_avg_daily_interactions	0.1111	53.9975	22.2675	39.2857	54.7143	69.8571	136.5714	0.0051	0.0002	691
6	future_7day_play_hours	0.1111	0.9122	0.4294	0.6202	0.8982	1.1665	3.2961	0.0051	0.0002	5061
7	log1p_future_7day_interactions	0.1111	5.8030	0.6652	5.6204	5.9506	6.1944	6.8638	0.0051	0.0002	691

The outcome summary compares missingness, variation, and ceiling effects. Outcomes with low variation are still useful as secondary checks, but the primary outcome should give the model enough signal to distinguish future engagement across user-days.

Visualize Future Outcome Distributions

This cell plots the most important future outcome candidates. The active-days outcome is bounded between 0 and 7, while interaction volume and watch hours are continuous or count-like engagement measures. Seeing them together makes the tradeoff visible: retention-style outcomes are easy to explain, but in this sample they may have less variation than engagement-volume outcomes.

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

sns.histplot(data=user_day, x="future_7day_active_days", bins=np.arange(-0.5, 8.5, 1), ax=axes[0], color="#2A6F97")
axes[0].set_title("Future 7-Day Active Days")
axes[0].set_xlabel("Active days in next 7 days")

sns.histplot(data=user_day, x="future_7day_interactions", bins=50, ax=axes[1], color="#5C946E")
axes[1].set_title("Future 7-Day Interactions")
axes[1].set_xlabel("Interactions in next 7 days")

sns.histplot(data=user_day, x="future_7day_play_hours", bins=50, ax=axes[2], color="#C07F00")
axes[2].set_title("Future 7-Day Watch Hours")
axes[2].set_xlabel("Watch hours in next 7 days")

plt.tight_layout()
plt.show()

The distributions make the outcome tradeoff visible. Future active days are easy to explain but bounded, while future interactions and watch hours contain richer variation for modeling long-term engagement.

Check the Retention Ceiling Effect

Next-day and 7-day active outcomes are intuitively appealing because they sound like retention. However, the sampled panel is very active. If nearly all rows have next-day activity or full 7-day activity, then a retention outcome cannot distinguish users well.

This cell quantifies that ceiling effect. The result does not make retention irrelevant; it tells us retention should be secondary in this sample, while a richer engagement outcome may be better as the primary outcome.

retention_ceiling = pd.DataFrame(
    [
        {
            "outcome": "next_day_active",
            "available_rows": user_day["next_day_active"].notna().sum(),
            "mean": user_day["next_day_active"].mean(),
            "share_at_max": (user_day["next_day_active"] == 1).mean(),
        },
        {
            "outcome": "future_3day_active_days",
            "available_rows": user_day["future_3day_active_days"].notna().sum(),
            "mean": user_day["future_3day_active_days"].mean(),
            "share_at_max": (user_day["future_3day_active_days"] == 3).mean(),
        },
        {
            "outcome": "future_7day_active_days",
            "available_rows": user_day["future_7day_active_days"].notna().sum(),
            "mean": user_day["future_7day_active_days"].mean(),
            "share_at_max": (user_day["future_7day_active_days"] == 7).mean(),
        },
    ]
)

display(retention_ceiling)

	outcome	available_rows	mean	share_at_max
0	next_day_active	5642	0.9661	0.9508
1	future_3day_active_days	5460	2.9117	0.8932
2	future_7day_active_days	5096	6.8055	0.7970

The ceiling-effect table confirms that retention-style outcomes are highly saturated in this sample. This supports using future 7-day interactions as the primary outcome while keeping active-days metrics as secondary product-health checks.

Naive Treatment-Outcome Associations

The next table compares treated and untreated active user-days for each treatment-outcome pair. This is still not causal. It is a diagnostic table that helps us understand whether a candidate estimand produces an interpretable contrast.

A large difference here may reflect confounding, selection, or genuine treatment effects. Later notebooks will try to adjust for confounding. Here we only use the table to compare candidate definitions.

def summarize_naive_difference(data, treatment_col, outcome_col):
    subset = data.loc[data["active_day"].eq(1)].dropna(subset=[treatment_col, outcome_col]).copy()
    grouped = subset.groupby(treatment_col)[outcome_col].agg(["mean", "count", "std"])
    if not set(grouped.index) >= {0, 1}:
        return None

    control_mean = grouped.loc[0, "mean"]
    treated_mean = grouped.loc[1, "mean"]
    return {
        "treatment": treatment_col,
        "outcome": outcome_col,
        "control_mean": control_mean,
        "treated_mean": treated_mean,
        "difference": treated_mean - control_mean,
        "relative_lift": (treated_mean / control_mean - 1) if control_mean != 0 else np.nan,
        "control_days": grouped.loc[0, "count"],
        "treated_days": grouped.loc[1, "count"],
    }

naive_rows = []
for treatment in candidate_treatments:
    for outcome in candidate_outcomes:
        row = summarize_naive_difference(user_day, treatment, outcome)
        if row is not None:
            naive_rows.append(row)

naive_associations = pd.DataFrame(naive_rows)

display(
    naive_associations.sort_values(["outcome", "treatment"])
)

	treatment	outcome	control_mean	treated_mean	difference	relative_lift	control_days	treated_days
1	treatment_high_intensity	future_3day_active_days	2.9371	2.9477	0.0106	0.0036	3880	1434
9	treatment_high_watch_exposure	future_3day_active_days	2.9479	2.9321	-0.0158	-0.0054	2649	2665
25	treatment_high_watch_time	future_3day_active_days	2.9377	2.9465	0.0089	0.0030	3930	1384
17	treatment_overcompletion_exposure	future_3day_active_days	2.9450	2.9350	-0.0100	-0.0034	2637	2677
3	treatment_high_intensity	future_3day_interactions	138.8616	207.0342	68.1726	0.4909	3880	1434
11	treatment_high_watch_exposure	future_3day_interactions	157.7493	156.7700	-0.9794	-0.0062	2649	2665
27	treatment_high_watch_time	future_3day_interactions	141.6537	201.5686	59.9150	0.4230	3930	1384
19	treatment_overcompletion_exposure	future_3day_interactions	156.8654	157.6451	0.7797	0.0050	2637	2677
2	treatment_high_intensity	future_7day_active_days	6.8580	6.8562	-0.0017	-0.0003	3528	1433
10	treatment_high_watch_exposure	future_7day_active_days	6.8719	6.8430	-0.0289	-0.0042	2483	2478
26	treatment_high_watch_time	future_7day_active_days	6.8572	6.8582	0.0010	0.0001	3579	1382
18	treatment_overcompletion_exposure	future_7day_active_days	6.8719	6.8433	-0.0286	-0.0042	2459	2502
5	treatment_high_intensity	future_7day_avg_daily_interactions	49.7585	65.2750	15.5165	0.3118	3528	1433
13	treatment_high_watch_exposure	future_7day_avg_daily_interactions	53.7613	54.7207	0.9593	0.0178	2483	2478
29	treatment_high_watch_time	future_7day_avg_daily_interactions	50.4425	64.0765	13.6340	0.2703	3579	1382
21	treatment_overcompletion_exposure	future_7day_avg_daily_interactions	53.8189	54.6550	0.8361	0.0155	2459	2502
4	treatment_high_intensity	future_7day_interactions	348.3098	456.9253	108.6155	0.3118	3528	1433
12	treatment_high_watch_exposure	future_7day_interactions	376.3294	383.0448	6.7154	0.0178	2483	2478
28	treatment_high_watch_time	future_7day_interactions	353.0972	448.5355	95.4382	0.2703	3579	1382
20	treatment_overcompletion_exposure	future_7day_interactions	376.7320	382.5847	5.8527	0.0155	2459	2502
6	treatment_high_intensity	future_7day_play_hours	0.8446	1.0943	0.2497	0.2956	3528	1433
14	treatment_high_watch_exposure	future_7day_play_hours	0.8409	0.9927	0.1517	0.1804	2483	2478
30	treatment_high_watch_time	future_7day_play_hours	0.8178	1.1729	0.3551	0.4342	3579	1382
22	treatment_overcompletion_exposure	future_7day_play_hours	0.8212	1.0107	0.1895	0.2308	2459	2502
7	treatment_high_intensity	log1p_future_7day_interactions	5.7280	6.0881	0.3601	0.0629	3528	1433
15	treatment_high_watch_exposure	log1p_future_7day_interactions	5.8234	5.8408	0.0174	0.0030	2483	2478
31	treatment_high_watch_time	log1p_future_7day_interactions	5.7424	6.0642	0.3218	0.0560	3579	1382
23	treatment_overcompletion_exposure	log1p_future_7day_interactions	5.8225	5.8415	0.0190	0.0033	2459	2502
0	treatment_high_intensity	next_day_active	0.9750	0.9868	0.0117	0.0120	4048	1434
8	treatment_high_watch_exposure	next_day_active	0.9835	0.9728	-0.0107	-0.0109	2728	2754
24	treatment_high_watch_time	next_day_active	0.9761	0.9841	0.0080	0.0082	4097	1385
16	treatment_overcompletion_exposure	next_day_active	0.9787	0.9775	-0.0012	-0.0012	2724	2758

The naive association table helps compare candidate treatment-outcome pairs, but it is still descriptive. The table points toward promising contrasts; the next cells examine whether those contrasts are confounded by prior user behavior.

Visualize Candidate Treatment Effects Descriptively

This plot focuses on future 7-day interactions because that is the strongest primary-outcome candidate. Again, this is descriptive. The purpose is to compare treatment definitions, not to estimate causal effects yet.

plot_df = user_day.loc[user_day["active_day"].eq(1), candidate_treatments + ["future_7day_interactions"]].copy()
plot_df = plot_df.melt(
    id_vars="future_7day_interactions",
    value_vars=candidate_treatments,
    var_name="treatment",
    value_name="treated",
)
plot_df["group"] = np.where(plot_df["treated"].eq(1), "treated", "control")

fig, ax = plt.subplots(figsize=(12, 5))
sns.barplot(
    data=plot_df.dropna(subset=["future_7day_interactions"]),
    x="treatment",
    y="future_7day_interactions",
    hue="group",
    errorbar=("ci", 95),
    ax=ax,
)
ax.set_title("Descriptive Future 7-Day Interactions by Candidate Treatment")
ax.set_xlabel("Candidate treatment")
ax.set_ylabel("Mean future 7-day interactions")
ax.tick_params(axis="x", rotation=20)
plt.tight_layout()
plt.show()

The bar plot focuses attention on future 7-day interactions and shows how treated and control means differ for each candidate treatment. Because these are unadjusted means, they are best used to choose a coherent estimand rather than to claim an effect.

Check Pre-Treatment Imbalance

A candidate treatment is more credible if we can model its assignment using observed pre-treatment history. The next cell measures how different treated and untreated active days are before treatment occurs.

The standardized mean difference compares treated and control means in standard deviation units. Large imbalances are expected in recommender logs, and they are a warning that naive comparisons are not enough.

pre_treatment_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
]

def standardized_mean_differences(data, treatment_col, covariate_cols):
    analytic = data.loc[data["active_day"].eq(1)].dropna(subset=[treatment_col]).copy()
    rows = []
    for covariate in covariate_cols:
        treated = analytic.loc[analytic[treatment_col].eq(1), covariate].dropna()
        control = analytic.loc[analytic[treatment_col].eq(0), covariate].dropna()
        pooled_sd = np.sqrt((treated.var(ddof=1) + control.var(ddof=1)) / 2)
        rows.append(
            {
                "treatment": treatment_col,
                "covariate": covariate,
                "treated_mean": treated.mean(),
                "control_mean": control.mean(),
                "smd": (treated.mean() - control.mean()) / pooled_sd if pooled_sd and not np.isnan(pooled_sd) else np.nan,
            }
        )
    return pd.DataFrame(rows)

balance_tables = [
    standardized_mean_differences(user_day, treatment, pre_treatment_covariates)
    for treatment in candidate_treatments
]
confounding_balance = pd.concat(balance_tables, ignore_index=True)

display(
    confounding_balance.assign(abs_smd=confounding_balance["smd"].abs())
    .sort_values(["treatment", "abs_smd"], ascending=[True, False])
    .drop(columns="abs_smd")
)

	treatment	covariate	treated_mean	control_mean	smd
6	treatment_high_intensity	prior_3day_interactions	207.7497	134.4448	1.0026
1	treatment_high_intensity	lag_1_interactions	71.7364	44.2764	0.8406
7	treatment_high_intensity	prior_3day_total_play_duration_sec	1,794.1988	1,177.3717	0.8370
2	treatment_high_intensity	lag_1_total_play_duration_sec	617.3946	388.2157	0.7200
10	treatment_high_intensity	calendar_day_index	24.7789	32.7007	-0.5046
5	treatment_high_intensity	prior_3day_active_day	2.9303	2.8279	0.2198
9	treatment_high_intensity	prior_3day_high_watch_share	1.4207	1.3741	0.1029
0	treatment_high_intensity	lag_1_active_day	0.9770	0.9654	0.0694
8	treatment_high_intensity	prior_3day_avg_watch_ratio	2.7440	2.6835	0.0556
3	treatment_high_intensity	lag_1_avg_watch_ratio	0.9103	0.9208	-0.0193
4	treatment_high_intensity	lag_1_high_watch_share	0.4730	0.4697	0.0181
20	treatment_high_watch_exposure	prior_3day_high_watch_share	1.5894	1.1817	0.9546
15	treatment_high_watch_exposure	lag_1_high_watch_share	0.5439	0.3969	0.8586
19	treatment_high_watch_exposure	prior_3day_avg_watch_ratio	2.9314	2.4656	0.4026
14	treatment_high_watch_exposure	lag_1_avg_watch_ratio	1.0097	0.8259	0.3077
18	treatment_high_watch_exposure	prior_3day_total_play_duration_sec	1,424.6386	1,249.1272	0.2284
13	treatment_high_watch_exposure	lag_1_total_play_duration_sec	479.3493	415.6172	0.2007
21	treatment_high_watch_exposure	calendar_day_index	30.1452	31.1557	-0.0560
17	treatment_high_watch_exposure	prior_3day_interactions	151.6120	155.2572	-0.0452
12	treatment_high_watch_exposure	lag_1_interactions	50.8336	51.9457	-0.0332
16	treatment_high_watch_exposure	prior_3day_active_day	2.8483	2.8606	-0.0232
11	treatment_high_watch_exposure	lag_1_active_day	0.9672	0.9696	-0.0135
40	treatment_high_watch_time	prior_3day_total_play_duration_sec	1,905.5116	1,147.5201	1.0275
39	treatment_high_watch_time	prior_3day_interactions	202.2621	137.1404	0.8747
35	treatment_high_watch_time	lag_1_total_play_duration_sec	654.1906	378.6461	0.8578
34	treatment_high_watch_time	lag_1_interactions	69.4426	45.3656	0.7309
43	treatment_high_watch_time	calendar_day_index	25.0404	32.5200	-0.4719
41	treatment_high_watch_time	prior_3day_avg_watch_ratio	3.0432	2.5844	0.3956
42	treatment_high_watch_time	prior_3day_high_watch_share	1.5126	1.3440	0.3691
37	treatment_high_watch_time	lag_1_high_watch_share	0.5058	0.4588	0.2602
36	treatment_high_watch_time	lag_1_avg_watch_ratio	1.0144	0.8859	0.2294
38	treatment_high_watch_time	prior_3day_active_day	2.9227	2.8316	0.1927
33	treatment_high_watch_time	lag_1_active_day	0.9776	0.9653	0.0739
31	treatment_overcompletion_exposure	prior_3day_high_watch_share	1.5595	1.2123	0.7885
26	treatment_overcompletion_exposure	lag_1_high_watch_share	0.5346	0.4064	0.7328
30	treatment_overcompletion_exposure	prior_3day_avg_watch_ratio	2.9696	2.4278	0.4716
25	treatment_overcompletion_exposure	lag_1_avg_watch_ratio	1.0235	0.8123	0.3547
29	treatment_overcompletion_exposure	prior_3day_total_play_duration_sec	1,450.5065	1,223.3656	0.2969
24	treatment_overcompletion_exposure	lag_1_total_play_duration_sec	490.5345	404.4647	0.2722
32	treatment_overcompletion_exposure	calendar_day_index	29.5525	31.7493	-0.1219
27	treatment_overcompletion_exposure	prior_3day_active_day	2.8291	2.8799	-0.0965
22	treatment_overcompletion_exposure	lag_1_active_day	0.9618	0.9750	-0.0759
28	treatment_overcompletion_exposure	prior_3day_interactions	150.8615	156.0062	-0.0638
23	treatment_overcompletion_exposure	lag_1_interactions	50.7764	52.0018	-0.0366

The balance table shows how much treated and control user-days differ before treatment. These imbalances are the reason the project needs causal methods in later notebooks instead of a direct comparison of future engagement.

Visualize Confounding by Candidate Treatment

This heatmap shows which treatment definitions are most strongly associated with prior user state. No treatment definition will be perfectly randomized here. The point is to understand what adjustment will need to handle in later notebooks.

balance_heatmap = confounding_balance.pivot(index="covariate", columns="treatment", values="smd")

fig, ax = plt.subplots(figsize=(11, 6))
sns.heatmap(
    balance_heatmap,
    annot=True,
    fmt=".2f",
    cmap="vlag",
    center=0,
    linewidths=0.5,
    ax=ax,
)
ax.set_title("Pre-Treatment Standardized Mean Differences")
ax.set_xlabel("Candidate treatment")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
plt.show()

The heatmap makes the confounding pattern easier to scan across treatment definitions. It shows which histories are most associated with treatment assignment and therefore should be included in propensity or outcome models later.

Select the Primary Estimand

Based on the diagnostics above, this notebook selects the following primary causal target:

Treatment: treatment_high_watch_exposure
An active user-day where the share of videos watched at least 80 percent is at or above the active-day median.

Comparator: active user-days that do not meet that high-watch-exposure rule.

Outcome: future_7day_interactions
The total number of interactions by the same user over the next 7 calendar days.

Population: active user-days with at least 3 prior days of history and at least 7 future days of follow-up.

This choice is practical and interpretable. The treatment captures the quality or satisfaction of the current recommendation day more directly than raw interaction volume. The outcome captures longer-term engagement with more variation than binary retention in this sampled panel. Retention-style outcomes remain secondary diagnostics because they are still important for product interpretation.

PRIMARY_TREATMENT = "treatment_high_watch_exposure"
PRIMARY_OUTCOME = "future_7day_interactions"
SECONDARY_OUTCOMES = [
    "future_7day_active_days",
    "future_7day_avg_daily_interactions",
    "future_7day_play_hours",
    "log1p_future_7day_interactions",
]

primary_estimand = pd.DataFrame(
    [
        {"component": "unit", "definition": "A KuaiRec user observed on one calendar day."},
        {"component": "population", "definition": "Active user-days with at least 3 prior days and 7 future days in the panel."},
        {"component": "treatment", "definition": PRIMARY_TREATMENT},
        {"component": "comparator", "definition": "Active user-days where treatment_high_watch_exposure = 0."},
        {"component": "primary_outcome", "definition": PRIMARY_OUTCOME},
        {"component": "time_horizon", "definition": "The 7 calendar days after the treatment day."},
        {"component": "adjustment_strategy_next", "definition": "Use observed prior behavior and calendar time to model treatment assignment and outcomes."},
    ]
)

display(primary_estimand)

	component	definition
0	unit	A KuaiRec user observed on one calendar day.
1	population	Active user-days with at least 3 prior days an...
2	treatment	treatment_high_watch_exposure
3	comparator	Active user-days where treatment_high_watch_ex...
4	primary_outcome	future_7day_interactions
5	time_horizon	The 7 calendar days after the treatment day.
6	adjustment_strategy_next	Use observed prior behavior and calendar time ...

The estimand table turns the exploratory choices into a formal causal target. From this point forward, later notebooks can use standardized treatment and outcome columns instead of repeatedly debating which metric is primary.

Define the Analysis Population

The analysis population applies the estimand rules. We require the current day to be active because the treatment is defined using consumed recommendation behavior. We require 3 prior days so lagged state is meaningful. We require 7 future days so the primary outcome is fully observed.

This cell creates an inclusion flag and a funnel table that shows how many rows remain after each rule.

user_day["eligible_active_day"] = user_day["active_day"].eq(1)
user_day["eligible_history"] = user_day["has_3day_history"]
user_day["eligible_followup"] = user_day["has_7day_followup"]
user_day["eligible_primary_outcome"] = user_day[PRIMARY_OUTCOME].notna()
user_day["eligible_primary_treatment"] = user_day[PRIMARY_TREATMENT].notna()

eligibility_steps = [
    ("all user-days", pd.Series(True, index=user_day.index)),
    ("active user-days", user_day["eligible_active_day"]),
    ("active + 3-day history", user_day["eligible_active_day"] & user_day["eligible_history"]),
    (
        "active + 3-day history + 7-day follow-up",
        user_day["eligible_active_day"] & user_day["eligible_history"] & user_day["eligible_followup"],
    ),
    (
        "final primary estimand rows",
        user_day["eligible_active_day"]
        & user_day["eligible_history"]
        & user_day["eligible_followup"]
        & user_day["eligible_primary_treatment"]
        & user_day["eligible_primary_outcome"],
    ),
]

funnel_rows = []
for step, mask in eligibility_steps:
    funnel_rows.append(
        {
            "step": step,
            "rows": int(mask.sum()),
            "users": int(user_day.loc[mask, "user_id"].nunique()),
            "share_of_all_rows": mask.mean(),
        }
    )

eligibility_funnel = pd.DataFrame(funnel_rows)
user_day["in_primary_estimand_population"] = eligibility_steps[-1][1]

display(eligibility_funnel)

	step	rows	users	share_of_all_rows
0	all user-days	5733	91	1.0000
1	active user-days	5537	91	0.9658
2	active + 3-day history	5274	91	0.9199
3	active + 3-day history + 7-day follow-up	4698	91	0.8195
4	final primary estimand rows	4698	91	0.8195

The eligibility funnel shows how the final population is formed step by step. This makes the sample restriction transparent: rows are kept because they are active, have enough history, and have full future follow-up for the 7-day outcome.

Check Final Treatment and Outcome Variation

After defining the analysis population, we need to verify that the primary treatment and outcome still have variation. This is the version that matters for modeling, because later estimators will use this filtered table.

primary_population = user_day.loc[user_day["in_primary_estimand_population"]].copy()

primary_variation = pd.DataFrame(
    [
        {"metric": "rows", "value": len(primary_population)},
        {"metric": "users", "value": primary_population["user_id"].nunique()},
        {"metric": "treated_share", "value": primary_population[PRIMARY_TREATMENT].mean()},
        {"metric": "treated_rows", "value": int(primary_population[PRIMARY_TREATMENT].sum())},
        {"metric": "control_rows", "value": int((1 - primary_population[PRIMARY_TREATMENT]).sum())},
        {"metric": "primary_outcome_mean", "value": primary_population[PRIMARY_OUTCOME].mean()},
        {"metric": "primary_outcome_std", "value": primary_population[PRIMARY_OUTCOME].std()},
        {"metric": "primary_outcome_min", "value": primary_population[PRIMARY_OUTCOME].min()},
        {"metric": "primary_outcome_max", "value": primary_population[PRIMARY_OUTCOME].max()},
    ]
)

display(primary_variation)

display(
    primary_population.groupby(PRIMARY_TREATMENT)[PRIMARY_OUTCOME]
    .agg(["count", "mean", "std", "min", "median", "max"])
    .rename_axis(PRIMARY_TREATMENT)
)

	metric	value
0	rows	4,698.0000
1	users	91.0000
2	treated_share	0.4996
3	treated_rows	2,347.0000
4	control_rows	2,351.0000
5	primary_outcome_mean	382.0826
6	primary_outcome_std	156.3942
7	primary_outcome_min	0.0000
8	primary_outcome_max	888.0000

	count	mean	std	min	median	max
treatment_high_watch_exposure
0	2351	378.5479	157.6583	18.0000	380.0000	888.0000
1	2347	385.6233	155.0705	0.0000	402.0000	851.0000

The final variation check confirms that the selected estimand still has a balanced treatment split and a non-degenerate outcome after all eligibility filters. That means the saved table is suitable for the next notebook on treatment assignment and weighting.

Visualize the Primary Outcome by Treatment Group

This plot shows the primary outcome distribution for treated and control rows in the final analysis population. The plot is descriptive, not causal. It helps us see skew, overlap, and whether both groups contain a meaningful range of future engagement.

fig, axes = plt.subplots(1, 2, figsize=(14, 4.8))

sns.boxplot(
    data=primary_population,
    x=PRIMARY_TREATMENT,
    y=PRIMARY_OUTCOME,
    ax=axes[0],
    color="#8ECAE6",
)
axes[0].set_title("Future 7-Day Interactions by Treatment Group")
axes[0].set_xlabel("High-watch exposure day")
axes[0].set_ylabel("Future 7-day interactions")

sns.histplot(
    data=primary_population,
    x="log1p_future_7day_interactions",
    hue=PRIMARY_TREATMENT,
    bins=40,
    common_norm=False,
    stat="density",
    element="step",
    fill=False,
    ax=axes[1],
)
axes[1].set_title("Log Future 7-Day Interactions Distribution")
axes[1].set_xlabel("log(1 + future 7-day interactions)")
axes[1].set_ylabel("Density")

plt.tight_layout()
plt.show()

The plots show overlap and skew in the selected primary outcome by treatment group. Overlap is important because later causal estimators rely on comparing treated and control observations with similar histories.

Build the Modeling Table

The modeling table keeps the primary treatment, primary outcome, secondary outcomes, and pre-treatment covariates needed by later notebooks. It also preserves user_id and date fields so later models can add user-level clustering, calendar controls, or fixed effects.

The table intentionally excludes post-treatment variables from the covariate list. Future outcomes and same-day treatment labels are included as targets, not as adjustment variables.

modeling_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "lag_1_high_watch_share",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_avg_watch_ratio",
    "prior_3day_high_watch_share",
    "calendar_day_index",
    "panel_day_index",
]

modeling_columns = [
    "user_id",
    "event_date",
    "day_of_week",
    "calendar_day_index",
    "panel_day_index",
    "active_day",
    PRIMARY_TREATMENT,
    PRIMARY_OUTCOME,
    *SECONDARY_OUTCOMES,
    *candidate_treatments,
    *modeling_covariates,
]

# Remove duplicates while preserving order because calendar_day_index and panel_day_index appear in multiple roles.
modeling_columns = list(dict.fromkeys(modeling_columns))

estimand_panel = primary_population[modeling_columns].copy()
estimand_panel["treatment"] = estimand_panel[PRIMARY_TREATMENT].astype(int)
estimand_panel["outcome"] = estimand_panel[PRIMARY_OUTCOME].astype(float)
estimand_panel["outcome_log1p"] = np.log1p(estimand_panel["outcome"])
estimand_panel["primary_treatment_name"] = PRIMARY_TREATMENT
estimand_panel["primary_outcome_name"] = PRIMARY_OUTCOME

print(f"Estimand panel shape: {estimand_panel.shape}")
display(estimand_panel.head())

Estimand panel shape: (4698, 30)

	user_id	event_date	day_of_week	calendar_day_index	panel_day_index	active_day	treatment_high_watch_exposure	future_7day_interactions	future_7day_active_days	future_7day_avg_daily_interactions	future_7day_play_hours	log1p_future_7day_interactions	treatment_overcompletion_exposure	lag_1_active_day	lag_1_interactions	lag_1_total_play_duration_sec	lag_1_avg_watch_ratio	lag_1_high_watch_share	prior_3day_active_day	prior_3day_interactions	prior_3day_total_play_duration_sec	prior_3day_avg_watch_ratio	prior_3day_high_watch_share	treatment	outcome	outcome_log1p	primary_treatment_name	primary_outcome_name
3	14	2020-07-08	Wednesday	3	3	1	1	279.0000	7.0000	39.8571	0.7500	5.6348	1	1.0000	78.0000	655.4890	0.8415	0.4615	3.0000	127.0000	1,144.8080	2.9900	1.5602	1	279.0000	5.6348	treatment_high_watch_exposure	future_7day_interactions
4	14	2020-07-09	Thursday	4	4	1	0	311.0000	7.0000	44.4286	0.8534	5.7430	0	1.0000	22.0000	201.9010	0.9828	0.5000	3.0000	123.0000	1,105.7340	2.8883	1.4833	0	311.0000	5.7430	treatment_high_watch_exposure	future_7day_interactions
5	14	2020-07-10	Friday	5	5	1	1	352.0000	7.0000	50.2857	0.9198	5.8665	1	1.0000	55.0000	485.0390	0.8619	0.3636	3.0000	155.0000	1,342.4290	2.6862	1.3252	1	352.0000	5.8665	treatment_high_watch_exposure	future_7day_interactions
6	14	2020-07-11	Saturday	6	6	1	1	437.0000	7.0000	62.4286	1.1297	6.0822	0	1.0000	52.0000	606.2440	1.1380	0.5769	3.0000	129.0000	1,293.1840	2.9827	1.4406	1	437.0000	6.0822	treatment_high_watch_exposure	future_7day_interactions
7	14	2020-07-12	Sunday	7	7	1	0	437.0000	7.0000	62.4286	1.1754	6.0822	0	1.0000	32.0000	284.7470	0.9337	0.5000	3.0000	139.0000	1,376.0300	2.9336	1.4406	0	437.0000	6.0822	treatment_high_watch_exposure	future_7day_interactions

The modeling table is the operational version of the estimand. It keeps identifiers, the primary treatment and outcome, secondary outcomes, and only pre-treatment covariates needed for adjustment.

Final Missingness Check

Before saving, we check missingness for the columns required by later causal models. Missing primary treatment, primary outcome, or covariate values would make downstream estimators fail or silently drop rows.

required_model_columns = ["treatment", "outcome", *modeling_covariates]

final_missingness = (
    estimand_panel[required_model_columns]
    .isna()
    .mean()
    .rename("missing_rate")
    .reset_index()
    .rename(columns={"index": "column"})
    .sort_values("missing_rate", ascending=False)
)

display(final_missingness)

if final_missingness["missing_rate"].max() > 0:
    raise ValueError("The estimand panel has missing values in required model columns.")

	column	missing_rate
0	treatment	0.0000
1	outcome	0.0000
2	lag_1_active_day	0.0000
3	lag_1_interactions	0.0000
4	lag_1_total_play_duration_sec	0.0000
5	lag_1_avg_watch_ratio	0.0000
6	lag_1_high_watch_share	0.0000
7	prior_3day_active_day	0.0000
8	prior_3day_interactions	0.0000
9	prior_3day_total_play_duration_sec	0.0000
10	prior_3day_avg_watch_ratio	0.0000
11	prior_3day_high_watch_share	0.0000
12	calendar_day_index	0.0000
13	panel_day_index	0.0000

The missingness check confirms that required modeling columns are complete. This prevents later causal estimators from silently dropping rows or estimating on a different population than the one defined in this notebook.

Save the Estimand Panel and Summaries

The saved panel is the main handoff to the next notebooks. Later notebooks should load kuairec_long_term_estimand_panel.parquet and use the standardized treatment and outcome columns unless they explicitly compare secondary outcomes.

The summary CSV files record the design choices made here so the project remains auditable.

estimand_panel_path = PROCESSED_DIR / "kuairec_long_term_estimand_panel.parquet"
estimand_summary_path = PROCESSED_DIR / "kuairec_long_term_estimand_summary.csv"
treatment_summary_path = PROCESSED_DIR / "kuairec_long_term_candidate_treatments.csv"
outcome_summary_path = PROCESSED_DIR / "kuairec_long_term_candidate_outcomes.csv"
eligibility_funnel_path = PROCESSED_DIR / "kuairec_long_term_eligibility_funnel.csv"

estimand_panel.to_parquet(estimand_panel_path, index=False)
primary_estimand.to_csv(estimand_summary_path, index=False)
treatment_summary.to_csv(treatment_summary_path, index=False)
outcome_summary.to_csv(outcome_summary_path, index=False)
eligibility_funnel.to_csv(eligibility_funnel_path, index=False)

print("Saved Notebook 02 artifacts:")
print(f"- {estimand_panel_path}")
print(f"- {estimand_summary_path}")
print(f"- {treatment_summary_path}")
print(f"- {outcome_summary_path}")
print(f"- {eligibility_funnel_path}")

Saved Notebook 02 artifacts:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimand_panel.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_estimand_summary.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_candidate_treatments.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_candidate_outcomes.csv
- /home/apex/Documents/ranking_sys/data/processed/kuairec_long_term_eligibility_funnel.csv

The saved files are the handoff to formal causal estimation. The parquet file contains the row-level modeling data, while the CSV summaries document the treatment, outcome, and eligibility decisions that produced it.

Takeaways and Next Step

This notebook defines the primary causal estimand for

Among active KuaiRec user-days with sufficient prior history and 7-day follow-up, estimate the effect of a high-watch-exposure day on the user’s future 7-day interaction volume.

The selected treatment is more about recommendation quality than raw session size, which makes it a good fit for a long-term recommender-system question. The selected outcome has more variation than binary retention in this sample, while still reflecting future user engagement. Retention-style outcomes remain in the saved panel as secondary checks.

The next notebook should examine time-varying confounding and treatment assignment in more detail. That means modeling the probability of treatment from prior user history, checking positivity, and preparing inverse probability weights for a marginal structural model.