Notebook 01: KuaiRec Sequence EDA for Long-Term Causal Effects

This notebook starts ** Long-Term Causal Effects. The central problem is a common tension in recommender systems: a recommendation policy can increase immediate engagement while still hurting longer-term user value. A short video feed, news feed, or streaming homepage may learn to maximize near-term clicks, watch time, or completion rate. Those metrics are useful, but they do not fully answer a more strategic data science question: does the exposure pattern caused by the recommender improve future user retention, future engagement, and healthier long-run consumption behavior?**

This is a causal question because users are not randomly assigned to content exposure patterns in normal logs. A user sees videos because the platform selected them, because the user has a history, because the user has preferences, and because prior engagement changes what the recommender does next. If we compare users who received high-engagement exposures against users who did not, the difference may reflect the recommender’s targeting logic rather than the effect of the exposure itself.

The specific problem handled here is the sequential causal inference setup for KuaiRec:

This notebook does not estimate the final causal effect yet. Instead, it prepares the causal terrain. We inspect whether KuaiRec has enough repeated user activity, enough date coverage, enough treatment variation, and enough future outcomes to support later notebooks on marginal structural models and g-computation. In other words, this notebook answers: can this dataset support a credible long-term causal effects project, and what should the treatment and outcome definitions look like?

Notebook Roadmap

This notebook follows a causal data-understanding path rather than a generic EDA path. First, we load KuaiRec directly from the nested zip archive without extracting the full dataset. Then we inspect the raw interaction fields, user metadata, item category metadata, and item daily popularity features. After that, we aggregate interactions into a user-day panel, because long-term causal effects require a time-indexed panel rather than isolated impression rows.

The most important output of the notebook is a reusable user-day table. Later notebooks can use this table to define treatment histories, estimate inverse probability weights, fit marginal structural models, and compare g-computation style estimates.

Data Fields and Column Guide

This project uses KuaiRec as a sequential recommendation log. The raw data is spread across interaction, user, item-category, and item-daily-feature tables. The field guide below explains the columns before any analysis code runs, so the rest of the notebook is easier to read.

small_matrix.csv: User-Video Interaction Log

This is the main event-level table. Each row represents a user-video watch event.

Column Meaning How it matters for this project
user_id Unique user identifier. Defines the panel unit. Long-term causal effects are studied by following each user over time.
video_id Unique video identifier. Identifies the item consumed or exposed in the recommendation environment.
play_duration Amount of time the user played the video, stored in milliseconds. Short-term engagement signal. Later aggregated into daily play duration and used to define future outcomes.
video_duration Length of the video, stored in milliseconds. Item context and denominator for watch-ratio calculations.
time Human-readable event time string. Used as a diagnostic event timestamp; the notebook also parses the Unix timestamp field.
date Calendar date of the event, encoded like 20200705. Becomes the daily time index for the user-day panel.
timestamp Unix timestamp of the event in seconds. Used to reconstruct event ordering within users.
watch_ratio play_duration / video_duration. Values above 1 can occur when users rewatch or exceed the nominal video length. Core short-term engagement measure. Used to define high-watch exposure candidates and daily engagement summaries.

user_features.csv: User Metadata

This table contains baseline user attributes and anonymized features. These columns can be used as baseline confounders because user characteristics may affect both what the recommender shows and how the user behaves later.

Column Meaning How it matters for this project
user_id Unique user identifier. Join key from interaction rows to user-level attributes.
user_active_degree Categorical user activity segment such as high-active or full-active. Baseline activity segment; likely related to future retention and exposure patterns.
is_lowactive_period Indicator for whether the user is in a low-activity period. Baseline or period-level activity state. Useful for understanding heterogeneous retention risk.
is_live_streamer Indicator for whether the user is a live streamer. User role/context feature that may affect content preferences and future behavior.
is_video_author Indicator for whether the user authors videos. User role/context feature; creators may behave differently from pure consumers.
follow_user_num Number of users this user follows. Social graph intensity; possible proxy for platform embeddedness.
follow_user_num_range Binned range version of follow_user_num. Categorical version of follow count for modeling or stratified summaries.
fans_user_num Number of fans/followers the user has. Popularity or creator-context proxy.
fans_user_num_range Binned range version of fans_user_num. Categorical version of fan count.
friend_user_num Number of friends connected to the user. Social connectedness proxy; may predict retention.
friend_user_num_range Binned range version of friend_user_num. Categorical version of friend count.
register_days Number of days since registration. User tenure; important baseline confounder for future engagement.
register_days_range Binned range version of register_days. Categorical tenure feature.
onehot_feat0 to onehot_feat17 Anonymized one-hot user features supplied by KuaiRec. Potential baseline covariates. They are useful for prediction and adjustment, but their substantive meaning is intentionally hidden.

item_categories.csv: Video Category Metadata

This table maps videos to one or more category identifiers.

Column Meaning How it matters for this project
video_id Unique video identifier. Join key from interaction rows to video metadata.
feat String representation of a list of category ids, such as [27, 9]. Describes item content category. Later parsed into category count and first-category diagnostics.

item_daily_features.csv: Daily Video Popularity and Engagement Context

The full table has many daily item metrics. This notebook loads a focused subset that is directly useful for sequential recommendation diagnostics. These features can become item-day context variables because item popularity may affect both recommendation exposure and downstream user behavior.

Column Meaning How it matters for this project
video_id Unique video identifier. Join key to connect daily item context to consumed videos.
date Calendar date for the item-day metrics. Aligns item popularity with the same daily panel used for users.
show_cnt Number of times the video was shown. Visibility/popularity proxy. A highly shown item may be more likely to appear in recommendation logs.
show_user_num Number of users who were shown the video. User-level reach of the item on that day.
play_cnt Number of plays. Daily item consumption volume.
play_user_num Number of users who played the video. Daily user reach among players.
complete_play_cnt Number of complete plays. Daily completion signal for the item.
like_cnt Number of likes. Positive feedback signal for the item.
comment_cnt Number of comments. Deeper engagement signal for the item.
share_cnt Number of shares. Viral or social engagement signal for the item.
collect_cnt Number of collections/saves. Longer-intent engagement signal for the item.

Derived Columns Created Later in This Notebook

The notebook also creates derived fields after loading the raw data. These are not separate raw KuaiRec columns, but they are central to the causal setup.

Derived column family Meaning
event_date, event_time, event_timestamp Cleaned time fields used to order events and build daily panels.
play_duration_sec, video_duration_sec Human-readable duration fields converted from milliseconds to seconds.
high_watch, complete_or_rewatch Interaction-level engagement indicators based on watch-ratio thresholds.
active_day, interactions, total_play_duration_sec, avg_watch_ratio User-day summaries used to describe daily state.
lag_1_*, prior_3day_* Pre-treatment history variables used as time-varying confounder candidates.
next_day_*, future_3day_*, future_7day_* Future outcome variables used to study longer-term engagement.
treatment_high_intensity, treatment_high_watch_exposure First-pass daily treatment definitions used to inspect treatment variation and confounding.

Setup

The first code cell imports the libraries used throughout the notebook. The plotting settings are kept simple and readable because this notebook is mostly about understanding temporal structure. We also make pandas show enough rows and columns for compact diagnostic tables.

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import ast
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 80)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")

The notebook environment is now ready for sequential EDA: pandas handles the panel operations, seaborn and matplotlib handle visual checks, and the display settings make the wide diagnostic tables readable. The next step is to locate the project data in a way that works whether the notebook is run from the repository root or from inside the notebook folder.

Locate the Project and Data

This notebook may be run from the repository root or from inside the notebooks/projects/project_3_long_term_causal_effects folder. The next cell searches upward until it finds the KuaiRec archive. Keeping this path logic inside the notebook makes it easier to rerun from Jupyter without manually changing directories.

KUAI_REC_ZIP_RELATIVE_PATH = Path("data/Kuairec/18164998.zip")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / KUAI_REC_ZIP_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {KUAI_REC_ZIP_RELATIVE_PATH}. Run this notebook from inside the ranking_sys project."
    )

KUAI_REC_ZIP = PROJECT_ROOT / KUAI_REC_ZIP_RELATIVE_PATH
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"KuaiRec archive: {KUAI_REC_ZIP}")
print(f"Processed output folder: {PROCESSED_DIR}")
Project root: /home/apex/Documents/ranking_sys
KuaiRec archive: /home/apex/Documents/ranking_sys/data/Kuairec/18164998.zip
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The printed paths confirm that the notebook found the project root, the KuaiRec archive, and the processed-data folder. With those locations fixed, the next cell can inspect the raw archive contents without hard-coding local machine paths.

Inspect the Nested Archive

The downloaded file is an outer archive that contains a second archive named KuaiRec.zip. The interaction and metadata CSV files live inside that inner archive. This cell lists the relevant files so we know exactly what data sources are available before reading any large tables.

with ZipFile(KUAI_REC_ZIP) as outer_zip:
    outer_inventory = pd.DataFrame(
        [
            {
                "member": member.filename,
                "size_mb": member.file_size / 1_000_000,
            }
            for member in outer_zip.infolist()
        ]
    )
    inner_zip_bytes = outer_zip.read("KuaiRec.zip")

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    inner_inventory = pd.DataFrame(
        [
            {
                "member": member.filename,
                "size_mb": member.file_size / 1_000_000,
            }
            for member in inner_zip.infolist()
            if member.filename.startswith("KuaiRec 2.0/data/")
        ]
    )

print("Outer archive contents")
display(outer_inventory.sort_values("size_mb", ascending=False))

print("Inner KuaiRec data files")
display(inner_inventory.sort_values("member"))
Outer archive contents
member size_mb
0 KuaiRec.zip 431.9649
1 kuairec_caption_category.csv 1.9646
2 video_raw_categories_multi.csv 1.7245
3 user_features_raw.csv 1.5416
Inner KuaiRec data files
member size_mb
0 KuaiRec 2.0/data/ 0.0000
1 KuaiRec 2.0/data/big_matrix.csv 1,083.5212
2 KuaiRec 2.0/data/item_categories.csv 0.1131
3 KuaiRec 2.0/data/item_daily_features.csv 85.8552
4 KuaiRec 2.0/data/kuairec_caption_category.csv 1.9646
5 KuaiRec 2.0/data/small_matrix.csv 406.1558
6 KuaiRec 2.0/data/social_network.csv 0.0069
7 KuaiRec 2.0/data/user_features.csv 0.7442

The archive inventory shows that KuaiRec is nested: the downloaded file contains an inner KuaiRec.zip, and the useful CSV files live inside that inner archive. This confirms that the notebook should read the data directly from the nested zip rather than assuming the dataset has already been extracted.

Choose the Tables and Sample Size

KuaiRec contains very large interaction matrices. For a first EDA notebook, we use a deterministic sample from small_matrix.csv. This is enough to study repeated users, daily activity, candidate treatments, and future outcomes while keeping the notebook fast enough to rerun interactively.

The sample is not the final modeling decision. It is a development-friendly slice used to design the causal panel. Later notebooks can increase the sample size or switch to a full extraction once the logic is stable.

SMALL_MATRIX_MEMBER = "KuaiRec 2.0/data/small_matrix.csv"
USER_FEATURES_MEMBER = "KuaiRec 2.0/data/user_features.csv"
ITEM_CATEGORIES_MEMBER = "KuaiRec 2.0/data/item_categories.csv"
ITEM_DAILY_FEATURES_MEMBER = "KuaiRec 2.0/data/item_daily_features.csv"

SAMPLE_ROWS = 300_000
ITEM_DAILY_SAMPLE_ROWS = 150_000

print(f"Interaction rows to read from small_matrix.csv: {SAMPLE_ROWS:,}")
print(f"Item daily feature rows to read: {ITEM_DAILY_SAMPLE_ROWS:,}")
Interaction rows to read from small_matrix.csv: 300,000
Item daily feature rows to read: 150,000

The selected files cover the ingredients needed for the project: interaction events, user metadata, item categories, and daily item context. The sample sizes keep this first notebook fast while still preserving a multi-user, multi-day sequential structure for causal design work.

Load and Normalize the Interaction Sample

The interaction table is the core event log. Each row records a user watching a video at a time, with the play duration, video duration, date, timestamp, and watch ratio. The next cell reads the sample and creates cleaner time columns.

The durations appear to be stored in milliseconds, so the cell also creates second-based versions that are easier to interpret. The watch_ratio is the observed play duration divided by video duration. Values above 1 can happen when users rewatch, loop, or otherwise spend longer than the nominal video duration.

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    with inner_zip.open(SMALL_MATRIX_MEMBER) as file:
        interactions = pd.read_csv(file, nrows=SAMPLE_ROWS)

# Convert KuaiRec's numeric date field into a proper datetime date.
date_numeric = pd.to_numeric(interactions["date"], errors="coerce").round().astype("Int64")
interactions["event_date"] = pd.to_datetime(
    date_numeric.astype("string"),
    format="%Y%m%d",
    errors="coerce",
)

# The timestamp column is Unix time in seconds. It is the most reliable event-time field.
interactions["event_timestamp"] = pd.to_datetime(
    pd.to_numeric(interactions["timestamp"], errors="coerce"),
    unit="s",
    errors="coerce",
)

# Keep a parsed version of the original time field as a backup and diagnostic.
interactions["event_time_from_text"] = pd.to_datetime(interactions["time"], errors="coerce")
interactions["event_time"] = interactions["event_time_from_text"].fillna(interactions["event_timestamp"])

# Add human-readable duration columns. The original millisecond fields are retained.
interactions["play_duration_sec"] = interactions["play_duration"] / 1_000
interactions["video_duration_sec"] = interactions["video_duration"] / 1_000
interactions["watch_ratio_capped_5"] = interactions["watch_ratio"].clip(lower=0, upper=5)

interactions = interactions.sort_values(["user_id", "event_time", "video_id"]).reset_index(drop=True)

print(f"Loaded interaction sample shape: {interactions.shape}")
display(interactions.head())
Loaded interaction sample shape: (300000, 15)
user_id video_id play_duration video_duration time date timestamp watch_ratio event_date event_timestamp event_time_from_text event_time play_duration_sec video_duration_sec watch_ratio_capped_5
0 14 148 4381 6067 2020-07-05 05:27:48.378 20,200,705.0000 1,593,898,068.3780 0.7221 2020-07-05 2020-07-04 21:27:48.378000021 2020-07-05 05:27:48.378 2020-07-05 05:27:48.378 4.3810 6.0670 0.7221
1 14 183 11635 6100 2020-07-05 05:28:00.057 20,200,705.0000 1,593,898,080.0570 1.9074 2020-07-05 2020-07-04 21:28:00.056999922 2020-07-05 05:28:00.057 2020-07-05 05:28:00.057 11.6350 6.1000 1.9074
2 14 3649 22422 10867 2020-07-05 05:29:09.479 20,200,705.0000 1,593,898,149.4790 2.0633 2020-07-05 2020-07-04 21:29:09.479000092 2020-07-05 05:29:09.479 2020-07-05 05:29:09.479 22.4220 10.8670 2.0633
3 14 5262 4479 7908 2020-07-05 05:30:43.285 20,200,705.0000 1,593,898,243.2850 0.5664 2020-07-05 2020-07-04 21:30:43.285000086 2020-07-05 05:30:43.285 2020-07-05 05:30:43.285 4.4790 7.9080 0.5664
4 14 8234 4602 11000 2020-07-05 05:35:43.459 20,200,705.0000 1,593,898,543.4590 0.4184 2020-07-05 2020-07-04 21:35:43.459000111 2020-07-05 05:35:43.459 2020-07-05 05:35:43.459 4.6020 11.0000 0.4184

The interaction sample is now sorted and enriched with clean event dates, timestamps, seconds-based duration fields, and a capped watch-ratio helper for plots. This makes the raw event log usable for both ordinary EDA and later user-day aggregation.

Load User, Item, and Daily Item Metadata

The interaction log is not enough for causal analysis because treatment assignment depends on user and item context. The next cell loads three metadata tables:

  • user_features.csv: user-level characteristics such as activity degree, follower counts, registration age, and anonymized one-hot features.
  • item_categories.csv: video category identifiers.
  • item_daily_features.csv: daily item popularity and engagement counts, which are useful proxies for how popular or visible an item was on a given day.

These features are not final confounders yet, but they are candidates for later treatment and outcome models.

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    with inner_zip.open(USER_FEATURES_MEMBER) as file:
        user_features = pd.read_csv(file)

    with inner_zip.open(ITEM_CATEGORIES_MEMBER) as file:
        item_categories = pd.read_csv(file)

    daily_usecols = [
        "video_id",
        "date",
        "show_cnt",
        "show_user_num",
        "play_cnt",
        "play_user_num",
        "complete_play_cnt",
        "like_cnt",
        "comment_cnt",
        "share_cnt",
        "collect_cnt",
    ]
    with inner_zip.open(ITEM_DAILY_FEATURES_MEMBER) as file:
        item_daily = pd.read_csv(file, usecols=daily_usecols, nrows=ITEM_DAILY_SAMPLE_ROWS)

# Convert category strings such as "[27, 9]" into counts and first-category diagnostics.
def parse_category_list(value):
    try:
        parsed = ast.literal_eval(value)
    except (ValueError, SyntaxError):
        return []
    return parsed if isinstance(parsed, list) else []

category_lists = item_categories["feat"].map(parse_category_list)
item_categories["category_count"] = category_lists.map(len)
item_categories["first_category"] = category_lists.map(lambda values: values[0] if values else np.nan)

item_daily["event_date"] = pd.to_datetime(item_daily["date"].astype("string"), format="%Y%m%d", errors="coerce")
item_daily["item_play_rate"] = np.where(
    item_daily["show_cnt"] > 0,
    item_daily["play_cnt"] / item_daily["show_cnt"],
    np.nan,
)
item_daily["item_like_rate_per_play"] = np.where(
    item_daily["play_cnt"] > 0,
    item_daily["like_cnt"] / item_daily["play_cnt"],
    np.nan,
)

print(f"User features shape: {user_features.shape}")
print(f"Item categories shape: {item_categories.shape}")
print(f"Item daily feature sample shape: {item_daily.shape}")
User features shape: (7176, 31)
Item categories shape: (10728, 4)
Item daily feature sample shape: (150000, 14)

The metadata tables loaded successfully and can be joined to the interaction log by user_id, video_id, and date. This matters because long-term causal modeling needs context: users and items are not exchangeable, and popularity or user tenure may influence both exposure and future engagement.

Raw Column Dictionary

Before aggregating anything, it helps to translate the key raw fields into causal language. This cell creates a compact dictionary for the columns we will use most often. This is especially useful in a portfolio notebook because reviewers can see the connection between raw logging fields and causal design components.

column_dictionary = pd.DataFrame(
    [
        {
            "table": "small_matrix",
            "column": "user_id",
            "meaning": "User identifier. This becomes the panel unit.",
            "causal_role": "Unit identifier",
        },
        {
            "table": "small_matrix",
            "column": "video_id",
            "meaning": "Recommended or consumed video identifier.",
            "causal_role": "Exposure content identifier",
        },
        {
            "table": "small_matrix",
            "column": "play_duration",
            "meaning": "Observed play duration in milliseconds.",
            "causal_role": "Short-term engagement outcome and future-state input",
        },
        {
            "table": "small_matrix",
            "column": "video_duration",
            "meaning": "Video duration in milliseconds.",
            "causal_role": "Item context and denominator for watch ratio",
        },
        {
            "table": "small_matrix",
            "column": "watch_ratio",
            "meaning": "Play duration divided by video duration; values above 1 indicate rewatch or over-completion.",
            "causal_role": "Short-term engagement and candidate exposure-quality measure",
        },
        {
            "table": "small_matrix",
            "column": "event_date",
            "meaning": "Calendar date derived from KuaiRec's date field.",
            "causal_role": "Time index for the user-day panel",
        },
        {
            "table": "user_features",
            "column": "user_active_degree",
            "meaning": "Categorical user activity segment supplied by KuaiRec.",
            "causal_role": "Baseline covariate and confounder candidate",
        },
        {
            "table": "item_daily_features",
            "column": "show_cnt / play_cnt",
            "meaning": "Daily item visibility and play counts.",
            "causal_role": "Item popularity context and confounder candidate",
        },
    ]
)

display(column_dictionary)
table column meaning causal_role
0 small_matrix user_id User identifier. This becomes the panel unit. Unit identifier
1 small_matrix video_id Recommended or consumed video identifier. Exposure content identifier
2 small_matrix play_duration Observed play duration in milliseconds. Short-term engagement outcome and future-state...
3 small_matrix video_duration Video duration in milliseconds. Item context and denominator for watch ratio
4 small_matrix watch_ratio Play duration divided by video duration; value... Short-term engagement and candidate exposure-q...
5 small_matrix event_date Calendar date derived from KuaiRec's date field. Time index for the user-day panel
6 user_features user_active_degree Categorical user activity segment supplied by ... Baseline covariate and confounder candidate
7 item_daily_features show_cnt / play_cnt Daily item visibility and play counts. Item popularity context and confounder candidate

The column dictionary translates raw logging fields into causal roles. That translation is important for the rest of the notebook: we are not just describing columns, we are deciding which fields can become units, treatments, covariates, and outcomes.

Basic Shape and Coverage

The first data-readiness question is whether the sample contains repeated observations over time. Long-term causal effects require enough users, enough dates, and enough future observations. The next cell summarizes the main coverage metrics for the interaction sample.

coverage_summary = pd.DataFrame(
    [
        {"metric": "interaction_rows", "value": len(interactions)},
        {"metric": "unique_users", "value": interactions["user_id"].nunique()},
        {"metric": "unique_videos", "value": interactions["video_id"].nunique()},
        {"metric": "unique_event_dates", "value": interactions["event_date"].nunique()},
        {"metric": "first_event_date", "value": interactions["event_date"].min()},
        {"metric": "last_event_date", "value": interactions["event_date"].max()},
        {"metric": "first_event_time", "value": interactions["event_time"].min()},
        {"metric": "last_event_time", "value": interactions["event_time"].max()},
    ]
)

display(coverage_summary)
metric value
0 interaction_rows 300000
1 unique_users 91
2 unique_videos 3327
3 unique_event_dates 63
4 first_event_date 2020-07-05 00:00:00
5 last_event_date 2020-09-05 00:00:00
6 first_event_time 2020-07-05 00:01:03.816000
7 last_event_time 2020-09-05 23:11:44.456000

The coverage table confirms that the sample contains repeated observations across users, videos, and calendar dates. Because the data spans multiple days per user, it is suitable for constructing histories and future outcomes rather than only one-shot engagement summaries.

Example Rows After Cleaning

This cell shows a few cleaned rows after date parsing and duration normalization. The goal is not to inspect every column manually. The goal is to verify that the event time, event date, duration, and watch-ratio fields look coherent before we build a panel from them.

preview_columns = [
    "user_id",
    "video_id",
    "event_date",
    "event_time",
    "play_duration_sec",
    "video_duration_sec",
    "watch_ratio",
]

display(interactions[preview_columns].head(10))
user_id video_id event_date event_time play_duration_sec video_duration_sec watch_ratio
0 14 148 2020-07-05 2020-07-05 05:27:48.378 4.3810 6.0670 0.7221
1 14 183 2020-07-05 2020-07-05 05:28:00.057 11.6350 6.1000 1.9074
2 14 3649 2020-07-05 2020-07-05 05:29:09.479 22.4220 10.8670 2.0633
3 14 5262 2020-07-05 2020-07-05 05:30:43.285 4.4790 7.9080 0.5664
4 14 8234 2020-07-05 2020-07-05 05:35:43.459 4.6020 11.0000 0.4184
5 14 6789 2020-07-05 2020-07-05 05:36:00.773 8.6070 13.2670 0.6488
6 14 1963 2020-07-05 2020-07-05 05:36:47.741 8.6130 9.5900 0.8981
7 14 175 2020-07-05 2020-07-05 05:49:27.965 11.6400 46.5140 0.2502
8 14 1973 2020-07-05 2020-07-05 05:49:41.762 4.5720 7.4000 0.6178
9 14 171 2020-07-05 2020-07-05 05:57:26.581 8.5180 5.2170 1.6327

The preview rows show that the cleaned event date, event time, durations, and watch ratio are aligned at the interaction level. Since these fields look coherent, the next checks can focus on missingness and distributional shape rather than basic parsing issues.

Missingness and Data Type Checks

Missing values matter because causal estimators usually require complete treatment, outcome, and confounder histories. This cell computes missing rates and shows the cleaned data types. A high missing rate in a key time or outcome column would be a blocker for later notebooks.

missingness = (
    interactions.isna()
    .mean()
    .sort_values(ascending=False)
    .rename("missing_rate")
    .reset_index()
    .rename(columns={"index": "column"})
)

dtypes = (
    interactions.dtypes.astype(str)
    .rename("dtype")
    .reset_index()
    .rename(columns={"index": "column"})
)

missingness_with_types = missingness.merge(dtypes, on="column", how="left")

display(missingness_with_types)
column missing_rate dtype
0 timestamp 0.0365 float64
1 date 0.0365 float64
2 time 0.0365 str
3 event_time 0.0365 datetime64[ns]
4 event_time_from_text 0.0365 datetime64[us]
5 event_timestamp 0.0365 datetime64[ns]
6 event_date 0.0365 datetime64[us]
7 video_duration 0.0000 int64
8 user_id 0.0000 int64
9 video_id 0.0000 int64
10 play_duration 0.0000 int64
11 watch_ratio 0.0000 float64
12 play_duration_sec 0.0000 float64
13 video_duration_sec 0.0000 float64
14 watch_ratio_capped_5 0.0000 float64

This check tells us whether any key variables would block later causal modeling. Low missingness in identifiers, timing fields, and engagement metrics means the sample can be safely aggregated into a daily panel without silently losing many rows.

Engagement Distribution Summary

Short-term engagement is the raw material for both candidate treatments and future outcomes. The next cell summarizes play duration, video duration, and watch ratio. The percentile view is important because watch-time data is usually skewed: a small number of very long or repeated watches can dominate averages.

engagement_summary = interactions[
    ["play_duration_sec", "video_duration_sec", "watch_ratio", "watch_ratio_capped_5"]
].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T

display(engagement_summary)
count mean std min 1% 5% 25% 50% 75% 95% 99% max
play_duration_sec 300,000.0000 8.7096 12.7322 0.0000 0.5640 1.9380 5.7090 7.6695 9.6900 15.7380 32.5791 1,502.2620
video_duration_sec 300,000.0000 14.4755 20.4464 3.0670 4.8850 5.9000 7.5220 9.5930 11.9340 44.8500 139.8770 315.0720
watch_ratio 300,000.0000 0.9191 1.5511 0.0000 0.0227 0.1017 0.4717 0.7720 1.1220 1.9767 3.7086 333.8360
watch_ratio_capped_5 300,000.0000 0.8802 0.6698 0.0000 0.0227 0.1017 0.4717 0.7720 1.1220 1.9767 3.7086 5.0000

The percentile table shows the skew typical of watch-time data. Watch ratios and play durations have long tails, so later plots and models should be careful with extreme values rather than relying only on raw means.

Visualize Short-Term Engagement

The plots below show the distribution of watch ratio and play duration. The watch-ratio plot is capped at 5 for readability, because extreme rewatch values can make the main distribution hard to see. This is only a plotting cap; the raw watch_ratio remains available.

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

sns.histplot(
    data=interactions,
    x="watch_ratio_capped_5",
    bins=60,
    ax=axes[0],
    color="#2A6F97",
)
axes[0].axvline(0.8, color="black", linestyle="--", linewidth=1, label="0.8 threshold")
axes[0].axvline(1.0, color="darkred", linestyle="--", linewidth=1, label="1.0 threshold")
axes[0].set_title("Watch Ratio Distribution, Capped at 5")
axes[0].set_xlabel("Watch ratio")
axes[0].legend()

sns.histplot(
    data=interactions,
    x="play_duration_sec",
    bins=60,
    ax=axes[1],
    color="#5C946E",
)
axes[1].set_title("Play Duration Distribution")
axes[1].set_xlabel("Play duration, seconds")
axes[1].set_yscale("log")

plt.tight_layout()
plt.show()

The plots make the engagement skew easier to see. The watch-ratio thresholds around 0.8 and 1.0 are plausible anchors for high-watch and over-completion behavior, which motivates the candidate treatment definitions created later.

User and Item Repeated Measures

Sequential causal inference needs repeated observations for the same users. If every user appeared once, we could not define histories, lagged confounders, or future retention. The next cell summarizes how many interactions and active days each sampled user contributes, and how concentrated video consumption is across items.

user_activity = (
    interactions.groupby("user_id")
    .agg(
        interactions=("video_id", "size"),
        unique_videos=("video_id", "nunique"),
        active_days=("event_date", "nunique"),
        first_date=("event_date", "min"),
        last_date=("event_date", "max"),
        avg_watch_ratio=("watch_ratio", "mean"),
        total_play_duration_sec=("play_duration_sec", "sum"),
    )
    .reset_index()
)

item_activity = (
    interactions.groupby("video_id")
    .agg(
        impressions=("user_id", "size"),
        unique_users=("user_id", "nunique"),
        avg_watch_ratio=("watch_ratio", "mean"),
        avg_play_duration_sec=("play_duration_sec", "mean"),
    )
    .reset_index()
    .sort_values("impressions", ascending=False)
)

print("User activity summary")
display(user_activity[["interactions", "unique_videos", "active_days", "avg_watch_ratio", "total_play_duration_sec"]].describe().T)

print("Most observed videos in the sample")
display(item_activity.head(10))
User activity summary
count mean std min 25% 50% 75% max
interactions 91.0000 3,296.7033 166.3290 1,729.0000 3,309.0000 3,315.0000 3,320.0000 3,326.0000
unique_videos 91.0000 3,296.7033 166.3290 1,729.0000 3,309.0000 3,315.0000 3,320.0000 3,326.0000
active_days 91.0000 60.8462 3.9831 31.0000 61.0000 62.0000 63.0000 63.0000
avg_watch_ratio 91.0000 0.9190 0.1953 0.6512 0.7941 0.8832 0.9841 2.0275
total_play_duration_sec 91.0000 28,712.8942 6,276.4357 13,711.2750 25,107.3275 28,289.4630 31,055.6880 63,349.6840
Most observed videos in the sample
video_id impressions unique_users avg_watch_ratio avg_play_duration_sec
31 186 91 91 1.1330 7.7613
30 183 91 91 1.6252 9.9136
0 103 91 91 0.7553 7.7798
1 109 91 91 1.0973 8.5228
2 120 91 91 1.9835 12.4414
3 122 91 91 0.8607 8.3776
4 128 91 91 0.7762 8.0989
5 130 91 91 0.8416 8.4340
29 180 91 91 0.8184 9.0079
28 179 91 91 1.2162 7.3785

The user and item summaries show how much repeated information exists per user and how concentrated consumption is across videos. Repeated user histories are the key ingredient for long-term causal effects, while item concentration reminds us that content popularity may be a confounder.

Plot Repeated User Activity

The next plots show whether users have enough repeated activity for a user-day panel. The left plot checks interaction volume per user. The right plot checks the number of active calendar days per user. More repeated days means more usable history for later marginal structural modeling.

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

sns.histplot(data=user_activity, x="interactions", bins=40, ax=axes[0], color="#2A6F97")
axes[0].set_title("Interactions per User")
axes[0].set_xlabel("Interaction rows")

sns.histplot(data=user_activity, x="active_days", bins=30, ax=axes[1], color="#C07F00")
axes[1].set_title("Active Days per User")
axes[1].set_xlabel("Active calendar days")

plt.tight_layout()
plt.show()

The activity plots help verify that users contribute multiple interaction rows and active days. That supports the next move: aggregating events into user-day records so that each row can represent one time step in a longitudinal causal panel.

Build the Observed User-Day Table

The raw data is interaction-level, but long-term causal effects are easier to reason about at a user-day level. This cell aggregates each user’s activity on each observed day. It creates daily metrics such as total interactions, unique videos, total play duration, average watch ratio, and the share of watched videos that crossed a high-watch threshold.

The high-watch threshold is set at watch_ratio >= 0.8 as a first-pass operational definition. This does not mean it is the final treatment. It is a transparent candidate that later notebooks can refine.

HIGH_WATCH_THRESHOLD = 0.8
COMPLETE_OR_REWATCH_THRESHOLD = 1.0

interactions["high_watch"] = (interactions["watch_ratio"] >= HIGH_WATCH_THRESHOLD).astype(int)
interactions["complete_or_rewatch"] = (interactions["watch_ratio"] >= COMPLETE_OR_REWATCH_THRESHOLD).astype(int)

user_day_observed = (
    interactions.groupby(["user_id", "event_date"])
    .agg(
        interactions=("video_id", "size"),
        unique_videos=("video_id", "nunique"),
        total_play_duration_ms=("play_duration", "sum"),
        avg_play_duration_ms=("play_duration", "mean"),
        avg_video_duration_ms=("video_duration", "mean"),
        avg_watch_ratio=("watch_ratio", "mean"),
        high_watch_count=("high_watch", "sum"),
        complete_or_rewatch_count=("complete_or_rewatch", "sum"),
    )
    .reset_index()
)

user_day_observed["total_play_duration_sec"] = user_day_observed["total_play_duration_ms"] / 1_000
user_day_observed["avg_play_duration_sec"] = user_day_observed["avg_play_duration_ms"] / 1_000
user_day_observed["avg_video_duration_sec"] = user_day_observed["avg_video_duration_ms"] / 1_000
user_day_observed["high_watch_share"] = user_day_observed["high_watch_count"] / user_day_observed["interactions"]
user_day_observed["complete_or_rewatch_share"] = user_day_observed["complete_or_rewatch_count"] / user_day_observed["interactions"]

user_day_observed = user_day_observed.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Observed user-day rows: {len(user_day_observed):,}")
display(user_day_observed.head(10))
Observed user-day rows: 5,537
user_id event_date interactions unique_videos total_play_duration_ms avg_play_duration_ms avg_video_duration_ms avg_watch_ratio high_watch_count complete_or_rewatch_count total_play_duration_sec avg_play_duration_sec avg_video_duration_sec high_watch_share complete_or_rewatch_share
0 14 2020-07-05 26 26 240975 9,268.2692 10,187.6538 1.0845 15 12 240.9750 9.2683 10.1877 0.5769 0.4615
1 14 2020-07-06 23 23 248344 10,797.5652 14,615.2174 1.0640 12 10 248.3440 10.7976 14.6152 0.5217 0.4348
2 14 2020-07-07 78 78 655489 8,403.7051 13,529.6410 0.8415 36 27 655.4890 8.4037 13.5296 0.4615 0.3462
3 14 2020-07-08 22 22 201901 9,177.3182 12,657.1818 0.9828 11 9 201.9010 9.1773 12.6572 0.5000 0.4091
4 14 2020-07-09 55 55 485039 8,818.8909 12,841.6727 0.8619 20 15 485.0390 8.8189 12.8417 0.3636 0.2727
5 14 2020-07-10 52 52 606244 11,658.5385 17,735.4038 1.1380 30 21 606.2440 11.6585 17.7354 0.5769 0.4038
6 14 2020-07-11 32 32 284747 8,898.3438 13,858.9688 0.9337 16 10 284.7470 8.8983 13.8590 0.5000 0.3125
7 14 2020-07-12 42 42 337918 8,045.6667 13,215.6429 0.7986 18 11 337.9180 8.0457 13.2156 0.4286 0.2619
8 14 2020-07-13 46 46 502145 10,916.1957 13,244.0000 1.0389 24 19 502.1450 10.9162 13.2440 0.5217 0.4130
9 14 2020-07-14 42 42 337489 8,035.4524 14,600.6190 0.9892 23 16 337.4890 8.0355 14.6006 0.5476 0.3810

The observed user-day table compresses raw events into daily engagement states. This is the first major causal-design transformation: the analysis unit shifts from an individual watch event to a user observed on a calendar day.

Densify the User-Day Panel

A long-term outcome like next-day retention requires knowing when a user was inactive. The observed table only contains days with at least one interaction, so it cannot directly distinguish missing days from inactive days. The next cell creates a dense user-date grid across the sample window and fills missing activity with zeros.

This dense panel lets us define outcomes such as next_day_active and future_7day_active_days. Those outcomes are central to the long-term causal question because they move beyond immediate watch behavior.

all_users = np.sort(user_day_observed["user_id"].unique())
all_dates = pd.date_range(
    user_day_observed["event_date"].min(),
    user_day_observed["event_date"].max(),
    freq="D",
)

dense_index = pd.MultiIndex.from_product(
    [all_users, all_dates],
    names=["user_id", "event_date"],
)

user_day = (
    user_day_observed.set_index(["user_id", "event_date"])
    .reindex(dense_index)
    .reset_index()
)

count_and_sum_columns = [
    "interactions",
    "unique_videos",
    "total_play_duration_ms",
    "high_watch_count",
    "complete_or_rewatch_count",
    "total_play_duration_sec",
]
rate_and_average_columns = [
    "avg_play_duration_ms",
    "avg_video_duration_ms",
    "avg_watch_ratio",
    "avg_play_duration_sec",
    "avg_video_duration_sec",
    "high_watch_share",
    "complete_or_rewatch_share",
]

user_day[count_and_sum_columns] = user_day[count_and_sum_columns].fillna(0)
user_day[rate_and_average_columns] = user_day[rate_and_average_columns].fillna(0)
user_day["active_day"] = (user_day["interactions"] > 0).astype(int)

user_day = user_day.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Dense user-day panel shape: {user_day.shape}")
print(f"Users in panel: {user_day['user_id'].nunique():,}")
print(f"Dates per user: {user_day['event_date'].nunique():,}")
display(user_day.head(10))
Dense user-day panel shape: (5733, 16)
Users in panel: 91
Dates per user: 63
user_id event_date interactions unique_videos total_play_duration_ms avg_play_duration_ms avg_video_duration_ms avg_watch_ratio high_watch_count complete_or_rewatch_count total_play_duration_sec avg_play_duration_sec avg_video_duration_sec high_watch_share complete_or_rewatch_share active_day
0 14 2020-07-05 26.0000 26.0000 240,975.0000 9,268.2692 10,187.6538 1.0845 15.0000 12.0000 240.9750 9.2683 10.1877 0.5769 0.4615 1
1 14 2020-07-06 23.0000 23.0000 248,344.0000 10,797.5652 14,615.2174 1.0640 12.0000 10.0000 248.3440 10.7976 14.6152 0.5217 0.4348 1
2 14 2020-07-07 78.0000 78.0000 655,489.0000 8,403.7051 13,529.6410 0.8415 36.0000 27.0000 655.4890 8.4037 13.5296 0.4615 0.3462 1
3 14 2020-07-08 22.0000 22.0000 201,901.0000 9,177.3182 12,657.1818 0.9828 11.0000 9.0000 201.9010 9.1773 12.6572 0.5000 0.4091 1
4 14 2020-07-09 55.0000 55.0000 485,039.0000 8,818.8909 12,841.6727 0.8619 20.0000 15.0000 485.0390 8.8189 12.8417 0.3636 0.2727 1
5 14 2020-07-10 52.0000 52.0000 606,244.0000 11,658.5385 17,735.4038 1.1380 30.0000 21.0000 606.2440 11.6585 17.7354 0.5769 0.4038 1
6 14 2020-07-11 32.0000 32.0000 284,747.0000 8,898.3438 13,858.9688 0.9337 16.0000 10.0000 284.7470 8.8983 13.8590 0.5000 0.3125 1
7 14 2020-07-12 42.0000 42.0000 337,918.0000 8,045.6667 13,215.6429 0.7986 18.0000 11.0000 337.9180 8.0457 13.2156 0.4286 0.2619 1
8 14 2020-07-13 46.0000 46.0000 502,145.0000 10,916.1957 13,244.0000 1.0389 24.0000 19.0000 502.1450 10.9162 13.2440 0.5217 0.4130 1
9 14 2020-07-14 42.0000 42.0000 337,489.0000 8,035.4524 14,600.6190 0.9892 23.0000 16.0000 337.4890 8.0355 14.6006 0.5476 0.3810 1

The dense panel explicitly includes inactive dates, which is necessary for future retention-style outcomes. Without this step, missing user-days would be invisible and the notebook could mistake absence from the log for absence from the analysis.

Add Lagged State Variables

In sequential causal inference, the user’s past is part of the confounding structure. A user who watched heavily yesterday may be more likely to receive certain recommendations today and more likely to be active tomorrow. The next cell creates lagged and prior-window features that later notebooks can use as time-varying confounders.

These variables are not outcomes. They are the observed state before a daily treatment decision.

lag_columns = [
    "active_day",
    "interactions",
    "total_play_duration_sec",
    "avg_watch_ratio",
    "high_watch_share",
]

for column in lag_columns:
    user_day[f"lag_1_{column}"] = user_day.groupby("user_id")[column].shift(1)
    user_day[f"prior_3day_{column}"] = user_day.groupby("user_id")[column].transform(
        lambda series: series.shift(1).rolling(window=3, min_periods=1).sum()
    )

lagged_columns = [column for column in user_day.columns if column.startswith("lag_1_") or column.startswith("prior_3day_")]
user_day[lagged_columns] = user_day[lagged_columns].fillna(0)

state_preview_columns = [
    "user_id",
    "event_date",
    "active_day",
    "interactions",
    "lag_1_active_day",
    "lag_1_interactions",
    "prior_3day_active_day",
    "prior_3day_interactions",
]

display(user_day[state_preview_columns].head(15))
user_id event_date active_day interactions lag_1_active_day lag_1_interactions prior_3day_active_day prior_3day_interactions
0 14 2020-07-05 1 26.0000 0.0000 0.0000 0.0000 0.0000
1 14 2020-07-06 1 23.0000 1.0000 26.0000 1.0000 26.0000
2 14 2020-07-07 1 78.0000 1.0000 23.0000 2.0000 49.0000
3 14 2020-07-08 1 22.0000 1.0000 78.0000 3.0000 127.0000
4 14 2020-07-09 1 55.0000 1.0000 22.0000 3.0000 123.0000
5 14 2020-07-10 1 52.0000 1.0000 55.0000 3.0000 155.0000
6 14 2020-07-11 1 32.0000 1.0000 52.0000 3.0000 129.0000
7 14 2020-07-12 1 42.0000 1.0000 32.0000 3.0000 139.0000
8 14 2020-07-13 1 46.0000 1.0000 42.0000 3.0000 126.0000
9 14 2020-07-14 1 42.0000 1.0000 46.0000 3.0000 120.0000
10 14 2020-07-15 1 10.0000 1.0000 42.0000 3.0000 130.0000
11 14 2020-07-16 1 87.0000 1.0000 10.0000 3.0000 98.0000
12 14 2020-07-17 1 93.0000 1.0000 87.0000 3.0000 139.0000
13 14 2020-07-18 1 117.0000 1.0000 93.0000 3.0000 190.0000
14 14 2020-07-19 1 42.0000 1.0000 117.0000 3.0000 297.0000

The lagged and prior-window columns capture the user’s pre-treatment state. These variables are central to later causal adjustment because prior engagement can influence both today’s recommendation exposure and tomorrow’s behavior.

Add Forward-Looking Outcomes

The next cell creates future outcome variables. These include next-day activity and multi-day future engagement. We deliberately keep these outcomes separate from treatment definitions so later notebooks can estimate how candidate treatment patterns affect future behavior.

Rows near the end of the sample window do not have enough future days to define all outcomes. Those rows are marked as missing for the affected future horizons rather than silently treating unknown future activity as zero.

future_base_columns = ["active_day", "interactions", "total_play_duration_sec"]
future_horizons = [1, 2, 3, 4, 5, 6, 7]

for column in future_base_columns:
    for horizon in future_horizons:
        user_day[f"lead_{horizon}_{column}"] = user_day.groupby("user_id")[column].shift(-horizon)

user_day["next_day_active"] = user_day["lead_1_active_day"]
user_day["next_day_interactions"] = user_day["lead_1_interactions"]
user_day["next_day_play_duration_sec"] = user_day["lead_1_total_play_duration_sec"]

user_day["future_3day_active_days"] = user_day[[f"lead_{horizon}_active_day" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)
user_day["future_3day_interactions"] = user_day[[f"lead_{horizon}_interactions" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)
user_day["future_3day_play_duration_sec"] = user_day[[f"lead_{horizon}_total_play_duration_sec" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)

user_day["future_7day_active_days"] = user_day[[f"lead_{horizon}_active_day" for horizon in future_horizons]].sum(axis=1, min_count=7)
user_day["future_7day_interactions"] = user_day[[f"lead_{horizon}_interactions" for horizon in future_horizons]].sum(axis=1, min_count=7)
user_day["future_7day_play_duration_sec"] = user_day[[f"lead_{horizon}_total_play_duration_sec" for horizon in future_horizons]].sum(axis=1, min_count=7)

outcome_preview_columns = [
    "user_id",
    "event_date",
    "active_day",
    "interactions",
    "next_day_active",
    "future_3day_active_days",
    "future_3day_interactions",
    "future_7day_active_days",
]

display(user_day[outcome_preview_columns].head(15))
user_id event_date active_day interactions next_day_active future_3day_active_days future_3day_interactions future_7day_active_days
0 14 2020-07-05 1 26.0000 1.0000 3.0000 123.0000 7.0000
1 14 2020-07-06 1 23.0000 1.0000 3.0000 155.0000 7.0000
2 14 2020-07-07 1 78.0000 1.0000 3.0000 129.0000 7.0000
3 14 2020-07-08 1 22.0000 1.0000 3.0000 139.0000 7.0000
4 14 2020-07-09 1 55.0000 1.0000 3.0000 126.0000 7.0000
5 14 2020-07-10 1 52.0000 1.0000 3.0000 120.0000 7.0000
6 14 2020-07-11 1 32.0000 1.0000 3.0000 130.0000 7.0000
7 14 2020-07-12 1 42.0000 1.0000 3.0000 98.0000 7.0000
8 14 2020-07-13 1 46.0000 1.0000 3.0000 139.0000 7.0000
9 14 2020-07-14 1 42.0000 1.0000 3.0000 190.0000 7.0000
10 14 2020-07-15 1 10.0000 1.0000 3.0000 297.0000 7.0000
11 14 2020-07-16 1 87.0000 1.0000 3.0000 252.0000 7.0000
12 14 2020-07-17 1 93.0000 1.0000 3.0000 264.0000 7.0000
13 14 2020-07-18 1 117.0000 1.0000 3.0000 227.0000 7.0000
14 14 2020-07-19 1 42.0000 1.0000 3.0000 202.0000 7.0000

The lead columns and future-window outcomes give the project its long-term target. Current-day behavior can now be related to next-day, 3-day, and 7-day future engagement without using future information as a confounder.

Candidate Treatment Definitions

For this project, a treatment is not a medical intervention or a one-time binary assignment. It is a daily exposure pattern generated by a recommender system. At this stage, we define two transparent candidate treatments:

  • treatment_high_intensity: the user had an active day with interaction count in the upper quartile of active days.
  • treatment_high_watch_exposure: the user had an active day where the share of high-watch videos was above the median active-day share.

These are deliberately simple first definitions. They give us a concrete way to inspect treatment variation and confounding before more sophisticated modeling.

active_user_days = user_day["active_day"] == 1

high_intensity_threshold = user_day.loc[active_user_days, "interactions"].quantile(0.75)
high_watch_share_threshold = user_day.loc[active_user_days, "high_watch_share"].median()

user_day["treatment_high_intensity"] = (
    active_user_days & (user_day["interactions"] >= high_intensity_threshold)
).astype(int)
user_day["treatment_high_watch_exposure"] = (
    active_user_days & (user_day["high_watch_share"] >= high_watch_share_threshold)
).astype(int)

thresholds = pd.DataFrame(
    [
        {
            "candidate_treatment": "treatment_high_intensity",
            "rule": "active day and interactions >= active-day 75th percentile",
            "threshold": high_intensity_threshold,
        },
        {
            "candidate_treatment": "treatment_high_watch_exposure",
            "rule": "active day and high_watch_share >= active-day median",
            "threshold": high_watch_share_threshold,
        },
    ]
)

treatment_summary = user_day[
    ["treatment_high_intensity", "treatment_high_watch_exposure"]
].agg(["mean", "sum"]).T.rename(columns={"mean": "share_of_user_days", "sum": "treated_user_days"})

display(thresholds)
display(treatment_summary)
candidate_treatment rule threshold
0 treatment_high_intensity active day and interactions >= active-day 75th... 70.0000
1 treatment_high_watch_exposure active day and high_watch_share >= active-day ... 0.4773
share_of_user_days treated_user_days
treatment_high_intensity 0.2501 1,434.0000
treatment_high_watch_exposure 0.4842 2,776.0000

The treatment summary shows whether the first two exposure definitions have usable variation. These definitions are intentionally simple at this stage: they let us test whether the data can support a treated-versus-control comparison before committing to a final estimand.

Plot Daily Treatment and Outcome Rates

This plot checks whether the candidate treatments and next-day activity vary over time. A useful causal study needs variation. If a treatment is always on, always off, or perfectly aligned with a single date, it is hard to estimate a meaningful effect.

daily_panel_summary = (
    user_day.groupby("event_date")
    .agg(
        active_rate=("active_day", "mean"),
        high_intensity_rate=("treatment_high_intensity", "mean"),
        high_watch_exposure_rate=("treatment_high_watch_exposure", "mean"),
        next_day_active_rate=("next_day_active", "mean"),
    )
    .reset_index()
)

plot_daily = daily_panel_summary.melt(
    id_vars="event_date",
    value_vars=["active_rate", "high_intensity_rate", "high_watch_exposure_rate", "next_day_active_rate"],
    var_name="metric",
    value_name="rate",
)

fig, ax = plt.subplots(figsize=(13, 5))
sns.lineplot(data=plot_daily, x="event_date", y="rate", hue="metric", marker="o", linewidth=1.5, ax=ax)
ax.set_title("Daily Treatment and Outcome Rates in the Dense User-Day Panel")
ax.set_xlabel("Date")
ax.set_ylabel("Rate")
ax.yaxis.set_major_formatter(lambda value, _: f"{value:.0%}")
ax.tick_params(axis="x", rotation=35)
plt.tight_layout()
plt.show()

The time-series plot checks whether treatment and outcome rates vary over the calendar window. If treatment prevalence changed sharply by date, later models would need calendar controls; this plot starts that diagnostic conversation.

Naive Associations Are Not Causal Effects

The next table compares future outcomes between treated and untreated active user-days. This is useful as a descriptive diagnostic, but it is not a causal estimate. Treated days are likely different before treatment happens. For example, users with high activity today may already be more engaged, and that prior engagement may predict future activity regardless of today’s exposure.

The purpose of this table is to create intuition and motivate the confounding checks that follow.

def summarize_naive_association(data, treatment_col, outcome_cols):
    rows = []
    analytic = data.loc[data["active_day"].eq(1)].copy()
    for outcome_col in outcome_cols:
        subset = analytic.dropna(subset=[treatment_col, outcome_col])
        grouped = subset.groupby(treatment_col)[outcome_col].agg(["mean", "count", "std"])
        if set(grouped.index) >= {0, 1}:
            control_mean = grouped.loc[0, "mean"]
            treated_mean = grouped.loc[1, "mean"]
            rows.append(
                {
                    "treatment": treatment_col,
                    "outcome": outcome_col,
                    "control_mean": control_mean,
                    "treated_mean": treated_mean,
                    "difference": treated_mean - control_mean,
                    "relative_lift": (treated_mean / control_mean - 1) if control_mean != 0 else np.nan,
                    "control_days": grouped.loc[0, "count"],
                    "treated_days": grouped.loc[1, "count"],
                }
            )
    return pd.DataFrame(rows)

outcome_columns = [
    "next_day_active",
    "future_3day_active_days",
    "future_3day_interactions",
    "future_7day_active_days",
]

naive_associations = pd.concat(
    [
        summarize_naive_association(user_day, "treatment_high_intensity", outcome_columns),
        summarize_naive_association(user_day, "treatment_high_watch_exposure", outcome_columns),
    ],
    ignore_index=True,
)

display(naive_associations)
treatment outcome control_mean treated_mean difference relative_lift control_days treated_days
0 treatment_high_intensity next_day_active 0.9750 0.9868 0.0117 0.0120 4048 1434
1 treatment_high_intensity future_3day_active_days 2.9371 2.9477 0.0106 0.0036 3880 1434
2 treatment_high_intensity future_3day_interactions 138.8616 207.0342 68.1726 0.4909 3880 1434
3 treatment_high_intensity future_7day_active_days 6.8580 6.8562 -0.0017 -0.0003 3528 1433
4 treatment_high_watch_exposure next_day_active 0.9835 0.9728 -0.0107 -0.0109 2728 2754
5 treatment_high_watch_exposure future_3day_active_days 2.9479 2.9321 -0.0158 -0.0054 2649 2665
6 treatment_high_watch_exposure future_3day_interactions 157.7493 156.7700 -0.9794 -0.0062 2649 2665
7 treatment_high_watch_exposure future_7day_active_days 6.8719 6.8430 -0.0289 -0.0042 2483 2478

The naive comparisons describe how future outcomes differ between treated and untreated days, but they should not be read as causal effects. Their real value is motivational: if there are visible differences, the next question is whether those differences remain after accounting for prior user state.

Check Time-Varying Confounding

A major reason this project needs causal methods is that prior user state can affect both today’s treatment and future outcomes. The next cell computes standardized mean differences for pre-treatment covariates between treated and untreated active days. Large imbalances mean that a naive treated-versus-control comparison is likely confounded.

A standardized mean difference is the treated-control difference in means divided by a pooled standard deviation. Values far from zero indicate imbalance in the pre-treatment state.

def standardized_mean_differences(data, treatment_col, covariate_cols):
    analytic = data.loc[data["active_day"].eq(1)].dropna(subset=[treatment_col]).copy()
    rows = []
    for covariate in covariate_cols:
        treated = analytic.loc[analytic[treatment_col].eq(1), covariate].dropna()
        control = analytic.loc[analytic[treatment_col].eq(0), covariate].dropna()
        pooled_sd = np.sqrt((treated.var(ddof=1) + control.var(ddof=1)) / 2)
        rows.append(
            {
                "treatment": treatment_col,
                "covariate": covariate,
                "treated_mean": treated.mean(),
                "control_mean": control.mean(),
                "smd": (treated.mean() - control.mean()) / pooled_sd if pooled_sd and not np.isnan(pooled_sd) else np.nan,
            }
        )
    return pd.DataFrame(rows)

pre_treatment_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_high_watch_share",
]

confounding_balance = pd.concat(
    [
        standardized_mean_differences(user_day, "treatment_high_intensity", pre_treatment_covariates),
        standardized_mean_differences(user_day, "treatment_high_watch_exposure", pre_treatment_covariates),
    ],
    ignore_index=True,
)

display(confounding_balance.sort_values(["treatment", "smd"], key=lambda col: col.abs() if col.name == "smd" else col))
treatment covariate treated_mean control_mean smd
3 treatment_high_intensity lag_1_avg_watch_ratio 0.9103 0.9208 -0.0193
0 treatment_high_intensity lag_1_active_day 0.9770 0.9654 0.0694
7 treatment_high_intensity prior_3day_high_watch_share 1.4207 1.3741 0.1029
4 treatment_high_intensity prior_3day_active_day 2.9303 2.8279 0.2198
2 treatment_high_intensity lag_1_total_play_duration_sec 617.3946 388.2157 0.7200
6 treatment_high_intensity prior_3day_total_play_duration_sec 1,794.1988 1,177.3717 0.8370
1 treatment_high_intensity lag_1_interactions 71.7364 44.2764 0.8406
5 treatment_high_intensity prior_3day_interactions 207.7497 134.4448 1.0026
8 treatment_high_watch_exposure lag_1_active_day 0.9672 0.9696 -0.0135
12 treatment_high_watch_exposure prior_3day_active_day 2.8483 2.8606 -0.0232
9 treatment_high_watch_exposure lag_1_interactions 50.8336 51.9457 -0.0332
13 treatment_high_watch_exposure prior_3day_interactions 151.6120 155.2572 -0.0452
10 treatment_high_watch_exposure lag_1_total_play_duration_sec 479.3493 415.6172 0.2007
14 treatment_high_watch_exposure prior_3day_total_play_duration_sec 1,424.6386 1,249.1272 0.2284
11 treatment_high_watch_exposure lag_1_avg_watch_ratio 1.0097 0.8259 0.3077
15 treatment_high_watch_exposure prior_3day_high_watch_share 1.5894 1.1817 0.9546

The standardized mean differences show whether treated and untreated days already looked different before treatment. Any meaningful imbalance here supports the central claim of the project: sequential recommender logs need causal adjustment, not just outcome comparison.

Visualize Covariate Imbalance

The heatmap below gives a quick visual read of confounding. Darker positive or negative values mean treated and untreated active days had different histories before the treatment day. This is exactly the kind of structure that motivates marginal structural models and g-computation in later notebooks.

balance_heatmap = confounding_balance.pivot(index="covariate", columns="treatment", values="smd")

fig, ax = plt.subplots(figsize=(9, 5.5))
sns.heatmap(
    balance_heatmap,
    annot=True,
    fmt=".2f",
    cmap="vlag",
    center=0,
    linewidths=0.5,
    ax=ax,
)
ax.set_title("Standardized Mean Differences in Pre-Treatment State")
ax.set_xlabel("Candidate treatment")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
plt.show()

The heatmap turns the balance table into a quick diagnostic. Strong colors indicate pre-treatment differences that later notebooks must address with propensity models, marginal structural models, or g-computation.

Join Metadata for Diagnostic Checks

User metadata can help explain why treatment assignment differs across users. The next cell joins the interaction sample to user activity segment and item category features. This is not the final feature engineering step; it is a sanity check that the metadata keys connect cleanly to the interaction log.

user_metadata_columns = [
    "user_id",
    "user_active_degree",
    "is_lowactive_period",
    "is_live_streamer",
    "is_video_author",
    "follow_user_num",
    "fans_user_num",
    "friend_user_num",
    "register_days",
]

interaction_metadata_sample = (
    interactions.merge(user_features[user_metadata_columns], on="user_id", how="left")
    .merge(item_categories[["video_id", "category_count", "first_category"]], on="video_id", how="left")
)

metadata_diagnostics = pd.DataFrame(
    [
        {"metric": "rows", "value": len(interaction_metadata_sample)},
        {"metric": "missing_user_active_degree_rate", "value": interaction_metadata_sample["user_active_degree"].isna().mean()},
        {"metric": "missing_item_category_rate", "value": interaction_metadata_sample["category_count"].isna().mean()},
        {"metric": "unique_user_activity_segments", "value": interaction_metadata_sample["user_active_degree"].nunique(dropna=True)},
        {"metric": "unique_first_categories", "value": interaction_metadata_sample["first_category"].nunique(dropna=True)},
    ]
)

display(metadata_diagnostics)

display(
    interaction_metadata_sample.groupby("user_active_degree")
    .agg(
        rows=("video_id", "size"),
        users=("user_id", "nunique"),
        avg_watch_ratio=("watch_ratio", "mean"),
        avg_play_duration_sec=("play_duration_sec", "mean"),
    )
    .sort_values("rows", ascending=False)
)
metric value
0 rows 300,000.0000
1 missing_user_active_degree_rate 0.0000
2 missing_item_category_rate 0.0000
3 unique_user_activity_segments 3.0000
4 unique_first_categories 30.0000
rows users avg_watch_ratio avg_play_duration_sec
user_active_degree
full_active 230406 70 0.9199 8.7178
high_active 62965 19 0.9250 8.7592
UNKNOWN 6629 2 0.8329 7.9533

The join diagnostics confirm that user metadata and item categories connect to the sampled interaction log. This means later models can enrich the history-only panel with baseline user context or item context if the causal design needs stronger adjustment.

Inspect Item Daily Popularity Features

Item daily features can be useful later because recommendation exposure is often influenced by item popularity. A popular item may be more likely to be shown, and popular items may also generate different future engagement. The next cell summarizes the sampled item daily table and checks the distribution of simple popularity rates.

item_daily_summary = item_daily[
    [
        "show_cnt",
        "show_user_num",
        "play_cnt",
        "play_user_num",
        "complete_play_cnt",
        "like_cnt",
        "comment_cnt",
        "share_cnt",
        "collect_cnt",
        "item_play_rate",
        "item_like_rate_per_play",
    ]
].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).T

display(item_daily_summary)

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
sns.histplot(data=item_daily, x="show_cnt", bins=60, ax=axes[0], color="#2A6F97")
axes[0].set_title("Daily Item Show Counts")
axes[0].set_xlabel("Show count")
axes[0].set_yscale("log")

sns.histplot(data=item_daily, x="item_play_rate", bins=60, ax=axes[1], color="#5C946E")
axes[1].set_title("Daily Item Play Rate")
axes[1].set_xlabel("Play count / show count")

plt.tight_layout()
plt.show()
count mean std min 5% 25% 50% 75% 95% max
show_cnt 150,000.0000 99,455.6275 565,375.6798 0.0000 3.0000 74.0000 1,762.0000 28,220.0000 415,838.5500 36,053,957.0000
show_user_num 150,000.0000 93,436.8595 542,642.7612 0.0000 3.0000 63.0000 1,474.0000 25,346.0000 387,015.8000 34,875,019.0000
play_cnt 150,000.0000 99,323.0334 584,500.5090 0.0000 0.0000 19.0000 892.0000 24,438.0000 413,801.5000 37,966,319.0000
play_user_num 150,000.0000 89,345.7759 534,001.9580 0.0000 0.0000 16.0000 736.0000 21,404.2500 369,398.3500 34,501,624.0000
complete_play_cnt 150,000.0000 54,362.3796 354,878.3167 0.0000 0.0000 5.0000 306.0000 10,966.2500 216,402.5500 23,892,867.0000
like_cnt 150,000.0000 2,972.5705 18,322.5665 0.0000 0.0000 0.0000 13.0000 521.2500 12,800.1000 1,286,185.0000
comment_cnt 150,000.0000 130.1410 1,613.5446 0.0000 0.0000 0.0000 0.0000 11.0000 363.0000 182,959.0000
share_cnt 150,000.0000 72.2995 1,050.6124 0.0000 0.0000 0.0000 0.0000 4.0000 157.0000 172,181.0000
collect_cnt 116,617.0000 14.5537 187.7763 0.0000 0.0000 0.0000 0.0000 1.0000 39.0000 41,399.0000
item_play_rate 149,659.0000 0.6062 0.4483 0.0000 0.0000 0.2485 0.6210 0.9876 1.0405 38.0000
item_like_rate_per_play 141,044.0000 0.0278 0.0529 0.0000 0.0000 0.0000 0.0138 0.0329 0.0997 2.0000

The item-daily summaries show that item popularity is highly uneven. That unevenness matters because popular videos may be more likely to be recommended and may also produce different engagement patterns, making popularity a potential confounder or stratification variable.

Causal Readiness Checklist

This checklist summarizes whether the current sample is adequate for the next stage of the project. The checks are intentionally practical: do we have repeated users, a time span, treatment variation, future outcome availability, and pre-treatment state variables? Passing these checks does not prove causal identification, but failing them would tell us to redesign the project before modeling.

active_days_per_user = user_day.groupby("user_id")["active_day"].sum()
future_3day_available = user_day["future_3day_active_days"].notna().mean()
future_7day_available = user_day["future_7day_active_days"].notna().mean()

treatment_rates = user_day[["treatment_high_intensity", "treatment_high_watch_exposure"]].mean()

readiness_checks = pd.DataFrame(
    [
        {
            "check": "at least 50 users in sample",
            "value": user_day["user_id"].nunique(),
            "passes": user_day["user_id"].nunique() >= 50,
        },
        {
            "check": "median user has at least 7 active days",
            "value": active_days_per_user.median(),
            "passes": active_days_per_user.median() >= 7,
        },
        {
            "check": "at least 30 calendar dates",
            "value": user_day["event_date"].nunique(),
            "passes": user_day["event_date"].nunique() >= 30,
        },
        {
            "check": "high-intensity treatment has non-trivial variation",
            "value": treatment_rates["treatment_high_intensity"],
            "passes": 0.01 < treatment_rates["treatment_high_intensity"] < 0.99,
        },
        {
            "check": "high-watch treatment has non-trivial variation",
            "value": treatment_rates["treatment_high_watch_exposure"],
            "passes": 0.01 < treatment_rates["treatment_high_watch_exposure"] < 0.99,
        },
        {
            "check": "future 3-day outcomes available for most rows",
            "value": future_3day_available,
            "passes": future_3day_available >= 0.80,
        },
        {
            "check": "future 7-day outcomes available for most rows",
            "value": future_7day_available,
            "passes": future_7day_available >= 0.70,
        },
        {
            "check": "lagged confounders have no missing values after filling panel starts",
            "value": user_day[pre_treatment_covariates].isna().mean().max(),
            "passes": user_day[pre_treatment_covariates].isna().mean().max() == 0,
        },
    ]
)

display(readiness_checks)
check value passes
0 at least 50 users in sample 91.0000 True
1 median user has at least 7 active days 62.0000 True
2 at least 30 calendar dates 63.0000 True
3 high-intensity treatment has non-trivial varia... 0.2501 True
4 high-watch treatment has non-trivial variation 0.4842 True
5 future 3-day outcomes available for most rows 0.9524 True
6 future 7-day outcomes available for most rows 0.8889 True
7 lagged confounders have no missing values afte... 0.0000 True

The readiness checklist condenses the EDA into go/no-go criteria for causal modeling. Passing these checks does not prove identification, but it shows that the sample has enough users, time, treatment variation, and future outcomes to justify moving to formal estimand definition.

Save Reusable Processed Files

The next cell saves two files for later notebooks:

  • kuairec_small_interactions_sample.parquet: cleaned interaction-level sample.
  • kuairec_user_day_panel_sample.parquet: dense user-day panel with lagged state, candidate treatments, and future outcomes.

Saving these files prevents later notebooks from repeatedly opening the nested zip archive. It also makes the project sequence cleaner: Notebook 01 handles data understanding, while later notebooks handle causal estimation.

interaction_output = PROCESSED_DIR / "kuairec_small_interactions_sample.parquet"
user_day_output = PROCESSED_DIR / "kuairec_user_day_panel_sample.parquet"
readiness_output = PROCESSED_DIR / "kuairec_sequence_eda_readiness.csv"

interactions.to_parquet(interaction_output, index=False)
user_day.to_parquet(user_day_output, index=False)
readiness_checks.to_csv(readiness_output, index=False)

print("Saved processed files:")
print(f"- {interaction_output}")
print(f"- {user_day_output}")
print(f"- {readiness_output}")
Saved processed files:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_small_interactions_sample.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_user_day_panel_sample.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_sequence_eda_readiness.csv

The saved files are the handoff from raw EDA to causal design. Future notebooks can load the cleaned interaction sample and dense user-day panel directly, which keeps the project modular and avoids repeating expensive archive parsing.

Takeaways and Next Step

This notebook turns KuaiRec from a raw interaction log into a sequential causal analysis dataset. The key object is the dense user-day panel. It contains the current day’s activity, prior user state, candidate treatment definitions, and future outcomes.

The descriptive comparisons in this notebook should not be interpreted causally. They are meant to show why causal methods are needed. The confounding checks make the main issue visible: treatment days and non-treatment days differ in their pre-treatment histories. The next notebook should formalize the long-term outcome definitions and decide which treatment-outcome pair will be the primary estimand for the project.