Notebook 01: KuaiRec Sequence EDA for Long-Term Causal Effects

This notebook starts ** Long-Term Causal Effects. The central problem is a common tension in recommender systems: a recommendation policy can increase immediate engagement while still hurting longer-term user value. A short video feed, news feed, or streaming homepage may learn to maximize near-term clicks, watch time, or completion rate. Those metrics are useful, but they do not fully answer a more strategic data science question: does the exposure pattern caused by the recommender improve future user retention, future engagement, and healthier long-run consumption behavior?**

This is a causal question because users are not randomly assigned to content exposure patterns in normal logs. A user sees videos because the platform selected them, because the user has a history, because the user has preferences, and because prior engagement changes what the recommender does next. If we compare users who received high-engagement exposures against users who did not, the difference may reflect the recommender’s targeting logic rather than the effect of the exposure itself.

The specific problem handled here is the sequential causal inference setup for KuaiRec:

The unit is a user observed across multiple days.
The time index is the calendar day.
The observed state includes prior activity, prior watch behavior, and user metadata.
The treatment is a candidate daily exposure pattern, such as a high-intensity recommendation day or a high-watch-ratio exposure day.
The outcome is a future behavior measure, such as next-day activity or future multi-day engagement.
The hard part is time-varying confounding: yesterday’s engagement affects today’s recommendations, and today’s recommendations affect tomorrow’s engagement.

This notebook does not estimate the final causal effect yet. Instead, it prepares the causal terrain. We inspect whether KuaiRec has enough repeated user activity, enough date coverage, enough treatment variation, and enough future outcomes to support later notebooks on marginal structural models and g-computation. In other words, this notebook answers: can this dataset support a credible long-term causal effects project, and what should the treatment and outcome definitions look like?

Notebook Roadmap

This notebook follows a causal data-understanding path rather than a generic EDA path. First, we load KuaiRec directly from the nested zip archive without extracting the full dataset. Then we inspect the raw interaction fields, user metadata, item category metadata, and item daily popularity features. After that, we aggregate interactions into a user-day panel, because long-term causal effects require a time-indexed panel rather than isolated impression rows.

The most important output of the notebook is a reusable user-day table. Later notebooks can use this table to define treatment histories, estimate inverse probability weights, fit marginal structural models, and compare g-computation style estimates.

Data Fields and Column Guide

This project uses KuaiRec as a sequential recommendation log. The raw data is spread across interaction, user, item-category, and item-daily-feature tables. The field guide below explains the columns before any analysis code runs, so the rest of the notebook is easier to read.

`small_matrix.csv`: User-Video Interaction Log

This is the main event-level table. Each row represents a user-video watch event.

Column	Meaning	How it matters for this project
`user_id`	Unique user identifier.	Defines the panel unit. Long-term causal effects are studied by following each user over time.
`video_id`	Unique video identifier.	Identifies the item consumed or exposed in the recommendation environment.
`play_duration`	Amount of time the user played the video, stored in milliseconds.	Short-term engagement signal. Later aggregated into daily play duration and used to define future outcomes.
`video_duration`	Length of the video, stored in milliseconds.	Item context and denominator for watch-ratio calculations.
`time`	Human-readable event time string.	Used as a diagnostic event timestamp; the notebook also parses the Unix timestamp field.
`date`	Calendar date of the event, encoded like `20200705`.	Becomes the daily time index for the user-day panel.
`timestamp`	Unix timestamp of the event in seconds.	Used to reconstruct event ordering within users.
`watch_ratio`	`play_duration / video_duration`. Values above 1 can occur when users rewatch or exceed the nominal video length.	Core short-term engagement measure. Used to define high-watch exposure candidates and daily engagement summaries.

`user_features.csv`: User Metadata

This table contains baseline user attributes and anonymized features. These columns can be used as baseline confounders because user characteristics may affect both what the recommender shows and how the user behaves later.

Column	Meaning	How it matters for this project
`user_id`	Unique user identifier.	Join key from interaction rows to user-level attributes.
`user_active_degree`	Categorical user activity segment such as high-active or full-active.	Baseline activity segment; likely related to future retention and exposure patterns.
`is_lowactive_period`	Indicator for whether the user is in a low-activity period.	Baseline or period-level activity state. Useful for understanding heterogeneous retention risk.
`is_live_streamer`	Indicator for whether the user is a live streamer.	User role/context feature that may affect content preferences and future behavior.
`is_video_author`	Indicator for whether the user authors videos.	User role/context feature; creators may behave differently from pure consumers.
`follow_user_num`	Number of users this user follows.	Social graph intensity; possible proxy for platform embeddedness.
`follow_user_num_range`	Binned range version of `follow_user_num`.	Categorical version of follow count for modeling or stratified summaries.
`fans_user_num`	Number of fans/followers the user has.	Popularity or creator-context proxy.
`fans_user_num_range`	Binned range version of `fans_user_num`.	Categorical version of fan count.
`friend_user_num`	Number of friends connected to the user.	Social connectedness proxy; may predict retention.
`friend_user_num_range`	Binned range version of `friend_user_num`.	Categorical version of friend count.
`register_days`	Number of days since registration.	User tenure; important baseline confounder for future engagement.
`register_days_range`	Binned range version of `register_days`.	Categorical tenure feature.
`onehot_feat0` to `onehot_feat17`	Anonymized one-hot user features supplied by KuaiRec.	Potential baseline covariates. They are useful for prediction and adjustment, but their substantive meaning is intentionally hidden.

`item_categories.csv`: Video Category Metadata

This table maps videos to one or more category identifiers.

Column	Meaning	How it matters for this project
`video_id`	Unique video identifier.	Join key from interaction rows to video metadata.
`feat`	String representation of a list of category ids, such as `[27, 9]`.	Describes item content category. Later parsed into category count and first-category diagnostics.

`item_daily_features.csv`: Daily Video Popularity and Engagement Context

The full table has many daily item metrics. This notebook loads a focused subset that is directly useful for sequential recommendation diagnostics. These features can become item-day context variables because item popularity may affect both recommendation exposure and downstream user behavior.

Column	Meaning	How it matters for this project
`video_id`	Unique video identifier.	Join key to connect daily item context to consumed videos.
`date`	Calendar date for the item-day metrics.	Aligns item popularity with the same daily panel used for users.
`show_cnt`	Number of times the video was shown.	Visibility/popularity proxy. A highly shown item may be more likely to appear in recommendation logs.
`show_user_num`	Number of users who were shown the video.	User-level reach of the item on that day.
`play_cnt`	Number of plays.	Daily item consumption volume.
`play_user_num`	Number of users who played the video.	Daily user reach among players.
`complete_play_cnt`	Number of complete plays.	Daily completion signal for the item.
`like_cnt`	Number of likes.	Positive feedback signal for the item.
`comment_cnt`	Number of comments.	Deeper engagement signal for the item.
`share_cnt`	Number of shares.	Viral or social engagement signal for the item.
`collect_cnt`	Number of collections/saves.	Longer-intent engagement signal for the item.

Derived Columns Created Later in This Notebook

The notebook also creates derived fields after loading the raw data. These are not separate raw KuaiRec columns, but they are central to the causal setup.

Derived column family	Meaning
`event_date`, `event_time`, `event_timestamp`	Cleaned time fields used to order events and build daily panels.
`play_duration_sec`, `video_duration_sec`	Human-readable duration fields converted from milliseconds to seconds.
`high_watch`, `complete_or_rewatch`	Interaction-level engagement indicators based on watch-ratio thresholds.
`active_day`, `interactions`, `total_play_duration_sec`, `avg_watch_ratio`	User-day summaries used to describe daily state.
`lag_1_`, `prior_3day_`	Pre-treatment history variables used as time-varying confounder candidates.
`next_day_`, `future_3day_`, `future_7day_*`	Future outcome variables used to study longer-term engagement.
`treatment_high_intensity`, `treatment_high_watch_exposure`	First-pass daily treatment definitions used to inspect treatment variation and confounding.

Setup

The first code cell imports the libraries used throughout the notebook. The plotting settings are kept simple and readable because this notebook is mostly about understanding temporal structure. We also make pandas show enough rows and columns for compact diagnostic tables.

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import ast
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 80)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

sns.set_theme(style="whitegrid", context="notebook")

The notebook environment is now ready for sequential EDA: pandas handles the panel operations, seaborn and matplotlib handle visual checks, and the display settings make the wide diagnostic tables readable. The next step is to locate the project data in a way that works whether the notebook is run from the repository root or from inside the notebook folder.

Locate the Project and Data

This notebook may be run from the repository root or from inside the notebooks/projects/project_3_long_term_causal_effects folder. The next cell searches upward until it finds the KuaiRec archive. Keeping this path logic inside the notebook makes it easier to rerun from Jupyter without manually changing directories.

KUAI_REC_ZIP_RELATIVE_PATH = Path("data/Kuairec/18164998.zip")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_ROOT = next(
    (path for path in candidate_roots if (path / KUAI_REC_ZIP_RELATIVE_PATH).exists()),
    None,
)

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"Could not find {KUAI_REC_ZIP_RELATIVE_PATH}. Run this notebook from inside the ranking_sys project."
    )

KUAI_REC_ZIP = PROJECT_ROOT / KUAI_REC_ZIP_RELATIVE_PATH
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"KuaiRec archive: {KUAI_REC_ZIP}")
print(f"Processed output folder: {PROCESSED_DIR}")

Project root: /home/apex/Documents/ranking_sys
KuaiRec archive: /home/apex/Documents/ranking_sys/data/Kuairec/18164998.zip
Processed output folder: /home/apex/Documents/ranking_sys/data/processed

The printed paths confirm that the notebook found the project root, the KuaiRec archive, and the processed-data folder. With those locations fixed, the next cell can inspect the raw archive contents without hard-coding local machine paths.

Inspect the Nested Archive

The downloaded file is an outer archive that contains a second archive named KuaiRec.zip. The interaction and metadata CSV files live inside that inner archive. This cell lists the relevant files so we know exactly what data sources are available before reading any large tables.

with ZipFile(KUAI_REC_ZIP) as outer_zip:
    outer_inventory = pd.DataFrame(
        [
            {
                "member": member.filename,
                "size_mb": member.file_size / 1_000_000,
            }
            for member in outer_zip.infolist()
        ]
    )
    inner_zip_bytes = outer_zip.read("KuaiRec.zip")

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    inner_inventory = pd.DataFrame(
        [
            {
                "member": member.filename,
                "size_mb": member.file_size / 1_000_000,
            }
            for member in inner_zip.infolist()
            if member.filename.startswith("KuaiRec 2.0/data/")
        ]
    )

print("Outer archive contents")
display(outer_inventory.sort_values("size_mb", ascending=False))

print("Inner KuaiRec data files")
display(inner_inventory.sort_values("member"))

Outer archive contents

	member	size_mb
0	KuaiRec.zip	431.9649
1	kuairec_caption_category.csv	1.9646
2	video_raw_categories_multi.csv	1.7245
3	user_features_raw.csv	1.5416

Inner KuaiRec data files

	member	size_mb
0	KuaiRec 2.0/data/	0.0000
1	KuaiRec 2.0/data/big_matrix.csv	1,083.5212
2	KuaiRec 2.0/data/item_categories.csv	0.1131
3	KuaiRec 2.0/data/item_daily_features.csv	85.8552
4	KuaiRec 2.0/data/kuairec_caption_category.csv	1.9646
5	KuaiRec 2.0/data/small_matrix.csv	406.1558
6	KuaiRec 2.0/data/social_network.csv	0.0069
7	KuaiRec 2.0/data/user_features.csv	0.7442

The archive inventory shows that KuaiRec is nested: the downloaded file contains an inner KuaiRec.zip, and the useful CSV files live inside that inner archive. This confirms that the notebook should read the data directly from the nested zip rather than assuming the dataset has already been extracted.

Choose the Tables and Sample Size

KuaiRec contains very large interaction matrices. For a first EDA notebook, we use a deterministic sample from small_matrix.csv. This is enough to study repeated users, daily activity, candidate treatments, and future outcomes while keeping the notebook fast enough to rerun interactively.

The sample is not the final modeling decision. It is a development-friendly slice used to design the causal panel. Later notebooks can increase the sample size or switch to a full extraction once the logic is stable.

SMALL_MATRIX_MEMBER = "KuaiRec 2.0/data/small_matrix.csv"
USER_FEATURES_MEMBER = "KuaiRec 2.0/data/user_features.csv"
ITEM_CATEGORIES_MEMBER = "KuaiRec 2.0/data/item_categories.csv"
ITEM_DAILY_FEATURES_MEMBER = "KuaiRec 2.0/data/item_daily_features.csv"

SAMPLE_ROWS = 300_000
ITEM_DAILY_SAMPLE_ROWS = 150_000

print(f"Interaction rows to read from small_matrix.csv: {SAMPLE_ROWS:,}")
print(f"Item daily feature rows to read: {ITEM_DAILY_SAMPLE_ROWS:,}")

Interaction rows to read from small_matrix.csv: 300,000
Item daily feature rows to read: 150,000

The selected files cover the ingredients needed for the project: interaction events, user metadata, item categories, and daily item context. The sample sizes keep this first notebook fast while still preserving a multi-user, multi-day sequential structure for causal design work.

Load and Normalize the Interaction Sample

The interaction table is the core event log. Each row records a user watching a video at a time, with the play duration, video duration, date, timestamp, and watch ratio. The next cell reads the sample and creates cleaner time columns.

The durations appear to be stored in milliseconds, so the cell also creates second-based versions that are easier to interpret. The watch_ratio is the observed play duration divided by video duration. Values above 1 can happen when users rewatch, loop, or otherwise spend longer than the nominal video duration.

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    with inner_zip.open(SMALL_MATRIX_MEMBER) as file:
        interactions = pd.read_csv(file, nrows=SAMPLE_ROWS)

# Convert KuaiRec's numeric date field into a proper datetime date.
date_numeric = pd.to_numeric(interactions["date"], errors="coerce").round().astype("Int64")
interactions["event_date"] = pd.to_datetime(
    date_numeric.astype("string"),
    format="%Y%m%d",
    errors="coerce",
)

# The timestamp column is Unix time in seconds. It is the most reliable event-time field.
interactions["event_timestamp"] = pd.to_datetime(
    pd.to_numeric(interactions["timestamp"], errors="coerce"),
    unit="s",
    errors="coerce",
)

# Keep a parsed version of the original time field as a backup and diagnostic.
interactions["event_time_from_text"] = pd.to_datetime(interactions["time"], errors="coerce")
interactions["event_time"] = interactions["event_time_from_text"].fillna(interactions["event_timestamp"])

# Add human-readable duration columns. The original millisecond fields are retained.
interactions["play_duration_sec"] = interactions["play_duration"] / 1_000
interactions["video_duration_sec"] = interactions["video_duration"] / 1_000
interactions["watch_ratio_capped_5"] = interactions["watch_ratio"].clip(lower=0, upper=5)

interactions = interactions.sort_values(["user_id", "event_time", "video_id"]).reset_index(drop=True)

print(f"Loaded interaction sample shape: {interactions.shape}")
display(interactions.head())

Loaded interaction sample shape: (300000, 15)

	user_id	video_id	play_duration	video_duration	time	date	timestamp	watch_ratio	event_date	event_timestamp	event_time_from_text	event_time	play_duration_sec	video_duration_sec	watch_ratio_capped_5
0	14	148	4381	6067	2020-07-05 05:27:48.378	20,200,705.0000	1,593,898,068.3780	0.7221	2020-07-05	2020-07-04 21:27:48.378000021	2020-07-05 05:27:48.378	2020-07-05 05:27:48.378	4.3810	6.0670	0.7221
1	14	183	11635	6100	2020-07-05 05:28:00.057	20,200,705.0000	1,593,898,080.0570	1.9074	2020-07-05	2020-07-04 21:28:00.056999922	2020-07-05 05:28:00.057	2020-07-05 05:28:00.057	11.6350	6.1000	1.9074
2	14	3649	22422	10867	2020-07-05 05:29:09.479	20,200,705.0000	1,593,898,149.4790	2.0633	2020-07-05	2020-07-04 21:29:09.479000092	2020-07-05 05:29:09.479	2020-07-05 05:29:09.479	22.4220	10.8670	2.0633
3	14	5262	4479	7908	2020-07-05 05:30:43.285	20,200,705.0000	1,593,898,243.2850	0.5664	2020-07-05	2020-07-04 21:30:43.285000086	2020-07-05 05:30:43.285	2020-07-05 05:30:43.285	4.4790	7.9080	0.5664
4	14	8234	4602	11000	2020-07-05 05:35:43.459	20,200,705.0000	1,593,898,543.4590	0.4184	2020-07-05	2020-07-04 21:35:43.459000111	2020-07-05 05:35:43.459	2020-07-05 05:35:43.459	4.6020	11.0000	0.4184

The interaction sample is now sorted and enriched with clean event dates, timestamps, seconds-based duration fields, and a capped watch-ratio helper for plots. This makes the raw event log usable for both ordinary EDA and later user-day aggregation.

Load User, Item, and Daily Item Metadata

The interaction log is not enough for causal analysis because treatment assignment depends on user and item context. The next cell loads three metadata tables:

user_features.csv: user-level characteristics such as activity degree, follower counts, registration age, and anonymized one-hot features.
item_categories.csv: video category identifiers.
item_daily_features.csv: daily item popularity and engagement counts, which are useful proxies for how popular or visible an item was on a given day.

These features are not final confounders yet, but they are candidates for later treatment and outcome models.

with ZipFile(BytesIO(inner_zip_bytes)) as inner_zip:
    with inner_zip.open(USER_FEATURES_MEMBER) as file:
        user_features = pd.read_csv(file)

    with inner_zip.open(ITEM_CATEGORIES_MEMBER) as file:
        item_categories = pd.read_csv(file)

    daily_usecols = [
        "video_id",
        "date",
        "show_cnt",
        "show_user_num",
        "play_cnt",
        "play_user_num",
        "complete_play_cnt",
        "like_cnt",
        "comment_cnt",
        "share_cnt",
        "collect_cnt",
    ]
    with inner_zip.open(ITEM_DAILY_FEATURES_MEMBER) as file:
        item_daily = pd.read_csv(file, usecols=daily_usecols, nrows=ITEM_DAILY_SAMPLE_ROWS)

# Convert category strings such as "[27, 9]" into counts and first-category diagnostics.
def parse_category_list(value):
    try:
        parsed = ast.literal_eval(value)
    except (ValueError, SyntaxError):
        return []
    return parsed if isinstance(parsed, list) else []

category_lists = item_categories["feat"].map(parse_category_list)
item_categories["category_count"] = category_lists.map(len)
item_categories["first_category"] = category_lists.map(lambda values: values[0] if values else np.nan)

item_daily["event_date"] = pd.to_datetime(item_daily["date"].astype("string"), format="%Y%m%d", errors="coerce")
item_daily["item_play_rate"] = np.where(
    item_daily["show_cnt"] > 0,
    item_daily["play_cnt"] / item_daily["show_cnt"],
    np.nan,
)
item_daily["item_like_rate_per_play"] = np.where(
    item_daily["play_cnt"] > 0,
    item_daily["like_cnt"] / item_daily["play_cnt"],
    np.nan,
)

print(f"User features shape: {user_features.shape}")
print(f"Item categories shape: {item_categories.shape}")
print(f"Item daily feature sample shape: {item_daily.shape}")

User features shape: (7176, 31)
Item categories shape: (10728, 4)
Item daily feature sample shape: (150000, 14)

The metadata tables loaded successfully and can be joined to the interaction log by user_id, video_id, and date. This matters because long-term causal modeling needs context: users and items are not exchangeable, and popularity or user tenure may influence both exposure and future engagement.

Raw Column Dictionary

Before aggregating anything, it helps to translate the key raw fields into causal language. This cell creates a compact dictionary for the columns we will use most often. This is especially useful in a portfolio notebook because reviewers can see the connection between raw logging fields and causal design components.

column_dictionary = pd.DataFrame(
    [
        {
            "table": "small_matrix",
            "column": "user_id",
            "meaning": "User identifier. This becomes the panel unit.",
            "causal_role": "Unit identifier",
        },
        {
            "table": "small_matrix",
            "column": "video_id",
            "meaning": "Recommended or consumed video identifier.",
            "causal_role": "Exposure content identifier",
        },
        {
            "table": "small_matrix",
            "column": "play_duration",
            "meaning": "Observed play duration in milliseconds.",
            "causal_role": "Short-term engagement outcome and future-state input",
        },
        {
            "table": "small_matrix",
            "column": "video_duration",
            "meaning": "Video duration in milliseconds.",
            "causal_role": "Item context and denominator for watch ratio",
        },
        {
            "table": "small_matrix",
            "column": "watch_ratio",
            "meaning": "Play duration divided by video duration; values above 1 indicate rewatch or over-completion.",
            "causal_role": "Short-term engagement and candidate exposure-quality measure",
        },
        {
            "table": "small_matrix",
            "column": "event_date",
            "meaning": "Calendar date derived from KuaiRec's date field.",
            "causal_role": "Time index for the user-day panel",
        },
        {
            "table": "user_features",
            "column": "user_active_degree",
            "meaning": "Categorical user activity segment supplied by KuaiRec.",
            "causal_role": "Baseline covariate and confounder candidate",
        },
        {
            "table": "item_daily_features",
            "column": "show_cnt / play_cnt",
            "meaning": "Daily item visibility and play counts.",
            "causal_role": "Item popularity context and confounder candidate",
        },
    ]
)

display(column_dictionary)

	table	column	meaning	causal_role
0	small_matrix	user_id	User identifier. This becomes the panel unit.	Unit identifier
1	small_matrix	video_id	Recommended or consumed video identifier.	Exposure content identifier
2	small_matrix	play_duration	Observed play duration in milliseconds.	Short-term engagement outcome and future-state...
3	small_matrix	video_duration	Video duration in milliseconds.	Item context and denominator for watch ratio
4	small_matrix	watch_ratio	Play duration divided by video duration; value...	Short-term engagement and candidate exposure-q...
5	small_matrix	event_date	Calendar date derived from KuaiRec's date field.	Time index for the user-day panel
6	user_features	user_active_degree	Categorical user activity segment supplied by ...	Baseline covariate and confounder candidate
7	item_daily_features	show_cnt / play_cnt	Daily item visibility and play counts.	Item popularity context and confounder candidate

The column dictionary translates raw logging fields into causal roles. That translation is important for the rest of the notebook: we are not just describing columns, we are deciding which fields can become units, treatments, covariates, and outcomes.

Basic Shape and Coverage

The first data-readiness question is whether the sample contains repeated observations over time. Long-term causal effects require enough users, enough dates, and enough future observations. The next cell summarizes the main coverage metrics for the interaction sample.

coverage_summary = pd.DataFrame(
    [
        {"metric": "interaction_rows", "value": len(interactions)},
        {"metric": "unique_users", "value": interactions["user_id"].nunique()},
        {"metric": "unique_videos", "value": interactions["video_id"].nunique()},
        {"metric": "unique_event_dates", "value": interactions["event_date"].nunique()},
        {"metric": "first_event_date", "value": interactions["event_date"].min()},
        {"metric": "last_event_date", "value": interactions["event_date"].max()},
        {"metric": "first_event_time", "value": interactions["event_time"].min()},
        {"metric": "last_event_time", "value": interactions["event_time"].max()},
    ]
)

display(coverage_summary)

	metric	value
0	interaction_rows	300000
1	unique_users	91
2	unique_videos	3327
3	unique_event_dates	63
4	first_event_date	2020-07-05 00:00:00
5	last_event_date	2020-09-05 00:00:00
6	first_event_time	2020-07-05 00:01:03.816000
7	last_event_time	2020-09-05 23:11:44.456000

The coverage table confirms that the sample contains repeated observations across users, videos, and calendar dates. Because the data spans multiple days per user, it is suitable for constructing histories and future outcomes rather than only one-shot engagement summaries.

Example Rows After Cleaning

This cell shows a few cleaned rows after date parsing and duration normalization. The goal is not to inspect every column manually. The goal is to verify that the event time, event date, duration, and watch-ratio fields look coherent before we build a panel from them.

preview_columns = [
    "user_id",
    "video_id",
    "event_date",
    "event_time",
    "play_duration_sec",
    "video_duration_sec",
    "watch_ratio",
]

display(interactions[preview_columns].head(10))

	user_id	video_id	event_date	event_time	play_duration_sec	video_duration_sec	watch_ratio
0	14	148	2020-07-05	2020-07-05 05:27:48.378	4.3810	6.0670	0.7221
1	14	183	2020-07-05	2020-07-05 05:28:00.057	11.6350	6.1000	1.9074
2	14	3649	2020-07-05	2020-07-05 05:29:09.479	22.4220	10.8670	2.0633
3	14	5262	2020-07-05	2020-07-05 05:30:43.285	4.4790	7.9080	0.5664
4	14	8234	2020-07-05	2020-07-05 05:35:43.459	4.6020	11.0000	0.4184
5	14	6789	2020-07-05	2020-07-05 05:36:00.773	8.6070	13.2670	0.6488
6	14	1963	2020-07-05	2020-07-05 05:36:47.741	8.6130	9.5900	0.8981
7	14	175	2020-07-05	2020-07-05 05:49:27.965	11.6400	46.5140	0.2502
8	14	1973	2020-07-05	2020-07-05 05:49:41.762	4.5720	7.4000	0.6178
9	14	171	2020-07-05	2020-07-05 05:57:26.581	8.5180	5.2170	1.6327

The preview rows show that the cleaned event date, event time, durations, and watch ratio are aligned at the interaction level. Since these fields look coherent, the next checks can focus on missingness and distributional shape rather than basic parsing issues.

Missingness and Data Type Checks

Missing values matter because causal estimators usually require complete treatment, outcome, and confounder histories. This cell computes missing rates and shows the cleaned data types. A high missing rate in a key time or outcome column would be a blocker for later notebooks.

missingness = (
    interactions.isna()
    .mean()
    .sort_values(ascending=False)
    .rename("missing_rate")
    .reset_index()
    .rename(columns={"index": "column"})
)

dtypes = (
    interactions.dtypes.astype(str)
    .rename("dtype")
    .reset_index()
    .rename(columns={"index": "column"})
)

missingness_with_types = missingness.merge(dtypes, on="column", how="left")

display(missingness_with_types)

	column	missing_rate	dtype
0	timestamp	0.0365	float64
1	date	0.0365	float64
2	time	0.0365	str
3	event_time	0.0365	datetime64[ns]
4	event_time_from_text	0.0365	datetime64[us]
5	event_timestamp	0.0365	datetime64[ns]
6	event_date	0.0365	datetime64[us]
7	video_duration	0.0000	int64
8	user_id	0.0000	int64
9	video_id	0.0000	int64
10	play_duration	0.0000	int64
11	watch_ratio	0.0000	float64
12	play_duration_sec	0.0000	float64
13	video_duration_sec	0.0000	float64
14	watch_ratio_capped_5	0.0000	float64

This check tells us whether any key variables would block later causal modeling. Low missingness in identifiers, timing fields, and engagement metrics means the sample can be safely aggregated into a daily panel without silently losing many rows.

Engagement Distribution Summary

Short-term engagement is the raw material for both candidate treatments and future outcomes. The next cell summarizes play duration, video duration, and watch ratio. The percentile view is important because watch-time data is usually skewed: a small number of very long or repeated watches can dominate averages.

engagement_summary = interactions[
    ["play_duration_sec", "video_duration_sec", "watch_ratio", "watch_ratio_capped_5"]
].describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T

display(engagement_summary)

	count	mean	std	min	1%	5%	25%	50%	75%	95%	99%	max
play_duration_sec	300,000.0000	8.7096	12.7322	0.0000	0.5640	1.9380	5.7090	7.6695	9.6900	15.7380	32.5791	1,502.2620
video_duration_sec	300,000.0000	14.4755	20.4464	3.0670	4.8850	5.9000	7.5220	9.5930	11.9340	44.8500	139.8770	315.0720
watch_ratio	300,000.0000	0.9191	1.5511	0.0000	0.0227	0.1017	0.4717	0.7720	1.1220	1.9767	3.7086	333.8360
watch_ratio_capped_5	300,000.0000	0.8802	0.6698	0.0000	0.0227	0.1017	0.4717	0.7720	1.1220	1.9767	3.7086	5.0000

The percentile table shows the skew typical of watch-time data. Watch ratios and play durations have long tails, so later plots and models should be careful with extreme values rather than relying only on raw means.

Visualize Short-Term Engagement

The plots below show the distribution of watch ratio and play duration. The watch-ratio plot is capped at 5 for readability, because extreme rewatch values can make the main distribution hard to see. This is only a plotting cap; the raw watch_ratio remains available.

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

sns.histplot(
    data=interactions,
    x="watch_ratio_capped_5",
    bins=60,
    ax=axes[0],
    color="#2A6F97",
)
axes[0].axvline(0.8, color="black", linestyle="--", linewidth=1, label="0.8 threshold")
axes[0].axvline(1.0, color="darkred", linestyle="--", linewidth=1, label="1.0 threshold")
axes[0].set_title("Watch Ratio Distribution, Capped at 5")
axes[0].set_xlabel("Watch ratio")
axes[0].legend()

sns.histplot(
    data=interactions,
    x="play_duration_sec",
    bins=60,
    ax=axes[1],
    color="#5C946E",
)
axes[1].set_title("Play Duration Distribution")
axes[1].set_xlabel("Play duration, seconds")
axes[1].set_yscale("log")

plt.tight_layout()
plt.show()

The plots make the engagement skew easier to see. The watch-ratio thresholds around 0.8 and 1.0 are plausible anchors for high-watch and over-completion behavior, which motivates the candidate treatment definitions created later.

User and Item Repeated Measures

Sequential causal inference needs repeated observations for the same users. If every user appeared once, we could not define histories, lagged confounders, or future retention. The next cell summarizes how many interactions and active days each sampled user contributes, and how concentrated video consumption is across items.

user_activity = (
    interactions.groupby("user_id")
    .agg(
        interactions=("video_id", "size"),
        unique_videos=("video_id", "nunique"),
        active_days=("event_date", "nunique"),
        first_date=("event_date", "min"),
        last_date=("event_date", "max"),
        avg_watch_ratio=("watch_ratio", "mean"),
        total_play_duration_sec=("play_duration_sec", "sum"),
    )
    .reset_index()
)

item_activity = (
    interactions.groupby("video_id")
    .agg(
        impressions=("user_id", "size"),
        unique_users=("user_id", "nunique"),
        avg_watch_ratio=("watch_ratio", "mean"),
        avg_play_duration_sec=("play_duration_sec", "mean"),
    )
    .reset_index()
    .sort_values("impressions", ascending=False)
)

print("User activity summary")
display(user_activity[["interactions", "unique_videos", "active_days", "avg_watch_ratio", "total_play_duration_sec"]].describe().T)

print("Most observed videos in the sample")
display(item_activity.head(10))

User activity summary

	count	mean	std	min	25%	50%	75%	max
interactions	91.0000	3,296.7033	166.3290	1,729.0000	3,309.0000	3,315.0000	3,320.0000	3,326.0000
unique_videos	91.0000	3,296.7033	166.3290	1,729.0000	3,309.0000	3,315.0000	3,320.0000	3,326.0000
active_days	91.0000	60.8462	3.9831	31.0000	61.0000	62.0000	63.0000	63.0000
avg_watch_ratio	91.0000	0.9190	0.1953	0.6512	0.7941	0.8832	0.9841	2.0275
total_play_duration_sec	91.0000	28,712.8942	6,276.4357	13,711.2750	25,107.3275	28,289.4630	31,055.6880	63,349.6840

Most observed videos in the sample

	video_id	impressions	unique_users	avg_watch_ratio	avg_play_duration_sec
31	186	91	91	1.1330	7.7613
30	183	91	91	1.6252	9.9136
0	103	91	91	0.7553	7.7798
1	109	91	91	1.0973	8.5228
2	120	91	91	1.9835	12.4414
3	122	91	91	0.8607	8.3776
4	128	91	91	0.7762	8.0989
5	130	91	91	0.8416	8.4340
29	180	91	91	0.8184	9.0079
28	179	91	91	1.2162	7.3785

The user and item summaries show how much repeated information exists per user and how concentrated consumption is across videos. Repeated user histories are the key ingredient for long-term causal effects, while item concentration reminds us that content popularity may be a confounder.

Plot Repeated User Activity

The next plots show whether users have enough repeated activity for a user-day panel. The left plot checks interaction volume per user. The right plot checks the number of active calendar days per user. More repeated days means more usable history for later marginal structural modeling.

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

sns.histplot(data=user_activity, x="interactions", bins=40, ax=axes[0], color="#2A6F97")
axes[0].set_title("Interactions per User")
axes[0].set_xlabel("Interaction rows")

sns.histplot(data=user_activity, x="active_days", bins=30, ax=axes[1], color="#C07F00")
axes[1].set_title("Active Days per User")
axes[1].set_xlabel("Active calendar days")

plt.tight_layout()
plt.show()

The activity plots help verify that users contribute multiple interaction rows and active days. That supports the next move: aggregating events into user-day records so that each row can represent one time step in a longitudinal causal panel.

Build the Observed User-Day Table

The raw data is interaction-level, but long-term causal effects are easier to reason about at a user-day level. This cell aggregates each user’s activity on each observed day. It creates daily metrics such as total interactions, unique videos, total play duration, average watch ratio, and the share of watched videos that crossed a high-watch threshold.

The high-watch threshold is set at watch_ratio >= 0.8 as a first-pass operational definition. This does not mean it is the final treatment. It is a transparent candidate that later notebooks can refine.

HIGH_WATCH_THRESHOLD = 0.8
COMPLETE_OR_REWATCH_THRESHOLD = 1.0

interactions["high_watch"] = (interactions["watch_ratio"] >= HIGH_WATCH_THRESHOLD).astype(int)
interactions["complete_or_rewatch"] = (interactions["watch_ratio"] >= COMPLETE_OR_REWATCH_THRESHOLD).astype(int)

user_day_observed = (
    interactions.groupby(["user_id", "event_date"])
    .agg(
        interactions=("video_id", "size"),
        unique_videos=("video_id", "nunique"),
        total_play_duration_ms=("play_duration", "sum"),
        avg_play_duration_ms=("play_duration", "mean"),
        avg_video_duration_ms=("video_duration", "mean"),
        avg_watch_ratio=("watch_ratio", "mean"),
        high_watch_count=("high_watch", "sum"),
        complete_or_rewatch_count=("complete_or_rewatch", "sum"),
    )
    .reset_index()
)

user_day_observed["total_play_duration_sec"] = user_day_observed["total_play_duration_ms"] / 1_000
user_day_observed["avg_play_duration_sec"] = user_day_observed["avg_play_duration_ms"] / 1_000
user_day_observed["avg_video_duration_sec"] = user_day_observed["avg_video_duration_ms"] / 1_000
user_day_observed["high_watch_share"] = user_day_observed["high_watch_count"] / user_day_observed["interactions"]
user_day_observed["complete_or_rewatch_share"] = user_day_observed["complete_or_rewatch_count"] / user_day_observed["interactions"]

user_day_observed = user_day_observed.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Observed user-day rows: {len(user_day_observed):,}")
display(user_day_observed.head(10))

Observed user-day rows: 5,537

	user_id	event_date	interactions	unique_videos	total_play_duration_ms	avg_play_duration_ms	avg_video_duration_ms	avg_watch_ratio	high_watch_count	complete_or_rewatch_count	total_play_duration_sec	avg_play_duration_sec	avg_video_duration_sec	high_watch_share	complete_or_rewatch_share
0	14	2020-07-05	26	26	240975	9,268.2692	10,187.6538	1.0845	15	12	240.9750	9.2683	10.1877	0.5769	0.4615
1	14	2020-07-06	23	23	248344	10,797.5652	14,615.2174	1.0640	12	10	248.3440	10.7976	14.6152	0.5217	0.4348
2	14	2020-07-07	78	78	655489	8,403.7051	13,529.6410	0.8415	36	27	655.4890	8.4037	13.5296	0.4615	0.3462
3	14	2020-07-08	22	22	201901	9,177.3182	12,657.1818	0.9828	11	9	201.9010	9.1773	12.6572	0.5000	0.4091
4	14	2020-07-09	55	55	485039	8,818.8909	12,841.6727	0.8619	20	15	485.0390	8.8189	12.8417	0.3636	0.2727
5	14	2020-07-10	52	52	606244	11,658.5385	17,735.4038	1.1380	30	21	606.2440	11.6585	17.7354	0.5769	0.4038
6	14	2020-07-11	32	32	284747	8,898.3438	13,858.9688	0.9337	16	10	284.7470	8.8983	13.8590	0.5000	0.3125
7	14	2020-07-12	42	42	337918	8,045.6667	13,215.6429	0.7986	18	11	337.9180	8.0457	13.2156	0.4286	0.2619
8	14	2020-07-13	46	46	502145	10,916.1957	13,244.0000	1.0389	24	19	502.1450	10.9162	13.2440	0.5217	0.4130
9	14	2020-07-14	42	42	337489	8,035.4524	14,600.6190	0.9892	23	16	337.4890	8.0355	14.6006	0.5476	0.3810

The observed user-day table compresses raw events into daily engagement states. This is the first major causal-design transformation: the analysis unit shifts from an individual watch event to a user observed on a calendar day.

Densify the User-Day Panel

A long-term outcome like next-day retention requires knowing when a user was inactive. The observed table only contains days with at least one interaction, so it cannot directly distinguish missing days from inactive days. The next cell creates a dense user-date grid across the sample window and fills missing activity with zeros.

This dense panel lets us define outcomes such as next_day_active and future_7day_active_days. Those outcomes are central to the long-term causal question because they move beyond immediate watch behavior.

all_users = np.sort(user_day_observed["user_id"].unique())
all_dates = pd.date_range(
    user_day_observed["event_date"].min(),
    user_day_observed["event_date"].max(),
    freq="D",
)

dense_index = pd.MultiIndex.from_product(
    [all_users, all_dates],
    names=["user_id", "event_date"],
)

user_day = (
    user_day_observed.set_index(["user_id", "event_date"])
    .reindex(dense_index)
    .reset_index()
)

count_and_sum_columns = [
    "interactions",
    "unique_videos",
    "total_play_duration_ms",
    "high_watch_count",
    "complete_or_rewatch_count",
    "total_play_duration_sec",
]
rate_and_average_columns = [
    "avg_play_duration_ms",
    "avg_video_duration_ms",
    "avg_watch_ratio",
    "avg_play_duration_sec",
    "avg_video_duration_sec",
    "high_watch_share",
    "complete_or_rewatch_share",
]

user_day[count_and_sum_columns] = user_day[count_and_sum_columns].fillna(0)
user_day[rate_and_average_columns] = user_day[rate_and_average_columns].fillna(0)
user_day["active_day"] = (user_day["interactions"] > 0).astype(int)

user_day = user_day.sort_values(["user_id", "event_date"]).reset_index(drop=True)

print(f"Dense user-day panel shape: {user_day.shape}")
print(f"Users in panel: {user_day['user_id'].nunique():,}")
print(f"Dates per user: {user_day['event_date'].nunique():,}")
display(user_day.head(10))

Dense user-day panel shape: (5733, 16)
Users in panel: 91
Dates per user: 63

	user_id	event_date	interactions	unique_videos	total_play_duration_ms	avg_play_duration_ms	avg_video_duration_ms	avg_watch_ratio	high_watch_count	complete_or_rewatch_count	total_play_duration_sec	avg_play_duration_sec	avg_video_duration_sec	high_watch_share	complete_or_rewatch_share	active_day
0	14	2020-07-05	26.0000	26.0000	240,975.0000	9,268.2692	10,187.6538	1.0845	15.0000	12.0000	240.9750	9.2683	10.1877	0.5769	0.4615	1
1	14	2020-07-06	23.0000	23.0000	248,344.0000	10,797.5652	14,615.2174	1.0640	12.0000	10.0000	248.3440	10.7976	14.6152	0.5217	0.4348	1
2	14	2020-07-07	78.0000	78.0000	655,489.0000	8,403.7051	13,529.6410	0.8415	36.0000	27.0000	655.4890	8.4037	13.5296	0.4615	0.3462	1
3	14	2020-07-08	22.0000	22.0000	201,901.0000	9,177.3182	12,657.1818	0.9828	11.0000	9.0000	201.9010	9.1773	12.6572	0.5000	0.4091	1
4	14	2020-07-09	55.0000	55.0000	485,039.0000	8,818.8909	12,841.6727	0.8619	20.0000	15.0000	485.0390	8.8189	12.8417	0.3636	0.2727	1
5	14	2020-07-10	52.0000	52.0000	606,244.0000	11,658.5385	17,735.4038	1.1380	30.0000	21.0000	606.2440	11.6585	17.7354	0.5769	0.4038	1
6	14	2020-07-11	32.0000	32.0000	284,747.0000	8,898.3438	13,858.9688	0.9337	16.0000	10.0000	284.7470	8.8983	13.8590	0.5000	0.3125	1
7	14	2020-07-12	42.0000	42.0000	337,918.0000	8,045.6667	13,215.6429	0.7986	18.0000	11.0000	337.9180	8.0457	13.2156	0.4286	0.2619	1
8	14	2020-07-13	46.0000	46.0000	502,145.0000	10,916.1957	13,244.0000	1.0389	24.0000	19.0000	502.1450	10.9162	13.2440	0.5217	0.4130	1
9	14	2020-07-14	42.0000	42.0000	337,489.0000	8,035.4524	14,600.6190	0.9892	23.0000	16.0000	337.4890	8.0355	14.6006	0.5476	0.3810	1

The dense panel explicitly includes inactive dates, which is necessary for future retention-style outcomes. Without this step, missing user-days would be invisible and the notebook could mistake absence from the log for absence from the analysis.

Add Lagged State Variables

In sequential causal inference, the user’s past is part of the confounding structure. A user who watched heavily yesterday may be more likely to receive certain recommendations today and more likely to be active tomorrow. The next cell creates lagged and prior-window features that later notebooks can use as time-varying confounders.

These variables are not outcomes. They are the observed state before a daily treatment decision.

lag_columns = [
    "active_day",
    "interactions",
    "total_play_duration_sec",
    "avg_watch_ratio",
    "high_watch_share",
]

for column in lag_columns:
    user_day[f"lag_1_{column}"] = user_day.groupby("user_id")[column].shift(1)
    user_day[f"prior_3day_{column}"] = user_day.groupby("user_id")[column].transform(
        lambda series: series.shift(1).rolling(window=3, min_periods=1).sum()
    )

lagged_columns = [column for column in user_day.columns if column.startswith("lag_1_") or column.startswith("prior_3day_")]
user_day[lagged_columns] = user_day[lagged_columns].fillna(0)

state_preview_columns = [
    "user_id",
    "event_date",
    "active_day",
    "interactions",
    "lag_1_active_day",
    "lag_1_interactions",
    "prior_3day_active_day",
    "prior_3day_interactions",
]

display(user_day[state_preview_columns].head(15))

	user_id	event_date	active_day	interactions	lag_1_active_day	lag_1_interactions	prior_3day_active_day	prior_3day_interactions
0	14	2020-07-05	1	26.0000	0.0000	0.0000	0.0000	0.0000
1	14	2020-07-06	1	23.0000	1.0000	26.0000	1.0000	26.0000
2	14	2020-07-07	1	78.0000	1.0000	23.0000	2.0000	49.0000
3	14	2020-07-08	1	22.0000	1.0000	78.0000	3.0000	127.0000
4	14	2020-07-09	1	55.0000	1.0000	22.0000	3.0000	123.0000
5	14	2020-07-10	1	52.0000	1.0000	55.0000	3.0000	155.0000
6	14	2020-07-11	1	32.0000	1.0000	52.0000	3.0000	129.0000
7	14	2020-07-12	1	42.0000	1.0000	32.0000	3.0000	139.0000
8	14	2020-07-13	1	46.0000	1.0000	42.0000	3.0000	126.0000
9	14	2020-07-14	1	42.0000	1.0000	46.0000	3.0000	120.0000
10	14	2020-07-15	1	10.0000	1.0000	42.0000	3.0000	130.0000
11	14	2020-07-16	1	87.0000	1.0000	10.0000	3.0000	98.0000
12	14	2020-07-17	1	93.0000	1.0000	87.0000	3.0000	139.0000
13	14	2020-07-18	1	117.0000	1.0000	93.0000	3.0000	190.0000
14	14	2020-07-19	1	42.0000	1.0000	117.0000	3.0000	297.0000

The lagged and prior-window columns capture the user’s pre-treatment state. These variables are central to later causal adjustment because prior engagement can influence both today’s recommendation exposure and tomorrow’s behavior.

Add Forward-Looking Outcomes

The next cell creates future outcome variables. These include next-day activity and multi-day future engagement. We deliberately keep these outcomes separate from treatment definitions so later notebooks can estimate how candidate treatment patterns affect future behavior.

Rows near the end of the sample window do not have enough future days to define all outcomes. Those rows are marked as missing for the affected future horizons rather than silently treating unknown future activity as zero.

future_base_columns = ["active_day", "interactions", "total_play_duration_sec"]
future_horizons = [1, 2, 3, 4, 5, 6, 7]

for column in future_base_columns:
    for horizon in future_horizons:
        user_day[f"lead_{horizon}_{column}"] = user_day.groupby("user_id")[column].shift(-horizon)

user_day["next_day_active"] = user_day["lead_1_active_day"]
user_day["next_day_interactions"] = user_day["lead_1_interactions"]
user_day["next_day_play_duration_sec"] = user_day["lead_1_total_play_duration_sec"]

user_day["future_3day_active_days"] = user_day[[f"lead_{horizon}_active_day" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)
user_day["future_3day_interactions"] = user_day[[f"lead_{horizon}_interactions" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)
user_day["future_3day_play_duration_sec"] = user_day[[f"lead_{horizon}_total_play_duration_sec" for horizon in [1, 2, 3]]].sum(axis=1, min_count=3)

user_day["future_7day_active_days"] = user_day[[f"lead_{horizon}_active_day" for horizon in future_horizons]].sum(axis=1, min_count=7)
user_day["future_7day_interactions"] = user_day[[f"lead_{horizon}_interactions" for horizon in future_horizons]].sum(axis=1, min_count=7)
user_day["future_7day_play_duration_sec"] = user_day[[f"lead_{horizon}_total_play_duration_sec" for horizon in future_horizons]].sum(axis=1, min_count=7)

outcome_preview_columns = [
    "user_id",
    "event_date",
    "active_day",
    "interactions",
    "next_day_active",
    "future_3day_active_days",
    "future_3day_interactions",
    "future_7day_active_days",
]

display(user_day[outcome_preview_columns].head(15))

	user_id	event_date	active_day	interactions	next_day_active	future_3day_active_days	future_3day_interactions	future_7day_active_days
0	14	2020-07-05	1	26.0000	1.0000	3.0000	123.0000	7.0000
1	14	2020-07-06	1	23.0000	1.0000	3.0000	155.0000	7.0000
2	14	2020-07-07	1	78.0000	1.0000	3.0000	129.0000	7.0000
3	14	2020-07-08	1	22.0000	1.0000	3.0000	139.0000	7.0000
4	14	2020-07-09	1	55.0000	1.0000	3.0000	126.0000	7.0000
5	14	2020-07-10	1	52.0000	1.0000	3.0000	120.0000	7.0000
6	14	2020-07-11	1	32.0000	1.0000	3.0000	130.0000	7.0000
7	14	2020-07-12	1	42.0000	1.0000	3.0000	98.0000	7.0000
8	14	2020-07-13	1	46.0000	1.0000	3.0000	139.0000	7.0000
9	14	2020-07-14	1	42.0000	1.0000	3.0000	190.0000	7.0000
10	14	2020-07-15	1	10.0000	1.0000	3.0000	297.0000	7.0000
11	14	2020-07-16	1	87.0000	1.0000	3.0000	252.0000	7.0000
12	14	2020-07-17	1	93.0000	1.0000	3.0000	264.0000	7.0000
13	14	2020-07-18	1	117.0000	1.0000	3.0000	227.0000	7.0000
14	14	2020-07-19	1	42.0000	1.0000	3.0000	202.0000	7.0000

The lead columns and future-window outcomes give the project its long-term target. Current-day behavior can now be related to next-day, 3-day, and 7-day future engagement without using future information as a confounder.

Candidate Treatment Definitions

For this project, a treatment is not a medical intervention or a one-time binary assignment. It is a daily exposure pattern generated by a recommender system. At this stage, we define two transparent candidate treatments:

treatment_high_intensity: the user had an active day with interaction count in the upper quartile of active days.
treatment_high_watch_exposure: the user had an active day where the share of high-watch videos was above the median active-day share.

These are deliberately simple first definitions. They give us a concrete way to inspect treatment variation and confounding before more sophisticated modeling.

active_user_days = user_day["active_day"] == 1

high_intensity_threshold = user_day.loc[active_user_days, "interactions"].quantile(0.75)
high_watch_share_threshold = user_day.loc[active_user_days, "high_watch_share"].median()

user_day["treatment_high_intensity"] = (
    active_user_days & (user_day["interactions"] >= high_intensity_threshold)
).astype(int)
user_day["treatment_high_watch_exposure"] = (
    active_user_days & (user_day["high_watch_share"] >= high_watch_share_threshold)
).astype(int)

thresholds = pd.DataFrame(
    [
        {
            "candidate_treatment": "treatment_high_intensity",
            "rule": "active day and interactions >= active-day 75th percentile",
            "threshold": high_intensity_threshold,
        },
        {
            "candidate_treatment": "treatment_high_watch_exposure",
            "rule": "active day and high_watch_share >= active-day median",
            "threshold": high_watch_share_threshold,
        },
    ]
)

treatment_summary = user_day[
    ["treatment_high_intensity", "treatment_high_watch_exposure"]
].agg(["mean", "sum"]).T.rename(columns={"mean": "share_of_user_days", "sum": "treated_user_days"})

display(thresholds)
display(treatment_summary)

	candidate_treatment	rule	threshold
0	treatment_high_intensity	active day and interactions >= active-day 75th...	70.0000
1	treatment_high_watch_exposure	active day and high_watch_share >= active-day ...	0.4773

	share_of_user_days	treated_user_days
treatment_high_intensity	0.2501	1,434.0000
treatment_high_watch_exposure	0.4842	2,776.0000

The treatment summary shows whether the first two exposure definitions have usable variation. These definitions are intentionally simple at this stage: they let us test whether the data can support a treated-versus-control comparison before committing to a final estimand.

Plot Daily Treatment and Outcome Rates

This plot checks whether the candidate treatments and next-day activity vary over time. A useful causal study needs variation. If a treatment is always on, always off, or perfectly aligned with a single date, it is hard to estimate a meaningful effect.

daily_panel_summary = (
    user_day.groupby("event_date")
    .agg(
        active_rate=("active_day", "mean"),
        high_intensity_rate=("treatment_high_intensity", "mean"),
        high_watch_exposure_rate=("treatment_high_watch_exposure", "mean"),
        next_day_active_rate=("next_day_active", "mean"),
    )
    .reset_index()
)

plot_daily = daily_panel_summary.melt(
    id_vars="event_date",
    value_vars=["active_rate", "high_intensity_rate", "high_watch_exposure_rate", "next_day_active_rate"],
    var_name="metric",
    value_name="rate",
)

fig, ax = plt.subplots(figsize=(13, 5))
sns.lineplot(data=plot_daily, x="event_date", y="rate", hue="metric", marker="o", linewidth=1.5, ax=ax)
ax.set_title("Daily Treatment and Outcome Rates in the Dense User-Day Panel")
ax.set_xlabel("Date")
ax.set_ylabel("Rate")
ax.yaxis.set_major_formatter(lambda value, _: f"{value:.0%}")
ax.tick_params(axis="x", rotation=35)
plt.tight_layout()
plt.show()

The time-series plot checks whether treatment and outcome rates vary over the calendar window. If treatment prevalence changed sharply by date, later models would need calendar controls; this plot starts that diagnostic conversation.

Naive Associations Are Not Causal Effects

The next table compares future outcomes between treated and untreated active user-days. This is useful as a descriptive diagnostic, but it is not a causal estimate. Treated days are likely different before treatment happens. For example, users with high activity today may already be more engaged, and that prior engagement may predict future activity regardless of today’s exposure.

The purpose of this table is to create intuition and motivate the confounding checks that follow.

def summarize_naive_association(data, treatment_col, outcome_cols):
    rows = []
    analytic = data.loc[data["active_day"].eq(1)].copy()
    for outcome_col in outcome_cols:
        subset = analytic.dropna(subset=[treatment_col, outcome_col])
        grouped = subset.groupby(treatment_col)[outcome_col].agg(["mean", "count", "std"])
        if set(grouped.index) >= {0, 1}:
            control_mean = grouped.loc[0, "mean"]
            treated_mean = grouped.loc[1, "mean"]
            rows.append(
                {
                    "treatment": treatment_col,
                    "outcome": outcome_col,
                    "control_mean": control_mean,
                    "treated_mean": treated_mean,
                    "difference": treated_mean - control_mean,
                    "relative_lift": (treated_mean / control_mean - 1) if control_mean != 0 else np.nan,
                    "control_days": grouped.loc[0, "count"],
                    "treated_days": grouped.loc[1, "count"],
                }
            )
    return pd.DataFrame(rows)

outcome_columns = [
    "next_day_active",
    "future_3day_active_days",
    "future_3day_interactions",
    "future_7day_active_days",
]

naive_associations = pd.concat(
    [
        summarize_naive_association(user_day, "treatment_high_intensity", outcome_columns),
        summarize_naive_association(user_day, "treatment_high_watch_exposure", outcome_columns),
    ],
    ignore_index=True,
)

display(naive_associations)

	treatment	outcome	control_mean	treated_mean	difference	relative_lift	control_days	treated_days
0	treatment_high_intensity	next_day_active	0.9750	0.9868	0.0117	0.0120	4048	1434
1	treatment_high_intensity	future_3day_active_days	2.9371	2.9477	0.0106	0.0036	3880	1434
2	treatment_high_intensity	future_3day_interactions	138.8616	207.0342	68.1726	0.4909	3880	1434
3	treatment_high_intensity	future_7day_active_days	6.8580	6.8562	-0.0017	-0.0003	3528	1433
4	treatment_high_watch_exposure	next_day_active	0.9835	0.9728	-0.0107	-0.0109	2728	2754
5	treatment_high_watch_exposure	future_3day_active_days	2.9479	2.9321	-0.0158	-0.0054	2649	2665
6	treatment_high_watch_exposure	future_3day_interactions	157.7493	156.7700	-0.9794	-0.0062	2649	2665
7	treatment_high_watch_exposure	future_7day_active_days	6.8719	6.8430	-0.0289	-0.0042	2483	2478

The naive comparisons describe how future outcomes differ between treated and untreated days, but they should not be read as causal effects. Their real value is motivational: if there are visible differences, the next question is whether those differences remain after accounting for prior user state.

Check Time-Varying Confounding

A major reason this project needs causal methods is that prior user state can affect both today’s treatment and future outcomes. The next cell computes standardized mean differences for pre-treatment covariates between treated and untreated active days. Large imbalances mean that a naive treated-versus-control comparison is likely confounded.

A standardized mean difference is the treated-control difference in means divided by a pooled standard deviation. Values far from zero indicate imbalance in the pre-treatment state.

def standardized_mean_differences(data, treatment_col, covariate_cols):
    analytic = data.loc[data["active_day"].eq(1)].dropna(subset=[treatment_col]).copy()
    rows = []
    for covariate in covariate_cols:
        treated = analytic.loc[analytic[treatment_col].eq(1), covariate].dropna()
        control = analytic.loc[analytic[treatment_col].eq(0), covariate].dropna()
        pooled_sd = np.sqrt((treated.var(ddof=1) + control.var(ddof=1)) / 2)
        rows.append(
            {
                "treatment": treatment_col,
                "covariate": covariate,
                "treated_mean": treated.mean(),
                "control_mean": control.mean(),
                "smd": (treated.mean() - control.mean()) / pooled_sd if pooled_sd and not np.isnan(pooled_sd) else np.nan,
            }
        )
    return pd.DataFrame(rows)

pre_treatment_covariates = [
    "lag_1_active_day",
    "lag_1_interactions",
    "lag_1_total_play_duration_sec",
    "lag_1_avg_watch_ratio",
    "prior_3day_active_day",
    "prior_3day_interactions",
    "prior_3day_total_play_duration_sec",
    "prior_3day_high_watch_share",
]

confounding_balance = pd.concat(
    [
        standardized_mean_differences(user_day, "treatment_high_intensity", pre_treatment_covariates),
        standardized_mean_differences(user_day, "treatment_high_watch_exposure", pre_treatment_covariates),
    ],
    ignore_index=True,
)

display(confounding_balance.sort_values(["treatment", "smd"], key=lambda col: col.abs() if col.name == "smd" else col))

	treatment	covariate	treated_mean	control_mean	smd
3	treatment_high_intensity	lag_1_avg_watch_ratio	0.9103	0.9208	-0.0193
0	treatment_high_intensity	lag_1_active_day	0.9770	0.9654	0.0694
7	treatment_high_intensity	prior_3day_high_watch_share	1.4207	1.3741	0.1029
4	treatment_high_intensity	prior_3day_active_day	2.9303	2.8279	0.2198
2	treatment_high_intensity	lag_1_total_play_duration_sec	617.3946	388.2157	0.7200
6	treatment_high_intensity	prior_3day_total_play_duration_sec	1,794.1988	1,177.3717	0.8370
1	treatment_high_intensity	lag_1_interactions	71.7364	44.2764	0.8406
5	treatment_high_intensity	prior_3day_interactions	207.7497	134.4448	1.0026
8	treatment_high_watch_exposure	lag_1_active_day	0.9672	0.9696	-0.0135
12	treatment_high_watch_exposure	prior_3day_active_day	2.8483	2.8606	-0.0232
9	treatment_high_watch_exposure	lag_1_interactions	50.8336	51.9457	-0.0332
13	treatment_high_watch_exposure	prior_3day_interactions	151.6120	155.2572	-0.0452
10	treatment_high_watch_exposure	lag_1_total_play_duration_sec	479.3493	415.6172	0.2007
14	treatment_high_watch_exposure	prior_3day_total_play_duration_sec	1,424.6386	1,249.1272	0.2284
11	treatment_high_watch_exposure	lag_1_avg_watch_ratio	1.0097	0.8259	0.3077
15	treatment_high_watch_exposure	prior_3day_high_watch_share	1.5894	1.1817	0.9546

The standardized mean differences show whether treated and untreated days already looked different before treatment. Any meaningful imbalance here supports the central claim of the project: sequential recommender logs need causal adjustment, not just outcome comparison.

Visualize Covariate Imbalance

The heatmap below gives a quick visual read of confounding. Darker positive or negative values mean treated and untreated active days had different histories before the treatment day. This is exactly the kind of structure that motivates marginal structural models and g-computation in later notebooks.

balance_heatmap = confounding_balance.pivot(index="covariate", columns="treatment", values="smd")

fig, ax = plt.subplots(figsize=(9, 5.5))
sns.heatmap(
    balance_heatmap,
    annot=True,
    fmt=".2f",
    cmap="vlag",
    center=0,
    linewidths=0.5,
    ax=ax,
)
ax.set_title("Standardized Mean Differences in Pre-Treatment State")
ax.set_xlabel("Candidate treatment")
ax.set_ylabel("Pre-treatment covariate")
plt.tight_layout()
plt.show()

The heatmap turns the balance table into a quick diagnostic. Strong colors indicate pre-treatment differences that later notebooks must address with propensity models, marginal structural models, or g-computation.

Join Metadata for Diagnostic Checks

User metadata can help explain why treatment assignment differs across users. The next cell joins the interaction sample to user activity segment and item category features. This is not the final feature engineering step; it is a sanity check that the metadata keys connect cleanly to the interaction log.

user_metadata_columns = [
    "user_id",
    "user_active_degree",
    "is_lowactive_period",
    "is_live_streamer",
    "is_video_author",
    "follow_user_num",
    "fans_user_num",
    "friend_user_num",
    "register_days",
]

interaction_metadata_sample = (
    interactions.merge(user_features[user_metadata_columns], on="user_id", how="left")
    .merge(item_categories[["video_id", "category_count", "first_category"]], on="video_id", how="left")
)

metadata_diagnostics = pd.DataFrame(
    [
        {"metric": "rows", "value": len(interaction_metadata_sample)},
        {"metric": "missing_user_active_degree_rate", "value": interaction_metadata_sample["user_active_degree"].isna().mean()},
        {"metric": "missing_item_category_rate", "value": interaction_metadata_sample["category_count"].isna().mean()},
        {"metric": "unique_user_activity_segments", "value": interaction_metadata_sample["user_active_degree"].nunique(dropna=True)},
        {"metric": "unique_first_categories", "value": interaction_metadata_sample["first_category"].nunique(dropna=True)},
    ]
)

display(metadata_diagnostics)

display(
    interaction_metadata_sample.groupby("user_active_degree")
    .agg(
        rows=("video_id", "size"),
        users=("user_id", "nunique"),
        avg_watch_ratio=("watch_ratio", "mean"),
        avg_play_duration_sec=("play_duration_sec", "mean"),
    )
    .sort_values("rows", ascending=False)
)

	metric	value
0	rows	300,000.0000
1	missing_user_active_degree_rate	0.0000
2	missing_item_category_rate	0.0000
3	unique_user_activity_segments	3.0000
4	unique_first_categories	30.0000

	rows	users	avg_watch_ratio	avg_play_duration_sec
user_active_degree
full_active	230406	70	0.9199	8.7178
high_active	62965	19	0.9250	8.7592
UNKNOWN	6629	2	0.8329	7.9533

The join diagnostics confirm that user metadata and item categories connect to the sampled interaction log. This means later models can enrich the history-only panel with baseline user context or item context if the causal design needs stronger adjustment.

Inspect Item Daily Popularity Features

Item daily features can be useful later because recommendation exposure is often influenced by item popularity. A popular item may be more likely to be shown, and popular items may also generate different future engagement. The next cell summarizes the sampled item daily table and checks the distribution of simple popularity rates.

item_daily_summary = item_daily[
    [
        "show_cnt",
        "show_user_num",
        "play_cnt",
        "play_user_num",
        "complete_play_cnt",
        "like_cnt",
        "comment_cnt",
        "share_cnt",
        "collect_cnt",
        "item_play_rate",
        "item_like_rate_per_play",
    ]
].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).T

display(item_daily_summary)

fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
sns.histplot(data=item_daily, x="show_cnt", bins=60, ax=axes[0], color="#2A6F97")
axes[0].set_title("Daily Item Show Counts")
axes[0].set_xlabel("Show count")
axes[0].set_yscale("log")

sns.histplot(data=item_daily, x="item_play_rate", bins=60, ax=axes[1], color="#5C946E")
axes[1].set_title("Daily Item Play Rate")
axes[1].set_xlabel("Play count / show count")

plt.tight_layout()
plt.show()

	count	mean	std	5%	25%	50%	75%	95%	max
show_cnt	150,000.0000	99,455.6275	565,375.6798	3.0000	74.0000	1,762.0000	28,220.0000	415,838.5500	36,053,957.0000
show_user_num	150,000.0000	93,436.8595	542,642.7612	3.0000	63.0000	1,474.0000	25,346.0000	387,015.8000	34,875,019.0000
play_cnt	150,000.0000	99,323.0334	584,500.5090	0.0000	19.0000	892.0000	24,438.0000	413,801.5000	37,966,319.0000
play_user_num	150,000.0000	89,345.7759	534,001.9580	0.0000	16.0000	736.0000	21,404.2500	369,398.3500	34,501,624.0000
complete_play_cnt	150,000.0000	54,362.3796	354,878.3167	0.0000	5.0000	306.0000	10,966.2500	216,402.5500	23,892,867.0000
like_cnt	150,000.0000	2,972.5705	18,322.5665	0.0000	0.0000	13.0000	521.2500	12,800.1000	1,286,185.0000
comment_cnt	150,000.0000	130.1410	1,613.5446	0.0000	0.0000	0.0000	11.0000	363.0000	182,959.0000
share_cnt	150,000.0000	72.2995	1,050.6124	0.0000	0.0000	0.0000	4.0000	157.0000	172,181.0000
collect_cnt	116,617.0000	14.5537	187.7763	0.0000	0.0000	0.0000	1.0000	39.0000	41,399.0000
item_play_rate	149,659.0000	0.6062	0.4483	0.0000	0.2485	0.6210	0.9876	1.0405	38.0000
item_like_rate_per_play	141,044.0000	0.0278	0.0529	0.0000	0.0000	0.0138	0.0329	0.0997	2.0000

The item-daily summaries show that item popularity is highly uneven. That unevenness matters because popular videos may be more likely to be recommended and may also produce different engagement patterns, making popularity a potential confounder or stratification variable.

Causal Readiness Checklist

This checklist summarizes whether the current sample is adequate for the next stage of the project. The checks are intentionally practical: do we have repeated users, a time span, treatment variation, future outcome availability, and pre-treatment state variables? Passing these checks does not prove causal identification, but failing them would tell us to redesign the project before modeling.

active_days_per_user = user_day.groupby("user_id")["active_day"].sum()
future_3day_available = user_day["future_3day_active_days"].notna().mean()
future_7day_available = user_day["future_7day_active_days"].notna().mean()

treatment_rates = user_day[["treatment_high_intensity", "treatment_high_watch_exposure"]].mean()

readiness_checks = pd.DataFrame(
    [
        {
            "check": "at least 50 users in sample",
            "value": user_day["user_id"].nunique(),
            "passes": user_day["user_id"].nunique() >= 50,
        },
        {
            "check": "median user has at least 7 active days",
            "value": active_days_per_user.median(),
            "passes": active_days_per_user.median() >= 7,
        },
        {
            "check": "at least 30 calendar dates",
            "value": user_day["event_date"].nunique(),
            "passes": user_day["event_date"].nunique() >= 30,
        },
        {
            "check": "high-intensity treatment has non-trivial variation",
            "value": treatment_rates["treatment_high_intensity"],
            "passes": 0.01 < treatment_rates["treatment_high_intensity"] < 0.99,
        },
        {
            "check": "high-watch treatment has non-trivial variation",
            "value": treatment_rates["treatment_high_watch_exposure"],
            "passes": 0.01 < treatment_rates["treatment_high_watch_exposure"] < 0.99,
        },
        {
            "check": "future 3-day outcomes available for most rows",
            "value": future_3day_available,
            "passes": future_3day_available >= 0.80,
        },
        {
            "check": "future 7-day outcomes available for most rows",
            "value": future_7day_available,
            "passes": future_7day_available >= 0.70,
        },
        {
            "check": "lagged confounders have no missing values after filling panel starts",
            "value": user_day[pre_treatment_covariates].isna().mean().max(),
            "passes": user_day[pre_treatment_covariates].isna().mean().max() == 0,
        },
    ]
)

display(readiness_checks)

	check	value	passes
0	at least 50 users in sample	91.0000	True
1	median user has at least 7 active days	62.0000	True
2	at least 30 calendar dates	63.0000	True
3	high-intensity treatment has non-trivial varia...	0.2501	True
4	high-watch treatment has non-trivial variation	0.4842	True
5	future 3-day outcomes available for most rows	0.9524	True
6	future 7-day outcomes available for most rows	0.8889	True
7	lagged confounders have no missing values afte...	0.0000	True

The readiness checklist condenses the EDA into go/no-go criteria for causal modeling. Passing these checks does not prove identification, but it shows that the sample has enough users, time, treatment variation, and future outcomes to justify moving to formal estimand definition.

Save Reusable Processed Files

The next cell saves two files for later notebooks:

kuairec_small_interactions_sample.parquet: cleaned interaction-level sample.
kuairec_user_day_panel_sample.parquet: dense user-day panel with lagged state, candidate treatments, and future outcomes.

Saving these files prevents later notebooks from repeatedly opening the nested zip archive. It also makes the project sequence cleaner: Notebook 01 handles data understanding, while later notebooks handle causal estimation.

interaction_output = PROCESSED_DIR / "kuairec_small_interactions_sample.parquet"
user_day_output = PROCESSED_DIR / "kuairec_user_day_panel_sample.parquet"
readiness_output = PROCESSED_DIR / "kuairec_sequence_eda_readiness.csv"

interactions.to_parquet(interaction_output, index=False)
user_day.to_parquet(user_day_output, index=False)
readiness_checks.to_csv(readiness_output, index=False)

print("Saved processed files:")
print(f"- {interaction_output}")
print(f"- {user_day_output}")
print(f"- {readiness_output}")

Saved processed files:
- /home/apex/Documents/ranking_sys/data/processed/kuairec_small_interactions_sample.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_user_day_panel_sample.parquet
- /home/apex/Documents/ranking_sys/data/processed/kuairec_sequence_eda_readiness.csv

The saved files are the handoff from raw EDA to causal design. Future notebooks can load the cleaned interaction sample and dense user-day panel directly, which keeps the project modular and avoids repeating expensive archive parsing.

Takeaways and Next Step

This notebook turns KuaiRec from a raw interaction log into a sequential causal analysis dataset. The key object is the dense user-day panel. It contains the current day’s activity, prior user state, candidate treatment definitions, and future outcomes.

The descriptive comparisons in this notebook should not be interpreted causally. They are meant to show why causal methods are needed. The confounding checks make the main issue visible: treatment days and non-treatment days differ in their pre-treatment histories. The next notebook should formalize the long-term outcome definitions and decide which treatment-outcome pair will be the primary estimand for the project.