This notebook starts the discovery-quality mediation workflow using KuaiRec.
The causal problem is that short-term interaction metrics do not necessarily measure durable user value. A recommendation exposure can create immediate engagement but still fail to improve satisfaction or future activity. For discovery systems, this distinction matters because a metric such as CTR or interaction volume can reward curiosity, novelty, or low-friction behavior without proving that users were satisfied.
This first notebook does not estimate mediation effects yet. It prepares the data and clarifies the measurement problem:
What signals exist in KuaiRec?
Which columns can represent exposure, immediate engagement, satisfaction, and future value?
How can we define a discovery-oriented exposure using long-tail or new-category content?
Is there enough variation to support a later mediation analysis?
Dataset Field Guide
KuaiRec is stored locally as a zip file that contains another KuaiRec.zip archive. The main files used in this workflow are below.
small_matrix.csv and big_matrix.csv
These are user-video interaction matrices. small_matrix.csv is used here because it is large enough for realistic EDA but small enough to process quickly.
user_id: anonymized user identifier.
video_id: anonymized video identifier.
play_duration: watched time in milliseconds.
video_duration: video length in milliseconds.
time: interaction timestamp as a readable datetime string.
date: calendar date in YYYYMMDD form.
timestamp: Unix timestamp for the interaction.
watch_ratio: play_duration / video_duration. Values can exceed 1 when a user rewatches, loops, or spends longer than the nominal video duration.
user_features.csv
This file contains anonymized user profile and activity features.
user_id: anonymized user identifier.
user_active_degree: categorical user activity level.
is_lowactive_period: whether the user is in a low-activity period.
is_live_streamer: whether the user is marked as a live streamer.
is_video_author: whether the user has authored videos.
follow_user_num: count of followed users.
follow_user_num_range: binned follow count.
fans_user_num: count of fans.
fans_user_num_range: binned fan count.
friend_user_num: count of friends.
friend_user_num_range: binned friend count.
register_days: days since registration.
register_days_range: binned registration age.
onehot_feat0 through onehot_feat17: anonymized categorical or profile features encoded as integers. The raw feature meanings are not provided, so they should be used as controls rather than interpreted directly.
item_categories.csv
This file maps videos to category feature IDs.
video_id: anonymized video identifier.
feat: list-like category feature IDs for the video.
kuairec_caption_category.csv
This file provides richer video metadata.
video_id: anonymized video identifier.
manual_cover_text: text shown on the video cover when available.
For this project, the first version of the mediation setup uses:
Treatment candidate: high discovery exposure on a user-day, measured by a high share of long-tail or new-category videos.
Immediate engagement mediator: valid play share or interaction intensity on the same day.
Satisfaction mediator: average watch ratio, high-watch share, or completion/rewatch share.
Future outcome: future 7-day interactions, active days, and play time.
The later mediation notebooks will formalize assumptions and estimate direct, indirect, and total effects.
1. Environment and Paths
This cell imports the libraries used for the setup notebook. It also finds the repository root by searching for the local KuaiRec zip file, then creates project-specific writeup folders for figures and tables.
from io import BytesIOfrom pathlib import Pathfrom zipfile import ZipFileimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport seaborn as snsfrom IPython.display import displaysns.set_theme(style="whitegrid", context="notebook")pd.set_option("display.max_columns", 120)pd.set_option("display.max_rows", 100)pd.set_option("display.float_format", lambda value: f"{value:,.4f}")candidate_roots = [Path.cwd(), *Path.cwd().parents]PROJECT_DIR =next( root for root in candidate_rootsif (root /"data"/"Kuairec"/"18164998.zip").exists())DATA_DIR = PROJECT_DIR /"data"RAW_ZIP = DATA_DIR /"Kuairec"/"18164998.zip"PROCESSED_DIR = DATA_DIR /"processed"PROCESSED_DIR.mkdir(parents=True, exist_ok=True)NOTEBOOK_DIR = PROJECT_DIR /"notebooks"/"discovery_quality_mediation"WRITEUP_DIR = NOTEBOOK_DIR /"writeup"FIGURE_DIR = WRITEUP_DIR /"figures"TABLE_DIR = WRITEUP_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)RAW_ZIP.exists(), RAW_ZIP
The path check should return True. The notebook will read from the local KuaiRec archive and save processed discovery-quality artifacts under data/processed.
2. Inspect the KuaiRec Archive
KuaiRec is distributed as an outer zip that contains a nested KuaiRec.zip. This cell lists the outer files and the nested files so the notebook documents exactly which local data assets are being used.
with ZipFile(RAW_ZIP) as outer_zip: outer_files = pd.DataFrame( [ {"file": info.filename,"uncompressed_mb": info.file_size /1_000_000,"compressed_mb": info.compress_size /1_000_000, }for info in outer_zip.infolist()ifnot info.is_dir() ] ) nested_bytes = outer_zip.read("KuaiRec.zip")with ZipFile(BytesIO(nested_bytes)) as inner_zip: nested_files = pd.DataFrame( [ {"file": info.filename,"uncompressed_mb": info.file_size /1_000_000,"compressed_mb": info.compress_size /1_000_000, }for info in inner_zip.infolist()ifnot info.is_dir() ] ).sort_values("uncompressed_mb", ascending=False)display(outer_files)display(nested_files)
file
uncompressed_mb
compressed_mb
0
KuaiRec.zip
431.9649
431.9649
1
kuairec_caption_category.csv
1.9646
1.9646
2
video_raw_categories_multi.csv
1.7245
1.7245
3
user_features_raw.csv
1.5416
1.5416
file
uncompressed_mb
compressed_mb
2
KuaiRec 2.0/data/big_matrix.csv
1,083.5212
292.5965
6
KuaiRec 2.0/data/small_matrix.csv
406.1558
119.1462
4
KuaiRec 2.0/data/item_daily_features.csv
85.8552
19.0098
5
KuaiRec 2.0/data/kuairec_caption_category.csv
1.9646
0.6010
8
KuaiRec 2.0/data/user_features.csv
0.7442
0.1223
1
KuaiRec 2.0/Statistics_KuaiRec.ipynb
0.3142
0.1893
9
KuaiRec 2.0/figs/KuaiRec.png
0.3002
0.2551
3
KuaiRec 2.0/data/item_categories.csv
0.1131
0.0315
0
KuaiRec 2.0/LICENSE
0.0201
0.0060
7
KuaiRec 2.0/data/social_network.csv
0.0069
0.0030
10
KuaiRec 2.0/figs/colab-badge.svg
0.0024
0.0011
11
KuaiRec 2.0/loaddata.py
0.0012
0.0004
The archive inspection shows why small_matrix.csv is the right starting point. It is a substantial interaction table, while big_matrix.csv is much larger. The metadata tables are small enough to load fully or with selected columns.
3. Load Metadata Tables
This cell loads user features, video category metadata, and selected daily item aggregate columns. These tables provide context for discovery-quality analysis: user controls, video categories, and item popularity signals.
The metadata tables give the controls and content context needed for mediation. User features help describe who is active. Category metadata lets us define new-category or diverse-discovery exposure. Item daily aggregates help define long-tail content more carefully than row counts alone.
4. Build a Deterministic Interaction Sample
This cell scans small_matrix.csv in chunks and keeps complete interaction histories for users whose user_id is divisible by a fixed modulus. Sampling complete users is better than sampling random rows because later mediation analysis needs user-day histories and future outcomes.
The scan also records full-file row counts, unique users, unique videos, and timestamp coverage.
The sample keeps a meaningful number of users and interactions while remaining fast to work with. Because we sampled users rather than isolated rows, each sampled user can contribute a coherent daily sequence for future outcome construction.
5. Clean Interaction Time and Watch Signals
This cell turns raw interaction fields into analysis-ready variables. Watch ratio is capped for plotting and robust summaries, while the uncapped version is preserved. The notebook also creates first-pass proxies for immediate engagement and satisfaction.
The proxy variables are not final causal estimands yet. They are measurement candidates. The later metric-construction notebook can decide which satisfaction proxy is most defensible, but this first notebook makes the available signals explicit.
6. Basic Interaction EDA
This cell summarizes the interaction sample. It focuses on watch time, video duration, watch ratio, and the proxy outcomes that could become mediators.
time 0.0400
date 0.0400
event_timestamp 0.0400
event_date 0.0400
event_time 0.0400
timestamp 0.0400
Name: missing_rate, dtype: float64
The distribution table shows why watch ratio is useful but needs care. It can exceed 1 because of rewatches or loops, and extreme values can dominate averages. Capped versions are useful for plotting and robust summaries, while uncapped values remain available for diagnostics.
7. Plot Watch Ratio and Satisfaction Proxies
This cell visualizes the main same-day engagement and satisfaction signals. The watch-ratio plot is capped at 5 for readability; the proxy-rate plot summarizes binary measurement candidates.
The plots show that satisfaction is richer than simple play occurrence. Most sampled rows are plays, so the more useful mediators are quality measures such as valid play, completion, high watch ratio, and abandonment.
8. Create Item Discovery Features
This cell builds item-level features from the interaction sample and metadata. The important feature is long_tail_item, a proxy for discovery-oriented content. Because KuaiRec’s small matrix can be close to dense across sampled users, sample interaction counts alone are not a good popularity measure here. We therefore define long-tail status from platform-level daily exposure counts (avg_show_cnt, with avg_play_cnt as a fallback), which better represents whether an item is broadly popular in the product environment.
The long-tail definition now uses platform exposure rather than the dense sampled matrix. This matters because a causal setup needs treatment variation. If every active day had the same discovery exposure value, later mediation notebooks would have nothing meaningful to compare.
9. Enrich Interactions with Discovery Context
This cell joins item features back to each interaction and creates row-level discovery indicators. A row is a discovery candidate if the video is long-tail or if the video category is new to that user at the time of interaction.
This row-level discovery marker is the bridge from EDA to causal design. It represents exposure to content that is either less popular or less familiar to the user. The later mediation notebook can test whether this exposure works through immediate engagement and satisfaction.
10. Plot Content Popularity and Discovery Exposure
This cell visualizes item popularity and discovery exposure. The popularity distribution shows whether there is a meaningful long tail, and the category plot shows where discovery-candidate interactions occur.
The plots confirm that discovery exposure has real variation. That is necessary for mediation: if every user-day had the same discovery exposure level, we could not study whether it changes engagement, satisfaction, or future activity.
11. Build a Balanced User-Day Panel
Mediation will be easier to define at the user-day level. This cell aggregates interactions into daily measures, then creates a balanced user-date panel so inactive future days are counted as zero activity rather than disappearing from the data.
The balanced panel is the core data structure for this project. It preserves inactive days, which matters for future engagement outcomes. It also summarizes discovery exposure and satisfaction on the same user-day.
12. Define Future Outcomes and History Controls
This cell creates future outcomes and lagged history controls. Later mediation models should adjust for prior user behavior because users with different histories may receive different exposure and have different future engagement.
def future_sum_by_user(frame, column, horizon): pieces = [] grouped = frame.groupby("user_id", sort=False)[column] total = pd.Series(0.0, index=frame.index)for step inrange(1, horizon +1): total = total + grouped.shift(-step).fillna(0)return totalfor col in ["active_day", "interactions", "total_play_duration_sec"]: user_day[f"lead_1_{col}"] = user_day.groupby("user_id")[col].shift(-1).fillna(0) user_day[f"future_3day_{col}"] = future_sum_by_user(user_day, col, 3) user_day[f"future_7day_{col}"] = future_sum_by_user(user_day, col, 7)for col in ["active_day","interactions","total_play_duration_sec","avg_watch_ratio_capped_2","valid_play_share","high_satisfaction_share","discovery_candidate_share",]: user_day[f"lag_1_{col}"] = user_day.groupby("user_id")[col].shift(1).fillna(0) user_day[f"prior_3day_{col}"] = ( user_day.groupby("user_id")[col] .rolling(window=3, min_periods=1) .sum() .reset_index(level=0, drop=True) .groupby(user_day["user_id"]) .shift(1) .fillna(0) )future_summary = user_day[ ["future_7day_active_day","future_7day_interactions","future_7day_total_play_duration_sec","lag_1_interactions","prior_3day_interactions", ]].describe(percentiles=[0.25, 0.5, 0.75, 0.9]).Tdisplay(future_summary)
count
mean
std
min
25%
50%
75%
90%
max
future_7day_active_day
8,379.0000
6.4157
1.5444
0.0000
7.0000
7.0000
7.0000
7.0000
7.0000
future_7day_interactions
8,379.0000
338.3823
182.2699
0.0000
202.0000
364.0000
470.0000
561.0000
899.0000
future_7day_total_play_duration_sec
8,379.0000
2,926.6982
1,698.1722
0.0000
1,726.1070
2,953.5820
4,037.2785
5,032.7692
10,442.1170
lag_1_interactions
8,379.0000
50.4739
32.3121
0.0000
27.0000
47.0000
69.0000
93.0000
293.0000
prior_3day_interactions
8,379.0000
151.1738
79.6062
0.0000
92.0000
149.0000
203.0000
254.0000
497.0000
The future outcomes provide the retention side of the mediation pathway. The lagged variables are not causal results; they are adjustment candidates that help later models compare similar user-days.
13. Define Treatment, Mediators, and Outcome Candidates
This cell creates first-pass variables for the mediation analysis. The treatment is high discovery exposure on an active day. The mediators are same-day engagement quality and satisfaction. The outcome is future 7-day engagement.
active_days = user_day.query("active_day == 1").copy()discovery_threshold = active_days["discovery_candidate_share"].median()long_tail_threshold_day = active_days["long_tail_share"].median()user_day["treatment_high_discovery_exposure"] = ( (user_day["active_day"].eq(1))& (user_day["discovery_candidate_share"] >= discovery_threshold)).astype("int8")user_day["treatment_high_long_tail_exposure"] = ( (user_day["active_day"].eq(1))& (user_day["long_tail_share"] >= long_tail_threshold_day)).astype("int8")user_day["mediator_valid_play_share"] = user_day["valid_play_share"]user_day["mediator_high_satisfaction_share"] = user_day["high_satisfaction_share"]user_day["mediator_avg_satisfaction_score"] = user_day["avg_satisfaction_score"]user_day["outcome_future_7day_interactions"] = user_day["future_7day_interactions"]user_day["outcome_future_7day_active_days"] = user_day["future_7day_active_day"]user_day["outcome_future_7day_play_hours"] = user_day["future_7day_total_play_duration_sec"] /3600mediation_panel = user_day.query("active_day == 1").copy()mediation_panel = mediation_panel.merge( user_features, on="user_id", how="left",)candidate_summary = pd.DataFrame( [ {"role": "treatment","variable": "treatment_high_discovery_exposure","mean": mediation_panel["treatment_high_discovery_exposure"].mean(),"description": "Active user-day has discovery-candidate share at or above the active-day median, where discovery combines platform long-tail status and first category exposure.", }, {"role": "treatment_alt","variable": "treatment_high_long_tail_exposure","mean": mediation_panel["treatment_high_long_tail_exposure"].mean(),"description": "Active user-day has long-tail share at or above the active-day median.", }, {"role": "mediator_engagement","variable": "mediator_valid_play_share","mean": mediation_panel["mediator_valid_play_share"].mean(),"description": "Share of interactions that look like valid plays.", }, {"role": "mediator_satisfaction","variable": "mediator_high_satisfaction_share","mean": mediation_panel["mediator_high_satisfaction_share"].mean(),"description": "Share of interactions with watch ratio at least 0.8.", }, {"role": "outcome","variable": "outcome_future_7day_interactions","mean": mediation_panel["outcome_future_7day_interactions"].mean(),"description": "Future 7-day interaction count after the current day.", }, {"role": "outcome_alt","variable": "outcome_future_7day_active_days","mean": mediation_panel["outcome_future_7day_active_days"].mean(),"description": "Future 7-day active-day count after the current day.", }, ])display(candidate_summary)display(mediation_panel.head())
role
variable
mean
description
0
treatment
treatment_high_discovery_exposure
0.5015
Active user-day has discovery-candidate share ...
1
treatment_alt
treatment_high_long_tail_exposure
0.5001
Active user-day has long-tail share at or abov...
2
mediator_engagement
mediator_valid_play_share
0.9394
Share of interactions that look like valid plays.
3
mediator_satisfaction
mediator_high_satisfaction_share
0.4686
Share of interactions with watch ratio at leas...
4
outcome
outcome_future_7day_interactions
340.6945
Future 7-day interaction count after the curre...
5
outcome_alt
outcome_future_7day_active_days
6.4664
Future 7-day active-day count after the curren...
user_id
event_date
interactions
unique_videos
unique_categories
total_play_duration_sec
avg_play_duration_sec
avg_video_duration_sec
avg_watch_ratio
avg_watch_ratio_capped_2
valid_play_share
high_satisfaction_share
complete_or_rewatch_share
short_abandon_share
avg_satisfaction_score
long_tail_share
new_category_share
discovery_candidate_share
active_day
calendar_day_index
lead_1_active_day
future_3day_active_day
future_7day_active_day
lead_1_interactions
future_3day_interactions
future_7day_interactions
lead_1_total_play_duration_sec
future_3day_total_play_duration_sec
future_7day_total_play_duration_sec
lag_1_active_day
prior_3day_active_day
lag_1_interactions
prior_3day_interactions
lag_1_total_play_duration_sec
prior_3day_total_play_duration_sec
lag_1_avg_watch_ratio_capped_2
prior_3day_avg_watch_ratio_capped_2
lag_1_valid_play_share
prior_3day_valid_play_share
lag_1_high_satisfaction_share
prior_3day_high_satisfaction_share
lag_1_discovery_candidate_share
prior_3day_discovery_candidate_share
treatment_high_discovery_exposure
treatment_high_long_tail_exposure
mediator_valid_play_share
mediator_high_satisfaction_share
mediator_avg_satisfaction_score
outcome_future_7day_interactions
outcome_future_7day_active_days
outcome_future_7day_play_hours
user_active_degree
is_lowactive_period
is_live_streamer
is_video_author
follow_user_num
follow_user_num_range
fans_user_num
fans_user_num_range
friend_user_num
friend_user_num_range
register_days
register_days_range
onehot_feat0
onehot_feat1
onehot_feat2
onehot_feat3
onehot_feat4
onehot_feat5
onehot_feat6
onehot_feat7
onehot_feat8
onehot_feat9
onehot_feat10
onehot_feat11
onehot_feat12
onehot_feat13
onehot_feat14
onehot_feat15
onehot_feat16
onehot_feat17
0
120
2020-07-05
32.0000
32.0000
16.0000
163.9700
5.1241
11.9891
0.5813
0.5813
0.9375
0.1562
0.0938
0.0625
0.4031
0.4062
0.5000
0.6875
1
0
1.0000
3.0000
7.0000
20.0000
73.0000
316.0000
130.9860
454.6260
1,945.6300
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
1
1
0.9375
0.1562
0.4031
316.0000
7.0000
0.5405
full_active
0
0
0
7
(0,10]
3
[1,10)
0
0
224
181-365
0
1
24
876
1.0000
0
1
4
98
6
0
0
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
1
120
2020-07-06
20.0000
20.0000
15.0000
130.9860
6.5493
13.4448
0.6965
0.6837
0.9500
0.3500
0.2000
0.1500
0.4584
0.3500
0.2500
0.4500
1
1
1.0000
3.0000
7.0000
16.0000
87.0000
345.0000
100.9200
554.2510
2,111.7460
1.0000
1.0000
32.0000
32.0000
163.9700
163.9700
0.5813
0.5813
0.9375
0.9375
0.1562
0.1562
0.6875
0.6875
1
0
0.9500
0.3500
0.4584
345.0000
7.0000
0.5866
full_active
0
0
0
7
(0,10]
3
[1,10)
0
0
224
181-365
0
1
24
876
1.0000
0
1
4
98
6
0
0
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
2
120
2020-07-07
16.0000
16.0000
10.0000
100.9200
6.3075
10.1208
0.6580
0.6580
1.0000
0.1875
0.0625
0.0000
0.4301
0.2500
0.3125
0.4375
1
2
1.0000
3.0000
7.0000
37.0000
128.0000
393.0000
222.7200
789.3200
2,417.3040
1.0000
2.0000
20.0000
52.0000
130.9860
294.9560
0.6837
1.2650
0.9500
1.8875
0.3500
0.5062
0.4500
1.1375
1
0
1.0000
0.1875
0.4301
393.0000
7.0000
0.6715
full_active
0
0
0
7
(0,10]
3
[1,10)
0
0
224
181-365
0
1
24
876
1.0000
0
1
4
98
6
0
0
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
3
120
2020-07-08
37.0000
37.0000
23.0000
222.7200
6.0195
21.1131
0.5842
0.5639
0.8919
0.2162
0.0541
0.2703
0.3775
0.3514
0.1351
0.4324
1
3
1.0000
3.0000
7.0000
34.0000
144.0000
436.0000
230.6110
894.1200
2,725.9950
1.0000
3.0000
16.0000
68.0000
100.9200
395.8760
0.6580
1.9230
1.0000
2.8875
0.1875
0.6937
0.4375
1.5750
1
0
0.8919
0.2162
0.3775
436.0000
7.0000
0.7572
full_active
0
0
0
7
(0,10]
3
[1,10)
0
0
224
181-365
0
1
24
876
1.0000
0
1
4
98
6
0
0
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
4
120
2020-07-09
34.0000
34.0000
14.0000
230.6110
6.7827
11.9327
0.6982
0.6982
1.0000
0.3235
0.2059
0.0294
0.4760
0.3824
0.0000
0.3824
1
4
1.0000
3.0000
7.0000
57.0000
209.0000
448.0000
335.9890
1,260.3930
2,809.6920
1.0000
3.0000
37.0000
73.0000
222.7200
454.6260
0.5639
1.9056
0.8919
2.8419
0.2162
0.7537
0.4324
1.3199
0
1
1.0000
0.3235
0.4760
448.0000
7.0000
0.7805
full_active
0
0
0
7
(0,10]
3
[1,10)
0
0
224
181-365
0
1
24
876
1.0000
0
1
4
98
6
0
0
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
This table is the first explicit mediation setup. It does not prove causality yet. It defines a plausible treatment, mediators, and future outcomes so the next notebook can validate and refine the metric choices.
14. Compare Same-Day Metrics by Discovery Exposure
This cell compares mediators and future outcomes between high-discovery and lower-discovery active days. These are raw differences, not adjusted causal effects, but they help us see whether the variables have meaningful variation.
comparison_metrics = ["interactions","valid_play_share","high_satisfaction_share","complete_or_rewatch_share","avg_satisfaction_score","outcome_future_7day_interactions","outcome_future_7day_active_days","outcome_future_7day_play_hours","lag_1_interactions","prior_3day_interactions",]raw_means = mediation_panel.groupby("treatment_high_discovery_exposure")[comparison_metrics].mean()for arm in [0, 1]:if arm notin raw_means.index: raw_means.loc[arm] = np.nanraw_means = raw_means.sort_index()raw_comparison = pd.DataFrame( {"metric": comparison_metrics,"lower_discovery_days": raw_means.loc[0, comparison_metrics].to_numpy(),"high_discovery_days": raw_means.loc[1, comparison_metrics].to_numpy(), })raw_comparison["high_minus_lower"] = raw_comparison["high_discovery_days"] - raw_comparison["lower_discovery_days"]display(raw_comparison)
metric
lower_discovery_days
high_discovery_days
high_minus_lower
0
interactions
39.6538
63.4929
23.8392
1
valid_play_share
0.9377
0.9411
0.0033
2
high_satisfaction_share
0.4538
0.4834
0.0296
3
complete_or_rewatch_share
0.3138
0.3324
0.0185
4
avg_satisfaction_score
0.5187
0.5295
0.0108
5
outcome_future_7day_interactions
247.8894
432.9353
185.0459
6
outcome_future_7day_active_days
6.0595
6.8709
0.8114
7
outcome_future_7day_play_hours
0.6040
1.0326
0.4286
8
lag_1_interactions
41.1461
60.8570
19.7109
9
prior_3day_interactions
126.5388
178.4667
51.9279
Raw differences are useful for orientation. If high-discovery days also have very different prior activity, later mediation models need adjustment. The next notebooks should not interpret these simple differences as causal effects.
15. Plot Treatment, Mediators, and Future Outcome Relationships
This cell creates a compact visual summary of how discovery exposure relates to same-day satisfaction and future engagement. It is meant as EDA, not causal evidence.
The plots show the measurement pathway in one place: discovery exposure, satisfaction-like mediators, and future engagement. This helps check whether the later mediation analysis will be numerically meaningful.
16. Correlation Map for Candidate Variables
This cell computes correlations among treatment, mediators, outcomes, and prior-history controls. Correlation is not causation, but it is a useful diagnostic for variable redundancy and expected model behavior.
The correlation map helps identify which variables are likely to act as confounders or redundant mediators. In particular, prior activity variables should be treated carefully because they can predict both current exposure and future outcomes.
17. Readiness Checks for Mediation Analysis
This cell summarizes whether the constructed panel is ready for the next notebook. It checks sample size, treatment variation, mediator variation, future-outcome variation, and missingness in key variables.
key_variables = ["treatment_high_discovery_exposure","mediator_valid_play_share","mediator_high_satisfaction_share","mediator_avg_satisfaction_score","outcome_future_7day_interactions","outcome_future_7day_active_days","lag_1_interactions","prior_3day_interactions",]readiness_checks = pd.DataFrame( [ {"check": "active_user_days","value": len(mediation_panel),"notes": "Rows available for active-day mediation setup.", }, {"check": "sampled_users","value": mediation_panel["user_id"].nunique(),"notes": "Users represented in the mediation panel.", }, {"check": "treatment_rate","value": mediation_panel["treatment_high_discovery_exposure"].mean(),"notes": "Should be neither near 0 nor near 1.", }, {"check": "mediator_satisfaction_std","value": mediation_panel["mediator_high_satisfaction_share"].std(),"notes": "Mediator must vary across user-days.", }, {"check": "future_7day_interactions_std","value": mediation_panel["outcome_future_7day_interactions"].std(),"notes": "Outcome must vary across user-days.", }, {"check": "max_key_variable_missing_rate","value": mediation_panel[key_variables].isna().mean().max(),"notes": "Key variables should be complete or nearly complete.", }, ])display(readiness_checks)
check
value
notes
0
active_user_days
8,199.0000
Rows available for active-day mediation setup.
1
sampled_users
133.0000
Users represented in the mediation panel.
2
treatment_rate
0.5015
Should be neither near 0 nor near 1.
3
mediator_satisfaction_std
0.1766
Mediator must vary across user-days.
4
future_7day_interactions_std
180.3633
Outcome must vary across user-days.
5
max_key_variable_missing_rate
0.0000
Key variables should be complete or nearly com...
The readiness checks should support moving to metric construction. If treatment, mediators, or future outcomes lacked variation, mediation would be weak. Here the panel has enough structure for the next notebook to compare candidate discovery-quality metrics.
18. Save Processed Discovery-Quality Artifacts
This cell saves the processed interaction sample, item features, user-day panel, mediation panel, candidate variable summary, and readiness checks. Later notebooks can load these directly.
These saved files are the handoff to the next notebook. The mediation panel is the most important artifact because it contains treatment candidates, mediator candidates, future outcomes, and history controls in one user-day table.
19. Notebook Takeaways
This notebook established the data foundation for discovery-quality mediation:
KuaiRec is a strong fit because it has watch duration, watch ratio, user features, item categories, and sequential user activity.
The project should not treat play occurrence alone as satisfaction. Watch ratio, completion/rewatch, high-satisfaction share, and abandonment are richer signals.
Discovery exposure can be represented by long-tail content and new-category content, then aggregated to a user-day treatment candidate.
Future 7-day engagement can be measured with interactions, active days, and play hours.
The next notebook should validate and refine the discovery-quality metric before formal mediation estimation.