MovieLens Interference Setup and EDA

This notebook starts the interference and spillover effects analysis. The causal problem is different from standard treatment-effect estimation because recommendation items compete with one another. If one movie is promoted into a visible slate position, another movie may lose visibility, attention, clicks, ratings, or watch time. That means an item’s outcome may depend not only on its own treatment status, but also on the treatment assignments of other nearby items.

The purpose of this first notebook is to understand the MovieLens data and prepare a clean foundation for later spillover notebooks. MovieLens does not contain true production impressions or randomized promotion assignments, so this project will use MovieLens as a realistic preference dataset and then simulate recommendation slates, promotion assignments, and item competition structures.

Dataset Field Guide

MovieLens 32M is distributed as four main CSV files inside ml-32m.zip.

`ratings.csv`

Each row is a user-movie rating event.

userId: anonymized user identifier. A user can rate many movies.
movieId: MovieLens movie identifier. This links to movies.csv, links.csv, and tags.csv.
rating: explicit star rating on a 0.5 to 5.0 scale. In this project it is a preference signal, not a direct exposure or click outcome.
timestamp: Unix timestamp for when the rating was created. This lets us study recency, user histories, and time ordering.

`movies.csv`

Each row is a movie catalog item.

movieId: MovieLens movie identifier.
title: movie title, usually with a release year in parentheses.
genres: pipe-delimited genre string such as Comedy|Romance. A movie can belong to multiple genres. We will use genres to create substitute groups because movies in similar genres plausibly compete for user attention.

`tags.csv`

Each row is a user-applied free-text tag for a movie.

userId: anonymized user identifier for the tag event.
movieId: tagged movie identifier.
tag: user-generated text label such as an actor, theme, mood, franchise, or descriptive keyword.
timestamp: Unix timestamp for when the tag was applied.

Tags are useful for richer item similarity, but they are sparse and noisy. In this first notebook we use them mainly for understanding catalog semantics.

`links.csv`

Each row maps a MovieLens movie to external identifiers.

movieId: MovieLens movie identifier.
imdbId: IMDb identifier.
tmdbId: TMDb identifier.

This project does not need external metadata immediately, but the file is useful if we later want posters, richer genres, cast, crew, or production metadata.

Causal Setup Preview

Later notebooks will convert this preference dataset into a simulated recommendation setting:

A slate is a set of movies shown together to a user.
A treated item is a movie promoted into a more visible position in the slate.
A spillover-exposed item is another item in the same slate, especially a similar or substitutable movie.
A direct effect measures what happens to the promoted item.
An indirect or spillover effect measures what happens to competing items.
A total effect combines promoted-item gains and displaced-item losses at the slate or cluster level.

This notebook does not estimate those effects yet. It builds the data understanding and processed inputs needed to do that carefully.

1. Environment and Paths

This cell imports the libraries used for dataset inspection, plotting, and processed-data export. It also defines paths to the MovieLens zip file and the processed-data folder. The notebook reads directly from the zip file so we do not need to permanently extract a very large dataset into the repository.

from pathlib import Path
from zipfile import ZipFile

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 60)
pd.set_option("display.float_format", lambda value: f"{value:,.4f}")

candidate_roots = [Path.cwd(), *Path.cwd().parents]
PROJECT_DIR = next(
    root for root in candidate_roots
    if (root / "data" / "movieLens" / "ml-32m.zip").exists()
)

DATA_DIR = PROJECT_DIR / "data"
RAW_ZIP = DATA_DIR / "movieLens" / "ml-32m.zip"
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

NOTEBOOK_DIR = PROJECT_DIR / "notebooks" / "interference_spillover_effects"
WRITEUP_DIR = NOTEBOOK_DIR / "writeup"
FIGURE_DIR = WRITEUP_DIR / "figures"
TABLE_DIR = WRITEUP_DIR / "tables"
FIGURE_DIR.mkdir(parents=True, exist_ok=True)
TABLE_DIR.mkdir(parents=True, exist_ok=True)

RAW_ZIP.exists(), RAW_ZIP

(True, PosixPath('/home/apex/Documents/ranking_sys/data/movieLens/ml-32m.zip'))

The path check should return True. If it does, the notebook has found the compressed MovieLens file and can read the CSVs directly from it. The writeup folders are also created here so later cells can save lightweight tables and figures without scattering files across the project.

2. Inspect the MovieLens Archive

Before loading data, we inspect the zip archive itself. This confirms which files are present, how large they are, and whether the archive matches the standard MovieLens 32M layout. This is a useful first notebook habit because many downstream errors come from assuming a folder layout that differs from the local download.

with ZipFile(RAW_ZIP) as zf:
    archive_rows = []
    for info in zf.infolist():
        if not info.is_dir():
            archive_rows.append(
                {
                    "file": info.filename,
                    "compressed_mb": info.compress_size / 1_000_000,
                    "uncompressed_mb": info.file_size / 1_000_000,
                }
            )

archive_df = pd.DataFrame(archive_rows).sort_values("uncompressed_mb", ascending=False)
display(archive_df)

	file	compressed_mb	uncompressed_mb
4	ml-32m/ratings.csv	218.7859	877.0762
0	ml-32m/tags.csv	17.8637	72.3539
5	ml-32m/movies.csv	1.4578	4.2429
1	ml-32m/links.csv	0.8377	1.9507
2	ml-32m/README.txt	0.0037	0.0092
3	ml-32m/checksums.txt	0.0001	0.0002

The ratings file is by far the largest file, so the rest of the notebook treats it carefully. The movie metadata is small enough to load fully, while ratings need chunked reading and a deterministic user sample. That keeps the EDA reproducible without requiring the notebook to hold all 32 million interactions in memory.

3. Load and Enrich the Movie Catalog

The movie catalog is the natural item table for the interference problem. This cell loads every movie, parses the release year from the title when available, splits the pipe-delimited genre string into a list, and creates a primary genre for simple grouping. Later notebooks can use these genre groups as item clusters where spillovers are most plausible.

with ZipFile(RAW_ZIP) as zf:
    with zf.open("ml-32m/movies.csv") as file:
        movies = pd.read_csv(file)

movies = movies.assign(
    release_year=movies["title"].str.extract(r"\((\d{4})\)\s*$")[0].astype("float"),
    clean_title=movies["title"].str.replace(r"\s*\(\d{4}\)\s*$", "", regex=True),
    genre_list=movies["genres"].fillna("(no genres listed)").str.split("|"),
)
movies["genre_count"] = movies["genre_list"].str.len()
movies["primary_genre"] = movies["genre_list"].str[0]

movie_summary = pd.DataFrame(
    {
        "metric": [
            "movies",
            "movies_with_release_year",
            "movies_without_genres",
            "distinct_primary_genres",
            "median_genres_per_movie",
        ],
        "value": [
            len(movies),
            int(movies["release_year"].notna().sum()),
            int((movies["genres"] == "(no genres listed)").sum()),
            movies["primary_genre"].nunique(),
            movies["genre_count"].median(),
        ],
    }
)

display(movie_summary)
display(movies.head())

	metric	value
0	movies	87,585.0000
1	movies_with_release_year	86,968.0000
2	movies_without_genres	7,080.0000
3	distinct_primary_genres	20.0000
4	median_genres_per_movie	1.0000

	movieId	title	genres	release_year	clean_title	genre_list	genre_count	primary_genre
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1,995.0000	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	5	Adventure
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1,995.0000	Jumanji	[Adventure, Children, Fantasy]	3	Adventure
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1,995.0000	Grumpier Old Men	[Comedy, Romance]	2	Comedy
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	1,995.0000	Waiting to Exhale	[Comedy, Drama, Romance]	3	Comedy
4	5	Father of the Bride Part II (1995)	Comedy	1,995.0000	Father of the Bride Part II	[Comedy]	1	Comedy

The catalog gives us a broad item universe and a simple substitute structure through genres. This is important because interference is not expected to be equally strong between all movies. Promotion of one comedy is more likely to displace another comedy than an unrelated documentary, so genre-based grouping is a defensible first exposure model.

4. Load a Tag Sample

Tags are optional for this first project stage, but they help us understand whether MovieLens has enough semantic information to support richer item similarity later. Because tags.csv is larger than the movie catalog and free-text tags can be messy, this cell reads a bounded sample and normalizes tags to lowercase for frequency checks.

TAG_SAMPLE_ROWS = 300_000

with ZipFile(RAW_ZIP) as zf:
    with zf.open("ml-32m/tags.csv") as file:
        tags_sample = pd.read_csv(file, nrows=TAG_SAMPLE_ROWS)

tags_sample = tags_sample.assign(
    tag_clean=tags_sample["tag"].astype("string").str.strip().str.lower(),
    tag_datetime=pd.to_datetime(tags_sample["timestamp"], unit="s"),
)

tag_summary = pd.DataFrame(
    {
        "metric": [
            "tag_rows_loaded",
            "unique_users_in_tag_sample",
            "unique_movies_in_tag_sample",
            "unique_clean_tags_in_sample",
            "tag_date_min",
            "tag_date_max",
        ],
        "value": [
            len(tags_sample),
            tags_sample["userId"].nunique(),
            tags_sample["movieId"].nunique(),
            tags_sample["tag_clean"].nunique(),
            tags_sample["tag_datetime"].min(),
            tags_sample["tag_datetime"].max(),
        ],
    }
)

display(tag_summary)
display(tags_sample.head())

	metric	value
0	tag_rows_loaded	300000
1	unique_users_in_tag_sample	3605
2	unique_movies_in_tag_sample	15914
3	unique_clean_tags_in_sample	28307
4	tag_date_min	2005-12-27 12:54:55
5	tag_date_max	2023-10-12 07:24:37

	userId	movieId	tag	timestamp	tag_clean	tag_datetime
0	22	26479	Kevin Kline	1583038886	kevin kline	2020-03-01 05:01:26
1	22	79592	misogyny	1581476297	misogyny	2020-02-12 02:58:17
2	22	247150	acrophobia	1622483469	acrophobia	2021-05-31 17:51:09
3	34	2174	music	1249808064	music	2009-08-09 08:54:24
4	34	2174	weird	1249808102	weird	2009-08-09 08:55:02

The tag sample confirms that MovieLens contains semantic item annotations beyond genres. We will not rely on tags as a core causal variable yet because they are user-generated and unevenly distributed, but they are useful evidence that richer similarity modeling is available if genre clusters become too coarse.

5. Build a Deterministic Rating Sample

The ratings table has more than 32 million rows, so loading it all at once is unnecessary for exploratory work. Instead, this cell scans the ratings file in chunks and keeps all ratings for users whose userId is divisible by a fixed modulus. This creates a deterministic user-level sample: selected users keep their full rating histories, which is better for sequence and slate construction than randomly sampling isolated rows.

The scan also collects full-file summary statistics such as total row count, unique users, unique movies, rating distribution, and timestamp range.

USER_SAMPLE_MODULUS = 50
RATING_CHUNK_ROWS = 1_000_000

rating_chunks = []
all_user_ids = set()
all_movie_ids = set()
rating_value_counts = pd.Series(dtype="int64")
full_rating_rows = 0
timestamp_min = None
timestamp_max = None

rating_dtypes = {
    "userId": "int32",
    "movieId": "int32",
    "rating": "float32",
    "timestamp": "int64",
}

with ZipFile(RAW_ZIP) as zf:
    with zf.open("ml-32m/ratings.csv") as file:
        for chunk in pd.read_csv(file, chunksize=RATING_CHUNK_ROWS, dtype=rating_dtypes):
            full_rating_rows += len(chunk)
            all_user_ids.update(chunk["userId"].unique().tolist())
            all_movie_ids.update(chunk["movieId"].unique().tolist())
            rating_value_counts = rating_value_counts.add(chunk["rating"].value_counts(), fill_value=0)
            chunk_min = int(chunk["timestamp"].min())
            chunk_max = int(chunk["timestamp"].max())
            timestamp_min = chunk_min if timestamp_min is None else min(timestamp_min, chunk_min)
            timestamp_max = chunk_max if timestamp_max is None else max(timestamp_max, chunk_max)

            sampled_chunk = chunk.loc[chunk["userId"] % USER_SAMPLE_MODULUS == 0].copy()
            if not sampled_chunk.empty:
                rating_chunks.append(sampled_chunk)

ratings_sample = pd.concat(rating_chunks, ignore_index=True)
ratings_sample["rating_datetime"] = pd.to_datetime(ratings_sample["timestamp"], unit="s")
ratings_sample["rating_year"] = ratings_sample["rating_datetime"].dt.year
ratings_sample["liked"] = (ratings_sample["rating"] >= 4.0).astype("int8")

full_rating_summary = pd.DataFrame(
    {
        "metric": [
            "full_rating_rows_scanned",
            "full_unique_users",
            "full_unique_movies_with_ratings",
            "rating_timestamp_min",
            "rating_timestamp_max",
            "sample_user_modulus",
            "sample_rating_rows",
            "sample_unique_users",
            "sample_unique_movies",
        ],
        "value": [
            full_rating_rows,
            len(all_user_ids),
            len(all_movie_ids),
            pd.to_datetime(timestamp_min, unit="s"),
            pd.to_datetime(timestamp_max, unit="s"),
            USER_SAMPLE_MODULUS,
            len(ratings_sample),
            ratings_sample["userId"].nunique(),
            ratings_sample["movieId"].nunique(),
        ],
    }
)

display(full_rating_summary)
display(ratings_sample.head())

	metric	value
0	full_rating_rows_scanned	32000204
1	full_unique_users	200948
2	full_unique_movies_with_ratings	84432
3	rating_timestamp_min	1995-01-09 11:46:44
4	rating_timestamp_max	2023-10-13 02:29:07
5	sample_user_modulus	50
6	sample_rating_rows	617851
7	sample_unique_users	4018
8	sample_unique_movies	22313

	userId	movieId	rating	timestamp	rating_datetime	rating_year	liked
0	50	32	5.0000	1262076877	2009-12-29 08:54:37	2009	1
1	50	47	5.0000	1262077899	2009-12-29 09:11:39	2009	1
2	50	50	5.0000	1262076691	2009-12-29 08:51:31	2009	1
3	50	163	3.5000	1262076468	2009-12-29 08:47:48	2009	0
4	50	172	4.0000	1262076505	2009-12-29 08:48:25	2009	1

The sample keeps complete histories for a manageable set of users. That matters for this project because interference simulation needs realistic user-level candidate sets, not disconnected individual ratings. The full-file statistics also let us describe the original dataset honestly even though later modeling uses a smaller processed sample.

6. Check Rating Scale and Preference Signal

This cell compares the full-file rating distribution collected during chunked loading with the sampled-user distribution. A close match is reassuring because it means the deterministic user sample is not obviously distorting the basic preference signal.

full_rating_distribution = (
    rating_value_counts.sort_index()
    .rename_axis("rating")
    .reset_index(name="full_count")
)
full_rating_distribution["full_share"] = (
    full_rating_distribution["full_count"] / full_rating_distribution["full_count"].sum()
)

sample_rating_distribution = (
    ratings_sample["rating"]
    .value_counts()
    .sort_index()
    .rename_axis("rating")
    .reset_index(name="sample_count")
)
sample_rating_distribution["sample_share"] = (
    sample_rating_distribution["sample_count"] / sample_rating_distribution["sample_count"].sum()
)

rating_distribution = full_rating_distribution.merge(sample_rating_distribution, on="rating", how="left")
rating_distribution["share_difference"] = rating_distribution["sample_share"] - rating_distribution["full_share"]

display(rating_distribution)

	rating	full_count	full_share	sample_count	sample_share	share_difference
0	0.5000	525,132.0000	0.0164	9200	0.0149	-0.0015
1	1.0000	946,675.0000	0.0296	19406	0.0314	0.0018
2	1.5000	531,063.0000	0.0166	9689	0.0157	-0.0009
3	2.0000	2,028,622.0000	0.0634	41065	0.0665	0.0031
4	2.5000	1,685,386.0000	0.0527	31052	0.0503	-0.0024
5	3.0000	6,054,990.0000	0.1892	117188	0.1897	0.0005
6	3.5000	4,290,105.0000	0.1341	80913	0.1310	-0.0031
7	4.0000	8,367,654.0000	0.2615	161801	0.2619	0.0004
8	4.5000	2,974,000.0000	0.0929	54857	0.0888	-0.0042
9	5.0000	4,596,577.0000	0.1436	92680	0.1500	0.0064

The rating distribution is the first sanity check for the sampled data. If the sample had very different shares of high or low ratings, later simulated outcomes would inherit that distortion. Small differences are fine because the goal is not survey-grade sampling; the goal is a stable working dataset that preserves the main preference structure.

7. Plot the Rating Distribution

The table above is precise, but the plot makes the rating scale easier to read. We show full-file and sampled-user shares side by side. The liked outcome used later is based on ratings of 4.0 or higher, so the mass around 4.0 and 5.0 is especially important.

rating_plot_df = rating_distribution.melt(
    id_vars="rating",
    value_vars=["full_share", "sample_share"],
    var_name="source",
    value_name="share",
)
rating_plot_df["source"] = rating_plot_df["source"].map(
    {"full_share": "Full file", "sample_share": "Sampled users"}
)

fig, ax = plt.subplots(figsize=(9, 4.5))
sns.barplot(data=rating_plot_df, x="rating", y="share", hue="source", ax=ax)
ax.set_title("Rating Distribution: Full File vs Sampled Users")
ax.set_xlabel("Rating")
ax.set_ylabel("Share of ratings")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.0%}")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "01_rating_distribution.png", dpi=160, bbox_inches="tight")
plt.show()

The sampled-user distribution should track the full distribution closely. This makes the sampled ratings acceptable for EDA, user-history features, and initial slate construction. The concentration of positive ratings also reminds us that MovieLens ratings are explicit preference events, not random impressions.

8. Join Ratings to Movie Metadata

For interference analysis, ratings alone are not enough. We need item context so we can ask whether promoted movies displace similar movies. This cell joins the sampled ratings to movie metadata and checks whether any sampled ratings lack catalog information.

ratings_enriched = ratings_sample.merge(
    movies[
        [
            "movieId",
            "title",
            "clean_title",
            "genres",
            "primary_genre",
            "genre_count",
            "release_year",
        ]
    ],
    on="movieId",
    how="left",
)

join_quality = pd.DataFrame(
    {
        "metric": [
            "sample_rating_rows",
            "rows_missing_movie_title",
            "rows_missing_primary_genre",
            "distinct_primary_genres_in_ratings",
        ],
        "value": [
            len(ratings_enriched),
            int(ratings_enriched["title"].isna().sum()),
            int(ratings_enriched["primary_genre"].isna().sum()),
            ratings_enriched["primary_genre"].nunique(),
        ],
    }
)

display(join_quality)
display(ratings_enriched.head())

	metric	value
0	sample_rating_rows	617851
1	rows_missing_movie_title	0
2	rows_missing_primary_genre	0
3	distinct_primary_genres_in_ratings	20

	userId	movieId	rating	timestamp	rating_datetime	rating_year	liked	title	clean_title	genres	primary_genre	genre_count	release_year
0	50	32	5.0000	1262076877	2009-12-29 08:54:37	2009	1	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	Twelve Monkeys (a.k.a. 12 Monkeys)	Mystery\|Sci-Fi\|Thriller	Mystery	3	1,995.0000
1	50	47	5.0000	1262077899	2009-12-29 09:11:39	2009	1	Seven (a.k.a. Se7en) (1995)	Seven (a.k.a. Se7en)	Mystery\|Thriller	Mystery	2	1,995.0000
2	50	50	5.0000	1262076691	2009-12-29 08:51:31	2009	1	Usual Suspects, The (1995)	Usual Suspects, The	Crime\|Mystery\|Thriller	Crime	3	1,995.0000
3	50	163	3.5000	1262076468	2009-12-29 08:47:48	2009	0	Desperado (1995)	Desperado	Action\|Romance\|Western	Action	3	1,995.0000
4	50	172	4.0000	1262076505	2009-12-29 08:48:25	2009	1	Johnny Mnemonic (1995)	Johnny Mnemonic	Action\|Sci-Fi\|Thriller	Action	3	1,995.0000

The join quality check tells us whether MovieLens identifiers are internally consistent. A clean join means later notebooks can safely use movie genres, release years, and titles when defining substitute groups and explaining spillover mechanisms.

9. User Activity Distribution

Interference simulation needs users with enough history to form realistic candidate slates. This cell summarizes how many ratings each sampled user has, their average rating, their share of high ratings, and the time span of their activity.

user_features = (
    ratings_enriched.groupby("userId")
    .agg(
        n_ratings=("rating", "size"),
        mean_rating=("rating", "mean"),
        liked_rate=("liked", "mean"),
        first_rating_time=("rating_datetime", "min"),
        last_rating_time=("rating_datetime", "max"),
        active_years=("rating_year", "nunique"),
        unique_primary_genres=("primary_genre", "nunique"),
    )
    .reset_index()
)
user_features["activity_span_days"] = (
    user_features["last_rating_time"] - user_features["first_rating_time"]
).dt.days

user_activity_summary = user_features[
    ["n_ratings", "mean_rating", "liked_rate", "active_years", "activity_span_days", "unique_primary_genres"]
].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]).T

display(user_activity_summary)
display(user_features.sort_values("n_ratings", ascending=False).head(10))

	count	mean	std	min	10%	25%	50%	75%	90%	max
n_ratings	4,018.0000	153.7708	266.3372	20.0000	24.0000	36.0000	70.0000	162.0000	339.0000	5,160.0000
mean_rating	4,018.0000	3.7075	0.4844	0.5500	3.1160	3.4134	3.7188	4.0156	4.3028	5.0000
liked_rate	4,018.0000	0.5780	0.2041	0.0000	0.3018	0.4366	0.5815	0.7257	0.8462	1.0000
active_years	4,018.0000	1.5271	1.6525	1.0000	1.0000	1.0000	1.0000	1.0000	2.0000	20.0000
activity_span_days	4,018.0000	257.6847	809.0312	0.0000	0.0000	0.0000	0.0000	53.0000	701.3000	8,396.0000
unique_primary_genres	4,018.0000	9.2086	3.0216	2.0000	6.0000	7.0000	9.0000	11.0000	14.0000	19.0000

	userId	n_ratings	mean_rating	liked_rate	first_rating_time	last_rating_time	active_years	unique_primary_genres	activity_span_days
1659	83000	5160	2.4058	0.0698	2002-08-19 19:48:36	2002-11-25 17:50:15	1	18	97
1640	82050	4681	3.3622	0.5954	2016-05-11 19:53:45	2023-10-06 02:48:47	8	19	2703
3185	159300	3714	3.0128	0.2130	2015-06-20 00:36:28	2023-10-10 00:38:25	9	19	3034
268	13450	3212	3.2425	0.2313	2015-04-28 21:23:47	2015-10-17 12:16:39	1	19	171
1059	53000	2937	3.8413	0.5788	2014-12-18 23:45:37	2021-10-23 00:32:59	5	19	2500
1334	66750	2740	3.6943	0.5544	2005-09-05 19:46:49	2023-06-24 03:41:52	10	19	6500
185	9300	2587	3.3761	0.3545	2005-08-28 08:22:39	2023-09-11 15:50:15	18	18	6588
2126	106350	2378	3.8425	0.6514	2001-06-20 21:31:57	2023-02-18 21:12:27	15	18	7912
1810	90550	2357	3.4913	0.3818	2000-04-06 22:40:10	2021-10-27 23:42:19	17	17	7874
1030	51550	2288	3.0590	0.4274	2016-03-13 17:39:09	2023-10-06 18:14:38	7	18	2763

The user activity distribution tells us how many users can support slate construction. Users with very few ratings are less useful because we cannot build a credible set of competing candidate items for them. Users with broader genre histories are especially useful because they create slates with both substitutes and non-substitutes.

10. Plot User Activity and Rating Tendencies

This cell visualizes user heterogeneity. The left plot shows the long-tailed number of ratings per sampled user. The right plot shows the distribution of each user’s average rating. Both matter because recommendation simulations should account for heavy users and naturally generous or strict raters.

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

sns.histplot(user_features["n_ratings"], bins=60, ax=axes[0])
axes[0].set_xscale("log")
axes[0].set_title("Ratings per Sampled User")
axes[0].set_xlabel("Number of ratings, log scale")
axes[0].set_ylabel("Users")

sns.histplot(user_features["mean_rating"], bins=40, ax=axes[1], color="tab:green")
axes[1].set_title("Average Rating per Sampled User")
axes[1].set_xlabel("Mean rating")
axes[1].set_ylabel("Users")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "02_user_activity.png", dpi=160, bbox_inches="tight")
plt.show()

The log scale is intentional because user activity is usually very skewed. That skew matters for interference: a small number of very active users may create many plausible slates, while less active users may only support a few. Later simulations should avoid letting the heaviest users dominate all estimates.

11. Item Popularity and Quality Signals

This cell creates item-level features from the sampled ratings. These features describe how often each movie appears, its average rating, and its high-rating rate. Popularity is important for simulated ranking because promoted items are rarely assigned from a uniform catalog; recommendation systems tend to draw from relevant and popular candidates.

item_features = (
    ratings_enriched.groupby("movieId")
    .agg(
        sample_rating_count=("rating", "size"),
        sample_mean_rating=("rating", "mean"),
        sample_liked_rate=("liked", "mean"),
        first_sample_rating_time=("rating_datetime", "min"),
        last_sample_rating_time=("rating_datetime", "max"),
    )
    .reset_index()
    .merge(movies, on="movieId", how="left")
)

item_features["popularity_bucket"] = pd.qcut(
    item_features["sample_rating_count"].rank(method="first"),
    q=5,
    labels=["very_low", "low", "medium", "high", "very_high"],
)

top_items = item_features.sort_values("sample_rating_count", ascending=False).head(15)[
    ["movieId", "title", "primary_genre", "sample_rating_count", "sample_mean_rating", "sample_liked_rate"]
]

item_summary = item_features[
    ["sample_rating_count", "sample_mean_rating", "sample_liked_rate", "genre_count"]
].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]).T

display(item_summary)
display(top_items)

	count	mean	std	min	10%	25%	50%	75%	90%	max
sample_rating_count	22,313.0000	27.6902	95.9410	1.0000	1.0000	1.0000	2.0000	12.0000	56.0000	2,031.0000
sample_mean_rating	22,313.0000	3.2312	0.9839	0.5000	2.0000	2.7867	3.4000	3.8571	4.2500	5.0000
sample_liked_rate	22,313.0000	0.3893	0.3626	0.0000	0.0000	0.0000	0.3333	0.6410	1.0000	1.0000
genre_count	22,313.0000	2.0419	1.0570	1.0000	1.0000	1.0000	2.0000	3.0000	3.0000	10.0000

	movieId	title	primary_genre	sample_rating_count	sample_mean_rating	sample_liked_rate
306	318	Shawshank Redemption, The (1994)	Crime	2031	4.4094	0.8612
285	296	Pulp Fiction (1994)	Comedy	1965	4.1883	0.7913
343	356	Forrest Gump (1994)	Comedy	1952	4.0666	0.7208
2371	2571	Matrix, The (1999)	Action	1865	4.1550	0.7609
573	593	Silence of the Lambs, The (1991)	Crime	1792	4.1083	0.7695
250	260	Star Wars: Episode IV - A New Hope (1977)	Action	1710	4.0854	0.7386
2746	2959	Fight Club (1999)	Action	1538	4.2048	0.7874
1103	1196	Star Wars: Episode V - The Empire Strikes Back...	Action	1446	4.0992	0.7510
464	480	Jurassic Park (1993)	Action	1443	3.7290	0.5703
511	527	Schindler's List (1993)	Drama	1430	4.2374	0.8070
4687	4993	Lord of the Rings: The Fellowship of the Ring,...	Adventure	1429	4.0896	0.7460
106	110	Braveheart (1995)	Action	1390	3.9946	0.7022
48	50	Usual Suspects, The (1995)	Crime	1384	4.2518	0.8071
569	589	Terminator 2: Judgment Day (1991)	Action	1353	3.9327	0.6556
795	858	Godfather, The (1972)	Crime	1353	4.3038	0.8130

The top movies are not just descriptive; they reveal the attention distribution that a simulated recommender would inherit. If we promote already-popular movies, displacement may mostly affect other popular substitutes. If we promote niche movies, spillovers may look different. This is why popularity buckets are saved as item features.

12. Plot the Item Popularity Long Tail

Recommendation catalogs usually have a long tail: a few items receive many interactions, while most items receive few. This plot checks whether the MovieLens sample has that shape. A long tail is useful here because spillovers can be studied across popular, medium, and niche items.

fig, ax = plt.subplots(figsize=(9, 4.5))
sns.histplot(item_features["sample_rating_count"], bins=80, ax=ax, color="tab:purple")
ax.set_xscale("log")
ax.set_title("Movie Popularity in Sampled Ratings")
ax.set_xlabel("Number of sampled ratings, log scale")
ax.set_ylabel("Movies")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "03_item_popularity_long_tail.png", dpi=160, bbox_inches="tight")
plt.show()

The long-tail pattern supports the simulation plan. Interference is partly about scarce attention, and scarce attention is most interesting when items vary widely in baseline popularity. This also gives later notebooks a reason to report effects separately by item popularity tier.

13. Genre Coverage and Substitute Groups

Genres are the first approximation to item clusters. This cell explodes the multi-genre movie table so each movie contributes to every genre it belongs to, then computes catalog size, rating volume, average rating, and liked rate by genre.

movie_genres = movies[["movieId", "genre_list"]].explode("genre_list").rename(columns={"genre_list": "genre"})

rating_genres = ratings_sample[["movieId", "rating", "liked"]].merge(movie_genres, on="movieId", how="left")

genre_summary = (
    rating_genres.groupby("genre")
    .agg(
        sampled_rating_rows=("rating", "size"),
        unique_movies=("movieId", "nunique"),
        mean_rating=("rating", "mean"),
        liked_rate=("liked", "mean"),
    )
    .reset_index()
    .sort_values("sampled_rating_rows", ascending=False)
)

display(genre_summary)

	genre	sampled_rating_rows	unique_movies	mean_rating	liked_rate
8	Drama	271190	9867	3.6791	0.5567
5	Comedy	215975	7010	3.4352	0.4565
1	Action	186192	3443	3.4850	0.4668
17	Thriller	167555	4070	3.5485	0.4938
2	Adventure	145711	2042	3.5298	0.4885
16	Sci-Fi	111123	1878	3.5124	0.4816
15	Romance	107165	3354	3.5506	0.5035
6	Crime	104853	2332	3.6946	0.5596
9	Fantasy	70853	1422	3.5189	0.4870
4	Children	51351	1239	3.4301	0.4579
14	Mystery	50611	1233	3.6846	0.5569
11	Horror	46883	2395	3.3446	0.4185
3	Animation	41736	1247	3.6112	0.5270
18	War	31009	814	3.7865	0.6047
12	IMAX	28710	175	3.5887	0.5039
13	Musical	21602	526	3.5132	0.4911
19	Western	11645	393	3.5910	0.5198
7	Documentary	8024	1516	3.6344	0.5581
10	Film-Noir	6138	174	3.9195	0.6795
0	(no genres listed)	953	431	3.4418	0.4953

The genre summary gives the first map of possible spillover neighborhoods. Large genres such as drama or comedy can support many within-genre substitute comparisons. Smaller genres may need to be combined or treated cautiously because sparse clusters can create noisy spillover estimates.

14. Plot Rating Volume by Genre

This visualization highlights which genres dominate the sampled preference data. For interference modeling, this helps decide where simulated slates will have enough similar items to create meaningful within-cluster competition.

top_genres = genre_summary.head(15).copy()

fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=top_genres, x="sampled_rating_rows", y="genre", ax=ax, color="tab:blue")
ax.set_title("Top Genres by Sampled Rating Volume")
ax.set_xlabel("Sampled rating rows")
ax.set_ylabel("Genre")
plt.tight_layout()
fig.savefig(FIGURE_DIR / "04_genre_rating_volume.png", dpi=160, bbox_inches="tight")
plt.show()

The largest genres are natural starting points for substitute clusters. Later, when we simulate a promoted item, we can define spillover exposure as the number or share of same-genre items promoted nearby in the same slate. That turns genre EDA into a causal exposure model.

15. Time Coverage in the Rating Sample

MovieLens spans many years. This cell summarizes rating volume over calendar time so we can see whether the sample covers the same broad period as the full file. Time matters because user preferences, catalog composition, and platform behavior can all drift.

yearly_activity = (
    ratings_sample.groupby("rating_year")
    .agg(
        rating_rows=("rating", "size"),
        users=("userId", "nunique"),
        movies=("movieId", "nunique"),
        mean_rating=("rating", "mean"),
        liked_rate=("liked", "mean"),
    )
    .reset_index()
)

display(yearly_activity.head())
display(yearly_activity.tail())

	rating_year	rating_rows	users	movies	mean_rating	liked_rate
0	1996	31909	510	1030	3.5553	0.4914
1	1997	12671	225	1161	3.5889	0.5097
2	1998	6167	82	1396	3.6142	0.5821
3	1999	21873	197	2177	3.6783	0.6207
4	2000	40933	299	3095	3.5898	0.5754

	rating_year	rating_rows	users	movies	mean_rating	liked_rate
23	2019	28206	247	5998	3.6410	0.5284
24	2020	31039	265	7148	3.5423	0.4982
25	2021	22974	241	5981	3.6466	0.5339
26	2022	18050	215	5139	3.7061	0.5532
27	2023	20494	186	5242	3.6596	0.5222

The yearly table checks temporal breadth and gives a first look at drift. We do not need to solve time drift in the first notebook, but later simulations should avoid accidentally mixing very old and very recent behavior without acknowledging it.

16. Plot Yearly Rating Activity

The plot makes the time trend easier to see than the table. We show sampled rating volume and average rating by year. This helps identify whether a small number of years dominate the sample.

fig, axes = plt.subplots(2, 1, figsize=(11, 7), sharex=True)

sns.lineplot(data=yearly_activity, x="rating_year", y="rating_rows", marker="o", ax=axes[0])
axes[0].set_title("Sampled Rating Volume by Year")
axes[0].set_ylabel("Rating rows")
axes[0].set_xlabel("")

sns.lineplot(data=yearly_activity, x="rating_year", y="mean_rating", marker="o", ax=axes[1], color="tab:green")
axes[1].set_title("Average Sampled Rating by Year")
axes[1].set_ylabel("Mean rating")
axes[1].set_xlabel("Rating year")

plt.tight_layout()
fig.savefig(FIGURE_DIR / "05_yearly_rating_activity.png", dpi=160, bbox_inches="tight")
plt.show()

The time trend is useful for deciding whether future notebooks should include calendar controls. Even in a simulation project, realistic time structure matters because old and new catalog items may have different popularity and competition patterns.

17. Top Tags in the Sample

This cell inspects the most common cleaned tags. Tags often capture actors, franchises, moods, themes, and other semantics that genres miss. We will not use them as the first cluster definition, but seeing them helps explain how the project could be extended beyond genre-level spillovers.

top_tags = (
    tags_sample.dropna(subset=["tag_clean"])
    .groupby("tag_clean")
    .agg(tag_rows=("tag_clean", "size"), users=("userId", "nunique"), movies=("movieId", "nunique"))
    .reset_index()
    .sort_values("tag_rows", ascending=False)
    .head(25)
)

display(top_tags)

	tag_clean	tag_rows	users	movies
22173	sci-fi	2562	667	474
2051	atmospheric	2418	503	536
609	action	2354	461	686
5411	comedy	2204	503	887
9942	funny	2075	489	774
27109	visually appealing	1628	482	295
24606	surreal	1596	488	345
21607	romance	1469	351	654
26232	twist ending	1459	482	224
25439	thought-provoking	1391	469	245
6364	dark comedy	1282	436	243
7700	dystopia	1206	370	196
20504	quirky	1181	343	319
23404	social commentary	1181	339	325
24368	stylized	1177	287	360
5011	cinematography	1173	347	342
2633	based on a book	1115	279	487
757	adventure	1109	320	322
6360	dark	1085	354	268
25459	thriller	1064	326	363
20339	psychology	1050	385	161
11923	horror	1044	276	384
8900	fantasy	1029	331	312
3536	boring	1004	267	704
5084	classic	1000	358	286

The tag sample suggests richer semantic neighborhoods are possible. For example, two movies in different genres might still compete if they share an actor, franchise, mood, or theme. The first version of the project will stay with genres for clarity, but tags are a credible future refinement.

18. Define an Interference-Ready Item Table

This cell creates the item table that later notebooks can reuse. It combines catalog metadata with sampled popularity and preference features. It also defines spillover_cluster, a first-pass substitute group based on primary genre. This is intentionally simple and explainable.

interference_items = item_features.assign(
    spillover_cluster=item_features["primary_genre"].fillna("unknown"),
    is_broad_release_year=item_features["release_year"].notna().astype("int8"),
)[
    [
        "movieId",
        "title",
        "clean_title",
        "genres",
        "primary_genre",
        "spillover_cluster",
        "release_year",
        "genre_count",
        "sample_rating_count",
        "sample_mean_rating",
        "sample_liked_rate",
        "popularity_bucket",
    ]
].copy()

cluster_readiness = (
    interference_items.groupby("spillover_cluster")
    .agg(
        movies=("movieId", "nunique"),
        sampled_ratings=("sample_rating_count", "sum"),
        mean_item_rating=("sample_mean_rating", "mean"),
    )
    .reset_index()
    .sort_values("sampled_ratings", ascending=False)
)

display(cluster_readiness.head(20))

	spillover_cluster	movies	sampled_ratings	mean_item_rating
1	Action	3443	186192	3.1305
5	Comedy	5432	139545	3.1890
8	Drama	5580	116649	3.3018
2	Adventure	1116	61465	3.1763
6	Crime	1196	44116	3.4327
11	Horror	1306	16833	2.8955
3	Animation	706	14465	3.2919
4	Children	427	10747	3.1775
7	Documentary	1372	6739	3.3269
14	Mystery	124	6230	3.4562
17	Thriller	331	4520	3.2861
16	Sci-Fi	227	3232	3.0557
9	Fantasy	140	2264	3.0980
19	Western	114	1240	3.4242
15	Romance	272	991	4.3652
0	(no genres listed)	431	953	3.1554
10	Film-Noir	23	815	3.5313
13	Musical	43	720	3.0830
18	War	29	134	3.3824
12	IMAX	1	1	3.0000

The spillover_cluster column is the bridge from EDA to causal design. In later notebooks, a movie’s potential outcomes can depend on its own promotion and on the promotion intensity among other movies in the same cluster or slate.

19. Build a Seed Dataset for Simulated Slates

MovieLens does not tell us which movies were shown together, so later notebooks need to simulate slates. This cell creates a seed table by selecting active users and taking a bounded set of their recent, highly rated movies. Each user’s seed slate is not yet a randomized experiment; it is a realistic candidate set that later notebooks can rank, promote, and use for spillover exposure construction.

MIN_USER_RATINGS_FOR_SLATE = 20
SLATE_SIZE = 12
MAX_SEED_USERS = 3_000
RECENT_POOL_SIZE = 50

eligible_users = (
    user_features.query("n_ratings >= @MIN_USER_RATINGS_FOR_SLATE")
    .sort_values(["n_ratings", "unique_primary_genres"], ascending=False)
    .head(MAX_SEED_USERS)["userId"]
)

recent_pool = (
    ratings_enriched.loc[ratings_enriched["userId"].isin(eligible_users)]
    .sort_values(["userId", "rating_datetime"], ascending=[True, False])
    .assign(recent_rank=lambda df: df.groupby("userId").cumcount() + 1)
    .query("recent_rank <= @RECENT_POOL_SIZE")
)

slate_seed = (
    recent_pool.sort_values(["userId", "rating", "rating_datetime"], ascending=[True, False, False])
    .groupby("userId", group_keys=False)
    .head(SLATE_SIZE)
    .copy()
)

slate_seed["slate_id"] = "user_" + slate_seed["userId"].astype(str) + "_seed"
slate_seed["slate_position_seed"] = slate_seed.groupby("userId").cumcount() + 1
slate_seed["observed_relevance"] = slate_seed["rating"]
slate_seed["high_relevance"] = (slate_seed["rating"] >= 4.0).astype("int8")
slate_seed["spillover_cluster"] = slate_seed["primary_genre"].fillna("unknown")

complete_slate_ids = (
    slate_seed.groupby("slate_id").size().loc[lambda size: size == SLATE_SIZE].index
)
slate_seed = slate_seed.loc[slate_seed["slate_id"].isin(complete_slate_ids)].copy()

slate_seed = slate_seed[
    [
        "slate_id",
        "userId",
        "movieId",
        "title",
        "genres",
        "primary_genre",
        "spillover_cluster",
        "slate_position_seed",
        "observed_relevance",
        "high_relevance",
        "rating_datetime",
        "rating_year",
    ]
]

slate_seed_summary = pd.DataFrame(
    {
        "metric": [
            "eligible_users_considered",
            "complete_seed_slates",
            "rows_in_seed_slate_table",
            "slate_size",
            "unique_movies_in_seed_slates",
            "unique_spillover_clusters",
            "mean_high_relevance_rate",
        ],
        "value": [
            len(eligible_users),
            slate_seed["slate_id"].nunique(),
            len(slate_seed),
            SLATE_SIZE,
            slate_seed["movieId"].nunique(),
            slate_seed["spillover_cluster"].nunique(),
            slate_seed["high_relevance"].mean(),
        ],
    }
)

display(slate_seed_summary)
display(slate_seed.head(15))

	metric	value
0	eligible_users_considered	3,000.0000
1	complete_seed_slates	3,000.0000
2	rows_in_seed_slate_table	36,000.0000
3	slate_size	12.0000
4	unique_movies_in_seed_slates	5,668.0000
5	unique_spillover_clusters	19.0000
6	mean_high_relevance_rate	0.9596

	slate_id	userId	movieId	title	genres	primary_genre	spillover_cluster	slate_position_seed	observed_relevance	high_relevance	rating_datetime	rating_year
74	user_50_seed	50	4027	O Brother, Where Art Thou? (2000)	Adventure\|Comedy\|Crime	Adventure	Adventure	1	5.0000	1	2009-12-31 06:58:12	2009
27	user_50_seed	50	1196	Star Wars: Episode V - The Empire Strikes Back...	Action\|Adventure\|Sci-Fi	Action	Action	2	5.0000	1	2009-12-29 09:12:16	2009
1	user_50_seed	50	47	Seven (a.k.a. Se7en) (1995)	Mystery\|Thriller	Mystery	Mystery	3	5.0000	1	2009-12-29 09:11:39	2009
101	user_50_seed	50	52435	How the Grinch Stole Christmas! (1966)	Animation\|Comedy\|Fantasy\|Musical	Animation	Animation	4	5.0000	1	2009-12-29 09:10:49	2009
36	user_50_seed	50	1214	Alien (1979)	Horror\|Sci-Fi	Horror	Horror	5	5.0000	1	2009-12-29 09:09:28	2009
49	user_50_seed	50	1288	This Is Spinal Tap (1984)	Comedy	Comedy	Comedy	6	5.0000	1	2009-12-29 09:09:18	2009
11	user_50_seed	50	593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller	Crime	Crime	7	5.0000	1	2009-12-29 09:08:00	2009
25	user_50_seed	50	1136	Monty Python and the Holy Grail (1975)	Adventure\|Comedy\|Fantasy	Adventure	Adventure	8	5.0000	1	2009-12-29 09:05:51	2009
67	user_50_seed	50	2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller	Action	Action	9	5.0000	1	2009-12-29 09:05:21	2009
52	user_50_seed	50	1617	L.A. Confidential (1997)	Crime\|Film-Noir\|Mystery\|Thriller	Crime	Crime	10	5.0000	1	2009-12-29 09:04:30	2009
12	user_50_seed	50	608	Fargo (1996)	Comedy\|Crime\|Drama\|Thriller	Comedy	Comedy	11	5.0000	1	2009-12-29 09:04:23	2009
47	user_50_seed	50	1276	Cool Hand Luke (1967)	Drama	Drama	Drama	12	4.5000	1	2009-12-29 09:14:25	2009
279	user_100_seed	100	55820	No Country for Old Men (2007)	Crime\|Drama	Crime	Crime	1	5.0000	1	2021-05-04 09:20:43	2021
269	user_100_seed	100	48394	Pan's Labyrinth (Laberinto del fauno, El) (2006)	Drama\|Fantasy\|Thriller	Drama	Drama	2	5.0000	1	2021-05-04 09:20:40	2021
325	user_100_seed	100	99114	Django Unchained (2012)	Action\|Drama\|Western	Action	Action	3	5.0000	1	2021-05-04 09:19:58	2021

The seed slate table is the key handoff from this notebook. It gives later notebooks user-specific candidate sets with item metadata and relevance labels. The next notebook can now focus on exposure mapping: which items are promoted, which same-cluster items are spillover-exposed, and how promotion changes simulated slate outcomes.

20. Save Processed Inputs

This cell saves the cleaned sample tables and readiness summaries. Saving these files keeps later notebooks fast and consistent: they can load the same sampled users, item features, and seed slates instead of repeating the expensive zip scan.

ratings_output = PROCESSED_DIR / "movielens_interference_ratings_sample.parquet"
items_output = PROCESSED_DIR / "movielens_interference_items.parquet"
users_output = PROCESSED_DIR / "movielens_interference_user_features.parquet"
slates_output = PROCESSED_DIR / "movielens_interference_slate_seed.parquet"
readiness_output = PROCESSED_DIR / "movielens_interference_setup_readiness.csv"

ratings_enriched.to_parquet(ratings_output, index=False)
interference_items.to_parquet(items_output, index=False)
user_features.to_parquet(users_output, index=False)
slate_seed.to_parquet(slates_output, index=False)

readiness_checks = pd.DataFrame(
    [
        {
            "check": "raw_zip_found",
            "value": bool(RAW_ZIP.exists()),
            "notes": "MovieLens zip file is available locally.",
        },
        {
            "check": "ratings_sample_rows",
            "value": len(ratings_enriched),
            "notes": "Sample contains complete histories for selected users.",
        },
        {
            "check": "sample_unique_users",
            "value": ratings_enriched["userId"].nunique(),
            "notes": "Users available for user-level slate simulation.",
        },
        {
            "check": "sample_unique_movies",
            "value": ratings_enriched["movieId"].nunique(),
            "notes": "Movies available for item-level spillover analysis.",
        },
        {
            "check": "complete_seed_slates",
            "value": slate_seed["slate_id"].nunique(),
            "notes": "Complete seed slates available for exposure simulation.",
        },
        {
            "check": "seed_slate_rows",
            "value": len(slate_seed),
            "notes": "Rows in the slate seed table; should equal complete slates times slate size.",
        },
        {
            "check": "spillover_clusters",
            "value": interference_items["spillover_cluster"].nunique(),
            "notes": "Initial substitute clusters based on primary genre.",
        },
    ]
)
readiness_checks.to_csv(readiness_output, index=False)

saved_files = pd.DataFrame(
    {
        "artifact": [
            "ratings_sample",
            "interference_items",
            "user_features",
            "slate_seed",
            "readiness_checks",
        ],
        "path": [
            str(ratings_output),
            str(items_output),
            str(users_output),
            str(slates_output),
            str(readiness_output),
        ],
    }
)

display(readiness_checks)
display(saved_files)

	check	value	notes
0	raw_zip_found	True	MovieLens zip file is available locally.
1	ratings_sample_rows	617851	Sample contains complete histories for selecte...
2	sample_unique_users	4018	Users available for user-level slate simulation.
3	sample_unique_movies	22313	Movies available for item-level spillover anal...
4	complete_seed_slates	3000	Complete seed slates available for exposure si...
5	seed_slate_rows	36000	Rows in the slate seed table; should equal com...
6	spillover_clusters	20	Initial substitute clusters based on primary g...

	artifact	path
0	ratings_sample	/home/apex/Documents/ranking_sys/data/processe...
1	interference_items	/home/apex/Documents/ranking_sys/data/processe...
2	user_features	/home/apex/Documents/ranking_sys/data/processe...
3	slate_seed	/home/apex/Documents/ranking_sys/data/processe...
4	readiness_checks	/home/apex/Documents/ranking_sys/data/processe...

The processed files make the interference project reproducible and modular. The important output is the seed slate table because it gives the next notebook a realistic starting point for simulating promotion assignments and spillover exposure. The user and item feature tables provide adjustment and segmentation variables.

21. Notebook Takeaways

This notebook established the foundation for an interference/spillover analysis using MovieLens:

MovieLens has strong user-item preference data but no true impression logs, so the causal design must be simulated.
Ratings provide a preference/relevance signal, while genres provide a transparent first substitute-cluster definition.
The sampled-user strategy keeps complete histories for selected users, which is more useful than isolated row sampling.
The seed slate table creates realistic user-specific candidate sets for later promotion and spillover simulations.
The next notebook should define exposure mappings: direct promotion, same-slate spillover exposure, same-cluster spillover exposure, and slate-level outcome construction.