This notebook starts the interference and spillover effects analysis. The causal problem is different from standard treatment-effect estimation because recommendation items compete with one another. If one movie is promoted into a visible slate position, another movie may lose visibility, attention, clicks, ratings, or watch time. That means an item’s outcome may depend not only on its own treatment status, but also on the treatment assignments of other nearby items.
The purpose of this first notebook is to understand the MovieLens data and prepare a clean foundation for later spillover notebooks. MovieLens does not contain true production impressions or randomized promotion assignments, so this project will use MovieLens as a realistic preference dataset and then simulate recommendation slates, promotion assignments, and item competition structures.
Dataset Field Guide
MovieLens 32M is distributed as four main CSV files inside ml-32m.zip.
ratings.csv
Each row is a user-movie rating event.
userId: anonymized user identifier. A user can rate many movies.
movieId: MovieLens movie identifier. This links to movies.csv, links.csv, and tags.csv.
rating: explicit star rating on a 0.5 to 5.0 scale. In this project it is a preference signal, not a direct exposure or click outcome.
timestamp: Unix timestamp for when the rating was created. This lets us study recency, user histories, and time ordering.
movies.csv
Each row is a movie catalog item.
movieId: MovieLens movie identifier.
title: movie title, usually with a release year in parentheses.
genres: pipe-delimited genre string such as Comedy|Romance. A movie can belong to multiple genres. We will use genres to create substitute groups because movies in similar genres plausibly compete for user attention.
tags.csv
Each row is a user-applied free-text tag for a movie.
userId: anonymized user identifier for the tag event.
movieId: tagged movie identifier.
tag: user-generated text label such as an actor, theme, mood, franchise, or descriptive keyword.
timestamp: Unix timestamp for when the tag was applied.
Tags are useful for richer item similarity, but they are sparse and noisy. In this first notebook we use them mainly for understanding catalog semantics.
links.csv
Each row maps a MovieLens movie to external identifiers.
movieId: MovieLens movie identifier.
imdbId: IMDb identifier.
tmdbId: TMDb identifier.
This project does not need external metadata immediately, but the file is useful if we later want posters, richer genres, cast, crew, or production metadata.
Causal Setup Preview
Later notebooks will convert this preference dataset into a simulated recommendation setting:
A slate is a set of movies shown together to a user.
A treated item is a movie promoted into a more visible position in the slate.
A spillover-exposed item is another item in the same slate, especially a similar or substitutable movie.
A direct effect measures what happens to the promoted item.
An indirect or spillover effect measures what happens to competing items.
A total effect combines promoted-item gains and displaced-item losses at the slate or cluster level.
This notebook does not estimate those effects yet. It builds the data understanding and processed inputs needed to do that carefully.
1. Environment and Paths
This cell imports the libraries used for dataset inspection, plotting, and processed-data export. It also defines paths to the MovieLens zip file and the processed-data folder. The notebook reads directly from the zip file so we do not need to permanently extract a very large dataset into the repository.
from pathlib import Pathfrom zipfile import ZipFileimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport seaborn as snsfrom IPython.display import displaysns.set_theme(style="whitegrid", context="notebook")pd.set_option("display.max_columns", 80)pd.set_option("display.max_rows", 60)pd.set_option("display.float_format", lambda value: f"{value:,.4f}")candidate_roots = [Path.cwd(), *Path.cwd().parents]PROJECT_DIR =next( root for root in candidate_rootsif (root /"data"/"movieLens"/"ml-32m.zip").exists())DATA_DIR = PROJECT_DIR /"data"RAW_ZIP = DATA_DIR /"movieLens"/"ml-32m.zip"PROCESSED_DIR = DATA_DIR /"processed"PROCESSED_DIR.mkdir(parents=True, exist_ok=True)NOTEBOOK_DIR = PROJECT_DIR /"notebooks"/"interference_spillover_effects"WRITEUP_DIR = NOTEBOOK_DIR /"writeup"FIGURE_DIR = WRITEUP_DIR /"figures"TABLE_DIR = WRITEUP_DIR /"tables"FIGURE_DIR.mkdir(parents=True, exist_ok=True)TABLE_DIR.mkdir(parents=True, exist_ok=True)RAW_ZIP.exists(), RAW_ZIP
The path check should return True. If it does, the notebook has found the compressed MovieLens file and can read the CSVs directly from it. The writeup folders are also created here so later cells can save lightweight tables and figures without scattering files across the project.
2. Inspect the MovieLens Archive
Before loading data, we inspect the zip archive itself. This confirms which files are present, how large they are, and whether the archive matches the standard MovieLens 32M layout. This is a useful first notebook habit because many downstream errors come from assuming a folder layout that differs from the local download.
with ZipFile(RAW_ZIP) as zf: archive_rows = []for info in zf.infolist():ifnot info.is_dir(): archive_rows.append( {"file": info.filename,"compressed_mb": info.compress_size /1_000_000,"uncompressed_mb": info.file_size /1_000_000, } )archive_df = pd.DataFrame(archive_rows).sort_values("uncompressed_mb", ascending=False)display(archive_df)
file
compressed_mb
uncompressed_mb
4
ml-32m/ratings.csv
218.7859
877.0762
0
ml-32m/tags.csv
17.8637
72.3539
5
ml-32m/movies.csv
1.4578
4.2429
1
ml-32m/links.csv
0.8377
1.9507
2
ml-32m/README.txt
0.0037
0.0092
3
ml-32m/checksums.txt
0.0001
0.0002
The ratings file is by far the largest file, so the rest of the notebook treats it carefully. The movie metadata is small enough to load fully, while ratings need chunked reading and a deterministic user sample. That keeps the EDA reproducible without requiring the notebook to hold all 32 million interactions in memory.
3. Load and Enrich the Movie Catalog
The movie catalog is the natural item table for the interference problem. This cell loads every movie, parses the release year from the title when available, splits the pipe-delimited genre string into a list, and creates a primary genre for simple grouping. Later notebooks can use these genre groups as item clusters where spillovers are most plausible.
The catalog gives us a broad item universe and a simple substitute structure through genres. This is important because interference is not expected to be equally strong between all movies. Promotion of one comedy is more likely to displace another comedy than an unrelated documentary, so genre-based grouping is a defensible first exposure model.
4. Load a Tag Sample
Tags are optional for this first project stage, but they help us understand whether MovieLens has enough semantic information to support richer item similarity later. Because tags.csv is larger than the movie catalog and free-text tags can be messy, this cell reads a bounded sample and normalizes tags to lowercase for frequency checks.
The tag sample confirms that MovieLens contains semantic item annotations beyond genres. We will not rely on tags as a core causal variable yet because they are user-generated and unevenly distributed, but they are useful evidence that richer similarity modeling is available if genre clusters become too coarse.
5. Build a Deterministic Rating Sample
The ratings table has more than 32 million rows, so loading it all at once is unnecessary for exploratory work. Instead, this cell scans the ratings file in chunks and keeps all ratings for users whose userId is divisible by a fixed modulus. This creates a deterministic user-level sample: selected users keep their full rating histories, which is better for sequence and slate construction than randomly sampling isolated rows.
The scan also collects full-file summary statistics such as total row count, unique users, unique movies, rating distribution, and timestamp range.
The sample keeps complete histories for a manageable set of users. That matters for this project because interference simulation needs realistic user-level candidate sets, not disconnected individual ratings. The full-file statistics also let us describe the original dataset honestly even though later modeling uses a smaller processed sample.
6. Check Rating Scale and Preference Signal
This cell compares the full-file rating distribution collected during chunked loading with the sampled-user distribution. A close match is reassuring because it means the deterministic user sample is not obviously distorting the basic preference signal.
The rating distribution is the first sanity check for the sampled data. If the sample had very different shares of high or low ratings, later simulated outcomes would inherit that distortion. Small differences are fine because the goal is not survey-grade sampling; the goal is a stable working dataset that preserves the main preference structure.
7. Plot the Rating Distribution
The table above is precise, but the plot makes the rating scale easier to read. We show full-file and sampled-user shares side by side. The liked outcome used later is based on ratings of 4.0 or higher, so the mass around 4.0 and 5.0 is especially important.
The sampled-user distribution should track the full distribution closely. This makes the sampled ratings acceptable for EDA, user-history features, and initial slate construction. The concentration of positive ratings also reminds us that MovieLens ratings are explicit preference events, not random impressions.
8. Join Ratings to Movie Metadata
For interference analysis, ratings alone are not enough. We need item context so we can ask whether promoted movies displace similar movies. This cell joins the sampled ratings to movie metadata and checks whether any sampled ratings lack catalog information.
The join quality check tells us whether MovieLens identifiers are internally consistent. A clean join means later notebooks can safely use movie genres, release years, and titles when defining substitute groups and explaining spillover mechanisms.
9. User Activity Distribution
Interference simulation needs users with enough history to form realistic candidate slates. This cell summarizes how many ratings each sampled user has, their average rating, their share of high ratings, and the time span of their activity.
The user activity distribution tells us how many users can support slate construction. Users with very few ratings are less useful because we cannot build a credible set of competing candidate items for them. Users with broader genre histories are especially useful because they create slates with both substitutes and non-substitutes.
10. Plot User Activity and Rating Tendencies
This cell visualizes user heterogeneity. The left plot shows the long-tailed number of ratings per sampled user. The right plot shows the distribution of each user’s average rating. Both matter because recommendation simulations should account for heavy users and naturally generous or strict raters.
fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))sns.histplot(user_features["n_ratings"], bins=60, ax=axes[0])axes[0].set_xscale("log")axes[0].set_title("Ratings per Sampled User")axes[0].set_xlabel("Number of ratings, log scale")axes[0].set_ylabel("Users")sns.histplot(user_features["mean_rating"], bins=40, ax=axes[1], color="tab:green")axes[1].set_title("Average Rating per Sampled User")axes[1].set_xlabel("Mean rating")axes[1].set_ylabel("Users")plt.tight_layout()fig.savefig(FIGURE_DIR /"02_user_activity.png", dpi=160, bbox_inches="tight")plt.show()
The log scale is intentional because user activity is usually very skewed. That skew matters for interference: a small number of very active users may create many plausible slates, while less active users may only support a few. Later simulations should avoid letting the heaviest users dominate all estimates.
11. Item Popularity and Quality Signals
This cell creates item-level features from the sampled ratings. These features describe how often each movie appears, its average rating, and its high-rating rate. Popularity is important for simulated ranking because promoted items are rarely assigned from a uniform catalog; recommendation systems tend to draw from relevant and popular candidates.
The top movies are not just descriptive; they reveal the attention distribution that a simulated recommender would inherit. If we promote already-popular movies, displacement may mostly affect other popular substitutes. If we promote niche movies, spillovers may look different. This is why popularity buckets are saved as item features.
12. Plot the Item Popularity Long Tail
Recommendation catalogs usually have a long tail: a few items receive many interactions, while most items receive few. This plot checks whether the MovieLens sample has that shape. A long tail is useful here because spillovers can be studied across popular, medium, and niche items.
fig, ax = plt.subplots(figsize=(9, 4.5))sns.histplot(item_features["sample_rating_count"], bins=80, ax=ax, color="tab:purple")ax.set_xscale("log")ax.set_title("Movie Popularity in Sampled Ratings")ax.set_xlabel("Number of sampled ratings, log scale")ax.set_ylabel("Movies")plt.tight_layout()fig.savefig(FIGURE_DIR /"03_item_popularity_long_tail.png", dpi=160, bbox_inches="tight")plt.show()
The long-tail pattern supports the simulation plan. Interference is partly about scarce attention, and scarce attention is most interesting when items vary widely in baseline popularity. This also gives later notebooks a reason to report effects separately by item popularity tier.
13. Genre Coverage and Substitute Groups
Genres are the first approximation to item clusters. This cell explodes the multi-genre movie table so each movie contributes to every genre it belongs to, then computes catalog size, rating volume, average rating, and liked rate by genre.
The genre summary gives the first map of possible spillover neighborhoods. Large genres such as drama or comedy can support many within-genre substitute comparisons. Smaller genres may need to be combined or treated cautiously because sparse clusters can create noisy spillover estimates.
14. Plot Rating Volume by Genre
This visualization highlights which genres dominate the sampled preference data. For interference modeling, this helps decide where simulated slates will have enough similar items to create meaningful within-cluster competition.
The largest genres are natural starting points for substitute clusters. Later, when we simulate a promoted item, we can define spillover exposure as the number or share of same-genre items promoted nearby in the same slate. That turns genre EDA into a causal exposure model.
15. Time Coverage in the Rating Sample
MovieLens spans many years. This cell summarizes rating volume over calendar time so we can see whether the sample covers the same broad period as the full file. Time matters because user preferences, catalog composition, and platform behavior can all drift.
The yearly table checks temporal breadth and gives a first look at drift. We do not need to solve time drift in the first notebook, but later simulations should avoid accidentally mixing very old and very recent behavior without acknowledging it.
16. Plot Yearly Rating Activity
The plot makes the time trend easier to see than the table. We show sampled rating volume and average rating by year. This helps identify whether a small number of years dominate the sample.
The time trend is useful for deciding whether future notebooks should include calendar controls. Even in a simulation project, realistic time structure matters because old and new catalog items may have different popularity and competition patterns.
17. Top Tags in the Sample
This cell inspects the most common cleaned tags. Tags often capture actors, franchises, moods, themes, and other semantics that genres miss. We will not use them as the first cluster definition, but seeing them helps explain how the project could be extended beyond genre-level spillovers.
The tag sample suggests richer semantic neighborhoods are possible. For example, two movies in different genres might still compete if they share an actor, franchise, mood, or theme. The first version of the project will stay with genres for clarity, but tags are a credible future refinement.
18. Define an Interference-Ready Item Table
This cell creates the item table that later notebooks can reuse. It combines catalog metadata with sampled popularity and preference features. It also defines spillover_cluster, a first-pass substitute group based on primary genre. This is intentionally simple and explainable.
The spillover_cluster column is the bridge from EDA to causal design. In later notebooks, a movie’s potential outcomes can depend on its own promotion and on the promotion intensity among other movies in the same cluster or slate.
19. Build a Seed Dataset for Simulated Slates
MovieLens does not tell us which movies were shown together, so later notebooks need to simulate slates. This cell creates a seed table by selecting active users and taking a bounded set of their recent, highly rated movies. Each user’s seed slate is not yet a randomized experiment; it is a realistic candidate set that later notebooks can rank, promote, and use for spillover exposure construction.
The seed slate table is the key handoff from this notebook. It gives later notebooks user-specific candidate sets with item metadata and relevance labels. The next notebook can now focus on exposure mapping: which items are promoted, which same-cluster items are spillover-exposed, and how promotion changes simulated slate outcomes.
20. Save Processed Inputs
This cell saves the cleaned sample tables and readiness summaries. Saving these files keeps later notebooks fast and consistent: they can load the same sampled users, item features, and seed slates instead of repeating the expensive zip scan.
ratings_output = PROCESSED_DIR /"movielens_interference_ratings_sample.parquet"items_output = PROCESSED_DIR /"movielens_interference_items.parquet"users_output = PROCESSED_DIR /"movielens_interference_user_features.parquet"slates_output = PROCESSED_DIR /"movielens_interference_slate_seed.parquet"readiness_output = PROCESSED_DIR /"movielens_interference_setup_readiness.csv"ratings_enriched.to_parquet(ratings_output, index=False)interference_items.to_parquet(items_output, index=False)user_features.to_parquet(users_output, index=False)slate_seed.to_parquet(slates_output, index=False)readiness_checks = pd.DataFrame( [ {"check": "raw_zip_found","value": bool(RAW_ZIP.exists()),"notes": "MovieLens zip file is available locally.", }, {"check": "ratings_sample_rows","value": len(ratings_enriched),"notes": "Sample contains complete histories for selected users.", }, {"check": "sample_unique_users","value": ratings_enriched["userId"].nunique(),"notes": "Users available for user-level slate simulation.", }, {"check": "sample_unique_movies","value": ratings_enriched["movieId"].nunique(),"notes": "Movies available for item-level spillover analysis.", }, {"check": "complete_seed_slates","value": slate_seed["slate_id"].nunique(),"notes": "Complete seed slates available for exposure simulation.", }, {"check": "seed_slate_rows","value": len(slate_seed),"notes": "Rows in the slate seed table; should equal complete slates times slate size.", }, {"check": "spillover_clusters","value": interference_items["spillover_cluster"].nunique(),"notes": "Initial substitute clusters based on primary genre.", }, ])readiness_checks.to_csv(readiness_output, index=False)saved_files = pd.DataFrame( {"artifact": ["ratings_sample","interference_items","user_features","slate_seed","readiness_checks", ],"path": [str(ratings_output),str(items_output),str(users_output),str(slates_output),str(readiness_output), ], })display(readiness_checks)display(saved_files)
check
value
notes
0
raw_zip_found
True
MovieLens zip file is available locally.
1
ratings_sample_rows
617851
Sample contains complete histories for selecte...
2
sample_unique_users
4018
Users available for user-level slate simulation.
3
sample_unique_movies
22313
Movies available for item-level spillover anal...
4
complete_seed_slates
3000
Complete seed slates available for exposure si...
5
seed_slate_rows
36000
Rows in the slate seed table; should equal com...
6
spillover_clusters
20
Initial substitute clusters based on primary g...
artifact
path
0
ratings_sample
/home/apex/Documents/ranking_sys/data/processe...
1
interference_items
/home/apex/Documents/ranking_sys/data/processe...
2
user_features
/home/apex/Documents/ranking_sys/data/processe...
3
slate_seed
/home/apex/Documents/ranking_sys/data/processe...
4
readiness_checks
/home/apex/Documents/ranking_sys/data/processe...
The processed files make the interference project reproducible and modular. The important output is the seed slate table because it gives the next notebook a realistic starting point for simulating promotion assignments and spillover exposure. The user and item feature tables provide adjustment and segmentation variables.
21. Notebook Takeaways
This notebook established the foundation for an interference/spillover analysis using MovieLens:
MovieLens has strong user-item preference data but no true impression logs, so the causal design must be simulated.
Ratings provide a preference/relevance signal, while genres provide a transparent first substitute-cluster definition.
The sampled-user strategy keeps complete histories for selected users, which is more useful than isolated row sampling.
The seed slate table creates realistic user-specific candidate sets for later promotion and spillover simulations.
The next notebook should define exposure mappings: direct promotion, same-slate spillover exposure, same-cluster spillover exposure, and slate-level outcome construction.