01 Open Bandit EDA

This notebook starts Off-Policy Evaluation of Recommendation Systems.

What Is The Off-Policy Evaluation Problem?

Recommendation systems are usually changed through policies. A policy is a rule that maps a user context to a probability distribution over possible actions. In this project, the action is the item shown to the user, the context is the information available before the recommendation, and the reward is whether the user clicked.

The central product question is: How would a new recommendation policy perform if we deployed it? In an online experiment, we could answer this by randomly assigning users to the new policy and directly observing clicks. But online experiments can be expensive, slow, risky, or unavailable during early research. Off-policy evaluation, usually shortened to OPE, tries to estimate the value of a new policy using historical logs collected by a different policy.

The historical policy is called the behavior policy or logging policy. It is the policy that actually chose the items in the data. The new policy we want to evaluate is called the evaluation policy. The difficulty is that each logged row only shows one action and one reward: the item that was actually recommended and whether it was clicked. We do not observe what the same user would have done if a different item had been shown. That missing counterfactual is the causal inference problem.

Open Bandit is useful because it logs the behavior-policy probability for the selected action. This probability is called the propensity score. If the behavior policy gave an action probability p, then rows with small p represent actions that were unlikely under the logger. Importance-based OPE estimators use these propensities to reweight logged outcomes so the historical data can mimic the action distribution of a different evaluation policy.

The simplest idea is inverse propensity scoring. If the evaluation policy would choose the logged action with probability pi_e and the behavior policy chose it with probability pi_b, then the row receives weight pi_e / pi_b. Rows that the evaluation policy likes more than the behavior policy get upweighted. Rows that the evaluation policy would rarely choose get downweighted. Averaging these weighted rewards estimates the value of the evaluation policy.

This only works under important assumptions. First, the logged propensities must be correct. Second, the behavior policy must have positive probability for actions the evaluation policy might choose; this is called support or positivity. Third, the logged context must be rich enough that comparing policy choices is meaningful. Even when these assumptions hold, IPS can have high variance when propensities are small, so later notebooks will compare IPS with self-normalized IPS and doubly robust estimators.

The goal of this first notebook is not to estimate a new policy yet. The goal is to verify that the Open Bandit logs contain the ingredients required for OPE:

a logged action, item_id
an observed reward, click
a logging-policy probability, propensity_score
context features available before the recommendation was made
enough action support so a future evaluation policy can be compared against the behavior policy

We will use the random/men campaign first because the random behavior policy is easier to reason about than an adaptive bandit policy. This gives us a clean foundation before moving to IPS, SNIPS, and doubly robust policy value estimation.

Notebook Setup

This cell imports the libraries used throughout the notebook. pandas handles the tabular logs, numpy supports numerical summaries, matplotlib and seaborn create plots, and zipfile lets us read the Open Bandit data directly from the downloaded zip without manually extracting the full archive.

The display settings make wide Open Bandit tables easier to inspect because the dataset contains many user-item affinity columns.

from pathlib import Path
from zipfile import ZipFile

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 80)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

This cell prepares the notebook environment for Open Bandit data understanding and OPE readiness. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Locate The Dataset

This cell finds the repository root by walking upward from the current working directory until it sees the downloaded Open Bandit zip. This makes the notebook work whether it is launched from the repository root, from notebooks/, or from notebooks/projects/project_2_off_policy_evaluation/.

We keep the zip file as the source of truth. The Open Bandit files are large, so reading only the campaign we need is cleaner than extracting everything.

OPEN_BANDIT_ZIP_RELATIVE_PATH = Path("data/open_bandit/open_bandit_dataset.zip")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / OPEN_BANDIT_ZIP_RELATIVE_PATH).exists()
)

OPEN_BANDIT_ZIP = PROJECT_ROOT / OPEN_BANDIT_ZIP_RELATIVE_PATH
PROCESSED_DIR = PROJECT_ROOT / "data/processed"

OPEN_BANDIT_ZIP

PosixPath('/home/apex/Documents/ranking_sys/data/open_bandit/open_bandit_dataset.zip')

The printed paths are a reproducibility checkpoint. Once the notebook can find the cached data and writeup folders, the rest of the analysis can run without manual path edits.

Inspect The Zip Contents

Before loading data, we inspect the archive inventory. This confirms which behavior policies and campaigns are available.

Open Bandit includes two behavior policies:

random: a randomized logging policy, useful as the cleanest starting point for OPE
bts: a Bernoulli Thompson Sampling policy, useful later for comparing adaptive logging behavior

It also includes three campaigns: all, men, and women. We begin with random/men because it is smaller and has a simple action space.

with ZipFile(OPEN_BANDIT_ZIP) as zf:
    zip_inventory = pd.DataFrame(
        [
            {
                "path": info.filename,
                "size_mb": info.file_size / 1_000_000,
                "is_dir": info.is_dir(),
            }
            for info in zf.infolist()
        ]
    )

zip_inventory.query("not is_dir").sort_values("path")

	path	size_mb	is_dir
0	open_bandit_dataset/README	0.002659	False
11	open_bandit_dataset/VERSION	0.000034	False
2	open_bandit_dataset/bts/all/all.csv	6321.017454	False
1	open_bandit_dataset/bts/all/item_context.csv	0.010041	False
4	open_bandit_dataset/bts/men/item_context.csv	0.004287	False
5	open_bandit_dataset/bts/men/men.csv	1332.891122	False
7	open_bandit_dataset/bts/women/item_context.csv	0.005792	False
8	open_bandit_dataset/bts/women/women.csv	2913.666740	False
13	open_bandit_dataset/random/all/all.csv	695.501426	False
12	open_bandit_dataset/random/all/item_context.csv	0.010041	False
15	open_bandit_dataset/random/men/item_context.csv	0.004287	False
16	open_bandit_dataset/random/men/men.csv	151.946449	False
18	open_bandit_dataset/random/women/item_context.csv	0.005792	False
19	open_bandit_dataset/random/women/women.csv	331.703218	False

The archive inventory confirms which Open Bandit files are available and where they live inside the zip. This prevents accidental use of the wrong campaign, behavior policy, or file split before OPE begins.

Choose The First Analysis Slice

This cell defines the campaign files for Notebook 1. We use:

behavior policy: random
campaign: men
logged data file: open_bandit_dataset/random/men/men.csv
item context file: open_bandit_dataset/random/men/item_context.csv

The full random/men file is already large, so the first notebook reads a sample of rows. This is enough for EDA and support checks, while keeping iteration fast. Later notebooks can increase SAMPLE_ROWS or read the full file if needed.

BEHAVIOR_POLICY = "random"
CAMPAIGN = "men"
SAMPLE_ROWS = 200_000

LOG_MEMBER = f"open_bandit_dataset/{BEHAVIOR_POLICY}/{CAMPAIGN}/{CAMPAIGN}.csv"
ITEM_CONTEXT_MEMBER = f"open_bandit_dataset/{BEHAVIOR_POLICY}/{CAMPAIGN}/item_context.csv"

pd.DataFrame(
    {
        "setting": ["behavior_policy", "campaign", "sample_rows", "log_member", "item_context_member"],
        "value": [BEHAVIOR_POLICY, CAMPAIGN, SAMPLE_ROWS, LOG_MEMBER, ITEM_CONTEXT_MEMBER],
    }
)

	setting	value
0	behavior_policy	random
1	campaign	men
2	sample_rows	200000
3	log_member	open_bandit_dataset/random/men/men.csv
4	item_context_member	open_bandit_dataset/random/men/item_context.csv

This selection fixes the first behavior-policy and campaign slice for inspection. OPE depends heavily on the logging policy, so choosing the slice explicitly keeps later propensity and support checks easy to interpret.

Read The Dataset Documentation

The zip includes a short README describing the dataset. We print the relevant excerpt so the notebook itself records why this dataset is appropriate for off-policy evaluation.

The key phrase for this project is that each logged row contains a selected action, a reward, and the true propensity score from the behavior policy. Those are exactly the three quantities required for basic OPE estimators.

with ZipFile(OPEN_BANDIT_ZIP) as zf:
    readme_text = zf.read("open_bandit_dataset/README").decode("utf-8")

print(readme_text[:1800])

# Open Bandit Dataset

This is the full size version of *Open Bandit Dataset* that can be used for research on bandit algorithms and off-policy evaluation.
The small size example version of our data is available at https://github.com/st-tech/zr-obp/tree/master/obd

This dataset is released along with the paper:

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita.
A Large-scale Open Dataset for Bandit Algorithms. https://arxiv.org/abs/2008.07146

When using this dataset, please cite the paper with following bibtex:

@article{saito2020large,
  title={A Large-scale Open Dataset for Bandit Algorithms},
  author={Saito, Yuta, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita},
  journal={arXiv preprint arXiv:2008.07146},
  year={2020}
}


## Data description
Open Bandit Dataset is constructed in an A/B test of two multi-armed bandit policies in a large-scale fashion e-commerce platform, ZOZOTOWN (https://zozo.jp/).
It currently consists of a total of 26M rows, each one representing a user impression with some feature values, selected items as actions, true propensity scores, and click indicators as an outcome.
This is especially suitable for evaluating *off-policy evaluation* (OPE), which attempts to predict the counterfactual performance of hypothetical algorithms using data generated by a different algorithm.


## Fields
Here is a detailed description of the fields (they are comma-separated in the CSV files):

{behavior_policy}/{campaign}.csv (behavior_policy in (bts, random), campaign in (all, men, women))
- timestamp: timestamps of impressions.
- item_id: index of items as arms (index ranges from 0-80 in "All" campaign, 0-33 for "Men" campaign, and 0-46 "Women" campaign).
- position: the position of an item being recommended. 1, 2, or 3 correspond to left, cente

Reading the documentation anchors the column meanings before modeling. For OPE, this is especially important because reward, action, position, and propensity fields each have a specific estimator role.

Load Logged Impressions

This cell loads the first SAMPLE_ROWS from the selected Open Bandit campaign. The first column in the CSV is an unnamed index created when the dataset was exported, so we load it with index_col=0.

Each row is a logged recommendation event at a specific position. The important causal fields are:

item_id: the action chosen by the behavior policy
click: the observed reward
propensity_score: the probability that the behavior policy assigned to the logged action
user and affinity features: context available for modeling rewards or evaluation policies

with ZipFile(OPEN_BANDIT_ZIP) as zf:
    with zf.open(LOG_MEMBER) as f:
        df = pd.read_csv(f, nrows=SAMPLE_ROWS, index_col=0)

df.head()

	timestamp	item_id	position	propensity_score	user_feature_0	user_feature_1	user_feature_2	user_feature_3
0	2019-11-24 00:00:03.800821+00:00	0	1	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8
1	2019-11-24 00:00:03.801019+00:00	25	3	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8
2	2019-11-24 00:00:03.801099+00:00	23	2	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8
3	2019-11-24 00:00:17.634355+00:00	25	1	0.029412	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44
4	2019-11-24 00:00:17.634998+00:00	30	2	0.029412	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44

The loaded table shape and preview confirm that the expected cached data is available. This check matters because all later OPE estimates depend on using the correct logged actions, rewards, contexts, and behavior propensities.

Load Item Context

The log contains user-side context and action identifiers. The item context table gives item-level features for the available actions in this campaign.

We load this separately because later notebooks can use item features when learning reward models for doubly robust OPE. In this notebook, the item context mostly helps us understand the action space and confirm that each item_id has metadata.

with ZipFile(OPEN_BANDIT_ZIP) as zf:
    with zf.open(ITEM_CONTEXT_MEMBER) as f:
        item_context = pd.read_csv(f, index_col=0)

item_context.head()

	item_id	item_feature_0	item_feature_1	item_feature_2	item_feature_3
0	0	-0.677183	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
1	1	-0.720300	3c2985d744e0d57c261abd7e541e4263	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
2	2	0.745662	3c2985d744e0d57c261abd7e541e4263	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759
3	3	-0.698741	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d
4	4	1.651109	01fe2f187e459e6ada960671d2942dfe	b4b5879029fb5f64eeec63cf4f73ef0e	b61cfaadd526b816e3aeb9b7be4b4759

Basic Shape

This cell summarizes the size of the sample and the item context table. For OPE, the number of rows matters because importance weighting can have high variance. The number of actions matters because a larger action space usually makes overlap harder.

In the men campaign, the action space is much smaller than the all campaign, which makes it a good first notebook for clean reasoning.

basic_shape = pd.DataFrame(
    {
        "object": ["logged_sample", "item_context"],
        "rows": [len(df), len(item_context)],
        "columns": [df.shape[1], item_context.shape[1]],
    }
)

basic_shape

	object	rows	columns
0	logged_sample	200000	43
1	item_context	34	5

The shape summary gives the scale of the logged recommendation data. OPE estimators can be noisy, so knowing the number of impressions, actions, and rewards gives context for every later uncertainty and support diagnostic.

Column Groups

Open Bandit has a compact schema, but the many affinity columns can make the table look intimidating. This cell groups the columns into interpretable families.

The affinity columns are especially important for later modeling. They are precomputed user-item affinity scores, meaning they can help predict clicks without using post-treatment information.

user_feature_cols = [col for col in df.columns if col.startswith("user_feature_")]
affinity_cols = [col for col in df.columns if col.startswith("user-item_affinity_")]
core_cols = ["timestamp", "item_id", "position", "click", "propensity_score"]
item_feature_cols = [col for col in item_context.columns if col.startswith("item_feature_")]

column_groups = pd.DataFrame(
    {
        "group": ["core logged fields", "user features", "user-item affinity features", "item features"],
        "n_columns": [len(core_cols), len(user_feature_cols), len(affinity_cols), len(item_feature_cols)],
        "columns": [core_cols, user_feature_cols, affinity_cols[:5] + ["..."], item_feature_cols],
    }
)

column_groups

	group	n_columns	columns
0	core logged fields	5	[timestamp, item_id, position, click, propensi...
1	user features	4	[user_feature_0, user_feature_1, user_feature_...
2	user-item affinity features	34	[user-item_affinity_0, user-item_affinity_1, u...
3	item features	4	[item_feature_0, item_feature_1, item_feature_...

This output clarifies which columns describe users, items, actions, rewards, and propensities. That separation keeps the OPE setup clean: actions and rewards are targets of evaluation, while context fields support modeling and diagnostics.

Column Dictionary

This cell creates a lightweight dictionary for the main columns in the logged data. The point is to make the causal roles explicit:

item_id is the action
click is the reward
propensity_score is the behavior-policy probability
the remaining features are context

That mapping will carry through the rest of the off-policy evaluation notebooks.

column_dictionary = pd.DataFrame(
    [
        ("timestamp", "time", "When the recommendation impression happened."),
        ("item_id", "action", "The item selected by the behavior policy."),
        ("position", "slot/context", "The UI position where the item was shown: 1, 2, or 3."),
        ("click", "reward", "Binary outcome: 1 if the user clicked, 0 otherwise."),
        ("propensity_score", "logging probability", "Probability that the behavior policy selected the logged action."),
        ("user_feature_*", "context", "Categorical user-side features available before recommendation."),
        ("user-item_affinity_*", "context", "Precomputed affinity scores between user context and candidate items."),
        ("item_feature_*", "action context", "Item metadata that can be joined by item_id."),
    ],
    columns=["field", "causal_role", "description"],
)

column_dictionary

	field	causal_role	description
0	timestamp	time	When the recommendation impression happened.
1	item_id	action	The item selected by the behavior policy.
2	position	slot/context	The UI position where the item was shown: 1, 2...
3	click	reward	Binary outcome: 1 if the user clicked, 0 other...
4	propensity_score	logging probability	Probability that the behavior policy selected ...
5	user_feature_*	context	Categorical user-side features available befor...
6	user-item_affinity_*	context	Precomputed affinity scores between user conte...
7	item_feature_*	action context	Item metadata that can be joined by item_id.

Data Types

This cell checks the raw data types. The user feature columns are hashed categorical values, the affinity columns are numeric, and the reward and propensity fields are numeric.

This matters because later reward models will need preprocessing: categorical columns should be encoded, while numeric affinity features can be passed directly into many models.

dtype_summary = (
    df.dtypes.astype(str)
    .rename("dtype")
    .reset_index()
    .rename(columns={"index": "column"})
)

dtype_summary

	column	dtype
0	timestamp	str
1	item_id	int64
2	position	int64
3	click	int64
4	propensity_score	float64
5	user_feature_0	str
6	user_feature_1	str
7	user_feature_2	str
8	user_feature_3	str
9	user-item_affinity_0	float64
10	user-item_affinity_1	float64
11	user-item_affinity_2	float64
12	user-item_affinity_3	float64
13	user-item_affinity_4	float64
14	user-item_affinity_5	float64
15	user-item_affinity_6	float64
16	user-item_affinity_7	float64
17	user-item_affinity_8	float64
18	user-item_affinity_9	float64
19	user-item_affinity_10	float64
20	user-item_affinity_11	float64
21	user-item_affinity_12	float64
22	user-item_affinity_13	float64
23	user-item_affinity_14	float64
24	user-item_affinity_15	float64
25	user-item_affinity_16	float64
26	user-item_affinity_17	float64
27	user-item_affinity_18	float64
28	user-item_affinity_19	float64
29	user-item_affinity_20	float64
30	user-item_affinity_21	float64
31	user-item_affinity_22	float64
32	user-item_affinity_23	float64
33	user-item_affinity_24	float64
34	user-item_affinity_25	float64
35	user-item_affinity_26	float64
36	user-item_affinity_27	float64
37	user-item_affinity_28	float64
38	user-item_affinity_29	float64
39	user-item_affinity_30	float64
40	user-item_affinity_31	float64
41	user-item_affinity_32	float64
42	user-item_affinity_33	float64

Missing Values

This cell computes the missing-value rate for every column and keeps only columns with nonzero missingness.

For OPE, missingness is not just a data-cleaning detail. If important context fields are missing systematically, a learned evaluation policy or reward model may be biased toward the rows where those features are present. A clean missingness check is therefore part of the causal diagnostics.

missing = df.isna().mean().sort_values(ascending=False).rename("missing_rate")
missing[missing > 0]

Series([], Name: missing_rate, dtype: float64)

The missingness check shows whether key OPE fields are complete. Missing rewards, actions, or propensity scores would be a serious blocker because IPS, SNIPS, and DR all depend on those logged quantities.

Parse Time Features

The raw timestamp is useful, but Open Bandit stores some timestamps with fractional seconds and some without them. For EDA, we parse these mixed timestamp strings and create simple calendar features such as date and hour. These are context features because they are known at recommendation time.

Time features can matter in recommendation systems because user intent and traffic patterns change over a day. Later, if an evaluation policy performs well only during certain hours, that would be a product-relevant heterogeneity finding.

df = df.assign(
    timestamp=pd.to_datetime(df["timestamp"], utc=True, format="mixed"),
)
df = df.assign(
    date=df["timestamp"].dt.date,
    hour=df["timestamp"].dt.hour,
)

df[["timestamp", "date", "hour"]].head()

	timestamp	date
0	2019-11-24 00:00:03.800821+00:00	2019-11-24
1	2019-11-24 00:00:03.801019+00:00	2019-11-24
2	2019-11-24 00:00:03.801099+00:00	2019-11-24
3	2019-11-24 00:00:17.634355+00:00	2019-11-24
4	2019-11-24 00:00:17.634998+00:00	2019-11-24

The parsed time features and coverage summary show when the logging data was collected. Time matters because policy behavior, item popularity, and reward rates can drift across the log window.

Reward And Action Summary

This cell summarizes the core bandit quantities:

sample size
click rate
number of distinct actions
number of positions
propensity range

A good first OPE dataset should have observed rewards, positive propensities, and enough action variation. The click rate also tells us how sparse the reward signal is.

summary = pd.Series(
    {
        "rows": len(df),
        "click_rate": df["click"].mean(),
        "clicked_rows": int(df["click"].sum()),
        "unique_items": df["item_id"].nunique(),
        "unique_positions": df["position"].nunique(),
        "min_propensity": df["propensity_score"].min(),
        "max_propensity": df["propensity_score"].max(),
        "mean_propensity": df["propensity_score"].mean(),
    }
).to_frame("value")

summary

	value
rows	200000.000000
click_rate	0.005190
clicked_rows	1038.000000
unique_items	34.000000
unique_positions	3.000000
min_propensity	0.029412
max_propensity	0.029412
mean_propensity	0.029412

The reward and action summaries establish the base click rate and action space. Since clicks are sparse, later OPE estimates need careful variance and effective-sample-size diagnostics.

Click Distribution

Clicks are usually rare in recommendation logs. This cell counts clicked and non-clicked rows so the class imbalance is visible.

This matters for the next notebooks because rare rewards make policy value estimates noisy. Even with known propensities, a policy that places high weight on a small number of clicked events can have unstable IPS estimates.

click_counts = (
    df["click"]
    .value_counts()
    .rename_axis("click")
    .reset_index(name="rows")
    .assign(rate=lambda x: x["rows"] / x["rows"].sum())
)

click_counts

	click	rows	rate
0	0	198962	0.994810
1	1	1038	0.005190

The reward and action summaries establish the base click rate and action space. Since clicks are sparse, later OPE estimates need careful variance and effective-sample-size diagnostics.

Plot Click Distribution

This plot visualizes the reward imbalance from the previous table. The bar for click = 0 will be much larger than the bar for click = 1, which is expected for click logs.

The purpose is not to make a causal claim yet. It simply tells us that future policy-value estimators will be estimating a small expected reward.

fig, ax = plt.subplots(figsize=(6, 4))
sns.barplot(data=click_counts, x="click", y="rows", ax=ax, color="#4C78A8")
ax.set_title("Click Outcome Counts")
ax.set_xlabel("Click")
ax.set_ylabel("Rows")
plt.show()

The reward and action summaries establish the base click rate and action space. Since clicks are sparse, later OPE estimates need careful variance and effective-sample-size diagnostics.

Position Distribution

The position column describes where the selected item was displayed. In this interface, positions 1, 2, and 3 correspond to the available recommendation slots.

For off-policy evaluation, position is best treated as context rather than the main treatment. The action is the recommended item, while position can affect click probability and should be considered in reward modeling.

position_summary = (
    df.groupby("position")
    .agg(rows=("click", "size"), click_rate=("click", "mean"), avg_propensity=("propensity_score", "mean"))
    .reset_index()
    .assign(row_share=lambda x: x["rows"] / x["rows"].sum())
)

position_summary

	position	rows	click_rate	avg_propensity	row_share
0	1	66653	0.005746	0.029412	0.333265
1	2	66679	0.005114	0.029412	0.333395
2	3	66668	0.004710	0.029412	0.333340

The position summary checks where logged recommendations appeared in the slate. Position can affect reward, so understanding its distribution helps interpret later policy-value comparisons.

Plot Position-Level CTR

This plot compares click rates across recommendation positions. A position effect is common in recommender systems because users see and react to slots differently.

This is useful context for later OPE: if an evaluation policy changes which items appear in which positions, the reward model should account for position rather than treating every impression as exchangeable.

fig, ax = plt.subplots(figsize=(7, 4))
sns.barplot(data=position_summary, x="position", y="click_rate", ax=ax, color="#72B7B2")
ax.set_title("Click Rate By Recommendation Position")
ax.set_xlabel("Position")
ax.set_ylabel("Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

The position CTR plot shows whether click rates vary by slate position. This is a useful reminder that logged reward is shaped by presentation context, not only by the item identity.

Action Distribution

This cell counts how often each item was recommended in the sample. Under a random logging policy, we expect the item distribution to be relatively even, though small sample variation is normal.

Action support is central to OPE. If an evaluation policy chooses items that the behavior policy almost never selected, importance weights become large and estimates become unreliable.

action_summary = (
    df.groupby("item_id")
    .agg(rows=("click", "size"), click_rate=("click", "mean"), avg_propensity=("propensity_score", "mean"))
    .reset_index()
    .assign(row_share=lambda x: x["rows"] / x["rows"].sum())
    .sort_values("rows", ascending=False)
)

action_summary.head(15)

	item_id	rows	click_rate	avg_propensity	row_share
7	7	6473	0.005562	0.029412	0.032365
22	22	6316	0.006175	0.029412	0.031580
11	11	6234	0.004491	0.029412	0.031170
17	17	6171	0.006482	0.029412	0.030855
25	25	6150	0.004065	0.029412	0.030750
19	19	6091	0.005254	0.029412	0.030455
6	6	6082	0.006084	0.029412	0.030410
2	2	6064	0.005607	0.029412	0.030320
32	32	6048	0.003638	0.029412	0.030240
0	0	6024	0.005644	0.029412	0.030120
21	21	5997	0.005836	0.029412	0.029985
18	18	5982	0.004179	0.029412	0.029910
9	9	5978	0.004015	0.029412	0.029890
8	8	5955	0.003862	0.029412	0.029775
16	16	5924	0.003038	0.029412	0.029620

The action distribution shows how often each item was logged. OPE relies on support: evaluation policies are only credible where the behavior policy logged enough comparable actions.

Plot Action Exposure Shares

This plot shows how much exposure each item receives in the random-policy sample. A nearly flat profile supports the idea that the behavior policy is exploring the action space broadly.

This is one reason Open Bandit is excellent for the random policy creates a strong baseline for learning and evaluating alternative recommendation policies.

fig, ax = plt.subplots(figsize=(10, 4))
sns.barplot(data=action_summary.sort_values("item_id"), x="item_id", y="row_share", ax=ax, color="#F58518")
ax.set_title("Action Exposure Share Under Random Policy")
ax.set_xlabel("Item ID")
ax.set_ylabel("Share of Logged Rows")
ax.tick_params(axis="x", rotation=90)
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.1%}")
plt.show()

The action exposure plot reveals whether the logging policy spreads probability broadly or concentrates on a few items. Concentration creates support risk for policies that choose rarely logged actions.

Propensity Score Summary

The propensity score is the probability assigned by the behavior policy to the logged action. For basic IPS, each row receives weight evaluation_policy_probability / behavior_policy_probability.

This cell summarizes the logged propensities. Positive propensities are required for support. Very small propensities imply potentially large inverse-propensity weights, which can inflate estimator variance.

propensity_summary = df["propensity_score"].describe(percentiles=[0.01, 0.05, 0.5, 0.95, 0.99]).to_frame("propensity_score")
propensity_summary

	propensity_score
count	200000.000000
mean	0.029412
std	0.000000
min	0.029412
1%	0.029412
5%	0.029412
50%	0.029412
95%	0.029412
99%	0.029412
max	0.029412

The propensity diagnostics show how much probability the behavior policy assigned to logged actions. Small propensities imply large inverse-propensity weights, which can make IPS estimates unstable.

Plot Propensity Scores

This histogram shows the distribution of logged propensities. In a simple random policy over a fixed action set, this distribution should be concentrated around a constant value.

If this plot had a long left tail near zero, IPS would be risky. A stable random-policy propensity distribution is a good sign for the first OPE notebook sequence.

fig, ax = plt.subplots(figsize=(8, 4))
sns.histplot(df["propensity_score"], bins=30, ax=ax, color="#54A24B")
ax.set_title("Logged Propensity Score Distribution")
ax.set_xlabel("Propensity Score")
ax.set_ylabel("Rows")
plt.show()

The propensity diagnostics show how much probability the behavior policy assigned to logged actions. Small propensities imply large inverse-propensity weights, which can make IPS estimates unstable.

Check Random-Policy Uniformity

For the random/men campaign, the behavior policy should assign roughly equal probability across the available actions for a given position. This cell compares the observed propensity score to 1 / number_of_actions.

This is a sanity check, not a replacement for the logged propensities. In OPE we should use the propensities provided by the logging system, but this check helps build trust that the data slice behaves as expected.

n_actions = df["item_id"].nunique()
expected_uniform_propensity = 1 / n_actions

uniformity_check = pd.Series(
    {
        "n_actions": n_actions,
        "expected_uniform_propensity": expected_uniform_propensity,
        "observed_min_propensity": df["propensity_score"].min(),
        "observed_max_propensity": df["propensity_score"].max(),
        "max_abs_difference_from_uniform": (df["propensity_score"] - expected_uniform_propensity).abs().max(),
    }
).to_frame("value")

uniformity_check

	value
n_actions	34.000000
expected_uniform_propensity	0.029412
observed_min_propensity	0.029412
observed_max_propensity	0.029412
max_abs_difference_from_uniform	0.000000

The uniformity check tests whether the random logging policy behaves as expected. Random logs are especially valuable for OPE because they usually provide broader support than production-style adaptive policies.

Inverse Propensity Weight Risk

Even before defining an evaluation policy, we can inspect the inverse behavior propensity 1 / propensity_score. This is the largest multiplier that appears when an evaluation policy puts all its probability on a logged action.

For random logging over many actions, inverse propensities can still be sizable. This is why later notebooks will compare IPS with self-normalized IPS and doubly robust estimators.

df = df.assign(inverse_behavior_propensity=1 / df["propensity_score"])

inverse_propensity_summary = df["inverse_behavior_propensity"].describe(
    percentiles=[0.01, 0.05, 0.5, 0.95, 0.99]
).to_frame("inverse_behavior_propensity")

inverse_propensity_summary

	inverse_behavior_propensity
count	200000.000000
mean	34.000000
std	0.000000
min	34.000000
1%	34.000000
5%	34.000000
50%	34.000000
95%	34.000000
99%	34.000000
max	34.000000

The inverse-propensity summary translates logging probabilities into variance risk. Large inverse propensities warn that a small number of rows may receive large weight in IPS-style estimators.

Effective Sample Size Illustration

Effective sample size is a rough diagnostic for weight concentration. The formula used here is:

ESS = (sum weights)^2 / sum(weights^2)

For this first check, we use inverse behavior propensities only. This is not the final OPE weight because we have not defined an evaluation policy yet. It simply illustrates the kind of diagnostic we will reuse once evaluation-policy probabilities are available.

weights = df["inverse_behavior_propensity"].to_numpy()
effective_sample_size = weights.sum() ** 2 / np.square(weights).sum()

pd.Series(
    {
        "rows": len(df),
        "illustrative_effective_sample_size": effective_sample_size,
        "ess_share_of_rows": effective_sample_size / len(df),
    }
).to_frame("value")

	value
rows	200000.000000
illustrative_effective_sample_size	200000.000000
ess_share_of_rows	1.000000

Effective sample size turns weight concentration into an intuitive sample-size diagnostic. A low ESS means the estimator has less usable information than the raw row count suggests.

User Feature Cardinality

The user feature columns are hashed categorical features. This cell counts their distinct values in the sample.

High-cardinality categorical features can be useful for prediction, but they require careful encoding. For Notebook 1, we only inspect them. In later notebooks, simple baselines may use a smaller feature set before moving to richer models.

user_feature_cardinality = pd.DataFrame(
    {
        "column": user_feature_cols,
        "unique_values": [df[col].nunique(dropna=False) for col in user_feature_cols],
        "missing_rate": [df[col].isna().mean() for col in user_feature_cols],
    }
)

user_feature_cardinality

	column	unique_values
0	user_feature_0	4
1	user_feature_1	6
2	user_feature_2	10
3	user_feature_3	10

The cardinality table shows how granular the user context features are. High-cardinality context can support personalization, but it also increases the modeling burden for reward models and contextual policies.

Affinity Feature Summary

The user-item affinity columns are numeric scores. This cell summarizes them in a compact way by reporting average, standard deviation, and the share of zeros.

The zero share matters because many affinity features may be sparse. Sparse affinity signals are still valuable, but they influence what kind of reward model is appropriate later.

affinity_summary = pd.DataFrame(
    {
        "column": affinity_cols,
        "mean": [df[col].mean() for col in affinity_cols],
        "std": [df[col].std() for col in affinity_cols],
        "zero_share": [(df[col] == 0).mean() for col in affinity_cols],
    }
).sort_values("mean", ascending=False)

affinity_summary.head(15)

	column	mean	std	zero_share
14	user-item_affinity_14	0.011320	0.120756	0.989925
29	user-item_affinity_29	0.006325	0.095263	0.994605
0	user-item_affinity_0	0.005415	0.079282	0.994900
11	user-item_affinity_11	0.003345	0.059278	0.996745
3	user-item_affinity_3	0.002580	0.052472	0.997510
7	user-item_affinity_7	0.002565	0.052043	0.997510
30	user-item_affinity_30	0.002340	0.049240	0.997705
18	user-item_affinity_18	0.002130	0.048944	0.998005
2	user-item_affinity_2	0.001835	0.045185	0.998270
10	user-item_affinity_10	0.001695	0.041499	0.998320
1	user-item_affinity_1	0.001560	0.039466	0.998440
23	user-item_affinity_23	0.001245	0.037329	0.998805
24	user-item_affinity_24	0.001155	0.034405	0.998860
27	user-item_affinity_27	0.001095	0.038129	0.999070
32	user-item_affinity_32	0.001005	0.035693	0.999085

The affinity summaries show which user preference signals vary meaningfully in the log. These features can help reward models learn context-dependent click patterns in later notebooks.

Plot Top Affinity Signals

This plot shows the affinity columns with the largest average values. It helps identify which affinity dimensions carry the most signal in the sample.

This is not feature importance yet. It is only a descriptive check of the logged context features.

top_affinity = affinity_summary.head(12).sort_values("mean")

fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=top_affinity, x="mean", y="column", ax=ax, color="#B279A2")
ax.set_title("Top User-Item Affinity Features By Mean Value")
ax.set_xlabel("Mean Affinity")
ax.set_ylabel("Affinity Feature")
plt.show()

The affinity summaries show which user preference signals vary meaningfully in the log. These features can help reward models learn context-dependent click patterns in later notebooks.

Join Item Context

This cell merges item metadata onto the logged rows by item_id. A successful merge confirms that each logged action has item-side context.

Item features are useful for later reward modeling and policy learning because they let us generalize beyond raw item IDs. For example, a policy might learn that certain item feature patterns perform better for certain user contexts.

df_with_items = df.merge(item_context, on="item_id", how="left", validate="many_to_one")

merge_check = pd.Series(
    {
        "rows_before_merge": len(df),
        "rows_after_merge": len(df_with_items),
        "item_feature_missing_rows": df_with_items[item_feature_cols].isna().any(axis=1).sum(),
    }
).to_frame("value")

merge_check

	value
rows_before_merge	200000
rows_after_merge	200000
item_feature_missing_rows	0

The item-context output confirms that item metadata can be attached to logged actions. This improves later reward models because OPE needs to predict rewards for candidate actions, not just the action that happened to be logged.

Item Feature Missingness

After the join, we check missingness in the item feature columns. Low or zero missingness means the item context table cleanly covers the actions appearing in the logged data.

This is important because doubly robust OPE depends on a reward model. Missing item features can make the reward model weaker or force extra imputation choices.

item_missing = (
    df_with_items[item_feature_cols]
    .isna()
    .mean()
    .sort_values(ascending=False)
    .rename("missing_rate")
)

item_missing

item_feature_0   0.000000
item_feature_1   0.000000
item_feature_2   0.000000
item_feature_3   0.000000
Name: missing_rate, dtype: float64

Item Context Preview

This cell shows one row per item after joining item-level metadata to action-level click summaries. It gives a compact view of which items are common, which items have high click rates, and what metadata is attached to them.

This is useful for product storytelling: later we can explain that the evaluation policy is choosing among real items with observable item features, not abstract treatment labels.

item_profile = (
    action_summary.merge(item_context, on="item_id", how="left")
    .sort_values("click_rate", ascending=False)
)

item_profile.head(15)

	item_id	rows	click_rate	avg_propensity	row_share	item_feature_0	item_feature_1	item_feature_2	item_feature_3
30	31	5526	0.010496	0.029412	0.027630	-0.461600	55fe518d85813954c7d9b8a875ff2453	7c63a6aa72e655abd1787c2e64385e6f	bbf748c6c978938bc63d432efa60191c
20	27	5834	0.009085	0.029412	0.029170	-0.849649	61c5d8c2524684aa047e15e172c7e92f	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d
31	23	5519	0.008154	0.029412	0.027595	-0.569392	55fe518d85813954c7d9b8a875ff2453	cc75031396a5aa830885915aa93f49d0	b61cfaadd526b816e3aeb9b7be4b4759
28	14	5619	0.007475	0.029412	0.028095	-1.000557	9874ffb54e9b0a269e29bbb2f5328735	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c
32	13	5348	0.007105	0.029412	0.026740	0.616313	9874ffb54e9b0a269e29bbb2f5328735	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759
25	12	5735	0.006800	0.029412	0.028675	1.198386	ce58bf66d7e62186e6ce01bafeea9d39	697cbf60c7c4b8569c149721231538c3	b61cfaadd526b816e3aeb9b7be4b4759
3	17	6171	0.006482	0.029412	0.030855	-0.698741	9874ffb54e9b0a269e29bbb2f5328735	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d
1	22	6316	0.006175	0.029412	0.031580	-0.698741	61c5d8c2524684aa047e15e172c7e92f	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
6	6	6082	0.006084	0.029412	0.030410	1.651109	ce58bf66d7e62186e6ce01bafeea9d39	7082af732502f0981a9fe77d7ba1ae8a	b61cfaadd526b816e3aeb9b7be4b4759
10	21	5997	0.005836	0.029412	0.029985	-0.698741	9874ffb54e9b0a269e29bbb2f5328735	ce1abd8b5d914ba8fe719b453bc5ba3b	5bc9c86cd1f08a9991670ea97b34f86d
9	0	6024	0.005644	0.029412	0.030120	-0.677183	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c
7	2	6064	0.005607	0.029412	0.030320	0.745662	3c2985d744e0d57c261abd7e541e4263	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759
0	7	6473	0.005562	0.029412	0.032365	2.858372	01fe2f187e459e6ada960671d2942dfe	2b851c0a9c4a961da8760d5dc747c5a3	b61cfaadd526b816e3aeb9b7be4b4759
18	5	5868	0.005453	0.029412	0.029340	0.142031	01fe2f187e459e6ada960671d2942dfe	c43671ed6855a6fe2e2a6030cba64366	bbf748c6c978938bc63d432efa60191c
5	19	6091	0.005254	0.029412	0.030455	-0.763416	3c2985d744e0d57c261abd7e541e4263	5d5dd3635cb3f84d3a70f5874a132d44	5bc9c86cd1f08a9991670ea97b34f86d

Time Coverage

This cell checks the time range covered by the sampled rows. Time coverage matters because logged recommendation behavior can shift over time.

For Notebook 1 we use the first rows in the file, so this is not a random sample across the full archive. If later notebooks need stronger time coverage, we can sample chunks from different parts of the file or process the full campaign into parquet.

time_coverage = pd.Series(
    {
        "min_timestamp": df["timestamp"].min(),
        "max_timestamp": df["timestamp"].max(),
        "unique_dates": df["date"].nunique(),
        "unique_hours": df["hour"].nunique(),
    }
).to_frame("value")

time_coverage

	value
min_timestamp	2019-11-24 00:00:03.800821+00:00
max_timestamp	2019-11-27 02:50:16.027289+00:00
unique_dates	4
unique_hours	24

The parsed time features and coverage summary show when the logging data was collected. Time matters because policy behavior, item popularity, and reward rates can drift across the log window.

Click Rate By Hour

This plot checks whether reward rates vary by hour. Time-of-day variation is common in recommendation systems, and it can matter for policy evaluation if a policy performs differently across traffic windows.

Because this notebook reads the first rows in the file, the hour coverage may be limited. The plot is still useful as a template for the full-data version.

hourly_ctr = (
    df.groupby("hour")
    .agg(rows=("click", "size"), click_rate=("click", "mean"))
    .reset_index()
)

fig, ax = plt.subplots(figsize=(9, 4))
sns.lineplot(data=hourly_ctr, x="hour", y="click_rate", marker="o", ax=ax, color="#E45756")
ax.set_title("Click Rate By Hour")
ax.set_xlabel("Hour of Day")
ax.set_ylabel("Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

hourly_ctr

	hour	rows	click_rate
0	0	8070	0.004337
1	1	9888	0.003843
2	2	9163	0.003711
3	3	8882	0.005855
4	4	7405	0.004727
5	5	7190	0.004590
6	6	7800	0.006154
7	7	8429	0.002966
8	8	9212	0.004776
9	9	9971	0.004513
10	10	10800	0.004074
11	11	13803	0.004854
12	12	17439	0.005792
13	13	17475	0.004807
14	14	15737	0.006418
15	15	12067	0.006713
16	16	5833	0.005657
17	17	3266	0.010410
18	18	2044	0.004403
19	19	1509	0.004639
20	20	1622	0.003699
21	21	2927	0.004441
22	22	4413	0.009517
23	23	5055	0.005341

The hourly CTR table checks whether reward rates drift by time of day. If time patterns are visible, they should be considered in diagnostics or model features rather than ignored.

OPE Readiness Checklist

This cell turns the main EDA findings into a checklist. A dataset is ready for introductory OPE if:

every row has an action
every row has a reward
every row has a positive logged propensity
the action space has repeated observations
the logged policy explores the available actions

Passing these checks does not prove any evaluation policy will be easy to estimate. It only tells us the basic data ingredients are present.

ope_checks = pd.DataFrame(
    [
        {
            "check": "logged action is present",
            "value": df["item_id"].notna().all(),
            "detail": f"{df['item_id'].isna().sum()} missing item_id values",
        },
        {
            "check": "observed reward is present",
            "value": df["click"].notna().all(),
            "detail": f"{df['click'].isna().sum()} missing click values",
        },
        {
            "check": "propensity score is present",
            "value": df["propensity_score"].notna().all(),
            "detail": f"{df['propensity_score'].isna().sum()} missing propensity values",
        },
        {
            "check": "propensity score is positive",
            "value": bool((df["propensity_score"] > 0).all()),
            "detail": f"minimum propensity = {df['propensity_score'].min():.6f}",
        },
        {
            "check": "multiple actions observed",
            "value": df["item_id"].nunique() > 1,
            "detail": f"{df['item_id'].nunique()} unique items",
        },
        {
            "check": "each action has repeated support",
            "value": action_summary["rows"].min() > 1,
            "detail": f"minimum rows per item = {action_summary['rows'].min()}",
        },
        {
            "check": "item context covers logged actions",
            "value": df_with_items[item_feature_cols].isna().any(axis=1).sum() == 0,
            "detail": f"{df_with_items[item_feature_cols].isna().any(axis=1).sum()} rows missing item features",
        },
    ]
)

ope_checks

	check	value	detail
0	logged action is present	True	0 missing item_id values
1	observed reward is present	True	0 missing click values
2	propensity score is present	True	0 missing propensity values
3	propensity score is positive	True	minimum propensity = 0.029412
4	multiple actions observed	True	34 unique items
5	each action has repeated support	True	minimum rows per item = 5342
6	item context covers logged actions	True	0 rows missing item features

The readiness checklist summarizes whether the log has the essentials for OPE: rewards, actions, propensities, support, and enough sample size. Passing these checks justifies moving from EDA to estimator design.

Baseline Behavior Policy Value

This cell estimates the observed value of the behavior policy in the sample. Because the logged data comes from the behavior policy itself, the average click rate is the direct sample estimate of behavior-policy value.

In later notebooks, off-policy evaluation will ask a harder question: what would the value have been under a different policy that selected different item probabilities for each context?

behavior_policy_value = pd.Series(
    {
        "behavior_policy": BEHAVIOR_POLICY,
        "campaign": CAMPAIGN,
        "sample_rows": len(df),
        "observed_policy_value_click_rate": df["click"].mean(),
        "standard_error": df["click"].std(ddof=1) / np.sqrt(len(df)),
    }
).to_frame("value")

behavior_policy_value

	value
behavior_policy	random
campaign	men
sample_rows	200000
observed_policy_value_click_rate	0.005190
standard_error	0.000161

The observed behavior-policy value is the logged baseline. It is not the value of a new policy, but it gives a reference point for whether evaluated policies appear better or worse offline.

Cache A Lightweight Analysis Sample

This final code cell writes the sampled and joined table to data/processed/open_bandit_random_men_sample.parquet. The next notebook can read this parquet file quickly instead of repeatedly scanning the zip archive.

The cached file is a convenience artifact, not a new source of truth. The original downloaded zip remains the raw dataset.

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
processed_sample_path = PROCESSED_DIR / "open_bandit_random_men_sample.parquet"

df_with_items.to_parquet(processed_sample_path, index=False)

processed_sample_path

PosixPath('/home/apex/Documents/ranking_sys/data/processed/open_bandit_random_men_sample.parquet')

This cell saves reusable outputs for downstream notebooks or the final writeup. Persisting these artifacts makes the project modular and prevents later notebooks from repeating expensive or fragile setup work.

Notebook 1 Takeaways

The random/men Open Bandit slice is suitable for off-policy evaluation because it has the core OPE ingredients: logged actions, binary rewards, positive behavior-policy propensities, and context features.

The most important practical finding is that the random policy gives broad action support. That makes it a strong starting point for IPS and SNIPS in Notebook 2. The reward is sparse, so later notebooks should pay close attention to estimator variance and compare direct IPS estimates against more stable doubly robust estimates.