03 IPS And SNIPS Policy Evaluation

This is the first off-policy evaluation notebook that actually estimates counterfactual policy values.

Notebook 1 established that Open Bandit has logged actions, rewards, context, and propensities. Notebook 2 studied the behavior policy and showed why the random/men slice is the cleanest starting point. Notebook 3 now asks:

Given logged data from a behavior policy, what would the click rate have been under a different recommendation policy?

We will estimate that value using two classic off-policy evaluation estimators:

IPS, or inverse propensity scoring
SNIPS, or self-normalized inverse propensity scoring

The notebook uses simple evaluation policies on purpose. The goal is to make the mechanics of OPE clear before introducing reward models and doubly robust estimation in the next notebook.

Estimator Framing

For each logged recommendation event, we observe:

X: context available before recommendation
A: logged action, here the shown item_id
Y: reward, here click
pi_b(A|X): behavior-policy probability for the logged action, stored as propensity_score

For an evaluation policy pi_e, IPS estimates its value with:

V_IPS(pi_e) = mean((pi_e(A|X) / pi_b(A|X)) * Y)

The ratio pi_e(A|X) / pi_b(A|X) is the importance weight. It converts rewards collected under the behavior policy into an estimate of rewards under the evaluation policy.

SNIPS uses the same weights, but normalizes by their sum:

V_SNIPS(pi_e) = sum(w_i * Y_i) / sum(w_i)

IPS is unbiased under standard assumptions when propensities are correct and support holds, but it can have high variance. SNIPS is usually biased in finite samples, but often more stable because it forces the average weight to behave like 1.

Notebook Setup

This cell imports the libraries used for policy construction, estimation, diagnostics, and plotting.

The notebook intentionally uses plain pandas and numpy functions rather than an OPE library. This makes the estimator formulas transparent, which is useful for a portfolio project and for interview conversations.

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

This cell prepares the notebook environment for IPS and SNIPS off-policy evaluation. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Locate Processed Open Bandit Samples

This cell finds the repository root and defines the cached sample paths produced by the earlier notebooks.

The main estimation work uses open_bandit_random_men_sample.parquet. We also load open_bandit_bts_men_sample.parquet later to show how the same candidate policies behave when evaluated from adaptive BTS logs.

RANDOM_SAMPLE_RELATIVE_PATH = Path("data/processed/open_bandit_random_men_sample.parquet")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / RANDOM_SAMPLE_RELATIVE_PATH).exists()
)

RANDOM_SAMPLE_PATH = PROJECT_ROOT / RANDOM_SAMPLE_RELATIVE_PATH
BTS_SAMPLE_PATH = PROJECT_ROOT / "data/processed/open_bandit_bts_men_sample.parquet"

pd.DataFrame(
    {
        "path_name": ["project_root", "random_sample", "bts_sample"],
        "path": [PROJECT_ROOT, RANDOM_SAMPLE_PATH, BTS_SAMPLE_PATH],
        "exists": [PROJECT_ROOT.exists(), RANDOM_SAMPLE_PATH.exists(), BTS_SAMPLE_PATH.exists()],
    }
)

	path_name	path	exists
0	project_root	/home/apex/Documents/ranking_sys	True
1	random_sample	/home/apex/Documents/ranking_sys/data/processe...	True
2	bts_sample	/home/apex/Documents/ranking_sys/data/processe...	True

The printed paths are a reproducibility checkpoint. Once the notebook can find the cached data and writeup folders, the rest of the analysis can run without manual path edits.

Load The Random Behavior-Policy Log

This cell loads the random-policy sample that Notebook 1 cached. It then sorts by timestamp.

Sorting by time lets us create a simple train/evaluation split that mimics a realistic workflow: use earlier logged data to design a candidate policy, then estimate that policy’s value on later held-out logged data. This avoids learning and evaluating a policy on the exact same rows.

random_df = pd.read_parquet(RANDOM_SAMPLE_PATH).sort_values("timestamp").reset_index(drop=True)

random_df[["timestamp", "item_id", "position", "click", "propensity_score"]].head()

	timestamp	item_id	position	propensity_score
0	2019-11-24 00:00:03.800821+00:00	0	1	0.029412
1	2019-11-24 00:00:03.801019+00:00	25	3	0.029412
2	2019-11-24 00:00:03.801099+00:00	23	2	0.029412
3	2019-11-24 00:00:17.634355+00:00	25	1	0.029412
4	2019-11-24 00:00:17.634998+00:00	30	2	0.029412

The loaded table shape and preview confirm that the expected cached data is available. This check matters because all later OPE estimates depend on using the correct logged actions, rewards, contexts, and behavior propensities.

Create A Time-Based Train And Evaluation Split

This cell splits the random-policy sample into two halves:

train_df: used to define simple evaluation policies
eval_df: used to estimate policy values with IPS and SNIPS

This is important because a policy that is tuned on the same rewards used for OPE can look artificially good. The split keeps the policy-design step separate from the policy-evaluation step.

SPLIT_FRACTION = 0.50
split_idx = int(len(random_df) * SPLIT_FRACTION)

train_df = random_df.iloc[:split_idx].copy()
eval_df = random_df.iloc[split_idx:].copy()

time_split_summary = pd.DataFrame(
    {
        "split": ["train", "evaluation"],
        "rows": [len(train_df), len(eval_df)],
        "min_timestamp": [train_df["timestamp"].min(), eval_df["timestamp"].min()],
        "max_timestamp": [train_df["timestamp"].max(), eval_df["timestamp"].max()],
        "observed_click_rate": [train_df["click"].mean(), eval_df["click"].mean()],
    }
)

time_split_summary

	split	rows	min_timestamp	max_timestamp	observed_click_rate
0	train	100000	2019-11-24 00:00:03.800821+00:00	2019-11-25 10:01:18.392921+00:00	0.005400
1	evaluation	100000	2019-11-25 10:01:18.393450+00:00	2019-11-27 02:50:16.027289+00:00	0.004980

The split separates policy construction from policy evaluation. This prevents using the same rows to design a policy and evaluate it, which would make the offline result too optimistic.

Define The Action Space

The action space is the set of items that a policy may recommend. For this notebook, we use the items observed in the random-policy sample.

Using a fixed action space makes policy probabilities easy to audit: every candidate policy must assign nonnegative probabilities that sum to 1 across these items. This is also where support matters. If an evaluation policy assigns probability to an item that never appears in the logged data, IPS cannot evaluate it.

action_space = np.array(sorted(random_df["item_id"].unique()))
n_actions = len(action_space)

action_space_summary = pd.Series(
    {
        "n_actions": n_actions,
        "min_item_id": action_space.min(),
        "max_item_id": action_space.max(),
        "train_unique_items": train_df["item_id"].nunique(),
        "eval_unique_items": eval_df["item_id"].nunique(),
    }
).to_frame("value")

action_space_summary

	value
n_actions	34
min_item_id	0
max_item_id	33
train_unique_items	34
eval_unique_items	34

The action-space definition fixes the set of items that evaluation policies are allowed to choose. OPE estimates are only meaningful for policies whose action probabilities live inside this logged support.

Estimate Item Statistics On The Train Split

This cell computes item-level exposure and click statistics using only the training rows.

The raw item click rate can be noisy because clicks are sparse. We therefore create a smoothed click rate using a simple empirical-Bayes style formula:

smoothed_ctr = (item_clicks + alpha * global_ctr) / (item_impressions + alpha)

The smoothing constant alpha shrinks low-exposure items toward the global click rate so a tiny number of lucky clicks does not dominate the policy.

SMOOTHING_ALPHA = 50
train_global_ctr = train_df["click"].mean()

item_stats = (
    train_df.groupby("item_id")
    .agg(train_impressions=("click", "size"), train_clicks=("click", "sum"), train_ctr=("click", "mean"))
    .reindex(action_space, fill_value=0)
    .rename_axis("item_id")
    .reset_index()
)
item_stats["smoothed_ctr"] = (
    item_stats["train_clicks"] + SMOOTHING_ALPHA * train_global_ctr
) / (item_stats["train_impressions"] + SMOOTHING_ALPHA)
item_stats["train_exposure_share"] = item_stats["train_impressions"] / item_stats["train_impressions"].sum()

item_stats.sort_values("smoothed_ctr", ascending=False).head(10)

	item_id	train_impressions	train_clicks	train_ctr	smoothed_ctr	train_exposure_share
31	31	2581	39	0.015110	0.014926	0.025810
27	27	3029	35	0.011555	0.011455	0.030290
23	23	2602	26	0.009992	0.009906	0.026020
14	14	2800	24	0.008571	0.008516	0.028000
21	21	3039	25	0.008226	0.008181	0.030390
0	0	2920	22	0.007534	0.007498	0.029200
22	22	3330	25	0.007508	0.007476	0.033300
20	20	2867	20	0.006976	0.006949	0.028670
12	12	2902	20	0.006892	0.006867	0.029020
2	2	3207	22	0.006860	0.006838	0.032070

The split separates policy construction from policy evaluation. This prevents using the same rows to design a policy and evaluate it, which would make the offline result too optimistic.

Plot Training Item Quality Signals

This plot shows the smoothed training click rates for the top items. It is not an OPE result yet. It is only the signal used to build candidate evaluation policies.

A realistic recommender policy would usually be context-aware. Here we start with context-free item policies because they make IPS and SNIPS easy to inspect. Later notebooks can add reward models and context-aware policies.

top_items = item_stats.sort_values("smoothed_ctr", ascending=False).head(12).sort_values("smoothed_ctr")

fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=top_items, x="smoothed_ctr", y="item_id", orient="h", ax=ax, color="#4C78A8")
ax.set_title("Top Items By Smoothed Training CTR")
ax.set_xlabel("Smoothed Training CTR")
ax.set_ylabel("Item ID")
ax.xaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

The item statistics estimate simple quality signals from the training split. These signals define benchmark evaluation policies while keeping the held-out evaluation log untouched.

Define Simple Evaluation Policies

This cell defines four simple candidate policies:

uniform: recommends every item with equal probability
exposure_popularity: recommends items in proportion to their training exposure share
ctr_weighted: recommends items in proportion to smoothed training CTR
epsilon_greedy_top_ctr: assigns most probability to the top smoothed-CTR item, while keeping exploration probability on all items

These policies are intentionally simple. They create clear differences in probability distributions, which makes IPS and SNIPS diagnostics easier to understand.

def normalize_probabilities(values):
    values = np.asarray(values, dtype=float)
    values = np.clip(values, 0, None)
    total = values.sum()
    if total <= 0:
        raise ValueError("Policy scores must have positive total mass.")
    return values / total

uniform_probs = np.full(n_actions, 1 / n_actions)
exposure_popularity_probs = normalize_probabilities(item_stats["train_exposure_share"].to_numpy())
ctr_weighted_probs = normalize_probabilities(item_stats["smoothed_ctr"].to_numpy())

epsilon = 0.15
epsilon_greedy_probs = np.full(n_actions, epsilon / n_actions)
top_ctr_index = int(item_stats["smoothed_ctr"].to_numpy().argmax())
epsilon_greedy_probs[top_ctr_index] += 1 - epsilon

policy_probability_df = pd.DataFrame(
    {
        "item_id": action_space,
        "uniform": uniform_probs,
        "exposure_popularity": exposure_popularity_probs,
        "ctr_weighted": ctr_weighted_probs,
        "epsilon_greedy_top_ctr": epsilon_greedy_probs,
        "smoothed_ctr": item_stats["smoothed_ctr"],
        "train_exposure_share": item_stats["train_exposure_share"],
    }
)

policy_probability_df.head()

	item_id	uniform	exposure_popularity	ctr_weighted	epsilon_greedy_top_ctr	smoothed_ctr	train_exposure_share
0	0	0.029412	0.029200	0.040498	0.004412	0.007498	0.029200
1	1	0.029412	0.027590	0.025514	0.004412	0.004724	0.027590
2	2	0.029412	0.032070	0.036929	0.004412	0.006838	0.032070
3	3	0.029412	0.029650	0.020188	0.004412	0.003738	0.029650
4	4	0.029412	0.028410	0.017318	0.004412	0.003207	0.028410

The policy definitions create several offline candidates with different levels of targeting. Comparing simple policies first makes it easier to understand IPS, SNIPS, and DR behavior before moving to contextual policies.

Audit Policy Probability Mass

Every evaluation policy must assign a valid probability distribution over the action space. This cell checks that each policy sums to 1, has no negative probabilities, and gives every action at least some support.

The positive-support check is useful because deterministic policies can create brittle OPE. If a policy assigns zero probability to most actions, the estimate may depend on a very small subset of logged rows.

policy_cols = ["uniform", "exposure_popularity", "ctr_weighted", "epsilon_greedy_top_ctr"]

policy_audit = pd.DataFrame(
    [
        {
            "policy": policy,
            "probability_sum": policy_probability_df[policy].sum(),
            "min_probability": policy_probability_df[policy].min(),
            "max_probability": policy_probability_df[policy].max(),
            "actions_with_positive_probability": int((policy_probability_df[policy] > 0).sum()),
        }
        for policy in policy_cols
    ]
)

policy_audit

	policy	probability_sum	min_probability	max_probability	actions_with_positive_probability
0	uniform	1.000000	0.029412	0.029412	34
1	exposure_popularity	1.000000	0.025810	0.033300	34
2	ctr_weighted	1.000000	0.009558	0.080613	34
3	epsilon_greedy_top_ctr	1.000000	0.004412	0.854412	34

The policy audit verifies that each evaluation policy assigns valid probability mass over the action space. This is a basic but important check: OPE formulas assume probabilities sum to one and remain nonnegative.

Plot Evaluation Policy Distributions

This plot shows how each candidate policy distributes probability across items. The items are ordered by smoothed training CTR.

The uniform and exposure-popularity policies should be relatively flat. The CTR-weighted policy should tilt toward better training items. The epsilon-greedy policy should be intentionally concentrated on the top item while preserving some exploration.

policy_plot_df = (
    policy_probability_df.sort_values("smoothed_ctr", ascending=False)
    .assign(action_rank=lambda x: np.arange(1, len(x) + 1))
    .melt(
        id_vars=["item_id", "action_rank", "smoothed_ctr"],
        value_vars=policy_cols,
        var_name="policy",
        value_name="probability",
    )
)

fig, ax = plt.subplots(figsize=(10, 5))
sns.lineplot(data=policy_plot_df, x="action_rank", y="probability", hue="policy", marker="o", ax=ax)
ax.set_title("Evaluation Policy Probability By Item Rank")
ax.set_xlabel("Item Rank By Smoothed Training CTR")
ax.set_ylabel("Evaluation Policy Probability")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.1%}")
plt.show()

The policy distribution plot shows how concentrated each candidate policy is. More concentrated policies can have higher expected reward, but they usually create larger importance weights and higher variance.

Attach Evaluation Policy Probabilities To Logged Rows

IPS needs pi_e(A|X), the evaluation-policy probability of the action that was actually logged. Since our policies are context-free item policies, this means mapping each logged item_id to its policy probability.

The behavior-policy probability pi_b(A|X) is already in the data as propensity_score. The importance weight is pi_e / pi_b.

eval_scored = eval_df[["timestamp", "item_id", "position", "click", "propensity_score"]].copy()

for policy in policy_cols:
    probability_map = policy_probability_df.set_index("item_id")[policy]
    eval_scored[f"pi_e_{policy}"] = eval_scored["item_id"].map(probability_map)
    eval_scored[f"w_{policy}"] = eval_scored[f"pi_e_{policy}"] / eval_scored["propensity_score"]

missing_policy_probability = eval_scored[[f"pi_e_{policy}" for policy in policy_cols]].isna().sum()
missing_policy_probability

pi_e_uniform                   0
pi_e_exposure_popularity       0
pi_e_ctr_weighted              0
pi_e_epsilon_greedy_top_ctr    0
dtype: int64

Attaching evaluation-policy probabilities to logged rows creates the numerator of the importance weight. Once this is joined to the behavior propensity, IPS and SNIPS can estimate each policy’s value.

Preview The Scored Evaluation Log

This cell shows the held-out logged rows after adding evaluation-policy probabilities and importance weights.

A row can have a different weight under each evaluation policy. If a policy likes the logged item more than the behavior policy did, the weight is above 1. If it likes the logged item less, the weight is below 1.

preview_cols = ["timestamp", "item_id", "position", "click", "propensity_score"]
for policy in policy_cols:
    preview_cols.extend([f"pi_e_{policy}", f"w_{policy}"])

eval_scored[preview_cols].head()

	timestamp	item_id	position	propensity_score	pi_e_uniform	w_uniform	pi_e_exposure_popularity	w_exposure_popularity	pi_e_ctr_weighted	w_ctr_weighted	pi_e_epsilon_greedy_top_ctr	w_epsilon_greedy_top_ctr
100000	2019-11-25 10:01:18.393450+00:00	22	2	0.029412	0.029412	1.000000	0.033300	1.132200	0.040379	1.372880	0.004412	0.150000
100001	2019-11-25 10:01:19.400109+00:00	22	2	0.029412	0.029412	1.000000	0.033300	1.132200	0.040379	1.372880	0.004412	0.150000
100002	2019-11-25 10:01:19.400113+00:00	10	3	0.029412	0.029412	1.000000	0.025860	0.879240	0.035384	1.203070	0.004412	0.150000
100003	2019-11-25 10:01:19.400175+00:00	25	1	0.029412	0.029412	1.000000	0.031250	1.062500	0.012367	0.420470	0.004412	0.150000
100004	2019-11-25 10:01:25.806862+00:00	0	1	0.029412	0.029412	1.000000	0.029200	0.992800	0.040498	1.376917	0.004412	0.150000

The scored-log preview confirms that each logged row now has behavior propensity, evaluation-policy probability, and reward information. This is the minimum row-level structure needed for IPS-family estimators.

Define IPS And SNIPS Helper Functions

This cell defines the estimator functions used throughout the notebook.

For IPS, the estimated value is the average of weight * reward. For SNIPS, the estimated value is the sum of weighted rewards divided by the sum of weights. The helper also computes an approximate standard error and 95% confidence interval.

For SNIPS, the standard error uses a common influence-function approximation: w * (Y - V_SNIPS) / mean(w). This is not a substitute for a full statistical treatment, but it is a useful and transparent diagnostic for this project.

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return weights.sum() ** 2 / np.square(weights).sum()


def estimate_ips_snips(reward, weight, policy_name, clip=None):
    reward = np.asarray(reward, dtype=float)
    weight = np.asarray(weight, dtype=float)
    if clip is not None:
        weight = np.minimum(weight, clip)

    weighted_reward = weight * reward
    ips = weighted_reward.mean()
    ips_se = weighted_reward.std(ddof=1) / np.sqrt(len(weighted_reward))

    weight_sum = weight.sum()
    snips = weighted_reward.sum() / weight_sum
    snips_influence = weight * (reward - snips) / weight.mean()
    snips_se = snips_influence.std(ddof=1) / np.sqrt(len(snips_influence))

    return {
        "policy": policy_name,
        "clip": "none" if clip is None else clip,
        "n": len(reward),
        "ips": ips,
        "ips_se": ips_se,
        "ips_ci_95_lower": ips - 1.96 * ips_se,
        "ips_ci_95_upper": ips + 1.96 * ips_se,
        "snips": snips,
        "snips_se": snips_se,
        "snips_ci_95_lower": snips - 1.96 * snips_se,
        "snips_ci_95_upper": snips + 1.96 * snips_se,
        "mean_weight": weight.mean(),
        "max_weight": weight.max(),
        "p99_weight": np.percentile(weight, 99),
        "ess": effective_sample_size(weight),
        "ess_share": effective_sample_size(weight) / len(weight),
    }

The helper functions encode the estimator formulas and diagnostics used repeatedly in the notebook. Defining them once keeps the later policy comparisons consistent and easier to audit.

Estimate Policy Values With IPS And SNIPS

This cell applies the estimators to each candidate evaluation policy on the held-out evaluation split.

The behavior policy here is the random Open Bandit policy. Since the random logging policy has broad support and stable propensities, this is the cleanest place to introduce IPS and SNIPS.

results = []
for policy in policy_cols:
    results.append(
        estimate_ips_snips(
            reward=eval_scored["click"],
            weight=eval_scored[f"w_{policy}"],
            policy_name=policy,
        )
    )

estimate_results = pd.DataFrame(results)
estimate_results

	policy	clip	n	ips	ips_se	ips_ci_95_lower	ips_ci_95_upper	snips	snips_se	snips_ci_95_lower	snips_ci_95_upper	mean_weight	max_weight	p99_weight	ess	ess_share
0	uniform	none	100000	0.004980	0.000223	0.004544	0.005416	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000	100000.000000	1.000000
1	exposure_popularity	none	100000	0.004971	0.000223	0.004535	0.005407	0.004968	0.000223	0.004532	0.005404	1.000591	1.132200	1.132200	99639.920774	0.996399
2	ctr_weighted	none	100000	0.005172	0.000261	0.004660	0.005685	0.005188	0.000262	0.004675	0.005701	0.996987	2.740842	2.740842	79062.817039	0.790628
3	epsilon_greedy_top_ctr	none	100000	0.006238	0.001267	0.003756	0.008720	0.006231	0.001261	0.003759	0.008703	1.001105	29.050000	29.050000	4029.027734	0.040290

The IPS and SNIPS estimates are the first direct counterfactual policy values. IPS is unbiased under correct propensities but can be high variance; SNIPS normalizes weights to trade some bias for stability.

Add The Observed Behavior-Policy Baseline

The observed click rate on the evaluation split estimates the value of the behavior policy that generated those rows. Because this split comes from random logging, the observed click rate is the random-policy baseline.

This baseline is not an off-policy estimate. It is the direct on-policy sample average for the behavior policy. We include it so IPS/SNIPS estimates have an intuitive reference point.

behavior_value = eval_scored["click"].mean()
behavior_se = eval_scored["click"].std(ddof=1) / np.sqrt(len(eval_scored))

behavior_baseline = pd.DataFrame(
    [
        {
            "policy": "behavior_random_observed",
            "estimator": "observed_mean",
            "estimate": behavior_value,
            "se": behavior_se,
            "ci_95_lower": behavior_value - 1.96 * behavior_se,
            "ci_95_upper": behavior_value + 1.96 * behavior_se,
            "ess_share": 1.0,
        }
    ]
)

behavior_baseline

	policy	estimator	estimate	se	ci_95_lower	ci_95_upper	ess_share
0	behavior_random_observed	observed_mean	0.004980	0.000223	0.004544	0.005416	1.000000

Adding the observed behavior-policy baseline gives a familiar reference point. Evaluation policies can now be interpreted as offline alternatives to the policy that generated the held-out log.

Reshape Estimates For Comparison

This cell reshapes the IPS and SNIPS results into one tidy table. A tidy table makes plotting and sorting easier.

The key columns are the policy name, estimator name, estimated value, confidence interval, and effective sample size share. A higher estimated value is better only if the estimate is stable and the policy is plausible.

comparison_rows = []
for _, row in estimate_results.iterrows():
    for estimator in ["ips", "snips"]:
        comparison_rows.append(
            {
                "policy": row["policy"],
                "estimator": estimator.upper(),
                "estimate": row[estimator],
                "se": row[f"{estimator}_se"],
                "ci_95_lower": row[f"{estimator}_ci_95_lower"],
                "ci_95_upper": row[f"{estimator}_ci_95_upper"],
                "ess_share": row["ess_share"],
                "mean_weight": row["mean_weight"],
                "max_weight": row["max_weight"],
            }
        )

estimate_comparison = pd.concat([behavior_baseline, pd.DataFrame(comparison_rows)], ignore_index=True)
estimate_comparison.sort_values(["policy", "estimator"])

	policy	estimator	estimate	se	ci_95_lower	ci_95_upper	ess_share	mean_weight	max_weight
0	behavior_random_observed	observed_mean	0.004980	0.000223	0.004544	0.005416	1.000000	NaN	NaN
5	ctr_weighted	IPS	0.005172	0.000261	0.004660	0.005685	0.790628	0.996987	2.740842
6	ctr_weighted	SNIPS	0.005188	0.000262	0.004675	0.005701	0.790628	0.996987	2.740842
7	epsilon_greedy_top_ctr	IPS	0.006238	0.001267	0.003756	0.008720	0.040290	1.001105	29.050000
8	epsilon_greedy_top_ctr	SNIPS	0.006231	0.001261	0.003759	0.008703	0.040290	1.001105	29.050000
3	exposure_popularity	IPS	0.004971	0.000223	0.004535	0.005407	0.996399	1.000591	1.132200
4	exposure_popularity	SNIPS	0.004968	0.000223	0.004532	0.005404	0.996399	1.000591	1.132200
1	uniform	IPS	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000
2	uniform	SNIPS	0.004980	0.000223	0.004544	0.005416	1.000000	1.000000	1.000000

The reshaped table puts policy values, uncertainty, and diagnostics into a comparison-friendly format. This is the table that later plots and recommendation decisions build from.

Plot IPS And SNIPS Policy Value Estimates

This plot compares the estimated click rate for each candidate policy. The error bars show approximate 95% confidence intervals.

A policy can have a higher point estimate but also much wider uncertainty if its weights are concentrated. This is why OPE should always report both value estimates and diagnostics.

plot_estimates = estimate_comparison.copy()
plot_estimates["lower_error"] = plot_estimates["estimate"] - plot_estimates["ci_95_lower"]
plot_estimates["upper_error"] = plot_estimates["ci_95_upper"] - plot_estimates["estimate"]

policy_order = plot_estimates["policy"].unique()
estimator_order = ["observed_mean", "IPS", "SNIPS"]
colors = {
    "observed_mean": "#4C78A8",
    "IPS": "#F58518",
    "SNIPS": "#54A24B",
}
offsets = {
    "observed_mean": 0.0,
    "IPS": -0.16,
    "SNIPS": 0.16,
}

fig, ax = plt.subplots(figsize=(11, 5))

for estimator in estimator_order:
    subset = plot_estimates[plot_estimates["estimator"] == estimator]

    for _, row in subset.iterrows():
        x_base = list(policy_order).index(row["policy"])
        x = x_base + offsets[estimator]

        ax.errorbar(
            x=x,
            y=row["estimate"],
            yerr=[[row["lower_error"]], [row["upper_error"]]],
            fmt="o",
            color=colors[estimator],
            ecolor=colors[estimator],
            capsize=4,
            linewidth=1.5,
            markersize=6,
            label=estimator if row["policy"] == subset["policy"].iloc[0] else None,
        )

ax.set_xticks(range(len(policy_order)))
ax.set_xticklabels(policy_order, rotation=25, ha="right")
ax.set_title("Estimated Policy Values On Held-Out Random Logs")
ax.set_xlabel("Policy")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
ax.legend(title="Estimator")
plt.tight_layout()
plt.show()

Weight Diagnostics By Policy

This cell summarizes the importance weights for each candidate policy. The estimate is only as credible as the weights are stable.

The most important fields are:

mean_weight: should be near 1 when the evaluation policy is well supported
max_weight: flags extreme individual rows
ess_share: effective sample size as a fraction of rows

Low ESS means the estimate is being driven by a smaller effective number of observations.

weight_diagnostics = estimate_results[
    ["policy", "mean_weight", "max_weight", "p99_weight", "ess", "ess_share"]
].sort_values("ess_share")

weight_diagnostics

	policy	mean_weight	max_weight	p99_weight	ess	ess_share
3	epsilon_greedy_top_ctr	1.001105	29.050000	29.050000	4029.027734	0.040290
2	ctr_weighted	0.996987	2.740842	2.740842	79062.817039	0.790628
1	exposure_popularity	1.000591	1.132200	1.132200	99639.920774	0.996399
0	uniform	1.000000	1.000000	1.000000	100000.000000	1.000000

The weight diagnostics show whether each policy value estimate is supported by enough logged data. Policies with large weights or low ESS may look attractive but carry higher offline evaluation risk.

Plot Weight Risk With Percentiles And ESS

The overlapping histogram was hard to read, so this plot switches to two simpler diagnostics.

The left panel shows selected weight percentiles for each policy. This answers: how large do the typical and tail weights get? The right panel shows effective sample size share. This answers: how much of the held-out log is effectively being used after weighting?

This view is clearer for OPE because the exact shape of the full weight distribution matters less than the tail size and the amount of usable support. High p99 or max weights plus low ESS are the warning signs.

weight_summary_rows = []
for policy in policy_cols:
    weights = eval_scored[f"w_{policy}"].to_numpy()
    weight_summary_rows.append(
        {
            "policy": policy,
            "p50": np.percentile(weights, 50),
            "p90": np.percentile(weights, 90),
            "p95": np.percentile(weights, 95),
            "p99": np.percentile(weights, 99),
            "max": weights.max(),
            "ess_share": effective_sample_size(weights) / len(weights),
        }
    )

weight_summary_plot = pd.DataFrame(weight_summary_rows).sort_values("ess_share", ascending=False)
percentile_plot_df = weight_summary_plot.melt(
    id_vars="policy",
    value_vars=["p50", "p90", "p95", "p99", "max"],
    var_name="weight_stat",
    value_name="weight",
)

fig, axes = plt.subplots(1, 2, figsize=(13, 5), gridspec_kw={"width_ratios": [1.7, 1]})

sns.scatterplot(
    data=percentile_plot_df,
    x="weight",
    y="policy",
    hue="weight_stat",
    style="weight_stat",
    s=90,
    ax=axes[0],
)
axes[0].set_xscale("log")
axes[0].set_title("Weight Percentiles By Evaluation Policy")
axes[0].set_xlabel("Importance Weight, log scale")
axes[0].set_ylabel("Evaluation Policy")
axes[0].axvline(1, color="black", linestyle="--", linewidth=1, alpha=0.6)
axes[0].legend(title="Statistic", bbox_to_anchor=(1.02, 1), loc="upper left")

sns.barplot(
    data=weight_summary_plot,
    x="ess_share",
    y="policy",
    ax=axes[1],
    color="#4C78A8",
)
axes[1].set_title("Effective Sample Size Share")
axes[1].set_xlabel("ESS / rows")
axes[1].set_ylabel("")
axes[1].xaxis.set_major_formatter(lambda x, _: f"{x:.0%}")
axes[1].set_xlim(0, 1.05)

for container in axes[1].containers:
    axes[1].bar_label(container, fmt=lambda x: f"{x:.0%}", padding=3)

plt.tight_layout()
plt.show()

weight_summary_plot

	policy	p50	p90	p95	p99	max	ess_share
0	uniform	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	exposure_popularity	0.998580	1.062500	1.130160	1.132200	1.132200	0.996399
2	ctr_weighted	0.852995	1.563756	2.103486	2.740842	2.740842	0.790628
3	epsilon_greedy_top_ctr	0.150000	0.150000	0.150000	29.050000	29.050000	0.040290

The weight diagnostics show whether each policy value estimate is supported by enough logged data. Policies with large weights or low ESS may look attractive but carry higher offline evaluation risk.

Compare IPS And SNIPS Directly

This cell computes the difference between IPS and SNIPS for each policy. When mean weights are close to 1 and the sample is large, IPS and SNIPS should often be similar.

Large gaps between IPS and SNIPS are warning signs. They can indicate weight instability, finite-sample noise, or a policy that is poorly matched to the behavior logs.

ips_snips_gap = estimate_results.assign(
    snips_minus_ips=lambda x: x["snips"] - x["ips"],
    relative_gap_pct=lambda x: 100 * (x["snips"] - x["ips"]) / x["ips"],
)[["policy", "ips", "snips", "snips_minus_ips", "relative_gap_pct", "mean_weight", "ess_share"]]

ips_snips_gap

	policy	ips	snips	snips_minus_ips	relative_gap_pct	mean_weight	ess_share
0	uniform	0.004980	0.004980	-0.000000	-0.000000	1.000000	1.000000
1	exposure_popularity	0.004971	0.004968	-0.000003	-0.059031	1.000591	0.996399
2	ctr_weighted	0.005172	0.005188	0.000016	0.302162	0.996987	0.790628
3	epsilon_greedy_top_ctr	0.006238	0.006231	-0.000007	-0.110378	1.001105	0.040290

The IPS-SNIPS gap is a stability diagnostic. Large gaps suggest that weight normalization materially changes the estimate, which usually means the policy has support or variance concerns.

Weight Clipping Sensitivity

Weight clipping caps large weights before estimating policy value. Clipping can reduce variance, but it introduces bias because it changes the estimator.

This cell evaluates several clipping thresholds. A stable policy should not change dramatically when weights are clipped. A sensitive policy deserves caution in the final writeup.

clip_values = [None, 5, 10, 20]
clipping_rows = []
for policy in policy_cols:
    for clip in clip_values:
        clipping_rows.append(
            estimate_ips_snips(
                reward=eval_scored["click"],
                weight=eval_scored[f"w_{policy}"],
                policy_name=policy,
                clip=clip,
            )
        )

clipping_sensitivity = pd.DataFrame(clipping_rows)
clipping_sensitivity[["policy", "clip", "ips", "snips", "mean_weight", "max_weight", "ess_share"]]

	policy	clip	ips	snips	mean_weight	max_weight	ess_share
0	uniform	none	0.004980	0.004980	1.000000	1.000000	1.000000
1	uniform	5	0.004980	0.004980	1.000000	1.000000	1.000000
2	uniform	10	0.004980	0.004980	1.000000	1.000000	1.000000
3	uniform	20	0.004980	0.004980	1.000000	1.000000	1.000000
4	exposure_popularity	none	0.004971	0.004968	1.000591	1.132200	0.996399
5	exposure_popularity	5	0.004971	0.004968	1.000591	1.132200	0.996399
6	exposure_popularity	10	0.004971	0.004968	1.000591	1.132200	0.996399
7	exposure_popularity	20	0.004971	0.004968	1.000591	1.132200	0.996399
8	ctr_weighted	none	0.005172	0.005188	0.996987	2.740842	0.790628
9	ctr_weighted	5	0.005172	0.005188	0.996987	2.740842	0.790628
10	ctr_weighted	10	0.005172	0.005188	0.996987	2.740842	0.790628
11	ctr_weighted	20	0.005172	0.005188	0.996987	2.740842	0.790628
12	epsilon_greedy_top_ctr	none	0.006238	0.006231	1.001105	29.050000	0.040290
13	epsilon_greedy_top_ctr	5	0.001669	0.005698	0.292833	5.000000	0.113115
14	epsilon_greedy_top_ctr	10	0.002619	0.005950	0.440083	10.000000	0.065279
15	epsilon_greedy_top_ctr	20	0.004519	0.006151	0.734583	20.000000	0.045723

The clipping sensitivity check shows how estimates change when extreme weights are capped. Stable estimates across clipping thresholds are more reassuring than estimates that depend strongly on a few high-weight rows.

Plot Clipping Sensitivity

This plot shows how IPS and SNIPS estimates move as the clipping threshold changes. The unclipped estimate is labeled none.

If a policy estimate changes sharply after clipping, then a few large-weight rows were materially influencing the estimate. That is not automatically wrong, but it should be discussed as a limitation.

clipping_plot_df = clipping_sensitivity.melt(
    id_vars=["policy", "clip"],
    value_vars=["ips", "snips"],
    var_name="estimator",
    value_name="estimate",
)
clipping_plot_df["clip_label"] = clipping_plot_df["clip"].astype(str)

fig, ax = plt.subplots(figsize=(10, 5))
sns.lineplot(
    data=clipping_plot_df,
    x="clip_label",
    y="estimate",
    hue="policy",
    style="estimator",
    marker="o",
    ax=ax,
)
ax.set_title("Policy Value Sensitivity To Weight Clipping")
ax.set_xlabel("Weight Clip")
ax.set_ylabel("Estimated Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

Estimated Lift Relative To Random Behavior

This cell converts estimates into lift over the observed random-policy baseline.

Lift is easier to discuss with product stakeholders because it answers: how much higher or lower is the estimated click rate relative to the current logged policy? Here lift is expressed in both percentage points and relative percent.

lift_table = estimate_comparison.query("estimator != 'observed_mean'").copy()
lift_table["baseline_click_rate"] = behavior_value
lift_table["lift_pp"] = 100 * (lift_table["estimate"] - behavior_value)
lift_table["relative_lift_pct"] = 100 * (lift_table["estimate"] / behavior_value - 1)

lift_table[["policy", "estimator", "estimate", "baseline_click_rate", "lift_pp", "relative_lift_pct", "ess_share"]].sort_values(
    "estimate", ascending=False
)

	policy	estimator	estimate	baseline_click_rate	lift_pp	relative_lift_pct	ess_share
7	epsilon_greedy_top_ctr	IPS	0.006238	0.004980	0.125800	25.261044	0.040290
8	epsilon_greedy_top_ctr	SNIPS	0.006231	0.004980	0.125111	25.122784	0.040290
6	ctr_weighted	SNIPS	0.005188	0.004980	0.020806	4.178004	0.790628
5	ctr_weighted	IPS	0.005172	0.004980	0.019244	3.864167	0.790628
1	uniform	IPS	0.004980	0.004980	0.000000	0.000000	1.000000
2	uniform	SNIPS	0.004980	0.004980	0.000000	0.000000	1.000000
3	exposure_popularity	IPS	0.004971	0.004980	-0.000904	-0.181598	0.996399
4	exposure_popularity	SNIPS	0.004968	0.004980	-0.001198	-0.240522	0.996399

The lift table translates policy values into improvement over the random logging baseline. This framing is easier to communicate, but it should always be read alongside uncertainty and weight diagnostics.

Optional BTS Log Comparison

Notebook 2 showed that BTS has more variable propensities than random logging. This section evaluates the same candidate policies on a held-out BTS split to show how logging-policy choice affects estimator stability.

This is a diagnostic comparison, not a clean claim that one log is universally better than another. Random and BTS logs may cover different traffic patterns. The point is to see how adaptive logging changes weight behavior.

if not BTS_SAMPLE_PATH.exists():
    raise FileNotFoundError(
        f"Missing {BTS_SAMPLE_PATH}. Run 02_behavior_policy_and_propensities.ipynb first to create the BTS cache."
    )

bts_df = pd.read_parquet(BTS_SAMPLE_PATH).sort_values("timestamp").reset_index(drop=True)
bts_eval_df = bts_df.iloc[int(len(bts_df) * SPLIT_FRACTION) :].copy()

bts_scored = bts_eval_df[["timestamp", "item_id", "position", "click", "propensity_score"]].copy()
for policy in policy_cols:
    probability_map = policy_probability_df.set_index("item_id")[policy]
    bts_scored[f"pi_e_{policy}"] = bts_scored["item_id"].map(probability_map)
    bts_scored[f"w_{policy}"] = bts_scored[f"pi_e_{policy}"] / bts_scored["propensity_score"]

bts_scored[["timestamp", "item_id", "position", "click", "propensity_score", "w_uniform"]].head()

	timestamp	item_id	position	propensity_score	w_uniform
100000	2019-11-24 03:28:23.210755+00:00	10	3	0.022395	1.313318
100001	2019-11-24 03:28:23.233904+00:00	19	2	0.011710	2.511679
100002	2019-11-24 03:28:23.234656+00:00	1	3	0.025370	1.159313
100003	2019-11-24 03:28:23.234692+00:00	13	1	0.410980	0.071565
100004	2019-11-24 03:28:23.370371+00:00	1	3	0.025370	1.159313

The BTS comparison illustrates how a more adaptive logging policy can change support and estimator stability. This is a useful contrast because production logs often look more like BTS than a clean randomized policy.

Estimate The Same Policies From BTS Logs

This cell computes IPS and SNIPS estimates using the BTS held-out rows. The evaluation policies are the same policies learned from the random-policy training split.

Comparing the random-log estimates and BTS-log estimates helps illustrate an important OPE lesson: the estimator is not only about the target policy. It is also about how well the behavior logs support that target policy.

bts_results = []
for policy in policy_cols:
    bts_results.append(
        estimate_ips_snips(
            reward=bts_scored["click"],
            weight=bts_scored[f"w_{policy}"],
            policy_name=policy,
        )
    )

bts_estimate_results = pd.DataFrame(bts_results).assign(log_source="bts")
random_estimate_results = estimate_results.assign(log_source="random")

log_source_comparison = pd.concat([random_estimate_results, bts_estimate_results], ignore_index=True)
log_source_comparison[["log_source", "policy", "ips", "snips", "mean_weight", "max_weight", "ess_share"]]

	log_source	policy	ips	snips	mean_weight	max_weight	ess_share
0	random	uniform	0.004980	0.004980	1.000000	1.000000	1.000000
1	random	exposure_popularity	0.004971	0.004968	1.000591	1.132200	0.996399
2	random	ctr_weighted	0.005172	0.005188	0.996987	2.740842	0.790628
3	random	epsilon_greedy_top_ctr	0.006238	0.006231	1.001105	29.050000	0.040290
4	bts	uniform	0.004483	0.004779	0.937988	125.156446	0.098302
5	bts	exposure_popularity	0.004458	0.004736	0.941207	123.489362	0.095822
6	bts	ctr_weighted	0.004490	0.004752	0.944894	157.810018	0.098642
7	bts	epsilon_greedy_top_ctr	0.006092	0.006408	0.950666	69.492620	0.036405

Plot ESS By Log Source

This plot compares effective sample size shares across random and BTS logs for the same evaluation policies.

When ESS is lower, the estimate effectively relies on fewer observations. This is often the clearest visual explanation of why randomized exploration data is so valuable for off-policy evaluation.

fig, ax = plt.subplots(figsize=(9, 5))
sns.barplot(data=log_source_comparison, x="policy", y="ess_share", hue="log_source", ax=ax)
ax.set_title("Effective Sample Size Share By Log Source")
ax.set_xlabel("Evaluation Policy")
ax.set_ylabel("ESS Share")
ax.tick_params(axis="x", rotation=25)
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.0%}")
plt.tight_layout()
plt.show()

The weight diagnostics show whether each policy value estimate is supported by enough logged data. Policies with large weights or low ESS may look attractive but carry higher offline evaluation risk.

Main Results Table

This cell creates a compact final table for Notebook 3. It includes the random-log estimates, confidence intervals, weight diagnostics, and estimated lift over the observed random behavior baseline.

This is the table that should be referenced in a project writeup before moving to doubly robust estimation.

final_results = lift_table.merge(
    weight_diagnostics,
    on="policy",
    how="left",
    suffixes=("", "_diagnostic"),
)
final_results = final_results[
    [
        "policy",
        "estimator",
        "estimate",
        "ci_95_lower",
        "ci_95_upper",
        "lift_pp",
        "relative_lift_pct",
        "mean_weight",
        "max_weight",
        "p99_weight",
        "ess_share",
    ]
].sort_values(["estimate", "ess_share"], ascending=[False, False])

final_results

	policy	estimator	estimate	ci_95_lower	ci_95_upper	lift_pp	relative_lift_pct	mean_weight	max_weight	p99_weight	ess_share
6	epsilon_greedy_top_ctr	IPS	0.006238	0.003756	0.008720	0.125800	25.261044	1.001105	29.050000	29.050000	0.040290
7	epsilon_greedy_top_ctr	SNIPS	0.006231	0.003759	0.008703	0.125111	25.122784	1.001105	29.050000	29.050000	0.040290
5	ctr_weighted	SNIPS	0.005188	0.004675	0.005701	0.020806	4.178004	0.996987	2.740842	2.740842	0.790628
4	ctr_weighted	IPS	0.005172	0.004660	0.005685	0.019244	3.864167	0.996987	2.740842	2.740842	0.790628
0	uniform	IPS	0.004980	0.004544	0.005416	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000
1	uniform	SNIPS	0.004980	0.004544	0.005416	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000
2	exposure_popularity	IPS	0.004971	0.004535	0.005407	-0.000904	-0.181598	1.000591	1.132200	1.132200	0.996399
3	exposure_popularity	SNIPS	0.004968	0.004532	0.005404	-0.001198	-0.240522	1.000591	1.132200	1.132200	0.996399

The reshaped table puts policy values, uncertainty, and diagnostics into a comparison-friendly format. This is the table that later plots and recommendation decisions build from.

Notebook 3 Takeaways

This notebook introduced the first actual off-policy estimators in the project.

The most important lessons are:

IPS estimates policy value by reweighting rewards with pi_e / pi_b.
SNIPS uses the same weights but normalizes by the total weight, often improving stability.
Weight diagnostics are not optional; they are part of the estimate.
Randomized logging makes simple OPE much cleaner because the behavior policy explores broadly.
Concentrated policies can have attractive point estimates but lower effective sample size.
BTS logs are useful, but their adaptive propensities make weight behavior more delicate.

The next notebook should introduce doubly robust OPE, which combines importance weighting with a reward model. That will help reduce variance while preserving some protection against model misspecification.