02 Behavior Policy And Propensity Diagnostics

This notebook is the bridge between basic EDA and actual off-policy estimators.

Notebook 1 showed that Open Bandit has the core ingredients for off-policy evaluation: logged actions, observed rewards, context features, and behavior-policy propensities. Notebook 2 asks a more focused question:

Can we trust the logging-policy information enough to use it in IPS, self-normalized IPS, and doubly robust OPE?

That means we will study the behavior policy itself. We will look at action support, position support, propensity score distributions, inverse-propensity weight risk, and a preview comparison between the random/men and bts/men behavior policies.

This notebook still does not try to make a final product claim about a new recommendation policy. Its job is diagnostic: understand where OPE will be stable, where it may be noisy, and why the random/men campaign is the safest starting point.

OPE Recap: Why Propensities Matter

In off-policy evaluation, we use data generated by one policy to estimate the value of another policy.

The behavior policy or logging policy generated the historical data.
The evaluation policy is the new policy we want to evaluate offline.
The action is the item shown to the user.
The reward is the observed click.
The propensity score is the probability that the behavior policy assigned to the logged action.

The key reweighting idea is:

weight = evaluation_policy_probability / behavior_policy_probability

If the evaluation policy would have chosen the logged item more often than the behavior policy did, that row receives a larger weight. If the evaluation policy would rarely choose it, that row receives a smaller weight.

This creates two practical requirements. First, the behavior propensity must be positive whenever the evaluation policy might choose that action. Second, very small behavior propensities create large weights, which can make estimates unstable. This notebook focuses on those two issues.

Notebook Setup

This cell imports the libraries used for diagnostics. We use zipfile to read the BTS comparison sample directly from the Open Bandit archive, while the random-policy sample comes from the parquet cache written by Notebook 1.

The plotting settings match the rest of the project so the notebook reads like one coherent portfolio artifact.

from pathlib import Path
from zipfile import ZipFile

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.6f}".format)

sns.set_theme(style="whitegrid", context="notebook")

This cell prepares the notebook environment for behavior-policy propensities and support diagnostics. There is no estimator output yet; the main value is that the imports, display settings, and plotting defaults are ready for the OPE diagnostics that follow.

Locate Project Files

This cell finds the repository root by walking upward until it sees the Open Bandit zip. It then defines the paths used in this notebook.

The random-policy sample is loaded from data/processed/open_bandit_random_men_sample.parquet, which was created by Notebook 1. The BTS sample will be read from the raw zip and cached to data/processed/open_bandit_bts_men_sample.parquet for later notebooks.

OPEN_BANDIT_ZIP_RELATIVE_PATH = Path("data/open_bandit/open_bandit_dataset.zip")
PROJECT_ROOT = next(
    path
    for path in [Path.cwd(), *Path.cwd().parents]
    if (path / OPEN_BANDIT_ZIP_RELATIVE_PATH).exists()
)

OPEN_BANDIT_ZIP = PROJECT_ROOT / OPEN_BANDIT_ZIP_RELATIVE_PATH
PROCESSED_DIR = PROJECT_ROOT / "data/processed"
RANDOM_SAMPLE_PATH = PROCESSED_DIR / "open_bandit_random_men_sample.parquet"
BTS_SAMPLE_PATH = PROCESSED_DIR / "open_bandit_bts_men_sample.parquet"

pd.DataFrame(
    {
        "path_name": ["project_root", "open_bandit_zip", "random_sample", "bts_sample_cache"],
        "path": [PROJECT_ROOT, OPEN_BANDIT_ZIP, RANDOM_SAMPLE_PATH, BTS_SAMPLE_PATH],
        "exists": [PROJECT_ROOT.exists(), OPEN_BANDIT_ZIP.exists(), RANDOM_SAMPLE_PATH.exists(), BTS_SAMPLE_PATH.exists()],
    }
)

	path_name	path	exists
0	project_root	/home/apex/Documents/ranking_sys	True
1	open_bandit_zip	/home/apex/Documents/ranking_sys/data/open_ban...	True
2	random_sample	/home/apex/Documents/ranking_sys/data/processe...	True
3	bts_sample_cache	/home/apex/Documents/ranking_sys/data/processe...	True

The printed paths are a reproducibility checkpoint. Once the notebook can find the cached data and writeup folders, the rest of the analysis can run without manual path edits.

Load The Random-Policy Sample

This cell loads the cached random/men sample from Notebook 1. The random campaign is our clean starting point because the behavior policy explores the action space broadly and assigns nearly uniform propensities.

We add a behavior_policy label so that tables and plots can compare random logging with BTS logging later in the notebook.

if not RANDOM_SAMPLE_PATH.exists():
    raise FileNotFoundError(
        f"Missing {RANDOM_SAMPLE_PATH}. Run 01_open_bandit_eda.ipynb first to create the cached sample."
    )

random_df = pd.read_parquet(RANDOM_SAMPLE_PATH).assign(behavior_policy="random")

random_df.head()

	timestamp	item_id	position	propensity_score	user_feature_0	user_feature_1	user_feature_2	user_feature_3	date	inverse_behavior_propensity	item_feature_0	item_feature_1	item_feature_2	item_feature_3	behavior_policy
0	2019-11-24 00:00:03.800821+00:00	0	1	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	2019-11-24	34.000000	-0.677183	ce58bf66d7e62186e6ce01bafeea9d39	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	random
1	2019-11-24 00:00:03.801019+00:00	25	3	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	2019-11-24	34.000000	-0.461600	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	random
2	2019-11-24 00:00:03.801099+00:00	23	2	0.029412	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	2723d2eb8bba04e0362098011fa3997b	c39b0c7dd5d4eb9a18e7db6ba2f258f8	2019-11-24	34.000000	-0.569392	55fe518d85813954c7d9b8a875ff2453	cc75031396a5aa830885915aa93f49d0	b61cfaadd526b816e3aeb9b7be4b4759	random
3	2019-11-24 00:00:17.634355+00:00	25	1	0.029412	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44	2019-11-24	34.000000	-0.461600	9874ffb54e9b0a269e29bbb2f5328735	7c5498711d69681385d21c0e26923e7e	bbf748c6c978938bc63d432efa60191c	random
4	2019-11-24 00:00:17.634998+00:00	30	2	0.029412	1a2b2ad3a7f218a0d709dd9c656fda27	e3528f5280f04c0031d337da1def86ea	398773dacf8501ee8f76e3706ccafbba	47e7dd7d9ccbe31d57ce716dba831d44	2019-11-24	34.000000	-0.914324	61c5d8c2524684aa047e15e172c7e92f	3f1feafd79578bedf199c459fecc378b	bbf748c6c978938bc63d432efa60191c	random

The loaded table shape and preview confirm that the expected cached data is available. This check matters because all later OPE estimates depend on using the correct logged actions, rewards, contexts, and behavior propensities.

Load A BTS Comparison Sample

Open Bandit also includes logs from a Bernoulli Thompson Sampling policy, abbreviated as bts. BTS is adaptive: it uses prior feedback to put more probability on items it currently believes are promising.

This cell reads the first SAMPLE_ROWS from bts/men/men.csv. We parse timestamps using format="mixed" because Open Bandit contains both fractional-second and whole-second timestamp strings. We also compute inverse_behavior_propensity, which is the raw 1 / propensity_score diagnostic used later.

SAMPLE_ROWS = len(random_df)
BTS_LOG_MEMBER = "open_bandit_dataset/bts/men/men.csv"

with ZipFile(OPEN_BANDIT_ZIP) as zf:
    with zf.open(BTS_LOG_MEMBER) as f:
        bts_df = pd.read_csv(f, nrows=SAMPLE_ROWS, index_col=0)

bts_df = bts_df.assign(
    timestamp=pd.to_datetime(bts_df["timestamp"], utc=True, format="mixed"),
    behavior_policy="bts",
)
bts_df = bts_df.assign(
    date=bts_df["timestamp"].dt.date,
    hour=bts_df["timestamp"].dt.hour,
    inverse_behavior_propensity=1 / bts_df["propensity_score"],
)

bts_df.head()

	timestamp	item_id	position	propensity_score	user_feature_0	user_feature_1	user_feature_2	user_feature_3	behavior_policy	date	inverse_behavior_propensity
0	2019-11-24 00:00:00.020518+00:00	10	1	0.019455	81ce123cbb5bd8ce818f60fb3586bba5	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	06128286bcc64b6a4b0fb7bc0328fe17	bts	2019-11-24	51.400668
1	2019-11-24 00:00:00.025815+00:00	13	2	0.329710	81ce123cbb5bd8ce818f60fb3586bba5	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	06128286bcc64b6a4b0fb7bc0328fe17	bts	2019-11-24	3.032968
2	2019-11-24 00:00:00.026514+00:00	0	3	0.044465	81ce123cbb5bd8ce818f60fb3586bba5	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	06128286bcc64b6a4b0fb7bc0328fe17	bts	2019-11-24	22.489599
3	2019-11-24 00:00:00.972404+00:00	13	2	0.329710	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	06128286bcc64b6a4b0fb7bc0328fe17	bts	2019-11-24	3.032968
4	2019-11-24 00:00:00.972817+00:00	6	3	0.051515	cef3390ed299c09874189c387777674a	03a5648a76832f83c859d46bc06cb64a	9b2d331c329ceb74d3dcfb48d8798c78	06128286bcc64b6a4b0fb7bc0328fe17	bts	2019-11-24	19.411822

Align The Comparison Columns

The random sample includes item features joined in Notebook 1, while the BTS sample currently contains only the raw logged fields. For policy-behavior diagnostics, we only need fields that exist in both tables.

This cell creates a combined comparison table with shared columns. It keeps the diagnostics focused on logging behavior: actions, positions, rewards, propensities, and timestamps.

comparison_cols = [
    "behavior_policy",
    "timestamp",
    "date",
    "hour",
    "item_id",
    "position",
    "click",
    "propensity_score",
    "inverse_behavior_propensity",
]

comparison_df = pd.concat(
    [random_df[comparison_cols], bts_df[comparison_cols]],
    ignore_index=True,
)

comparison_df.head()

	behavior_policy	timestamp	date	item_id	position	propensity_score	inverse_behavior_propensity
0	random	2019-11-24 00:00:03.800821+00:00	2019-11-24	0	1	0.029412	34.000000
1	random	2019-11-24 00:00:03.801019+00:00	2019-11-24	25	3	0.029412	34.000000
2	random	2019-11-24 00:00:03.801099+00:00	2019-11-24	23	2	0.029412	34.000000
3	random	2019-11-24 00:00:17.634355+00:00	2019-11-24	25	1	0.029412	34.000000
4	random	2019-11-24 00:00:17.634998+00:00	2019-11-24	30	2	0.029412	34.000000

Aligning columns makes the random and BTS logs comparable. With a common schema, the next diagnostics can focus on policy differences instead of table-format differences.

Basic Policy-Level Summary

This cell summarizes each behavior policy at a high level. The key comparison is not just click rate. We also care about how many actions were explored, how propensities vary, and how large inverse propensity weights can become.

A behavior policy with broad support and stable propensities is easier to use for OPE. A policy with concentrated actions or tiny propensities can still be useful, but estimators will usually have higher variance.

def effective_sample_size(weights):
    weights = np.asarray(weights, dtype=float)
    return weights.sum() ** 2 / np.square(weights).sum()

policy_summary = (
    comparison_df.groupby("behavior_policy")
    .agg(
        rows=("click", "size"),
        click_rate=("click", "mean"),
        unique_items=("item_id", "nunique"),
        unique_positions=("position", "nunique"),
        min_propensity=("propensity_score", "min"),
        max_propensity=("propensity_score", "max"),
        mean_propensity=("propensity_score", "mean"),
        propensity_std=("propensity_score", "std"),
        max_inverse_propensity=("inverse_behavior_propensity", "max"),
    )
    .reset_index()
)

policy_summary["inverse_propensity_ess"] = comparison_df.groupby("behavior_policy")[
    "inverse_behavior_propensity"
].apply(effective_sample_size).values
policy_summary["inverse_propensity_ess_share"] = policy_summary["inverse_propensity_ess"] / policy_summary["rows"]

policy_summary

	behavior_policy	rows	click_rate	unique_items	unique_positions	min_propensity	max_propensity	mean_propensity	propensity_std	max_inverse_propensity	inverse_propensity_ess	inverse_propensity_ess_share
0	bts	200000	0.006030	34	3	0.000235	0.501705	0.124856	0.109383	4255.319149	23285.403834	0.116427
1	random	200000	0.005190	34	3	0.029412	0.029412	0.029412	0.000000	34.000000	200000.000000	1.000000

The policy-level comparison clarifies how random and adaptive logs differ. This matters because broad randomized support usually produces safer OPE estimates than concentrated production-style support.

Behavior Policy Interpretation

This cell turns the summary table into a short interpretation table. It is useful for the final writeup because it separates a statistical fact from its OPE meaning.

The key expected pattern is that random logging should have almost constant propensities, while BTS should have more variable propensities because it adaptively favors some actions over others.

interpretation = pd.DataFrame(
    [
        {
            "diagnostic": "Random policy propensity variation",
            "what_to_look_for": "Near-zero standard deviation in propensity_score.",
            "why_it_matters": "Uniform logging makes IPS weights stable for simple evaluation policies.",
            "observed": policy_summary.loc[policy_summary["behavior_policy"] == "random", "propensity_std"].iloc[0],
        },
        {
            "diagnostic": "BTS policy propensity variation",
            "what_to_look_for": "Wider propensity range than random logging.",
            "why_it_matters": "Adaptive logging can create larger IPS weights for actions BTS rarely chose.",
            "observed": policy_summary.loc[policy_summary["behavior_policy"] == "bts", "propensity_std"].iloc[0],
        },
        {
            "diagnostic": "Maximum inverse propensity",
            "what_to_look_for": "Large values flag possible high-variance weights.",
            "why_it_matters": "Rows with small behavior propensity can dominate IPS estimates.",
            "observed": policy_summary["max_inverse_propensity"].max(),
        },
    ]
)

interpretation

	diagnostic	what_to_look_for	why_it_matters	observed
0	Random policy propensity variation	Near-zero standard deviation in propensity_score.	Uniform logging makes IPS weights stable for s...	0.000000
1	BTS policy propensity variation	Wider propensity range than random logging.	Adaptive logging can create larger IPS weights...	0.109383
2	Maximum inverse propensity	Large values flag possible high-variance weights.	Rows with small behavior propensity can domina...	4255.319149

Positivity Checks

The positivity assumption says the behavior policy must assign positive probability to actions that the evaluation policy may choose. Without positivity, historical data cannot tell us what would happen under that evaluation policy.

This cell checks the most basic version of positivity: every logged row must have a positive propensity score. Later, when we define an actual evaluation policy, we will also check whether that evaluation policy asks for actions that are sufficiently represented in the logs.

positivity_checks = (
    comparison_df.groupby("behavior_policy")
    .agg(
        rows=("propensity_score", "size"),
        missing_propensities=("propensity_score", lambda x: x.isna().sum()),
        nonpositive_propensities=("propensity_score", lambda x: (x <= 0).sum()),
        min_propensity=("propensity_score", "min"),
        pct_below_0_01=("propensity_score", lambda x: (x < 0.01).mean()),
        pct_below_0_02=("propensity_score", lambda x: (x < 0.02).mean()),
    )
    .reset_index()
)

positivity_checks

	behavior_policy	rows	missing_propensities	nonpositive_propensities	min_propensity	pct_below_0_01	pct_below_0_02
0	bts	200000	0	0	0.000235	0.065080	0.139180
1	random	200000	0	0	0.029412	0.000000	0.000000

The positivity checks ask whether evaluation actions have nonzero probability under the behavior policy. Positivity is a core OPE requirement: without support, no amount of modeling can fully recover missing counterfactual rewards.

Propensity Distribution By Policy

This cell computes detailed propensity percentiles for each behavior policy. Percentiles are often more informative than the mean because OPE variance is driven by the lower tail of behavior propensities.

For random logging, the percentiles should be almost identical. For BTS, the percentiles should spread out because some actions receive much higher probability than others.

propensity_percentiles = (
    comparison_df.groupby("behavior_policy")["propensity_score"]
    .describe(percentiles=[0.01, 0.05, 0.10, 0.50, 0.90, 0.95, 0.99])
    .reset_index()
)

propensity_percentiles

	behavior_policy	count	mean	std	min	1%	5%	10%	50%	90%	95%	99%	max
0	bts	200000.000000	0.124856	0.109383	0.000235	0.002905	0.008460	0.014485	0.086295	0.305480	0.319375	0.439950	0.501705
1	random	200000.000000	0.029412	0.000000	0.029412	0.029412	0.029412	0.029412	0.029412	0.029412	0.029412	0.029412	0.029412

The propensity diagnostics show how much probability the behavior policy assigned to logged actions. Small propensities imply large inverse-propensity weights, which can make IPS estimates unstable.

Plot Propensity Distributions

This plot compares the random and BTS propensity distributions. The x-axis uses a log scale so small propensities are visible.

This is one of the most important diagnostics in the notebook. A narrow distribution means simple importance weighting is likely to be stable. A wide distribution means the policy placed very different probabilities on different logged actions, increasing the risk of high-variance OPE.

fig, ax = plt.subplots(figsize=(9, 4))
sns.histplot(
    data=comparison_df,
    x="propensity_score",
    hue="behavior_policy",
    bins=60,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=ax,
)
ax.set_xscale("log")
ax.set_title("Behavior Propensity Score Distributions")
ax.set_xlabel("Logged Propensity Score")
ax.set_ylabel("Density")
plt.show()

The propensity diagnostics show how much probability the behavior policy assigned to logged actions. Small propensities imply large inverse-propensity weights, which can make IPS estimates unstable.

Random Policy Uniformity Check

For the random/men campaign, the behavior policy should be close to uniform over the available items. If there are K actions, the expected random propensity is approximately 1 / K.

This cell compares the logged random propensities to that expected uniform value. We still use the logged propensities in OPE, but this check helps confirm that the random-policy slice behaves as advertised.

random_n_actions = random_df["item_id"].nunique()
random_expected_uniform_propensity = 1 / random_n_actions

random_uniformity = pd.Series(
    {
        "random_unique_items": random_n_actions,
        "expected_uniform_propensity": random_expected_uniform_propensity,
        "observed_min_propensity": random_df["propensity_score"].min(),
        "observed_max_propensity": random_df["propensity_score"].max(),
        "observed_mean_propensity": random_df["propensity_score"].mean(),
        "max_abs_difference_from_uniform": (random_df["propensity_score"] - random_expected_uniform_propensity).abs().max(),
    }
).to_frame("value")

random_uniformity

	value
random_unique_items	34.000000
expected_uniform_propensity	0.029412
observed_min_propensity	0.029412
observed_max_propensity	0.029412
observed_mean_propensity	0.029412
max_abs_difference_from_uniform	0.000000

The uniformity check tests whether the random logging policy behaves as expected. Random logs are especially valuable for OPE because they usually provide broader support than production-style adaptive policies.

Action Distribution By Behavior Policy

This cell counts how often each item appears under each behavior policy. Broad action coverage is crucial because an evaluation policy can only be evaluated where the historical behavior policy collected useful data.

Random logging should be flatter. BTS should be more concentrated because it is learning which items seem promising and assigning them higher probability.

action_distribution = (
    comparison_df.groupby(["behavior_policy", "item_id"])
    .agg(
        rows=("click", "size"),
        click_rate=("click", "mean"),
        avg_propensity=("propensity_score", "mean"),
    )
    .reset_index()
)
action_distribution["row_share"] = action_distribution["rows"] / action_distribution.groupby("behavior_policy")[
    "rows"
].transform("sum")

action_distribution.sort_values(["behavior_policy", "rows"], ascending=[True, False]).head(20)

	behavior_policy	item_id	rows	click_rate	avg_propensity	row_share
13	bts	13	50055	0.007472	0.271668	0.250275
23	bts	23	29450	0.006961	0.155400	0.147250
9	bts	9	18410	0.006355	0.136031	0.092050
21	bts	21	12097	0.006283	0.080557	0.060485
3	bts	3	7635	0.005108	0.055114	0.038175
31	bts	31	7198	0.007502	0.047918	0.035990
27	bts	27	7158	0.004750	0.042910	0.035790
26	bts	26	6916	0.005350	0.039247	0.034580
0	bts	0	6236	0.003849	0.049725	0.031180
1	bts	1	5608	0.009807	0.040800	0.028040
6	bts	6	5154	0.004075	0.056829	0.025770
19	bts	19	5109	0.005872	0.047239	0.025545
7	bts	7	3925	0.002803	0.031182	0.019625
2	bts	2	3806	0.004204	0.029649	0.019030
33	bts	33	3698	0.004327	0.039191	0.018490
11	bts	11	3623	0.004416	0.034436	0.018115
10	bts	10	3029	0.003962	0.020609	0.015145
29	bts	29	2871	0.002786	0.019926	0.014355
16	bts	16	2187	0.005944	0.016189	0.010935
22	bts	22	2114	0.001892	0.028410	0.010570

The action distribution shows how often each item was logged. OPE relies on support: evaluation policies are only credible where the behavior policy logged enough comparable actions.

Action Concentration Metrics

This cell summarizes how concentrated each behavior policy is across actions.

The Herfindahl index is the sum of squared action shares. It is higher when exposure is concentrated in fewer actions. The top-5 share tells us how much traffic goes to the five most common items. These metrics help explain why adaptive logging can be harder for OPE than random logging.

def normalized_entropy(shares):
    shares = np.asarray(shares, dtype=float)
    shares = shares[shares > 0]
    return -(shares * np.log(shares)).sum() / np.log(len(shares))

concentration_rows = []
for policy, policy_actions in action_distribution.groupby("behavior_policy"):
    shares = policy_actions["row_share"].to_numpy()
    concentration_rows.append(
        {
            "behavior_policy": policy,
            "unique_items": policy_actions["item_id"].nunique(),
            "min_rows_per_item": policy_actions["rows"].min(),
            "max_rows_per_item": policy_actions["rows"].max(),
            "max_item_share": policy_actions["row_share"].max(),
            "top_5_item_share": policy_actions.nlargest(5, "row_share")["row_share"].sum(),
            "herfindahl_index": np.square(shares).sum(),
            "normalized_entropy": normalized_entropy(shares),
        }
    )

concentration = pd.DataFrame(concentration_rows)
concentration

	behavior_policy	unique_items	min_rows_per_item	max_rows_per_item	max_item_share	top_5_item_share	herfindahl_index	normalized_entropy
0	bts	34	145	50055	0.250275	0.588235	0.107264	0.785435
1	random	34	5342	6473	0.032365	0.156720	0.029465	0.999743

The concentration metrics quantify how spread out the logging policy is across actions. Lower entropy or higher concentration means fewer actions dominate the log, increasing support risk.

Plot Action Exposure Shares

This plot shows the action exposure profile under each behavior policy. The bars are sorted separately within each policy by exposure share.

A flatter random profile means it explores items evenly. A steeper BTS profile means some items receive much more traffic than others. That concentration can improve online reward for BTS, but it can make offline evaluation of very different policies more difficult.

plot_action_distribution = (
    action_distribution.sort_values(["behavior_policy", "row_share"], ascending=[True, False])
    .assign(action_rank=lambda x: x.groupby("behavior_policy").cumcount() + 1)
)

fig, ax = plt.subplots(figsize=(10, 5))
sns.lineplot(
    data=plot_action_distribution,
    x="action_rank",
    y="row_share",
    hue="behavior_policy",
    marker="o",
    ax=ax,
)
ax.set_title("Action Exposure Concentration By Behavior Policy")
ax.set_xlabel("Action Rank Within Policy")
ax.set_ylabel("Share of Logged Rows")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.1%}")
plt.show()

The action exposure plot reveals whether the logging policy spreads probability broadly or concentrates on a few items. Concentration creates support risk for policies that choose rarely logged actions.

Position Distribution By Behavior Policy

Open Bandit logs the position where the item was shown. Position is not the main action in this project, but it affects click probability and can interact with item choice.

This cell checks whether the two behavior policies use the three positions similarly. If one policy used certain positions much more often, reward comparisons would need to account for that context difference.

position_distribution = (
    comparison_df.groupby(["behavior_policy", "position"])
    .agg(
        rows=("click", "size"),
        click_rate=("click", "mean"),
        avg_propensity=("propensity_score", "mean"),
    )
    .reset_index()
)
position_distribution["row_share"] = position_distribution["rows"] / position_distribution.groupby("behavior_policy")[
    "rows"
].transform("sum")

position_distribution

	behavior_policy	position	rows	click_rate	avg_propensity	row_share
0	bts	1	66643	0.006467	0.140928	0.333215
1	bts	2	66674	0.006614	0.134425	0.333370
2	bts	3	66683	0.005009	0.099227	0.333415
3	random	1	66653	0.005746	0.029412	0.333265
4	random	2	66679	0.005114	0.029412	0.333395
5	random	3	66668	0.004710	0.029412	0.333340

The position summary checks where logged recommendations appeared in the slate. Position can affect reward, so understanding its distribution helps interpret later policy-value comparisons.

Plot Position Click Rates

This plot compares click rates by position across the two behavior policies. Position-level CTR differences should be treated descriptively here.

A higher click rate under BTS does not automatically mean BTS is causally better in this sample. BTS may choose different items, with different propensities, for different contexts. The point is to understand the logged data before using causal estimators.

fig, ax = plt.subplots(figsize=(8, 4))
sns.barplot(data=position_distribution, x="position", y="click_rate", hue="behavior_policy", ax=ax)
ax.set_title("Observed Click Rate By Position And Behavior Policy")
ax.set_xlabel("Position")
ax.set_ylabel("Observed Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

The position CTR plot shows whether click rates vary by slate position. This is a useful reminder that logged reward is shaped by presentation context, not only by the item identity.

Action-Position Support Matrix

An evaluation policy might choose different items in different positions. That means we should check not only whether each item appears, but also whether each item-position pair appears.

This cell builds action-position count matrices for each behavior policy. Missing or very small cells are warning signs for policies that rely heavily on those action-position combinations.

random_action_position = random_df.pivot_table(
    index="item_id", columns="position", values="click", aggfunc="size", fill_value=0
).astype(int)
bts_action_position = bts_df.pivot_table(
    index="item_id", columns="position", values="click", aggfunc="size", fill_value=0
).astype(int)

support_summary = pd.DataFrame(
    [
        {
            "behavior_policy": "random",
            "observed_action_position_cells": int((random_action_position > 0).sum().sum()),
            "total_action_position_cells": int(random_action_position.size),
            "min_nonzero_cell_count": int(random_action_position[random_action_position > 0].min().min()),
            "zero_cells": int((random_action_position == 0).sum().sum()),
        },
        {
            "behavior_policy": "bts",
            "observed_action_position_cells": int((bts_action_position > 0).sum().sum()),
            "total_action_position_cells": int(bts_action_position.size),
            "min_nonzero_cell_count": int(bts_action_position[bts_action_position > 0].min().min()),
            "zero_cells": int((bts_action_position == 0).sum().sum()),
        },
    ]
)

support_summary

	behavior_policy	observed_action_position_cells	total_action_position_cells	min_nonzero_cell_count	zero_cells
0	random	102	102	1634	0
1	bts	102	102	35	0

The action-position support table shows whether items are observed across the presentation contexts where evaluation policies might place them. Sparse cells warn that value estimates may rely on extrapolation or heavy weights.

Plot Action-Position Support

These heatmaps show logged counts for each item-position pair. The color scale uses log1p(count) so both small and large counts are visible.

The random policy should show broad, even support. BTS may show uneven support because it adaptively allocates more exposure to some item-position pairs.

fig, axes = plt.subplots(1, 2, figsize=(10, 8), sharey=True)

sns.heatmap(np.log1p(random_action_position), cmap="Blues", cbar=True, ax=axes[0])
axes[0].set_title("Random Policy\nlog1p(rows)")
axes[0].set_xlabel("Position")
axes[0].set_ylabel("Item ID")

sns.heatmap(np.log1p(bts_action_position), cmap="Oranges", cbar=True, ax=axes[1])
axes[1].set_title("BTS Policy\nlog1p(rows)")
axes[1].set_xlabel("Position")
axes[1].set_ylabel("")

plt.tight_layout()
plt.show()

Inverse Behavior Propensity Diagnostics

Inverse behavior propensity, 1 / propensity_score, is not yet an OPE weight because we have not defined an evaluation policy. However, it is a useful upper-risk diagnostic.

When behavior propensities are small, inverse propensities are large. Large weights can make IPS estimates noisy because a few rows can dominate the average.

inverse_propensity_percentiles = (
    comparison_df.groupby("behavior_policy")["inverse_behavior_propensity"]
    .describe(percentiles=[0.01, 0.05, 0.10, 0.50, 0.90, 0.95, 0.99])
    .reset_index()
)

inverse_propensity_percentiles

	behavior_policy	count	mean	std	min	1%	5%	10%	50%	90%	95%	99%	max
0	bts	200000.000000	32.824107	90.424960	1.993203	2.272986	3.131115	3.273537	11.588157	69.036935	118.203310	344.234079	4255.319149
1	random	200000.000000	34.000000	0.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000

This output is part of the behavior-policy propensities and support diagnostics workflow. Read it as a checkpoint that either verifies the log, defines reusable estimator machinery, or produces a diagnostic that motivates the next OPE step.

Plot Inverse Propensity Distributions

This plot shows how large 1 / propensity_score can become under each behavior policy. The x-axis uses a log scale because inverse propensities can vary substantially under adaptive logging.

For OPE, this is the variance warning light. A long right tail means later IPS estimates may be sensitive to a small number of logged events.

fig, ax = plt.subplots(figsize=(9, 4))
sns.histplot(
    data=comparison_df,
    x="inverse_behavior_propensity",
    hue="behavior_policy",
    bins=60,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=ax,
)
ax.set_xscale("log")
ax.set_title("Inverse Behavior Propensity Distributions")
ax.set_xlabel("1 / Logged Propensity Score")
ax.set_ylabel("Density")
plt.show()

The propensity diagnostics show how much probability the behavior policy assigned to logged actions. Small propensities imply large inverse-propensity weights, which can make IPS estimates unstable.

Effective Sample Size For Candidate Weights

Effective sample size, or ESS, summarizes weight concentration:

ESS = (sum weights)^2 / sum(weights^2)

If all weights are equal, ESS equals the number of rows. If a few rows have huge weights, ESS becomes much smaller. This cell computes ESS for inverse behavior propensities as a generic risk diagnostic.

ess_rows = []
for policy, policy_df in comparison_df.groupby("behavior_policy"):
    weights = policy_df["inverse_behavior_propensity"].to_numpy()
    ess = effective_sample_size(weights)
    ess_rows.append(
        {
            "behavior_policy": policy,
            "rows": len(policy_df),
            "inverse_propensity_ess": ess,
            "ess_share_of_rows": ess / len(policy_df),
            "max_inverse_propensity": weights.max(),
            "p99_inverse_propensity": np.percentile(weights, 99),
        }
    )

inverse_ess = pd.DataFrame(ess_rows)
inverse_ess

	behavior_policy	rows	inverse_propensity_ess	ess_share_of_rows	max_inverse_propensity	p99_inverse_propensity
0	bts	200000	23285.403834	0.116427	4255.319149	344.234079
1	random	200000	200000.000000	1.000000	34.000000	34.000000

Effective sample size turns weight concentration into an intuitive sample-size diagnostic. A low ESS means the estimator has less usable information than the raw row count suggests.

Uniform Evaluation Policy Weight Preview

This cell previews a real OPE weight for a simple evaluation policy: a uniform policy over the observed action set.

For random logging, this evaluation policy is nearly identical to the behavior policy, so weights should be close to 1. For BTS logging, the same uniform evaluation policy can create more variable weights because BTS does not choose all actions uniformly.

This is still a diagnostic preview. The next notebook will define evaluation policies more formally and estimate IPS/SNIPS policy values.

uniform_weight_frames = []
for policy, policy_df in comparison_df.groupby("behavior_policy"):
    n_actions = policy_df["item_id"].nunique()
    uniform_eval_prob = 1 / n_actions
    weights = uniform_eval_prob / policy_df["propensity_score"]
    uniform_weight_frames.append(
        policy_df[["behavior_policy", "item_id", "position", "click", "propensity_score"]].assign(
            uniform_eval_prob=uniform_eval_prob,
            uniform_policy_weight=weights,
        )
    )

uniform_weight_df = pd.concat(uniform_weight_frames, ignore_index=True)

uniform_weight_summary = (
    uniform_weight_df.groupby("behavior_policy")["uniform_policy_weight"]
    .describe(percentiles=[0.01, 0.05, 0.50, 0.95, 0.99])
    .reset_index()
)
uniform_weight_summary["ess"] = uniform_weight_df.groupby("behavior_policy")["uniform_policy_weight"].apply(
    effective_sample_size
).values
uniform_weight_summary["ess_share_of_rows"] = uniform_weight_summary["ess"] / uniform_weight_summary["count"]

uniform_weight_summary

	behavior_policy	count	mean	std	min	1%	5%	50%	95%	99%	max	ess	ess_share_of_rows
0	bts	200000.000000	0.965415	2.659558	0.058624	0.066853	0.092092	0.340828	3.476568	10.124532	125.156446	23285.403834	0.116427
1	random	200000.000000	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	200000.000000	1.000000

The uniform-policy weight preview shows what OPE weights would look like for a simple benchmark policy. This is a low-complexity way to build intuition before evaluating more targeted policies.

Plot Uniform Policy Weights

This plot visualizes the candidate weights for evaluating a uniform policy from each behavior log.

The random-policy weights should be tightly centered near 1. The BTS weights should be more dispersed. This is a concrete example of why the same evaluation policy can be easy to estimate from randomized logs and harder to estimate from adaptive logs.

fig, ax = plt.subplots(figsize=(9, 4))
sns.histplot(
    data=uniform_weight_df,
    x="uniform_policy_weight",
    hue="behavior_policy",
    bins=60,
    stat="density",
    common_norm=False,
    alpha=0.45,
    ax=ax,
)
ax.set_xscale("log")
ax.set_title("Candidate Weights For A Uniform Evaluation Policy")
ax.set_xlabel("uniform evaluation probability / behavior propensity")
ax.set_ylabel("Density")
plt.show()

The uniform-policy weight preview shows what OPE weights would look like for a simple benchmark policy. This is a low-complexity way to build intuition before evaluating more targeted policies.

Naive Observed Click Rate Comparison

This cell compares raw observed click rates under random and BTS logging. This is intentionally labeled naive.

A raw click-rate comparison is not an off-policy estimate because the two behavior policies choose different item distributions. BTS may look better because it learned to favor better items, because it served different contexts, or because of both. Proper policy evaluation requires reweighting or modeling.

naive_policy_values = (
    comparison_df.groupby("behavior_policy")
    .agg(rows=("click", "size"), observed_click_rate=("click", "mean"))
    .reset_index()
)
naive_policy_values["standard_error"] = comparison_df.groupby("behavior_policy")["click"].apply(
    lambda x: x.std(ddof=1) / np.sqrt(len(x))
).values

naive_policy_values

	behavior_policy	rows	observed_click_rate	standard_error
0	bts	200000	0.006030	0.000173
1	random	200000	0.005190	0.000161

The observed behavior-policy value is the logged baseline. It is not the value of a new policy, but it gives a reference point for whether evaluated policies appear better or worse offline.

Plot Naive Observed Click Rates

This plot is useful because it shows why adaptive policies are tempting: they may have higher observed reward in the logs.

But the notebook should not stop at this plot. The causal question is not simply which log has higher click rate. The causal question is how a target evaluation policy would perform under the same population and context distribution.

fig, ax = plt.subplots(figsize=(7, 4))
sns.barplot(data=naive_policy_values, x="behavior_policy", y="observed_click_rate", ax=ax)
ax.set_title("Naive Observed Click Rate By Behavior Policy")
ax.set_xlabel("Behavior Policy")
ax.set_ylabel("Observed Click Rate")
ax.yaxis.set_major_formatter(lambda x, _: f"{x:.2%}")
plt.show()

The observed behavior-policy value is the logged baseline. It is not the value of a new policy, but it gives a reference point for whether evaluated policies appear better or worse offline.

Support Risk By Action

This cell identifies actions with the least support under each behavior policy. These actions are where an evaluation policy would be hardest to evaluate precisely.

For example, if a future policy heavily recommends an item that BTS rarely selected, BTS logs may yield high-variance estimates for that policy. Random logs are less likely to have this issue because exposure is more even.

least_supported_actions = (
    action_distribution.sort_values(["behavior_policy", "rows"], ascending=[True, True])
    .groupby("behavior_policy")
    .head(8)
    .reset_index(drop=True)
)

least_supported_actions

	behavior_policy	item_id	rows	click_rate	avg_propensity	row_share
0	bts	12	145	0.000000	0.004161	0.000725
1	bts	30	321	0.000000	0.003403	0.001605
2	bts	4	327	0.000000	0.004933	0.001635
3	bts	20	638	0.000000	0.005284	0.003190
4	bts	28	665	0.000000	0.004773	0.003325
5	bts	8	829	0.001206	0.011478	0.004145
6	bts	32	865	0.005780	0.010043	0.004325
7	bts	5	1118	0.008945	0.009086	0.005590
8	random	10	5342	0.005241	0.029412	0.026710
9	random	13	5348	0.007105	0.029412	0.026740
10	random	23	5519	0.008154	0.029412	0.027595
11	random	31	5526	0.010496	0.029412	0.027630
12	random	4	5574	0.003409	0.029412	0.027870
13	random	14	5619	0.007475	0.029412	0.028095
14	random	1	5722	0.003495	0.029412	0.028610
15	random	28	5729	0.003840	0.029412	0.028645

The support-risk output identifies actions or rows that are poorly covered by the behavior policy. These are the cases most likely to create unstable weights or unreliable counterfactual estimates.

Support Risk By Propensity

Action counts are useful, but propensity scores are even more directly tied to OPE. This cell identifies rows with the smallest logged propensities under each behavior policy.

Small propensities do not automatically invalidate OPE, but they warn us that estimators may need clipping, self-normalization, doubly robust modeling, or careful evaluation-policy design.

smallest_propensity_rows = (
    comparison_df.sort_values(["behavior_policy", "propensity_score"], ascending=[True, True])
    .groupby("behavior_policy")
    .head(10)
    .reset_index(drop=True)
)

smallest_propensity_rows[["behavior_policy", "timestamp", "item_id", "position", "click", "propensity_score", "inverse_behavior_propensity"]]

	behavior_policy	timestamp	item_id	position	propensity_score	inverse_behavior_propensity
0	bts	2019-11-24 03:59:00.913324+00:00	12	2	0.000235	4255.319149
1	bts	2019-11-24 03:59:01.983928+00:00	12	2	0.000235	4255.319149
2	bts	2019-11-24 03:59:02.199113+00:00	12	2	0.000235	4255.319149
3	bts	2019-11-24 03:59:05.032877+00:00	12	2	0.000235	4255.319149
4	bts	2019-11-24 03:59:06.902389+00:00	12	2	0.000235	4255.319149
5	bts	2019-11-24 03:59:07.450618+00:00	12	2	0.000235	4255.319149
6	bts	2019-11-24 03:59:08.711859+00:00	12	2	0.000235	4255.319149
7	bts	2019-11-24 03:59:12.348441+00:00	12	2	0.000235	4255.319149
8	bts	2019-11-24 02:07:08.701458+00:00	12	3	0.000300	3333.333333
9	bts	2019-11-24 02:07:09.225994+00:00	12	3	0.000300	3333.333333
10	random	2019-11-24 00:00:03.800821+00:00	0	1	0.029412	34.000000
11	random	2019-11-24 00:00:03.801019+00:00	25	3	0.029412	34.000000
12	random	2019-11-24 00:00:03.801099+00:00	23	2	0.029412	34.000000
13	random	2019-11-24 00:00:17.634355+00:00	25	1	0.029412	34.000000
14	random	2019-11-24 00:00:17.634998+00:00	30	2	0.029412	34.000000
15	random	2019-11-24 00:00:17.635198+00:00	17	3	0.029412	34.000000
16	random	2019-11-24 00:00:18.196917+00:00	25	1	0.029412	34.000000
17	random	2019-11-24 00:00:18.196934+00:00	30	2	0.029412	34.000000
18	random	2019-11-24 00:00:18.197239+00:00	17	3	0.029412	34.000000
19	random	2019-11-24 00:00:19.236231+00:00	30	2	0.029412	34.000000

The support-risk output identifies actions or rows that are poorly covered by the behavior policy. These are the cases most likely to create unstable weights or unreliable counterfactual estimates.

OPE Readiness Checklist

This cell converts the diagnostics into a compact checklist for the next notebook. It distinguishes between basic feasibility and estimator stability.

The main takeaway should be: both policies contain valid propensities, but random logging is cleaner for first-pass OPE because support is broader and weights are more stable.

readiness_checks = pd.DataFrame(
    [
        {
            "check": "Random propensities are positive",
            "status": bool((random_df["propensity_score"] > 0).all()),
            "why_it_matters": "Required for IPS on logged random-policy rows.",
        },
        {
            "check": "BTS propensities are positive",
            "status": bool((bts_df["propensity_score"] > 0).all()),
            "why_it_matters": "Required for IPS on logged BTS-policy rows.",
        },
        {
            "check": "Random action support is broad",
            "status": concentration.loc[concentration["behavior_policy"] == "random", "min_rows_per_item"].iloc[0] > 100,
            "why_it_matters": "Each item has many observations in the sample.",
        },
        {
            "check": "Random weights are stable for uniform policy",
            "status": uniform_weight_summary.loc[
                uniform_weight_summary["behavior_policy"] == "random", "ess_share_of_rows"
            ].iloc[0] > 0.95,
            "why_it_matters": "Uniform evaluation from random logs should be low variance.",
        },
        {
            "check": "BTS requires more careful weighting",
            "status": uniform_weight_summary.loc[
                uniform_weight_summary["behavior_policy"] == "bts", "ess_share_of_rows"
            ].iloc[0] < 0.95,
            "why_it_matters": "Adaptive logging creates more variable weights.",
        },
    ]
)

readiness_checks

	check	status	why_it_matters
0	Random propensities are positive	True	Required for IPS on logged random-policy rows.
1	BTS propensities are positive	True	Required for IPS on logged BTS-policy rows.
2	Random action support is broad	True	Each item has many observations in the sample.
3	Random weights are stable for uniform policy	True	Uniform evaluation from random logs should be ...
4	BTS requires more careful weighting	True	Adaptive logging creates more variable weights.

The readiness checklist summarizes whether the log has the essentials for OPE: rewards, actions, propensities, support, and enough sample size. Passing these checks justifies moving from EDA to estimator design.

Cache The BTS Sample

This cell saves the parsed BTS sample to parquet. The next notebooks can use this cache for quick diagnostics or comparisons without repeatedly scanning the large zip file.

The raw zip remains the source of truth. This parquet file is a local convenience artifact for faster iteration.

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
bts_df.to_parquet(BTS_SAMPLE_PATH, index=False)

BTS_SAMPLE_PATH

PosixPath('/home/apex/Documents/ranking_sys/data/processed/open_bandit_bts_men_sample.parquet')

This cell saves reusable outputs for downstream notebooks or the final writeup. Persisting these artifacts makes the project modular and prevents later notebooks from repeating expensive or fragile setup work.

Notebook 2 Takeaways

The Open Bandit logs pass the basic feasibility checks for off-policy evaluation: actions, rewards, and positive propensities are present.

The random/men sample is the best starting point for the first OPE estimators because its action support is broad and its propensities are nearly uniform. That makes simple IPS and SNIPS easier to explain and less sensitive to extreme weights.

The bts/men sample is valuable as a comparison because it shows the realistic complication introduced by adaptive logging. BTS concentrates more traffic on some actions and produces more variable propensities. That does not make BTS unusable, but it means later estimates need stronger diagnostics, possible weight clipping, and doubly robust modeling.

Notebook 3 should now define one or two simple evaluation policies and estimate their values with IPS and self-normalized IPS.